Linux Power Management development
 help / color / mirror / Atom feed
* Re: [RFC PATCH v2 02/10] CPU hotplug: Provide APIs for "full" atomic readers to prevent CPU offline
From: Tejun Heo @ 2012-12-05 20:57 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, oleg, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <50BFAF27.9060203@linux.vnet.ibm.com>

Hello,

On Thu, Dec 06, 2012 at 02:01:35AM +0530, Srivatsa S. Bhat wrote:
> Yes, that _sounds_ sufficient, but IMHO it won't be, in practice. The
> *number* of call-sites that you need to convert from preempt_disable/enable
> to get/put_online_cpus_atomic() won't be too many, however the *frequency*
> of usage of those call-sites can potentially be very high.

I don't think that will be the case and, even if it is, doing it this
way would make it difficult to tell.  The right thing to do is
replacing stop_machine with finer grained percpu locking first.
Refining it further should happen iff that isn't enough and there
isn't an simpler solution.  So, let's please do the simple conversion
first.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 6/6 v8] cpufreq, highbank: add support for highbank cpufreq
From: Mark Langsdorf @ 2012-12-05 22:09 UTC (permalink / raw)
  To: Mike Turquette
  Cc: linux-kernel@vger.kernel.org, cpufreq@vger.kernel.org,
	linux-pm@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Shawn Guo
In-Reply-To: <CAPtuhTg4N7zAMf_Vfw1Xfbn3OmZ7hVffVuc=kMByxf_hJLkpPw@mail.gmail.com>

On 12/05/2012 12:49 PM, Mike Turquette wrote:
> On Wed, Dec 5, 2012 at 8:48 AM, Mark Langsdorf
> <mark.langsdorf@calxeda.com> wrote:
>> diff --git a/drivers/cpufreq/highbank-cpufreq.c b/drivers/cpufreq/highbank-cpufreq.c
>> new file mode 100644
>> index 0000000..1f28fa6
>> --- /dev/null
>> +++ b/drivers/cpufreq/highbank-cpufreq.c
>> @@ -0,0 +1,102 @@
> 
> Looks pretty good to me.  Some tedious nitpicks and discussion below.
> <snip>
> 
>> +static int hb_voltage_change(unsigned int freq)
>> +{
>> +       int i;
>> +       u32 msg[7];
>> +
>> +       msg[0] = HB_CPUFREQ_CHANGE_NOTE;
>> +       msg[1] = freq / 1000000;
>> +       for (i = 2; i < 7; i++)
>> +               msg[i] = 0;
>> +
>> +       return pl320_ipc_transmit(msg);
>> +}
>> +
>> +static int hb_cpufreq_clk_notify(struct notifier_block *nb,
>> +                               unsigned long action, void *hclk)
>> +{
>> +       struct clk_notifier_data *clk_data = hclk;
>> +       int i = 0;
>> +
>> +       if (action == PRE_RATE_CHANGE) {
>> +               if (clk_data->new_rate > clk_data->old_rate)
>> +                       while (hb_voltage_change(clk_data->new_rate))
>> +                               if (i++ > 15)
> 
> There are a few magic numbers here.  How about something like:
> 
> #define HB_VOLT_CHANGE_MAX_TRIES 15
> 
> Maybe do the same for the i2c message length?

Fixed.

>> +                                       return NOTIFY_STOP;
> 
> How about NOTIFY_BAD?  It more clearly signals that an error has occurred.

> Same as above.  It is true that the clock framework does nothing with
> post-rate change notifier aborts but that might change in the future.

Changed and added.

>> +       }
>> +
>> +       return NOTIFY_DONE;
>> +}
>> +
>> +static struct notifier_block hb_cpufreq_clk_nb = {
>> +       .notifier_call = hb_cpufreq_clk_notify,
>> +};
>> +
> 
> Do you have any plans to convert your voltage change routine over to
> the regulator framework?  Likewise do you plan to use the OPP library
> in the future?  I can understand if you do not do that since your
> regulator/dvfs programming model makes things very simple for you.

I looked at treating the ECME as a voltage regulator, but it was a very
bad fit. The ECME has a certain amount of intelligence built into it and
corporate plans are to treat voltage control as a black box.

The current solution is actually nicely generic from my perspective. The
clk notifiers guarantee we can make the voltage changes at the right
time regardless of the underlying cpufreq driver implementation. I don't
think we need more until we get into cpufreq QoS issues, and even then
I'd want to stick with something like the current structure.

--Mark Langsdorf
Calxeda, Inc.

^ permalink raw reply

* Re: [RFC PATCH 4/5] arm: omap2: support port power on lan95xx devices
From: Andy Green @ 2012-12-06  0:05 UTC (permalink / raw)
  To: Alan Stern
  Cc: Rafael J. Wysocki, Ming Lei, Linux-pm mailing list, linux-omap,
	USB list
In-Reply-To: <Pine.LNX.4.44L0.1212051129140.1659-100000@iolanthe.rowland.org>

On 06/12/12 00:42, the mail apparently from Alan Stern included:
> On Wed, 5 Dec 2012, Andy Green wrote:
>
>>> The details of this aren't clear yet.  For instance, should the device
>>> core try to match the port with the asset info, or should this be done
>>> by the USB code when the port is created?
>>
>> Currently what I have (this is before changing it to pm domain, but this
>> should be unchanged) lets the asset define a callback op which will
>> receive notifications for all added child devices that have the device
>> the asset is bound to as an ancestor.
>>
>> So you would bind an asset to the host controller device...
>>
>> static struct device_asset assets_ehci_omap0[] = {
>> 	{
>>                   .name = "-0:1.0/port1",
>>                   .asset = assets_ehci_omap0_smsc_port,
>>                   .ops = &device_descendant_attach_default_asset_ops,
>> 	},
>>           { }
>> };
>>
>> ...with this descendant filter callback pointing to a generic "end of
>> the device path" matcher.
>>
>> when device_descendant_attach_default_asset_ops() sees the child that
>> was foretold has appeared (and it only hears about children of
>> ehci-omap.0 in this case), it binds the assets pointed to by .asset to
>> the new child before its probe.  assets_ehci_omap0_smsc_port is an array
>> of the actual regulator and clock that need switching by the child.  So
>> the effect is to magic the right assets to the child device just before
>> it gets added (and probe called etc).
>>
>> This is working well and the matcher helper is small and simple.
>
> Right.  The question is should it be done in this somewhat roundabout
> way (you've got two separate assets for setting up this one thing), or
> should the USB port code be responsible for explicitly checking if
> there are any applicable assets when the port is created?
>
> The advantange of the second approach is that it doesn't involve
> checking all the ancestors every time a new device is added (and it
> involves only one asset).  The disadvantage is that it works only for
> USB ports.

It's done in two steps to strongly filter candidate child devices 
against being children of a known platform device.  If you go around 
that, you will be back to full device path matching with wildcards at 
the USB child to determine if he is "the one".  So that's a feature not 
an issue I think.

I can see doing it generically or not is equally a political issue as a 
technical one, but I think if we bother to add this kind of support we 
should prefer to do it generally.  It's going to be about the same 
amount of code.

As Tony Lindgren said, even board-omap4panda.c itself is deprecated, to 
project platform info into USB stack you either have to add DT code into 
usb stack then to go find things directly, or provide a generic 
methodology like assets which have the dt bindings done outside of any 
subsystem and apply their operations outside the subsystem too.

> To answer the question, we need to know how other subsystems might
> want to use the asset-matching approach.  My guess is that only a small
> number of device types would care about assets (in the same way that
> assets matter to USB ports but not to other USB device types).  And it
> might not be necessary to check against every ancestor; we might know
> beforehand where to check (just as the USB port would know to look for
> an asset attached to the host controller device).

Yes I think it boils down to "buses" in general can benefit from this. 
They're the thing that dynamically - later - create child devices you 
might need to target with what was ultimately platform information.

On Panda the other bus thing that can benefit is the WLAN module on 
SDIO.  In fact that's a very illuminating case for your question above. 
  Faced with exactly the same problem, that they needed to project 
platform info on to SDIO-probed device instance to tell it what clock 
rate it had been given, they invented an api which you can see today in 
board-omap4panda.c and other boards there, wl12xx_set_platform_data(). 
This is from board-4430sdp:

static struct wl12xx_platform_data omap4_sdp4430_wlan_data __initdata = {
         .board_ref_clock = WL12XX_REFCLOCK_26,
         .board_tcxo_clock = WL12XX_TCXOCLOCK_26,
};

...

	ret = wl12xx_set_platform_data(&omap4_sdp4430_wlan_data);

You can find the other end of it here in 
drivers/net/wireless/ti/wlcore/wl12xx_platform_data.c

static struct wl12xx_platform_data *platform_data;

int __init wl12xx_set_platform_data(const struct wl12xx_platform_data *data)
{
	if (platform_data)
		return -EBUSY;
	if (!data)
		return -EINVAL;

	platform_data = kmemdup(data, sizeof(*data), GFP_KERNEL);
	if (!platform_data)
		return -ENOMEM;

	return 0;
}

when the driver for the device instance wants to "get" its platform data 
it reads the static pointer via another api.  Obviously if you want two 
modules on your platform that's not so good.

I doubt that's the only SDIO device that wants to know platform info. 
So I think by providing a generic way we help other buses and devices.

-Andy

-- 
Andy Green | TI Landing Team Leader
Linaro.org │ Open source software for ARM SoCs | Follow Linaro
http://facebook.com/pages/Linaro/155974581091106  - 
http://twitter.com/#!/linaroorg - http://linaro.org/linaro-blog
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH 4/5] arm: omap2: support port power on lan95xx devices
From: Rafael J. Wysocki @ 2012-12-06  1:00 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andy Green, Ming Lei, Linux-pm mailing list,
	linux-omap-u79uwXL29TY76Z2rM5mHXA, USB list
In-Reply-To: <Pine.LNX.4.44L0.1212041150430.1800-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>

Hi,

Well, I'm not any less busy than yesterday, as it turns out, but I'm expecting
that to continue tomorrow, so I may as well have a look at it now. :-)

On Tuesday, December 04, 2012 12:10:32 PM Alan Stern wrote:
> [CC: list trimmed; the people who were on it should be subscribed to at 
> least one of the lists anyway.]
> 

[...]

> 
> > Not only regulators involved, clock or other things might be involved too.
> > Also the same power domain might be shared with more than one port,
> > so it is better to introduce power domain to the problem. Looks
> > generic_pm_domain is overkill, so I introduced power controller which
> > only focuses on power on/off and being shared by multiple devices.   
> 
> Even though it is overkill, it may be better than introducing a new 
> abstraction.  After all, this is exactly the sort of thing that PM 
> domains were originally created to handle.
> 
> Rafael, do you have any advice on this?  The generic_pm_domain 
> structure is fairly complicated, there's a lot of code in 
> drivers/base/power/domain.c (it's the biggest source file in its 
> directory), and I'm not aware of any documentation.

Yeah, documentation is missing, which obviously is my fault.

That code is designed to cover the case in which multiple devices share
a "power switch" that can be used to remove power from all of them at
once (eg. through a register write).  It also assumes that there will
be a "stop" operation, such as "disable clock", allowing each device in
the domain to be put into a shallow low-power state (individually) without
the necessity to save its state.  Device states only have to be saved
when the "power switch" is about to be used, which generally happens
when they all have been "stopped" (their runtime PM status is RPM_SUSPENDED).

The "stop" operation may be defined in a different way for each device in the
domain (actually, that applies to the "save state", "restore state", and
"start" - opposite to "stop" - operations too) and PM QoS latency constraints
are used to decide if and when to turn the whole domain off.  Moreover, it
supports hierarchies of power domains that may be pretty much arbitarily
complicated.

A big part of this code is for the handling of system suspend/resume
in such a way that it is consistent with runtime PM.

For USB ports I'd recommend to use something simpler. :-)

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH 1/5] Device Power: introduce power controller
From: Ming Lei @ 2012-12-06  1:27 UTC (permalink / raw)
  To: Roger Quadros
  Cc: Andy Green, Alan Stern, Greg Kroah-Hartman, Lan Tianyu,
	Sarah Sharp, Rafael J. Wysocki, linux-pm, Oliver Neukum,
	linux-omap, linux-usb, Felipe Balbi
In-Reply-To: <50BF7B30.9050904@ti.com>

On Thu, Dec 6, 2012 at 12:49 AM, Roger Quadros <rogerq@ti.com> wrote:
> On 12/03/2012 05:00 AM, Ming Lei wrote:
>> On Mon, Dec 3, 2012 at 12:02 AM, Andy Green <andy.green@linaro.org> wrote:
>>> On 02/12/12 23:01, the mail apparently from Ming Lei included:
>>>
>>>> Power controller is an abstract on simple power on/off switch.
>>>>
>>>> One power controller can bind to more than one device, which
>>>> provides power logically, for example, we can think one usb port
>>>> in hub provides power to the usb device attached to the port, even
>>>> though the power is supplied actually by other ways, eg. the usb
>>>> hub is a self-power device. From hardware view, more than one
>>>> device can share one power domain, and power controller can power
>>>> on if one of these devices need to provide power, and power off if
>>>> all these devices don't need to provide power.
>>>
>>>
>>> What stops us using struct regulator here?  If you have child regulators
>>> supplied by a parent supply, isn't that the right semantic already without
>>> introducing a whole new thing?  Apologies if I missed the point.
>>
>> There are two purposes:
>>
>> One is to hide the implementation details of the power controller because
>> the user doesn't care how it is implemented, maybe clock, regulator, gpio
>> and other platform dependent stuffs involved, so the patch simplify the usage
>> from the view of users.
>>
>
> Which user are you talking about?

Here it is the usb port device.

At least, there are many boards which have hardwired and self-powered usb
devices, so in theory they can benefits from the power controller.  Maybe
only regulator and clock can't be covered completely for other boards.

The patch can make usb port deal with the 'power controller' only, and make it
avoid to deal with regulators/clocks/... directly.


Thanks,
-- 
Ming Lei

^ permalink raw reply

* Re: [RFC PATCH 1/5] Device Power: introduce power controller
From: Jassi Brar @ 2012-12-06  3:46 UTC (permalink / raw)
  To: Ming Lei
  Cc: Roger Quadros, Andy Green, Alan Stern, Greg Kroah-Hartman,
	Lan Tianyu, Sarah Sharp, Rafael J. Wysocki, linux-pm,
	Oliver Neukum, linux-omap, linux-usb, Felipe Balbi
In-Reply-To: <CACVXFVOiu5+M3A14Q-U2FiHYrBjAswp4Jv+Aybs_kV9DiWNHZw@mail.gmail.com>

On 6 December 2012 06:57, Ming Lei <tom.leiming@gmail.com> wrote:
> On Thu, Dec 6, 2012 at 12:49 AM, Roger Quadros <rogerq@ti.com> wrote:
>> On 12/03/2012 05:00 AM, Ming Lei wrote:
>>> On Mon, Dec 3, 2012 at 12:02 AM, Andy Green <andy.green@linaro.org> wrote:
>>>> On 02/12/12 23:01, the mail apparently from Ming Lei included:
>>>>
>>>>> Power controller is an abstract on simple power on/off switch.
>>>>>
>>>>> One power controller can bind to more than one device, which
>>>>> provides power logically, for example, we can think one usb port
>>>>> in hub provides power to the usb device attached to the port, even
>>>>> though the power is supplied actually by other ways, eg. the usb
>>>>> hub is a self-power device. From hardware view, more than one
>>>>> device can share one power domain, and power controller can power
>>>>> on if one of these devices need to provide power, and power off if
>>>>> all these devices don't need to provide power.
>>>>
>>>>
>>>> What stops us using struct regulator here?  If you have child regulators
>>>> supplied by a parent supply, isn't that the right semantic already without
>>>> introducing a whole new thing?  Apologies if I missed the point.
>>>
>>> There are two purposes:
>>>
>>> One is to hide the implementation details of the power controller because
>>> the user doesn't care how it is implemented, maybe clock, regulator, gpio
>>> and other platform dependent stuffs involved, so the patch simplify the usage
>>> from the view of users.
>>>
>>
>> Which user are you talking about?
>
> Here it is the usb port device.
>
> At least, there are many boards which have hardwired and self-powered usb
> devices, so in theory they can benefits from the power controller.  Maybe
> only regulator and clock can't be covered completely for other boards.
>
> The patch can make usb port deal with the 'power controller' only, and make it
> avoid to deal with regulators/clocks/... directly.
>
I am curious too, except for clocks and voltage supplies (regulators),
what other external resources does a chip need ?

^ permalink raw reply

* Re: [RFC PATCH v2 02/10] CPU hotplug: Provide APIs for "full" atomic readers to prevent CPU offline
From: Srivatsa S. Bhat @ 2012-12-06  4:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, oleg, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121205205744.GB27465@mtj.dyndns.org>

On 12/06/2012 02:27 AM, Tejun Heo wrote:
> Hello,
> 
> On Thu, Dec 06, 2012 at 02:01:35AM +0530, Srivatsa S. Bhat wrote:
>> Yes, that _sounds_ sufficient, but IMHO it won't be, in practice. The
>> *number* of call-sites that you need to convert from preempt_disable/enable
>> to get/put_online_cpus_atomic() won't be too many, however the *frequency*
>> of usage of those call-sites can potentially be very high.
> 
> I don't think that will be the case and, even if it is, doing it this
> way would make it difficult to tell.  The right thing to do is
> replacing stop_machine with finer grained percpu locking first.
> Refining it further should happen iff that isn't enough and there
> isn't an simpler solution.  So, let's please do the simple conversion
> first.
> 

Hmm, OK, that sounds like a good plan. So I'll drop the "light" and
"full" variants for now and work on providing a straight-forward
get/put_online_cpus_atomic() APIs.

Thank you!
 
Regards,
Srivatsa S. Bhat


^ permalink raw reply

* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
From: Srivatsa S. Bhat @ 2012-12-06  6:32 UTC (permalink / raw)
  To: wujianguo
  Cc: akpm, mgorman, mjg59, paulmck, dave, maxime.coquelin,
	loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw,
	gargankita, amit.kachhap, svaidy, thomas.abraham,
	santosh.shilimkar, linux-pm, linux-mm, linux-kernel
In-Reply-To: <50BDD5AB.9070706@gmail.com>

Hi Jianguo,

On 12/04/2012 04:21 PM, wujianguo wrote:
> Hi Srivatsa,
> 
> I applied this patchset, and run genload(from LTP) test: numactl --membind=1 ./genload -m 100,
> then got a "general protection fault", and system was going to reboot.
> 
> If I revert [RFC PATCH 7/8], and run this test again, genload will be killed due to OOM,
> but the system is OK, no coredump.
> 

Sorry for the delay in replying. Thanks a lot for testing and for the bug-report!
I could recreate the issue in one of my machines using the LTP test you mentioned.
I'll try to dig and find out what is going wrong.

Regards,
Srivatsa S. Bhat

> ps: node1 has 8G memory.
> 
> [ 3647.020666] general protection fault: 0000 [#1] SMP
> [ 3647.026232] Modules linked in: edd cpufreq_conservative cpufreq_userspace cpu
> freq_powersave acpi_cpufreq mperf fuse vfat fat loop dm_mod coretemp kvm crc32c_
> intel ixgbe ipv6 i7core_edac igb iTCO_wdt i2c_i801 iTCO_vendor_support ioatdma e
> dac_core tpm_tis joydev lpc_ich i2c_core microcode mfd_core rtc_cmos pcspkr sr_m
> od tpm sg dca hid_generic mdio tpm_bios cdrom button ext3 jbd mbcache usbhid hid
>  uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys hw
> mon scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_
> piix libata megaraid_sas scsi_mod
> [ 3647.084565] CPU 19
> [ 3647.086709] Pid: 33708, comm: genload Not tainted 3.7.0-rc7-mem-region+ #11 Q
> CI QSSC-S4R/QSSC-S4R
> [ 3647.096799] RIP: 0010:[<ffffffff8110979c>]  [<ffffffff8110979c>] add_to_freel
> ist+0x8c/0x100
> [ 3647.106125] RSP: 0000:ffff880a7f6c3e58  EFLAGS: 00010086
> [ 3647.112042] RAX: dead000000200200 RBX: 0000000000000001 RCX: 0000000000000000
> 
> [ 3647.119990] RDX: ffffea001211a3a0 RSI: ffffea001211ffa0 RDI: 0000000000000001
> 
> [ 3647.127936] RBP: ffff880a7f6c3e58 R08: ffff88067ff6d240 R09: ffff88067ff6b180
> 
> [ 3647.135884] R10: 0000000000000002 R11: 0000000000000001 R12: 00000000000007fe
> 
> [ 3647.143831] R13: 0000000000000001 R14: 0000000000000001 R15: ffffea001211ff80
> 
> [ 3647.151778] FS:  00007f0b2a674700(0000) GS:ffff880a7f6c0000(0000) knlGS:00000
> 00000000000
> [ 3647.160790] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 3647.167188] CR2: 00007f0b1a000000 CR3: 0000000484723000 CR4: 00000000000007e0
> 
> [ 3647.175136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> 
> [ 3647.183083] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> 
> [ 3647.191030] Process genload (pid: 33708, threadinfo ffff8806852bc000, task ff
> ff880688288000)
> [ 3647.200428] Stack:
> [ 3647.202667]  ffff880a7f6c3f08 ffffffff8110e9c0 ffff88067ff66100 0000000000000
> 7fe
> [ 3647.210954]  ffff880a7f6d5bb0 0000000000000030 0000000000002030 ffff88067ff66
> 168
> [ 3647.219244]  0000000000000002 ffff880a7f6d5b78 0000000e88288000 ffff88067ff66
> 100
> [ 3647.227530] Call Trace:
> [ 3647.230252]  <IRQ>
> [ 3647.232394]  [<ffffffff8110e9c0>] free_pcppages_bulk+0x350/0x450
> [ 3647.239297]  [<ffffffff8110f0d0>] ? drain_pages+0xd0/0xd0
> [ 3647.245313]  [<ffffffff8110f0c3>] drain_pages+0xc3/0xd0
> [ 3647.251135]  [<ffffffff8110f0e6>] drain_local_pages+0x16/0x20
> [ 3647.257540]  [<ffffffff810a3bce>] generic_smp_call_function_interrupt+0xae/0x
> 260
> [ 3647.265783]  [<ffffffff810282c7>] smp_call_function_interrupt+0x27/0x40
> [ 3647.273156]  [<ffffffff8147f272>] call_function_interrupt+0x72/0x80
> [ 3647.280136]  <EOI>
> [ 3647.282278]  [<ffffffff81077936>] ? mutex_spin_on_owner+0x76/0xa0
> [ 3647.289292]  [<ffffffff81473116>] __mutex_lock_slowpath+0x66/0x180
> [ 3647.296181]  [<ffffffff8113afe7>] ? try_to_unmap_one+0x277/0x440
> [ 3647.302872]  [<ffffffff81472b93>] mutex_lock+0x23/0x40
> [ 3647.308595]  [<ffffffff8113b657>] rmap_walk+0x137/0x240
> [ 3647.314417]  [<ffffffff8115c230>] ? get_page+0x40/0x40
> [ 3647.320133]  [<ffffffff8115d036>] move_to_new_page+0xb6/0x110
> [ 3647.326526]  [<ffffffff8115d452>] __unmap_and_move+0x192/0x230
> [ 3647.333023]  [<ffffffff8115d612>] unmap_and_move+0x122/0x140
> [ 3647.339328]  [<ffffffff8115d6c9>] migrate_pages+0x99/0x150
> [ 3647.345433]  [<ffffffff81129f10>] ? isolate_freepages+0x220/0x220
> [ 3647.352220]  [<ffffffff8112ace2>] compact_zone+0x2f2/0x5d0
> [ 3647.358332]  [<ffffffff8112b4a0>] try_to_compact_pages+0x180/0x240
> [ 3647.365218]  [<ffffffff8110f1e7>] __alloc_pages_direct_compact+0x97/0x200
> [ 3647.372780]  [<ffffffff810a45a3>] ? on_each_cpu_mask+0x63/0xb0
> [ 3647.379279]  [<ffffffff8110f84f>] __alloc_pages_slowpath+0x4ff/0x780
> [ 3647.386349]  [<ffffffff8110fbf1>] __alloc_pages_nodemask+0x121/0x180
> [ 3647.393430]  [<ffffffff811500d6>] alloc_pages_vma+0xd6/0x170
> [ 3647.399737]  [<ffffffff81162198>] do_huge_pmd_anonymous_page+0x148/0x210
> [ 3647.407203]  [<ffffffff81132f6b>] handle_mm_fault+0x33b/0x340
> [ 3647.413609]  [<ffffffff814799d3>] __do_page_fault+0x2a3/0x4e0
> [ 3647.420017]  [<ffffffff8126316a>] ? trace_hardirqs_off_thunk+0x3a/0x6c
> [ 3647.427290]  [<ffffffff81479c1e>] do_page_fault+0xe/0x10
> [ 3647.433208]  [<ffffffff81475f68>] page_fault+0x28/0x30
> [ 3647.438921] Code: 8d 78 01 48 89 f8 48 c1 e0 04 49 8d 04 00 48 8b 50 08 48 83
>  40 10 01 48 85 d2 74 1b 48 8b 42 08 48 89 72 08 48 89 16 48 89 46 08 <48> 89 30
>  c9 c3 0f 1f 80 00 00 00 00 4d 3b 00 74 4b 83 e9 01 79
> [ 3647.460607] RIP  [<ffffffff8110979c>] add_to_freelist+0x8c/0x100
> [ 3647.467308]  RSP <ffff880a7f6c3e58>
> [    0.000000] Linux version 3.7.0-rc7-mem-region+ (root@linux-intel) (gcc versi
> on 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #11 SMP Tue Dec 4 15:23
> :15 CST 2012
> .
> 
> Thanks,
> Jianguo Wu
> 
> On 2012-11-7 3:52, Srivatsa S. Bhat wrote:
>> Hi,
>>
>> This is an alternative design for Memory Power Management, developed based on
>> some of the suggestions[1] received during the review of the earlier patchset
>> ("Hierarchy" design) on Memory Power Management[2]. This alters the buddy-lists
>> to keep them region-sorted, and is hence identified as the "Sorted-buddy" design.
>>
>> One of the key aspects of this design is that it avoids the zone-fragmentation
>> problem that was present in the earlier design[3].
>>
>>
>> Quick overview of Memory Power Management and Memory Regions:
>> ------------------------------------------------------------
>>
>> Today memory subsystems are offer a wide range of capabilities for managing
>> memory power consumption. As a quick example, if a block of memory is not
>> referenced for a threshold amount of time, the memory controller can decide to
>> put that chunk into a low-power content-preserving state. And the next
>> reference to that memory chunk would bring it back to full power for read/write.
>> With this capability in place, it becomes important for the OS to understand
>> the boundaries of such power-manageable chunks of memory and to ensure that
>> references are consolidated to a minimum number of such memory power management
>> domains.
>>
>> ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
>> the firmware can expose information regarding the boundaries of such memory
>> power management domains to the OS in a standard way.
>>
>> How can Linux VM help memory power savings?
>>
>> o Consolidate memory allocations and/or references such that they are
>> not spread across the entire memory address space.  Basically area of memory
>> that is not being referenced, can reside in low power state.
>>
>> o Support targeted memory reclaim, where certain areas of memory that can be
>> easily freed can be offlined, allowing those areas of memory to be put into
>> lower power states.
>>
>> Memory Regions:
>> ---------------
>>
>> "Memory Regions" is a way of capturing the boundaries of power-managable
>> chunks of memory, within the MM subsystem.
>>
>>
>> Short description of the "Sorted-buddy" design:
>> -----------------------------------------------
>>
>> In this design, the memory region boundaries are captured in a parallel
>> data-structure instead of fitting regions between nodes and zones in the
>> hierarchy. Further, the buddy allocator is altered, such that we maintain the
>> zones' freelists in region-sorted-order and thus do page allocation in the
>> order of increasing memory regions. (The freelists need not be fully
>> address-sorted, they just need to be region-sorted. Patch 6 explains this
>> in more detail).
>>
>> The idea is to do page allocation in increasing order of memory regions
>> (within a zone) and perform page reclaim in the reverse order, as illustrated
>> below.
>>
>> ---------------------------- Increasing region number---------------------->
>>
>> Direction of allocation--->                         <---Direction of reclaim
>>
>>
>> The sorting logic (to maintain freelist pageblocks in region-sorted-order)
>> lies in the page-free path and not the page-allocation path and hence the
>> critical page allocation paths remain fast. Moreover, the heart of the page
>> allocation algorithm itself remains largely unchanged, and the region-related
>> data-structures are optimized to avoid unnecessary updates during the
>> page-allocator's runtime.
>>
>> Advantages of this design:
>> --------------------------
>> 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and
>>    hence we avoid its associated problems (like too many zones, extra page
>>    reclaim threads, question of choosing watermarks etc).
>>    [This is an advantage over the "Hierarchy" design]
>>
>> 2. Performance overhead is expected to be low: Since we retain the simplicity
>>    of the algorithm in the page allocation path, page allocation can
>>    potentially remain as fast as it would be without memory regions. The
>>    overhead is pushed to the page-freeing paths which are not that critical.
>>
>>
>> Results:
>> =======
>>
>> Test setup:
>> -----------
>> This patchset applies cleanly on top of 3.7-rc3.
>>
>> x86 dual-socket quad core HT-enabled machine booted with mem=8G
>> Memory region size = 512 MB
>>
>> Functional testing:
>> -------------------
>>
>> Ran pagetest, a simple C program that allocates and touches a required number
>> of pages.
>>
>> Below is the statistics from the regions within ZONE_NORMAL, at various sizes
>> of allocations from pagetest.
>>
>> 	     Present pages   |	Free pages at various allocations        |
>> 			     |  start	|  512 MB  |  1024 MB | 2048 MB  |
>>   Region 0      16	     |   0      |    0     |     0    |    0     |
>>   Region 1      131072       |  87219   |  8066    |   7892   |  7387    |
>>   Region 2      131072       | 131072   |  79036   |     0    |    0     |
>>   Region 3      131072       | 131072   | 131072   |   79061  |    0     |
>>   Region 4      131072       | 131072   | 131072   |  131072  |    0     |
>>   Region 5      131072       | 131072   | 131072   |  131072  |  79051   |
>>   Region 6      131072       | 131072   | 131072   |  131072  |  131072  |
>>   Region 7      131072       | 131072   | 131072   |  131072  |  131072  |
>>   Region 8      131056       | 105475   | 105472   |  105472  |  105472  |
>>
>> This shows that page allocation occurs in the order of increasing region
>> numbers, as intended in this design.
>>
>> Performance impact:
>> -------------------
>>
>> Kernbench results didn't show much of a difference between the performance
>> of vanilla 3.7-rc3 and this patchset.
>>
>>
>> Todos:
>> =====
>>
>> 1. Memory-region aware page-reclamation:
>> ----------------------------------------
>>
>> We would like to do page reclaim in the reverse order of page allocation
>> within a zone, ie., in the order of decreasing region numbers.
>> To achieve that, while scanning lru pages to reclaim, we could potentially
>> look for pages belonging to higher regions (considering region boundaries)
>> or perhaps simply prefer pages of higher pfns (and skip lower pfns) as
>> reclaim candidates.
>>
>> 2. Compile-time exclusion of Memory Power Management, and extending the
>> support to also work with other features such as Mem cgroups, kexec etc.
>>
>> References:
>> ----------
>>
>> [1]. Review comments suggesting modifying the buddy allocator to be aware of
>>      memory regions:
>>      http://article.gmane.org/gmane.linux.power-management.general/24862
>>      http://article.gmane.org/gmane.linux.power-management.general/25061
>>      http://article.gmane.org/gmane.linux.kernel.mm/64689
>>
>> [2]. Patch series that implemented the node-region-zone hierarchy design:
>>      http://lwn.net/Articles/445045/
>>      http://thread.gmane.org/gmane.linux.kernel.mm/63840
>>
>>      Summary of the discussion on that patchset:
>>      http://article.gmane.org/gmane.linux.power-management.general/25061
>>
>>      Forward-port of that patchset to 3.7-rc3 (minimal x86 config)
>>      http://thread.gmane.org/gmane.linux.kernel.mm/89202
>>
>> [3]. Disadvantages of having memory regions in the hierarchy between nodes and
>>      zones:
>>      http://article.gmane.org/gmane.linux.kernel.mm/63849
>>
>> [4]. Estimate of potential power savings on Samsung exynos board
>>      http://article.gmane.org/gmane.linux.kernel.mm/65935
>>
>> [5]. ACPI 5.0 and MPST support
>>      http://www.acpi.info/spec.htm
>>      Section 5.2.21 Memory Power State Table (MPST)
>>
>>  Srivatsa S. Bhat (8):
>>       mm: Introduce memory regions data-structure to capture region boundaries within node
>>       mm: Initialize node memory regions during boot
>>       mm: Introduce and initialize zone memory regions
>>       mm: Add helpers to retrieve node region and zone region for a given page
>>       mm: Add data-structures to describe memory regions within the zones' freelists
>>       mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
>>       mm: Add an optimized version of del_from_freelist to keep page allocation fast
>>       mm: Print memory region statistics to understand the buddy allocator behavior
>>
>>
>>   include/linux/mm.h     |   38 +++++++
>>  include/linux/mmzone.h |   52 +++++++++
>>  mm/compaction.c        |    8 +
>>  mm/page_alloc.c        |  263 ++++++++++++++++++++++++++++++++++++++++++++----
>>  mm/vmstat.c            |   59 ++++++++++-
>>  5 files changed, 390 insertions(+), 30 deletions(-)
>>
>>
>> Thanks,
>> Srivatsa S. Bhat
>> IBM Linux Technology Center
>>
>> --


^ permalink raw reply

* Re: Race condition between driver_probe_device and device_shutdown
From: Wedson Almeida Filho @ 2012-12-06  9:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Alan Stern, Greg KH, Eric W. Biederman, Andrew Morton,
	linux-kernel, Linux PM List
In-Reply-To: <CACVXFVPu_nbVCVR_HTt9H764+4JKC3=GPyhsf8Lpsch4Fa=xqQ@mail.gmail.com>

[Sorry for taking so long to respond, after a week of silence I
assumed I wouldn't get any responses, plus I had moved on to other
things.]

I happen to still have the logs, the relevant part is pasted at the end.

Answering some of the questions: the driver is on the platform bus, in
fact, it's drivers/usb/host/ehci-tegra.c; after seeing the oops below,
I added printks when entering and exiting tegra_ehci_probe to try to
understand it better.

Note that in the log we see some thread entering tegra_ehci_probe, but
nothing indicates that it has exited before we get the oops.

As for how to reproduce this, I was running a "reboot stress" test; I
would boot the device and reboot it as soon as possible. It appears
that drivers were still being loaded when I managed to start the
reboot. Given that it's a race condition, it wouldn't always
reproduce, but it happened often enough that caught my attention.

With the patch I sent out in my first email I wouldn't run into this at all.

[   58.759906] tegra_ehci_probe instance 1
[   58.764958] [USBHHCD] : usb_create_hcd start
[   58.769342] [USBHHCD] : usb_bus_init start
[   58.772507] Entering device_shutdown
[   58.772516] [USBHv2] tegra_ehci_hcd_shutdown
[   58.772534] Unable to handle kernel paging request at virtual
address ffffffa8
[   58.772541] pgd = ef1e0000
[   58.772545] [ffffffa8] *pgd=af7fe021, *pte=00000000, *ppte=00000000
[   58.772557] Internal error: Oops: 17 [#1] PREEMPT SMP
[   58.772563] last sysfs file:
/sys/devices/platform/tegra-ehci.1/usb1/1-1/bbusb_ioctl
[   58.772588] CPU: 0    Tainted: G        WC
(2.6.36.3-02116-gb523cbe-dirty #3)
[   58.772610] PC is at tegra_ehci_hcd_shutdown+0x28/0x64
[   58.772617] LR is at tegra_ehci_hcd_shutdown+0x28/0x64
[   58.772624] pc : [<c04e3c04>]    lr : [<c04e3c04>]    psr: 60000013
[   58.772628] sp : e5ea3e50  ip : 00010135  fp : 00000001
[   58.772634] r10: 40819060  r9 : e5ea2000  r8 : c023c284
[   58.772639] r7 : 4081b090  r6 : c09326d0  r5 : fffffef8  r4 : 00000000
[   58.772645] r3 : 00000000  r2 : 00000001  r1 : 60000093  r0 : 00000036
[   58.772652] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[   58.772659] Control: 10c5387d  Table: af1e004a  DAC: 00000015
[   58.772664]
[   58.772666] PC: 0xc04e3b84:
[   58.772669] 3b84  10800003 eaffffd3 e591c000 e59c6008 e1120006
15d1302b 10800003 e1140006
[   58.772680] 3ba4  15d1302d e2811008 10800003 eaffffc9 e2803f42
e92d4010 e5932004 e592000c
[   58.772691] 3bc4  f57ff05f e5931024 e1a001a0 ebfcc5b7 e1a00001
e8bd8010 e92d4070 e2800008
[   58.772702] 3be4  ebfe615a e3500000 08bd8070 e5904000 e59f1038
e59f0038 e2445f42 eb07d6b4
[   58.772713] 3c04  e5143058 e5933028 e3530000 08bd8070 e59f1018
e59f001c eb07d6ad e5143058
[   58.772723] 3c24  e1a00005 e1a0e00f e593f028 e8bd8070 c07098f8
c07ec784 c07ec794 e92d41f0
[   58.772734] 3c44  e59d4018 e1a05001 e1a06002 e1a07003 e5952000
f57ff05f e3720001 e0021006
[   58.772745] 3c64  e2444001 e3a00001 0a000006 e1510007 0a000006
ebf5c2ba e3540000 cafffff3
[   58.772757]
[   58.772758] LR: 0xc04e3b84:
[   58.772761] 3b84  10800003 eaffffd3 e591c000 e59c6008 e1120006
15d1302b 10800003 e1140006
[   58.772772] 3ba4  15d1302d e2811008 10800003 eaffffc9 e2803f42
e92d4010 e5932004 e592000c
[   58.772783] 3bc4  f57ff05f e5931024 e1a001a0 ebfcc5b7 e1a00001
e8bd8010 e92d4070 e2800008
[   58.772793] 3be4  ebfe615a e3500000 08bd8070 e5904000 e59f1038
e59f0038 e2445f42 eb07d6b4
[   58.772804] 3c04  e5143058 e5933028 e3530000 08bd8070 e59f1018
e59f001c eb07d6ad e5143058
[   58.772814] 3c24  e1a00005 e1a0e00f e593f028 e8bd8070 c07098f8
c07ec784 c07ec794 e92d41f0
[   58.772825] 3c44  e59d4018 e1a05001 e1a06002 e1a07003 e5952000
f57ff05f e3720001 e0021006
[   58.772836] 3c64  e2444001 e3a00001 0a000006 e1510007 0a000006
ebf5c2ba e3540000 cafffff3
[   58.772847]
[   58.772849] SP: 0xe5ea3dd0:
[   58.772852] 3dd0  58b8609a 2020205b 372e3835 31353237 00205d36
00000000 3b9aca00 10624dd3
[   58.772863] 3df0  60000013 ffffffff e5ea3e3c c09326d0 4081b090
c023bbac 00000036 60000093
[   58.772874] 3e10  00000001 00000000 00000000 fffffef8 c09326d0
4081b090 c023c284 e5ea2000
[   58.772884] 3e30  40819060 00000001 00010135 e5ea3e50 c04e3c04
c04e3c04 60000013 ffffffff
[   58.772895] 3e50  ef044a08 ef044a14 c09326d0 c047d500 ef044a08
c047908c c07b89b8 e5ea3e90
[   58.772905] 3e70  00000000 c02b27f4 e5ea3e90 e5ea3e90 00000000
c02b28a4 00000000 c02b2a80
[   58.772916] 3e90  e5ea2000 403adc40 00000000 403adc48 00000100
00000000 00000000 e5ea3fb0
[   58.772927] 3eb0  e5ea2000 eb1f0080 e5ea2000 c02ae37c e5ea3fb0
c023eeb0 00000011 c11cbf14
[   58.772938]
[   58.772940] R5: 0xfffffe78:
[   58.772943] fe78  ******** ******** ******** ******** ********
******** ******** ********
[   58.772957] fe98  ******** ******** ******** ******** ********
******** ******** ********
[   58.772968] feb8  ******** ******** ******** ******** ********
******** ******** ********
[   58.772979] fed8  ******** ******** ******** ******** ********
******** ******** ********
[   58.772990] fef8  ******** ******** ******** ******** ********
******** ******** ********
[   58.773000] ff18  ******** ******** ******** ******** ********
******** ******** ********
[   58.773011] ff38  ******** ******** ******** ******** ********
******** ******** ********
[   58.773022] ff58  ******** ******** ******** ******** ********
******** ******** ********
[   58.773033]
[   58.773035] R6: 0xc0932650:
[   58.773038] 2650  00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[   58.773049] 2670  00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[   58.773059] 2690  00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[   58.773069] 26b0  00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[   58.773079] 26d0  ef015420 ef096c60 ef0153e0 ef0153a0 ef015360
00000000 00000000 ef0151e0
[   58.773090] 26f0  00000000 00000000 ef015320 00000001 ef0152e0
00000000 ef0152a0 00000000
[   58.773100] 2710  00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[   58.773110] 2730  00000000 00000000 00000000 00000000 ef3f5bc0
00000000 00000000 00000000
[   58.773121]
[   58.773123] R8: 0xc023c204:
[   58.773126] c204  e31c0c01 1a000008 e3570f5d e24fef46 3798f107
e28d1008 e3a08000 e357080f
[   58.773137] c224  e2270000 2a000f9a ea01fd5c e1a02007 e28d1008
e3a00000 eb000583 e28fe014
[   58.773147] c244  e1a07000 e28d1008 e3570f5d 3891000f 3798f107
eaffffef e5ad0008 e1a02007
[   58.773158] c264  e1a0100d e3a00001 eb000577 eaffffba e320f000
e320f000 e320f000 c088360c
[   58.773169] c284  c02ad098 c02a5208 c023c87c c031ebf8 c031e9c4
c031c940 c031c6f8 c02bb7a4
[   58.773179] c2a4  c031c958 c032a680 c0329fe8 c023c88c c031d2c4
c02bb7a4 c032a3dc c031d1b8
[   58.773191] c2c4  c02cf7ac c02bb7a4 c02bb7a4 c031db00 c02ac334
c03381b8 c02bb7a4 c02cf744
[   58.773201] c2e4  c02cf4e8 c02bb7a4 c02ab000 c02bb7a4 c02bb7a4
c02ad218 c02bb7a4 c02bb7a4
[   58.773213]
[   58.773215] R9: 0xe5ea1f80:
[   58.773217] 1f80  00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[   58.773228] 1fa0  00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[   58.773238] 1fc0  00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[   58.773248] 1fe0  00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[   58.773258] 2000  00000000 00000002 00000000 eb1f0080 c08a5268
00000000 00000015 eb1f0080
[   58.773269] 2020  c11bcb00 af1765a8 0000000d ee93da20 c08827e0
00000000 e5ea3bb4 e5ea3b18
[   58.773279] 2040  c06d9f30 00000000 00000000 00000000 00000000
00000000 01000000 00000000
[   58.773289] 2060  403adf00 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[   58.773303] Process adbd (pid: 662, stack limit = 0xe5ea22f0)
[   58.773310] Stack: (0xe5ea3e50 to 0xe5ea4000)
[   58.773317] 3e40:                                     ef044a08
ef044a14 c09326d0 c047d500
[   58.773326] 3e60: ef044a08 c047908c c07b89b8 e5ea3e90 00000000
c02b27f4 e5ea3e90 e5ea3e90
[   58.773335] 3e80: 00000000 c02b28a4 00000000 c02b2a80 e5ea2000
403adc40 00000000 403adc48
[   58.773344] 3ea0: 00000100 00000000 00000000 e5ea3fb0 e5ea2000
eb1f0080 e5ea2000 c02ae37c
[   58.773354] 3ec0: e5ea3fb0 c023eeb0 00000011 c11cbf14 c08a8100
60000013 00000000 00000011
[   58.773363] 3ee0: 00000011 00000000 00040001 0000029b 00000000
00000000 00000000 00000001
[   58.773372] 3f00: 00000001 00000000 00000001 ee2064e0 e5ea3f60
ee856bd4 e5ea3f78 c02baa78
[   58.773381] 3f20: ee01db78 e5ea3f60 ee01dda8 c02a4890 00000001
e5ea3f78 00000000 00000005
[   58.773390] 3f40: 403adb4c 00000000 00000003 00000000 e5ea2000
00000000 e5ea2000 00000000
[   58.773399] 3f60: e5ea2000 e5ea21b0 00000100 00000000 403adb58
403add58 00000000 00000000
[   58.773408] 3f80: e5ea3fb0 e5ea2000 0000001b 00000077 00000000
000374d0 4081b090 0000001b
[   58.773417] 3fa0: 00000058 c023c100 000374d0 4081b090 fee1dead
28121969 a1b2c3d4 4081b090
[   58.773426] 3fc0: 000374d0 4081b090 0000001b 00000058 0000ee29
00100000 40819060 00000001
[   58.773436] 3fe0: 403adffc 403ade48 0000f1f7 000084fc 20000010
fee1dead ffffffff ffffffff
[   58.773468] [<c04e3c04>] (tegra_ehci_hcd_shutdown+0x28/0x64) from
[<c047d500>] (platform_drv_shutdown+0x18/0x1c)
[   58.773485] [<c047d500>] (platform_drv_shutdown+0x18/0x1c) from
[<c047908c>] (device_shutdown+0x78/0xd0)
[   58.773503] [<c047908c>] (device_shutdown+0x78/0xd0) from
[<c02b27f4>] (kernel_restart_prepare+0x58/0x70)
[   58.773516] [<c02b27f4>] (kernel_restart_prepare+0x58/0x70) from
[<c02b28a4>] (kernel_restart+0x2c/0x7c)
[   58.773527] [<c02b28a4>] (kernel_restart+0x2c/0x7c) from
[<c02b2a80>] (sys_reboot+0x184/0x1cc)
[   58.773540] [<c02b2a80>] (sys_reboot+0x184/0x1cc) from [<c023c100>]
(ret_fast_syscall+0x0/0x30)
[   58.773551] Code: e59f1038 e59f0038 e2445f42 eb07d6b4 (e5143058)
[   58.773559] ---[ end trace 1b75b31a2719ed32 ]---

On Thu, May 24, 2012 at 5:33 PM, Ming Lei <ming.lei@canonical.com> wrote:
> On Thu, May 24, 2012 at 10:37 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
>>
>> The code there is racy already.  It does:
>>
>>                } else if (dev->driver && dev->driver->shutdown) {
>>
>> without any locking protection.  If the driver is unbound while this
>> statement runs then dev->driver could be non-NULL for the first test
>> and NULL for the second.
>
> Yes,  I missed this one, :-)
>
>>
>>> > to fix the race by prevent driver core from probing or releasing once
>>> > shutdown is started.
>>> >
>>> > How about the below patch?
>>>
>>> How about waiting for the original poster to respond as to exactly how
>>> they are hitting this race before doing anything?
>>
>> In addition, the patch is too complicated.  For this type of
>> synchronization you should use SRCU.  See
>> Documentation/RCU/whatisRCU.txt and related files.
>
> Yes, the synchronization should be a many reader vs. one
> writer problem, RCU should be suitable.
>
> Looks we think alike, :-)
>
> I have studied RCU yesterday, but was afraid that may introduce
> much more code, so not applied it in the patch. Will study it further
> to figure out a new version.
>
>
> Thanks,
> --
> Ming Lei

^ permalink raw reply

* Re: [PATCH 6/6 v8] cpufreq, highbank: add support for highbank cpufreq
From: Shawn Guo @ 2012-12-06  9:37 UTC (permalink / raw)
  To: Mark Langsdorf
  Cc: linux-kernel, cpufreq, linux-pm, linux-arm-kernel, mturquette
In-Reply-To: <1354726121-17190-7-git-send-email-mark.langsdorf@calxeda.com>

On Wed, Dec 05, 2012 at 10:48:41AM -0600, Mark Langsdorf wrote:
> Highbank processors depend on the external ECME to perform voltage
> management based on a requested frequency. Communication between the
> A9 cores and the ECME happens over the pl320 IPC channel.
> 
> Signed-off-by: Mark Langsdorf <mark.langsdorf@calxeda.com>
> Cc: shawn.guo@linaro.org

Reviewed-by: Shawn Guo <shawn.guo@linaro.org>

^ permalink raw reply

* Re: Race condition between driver_probe_device and device_shutdown
From: Ming Lei @ 2012-12-06 10:52 UTC (permalink / raw)
  To: Wedson Almeida Filho
  Cc: Alan Stern, Greg KH, Eric W. Biederman, Andrew Morton,
	linux-kernel, Linux PM List
In-Reply-To: <CANeycqo8sFVeWVEeW3mdPPVRt35+kfkdUzenLB+YgNs4y-T7Cg@mail.gmail.com>

On Thu, Dec 6, 2012 at 5:11 PM, Wedson Almeida Filho <wedsonaf@gmail.com> wrote:
> [Sorry for taking so long to respond, after a week of silence I
> assumed I wouldn't get any responses, plus I had moved on to other
> things.]
>
> I happen to still have the logs, the relevant part is pasted at the end.
>
> Answering some of the questions: the driver is on the platform bus, in
> fact, it's drivers/usb/host/ehci-tegra.c; after seeing the oops below,
> I added printks when entering and exiting tegra_ehci_probe to try to
> understand it better.
>
> Note that in the log we see some thread entering tegra_ehci_probe, but
> nothing indicates that it has exited before we get the oops.

The commit d1c6c030fcec6f860d9bb6c632a3ebe62e28440b(driver core:
fix shutdown races with probe/remove(v3)) has been merged to v3.6, so
device_shutdown will wait for completing of probe.

Could you test kernel 3.6 or the latest kernel to see if the problem can be
reproduced?

Thanks,
--
Ming Lei

^ permalink raw reply

* Re: Unreliable USB3 with NEC uPD720200 and Delock Cardreader
From: Rafael J. Wysocki @ 2012-12-06 12:43 UTC (permalink / raw)
  To: Huang Ying
  Cc: Sarah Sharp, Bjørn Mork, Ulrich Eckhardt, Andrew Lutomirski,
	linux-usb@vger.kernel.org, Alan Stern, Ming Lei, Bjorn Helgaas,
	linux-pci, Linux PM list
In-Reply-To: <1354778234.7216.9.camel@yhuang-dev>

On Thursday, December 06, 2012 03:17:14 PM Huang Ying wrote:
> On Wed, 2012-12-05 at 16:33 -0800, Sarah Sharp wrote:
> > On Wed, Nov 28, 2012 at 02:54:06PM -0800, Sarah Sharp wrote:
> > > On Mon, Nov 26, 2012 at 10:48:03PM +0100, Bjørn Mork wrote:
> > > > Sarah Sharp <sarah.a.sharp@linux.intel.com> writes:
> > > > 
> > > > > It looks like both Ulrich and Andrew have the same issue.  I also have a
> > > > > Lenovo x220, and I confirmed that when I turn on PCI runtime suspend,
> > > > > the NEC host controller does not report port status changes when a new
> > > > > USB device is plugged in.
> > > > >
> > > > > I'm running 3.6.7, and I'm pretty sure that runtime suspend worked for
> > > > > the NEC host on some older kernel.  I don't think the NEC host went into
> > > > > D3cold on that kernel, though.  Is there a way to disable D3cold and
> > > > > just use D3hot instead?
> > > > 
> > > > Yes, you have /sys/bus/pci/devices/.../d3cold_allowed
> > > > See Documentation/ABI/testing/sysfs-bus-pci
> > > > 
> > > > If this really is a problem with the D3cold support that went into 3.6
> > > > then I guess you should include Huang Ying in the discussions as well
> > > > (CCed).
> > > 
> > > Turning off D3 cold didn't help.  Once the PCI device is suspended,
> > > connect events do not generate an interrupt.
> > > 
> > > I'll go see if I can figure out which kernel this worked on and bisect.
> > 
> > Wakeup from D3 works fine on the 3.5.0 kernel, but fails on 3.6.2.  I
> > haven't fully bisected yet.
> > 
> > In debugging, I found that if you only enable runtime suspend for the
> > NEC host controller, the host successfully comes out of D3 when you plug
> > in a USB device.  However, if you enable runtime PM for the parent PCIe root
> > port, it stops working.  Disabling D3cold for both devices did not help.
> > 
> > It looks like a PCI issue, so what sort of debugging info do you need
> > from me?
> 
> I do some debug with my NEC uPD720200, the patch is attached.  I found
> that NEC uPD720200 can not generate PME interrupt, so it can be remote
> waken up only via polling.  But if the PCIe bridge connected to NEC
> uPD720200 goes into suspended, we can not poll it.  Maybe we need a way
> to disable PCIe bridge suspended if the underlying device can only be
> waken up via polling.

I think we need to be more intelligent about that.  It looks like we can
only suspend a port (or bridge in general) if all devices below it have
the pme_poll flag unset.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [RFC PATCH 1/5] Device Power: introduce power controller
From: Ming Lei @ 2012-12-06 13:18 UTC (permalink / raw)
  To: Jassi Brar
  Cc: Roger Quadros, Andy Green, Alan Stern, Greg Kroah-Hartman,
	Lan Tianyu, Sarah Sharp, Rafael J. Wysocki, linux-pm,
	Oliver Neukum, linux-omap, linux-usb, Felipe Balbi
In-Reply-To: <CAJe_ZhcUV7mYaNSEw1kLe8-Ym6RcVEab-1-ZZMT20xsWAb=szQ@mail.gmail.com>

On Thu, Dec 6, 2012 at 11:46 AM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>> The patch can make usb port deal with the 'power controller' only, and make it
>> avoid to deal with regulators/clocks/... directly.
>>
> I am curious too, except for clocks and voltage supplies (regulators),
> what other external resources does a chip need ?

For example, one indicator LED which doesn't connect to the same power
domain might need to be triggered after the power switch state is changed.

Thanks,
-- 
Ming Lei

^ permalink raw reply

* Re: [RFC PATCH 1/5] Device Power: introduce power controller
From: Jassi Brar @ 2012-12-06 14:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Roger Quadros, Andy Green, Alan Stern, Greg Kroah-Hartman,
	Lan Tianyu, Sarah Sharp, Rafael J. Wysocki,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, Oliver Neukum,
	linux-omap-u79uwXL29TY76Z2rM5mHXA,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, Felipe Balbi
In-Reply-To: <CACVXFVMKYAANsNJKBZ90ThaJ7KNOTzpyvARGnNcHsVVczxyO4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 6 December 2012 18:48, Ming Lei <tom.leiming-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Thu, Dec 6, 2012 at 11:46 AM, Jassi Brar <jaswinder.singh-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>>> The patch can make usb port deal with the 'power controller' only, and make it
>>> avoid to deal with regulators/clocks/... directly.
>>>
>> I am curious too, except for clocks and voltage supplies (regulators),
>> what other external resources does a chip need ?
>
> For example, one indicator LED which doesn't connect to the same power
> domain might need to be triggered after the power switch state is changed.
>
Hmm.. I am not sure if we could call an indicator LED a resource that
a chip needs to be functional. Isn't how mmc core manages the
indicator led, a better way?

cheers.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH 4/5] arm: omap2: support port power on lan95xx devices
From: Alan Stern @ 2012-12-06 15:25 UTC (permalink / raw)
  To: Andy Green
  Cc: Rafael J. Wysocki, Ming Lei, Linux-pm mailing list, linux-omap,
	USB list
In-Reply-To: <50BFE159.1010009@linaro.org>

On Thu, 6 Dec 2012, Andy Green wrote:

> > Right.  The question is should it be done in this somewhat roundabout
> > way (you've got two separate assets for setting up this one thing), or
> > should the USB port code be responsible for explicitly checking if
> > there are any applicable assets when the port is created?
> >
> > The advantange of the second approach is that it doesn't involve
> > checking all the ancestors every time a new device is added (and it
> > involves only one asset).  The disadvantage is that it works only for
> > USB ports.
> 
> It's done in two steps to strongly filter candidate child devices 
> against being children of a known platform device.  If you go around 
> that, you will be back to full device path matching with wildcards at 
> the USB child to determine if he is "the one".  So that's a feature not 
> an issue I think.

I don't follow.  Suppose an asset is attached to ehci-omap.0 with its
string set to "-0:1.0/port1" and the other members pointing to the
regulator, clock and so on.  And suppose the USB port code, when
creating a new port, checks for a name match against all the assets
attached to its bus's host controller device structure.  That should
do exactly what you want while using only one asset.

> I can see doing it generically or not is equally a political issue as a 
> technical one, but I think if we bother to add this kind of support we 
> should prefer to do it generally.  It's going to be about the same 
> amount of code.

Not quite.  If you do it generally then you have to insert hooks at all 
the places where a subsystem _might_ need it.  If you do it 
specifically then no hooks are needed at all -- just a match call at 
the right place in the subsystem that needs it.

> As Tony Lindgren said, even board-omap4panda.c itself is deprecated, to 
> project platform info into USB stack you either have to add DT code into 
> usb stack then to go find things directly, or provide a generic 
> methodology like assets which have the dt bindings done outside of any 
> subsystem and apply their operations outside the subsystem too.

Assuming DT can be extended to support assets, adding one asset will be 
simpler than adding two.

> > To answer the question, we need to know how other subsystems might
> > want to use the asset-matching approach.  My guess is that only a small
> > number of device types would care about assets (in the same way that
> > assets matter to USB ports but not to other USB device types).  And it
> > might not be necessary to check against every ancestor; we might know
> > beforehand where to check (just as the USB port would know to look for
> > an asset attached to the host controller device).
> 
> Yes I think it boils down to "buses" in general can benefit from this. 
> They're the thing that dynamically - later - create child devices you 
> might need to target with what was ultimately platform information.
> 
> On Panda the other bus thing that can benefit is the WLAN module on 
> SDIO.  In fact that's a very illuminating case for your question above. 
>   Faced with exactly the same problem, that they needed to project 
> platform info on to SDIO-probed device instance to tell it what clock 
> rate it had been given, they invented an api which you can see today in 
> board-omap4panda.c and other boards there, wl12xx_set_platform_data(). 
> This is from board-4430sdp:
> 
> static struct wl12xx_platform_data omap4_sdp4430_wlan_data __initdata = {
>          .board_ref_clock = WL12XX_REFCLOCK_26,
>          .board_tcxo_clock = WL12XX_TCXOCLOCK_26,
> };
> 
> ...
> 
> 	ret = wl12xx_set_platform_data(&omap4_sdp4430_wlan_data);
> 
> You can find the other end of it here in 
> drivers/net/wireless/ti/wlcore/wl12xx_platform_data.c
> 
> static struct wl12xx_platform_data *platform_data;
> 
> int __init wl12xx_set_platform_data(const struct wl12xx_platform_data *data)
> {
> 	if (platform_data)
> 		return -EBUSY;
> 	if (!data)
> 		return -EINVAL;
> 
> 	platform_data = kmemdup(data, sizeof(*data), GFP_KERNEL);
> 	if (!platform_data)
> 		return -ENOMEM;
> 
> 	return 0;
> }
> 
> when the driver for the device instance wants to "get" its platform data 
> it reads the static pointer via another api.  Obviously if you want two 
> modules on your platform that's not so good.

Also it isn't type-safe, although that's probably not a big concern.

> I doubt that's the only SDIO device that wants to know platform info. 
> So I think by providing a generic way we help other buses and devices.

I agree, assets look like a good way to improve this.  In fact, to some
extent assets appear to be a generalization or extension of platform
data.

Alan Stern


^ permalink raw reply

* Re: [RFC PATCH v2 01/10] CPU hotplug: Provide APIs for "light" atomic readers to prevent CPU offline
From: Oleg Nesterov @ 2012-12-06 16:18 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tj, tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <50BFAB17.3090603@linux.vnet.ibm.com>

On 12/06, Srivatsa S. Bhat wrote:
>
> +void get_online_cpus_atomic(void)
> +{
> +	int c, old;
> +
> +	preempt_disable();
> +	read_lock(&hotplug_rwlock);

Confused... Why it also takes hotplug_rwlock?

> +
> +	for (;;) {
> +		c = atomic_read(&__get_cpu_var(atomic_reader_refcount));
> +		if (unlikely(writer_active(c))) {
> +			cpu_relax();
> +			continue;
> +		}
> +
> +		old = atomic_cmpxchg(&__get_cpu_var(atomic_reader_refcount),
> +				     c, c + 1);
> +
> +		if (likely(old == c))
> +			break;
> +
> +		c = old;
> +	}
> +}

	while (!atomic_inc_unless_negative(...))
		cpu_relax();
	
and atomic_dec_unless_positive() in disable_atomic_reader().


Obviously you can't use get_online_cpus_atomic() under rq->lock or
task->pi_lock or any other lock CPU_DYING can take. Probably this is
fine, but perhaps it makes sense to add the lockdep annotations.

Oleg.


^ permalink raw reply

* [k.org requests #104] Add 2 categories to linux-pm bugzilla
From: Konstantin Ryabitsev via RT @ 2012-12-06 17:51 UTC (permalink / raw)
  To: lenb; +Cc: linux-pm
In-Reply-To: <rt-3.8.13-18279-1353984226-332.104-6-0@kernel.org>

On Tue Nov 27 02:43:46 2012, lenb@kernel.org wrote:
> Konstantin,
> Please sent the default assignees as below.
> They may not be perfect, but they are better than totally broken:-)

All requested changes have been made. Sorry for the delay.

Best,
-- 
Konstantin Ryabitsev
Systems Administrator
Linux Foundation, kernel.org
Montréal, Québec

^ permalink raw reply

* Re: [RFC PATCH v2 01/10] CPU hotplug: Provide APIs for "light" atomic readers to prevent CPU offline
From: Srivatsa S. Bhat @ 2012-12-06 18:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tj, tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121206161850.GA6710@redhat.com>

On 12/06/2012 09:48 PM, Oleg Nesterov wrote:
> On 12/06, Srivatsa S. Bhat wrote:
>>
>> +void get_online_cpus_atomic(void)
>> +{
>> +	int c, old;
>> +
>> +	preempt_disable();
>> +	read_lock(&hotplug_rwlock);
> 
> Confused... Why it also takes hotplug_rwlock?

To avoid ABBA deadlocks.

hotplug_rwlock was meant for the "light" readers.
The atomic counters were meant for the "heavy/full" readers.
I wanted them to be able to nest in any manner they wanted,
such as:

Full inside light:

get_online_cpus_atomic_light()
	...
	get_online_cpus_atomic_full()
	...
	put_online_cpus_atomic_full()
	...
put_online_cpus_atomic_light()

Or, light inside full:

get_online_cpus_atomic_full()
	...
	get_online_cpus_atomic_light()
	...
	put_online_cpus_atomic_light()
	...
put_online_cpus_atomic_full()

To allow this, I made the two sets of APIs take the locks
in the same order internally.

(I had some more description of this logic in the changelog
of 2/10; the only difference there is that instead of atomic
counters, I used rwlocks for the full-readers as well.
https://lkml.org/lkml/2012/12/5/320)

> 
>> +
>> +	for (;;) {
>> +		c = atomic_read(&__get_cpu_var(atomic_reader_refcount));
>> +		if (unlikely(writer_active(c))) {
>> +			cpu_relax();
>> +			continue;
>> +		}
>> +
>> +		old = atomic_cmpxchg(&__get_cpu_var(atomic_reader_refcount),
>> +				     c, c + 1);
>> +
>> +		if (likely(old == c))
>> +			break;
>> +
>> +		c = old;
>> +	}
>> +}
> 
> 	while (!atomic_inc_unless_negative(...))
> 		cpu_relax();
> 	
> and atomic_dec_unless_positive() in disable_atomic_reader().
> 

Ah, great! I was searching for them while writing the code, but somehow
overlooked them and rolled out my own. ;-)

> 
> Obviously you can't use get_online_cpus_atomic() under rq->lock or
> task->pi_lock or any other lock CPU_DYING can take. Probably this is
> fine, but perhaps it makes sense to add the lockdep annotations.
> 

Hmm, you are right. We can't use _atomic() in the CPU_DYING path.
So how about altering it to _allow_ that, instead of teaching lockdep
that we don't allow it? I mean, just like how the existing
get_online_cpus() allows such calls in the writer?

(I haven't thought it through; just thinking aloud...)

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [RFC PATCH v2 01/10] CPU hotplug: Provide APIs for "light" atomic readers to prevent CPU offline
From: Srivatsa S. Bhat @ 2012-12-06 19:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tj, tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <50C0E88E.9050909@linux.vnet.ibm.com>

On 12/07/2012 12:18 AM, Srivatsa S. Bhat wrote:
> On 12/06/2012 09:48 PM, Oleg Nesterov wrote:
>> On 12/06, Srivatsa S. Bhat wrote:
>>>
>>> +void get_online_cpus_atomic(void)
>>> +{
>>> +	int c, old;
>>> +
>>> +	preempt_disable();
>>> +	read_lock(&hotplug_rwlock);
>>
>> Confused... Why it also takes hotplug_rwlock?
> 
> To avoid ABBA deadlocks.
> 
> hotplug_rwlock was meant for the "light" readers.
> The atomic counters were meant for the "heavy/full" readers.
> I wanted them to be able to nest in any manner they wanted,
> such as:
> 
> Full inside light:
> 
> get_online_cpus_atomic_light()
> 	...
> 	get_online_cpus_atomic_full()
> 	...
> 	put_online_cpus_atomic_full()
> 	...
> put_online_cpus_atomic_light()
> 
> Or, light inside full:
> 
> get_online_cpus_atomic_full()
> 	...
> 	get_online_cpus_atomic_light()
> 	...
> 	put_online_cpus_atomic_light()
> 	...
> put_online_cpus_atomic_full()
> 
> To allow this, I made the two sets of APIs take the locks
> in the same order internally.
> 
> (I had some more description of this logic in the changelog
> of 2/10; the only difference there is that instead of atomic
> counters, I used rwlocks for the full-readers as well.
> https://lkml.org/lkml/2012/12/5/320)
> 

One of the reasons why I changed everybody to global rwlocks
instead of per-cpu atomic counters was to avoid lock ordering
related deadlocks associated with per-cpu locking.

Eg:

CPU 0                         CPU 1
------                        ------

1. Acquire lock A             Increment CPU1's
                              atomic counter



2. Increment CPU0's            Try to acquire lock A
   atomic counter


Now consider what happens if a hotplug writer (cpu_down) begins,
and starts looking at CPU0, to try to decrement its atomic counter,
in between steps 1 and 2.

The hotplug writer will be successful in CPU0 because CPU0 hasn't
yet incremented its counter. So, now CPU0 spins waiting for the
hotplug writer to reset the atomic counter again.

When the hotplug writer looks at CPU1, it can't decrement the
atomic counter because CPU1 has already incremented it. So the
hotplug writer waits. CPU1 goes ahead, only to start spinning on
lock A which was acquired by CPU0. So we end up in a deadlock due
to circular locking dependency between the 3 entities.

One way to deal with this would be that the writer should abort
its loop of trying to atomic_dec per-cpu counters, whenever it has
to wait. But that might prove to be too wasteful (and overly
paranoid) in practice.

So, instead, if we have global locks (like global rwlocks that I
finally used when posting out this v2), then we won't end up in
such messy issues.

And why exactly was I concerned about all this?
Because, the current code uses the extremely flexible preempt_disable()
/preempt_enable() pair which impose absolutely no ordering
restrictions. And probably the existing code _depends_ on that.
So if our new APIs that replace preempt_disable/enable start
imposing ordering restrictions, I feared that they might become
unusable. Hence I used global rwlocks, thereby trading cache-line
bouncing (due to global rwlocks) for lock-safety and flexibility.

Regards,
Srivatsa S. Bhat


^ permalink raw reply

* Re: [RFC PATCH v2 01/10] CPU hotplug: Provide APIs for "light" atomic readers to prevent CPU offline
From: Steven Rostedt @ 2012-12-06 19:28 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Oleg Nesterov, tj, tglx, peterz, paulmck, rusty, mingo, akpm,
	namhyung, vincent.guittot, sbw, amit.kucheria, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <50C0E88E.9050909@linux.vnet.ibm.com>

On Fri, 2012-12-07 at 00:18 +0530, Srivatsa S. Bhat wrote:
> On 12/06/2012 09:48 PM, Oleg Nesterov wrote:
> > On 12/06, Srivatsa S. Bhat wrote:
> >>
> >> +void get_online_cpus_atomic(void)
> >> +{
> >> +	int c, old;
> >> +
> >> +	preempt_disable();
> >> +	read_lock(&hotplug_rwlock);
> > 
> > Confused... Why it also takes hotplug_rwlock?
> 
> To avoid ABBA deadlocks.
> 
> hotplug_rwlock was meant for the "light" readers.
> The atomic counters were meant for the "heavy/full" readers.
> I wanted them to be able to nest in any manner they wanted,
> such as:
> 
> Full inside light:
> 
> get_online_cpus_atomic_light()
> 	...
> 	get_online_cpus_atomic_full()
> 	...
> 	put_online_cpus_atomic_full()
> 	...
> put_online_cpus_atomic_light()
> 
> Or, light inside full:
> 
> get_online_cpus_atomic_full()
> 	...
> 	get_online_cpus_atomic_light()
> 	...
> 	put_online_cpus_atomic_light()
> 	...
> put_online_cpus_atomic_full()
> 
> To allow this, I made the two sets of APIs take the locks
> in the same order internally.
> 
> (I had some more description of this logic in the changelog
> of 2/10; the only difference there is that instead of atomic
> counters, I used rwlocks for the full-readers as well.
> https://lkml.org/lkml/2012/12/5/320)
> 

You know reader locks can deadlock with each other, right? And this
isn't caught be lockdep yet. This is because rwlocks have been made to
be fair with writers. Before writers could be starved if a CPU always
let a reader in. Now if a writer is waiting, a reader will block behind
the writer. But this has introduced new issues with the kernel as
follows:


   CPU0			   CPU1	 	   CPU2		   CPU3
   ----			   ----		   ----		   ----
read_lock(A);
			read_lock(B)
					write_lock(A) <- block
							write_lock(B) <- block
read_lock(B) <-block

			read_lock(A) <- block

DEADLOCK!

-- Steve

^ permalink raw reply

* Re: [RFC PATCH v2 01/10] CPU hotplug: Provide APIs for "light" atomic readers to prevent CPU offline
From: Srivatsa S. Bhat @ 2012-12-06 19:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, tj, tglx, peterz, paulmck, rusty, mingo, akpm,
	namhyung, vincent.guittot, sbw, amit.kucheria, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <1354822103.17101.24.camel@gandalf.local.home>

On 12/07/2012 12:58 AM, Steven Rostedt wrote:
> On Fri, 2012-12-07 at 00:18 +0530, Srivatsa S. Bhat wrote:
>> On 12/06/2012 09:48 PM, Oleg Nesterov wrote:
>>> On 12/06, Srivatsa S. Bhat wrote:
>>>>
>>>> +void get_online_cpus_atomic(void)
>>>> +{
>>>> +	int c, old;
>>>> +
>>>> +	preempt_disable();
>>>> +	read_lock(&hotplug_rwlock);
>>>
>>> Confused... Why it also takes hotplug_rwlock?
>>
>> To avoid ABBA deadlocks.
>>
>> hotplug_rwlock was meant for the "light" readers.
>> The atomic counters were meant for the "heavy/full" readers.
>> I wanted them to be able to nest in any manner they wanted,
>> such as:
>>
>> Full inside light:
>>
>> get_online_cpus_atomic_light()
>> 	...
>> 	get_online_cpus_atomic_full()
>> 	...
>> 	put_online_cpus_atomic_full()
>> 	...
>> put_online_cpus_atomic_light()
>>
>> Or, light inside full:
>>
>> get_online_cpus_atomic_full()
>> 	...
>> 	get_online_cpus_atomic_light()
>> 	...
>> 	put_online_cpus_atomic_light()
>> 	...
>> put_online_cpus_atomic_full()
>>
>> To allow this, I made the two sets of APIs take the locks
>> in the same order internally.
>>
>> (I had some more description of this logic in the changelog
>> of 2/10; the only difference there is that instead of atomic
>> counters, I used rwlocks for the full-readers as well.
>> https://lkml.org/lkml/2012/12/5/320)
>>
> 
> You know reader locks can deadlock with each other, right? And this
> isn't caught be lockdep yet. This is because rwlocks have been made to
> be fair with writers. Before writers could be starved if a CPU always
> let a reader in. Now if a writer is waiting, a reader will block behind
> the writer. But this has introduced new issues with the kernel as
> follows:
> 
> 
>    CPU0			   CPU1	 	   CPU2		   CPU3
>    ----			   ----		   ----		   ----
> read_lock(A);
> 			read_lock(B)
> 					write_lock(A) <- block
> 							write_lock(B) <- block
> read_lock(B) <-block
> 
> 			read_lock(A) <- block
> 
> DEADLOCK!
> 

The root-cause of this deadlock is again lock-ordering mismatch right?
CPU0 takes locks in order A, B
CPU1 takes locks in order B, A

And the writer facilitates in actually getting deadlocked.

I avoid this in this patchset by always taking the locks in the same
order. So we won't be deadlocking like this.

Regards,
Srivatsa S. Bhat



^ permalink raw reply

* Re: [RFC PATCH v2 01/10] CPU hotplug: Provide APIs for "light" atomic readers to prevent CPU offline
From: Steven Rostedt @ 2012-12-06 22:02 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Oleg Nesterov, tj, tglx, peterz, paulmck, rusty, mingo, akpm,
	namhyung, vincent.guittot, sbw, amit.kucheria, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <50C0F3CA.7070105@linux.vnet.ibm.com>

On Fri, 2012-12-07 at 01:06 +0530, Srivatsa S. Bhat wrote:
> On 12/07/2012 12:58 AM, Steven Rostedt wrote:
> > On Fri, 2012-12-07 at 00:18 +0530, Srivatsa S. Bhat wrote:
> >> On 12/06/2012 09:48 PM, Oleg Nesterov wrote:
> >>> On 12/06, Srivatsa S. Bhat wrote:
> >>>>
> >>>> +void get_online_cpus_atomic(void)
> >>>> +{
> >>>> +	int c, old;
> >>>> +
> >>>> +	preempt_disable();
> >>>> +	read_lock(&hotplug_rwlock);
> >>>
> >>> Confused... Why it also takes hotplug_rwlock?
> >>
> >> To avoid ABBA deadlocks.
> >>
> >> hotplug_rwlock was meant for the "light" readers.
> >> The atomic counters were meant for the "heavy/full" readers.
> >> I wanted them to be able to nest in any manner they wanted,
> >> such as:
> >>
> >> Full inside light:
> >>
> >> get_online_cpus_atomic_light()
> >> 	...
> >> 	get_online_cpus_atomic_full()
> >> 	...
> >> 	put_online_cpus_atomic_full()
> >> 	...
> >> put_online_cpus_atomic_light()
> >>
> >> Or, light inside full:
> >>
> >> get_online_cpus_atomic_full()
> >> 	...
> >> 	get_online_cpus_atomic_light()
> >> 	...
> >> 	put_online_cpus_atomic_light()
> >> 	...
> >> put_online_cpus_atomic_full()
> >>
> >> 



> The root-cause of this deadlock is again lock-ordering mismatch right?
> CPU0 takes locks in order A, B
> CPU1 takes locks in order B, A
> 
> And the writer facilitates in actually getting deadlocked.
> 
> I avoid this in this patchset by always taking the locks in the same
> order. So we won't be deadlocking like this.

OK, I haven't looked closely at the patch yet. I'm currently hacking on
my own problems. But just from the description above, it looked like you
were using rw_locks() to be able to inverse the order of the locks.

-- Steve



^ permalink raw reply

* [PATCH 0/6 v9] cpufreq: add support for Calxeda ECX-1000 (highbank)
From: Mark Langsdorf @ 2012-12-06 22:42 UTC (permalink / raw)
  To: linux-kernel, cpufreq, linux-pm, linux-arm-kernel
In-Reply-To: <1351631056-25938-1-git-send-email-mark.langsdorf@calxeda.com>

This patch series adds cpufreq support for the Calxeda
ECX-1000 (highbank) SoCs. The EnergyCore Management Engine (ECME) on
the ECX-1000 manages the voltage for the part and communications with
Linux through a pl320 mailbox. clk notifications are used to control
when to send messages to the ECME.

--Mark Langsdorf
Calxeda, Inc.


^ permalink raw reply

* [PATCH 3/6 v9] cpufreq: tolerate inexact values when collecting stats
From: Mark Langsdorf @ 2012-12-06 22:42 UTC (permalink / raw)
  To: linux-kernel, cpufreq, linux-pm, linux-arm-kernel; +Cc: Mark Langsdorf
In-Reply-To: <1354833773-22845-1-git-send-email-mark.langsdorf@calxeda.com>

This patch is withdrawn due to a need for severe rework.

Changes from v4
        Withdrawn.      
Changes from v3, v2
        None.
Changes from v1
        Implemented a simple round-up algorithm instead of the over/under
method that could cause errors on Intel processors with boost mode.


^ permalink raw reply

* [PATCH 4/6 v9] arm highbank: add support for pl320 IPC
From: Mark Langsdorf @ 2012-12-06 22:42 UTC (permalink / raw)
  To: linux-kernel, cpufreq, linux-pm, linux-arm-kernel
  Cc: Mark Langsdorf, Rob Herring, Omar Ramirez Luna, Arnd Bergmann
In-Reply-To: <1354833773-22845-1-git-send-email-mark.langsdorf@calxeda.com>

From: Rob Herring <rob.herring@calxeda.com>

The pl320 IPC allows for interprocessor communication between the highbank A9
and the EnergyCore Management Engine. The pl320 implements a straightforward
mailbox protocol.

This patch depends on Omar Ramirez Luna's <omar.luna@linaro.org>
mailbox driver patch series.

Signed-off-by: Mark Langsdorf <mark.langsdorf@calxeda.com>
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Cc: Omar Ramirez Luna <omar.luna@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
---
Changes from v6, v7, v8
        None.
Changes from v5
        Renamed ipc_transmit() to pl320_ipc_transmit().
        Properly exported pl320_ipc_{un}register_notifier().
Changes from v4
        Moved pl320-ipc.c from arch/arm/mach-highbank to drivers/mailbox.
        Moved header information to include/linux/mailbox.h.
        Added Kconfig options to reflect the new code location.
        Change drivers/mailbox/Makefile to build the omap mailboxes only 
        when they are configured.
        Removed ipc_call_fast and renamed ipc_call_slow ipc_transmit.
Changes from v3, v2
        None.
Changes from v1
        Removed erroneous changes for cpufreq Kconfig.

 arch/arm/mach-highbank/Kconfig |   2 +
 drivers/mailbox/Kconfig        |   9 ++
 drivers/mailbox/Makefile       |   4 +
 drivers/mailbox/pl320-ipc.c    | 199 +++++++++++++++++++++++++++++++++++++++++
 include/linux/mailbox.h        |  19 +++-
 5 files changed, 232 insertions(+), 1 deletion(-)
 create mode 100644 drivers/mailbox/Makefile
 create mode 100644 drivers/mailbox/pl320-ipc.c

diff --git a/arch/arm/mach-highbank/Kconfig b/arch/arm/mach-highbank/Kconfig
index 0e1d0a4..2896881 100644
--- a/arch/arm/mach-highbank/Kconfig
+++ b/arch/arm/mach-highbank/Kconfig
@@ -11,5 +11,7 @@ config ARCH_HIGHBANK
 	select GENERIC_CLOCKEVENTS
 	select HAVE_ARM_SCU
 	select HAVE_SMP
+	select MAILBOX
+	select PL320_MBOX
 	select SPARSE_IRQ
 	select USE_OF
diff --git a/drivers/mailbox/Kconfig b/drivers/mailbox/Kconfig
index be8cac0..e89fdb4 100644
--- a/drivers/mailbox/Kconfig
+++ b/drivers/mailbox/Kconfig
@@ -34,4 +34,13 @@ config OMAP_MBOX_KFIFO_SIZE
 	  This can also be changed at runtime (via the mbox_kfifo_size
 	  module parameter).
 
+config PL320_MBOX
+	bool "ARM PL320 Mailbox"
+	help
+	  An implementation of the ARM PL320 Interprocessor Communication
+	  Mailbox (IPCM), tailored for the Calxeda Highbank. It is used to
+	  send short messages between Highbank's A9 cores and the EnergyCore
+	  Management Engine, primarily for cpufreq. Say Y here if you want
+	  to use the PL320 IPCM support.
+
 endif
diff --git a/drivers/mailbox/Makefile b/drivers/mailbox/Makefile
new file mode 100644
index 0000000..c9f14c3
--- /dev/null
+++ b/drivers/mailbox/Makefile
@@ -0,0 +1,4 @@
+obj-$(CONFIG_OMAP1_MBOX)	+= mailbox.o mailbox-omap1.o
+obj-$(CONFIG_OMAP2PLUS_MBOX)	+= mailbox.o mailbox-omap2.o
+obj-$(CONFIG_PL320_MBOX)	+= pl320-ipc.o
+
diff --git a/drivers/mailbox/pl320-ipc.c b/drivers/mailbox/pl320-ipc.c
new file mode 100644
index 0000000..1a9d8e4
--- /dev/null
+++ b/drivers/mailbox/pl320-ipc.c
@@ -0,0 +1,199 @@
+/*
+ * Copyright 2012 Calxeda, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+#include <linux/types.h>
+#include <linux/err.h>
+#include <linux/delay.h>
+#include <linux/export.h>
+#include <linux/io.h>
+#include <linux/interrupt.h>
+#include <linux/completion.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/spinlock.h>
+#include <linux/device.h>
+#include <linux/amba/bus.h>
+
+#include <linux/mailbox.h>
+
+#define IPCMxSOURCE(m)		((m) * 0x40)
+#define IPCMxDSET(m)		(((m) * 0x40) + 0x004)
+#define IPCMxDCLEAR(m)		(((m) * 0x40) + 0x008)
+#define IPCMxDSTATUS(m)		(((m) * 0x40) + 0x00C)
+#define IPCMxMODE(m)		(((m) * 0x40) + 0x010)
+#define IPCMxMSET(m)		(((m) * 0x40) + 0x014)
+#define IPCMxMCLEAR(m)		(((m) * 0x40) + 0x018)
+#define IPCMxMSTATUS(m)		(((m) * 0x40) + 0x01C)
+#define IPCMxSEND(m)		(((m) * 0x40) + 0x020)
+#define IPCMxDR(m, dr)		(((m) * 0x40) + ((dr) * 4) + 0x024)
+
+#define IPCMMIS(irq)		(((irq) * 8) + 0x800)
+#define IPCMRIS(irq)		(((irq) * 8) + 0x804)
+
+#define MBOX_MASK(n)		(1 << (n))
+#define IPC_TX_MBOX		1
+#define IPC_RX_MBOX		2
+
+#define CHAN_MASK(n)		(1 << (n))
+#define A9_SOURCE		1
+#define M3_SOURCE		0
+
+static void __iomem *ipc_base;
+static int ipc_irq;
+static DEFINE_MUTEX(ipc_m1_lock);
+static DECLARE_COMPLETION(ipc_completion);
+static ATOMIC_NOTIFIER_HEAD(ipc_notifier);
+
+static inline void set_destination(int source, int mbox)
+{
+	__raw_writel(CHAN_MASK(source), ipc_base + IPCMxDSET(mbox));
+	__raw_writel(CHAN_MASK(source), ipc_base + IPCMxMSET(mbox));
+}
+
+static inline void clear_destination(int source, int mbox)
+{
+	__raw_writel(CHAN_MASK(source), ipc_base + IPCMxDCLEAR(mbox));
+	__raw_writel(CHAN_MASK(source), ipc_base + IPCMxMCLEAR(mbox));
+}
+
+static void __ipc_send(int mbox, u32 *data)
+{
+	int i;
+	for (i = 0; i < 7; i++)
+		__raw_writel(data[i], ipc_base + IPCMxDR(mbox, i));
+	__raw_writel(0x1, ipc_base + IPCMxSEND(mbox));
+}
+
+static u32 __ipc_rcv(int mbox, u32 *data)
+{
+	int i;
+	for (i = 0; i < 7; i++)
+		data[i] = __raw_readl(ipc_base + IPCMxDR(mbox, i));
+	return data[1];
+}
+
+/* blocking implmentation from the A9 side, not usuable in interrupts! */
+int pl320_ipc_transmit(u32 *data)
+{
+	int ret;
+
+	mutex_lock(&ipc_m1_lock);
+
+	init_completion(&ipc_completion);
+	__ipc_send(IPC_TX_MBOX, data);
+	ret = wait_for_completion_timeout(&ipc_completion,
+					  msecs_to_jiffies(1000));
+	if (ret == 0) {
+		ret = -ETIMEDOUT;
+		goto out;
+	}
+
+	ret = __ipc_rcv(IPC_TX_MBOX, data);
+out:
+	mutex_unlock(&ipc_m1_lock);
+	return ret;
+}
+EXPORT_SYMBOL(pl320_ipc_transmit);
+
+irqreturn_t ipc_handler(int irq, void *dev)
+{
+	u32 irq_stat;
+	u32 data[7];
+
+	irq_stat = __raw_readl(ipc_base + IPCMMIS(1));
+	if (irq_stat & MBOX_MASK(IPC_TX_MBOX)) {
+		__raw_writel(0, ipc_base + IPCMxSEND(IPC_TX_MBOX));
+		complete(&ipc_completion);
+	}
+	if (irq_stat & MBOX_MASK(IPC_RX_MBOX)) {
+		__ipc_rcv(IPC_RX_MBOX, data);
+		atomic_notifier_call_chain(&ipc_notifier, data[0], data + 1);
+		__raw_writel(2, ipc_base + IPCMxSEND(IPC_RX_MBOX));
+	}
+
+	return IRQ_HANDLED;
+}
+
+int pl320_ipc_register_notifier(struct notifier_block *nb)
+{
+	return atomic_notifier_chain_register(&ipc_notifier, nb);
+}
+EXPORT_SYMBOL(pl320_ipc_register_notifier);
+
+int pl320_ipc_unregister_notifier(struct notifier_block *nb)
+{
+	return atomic_notifier_chain_unregister(&ipc_notifier, nb);
+}
+EXPORT_SYMBOL(pl320_ipc_unregister_notifier);
+
+static int __devinit pl320_probe(struct amba_device *adev,
+				const struct amba_id *id)
+{
+	int ret;
+
+	ipc_base = ioremap(adev->res.start, resource_size(&adev->res));
+	if (ipc_base == NULL)
+		return -ENOMEM;
+
+	__raw_writel(0, ipc_base + IPCMxSEND(IPC_TX_MBOX));
+
+	ipc_irq = adev->irq[0];
+	ret = request_irq(ipc_irq, ipc_handler, 0, dev_name(&adev->dev), NULL);
+	if (ret < 0)
+		goto err;
+
+	/* Init slow mailbox */
+	__raw_writel(CHAN_MASK(A9_SOURCE),
+			ipc_base + IPCMxSOURCE(IPC_TX_MBOX));
+	__raw_writel(CHAN_MASK(M3_SOURCE),
+			ipc_base + IPCMxDSET(IPC_TX_MBOX));
+	__raw_writel(CHAN_MASK(M3_SOURCE) | CHAN_MASK(A9_SOURCE),
+		     ipc_base + IPCMxMSET(IPC_TX_MBOX));
+
+	/* Init receive mailbox */
+	__raw_writel(CHAN_MASK(M3_SOURCE),
+			ipc_base + IPCMxSOURCE(IPC_RX_MBOX));
+	__raw_writel(CHAN_MASK(A9_SOURCE),
+			ipc_base + IPCMxDSET(IPC_RX_MBOX));
+	__raw_writel(CHAN_MASK(M3_SOURCE) | CHAN_MASK(A9_SOURCE),
+		     ipc_base + IPCMxMSET(IPC_RX_MBOX));
+
+	return 0;
+err:
+	iounmap(ipc_base);
+	return ret;
+}
+
+static struct amba_id pl320_ids[] = {
+	{
+		.id	= 0x00041320,
+		.mask	= 0x000fffff,
+	},
+	{ 0, 0 },
+};
+
+static struct amba_driver pl320_driver = {
+	.drv = {
+		.name	= "pl320",
+	},
+	.id_table	= pl320_ids,
+	.probe		= pl320_probe,
+};
+
+static int __init ipc_init(void)
+{
+	return amba_driver_register(&pl320_driver);
+}
+module_init(ipc_init);
diff --git a/include/linux/mailbox.h b/include/linux/mailbox.h
index e8e4131..e7829e5 100644
--- a/include/linux/mailbox.h
+++ b/include/linux/mailbox.h
@@ -1,4 +1,16 @@
-/* mailbox.h */
+/*
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
 
 typedef u32 mbox_msg_t;
 struct omap_mbox;
@@ -20,3 +32,8 @@ void omap_mbox_save_ctx(struct omap_mbox *mbox);
 void omap_mbox_restore_ctx(struct omap_mbox *mbox);
 void omap_mbox_enable_irq(struct omap_mbox *mbox, omap_mbox_irq_t irq);
 void omap_mbox_disable_irq(struct omap_mbox *mbox, omap_mbox_irq_t irq);
+
+int pl320_ipc_transmit(u32 *data);
+int pl320_ipc_register_notifier(struct notifier_block *nb);
+int pl320_ipc_unregister_notifier(struct notifier_block *nb);
+
-- 
1.7.11.7


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox