[REGRESSION] [BISECTED] MM patch causes kernel lockup with 3.12 and acpi

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [REGRESSION] [BISECTED] MM patch causes kernel lockup with 3.12 and acpi_backlight=vendor
@ 2013-12-27  3:21 Bradley Baetz
  2013-12-27  3:22 ` Bradley Baetz
  2014-01-06 17:13 ` Johannes Weiner
  0 siblings, 2 replies; 6+ messages in thread
From: Bradley Baetz @ 2013-12-27  3:21 UTC (permalink / raw)
  To: platform-driver-x86, linux-mm, hannes; +Cc: hdegoede

[-- Attachment #1: Type: text/plain, Size: 2837 bytes --]

Hi,

I have a Dell laptop (Vostro 3560). When I boot Fedora 20 with the
acpi_backlight=vendor option, the kernel locks up hard during the boot
proces, when systemd runs udevadm trigger. This is a hard lockup -
magic-sysrq doesn't work, and neither does caps lock/vt-change/etc.

I've bisected this to:

commit 81c0a2bb515fd4daae8cab64352877480792b515
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Wed Sep 11 14:20:47 2013 -0700

    mm: page_alloc: fair zone allocator policy

which seemed really unrelated, but I've confirmed that:

 - the commit before this patch doesn't cause the problem, and the commit
afterwrads does
 - reverting that patch from 3.12.0 fixes the problem
 - reverting that patch (and the partial revert
fff4068cba484e6b0abe334ed6b15d5a215a3b25) from master also fixes the problem
 - reverting that patch from the fedora 3.12.5-302.fc20 kernel fixes the
problem
 - applying that patch to 3.11.0 causes the problem

so I'm pretty sure that that is the patch that causes (or at least
triggers) this issue

I'm using the acpi_backlight option to get the backlight working - without
this the backlight doesn't work at all. Removing 'acpi_backlight=vendor'
(or blacklisting the dell-laptop module, which is effectively the same
thing) fixes the issue.

The lockup happens when systemd runs "udevadm trigger", not when the module
is loaded - I can reproduce the issue by booting into emergency mode,
remounting the filesystem as rw, starting up systemd-udevd and running
udevadm trigger manually. It dies a few seconds after loading the
dell-laptop module.

This happens even if I don't boot into X (using
systemd.unit=multi-user.target)

Triggering udev individually for each item doesn't trigger the issue ie:

for i in `udevadm --debug trigger --type=devices --action=add --dry-run
--verbose`; do echo $i; udevadm --debug trigger --type=devices --action=add
--verbose --parent-match=$i; sleep 1; done

works, so I haven't been able to work out what specific combination of
actions are causing this.

With the acpi_backlight option, I can manually read/write to the sysfs
dell-laptop backlight file, and it works (and changes the backlight as
expected)

This is 100% reproducible. I've also tested by powering off the laptop and
pulling the battery just in case one of the previous boots with the bisect
left the hardware in a strange state - no change.

I did successfully boot a 3.12 kernel on F19 (before I upgraded to F20), so
there's presumably something that F20 is doing differently. It was only one
boot though.

I reported this to fedora (
https://bugzilla.redhat.com/show_bug.cgi?id=1045807) but it looks like this
is an upstream issue so I was asked to report it here.

This is an 8-core single i7 cpu (one numa node) - its a laptop, so nothing
fancy. DMI data is attached to the fedora bug.

Bradley

[-- Attachment #2: Type: text/html, Size: 3558 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] [BISECTED] MM patch causes kernel lockup with 3.12 and acpi_backlight=vendor
  2013-12-27  3:21 [REGRESSION] [BISECTED] MM patch causes kernel lockup with 3.12 and acpi_backlight=vendor Bradley Baetz
@ 2013-12-27  3:22 ` Bradley Baetz
  2014-01-06 17:13 ` Johannes Weiner
  1 sibling, 0 replies; 6+ messages in thread
From: Bradley Baetz @ 2013-12-27  3:22 UTC (permalink / raw)
  To: platform-driver-x86, linux-mm, hannes; +Cc: hdegoede

Resending in plain text mode....

Bradley

On Fri, Dec 27, 2013 at 2:21 PM, Bradley Baetz <bbaetz@gmail.com> wrote:
> Hi,
>
> I have a Dell laptop (Vostro 3560). When I boot Fedora 20 with the
> acpi_backlight=vendor option, the kernel locks up hard during the boot
> proces, when systemd runs udevadm trigger. This is a hard lockup -
> magic-sysrq doesn't work, and neither does caps lock/vt-change/etc.
>
> I've bisected this to:
>
> commit 81c0a2bb515fd4daae8cab64352877480792b515
> Author: Johannes Weiner <hannes@cmpxchg.org>
> Date:   Wed Sep 11 14:20:47 2013 -0700
>
>     mm: page_alloc: fair zone allocator policy
>
> which seemed really unrelated, but I've confirmed that:
>
>  - the commit before this patch doesn't cause the problem, and the commit
> afterwrads does
>  - reverting that patch from 3.12.0 fixes the problem
>  - reverting that patch (and the partial revert
> fff4068cba484e6b0abe334ed6b15d5a215a3b25) from master also fixes the problem
>  - reverting that patch from the fedora 3.12.5-302.fc20 kernel fixes the
> problem
>  - applying that patch to 3.11.0 causes the problem
>
> so I'm pretty sure that that is the patch that causes (or at least triggers)
> this issue
>
> I'm using the acpi_backlight option to get the backlight working - without
> this the backlight doesn't work at all. Removing 'acpi_backlight=vendor' (or
> blacklisting the dell-laptop module, which is effectively the same thing)
> fixes the issue.
>
> The lockup happens when systemd runs "udevadm trigger", not when the module
> is loaded - I can reproduce the issue by booting into emergency mode,
> remounting the filesystem as rw, starting up systemd-udevd and running
> udevadm trigger manually. It dies a few seconds after loading the
> dell-laptop module.
>
> This happens even if I don't boot into X (using
> systemd.unit=multi-user.target)
>
> Triggering udev individually for each item doesn't trigger the issue ie:
>
> for i in `udevadm --debug trigger --type=devices --action=add --dry-run
> --verbose`; do echo $i; udevadm --debug trigger --type=devices --action=add
> --verbose --parent-match=$i; sleep 1; done
>
> works, so I haven't been able to work out what specific combination of
> actions are causing this.
>
> With the acpi_backlight option, I can manually read/write to the sysfs
> dell-laptop backlight file, and it works (and changes the backlight as
> expected)
>
> This is 100% reproducible. I've also tested by powering off the laptop and
> pulling the battery just in case one of the previous boots with the bisect
> left the hardware in a strange state - no change.
>
> I did successfully boot a 3.12 kernel on F19 (before I upgraded to F20), so
> there's presumably something that F20 is doing differently. It was only one
> boot though.
>
> I reported this to fedora
> (https://bugzilla.redhat.com/show_bug.cgi?id=1045807) but it looks like this
> is an upstream issue so I was asked to report it here.
>
> This is an 8-core single i7 cpu (one numa node) - its a laptop, so nothing
> fancy. DMI data is attached to the fedora bug.
>
> Bradley

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] [BISECTED] MM patch causes kernel lockup with 3.12 and acpi_backlight=vendor
  2013-12-27  3:21 [REGRESSION] [BISECTED] MM patch causes kernel lockup with 3.12 and acpi_backlight=vendor Bradley Baetz
  2013-12-27  3:22 ` Bradley Baetz
@ 2014-01-06 17:13 ` Johannes Weiner
  2014-01-07 12:06   ` Bradley Baetz
  1 sibling, 1 reply; 6+ messages in thread
From: Johannes Weiner @ 2014-01-06 17:13 UTC (permalink / raw)
  To: Bradley Baetz; +Cc: platform-driver-x86, linux-mm, hdegoede

Hi Bradley,

On Fri, Dec 27, 2013 at 02:21:21PM +1100, Bradley Baetz wrote:
> Hi,
> 
> I have a Dell laptop (Vostro 3560). When I boot Fedora 20 with the
> acpi_backlight=vendor option, the kernel locks up hard during the boot
> proces, when systemd runs udevadm trigger. This is a hard lockup -
> magic-sysrq doesn't work, and neither does caps lock/vt-change/etc.
> 
> I've bisected this to:
> 
> commit 81c0a2bb515fd4daae8cab64352877480792b515
> Author: Johannes Weiner <hannes@cmpxchg.org>
> Date:   Wed Sep 11 14:20:47 2013 -0700
> 
>     mm: page_alloc: fair zone allocator policy
> 
> which seemed really unrelated, but I've confirmed that:
> 
>  - the commit before this patch doesn't cause the problem, and the commit
> afterwrads does
>  - reverting that patch from 3.12.0 fixes the problem
>  - reverting that patch (and the partial revert
> fff4068cba484e6b0abe334ed6b15d5a215a3b25) from master also fixes the problem
>  - reverting that patch from the fedora 3.12.5-302.fc20 kernel fixes the
> problem
>  - applying that patch to 3.11.0 causes the problem
> 
> so I'm pretty sure that that is the patch that causes (or at least
> triggers) this issue
> 
> I'm using the acpi_backlight option to get the backlight working - without
> this the backlight doesn't work at all. Removing 'acpi_backlight=vendor'
> (or blacklisting the dell-laptop module, which is effectively the same
> thing) fixes the issue.
> 
> The lockup happens when systemd runs "udevadm trigger", not when the module
> is loaded - I can reproduce the issue by booting into emergency mode,
> remounting the filesystem as rw, starting up systemd-udevd and running
> udevadm trigger manually. It dies a few seconds after loading the
> dell-laptop module.
> 
> This happens even if I don't boot into X (using
> systemd.unit=multi-user.target)
> 
> Triggering udev individually for each item doesn't trigger the issue ie:
> 
> for i in `udevadm --debug trigger --type=devices --action=add --dry-run
> --verbose`; do echo $i; udevadm --debug trigger --type=devices --action=add
> --verbose --parent-match=$i; sleep 1; done
> 
> works, so I haven't been able to work out what specific combination of
> actions are causing this.
> 
> With the acpi_backlight option, I can manually read/write to the sysfs
> dell-laptop backlight file, and it works (and changes the backlight as
> expected)
> 
> This is 100% reproducible. I've also tested by powering off the laptop and
> pulling the battery just in case one of the previous boots with the bisect
> left the hardware in a strange state - no change.

My patch aggressively spreads allocations over all zones in the
system, but it should still respect dell-laptop's requirements for
DMA32 memory.

I wonder if the drastic change in allocation placement exposes an
existing memory corruption.  In fact, the dell-laptop module is
confused when it comes to the page allocator interface, it does

  free_page((unsigned long)bufferpage);

in the error path, where bufferpage is a page pointer that came out of
alloc_page(), which will cause the page allocator to try to free the
mem_map(!) page that backs the bufferpage page struct.  So one failed
load attempt of the module could plausibly corrupt internal state.

Does the following resolve the problem?  And if not, what are the
"dell-laptop:" lines in the good and the bad kernel, and does the bad
kernel trigger the WARNING?

---

diff --git a/drivers/platform/x86/dell-laptop.c b/drivers/platform/x86/dell-laptop.c
index c608b1d33f4a..92088b228573 100644
--- a/drivers/platform/x86/dell-laptop.c
+++ b/drivers/platform/x86/dell-laptop.c
@@ -819,6 +819,18 @@ static int __init dell_init(void)
 		ret = -ENOMEM;
 		goto fail_buffer;
 	}
+
+	{
+		struct zone *zone = page_zone(bufferpage);
+		int idx = zone_idx(zone);
+
+		printk("dell-laptop: bufferpage (%p) in node %d zone %d (%s)\n", bufferpage, zone->node, idx, zone->name);
+		if (WARN_ON(idx > ZONE_DMA32)) {
+			ret = -EINVAL;
+			goto fail_rfkill;
+		}
+	}
+
 	buffer = page_address(bufferpage);
 
 	ret = dell_setup_rfkill();
@@ -888,7 +900,7 @@ fail_backlight:
 fail_filter:
 	dell_cleanup_rfkill();
 fail_rfkill:
-	free_page((unsigned long)bufferpage);
+	__free_page(bufferpage);
 fail_buffer:
 	platform_device_del(platform_device);
 fail_platform_device2:
@@ -914,7 +926,7 @@ static void __exit dell_exit(void)
 		platform_driver_unregister(&platform_driver);
 	}
 	kfree(da_tokens);
-	free_page((unsigned long)buffer);
+	__free_page(bufferpage);
 }
 
 module_init(dell_init);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] [BISECTED] MM patch causes kernel lockup with 3.12 and acpi_backlight=vendor
  2014-01-06 17:13 ` Johannes Weiner
@ 2014-01-07 12:06   ` Bradley Baetz
  2014-01-08 14:51     ` Bradley Baetz
  0 siblings, 1 reply; 6+ messages in thread
From: Bradley Baetz @ 2014-01-07 12:06 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: platform-driver-x86, linux-mm, Hans De Goede

Hi,

On Tue, Jan 7, 2014 at 4:13 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Hi Bradley,
>
> On Fri, Dec 27, 2013 at 02:21:21PM +1100, Bradley Baetz wrote:
>> Hi,
>>
>> I have a Dell laptop (Vostro 3560). When I boot Fedora 20 with the
>> acpi_backlight=vendor option, the kernel locks up hard during the boot
>> proces, when systemd runs udevadm trigger. This is a hard lockup -
>> magic-sysrq doesn't work, and neither does caps lock/vt-change/etc.
>>
>> I've bisected this to:
>>
>> commit 81c0a2bb515fd4daae8cab64352877480792b515
>> Author: Johannes Weiner <hannes@cmpxchg.org>
>> Date:   Wed Sep 11 14:20:47 2013 -0700
>>
>>     mm: page_alloc: fair zone allocator policy
>>
>> which seemed really unrelated, but I've confirmed that:
>>
>>  - the commit before this patch doesn't cause the problem, and the commit
>> afterwrads does
>>  - reverting that patch from 3.12.0 fixes the problem
>>  - reverting that patch (and the partial revert
>> fff4068cba484e6b0abe334ed6b15d5a215a3b25) from master also fixes the problem
>>  - reverting that patch from the fedora 3.12.5-302.fc20 kernel fixes the
>> problem
>>  - applying that patch to 3.11.0 causes the problem
>>
>> so I'm pretty sure that that is the patch that causes (or at least
>> triggers) this issue
>>
>> I'm using the acpi_backlight option to get the backlight working - without
>> this the backlight doesn't work at all. Removing 'acpi_backlight=vendor'
>> (or blacklisting the dell-laptop module, which is effectively the same
>> thing) fixes the issue.
>>
>> The lockup happens when systemd runs "udevadm trigger", not when the module
>> is loaded - I can reproduce the issue by booting into emergency mode,
>> remounting the filesystem as rw, starting up systemd-udevd and running
>> udevadm trigger manually. It dies a few seconds after loading the
>> dell-laptop module.
>>
>> This happens even if I don't boot into X (using
>> systemd.unit=multi-user.target)
>>
>> Triggering udev individually for each item doesn't trigger the issue ie:
>>
>> for i in `udevadm --debug trigger --type=devices --action=add --dry-run
>> --verbose`; do echo $i; udevadm --debug trigger --type=devices --action=add
>> --verbose --parent-match=$i; sleep 1; done
>>
>> works, so I haven't been able to work out what specific combination of
>> actions are causing this.
>>
>> With the acpi_backlight option, I can manually read/write to the sysfs
>> dell-laptop backlight file, and it works (and changes the backlight as
>> expected)
>>
>> This is 100% reproducible. I've also tested by powering off the laptop and
>> pulling the battery just in case one of the previous boots with the bisect
>> left the hardware in a strange state - no change.
>
> My patch aggressively spreads allocations over all zones in the
> system, but it should still respect dell-laptop's requirements for
> DMA32 memory.
>
> I wonder if the drastic change in allocation placement exposes an
> existing memory corruption.  In fact, the dell-laptop module is
> confused when it comes to the page allocator interface, it does
>
>   free_page((unsigned long)bufferpage);
>
> in the error path, where bufferpage is a page pointer that came out of
> alloc_page(), which will cause the page allocator to try to free the
> mem_map(!) page that backs the bufferpage page struct.  So one failed
> load attempt of the module could plausibly corrupt internal state.
>
> Does the following resolve the problem?  And if not, what are the
> "dell-laptop:" lines in the good and the bad kernel, and does the bad
> kernel trigger the WARNING?

Nope, no luck. I added some more printk's arround the use of SMI. I've
transcribed the logs from a screenshot for the failing kernel (ie
master+your patch) ("Sending command" logs class, select, and
&command.ebx (with the %pa format string):

dell-laptop: bufferpage (ffffea000263c680) in node 0 zone 1 (DMA32)
Sending command: 0, 2, 0x4253493198f1a000
Command sent
dell-laptop: getting intensity
Sending command: 0, 2, 0x4253493198f1a000
Command sent
dell-laptop: got intensity
dell-laptop: Setting intensity
Sending command: 1, 2, 0x4253493198f1a000

and then it locks up before returning from the SMI

So some of the commands work, and they also return the same value for
the brightness, AND have parsed the same value from the SMBIOS table
for the ioport/value to use. (I added that later, but didn't take a
photo - they all return brightness of 2, which is the at-boot default
value)

Without acpi_backlight=vendor:

dell-laptop: bufferpage (ffffea0000fa0dc0) in node 0 zone 1 (DMA32)

(no other logs, because the module's backlight interface isn't used
without that boot param)

With your mm patches reverted:

[   12.773884] dell-laptop: bufferpage (ffffea0000fe0180) in node 0
zone 1 (DMA32)
[   12.775502] Sending command: 0, 2, 0x425349313f806000
[   12.777293] Command sent
[   12.778950] dell-laptop: getting intensity
[   12.780589] Sending command: 0, 2, 0x425349313f806000
[   12.782185] Command sent
[   12.783679] dell-laptop: got intensity
[   12.785202] dell-laptop: Setting intensity
[   12.786715] Sending command: 1, 2, 0x425349313f806000
[   12.788892] Command sent
[   12.790379] dell-laptop: set intensity

(with the get/set repeated a bit later when X starts up)

And on the broken kernel, when I boot into 'emergency' mode, manually
load dell-laptop, I get the same logs as the 'working' bit (including
the getting/got/setting/set lines).

Looking at the code, I notice a few things odd with the dcdbas code,
although I don't think that they're the issue here

1. dcdbas_smi_request does outb/inb, and marks eax as an input, but
doesn't mark it as clobbered (I think; I don't have much experience
with gcc's asm). In practice, I can't see that being an issue
2. dcdbas_smi_request says that it is "Called with smi_data_lock" but
that's only true for the calls *within* dcdbas.c. I think that that's
only a documentation issue, since is protecting a buffer that isn't
used here. (Dell-laptop has its own buffer and mutex).

I'm still unable to manually reproduce this - the only way to repro is
'try to boot normally', and while that's 100% reliable, it makes it a
bit hard to narrow a trigger down...

Bradley

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] [BISECTED] MM patch causes kernel lockup with 3.12 and acpi_backlight=vendor
  2014-01-07 12:06   ` Bradley Baetz
@ 2014-01-08 14:51     ` Bradley Baetz
  2014-01-11  2:59       ` Bradley Baetz
  0 siblings, 1 reply; 6+ messages in thread
From: Bradley Baetz @ 2014-01-08 14:51 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: platform-driver-x86, linux-mm, Hans De Goede

On Tue, Jan 7, 2014 at 11:06 PM, Bradley Baetz <bbaetz@gmail.com> wrote:
> Hi,
>
> On Tue, Jan 7, 2014 at 4:13 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> Hi Bradley,
>>
>> On Fri, Dec 27, 2013 at 02:21:21PM +1100, Bradley Baetz wrote:
>>> Hi,
>>>
>>> I have a Dell laptop (Vostro 3560). When I boot Fedora 20 with the
>>> acpi_backlight=vendor option, the kernel locks up hard during the boot
>>> proces, when systemd runs udevadm trigger. This is a hard lockup -
>>> magic-sysrq doesn't work, and neither does caps lock/vt-change/etc.
>>>
>>> I've bisected this to:
>>>
>>> commit 81c0a2bb515fd4daae8cab64352877480792b515
>>> Author: Johannes Weiner <hannes@cmpxchg.org>
>>> Date:   Wed Sep 11 14:20:47 2013 -0700
>>>
>>>     mm: page_alloc: fair zone allocator policy
>>>
>>> which seemed really unrelated, but I've confirmed that:
>>>
>>>  - the commit before this patch doesn't cause the problem, and the commit
>>> afterwrads does
>>>  - reverting that patch from 3.12.0 fixes the problem
>>>  - reverting that patch (and the partial revert
>>> fff4068cba484e6b0abe334ed6b15d5a215a3b25) from master also fixes the problem
>>>  - reverting that patch from the fedora 3.12.5-302.fc20 kernel fixes the
>>> problem
>>>  - applying that patch to 3.11.0 causes the problem
>>>
>>> so I'm pretty sure that that is the patch that causes (or at least
>>> triggers) this issue
>>>
>>> I'm using the acpi_backlight option to get the backlight working - without
>>> this the backlight doesn't work at all. Removing 'acpi_backlight=vendor'
>>> (or blacklisting the dell-laptop module, which is effectively the same
>>> thing) fixes the issue.
>>>
>>> The lockup happens when systemd runs "udevadm trigger", not when the module
>>> is loaded - I can reproduce the issue by booting into emergency mode,
>>> remounting the filesystem as rw, starting up systemd-udevd and running
>>> udevadm trigger manually. It dies a few seconds after loading the
>>> dell-laptop module.
>>>
>>> This happens even if I don't boot into X (using
>>> systemd.unit=multi-user.target)
>>>
>>> Triggering udev individually for each item doesn't trigger the issue ie:
>>>
>>> for i in `udevadm --debug trigger --type=devices --action=add --dry-run
>>> --verbose`; do echo $i; udevadm --debug trigger --type=devices --action=add
>>> --verbose --parent-match=$i; sleep 1; done
>>>
>>> works, so I haven't been able to work out what specific combination of
>>> actions are causing this.
>>>
>>> With the acpi_backlight option, I can manually read/write to the sysfs
>>> dell-laptop backlight file, and it works (and changes the backlight as
>>> expected)
>>>
>>> This is 100% reproducible. I've also tested by powering off the laptop and
>>> pulling the battery just in case one of the previous boots with the bisect
>>> left the hardware in a strange state - no change.
>>
>> My patch aggressively spreads allocations over all zones in the
>> system, but it should still respect dell-laptop's requirements for
>> DMA32 memory.
>>
>> I wonder if the drastic change in allocation placement exposes an
>> existing memory corruption.  In fact, the dell-laptop module is
>> confused when it comes to the page allocator interface, it does
>>
>>   free_page((unsigned long)bufferpage);
>>
>> in the error path, where bufferpage is a page pointer that came out of
>> alloc_page(), which will cause the page allocator to try to free the
>> mem_map(!) page that backs the bufferpage page struct.  So one failed
>> load attempt of the module could plausibly corrupt internal state.
>>
>> Does the following resolve the problem?  And if not, what are the
>> "dell-laptop:" lines in the good and the bad kernel, and does the bad
>> kernel trigger the WARNING?
>
> Nope, no luck. I added some more printk's arround the use of SMI. I've
> transcribed the logs from a screenshot for the failing kernel (ie
> master+your patch) ("Sending command" logs class, select, and
> &command.ebx (with the %pa format string):
>
> dell-laptop: bufferpage (ffffea000263c680) in node 0 zone 1 (DMA32)
> Sending command: 0, 2, 0x4253493198f1a000
> Command sent
> dell-laptop: getting intensity
> Sending command: 0, 2, 0x4253493198f1a000
> Command sent
> dell-laptop: got intensity
> dell-laptop: Setting intensity
> Sending command: 1, 2, 0x4253493198f1a000
>
> and then it locks up before returning from the SMI
>
> So some of the commands work, and they also return the same value for
> the brightness, AND have parsed the same value from the SMBIOS table
> for the ioport/value to use. (I added that later, but didn't take a
> photo - they all return brightness of 2, which is the at-boot default
> value)
>
> Without acpi_backlight=vendor:
>
> dell-laptop: bufferpage (ffffea0000fa0dc0) in node 0 zone 1 (DMA32)
>
> (no other logs, because the module's backlight interface isn't used
> without that boot param)
>
> With your mm patches reverted:
>
> [   12.773884] dell-laptop: bufferpage (ffffea0000fe0180) in node 0
> zone 1 (DMA32)
> [   12.775502] Sending command: 0, 2, 0x425349313f806000
> [   12.777293] Command sent
> [   12.778950] dell-laptop: getting intensity
> [   12.780589] Sending command: 0, 2, 0x425349313f806000
> [   12.782185] Command sent
> [   12.783679] dell-laptop: got intensity
> [   12.785202] dell-laptop: Setting intensity
> [   12.786715] Sending command: 1, 2, 0x425349313f806000
> [   12.788892] Command sent
> [   12.790379] dell-laptop: set intensity
>
> (with the get/set repeated a bit later when X starts up)
>
> And on the broken kernel, when I boot into 'emergency' mode, manually
> load dell-laptop, I get the same logs as the 'working' bit (including
> the getting/got/setting/set lines).
>
> Looking at the code, I notice a few things odd with the dcdbas code,
> although I don't think that they're the issue here
>
> 1. dcdbas_smi_request does outb/inb, and marks eax as an input, but
> doesn't mark it as clobbered (I think; I don't have much experience
> with gcc's asm). In practice, I can't see that being an issue
> 2. dcdbas_smi_request says that it is "Called with smi_data_lock" but
> that's only true for the calls *within* dcdbas.c. I think that that's
> only a documentation issue, since is protecting a buffer that isn't
> used here. (Dell-laptop has its own buffer and mutex).
>
> I'm still unable to manually reproduce this - the only way to repro is
> 'try to boot normally', and while that's 100% reliable, it makes it a
> bit hard to narrow a trigger down...

So if I boot into 'emergency' mode and modprobe dell-laptop, it only
locks up about 50% of the time. And if I boot with init=/bin/bash, and
then load the module, it doesn't lock up at all (tried 5 times)

I also tried making dell-laptop use the DMA zone (instead of DMA32),
and that didn't help.

Bradley

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] [BISECTED] MM patch causes kernel lockup with 3.12 and acpi_backlight=vendor
  2014-01-08 14:51     ` Bradley Baetz
@ 2014-01-11  2:59       ` Bradley Baetz
  0 siblings, 0 replies; 6+ messages in thread
From: Bradley Baetz @ 2014-01-11  2:59 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: platform-driver-x86, linux-mm, Hans De Goede

On Thu, Jan 9, 2014 at 1:51 AM, Bradley Baetz <bbaetz@gmail.com> wrote:
> On Tue, Jan 7, 2014 at 11:06 PM, Bradley Baetz <bbaetz@gmail.com> wrote:
>> Hi,
>>
>> On Tue, Jan 7, 2014 at 4:13 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>>> Hi Bradley,
>>>
>>> On Fri, Dec 27, 2013 at 02:21:21PM +1100, Bradley Baetz wrote:
>>>> Hi,
>>>>
>>>> I have a Dell laptop (Vostro 3560). When I boot Fedora 20 with the
>>>> acpi_backlight=vendor option, the kernel locks up hard during the boot
>>>> proces, when systemd runs udevadm trigger. This is a hard lockup -
>>>> magic-sysrq doesn't work, and neither does caps lock/vt-change/etc.
>>>>
>>>> I've bisected this to:
>>>>
>>>> commit 81c0a2bb515fd4daae8cab64352877480792b515
>>>> Author: Johannes Weiner <hannes@cmpxchg.org>
>>>> Date:   Wed Sep 11 14:20:47 2013 -0700
>>>>
>>>>     mm: page_alloc: fair zone allocator policy
>>>>
>>>> which seemed really unrelated, but I've confirmed that:
>>>>
>>>>  - the commit before this patch doesn't cause the problem, and the commit
>>>> afterwrads does
>>>>  - reverting that patch from 3.12.0 fixes the problem
>>>>  - reverting that patch (and the partial revert
>>>> fff4068cba484e6b0abe334ed6b15d5a215a3b25) from master also fixes the problem
>>>>  - reverting that patch from the fedora 3.12.5-302.fc20 kernel fixes the
>>>> problem
>>>>  - applying that patch to 3.11.0 causes the problem
>>>>
>>>> so I'm pretty sure that that is the patch that causes (or at least
>>>> triggers) this issue
>>>>
>>>> I'm using the acpi_backlight option to get the backlight working - without
>>>> this the backlight doesn't work at all. Removing 'acpi_backlight=vendor'
>>>> (or blacklisting the dell-laptop module, which is effectively the same
>>>> thing) fixes the issue.
>>>>
>>>> The lockup happens when systemd runs "udevadm trigger", not when the module
>>>> is loaded - I can reproduce the issue by booting into emergency mode,
>>>> remounting the filesystem as rw, starting up systemd-udevd and running
>>>> udevadm trigger manually. It dies a few seconds after loading the
>>>> dell-laptop module.
>>>>
>>>> This happens even if I don't boot into X (using
>>>> systemd.unit=multi-user.target)
>>>>
>>>> Triggering udev individually for each item doesn't trigger the issue ie:
>>>>
>>>> for i in `udevadm --debug trigger --type=devices --action=add --dry-run
>>>> --verbose`; do echo $i; udevadm --debug trigger --type=devices --action=add
>>>> --verbose --parent-match=$i; sleep 1; done
>>>>
>>>> works, so I haven't been able to work out what specific combination of
>>>> actions are causing this.
>>>>
>>>> With the acpi_backlight option, I can manually read/write to the sysfs
>>>> dell-laptop backlight file, and it works (and changes the backlight as
>>>> expected)
>>>>
>>>> This is 100% reproducible. I've also tested by powering off the laptop and
>>>> pulling the battery just in case one of the previous boots with the bisect
>>>> left the hardware in a strange state - no change.
>>>
>>> My patch aggressively spreads allocations over all zones in the
>>> system, but it should still respect dell-laptop's requirements for
>>> DMA32 memory.
>>>
>>> I wonder if the drastic change in allocation placement exposes an
>>> existing memory corruption.  In fact, the dell-laptop module is
>>> confused when it comes to the page allocator interface, it does
>>>
>>>   free_page((unsigned long)bufferpage);
>>>
>>> in the error path, where bufferpage is a page pointer that came out of
>>> alloc_page(), which will cause the page allocator to try to free the
>>> mem_map(!) page that backs the bufferpage page struct.  So one failed
>>> load attempt of the module could plausibly corrupt internal state.
>>>
>>> Does the following resolve the problem?  And if not, what are the
>>> "dell-laptop:" lines in the good and the bad kernel, and does the bad
>>> kernel trigger the WARNING?
>>
>> Nope, no luck. I added some more printk's arround the use of SMI. I've
>> transcribed the logs from a screenshot for the failing kernel (ie
>> master+your patch) ("Sending command" logs class, select, and
>> &command.ebx (with the %pa format string):
>>
>> dell-laptop: bufferpage (ffffea000263c680) in node 0 zone 1 (DMA32)
>> Sending command: 0, 2, 0x4253493198f1a000
>> Command sent
>> dell-laptop: getting intensity
>> Sending command: 0, 2, 0x4253493198f1a000
>> Command sent
>> dell-laptop: got intensity
>> dell-laptop: Setting intensity
>> Sending command: 1, 2, 0x4253493198f1a000
>>
>> and then it locks up before returning from the SMI
>>
>> So some of the commands work, and they also return the same value for
>> the brightness, AND have parsed the same value from the SMBIOS table
>> for the ioport/value to use. (I added that later, but didn't take a
>> photo - they all return brightness of 2, which is the at-boot default
>> value)
>>
>> Without acpi_backlight=vendor:
>>
>> dell-laptop: bufferpage (ffffea0000fa0dc0) in node 0 zone 1 (DMA32)
>>
>> (no other logs, because the module's backlight interface isn't used
>> without that boot param)
>>
>> With your mm patches reverted:
>>
>> [   12.773884] dell-laptop: bufferpage (ffffea0000fe0180) in node 0
>> zone 1 (DMA32)
>> [   12.775502] Sending command: 0, 2, 0x425349313f806000
>> [   12.777293] Command sent
>> [   12.778950] dell-laptop: getting intensity
>> [   12.780589] Sending command: 0, 2, 0x425349313f806000
>> [   12.782185] Command sent
>> [   12.783679] dell-laptop: got intensity
>> [   12.785202] dell-laptop: Setting intensity
>> [   12.786715] Sending command: 1, 2, 0x425349313f806000
>> [   12.788892] Command sent
>> [   12.790379] dell-laptop: set intensity
>>
>> (with the get/set repeated a bit later when X starts up)
>>
>> And on the broken kernel, when I boot into 'emergency' mode, manually
>> load dell-laptop, I get the same logs as the 'working' bit (including
>> the getting/got/setting/set lines).
>>
>> Looking at the code, I notice a few things odd with the dcdbas code,
>> although I don't think that they're the issue here
>>
>> 1. dcdbas_smi_request does outb/inb, and marks eax as an input, but
>> doesn't mark it as clobbered (I think; I don't have much experience
>> with gcc's asm). In practice, I can't see that being an issue
>> 2. dcdbas_smi_request says that it is "Called with smi_data_lock" but
>> that's only true for the calls *within* dcdbas.c. I think that that's
>> only a documentation issue, since is protecting a buffer that isn't
>> used here. (Dell-laptop has its own buffer and mutex).
>>
>> I'm still unable to manually reproduce this - the only way to repro is
>> 'try to boot normally', and while that's 100% reliable, it makes it a
>> bit hard to narrow a trigger down...
>
> So if I boot into 'emergency' mode and modprobe dell-laptop, it only
> locks up about 50% of the time. And if I boot with init=/bin/bash, and
> then load the module, it doesn't lock up at all (tried 5 times)
>
> I also tried making dell-laptop use the DMA zone (instead of DMA32),
> and that didn't help.

I played around with memmap= boot options to try to work out what bits
of memory were being used that shouldn't be. Unfortunately I ran into
trouble reserving the EFI areas, because any memory areas marked as
reserved don't end up with the execute bit set, and the early calls
into EFI crash due to the NX checks.

I hacked efi_free_boot_services to just return without freeing
anything, and it seems to not lock up. Whether that's just because the
memory allocation ends up being different, or because the SMI way of
adjusting the backlight isn't meant to be used with EFI is another
story...

So looks like this would always have been a problem, but the
allocation schema changes just exposed it early?

Bradley

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-01-11  2:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-27  3:21 [REGRESSION] [BISECTED] MM patch causes kernel lockup with 3.12 and acpi_backlight=vendor Bradley Baetz
2013-12-27  3:22 ` Bradley Baetz
2014-01-06 17:13 ` Johannes Weiner
2014-01-07 12:06   ` Bradley Baetz
2014-01-08 14:51     ` Bradley Baetz
2014-01-11  2:59       ` Bradley Baetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).