* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
[not found] <bug-15214-10286@http.bugzilla.kernel.org/>
@ 2010-02-03 22:39 ` Andrew Morton
2010-02-05 11:20 ` Mel Gorman
0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2010-02-03 22:39 UTC (permalink / raw)
To: linux-mm; +Cc: bugzilla-daemon, bugme-daemon, Mel Gorman, Johannes Weiner,
ajlill
(switched to email. Please respond via emailed reply-to-all, not via the
bugzilla web interface).
On Wed, 3 Feb 2010 02:30:22 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=15214
>
> Summary: Oops at __rmqueue+0x51/0x2b3
> Product: Memory Management
> Version: 2.5
> Kernel Version: 2.6.32.7
> Platform: All
> OS/Version: Linux
> Tree: Mainline
> Status: NEW
> Severity: normal
> Priority: P1
> Component: Page Allocator
> AssignedTo: akpm@linux-foundation.org
> ReportedBy: ajlill@ajlc.waterloo.on.ca
> Regression: Yes
>
>
> Created an attachment (id=24887)
> --> (http://bugzilla.kernel.org/attachment.cgi?id=24887)
> .config file
>
> I get an Oops when doing a lot of filesystem reads. The process, cfagent, is
> running through the filesystem checksumming files when it dies. It doesn't
> happen every time cfagent runs, but there's a pretty good chance it will.
> This problem happens on 2.6.31.* as well, 3.6.30.10 appears to be stable. It
> happens on two different computers, so it's unlikely to be hardware. Also, in
> 2.6.32.*, I get an Oops at
>
> BUG_ON(page_zone(start_page) != page_zone(end_page));
>
> in move_freepages when I do sysctl -w vm.min_free_kbytes=16384
>
> but I can only reliably reproduce it when I do the sysctl from the boot
> scripts, and I'm having trouble getting netconsole started beforehand to
> capture the full output.
>
> gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
>
> Full text of Oops:
>
> BUG: unable to handle kernel paging request at 6eae67fc
> IP: [<c0192a38>] __rmqueue+0x51/0x2b3
> *pdpt = 00000000351be001 *pde = 0000000000000000
> Oops: 0002 [#1] SMP
> last sysfs file: /sys/class/firmware/0000:00:0b.0/loading
> Modules linked in: netconsole af_packet autofs4 nfsd nfs lockd fscache nfs_acl
> auth_rpcgss sunrpc ipv6 nls_iso8859_1 nls_cp437 vfat fat xfs exportfs fuse
> configfs dm_snapshot dm_mirror dm_region_hash dm_log dm_mod eeprom w83781d
> hwmon_vid hwmon r128 drm tuner_simple tuner_types tuner msp3400 saa7115 button
> processor ivtv i2c_algo_bit cx2341x v4l2_common videodev psmouse parport_pc
> v4l1_compat rtc_cmos parport tveeprom i2c_piix4 rtc_core intel_agp serio_raw
> rtc_lib agpgart i2c_core shpchp pci_hotplug pcspkr evdev ext3 jbd mbcache raid1
> sg sr_mod sd_mod cdrom crc_t10dif ata_generic pata_acpi pata_pdc202xx_old
> ata_piix floppy e1000 uhci_hcd libata thermal fan unix [last unloaded:
> scsi_wait_scan]
>
> Pid: 6629, comm: cfagent Not tainted (2.6.32.7 #1) System Name
> EIP: 0060:[<c0192a38>] EFLAGS: 00210002 CPU: 0
> EIP is at __rmqueue+0x51/0x2b3
> EAX: c146a018 EBX: 0000000a ECX: 6eae67f8 EDX: c050b654
> ESI: c050b644 EDI: 00200246 EBP: f51c9d1c ESP: f51c9cec
> DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> Process cfagent (pid: 6629, ti=f51c8000 task=f51b40b0 task.ti=f51c8000)
> Stack:
> 00000002 00000000 c050b260 00000001 f6ba8280 00200002 c0193c92 c019404e
> <0> c146a000 c1479ff8 c050b260 00200246 f51c9d78 c0193cd5 f51c9d7c 00000002
> <0> 00000000 00000000 000201da c050c16c 00000000 c050b280 00000001 0000001f
> Call Trace:
> [<c0193c92>] ? get_page_from_freelist+0xdf/0x3a8
> [<c019404e>] ? __alloc_pages_nodemask+0xdd/0x481
> [<c0193cd5>] ? get_page_from_freelist+0x122/0x3a8
> [<c019404e>] ? __alloc_pages_nodemask+0xdd/0x481
> [<c01caa57>] ? _d_rehash+0x3c/0x40
> [<c01961e3>] ? __do_page_cache_readahead+0x80/0x15b
> [<c01cb95f>] ? __d_lookup+0xa1/0xd5
> [<c01962d5>] ? ra_submit+0x17/0x1c
> [<c01964e4>] ? ondemand_readahead+0x150/0x15c
> [<c0196569>] ? page_cache_sync_readahead+0x16/0x1b
> [<c0190def>] ? generic_file_aio_read+0x212/0x507
> [<c01bd512>] ? do_sync_read+0xab/0xe9
> [<c01a86f5>] ? mmap_region+0x25b/0x334
> [<c014823f>] ? autoremove_wake_function+0x0/0x33
> [<c020edd8>] ? security_file_permission+0xf/0x11
> [<c01bd467>] ? do_sync_read+0x0/0xe9
> [<c01bdc1d>] ? vfs_read+0x8a/0x13f
> [<c01be026>] ? sys_read+0x3b/0x60
> [<c010296f>] ? sysenter_do_call+0x12/0x27
> Code: 2c c1 e1 03 8d 94 30 20 02 00 00 e9 8a 00 00 00 8d 72 0c 8d 04 0e 39 00
> 74 7c 8b 55 d0 8b 04 d6 8d 48 e8 89 4d f0 8b 08 8b 50 04 <89> 51 04 89 0a c7 40
> 04 00 02 20 00 c7 00 00 01 10 00 0f ba 70
> EIP: [<c0192a38>] __rmqueue+0x51/0x2b3 SS:ESP 0068:f51c9cec
> CR2: 000000006eae67fc
> ---[ end trace db0096b2091950d0 ]---
>
Strange regression. I'd be suspecting that we've mucked up the initial
mem_map, perhaps because of a wart in the e820 or acpi tables.
Or perhaps it's something else.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
2010-02-03 22:39 ` [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3 Andrew Morton
@ 2010-02-05 11:20 ` Mel Gorman
2010-02-07 18:34 ` Tony Lill
0 siblings, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2010-02-05 11:20 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, bugzilla-daemon, bugme-daemon, Johannes Weiner, ajlill
On Wed, Feb 03, 2010 at 02:39:21PM -0800, Andrew Morton wrote:
>
> (switched to email. Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
>
> On Wed, 3 Feb 2010 02:30:22 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
>
> > http://bugzilla.kernel.org/show_bug.cgi?id=15214
> >
> > Summary: Oops at __rmqueue+0x51/0x2b3
> > Product: Memory Management
> > Version: 2.5
> > Kernel Version: 2.6.32.7
> > Platform: All
> > OS/Version: Linux
> > Tree: Mainline
> > Status: NEW
> > Severity: normal
> > Priority: P1
> > Component: Page Allocator
> > AssignedTo: akpm@linux-foundation.org
> > ReportedBy: ajlill@ajlc.waterloo.on.ca
> > Regression: Yes
> >
> >
> > Created an attachment (id=24887)
> > --> (http://bugzilla.kernel.org/attachment.cgi?id=24887)
> > .config file
> >
> > I get an Oops when doing a lot of filesystem reads. The process, cfagent, is
> > running through the filesystem checksumming files when it dies. It doesn't
> > happen every time cfagent runs, but there's a pretty good chance it will.
> > This problem happens on 2.6.31.* as well, 3.6.30.10 appears to be stable. It
> > happens on two different computers, so it's unlikely to be hardware. Also, in
> > 2.6.32.*, I get an Oops at
> >
> > BUG_ON(page_zone(start_page) != page_zone(end_page));
> >
> > in move_freepages when I do sysctl -w vm.min_free_kbytes=16384
> >
> > but I can only reliably reproduce it when I do the sysctl from the boot
> > scripts, and I'm having trouble getting netconsole started beforehand to
> > capture the full output.
> >
This point on sysctl is truely bizarre. It implies that the struct pages
have been corrupted in some fashion. Just before this check is made, we
do
/* Do not cross zone boundaries */
if (start_pfn < zone->zone_start_pfn)
start_page = page;
if (end_pfn >= zone->zone_start_pfn + zone->spanned_pages)
return 0;
return move_freepages(zone, start_page, end_page, migratetype);
So, for that bug to be triggered, two pages between
zone->zone_start_pfn and
zone->zone_start_pfn + zone->spanned_pages
have to have different results for page_zone(). That would be outright
wrong.
Ordinarily at this point, I would assume that your memory is bad with
small errors occuring. The early-in-boot problem might indicate that
there is a specific region of memory that is bust rather than something
like a power problem.
You said that the checksumming problem happens on two separate machines,
but can you confirm that this problem also happens on both please?
> > gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
> >
This is a bit of a reach, but how confident are you that this version of
gcc is building kernels correctly?
There are a few disconnected reports of kernel problems with this
particular version of gcc although none that I can connect with this
problem or on x86 for that matter. One example is
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=536354
which reported problems building kernels on the s390 with that compiler.
Moving to 4.2 helped them and it *should* have been fixed according to
this bug
http://bugzilla.kernel.org/show_bug.cgi?id=13012
It might be a red herring, but just to be sure, would you mind trying
gcc 4.2 or 4.3 just to be sure please?
> > Full text of Oops:
> >
> > BUG: unable to handle kernel paging request at 6eae67fc
> > IP: [<c0192a38>] __rmqueue+0x51/0x2b3
> > *pdpt = 00000000351be001 *pde = 0000000000000000
> > Oops: 0002 [#1] SMP
> > last sysfs file: /sys/class/firmware/0000:00:0b.0/loading
> > Modules linked in: netconsole af_packet autofs4 nfsd nfs lockd fscache nfs_acl
> > auth_rpcgss sunrpc ipv6 nls_iso8859_1 nls_cp437 vfat fat xfs exportfs fuse
> > configfs dm_snapshot dm_mirror dm_region_hash dm_log dm_mod eeprom w83781d
> > hwmon_vid hwmon r128 drm tuner_simple tuner_types tuner msp3400 saa7115 button
> > processor ivtv i2c_algo_bit cx2341x v4l2_common videodev psmouse parport_pc
> > v4l1_compat rtc_cmos parport tveeprom i2c_piix4 rtc_core intel_agp serio_raw
> > rtc_lib agpgart i2c_core shpchp pci_hotplug pcspkr evdev ext3 jbd mbcache raid1
> > sg sr_mod sd_mod cdrom crc_t10dif ata_generic pata_acpi pata_pdc202xx_old
> > ata_piix floppy e1000 uhci_hcd libata thermal fan unix [last unloaded:
> > scsi_wait_scan]
> >
> > Pid: 6629, comm: cfagent Not tainted (2.6.32.7 #1) System Name
> > EIP: 0060:[<c0192a38>] EFLAGS: 00210002 CPU: 0
> > EIP is at __rmqueue+0x51/0x2b3
What line does addr2line say c0192a38 corresponds to?
> > EAX: c146a018 EBX: 0000000a ECX: 6eae67f8 EDX: c050b654
> > ESI: c050b644 EDI: 00200246 EBP: f51c9d1c ESP: f51c9cec
> > DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> > Process cfagent (pid: 6629, ti=f51c8000 task=f51b40b0 task.ti=f51c8000)
> > Stack:
> > 00000002 00000000 c050b260 00000001 f6ba8280 00200002 c0193c92 c019404e
> > <0> c146a000 c1479ff8 c050b260 00200246 f51c9d78 c0193cd5 f51c9d7c 00000002
> > <0> 00000000 00000000 000201da c050c16c 00000000 c050b280 00000001 0000001f
> > Call Trace:
> > [<c0193c92>] ? get_page_from_freelist+0xdf/0x3a8
> > [<c019404e>] ? __alloc_pages_nodemask+0xdd/0x481
> > [<c0193cd5>] ? get_page_from_freelist+0x122/0x3a8
> > [<c019404e>] ? __alloc_pages_nodemask+0xdd/0x481
> > [<c01caa57>] ? _d_rehash+0x3c/0x40
> > [<c01961e3>] ? __do_page_cache_readahead+0x80/0x15b
> > [<c01cb95f>] ? __d_lookup+0xa1/0xd5
> > [<c01962d5>] ? ra_submit+0x17/0x1c
> > [<c01964e4>] ? ondemand_readahead+0x150/0x15c
> > [<c0196569>] ? page_cache_sync_readahead+0x16/0x1b
> > [<c0190def>] ? generic_file_aio_read+0x212/0x507
> > [<c01bd512>] ? do_sync_read+0xab/0xe9
> > [<c01a86f5>] ? mmap_region+0x25b/0x334
> > [<c014823f>] ? autoremove_wake_function+0x0/0x33
> > [<c020edd8>] ? security_file_permission+0xf/0x11
> > [<c01bd467>] ? do_sync_read+0x0/0xe9
> > [<c01bdc1d>] ? vfs_read+0x8a/0x13f
> > [<c01be026>] ? sys_read+0x3b/0x60
> > [<c010296f>] ? sysenter_do_call+0x12/0x27
> > Code: 2c c1 e1 03 8d 94 30 20 02 00 00 e9 8a 00 00 00 8d 72 0c 8d 04 0e 39 00
> > 74 7c 8b 55 d0 8b 04 d6 8d 48 e8 89 4d f0 8b 08 8b 50 04 <89> 51 04 89 0a c7 40
> > 04 00 02 20 00 c7 00 00 01 10 00 0f ba 70
> > EIP: [<c0192a38>] __rmqueue+0x51/0x2b3 SS:ESP 0068:f51c9cec
> > CR2: 000000006eae67fc
> > ---[ end trace db0096b2091950d0 ]---
> >
>
> Strange regression. I'd be suspecting that we've mucked up the initial
> mem_map, perhaps because of a wart in the e820 or acpi tables.
>
> Or perhaps it's something else.
>
Lets see what the early boot looked like.
Tony, would you mind booting with "mminit_loglevel=4 loglevel=9" and send
the full dmesg please?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
2010-02-05 11:20 ` Mel Gorman
@ 2010-02-07 18:34 ` Tony Lill
2010-02-08 10:10 ` Mel Gorman
0 siblings, 1 reply; 9+ messages in thread
From: Tony Lill @ 2010-02-07 18:34 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Johannes Weiner
[-- Attachment #1: Type: Text/Plain, Size: 1506 bytes --]
On Friday 05 February 2010 06:20:00 Mel Gorman wrote:
> On Wed, Feb 03, 2010 at 02:39:21PM -0800, Andrew Morton wrote:
> > > gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
>
> This is a bit of a reach, but how confident are you that this version of
> gcc is building kernels correctly?
>
> There are a few disconnected reports of kernel problems with this
> particular version of gcc although none that I can connect with this
> problem or on x86 for that matter. One example is
>
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=536354
>
> which reported problems building kernels on the s390 with that compiler.
> Moving to 4.2 helped them and it *should* have been fixed according to
> this bug
>
> http://bugzilla.kernel.org/show_bug.cgi?id=13012
>
> It might be a red herring, but just to be sure, would you mind trying
> gcc 4.2 or 4.3 just to be sure please?
Well, it was producing working kernels up until 2.6.30, but I recompiled with
gcc (Debian 4.3.2-1.1) 4.3.2
and the box has been running nearly 48 hour without incident. My previous
record was 2. So I guess we can put this down to a new compiler bug.
I probably should have checked this before reporting a bug. Mea culpa
--
Tony Lill, Tony.Lill@AJLC.Waterloo.ON.CA
President, A. J. Lill Consultants (519) 650 0660
539 Grand Valley Dr., Cambridge, Ont. N3H 2S2 (519) 241 2461
--------------- http://www.ajlc.waterloo.on.ca/ ----------------
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
2010-02-07 18:34 ` Tony Lill
@ 2010-02-08 10:10 ` Mel Gorman
2010-02-08 19:18 ` Andrew Morton
0 siblings, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2010-02-08 10:10 UTC (permalink / raw)
To: Tony Lill
Cc: Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Johannes Weiner
On Sun, Feb 07, 2010 at 01:34:58PM -0500, Tony Lill wrote:
> On Friday 05 February 2010 06:20:00 Mel Gorman wrote:
> > On Wed, Feb 03, 2010 at 02:39:21PM -0800, Andrew Morton wrote:
> > > > gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
> >
> > This is a bit of a reach, but how confident are you that this version of
> > gcc is building kernels correctly?
> >
> > There are a few disconnected reports of kernel problems with this
> > particular version of gcc although none that I can connect with this
> > problem or on x86 for that matter. One example is
> >
> > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=536354
> >
> > which reported problems building kernels on the s390 with that compiler.
> > Moving to 4.2 helped them and it *should* have been fixed according to
> > this bug
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=13012
> >
> > It might be a red herring, but just to be sure, would you mind trying
> > gcc 4.2 or 4.3 just to be sure please?
>
> Well, it was producing working kernels up until 2.6.30, but I recompiled with
> gcc (Debian 4.3.2-1.1) 4.3.2
> and the box has been running nearly 48 hour without incident. My previous
> record was 2. So I guess we can put this down to a new compiler bug.
>
Well, it's great the problem source is known but pinning down compiler bugs
is a bit of a pain. Andrew, I don't recall an easy-as-in-bisection-easy
means of identifying which part of the compile unit went wrong and why so
it can be marked with #error for known broken compilers. Is there one or is
it a case of asking for two objdumps of __rmqueue and making a stab at it?
> I probably should have checked this before reporting a bug. Mea culpa
Not at all. Miscompiles like this are rare and usually caught a lot quicker
than this. If you hadn't reported the problem with two different machines,
I would have blamed hardware and asked for a memtest. The only reason I
spotted this might be a compiler was because the type of error you reported
"couldn't happen".
Thanks for reporting and testing.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
2010-02-08 10:10 ` Mel Gorman
@ 2010-02-08 19:18 ` Andrew Morton
2010-02-09 14:45 ` Mel Gorman
0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2010-02-08 19:18 UTC (permalink / raw)
To: Mel Gorman
Cc: Tony Lill, linux-mm, bugzilla-daemon, bugme-daemon,
Johannes Weiner
On Mon, 8 Feb 2010 10:10:46 +0000
Mel Gorman <mel@csn.ul.ie> wrote:
> On Sun, Feb 07, 2010 at 01:34:58PM -0500, Tony Lill wrote:
> > On Friday 05 February 2010 06:20:00 Mel Gorman wrote:
> > > On Wed, Feb 03, 2010 at 02:39:21PM -0800, Andrew Morton wrote:
> > > > > gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
> > >
> > > This is a bit of a reach, but how confident are you that this version of
> > > gcc is building kernels correctly?
> > >
> > > There are a few disconnected reports of kernel problems with this
> > > particular version of gcc although none that I can connect with this
> > > problem or on x86 for that matter. One example is
> > >
> > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=536354
> > >
> > > which reported problems building kernels on the s390 with that compiler.
> > > Moving to 4.2 helped them and it *should* have been fixed according to
> > > this bug
> > >
> > > http://bugzilla.kernel.org/show_bug.cgi?id=13012
> > >
> > > It might be a red herring, but just to be sure, would you mind trying
> > > gcc 4.2 or 4.3 just to be sure please?
> >
> > Well, it was producing working kernels up until 2.6.30, but I recompiled with
> > gcc (Debian 4.3.2-1.1) 4.3.2
> > and the box has been running nearly 48 hour without incident. My previous
> > record was 2. So I guess we can put this down to a new compiler bug.
> >
>
> Well, it's great the problem source is known but pinning down compiler bugs
> is a bit of a pain. Andrew, I don't recall an easy-as-in-bisection-easy
> means of identifying which part of the compile unit went wrong and why so
> it can be marked with #error for known broken compilers. Is there one or is
> it a case of asking for two objdumps of __rmqueue and making a stab at it?
ugh. This is pretty rare.
Probably the best strategy is to generate the two page_alloc.s files,
fish out the __rmqueue part and then try to compare them. The key
part is to Cc Linus then thrash around stupidly for long enough for him
to take pity and find the bug for you.
> > I probably should have checked this before reporting a bug. Mea culpa
>
> Not at all. Miscompiles like this are rare and usually caught a lot quicker
> than this. If you hadn't reported the problem with two different machines,
> I would have blamed hardware and asked for a memtest. The only reason I
> spotted this might be a compiler was because the type of error you reported
> "couldn't happen".
>
> Thanks for reporting and testing.
Yup.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
2010-02-08 19:18 ` Andrew Morton
@ 2010-02-09 14:45 ` Mel Gorman
[not found] ` <201002101217.34131.ajlill@ajlc.waterloo.on.ca>
0 siblings, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2010-02-09 14:45 UTC (permalink / raw)
To: Andrew Morton
Cc: Tony Lill, linux-mm, bugzilla-daemon, bugme-daemon,
Johannes Weiner
On Mon, Feb 08, 2010 at 11:18:52AM -0800, Andrew Morton wrote:
> On Mon, 8 Feb 2010 10:10:46 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
>
> > On Sun, Feb 07, 2010 at 01:34:58PM -0500, Tony Lill wrote:
> > > On Friday 05 February 2010 06:20:00 Mel Gorman wrote:
> > > > On Wed, Feb 03, 2010 at 02:39:21PM -0800, Andrew Morton wrote:
> > > > > > gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
> > > >
> > > > This is a bit of a reach, but how confident are you that this version of
> > > > gcc is building kernels correctly?
> > > >
> > > > There are a few disconnected reports of kernel problems with this
> > > > particular version of gcc although none that I can connect with this
> > > > problem or on x86 for that matter. One example is
> > > >
> > > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=536354
> > > >
> > > > which reported problems building kernels on the s390 with that compiler.
> > > > Moving to 4.2 helped them and it *should* have been fixed according to
> > > > this bug
> > > >
> > > > http://bugzilla.kernel.org/show_bug.cgi?id=13012
> > > >
> > > > It might be a red herring, but just to be sure, would you mind trying
> > > > gcc 4.2 or 4.3 just to be sure please?
> > >
> > > Well, it was producing working kernels up until 2.6.30, but I recompiled with
> > > gcc (Debian 4.3.2-1.1) 4.3.2
> > > and the box has been running nearly 48 hour without incident. My previous
> > > record was 2. So I guess we can put this down to a new compiler bug.
> > >
> >
> > Well, it's great the problem source is known but pinning down compiler bugs
> > is a bit of a pain. Andrew, I don't recall an easy-as-in-bisection-easy
> > means of identifying which part of the compile unit went wrong and why so
> > it can be marked with #error for known broken compilers. Is there one or is
> > it a case of asking for two objdumps of __rmqueue and making a stab at it?
>
> ugh. This is pretty rare.
>
Indeed. It does appear to be the case here and it's not the first bug
related to gcc 4.1 and the kernel judging from search results on google.
> Probably the best strategy is to generate the two page_alloc.s files,
> fish out the __rmqueue part and then try to compare them. The key
> part is to Cc Linus then thrash around stupidly for long enough for him
> to take pity and find the bug for you.
>
Ok, step 1 then before I do the Team America Super Secret Signal to
Linus for help.
Tony, can you generate the .s files for me please? It should be a case
of
make clean
rm *.s
make CC=gcc-$BAD_VERSION KCFLAGS=-save-temps mm
tar -czf kernel-s-files-bad-compiler.tar.gz .config *.s mm/*.c mm/*.h mm/Makefile mm/Kconfig
make clean
rm *.s
make CC=gcc-$GOOD_VERSION KCFLAGS=-save-temps mm
tar -czf kernel-s-files-good-compiler.tar.gz .config *.s mm/*.c mm/*.h mm/Makefile mm/Kconfig
where $BAD_VERSION and $GOOD_VERSION are the two compiler versions and
then post the two tarballs. It should contain what is needed.
Thanks
> > > I probably should have checked this before reporting a bug. Mea culpa
> >
> > Not at all. Miscompiles like this are rare and usually caught a lot quicker
> > than this. If you hadn't reported the problem with two different machines,
> > I would have blamed hardware and asked for a memtest. The only reason I
> > spotted this might be a compiler was because the type of error you reported
> > "couldn't happen".
> >
> > Thanks for reporting and testing.
>
> Yup.
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
[not found] ` <201002101217.34131.ajlill@ajlc.waterloo.on.ca>
@ 2010-02-11 18:20 ` Mel Gorman
2010-02-11 18:49 ` Linus Torvalds
0 siblings, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2010-02-11 18:20 UTC (permalink / raw)
To: Tony Lill
Cc: Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Johannes Weiner, Linus Torvalds
On Wed, Feb 10, 2010 at 12:17:28PM -0500, Tony Lill wrote:
> On Tuesday 09 February 2010 09:45:38 Mel Gorman wrote:
>
> > Tony, can you generate the .s files for me please?
>
> Here they are
Thanks Tony,
I made a diff of the assember generated for __rmqueue and move_freepages_block
and wrote up some notes below. The problem is that I cannot spot where the
bad assembler is or why it causes bad references. At least, I cannot see
what sequence of events have to happen for the BUG_ON check to be triggered.
(adds Linus to cc)
Linus, the background to this bug
(http://bugzilla.kernel.org/show_bug.cgi?id=15214) is bad memory references in
2.6.31.* and a BUG_ON being triggered in mm/page_alloc.c#move_freepages_block()
in 2.6.32* with no problems in 2.6.30.10. The BUG_ON "shouldn't" happen as
discussed in http://marc.info/?l=linux-mm&m=126536882627752&w=2 so a suggestion
was made that the compiler might be at fault. Using a newer compiler was
reported to work fine at http://marc.info/?l=linux-mm&m=126556778403048&w=2
broken compiler: gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
working compiler: gcc (Debian 4.3.2-1.1) 4.3.2
Tony posted the assember files (KCFLAGS=-save-temps) from
the broken and working compilers which a copy of is available at
http://www.csn.ul.ie/~mel/postings/bug-20100211/ . Have you any suggestions
on what the best way to go about finding where the badly generated code
might be so a warning can be added for gcc 4.1? My strongest suspicion is
that the problem is in the assembler that looks up the struct page from a
PFN in sparsemem but I'm failing to prove it.
--- bad/__rmqueue.s 2010-02-11 14:48:10.000000000 +0000
+++ good/__rmqueue.s 2010-02-11 14:48:46.000000000 +0000
@@ -1,278 +1,315 @@
__rmqueue:
+.L163:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
# Looks like normal entry stuff for
# __rmqueue. There are differences
# in the stack usage between the
# compilers
- subl $36, %esp
- movl %eax, -40(%ebp)
- movl %edx, -44(%ebp)
- movl %ecx, -48(%ebp)
- jmp .L265
-.L266:
- movl $3, -48(%ebp)
-.L265:
- movl -44(%ebp), %ebx
- movl -48(%ebp), %ecx
- movl -40(%ebp), %esi
- imull $44, %ebx, %eax
- sall $3, %ecx
- leal 544(%eax,%esi), %edx
- jmp .L267
# This looks like __rmqueue_smallest
# Main loop. There are significant
# differences in the compiled code but
# I couldn't spot anything obviously
# wrong
-.L268:
- leal 12(%edx), %esi
- leal (%esi,%ecx), %eax
- cmpl %eax, (%eax)
- je .L269
- movl -48(%ebp), %edx
- movl (%esi,%edx,8), %eax
- leal -24(%eax), %ecx
- movl %ecx, -16(%ebp)
- movl (%eax), %ecx
- movl 4(%eax), %edx
- movl %edx, 4(%ecx)
- movl %ecx, (%edx)
- movl $2097664, 4(%eax)
- movl $1048832, (%eax)
+ subl $76, %esp
+ movl %eax, -56(%ebp)
+ imull $44, %edx, %eax
+ movl %edx, -60(%ebp)
+ movl %eax, -84(%ebp)
+.L188:
+ movl %ecx, -28(%ebp)
+.L187:
+ movl -28(%ebp), %edx
+ movl -84(%ebp), %ecx
+ movl -56(%ebp), %ebx
+ movl -60(%ebp), %esi
+ sall $3, %edx
+ leal 544(%edx,%ecx), %eax
+ leal 12(%ebx,%eax), %ecx
+ movl %esi, -24(%ebp)
+ movl %edx, -80(%ebp)
+ jmp .L164
+.L171:
+ imull $44, -24(%ebp), %edi
+ movl -56(%ebp), %eax
+ movl -80(%ebp), %ebx
+ movl -56(%ebp), %esi
+ leal 544(%eax,%edi), %eax
+ movl %eax, -64(%ebp)
+ addl $12, %eax
+ movl %eax, -48(%ebp)
+ movl (%ecx), %eax
+ leal 544(%edi,%ebx), %edx
+ leal 12(%esi,%edx), %edx
+ addl $44, %ecx
+ cmpl %edx, %eax
+ je .L165
+ movl -28(%ebp), %ebx
+ sall $3, %ebx
+ leal (%ebx,%edi), %eax
+ movl 556(%esi,%eax), %ecx
+ leal -24(%ecx), %esi
+ movl 24(%esi), %edx
+ movl 28(%esi), %eax
+ movl %eax, 4(%edx)
+ movl %edx, (%eax)
+ movl $2097664, 28(%esi)
+ movl $1048832, 24(%esi)
#APP
- btr $18,-24(%eax)
+# 127 "/misc/m1/kernel/linux-2.6.32.7/arch/x86/include/asm/bitops.h" 1
+ btr $18,-24(%ecx)
+# 0 "" 2
#NO_APP
# More of __rmqueue_smallest I
# think it's roughly in the
# following place in the code.
# The main differences appear to
# be in how registers are used
# and the stack is laid out.
# Again, I can't actually see
# anything wrong as such
static inline
struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
int migratetype)
{
....
rmv_page_order(page);
area->nr_free--;
....
}
# rmv_page_order() ?
- movl -16(%ebp), %eax
- movb %bl, %cl
- movl %ebx, %edi
- movl $0, 12(%eax)
# area->nr_free--
- decl 40(%esi)
- movl $1, -36(%ebp)
- sall %cl, -36(%ebp)
- jmp .L271
-.L272:
- shrl -36(%ebp)
- movl -16(%ebp), %ebx
- movl -36(%ebp), %eax
+ movl $0, 12(%esi)
+ movl -56(%ebp), %eax
+ decl 596(%eax,%edi)
+ movl -64(%ebp), %edx
+ movl -80(%ebp), %eax
+ movl -24(%ebp), %edi
+ movl $1, -52(%ebp)
+ movl %ebx, -76(%ebp)
+ leal -32(%edx,%eax), %eax
+ movl %edi, %ecx
+ movl %eax, -32(%ebp)
+ sall %cl, -52(%ebp)
+ jmp .L166
+.L169:
+ shrl -52(%ebp)
+ movl -52(%ebp), %eax
sall $5, %eax
- addl %eax, %ebx
- movl -40(%ebp), %eax
+ leal (%esi,%eax), %ebx
+ movl -56(%ebp), %eax
movl %ebx, %edx
call bad_range
testl %eax, %eax
- je .L273
+ je .L167
#APP
# Think the following
# means we are looking
# in expand() which is
# at line 665
+# 665 "/misc/m1/kernel/linux-2.6.32.7/mm/page_alloc.c" 1
1: ud2
.pushsection __bug_table,"a"
2: .long 1b, .LC0
.word 665, 0
.org 2b+12
.popsection
+# 0 "" 2
#NO_APP
-.L275:
- jmp .L275
-.L273:
- movl -48(%ebp), %edx
- subl $44, %esi
- decl %edi
- leal (%esi,%edx,8), %eax
- movl (%eax), %ecx
+.L168:
+ jmp .L168
+.L167:
+ movl -32(%ebp), %ecx
leal 24(%ebx), %edx
- movl %edx, 4(%ecx)
- movl %ecx, 24(%ebx)
- movl %eax, 4(%edx)
- movl %edx, (%eax)
- incl 40(%esi)
+ decl %edi
+ subl $44, -48(%ebp)
+ movl (%ecx), %eax
+ movl %edx, 4(%eax)
+ movl %eax, 24(%ebx)
+ movl -48(%ebp), %eax
+ addl -76(%ebp), %eax
+ movl %edx, (%ecx)
+ movl %eax, 28(%ebx)
+ movl -48(%ebp), %eax
+ incl 40(%eax)
movl %edi, 12(%ebx)
#APP
+# 84 "/misc/m1/kernel/linux-2.6.32.7/arch/x86/include/asm/bitops.h" 1
bts $18,(%ebx)
+# 0 "" 2
#NO_APP
# Looks like more of expand. Think the
# cmpl and jgs towards the end of this
# section for the bad compiler are the
# while (high > low) {}
# part.
#
-.L271:
- cmpl -44(%ebp), %edi
- jg .L272
- jmp .L316
-.L269:
- incl %ebx
- addl $44, %edx
-.L267:
- cmpl $10, %ebx
- jbe .L268
- movl $0, -16(%ebp)
- jmp .L310
-.L316:
- cmpl $0, -16(%ebp)
- jne .L278
-.L310:
- cmpl $3, -48(%ebp)
- je .L278
- movl $10, -32(%ebp)
- jmp .L280
-.L281:
- movl (%ecx), %ebx
- cmpl $3, %ebx
- movl %ebx, -20(%ebp)
- je .L282
- imull $44, -32(%ebp), %eax
- movl -40(%ebp), %esi
- leal 556(%eax,%esi), %eax
- movl %eax, -28(%ebp)
- leal (%eax,%ebx,8), %eax
- cmpl %eax, (%eax)
- je .L282
- movl -32(%ebp), %eax
- movl -28(%ebp), %edx
- movl %eax, -24(%ebp)
- movl (%edx,%ebx,8), %esi
- subl $24, %esi
- movl %esi, -16(%ebp)
- decl 40(%edx)
- cmpl $4, -32(%ebp)
- jg .L285
- cmpl $1, -48(%ebp)
- je .L285
+ subl $44, %ecx
+ movl %ecx, -32(%ebp)
+.L166:
+ cmpl -60(%ebp), %edi
+ jg .L169
+ jmp .L202
+.L165:
+ incl -24(%ebp)
+.L164:
+ cmpl $10, -24(%ebp)
+ jbe .L171
+ xorl %esi, %esi
+ jmp .L199
+.L202:
+ testl %esi, %esi
+ jne .L172
+.L199:
+ cmpl $3, -28(%ebp)
+ je .L172
+ movl -28(%ebp), %eax
+ movl $10, %edi
+ sall $4, %eax
+ addl $fallbacks, %eax
+ movl %eax, -72(%ebp)
+ movl -28(%ebp), %eax
+ sall $4, %eax
+ addl $fallbacks+16, %eax
+ movl %eax, -88(%ebp)
+ jmp .L173
+.L185:
+ movl (%ecx), %edx
+ cmpl $3, %edx
+ movl %edx, -20(%ebp)
+ je .L174
+ imull $44, %edi, %edx
+ movl -20(%ebp), %ebx
+ movl -56(%ebp), %esi
+ leal (%edx,%ebx,8), %eax
+ leal 544(%esi,%eax), %ebx
+ leal 556(%esi,%eax), %eax
+ cmpl %eax, 12(%ebx)
+ je .L174
+ movl -56(%ebp), %eax
+ movl %edi, -16(%ebp)
+ movl 12(%ebx), %ebx
+ decl 596(%eax,%edx)
+ cmpl $4, %edi
+ leal -24(%ebx), %esi
+ jg .L175
+ cmpl $1, -28(%ebp)
+ je .L175
cmpl $0, page_group_by_mobility_disabled
# This is the section that
# actually calls move_freepages_block
# presumably with bad parameters in
# the bad compiler
- je .L287
-.L285:
- movl -48(%ebp), %ecx
+ je .L176
+.L175:
+ movl -28(%ebp), %ecx
movl %esi, %edx
- movl -40(%ebp), %eax
+ movl -56(%ebp), %eax
<SNIP>
Cutting the rest of the differences from __rmqueue and seeing what
move_freepages_block looks like
--- bad/move_freepages_block.s 2010-02-11 16:36:15.000000000 +0000
+++ good/move_freepages_block.s 2010-02-11 16:36:49.000000000 +0000
@@ -1,18 +1,15 @@
- .size split_page, .-split_page
- .type move_freepages_block, @function
move_freepages_block:
pushl %ebp
movl %esp, %ebp
pushl %edi
- movl %edx, %edi
pushl %esi
+ movl %ecx, %esi
pushl %ebx
+ movl %edx, %ecx
subl $16, %esp
movl %eax, -24(%ebp)
- movl %ecx, -28(%ebp)
- movl %edx, %ecx
- movl -24(%ebp), %esi
movl (%edx), %eax
+ movl -24(%ebp), %edi
shrl $25, %eax
sall $4, %eax
movl mem_section(%eax), %eax
@@ -21,23 +18,25 @@
sarl $5, %ecx
andl $-1024, %ecx
movl %ecx, %eax
+ movl %ecx, %ebx
shrl $17, %eax
sall $4, %eax
# This is looking up the page in
# the sparsemem section map.
# While there are differences, I
# don't see where it goes wrong
# although this is the most
# likely problem code
- movl mem_section(%eax), %ebx
- movl %ecx, %eax
- sall $5, %eax
- andl $-4, %ebx
- addl %eax, %ebx
- movl 1264(%esi), %eax
+ movl mem_section(%eax), %eax
+ sall $5, %ebx
+ andl $-4, %eax
+ leal (%eax,%ebx), %ebx
+ movl 1264(%edi), %eax
+ movl %edx, %edi
+ movl -24(%ebp), %edx
cmpl %eax, %ecx
cmovae %ebx, %edi
addl $1023, %ecx
- addl 1268(%esi), %eax
+ addl 1268(%edx), %eax
movl $0, -16(%ebp)
cmpl %eax, %ecx
- jae .L218
- leal 32736(%ebx), %eax
- movl %eax, -20(%ebp)
+ jae .L20
+ leal 32736(%ebx), %ecx
+ movl %ecx, -20(%ebp)
movl (%edi), %edx
movl 32736(%ebx), %eax
shrl $23, %edx
@@ -46,55 +45,61 @@
andl $3, %eax
imull $1280, %edx, %edx
imull $1280, %eax, %eax
- addl $contig_page_data, %edx
- addl $contig_page_data, %eax
cmpl %eax, %edx
- je .L235
+ jne .L21
+ sall $3, %esi
+ movl $0, -16(%ebp)
+ movl %esi, -28(%ebp)
+ jmp .L27
+.L21:
#APP
+# 775 "/misc/m1/kernel/linux-2.6.32.7/mm/page_alloc.c" 1
1: ud2
.pushsection __bug_table,"a"
2: .long 1b, .LC0
.word 775, 0
.org 2b+12
.popsection
+# 0 "" 2
#NO_APP
-.L222:
- jmp .L222
-.L223:
+.L23:
+ jmp .L23
+.L25:
movl (%edi), %eax
testl $262144, %eax
- jne .L226
+ jne .L24
addl $32, %edi
- jmp .L235
-.L226:
- leal 24(%edi), %ecx
+ jmp .L27
+.L24:
+ movl 12(%edi), %ebx
+ leal 24(%edi), %esi
movl 24(%edi), %edx
- movl 4(%ecx), %eax
- movl 12(%edi), %esi
- movl %eax, 4(%edx)
+ movl 28(%edi), %eax
movl %edx, (%eax)
- imull $44, %esi, %eax
+ movl %eax, 4(%edx)
+ imull $44, %ebx, %eax
movl $1048832, 24(%edi)
- movl -28(%ebp), %edx
- leal 544(%eax,%edx,8), %eax
- addl -24(%ebp), %eax
- movl 12(%eax), %edx
- leal 12(%eax), %ebx
- movl %ecx, 4(%edx)
+ movl -24(%ebp), %edx
+ addl -28(%ebp), %eax
+ leal 544(%edx,%eax), %ecx
+ movl 12(%ecx), %edx
movl %edx, 24(%edi)
- movl %ebx, 4(%ecx)
- movl %ecx, 12(%eax)
- movl %esi, %ecx
+ movl %esi, 4(%edx)
+ movl -24(%ebp), %edx
+ movl %esi, 12(%ecx)
+ movb %bl, %cl
+ leal 556(%edx,%eax), %eax
+ movl %eax, 28(%edi)
movl $32, %eax
sall %cl, %eax
addl %eax, %edi
movl $1, %eax
sall %cl, %eax
addl %eax, -16(%ebp)
-.L235:
+.L27:
cmpl -20(%ebp), %edi
- jbe .L223
-.L218:
+ jbe .L25
+.L20:
movl -16(%ebp), %eax
addl $16, %esp
popl %ebx
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
2010-02-11 18:20 ` Mel Gorman
@ 2010-02-11 18:49 ` Linus Torvalds
2010-02-12 12:17 ` Mel Gorman
0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2010-02-11 18:49 UTC (permalink / raw)
To: Mel Gorman
Cc: Tony Lill, Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Johannes Weiner
On Thu, 11 Feb 2010, Mel Gorman wrote:
>
> Tony posted the assember files (KCFLAGS=-save-temps) from
> the broken and working compilers which a copy of is available at
> http://www.csn.ul.ie/~mel/postings/bug-20100211/ . Have you any suggestions
> on what the best way to go about finding where the badly generated code
> might be so a warning can be added for gcc 4.1? My strongest suspicion is
> that the problem is in the assembler that looks up the struct page from a
> PFN in sparsemem but I'm failing to prove it.
Try contacting the gcc people. They are (well, _some_ of them are) much
more used to walking through asm differences, and may have more of a clue
about where the difference is likely to be for those compiler versions.
I'm personally very comfortable with x86 assembly, but having tried to
find compiler bugs in the past I can also say that despite my x86 comfort
I've almost always failed. The trivial stupid differences tend to always
just totally overwhelm the actual real difference that causes the bug.
One thing to try is to see if the buggy compiler version can be itself
triggered to create a non-buggy asm listing by using some compiler flag.
That way the "trivial differences" tend to be smaller, and the bug stands
out more.
For example, that's how we found the problem with "-fwrapv" - testing the
same compiler version with different flags (see commit a137802ee83).
Sometimes if the trivial differences are mostly register allocation, you
can get a "feel" for the differences by replacing all register names with
just the string "REG" (and "[0-9x](%e[sb]p)" with "STACKSLOT", and try to
do the diff that way. If everything else is roughly the same, you then see
the place where the code is _really_ different.
But when the compiler actually re-orders basic blocks etc, then diffs are
basically impossible to get anything sane out of.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
2010-02-11 18:49 ` Linus Torvalds
@ 2010-02-12 12:17 ` Mel Gorman
0 siblings, 0 replies; 9+ messages in thread
From: Mel Gorman @ 2010-02-12 12:17 UTC (permalink / raw)
To: Linus Torvalds
Cc: Tony Lill, Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
Johannes Weiner
On Thu, Feb 11, 2010 at 10:49:44AM -0800, Linus Torvalds wrote:
>
>
> On Thu, 11 Feb 2010, Mel Gorman wrote:
> >
> > Tony posted the assember files (KCFLAGS=-save-temps) from
> > the broken and working compilers which a copy of is available at
> > http://www.csn.ul.ie/~mel/postings/bug-20100211/ . Have you any suggestions
> > on what the best way to go about finding where the badly generated code
> > might be so a warning can be added for gcc 4.1? My strongest suspicion is
> > that the problem is in the assembler that looks up the struct page from a
> > PFN in sparsemem but I'm failing to prove it.
>
> Try contacting the gcc people. They are (well, _some_ of them are) much
> more used to walking through asm differences, and may have more of a clue
> about where the difference is likely to be for those compiler versions.
>
Ok, thanks. Will get on to them if the other suggestions don't work out.
> I'm personally very comfortable with x86 assembly, but having tried to
> find compiler bugs in the past I can also say that despite my x86 comfort
> I've almost always failed. The trivial stupid differences tend to always
> just totally overwhelm the actual real difference that causes the bug.
>
I don't feel quite as bad then. I was hoping it would be "obvious" but
was getting tripped up by reordering and slightly-different ways of
achieving the same end result.
> One thing to try is to see if the buggy compiler version can be itself
> triggered to create a non-buggy asm listing by using some compiler flag.
> That way the "trivial differences" tend to be smaller, and the bug stands
> out more.
>
> For example, that's how we found the problem with "-fwrapv" - testing the
> same compiler version with different flags (see commit a137802ee83).
>
The compiler of interest is still available so I should be able to reproduce
the problem locally once I get an old distro installed.
> Sometimes if the trivial differences are mostly register allocation, you
> can get a "feel" for the differences by replacing all register names with
> just the string "REG" (and "[0-9x](%e[sb]p)" with "STACKSLOT", and try to
> do the diff that way. If everything else is roughly the same, you then see
> the place where the code is _really_ different.
>
Will try this first, then installing and old distro before resorting to
the gcc people. There is a good chance their response will be "go away"
once they realise it'd fixed in later compilers.
> But when the compiler actually re-orders basic blocks etc, then diffs are
> basically impossible to get anything sane out of.
>
Thanks for the suggestions.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2010-02-12 12:17 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <bug-15214-10286@http.bugzilla.kernel.org/>
2010-02-03 22:39 ` [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3 Andrew Morton
2010-02-05 11:20 ` Mel Gorman
2010-02-07 18:34 ` Tony Lill
2010-02-08 10:10 ` Mel Gorman
2010-02-08 19:18 ` Andrew Morton
2010-02-09 14:45 ` Mel Gorman
[not found] ` <201002101217.34131.ajlill@ajlc.waterloo.on.ca>
2010-02-11 18:20 ` Mel Gorman
2010-02-11 18:49 ` Linus Torvalds
2010-02-12 12:17 ` Mel Gorman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).