Re: [Bugme-new] [Bug 15214] New: Oops at _

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
       [not found] <bug-15214-10286@http.bugzilla.kernel.org/>
@ 2010-02-03 22:39 ` Andrew Morton
  2010-02-05 11:20   ` Mel Gorman
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2010-02-03 22:39 UTC (permalink / raw)
  To: linux-mm; +Cc: bugzilla-daemon, bugme-daemon, Mel Gorman, Johannes Weiner,
	ajlill


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Wed, 3 Feb 2010 02:30:22 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=15214
> 
>            Summary: Oops at __rmqueue+0x51/0x2b3
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 2.6.32.7
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Page Allocator
>         AssignedTo: akpm@linux-foundation.org
>         ReportedBy: ajlill@ajlc.waterloo.on.ca
>         Regression: Yes
> 
> 
> Created an attachment (id=24887)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=24887)
> .config file
> 
> I get an Oops when doing a lot of filesystem reads. The process, cfagent, is
> running through the filesystem checksumming files when it dies. It doesn't
> happen every time cfagent runs, but there's a pretty good chance it will.
> This problem happens on 2.6.31.* as well, 3.6.30.10 appears to be stable. It
> happens on two different computers, so it's unlikely to be hardware. Also, in
> 2.6.32.*, I get an Oops at
> 
>     BUG_ON(page_zone(start_page) != page_zone(end_page));
> 
> in move_freepages when I do sysctl -w vm.min_free_kbytes=16384
> 
> but I can only reliably reproduce it when I do the sysctl from the boot
> scripts, and I'm having trouble getting netconsole started beforehand to
> capture the full output.
> 
> gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
> 
> Full text of Oops:
> 
> BUG: unable to handle kernel paging request at 6eae67fc
> IP: [<c0192a38>] __rmqueue+0x51/0x2b3
> *pdpt = 00000000351be001 *pde = 0000000000000000 
> Oops: 0002 [#1] SMP 
> last sysfs file: /sys/class/firmware/0000:00:0b.0/loading
> Modules linked in: netconsole af_packet autofs4 nfsd nfs lockd fscache nfs_acl
> auth_rpcgss sunrpc ipv6 nls_iso8859_1 nls_cp437 vfat fat xfs exportfs fuse
> configfs dm_snapshot dm_mirror dm_region_hash dm_log dm_mod eeprom w83781d
> hwmon_vid hwmon r128 drm tuner_simple tuner_types tuner msp3400 saa7115 button
> processor ivtv i2c_algo_bit cx2341x v4l2_common videodev psmouse parport_pc
> v4l1_compat rtc_cmos parport tveeprom i2c_piix4 rtc_core intel_agp serio_raw
> rtc_lib agpgart i2c_core shpchp pci_hotplug pcspkr evdev ext3 jbd mbcache raid1
> sg sr_mod sd_mod cdrom crc_t10dif ata_generic pata_acpi pata_pdc202xx_old
> ata_piix floppy e1000 uhci_hcd libata thermal fan unix [last unloaded:
> scsi_wait_scan]
> 
> Pid: 6629, comm: cfagent Not tainted (2.6.32.7 #1) System Name
> EIP: 0060:[<c0192a38>] EFLAGS: 00210002 CPU: 0
> EIP is at __rmqueue+0x51/0x2b3
> EAX: c146a018 EBX: 0000000a ECX: 6eae67f8 EDX: c050b654
> ESI: c050b644 EDI: 00200246 EBP: f51c9d1c ESP: f51c9cec
>  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> Process cfagent (pid: 6629, ti=f51c8000 task=f51b40b0 task.ti=f51c8000)
> Stack:
>  00000002 00000000 c050b260 00000001 f6ba8280 00200002 c0193c92 c019404e
> <0> c146a000 c1479ff8 c050b260 00200246 f51c9d78 c0193cd5 f51c9d7c 00000002
> <0> 00000000 00000000 000201da c050c16c 00000000 c050b280 00000001 0000001f
> Call Trace:
>  [<c0193c92>] ? get_page_from_freelist+0xdf/0x3a8
>  [<c019404e>] ? __alloc_pages_nodemask+0xdd/0x481
>  [<c0193cd5>] ? get_page_from_freelist+0x122/0x3a8
>  [<c019404e>] ? __alloc_pages_nodemask+0xdd/0x481
>  [<c01caa57>] ? _d_rehash+0x3c/0x40
>  [<c01961e3>] ? __do_page_cache_readahead+0x80/0x15b
>  [<c01cb95f>] ? __d_lookup+0xa1/0xd5
>  [<c01962d5>] ? ra_submit+0x17/0x1c
>  [<c01964e4>] ? ondemand_readahead+0x150/0x15c
>  [<c0196569>] ? page_cache_sync_readahead+0x16/0x1b
>  [<c0190def>] ? generic_file_aio_read+0x212/0x507
>  [<c01bd512>] ? do_sync_read+0xab/0xe9
>  [<c01a86f5>] ? mmap_region+0x25b/0x334
>  [<c014823f>] ? autoremove_wake_function+0x0/0x33
>  [<c020edd8>] ? security_file_permission+0xf/0x11
>  [<c01bd467>] ? do_sync_read+0x0/0xe9
>  [<c01bdc1d>] ? vfs_read+0x8a/0x13f
>  [<c01be026>] ? sys_read+0x3b/0x60
>  [<c010296f>] ? sysenter_do_call+0x12/0x27
> Code: 2c c1 e1 03 8d 94 30 20 02 00 00 e9 8a 00 00 00 8d 72 0c 8d 04 0e 39 00
> 74 7c 8b 55 d0 8b 04 d6 8d 48 e8 89 4d f0 8b 08 8b 50 04 <89> 51 04 89 0a c7 40
> 04 00 02 20 00 c7 00 00 01 10 00 0f ba 70 
> EIP: [<c0192a38>] __rmqueue+0x51/0x2b3 SS:ESP 0068:f51c9cec
> CR2: 000000006eae67fc
> ---[ end trace db0096b2091950d0 ]---
> 

Strange regression.  I'd be suspecting that we've mucked up the initial
mem_map, perhaps because of a wart in the e820 or acpi tables.

Or perhaps it's something else.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
  2010-02-03 22:39 ` [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3 Andrew Morton
@ 2010-02-05 11:20   ` Mel Gorman
  2010-02-07 18:34     ` Tony Lill
  0 siblings, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2010-02-05 11:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, bugzilla-daemon, bugme-daemon, Johannes Weiner, ajlill

On Wed, Feb 03, 2010 at 02:39:21PM -0800, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Wed, 3 Feb 2010 02:30:22 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=15214
> > 
> >            Summary: Oops at __rmqueue+0x51/0x2b3
> >            Product: Memory Management
> >            Version: 2.5
> >     Kernel Version: 2.6.32.7
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Page Allocator
> >         AssignedTo: akpm@linux-foundation.org
> >         ReportedBy: ajlill@ajlc.waterloo.on.ca
> >         Regression: Yes
> > 
> > 
> > Created an attachment (id=24887)
> >  --> (http://bugzilla.kernel.org/attachment.cgi?id=24887)
> > .config file
> > 
> > I get an Oops when doing a lot of filesystem reads. The process, cfagent, is
> > running through the filesystem checksumming files when it dies. It doesn't
> > happen every time cfagent runs, but there's a pretty good chance it will.
> > This problem happens on 2.6.31.* as well, 3.6.30.10 appears to be stable. It
> > happens on two different computers, so it's unlikely to be hardware. Also, in
> > 2.6.32.*, I get an Oops at
> > 
> >     BUG_ON(page_zone(start_page) != page_zone(end_page));
> > 
> > in move_freepages when I do sysctl -w vm.min_free_kbytes=16384
> > 
> > but I can only reliably reproduce it when I do the sysctl from the boot
> > scripts, and I'm having trouble getting netconsole started beforehand to
> > capture the full output.
> > 

This point on sysctl is truely bizarre. It implies that the struct pages
have been corrupted in some fashion. Just before this check is made, we
do

        /* Do not cross zone boundaries */
        if (start_pfn < zone->zone_start_pfn)
                start_page = page;
        if (end_pfn >= zone->zone_start_pfn + zone->spanned_pages)
                return 0;

        return move_freepages(zone, start_page, end_page, migratetype);

So, for that bug to be triggered, two pages between 
zone->zone_start_pfn and
zone->zone_start_pfn + zone->spanned_pages
have to have different results for page_zone(). That would be outright
wrong.

Ordinarily at this point, I would assume that your memory is bad with
small errors occuring. The early-in-boot problem might indicate that
there is a specific region of memory that is bust rather than something
like a power problem.

You said that the checksumming problem happens on two separate machines,
but can you confirm that this problem also happens on both please?

> > gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
> > 

This is a bit of a reach, but how confident are you that this version of
gcc is building kernels correctly?

There are a few disconnected reports of kernel problems with this
particular version of gcc although none that I can connect with this
problem or on x86 for that matter. One example is

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=536354

which reported problems building kernels on the s390 with that compiler.
Moving to 4.2 helped them and it *should* have been fixed according to
this bug

http://bugzilla.kernel.org/show_bug.cgi?id=13012

It might be a red herring, but just to be sure, would you mind trying
gcc 4.2 or 4.3 just to be sure please?

> > Full text of Oops:
> > 
> > BUG: unable to handle kernel paging request at 6eae67fc
> > IP: [<c0192a38>] __rmqueue+0x51/0x2b3
> > *pdpt = 00000000351be001 *pde = 0000000000000000 
> > Oops: 0002 [#1] SMP 
> > last sysfs file: /sys/class/firmware/0000:00:0b.0/loading
> > Modules linked in: netconsole af_packet autofs4 nfsd nfs lockd fscache nfs_acl
> > auth_rpcgss sunrpc ipv6 nls_iso8859_1 nls_cp437 vfat fat xfs exportfs fuse
> > configfs dm_snapshot dm_mirror dm_region_hash dm_log dm_mod eeprom w83781d
> > hwmon_vid hwmon r128 drm tuner_simple tuner_types tuner msp3400 saa7115 button
> > processor ivtv i2c_algo_bit cx2341x v4l2_common videodev psmouse parport_pc
> > v4l1_compat rtc_cmos parport tveeprom i2c_piix4 rtc_core intel_agp serio_raw
> > rtc_lib agpgart i2c_core shpchp pci_hotplug pcspkr evdev ext3 jbd mbcache raid1
> > sg sr_mod sd_mod cdrom crc_t10dif ata_generic pata_acpi pata_pdc202xx_old
> > ata_piix floppy e1000 uhci_hcd libata thermal fan unix [last unloaded:
> > scsi_wait_scan]
> > 
> > Pid: 6629, comm: cfagent Not tainted (2.6.32.7 #1) System Name
> > EIP: 0060:[<c0192a38>] EFLAGS: 00210002 CPU: 0
> > EIP is at __rmqueue+0x51/0x2b3

What line does addr2line say c0192a38 corresponds to?

> > EAX: c146a018 EBX: 0000000a ECX: 6eae67f8 EDX: c050b654
> > ESI: c050b644 EDI: 00200246 EBP: f51c9d1c ESP: f51c9cec
> >  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> > Process cfagent (pid: 6629, ti=f51c8000 task=f51b40b0 task.ti=f51c8000)
> > Stack:
> >  00000002 00000000 c050b260 00000001 f6ba8280 00200002 c0193c92 c019404e
> > <0> c146a000 c1479ff8 c050b260 00200246 f51c9d78 c0193cd5 f51c9d7c 00000002
> > <0> 00000000 00000000 000201da c050c16c 00000000 c050b280 00000001 0000001f
> > Call Trace:
> >  [<c0193c92>] ? get_page_from_freelist+0xdf/0x3a8
> >  [<c019404e>] ? __alloc_pages_nodemask+0xdd/0x481
> >  [<c0193cd5>] ? get_page_from_freelist+0x122/0x3a8
> >  [<c019404e>] ? __alloc_pages_nodemask+0xdd/0x481
> >  [<c01caa57>] ? _d_rehash+0x3c/0x40
> >  [<c01961e3>] ? __do_page_cache_readahead+0x80/0x15b
> >  [<c01cb95f>] ? __d_lookup+0xa1/0xd5
> >  [<c01962d5>] ? ra_submit+0x17/0x1c
> >  [<c01964e4>] ? ondemand_readahead+0x150/0x15c
> >  [<c0196569>] ? page_cache_sync_readahead+0x16/0x1b
> >  [<c0190def>] ? generic_file_aio_read+0x212/0x507
> >  [<c01bd512>] ? do_sync_read+0xab/0xe9
> >  [<c01a86f5>] ? mmap_region+0x25b/0x334
> >  [<c014823f>] ? autoremove_wake_function+0x0/0x33
> >  [<c020edd8>] ? security_file_permission+0xf/0x11
> >  [<c01bd467>] ? do_sync_read+0x0/0xe9
> >  [<c01bdc1d>] ? vfs_read+0x8a/0x13f
> >  [<c01be026>] ? sys_read+0x3b/0x60
> >  [<c010296f>] ? sysenter_do_call+0x12/0x27
> > Code: 2c c1 e1 03 8d 94 30 20 02 00 00 e9 8a 00 00 00 8d 72 0c 8d 04 0e 39 00
> > 74 7c 8b 55 d0 8b 04 d6 8d 48 e8 89 4d f0 8b 08 8b 50 04 <89> 51 04 89 0a c7 40
> > 04 00 02 20 00 c7 00 00 01 10 00 0f ba 70 
> > EIP: [<c0192a38>] __rmqueue+0x51/0x2b3 SS:ESP 0068:f51c9cec
> > CR2: 000000006eae67fc
> > ---[ end trace db0096b2091950d0 ]---
> > 
> 
> Strange regression.  I'd be suspecting that we've mucked up the initial
> mem_map, perhaps because of a wart in the e820 or acpi tables.
> 
> Or perhaps it's something else.
> 

Lets see what the early boot looked like.

Tony, would you mind booting with "mminit_loglevel=4 loglevel=9" and send
the full dmesg please?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
  2010-02-05 11:20   ` Mel Gorman
@ 2010-02-07 18:34     ` Tony Lill
  2010-02-08 10:10       ` Mel Gorman
  0 siblings, 1 reply; 9+ messages in thread
From: Tony Lill @ 2010-02-07 18:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
	Johannes Weiner

[-- Attachment #1: Type: Text/Plain, Size: 1506 bytes --]

On Friday 05 February 2010 06:20:00 Mel Gorman wrote:
> On Wed, Feb 03, 2010 at 02:39:21PM -0800, Andrew Morton wrote:
> > > gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
> 
> This is a bit of a reach, but how confident are you that this version of
> gcc is building kernels correctly?
>
> There are a few disconnected reports of kernel problems with this
> particular version of gcc although none that I can connect with this
> problem or on x86 for that matter. One example is
> 
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=536354
> 
> which reported problems building kernels on the s390 with that compiler.
> Moving to 4.2 helped them and it *should* have been fixed according to
> this bug
> 
> http://bugzilla.kernel.org/show_bug.cgi?id=13012
> 
> It might be a red herring, but just to be sure, would you mind trying
> gcc 4.2 or 4.3 just to be sure please?

Well, it was producing working kernels up until 2.6.30, but I recompiled with
gcc (Debian 4.3.2-1.1) 4.3.2
and the box has been running nearly 48 hour without incident. My previous 
record was 2. So I guess we can put this down to a new compiler bug.

I probably should have checked this before reporting a bug. Mea culpa
-- 
Tony Lill,                         Tony.Lill@AJLC.Waterloo.ON.CA
President, A. J. Lill Consultants                 (519) 650 0660
539 Grand Valley Dr., Cambridge, Ont. N3H 2S2     (519) 241 2461
--------------- http://www.ajlc.waterloo.on.ca/ ----------------



[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
  2010-02-07 18:34     ` Tony Lill
@ 2010-02-08 10:10       ` Mel Gorman
  2010-02-08 19:18         ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2010-02-08 10:10 UTC (permalink / raw)
  To: Tony Lill
  Cc: Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
	Johannes Weiner

On Sun, Feb 07, 2010 at 01:34:58PM -0500, Tony Lill wrote:
> On Friday 05 February 2010 06:20:00 Mel Gorman wrote:
> > On Wed, Feb 03, 2010 at 02:39:21PM -0800, Andrew Morton wrote:
> > > > gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
> > 
> > This is a bit of a reach, but how confident are you that this version of
> > gcc is building kernels correctly?
> >
> > There are a few disconnected reports of kernel problems with this
> > particular version of gcc although none that I can connect with this
> > problem or on x86 for that matter. One example is
> > 
> > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=536354
> > 
> > which reported problems building kernels on the s390 with that compiler.
> > Moving to 4.2 helped them and it *should* have been fixed according to
> > this bug
> > 
> > http://bugzilla.kernel.org/show_bug.cgi?id=13012
> > 
> > It might be a red herring, but just to be sure, would you mind trying
> > gcc 4.2 or 4.3 just to be sure please?
> 
> Well, it was producing working kernels up until 2.6.30, but I recompiled with
> gcc (Debian 4.3.2-1.1) 4.3.2
> and the box has been running nearly 48 hour without incident. My previous 
> record was 2. So I guess we can put this down to a new compiler bug.
> 

Well, it's great the problem source is known but pinning down compiler bugs
is a bit of a pain. Andrew, I don't recall an easy-as-in-bisection-easy
means of identifying which part of the compile unit went wrong and why so
it can be marked with #error for known broken compilers. Is there one or is
it a case of asking for two objdumps of __rmqueue and making a stab at it?

> I probably should have checked this before reporting a bug. Mea culpa

Not at all. Miscompiles like this are rare and usually caught a lot quicker
than this. If you hadn't reported the problem with  two different machines,
I would have blamed hardware and asked for a memtest. The only reason I
spotted this might be a compiler was because the type of error you reported
"couldn't happen".

Thanks for reporting and testing.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
  2010-02-08 10:10       ` Mel Gorman
@ 2010-02-08 19:18         ` Andrew Morton
  2010-02-09 14:45           ` Mel Gorman
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2010-02-08 19:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Tony Lill, linux-mm, bugzilla-daemon, bugme-daemon,
	Johannes Weiner

On Mon, 8 Feb 2010 10:10:46 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> On Sun, Feb 07, 2010 at 01:34:58PM -0500, Tony Lill wrote:
> > On Friday 05 February 2010 06:20:00 Mel Gorman wrote:
> > > On Wed, Feb 03, 2010 at 02:39:21PM -0800, Andrew Morton wrote:
> > > > > gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
> > > 
> > > This is a bit of a reach, but how confident are you that this version of
> > > gcc is building kernels correctly?
> > >
> > > There are a few disconnected reports of kernel problems with this
> > > particular version of gcc although none that I can connect with this
> > > problem or on x86 for that matter. One example is
> > > 
> > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=536354
> > > 
> > > which reported problems building kernels on the s390 with that compiler.
> > > Moving to 4.2 helped them and it *should* have been fixed according to
> > > this bug
> > > 
> > > http://bugzilla.kernel.org/show_bug.cgi?id=13012
> > > 
> > > It might be a red herring, but just to be sure, would you mind trying
> > > gcc 4.2 or 4.3 just to be sure please?
> > 
> > Well, it was producing working kernels up until 2.6.30, but I recompiled with
> > gcc (Debian 4.3.2-1.1) 4.3.2
> > and the box has been running nearly 48 hour without incident. My previous 
> > record was 2. So I guess we can put this down to a new compiler bug.
> > 
> 
> Well, it's great the problem source is known but pinning down compiler bugs
> is a bit of a pain. Andrew, I don't recall an easy-as-in-bisection-easy
> means of identifying which part of the compile unit went wrong and why so
> it can be marked with #error for known broken compilers. Is there one or is
> it a case of asking for two objdumps of __rmqueue and making a stab at it?

ugh.  This is pretty rare.

Probably the best strategy is to generate the two page_alloc.s files,
fish out the __rmqueue part and then try to compare them.  The key
part is to Cc Linus then thrash around stupidly for long enough for him
to take pity and find the bug for you.

> > I probably should have checked this before reporting a bug. Mea culpa
> 
> Not at all. Miscompiles like this are rare and usually caught a lot quicker
> than this. If you hadn't reported the problem with  two different machines,
> I would have blamed hardware and asked for a memtest. The only reason I
> spotted this might be a compiler was because the type of error you reported
> "couldn't happen".
> 
> Thanks for reporting and testing.

Yup.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
  2010-02-08 19:18         ` Andrew Morton
@ 2010-02-09 14:45           ` Mel Gorman
       [not found]             ` <201002101217.34131.ajlill@ajlc.waterloo.on.ca>
  0 siblings, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2010-02-09 14:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tony Lill, linux-mm, bugzilla-daemon, bugme-daemon,
	Johannes Weiner

On Mon, Feb 08, 2010 at 11:18:52AM -0800, Andrew Morton wrote:
> On Mon, 8 Feb 2010 10:10:46 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Sun, Feb 07, 2010 at 01:34:58PM -0500, Tony Lill wrote:
> > > On Friday 05 February 2010 06:20:00 Mel Gorman wrote:
> > > > On Wed, Feb 03, 2010 at 02:39:21PM -0800, Andrew Morton wrote:
> > > > > > gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
> > > > 
> > > > This is a bit of a reach, but how confident are you that this version of
> > > > gcc is building kernels correctly?
> > > >
> > > > There are a few disconnected reports of kernel problems with this
> > > > particular version of gcc although none that I can connect with this
> > > > problem or on x86 for that matter. One example is
> > > > 
> > > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=536354
> > > > 
> > > > which reported problems building kernels on the s390 with that compiler.
> > > > Moving to 4.2 helped them and it *should* have been fixed according to
> > > > this bug
> > > > 
> > > > http://bugzilla.kernel.org/show_bug.cgi?id=13012
> > > > 
> > > > It might be a red herring, but just to be sure, would you mind trying
> > > > gcc 4.2 or 4.3 just to be sure please?
> > > 
> > > Well, it was producing working kernels up until 2.6.30, but I recompiled with
> > > gcc (Debian 4.3.2-1.1) 4.3.2
> > > and the box has been running nearly 48 hour without incident. My previous 
> > > record was 2. So I guess we can put this down to a new compiler bug.
> > > 
> > 
> > Well, it's great the problem source is known but pinning down compiler bugs
> > is a bit of a pain. Andrew, I don't recall an easy-as-in-bisection-easy
> > means of identifying which part of the compile unit went wrong and why so
> > it can be marked with #error for known broken compilers. Is there one or is
> > it a case of asking for two objdumps of __rmqueue and making a stab at it?
> 
> ugh.  This is pretty rare.
> 

Indeed. It does appear to be the case here and it's not the first bug
related to gcc 4.1 and the kernel judging from search results on google.

> Probably the best strategy is to generate the two page_alloc.s files,
> fish out the __rmqueue part and then try to compare them.  The key
> part is to Cc Linus then thrash around stupidly for long enough for him
> to take pity and find the bug for you.
> 

Ok, step 1 then before I do the Team America Super Secret Signal to
Linus for help.

Tony, can you generate the .s files for me please? It should be a case
of

make clean
rm *.s
make CC=gcc-$BAD_VERSION KCFLAGS=-save-temps mm
tar -czf kernel-s-files-bad-compiler.tar.gz .config *.s mm/*.c mm/*.h mm/Makefile mm/Kconfig

make clean
rm *.s
make CC=gcc-$GOOD_VERSION KCFLAGS=-save-temps mm
tar -czf kernel-s-files-good-compiler.tar.gz .config *.s mm/*.c mm/*.h mm/Makefile mm/Kconfig

where $BAD_VERSION and $GOOD_VERSION are the two compiler versions and
then post the two tarballs. It should contain what is needed.

Thanks

> > > I probably should have checked this before reporting a bug. Mea culpa
> > 
> > Not at all. Miscompiles like this are rare and usually caught a lot quicker
> > than this. If you hadn't reported the problem with  two different machines,
> > I would have blamed hardware and asked for a memtest. The only reason I
> > spotted this might be a compiler was because the type of error you reported
> > "couldn't happen".
> > 
> > Thanks for reporting and testing.
> 
> Yup.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
       [not found]             ` <201002101217.34131.ajlill@ajlc.waterloo.on.ca>
@ 2010-02-11 18:20               ` Mel Gorman
  2010-02-11 18:49                 ` Linus Torvalds
  0 siblings, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2010-02-11 18:20 UTC (permalink / raw)
  To: Tony Lill
  Cc: Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
	Johannes Weiner, Linus Torvalds

On Wed, Feb 10, 2010 at 12:17:28PM -0500, Tony Lill wrote:
> On Tuesday 09 February 2010 09:45:38 Mel Gorman wrote:
> 
> > Tony, can you generate the .s files for me please?  
> 
> Here they are

Thanks Tony,

I made a diff of the assember generated for __rmqueue and move_freepages_block
and wrote up some notes below. The problem is that I cannot spot where the
bad assembler is or why it causes bad references. At least, I cannot see
what sequence of events have to happen for the BUG_ON check to be triggered.

(adds Linus to cc)

Linus, the background to this bug
(http://bugzilla.kernel.org/show_bug.cgi?id=15214) is bad memory references in
2.6.31.* and a BUG_ON being triggered in mm/page_alloc.c#move_freepages_block()
in 2.6.32* with no problems in 2.6.30.10. The BUG_ON "shouldn't" happen as
discussed in http://marc.info/?l=linux-mm&m=126536882627752&w=2 so a suggestion
was made that the compiler might be at fault. Using a newer compiler was
reported to work fine at http://marc.info/?l=linux-mm&m=126556778403048&w=2

broken compiler:  gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
working compiler: gcc (Debian 4.3.2-1.1) 4.3.2

Tony posted the assember files (KCFLAGS=-save-temps) from
the broken and working compilers which a copy of is available at
http://www.csn.ul.ie/~mel/postings/bug-20100211/ . Have you any suggestions
on what the best way to go about finding where the badly generated code
might be so a warning can be added for gcc 4.1?  My strongest suspicion is
that the problem is in the assembler that looks up the struct page from a
PFN in sparsemem but I'm failing to prove it.

--- bad/__rmqueue.s	2010-02-11 14:48:10.000000000 +0000
+++ good/__rmqueue.s	2010-02-11 14:48:46.000000000 +0000
@@ -1,278 +1,315 @@
 __rmqueue:
+.L163:
 	pushl	%ebp
 	movl	%esp, %ebp
 	pushl	%edi
 	pushl	%esi
 	pushl	%ebx

					# Looks like normal entry stuff for
					# __rmqueue. There are differences
					# in the stack usage between the
					# compilers
-	subl	$36, %esp
-	movl	%eax, -40(%ebp)
-	movl	%edx, -44(%ebp)
-	movl	%ecx, -48(%ebp)
-	jmp	.L265
-.L266:
-	movl	$3, -48(%ebp)
-.L265:
-	movl	-44(%ebp), %ebx
-	movl	-48(%ebp), %ecx
-	movl	-40(%ebp), %esi
-	imull	$44, %ebx, %eax
-	sall	$3, %ecx
-	leal	544(%eax,%esi), %edx
-	jmp	.L267

					# This looks like __rmqueue_smallest
					# Main loop. There are significant
					# differences in the compiled code but
					# I couldn't spot anything obviously
					# wrong
-.L268:
-	leal	12(%edx), %esi
-	leal	(%esi,%ecx), %eax
-	cmpl	%eax, (%eax)
-	je	.L269
-	movl	-48(%ebp), %edx
-	movl	(%esi,%edx,8), %eax
-	leal	-24(%eax), %ecx
-	movl	%ecx, -16(%ebp)
-	movl	(%eax), %ecx
-	movl	4(%eax), %edx
-	movl	%edx, 4(%ecx)
-	movl	%ecx, (%edx)
-	movl	$2097664, 4(%eax)
-	movl	$1048832, (%eax)
+	subl	$76, %esp
+	movl	%eax, -56(%ebp)
+	imull	$44, %edx, %eax
+	movl	%edx, -60(%ebp)
+	movl	%eax, -84(%ebp)
+.L188:
+	movl	%ecx, -28(%ebp)
+.L187:
+	movl	-28(%ebp), %edx
+	movl	-84(%ebp), %ecx
+	movl	-56(%ebp), %ebx
+	movl	-60(%ebp), %esi
+	sall	$3, %edx
+	leal	544(%edx,%ecx), %eax
+	leal	12(%ebx,%eax), %ecx
+	movl	%esi, -24(%ebp)
+	movl	%edx, -80(%ebp)
+	jmp	.L164
+.L171:
+	imull	$44, -24(%ebp), %edi
+	movl	-56(%ebp), %eax
+	movl	-80(%ebp), %ebx
+	movl	-56(%ebp), %esi
+	leal	544(%eax,%edi), %eax
+	movl	%eax, -64(%ebp)
+	addl	$12, %eax
+	movl	%eax, -48(%ebp)
+	movl	(%ecx), %eax
+	leal	544(%edi,%ebx), %edx
+	leal	12(%esi,%edx), %edx
+	addl	$44, %ecx
+	cmpl	%edx, %eax
+	je	.L165
+	movl	-28(%ebp), %ebx
+	sall	$3, %ebx
+	leal	(%ebx,%edi), %eax
+	movl	556(%esi,%eax), %ecx
+	leal	-24(%ecx), %esi
+	movl	24(%esi), %edx
+	movl	28(%esi), %eax
+	movl	%eax, 4(%edx)
+	movl	%edx, (%eax)
+	movl	$2097664, 28(%esi)
+	movl	$1048832, 24(%esi)
 #APP
-	btr $18,-24(%eax)
+# 127 "/misc/m1/kernel/linux-2.6.32.7/arch/x86/include/asm/bitops.h" 1
+	btr $18,-24(%ecx)
+# 0 "" 2
 #NO_APP

					# More of __rmqueue_smallest I
					# think it's roughly in the
					# following place in the code.
					# The main differences appear to
					# be in how registers are used
					# and the stack is laid out.
					# Again, I can't actually see
					# anything wrong as such

static inline
struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
                                                int migratetype)
{
....
               	rmv_page_order(page);
                area->nr_free--;
....
}

					# rmv_page_order() ?
-	movl	-16(%ebp), %eax
-	movb	%bl, %cl
-	movl	%ebx, %edi
-	movl	$0, 12(%eax)
					# area->nr_free--
-	decl	40(%esi)
-	movl	$1, -36(%ebp)
-	sall	%cl, -36(%ebp)
-	jmp	.L271
-.L272:
-	shrl	-36(%ebp)
-	movl	-16(%ebp), %ebx
-	movl	-36(%ebp), %eax
+	movl	$0, 12(%esi)
+	movl	-56(%ebp), %eax
+	decl	596(%eax,%edi)
+	movl	-64(%ebp), %edx
+	movl	-80(%ebp), %eax
+	movl	-24(%ebp), %edi
+	movl	$1, -52(%ebp)
+	movl	%ebx, -76(%ebp)
+	leal	-32(%edx,%eax), %eax
+	movl	%edi, %ecx
+	movl	%eax, -32(%ebp)
+	sall	%cl, -52(%ebp)
+	jmp	.L166
+.L169:
+	shrl	-52(%ebp)
+	movl	-52(%ebp), %eax
 	sall	$5, %eax
-	addl	%eax, %ebx
-	movl	-40(%ebp), %eax
+	leal	(%esi,%eax), %ebx
+	movl	-56(%ebp), %eax
 	movl	%ebx, %edx
 	call	bad_range
 	testl	%eax, %eax
-	je	.L273
+	je	.L167
 #APP
					# Think the following
					# means we are looking
					# in expand() which is
					# at line 665
+# 665 "/misc/m1/kernel/linux-2.6.32.7/mm/page_alloc.c" 1
 	1:	ud2
 .pushsection __bug_table,"a"
 2:	.long 1b, .LC0
 	.word 665, 0
 	.org 2b+12
 .popsection
+# 0 "" 2
 #NO_APP

-.L275:
-	jmp	.L275
-.L273:
-	movl	-48(%ebp), %edx
-	subl	$44, %esi
-	decl	%edi
-	leal	(%esi,%edx,8), %eax
-	movl	(%eax), %ecx
+.L168:
+	jmp	.L168
+.L167:
+	movl	-32(%ebp), %ecx
 	leal	24(%ebx), %edx
-	movl	%edx, 4(%ecx)
-	movl	%ecx, 24(%ebx)
-	movl	%eax, 4(%edx)
-	movl	%edx, (%eax)
-	incl	40(%esi)
+	decl	%edi
+	subl	$44, -48(%ebp)
+	movl	(%ecx), %eax
+	movl	%edx, 4(%eax)
+	movl	%eax, 24(%ebx)
+	movl	-48(%ebp), %eax
+	addl	-76(%ebp), %eax
+	movl	%edx, (%ecx)
+	movl	%eax, 28(%ebx)
+	movl	-48(%ebp), %eax
+	incl	40(%eax)
 	movl	%edi, 12(%ebx)
 #APP
+# 84 "/misc/m1/kernel/linux-2.6.32.7/arch/x86/include/asm/bitops.h" 1
 	bts $18,(%ebx)
+# 0 "" 2
 #NO_APP

				# Looks like more of expand. Think the
				# cmpl and jgs towards the end of this
				# section for the bad compiler are the
				# while (high > low) {}
				# part.
				#
-.L271:
-	cmpl	-44(%ebp), %edi
-	jg	.L272
-	jmp	.L316
-.L269:
-	incl	%ebx
-	addl	$44, %edx
-.L267:
-	cmpl	$10, %ebx
-	jbe	.L268
-	movl	$0, -16(%ebp)
-	jmp	.L310
-.L316:
-	cmpl	$0, -16(%ebp)
-	jne	.L278
-.L310:
-	cmpl	$3, -48(%ebp)
-	je	.L278
-	movl	$10, -32(%ebp)
-	jmp	.L280
-.L281:
-	movl	(%ecx), %ebx
-	cmpl	$3, %ebx
-	movl	%ebx, -20(%ebp)
-	je	.L282
-	imull	$44, -32(%ebp), %eax
-	movl	-40(%ebp), %esi
-	leal	556(%eax,%esi), %eax
-	movl	%eax, -28(%ebp)
-	leal	(%eax,%ebx,8), %eax
-	cmpl	%eax, (%eax)
-	je	.L282
-	movl	-32(%ebp), %eax
-	movl	-28(%ebp), %edx
-	movl	%eax, -24(%ebp)
-	movl	(%edx,%ebx,8), %esi
-	subl	$24, %esi
-	movl	%esi, -16(%ebp)
-	decl	40(%edx)
-	cmpl	$4, -32(%ebp)
-	jg	.L285
-	cmpl	$1, -48(%ebp)
-	je	.L285
+	subl	$44, %ecx
+	movl	%ecx, -32(%ebp)
+.L166:
+	cmpl	-60(%ebp), %edi
+	jg	.L169
+	jmp	.L202
+.L165:
+	incl	-24(%ebp)
+.L164:
+	cmpl	$10, -24(%ebp)
+	jbe	.L171
+	xorl	%esi, %esi
+	jmp	.L199
+.L202:
+	testl	%esi, %esi
+	jne	.L172
+.L199:
+	cmpl	$3, -28(%ebp)
+	je	.L172
+	movl	-28(%ebp), %eax
+	movl	$10, %edi
+	sall	$4, %eax
+	addl	$fallbacks, %eax
+	movl	%eax, -72(%ebp)
+	movl	-28(%ebp), %eax
+	sall	$4, %eax
+	addl	$fallbacks+16, %eax
+	movl	%eax, -88(%ebp)
+	jmp	.L173
+.L185:
+	movl	(%ecx), %edx
+	cmpl	$3, %edx
+	movl	%edx, -20(%ebp)
+	je	.L174
+	imull	$44, %edi, %edx
+	movl	-20(%ebp), %ebx
+	movl	-56(%ebp), %esi
+	leal	(%edx,%ebx,8), %eax
+	leal	544(%esi,%eax), %ebx
+	leal	556(%esi,%eax), %eax
+	cmpl	%eax, 12(%ebx)
+	je	.L174
+	movl	-56(%ebp), %eax
+	movl	%edi, -16(%ebp)
+	movl	12(%ebx), %ebx
+	decl	596(%eax,%edx)
+	cmpl	$4, %edi
+	leal	-24(%ebx), %esi
+	jg	.L175
+	cmpl	$1, -28(%ebp)
+	je	.L175
 	cmpl	$0, page_group_by_mobility_disabled


					# This is the section that
					# actually calls move_freepages_block
					# presumably with bad parameters in
					# the bad compiler
-	je	.L287
-.L285:
-	movl	-48(%ebp), %ecx
+	je	.L176
+.L175:
+	movl	-28(%ebp), %ecx
 	movl	%esi, %edx
-	movl	-40(%ebp), %eax
+	movl	-56(%ebp), %eax

<SNIP>

Cutting the rest of the differences from __rmqueue and seeing what
move_freepages_block looks like


--- bad/move_freepages_block.s	2010-02-11 16:36:15.000000000 +0000
+++ good/move_freepages_block.s	2010-02-11 16:36:49.000000000 +0000
@@ -1,18 +1,15 @@
-	.size	split_page, .-split_page
-	.type	move_freepages_block, @function
 move_freepages_block:
 	pushl	%ebp
 	movl	%esp, %ebp
 	pushl	%edi
-	movl	%edx, %edi
 	pushl	%esi
+	movl	%ecx, %esi
 	pushl	%ebx
+	movl	%edx, %ecx
 	subl	$16, %esp
 	movl	%eax, -24(%ebp)
-	movl	%ecx, -28(%ebp)
-	movl	%edx, %ecx
-	movl	-24(%ebp), %esi
 	movl	(%edx), %eax
+	movl	-24(%ebp), %edi
 	shrl	$25, %eax
 	sall	$4, %eax
 	movl	mem_section(%eax), %eax
@@ -21,23 +18,25 @@
 	sarl	$5, %ecx
 	andl	$-1024, %ecx
 	movl	%ecx, %eax
+	movl	%ecx, %ebx
 	shrl	$17, %eax
 	sall	$4, %eax

					# This is looking up the page in
					# the sparsemem section map.
					# While there are differences, I
					# don't see where it goes wrong
					# although this is the most
					# likely problem code
-	movl	mem_section(%eax), %ebx
-	movl	%ecx, %eax
-	sall	$5, %eax
-	andl	$-4, %ebx
-	addl	%eax, %ebx
-	movl	1264(%esi), %eax
+	movl	mem_section(%eax), %eax
+	sall	$5, %ebx
+	andl	$-4, %eax
+	leal	(%eax,%ebx), %ebx
+	movl	1264(%edi), %eax
+	movl	%edx, %edi
+	movl	-24(%ebp), %edx
 	cmpl	%eax, %ecx
 	cmovae	%ebx, %edi
 	addl	$1023, %ecx
-	addl	1268(%esi), %eax
+	addl	1268(%edx), %eax
 	movl	$0, -16(%ebp)
 	cmpl	%eax, %ecx
-	jae	.L218
-	leal	32736(%ebx), %eax
-	movl	%eax, -20(%ebp)
+	jae	.L20
+	leal	32736(%ebx), %ecx
+	movl	%ecx, -20(%ebp)
 	movl	(%edi), %edx
 	movl	32736(%ebx), %eax
 	shrl	$23, %edx
@@ -46,55 +45,61 @@
 	andl	$3, %eax
 	imull	$1280, %edx, %edx
 	imull	$1280, %eax, %eax
-	addl	$contig_page_data, %edx
-	addl	$contig_page_data, %eax
 	cmpl	%eax, %edx
-	je	.L235
+	jne	.L21
+	sall	$3, %esi
+	movl	$0, -16(%ebp)
+	movl	%esi, -28(%ebp)
+	jmp	.L27
+.L21:
 #APP
+# 775 "/misc/m1/kernel/linux-2.6.32.7/mm/page_alloc.c" 1
 	1:	ud2
 .pushsection __bug_table,"a"
 2:	.long 1b, .LC0
 	.word 775, 0
 	.org 2b+12
 .popsection
+# 0 "" 2
 #NO_APP
-.L222:
-	jmp	.L222
-.L223:
+.L23:
+	jmp	.L23
+.L25:
 	movl	(%edi), %eax
 	testl	$262144, %eax
-	jne	.L226
+	jne	.L24
 	addl	$32, %edi
-	jmp	.L235
-.L226:
-	leal	24(%edi), %ecx
+	jmp	.L27
+.L24:
+	movl	12(%edi), %ebx
+	leal	24(%edi), %esi
 	movl	24(%edi), %edx
-	movl	4(%ecx), %eax
-	movl	12(%edi), %esi
-	movl	%eax, 4(%edx)
+	movl	28(%edi), %eax
 	movl	%edx, (%eax)
-	imull	$44, %esi, %eax
+	movl	%eax, 4(%edx)
+	imull	$44, %ebx, %eax
 	movl	$1048832, 24(%edi)
-	movl	-28(%ebp), %edx
-	leal	544(%eax,%edx,8), %eax
-	addl	-24(%ebp), %eax
-	movl	12(%eax), %edx
-	leal	12(%eax), %ebx
-	movl	%ecx, 4(%edx)
+	movl	-24(%ebp), %edx
+	addl	-28(%ebp), %eax
+	leal	544(%edx,%eax), %ecx
+	movl	12(%ecx), %edx
 	movl	%edx, 24(%edi)
-	movl	%ebx, 4(%ecx)
-	movl	%ecx, 12(%eax)
-	movl	%esi, %ecx
+	movl	%esi, 4(%edx)
+	movl	-24(%ebp), %edx
+	movl	%esi, 12(%ecx)
+	movb	%bl, %cl
+	leal	556(%edx,%eax), %eax
+	movl	%eax, 28(%edi)
 	movl	$32, %eax
 	sall	%cl, %eax
 	addl	%eax, %edi
 	movl	$1, %eax
 	sall	%cl, %eax
 	addl	%eax, -16(%ebp)
-.L235:
+.L27:
 	cmpl	-20(%ebp), %edi
-	jbe	.L223
-.L218:
+	jbe	.L25
+.L20:
 	movl	-16(%ebp), %eax
 	addl	$16, %esp
 	popl	%ebx
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
  2010-02-11 18:20               ` Mel Gorman
@ 2010-02-11 18:49                 ` Linus Torvalds
  2010-02-12 12:17                   ` Mel Gorman
  0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2010-02-11 18:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Tony Lill, Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
	Johannes Weiner

On Thu, 11 Feb 2010, Mel Gorman wrote:
> 
> Tony posted the assember files (KCFLAGS=-save-temps) from
> the broken and working compilers which a copy of is available at
> http://www.csn.ul.ie/~mel/postings/bug-20100211/ . Have you any suggestions
> on what the best way to go about finding where the badly generated code
> might be so a warning can be added for gcc 4.1?  My strongest suspicion is
> that the problem is in the assembler that looks up the struct page from a
> PFN in sparsemem but I'm failing to prove it.

Try contacting the gcc people. They are (well, _some_ of them are) much 
more used to walking through asm differences, and may have more of a clue 
about where the difference is likely to be for those compiler versions.

I'm personally very comfortable with x86 assembly, but having tried to 
find compiler bugs in the past I can also say that despite my x86 comfort 
I've almost always failed. The trivial stupid differences tend to always 
just totally overwhelm the actual real difference that causes the bug.

One thing to try is to see if the buggy compiler version can be itself 
triggered to create a non-buggy asm listing by using some compiler flag. 
That way the "trivial differences" tend to be smaller, and the bug stands 
out more.

For example, that's how we found the problem with "-fwrapv" - testing the 
same compiler version with different flags (see commit a137802ee83).

Sometimes if the trivial differences are mostly register allocation, you 
can get a "feel" for the differences by replacing all register names with 
just the string "REG" (and "[0-9x](%e[sb]p)" with "STACKSLOT", and try to 
do the diff that way. If everything else is roughly the same, you then see 
the place where the code is _really_ different.

But when the compiler actually re-orders basic blocks etc, then diffs are 
basically impossible to get anything sane out of.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3
  2010-02-11 18:49                 ` Linus Torvalds
@ 2010-02-12 12:17                   ` Mel Gorman
  0 siblings, 0 replies; 9+ messages in thread
From: Mel Gorman @ 2010-02-12 12:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tony Lill, Andrew Morton, linux-mm, bugzilla-daemon, bugme-daemon,
	Johannes Weiner

On Thu, Feb 11, 2010 at 10:49:44AM -0800, Linus Torvalds wrote:
> 
> 
> On Thu, 11 Feb 2010, Mel Gorman wrote:
> > 
> > Tony posted the assember files (KCFLAGS=-save-temps) from
> > the broken and working compilers which a copy of is available at
> > http://www.csn.ul.ie/~mel/postings/bug-20100211/ . Have you any suggestions
> > on what the best way to go about finding where the badly generated code
> > might be so a warning can be added for gcc 4.1?  My strongest suspicion is
> > that the problem is in the assembler that looks up the struct page from a
> > PFN in sparsemem but I'm failing to prove it.
> 
> Try contacting the gcc people. They are (well, _some_ of them are) much 
> more used to walking through asm differences, and may have more of a clue 
> about where the difference is likely to be for those compiler versions.
> 

Ok, thanks. Will get on to them if the other suggestions don't work out.

> I'm personally very comfortable with x86 assembly, but having tried to 
> find compiler bugs in the past I can also say that despite my x86 comfort 
> I've almost always failed. The trivial stupid differences tend to always 
> just totally overwhelm the actual real difference that causes the bug.
> 

I don't feel quite as bad then. I was hoping it would be "obvious" but
was getting tripped up by reordering and slightly-different ways of
achieving the same end result.

> One thing to try is to see if the buggy compiler version can be itself 
> triggered to create a non-buggy asm listing by using some compiler flag. 
> That way the "trivial differences" tend to be smaller, and the bug stands 
> out more.
> 
> For example, that's how we found the problem with "-fwrapv" - testing the 
> same compiler version with different flags (see commit a137802ee83).
> 

The compiler of interest is still available so I should be able to reproduce
the problem locally once I get an old distro installed.

> Sometimes if the trivial differences are mostly register allocation, you 
> can get a "feel" for the differences by replacing all register names with 
> just the string "REG" (and "[0-9x](%e[sb]p)" with "STACKSLOT", and try to 
> do the diff that way. If everything else is roughly the same, you then see 
> the place where the code is _really_ different.
> 

Will try this first, then installing and old distro before resorting to
the gcc people. There is a good chance their response will be "go away"
once they realise it'd fixed in later compilers.

> But when the compiler actually re-orders basic blocks etc, then diffs are 
> basically impossible to get anything sane out of.
> 

Thanks for the suggestions.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-02-12 12:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <bug-15214-10286@http.bugzilla.kernel.org/>
2010-02-03 22:39 ` [Bugme-new] [Bug 15214] New: Oops at __rmqueue+0x51/0x2b3 Andrew Morton
2010-02-05 11:20   ` Mel Gorman
2010-02-07 18:34     ` Tony Lill
2010-02-08 10:10       ` Mel Gorman
2010-02-08 19:18         ` Andrew Morton
2010-02-09 14:45           ` Mel Gorman
     [not found]             ` <201002101217.34131.ajlill@ajlc.waterloo.on.ca>
2010-02-11 18:20               ` Mel Gorman
2010-02-11 18:49                 ` Linus Torvalds
2010-02-12 12:17                   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).