* .17rc5 cfq slab corruption. @ 2006-05-26 21:39 Dave Jones 2006-05-27 0:00 ` Andrew Morton 2006-05-27 2:56 ` Dave Jones 0 siblings, 2 replies; 25+ messages in thread From: Dave Jones @ 2006-05-26 21:39 UTC (permalink / raw) To: Linux Kernel Was playing with googles new picasa toy, which hammered the disks hunting out every image file it could find, when this popped out: Slab corruption: (Not tainted) start=ffff810012b998c8, len=168 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b Prev obj: start=ffff810012b99808, len=168 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Next obj: start=ffff810012b99988, len=168 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-26 21:39 .17rc5 cfq slab corruption Dave Jones @ 2006-05-27 0:00 ` Andrew Morton 2006-05-27 7:07 ` Jens Axboe 2006-05-27 2:56 ` Dave Jones 1 sibling, 1 reply; 25+ messages in thread From: Andrew Morton @ 2006-05-27 0:00 UTC (permalink / raw) To: Dave Jones; +Cc: linux-kernel, Jens Axboe Dave Jones <davej@redhat.com> wrote: > > Was playing with googles new picasa toy, which hammered the disks > hunting out every image file it could find, when this popped out: > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168 > Redzone: 0x5a2cf071/0x5a2cf071. > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b > Prev obj: start=ffff810012b99808, len=168 > Redzone: 0x5a2cf071/0x5a2cf071. > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > Next obj: start=ffff810012b99988, len=168 > Redzone: 0x5a2cf071/0x5a2cf071. > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > Just to compare notes... I hit the below a few times last week. That was with the subsystem-trees part of the mm lineup applied, but it doesn't have much in the way of CFQ alterations. audit(1147884480.106:2): selinux=0 auid=4294967295 general protection fault: 0000 [1] ipv6 ppdev autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_mod video sony_acpi button battery asus_acpi ac lp parport_pc parport nvram snd_hda_intel snd_hda_codec i2c_i801 sr_mod snd_seq_dummy i2c_core sg snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device ehci_hcd snd_pcm_oss ohci1394 ieee1394 snd_mixer_oss uhci_hcd snd_pcm snd_timer snd soundcore snd_page_alloc hw_random ext3 jbd ata_piix libata sd_mod scsi_mod Pid: 15, comm: kblockd/1 Not tainted 2.6.17-rc4-mm1 #3 RIP: 0010:[<ffffffff802f0cc3>] <ffffffff802f0cc3>{cfq_dispatch_requests+816} RSP: 0018:ffff81012f4dfd48 EFLAGS: 00010087 RAX: ffff81012c89ab58 RBX: ffff81012edd8060 RCX: 00000001009985e6 RDX: 0000000000000001 RSI: ffff81012edd8068 RDI: 0000000000000000 RBP: ffff81012edd8000 R08: ffff81012edd8088 R09: 0000000000000000 R10: ffff81012f91d900 R11: ffff81012edd8000 R12: 0002000500030002 R13: 0000000000000004 R14: 0000000000000000 R15: ffff81012f91d900 FS: 0000000000000000(0000) GS:ffff81012fcb5cc0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000003052576460 CR3: 0000000119422000 CR4: 00000000000006e0 Process kblockd/1 (pid: 15, threadinfo ffff81012f4de000, task ffff81012f4d9080) Stack: ffffffff880027a8 0000000000000000 ffff81012f91d800 ffff81012eecb550 ffff81012eecb550 ffff81012eecb748 ffff81012f91d900 ffffffff802e5e51 ffff81012f91da30 ffff8101157030a8 Call Trace: <ffffffff880027a8>{:scsi_mod:scsi_done+0} <ffffffff802e5e51>{elv_next_request+321} <ffffffff88008174>{:scsi_mod:scsi_request_fn+115} <ffffffff802f1208>{cfq_kick_queue+0} <ffffffff802f1287>{cfq_kick_queue+127} <ffffffff8023f269>{run_workqueue+159} <ffffffff8023fc51>{worker_thread+0} <ffffffff802425b0>{keventd_create_kthread+0} <ffffffff8023fd5a>{worker_thread+265} <ffffffff8022904d>{default_wake_function+0} <ffffffff802425b0>{keventd_create_kthread+0} <ffffffff802425b0>{keventd_create_kthread+0} <ffffffff802428df>{kthread+254} <ffffffff8020a14a>{child_rip+8} <ffffffff802425b0>{keventd_create_kthread+0} <ffffffff8043fb47>{thread_return+0} <ffffffff802427e1>{kthread+0} <ffffffff8020a142>{child_rip+0} Code: 49 8b 54 24 20 83 7b 20 01 48 19 c0 48 83 e0 fc 8b 84 30 e8 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-27 0:00 ` Andrew Morton @ 2006-05-27 7:07 ` Jens Axboe 2006-05-27 13:31 ` Dave Jones 0 siblings, 1 reply; 25+ messages in thread From: Jens Axboe @ 2006-05-27 7:07 UTC (permalink / raw) To: Andrew Morton; +Cc: Dave Jones, linux-kernel On Fri, May 26 2006, Andrew Morton wrote: > Dave Jones <davej@redhat.com> wrote: > > > > Was playing with googles new picasa toy, which hammered the disks > > hunting out every image file it could find, when this popped out: > > > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168 > > Redzone: 0x5a2cf071/0x5a2cf071. > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b > > Prev obj: start=ffff810012b99808, len=168 > > Redzone: 0x5a2cf071/0x5a2cf071. > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > Next obj: start=ffff810012b99988, len=168 > > Redzone: 0x5a2cf071/0x5a2cf071. > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > Just to compare notes... > > I hit the below a few times last week. That was with the subsystem-trees > part of the mm lineup applied, but it doesn't have much in the way of CFQ > alterations. > > > audit(1147884480.106:2): selinux=0 auid=4294967295 > general protection fault: 0000 [1] ipv6 ppdev autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_mod video sony_acpi button battery asus_acpi ac lp parport_pc parport nvram snd_hda_intel snd_hda_codec i2c_i801 sr_mod snd_seq_dummy i2c_core sg snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device ehci_hcd snd_pcm_oss ohci1394 ieee1394 snd_mixer_oss uhci_hcd snd_pcm snd_timer snd soundcore snd_page_alloc hw_random ext3 jbd ata_piix libata sd_mod scsi_mod > Pid: 15, comm: kblockd/1 Not tainted 2.6.17-rc4-mm1 #3 > RIP: 0010:[<ffffffff802f0cc3>] <ffffffff802f0cc3>{cfq_dispatch_requests+816} > RSP: 0018:ffff81012f4dfd48 EFLAGS: 00010087 > RAX: ffff81012c89ab58 RBX: ffff81012edd8060 RCX: 00000001009985e6 > RDX: 0000000000000001 RSI: ffff81012edd8068 RDI: 0000000000000000 > RBP: ffff81012edd8000 R08: ffff81012edd8088 R09: 0000000000000000 > R10: ffff81012f91d900 R11: ffff81012edd8000 R12: 0002000500030002 > R13: 0000000000000004 R14: 0000000000000000 R15: ffff81012f91d900 > FS: 0000000000000000(0000) GS:ffff81012fcb5cc0(0000) knlGS:0000000000000000 > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > CR2: 0000003052576460 CR3: 0000000119422000 CR4: 00000000000006e0 > Process kblockd/1 (pid: 15, threadinfo ffff81012f4de000, task ffff81012f4d9080) > Stack: ffffffff880027a8 0000000000000000 ffff81012f91d800 ffff81012eecb550 > ffff81012eecb550 ffff81012eecb748 ffff81012f91d900 ffffffff802e5e51 > ffff81012f91da30 ffff8101157030a8 > Call Trace: <ffffffff880027a8>{:scsi_mod:scsi_done+0} > <ffffffff802e5e51>{elv_next_request+321} <ffffffff88008174>{:scsi_mod:scsi_request_fn+115} > <ffffffff802f1208>{cfq_kick_queue+0} <ffffffff802f1287>{cfq_kick_queue+127} > <ffffffff8023f269>{run_workqueue+159} <ffffffff8023fc51>{worker_thread+0} > <ffffffff802425b0>{keventd_create_kthread+0} <ffffffff8023fd5a>{worker_thread+265} > <ffffffff8022904d>{default_wake_function+0} <ffffffff802425b0>{keventd_create_kthread+0} > <ffffffff802425b0>{keventd_create_kthread+0} <ffffffff802428df>{kthread+254} > <ffffffff8020a14a>{child_rip+8} <ffffffff802425b0>{keventd_create_kthread+0} > <ffffffff8043fb47>{thread_return+0} <ffffffff802427e1>{kthread+0} > <ffffffff8020a142>{child_rip+0} > > Code: 49 8b 54 24 20 83 7b 20 01 48 19 c0 48 83 e0 fc 8b 84 30 e8 Pretty baffling... cfq has been hammered pretty thoroughly over the last months and _nothing_ has shown up except some performance anomalies that are now fixed. Since daves case (at least) seems to be use-after-free, I'll see if I can reproduce with some contrived case. I'm asuming that picasa forks and exits a lot with submitted io in between than may not have finished at exit. -- Jens Axboe ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-27 7:07 ` Jens Axboe @ 2006-05-27 13:31 ` Dave Jones 2006-05-27 14:52 ` Jens Axboe 2006-05-30 13:17 ` Jens Axboe 0 siblings, 2 replies; 25+ messages in thread From: Dave Jones @ 2006-05-27 13:31 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, linux-kernel On Sat, May 27, 2006 at 09:07:24AM +0200, Jens Axboe wrote: > On Fri, May 26 2006, Andrew Morton wrote: > > Dave Jones <davej@redhat.com> wrote: > > > > > > Was playing with googles new picasa toy, which hammered the disks > > > hunting out every image file it could find, when this popped out: > > > > > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168 > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b > > > Prev obj: start=ffff810012b99808, len=168 > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > Next obj: start=ffff810012b99988, len=168 > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > Pretty baffling... cfq has been hammered pretty thoroughly over the > last months and _nothing_ has shown up except some performance anomalies > that are now fixed. Since daves case (at least) seems to be > use-after-free, I'll see if I can reproduce with some contrived case. > I'm asuming that picasa forks and exits a lot with submitted io in > between than may not have finished at exit. The second time I hit it, was actually during boot up. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-27 13:31 ` Dave Jones @ 2006-05-27 14:52 ` Jens Axboe 2006-05-30 13:17 ` Jens Axboe 1 sibling, 0 replies; 25+ messages in thread From: Jens Axboe @ 2006-05-27 14:52 UTC (permalink / raw) To: Dave Jones, Andrew Morton, linux-kernel On Sat, May 27 2006, Dave Jones wrote: > On Sat, May 27, 2006 at 09:07:24AM +0200, Jens Axboe wrote: > > On Fri, May 26 2006, Andrew Morton wrote: > > > Dave Jones <davej@redhat.com> wrote: > > > > > > > > Was playing with googles new picasa toy, which hammered the disks > > > > hunting out every image file it could find, when this popped out: > > > > > > > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168 > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b > > > > Prev obj: start=ffff810012b99808, len=168 > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > Next obj: start=ffff810012b99988, len=168 > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > Pretty baffling... cfq has been hammered pretty thoroughly over the > > last months and _nothing_ has shown up except some performance anomalies > > that are now fixed. Since daves case (at least) seems to be > > use-after-free, I'll see if I can reproduce with some contrived case. > > I'm asuming that picasa forks and exits a lot with submitted io in > > between than may not have finished at exit. > > The second time I hit it, was actually during boot up. Boot up sees a lot of that same pattern. -- Jens Axboe ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-27 13:31 ` Dave Jones 2006-05-27 14:52 ` Jens Axboe @ 2006-05-30 13:17 ` Jens Axboe 2006-05-30 13:44 ` Dave Jones 2006-05-30 16:12 ` Dave Jones 1 sibling, 2 replies; 25+ messages in thread From: Jens Axboe @ 2006-05-30 13:17 UTC (permalink / raw) To: Dave Jones, Andrew Morton, linux-kernel On Sat, May 27 2006, Dave Jones wrote: > On Sat, May 27, 2006 at 09:07:24AM +0200, Jens Axboe wrote: > > On Fri, May 26 2006, Andrew Morton wrote: > > > Dave Jones <davej@redhat.com> wrote: > > > > > > > > Was playing with googles new picasa toy, which hammered the disks > > > > hunting out every image file it could find, when this popped out: > > > > > > > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168 > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b > > > > Prev obj: start=ffff810012b99808, len=168 > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > Next obj: start=ffff810012b99988, len=168 > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > Pretty baffling... cfq has been hammered pretty thoroughly over the > > last months and _nothing_ has shown up except some performance anomalies > > that are now fixed. Since daves case (at least) seems to be > > use-after-free, I'll see if I can reproduce with some contrived case. > > I'm asuming that picasa forks and exits a lot with submitted io in > > between than may not have finished at exit. > > The second time I hit it, was actually during boot up. Dave, do you have any io scheduler switching going on? -- Jens Axboe ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-30 13:17 ` Jens Axboe @ 2006-05-30 13:44 ` Dave Jones 2006-05-30 13:50 ` Jens Axboe 2006-05-30 16:12 ` Dave Jones 1 sibling, 1 reply; 25+ messages in thread From: Dave Jones @ 2006-05-30 13:44 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, linux-kernel On Tue, May 30, 2006 at 03:17:28PM +0200, Jens Axboe wrote: > On Sat, May 27 2006, Dave Jones wrote: > > On Sat, May 27, 2006 at 09:07:24AM +0200, Jens Axboe wrote: > > > On Fri, May 26 2006, Andrew Morton wrote: > > > > Dave Jones <davej@redhat.com> wrote: > > > > > > > > > > Was playing with googles new picasa toy, which hammered the disks > > > > > hunting out every image file it could find, when this popped out: > > > > > > > > > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168 > > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b > > > > > Prev obj: start=ffff810012b99808, len=168 > > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > > Next obj: start=ffff810012b99988, len=168 > > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > > > Pretty baffling... cfq has been hammered pretty thoroughly over the > > > last months and _nothing_ has shown up except some performance anomalies > > > that are now fixed. Since daves case (at least) seems to be > > > use-after-free, I'll see if I can reproduce with some contrived case. > > > I'm asuming that picasa forks and exits a lot with submitted io in > > > between than may not have finished at exit. > > > > The second time I hit it, was actually during boot up. > > Dave, do you have any io scheduler switching going on? Nope, everything set to use CFQ as default, and left that way. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-30 13:44 ` Dave Jones @ 2006-05-30 13:50 ` Jens Axboe 2006-05-30 13:52 ` Dave Jones 0 siblings, 1 reply; 25+ messages in thread From: Jens Axboe @ 2006-05-30 13:50 UTC (permalink / raw) To: Dave Jones, Andrew Morton, linux-kernel On Tue, May 30 2006, Dave Jones wrote: > On Tue, May 30, 2006 at 03:17:28PM +0200, Jens Axboe wrote: > > On Sat, May 27 2006, Dave Jones wrote: > > > On Sat, May 27, 2006 at 09:07:24AM +0200, Jens Axboe wrote: > > > > On Fri, May 26 2006, Andrew Morton wrote: > > > > > Dave Jones <davej@redhat.com> wrote: > > > > > > > > > > > > Was playing with googles new picasa toy, which hammered the disks > > > > > > hunting out every image file it could find, when this popped out: > > > > > > > > > > > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168 > > > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > > > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b > > > > > > Prev obj: start=ffff810012b99808, len=168 > > > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > > > Next obj: start=ffff810012b99988, len=168 > > > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > > > > > Pretty baffling... cfq has been hammered pretty thoroughly over the > > > > last months and _nothing_ has shown up except some performance anomalies > > > > that are now fixed. Since daves case (at least) seems to be > > > > use-after-free, I'll see if I can reproduce with some contrived case. > > > > I'm asuming that picasa forks and exits a lot with submitted io in > > > > between than may not have finished at exit. > > > > > > The second time I hit it, was actually during boot up. > > > > Dave, do you have any io scheduler switching going on? > > Nope, everything set to use CFQ as default, and left that way. Hrmpf ok, I had hoped perhaps something in your init scripts modified the scheduler value. -- Jens Axboe ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-30 13:50 ` Jens Axboe @ 2006-05-30 13:52 ` Dave Jones 2006-05-30 14:13 ` Dave Jones 0 siblings, 1 reply; 25+ messages in thread From: Dave Jones @ 2006-05-30 13:52 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, linux-kernel On Tue, May 30, 2006 at 03:50:59PM +0200, Jens Axboe wrote: > > > > > Pretty baffling... cfq has been hammered pretty thoroughly over the > > > > > last months and _nothing_ has shown up except some performance anomalies > > > > > that are now fixed. Since daves case (at least) seems to be > > > > > use-after-free, I'll see if I can reproduce with some contrived case. > > > > > I'm asuming that picasa forks and exits a lot with submitted io in > > > > > between than may not have finished at exit. > > > > > > > > The second time I hit it, was actually during boot up. > > > > > > Dave, do you have any io scheduler switching going on? > > > > Nope, everything set to use CFQ as default, and left that way. > > Hrmpf ok, I had hoped perhaps something in your init scripts modified > the scheduler value. grep doesn't show anything in init scripts, and ttbomk, hald isn't messing with this. (Actually I'm seeing it trigger before that gets started anyway, so that can't be it). Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-30 13:52 ` Dave Jones @ 2006-05-30 14:13 ` Dave Jones 0 siblings, 0 replies; 25+ messages in thread From: Dave Jones @ 2006-05-30 14:13 UTC (permalink / raw) To: Jens Axboe, Andrew Morton, linux-kernel On Tue, May 30, 2006 at 09:52:24AM -0400, Dave Jones wrote: > On Tue, May 30, 2006 at 03:50:59PM +0200, Jens Axboe wrote: > > > > > > > Pretty baffling... cfq has been hammered pretty thoroughly over the > > > > > > last months and _nothing_ has shown up except some performance anomalies > > > > > > that are now fixed. Since daves case (at least) seems to be > > > > > > use-after-free, I'll see if I can reproduce with some contrived case. > > > > > > I'm asuming that picasa forks and exits a lot with submitted io in > > > > > > between than may not have finished at exit. > > > > > > > > > > The second time I hit it, was actually during boot up. > > > > > > > > Dave, do you have any io scheduler switching going on? > > > > > > Nope, everything set to use CFQ as default, and left that way. > > > > Hrmpf ok, I had hoped perhaps something in your init scripts modified > > the scheduler value. > > grep doesn't show anything in init scripts, and ttbomk, hald isn't messing > with this. (Actually I'm seeing it trigger before that gets started > anyway, so that can't be it). In case it makes a difference to help with reproducing -- the hardware of this system is: two sata drives sda1/sdb1 in single-volume device-mapper configuration as VolGroup00-LogVol00 pata cd drive on /dev/hdc usb memory stick on sdc1 Nothing too out of the ordinary there.. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-30 13:17 ` Jens Axboe 2006-05-30 13:44 ` Dave Jones @ 2006-05-30 16:12 ` Dave Jones 2006-05-30 16:49 ` Jens Axboe 1 sibling, 1 reply; 25+ messages in thread From: Dave Jones @ 2006-05-30 16:12 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, linux-kernel On Tue, May 30, 2006 at 03:17:28PM +0200, Jens Axboe wrote: > On Sat, May 27 2006, Dave Jones wrote: > > On Sat, May 27, 2006 at 09:07:24AM +0200, Jens Axboe wrote: > > > On Fri, May 26 2006, Andrew Morton wrote: > > > > Dave Jones <davej@redhat.com> wrote: > > > > > > > > > > Was playing with googles new picasa toy, which hammered the disks > > > > > hunting out every image file it could find, when this popped out: > > > > > > > > > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168 > > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b > > > > > Prev obj: start=ffff810012b99808, len=168 > > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > > Next obj: start=ffff810012b99988, len=168 > > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > > > Pretty baffling... cfq has been hammered pretty thoroughly over the > > > last months and _nothing_ has shown up except some performance anomalies > > > that are now fixed. Since daves case (at least) seems to be > > > use-after-free, I'll see if I can reproduce with some contrived case. > > > I'm asuming that picasa forks and exits a lot with submitted io in > > > between than may not have finished at exit. > > > > The second time I hit it, was actually during boot up. > > Dave, do you have any io scheduler switching going on? Here's something interesting (possibly unrelated). https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=193534 I added this patch to our devel kernel (based on 17rc5-git5 right now) It's similar to the list_head debugging patch from -mm --- linux-2.6.12/include/linux/list.h~ 2005-08-08 15:34:50.000000000 -0400 +++ linux-2.6.12/include/linux/list.h 2005-08-08 15:35:22.000000000 -0400 @@ -5,7 +5,9 @@ #include <linux/stddef.h> #include <linux/prefetch.h> +#include <linux/kernel.h> #include <asm/system.h> +#include <asm/bug.h> /* * These are non-NULL pointers that will result in page faults @@ -52,6 +52,16 @@ static inline void __list_add(struct lis struct list_head *prev, struct list_head *next) { + if (next->prev != prev) { + printk("List corruption. next->prev should be %p, but was %p\n", + prev, next->prev); + BUG(); + } + if (prev->next != next) { + printk("List corruption. prev->next should be %p, but was %p\n", + next, prev->next); + BUG(); + } next->prev = new; new->next = next; new->prev = prev; @@ -162,6 +162,16 @@ static inline void __list_del(struct lis */ static inline void list_del(struct list_head *entry) { + if (entry->prev->next != entry) { + printk("List corruption. prev->next should be %p, but was %p\n", + entry, entry->prev->next); + BUG(); + } + if (entry->next->prev != entry) { + printk("List corruption. next->prev should be %p, but was %p\n", + entry, entry->next->prev); + BUG(); + } __list_del(entry->prev, entry->next); entry->next = LIST_POISON1; entry->prev = LIST_POISON2; And then it turned up this: List corruption. next->prev should be f74a5e2c, but was ea7ed31c Pointing at cfq_set_request. Now, *anything* could have corrupted that list, not necessarily cfq, but it's something of a coincidence. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-30 16:12 ` Dave Jones @ 2006-05-30 16:49 ` Jens Axboe 2006-05-30 16:56 ` Dave Jones 0 siblings, 1 reply; 25+ messages in thread From: Jens Axboe @ 2006-05-30 16:49 UTC (permalink / raw) To: Dave Jones, Andrew Morton, linux-kernel On Tue, May 30 2006, Dave Jones wrote: > On Tue, May 30, 2006 at 03:17:28PM +0200, Jens Axboe wrote: > > On Sat, May 27 2006, Dave Jones wrote: > > > On Sat, May 27, 2006 at 09:07:24AM +0200, Jens Axboe wrote: > > > > On Fri, May 26 2006, Andrew Morton wrote: > > > > > Dave Jones <davej@redhat.com> wrote: > > > > > > > > > > > > Was playing with googles new picasa toy, which hammered the disks > > > > > > hunting out every image file it could find, when this popped out: > > > > > > > > > > > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168 > > > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > > > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b > > > > > > Prev obj: start=ffff810012b99808, len=168 > > > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > > > Next obj: start=ffff810012b99988, len=168 > > > > > > Redzone: 0x5a2cf071/0x5a2cf071. > > > > > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > > > > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > > > > > > > Pretty baffling... cfq has been hammered pretty thoroughly over the > > > > last months and _nothing_ has shown up except some performance anomalies > > > > that are now fixed. Since daves case (at least) seems to be > > > > use-after-free, I'll see if I can reproduce with some contrived case. > > > > I'm asuming that picasa forks and exits a lot with submitted io in > > > > between than may not have finished at exit. > > > > > > The second time I hit it, was actually during boot up. > > > > Dave, do you have any io scheduler switching going on? > > Here's something interesting (possibly unrelated). > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=193534 > > I added this patch to our devel kernel (based on 17rc5-git5 right now) > > It's similar to the list_head debugging patch from -mm > > --- linux-2.6.12/include/linux/list.h~ 2005-08-08 15:34:50.000000000 -0400 > +++ linux-2.6.12/include/linux/list.h 2005-08-08 15:35:22.000000000 -0400 > @@ -5,7 +5,9 @@ > > #include <linux/stddef.h> > #include <linux/prefetch.h> > +#include <linux/kernel.h> > #include <asm/system.h> > +#include <asm/bug.h> > > /* > * These are non-NULL pointers that will result in page faults > @@ -52,6 +52,16 @@ static inline void __list_add(struct lis > struct list_head *prev, > struct list_head *next) > { > + if (next->prev != prev) { > + printk("List corruption. next->prev should be %p, but was %p\n", > + prev, next->prev); > + BUG(); > + } > + if (prev->next != next) { > + printk("List corruption. prev->next should be %p, but was %p\n", > + next, prev->next); > + BUG(); > + } > next->prev = new; > new->next = next; > new->prev = prev; > @@ -162,6 +162,16 @@ static inline void __list_del(struct lis > */ > static inline void list_del(struct list_head *entry) > { > + if (entry->prev->next != entry) { > + printk("List corruption. prev->next should be %p, but was %p\n", > + entry, entry->prev->next); > + BUG(); > + } > + if (entry->next->prev != entry) { > + printk("List corruption. next->prev should be %p, but was %p\n", > + entry, entry->next->prev); > + BUG(); > + } > __list_del(entry->prev, entry->next); > entry->next = LIST_POISON1; > entry->prev = LIST_POISON2; > > > And then it turned up this: > > List corruption. next->prev should be f74a5e2c, but was ea7ed31c > Pointing at cfq_set_request. I think I'm missing a piece of this - what list was corrupted, in what function did it trigger? -- Jens Axboe ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-30 16:49 ` Jens Axboe @ 2006-05-30 16:56 ` Dave Jones 2006-05-30 17:04 ` Jens Axboe 0 siblings, 1 reply; 25+ messages in thread From: Dave Jones @ 2006-05-30 16:56 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, linux-kernel On Tue, May 30, 2006 at 06:49:18PM +0200, Jens Axboe wrote: > > List corruption. next->prev should be f74a5e2c, but was ea7ed31c > > Pointing at cfq_set_request. > > I think I'm missing a piece of this - what list was corrupted, in what > function did it trigger? If you look at the attachment in the bugzilla url in my previous msg, you'll see this: ay 30 05:31:33 mandril kernel: List corruption. next->prev should be f74a5e2c, but was ea7ed31c May 30 05:31:33 mandril kernel: ------------[ cut here ]------------ May 30 05:31:33 mandril kernel: kernel BUG at include/linux/list.h:58! May 30 05:31:33 mandril kernel: invalid opcode: 0000 [#1] May 30 05:31:33 mandril kernel: SMP May 30 05:31:33 mandril kernel: last sysfs file: /devices/pci0000:00/0000:00:1f.3/i2c-0/0-002e/pwm3 May 30 05:31:33 mandril kernel: Modules linked in: iptable_filter ipt_DSCP iptable_mangle ip_tables x_tables eeprom lm85 hwmon_vid hwmon i2c_isa ipv6 nls_utf8 loop dm_mirror dm_mod video button battery ac lp parport_pc parport ehci_hcd uhci_hcd floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus sg snd_seq_dummy matroxfb_base snd_seq_oss snd_seq_midi_event matroxfb_DAC1064 snd_seq matroxfb_accel matroxfb_Ti3026 3w_9xxx matroxfb_g450 snd_seq_device g450_pll matroxfb_misc snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd e1000 soundcore snd_page_alloc i2c_i801 i2c_core ext3 jbd 3w_xxxx ata_piix libata sd_mod scsi_mod May 30 05:31:33 mandril kernel: CPU: 0 May 30 05:31:33 mandril kernel: EIP: 0060:[<c04e3310>] Not tainted VLI May 30 05:31:33 mandril kernel: EFLAGS: 00210292 (2.6.16-1.2227_FC6 #1) May 30 05:31:33 mandril kernel: EIP is at cfq_set_request+0x202/0x3ff May 30 05:31:33 mandril kernel: eax: 00000044 ebx: f605af58 ecx: c06c77d0 edx: 00200282 May 30 05:31:33 mandril kernel: esi: f4c31d58 edi: f74a5e2c ebp: f4c31da8 esp: c57c1abc May 30 05:31:33 mandril kernel: ds: 007b es: 007b ss: 0068 May 30 05:31:33 mandril kernel: Process httpd (pid: 30323, threadinfo=c57c1000 task=f1bf4550) May 30 05:31:33 mandril kernel: Stack: c0624ac7 f74a5e2c ea7ed31c d0c71c34 f74cb25c f74a5d04 f1bf4550 00000000 May 30 05:31:33 mandril kernel: 00007673 f3822398 00000000 c04e310e f74cb25c d0c71c34 f74cb26c c04d7342 May 30 05:31:33 mandril kernel: 00000010 f74cb25c 00000000 c04d9da8 00000010 d954b84c c046ca2b 0142781d May 30 05:31:33 mandril kernel: Call Trace: May 30 05:31:33 mandril kernel: <c04e310e> cfq_set_request+0x0/0x3ff <c04d7342> elv_set_request+0x1e/0x2d May 30 05:31:33 mandril kernel: <c04d9da8> get_request+0x164/0x31d <c046ca2b> __find_get_block+0x161/0x16b May 30 05:31:33 mandril kernel: <c04da58e> get_request_wait+0x1b/0x15e <c046ca6e> __getblk+0x39/0x257 May 30 05:31:33 mandril kernel: <c04dbeeb> __make_request+0x322/0x3cb <c0465b7a> cache_alloc_debugcheck_after+0x134/0x13e May 30 05:31:33 mandril kernel: <f8951626> ext3_get_blocks_handle+0x921/0x954 [ext3] <c04d96f6> generic_make_request+0x2b7/0x2c7 May 30 05:31:33 mandril kernel: <c046c6f9> __find_get_block_slow+0x11c/0x128 <c0465b7a> cache_alloc_debugcheck_after+0x134/0x13e May 30 05:31:33 mandril kernel: <c046ca2b> __find_get_block+0x161/0x16b <c0465b13> cache_alloc_debugcheck_after+0xcd/0x13e May 30 05:31:33 mandril kernel: <c044ec05> mempool_alloc+0x37/0xd3 <c04db30c> submit_bio+0xb7/0xbe May 30 05:31:33 mandril kernel: <c044ec05> mempool_alloc+0x37/0xd3 <f8951774> ext3_getblk+0x11b/0x2a6 [ext3] May 30 05:31:33 mandril kernel: <c046f5f2> bio_alloc_bioset+0x9b/0xf3 <c046c3da> submit_bh+0xe4/0x102 May 30 05:31:33 mandril kernel: <c046d3d5> ll_rw_block+0x8d/0xa9 <f89527a3> ext3_bread+0x39/0x83 [ext3] May 30 05:31:33 mandril kernel: <f895571c> ext3_find_entry+0x174/0x58d [ext3] <c0406470> do_IRQ+0x79/0x84 May 30 05:31:33 mandril kernel: <c040485e> common_interrupt+0x1a/0x20 <c04390ef> debug_mutex_add_waiter+0x97/0xa9 May 30 05:31:33 mandril kernel: <c047fd1e> d_alloc+0x4b/0x206 <f8956f38> ext3_lookup+0x21/0x77 [ext3] May 30 05:31:33 mandril kernel: <c0476e95> do_lookup+0xb2/0x14d <c0478c03> __link_path_walk+0x894/0xd31 May 30 05:31:33 mandril kernel: <c04790e9> link_path_walk+0x49/0xbd <c0435994> autoremove_wake_function+0x0/0x35 May 30 05:31:33 mandril kernel: <c0479484> do_path_lookup+0x1e3/0x251 <c047823b> getname+0x91/0x98 May 30 05:31:33 mandril kernel: <c0479c7d> __user_walk_fd+0x2f/0x40 <c04736ab> vfs_stat_fd+0x19/0x40 May 30 05:31:33 mandril kernel: <c0435994> autoremove_wake_function+0x0/0x35 <c047375f> sys_stat64+0xf/0x23 May 30 05:31:33 mandril kernel: <c0406470> do_IRQ+0x79/0x84 <c0403e1f> syscall_call+0x7/0xb May 30 05:31:33 mandril kernel: Code: 8b 7c 24 14 8b 98 28 01 00 00 81 c7 28 01 00 00 8b 43 04 39 f8 74 46 89 7c 24 04 89 44 24 08 c7 04 24 c7 4a 62 c0 e8 44 1e f4 ff <0f> 0b 3a 00 b2 4a 62 c0 8b 54 24 14 8b 82 28 01 00 00 39 d8 74 Not sure which list is being pointed at. I'll see if I can decipher more after lunch. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-30 16:56 ` Dave Jones @ 2006-05-30 17:04 ` Jens Axboe 2006-05-30 18:49 ` Jens Axboe 0 siblings, 1 reply; 25+ messages in thread From: Jens Axboe @ 2006-05-30 17:04 UTC (permalink / raw) To: Dave Jones, Andrew Morton, linux-kernel On Tue, May 30 2006, Dave Jones wrote: > On Tue, May 30, 2006 at 06:49:18PM +0200, Jens Axboe wrote: > > > > List corruption. next->prev should be f74a5e2c, but was ea7ed31c > > > Pointing at cfq_set_request. > > > > I think I'm missing a piece of this - what list was corrupted, in what > > function did it trigger? > > If you look at the attachment in the bugzilla url in my previous msg, > you'll see this: > > ay 30 05:31:33 mandril kernel: List corruption. next->prev should be f74a5e2c, but was ea7ed31c > May 30 05:31:33 mandril kernel: ------------[ cut here ]------------ > May 30 05:31:33 mandril kernel: kernel BUG at include/linux/list.h:58! > May 30 05:31:33 mandril kernel: invalid opcode: 0000 [#1] > May 30 05:31:33 mandril kernel: SMP > May 30 05:31:33 mandril kernel: last sysfs file: /devices/pci0000:00/0000:00:1f.3/i2c-0/0-002e/pwm3 > May 30 05:31:33 mandril kernel: Modules linked in: iptable_filter ipt_DSCP iptable_mangle ip_tables x_tables eeprom lm85 hwmon_vid hwmon i2c_isa ipv6 nls_utf8 loop dm_mirror dm_mod video button battery ac lp parport_pc parport ehci_hcd uhci_hcd floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus sg snd_seq_dummy matroxfb_base snd_seq_oss snd_seq_midi_event matroxfb_DAC1064 snd_seq matroxfb_accel matroxfb_Ti3026 3w_9xxx matroxfb_g450 snd_seq_device g450_pll matroxfb_misc snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd e1000 soundcore snd_page_alloc i2c_i801 i2c_core ext3 jbd 3w_xxxx ata_piix libata sd_mod scsi_mod > May 30 05:31:33 mandril kernel: CPU: 0 > May 30 05:31:33 mandril kernel: EIP: 0060:[<c04e3310>] Not tainted VLI > May 30 05:31:33 mandril kernel: EFLAGS: 00210292 (2.6.16-1.2227_FC6 #1) > May 30 05:31:33 mandril kernel: EIP is at cfq_set_request+0x202/0x3ff Just do a l *cfq_set_request+0x202 from gdb if you have CONFIG_DEBUG_INFO enabled in your vmlinux. -- Jens Axboe ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-30 17:04 ` Jens Axboe @ 2006-05-30 18:49 ` Jens Axboe 2006-05-30 18:51 ` Jens Axboe 0 siblings, 1 reply; 25+ messages in thread From: Jens Axboe @ 2006-05-30 18:49 UTC (permalink / raw) To: Dave Jones, Andrew Morton, linux-kernel On Tue, May 30 2006, Jens Axboe wrote: > On Tue, May 30 2006, Dave Jones wrote: > > On Tue, May 30, 2006 at 06:49:18PM +0200, Jens Axboe wrote: > > > > > > List corruption. next->prev should be f74a5e2c, but was ea7ed31c > > > > Pointing at cfq_set_request. > > > > > > I think I'm missing a piece of this - what list was corrupted, in what > > > function did it trigger? > > > > If you look at the attachment in the bugzilla url in my previous msg, > > you'll see this: > > > > ay 30 05:31:33 mandril kernel: List corruption. next->prev should be f74a5e2c, but was ea7ed31c > > May 30 05:31:33 mandril kernel: ------------[ cut here ]------------ > > May 30 05:31:33 mandril kernel: kernel BUG at include/linux/list.h:58! > > May 30 05:31:33 mandril kernel: invalid opcode: 0000 [#1] > > May 30 05:31:33 mandril kernel: SMP > > May 30 05:31:33 mandril kernel: last sysfs file: /devices/pci0000:00/0000:00:1f.3/i2c-0/0-002e/pwm3 > > May 30 05:31:33 mandril kernel: Modules linked in: iptable_filter ipt_DSCP iptable_mangle ip_tables x_tables eeprom lm85 hwmon_vid hwmon i2c_isa ipv6 nls_utf8 loop dm_mirror dm_mod video button battery ac lp parport_pc parport ehci_hcd uhci_hcd floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus sg snd_seq_dummy matroxfb_base snd_seq_oss snd_seq_midi_event matroxfb_DAC1064 snd_seq matroxfb_accel matroxfb_Ti3026 3w_9xxx matroxfb_g450 snd_seq_device g450_pll matroxfb_misc snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd e1000 soundcore snd_page_alloc i2c_i801 i2c_core ext3 jbd 3w_xxxx ata_piix libata sd_mod scsi_mod > > May 30 05:31:33 mandril kernel: CPU: 0 > > May 30 05:31:33 mandril kernel: EIP: 0060:[<c04e3310>] Not tainted VLI > > May 30 05:31:33 mandril kernel: EFLAGS: 00210292 (2.6.16-1.2227_FC6 #1) > > May 30 05:31:33 mandril kernel: EIP is at cfq_set_request+0x202/0x3ff > > Just do a l *cfq_set_request+0x202 from gdb if you have > CONFIG_DEBUG_INFO enabled in your vmlinux. Doh, found it. Dave, please try and reproduce with this applied: --- [PATCH] cfq-iosched: kill list entry on dead cic We must remember to drop the cic circular list when we prune a dead entry. Signed-off-by: Jens Axboe <axboe@suse.de> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 2540dfa..309e0f4 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -1477,6 +1477,7 @@ cfq_drop_dead_cic(struct io_context *ioc { read_lock(&cfq_exit_lock); rb_erase(&cic->rb_node, &ioc->cic_root); + list_del_init(&cic->queue_list); read_unlock(&cfq_exit_lock); kmem_cache_free(cfq_ioc_pool, cic); atomic_dec(&ioc_count); -- Jens Axboe ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-30 18:49 ` Jens Axboe @ 2006-05-30 18:51 ` Jens Axboe 2006-05-30 19:11 ` Jens Axboe 0 siblings, 1 reply; 25+ messages in thread From: Jens Axboe @ 2006-05-30 18:51 UTC (permalink / raw) To: Dave Jones, Andrew Morton, linux-kernel On Tue, May 30 2006, Jens Axboe wrote: > On Tue, May 30 2006, Jens Axboe wrote: > > On Tue, May 30 2006, Dave Jones wrote: > > > On Tue, May 30, 2006 at 06:49:18PM +0200, Jens Axboe wrote: > > > > > > > > List corruption. next->prev should be f74a5e2c, but was ea7ed31c > > > > > Pointing at cfq_set_request. > > > > > > > > I think I'm missing a piece of this - what list was corrupted, in what > > > > function did it trigger? > > > > > > If you look at the attachment in the bugzilla url in my previous msg, > > > you'll see this: > > > > > > ay 30 05:31:33 mandril kernel: List corruption. next->prev should be f74a5e2c, but was ea7ed31c > > > May 30 05:31:33 mandril kernel: ------------[ cut here ]------------ > > > May 30 05:31:33 mandril kernel: kernel BUG at include/linux/list.h:58! > > > May 30 05:31:33 mandril kernel: invalid opcode: 0000 [#1] > > > May 30 05:31:33 mandril kernel: SMP > > > May 30 05:31:33 mandril kernel: last sysfs file: /devices/pci0000:00/0000:00:1f.3/i2c-0/0-002e/pwm3 > > > May 30 05:31:33 mandril kernel: Modules linked in: iptable_filter ipt_DSCP iptable_mangle ip_tables x_tables eeprom lm85 hwmon_vid hwmon i2c_isa ipv6 nls_utf8 loop dm_mirror dm_mod video button battery ac lp parport_pc parport ehci_hcd uhci_hcd floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus sg snd_seq_dummy matroxfb_base snd_seq_oss snd_seq_midi_event matroxfb_DAC1064 snd_seq matroxfb_accel matroxfb_Ti3026 3w_9xxx matroxfb_g450 snd_seq_device g450_pll matroxfb_misc snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd e1000 soundcore snd_page_alloc i2c_i801 i2c_core ext3 jbd 3w_xxxx ata_piix libata sd_mod scsi_mod > > > May 30 05:31:33 mandril kernel: CPU: 0 > > > May 30 05:31:33 mandril kernel: EIP: 0060:[<c04e3310>] Not tainted VLI > > > May 30 05:31:33 mandril kernel: EFLAGS: 00210292 (2.6.16-1.2227_FC6 #1) > > > May 30 05:31:33 mandril kernel: EIP is at cfq_set_request+0x202/0x3ff > > > > Just do a l *cfq_set_request+0x202 from gdb if you have > > CONFIG_DEBUG_INFO enabled in your vmlinux. > > Doh, found it. Dave, please try and reproduce with this applied: Nevermind, that's not it either. Damn. Stay tuned. -- Jens Axboe ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-30 18:51 ` Jens Axboe @ 2006-05-30 19:11 ` Jens Axboe 2006-05-30 19:23 ` Dave Jones 2006-05-30 19:28 ` OGAWA Hirofumi 0 siblings, 2 replies; 25+ messages in thread From: Jens Axboe @ 2006-05-30 19:11 UTC (permalink / raw) To: Dave Jones, Andrew Morton, linux-kernel On Tue, May 30 2006, Jens Axboe wrote: > On Tue, May 30 2006, Jens Axboe wrote: > > On Tue, May 30 2006, Jens Axboe wrote: > > > On Tue, May 30 2006, Dave Jones wrote: > > > > On Tue, May 30, 2006 at 06:49:18PM +0200, Jens Axboe wrote: > > > > > > > > > > List corruption. next->prev should be f74a5e2c, but was ea7ed31c > > > > > > Pointing at cfq_set_request. > > > > > > > > > > I think I'm missing a piece of this - what list was corrupted, in what > > > > > function did it trigger? > > > > > > > > If you look at the attachment in the bugzilla url in my previous msg, > > > > you'll see this: > > > > > > > > ay 30 05:31:33 mandril kernel: List corruption. next->prev should be f74a5e2c, but was ea7ed31c > > > > May 30 05:31:33 mandril kernel: ------------[ cut here ]------------ > > > > May 30 05:31:33 mandril kernel: kernel BUG at include/linux/list.h:58! > > > > May 30 05:31:33 mandril kernel: invalid opcode: 0000 [#1] > > > > May 30 05:31:33 mandril kernel: SMP > > > > May 30 05:31:33 mandril kernel: last sysfs file: /devices/pci0000:00/0000:00:1f.3/i2c-0/0-002e/pwm3 > > > > May 30 05:31:33 mandril kernel: Modules linked in: iptable_filter ipt_DSCP iptable_mangle ip_tables x_tables eeprom lm85 hwmon_vid hwmon i2c_isa ipv6 nls_utf8 loop dm_mirror dm_mod video button battery ac lp parport_pc parport ehci_hcd uhci_hcd floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus sg snd_seq_dummy matroxfb_base snd_seq_oss snd_seq_midi_event matroxfb_DAC1064 snd_seq matroxfb_accel matroxfb_Ti3026 3w_9xxx matroxfb_g450 snd_seq_device g450_pll matroxfb_misc snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd e1000 soundcore snd_page_alloc i2c_i801 i2c_core ext3 jbd 3w_xxxx ata_piix libata sd_mod scsi_mod > > > > May 30 05:31:33 mandril kernel: CPU: 0 > > > > May 30 05:31:33 mandril kernel: EIP: 0060:[<c04e3310>] Not tainted VLI > > > > May 30 05:31:33 mandril kernel: EFLAGS: 00210292 (2.6.16-1.2227_FC6 #1) > > > > May 30 05:31:33 mandril kernel: EIP is at cfq_set_request+0x202/0x3ff > > > > > > Just do a l *cfq_set_request+0x202 from gdb if you have > > > CONFIG_DEBUG_INFO enabled in your vmlinux. > > > > Doh, found it. Dave, please try and reproduce with this applied: > > Nevermind, that's not it either. Damn. Stay tuned. Try this instead, please. [PATCH] cfq-iosched: fixup locking and ->queue_list list management - Drop cic from the list when seen as dead. - Fixup the locking, just use a simple spinlock. Signed-off-by: Jens Axboe <axboe@suse.de> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 2540dfa..11ce6aa 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -33,7 +33,7 @@ #define CFQ_SLICE_SCALE (5) #define CFQ_KEY_ASYNC (0) -static DEFINE_RWLOCK(cfq_exit_lock); +static DEFINE_SPINLOCK(cfq_exit_lock); /* * for the hash of cfqq inside the cfqd @@ -1284,7 +1284,7 @@ static void cfq_exit_io_context(struct i /* * put the reference this task is holding to the various queues */ - read_lock_irqsave(&cfq_exit_lock, flags); + spin_lock_irqsave(&cfq_exit_lock, flags); n = rb_first(&ioc->cic_root); while (n != NULL) { @@ -1294,7 +1294,7 @@ static void cfq_exit_io_context(struct i n = rb_next(n); } - read_unlock_irqrestore(&cfq_exit_lock, flags); + spin_unlock_irqrestore(&cfq_exit_lock, flags); } static struct cfq_io_context * @@ -1400,17 +1400,17 @@ static int cfq_ioc_set_ioprio(struct io_ struct cfq_io_context *cic; struct rb_node *n; - write_lock(&cfq_exit_lock); + spin_lock(&cfq_exit_lock); n = rb_first(&ioc->cic_root); while (n != NULL) { cic = rb_entry(n, struct cfq_io_context, rb_node); - + changed_ioprio(cic); n = rb_next(n); } - write_unlock(&cfq_exit_lock); + spin_unlock(&cfq_exit_lock); return 0; } @@ -1475,9 +1475,10 @@ out: static void cfq_drop_dead_cic(struct io_context *ioc, struct cfq_io_context *cic) { - read_lock(&cfq_exit_lock); + spin_lock(&cfq_exit_lock); rb_erase(&cic->rb_node, &ioc->cic_root); - read_unlock(&cfq_exit_lock); + list_del_init(&cic->queue_list); + spin_unlock(&cfq_exit_lock); kmem_cache_free(cfq_ioc_pool, cic); atomic_dec(&ioc_count); } @@ -1545,11 +1546,11 @@ restart: BUG(); } - read_lock(&cfq_exit_lock); + spin_lock(&cfq_exit_lock); rb_link_node(&cic->rb_node, parent, p); rb_insert_color(&cic->rb_node, &ioc->cic_root); list_add(&cic->queue_list, &cfqd->cic_list); - read_unlock(&cfq_exit_lock); + spin_unlock(&cfq_exit_lock); } /* @@ -2187,7 +2188,7 @@ static void cfq_exit_queue(elevator_t *e cfq_shutdown_timer_wq(cfqd); - write_lock(&cfq_exit_lock); + spin_lock(&cfq_exit_lock); spin_lock_irq(q->queue_lock); if (cfqd->active_queue) @@ -2210,7 +2211,7 @@ static void cfq_exit_queue(elevator_t *e } spin_unlock_irq(q->queue_lock); - write_unlock(&cfq_exit_lock); + spin_unlock(&cfq_exit_lock); cfq_shutdown_timer_wq(cfqd); -- Jens Axboe ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-30 19:11 ` Jens Axboe @ 2006-05-30 19:23 ` Dave Jones 2006-05-30 19:27 ` Jens Axboe 2006-05-30 19:28 ` OGAWA Hirofumi 1 sibling, 1 reply; 25+ messages in thread From: Dave Jones @ 2006-05-30 19:23 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, linux-kernel On Tue, May 30, 2006 at 09:11:28PM +0200, Jens Axboe wrote: > > > > Just do a l *cfq_set_request+0x202 from gdb if you have > > > > CONFIG_DEBUG_INFO enabled in your vmlinux. > > > > > > Doh, found it. Dave, please try and reproduce with this applied: > > > > Nevermind, that's not it either. Damn. Stay tuned. > > Try this instead, please. Heh, I was waiting forever to get that debuginfo package downloaded & unpacked. I'll throw this into my next build, and see what happens. Is this likely to fix the slab corruption bug I first reported, or the list corruption as seen in the bugzilla, or (optimistically) both ? Thanks, Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-30 19:23 ` Dave Jones @ 2006-05-30 19:27 ` Jens Axboe 0 siblings, 0 replies; 25+ messages in thread From: Jens Axboe @ 2006-05-30 19:27 UTC (permalink / raw) To: Dave Jones, Andrew Morton, linux-kernel On Tue, May 30 2006, Dave Jones wrote: > On Tue, May 30, 2006 at 09:11:28PM +0200, Jens Axboe wrote: > > > > > > Just do a l *cfq_set_request+0x202 from gdb if you have > > > > > CONFIG_DEBUG_INFO enabled in your vmlinux. > > > > > > > > Doh, found it. Dave, please try and reproduce with this applied: > > > > > > Nevermind, that's not it either. Damn. Stay tuned. > > > > Try this instead, please. > > Heh, I was waiting forever to get that debuginfo package downloaded & unpacked. > I'll throw this into my next build, and see what happens. > Is this likely to fix the slab corruption bug I first reported, > or the list corruption as seen in the bugzilla, or (optimistically) both ? Both, it's the same bug. -- Jens Axboe ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-30 19:11 ` Jens Axboe 2006-05-30 19:23 ` Dave Jones @ 2006-05-30 19:28 ` OGAWA Hirofumi 2006-05-30 19:42 ` Jens Axboe 1 sibling, 1 reply; 25+ messages in thread From: OGAWA Hirofumi @ 2006-05-30 19:28 UTC (permalink / raw) To: Jens Axboe; +Cc: Dave Jones, Andrew Morton, linux-kernel Jens Axboe <axboe@suse.de> writes: > On Tue, May 30 2006, Jens Axboe wrote: >> On Tue, May 30 2006, Jens Axboe wrote: >> > On Tue, May 30 2006, Jens Axboe wrote: >> > > On Tue, May 30 2006, Dave Jones wrote: >> > > > On Tue, May 30, 2006 at 06:49:18PM +0200, Jens Axboe wrote: >> > > > >> > > > > > List corruption. next->prev should be f74a5e2c, but was ea7ed31c >> > > > > > Pointing at cfq_set_request. >> > > > > >> > > > > I think I'm missing a piece of this - what list was corrupted, in what >> > > > > function did it trigger? >> > > > >> > > > If you look at the attachment in the bugzilla url in my previous msg, >> > > > you'll see this: >> > > > >> > > > ay 30 05:31:33 mandril kernel: List corruption. next->prev should be f74a5e2c, but was ea7ed31c >> > > > May 30 05:31:33 mandril kernel: ------------[ cut here ]------------ >> > > > May 30 05:31:33 mandril kernel: kernel BUG at include/linux/list.h:58! >> > > > May 30 05:31:33 mandril kernel: invalid opcode: 0000 [#1] >> > > > May 30 05:31:33 mandril kernel: SMP >> > > > May 30 05:31:33 mandril kernel: last sysfs file: /devices/pci0000:00/0000:00:1f.3/i2c-0/0-002e/pwm3 >> > > > May 30 05:31:33 mandril kernel: Modules linked in: iptable_filter ipt_DSCP iptable_mangle ip_tables x_tables eeprom lm85 hwmon_vid hwmon i2c_isa ipv6 nls_utf8 loop dm_mirror dm_mod video button battery ac lp parport_pc parport ehci_hcd uhci_hcd floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus sg snd_seq_dummy matroxfb_base snd_seq_oss snd_seq_midi_event matroxfb_DAC1064 snd_seq matroxfb_accel matroxfb_Ti3026 3w_9xxx matroxfb_g450 snd_seq_device g450_pll matroxfb_misc snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd e1000 soundcore snd_page_alloc i2c_i801 i2c_core ext3 jbd 3w_xxxx ata_piix libata sd_mod scsi_mod >> > > > May 30 05:31:33 mandril kernel: CPU: 0 >> > > > May 30 05:31:33 mandril kernel: EIP: 0060:[<c04e3310>] Not tainted VLI >> > > > May 30 05:31:33 mandril kernel: EFLAGS: 00210292 (2.6.16-1.2227_FC6 #1) >> > > > May 30 05:31:33 mandril kernel: EIP is at cfq_set_request+0x202/0x3ff >> > > >> > > Just do a l *cfq_set_request+0x202 from gdb if you have >> > > CONFIG_DEBUG_INFO enabled in your vmlinux. >> > >> > Doh, found it. Dave, please try and reproduce with this applied: >> >> Nevermind, that's not it either. Damn. Stay tuned. > > Try this instead, please. Umm.. don't we need this line? static void cfq_free_io_context(struct io_context *ioc) { struct cfq_io_context *__cic; struct rb_node *n; int freed = 0; while ((n = rb_first(&ioc->cic_root)) != NULL) { __cic = rb_entry(n, struct cfq_io_context, rb_node); rb_erase(&__cic->rb_node, &ioc->cic_root); list_del(&__cic->queue_list); ^^^^^^^^ <---- this line kmem_cache_free(cfq_ioc_pool, __cic); freed++; } if (atomic_sub_and_test(freed, &ioc_count) && ioc_gone) complete(ioc_gone); } -- OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-30 19:28 ` OGAWA Hirofumi @ 2006-05-30 19:42 ` Jens Axboe 2006-05-30 19:48 ` Jens Axboe 2006-05-30 19:49 ` OGAWA Hirofumi 0 siblings, 2 replies; 25+ messages in thread From: Jens Axboe @ 2006-05-30 19:42 UTC (permalink / raw) To: OGAWA Hirofumi; +Cc: Dave Jones, Andrew Morton, linux-kernel On Wed, May 31 2006, OGAWA Hirofumi wrote: > Jens Axboe <axboe@suse.de> writes: > > > On Tue, May 30 2006, Jens Axboe wrote: > >> On Tue, May 30 2006, Jens Axboe wrote: > >> > On Tue, May 30 2006, Jens Axboe wrote: > >> > > On Tue, May 30 2006, Dave Jones wrote: > >> > > > On Tue, May 30, 2006 at 06:49:18PM +0200, Jens Axboe wrote: > >> > > > > >> > > > > > List corruption. next->prev should be f74a5e2c, but was ea7ed31c > >> > > > > > Pointing at cfq_set_request. > >> > > > > > >> > > > > I think I'm missing a piece of this - what list was corrupted, in what > >> > > > > function did it trigger? > >> > > > > >> > > > If you look at the attachment in the bugzilla url in my previous msg, > >> > > > you'll see this: > >> > > > > >> > > > ay 30 05:31:33 mandril kernel: List corruption. next->prev should be f74a5e2c, but was ea7ed31c > >> > > > May 30 05:31:33 mandril kernel: ------------[ cut here ]------------ > >> > > > May 30 05:31:33 mandril kernel: kernel BUG at include/linux/list.h:58! > >> > > > May 30 05:31:33 mandril kernel: invalid opcode: 0000 [#1] > >> > > > May 30 05:31:33 mandril kernel: SMP > >> > > > May 30 05:31:33 mandril kernel: last sysfs file: /devices/pci0000:00/0000:00:1f.3/i2c-0/0-002e/pwm3 > >> > > > May 30 05:31:33 mandril kernel: Modules linked in: iptable_filter ipt_DSCP iptable_mangle ip_tables x_tables eeprom lm85 hwmon_vid hwmon i2c_isa ipv6 nls_utf8 loop dm_mirror dm_mod video button battery ac lp parport_pc parport ehci_hcd uhci_hcd floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus sg snd_seq_dummy matroxfb_base snd_seq_oss snd_seq_midi_event matroxfb_DAC1064 snd_seq matroxfb_accel matroxfb_Ti3026 3w_9xxx matroxfb_g450 snd_seq_device g450_pll matroxfb_misc snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd e1000 soundcore snd_page_alloc i2c_i801 i2c_core ext3 jbd 3w_xxxx ata_piix libata sd_mod scsi_mod > >> > > > May 30 05:31:33 mandril kernel: CPU: 0 > >> > > > May 30 05:31:33 mandril kernel: EIP: 0060:[<c04e3310>] Not tainted VLI > >> > > > May 30 05:31:33 mandril kernel: EFLAGS: 00210292 (2.6.16-1.2227_FC6 #1) > >> > > > May 30 05:31:33 mandril kernel: EIP is at cfq_set_request+0x202/0x3ff > >> > > > >> > > Just do a l *cfq_set_request+0x202 from gdb if you have > >> > > CONFIG_DEBUG_INFO enabled in your vmlinux. > >> > > >> > Doh, found it. Dave, please try and reproduce with this applied: > >> > >> Nevermind, that's not it either. Damn. Stay tuned. > > > > Try this instead, please. > > Umm.. don't we need this line? > > static void cfq_free_io_context(struct io_context *ioc) > { > struct cfq_io_context *__cic; > struct rb_node *n; > int freed = 0; > > while ((n = rb_first(&ioc->cic_root)) != NULL) { > __cic = rb_entry(n, struct cfq_io_context, rb_node); > rb_erase(&__cic->rb_node, &ioc->cic_root); > list_del(&__cic->queue_list); > ^^^^^^^^ <---- this line > kmem_cache_free(cfq_ioc_pool, __cic); > freed++; > } > > if (atomic_sub_and_test(freed, &ioc_count) && ioc_gone) > complete(ioc_gone); Yep, looks like that is missing as well. Care to send a proper patch and I'll shove it in, too. -- Jens Axboe ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-30 19:42 ` Jens Axboe @ 2006-05-30 19:48 ` Jens Axboe 2006-05-30 19:49 ` OGAWA Hirofumi 1 sibling, 0 replies; 25+ messages in thread From: Jens Axboe @ 2006-05-30 19:48 UTC (permalink / raw) To: OGAWA Hirofumi; +Cc: Dave Jones, Andrew Morton, linux-kernel On Tue, May 30 2006, Jens Axboe wrote: > On Wed, May 31 2006, OGAWA Hirofumi wrote: > > Jens Axboe <axboe@suse.de> writes: > > > > > On Tue, May 30 2006, Jens Axboe wrote: > > >> On Tue, May 30 2006, Jens Axboe wrote: > > >> > On Tue, May 30 2006, Jens Axboe wrote: > > >> > > On Tue, May 30 2006, Dave Jones wrote: > > >> > > > On Tue, May 30, 2006 at 06:49:18PM +0200, Jens Axboe wrote: > > >> > > > > > >> > > > > > List corruption. next->prev should be f74a5e2c, but was ea7ed31c > > >> > > > > > Pointing at cfq_set_request. > > >> > > > > > > >> > > > > I think I'm missing a piece of this - what list was corrupted, in what > > >> > > > > function did it trigger? > > >> > > > > > >> > > > If you look at the attachment in the bugzilla url in my previous msg, > > >> > > > you'll see this: > > >> > > > > > >> > > > ay 30 05:31:33 mandril kernel: List corruption. next->prev should be f74a5e2c, but was ea7ed31c > > >> > > > May 30 05:31:33 mandril kernel: ------------[ cut here ]------------ > > >> > > > May 30 05:31:33 mandril kernel: kernel BUG at include/linux/list.h:58! > > >> > > > May 30 05:31:33 mandril kernel: invalid opcode: 0000 [#1] > > >> > > > May 30 05:31:33 mandril kernel: SMP > > >> > > > May 30 05:31:33 mandril kernel: last sysfs file: /devices/pci0000:00/0000:00:1f.3/i2c-0/0-002e/pwm3 > > >> > > > May 30 05:31:33 mandril kernel: Modules linked in: iptable_filter ipt_DSCP iptable_mangle ip_tables x_tables eeprom lm85 hwmon_vid hwmon i2c_isa ipv6 nls_utf8 loop dm_mirror dm_mod video button battery ac lp parport_pc parport ehci_hcd uhci_hcd floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus sg snd_seq_dummy matroxfb_base snd_seq_oss snd_seq_midi_event matroxfb_DAC1064 snd_seq matroxfb_accel matroxfb_Ti3026 3w_9xxx matroxfb_g450 snd_seq_device g450_pll matroxfb_misc snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd e1000 soundcore snd_page_alloc i2c_i801 i2c_core ext3 jbd 3w_xxxx ata_piix libata sd_mod scsi_mod > > >> > > > May 30 05:31:33 mandril kernel: CPU: 0 > > >> > > > May 30 05:31:33 mandril kernel: EIP: 0060:[<c04e3310>] Not tainted VLI > > >> > > > May 30 05:31:33 mandril kernel: EFLAGS: 00210292 (2.6.16-1.2227_FC6 #1) > > >> > > > May 30 05:31:33 mandril kernel: EIP is at cfq_set_request+0x202/0x3ff > > >> > > > > >> > > Just do a l *cfq_set_request+0x202 from gdb if you have > > >> > > CONFIG_DEBUG_INFO enabled in your vmlinux. > > >> > > > >> > Doh, found it. Dave, please try and reproduce with this applied: > > >> > > >> Nevermind, that's not it either. Damn. Stay tuned. > > > > > > Try this instead, please. > > > > Umm.. don't we need this line? > > > > static void cfq_free_io_context(struct io_context *ioc) > > { > > struct cfq_io_context *__cic; > > struct rb_node *n; > > int freed = 0; > > > > while ((n = rb_first(&ioc->cic_root)) != NULL) { > > __cic = rb_entry(n, struct cfq_io_context, rb_node); > > rb_erase(&__cic->rb_node, &ioc->cic_root); > > list_del(&__cic->queue_list); > > ^^^^^^^^ <---- this line > > kmem_cache_free(cfq_ioc_pool, __cic); > > freed++; > > } > > > > if (atomic_sub_and_test(freed, &ioc_count) && ioc_gone) > > complete(ioc_gone); > > Yep, looks like that is missing as well. Care to send a proper patch and > I'll shove it in, too. Spoke too quickly, it's not needed. The cic will already have gone through the exit function, so it must already have been pruned from the list. It was a little head scratching though, if we really did need that it should have shown very quickly I think. -- Jens Axboe ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-30 19:42 ` Jens Axboe 2006-05-30 19:48 ` Jens Axboe @ 2006-05-30 19:49 ` OGAWA Hirofumi 1 sibling, 0 replies; 25+ messages in thread From: OGAWA Hirofumi @ 2006-05-30 19:49 UTC (permalink / raw) To: Jens Axboe; +Cc: Dave Jones, Andrew Morton, linux-kernel Jens Axboe <axboe@suse.de> writes: > Yep, looks like that is missing as well. Care to send a proper patch and > I'll shove it in, too. Sorry. -- OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> [PATCH] Fix missing list_del(&__cic->queue_list) We should kill cic from cfqd->cic_list before freeing it. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> --- block/cfq-iosched.c | 1 + 1 file changed, 1 insertion(+) diff -puN block/cfq-iosched.c~cfq-missing-list_del block/cfq-iosched.c --- linux-2.6/block/cfq-iosched.c~cfq-missing-list_del 2006-05-31 04:42:36.000000000 +0900 +++ linux-2.6-hirofumi/block/cfq-iosched.c 2006-05-31 04:42:55.000000000 +0900 @@ -1225,6 +1225,7 @@ static void cfq_free_io_context(struct i while ((n = rb_first(&ioc->cic_root)) != NULL) { __cic = rb_entry(n, struct cfq_io_context, rb_node); rb_erase(&__cic->rb_node, &ioc->cic_root); + list_del(&__cic->queue_list); kmem_cache_free(cfq_ioc_pool, __cic); freed++; } _ ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-26 21:39 .17rc5 cfq slab corruption Dave Jones 2006-05-27 0:00 ` Andrew Morton @ 2006-05-27 2:56 ` Dave Jones 2006-05-27 3:03 ` Dave Jones 1 sibling, 1 reply; 25+ messages in thread From: Dave Jones @ 2006-05-27 2:56 UTC (permalink / raw) To: Linux Kernel On Fri, May 26, 2006 at 05:39:15PM -0400, Dave Jones wrote: > Was playing with googles new picasa toy, which hammered the disks > hunting out every image file it could find, when this popped out: > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168 > Redzone: 0x5a2cf071/0x5a2cf071. > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b > Prev obj: start=ffff810012b99808, len=168 > Redzone: 0x5a2cf071/0x5a2cf071. > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > Next obj: start=ffff810012b99988, len=168 > Redzone: 0x5a2cf071/0x5a2cf071. > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b After a reboot, I just hit this again. This time whilst the box was mostly idle (just picking up some email via fetchmail) Slab corruption: (Not tainted) start=ffff81003dcde8c8, len=168 Redzone: 0x5a2cf071/0x5a2cf071. Last user: [<ffffffff8032bb5f>](cfq_free_io_context+0x2f/0x74) 090: d0 1a 65 3b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b Prev obj: start=ffff81003dcde808, len=168 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<ffffffff8021ce1c>](cfq_set_request+0x3bb/0x41f) 000: 00 00 00 00 00 00 00 00 01 00 00 00 5a 5a 5a 5a 010: 00 00 00 00 00 00 00 00 08 78 44 3c 00 81 ff ff Next obj: start=ffff81003dcde988, len=168 Redzone: 0x170fc2a5/0x170fc2a5. Last user: [<ffffffff8021ce1c>](cfq_set_request+0x3bb/0x41f) 000: 88 e6 cd 3d 00 81 ff ff 00 00 00 00 5a 5a 5a 5a 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 What's interesting is the 00 81 ff ff part of the corruption is there in both cases. Anyone have any clues what this could be? Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: .17rc5 cfq slab corruption. 2006-05-27 2:56 ` Dave Jones @ 2006-05-27 3:03 ` Dave Jones 0 siblings, 0 replies; 25+ messages in thread From: Dave Jones @ 2006-05-27 3:03 UTC (permalink / raw) To: Linux Kernel On Fri, May 26, 2006 at 10:56:23PM -0400, Dave Jones wrote: > On Fri, May 26, 2006 at 05:39:15PM -0400, Dave Jones wrote: > > Was playing with googles new picasa toy, which hammered the disks > > hunting out every image file it could find, when this popped out: > > > > Slab corruption: (Not tainted) start=ffff810012b998c8, len=168 > > Redzone: 0x5a2cf071/0x5a2cf071. > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > 090: 10 bd 28 1b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b > > Prev obj: start=ffff810012b99808, len=168 > > Redzone: 0x5a2cf071/0x5a2cf071. > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > Next obj: start=ffff810012b99988, len=168 > > Redzone: 0x5a2cf071/0x5a2cf071. > > Last user: [<ffffffff8032c319>](cfq_free_io_context+0x2f/0x74) > > 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > After a reboot, I just hit this again. This time whilst the box was > mostly idle (just picking up some email via fetchmail) > > Slab corruption: (Not tainted) start=ffff81003dcde8c8, len=168 > Redzone: 0x5a2cf071/0x5a2cf071. > Last user: [<ffffffff8032bb5f>](cfq_free_io_context+0x2f/0x74) > 090: d0 1a 65 3b 00 81 ff ff 6b 6b 6b 6b 6b 6b 6b 6b > Prev obj: start=ffff81003dcde808, len=168 > Redzone: 0x170fc2a5/0x170fc2a5. > Last user: [<ffffffff8021ce1c>](cfq_set_request+0x3bb/0x41f) > 000: 00 00 00 00 00 00 00 00 01 00 00 00 5a 5a 5a 5a > 010: 00 00 00 00 00 00 00 00 08 78 44 3c 00 81 ff ff > Next obj: start=ffff81003dcde988, len=168 > Redzone: 0x170fc2a5/0x170fc2a5. > Last user: [<ffffffff8021ce1c>](cfq_set_request+0x3bb/0x41f) > 000: 88 e6 cd 3d 00 81 ff ff 00 00 00 00 5a 5a 5a 5a > 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > What's interesting is the 00 81 ff ff part of the corruption > is there in both cases. Anyone have any clues what this could be? <sudden realisation> oh, it's high 32 bits of the address, duh. Back to head-scratching. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2006-05-30 19:49 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-05-26 21:39 .17rc5 cfq slab corruption Dave Jones 2006-05-27 0:00 ` Andrew Morton 2006-05-27 7:07 ` Jens Axboe 2006-05-27 13:31 ` Dave Jones 2006-05-27 14:52 ` Jens Axboe 2006-05-30 13:17 ` Jens Axboe 2006-05-30 13:44 ` Dave Jones 2006-05-30 13:50 ` Jens Axboe 2006-05-30 13:52 ` Dave Jones 2006-05-30 14:13 ` Dave Jones 2006-05-30 16:12 ` Dave Jones 2006-05-30 16:49 ` Jens Axboe 2006-05-30 16:56 ` Dave Jones 2006-05-30 17:04 ` Jens Axboe 2006-05-30 18:49 ` Jens Axboe 2006-05-30 18:51 ` Jens Axboe 2006-05-30 19:11 ` Jens Axboe 2006-05-30 19:23 ` Dave Jones 2006-05-30 19:27 ` Jens Axboe 2006-05-30 19:28 ` OGAWA Hirofumi 2006-05-30 19:42 ` Jens Axboe 2006-05-30 19:48 ` Jens Axboe 2006-05-30 19:49 ` OGAWA Hirofumi 2006-05-27 2:56 ` Dave Jones 2006-05-27 3:03 ` Dave Jones
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.