2.6.18 ext3 panic.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.6.18 ext3 panic.
@ 2006-10-02 19:47 Dave Jones
  2006-10-03  5:22 ` Dave Jones
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Jones @ 2006-10-02 19:47 UTC (permalink / raw)
  To: Linux Kernel

Not sure what exactly happened here. Was running fsx on ext3 over 2 disk raid0,
and running a yum update. Box locked up solid after a few minutes..
http://www.codemonkey.org.uk/junk/DSC00747.JPG

The unwinder getting stuck meant I lost the top of the trace though.
(I have backporting the .19 fixes to .18 on my todo unless someone
 beats me to it and they end up in -stable).

Will try to reproduce with a serial console hooked up.

	Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-02 19:47 2.6.18 ext3 panic Dave Jones
@ 2006-10-03  5:22 ` Dave Jones
  2006-10-03  5:43   ` Eric Sandeen
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Jones @ 2006-10-03  5:22 UTC (permalink / raw)
  To: Linux Kernel; +Cc: esandeen

<Cc'd Eric as he's been looking into this>

On Mon, Oct 02, 2006 at 03:47:11PM -0400, Dave Jones wrote:
 > Not sure what exactly happened here. Was running fsx on ext3 over 2 disk raid0,
 > and running a yum update. Box locked up solid after a few minutes..
 > http://www.codemonkey.org.uk/junk/DSC00747.JPG
 > 
 > The unwinder getting stuck meant I lost the top of the trace though.
 > (I have backporting the .19 fixes to .18 on my todo unless someone
 >  beats me to it and they end up in -stable).
 > 
 > Will try to reproduce with a serial console hooked up.

So I managed to reproduce it with an 'fsx foo' and a
'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
a vanilla 2.6.18 with none of the Fedora patches..

I'll give 2.6.18-git a try next.

		Dave

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at fs/buffer.c:2791
invalid opcode: 0000 [1] SMP 
CPU 1 
Modules linked in: hidp l2cap bluetooth nfs lockd nfs_acl sunrpc ipv6 dm_mirror dm_mod video sbs i2c_ec button battery asus_acpi ac parport_pc lp parport snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq sr_mod snd_seq_device cdrom intel_rng snd_pcm_oss sg snd_mixer_oss snd_pcm shpchp floppy serio_raw pcspkr i2c_i801 ohci1394 ieee1394 snd_timer snd e1000 i2c_core soundcore snd_page_alloc sata_sil ahci libata sd_mod scsi_mod raid0 ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 408, comm: kjournald Not tainted 2.6.18 #1
RIP: 0010:[<ffffffff8021b425>]  [<ffffffff8021b425>] submit_bh+0x29/0x124
RSP: 0018:ffff81007ebcbd40  EFLAGS: 00010246
RAX: 0000000000000005 RBX: ffff810063dd0ec8 RCX: 8000000000000000
RDX: ffff81007f1f5430 RSI: ffff810063dd0ec8 RDI: 0000000000000001
RBP: ffff81007ebcbd60 R08: 0000000000800000 R09: 0000000000000003
R10: ffff810068621510 R11: 0000000000000400 R12: ffff81007f7f48c8
R13: 0000000000000001 R14: 0000000000000033 R15: 0000000000000080
FS:  0000000000000000(0000) GS:ffff810037ff1cc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b76d5365000 CR3: 00000000658e3000 CR4: 00000000000006e0
Process kjournald (pid: 408, threadinfo ffff81007ebca000, task ffff810037f52040)
Stack:  0000000000000003 ffff810063dd0ec8 ffff81007f7f48c8 0000000000000003
 ffff81007ebcbda0 ffffffff80217ca1 ffff81007aa572a0 ffff810063f8d400
 ffff810064145478 ffff81007f1ea208 ffff81007aa572a0 0000000000000080
Call Trace:
 [<ffffffff80217ca1>] ll_rw_block+0xa6/0xcd
 [<ffffffff88035991>] :jbd:journal_commit_transaction+0x40b/0x10ce
 [<ffffffff8803a033>] :jbd:kjournald+0xc7/0x222
 [<ffffffff80236089>] kthread+0x100/0x136
 [<ffffffff802624a0>] child_rip+0xa/0x12
DWARF2 unwinder stuck at child_rip+0xa/0x12
Leftover inexact backtrace:
 [<ffffffff80268c22>] _spin_unlock_irq+0x2b/0x31
 [<ffffffff80261adc>] restore_args+0x0/0x30
 [<ffffffff80250ec3>] run_workqueue+0x19/0xfa
 [<ffffffff80250ec3>] run_workqueue+0x19/0xfa
 [<ffffffff80235f89>] kthread+0x0/0x136
 [<ffffffff80262496>] child_rip+0x0/0x12


Code: 0f 0b 68 c8 86 49 80 c2 e7 0a 48 83 7b 38 00 75 0a 0f 0b 68 
RIP  [<ffffffff8021b425>] submit_bh+0x29/0x124
 RSP <ffff81007ebcbd40>
 <0>general protection fault: 0000 [2] SMP 
CPU 1 
Modules linked in: hidp l2cap bluetooth nfs lockd nfs_acl sunrpc ipv6 dm_mirror dm_mod video sbs i2c_ec button battery asus_acpi ac parport_pc lp parport snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq sr_mod snd_seq_device cdrom intel_rng snd_pcm_oss sg snd_mixer_oss snd_pcm shpchp floppy serio_raw pcspkr i2c_i801 ohci1394 ieee1394 snd_timer snd e1000 i2c_core soundcore snd_page_alloc sata_sil ahci libata sd_mod scsi_mod raid0 ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 0, comm: swapper Not tainted 2.6.18 #1
RIP: 0010:[<ffffffff8028e1a1>]  [<ffffffff8028e1a1>] task_rq_lock+0x2b/0x74
RSP: 0018:ffff810037e17df0  EFLAGS: 00010006
RAX: 6b6b6b6b6b6b6b6b RBX: ffffffff80aae480 RCX: ffff81007f1ea5a8
RDX: ffff81007ee71080 RSI: ffff810037e17e78 RDI: ffff810037f52040
RBP: ffff810037e17e10 R08: ffff810037e17eb0 R09: 0000000000000001
R10: 0000000000000001 R11: ffffffff8029995d R12: ffffffff80aae480
R13: ffff810037e17e78 R14: ffff810037f52040 R15: 0000000000000100
FS:  0000000000000000(0000) GS:ffff810037ff1cc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b76d535d000 CR3: 000000007e290000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffff810037e10000, task ffff81007ee71080)
Stack:  000000000000000f ffff810037e08000 ffff810037f52040 ffffffff880398d9
 ffff810037e17eb0 ffffffff80249e98 ffff810037e08000 000000007ee71080
 ffffffff80268c22 0000000200000001 0000000000000000 0000000100000000
Call Trace:
 [<ffffffff80249e98>] try_to_wake_up+0x27/0x421
 [<ffffffff8028e3ce>] wake_up_process+0x10/0x12
 [<ffffffff880398e2>] :jbd:commit_timeout+0x9/0xb
 [<ffffffff80299a67>] run_timer_softirq+0x14c/0x1d5
 [<ffffffff80212724>] __do_softirq+0x68/0xf5
 [<ffffffff802627f8>] call_softirq+0x1c/0x28
DWARF2 unwinder stuck at call_softirq+0x1c/0x28
Leftover inexact backtrace:
 <IRQ> [<ffffffff80270c65>] do_softirq+0x39/0x9f
 [<ffffffff802962a3>] irq_exit+0x57/0x59
 [<ffffffff8027b063>] smp_apic_timer_interrupt+0x5d/0x62
 [<ffffffff8025b784>] mwait_idle+0x0/0x54
 [<ffffffff8026216f>] apic_timer_interrupt+0x6b/0x70
 <EOI> [<ffffffff80266026>] __sched_text_start+0xaa6/0xadd
 [<ffffffff8025b7c3>] mwait_idle+0x3f/0x54
 [<ffffffff8025b78d>] mwait_idle+0x9/0x54
 [<ffffffff8024c916>] cpu_idle+0xa2/0xc5
 [<ffffffff8027a674>] start_secondary+0x468/0x477


Code: 8b 40 18 48 8b 04 c5 c0 59 a6 80 4c 03 60 08 4c 89 e7 e8 5e 
RIP  [<ffffffff8028e1a1>] task_rq_lock+0x2b/0x74
 RSP <ffff810037e17df0>
 <3>BUG: sleeping function called from invalid context at kernel/rwsem.c:20
in_atomic():1, irqs_disabled():1

Call Trace:
 [<ffffffff8026f956>] show_trace+0xae/0x336
 [<ffffffff8026fbf3>] dump_stack+0x15/0x17
 [<ffffffff8020bb01>] __might_sleep+0xb2/0xb4
 [<ffffffff802a5160>] down_read+0x1d/0x4a
 [<ffffffff8029cf62>] blocking_notifier_call_chain+0x1b/0x41
 [<ffffffff80293e10>] profile_task_exit+0x15/0x17
 [<ffffffff80215a74>] do_exit+0x25/0x96a
 [<ffffffff8026fe21>] kernel_math_error+0x0/0x96
 [<ffff810037e17d48>]
DWARF2 unwinder stuck at 0xffff810037e17d48
Leftover inexact backtrace:
 <IRQ> [<ffffffff8026993f>] do_general_protection+0x10a/0x115
 [<ffffffff8808bab2>] :scsi_mod:scsi_run_queue+0x1ab/0x1ba
 [<ffffffff802622d1>] error_exit+0x0/0x96
 [<ffffffff8029995d>] run_timer_softirq+0x42/0x1d5
 [<ffffffff8028e1a1>] task_rq_lock+0x2b/0x74
 [<ffffffff880398d9>] :jbd:commit_timeout+0x0/0xb
 [<ffffffff80249e98>] try_to_wake_up+0x27/0x421
 [<ffffffff80268c22>] _spin_unlock_irq+0x2b/0x31
 [<ffffffff80268c22>] _spin_unlock_irq+0x2b/0x31
 [<ffffffff880398d9>] :jbd:commit_timeout+0x0/0xb
 [<ffffffff880398d9>] :jbd:commit_timeout+0x0/0xb
 [<ffffffff8028e3ce>] wake_up_process+0x10/0x12
 [<ffffffff880398e2>] :jbd:commit_timeout+0x9/0xb
 [<ffffffff80299a67>] run_timer_softirq+0x14c/0x1d5
 [<ffffffff80212724>] __do_softirq+0x68/0xf5
 [<ffffffff802627f8>] call_softirq+0x1c/0x28
 [<ffffffff80270c65>] do_softirq+0x39/0x9f
 [<ffffffff802962a3>] irq_exit+0x57/0x59
 [<ffffffff8027b063>] smp_apic_timer_interrupt+0x5d/0x62
 [<ffffffff8025b784>] mwait_idle+0x0/0x54
 [<ffffffff8026216f>] apic_timer_interrupt+0x6b/0x70
 <EOI> [<ffffffff80266026>] __sched_text_start+0xaa6/0xadd
 [<ffffffff8025b7c3>] mwait_idle+0x3f/0x54
 [<ffffffff8025b78d>] mwait_idle+0x9/0x54
 [<ffffffff8024c916>] cpu_idle+0xa2/0xc5
 [<ffffffff8027a674>] start_secondary+0x468/0x477

Kernel panic - not syncing: Aiee, killing interrupt handler!
 

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-03  5:22 ` Dave Jones
@ 2006-10-03  5:43   ` Eric Sandeen
  2006-10-03  6:19     ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Sandeen @ 2006-10-03  5:43 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, esandeen

Dave Jones wrote:

> So I managed to reproduce it with an 'fsx foo' and a
> 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
> a vanilla 2.6.18 with none of the Fedora patches..
> 
> I'll give 2.6.18-git a try next.
> 
> 		Dave
> 
> ----------- [cut here ] --------- [please bite here ] ---------
> Kernel BUG at fs/buffer.c:2791

I had thought/hoped that this was fixed by Jan's patch at 
http://lkml.org/lkml/2006/9/7/236 from the thread started at 
http://lkml.org/lkml/2006/9/1/149, but it seems maybe not.  Dave hit this bug 
first by going through that new codepath....

-Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-03  5:43   ` Eric Sandeen
@ 2006-10-03  6:19     ` Andrew Morton
  2006-10-03  6:40       ` Dave Jones
  2006-10-09 19:46       ` Eric Sandeen
  0 siblings, 2 replies; 37+ messages in thread
From: Andrew Morton @ 2006-10-03  6:19 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Dave Jones, Linux Kernel, esandeen, Badari Pulavarty, Jan Kara

On Tue, 03 Oct 2006 00:43:01 -0500
Eric Sandeen <sandeen@sandeen.net> wrote:

> Dave Jones wrote:
> 
> > So I managed to reproduce it with an 'fsx foo' and a
> > 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
> > a vanilla 2.6.18 with none of the Fedora patches..
> > 
> > I'll give 2.6.18-git a try next.
> > 
> > 		Dave
> > 
> > ----------- [cut here ] --------- [please bite here ] ---------
> > Kernel BUG at fs/buffer.c:2791
> 
> I had thought/hoped that this was fixed by Jan's patch at 
> http://lkml.org/lkml/2006/9/7/236 from the thread started at 
> http://lkml.org/lkml/2006/9/1/149, but it seems maybe not.  Dave hit this bug 
> first by going through that new codepath....

Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion.  iirc,
Badari was hitting that BUG and was able to confirm that Jan's patch
(3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
it.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-03  6:19     ` Andrew Morton
@ 2006-10-03  6:40       ` Dave Jones
  2006-10-03 16:45         ` Dave Jones
  2006-10-09 19:46       ` Eric Sandeen
  1 sibling, 1 reply; 37+ messages in thread
From: Dave Jones @ 2006-10-03  6:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Sandeen, Linux Kernel, esandeen, Badari Pulavarty, Jan Kara

On Mon, Oct 02, 2006 at 11:19:45PM -0700, Andrew Morton wrote:
 > On Tue, 03 Oct 2006 00:43:01 -0500
 > Eric Sandeen <sandeen@sandeen.net> wrote:
 > 
 > > Dave Jones wrote:
 > > 
 > > > So I managed to reproduce it with an 'fsx foo' and a
 > > > 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
 > > > a vanilla 2.6.18 with none of the Fedora patches..
 > > > 
 > > > I'll give 2.6.18-git a try next.
 > > > 
 > > > 		Dave
 > > > 
 > > > ----------- [cut here ] --------- [please bite here ] ---------
 > > > Kernel BUG at fs/buffer.c:2791
 > > 
 > > I had thought/hoped that this was fixed by Jan's patch at 
 > > http://lkml.org/lkml/2006/9/7/236 from the thread started at 
 > > http://lkml.org/lkml/2006/9/1/149, but it seems maybe not.  Dave hit this bug 
 > > first by going through that new codepath....
 > 
 > Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion.  iirc,
 > Badari was hitting that BUG and was able to confirm that Jan's patch
 > (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
 > it.

Ok, this afternoon I was definitly running a kernel with that patch in it,
and managed to get a trace (It was the one from the top of this thread
that unfortunatly got truncated).

Now, I can't reproduce it on a plain 2.6.18+that patch.
I'll leave the stress test running overnight, and see if anything
falls out in the morning.

	Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-03  6:40       ` Dave Jones
@ 2006-10-03 16:45         ` Dave Jones
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Jones @ 2006-10-03 16:45 UTC (permalink / raw)
  To: Andrew Morton, Eric Sandeen, Linux Kernel, esandeen,
	Badari Pulavarty, Jan Kara

On Tue, Oct 03, 2006 at 02:40:30AM -0400, Dave Jones wrote:
 
 >  > > > ----------- [cut here ] --------- [please bite here ] ---------
 >  > > > Kernel BUG at fs/buffer.c:2791
 >  > > 
 >  > > I had thought/hoped that this was fixed by Jan's patch at 
 >  > > http://lkml.org/lkml/2006/9/7/236 from the thread started at 
 >  > > http://lkml.org/lkml/2006/9/1/149, but it seems maybe not.  Dave hit this bug 
 >  > > first by going through that new codepath....
 >  > 
 >  > Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion.  iirc,
 >  > Badari was hitting that BUG and was able to confirm that Jan's patch
 >  > (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
 >  > it.
 > 
 > Ok, this afternoon I was definitly running a kernel with that patch in it,
 > and managed to get a trace (It was the one from the top of this thread
 > that unfortunatly got truncated).
 > 
 > Now, I can't reproduce it on a plain 2.6.18+that patch.
 > I'll leave the stress test running overnight, and see if anything
 > falls out in the morning.
 
Been chugging away for 10 hrs now without repeating that incident. Hmm.
That patch looks like good -stable material. I'll keep digging to
see if I can somehow reproduce the problem I saw with the patch applied,
but in absense of something better, I think we should go with it.

One thing that did happen in the 10hrs was fsx-over-NFS spewed some
nasty looking trace. I'll post that separately next.

	Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-03  6:19     ` Andrew Morton
  2006-10-03  6:40       ` Dave Jones
@ 2006-10-09 19:46       ` Eric Sandeen
  2006-10-09 19:59         ` Eric Sandeen
                           ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Eric Sandeen @ 2006-10-09 19:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Jones, Linux Kernel, esandeen, Badari Pulavarty, Jan Kara

Andrew Morton wrote:
> On Tue, 03 Oct 2006 00:43:01 -0500
> Eric Sandeen <sandeen@sandeen.net> wrote:
> 
>> Dave Jones wrote:
>>
>>> So I managed to reproduce it with an 'fsx foo' and a
>>> 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
>>> a vanilla 2.6.18 with none of the Fedora patches..
>>>
>>> I'll give 2.6.18-git a try next.
>>>
>>> 		Dave
>>>
>>> ----------- [cut here ] --------- [please bite here ] ---------
>>> Kernel BUG at fs/buffer.c:2791
>> I had thought/hoped that this was fixed by Jan's patch at 
>> http://lkml.org/lkml/2006/9/7/236 from the thread started at 
>> http://lkml.org/lkml/2006/9/1/149, but it seems maybe not.  Dave hit this bug 
>> first by going through that new codepath....
> 
> Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion.  iirc,
> Badari was hitting that BUG and was able to confirm that Jan's patch
> (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
> it.

Looking at some BH traces*, it appears that what Dave hit is a truncate
racing with a sync...

truncate ...
  ext3_invalidate_page
    journal_invalidatepage
      journal_unmap buffer

going off at the same time as

sync ...
  journal_dirty_data
    sync_dirty_buffer
      submit_bh <-- finds unmapped buffer, boom.

I'm not sure what should be coordinating this, and I'm not sure why
we've not yet seen it on a stock kernel, but only FC6... I haven't found
anything in FC6 that looks like it may affect this.

-Eric

*http://people.redhat.com/esandeen/traces/davej_ext3_oops1.txt

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-09 19:46       ` Eric Sandeen
@ 2006-10-09 19:59         ` Eric Sandeen
  2006-10-09 21:59         ` Badari Pulavarty
  2006-10-09 22:40         ` Jan-Benedict Glaw
  2 siblings, 0 replies; 37+ messages in thread
From: Eric Sandeen @ 2006-10-09 19:59 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Andrew Morton, Dave Jones, Linux Kernel, Badari Pulavarty,
	Jan Kara

Eric Sandeen wrote:

>>> I had thought/hoped that this was fixed by Jan's patch at 
>>> http://lkml.org/lkml/2006/9/7/236 from the thread started at 
>>> http://lkml.org/lkml/2006/9/1/149, but it seems maybe not.  Dave hit this bug 
>>> first by going through that new codepath....
>> Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion.  iirc,
>> Badari was hitting that BUG and was able to confirm that Jan's patch
>> (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
>> it.
> 
> Looking at some BH traces*, it appears that what Dave hit is a truncate
> racing with a sync...

(oh btw this is -with the above patch from Jan in place...)

-Eric

> truncate ...
>   ext3_invalidate_page
>     journal_invalidatepage
>       journal_unmap buffer
> 
> going off at the same time as
> 
> sync ...
>   journal_dirty_data
>     sync_dirty_buffer
>       submit_bh <-- finds unmapped buffer, boom.
> 
> I'm not sure what should be coordinating this, and I'm not sure why
> we've not yet seen it on a stock kernel, but only FC6... I haven't found
> anything in FC6 that looks like it may affect this.
> 
> -Eric
> 
> *http://people.redhat.com/esandeen/traces/davej_ext3_oops1.txt


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-09 19:46       ` Eric Sandeen
  2006-10-09 19:59         ` Eric Sandeen
@ 2006-10-09 21:59         ` Badari Pulavarty
  2006-10-09 22:50           ` Dave Jones
  2006-10-09 22:40         ` Jan-Benedict Glaw
  2 siblings, 1 reply; 37+ messages in thread
From: Badari Pulavarty @ 2006-10-09 21:59 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Andrew Morton, Dave Jones, Linux Kernel, esandeen, Jan Kara

On Mon, 2006-10-09 at 14:46 -0500, Eric Sandeen wrote:
> Andrew Morton wrote:
> > On Tue, 03 Oct 2006 00:43:01 -0500
> > Eric Sandeen <sandeen@sandeen.net> wrote:
> > 
> >> Dave Jones wrote:
> >>
> >>> So I managed to reproduce it with an 'fsx foo' and a
> >>> 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
> >>> a vanilla 2.6.18 with none of the Fedora patches..
> >>>
> >>> I'll give 2.6.18-git a try next.
> >>>
> >>> 		Dave
> >>>
> >>> ----------- [cut here ] --------- [please bite here ] ---------
> >>> Kernel BUG at fs/buffer.c:2791
> >> I had thought/hoped that this was fixed by Jan's patch at 
> >> http://lkml.org/lkml/2006/9/7/236 from the thread started at 
> >> http://lkml.org/lkml/2006/9/1/149, but it seems maybe not.  Dave hit this bug 
> >> first by going through that new codepath....
> > 
> > Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion.  iirc,
> > Badari was hitting that BUG and was able to confirm that Jan's patch
> > (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
> > it.
> 
> Looking at some BH traces*, it appears that what Dave hit is a truncate
> racing with a sync...
> 
> truncate ...
>   ext3_invalidate_page
>     journal_invalidatepage
>       journal_unmap buffer
> 
> going off at the same time as
> 
> sync ...
>   journal_dirty_data
>     sync_dirty_buffer
>       submit_bh <-- finds unmapped buffer, boom.
> 

I don't understand how this can happen ..

journal_unmap_buffer() zapping the buffer since its not attached to any
transaction. 

journal_unmap_buffer():[fs/jbd/transaction.c:1789] not on any
transaction: zap
     b_state:0x10402f b_jlist:BJ_None cpu:0 b_count:3 b_blocknr:52735707
     b_jbd:1 b_frozen_data:0000000000000000
b_committed_data:0000000000000000
     b_transaction:0 b_next_transaction:0 b_cp_transaction:0
b_trans_is_running:0
     b_trans_is_comitting:0 b_jcount:2 pg_dirty:1


journal_dirty_data() would do submit_bh() ONLY if its part of the older
transaction.

I need to take a closer look to understand the race.

BTW, is this 1k or 2k filesystem ? How easy is to reproduce the
problem ?

Thanks,
Badari




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-09 21:59         ` Badari Pulavarty
@ 2006-10-09 22:50           ` Dave Jones
  2006-10-10 14:11             ` Jan Kara
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Jones @ 2006-10-09 22:50 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Eric Sandeen, Andrew Morton, Linux Kernel, esandeen, Jan Kara

On Mon, Oct 09, 2006 at 02:59:25PM -0700, Badari Pulavarty wrote:

 > journal_dirty_data() would do submit_bh() ONLY if its part of the older
 > transaction.
 > 
 > I need to take a closer look to understand the race.
 > 
 > BTW, is this 1k or 2k filesystem ?

(18:41:11:davej@gelk:~)$ sudo tune2fs -l /dev/md0  | grep size
Block size:               1024
Fragment size:            1024
Inode size:               128
(18:41:16:davej@gelk:~)$ 

 > How easy is to reproduce the problem ?

I can reproduce it within a few hours of stressing, but only
on that one box.  I've not figured out exactly what's so
special about it yet (though the 1k block thing may be it).
I had been thinking it was a raid0 only thing, as none of
my other boxes have that.

I'm not entirely sure how it got set up that way either.
The Fedora installer being too smart for its own good perhaps.

	Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-09 22:50           ` Dave Jones
@ 2006-10-10 14:11             ` Jan Kara
  2006-10-10 18:42               ` Andrew Morton
  2006-10-10 22:03               ` Eric Sandeen
  0 siblings, 2 replies; 37+ messages in thread
From: Jan Kara @ 2006-10-10 14:11 UTC (permalink / raw)
  To: Dave Jones, Badari Pulavarty, Eric Sandeen, Andrew Morton,
	Linux Kernel, esandeen

> On Mon, Oct 09, 2006 at 02:59:25PM -0700, Badari Pulavarty wrote:
> 
>  > journal_dirty_data() would do submit_bh() ONLY if its part of the older
>  > transaction.
>  > 
>  > I need to take a closer look to understand the race.
>  > 
>  > BTW, is this 1k or 2k filesystem ?
> 
> (18:41:11:davej@gelk:~)$ sudo tune2fs -l /dev/md0  | grep size
> Block size:               1024
> Fragment size:            1024
> Inode size:               128
> (18:41:16:davej@gelk:~)$ 
> 
>  > How easy is to reproduce the problem ?
> 
> I can reproduce it within a few hours of stressing, but only
> on that one box.  I've not figured out exactly what's so
> special about it yet (though the 1k block thing may be it).
> I had been thinking it was a raid0 only thing, as none of
> my other boxes have that.
> 
> I'm not entirely sure how it got set up that way either.
> The Fedora installer being too smart for its own good perhaps.
  I think it's really the 1KB block size that makes it happen.
I've looked at journal_dirty_data() code and I think the following can
happen:
  sync() eventually ends up in journal_dirty_data(bh) as Eric writes.
There is finds dirty buffer attached to the comitting transaction. So it drops
all locks and calls sync_dirty_buffer(bh).
  Now in other process, file is truncated so that 'bh' gets just after EOF.
As we have 1kb buffers, it can happen that bh is in the partially
truncated page. Buffer is marked unmapped and clean. But in a moment the page
is marked dirty and msync() is called. That eventually calls
set_page_dirty() and all buffers in the page are marked dirty.
  The first process now wakes up, locks the buffer, clears the dirty bit
and does submit_bh() - Oops.

  This is essentially the same problem Badari found but in a different
place. There are two places that are arguably wrong...
  1) We mark buffer dirty after EOF. But actually that may be needed -
or what is the expected behaviour when we write into mmapped file after
EOF, then extend the file and do msync()?
  2) We submit a buffer after EOF for IO. This should be clearly avoided
but getting the needed info from bh is really ugly...

  What we could do is: Instead od calling sync_dirty_buffer() we do
something like:

lock_buffer(bh);
jbd_lock_bh_state(bh);
if (!buffer_jbd(bh) || jh != bh2jh(bh) || jh->b_transaction !=
  journal->j_committing_transaction) {
	jbd_unlock_bh_state(bh);
	unlock_buffer(bh);
}
jbd_unlock_bh_state(bh);
if (test_clear_buffer_dirty(bh)) {
	get_bh(bh);
	bh->b_end_io = end_buffer_write_sync;
	submit_bh(WRITE, bh);
	wait_on_buffer(bh);
}
else
	unlock_buffer(bh);

That should deal with the problem... Much more adventurous change would
be to remove the syncing code altogether - the new commit code makes
sure to write out each buffer just once so the livelock should not
happen now. But then we'd have to put running transaction in
j_next_transaction and refile data buffers instead of unfiling them.
That should actually give quite some performance improvement when
intensively rewriting files. But I guess that is JBD2 matter ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-10 14:11             ` Jan Kara
@ 2006-10-10 18:42               ` Andrew Morton
  2006-10-10 22:03               ` Eric Sandeen
  1 sibling, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2006-10-10 18:42 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Jones, Badari Pulavarty, Eric Sandeen, Linux Kernel,
	esandeen

On Tue, 10 Oct 2006 16:11:45 +0200
Jan Kara <jack@suse.cz> wrote:

> > On Mon, Oct 09, 2006 at 02:59:25PM -0700, Badari Pulavarty wrote:
> > 
> >  > journal_dirty_data() would do submit_bh() ONLY if its part of the older
> >  > transaction.
> >  > 
> >  > I need to take a closer look to understand the race.
> >  > 
> >  > BTW, is this 1k or 2k filesystem ?
> > 
> > (18:41:11:davej@gelk:~)$ sudo tune2fs -l /dev/md0  | grep size
> > Block size:               1024
> > Fragment size:            1024
> > Inode size:               128
> > (18:41:16:davej@gelk:~)$ 
> > 
> >  > How easy is to reproduce the problem ?
> > 
> > I can reproduce it within a few hours of stressing, but only
> > on that one box.  I've not figured out exactly what's so
> > special about it yet (though the 1k block thing may be it).
> > I had been thinking it was a raid0 only thing, as none of
> > my other boxes have that.
> > 
> > I'm not entirely sure how it got set up that way either.
> > The Fedora installer being too smart for its own good perhaps.
>   I think it's really the 1KB block size that makes it happen.
> I've looked at journal_dirty_data() code and I think the following can
> happen:
>   sync() eventually ends up in journal_dirty_data(bh) as Eric writes.
> There is finds dirty buffer attached to the comitting transaction. So it drops
> all locks and calls sync_dirty_buffer(bh).
>   Now in other process, file is truncated so that 'bh' gets just after EOF.
> As we have 1kb buffers, it can happen that bh is in the partially
> truncated page. Buffer is marked unmapped and clean. But in a moment the page
> is marked dirty and msync() is called. That eventually calls
> set_page_dirty() and all buffers in the page are marked dirty.
>   The first process now wakes up, locks the buffer, clears the dirty bit
> and does submit_bh() - Oops.
> 
>   This is essentially the same problem Badari found but in a different
> place. There are two places that are arguably wrong...
>   1) We mark buffer dirty after EOF. But actually that may be needed -
> or what is the expected behaviour when we write into mmapped file after
> EOF, then extend the file and do msync()?

yup.

>   2) We submit a buffer after EOF for IO. This should be clearly avoided
> but getting the needed info from bh is really ugly...

Things like __block_write_full_page() avoid this by checking the block's
offset against i_size.  (Not racy against truncate-down because the page is
locked, not racy against truncate-up because the bh is zero and
up-to-date).

But for jbd writeout we don't hold the page lock, so checking against
bh->b_page->host->i_size is a bit racy.

hm.  But we do lock the buffers in journal_invalidatepage(), so checking
i_size after locking the buffer in the writeout path might get us there.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-10 14:11             ` Jan Kara
  2006-10-10 18:42               ` Andrew Morton
@ 2006-10-10 22:03               ` Eric Sandeen
  2006-10-10 22:25                 ` Badari Pulavarty
  1 sibling, 1 reply; 37+ messages in thread
From: Eric Sandeen @ 2006-10-10 22:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Jones, Badari Pulavarty, Eric Sandeen, Andrew Morton,
	Linux Kernel

Jan Kara wrote:

>   I think it's really the 1KB block size that makes it happen.
> I've looked at journal_dirty_data() code and I think the following can
> happen:
>   sync() eventually ends up in journal_dirty_data(bh) as Eric writes.
> There is finds dirty buffer attached to the comitting transaction. So it drops
> all locks and calls sync_dirty_buffer(bh).
>   Now in other process, file is truncated so that 'bh' gets just after EOF.
> As we have 1kb buffers, it can happen that bh is in the partially
> truncated page. Buffer is marked unmapped and clean. But in a moment the page
> is marked dirty and msync() is called. That eventually calls
> set_page_dirty() and all buffers in the page are marked dirty.
>   The first process now wakes up, locks the buffer, clears the dirty bit
> and does submit_bh() - Oops.

Hm, just FWIW I have a couple traces* of the buffer getting unmapped
-before- journal_submit_data_buffers ever even finds it...

 journal_submit_data_buffers():[fs/jbd/commit.c:242] needs writeout,
adding to array pid 1836
     b_state:0x114025 b_jlist:BJ_SyncData cpu:0 b_count:2 b_blocknr:27130
     b_jbd:1 b_frozen_data:0000000000000000
b_committed_data:0000000000000000
     b_transaction:1 b_next_transaction:0 b_cp_transaction:0
b_trans_is_running:0
     b_trans_is_comitting:1 b_jcount:0 pg_dirty:0

so it's already unmapped at this point.  Could
journal_submit_data_buffers benefit from some buffer_mapped checks?  Or
is that just a bandaid too late...

-Eric

*http://people.redhat.com/esandeen/traces/eric_ext3_oops1.txt
 http://people.redhat.com/esandeen/traces/eric_ext3_oops2.txt

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-10 22:03               ` Eric Sandeen
@ 2006-10-10 22:25                 ` Badari Pulavarty
  2006-10-11  1:43                   ` Eric Sandeen
  0 siblings, 1 reply; 37+ messages in thread
From: Badari Pulavarty @ 2006-10-10 22:25 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, Dave Jones, Eric Sandeen, Andrew Morton, Linux Kernel

On Tue, 2006-10-10 at 17:03 -0500, Eric Sandeen wrote:
> Jan Kara wrote:
> 
> >   I think it's really the 1KB block size that makes it happen.
> > I've looked at journal_dirty_data() code and I think the following can
> > happen:
> >   sync() eventually ends up in journal_dirty_data(bh) as Eric writes.
> > There is finds dirty buffer attached to the comitting transaction. So it drops
> > all locks and calls sync_dirty_buffer(bh).
> >   Now in other process, file is truncated so that 'bh' gets just after EOF.
> > As we have 1kb buffers, it can happen that bh is in the partially
> > truncated page. Buffer is marked unmapped and clean. But in a moment the page
> > is marked dirty and msync() is called. That eventually calls
> > set_page_dirty() and all buffers in the page are marked dirty.
> >   The first process now wakes up, locks the buffer, clears the dirty bit
> > and does submit_bh() - Oops.
> 
> Hm, just FWIW I have a couple traces* of the buffer getting unmapped
> -before- journal_submit_data_buffers ever even finds it...
> 
>  journal_submit_data_buffers():[fs/jbd/commit.c:242] needs writeout,
> adding to array pid 1836
>      b_state:0x114025 b_jlist:BJ_SyncData cpu:0 b_count:2 b_blocknr:27130
>      b_jbd:1 b_frozen_data:0000000000000000
> b_committed_data:0000000000000000
>      b_transaction:1 b_next_transaction:0 b_cp_transaction:0
> b_trans_is_running:0
>      b_trans_is_comitting:1 b_jcount:0 pg_dirty:0
> 
> so it's already unmapped at this point.  Could
> journal_submit_data_buffers benefit from some buffer_mapped checks?  Or
> is that just a bandaid too late...

Hmm..

b_state: 0x114025 
               ^
means BH_Mapped. Isn't it ?

Thanks,
Badari


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-10 22:25                 ` Badari Pulavarty
@ 2006-10-11  1:43                   ` Eric Sandeen
  2006-10-11 10:33                     ` Jan Kara
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Sandeen @ 2006-10-11  1:43 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Eric Sandeen, Jan Kara, Dave Jones, Andrew Morton, Linux Kernel

Badari Pulavarty wrote:
> On Tue, 2006-10-10 at 17:03 -0500, Eric Sandeen wrote:
>> Jan Kara wrote:
>>
>>>   I think it's really the 1KB block size that makes it happen.
>>> I've looked at journal_dirty_data() code and I think the following can
>>> happen:
>>>   sync() eventually ends up in journal_dirty_data(bh) as Eric writes.
>>> There is finds dirty buffer attached to the comitting transaction. So it drops
>>> all locks and calls sync_dirty_buffer(bh).
>>>   Now in other process, file is truncated so that 'bh' gets just after EOF.
>>> As we have 1kb buffers, it can happen that bh is in the partially
>>> truncated page. Buffer is marked unmapped and clean. But in a moment the page
>>> is marked dirty and msync() is called. That eventually calls
>>> set_page_dirty() and all buffers in the page are marked dirty.
>>>   The first process now wakes up, locks the buffer, clears the dirty bit
>>> and does submit_bh() - Oops.
>> Hm, just FWIW I have a couple traces* of the buffer getting unmapped
>> -before- journal_submit_data_buffers ever even finds it...
>>
>>  journal_submit_data_buffers():[fs/jbd/commit.c:242] needs writeout,
>> adding to array pid 1836
>>      b_state:0x114025 b_jlist:BJ_SyncData cpu:0 b_count:2 b_blocknr:27130
>>      b_jbd:1 b_frozen_data:0000000000000000
>> b_committed_data:0000000000000000
>>      b_transaction:1 b_next_transaction:0 b_cp_transaction:0
>> b_trans_is_running:0
>>      b_trans_is_comitting:1 b_jcount:0 pg_dirty:0
>>
>> so it's already unmapped at this point.  Could
>> journal_submit_data_buffers benefit from some buffer_mapped checks?  Or
>> is that just a bandaid too late...
> 
> Hmm..
> 
> b_state: 0x114025 
>                ^
> means BH_Mapped. Isn't it ?

Whoops, I pasted in the wrong one, I guess, from earlier in the trace.  Here are 
the ones I was looking at:

  journal_submit_data_buffers():[fs/jbd/commit.c:242] needs writeout, adding to 
array pid 1690
      b_state:0x104005 b_jlist:BJ_SyncData cpu:0 b_count:2 b_blocknr:30045
      b_jbd:1 b_frozen_data:0000000000000000 b_committed_data:0000000000000000
      b_transaction:1 b_next_transaction:0 b_cp_transaction:0 b_trans_is_running:0
      b_trans_is_comitting:1 b_jcount:0 pg_dirty:1

and

  journal_submit_data_buffers():[fs/jbd/commit.c:242] needs writeout, adding to 
array pid 1836
      b_state:0x114005 b_jlist:BJ_SyncData cpu:1 b_count:2 b_blocknr:27130
      b_jbd:1 b_frozen_data:0000000000000000 b_committed_data:0000000000000000
      b_transaction:1 b_next_transaction:0 b_cp_transaction:0 b_trans_is_running:0
      b_trans_is_comitting:1 b_jcount:0 pg_dirty:1

-Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-11  1:43                   ` Eric Sandeen
@ 2006-10-11 10:33                     ` Jan Kara
  2006-10-11 13:44                       ` Eric Sandeen
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Kara @ 2006-10-11 10:33 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Badari Pulavarty, Eric Sandeen, Jan Kara, Dave Jones,
	Andrew Morton, Linux Kernel

> Badari Pulavarty wrote:
> >On Tue, 2006-10-10 at 17:03 -0500, Eric Sandeen wrote:
> >>Jan Kara wrote:
> >>
> >>>  I think it's really the 1KB block size that makes it happen.
> >>>I've looked at journal_dirty_data() code and I think the following can
> >>>happen:
> >>>  sync() eventually ends up in journal_dirty_data(bh) as Eric writes.
> >>>There is finds dirty buffer attached to the comitting transaction. So it 
> >>>drops
> >>>all locks and calls sync_dirty_buffer(bh).
> >>>  Now in other process, file is truncated so that 'bh' gets just after 
> >>>  EOF.
> >>>As we have 1kb buffers, it can happen that bh is in the partially
> >>>truncated page. Buffer is marked unmapped and clean. But in a moment the 
> >>>page
> >>>is marked dirty and msync() is called. That eventually calls
> >>>set_page_dirty() and all buffers in the page are marked dirty.
> >>>  The first process now wakes up, locks the buffer, clears the dirty bit
> >>>and does submit_bh() - Oops.
> >>Hm, just FWIW I have a couple traces* of the buffer getting unmapped
> >>-before- journal_submit_data_buffers ever even finds it...
> >>
> >> journal_submit_data_buffers():[fs/jbd/commit.c:242] needs writeout,
> >>adding to array pid 1836
> >>     b_state:0x114025 b_jlist:BJ_SyncData cpu:0 b_count:2 b_blocknr:27130
> >>     b_jbd:1 b_frozen_data:0000000000000000
> >>b_committed_data:0000000000000000
> >>     b_transaction:1 b_next_transaction:0 b_cp_transaction:0
> >>b_trans_is_running:0
> >>     b_trans_is_comitting:1 b_jcount:0 pg_dirty:0
> >>
> >>so it's already unmapped at this point.  Could
> >>journal_submit_data_buffers benefit from some buffer_mapped checks?  Or
> >>is that just a bandaid too late...
> >
> >Hmm..
> >
> >b_state: 0x114025 
> >               ^
> >means BH_Mapped. Isn't it ?
> 
> Whoops, I pasted in the wrong one, I guess, from earlier in the trace.  
> Here are the ones I was looking at:
> 
>  journal_submit_data_buffers():[fs/jbd/commit.c:242] needs writeout, adding 
>  to array pid 1690
>      b_state:0x104005 b_jlist:BJ_SyncData cpu:0 b_count:2 b_blocknr:30045
>      b_jbd:1 b_frozen_data:0000000000000000 
>      b_committed_data:0000000000000000
>      b_transaction:1 b_next_transaction:0 b_cp_transaction:0 
>      b_trans_is_running:0
>      b_trans_is_comitting:1 b_jcount:0 pg_dirty:1
> 
> and
> 
>  journal_submit_data_buffers():[fs/jbd/commit.c:242] needs writeout, adding 
>  to array pid 1836
>      b_state:0x114005 b_jlist:BJ_SyncData cpu:1 b_count:2 b_blocknr:27130
>      b_jbd:1 b_frozen_data:0000000000000000 
>      b_committed_data:0000000000000000
>      b_transaction:1 b_next_transaction:0 b_cp_transaction:0 
>      b_trans_is_running:0
>      b_trans_is_comitting:1 b_jcount:0 pg_dirty:1
  Umm, but these two traces confuse me:
1) They are different traces that those you wrote about initially,
aren't they? Because here we would not call sync_dirty_buffer() from
journal_dirty_data().
  BTW: Does this buffer trace lead to that Oops in submit_bh()? I guess not
as the buffer is not dirty...
  Am I right that now there are no Oopses because of buffers submitted
from the commit code? Only those from journal_dirty_data()?

2) Those buffers have strange states - they are !mapped, !dirty (so this
is fine) but they are also JBD_Dirty and ion BJ_SyncData. This is really
strange! Either it is ordered data buffer and should not be JBD_Dirty or
it is not ordered data buffer and then it should not be in BJ_SyncData!
The second buffer even has JBD_JWrite set so it really was metadabuffer
under commit and later something took it from the committing transaction
(without clearing the JWrite bit) and filed it in the SyncData list...
Umm, this reminds me something... <looks into commit code> Oh no, who could
write that BJ_Forget list handling.. me? ;)

I think the problem is in my change to BJ_Forget list handling - if we
find out buffer has b_next_transaction set, we just refile it
(previously we BUGged). That it just fine, except when we are in the
ordered mode and the buffer is reallocated as data. Then we refile the
buffer to BJ_Metadata or BJ_Reserved list, instead of BJ_SyncData.  What
then happens is uncertain but probably something gets (rightfully so)
confused and JBD_Dirty buffer gets to BJ_SyncData list.  This bug does
not seem to cause those problems with unmapped buffers but still we
should fix it as it is asking for trouble.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-11 10:33                     ` Jan Kara
@ 2006-10-11 13:44                       ` Eric Sandeen
  2006-10-11 14:22                         ` Jan Kara
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Sandeen @ 2006-10-11 13:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Badari Pulavarty, Eric Sandeen, Dave Jones, Andrew Morton,
	Linux Kernel

Jan Kara wrote:

>   Umm, but these two traces confuse me:
> 1) They are different traces that those you wrote about initially,
> aren't they? Because here we would not call sync_dirty_buffer() from
> journal_dirty_data().
>   BTW: Does this buffer trace lead to that Oops in submit_bh()? I guess not
> as the buffer is not dirty...

They do wind up at the same oops, from the same "testcase" (i.e. beat the tar 
out of the filesystem with multiple fsx's and fsstress...)

The buffer is not dirty at that tracepoint because it has just done
                 if (locked && test_clear_buffer_dirty(bh)) {
prior to the tracepoint...

>   Am I right that now there are no Oopses because of buffers submitted
> from the commit code? Only those from journal_dirty_data()?

Well, it's hard to sort out which thread is doing what; I added a pid to each 
tracepoint to try to make that a little easier.  Both of the traces I posted 
seem to be journal_submit_data_buffers leading to an unmapped buffer submitted 
in submit_bh.

> 2) Those buffers have strange states - they are !mapped, !dirty (so this
> is fine) 

Well, they were just undirtied in journal_submit_data_buffers.

See the whole traces at

http://people.redhat.com/esandeen/traces/eric_ext3_oops1.txt
http://people.redhat.com/esandeen/traces/eric_ext3_oops2.txt

I will try to reproduce with fewer threads, to see if we can find a simpler 
sequence of events...

As an aside, when we do journal_unmap_buffer... should it stay on t_sync_datalist?

Thanks,
-Eric

> but they are also JBD_Dirty and ion BJ_SyncData. This is really
> strange! Either it is ordered data buffer and should not be JBD_Dirty or
> it is not ordered data buffer and then it should not be in BJ_SyncData!
> The second buffer even has JBD_JWrite set so it really was metadabuffer
> under commit and later something took it from the committing transaction
> (without clearing the JWrite bit) and filed it in the SyncData list...
> Umm, this reminds me something... <looks into commit code> Oh no, who could
> write that BJ_Forget list handling.. me? ;)
> 
> I think the problem is in my change to BJ_Forget list handling - if we
> find out buffer has b_next_transaction set, we just refile it
> (previously we BUGged). That it just fine, except when we are in the
> ordered mode and the buffer is reallocated as data. Then we refile the
> buffer to BJ_Metadata or BJ_Reserved list, instead of BJ_SyncData.  What
> then happens is uncertain but probably something gets (rightfully so)
> confused and JBD_Dirty buffer gets to BJ_SyncData list.  This bug does
> not seem to cause those problems with unmapped buffers but still we
> should fix it as it is asking for trouble.
> 
> 								Honza


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-11 13:44                       ` Eric Sandeen
@ 2006-10-11 14:22                         ` Jan Kara
  2006-10-11 17:54                           ` Badari Pulavarty
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Kara @ 2006-10-11 14:22 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Badari Pulavarty, Eric Sandeen, Dave Jones, Andrew Morton,
	Linux Kernel

> Jan Kara wrote:
> 
> >  Umm, but these two traces confuse me:
> >1) They are different traces that those you wrote about initially,
> >aren't they? Because here we would not call sync_dirty_buffer() from
> >journal_dirty_data().
> >  BTW: Does this buffer trace lead to that Oops in submit_bh()? I guess not
> >as the buffer is not dirty...
> 
> They do wind up at the same oops, from the same "testcase" (i.e. beat the 
> tar out of the filesystem with multiple fsx's and fsstress...)
> 
> The buffer is not dirty at that tracepoint because it has just done
>                 if (locked && test_clear_buffer_dirty(bh)) {
> prior to the tracepoint...
  Oh, I see. OK.

> 
> See the whole traces at
> 
> http://people.redhat.com/esandeen/traces/eric_ext3_oops1.txt
> http://people.redhat.com/esandeen/traces/eric_ext3_oops2.txt
  Hmm, those traces look really useful. I just have to digest them ;).

> As an aside, when we do journal_unmap_buffer... should it stay on 
> t_sync_datalist?
  Yes, it should and it seems it really was removed from it at some
point. Only later journal_dirty_data() came and filed it back to the
BJ_SyncData list. And the buffer remained unmapped till the commit time
and then *bang*... It may even be a race in ext3 itself that it called
journal_dirty_data() on an unmapped buffer but I have to read some more
code.

							Bye
								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-11 14:22                         ` Jan Kara
@ 2006-10-11 17:54                           ` Badari Pulavarty
  2006-10-12  2:36                             ` Eric Sandeen
  0 siblings, 1 reply; 37+ messages in thread
From: Badari Pulavarty @ 2006-10-11 17:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric Sandeen, Eric Sandeen, Dave Jones, Andrew Morton,
	Linux Kernel

On Wed, 2006-10-11 at 16:22 +0200, Jan Kara wrote:
> > Jan Kara wrote:
> > 
> > >  Umm, but these two traces confuse me:
> > >1) They are different traces that those you wrote about initially,
> > >aren't they? Because here we would not call sync_dirty_buffer() from
> > >journal_dirty_data().
> > >  BTW: Does this buffer trace lead to that Oops in submit_bh()? I guess not
> > >as the buffer is not dirty...
> > 
> > They do wind up at the same oops, from the same "testcase" (i.e. beat the 
> > tar out of the filesystem with multiple fsx's and fsstress...)
> > 
> > The buffer is not dirty at that tracepoint because it has just done
> >                 if (locked && test_clear_buffer_dirty(bh)) {
> > prior to the tracepoint...
>   Oh, I see. OK.
> 
> > 
> > See the whole traces at
> > 
> > http://people.redhat.com/esandeen/traces/eric_ext3_oops1.txt
> > http://people.redhat.com/esandeen/traces/eric_ext3_oops2.txt
>   Hmm, those traces look really useful. I just have to digest them ;).
> 
> > As an aside, when we do journal_unmap_buffer... should it stay on 
> > t_sync_datalist?
>   Yes, it should and it seems it really was removed from it at some
> point. Only later journal_dirty_data() came and filed it back to the
> BJ_SyncData list. And the buffer remained unmapped till the commit time
> and then *bang*... It may even be a race in ext3 itself that it called
> journal_dirty_data() on an unmapped buffer but I have to read some more
> code.
> 

Yes. calling journal_dirty_data() on unmapped buffer can definitely
happen. (only thing i am not sure is - why doesn't happen with a
simple testcase like dirtying only a part of a page in 1k filesystem.
I am not sure why we need journal_unmap_buffer() in the sequence).

Here is what I think is happening..

journal_unmap_buffer() - cleaned the buffer, since its outside EOF, but
its a part of the same page. So it remained on the page->buffers
list. (at this time its not part of any transaction).

Then, ordererd_commit_write() called journal_dirty_data() and we added
all these buffers to BJ_SyncData list. (at this time buffer is clean -
not dirty).

Now msync() called __set_page_dirty_buffers() and dirtied *all* the
buffers attached to this page.

journal_submit_data_buffers() got around to this buffer and tried to
submit the buffer...

Andrew is right - only option for us to check the filesize in the
write out path and skip the buffers beyond EOF.

Thanks,
Badari

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-11 17:54                           ` Badari Pulavarty
@ 2006-10-12  2:36                             ` Eric Sandeen
  2006-10-12  4:34                               ` John Wendel
  2006-10-12 12:28                               ` Jan Kara
  0 siblings, 2 replies; 37+ messages in thread
From: Eric Sandeen @ 2006-10-12  2:36 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Jan Kara, Eric Sandeen, Dave Jones, Andrew Morton, Linux Kernel

Badari Pulavarty wrote:

> Here is what I think is happening..
> 
> journal_unmap_buffer() - cleaned the buffer, since its outside EOF, but
> its a part of the same page. So it remained on the page->buffers
> list. (at this time its not part of any transaction).
> 
> Then, ordererd_commit_write() called journal_dirty_data() and we added
> all these buffers to BJ_SyncData list. (at this time buffer is clean -
> not dirty).
> 
> Now msync() called __set_page_dirty_buffers() and dirtied *all* the
> buffers attached to this page.
> 
> journal_submit_data_buffers() got around to this buffer and tried to
> submit the buffer...

This seems about right, but one thing bothers me in the traces; it seems like 
there is some locking that is missing.  In
http://people.redhat.com/esandeen/traces/eric_ext3_oops1.txt
for example, it looks like journal_dirty_data gets started, but then the 
buffer_head is acted on by journal_unmap_buffer, which decides this buffer is 
part of the running transaction, past EOF, and clears mapped, dirty, etc.  Then 
journal_dirty_data picks up again, decides that the buffer is not on the right 
list (now BJ_None) and puts it back on BJ_SyncData.  Then it gets picked up by 
journal_submit_data_buffers and submitted, and oops.

Talking with Stephen, it seemed like the page lock should synchronize these 
threads, but I've found that we can get to journal_dirty_data acting on the 
buffer heads w/o having the page locked...

I'm still digging, and, er, grasping at straws here... Am I off base?

-Eric

> Andrew is right - only option for us to check the filesize in the
> write out path and skip the buffers beyond EOF.
> 
> Thanks,
> Badari
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-12  2:36                             ` Eric Sandeen
@ 2006-10-12  4:34                               ` John Wendel
  2006-10-12  6:57                                 ` Jan-Benedict Glaw
  2006-10-12 12:28                               ` Jan Kara
  1 sibling, 1 reply; 37+ messages in thread
From: John Wendel @ 2006-10-12  4:34 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Badari Pulavarty, Jan Kara, Eric Sandeen, Dave Jones,
	Andrew Morton, Linux Kernel

Eric Sandeen wrote:
> Badari Pulavarty wrote:
>
>> Here is what I think is happening..
>>
>> journal_unmap_buffer() - cleaned the buffer, since its outside EOF, but
>> its a part of the same page. So it remained on the page->buffers
>> list. (at this time its not part of any transaction).
>>
>> Then, ordererd_commit_write() called journal_dirty_data() and we added
>> all these buffers to BJ_SyncData list. (at this time buffer is clean -
>> not dirty).
>>
>> Now msync() called __set_page_dirty_buffers() and dirtied *all* the
>> buffers attached to this page.
>>
>> journal_submit_data_buffers() got around to this buffer and tried to
>> submit the buffer...
>
> This seems about right, but one thing bothers me in the traces; it 
> seems like there is some locking that is missing.  In
> http://people.redhat.com/esandeen/traces/eric_ext3_oops1.txt
> for example, it looks like journal_dirty_data gets started, but then 
> the buffer_head is acted on by journal_unmap_buffer, which decides 
> this buffer is part of the running transaction, past EOF, and clears 
> mapped, dirty, etc.  Then journal_dirty_data picks up again, decides 
> that the buffer is not on the right list (now BJ_None) and puts it 
> back on BJ_SyncData.  Then it gets picked up by 
> journal_submit_data_buffers and submitted, and oops.
>
> Talking with Stephen, it seemed like the page lock should synchronize 
> these threads, but I've found that we can get to journal_dirty_data 
> acting on the buffer heads w/o having the page locked...
>
> I'm still digging, and, er, grasping at straws here... Am I off base?
>
> -Eric
>
>
>> Andrew is right - only option for us to check the filesize in the
>> write out path and skip the buffers beyond EOF.
>>
>> Thanks,
>> Badari
>>
Here's another data point for your consideration. I've been seeing this 
error since I started running 2.6.18, I assumed it was hardware, so I've 
tried 3 different disks, a PATA and 2 SATA drives, with VIA and Promise 
controllers, the error has occurred on all of them. I see the error 
infrequently, always when downloading lots of small files from Usenet 
and building, copying and deleting large (200 - 300 MB). I haven't ever 
had an oops/panic, just this error.  When I run fsck, I always see a 
single message that "deleted inode nnn has zero dtime". I hope this will 
be useful.

Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5): 
ext3_free_blocks_sb: bit already cleared for block 4740550
Oct 11 20:37:32 Godzilla kernel: Aborting journal on device hda5.
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in 
ext3_free_blocks_sb: Journal has aborted
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in 
ext3_free_blocks_sb: Journal has aborted
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in 
ext3_reserve_inode_write: Journal has aborted
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in 
ext3_truncate: Journal has aborted
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in 
ext3_reserve_inode_write: Journal has aborted
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in 
ext3_orphan_del: Journal has aborted
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in 
ext3_reserve_inode_write: Journal has aborted
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in 
ext3_delete_inode: Journal has aborted
Oct 11 20:37:32 Godzilla kernel: __journal_remove_journal_head: freeing 
b_committed_data
Oct 11 20:37:32 Godzilla kernel: __journal_remove_journal_head: freeing 
b_committed_data
Oct 11 20:37:32 Godzilla kernel: ext3_abort called.
Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5): 
ext3_journal_start_sb: Detected aborted journal
Oct 11 20:37:32 Godzilla kernel: Remounting filesystem read-only

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-12  4:34                               ` John Wendel
@ 2006-10-12  6:57                                 ` Jan-Benedict Glaw
  0 siblings, 0 replies; 37+ messages in thread
From: Jan-Benedict Glaw @ 2006-10-12  6:57 UTC (permalink / raw)
  To: John Wendel
  Cc: Eric Sandeen, Badari Pulavarty, Jan Kara, Eric Sandeen,
	Dave Jones, Andrew Morton, Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 1793 bytes --]

On Wed, 2006-10-11 21:34:13 -0700, John Wendel <jwendel10@comcast.net> wrote:
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5): ext3_free_blocks_sb: bit already cleared for block 4740550
> Oct 11 20:37:32 Godzilla kernel: Aborting journal on device hda5.
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in ext3_free_blocks_sb: Journal has aborted
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in ext3_free_blocks_sb: Journal has aborted
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in ext3_reserve_inode_write: Journal has aborted
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in ext3_truncate: Journal has aborted
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in ext3_reserve_inode_write: Journal has aborted
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in ext3_orphan_del: Journal has aborted
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in ext3_reserve_inode_write: Journal has aborted
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5) in ext3_delete_inode: Journal has aborted
> Oct 11 20:37:32 Godzilla kernel: __journal_remove_journal_head: freeing b_committed_data
> Oct 11 20:37:32 Godzilla kernel: __journal_remove_journal_head: freeing b_committed_data
> Oct 11 20:37:32 Godzilla kernel: ext3_abort called.
> Oct 11 20:37:32 Godzilla kernel: EXT3-fs error (device hda5): ext3_journal_start_sb: Detected aborted journal
> Oct 11 20:37:32 Godzilla kernel: Remounting filesystem read-only

This looks very much like the issue I see.

MfG, JBG

-- 
      Jan-Benedict Glaw      jbglaw@lug-owl.de              +49-172-7608481
Signature of:                http://catb.org/~esr/faqs/smart-questions.html
the second  :

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-12  2:36                             ` Eric Sandeen
  2006-10-12  4:34                               ` John Wendel
@ 2006-10-12 12:28                               ` Jan Kara
  2006-10-12 13:20                                 ` Eric Sandeen
  2006-10-12 16:40                                 ` Andrew Morton
  1 sibling, 2 replies; 37+ messages in thread
From: Jan Kara @ 2006-10-12 12:28 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Badari Pulavarty, Eric Sandeen, Dave Jones, Andrew Morton,
	Linux Kernel

> Badari Pulavarty wrote:
> 
> >Here is what I think is happening..
> >
> >journal_unmap_buffer() - cleaned the buffer, since its outside EOF, but
> >its a part of the same page. So it remained on the page->buffers
> >list. (at this time its not part of any transaction).
> >
> >Then, ordererd_commit_write() called journal_dirty_data() and we added
> >all these buffers to BJ_SyncData list. (at this time buffer is clean -
> >not dirty).
> >
> >Now msync() called __set_page_dirty_buffers() and dirtied *all* the
> >buffers attached to this page.
> >
> >journal_submit_data_buffers() got around to this buffer and tried to
> >submit the buffer...
  Yes, this is certainly one we need to fix.

> This seems about right, but one thing bothers me in the traces; it seems 
> like there is some locking that is missing.  In
> http://people.redhat.com/esandeen/traces/eric_ext3_oops1.txt
> for example, it looks like journal_dirty_data gets started, but then the 
> buffer_head is acted on by journal_unmap_buffer, which decides this buffer 
> is part of the running transaction, past EOF, and clears mapped, dirty, 
  It's part of the committing transaction.

> etc.  Then journal_dirty_data picks up again, decides that the buffer is 
> not on the right list (now BJ_None) and puts it back on BJ_SyncData.  Then 
> it gets picked up by journal_submit_data_buffers and submitted, and oops.
  Now it is put on the running transaction's BJ_SyncData list. But
otherwise you're right.

> Talking with Stephen, it seemed like the page lock should synchronize these 
> threads, but I've found that we can get to journal_dirty_data acting on the 
> buffer heads w/o having the page locked...
  Yes, PageLock should protect us. Where can we call
journal_dirty_data() without PageLock? I see the following callers:
  ext3_ordered_commit_write - should have PageLock
  ext3_ordered_writepage - has PageLock
  ext3_block_truncate_page - has PageLock

  And that are all callers from ext3. Am I missing something?

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-12 12:28                               ` Jan Kara
@ 2006-10-12 13:20                                 ` Eric Sandeen
  2006-10-12 16:40                                 ` Andrew Morton
  1 sibling, 0 replies; 37+ messages in thread
From: Eric Sandeen @ 2006-10-12 13:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric Sandeen, Badari Pulavarty, Dave Jones, Andrew Morton,
	Linux Kernel

Jan Kara wrote:

>> Talking with Stephen, it seemed like the page lock should synchronize these 
>> threads, but I've found that we can get to journal_dirty_data acting on the 
>> buffer heads w/o having the page locked...
>   Yes, PageLock should protect us. Where can we call
> journal_dirty_data() without PageLock? I see the following callers:
>   ext3_ordered_commit_write - should have PageLock
>   ext3_ordered_writepage - has PageLock
>   ext3_block_truncate_page - has PageLock
> 
>   And that are all callers from ext3. Am I missing something?
> 
> 								Honza

I put an assert about the page being locked in journal_dirty_data, and hit it 
right away.  I'll look a bit more but I think this is how I got there:


ext3_ordered_writepage <-- assert PageLocked
	...
	block_write_full_page
		__block_write_full_page
			unlock_page()
	...
	walk_page_buffers
		journal_dirty_data_fn
			ext3_journal_dirty_data
				journal_dirty_data <-- find page unlocked

there's a comment in ext3_ordered_writepage:

         /*
          * The page can become unlocked at any point now, and
          * truncate can then come in and change things.  So we
          * can't touch *page from now on.  But *page_bufs is
          * safe due to elevated refcount.
          */

-Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-12 12:28                               ` Jan Kara
  2006-10-12 13:20                                 ` Eric Sandeen
@ 2006-10-12 16:40                                 ` Andrew Morton
  2006-10-12 16:44                                   ` Eric Sandeen
  2006-10-12 20:07                                   ` Eric Sandeen
  1 sibling, 2 replies; 37+ messages in thread
From: Andrew Morton @ 2006-10-12 16:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric Sandeen, Badari Pulavarty, Eric Sandeen, Dave Jones,
	Linux Kernel

On Thu, 12 Oct 2006 14:28:20 +0200
Jan Kara <jack@suse.cz> wrote:

> Where can we call
> journal_dirty_data() without PageLock?

block_write_full_page() will unlock the page, so ext3_writepage()
will run journal_dirty_data_fn() against an unlocked page.

I haven't looked into the exact details of the race, but it should
be addressable via jbd_lock_bh_state() or j_list_lock coverage.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-12 16:40                                 ` Andrew Morton
@ 2006-10-12 16:44                                   ` Eric Sandeen
  2006-10-12 20:07                                   ` Eric Sandeen
  1 sibling, 0 replies; 37+ messages in thread
From: Eric Sandeen @ 2006-10-12 16:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Badari Pulavarty, Eric Sandeen, Dave Jones,
	Linux Kernel

Andrew Morton wrote:
> On Thu, 12 Oct 2006 14:28:20 +0200
> Jan Kara <jack@suse.cz> wrote:
> 
>> Where can we call
>> journal_dirty_data() without PageLock?
> 
> block_write_full_page() will unlock the page, so ext3_writepage()
> will run journal_dirty_data_fn() against an unlocked page.
> 
> I haven't looked into the exact details of the race, but it should
> be addressable via jbd_lock_bh_state() or j_list_lock coverage.

Yep, that's what I've been hashing out with Stephen today...

In one of my cases journal_dirty_data has dropped & re-acquired the
bh_state lock and j_list_lock, and journal_unmap_buffer has come along
in the meantime.

So it looks like we are missing some state tests, i.e. buffer_mapped(),
at a couple points after we acquire jbd_lock_bh_state().

-Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-12 16:40                                 ` Andrew Morton
  2006-10-12 16:44                                   ` Eric Sandeen
@ 2006-10-12 20:07                                   ` Eric Sandeen
  2006-10-12 21:55                                     ` Badari Pulavarty
  1 sibling, 1 reply; 37+ messages in thread
From: Eric Sandeen @ 2006-10-12 20:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Badari Pulavarty, Eric Sandeen, Dave Jones,
	Linux Kernel

Andrew Morton wrote:

> On Thu, 12 Oct 2006 14:28:20 +0200
> Jan Kara <jack@suse.cz> wrote:
>
>   
>> Where can we call
>> journal_dirty_data() without PageLock?
>>     
>
> block_write_full_page() will unlock the page, so ext3_writepage()
> will run journal_dirty_data_fn() against an unlocked page.
>
> I haven't looked into the exact details of the race, but it should
> be addressable via jbd_lock_bh_state() or j_list_lock coverage
I'm testing with something like this now; seem sane?

journal_dirty_data & journal_unmap_data both check do 
jbd_lock_bh_state(bh) close to the top... journal_dirty_data_fn has checked 
buffer_mapped before getting into journal_dirty_data, but that state may
change before the lock is grabbed.  Similarly re-check after we drop the lock.

-Eric

Index: linux-2.6.18-1.2737.fc6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.18-1.2737.fc6.orig/fs/jbd/transaction.c
+++ linux-2.6.18-1.2737.fc6/fs/jbd/transaction.c
@@ -967,6 +967,13 @@ int journal_dirty_data(handle_t *handle,
 	 */
 	jbd_lock_bh_state(bh);
 	spin_lock(&journal->j_list_lock);
+
+	/* Now that we have bh_state locked, are we really still mapped? */
+	if (!buffer_mapped(bh)) {
+		JBUFFER_TRACE(jh, "unmapped, bailing out");
+		goto no_journal;
+	}
+		
 	if (jh->b_transaction) {
 		JBUFFER_TRACE(jh, "has transaction");
 		if (jh->b_transaction != handle->h_transaction) {
@@ -1028,6 +1036,11 @@ int journal_dirty_data(handle_t *handle,
 				sync_dirty_buffer(bh);
 				jbd_lock_bh_state(bh);
 				spin_lock(&journal->j_list_lock);
+				/* Since we dropped the lock... */
+				if (!buffer_mapped(bh)) {
+					JBUFFER_TRACE(jh, "Got unmapped");
+					goto no_journal;
+				}
 				/* The buffer may become locked again at any
 				   time if it is redirtied */
 			}


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-12 20:07                                   ` Eric Sandeen
@ 2006-10-12 21:55                                     ` Badari Pulavarty
  2006-10-12 21:57                                       ` Eric Sandeen
  2006-10-13  7:56                                       ` Jan Kara
  0 siblings, 2 replies; 37+ messages in thread
From: Badari Pulavarty @ 2006-10-12 21:55 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Andrew Morton, Jan Kara, Eric Sandeen, Dave Jones, Linux Kernel

Eric Sandeen wrote:
> Andrew Morton wrote:
>
>   
>> On Thu, 12 Oct 2006 14:28:20 +0200
>> Jan Kara <jack@suse.cz> wrote:
>>
>>   
>>     
>>> Where can we call
>>> journal_dirty_data() without PageLock?
>>>     
>>>       
>> block_write_full_page() will unlock the page, so ext3_writepage()
>> will run journal_dirty_data_fn() against an unlocked page.
>>
>> I haven't looked into the exact details of the race, but it should
>> be addressable via jbd_lock_bh_state() or j_list_lock coverage
>>     
> I'm testing with something like this now; seem sane?
>
> journal_dirty_data & journal_unmap_data both check do 
> jbd_lock_bh_state(bh) close to the top... journal_dirty_data_fn has checked 
> buffer_mapped before getting into journal_dirty_data, but that state may
> change before the lock is grabbed.  Similarly re-check after we drop the lock.
>
>   
This is exactly  the solution I proposed earlier (to check 
buffer_mapped() before calling submit_bh()).
But at that time, Jan pointed out that the whole handling is wrong.

But if this is the only case we need to handle, I am okay with this band 
aid :)

Thanks,
Badari


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-12 21:55                                     ` Badari Pulavarty
@ 2006-10-12 21:57                                       ` Eric Sandeen
  2006-10-12 22:34                                         ` Badari Pulavarty
  2006-10-13  7:56                                       ` Jan Kara
  1 sibling, 1 reply; 37+ messages in thread
From: Eric Sandeen @ 2006-10-12 21:57 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Eric Sandeen, Andrew Morton, Jan Kara, Dave Jones, Linux Kernel

Badari Pulavarty wrote:

> This is exactly  the solution I proposed earlier (to check 
> buffer_mapped() before calling submit_bh()).
> But at that time, Jan pointed out that the whole handling is wrong.
> 
> But if this is the only case we need to handle, I am okay with this band 
> aid :)

Doh!

And we come full circle... ok let me go reread that thread, it got long
enough I had to swap out... :)

-Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-12 21:57                                       ` Eric Sandeen
@ 2006-10-12 22:34                                         ` Badari Pulavarty
  0 siblings, 0 replies; 37+ messages in thread
From: Badari Pulavarty @ 2006-10-12 22:34 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Eric Sandeen, Andrew Morton, Jan Kara, Dave Jones, Linux Kernel

Eric Sandeen wrote:
> Badari Pulavarty wrote:
>
>   
>> This is exactly  the solution I proposed earlier (to check 
>> buffer_mapped() before calling submit_bh()).
>> But at that time, Jan pointed out that the whole handling is wrong.
>>
>> But if this is the only case we need to handle, I am okay with this band 
>> aid :)
>>     
>
> Doh!
>
> And we come full circle... ok let me go reread that thread, it got long
> enough I had to swap out... :)

Don't bother. Lets see if this is the only case that needs fixing and 
move forward..

Thanks,
Badari



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-12 21:55                                     ` Badari Pulavarty
  2006-10-12 21:57                                       ` Eric Sandeen
@ 2006-10-13  7:56                                       ` Jan Kara
  2006-10-13 16:08                                         ` Eric Sandeen
  1 sibling, 1 reply; 37+ messages in thread
From: Jan Kara @ 2006-10-13  7:56 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Eric Sandeen, Andrew Morton, Eric Sandeen, Dave Jones,
	Linux Kernel

> Eric Sandeen wrote:
> >Andrew Morton wrote:
> >
> >  
> >>On Thu, 12 Oct 2006 14:28:20 +0200
> >>Jan Kara <jack@suse.cz> wrote:
> >>
> >>  
> >>    
> >>>Where can we call
> >>>journal_dirty_data() without PageLock?
> >>>    
> >>>      
> >>block_write_full_page() will unlock the page, so ext3_writepage()
> >>will run journal_dirty_data_fn() against an unlocked page.
> >>
> >>I haven't looked into the exact details of the race, but it should
> >>be addressable via jbd_lock_bh_state() or j_list_lock coverage
> >>    
> >I'm testing with something like this now; seem sane?
> >
> >journal_dirty_data & journal_unmap_data both check do 
> >jbd_lock_bh_state(bh) close to the top... journal_dirty_data_fn has 
> >checked buffer_mapped before getting into journal_dirty_data, but that 
> >state may
> >change before the lock is grabbed.  Similarly re-check after we drop the 
> >lock.
> >
> >  
> This is exactly  the solution I proposed earlier (to check 
> buffer_mapped() before calling submit_bh()).
> But at that time, Jan pointed out that the whole handling is wrong.
  Yes, and it was. However it turned out that there are more problems
than I thought ;).

> But if this is the only case we need to handle, I am okay with this band 
> aid :)
  I think Eric's patch may be a part of it. But we still need to check whether
the buffer is not after EOF before submitting it (or better said just
after we manage to lock the buffer). Because while we are waiting for
the buffer lock, journal_unmap_buffer() can still come and steal the
buffer - at least the write-out in journal_dirty_data() definitely needs
the check if I haven't overlooked something.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-13  7:56                                       ` Jan Kara
@ 2006-10-13 16:08                                         ` Eric Sandeen
  2006-10-16 16:54                                           ` Jan Kara
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Sandeen @ 2006-10-13 16:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Badari Pulavarty, Andrew Morton, Eric Sandeen, Dave Jones,
	Linux Kernel

Jan Kara wrote:

>> This is exactly  the solution I proposed earlier (to check 
>> buffer_mapped() before calling submit_bh()).
>> But at that time, Jan pointed out that the whole handling is wrong.
>   Yes, and it was. However it turned out that there are more problems
> than I thought ;).
> 
>> But if this is the only case we need to handle, I am okay with this band 
>> aid :)
>   I think Eric's patch may be a part of it. But we still need to check whether
> the buffer is not after EOF before submitting it (or better said just
> after we manage to lock the buffer). Because while we are waiting for
> the buffer lock, journal_unmap_buffer() can still come and steal the
> buffer - at least the write-out in journal_dirty_data() definitely needs
> the check if I haven't overlooked something.

Ok, let me think on that today.  My first reaction is that if we have
the bh state lock and pay attention to mapped in journal_dirty_data(),
then any blocks past EOF which have gotten unmapped by
journal_unmap_buffer will be recognized as such (because they are now
unmapped... without needing to check for past EOF...) and we'll be fine.

As a datapoint, davej's stresstest (several fsx's and fsstresses)
survived an overnight run on his box, which used to panic in < 2 hrs.
Survived about 6 hours on my box until I intentionally stopped it; my
box had added a write/truncate test in a loop, with a bunch of periodic
syncs as well....

-Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-13 16:08                                         ` Eric Sandeen
@ 2006-10-16 16:54                                           ` Jan Kara
  2006-10-16 16:56                                             ` Eric Sandeen
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Kara @ 2006-10-16 16:54 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Badari Pulavarty, Andrew Morton, Eric Sandeen, Dave Jones,
	Linux Kernel

> Jan Kara wrote:
> 
> >> This is exactly  the solution I proposed earlier (to check 
> >> buffer_mapped() before calling submit_bh()).
> >> But at that time, Jan pointed out that the whole handling is wrong.
> >   Yes, and it was. However it turned out that there are more problems
> > than I thought ;).
> > 
> >> But if this is the only case we need to handle, I am okay with this band 
> >> aid :)
> >   I think Eric's patch may be a part of it. But we still need to check whether
> > the buffer is not after EOF before submitting it (or better said just
> > after we manage to lock the buffer). Because while we are waiting for
> > the buffer lock, journal_unmap_buffer() can still come and steal the
> > buffer - at least the write-out in journal_dirty_data() definitely needs
> > the check if I haven't overlooked something.
> 
> Ok, let me think on that today.  My first reaction is that if we have
> the bh state lock and pay attention to mapped in journal_dirty_data(),
> then any blocks past EOF which have gotten unmapped by
> journal_unmap_buffer will be recognized as such (because they are now
> unmapped... without needing to check for past EOF...) and we'll be fine.
  Hmm, yes, you're right. If we do the test in journal_dirty_data() we
should not file unmapped buffer into transaction's list and hence we
should be safe. Fine. In case we eventually hit the assertion, we can
think further ;).

> As a datapoint, davej's stresstest (several fsx's and fsstresses)
> survived an overnight run on his box, which used to panic in < 2 hrs.
> Survived about 6 hours on my box until I intentionally stopped it; my
> box had added a write/truncate test in a loop, with a bunch of periodic
> syncs as well....
  Perfect :).

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-16 16:54                                           ` Jan Kara
@ 2006-10-16 16:56                                             ` Eric Sandeen
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Sandeen @ 2006-10-16 16:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: Badari Pulavarty, Andrew Morton, Eric Sandeen, Dave Jones,
	Linux Kernel

Jan Kara wrote:

>>>   I think Eric's patch may be a part of it. But we still need to check whether
>>> the buffer is not after EOF before submitting it (or better said just
>>> after we manage to lock the buffer). Because while we are waiting for
>>> the buffer lock, journal_unmap_buffer() can still come and steal the
>>> buffer - at least the write-out in journal_dirty_data() definitely needs
>>> the check if I haven't overlooked something.
>> Ok, let me think on that today.  My first reaction is that if we have
>> the bh state lock and pay attention to mapped in journal_dirty_data(),
>> then any blocks past EOF which have gotten unmapped by
>> journal_unmap_buffer will be recognized as such (because they are now
>> unmapped... without needing to check for past EOF...) and we'll be fine.
>   Hmm, yes, you're right. If we do the test in journal_dirty_data() we
> should not file unmapped buffer into transaction's list and hence we
> should be safe. Fine. In case we eventually hit the assertion, we can
> think further ;).

Awww no way that can possibly happen right?  :)

>> As a datapoint, davej's stresstest (several fsx's and fsstresses)
>> survived an overnight run on his box, which used to panic in < 2 hrs.
>> Survived about 6 hours on my box until I intentionally stopped it; my
>> box had added a write/truncate test in a loop, with a bunch of periodic
>> syncs as well....
>   Perfect :).

Ok, thanks Jan!

I'll send a patch for -mm as a new thread, this one has gotten buried
pretty deep.

-Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-09 19:46       ` Eric Sandeen
  2006-10-09 19:59         ` Eric Sandeen
  2006-10-09 21:59         ` Badari Pulavarty
@ 2006-10-09 22:40         ` Jan-Benedict Glaw
  2006-10-10 13:16           ` Jan Kara
  2 siblings, 1 reply; 37+ messages in thread
From: Jan-Benedict Glaw @ 2006-10-09 22:40 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Andrew Morton, Dave Jones, Linux Kernel, esandeen,
	Badari Pulavarty, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 1891 bytes --]

On Mon, 2006-10-09 14:46:30 -0500, Eric Sandeen <sandeen@sandeen.net> wrote:
> Andrew Morton wrote:
> > On Tue, 03 Oct 2006 00:43:01 -0500
> > Eric Sandeen <sandeen@sandeen.net> wrote:
> > > Dave Jones wrote:
> > > > So I managed to reproduce it with an 'fsx foo' and a
> > > > 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
> > > > a vanilla 2.6.18 with none of the Fedora patches..
> > > >
> > > > I'll give 2.6.18-git a try next.
> > > >
> > > > ----------- [cut here ] --------- [please bite here ] ---------
> > > > Kernel BUG at fs/buffer.c:2791
> > > I had thought/hoped that this was fixed by Jan's patch at 
> > > http://lkml.org/lkml/2006/9/7/236 from the thread started at 
> > > http://lkml.org/lkml/2006/9/1/149, but it seems maybe not.  Dave hit this bug 
> > > first by going through that new codepath....
> > 
> > Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion.  iirc,
> > Badari was hitting that BUG and was able to confirm that Jan's patch
> > (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
> > it.
> 
> Looking at some BH traces*, it appears that what Dave hit is a truncate
> racing with a sync...
> 
> truncate ...
>   ext3_invalidate_page
>     journal_invalidatepage
>       journal_unmap buffer
> 
> going off at the same time as
> 
> sync ...
>   journal_dirty_data
>     sync_dirty_buffer
>       submit_bh <-- finds unmapped buffer, boom.

Is this possibly related to the issues that are discussed in another
thread? We're seeing problems while unlinking large files (usually get
it within some hours with 200MB files, but couldn't yet reproduce it
with 20MB.)

MfG, JBG
-- 
      Jan-Benedict Glaw      jbglaw@lug-owl.de              +49-172-7608481
Signature of:         Alles wird gut! ...und heute wirds schon ein bißchen besser.
the second  :

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-09 22:40         ` Jan-Benedict Glaw
@ 2006-10-10 13:16           ` Jan Kara
  2006-10-10 16:39             ` Jan-Benedict Glaw
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Kara @ 2006-10-10 13:16 UTC (permalink / raw)
  To: Eric Sandeen, Andrew Morton, Dave Jones, Linux Kernel, esandeen,
	Badari Pulavarty

> On Mon, 2006-10-09 14:46:30 -0500, Eric Sandeen <sandeen@sandeen.net> wrote:
> > Andrew Morton wrote:
> > > On Tue, 03 Oct 2006 00:43:01 -0500
> > > Eric Sandeen <sandeen@sandeen.net> wrote:
> > > > Dave Jones wrote:
> > > > > So I managed to reproduce it with an 'fsx foo' and a
> > > > > 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from
> > > > > a vanilla 2.6.18 with none of the Fedora patches..
> > > > >
> > > > > I'll give 2.6.18-git a try next.
> > > > >
> > > > > ----------- [cut here ] --------- [please bite here ] ---------
> > > > > Kernel BUG at fs/buffer.c:2791
> > > > I had thought/hoped that this was fixed by Jan's patch at 
> > > > http://lkml.org/lkml/2006/9/7/236 from the thread started at 
> > > > http://lkml.org/lkml/2006/9/1/149, but it seems maybe not.  Dave hit this bug 
> > > > first by going through that new codepath....
> > > 
> > > Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion.  iirc,
> > > Badari was hitting that BUG and was able to confirm that Jan's patch
> > > (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed
> > > it.
> > 
> > Looking at some BH traces*, it appears that what Dave hit is a truncate
> > racing with a sync...
> > 
> > truncate ...
> >   ext3_invalidate_page
> >     journal_invalidatepage
> >       journal_unmap buffer
> > 
> > going off at the same time as
> > 
> > sync ...
> >   journal_dirty_data
> >     sync_dirty_buffer
> >       submit_bh <-- finds unmapped buffer, boom.
> 
> Is this possibly related to the issues that are discussed in another
> thread? We're seeing problems while unlinking large files (usually get
> it within some hours with 200MB files, but couldn't yet reproduce it
> with 20MB.)
  I don't think this is related (BTW: I've run your test for 5 hours
without any luck ;( Maybe I'll try again for some longer time...).

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: 2.6.18 ext3 panic.
  2006-10-10 13:16           ` Jan Kara
@ 2006-10-10 16:39             ` Jan-Benedict Glaw
  0 siblings, 0 replies; 37+ messages in thread
From: Jan-Benedict Glaw @ 2006-10-10 16:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric Sandeen, Andrew Morton, Dave Jones, Linux Kernel, esandeen,
	Badari Pulavarty

[-- Attachment #1: Type: text/plain, Size: 841 bytes --]

On Tue, 2006-10-10 15:16:03 +0200, Jan Kara <jack@suse.cz> wrote:
> > Is this possibly related to the issues that are discussed in another
> > thread? We're seeing problems while unlinking large files (usually get
> > it within some hours with 200MB files, but couldn't yet reproduce it
> > with 20MB.)
>   I don't think this is related (BTW: I've run your test for 5 hours
> without any luck ;( Maybe I'll try again for some longer time...).

Looking at it, I also think it's a different thing.  Just to add, I
couldn't make it bug with a 10 MB file in a day, but it failed again
using a 100MB file.

So file size seems to matter somehow.

MfG, JBG

-- 
      Jan-Benedict Glaw      jbglaw@lug-owl.de              +49-172-7608481
  Signature of:                           Wenn ich wach bin, träume ich.
  the second  :

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2006-10-16 16:57 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-02 19:47 2.6.18 ext3 panic Dave Jones
2006-10-03  5:22 ` Dave Jones
2006-10-03  5:43   ` Eric Sandeen
2006-10-03  6:19     ` Andrew Morton
2006-10-03  6:40       ` Dave Jones
2006-10-03 16:45         ` Dave Jones
2006-10-09 19:46       ` Eric Sandeen
2006-10-09 19:59         ` Eric Sandeen
2006-10-09 21:59         ` Badari Pulavarty
2006-10-09 22:50           ` Dave Jones
2006-10-10 14:11             ` Jan Kara
2006-10-10 18:42               ` Andrew Morton
2006-10-10 22:03               ` Eric Sandeen
2006-10-10 22:25                 ` Badari Pulavarty
2006-10-11  1:43                   ` Eric Sandeen
2006-10-11 10:33                     ` Jan Kara
2006-10-11 13:44                       ` Eric Sandeen
2006-10-11 14:22                         ` Jan Kara
2006-10-11 17:54                           ` Badari Pulavarty
2006-10-12  2:36                             ` Eric Sandeen
2006-10-12  4:34                               ` John Wendel
2006-10-12  6:57                                 ` Jan-Benedict Glaw
2006-10-12 12:28                               ` Jan Kara
2006-10-12 13:20                                 ` Eric Sandeen
2006-10-12 16:40                                 ` Andrew Morton
2006-10-12 16:44                                   ` Eric Sandeen
2006-10-12 20:07                                   ` Eric Sandeen
2006-10-12 21:55                                     ` Badari Pulavarty
2006-10-12 21:57                                       ` Eric Sandeen
2006-10-12 22:34                                         ` Badari Pulavarty
2006-10-13  7:56                                       ` Jan Kara
2006-10-13 16:08                                         ` Eric Sandeen
2006-10-16 16:54                                           ` Jan Kara
2006-10-16 16:56                                             ` Eric Sandeen
2006-10-09 22:40         ` Jan-Benedict Glaw
2006-10-10 13:16           ` Jan Kara
2006-10-10 16:39             ` Jan-Benedict Glaw

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox