ext3: slow symlink corruption on umount...

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* ext3: slow symlink corruption on umount...
@ 2008-10-24 18:37 Arthur Jones
  2008-10-27 16:54 ` Arthur Jones
  0 siblings, 1 reply; 49+ messages in thread
From: Arthur Jones @ 2008-10-24 18:37 UTC (permalink / raw)
  To: linux-ext4

Hi All,  I'm seeing slow symlink corruption on ext3 on linux-2.6.27,
yesterday's linux-2.6 git tree and 2.6.9 RHEL4.7.  I.e. every kernel
I've tried I see this effect.  To reproduce this, I need:

* 250MB + tar file in memory (tmpfs or in the buffer cache)
* long symlinks in the tar file (over 60 characters)
* umount immediately after untarring

What I see is that the symlinks are corrupted, e.g.:

# ls -l etc/vmware-vix-disklib
etc/vmware-vix-disklib -> ??f

fsck shows:

Symlink /etc/vmware-vix-disklib (inode #16454) is invalid.

Debugfs shows:

debugfs:  stat <16454>
Inode: 16454   Type: symlink    Mode:  0777   Flags: 0x0   Generation: 1431972005
User:     0   Group:     0   Size: 65
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x4900ac69 -- Thu Oct 23 09:55:05 2008
atime: 0x4900ac84 -- Thu Oct 23 09:55:32 2008
mtime: 0x4900ac69 -- Thu Oct 23 09:55:05 2008
BLOCKS:
(0):56034
TOTAL: 1

I'm still tracking down exactly what's going on.  Anyone seen
anything like this before?  ext2 does not show this effect (I've
not tried ext4).  It happens when the backing block device is
a SATA drive or flash.

Thanks,

Arthur

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: ext3: slow symlink corruption on umount...
  2008-10-24 18:37 ext3: slow symlink corruption on umount Arthur Jones
@ 2008-10-27 16:54 ` Arthur Jones
  2008-10-29 19:54   ` Arthur Jones
  0 siblings, 1 reply; 49+ messages in thread
From: Arthur Jones @ 2008-10-27 16:54 UTC (permalink / raw)
  To: linux-ext4@vger.kernel.org; +Cc: sct, akpm, linux-kernel

Some additional info -- and attempting to cast a wider
net on the CC:

I do not see the long symlink corruption with mount -o data=writeback
and we've now seen a couple cases where the symlink corruption
does not require a umount...

Arthur

On Fri, Oct 24, 2008 at 11:37:34AM -0700, Arthur Jones wrote:
> Hi All,  I'm seeing slow symlink corruption on ext3 on linux-2.6.27,
> yesterday's linux-2.6 git tree and 2.6.9 RHEL4.7.  I.e. every kernel
> I've tried I see this effect.  To reproduce this, I need:
> 
> * 250MB + tar file in memory (tmpfs or in the buffer cache)
> * long symlinks in the tar file (over 60 characters)
> * umount immediately after untarring
> 
> What I see is that the symlinks are corrupted, e.g.:
> 
> # ls -l etc/vmware-vix-disklib
> etc/vmware-vix-disklib -> ??f
> 
> fsck shows:
> 
> Symlink /etc/vmware-vix-disklib (inode #16454) is invalid.
> 
> Debugfs shows:
> 
> debugfs:  stat <16454>
> Inode: 16454   Type: symlink    Mode:  0777   Flags: 0x0   Generation: 1431972005
> User:     0   Group:     0   Size: 65
> File ACL: 0    Directory ACL: 0
> Links: 1   Blockcount: 8
> Fragment:  Address: 0    Number: 0    Size: 0
> ctime: 0x4900ac69 -- Thu Oct 23 09:55:05 2008
> atime: 0x4900ac84 -- Thu Oct 23 09:55:32 2008
> mtime: 0x4900ac69 -- Thu Oct 23 09:55:05 2008
> BLOCKS:
> (0):56034
> TOTAL: 1
> 
> I'm still tracking down exactly what's going on.  Anyone seen
> anything like this before?  ext2 does not show this effect (I've
> not tried ext4).  It happens when the backing block device is
> a SATA drive or flash.
> 
> Thanks,
> 
> Arthur

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: ext3: slow symlink corruption on umount...
  2008-10-27 16:54 ` Arthur Jones
@ 2008-10-29 19:54   ` Arthur Jones
  2008-10-29 20:36     ` Eric Sandeen
  0 siblings, 1 reply; 49+ messages in thread
From: Arthur Jones @ 2008-10-29 19:54 UTC (permalink / raw)
  To: linux-ext4@vger.kernel.org; +Cc: sct, akpm, linux-kernel

This one's turning out to be a slippery fish.
I have found that the corruption appears to
be due to ->writepage() not getting called at all
for any of the long symlinks...

Ring any bells anyone?  Any ideas where to look
or what to test?  This is my first foray into
ext3 and I could definitely use some expert advice...

Arthur

On Mon, Oct 27, 2008 at 09:54:23AM -0700, Arthur Jones wrote:
> Some additional info -- and attempting to cast a wider
> net on the CC:
> 
> I do not see the long symlink corruption with mount -o data=writeback
> and we've now seen a couple cases where the symlink corruption
> does not require a umount...
> 
> Arthur
> 
> On Fri, Oct 24, 2008 at 11:37:34AM -0700, Arthur Jones wrote:
> > Hi All,  I'm seeing slow symlink corruption on ext3 on linux-2.6.27,
> > yesterday's linux-2.6 git tree and 2.6.9 RHEL4.7.  I.e. every kernel
> > I've tried I see this effect.  To reproduce this, I need:
> > 
> > * 250MB + tar file in memory (tmpfs or in the buffer cache)
> > * long symlinks in the tar file (over 60 characters)
> > * umount immediately after untarring
> > 
> > What I see is that the symlinks are corrupted, e.g.:
> > 
> > # ls -l etc/vmware-vix-disklib
> > etc/vmware-vix-disklib -> ??f
> > 
> > fsck shows:
> > 
> > Symlink /etc/vmware-vix-disklib (inode #16454) is invalid.
> > 
> > Debugfs shows:
> > 
> > debugfs:  stat <16454>
> > Inode: 16454   Type: symlink    Mode:  0777   Flags: 0x0   Generation: 1431972005
> > User:     0   Group:     0   Size: 65
> > File ACL: 0    Directory ACL: 0
> > Links: 1   Blockcount: 8
> > Fragment:  Address: 0    Number: 0    Size: 0
> > ctime: 0x4900ac69 -- Thu Oct 23 09:55:05 2008
> > atime: 0x4900ac84 -- Thu Oct 23 09:55:32 2008
> > mtime: 0x4900ac69 -- Thu Oct 23 09:55:05 2008
> > BLOCKS:
> > (0):56034
> > TOTAL: 1
> > 
> > I'm still tracking down exactly what's going on.  Anyone seen
> > anything like this before?  ext2 does not show this effect (I've
> > not tried ext4).  It happens when the backing block device is
> > a SATA drive or flash.
> > 
> > Thanks,
> > 
> > Arthur

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: ext3: slow symlink corruption on umount...
  2008-10-29 19:54   ` Arthur Jones
@ 2008-10-29 20:36     ` Eric Sandeen
  2008-10-29 21:09       ` Theodore Tso
                         ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Eric Sandeen @ 2008-10-29 20:36 UTC (permalink / raw)
  To: Arthur Jones; +Cc: linux-ext4@vger.kernel.org, sct, akpm, linux-kernel

Arthur Jones wrote:
> This one's turning out to be a slippery fish.
> I have found that the corruption appears to
> be due to ->writepage() not getting called at all
> for any of the long symlinks...
> 
> Ring any bells anyone?  Any ideas where to look
> or what to test?  This is my first foray into
> ext3 and I could definitely use some expert advice...
> 
> Arthur


Sorry for the silence, this is a nice bug you've found :)

I can reproduce this at least, with this script:

#!/bin/bash

umount /mnt/test2
mount /dev/sdb4 /mnt/test2
rm -f /mnt/test2/*
dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
touch
/mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
ln -s
/mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
/mnt/test2/link
umount /mnt/test2
mount /dev/sdb4 /mnt/test2
ls /mnt/test2/
umount /mnt/test2

I'll look into it ...

-Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: ext3: slow symlink corruption on umount...
  2008-10-29 20:36     ` Eric Sandeen
@ 2008-10-29 21:09       ` Theodore Tso
  2008-10-30 13:38         ` Eric Sandeen
  2008-10-30 17:40       ` Arthur Jones
  2008-11-03 18:44       ` [PATCH] ext3: wait on all pending commits in ext3_sync_fs Arthur Jones
  2 siblings, 1 reply; 49+ messages in thread
From: Theodore Tso @ 2008-10-29 21:09 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Arthur Jones, linux-ext4@vger.kernel.org, sct, akpm, linux-kernel

On Wed, Oct 29, 2008 at 03:36:33PM -0500, Eric Sandeen wrote:
> 
> Sorry for the silence, this is a nice bug you've found :)
> 
> I'll look into it ...

You may want to take a quick look at this thread:

    http://lkml.org/lkml/2008/10/28/413

					- Ted

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: ext3: slow symlink corruption on umount...
  2008-10-29 21:09       ` Theodore Tso
@ 2008-10-30 13:38         ` Eric Sandeen
  2008-10-30 13:55           ` Arthur Jones
  2008-10-31  9:47           ` Nick Piggin
  0 siblings, 2 replies; 49+ messages in thread
From: Eric Sandeen @ 2008-10-30 13:38 UTC (permalink / raw)
  To: Theodore Tso, Eric Sandeen, Arthur Jones,
	linux-ext4@vger.kernel.org, sct, akpm

Theodore Tso wrote:
> On Wed, Oct 29, 2008 at 03:36:33PM -0500, Eric Sandeen wrote:
>> Sorry for the silence, this is a nice bug you've found :)
>>
>> I'll look into it ...
> 
> You may want to take a quick look at this thread:
> 
>     http://lkml.org/lkml/2008/10/28/413
> 
> 					- Ted

I thought about that, but at least at first glance I don't see how the
gfp mask change would cause this behavior...?  At least, I don't think
we're seeing recursion back into the filesystem... but I'll ponder that.

(Also, Arthur reports seeing this as long ago as 2.6.9...)

-Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: ext3: slow symlink corruption on umount...
  2008-10-30 13:38         ` Eric Sandeen
@ 2008-10-30 13:55           ` Arthur Jones
  2008-10-31  9:47           ` Nick Piggin
  1 sibling, 0 replies; 49+ messages in thread
From: Arthur Jones @ 2008-10-30 13:55 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Theodore Tso, linux-ext4@vger.kernel.org, sct@redhat.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org

Hi Eric, ...

On Thu, Oct 30, 2008 at 06:38:30AM -0700, Eric Sandeen wrote:
> Theodore Tso wrote:
> > On Wed, Oct 29, 2008 at 03:36:33PM -0500, Eric Sandeen wrote:
> >> Sorry for the silence, this is a nice bug you've found :)
> >>
> >> I'll look into it ...
> >
> > You may want to take a quick look at this thread:
> >
> >     http://lkml.org/lkml/2008/10/28/413
> >
> >                                       - Ted
> 
> I thought about that, but at least at first glance I don't see how the
> gfp mask change would cause this behavior...?  At least, I don't think
> we're seeing recursion back into the filesystem... but I'll ponder that.

Back when I thought this was a 2.6.9 bug, I tried
that patch.  There was no change in behavior...

Arthur

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: ext3: slow symlink corruption on umount...
  2008-10-30 13:38         ` Eric Sandeen
  2008-10-30 13:55           ` Arthur Jones
@ 2008-10-31  9:47           ` Nick Piggin
  1 sibling, 0 replies; 49+ messages in thread
From: Nick Piggin @ 2008-10-31  9:47 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Theodore Tso, Arthur Jones, linux-ext4@vger.kernel.org, sct, akpm,
	linux-kernel

On Friday 31 October 2008 00:38, Eric Sandeen wrote:
> Theodore Tso wrote:
> > On Wed, Oct 29, 2008 at 03:36:33PM -0500, Eric Sandeen wrote:
> >> Sorry for the silence, this is a nice bug you've found :)
> >>
> >> I'll look into it ...
> >
> > You may want to take a quick look at this thread:
> >
> >     http://lkml.org/lkml/2008/10/28/413
> >
> > 					- Ted
>
> I thought about that, but at least at first glance I don't see how the
> gfp mask change would cause this behavior...?  At least, I don't think
> we're seeing recursion back into the filesystem... but I'll ponder that.

That's definitely a bug which I'll have to fix for 2.6.28, but
I agree it's unlikely to recurse frequently like this (would only
happen under high memory pressure, and only when writeout from
page reclaim happens).


> (Also, Arthur reports seeing this as long ago as 2.6.9...)

OK, well that would confirm it.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: ext3: slow symlink corruption on umount...
  2008-10-29 20:36     ` Eric Sandeen
  2008-10-29 21:09       ` Theodore Tso
@ 2008-10-30 17:40       ` Arthur Jones
  2008-10-30 18:03         ` Eric Sandeen
  2008-10-30 18:32         ` Arthur Jones
  2008-11-03 18:44       ` [PATCH] ext3: wait on all pending commits in ext3_sync_fs Arthur Jones
  2 siblings, 2 replies; 49+ messages in thread
From: Arthur Jones @ 2008-10-30 17:40 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: linux-ext4@vger.kernel.org, sct@redhat.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org

Hi Eric, ...

On Wed, Oct 29, 2008 at 01:36:33PM -0700, Eric Sandeen wrote:
> [...]
> I'll look into it ...

In the cases that fail, I'm seeing bdi_write_congested()
return 2 at the top of write_cache_pages().  In the working
case, bdi_write_congested() returns 0 and the inodes are
found twice in the s_io list in generic_sync_sb_inodes,
first as i_state==7, where they are skipped, then a second
time as i_state==4, where ->writepage() is then called...

Arthur

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: ext3: slow symlink corruption on umount...
  2008-10-30 17:40       ` Arthur Jones
@ 2008-10-30 18:03         ` Eric Sandeen
  2008-10-30 21:34           ` Arthur Jones
  2008-10-30 18:32         ` Arthur Jones
  1 sibling, 1 reply; 49+ messages in thread
From: Eric Sandeen @ 2008-10-30 18:03 UTC (permalink / raw)
  To: Arthur Jones
  Cc: linux-ext4@vger.kernel.org, sct@redhat.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org

Arthur Jones wrote:
> Hi Eric, ...
> 
> On Wed, Oct 29, 2008 at 01:36:33PM -0700, Eric Sandeen wrote:
>> [...]
>> I'll look into it ...
> 
> In the cases that fail, I'm seeing bdi_write_congested()
> return 2 at the top of write_cache_pages().  In the working
> case, bdi_write_congested() returns 0 and the inodes are
> found twice in the s_io list in generic_sync_sb_inodes,
> first as i_state==7, where they are skipped, then a second
> time as i_state==4, where ->writepage() is then called...
> 
> Arthur

Something is definitely racy here; in my simple testcase I get failures
maybe 30-50% of the time...

If I add a mark_buffer_dirty to write_end_fn, it always passes but I
need to think a bit about that.

-Eric


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: ext3: slow symlink corruption on umount...
  2008-10-30 18:03         ` Eric Sandeen
@ 2008-10-30 21:34           ` Arthur Jones
  2008-10-31 17:24             ` Arthur Jones
  0 siblings, 1 reply; 49+ messages in thread
From: Arthur Jones @ 2008-10-30 21:34 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: linux-ext4@vger.kernel.org, sct@redhat.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org

Hi Eric,  ...

On Thu, Oct 30, 2008 at 11:03:49AM -0700, Eric Sandeen wrote:
> [...]
> Something is definitely racy here; in my simple testcase I get failures
> maybe 30-50% of the time...

Some more info: in the working case, the inodes are put
back on sb->s_dirty at then next ext3_sync_fs() call:

__fsync_super -> DQUOT_SYNC -> ext3_sync_fs -> log_wait_commit

In the failing case, journal_start_commit returns 0 in ext_sync_fs
and the inodes disappear into never-never land...

Arthur

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: ext3: slow symlink corruption on umount...
  2008-10-30 21:34           ` Arthur Jones
@ 2008-10-31 17:24             ` Arthur Jones
  2008-10-31 18:37               ` Eric Sandeen
  0 siblings, 1 reply; 49+ messages in thread
From: Arthur Jones @ 2008-10-31 17:24 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: linux-ext4@vger.kernel.org, sct@redhat.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org

On Thu, Oct 30, 2008 at 02:34:00PM -0700, Arthur Jones wrote:
> Hi Eric,  ...
> 
> On Thu, Oct 30, 2008 at 11:03:49AM -0700, Eric Sandeen wrote:
> > [...]
> > Something is definitely racy here; in my simple testcase I get failures
> > maybe 30-50% of the time...
> 
> Some more info: in the working case, the inodes are put
> back on sb->s_dirty at then next ext3_sync_fs() call:
> 
> __fsync_super -> DQUOT_SYNC -> ext3_sync_fs -> log_wait_commit
> 
> In the failing case, journal_start_commit returns 0 in ext_sync_fs
> and the inodes disappear into never-never land...

More details, these are dumps at __log_start_commit in the
call chain described above, the first column is the failing
case, the next column is working case, t_expires is the delta
from the time the dump was taken:

        journal->j_flags                                0x10    0x10
        journal->j_tail_sequence                         515     519
        journal->j_transaction_sequence                  517     522
        journal->j_commit_sequence                       514     519
        journal->j_commit_request                        516     520

        journal->j_running_transaction->t_tid            516     521
        journal->j_running_transaction->t_state            0       0
        journal->j_running_transaction->t_updates          0       0
        journal->j_running_transaction->t_handle_count 27305   27344
        journal->j_running_transaction->t_expires       -566      28

Can you tell from this whether the transactions
are messed up or whether we're just missing a
wake_up?  Any other info you'd like to see?

Arthur

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: ext3: slow symlink corruption on umount...
  2008-10-31 17:24             ` Arthur Jones
@ 2008-10-31 18:37               ` Eric Sandeen
  0 siblings, 0 replies; 49+ messages in thread
From: Eric Sandeen @ 2008-10-31 18:37 UTC (permalink / raw)
  To: Arthur Jones
  Cc: linux-ext4@vger.kernel.org, sct@redhat.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org

Arthur Jones wrote:
> On Thu, Oct 30, 2008 at 02:34:00PM -0700, Arthur Jones wrote:
>> Hi Eric,  ...
>>
>> On Thu, Oct 30, 2008 at 11:03:49AM -0700, Eric Sandeen wrote:
>>> [...]
>>> Something is definitely racy here; in my simple testcase I get failures
>>> maybe 30-50% of the time...
>> Some more info: in the working case, the inodes are put
>> back on sb->s_dirty at then next ext3_sync_fs() call:
>>
>> __fsync_super -> DQUOT_SYNC -> ext3_sync_fs -> log_wait_commit
>>
>> In the failing case, journal_start_commit returns 0 in ext_sync_fs
>> and the inodes disappear into never-never land...
> 
> More details, these are dumps at __log_start_commit in the
> call chain described above, the first column is the failing
> case, the next column is working case, t_expires is the delta
> from the time the dump was taken:
> 
>         journal->j_flags                                0x10    0x10
>         journal->j_tail_sequence                         515     519
>         journal->j_transaction_sequence                  517     522
>         journal->j_commit_sequence                       514     519
>         journal->j_commit_request                        516     520
> 
>         journal->j_running_transaction->t_tid            516     521
>         journal->j_running_transaction->t_state            0       0
>         journal->j_running_transaction->t_updates          0       0
>         journal->j_running_transaction->t_handle_count 27305   27344
>         journal->j_running_transaction->t_expires       -566      28
> 
> Can you tell from this whether the transactions
> are messed up or whether we're just missing a
> wake_up?  Any other info you'd like to see?

That's kind of along the lines of what I'm seeing; also, in particular,
I'm never seeing the buffer_head in question (the one for the block
which contains the slow link's data) transition from jbddirty to normal
BH_Dirty.  I've had to take a break from this today, but will be back at
it a bit later... since I have a solid testcase I'm sure I'll get to the
bottom of it ... :)  I'll probably hook up akpm's buffer tracing
infrastructure, just need to find a decent thing to trigger on to dump
out the history.

-Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: ext3: slow symlink corruption on umount...
  2008-10-30 17:40       ` Arthur Jones
  2008-10-30 18:03         ` Eric Sandeen
@ 2008-10-30 18:32         ` Arthur Jones
  1 sibling, 0 replies; 49+ messages in thread
From: Arthur Jones @ 2008-10-30 18:32 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: linux-ext4@vger.kernel.org, sct@redhat.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org

Hi Eric, ...

On Thu, Oct 30, 2008 at 10:40:57AM -0700, Arthur Jones wrote:
> [...]
> return 2 at the top of write_cache_pages().  In the working
> case, bdi_write_congested() returns 0 and the inodes are
> found twice in the s_io list in generic_sync_sb_inodes,
> first as i_state==7, where they are skipped, then a second
> time as i_state==4, where ->writepage() is then called...

To clarify, they are not on the list twice, rather
a separate call to generic_sync_sb_inodes sees them
again as i_state==4...

Arthur

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-10-29 20:36     ` Eric Sandeen
  2008-10-29 21:09       ` Theodore Tso
  2008-10-30 17:40       ` Arthur Jones
@ 2008-11-03 18:44       ` Arthur Jones
  2008-11-03 19:33         ` Andrew Morton
  2008-11-03 19:59         ` Eric Sandeen
  2 siblings, 2 replies; 49+ messages in thread
From: Arthur Jones @ 2008-11-03 18:44 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: linux-ext4@vger.kernel.org, sct@redhat.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org

Hi Eric,   This patch fixes the problem for me, and
seems to put the buffers on the dirty list at the
place where they are put on the list during the working
case.  Despite having rooted around in the innards of
ext3 for the last few days, I cannot say that I have
any sense of whether this patch will cause problems
elsewhere or even if this is the best place to
intercede.

I post the complete patch not because I think it
should be committed as is, but rather to try
to explain the logic that brought it about.  At the
very least, this should be reviewed by the experts
here to make sure there is no collateral damage.

Arthur

-------------------
In ext3_sync_fs, we only wait for a commit to
finish if we started it, but there may be one
already in progress which will not be synced.

In the case of a data=ordered umount with pending long
symlinks which are delayed due to a long list
of other I/O on the backing block device, this
causes the buffer associated with the long symlinks
to not be moved to the inode dirty list in the second
phase of fsync_super.  Then, before they can be dirtied
again, kjournald exits, seeing the UMOUNT flag and the
dirty pages are never written to the backing block device,
causing long symlink corruption and exposing new or
previously freed block data to userspace.

This can be reproduced with a script created
by Eric Sandeen <sandeen@redhat.com>:

	#!/bin/bash

	umount /mnt/test2
	mount /dev/sdb4 /mnt/test2
	rm -f /mnt/test2/*
	dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
	touch
	/mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
	ln -s
	/mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
	/mnt/test2/link
	umount /mnt/test2
	mount /dev/sdb4 /mnt/test2
	ls /mnt/test2/
	umount /mnt/test2

To ensure all commits are synced, we flush
all journal commits now when sync_fs'ing ext3.

Signed-off-by: Arthur Jones <ajones@riverbed.com>
---
 fs/ext3/super.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 18eaa78..053659a 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2392,7 +2392,13 @@ static int ext3_sync_fs(struct super_block *sb, int wait)
 	if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
 		if (wait)
 			log_wait_commit(EXT3_SB(sb)->s_journal, target);
-	}
+	} else if (wait)
+		/*
+		 * We may have a commit in progress, clear it out
+		 * before we go on...
+		 */
+		ext3_force_commit(sb);
+
 	return 0;
 }

-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 18:44       ` [PATCH] ext3: wait on all pending commits in ext3_sync_fs Arthur Jones
@ 2008-11-03 19:33         ` Andrew Morton
  2008-11-03 20:14           ` Arthur Jones
  2008-11-03 19:59         ` Eric Sandeen
  1 sibling, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2008-11-03 19:33 UTC (permalink / raw)
  To: Arthur Jones; +Cc: sandeen, linux-ext4, sct, linux-kernel

On Mon, 3 Nov 2008 10:44:26 -0800
Arthur Jones <ajones@riverbed.com> wrote:

> Hi Eric,   This patch fixes the problem for me, and
> seems to put the buffers on the dirty list at the
> place where they are put on the list during the working
> case.  Despite having rooted around in the innards of
> ext3 for the last few days, I cannot say that I have
> any sense of whether this patch will cause problems
> elsewhere or even if this is the best place to
> intercede.
> 
> I post the complete patch not because I think it
> should be committed as is, but rather to try
> to explain the logic that brought it about.  At the
> very least, this should be reviewed by the experts
> here to make sure there is no collateral damage.
> 
> Arthur
> 
> -------------------
> In ext3_sync_fs, we only wait for a commit to
> finish if we started it, but there may be one
> already in progress which will not be synced.

argh.

> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2392,7 +2392,13 @@ static int ext3_sync_fs(struct super_block *sb, int wait)
>  	if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
>  		if (wait)
>  			log_wait_commit(EXT3_SB(sb)->s_journal, target);
> -	}
> +	} else if (wait)
> +		/*
> +		 * We may have a commit in progress, clear it out
> +		 * before we go on...
> +		 */
> +		ext3_force_commit(sb);
> +
>  	return 0;
>  }

Can we do

	sb->s_dirt = 0;
	if (wait)
		ext3_force_commit(...);
	else
		journal_start_commit(...);

?


Also, I wonder if that `sb->s_dirt = 0' is correct if
journal_start_commit() didn't start a commit?


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 19:33         ` Andrew Morton
@ 2008-11-03 20:14           ` Arthur Jones
  2008-11-03 20:37             ` Andrew Morton
  2008-12-18 23:17             ` Jan Kara
  0 siblings, 2 replies; 49+ messages in thread
From: Arthur Jones @ 2008-11-03 20:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: sandeen@redhat.com, linux-ext4@vger.kernel.org, sct@redhat.com,
	linux-kernel@vger.kernel.org

Hi Andrew, ...

On Mon, Nov 03, 2008 at 11:33:18AM -0800, Andrew Morton wrote:
> [...]
> > --- a/fs/ext3/super.c
> > +++ b/fs/ext3/super.c
> > @@ -2392,7 +2392,13 @@ static int ext3_sync_fs(struct super_block *sb, int wait)
> >       if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
> >               if (wait)
> >                       log_wait_commit(EXT3_SB(sb)->s_journal, target);
> > -     }
> > +     } else if (wait)
> > +             /*
> > +              * We may have a commit in progress, clear it out
> > +              * before we go on...
> > +              */
> > +             ext3_force_commit(sb);
> > +
> >       return 0;
> >  }
> 
> Can we do
> 
>         sb->s_dirt = 0;
>         if (wait)
>                 ext3_force_commit(...);
>         else
>                 journal_start_commit(...);

I think this is what you had in mind:


diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 18eaa78..2b48b85 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2386,13 +2386,12 @@ static void ext3_write_super (struct super_block * sb)
 
 static int ext3_sync_fs(struct super_block *sb, int wait)
 {
-	tid_t target;

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 20:14           ` Arthur Jones
@ 2008-11-03 20:37             ` Andrew Morton
  2008-11-03 20:58               ` Arthur Jones
  2008-11-03 21:48               ` Arthur Jones
  2008-12-18 23:17             ` Jan Kara
  1 sibling, 2 replies; 49+ messages in thread
From: Andrew Morton @ 2008-11-03 20:37 UTC (permalink / raw)
  To: Arthur Jones; +Cc: sandeen, linux-ext4, sct, linux-kernel

On Mon, 3 Nov 2008 12:14:28 -0800
Arthur Jones <ajones@riverbed.com> wrote:

> Hi Andrew, ...
> 
> On Mon, Nov 03, 2008 at 11:33:18AM -0800, Andrew Morton wrote:
> > [...]
> > > --- a/fs/ext3/super.c
> > > +++ b/fs/ext3/super.c
> > > @@ -2392,7 +2392,13 @@ static int ext3_sync_fs(struct super_block *sb, int wait)
> > >       if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
> > >               if (wait)
> > >                       log_wait_commit(EXT3_SB(sb)->s_journal, target);
> > > -     }
> > > +     } else if (wait)
> > > +             /*
> > > +              * We may have a commit in progress, clear it out
> > > +              * before we go on...
> > > +              */
> > > +             ext3_force_commit(sb);
> > > +
> > >       return 0;
> > >  }
> > 
> > Can we do
> > 
> >         sb->s_dirt = 0;
> >         if (wait)
> >                 ext3_force_commit(...);
> >         else
> >                 journal_start_commit(...);
> 
> I think this is what you had in mind:

yup.

> 
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 18eaa78..2b48b85 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2386,13 +2386,12 @@ static void ext3_write_super (struct super_block * sb)
>  
>  static int ext3_sync_fs(struct super_block *sb, int wait)
>  {
> -	tid_t target;
> -
>  	sb->s_dirt = 0;
> -	if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
> -		if (wait)
> -			log_wait_commit(EXT3_SB(sb)->s_journal, target);
> -	}
> +	if (wait)
> +		ext3_force_commit(sb);
> +	else
> +		journal_start_commit(EXT3_SB(sb)->s_journal, NULL);
> +
>  	return 0;
>  }
>  
> I tried this and it too fixes the problem.  FWIW I agree it
> looks better...
> 

OK.  But still, questions remain.

- should we clear s_dirt if log_wait_commit() didn't work?

- ext3_force_commit() clears s_dirt also

- should ext3_sync_fs() be dropping the ext3_force_commit() return
  value on the floor?

This is all rather worrisome.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 20:37             ` Andrew Morton
@ 2008-11-03 20:58               ` Arthur Jones
  2008-11-03 21:13                 ` Andrew Morton
  2008-11-03 21:48               ` Arthur Jones
  1 sibling, 1 reply; 49+ messages in thread
From: Arthur Jones @ 2008-11-03 20:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: sandeen@redhat.com, linux-ext4@vger.kernel.org, sct@redhat.com,
	linux-kernel@vger.kernel.org

Hi Andrew, ...

On Mon, Nov 03, 2008 at 12:37:50PM -0800, Andrew Morton wrote:
> [...]
> OK.  But still, questions remain.
> 
> - should we clear s_dirt if log_wait_commit() didn't work?
> 
> - ext3_force_commit() clears s_dirt also
> 
> - should ext3_sync_fs() be dropping the ext3_force_commit() return
>   value on the floor?

- does ext4 have a similar issue (ext4_sync_fs looks the same)?

Arthur

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 20:58               ` Arthur Jones
@ 2008-11-03 21:13                 ` Andrew Morton
  2008-11-03 21:19                   ` Theodore Tso
  0 siblings, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2008-11-03 21:13 UTC (permalink / raw)
  To: Arthur Jones; +Cc: sandeen, linux-ext4, sct, linux-kernel

On Mon, 3 Nov 2008 12:58:54 -0800
Arthur Jones <ajones@riverbed.com> wrote:

> Hi Andrew, ...
> 
> On Mon, Nov 03, 2008 at 12:37:50PM -0800, Andrew Morton wrote:
> > [...]
> > OK.  But still, questions remain.
> > 
> > - should we clear s_dirt if log_wait_commit() didn't work?
> > 
> > - ext3_force_commit() clears s_dirt also
> > 
> > - should ext3_sync_fs() be dropping the ext3_force_commit() return
> >   value on the floor?
> 
> - does ext4 have a similar issue (ext4_sync_fs looks the same)?
> 

I'm sure it does.  Someone(tm) should prepare the ext4 patch once we've
settled on the ext3 patch.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 21:13                 ` Andrew Morton
@ 2008-11-03 21:19                   ` Theodore Tso
  2008-11-03 21:27                     ` Andrew Morton
  0 siblings, 1 reply; 49+ messages in thread
From: Theodore Tso @ 2008-11-03 21:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Arthur Jones, sandeen, linux-ext4, sct, linux-kernel

On Mon, Nov 03, 2008 at 01:13:13PM -0800, Andrew Morton wrote:
> On Mon, 3 Nov 2008 12:58:54 -0800
> Arthur Jones <ajones@riverbed.com> wrote:
> 
> > Hi Andrew, ...
> > 
> > On Mon, Nov 03, 2008 at 12:37:50PM -0800, Andrew Morton wrote:
> > > [...]
> > > OK.  But still, questions remain.
> > > 
> > > - should we clear s_dirt if log_wait_commit() didn't work?
> > > 
> > > - ext3_force_commit() clears s_dirt also
> > > 
> > > - should ext3_sync_fs() be dropping the ext3_force_commit() return
> > >   value on the floor?

Should all the callers of ->sync_fs also be dropping the error returns
on the floor?

> > 
> > - does ext4 have a similar issue (ext4_sync_fs looks the same)?
> > 
> 
> I'm sure it does.  Someone(tm) should prepare the ext4 patch once we've
> settled on the ext3 patch.

I'll take care of it.  And I would reflect the error returns up to
ext3_sync_fs's callers, although at the moment it's not clear it would
make much of a difference.  :-(

						- Ted

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 21:19                   ` Theodore Tso
@ 2008-11-03 21:27                     ` Andrew Morton
  2008-11-03 21:48                       ` Theodore Tso
  2008-11-03 22:01                       ` Theodore Tso
  0 siblings, 2 replies; 49+ messages in thread
From: Andrew Morton @ 2008-11-03 21:27 UTC (permalink / raw)
  To: Theodore Tso; +Cc: ajones, sandeen, linux-ext4, sct, linux-kernel

On Mon, 3 Nov 2008 16:19:29 -0500
Theodore Tso <tytso@mit.edu> wrote:

> On Mon, Nov 03, 2008 at 01:13:13PM -0800, Andrew Morton wrote:
> > On Mon, 3 Nov 2008 12:58:54 -0800
> > Arthur Jones <ajones@riverbed.com> wrote:
> > 
> > > Hi Andrew, ...
> > > 
> > > On Mon, Nov 03, 2008 at 12:37:50PM -0800, Andrew Morton wrote:
> > > > [...]
> > > > OK.  But still, questions remain.
> > > > 
> > > > - should we clear s_dirt if log_wait_commit() didn't work?
> > > > 
> > > > - ext3_force_commit() clears s_dirt also
> > > > 
> > > > - should ext3_sync_fs() be dropping the ext3_force_commit() return
> > > >   value on the floor?
> 
> Should all the callers of ->sync_fs also be dropping the error returns
> on the floor?

Oh, this is sync_fs.  How meaningful would it be to return an error
from sys_umount()?  Would there be any point in leaving the errored fs
mounted?  Dunno.

But if we're going to drop this error on the floor, we should do
that at a higher level, not on a per-fs basis ;)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 21:27                     ` Andrew Morton
@ 2008-11-03 21:48                       ` Theodore Tso
  2008-11-03 22:01                       ` Theodore Tso
  1 sibling, 0 replies; 49+ messages in thread
From: Theodore Tso @ 2008-11-03 21:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ajones, sandeen, linux-ext4, sct, linux-kernel

On Mon, Nov 03, 2008 at 01:27:35PM -0800, Andrew Morton wrote:
> Oh, this is sync_fs.  How meaningful would it be to return an error
> from sys_umount()?  Would there be any point in leaving the errored fs
> mounted?  Dunno.
> 
> But if we're going to drop this error on the floor, we should do
> that at a higher level, not on a per-fs basis ;)

At the very least we should log it at the VFS layer, IMHO.

       	    	     	    	      	  - Ted

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 21:27                     ` Andrew Morton
  2008-11-03 21:48                       ` Theodore Tso
@ 2008-11-03 22:01                       ` Theodore Tso
  2008-11-03 22:18                         ` Arthur Jones
  2008-11-03 22:27                         ` Andrew Morton
  1 sibling, 2 replies; 49+ messages in thread
From: Theodore Tso @ 2008-11-03 22:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ajones, sandeen, linux-ext4, sct, linux-kernel

Here's the ext4 version of the patch.  It's slightly altered from the
ext3 version in that we do reflect the errors up to sync_fs's callers
(which then throw it on the floor, but we might as well take away some
of the fun from all of those academic researchers who like to write
papers complaining about how often Linux doesn't do appropriat eerror
checking :-).

After doing some testing, I plan to carry it in the ext4 patch queue.
I think similar changes should be made to the ext3 version of the
patch; agreed?

						- Ted

commit 8106ea5364c2680a385ed240e8f898611babf661
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Mon Nov 3 17:01:22 2008 -0500

    ext4: wait on all pending commits in ext4_sync_fs()
    
    In ext4_sync_fs, we only wait for a commit to finish if we started it,
    but there may be one already in progress which will not be synced.
    
    In the case of a data=ordered umount with pending long symlinks which
    are delayed due to a long list of other I/O on the backing block
    device, this causes the buffer associated with the long symlinks to
    not be moved to the inode dirty list in the second phase of
    fsync_super.  Then, before they can be dirtied again, kjournald exits,
    seeing the UMOUNT flag and the dirty pages are never written to the
    backing block device, causing long symlink corruption and exposing new
    or previously freed block data to userspace.
    
    To ensure all commits are synced, we flush all journal commits now
    when sync_fs'ing ext4.
    
    Signed-off-by: Arthur Jones <ajones@riverbed.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
    Cc: Eric Sandeen <sandeen@redhat.com>
    Cc: <linux-ext4@vger.kernel.org>

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 97cb896..e199773 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2899,8 +2899,9 @@ int ext4_force_commit(struct super_block *sb)
 		return 0;
 
 	journal = EXT4_SB(sb)->s_journal;
-	sb->s_dirt = 0;
 	ret = ext4_journal_force_commit(journal);
+	if (!ret)
+		sb->s_dirt = 0;
 	return ret;
 }
 
@@ -2922,15 +2923,16 @@ static void ext4_write_super(struct super_block *sb)
 
 static int ext4_sync_fs(struct super_block *sb, int wait)
 {
-	tid_t target;
+	int ret;
 
 	trace_mark(ext4_sync_fs, "dev %s wait %d", sb->s_id, wait);
-	sb->s_dirt = 0;
-	if (jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, &target)) {
-		if (wait)
-			jbd2_log_wait_commit(EXT4_SB(sb)->s_journal, target);
-	}
-	return 0;
+	if (wait)
+		ret = ext4_force_commit(sb);
+	else
+		ret = jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, NULL);
+	if (!ret)
+		sb->s_dirt = 0;
+	return ret;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 22:01                       ` Theodore Tso
@ 2008-11-03 22:18                         ` Arthur Jones
  2008-11-03 22:27                         ` Andrew Morton
  1 sibling, 0 replies; 49+ messages in thread
From: Arthur Jones @ 2008-11-03 22:18 UTC (permalink / raw)
  To: Theodore Tso, Andrew Morton, sandeen, linux-ext4, sct,
	linux-kernel

Hi Ted, ...

On Mon, Nov 03, 2008 at 02:01:44PM -0800, Theodore Tso wrote:
> [...]
> @@ -2922,15 +2923,16 @@ static void ext4_write_super(struct super_block *sb)
> 
>  static int ext4_sync_fs(struct super_block *sb, int wait)
>  {
> -       tid_t target;
> +       int ret;
> 
>         trace_mark(ext4_sync_fs, "dev %s wait %d", sb->s_id, wait);
> -       sb->s_dirt = 0;
> -       if (jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, &target)) {
> -               if (wait)
> -                       jbd2_log_wait_commit(EXT4_SB(sb)->s_journal, target);
> -       }
> -       return 0;
> +       if (wait)
> +               ret = ext4_force_commit(sb);
> +       else
> +               ret = jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, NULL);
> +       if (!ret)
> +               sb->s_dirt = 0;
> +       return ret;

This bit will clear sb->s_dirt if we did _not_ start
a commit in jbd2_journal_start_commit.  It will return
1 if we did start a commit which I don't think is what
we want to do here...

Arthur

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 22:01                       ` Theodore Tso
  2008-11-03 22:18                         ` Arthur Jones
@ 2008-11-03 22:27                         ` Andrew Morton
  2008-11-03 22:55                           ` Theodore Tso
  1 sibling, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2008-11-03 22:27 UTC (permalink / raw)
  To: Theodore Tso; +Cc: ajones, sandeen, linux-ext4, sct, linux-kernel

On Mon, 3 Nov 2008 17:01:44 -0500
Theodore Tso <tytso@mit.edu> wrote:

>  static int ext4_sync_fs(struct super_block *sb, int wait)
>  {
> -	tid_t target;
> +	int ret;
>  
>  	trace_mark(ext4_sync_fs, "dev %s wait %d", sb->s_id, wait);
> -	sb->s_dirt = 0;
> -	if (jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, &target)) {
> -		if (wait)
> -			jbd2_log_wait_commit(EXT4_SB(sb)->s_journal, target);
> -	}
> -	return 0;
> +	if (wait)
> +		ret = ext4_force_commit(sb);
> +	else
> +		ret = jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, NULL);
> +	if (!ret)
> +		sb->s_dirt = 0;
> +	return ret;
>  }

It should clear s_dirt before doing the "i/o", methinks?

The usual pattern is

	foo->dirty = 0;
	do_io_on(foo);

because


	do_io_on(foo);
					modify(foo);
					foo->dirty = 1;
	foo->dirty = 0;

is racy.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 22:27                         ` Andrew Morton
@ 2008-11-03 22:55                           ` Theodore Tso
  2008-11-03 23:01                             ` Arthur Jones
  0 siblings, 1 reply; 49+ messages in thread
From: Theodore Tso @ 2008-11-03 22:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ajones, sandeen, linux-ext4, sct, linux-kernel

On Mon, Nov 03, 2008 at 02:27:06PM -0800, Andrew Morton wrote:
> It should clear s_dirt before doing the "i/o", methinks?

Yep, good point.  As I mentioned earlier, though, I'm about 99% sure
that the right fix is to remove all mention of s_dirt entirely, and in
fact we can make super_operations.write_super be NULL for ext3 and
ext4.  But for now we should just keep it in its usual place for now,
and save that for a cleanup commit later on.

						- Ted

commit b20506dc713db1105287b691390563d2aace6d84
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Mon Nov 3 17:54:41 2008 -0500

    ext4: wait on all pending commits in ext4_sync_fs()
    
    In ext4_sync_fs, we only wait for a commit to finish if we started it,
    but there may be one already in progress which will not be synced.
    
    In the case of a data=ordered umount with pending long symlinks which
    are delayed due to a long list of other I/O on the backing block
    device, this causes the buffer associated with the long symlinks to
    not be moved to the inode dirty list in the second phase of
    fsync_super.  Then, before they can be dirtied again, kjournald exits,
    seeing the UMOUNT flag and the dirty pages are never written to the
    backing block device, causing long symlink corruption and exposing new
    or previously freed block data to userspace.
    
    To ensure all commits are synced, we flush all journal commits now
    when sync_fs'ing ext4.
    
    Signed-off-by: Arthur Jones <ajones@riverbed.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
    Cc: Eric Sandeen <sandeen@redhat.com>
    Cc: <linux-ext4@vger.kernel.org>

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 97cb896..5b5e38e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2907,12 +2907,9 @@ int ext4_force_commit(struct super_block *sb)
 /*
  * Ext4 always journals updates to the superblock itself, so we don't
  * have to propagate any other updates to the superblock on disk at this
- * point.  Just start an async writeback to get the buffers on their way
- * to the disk.
- *
- * This implicitly triggers the writebehind on sync().
+ * point.  (We can probably nuke this function altogether, and remove
+ * any mention to sb->s_dirt in all of fs/ext4; eventual cleanup...)
  */
-
 static void ext4_write_super(struct super_block *sb)
 {
 	if (mutex_trylock(&sb->s_lock) != 0)
@@ -2922,15 +2919,15 @@ static void ext4_write_super(struct super_block *sb)
 
 static int ext4_sync_fs(struct super_block *sb, int wait)
 {
-	tid_t target;
+	int ret;
 
 	trace_mark(ext4_sync_fs, "dev %s wait %d", sb->s_id, wait);
 	sb->s_dirt = 0;
-	if (jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, &target)) {
-		if (wait)
-			jbd2_log_wait_commit(EXT4_SB(sb)->s_journal, target);
-	}
-	return 0;
+	if (wait)
+		ret = ext4_force_commit(sb);
+	else
+		ret = jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, NULL);
+	return ret;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 22:55                           ` Theodore Tso
@ 2008-11-03 23:01                             ` Arthur Jones
  2008-11-03 23:12                               ` Theodore Tso
  0 siblings, 1 reply; 49+ messages in thread
From: Arthur Jones @ 2008-11-03 23:01 UTC (permalink / raw)
  To: Theodore Tso, Andrew Morton, sandeen, linux-ext4, sct,
	linux-kernel

Hi Ted, ...

On Mon, Nov 03, 2008 at 02:55:44PM -0800, Theodore Tso wrote:
> @@ -2922,15 +2919,15 @@ static void ext4_write_super(struct super_block *sb)
> 
>  static int ext4_sync_fs(struct super_block *sb, int wait)
>  {
> -       tid_t target;
> +       int ret;
> 
>         trace_mark(ext4_sync_fs, "dev %s wait %d", sb->s_id, wait);
>         sb->s_dirt = 0;
> -       if (jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, &target)) {
> -               if (wait)
> -                       jbd2_log_wait_commit(EXT4_SB(sb)->s_journal, target);
> -       }
> -       return 0;
> +       if (wait)
> +               ret = ext4_force_commit(sb);
> +       else
> +               ret = jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, NULL);
> +       return ret;
>  }
> 
>  /*

Same problem as the last one, jbd2_journal_start_commit returns
true if it started a commit, not an error...

Arthur

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 23:01                             ` Arthur Jones
@ 2008-11-03 23:12                               ` Theodore Tso
  2008-11-04 16:26                                 ` Arthur Jones
  0 siblings, 1 reply; 49+ messages in thread
From: Theodore Tso @ 2008-11-03 23:12 UTC (permalink / raw)
  To: Arthur Jones; +Cc: Andrew Morton, sandeen, linux-ext4, sct, linux-kernel

On Mon, Nov 03, 2008 at 03:01:04PM -0800, Arthur Jones wrote:
> 
> Same problem as the last one, jbd2_journal_start_commit returns
> true if it started a commit, not an error...

Ah, right.  !@#! differing return conventions....

BTW, did you ever open a bugzilla entry for this the symlink
corruption problem that we should reference in the commit log?

					- Ted

commit fd7384f8243a957386af3767532d035346f0d149
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Mon Nov 3 18:10:55 2008 -0500

    ext4: wait on all pending commits in ext4_sync_fs()
    
    In ext4_sync_fs, we only wait for a commit to finish if we started it,
    but there may be one already in progress which will not be synced.
    
    In the case of a data=ordered umount with pending long symlinks which
    are delayed due to a long list of other I/O on the backing block
    device, this causes the buffer associated with the long symlinks to
    not be moved to the inode dirty list in the second phase of
    fsync_super.  Then, before they can be dirtied again, kjournald exits,
    seeing the UMOUNT flag and the dirty pages are never written to the
    backing block device, causing long symlink corruption and exposing new
    or previously freed block data to userspace.
    
    To ensure all commits are synced, we flush all journal commits now
    when sync_fs'ing ext4.
    
    Signed-off-by: Arthur Jones <ajones@riverbed.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
    Cc: Eric Sandeen <sandeen@redhat.com>
    Cc: <linux-ext4@vger.kernel.org>

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 97cb896..9e7e66c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2907,12 +2907,9 @@ int ext4_force_commit(struct super_block *sb)
 /*
  * Ext4 always journals updates to the superblock itself, so we don't
  * have to propagate any other updates to the superblock on disk at this
- * point.  Just start an async writeback to get the buffers on their way
- * to the disk.
- *
- * This implicitly triggers the writebehind on sync().
+ * point.  (We can probably nuke this function altogether, and remove
+ * any mention to sb->s_dirt in all of fs/ext4; eventual cleanup...)
  */
-
 static void ext4_write_super(struct super_block *sb)
 {
 	if (mutex_trylock(&sb->s_lock) != 0)
@@ -2922,15 +2919,15 @@ static void ext4_write_super(struct super_block *sb)
 
 static int ext4_sync_fs(struct super_block *sb, int wait)
 {
-	tid_t target;
+	int ret = 0;
 
 	trace_mark(ext4_sync_fs, "dev %s wait %d", sb->s_id, wait);
 	sb->s_dirt = 0;
-	if (jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, &target)) {
-		if (wait)
-			jbd2_log_wait_commit(EXT4_SB(sb)->s_journal, target);
-	}
-	return 0;
+	if (wait)
+		ret = ext4_force_commit(sb);
+	else
+		jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, NULL);
+	return ret;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 23:12                               ` Theodore Tso
@ 2008-11-04 16:26                                 ` Arthur Jones
  0 siblings, 0 replies; 49+ messages in thread
From: Arthur Jones @ 2008-11-04 16:26 UTC (permalink / raw)
  To: Theodore Tso, Andrew Morton, sandeen, linux-ext4, sct,
	linux-kernel

Hi Ted, ...

On Mon, Nov 03, 2008 at 03:12:09PM -0800, Theodore Tso wrote:
> [...]
> BTW, did you ever open a bugzilla entry for this the symlink
> corruption problem that we should reference in the commit log?

Nope...

Patch looks good now -- thanks for looking into this...

Arthur

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 20:37             ` Andrew Morton
  2008-11-03 20:58               ` Arthur Jones
@ 2008-11-03 21:48               ` Arthur Jones
  2008-11-03 22:47                 ` Theodore Tso
  1 sibling, 1 reply; 49+ messages in thread
From: Arthur Jones @ 2008-11-03 21:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: sandeen@redhat.com, linux-ext4@vger.kernel.org, sct@redhat.com,
	linux-kernel@vger.kernel.org

Hi Andrew, ...

On Mon, Nov 03, 2008 at 12:37:50PM -0800, Andrew Morton wrote:
> [...]
> OK.  But still, questions remain.
> 
> - should we clear s_dirt if log_wait_commit() didn't work?
> 
> - ext3_force_commit() clears s_dirt also

I'm not sure if it makes sense, but ATM I don't think it
hurts anything given that ext3_write_super is just sb->s_dirt = 0;

> - should ext3_sync_fs() be dropping the ext3_force_commit() return
>   value on the floor?

Something like this (tested)?

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 2b48b85..743acf1 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2386,13 +2386,15 @@ static void ext3_write_super (struct super_block * sb)
 
 static int ext3_sync_fs(struct super_block *sb, int wait)
 {
+	int ret = 0;
+
 	sb->s_dirt = 0;
 	if (wait)
-		ext3_force_commit(sb);
+		ret = ext3_force_commit(sb);
 	else
 		journal_start_commit(EXT3_SB(sb)->s_journal, NULL);
 
-	return 0;
+	return ret;
 }
 
 /*

Arthur

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 21:48               ` Arthur Jones
@ 2008-11-03 22:47                 ` Theodore Tso
  0 siblings, 0 replies; 49+ messages in thread
From: Theodore Tso @ 2008-11-03 22:47 UTC (permalink / raw)
  To: Arthur Jones
  Cc: Andrew Morton, sandeen@redhat.com, linux-ext4@vger.kernel.org,
	sct@redhat.com, linux-kernel@vger.kernel.org

On Mon, Nov 03, 2008 at 01:48:40PM -0800, Arthur Jones wrote:
> > - should we clear s_dirt if log_wait_commit() didn't work?
> > 
> > - ext3_force_commit() clears s_dirt also
> 
> I'm not sure if it makes sense, but ATM I don't think it
> hurts anything given that ext3_write_super is just sb->s_dirt = 0;

I want to look at this some more, but it's possible we should just
remove all of the references to sb->s_dirt completely.  Also, the
comment just above ext[34]_write_super() is out of date since before
2.6.0.

						- Ted

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 20:14           ` Arthur Jones
  2008-11-03 20:37             ` Andrew Morton
@ 2008-12-18 23:17             ` Jan Kara
  2008-12-18 23:37               ` Eric Sandeen
  2009-01-13 22:14               ` Eric Sandeen
  1 sibling, 2 replies; 49+ messages in thread
From: Jan Kara @ 2008-12-18 23:17 UTC (permalink / raw)
  To: Arthur Jones
  Cc: Andrew Morton, sandeen@redhat.com, linux-ext4@vger.kernel.org,
	sct@redhat.com, linux-kernel@vger.kernel.org

  Hello,

  I'm sorry I'm replying late but I got time to react to this only now...

> On Mon, Nov 03, 2008 at 11:33:18AM -0800, Andrew Morton wrote:
> > [...]
> > > --- a/fs/ext3/super.c
> > > +++ b/fs/ext3/super.c
> > > @@ -2392,7 +2392,13 @@ static int ext3_sync_fs(struct super_block *sb, int wait)
> > >       if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
> > >               if (wait)
> > >                       log_wait_commit(EXT3_SB(sb)->s_journal, target);
> > > -     }
> > > +     } else if (wait)
> > > +             /*
> > > +              * We may have a commit in progress, clear it out
> > > +              * before we go on...
> > > +              */
> > > +             ext3_force_commit(sb);
> > > +
> > >       return 0;
> > >  }
> > 
> > Can we do
> > 
> >         sb->s_dirt = 0;
> >         if (wait)
> >                 ext3_force_commit(...);
> >         else
> >                 journal_start_commit(...);
> 
> I think this is what you had in mind:
> 
> 
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 18eaa78..2b48b85 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2386,13 +2386,12 @@ static void ext3_write_super (struct super_block * sb)
>  
>  static int ext3_sync_fs(struct super_block *sb, int wait)
>  {
> -	tid_t target;
> -
>  	sb->s_dirt = 0;
> -	if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
> -		if (wait)
> -			log_wait_commit(EXT3_SB(sb)->s_journal, target);
> -	}
> +	if (wait)
> +		ext3_force_commit(sb);
> +	else
> +		journal_start_commit(EXT3_SB(sb)->s_journal, NULL);
> +
>  	return 0;
>  }
>  
> I tried this and it too fixes the problem.  FWIW I agree it
> looks better...
  Well, shouldn't we rather fix what journal_start_commit() returns?
The interface which returns 1 when a transaction is already committing or
a transaction commit has just been started but 0 when we race with
somebody staring the commit is fairly unusable. Moreover
ext3_force_commit() will unnecessarily create new sync transaction and
commit it if there's no transaction running which is quite expensive
(even merging empty sync handle is not for free because of sync
transaction batching). But this is minor problem since we probably
don't care too much about sync() performance - BTW this is probably a
cause for bug 12224, isn't it?
  BTW: ocfs2 would need fixing as well if done your way since it's
sync_fs function has been copied over from ext3.
  To summarized I'd rather see a patch like below (untested) going in
and your patch reverted... Opinions? I can cookup a JBD2 version of
the patch in case we agree to go this way.

								Honza

>From 0a578ba1b56fe655570ee6dad41748863a120dbc Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Fri, 19 Dec 2008 00:05:34 +0100
Subject: [PATCH] jbd: Fix return value of journal_start_commit()

journal_start_commit() returns 1 if either a transaction is committing or the
function has queued a transaction commit. But it returns 0 if we raced with
somebody queueing the transaction commit as well. This resulted in
ext3_sync_fs() not functioning correctly (description from Arthur Jones):
In the case of a data=ordered umount with pending long symlinks which are
delayed due to a long list of other I/O on the backing block device, this
causes the buffer associated with the long symlinks to not be moved to the
inode dirty list in the second phase of fsync_super.  Then, before they can be
dirtied again, kjournald exits, seeing the UMOUNT flag and the dirty pages are
never written to the backing block device, causing long symlink corruption and
exposing new or previously freed block data to userspace.

This can be reproduced with a script created by Eric Sandeen
<sandeen@redhat.com>:

        #!/bin/bash

        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        rm -f /mnt/test2/*
        dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
        touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        /mnt/test2/link
        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        ls /mnt/test2/

This patch fixes journal_start_commit() to always return 1 when there's
a transaction committing or queued for commit.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd/journal.c |   17 +++++++++++------
 1 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index 9e4fa52..e79c078 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -427,7 +427,7 @@ int __log_space_left(journal_t *journal)
 }
 
 /*
- * Called under j_state_lock.  Returns true if a transaction was started.
+ * Called under j_state_lock.  Returns true if a transaction commit was started.
  */
 int __log_start_commit(journal_t *journal, tid_t target)
 {
@@ -495,7 +495,8 @@ int journal_force_commit_nested(journal_t *journal)
 
 /*
  * Start a commit of the current running transaction (if any).  Returns true
- * if a transaction was started, and fills its tid in at *ptid
+ * if a transaction is going to be committed (or is currently already
+ * committing), and fills its tid in at *ptid
  */
 int journal_start_commit(journal_t *journal, tid_t *ptid)
 {
@@ -505,15 +506,19 @@ int journal_start_commit(journal_t *journal, tid_t *ptid)
 	if (journal->j_running_transaction) {
 		tid_t tid = journal->j_running_transaction->t_tid;
 
-		ret = __log_start_commit(journal, tid);
-		if (ret && ptid)
+		__log_start_commit(journal, tid);
+		/* There's a running transaction and we've just made sure
+		 * it's commit has been scheduled. */
+		if (ptid)
 			*ptid = tid;
-	} else if (journal->j_committing_transaction && ptid) {
+		ret = 1;
+	} else if (journal->j_committing_transaction) {
 		/*
 		 * If ext3_write_super() recently started a commit, then we
 		 * have to wait for completion of that transaction
 		 */
-		*ptid = journal->j_committing_transaction->t_tid;
+		if (ptid)
+			*ptid = journal->j_committing_transaction->t_tid;
 		ret = 1;
 	}
 	spin_unlock(&journal->j_state_lock);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-12-18 23:17             ` Jan Kara
@ 2008-12-18 23:37               ` Eric Sandeen
  2008-12-19  0:27                 ` Jan Kara
  2009-01-12 22:28                 ` Jan Kara
  2009-01-13 22:14               ` Eric Sandeen
  1 sibling, 2 replies; 49+ messages in thread
From: Eric Sandeen @ 2008-12-18 23:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Arthur Jones, Andrew Morton, linux-ext4@vger.kernel.org,
	sct@redhat.com, linux-kernel@vger.kernel.org

Jan Kara wrote:
>   Hello,
> 
>   I'm sorry I'm replying late but I got time to react to this only now...
> 

<snip>

>> I tried this and it too fixes the problem.  FWIW I agree it
>> looks better...
>   Well, shouldn't we rather fix what journal_start_commit() returns?
> The interface which returns 1 when a transaction is already committing or
> a transaction commit has just been started but 0 when we race with
> somebody staring the commit is fairly unusable. Moreover
> ext3_force_commit() will unnecessarily create new sync transaction and
> commit it if there's no transaction running which is quite expensive
> (even merging empty sync handle is not for free because of sync
> transaction batching). But this is minor problem since we probably
> don't care too much about sync() performance - BTW this is probably a
> cause for bug 12224, isn't it?

Yep, it is!  :)

>   BTW: ocfs2 would need fixing as well if done your way since it's
> sync_fs function has been copied over from ext3.
>   To summarized I'd rather see a patch like below (untested) going in
> and your patch reverted... Opinions? I can cookup a JBD2 version of
> the patch in case we agree to go this way.

Thanks, I'll look that over.

In looking at what we have today, I wonder if we can make things smarter
so that we don't commit empty transactions in any case?

-Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-12-18 23:37               ` Eric Sandeen
@ 2008-12-19  0:27                 ` Jan Kara
  2008-12-19  1:34                   ` Eric Sandeen
  2009-01-12 22:28                 ` Jan Kara
  1 sibling, 1 reply; 49+ messages in thread
From: Jan Kara @ 2008-12-19  0:27 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Arthur Jones, Andrew Morton, linux-ext4@vger.kernel.org,
	sct@redhat.com, linux-kernel@vger.kernel.org

On Thu 18-12-08 17:37:23, Eric Sandeen wrote:
> Jan Kara wrote:
> >   Hello,
> > 
> >   I'm sorry I'm replying late but I got time to react to this only now...
> > 
> 
> <snip>
> 
> >> I tried this and it too fixes the problem.  FWIW I agree it
> >> looks better...
> >   Well, shouldn't we rather fix what journal_start_commit() returns?
> > The interface which returns 1 when a transaction is already committing or
> > a transaction commit has just been started but 0 when we race with
> > somebody staring the commit is fairly unusable. Moreover
> > ext3_force_commit() will unnecessarily create new sync transaction and
> > commit it if there's no transaction running which is quite expensive
> > (even merging empty sync handle is not for free because of sync
> > transaction batching). But this is minor problem since we probably
> > don't care too much about sync() performance - BTW this is probably a
> > cause for bug 12224, isn't it?
> 
> Yep, it is!  :)
> 
> >   BTW: ocfs2 would need fixing as well if done your way since it's
> > sync_fs function has been copied over from ext3.
> >   To summarized I'd rather see a patch like below (untested) going in
> > and your patch reverted... Opinions? I can cookup a JBD2 version of
> > the patch in case we agree to go this way.
> 
> Thanks, I'll look that over.
  Thanks.

> In looking at what we have today, I wonder if we can make things smarter
> so that we don't commit empty transactions in any case?
  Probably it does not make sence to commit such transactions and we might
save some time in sync paths if we do so. So yes, I think skipping empty
transaction commit might be worthwhile and it shouldn't be hard to do
either. But I'd give it serious testing just in case some unexpectedly
relies on this behaviour - wouldn't this interfere e.g. with sync
transaction batching autotuning code? Untested patch below...
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
---

>From 3e1a8dc6a64fdce5c99d7edca0a2812178f7463c Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Fri, 19 Dec 2008 01:20:47 +0100
Subject: [PATCH] jbd2: Skip commit of a transaction without any buffers

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/commit.c |   16 ++++++++++++++++
 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index ebc667b..798e021 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -401,6 +401,21 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	J_ASSERT (commit_transaction->t_outstanding_credits <=
 			journal->j_max_transaction_buffers);
 
+	/* Skip commit of a transaction without any buffers */
+	spin_lock(&journal->j_list_lock);
+	if (commit_transaction->t_reserved_list == NULL &&
+	    commit_transaction->t_buffers == NULL &&
+	    commit_transaction->t_forget == NULL &&
+	    list_empty(commit_transaction->t_inode_list)) {
+		BUG_ON(commit_transaction->t_checkpoint_list);
+		BUG_ON(commit_transaction->t_checkpoint_io_list);
+		BUG_ON(commit_transaction->t_iobuf_list);
+		BUG_ON(commit_transaction->t_shadow_list);
+		BUG_ON(commit_transaction->t_log_list);
+		goto out_committed;
+	}
+	spin_unlock(&journal->j_list_lock);
+
 	/*
 	 * First thing we are allowed to do is to discard any remaining
 	 * BJ_Reserved buffers.  Note, it is _not_ permissible to assume
@@ -934,6 +949,7 @@ restart_loop:
 
 	/* Done with this transaction! */
 
+out_committed:
 	jbd_debug(3, "JBD: commit phase 7\n");
 
 	J_ASSERT(commit_transaction->t_state == T_COMMIT);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-12-19  0:27                 ` Jan Kara
@ 2008-12-19  1:34                   ` Eric Sandeen
  2008-12-22 19:15                     ` Ric Wheeler
  0 siblings, 1 reply; 49+ messages in thread
From: Eric Sandeen @ 2008-12-19  1:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Arthur Jones, Andrew Morton, linux-ext4@vger.kernel.org,
	sct@redhat.com, linux-kernel@vger.kernel.org

Jan Kara wrote:

>> In looking at what we have today, I wonder if we can make things smarter
>> so that we don't commit empty transactions in any case?
>   Probably it does not make sence to commit such transactions and we might
> save some time in sync paths if we do so. So yes, I think skipping empty
> transaction commit might be worthwhile and it shouldn't be hard to do
> either. But I'd give it serious testing just in case some unexpectedly
> relies on this behaviour - wouldn't this interfere e.g. with sync
> transaction batching autotuning code? Untested patch below...
> 								Honza


Cool, thanks!  This's stop:

# sync

from spinning up disks under idle filesystems too, I think.

I was looking at something similar but was still working out how many
things to check before deciding if the transaction was in fact empty.  :)

-Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-12-19  1:34                   ` Eric Sandeen
@ 2008-12-22 19:15                     ` Ric Wheeler
  2008-12-22 22:57                       ` Andreas Dilger
  0 siblings, 1 reply; 49+ messages in thread
From: Ric Wheeler @ 2008-12-22 19:15 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, Arthur Jones, Andrew Morton, linux-ext4@vger.kernel.org,
	sct@redhat.com, linux-kernel@vger.kernel.org

Eric Sandeen wrote:
> Jan Kara wrote:
>
>   
>>> In looking at what we have today, I wonder if we can make things smarter
>>> so that we don't commit empty transactions in any case?
>>>       
>>   Probably it does not make sence to commit such transactions and we might
>> save some time in sync paths if we do so. So yes, I think skipping empty
>> transaction commit might be worthwhile and it shouldn't be hard to do
>> either. But I'd give it serious testing just in case some unexpectedly
>> relies on this behaviour - wouldn't this interfere e.g. with sync
>> transaction batching autotuning code? Untested patch below...
>> 								Honza
>>     
>
>
> Cool, thanks!  This's stop:
>
> # sync
>
> from spinning up disks under idle filesystems too, I think.
>
> I was looking at something similar but was still working out how many
> things to check before deciding if the transaction was in fact empty.  :)
>
> -Eric
>   

Without having dived into the patch in detail, one worry I would have is 
that we still might care to spin up a drive for empty transactions in 
order to invalidate the drive's write cache.

For example, if we have the following sequence:

    (1) user app performs series of writes to file A
    (2) pages dirtied from writes to A are destaged to the disk over time
    (3) user app issues fsync(file A) to make sure that the data will 
survive a power outage

At this point in time, would this change prevent us from spinning up the 
drive and invalidating the disk write cache for that fsync() ?

Regards,

Ric


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-12-22 19:15                     ` Ric Wheeler
@ 2008-12-22 22:57                       ` Andreas Dilger
  2008-12-23  0:09                         ` Ric Wheeler
  2008-12-23 15:56                         ` Eric Sandeen
  0 siblings, 2 replies; 49+ messages in thread
From: Andreas Dilger @ 2008-12-22 22:57 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Eric Sandeen, Jan Kara, Arthur Jones, Andrew Morton,
	linux-ext4@vger.kernel.org, sct@redhat.com,
	linux-kernel@vger.kernel.org

On Dec 22, 2008  14:15 -0500, Ric Wheeler wrote:
> Without having dived into the patch in detail, one worry I would have is  
> that we still might care to spin up a drive for empty transactions in  
> order to invalidate the drive's write cache.
>
> For example, if we have the following sequence:
>
>    (1) user app performs series of writes to file A
>    (2) pages dirtied from writes to A are destaged to the disk over time
>    (3) user app issues fsync(file A) to make sure that the data will  
> survive a power outage
>
> At this point in time, would this change prevent us from spinning up the  
> drive and invalidating the disk write cache for that fsync() ?

Well, if the writes themselves didn't spin up the drive, it is uncertain
whether the write of the journal commit block would be any more helpful
in getting that to happen.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-12-22 22:57                       ` Andreas Dilger
@ 2008-12-23  0:09                         ` Ric Wheeler
  2008-12-23 15:56                         ` Eric Sandeen
  1 sibling, 0 replies; 49+ messages in thread
From: Ric Wheeler @ 2008-12-23  0:09 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Eric Sandeen, Jan Kara, Arthur Jones, Andrew Morton,
	linux-ext4@vger.kernel.org, sct@redhat.com,
	linux-kernel@vger.kernel.org

Andreas Dilger wrote:
> On Dec 22, 2008  14:15 -0500, Ric Wheeler wrote:
>   
>> Without having dived into the patch in detail, one worry I would have is  
>> that we still might care to spin up a drive for empty transactions in  
>> order to invalidate the drive's write cache.
>>
>> For example, if we have the following sequence:
>>
>>    (1) user app performs series of writes to file A
>>    (2) pages dirtied from writes to A are destaged to the disk over time
>>    (3) user app issues fsync(file A) to make sure that the data will  
>> survive a power outage
>>
>> At this point in time, would this change prevent us from spinning up the  
>> drive and invalidating the disk write cache for that fsync() ?
>>     
>
> Well, if the writes themselves didn't spin up the drive, it is uncertain
> whether the write of the journal commit block would be any more helpful
> in getting that to happen.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>   
To make the example cleared, add a step:

    (2.5) between the writes and the fsync(), the drive spins down

In this case, if the app issues an fsync(), will we still have volatile 
data in its write cache that has not been flushed?

Ric


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-12-22 22:57                       ` Andreas Dilger
  2008-12-23  0:09                         ` Ric Wheeler
@ 2008-12-23 15:56                         ` Eric Sandeen
  1 sibling, 0 replies; 49+ messages in thread
From: Eric Sandeen @ 2008-12-23 15:56 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Ric Wheeler, Jan Kara, Arthur Jones, Andrew Morton,
	linux-ext4@vger.kernel.org, sct@redhat.com,
	linux-kernel@vger.kernel.org

Andreas Dilger wrote:
> On Dec 22, 2008  14:15 -0500, Ric Wheeler wrote:
>> Without having dived into the patch in detail, one worry I would have is  
>> that we still might care to spin up a drive for empty transactions in  
>> order to invalidate the drive's write cache.
>>
>> For example, if we have the following sequence:
>>
>>    (1) user app performs series of writes to file A
>>    (2) pages dirtied from writes to A are destaged to the disk over time
>>    (3) user app issues fsync(file A) to make sure that the data will  
>> survive a power outage
>>
>> At this point in time, would this change prevent us from spinning up the  
>> drive and invalidating the disk write cache for that fsync() ?
> 
> Well, if the writes themselves didn't spin up the drive, it is uncertain
> whether the write of the journal commit block would be any more helpful
> in getting that to happen.

So, ext4_sync_file() calls blkdev_issue_flush() which would should do
the right thing even if the drive is spun down, I think (rather than
hoping that some other journal activity would flush this out...)

I guess I don't know for sure what blkdev_issue_flush does on a
spun-down drive but I'd hope it does the right thing.

Pretty sure I sent a patch for ext3 to do the same, but it was
ignored/dropped/forgotten along with the barriers-by-default patch.
Suppose I could try again.

-Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-12-18 23:37               ` Eric Sandeen
  2008-12-19  0:27                 ` Jan Kara
@ 2009-01-12 22:28                 ` Jan Kara
  2009-01-13 17:21                   ` Eric Sandeen
  1 sibling, 1 reply; 49+ messages in thread
From: Jan Kara @ 2009-01-12 22:28 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Arthur Jones, Andrew Morton, linux-ext4@vger.kernel.org,
	sct@redhat.com, linux-kernel@vger.kernel.org

  Hi Eric,

> Jan Kara wrote:
> >   Hello,
> > 
> >   I'm sorry I'm replying late but I got time to react to this only now...
> > 
> 
> >> I tried this and it too fixes the problem.  FWIW I agree it
> >> looks better...
> >   Well, shouldn't we rather fix what journal_start_commit() returns?
> > The interface which returns 1 when a transaction is already committing or
> > a transaction commit has just been started but 0 when we race with
> > somebody staring the commit is fairly unusable. Moreover
> > ext3_force_commit() will unnecessarily create new sync transaction and
> > commit it if there's no transaction running which is quite expensive
> > (even merging empty sync handle is not for free because of sync
> > transaction batching). But this is minor problem since we probably
> > don't care too much about sync() performance - BTW this is probably a
> > cause for bug 12224, isn't it?
> 
> Yep, it is!  :)
> 
> >   BTW: ocfs2 would need fixing as well if done your way since it's
> > sync_fs function has been copied over from ext3.
> >   To summarized I'd rather see a patch like below (untested) going in
> > and your patch reverted... Opinions? I can cookup a JBD2 version of
> > the patch in case we agree to go this way.
> 
> Thanks, I'll look that over.
  Any chance you've looked over that patch? Thanks.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2009-01-12 22:28                 ` Jan Kara
@ 2009-01-13 17:21                   ` Eric Sandeen
  0 siblings, 0 replies; 49+ messages in thread
From: Eric Sandeen @ 2009-01-13 17:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: Arthur Jones, Andrew Morton, linux-ext4@vger.kernel.org,
	sct@redhat.com, linux-kernel@vger.kernel.org

Jan Kara wrote:
>   Hi Eric,
> 
>> Jan Kara wrote:
>>>   Hello,
>>>
>>>   I'm sorry I'm replying late but I got time to react to this only now...
>>>
>>>> I tried this and it too fixes the problem.  FWIW I agree it
>>>> looks better...
>>>   Well, shouldn't we rather fix what journal_start_commit() returns?
>>> The interface which returns 1 when a transaction is already committing or
>>> a transaction commit has just been started but 0 when we race with
>>> somebody staring the commit is fairly unusable. Moreover
>>> ext3_force_commit() will unnecessarily create new sync transaction and
>>> commit it if there's no transaction running which is quite expensive
>>> (even merging empty sync handle is not for free because of sync
>>> transaction batching). But this is minor problem since we probably
>>> don't care too much about sync() performance - BTW this is probably a
>>> cause for bug 12224, isn't it?
>> Yep, it is!  :)
>>
>>>   BTW: ocfs2 would need fixing as well if done your way since it's
>>> sync_fs function has been copied over from ext3.
>>>   To summarized I'd rather see a patch like below (untested) going in
>>> and your patch reverted... Opinions? I can cookup a JBD2 version of
>>> the patch in case we agree to go this way.
>> Thanks, I'll look that over.
>   Any chance you've looked over that patch? Thanks.
> 
> 								Honza

Sorry, kind of slipped through the cracks.  I'll do that and run it
through the testcase today.

-Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-12-18 23:17             ` Jan Kara
  2008-12-18 23:37               ` Eric Sandeen
@ 2009-01-13 22:14               ` Eric Sandeen
  2009-01-14  4:24                 ` Theodore Tso
  1 sibling, 1 reply; 49+ messages in thread
From: Eric Sandeen @ 2009-01-13 22:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Arthur Jones, Andrew Morton, linux-ext4@vger.kernel.org,
	sct@redhat.com, linux-kernel@vger.kernel.org

Jan Kara wrote:

> From 0a578ba1b56fe655570ee6dad41748863a120dbc Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Fri, 19 Dec 2008 00:05:34 +0100
> Subject: [PATCH] jbd: Fix return value of journal_start_commit()
> 
> journal_start_commit() returns 1 if either a transaction is committing or the
> function has queued a transaction commit. But it returns 0 if we raced with
> somebody queueing the transaction commit as well. This resulted in
> ext3_sync_fs() not functioning correctly (description from Arthur Jones):
> In the case of a data=ordered umount with pending long symlinks which are
> delayed due to a long list of other I/O on the backing block device, this
> causes the buffer associated with the long symlinks to not be moved to the
> inode dirty list in the second phase of fsync_super.  Then, before they can be
> dirtied again, kjournald exits, seeing the UMOUNT flag and the dirty pages are
> never written to the backing block device, causing long symlink corruption and
> exposing new or previously freed block data to userspace.

This looks sane to me, and it does fix the below testcase.

Care to formally propose it?

Thanks,
-Eric

> This can be reproduced with a script created by Eric Sandeen
> <sandeen@redhat.com>:
> 
>         #!/bin/bash
> 
>         umount /mnt/test2
>         mount /dev/sdb4 /mnt/test2
>         rm -f /mnt/test2/*
>         dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
>         touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
>         ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
>         /mnt/test2/link
>         umount /mnt/test2
>         mount /dev/sdb4 /mnt/test2
>         ls /mnt/test2/
> 
> This patch fixes journal_start_commit() to always return 1 when there's
> a transaction committing or queued for commit.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/jbd/journal.c |   17 +++++++++++------
>  1 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
> index 9e4fa52..e79c078 100644
> --- a/fs/jbd/journal.c
> +++ b/fs/jbd/journal.c
> @@ -427,7 +427,7 @@ int __log_space_left(journal_t *journal)
>  }
>  
>  /*
> - * Called under j_state_lock.  Returns true if a transaction was started.
> + * Called under j_state_lock.  Returns true if a transaction commit was started.
>   */
>  int __log_start_commit(journal_t *journal, tid_t target)
>  {
> @@ -495,7 +495,8 @@ int journal_force_commit_nested(journal_t *journal)
>  
>  /*
>   * Start a commit of the current running transaction (if any).  Returns true
> - * if a transaction was started, and fills its tid in at *ptid
> + * if a transaction is going to be committed (or is currently already
> + * committing), and fills its tid in at *ptid
>   */
>  int journal_start_commit(journal_t *journal, tid_t *ptid)
>  {
> @@ -505,15 +506,19 @@ int journal_start_commit(journal_t *journal, tid_t *ptid)
>  	if (journal->j_running_transaction) {
>  		tid_t tid = journal->j_running_transaction->t_tid;
>  
> -		ret = __log_start_commit(journal, tid);
> -		if (ret && ptid)
> +		__log_start_commit(journal, tid);
> +		/* There's a running transaction and we've just made sure
> +		 * it's commit has been scheduled. */
> +		if (ptid)
>  			*ptid = tid;
> -	} else if (journal->j_committing_transaction && ptid) {
> +		ret = 1;
> +	} else if (journal->j_committing_transaction) {
>  		/*
>  		 * If ext3_write_super() recently started a commit, then we
>  		 * have to wait for completion of that transaction
>  		 */
> -		*ptid = journal->j_committing_transaction->t_tid;
> +		if (ptid)
> +			*ptid = journal->j_committing_transaction->t_tid;
>  		ret = 1;
>  	}
>  	spin_unlock(&journal->j_state_lock);


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2009-01-13 22:14               ` Eric Sandeen
@ 2009-01-14  4:24                 ` Theodore Tso
  2009-01-14 17:26                   ` Eric Sandeen
  2009-01-14 17:27                   ` Jan Kara
  0 siblings, 2 replies; 49+ messages in thread
From: Theodore Tso @ 2009-01-14  4:24 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, Arthur Jones, Andrew Morton, linux-ext4@vger.kernel.org,
	sct@redhat.com, linux-kernel@vger.kernel.org

On Tue, Jan 13, 2009 at 04:14:11PM -0600, Eric Sandeen wrote:
> 
> This looks sane to me, and it does fix the below testcase.
> 
> Care to formally propose it?

Can we confirm what is being proposed?  From following this thread, I
think what folks are suggesting is:

1)  Revert the current "ext3/4: wait on all pending ocmmits in ext3/4_sync_fs"

2)  Apply Jan's patch "jbd[2]: Fix return value of journal_start_commit()"

3)  Also apply Jan's patch "jbd2: Skip commit of a transaction without
any buffers" since it appears to be a good optimization (although it's
not clear it would happen once we revert (1), above.

    	     	   	       	  	      - Ted

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2009-01-14  4:24                 ` Theodore Tso
@ 2009-01-14 17:26                   ` Eric Sandeen
  2009-01-14 17:27                   ` Jan Kara
  1 sibling, 0 replies; 49+ messages in thread
From: Eric Sandeen @ 2009-01-14 17:26 UTC (permalink / raw)
  To: Theodore Tso, Eric Sandeen, Jan Kara, Arthur Jones, Andrew Morton,
	"linux-ext4@vg

Theodore Tso wrote:
> On Tue, Jan 13, 2009 at 04:14:11PM -0600, Eric Sandeen wrote:
>> This looks sane to me, and it does fix the below testcase.
>>
>> Care to formally propose it?
> 
> Can we confirm what is being proposed?  From following this thread, I
> think what folks are suggesting is:
> 
> 1)  Revert the current "ext3/4: wait on all pending ocmmits in ext3/4_sync_fs"
> 
> 2)  Apply Jan's patch "jbd[2]: Fix return value of journal_start_commit()"

The above seems like the right plan to me.  It should address the
kernel.org bug about IO on idle partitions.

> 3)  Also apply Jan's patch "jbd2: Skip commit of a transaction without
> any buffers" since it appears to be a good optimization (although it's
> not clear it would happen once we revert (1), above.

Maybe not yet, it could probably use more testing/review/soak.

-Eric

>     	     	   	       	  	      - Ted

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2009-01-14  4:24                 ` Theodore Tso
  2009-01-14 17:26                   ` Eric Sandeen
@ 2009-01-14 17:27                   ` Jan Kara
  2009-01-29 18:27                     ` Mike Snitzer
  1 sibling, 1 reply; 49+ messages in thread
From: Jan Kara @ 2009-01-14 17:27 UTC (permalink / raw)
  To: Theodore Tso, Eric Sandeen, Jan Kara, Arthur Jones, Andrew Morton,
	"linux-ext4@vg

On Tue 13-01-09 23:24:02, Theodore Tso wrote:
> On Tue, Jan 13, 2009 at 04:14:11PM -0600, Eric Sandeen wrote:
> > 
> > This looks sane to me, and it does fix the below testcase.
> > 
> > Care to formally propose it?
> 
> Can we confirm what is being proposed?  From following this thread, I
> think what folks are suggesting is:
> 
> 1)  Revert the current "ext3/4: wait on all pending ocmmits in ext3/4_sync_fs"
  Yes.

> 2)  Apply Jan's patch "jbd[2]: Fix return value of journal_start_commit()"
  Yes.

> 3)  Also apply Jan's patch "jbd2: Skip commit of a transaction without
> any buffers" since it appears to be a good optimization (although it's
> not clear it would happen once we revert (1), above.
  Yes, it's an optimization but I'm still a bit afraid about something
relying on jbd2_journal_force_commit() implying a barrier which would not
always be a case after this patch... So we should probably audit all users of
ext4_force_commit() and check that this change is fine with them.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2009-01-14 17:27                   ` Jan Kara
@ 2009-01-29 18:27                     ` Mike Snitzer
  2009-01-29 20:05                       ` Eric Sandeen
  0 siblings, 1 reply; 49+ messages in thread
From: Mike Snitzer @ 2009-01-29 18:27 UTC (permalink / raw)
  To: Theodore Tso, Jan Kara, Eric Sandeen
  Cc: Arthur Jones, Andrew Morton, linux-ext4@vger.kernel.org,
	sct@redhat.com, linux-kernel@vger.kernel.org

On Wed, Jan 14, 2009 at 12:27 PM, Jan Kara <jack@suse.cz> wrote:
> On Tue 13-01-09 23:24:02, Theodore Tso wrote:
>> On Tue, Jan 13, 2009 at 04:14:11PM -0600, Eric Sandeen wrote:
>> >
>> > This looks sane to me, and it does fix the below testcase.
>> >
>> > Care to formally propose it?
>>
>> Can we confirm what is being proposed?  From following this thread, I
>> think what folks are suggesting is:
>>
>> 1)  Revert the current "ext3/4: wait on all pending ocmmits in ext3/4_sync_fs"
>  Yes.
>
>> 2)  Apply Jan's patch "jbd[2]: Fix return value of journal_start_commit()"
>  Yes.
>
>> 3)  Also apply Jan's patch "jbd2: Skip commit of a transaction without
>> any buffers" since it appears to be a good optimization (although it's
>> not clear it would happen once we revert (1), above.
>  Yes, it's an optimization but I'm still a bit afraid about something
> relying on jbd2_journal_force_commit() implying a barrier which would not
> always be a case after this patch... So we should probably audit all users of
> ext4_force_commit() and check that this change is fine with them.

Ted/Jan/Eric,

I just wanted to followup on this to see what the plan is.  Items 1
and 2 haven't occurred in any of the ext4.git branches that I can see.
 I could be missing something but it seems this may have slipped
through the ext[34] cracks?

Mike

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2009-01-29 18:27                     ` Mike Snitzer
@ 2009-01-29 20:05                       ` Eric Sandeen
  0 siblings, 0 replies; 49+ messages in thread
From: Eric Sandeen @ 2009-01-29 20:05 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Theodore Tso, Jan Kara, Arthur Jones, Andrew Morton,
	linux-ext4@vger.kernel.org, sct@redhat.com,
	linux-kernel@vger.kernel.org

Mike Snitzer wrote:
> On Wed, Jan 14, 2009 at 12:27 PM, Jan Kara <jack@suse.cz> wrote:
>> On Tue 13-01-09 23:24:02, Theodore Tso wrote:
>>> On Tue, Jan 13, 2009 at 04:14:11PM -0600, Eric Sandeen wrote:
>>>> This looks sane to me, and it does fix the below testcase.
>>>>
>>>> Care to formally propose it?
>>> Can we confirm what is being proposed?  From following this thread, I
>>> think what folks are suggesting is:
>>>
>>> 1)  Revert the current "ext3/4: wait on all pending ocmmits in ext3/4_sync_fs"
>>  Yes.
>>
>>> 2)  Apply Jan's patch "jbd[2]: Fix return value of journal_start_commit()"
>>  Yes.
>>
>>> 3)  Also apply Jan's patch "jbd2: Skip commit of a transaction without
>>> any buffers" since it appears to be a good optimization (although it's
>>> not clear it would happen once we revert (1), above.
>>  Yes, it's an optimization but I'm still a bit afraid about something
>> relying on jbd2_journal_force_commit() implying a barrier which would not
>> always be a case after this patch... So we should probably audit all users of
>> ext4_force_commit() and check that this change is fine with them.
> 
> Ted/Jan/Eric,
> 
> I just wanted to followup on this to see what the plan is.  Items 1
> and 2 haven't occurred in any of the ext4.git branches that I can see.
>  I could be missing something but it seems this may have slipped
> through the ext[34] cracks?

Hm, I agree.

Jan, do you want to re-send it in its own message rather than buried in
the other thread?  I don't know how we technically handle a "revert"
upstream, to be honest.

-Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] ext3: wait on all pending commits in ext3_sync_fs
  2008-11-03 18:44       ` [PATCH] ext3: wait on all pending commits in ext3_sync_fs Arthur Jones
  2008-11-03 19:33         ` Andrew Morton
@ 2008-11-03 19:59         ` Eric Sandeen
  1 sibling, 0 replies; 49+ messages in thread
From: Eric Sandeen @ 2008-11-03 19:59 UTC (permalink / raw)
  To: Arthur Jones
  Cc: linux-ext4@vger.kernel.org, sct@redhat.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org

Arthur Jones wrote:
> Hi Eric,   This patch fixes the problem for me, and
> seems to put the buffers on the dirty list at the
> place where they are put on the list during the working
> case.  Despite having rooted around in the innards of
> ext3 for the last few days, I cannot say that I have
> any sense of whether this patch will cause problems
> elsewhere or even if this is the best place to
> intercede.
> 
> I post the complete patch not because I think it
> should be committed as is, but rather to try
> to explain the logic that brought it about.  At the
> very least, this should be reviewed by the experts
> here to make sure there is no collateral damage.
> 
> Arthur

Seems like the right approach; I too was thinking that the problem was
we just weren't either kicking off, or waiting for, the log commit at
unmount time.

I've had to step away from this problem for a couple days but will
eyeball this soon, it seems like the right root cause & general approach
to a fix, to me.

Thanks!

-Eric

> -------------------
> In ext3_sync_fs, we only wait for a commit to
> finish if we started it, but there may be one
> already in progress which will not be synced.
> 
> In the case of a data=ordered umount with pending long
> symlinks which are delayed due to a long list
> of other I/O on the backing block device, this
> causes the buffer associated with the long symlinks
> to not be moved to the inode dirty list in the second
> phase of fsync_super.  Then, before they can be dirtied
> again, kjournald exits, seeing the UMOUNT flag and the
> dirty pages are never written to the backing block device,
> causing long symlink corruption and exposing new or
> previously freed block data to userspace.
> 
> This can be reproduced with a script created
> by Eric Sandeen <sandeen@redhat.com>:
> 
> 	#!/bin/bash
> 
> 	umount /mnt/test2
> 	mount /dev/sdb4 /mnt/test2
> 	rm -f /mnt/test2/*
> 	dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
> 	touch
> 	/mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
> 	ln -s
> 	/mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
> 	/mnt/test2/link
> 	umount /mnt/test2
> 	mount /dev/sdb4 /mnt/test2
> 	ls /mnt/test2/
> 	umount /mnt/test2
> 
> To ensure all commits are synced, we flush
> all journal commits now when sync_fs'ing ext3.
> 
> Signed-off-by: Arthur Jones <ajones@riverbed.com>
> ---
>  fs/ext3/super.c |    8 +++++++-
>  1 files changed, 7 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 18eaa78..053659a 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2392,7 +2392,13 @@ static int ext3_sync_fs(struct super_block *sb, int wait)
>  	if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
>  		if (wait)
>  			log_wait_commit(EXT3_SB(sb)->s_journal, target);
> -	}
> +	} else if (wait)
> +		/*
> +		 * We may have a commit in progress, clear it out
> +		 * before we go on...
> +		 */
> +		ext3_force_commit(sb);
> +
>  	return 0;
>  }
>  


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2009-01-29 20:05 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-24 18:37 ext3: slow symlink corruption on umount Arthur Jones
2008-10-27 16:54 ` Arthur Jones
2008-10-29 19:54   ` Arthur Jones
2008-10-29 20:36     ` Eric Sandeen
2008-10-29 21:09       ` Theodore Tso
2008-10-30 13:38         ` Eric Sandeen
2008-10-30 13:55           ` Arthur Jones
2008-10-31  9:47           ` Nick Piggin
2008-10-30 17:40       ` Arthur Jones
2008-10-30 18:03         ` Eric Sandeen
2008-10-30 21:34           ` Arthur Jones
2008-10-31 17:24             ` Arthur Jones
2008-10-31 18:37               ` Eric Sandeen
2008-10-30 18:32         ` Arthur Jones
2008-11-03 18:44       ` [PATCH] ext3: wait on all pending commits in ext3_sync_fs Arthur Jones
2008-11-03 19:33         ` Andrew Morton
2008-11-03 20:14           ` Arthur Jones
2008-11-03 20:37             ` Andrew Morton
2008-11-03 20:58               ` Arthur Jones
2008-11-03 21:13                 ` Andrew Morton
2008-11-03 21:19                   ` Theodore Tso
2008-11-03 21:27                     ` Andrew Morton
2008-11-03 21:48                       ` Theodore Tso
2008-11-03 22:01                       ` Theodore Tso
2008-11-03 22:18                         ` Arthur Jones
2008-11-03 22:27                         ` Andrew Morton
2008-11-03 22:55                           ` Theodore Tso
2008-11-03 23:01                             ` Arthur Jones
2008-11-03 23:12                               ` Theodore Tso
2008-11-04 16:26                                 ` Arthur Jones
2008-11-03 21:48               ` Arthur Jones
2008-11-03 22:47                 ` Theodore Tso
2008-12-18 23:17             ` Jan Kara
2008-12-18 23:37               ` Eric Sandeen
2008-12-19  0:27                 ` Jan Kara
2008-12-19  1:34                   ` Eric Sandeen
2008-12-22 19:15                     ` Ric Wheeler
2008-12-22 22:57                       ` Andreas Dilger
2008-12-23  0:09                         ` Ric Wheeler
2008-12-23 15:56                         ` Eric Sandeen
2009-01-12 22:28                 ` Jan Kara
2009-01-13 17:21                   ` Eric Sandeen
2009-01-13 22:14               ` Eric Sandeen
2009-01-14  4:24                 ` Theodore Tso
2009-01-14 17:26                   ` Eric Sandeen
2009-01-14 17:27                   ` Jan Kara
2009-01-29 18:27                     ` Mike Snitzer
2009-01-29 20:05                       ` Eric Sandeen
2008-11-03 19:59         ` Eric Sandeen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).