* [PATCH] BUG on fsync/fdatasync with Ext3 data=journal
@ 2004-09-16 12:56 Seiji Kihara
2004-09-16 21:50 ` Andrew Morton
0 siblings, 1 reply; 8+ messages in thread
From: Seiji Kihara @ 2004-09-16 12:56 UTC (permalink / raw)
To: Andrew Morton, Stephen C. Tweedie, Andreas Dilger
Cc: linux-fsdevel, ospfs, ext3-users, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2488 bytes --]
Hello,
We found that fsync and fdatasync syscalls sometimes don't sync
data in an ext3 file system under the following conditions.
1. Kernel version is 2.6.6 or later (including 2.6.8.1 and 2.6.9-rc2).
2. Ext3's journalling mode is "data=journal".
3. Create a file (whose size is 1Mbytes) and execute umount/mount.
4. lseek to a random position within the file, write 8192 bytes
data, and fsync or fdatasync.
We presume the data was not written to the corresponding disk
before returning from fsync or fdatasync syscall on the evidence
as follows:
1. The response time of fsync() and fdatasync() was extremely
short.
We use the "diskio" tool, which is downloadable from OSDL page
(http://developer.osdl.jp/projects/doubt/). The program showed
that the response time was under 10 microseconds. This time
cannot be achieved with data transfer on IDE and PCI bus!
2. The IDE writing routine ide_start_dma() was not called under
DMA enabled.
We inserted the print messages in the sys_write(), sys_fsync()
and ide_start_dma() by the attached patch. Sometimes the
"ide_start_dma: ..." message was not shown between "write: in
..." and "fsync: out ...".
The problem was occurred since 2.6.5-bk1, which includes the patch
"[PATCH] ext3 fsync() and fdatasync() speedup". We found that the
problem was solved by deleting the part of the patch which
modifies ext3_sync_file(). Maybe, i_state is not correctly set to
I_DIRTY when the related page cache is dirty (is it true?)
Attached file is tarball (tar + bzip2) which contains following
files. The patches are for 2.6.8.1-kernel (applicable to
2.6.9-rc2), and the results were also produced with
2.6.8.1-kernel.
- fsync.c.patch: patch for solving the problem
- kernel.printk.patch: patch for showing the problem by printks
- kernel.printk.log: printks from the kernel with the printk patch
- results/*: output from "diskio" program
(result for fsync, and O_SYNC for comparison)
- kernel-2.6.8.1.config: .config file which we used to build kernel
We use the following hardware to make the data:
CPU: Intel Pentium 4 3GHz Hyper Threading, 2nd cache 512KB
RAM: 1GBytes
IDE controller: on-board ICH5 (ATA100)
HDD: E-IDE (7200rpm, 8MB cache, ATA133) x 2
(one for system and the other for writing test data)
(The problem was also reproduced with a SCSI system)
Regards,
Seiji
--
Seiji Kihara
Open Source Software Computing Project,
NTT Cyber Space Laboratories, Yokosuka, JAPAN
[-- Attachment #2: patches+results.tar.bz2 --]
[-- Type: application/octet-stream, Size: 80725 bytes --]
[-- Attachment #3: Type: text/plain, Size: 144 bytes --]
_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] BUG on fsync/fdatasync with Ext3 data=journal
2004-09-16 12:56 [PATCH] BUG on fsync/fdatasync with Ext3 data=journal Seiji Kihara
@ 2004-09-16 21:50 ` Andrew Morton
0 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2004-09-16 21:50 UTC (permalink / raw)
To: Seiji Kihara; +Cc: sct, adilger, ext3-users, linux-fsdevel, linux-kernel, ospfs
Seiji Kihara <kihara.seiji@lab.ntt.co.jp> wrote:
>
> We found that fsync and fdatasync syscalls sometimes don't sync
> data in an ext3 file system under the following conditions.
>
> 1. Kernel version is 2.6.6 or later (including 2.6.8.1 and 2.6.9-rc2).
> 2. Ext3's journalling mode is "data=journal".
> 3. Create a file (whose size is 1Mbytes) and execute umount/mount.
> 4. lseek to a random position within the file, write 8192 bytes
> data, and fsync or fdatasync.
>
> We presume the data was not written to the corresponding disk
> before returning from fsync or fdatasync syscall on the evidence
> as follows:
>
> 1. The response time of fsync() and fdatasync() was extremely
> short.
>
> We use the "diskio" tool, which is downloadable from OSDL page
> (http://developer.osdl.jp/projects/doubt/). The program showed
> that the response time was under 10 microseconds. This time
> cannot be achieved with data transfer on IDE and PCI bus!
>
> 2. The IDE writing routine ide_start_dma() was not called under
> DMA enabled.
>
> We inserted the print messages in the sys_write(), sys_fsync()
> and ide_start_dma() by the attached patch. Sometimes the
> "ide_start_dma: ..." message was not shown between "write: in
> ..." and "fsync: out ...".
>
> The problem was occurred since 2.6.5-bk1, which includes the patch
> "[PATCH] ext3 fsync() and fdatasync() speedup". We found that the
> problem was solved by deleting the part of the patch which
> modifies ext3_sync_file(). Maybe, i_state is not correctly set to
> I_DIRTY when the related page cache is dirty (is it true?)
I forgot about this one.
> Attached file is tarball (tar + bzip2) which contains following
> files. The patches are for 2.6.8.1-kernel (applicable to
> 2.6.9-rc2), and the results were also produced with
> 2.6.8.1-kernel.
We really don't need a 100k tarball to communicate a three line patch :(
Yes, the I_DIRTY test is bogus because data pages are not marked dirty at
write() time when the filesystem is mounted in data=journal mode.
However your patch will disable the above optimisation for data=writeback
and data-ordered modes as well. I don't think that's necessary?
How about this?
--- 25/fs/ext3/fsync.c~ext3-journal-data-fsync-fix Thu Sep 16 14:47:21 2004
+++ 25-akpm/fs/ext3/fsync.c Thu Sep 16 14:47:33 2004
@@ -49,10 +49,6 @@ int ext3_sync_file(struct file * file, s
J_ASSERT(ext3_journal_current_handle() == 0);
- smp_mb(); /* prepare for lockless i_state read */
- if (!(inode->i_state & I_DIRTY))
- goto out;
-
/*
* data=writeback:
* The caller's filemap_fdatawrite()/wait will sync the data.
@@ -76,6 +72,10 @@ int ext3_sync_file(struct file * file, s
goto out;
}
+ smp_mb(); /* prepare for lockless i_state read */
+ if (!(inode->i_state & I_DIRTY))
+ goto out;
+
/*
* The VFS has written the file data. If the inode is unaltered
* then we need not start a commit.
_
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] BUG on fsync/fdatasync with Ext3 data=journal
@ 2004-09-16 21:50 ` Andrew Morton
0 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2004-09-16 21:50 UTC (permalink / raw)
To: Seiji Kihara; +Cc: ext3-users, ospfs, linux-fsdevel, sct, adilger, linux-kernel
Seiji Kihara <kihara.seiji@lab.ntt.co.jp> wrote:
>
> We found that fsync and fdatasync syscalls sometimes don't sync
> data in an ext3 file system under the following conditions.
>
> 1. Kernel version is 2.6.6 or later (including 2.6.8.1 and 2.6.9-rc2).
> 2. Ext3's journalling mode is "data=journal".
> 3. Create a file (whose size is 1Mbytes) and execute umount/mount.
> 4. lseek to a random position within the file, write 8192 bytes
> data, and fsync or fdatasync.
>
> We presume the data was not written to the corresponding disk
> before returning from fsync or fdatasync syscall on the evidence
> as follows:
>
> 1. The response time of fsync() and fdatasync() was extremely
> short.
>
> We use the "diskio" tool, which is downloadable from OSDL page
> (http://developer.osdl.jp/projects/doubt/). The program showed
> that the response time was under 10 microseconds. This time
> cannot be achieved with data transfer on IDE and PCI bus!
>
> 2. The IDE writing routine ide_start_dma() was not called under
> DMA enabled.
>
> We inserted the print messages in the sys_write(), sys_fsync()
> and ide_start_dma() by the attached patch. Sometimes the
> "ide_start_dma: ..." message was not shown between "write: in
> ..." and "fsync: out ...".
>
> The problem was occurred since 2.6.5-bk1, which includes the patch
> "[PATCH] ext3 fsync() and fdatasync() speedup". We found that the
> problem was solved by deleting the part of the patch which
> modifies ext3_sync_file(). Maybe, i_state is not correctly set to
> I_DIRTY when the related page cache is dirty (is it true?)
I forgot about this one.
> Attached file is tarball (tar + bzip2) which contains following
> files. The patches are for 2.6.8.1-kernel (applicable to
> 2.6.9-rc2), and the results were also produced with
> 2.6.8.1-kernel.
We really don't need a 100k tarball to communicate a three line patch :(
Yes, the I_DIRTY test is bogus because data pages are not marked dirty at
write() time when the filesystem is mounted in data=journal mode.
However your patch will disable the above optimisation for data=writeback
and data-ordered modes as well. I don't think that's necessary?
How about this?
--- 25/fs/ext3/fsync.c~ext3-journal-data-fsync-fix Thu Sep 16 14:47:21 2004
+++ 25-akpm/fs/ext3/fsync.c Thu Sep 16 14:47:33 2004
@@ -49,10 +49,6 @@ int ext3_sync_file(struct file * file, s
J_ASSERT(ext3_journal_current_handle() == 0);
- smp_mb(); /* prepare for lockless i_state read */
- if (!(inode->i_state & I_DIRTY))
- goto out;
-
/*
* data=writeback:
* The caller's filemap_fdatawrite()/wait will sync the data.
@@ -76,6 +72,10 @@ int ext3_sync_file(struct file * file, s
goto out;
}
+ smp_mb(); /* prepare for lockless i_state read */
+ if (!(inode->i_state & I_DIRTY))
+ goto out;
+
/*
* The VFS has written the file data. If the inode is unaltered
* then we need not start a commit.
_
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] BUG on fsync/fdatasync with Ext3 data=journal
2004-09-16 21:50 ` Andrew Morton
@ 2004-09-17 1:27 ` Seiji Kihara
-1 siblings, 0 replies; 8+ messages in thread
From: Seiji Kihara @ 2004-09-17 1:27 UTC (permalink / raw)
To: Andrew Morton
Cc: sct, adilger, ext3-users, linux-fsdevel, linux-kernel, ospfs
Hi,
Thank you for reply, Mr. Morton. I apologize to everyone that the
mail I sent last night was not delivered to lists because of its
size (maybe).
At Thu, 16 Sep 2004 14:50:59 -0700,
Andrew Morton wrote:
(snip)
> Yes, the I_DIRTY test is bogus because data pages are not marked dirty at
> write() time when the filesystem is mounted in data=journal mode.
>
> However your patch will disable the above optimisation for data=writeback
> and data-ordered modes as well. I don't think that's necessary?
>
> How about this?
I have not understood what is the real problem yet, but I agree
that your patch is better than mine, and confirmed that no problem
occured during the test by the "diskio" and the kernel 2.6.8.1
with your patch.
Thank you again.
Seiji
--
Seiji Kihara
Open Source Software Computing Project,
NTT Cyber Space Laboratories, Yokosuka, JAPAN
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] BUG on fsync/fdatasync with Ext3 data=journal
@ 2004-09-17 1:27 ` Seiji Kihara
0 siblings, 0 replies; 8+ messages in thread
From: Seiji Kihara @ 2004-09-17 1:27 UTC (permalink / raw)
To: Andrew Morton
Cc: ext3-users, ospfs, linux-fsdevel, sct, adilger, linux-kernel
Hi,
Thank you for reply, Mr. Morton. I apologize to everyone that the
mail I sent last night was not delivered to lists because of its
size (maybe).
At Thu, 16 Sep 2004 14:50:59 -0700,
Andrew Morton wrote:
(snip)
> Yes, the I_DIRTY test is bogus because data pages are not marked dirty at
> write() time when the filesystem is mounted in data=journal mode.
>
> However your patch will disable the above optimisation for data=writeback
> and data-ordered modes as well. I don't think that's necessary?
>
> How about this?
I have not understood what is the real problem yet, but I agree
that your patch is better than mine, and confirmed that no problem
occured during the test by the "diskio" and the kernel 2.6.8.1
with your patch.
Thank you again.
Seiji
--
Seiji Kihara
Open Source Software Computing Project,
NTT Cyber Space Laboratories, Yokosuka, JAPAN
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] BUG on fsync/fdatasync with Ext3 data=journal
2004-09-16 21:50 ` Andrew Morton
@ 2004-09-17 8:51 ` Christopher Chan
-1 siblings, 0 replies; 8+ messages in thread
From: Christopher Chan @ 2004-09-17 8:51 UTC (permalink / raw)
To: Andrew Morton
Cc: Seiji Kihara, ext3-users, ospfs, linux-fsdevel, sct, adilger,
linux-kernel
Andrew Morton wrote:
> Seiji Kihara <kihara.seiji@lab.ntt.co.jp> wrote:
>
>>We found that fsync and fdatasync syscalls sometimes don't sync
>>data in an ext3 file system under the following conditions.
>>
>>1. Kernel version is 2.6.6 or later (including 2.6.8.1 and 2.6.9-rc2).
>>2. Ext3's journalling mode is "data=journal".
>>
>>The problem was occurred since 2.6.5-bk1, which includes the patch
>>"[PATCH] ext3 fsync() and fdatasync() speedup". We found that the
>>problem was solved by deleting the part of the patch which
>>modifies ext3_sync_file(). Maybe, i_state is not correctly set to
>>I_DIRTY when the related page cache is dirty (is it true?)
>
I have a few qmail (about the heaviest fsync using mta software around)
boxes that have their queues on ext3.
On a 2.6.7 kernel, these guys are guaranteed to crash within hours if I
used data=journal for the fs on which the qmail queues are. I say this
because I ran two of them with data=journal mode and they crashed once
or more a day. Another one which stayed with ordered had no problems
during the same period.
Going back to ordered meant that they ran stable for days (weeks now).
The only thing I could get from the logs is:
---------------------------
Aug 17 05:58:22 mta1-7 kernel: Assertion failure in
__journal_drop_transaction() at fs/jbd/checkpoint.c:613: "transaction->t
_forget == NULL"
Aug 17 05:58:22 mta1-7 kernel: ------------[ cut here ]------------
Aug 17 05:58:22 mta1-7 kernel: kernel BUG at fs/jbd/checkpoint.c:613!
Aug 17 05:58:22 mta1-7 kernel: invalid operand: 0000 [#1]
Aug 17 05:58:22 mta1-7 kernel: SMP
Aug 17 05:58:22 mta1-7 kernel: Modules linked in: nfs lockd sunrpc e1000
e100 mii usbcore
Aug 17 05:58:22 mta1-7 kernel: CPU: 0
Aug 17 05:58:22 mta1-7 kernel: EIP: 0060:[<c0193db0>] Not tainted
Aug 17 05:58:22 mta1-7 kernel: EFLAGS: 00010202 (2.6.7)
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] BUG on fsync/fdatasync with Ext3 data=journal
@ 2004-09-17 8:51 ` Christopher Chan
0 siblings, 0 replies; 8+ messages in thread
From: Christopher Chan @ 2004-09-17 8:51 UTC (permalink / raw)
To: Andrew Morton
Cc: sct, linux-fsdevel, ospfs, adilger, linux-kernel, ext3-users
Andrew Morton wrote:
> Seiji Kihara <kihara.seiji@lab.ntt.co.jp> wrote:
>
>>We found that fsync and fdatasync syscalls sometimes don't sync
>>data in an ext3 file system under the following conditions.
>>
>>1. Kernel version is 2.6.6 or later (including 2.6.8.1 and 2.6.9-rc2).
>>2. Ext3's journalling mode is "data=journal".
>>
>>The problem was occurred since 2.6.5-bk1, which includes the patch
>>"[PATCH] ext3 fsync() and fdatasync() speedup". We found that the
>>problem was solved by deleting the part of the patch which
>>modifies ext3_sync_file(). Maybe, i_state is not correctly set to
>>I_DIRTY when the related page cache is dirty (is it true?)
>
I have a few qmail (about the heaviest fsync using mta software around)
boxes that have their queues on ext3.
On a 2.6.7 kernel, these guys are guaranteed to crash within hours if I
used data=journal for the fs on which the qmail queues are. I say this
because I ran two of them with data=journal mode and they crashed once
or more a day. Another one which stayed with ordered had no problems
during the same period.
Going back to ordered meant that they ran stable for days (weeks now).
The only thing I could get from the logs is:
---------------------------
Aug 17 05:58:22 mta1-7 kernel: Assertion failure in
__journal_drop_transaction() at fs/jbd/checkpoint.c:613: "transaction->t
_forget == NULL"
Aug 17 05:58:22 mta1-7 kernel: ------------[ cut here ]------------
Aug 17 05:58:22 mta1-7 kernel: kernel BUG at fs/jbd/checkpoint.c:613!
Aug 17 05:58:22 mta1-7 kernel: invalid operand: 0000 [#1]
Aug 17 05:58:22 mta1-7 kernel: SMP
Aug 17 05:58:22 mta1-7 kernel: Modules linked in: nfs lockd sunrpc e1000
e100 mii usbcore
Aug 17 05:58:22 mta1-7 kernel: CPU: 0
Aug 17 05:58:22 mta1-7 kernel: EIP: 0060:[<c0193db0>] Not tainted
Aug 17 05:58:22 mta1-7 kernel: EFLAGS: 00010202 (2.6.7)
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] BUG on fsync/fdatasync with Ext3 data=journal
2004-09-16 21:50 ` Andrew Morton
` (2 preceding siblings ...)
(?)
@ 2004-09-18 20:47 ` Kenichi Okuyama
-1 siblings, 0 replies; 8+ messages in thread
From: Kenichi Okuyama @ 2004-09-18 20:47 UTC (permalink / raw)
To: akpm
Cc: kihara.seiji, sct, adilger, ext3-users, linux-fsdevel,
linux-kernel, ospfs
Dear Mr. Morton, Seiji, and all,
>>>>> "AM" == Andrew Morton <akpm@osdl.org> writes:
AM> Yes, the I_DIRTY test is bogus because data pages are not marked dirty at
AM> write() time when the filesystem is mounted in data=journal mode.
AM> However your patch will disable the above optimisation for data=writeback
AM> and data-ordered modes as well. I don't think that's necessary?
I don't think Mr. Morton's code have any advantages over Seiji's patch.
Please look at lines below. Line starting with AM> + are the point
Mr. Morton have added the code ( point where you removed are bit
above, and not in the lines ).
74 if (ext3_should_journal_data(inode)) {
75 ret = ext3_force_commit(inode->i_sb);
76 goto out;
77 }
AM> + smp_mb(); /* prepare for lockless i_state read */
AM> + if (!(inode->i_state & I_DIRTY))
AM> + goto out;
AM> +
78
79 /*
80 * The VFS has written the file data. If the inode is unaltered
81 * then we need not start a commit.
82 */
83 if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
84 struct writeback_control wbc = {
85 .sync_mode = WB_SYNC_ALL,
86 .nr_to_write = 0, /* sys_fsync did this */
87 };
88 ret = sync_inode(inode, &wbc);
89 }
90 out:
91 return ret;
Now. Please note that
#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
is definition of macro 'I_DIRTY'. As result, Mr. Morton's patch is
saying that:
if (!(inode->i_state & (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGE))
goto out;
if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
}
out:
But this is equivalent to following code ( think carefully :-)
if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
}
out:
whch turns out to be what Seiji's patch was.
Hence, Mr. Morton's patch have no OPTIMIZATION over Seiji's code.
( if gcc is smart enough, Mr. Morton's code should have no effect to
binary. If not, it's overhead. ).
My worry is follows.
Basically, Seiji's patch is better. But in that case,
smp_mb() call right before accessing to inode->i_state will
disappear. Is this safe.....
I am not sure because even without Seiji's patch,
codes at line 83 did exist. And it was working... wasn't it?
In the case smp_mb() was simply not nessasary, Seiji's patch
will do everything. In case smp_mb() was nessasary, we were
lacking one right before line 83.
best regards,
----
Kenichi Okuyama
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2004-09-18 20:48 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-16 12:56 [PATCH] BUG on fsync/fdatasync with Ext3 data=journal Seiji Kihara
2004-09-16 21:50 ` Andrew Morton
2004-09-16 21:50 ` Andrew Morton
2004-09-17 1:27 ` Seiji Kihara
2004-09-17 1:27 ` Seiji Kihara
2004-09-17 8:51 ` Christopher Chan
2004-09-17 8:51 ` Christopher Chan
2004-09-18 20:47 ` Kenichi Okuyama
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.