* [PATCH] Write the inode itself in block_fsync()
@ 2006-03-09 16:22 OGAWA Hirofumi
2006-03-10 1:05 ` Sam Vilain
0 siblings, 1 reply; 6+ messages in thread
From: OGAWA Hirofumi @ 2006-03-09 16:22 UTC (permalink / raw)
To: linux-kernel
Hi,
For block device's inode, we don't write a inode's meta data
itself. But, I think we should write inode's meta data for fsync().
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
---
fs/block_dev.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff -puN fs/block_dev.c~block_fsync-write-inode fs/block_dev.c
--- linux-2.6/fs/block_dev.c~block_fsync-write-inode 2006-03-05 21:51:15.000000000 +0900
+++ linux-2.6-hirofumi/fs/block_dev.c 2006-03-05 22:28:28.000000000 +0900
@@ -19,6 +19,7 @@
#include <linux/module.h>
#include <linux/blkpg.h>
#include <linux/buffer_head.h>
+#include <linux/writeback.h>
#include <linux/mpage.h>
#include <linux/mount.h>
#include <linux/uio.h>
@@ -232,7 +233,20 @@ static loff_t block_llseek(struct file *
static int block_fsync(struct file *filp, struct dentry *dentry, int datasync)
{
- return sync_blockdev(I_BDEV(filp->f_mapping->host));
+ struct inode *inode = dentry->d_inode;
+ int ret = 0, err;
+
+ if (!datasync || (inode->i_state & I_DIRTY_DATASYNC)) {
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_ALL,
+ .nr_to_write = 0, /* sys_fsync did this */
+ };
+ ret = sync_inode(inode, &wbc);
+ }
+ err = sync_blockdev(I_BDEV(filp->f_mapping->host));
+ if (!ret)
+ ret = err;
+ return err;
}
/*
_
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Write the inode itself in block_fsync()
2006-03-09 16:22 [PATCH] Write the inode itself in block_fsync() OGAWA Hirofumi
@ 2006-03-10 1:05 ` Sam Vilain
2006-03-10 4:10 ` Andrew Morton
0 siblings, 1 reply; 6+ messages in thread
From: Sam Vilain @ 2006-03-10 1:05 UTC (permalink / raw)
To: OGAWA Hirofumi; +Cc: linux-kernel
OGAWA Hirofumi wrote:
>Hi,
>
>For block device's inode, we don't write a inode's meta data
>itself. But, I think we should write inode's meta data for fsync().
>
>
Ouch... won't that halve performance of database transaction logs?
>Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
>---
>
> fs/block_dev.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
>diff -puN fs/block_dev.c~block_fsync-write-inode fs/block_dev.c
>--- linux-2.6/fs/block_dev.c~block_fsync-write-inode 2006-03-05 21:51:15.000000000 +0900
>+++ linux-2.6-hirofumi/fs/block_dev.c 2006-03-05 22:28:28.000000000 +0900
>@@ -19,6 +19,7 @@
> #include <linux/module.h>
> #include <linux/blkpg.h>
> #include <linux/buffer_head.h>
>+#include <linux/writeback.h>
> #include <linux/mpage.h>
> #include <linux/mount.h>
> #include <linux/uio.h>
>@@ -232,7 +233,20 @@ static loff_t block_llseek(struct file *
>
> static int block_fsync(struct file *filp, struct dentry *dentry, int datasync)
> {
>- return sync_blockdev(I_BDEV(filp->f_mapping->host));
>+ struct inode *inode = dentry->d_inode;
>+ int ret = 0, err;
>+
>+ if (!datasync || (inode->i_state & I_DIRTY_DATASYNC)) {
>+ struct writeback_control wbc = {
>+ .sync_mode = WB_SYNC_ALL,
>+ .nr_to_write = 0, /* sys_fsync did this */
>+ };
>+ ret = sync_inode(inode, &wbc);
>+ }
>+ err = sync_blockdev(I_BDEV(filp->f_mapping->host));
>+ if (!ret)
>+ ret = err;
>+ return err;
> }
>
> /*
>_
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Write the inode itself in block_fsync()
2006-03-10 1:05 ` Sam Vilain
@ 2006-03-10 4:10 ` Andrew Morton
2006-03-10 14:12 ` Bart Samwel
0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2006-03-10 4:10 UTC (permalink / raw)
To: Sam Vilain; +Cc: hirofumi, linux-kernel
Sam Vilain <sam@vilain.net> wrote:
>
> OGAWA Hirofumi wrote:
>
> >Hi,
> >
> >For block device's inode, we don't write a inode's meta data
> >itself. But, I think we should write inode's meta data for fsync().
> >
> >
>
> Ouch... won't that halve performance of database transaction logs?
Yes, it could well cause a lot more seeking to do atime and/or mtime
writes. Which aren't terribly important, really.
Unless I'm missing something, I suspect we'd be better off without this,
even though it's a correctness fix :(
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Write the inode itself in block_fsync()
2006-03-10 4:10 ` Andrew Morton
@ 2006-03-10 14:12 ` Bart Samwel
2006-03-10 15:18 ` OGAWA Hirofumi
0 siblings, 1 reply; 6+ messages in thread
From: Bart Samwel @ 2006-03-10 14:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: Sam Vilain, hirofumi, linux-kernel
Andrew Morton wrote:
> Sam Vilain <sam@vilain.net> wrote:
>> OGAWA Hirofumi wrote:
>>
>> >For block device's inode, we don't write a inode's meta data
>> >itself. But, I think we should write inode's meta data for fsync().
>>
>> Ouch... won't that halve performance of database transaction logs?
>
> Yes, it could well cause a lot more seeking to do atime and/or mtime
> writes. Which aren't terribly important, really.
>
> Unless I'm missing something, I suspect we'd be better off without this,
> even though it's a correctness fix :(
Maybe atime/mtime aren't important, but I would be unhappy if a file
size change wasn't written to disk on fsync.
Anyway, shouldn't databases be using a combination of fixed-size files
and fdatasync? fsync doesn't perform well by definition, and I guess the
only reason databases still use it is because the kernel failed to
implement the sucky part of the behaviour.
A complex but perhaps viable suggestion: as atime/mtime are stored on
disk in second granularity (on ext3 at least, don't know about other
fss), wouldn't it somehow be possible to only regard atime/mtime changes
as real changes when i_(a|c)time.tv_sec changes? This would enable fsync
to write the inode once every second instead of on every fsync. The
performance drop would be much less dramatic than writing the inode on
every fsync, and it would at least yield correct behaviour.
Cheers,
Bart
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Write the inode itself in block_fsync()
2006-03-10 14:12 ` Bart Samwel
@ 2006-03-10 15:18 ` OGAWA Hirofumi
2006-03-10 17:32 ` Bart Samwel
0 siblings, 1 reply; 6+ messages in thread
From: OGAWA Hirofumi @ 2006-03-10 15:18 UTC (permalink / raw)
To: Bart Samwel; +Cc: Andrew Morton, Sam Vilain, linux-kernel
Bart Samwel <bart@samwel.tk> writes:
> Andrew Morton wrote:
>> Sam Vilain <sam@vilain.net> wrote:
>>> OGAWA Hirofumi wrote:
> >>
>>> >For block device's inode, we don't write a inode's meta data
>>> >itself. But, I think we should write inode's meta data for fsync().
>>>
>>> Ouch... won't that halve performance of database transaction logs?
>>
>> Yes, it could well cause a lot more seeking to do atime and/or mtime
>> writes. Which aren't terribly important, really.
>>
>> Unless I'm missing something, I suspect we'd be better off without this,
>> even though it's a correctness fix :(
>
> Maybe atime/mtime aren't important, but I would be unhappy if a file
> size change wasn't written to disk on fsync.
Please don't worry, we should be doing a right thing for normal files
already. This patch is just for block device file.
> Anyway, shouldn't databases be using a combination of fixed-size files
> and fdatasync? fsync doesn't perform well by definition, and I guess the
> only reason databases still use it is because the kernel failed to
> implement the sucky part of the behaviour.
Yes, I agree. The changes of atime/mtime only sets I_DIRTY_SYNC, so,
usually this patch doesn't change fdatasync() at all.
Umm... however, I also can understand what akpm says.... check some databases.
berkeley db 4.4: use fdatasync() if available
mysql 5.0: use fdatasync() if available (innobase)
use fsync() (bdb)
postgresql: use fdatasync() if available
sqlite: use fsync
After all, I don't know.
> A complex but perhaps viable suggestion: as atime/mtime are stored
> on disk in second granularity (on ext3 at least, don't know about
> other fss), wouldn't it somehow be possible to only regard
> atime/mtime changes as real changes when i_(a|c)time.tv_sec changes?
> This would enable fsync to write the inode once every second instead
> of on every fsync. The performance drop would be much less dramatic
> than writing the inode on every fsync, and it would at least yield
> correct behaviour.
Yes, and we are already doing it.
Thanks.
--
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Write the inode itself in block_fsync()
2006-03-10 15:18 ` OGAWA Hirofumi
@ 2006-03-10 17:32 ` Bart Samwel
0 siblings, 0 replies; 6+ messages in thread
From: Bart Samwel @ 2006-03-10 17:32 UTC (permalink / raw)
To: OGAWA Hirofumi; +Cc: Andrew Morton, Sam Vilain, linux-kernel
OGAWA Hirofumi wrote:
> Bart Samwel <bart@samwel.tk> writes:
>
>> Andrew Morton wrote:
>>> Sam Vilain <sam@vilain.net> wrote:
>>>> OGAWA Hirofumi wrote:
>> >>
>>>> Ouch... won't that halve performance of database transaction logs?
>>> Yes, it could well cause a lot more seeking to do atime and/or mtime
>>> writes. Which aren't terribly important, really.
>>>
>>> Unless I'm missing something, I suspect we'd be better off without this,
>>> even though it's a correctness fix :(
>> Maybe atime/mtime aren't important, but I would be unhappy if a file
>> size change wasn't written to disk on fsync.
>
> Please don't worry, we should be doing a right thing for normal files
> already. This patch is just for block device file.
Ahhh, I missed that. I interpreted:
>For block device's inode, we don't write a inode's meta data
>itself. But, I think we should write inode's meta data for fsync().
as "for block devices we don't, for normal files, yes", but apparently
that's not what you meant. :-)
>> Anyway, shouldn't databases be using a combination of fixed-size files
>> and fdatasync? fsync doesn't perform well by definition, and I guess the
>> only reason databases still use it is because the kernel failed to
>> implement the sucky part of the behaviour.
>
> Yes, I agree. The changes of atime/mtime only sets I_DIRTY_SYNC, so,
> usually this patch doesn't change fdatasync() at all.
>
> Umm... however, I also can understand what akpm says.... check some databases.
>
> berkeley db 4.4: use fdatasync() if available
> mysql 5.0: use fdatasync() if available (innobase)
> use fsync() (bdb)
> postgresql: use fdatasync() if available
> sqlite: use fsync
Nice piece of info. Apparently all of the "large" database engines can
use fdatasync, only the smaller ones (bdb, sqlite) don't. I've done some
extra research:
* From a quick look at the docs it seems to me that bdb can't be
configured to put its transaction log directly on a block device, so bdb
won't be affected.
* SQLite definitely can't write logs to a block device, the docs
explicitly say that the transaction log is a regular file with a
specific name, so we can write off sqlite as well. (It does seem to use
fdatasync btw, since version 3.2.6, see http://www.sqlite.org/changes.html.)
If we've missed none, that leaves only proprietary databases at risk.
But I would be genuinely surprised if a database like Oracle would use
fsync. If we assume that Oracle et al. are not a problem, the risks of
this patch are very low.
Cheers,
Bart
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2006-03-10 17:33 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-09 16:22 [PATCH] Write the inode itself in block_fsync() OGAWA Hirofumi
2006-03-10 1:05 ` Sam Vilain
2006-03-10 4:10 ` Andrew Morton
2006-03-10 14:12 ` Bart Samwel
2006-03-10 15:18 ` OGAWA Hirofumi
2006-03-10 17:32 ` Bart Samwel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox