* [PATCH] Write the inode itself in block_fsync()
@ 2006-03-09 16:22 OGAWA Hirofumi
2006-03-10 1:05 ` Sam Vilain
0 siblings, 1 reply; 6+ messages in thread
From: OGAWA Hirofumi @ 2006-03-09 16:22 UTC (permalink / raw)
To: linux-kernel
Hi,
For block device's inode, we don't write a inode's meta data
itself. But, I think we should write inode's meta data for fsync().
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
---
fs/block_dev.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff -puN fs/block_dev.c~block_fsync-write-inode fs/block_dev.c
--- linux-2.6/fs/block_dev.c~block_fsync-write-inode 2006-03-05 21:51:15.000000000 +0900
+++ linux-2.6-hirofumi/fs/block_dev.c 2006-03-05 22:28:28.000000000 +0900
@@ -19,6 +19,7 @@
#include <linux/module.h>
#include <linux/blkpg.h>
#include <linux/buffer_head.h>
+#include <linux/writeback.h>
#include <linux/mpage.h>
#include <linux/mount.h>
#include <linux/uio.h>
@@ -232,7 +233,20 @@ static loff_t block_llseek(struct file *
static int block_fsync(struct file *filp, struct dentry *dentry, int datasync)
{
- return sync_blockdev(I_BDEV(filp->f_mapping->host));
+ struct inode *inode = dentry->d_inode;
+ int ret = 0, err;
+
+ if (!datasync || (inode->i_state & I_DIRTY_DATASYNC)) {
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_ALL,
+ .nr_to_write = 0, /* sys_fsync did this */
+ };
+ ret = sync_inode(inode, &wbc);
+ }
+ err = sync_blockdev(I_BDEV(filp->f_mapping->host));
+ if (!ret)
+ ret = err;
+ return err;
}
/*
_
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH] Write the inode itself in block_fsync() 2006-03-09 16:22 [PATCH] Write the inode itself in block_fsync() OGAWA Hirofumi @ 2006-03-10 1:05 ` Sam Vilain 2006-03-10 4:10 ` Andrew Morton 0 siblings, 1 reply; 6+ messages in thread From: Sam Vilain @ 2006-03-10 1:05 UTC (permalink / raw) To: OGAWA Hirofumi; +Cc: linux-kernel OGAWA Hirofumi wrote: >Hi, > >For block device's inode, we don't write a inode's meta data >itself. But, I think we should write inode's meta data for fsync(). > > Ouch... won't that halve performance of database transaction logs? >Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> >--- > > fs/block_dev.c | 16 +++++++++++++++- > 1 file changed, 15 insertions(+), 1 deletion(-) > >diff -puN fs/block_dev.c~block_fsync-write-inode fs/block_dev.c >--- linux-2.6/fs/block_dev.c~block_fsync-write-inode 2006-03-05 21:51:15.000000000 +0900 >+++ linux-2.6-hirofumi/fs/block_dev.c 2006-03-05 22:28:28.000000000 +0900 >@@ -19,6 +19,7 @@ > #include <linux/module.h> > #include <linux/blkpg.h> > #include <linux/buffer_head.h> >+#include <linux/writeback.h> > #include <linux/mpage.h> > #include <linux/mount.h> > #include <linux/uio.h> >@@ -232,7 +233,20 @@ static loff_t block_llseek(struct file * > > static int block_fsync(struct file *filp, struct dentry *dentry, int datasync) > { >- return sync_blockdev(I_BDEV(filp->f_mapping->host)); >+ struct inode *inode = dentry->d_inode; >+ int ret = 0, err; >+ >+ if (!datasync || (inode->i_state & I_DIRTY_DATASYNC)) { >+ struct writeback_control wbc = { >+ .sync_mode = WB_SYNC_ALL, >+ .nr_to_write = 0, /* sys_fsync did this */ >+ }; >+ ret = sync_inode(inode, &wbc); >+ } >+ err = sync_blockdev(I_BDEV(filp->f_mapping->host)); >+ if (!ret) >+ ret = err; >+ return err; > } > > /* >_ >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ > > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Write the inode itself in block_fsync() 2006-03-10 1:05 ` Sam Vilain @ 2006-03-10 4:10 ` Andrew Morton 2006-03-10 14:12 ` Bart Samwel 0 siblings, 1 reply; 6+ messages in thread From: Andrew Morton @ 2006-03-10 4:10 UTC (permalink / raw) To: Sam Vilain; +Cc: hirofumi, linux-kernel Sam Vilain <sam@vilain.net> wrote: > > OGAWA Hirofumi wrote: > > >Hi, > > > >For block device's inode, we don't write a inode's meta data > >itself. But, I think we should write inode's meta data for fsync(). > > > > > > Ouch... won't that halve performance of database transaction logs? Yes, it could well cause a lot more seeking to do atime and/or mtime writes. Which aren't terribly important, really. Unless I'm missing something, I suspect we'd be better off without this, even though it's a correctness fix :( ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Write the inode itself in block_fsync() 2006-03-10 4:10 ` Andrew Morton @ 2006-03-10 14:12 ` Bart Samwel 2006-03-10 15:18 ` OGAWA Hirofumi 0 siblings, 1 reply; 6+ messages in thread From: Bart Samwel @ 2006-03-10 14:12 UTC (permalink / raw) To: Andrew Morton; +Cc: Sam Vilain, hirofumi, linux-kernel Andrew Morton wrote: > Sam Vilain <sam@vilain.net> wrote: >> OGAWA Hirofumi wrote: >> >> >For block device's inode, we don't write a inode's meta data >> >itself. But, I think we should write inode's meta data for fsync(). >> >> Ouch... won't that halve performance of database transaction logs? > > Yes, it could well cause a lot more seeking to do atime and/or mtime > writes. Which aren't terribly important, really. > > Unless I'm missing something, I suspect we'd be better off without this, > even though it's a correctness fix :( Maybe atime/mtime aren't important, but I would be unhappy if a file size change wasn't written to disk on fsync. Anyway, shouldn't databases be using a combination of fixed-size files and fdatasync? fsync doesn't perform well by definition, and I guess the only reason databases still use it is because the kernel failed to implement the sucky part of the behaviour. A complex but perhaps viable suggestion: as atime/mtime are stored on disk in second granularity (on ext3 at least, don't know about other fss), wouldn't it somehow be possible to only regard atime/mtime changes as real changes when i_(a|c)time.tv_sec changes? This would enable fsync to write the inode once every second instead of on every fsync. The performance drop would be much less dramatic than writing the inode on every fsync, and it would at least yield correct behaviour. Cheers, Bart ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Write the inode itself in block_fsync() 2006-03-10 14:12 ` Bart Samwel @ 2006-03-10 15:18 ` OGAWA Hirofumi 2006-03-10 17:32 ` Bart Samwel 0 siblings, 1 reply; 6+ messages in thread From: OGAWA Hirofumi @ 2006-03-10 15:18 UTC (permalink / raw) To: Bart Samwel; +Cc: Andrew Morton, Sam Vilain, linux-kernel Bart Samwel <bart@samwel.tk> writes: > Andrew Morton wrote: >> Sam Vilain <sam@vilain.net> wrote: >>> OGAWA Hirofumi wrote: > >> >>> >For block device's inode, we don't write a inode's meta data >>> >itself. But, I think we should write inode's meta data for fsync(). >>> >>> Ouch... won't that halve performance of database transaction logs? >> >> Yes, it could well cause a lot more seeking to do atime and/or mtime >> writes. Which aren't terribly important, really. >> >> Unless I'm missing something, I suspect we'd be better off without this, >> even though it's a correctness fix :( > > Maybe atime/mtime aren't important, but I would be unhappy if a file > size change wasn't written to disk on fsync. Please don't worry, we should be doing a right thing for normal files already. This patch is just for block device file. > Anyway, shouldn't databases be using a combination of fixed-size files > and fdatasync? fsync doesn't perform well by definition, and I guess the > only reason databases still use it is because the kernel failed to > implement the sucky part of the behaviour. Yes, I agree. The changes of atime/mtime only sets I_DIRTY_SYNC, so, usually this patch doesn't change fdatasync() at all. Umm... however, I also can understand what akpm says.... check some databases. berkeley db 4.4: use fdatasync() if available mysql 5.0: use fdatasync() if available (innobase) use fsync() (bdb) postgresql: use fdatasync() if available sqlite: use fsync After all, I don't know. > A complex but perhaps viable suggestion: as atime/mtime are stored > on disk in second granularity (on ext3 at least, don't know about > other fss), wouldn't it somehow be possible to only regard > atime/mtime changes as real changes when i_(a|c)time.tv_sec changes? > This would enable fsync to write the inode once every second instead > of on every fsync. The performance drop would be much less dramatic > than writing the inode on every fsync, and it would at least yield > correct behaviour. Yes, and we are already doing it. Thanks. -- OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Write the inode itself in block_fsync() 2006-03-10 15:18 ` OGAWA Hirofumi @ 2006-03-10 17:32 ` Bart Samwel 0 siblings, 0 replies; 6+ messages in thread From: Bart Samwel @ 2006-03-10 17:32 UTC (permalink / raw) To: OGAWA Hirofumi; +Cc: Andrew Morton, Sam Vilain, linux-kernel OGAWA Hirofumi wrote: > Bart Samwel <bart@samwel.tk> writes: > >> Andrew Morton wrote: >>> Sam Vilain <sam@vilain.net> wrote: >>>> OGAWA Hirofumi wrote: >> >> >>>> Ouch... won't that halve performance of database transaction logs? >>> Yes, it could well cause a lot more seeking to do atime and/or mtime >>> writes. Which aren't terribly important, really. >>> >>> Unless I'm missing something, I suspect we'd be better off without this, >>> even though it's a correctness fix :( >> Maybe atime/mtime aren't important, but I would be unhappy if a file >> size change wasn't written to disk on fsync. > > Please don't worry, we should be doing a right thing for normal files > already. This patch is just for block device file. Ahhh, I missed that. I interpreted: >For block device's inode, we don't write a inode's meta data >itself. But, I think we should write inode's meta data for fsync(). as "for block devices we don't, for normal files, yes", but apparently that's not what you meant. :-) >> Anyway, shouldn't databases be using a combination of fixed-size files >> and fdatasync? fsync doesn't perform well by definition, and I guess the >> only reason databases still use it is because the kernel failed to >> implement the sucky part of the behaviour. > > Yes, I agree. The changes of atime/mtime only sets I_DIRTY_SYNC, so, > usually this patch doesn't change fdatasync() at all. > > Umm... however, I also can understand what akpm says.... check some databases. > > berkeley db 4.4: use fdatasync() if available > mysql 5.0: use fdatasync() if available (innobase) > use fsync() (bdb) > postgresql: use fdatasync() if available > sqlite: use fsync Nice piece of info. Apparently all of the "large" database engines can use fdatasync, only the smaller ones (bdb, sqlite) don't. I've done some extra research: * From a quick look at the docs it seems to me that bdb can't be configured to put its transaction log directly on a block device, so bdb won't be affected. * SQLite definitely can't write logs to a block device, the docs explicitly say that the transaction log is a regular file with a specific name, so we can write off sqlite as well. (It does seem to use fdatasync btw, since version 3.2.6, see http://www.sqlite.org/changes.html.) If we've missed none, that leaves only proprietary databases at risk. But I would be genuinely surprised if a database like Oracle would use fsync. If we assume that Oracle et al. are not a problem, the risks of this patch are very low. Cheers, Bart ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2006-03-10 17:33 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-03-09 16:22 [PATCH] Write the inode itself in block_fsync() OGAWA Hirofumi 2006-03-10 1:05 ` Sam Vilain 2006-03-10 4:10 ` Andrew Morton 2006-03-10 14:12 ` Bart Samwel 2006-03-10 15:18 ` OGAWA Hirofumi 2006-03-10 17:32 ` Bart Samwel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox