linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] ext3[34] barrier changes
@ 2008-05-16 14:37 Eric Sandeen
  2008-05-16 14:37 ` [PATCH 1/4] ext3: enable barriers by default Eric Sandeen
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Eric Sandeen @ 2008-05-16 14:37 UTC (permalink / raw)
  To: sandeen


A collection of patches to make ext3 & 4 use barriers by
default, and to call blkdev_issue_flush on fsync if they
are enabled.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] ext3: enable barriers by default
  2008-05-16 14:37 [PATCH 0/4] ext3[34] barrier changes Eric Sandeen
@ 2008-05-16 14:37 ` Eric Sandeen
  2008-05-16 14:37 ` [PATCH 2/4] ext3: call blkdev_issue_flush on fsync Eric Sandeen
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Eric Sandeen @ 2008-05-16 14:37 UTC (permalink / raw)
  To: sandeen; +Cc: Eric Sandeen

I can't think of any valid reason for ext3 to not use barriers when
they are available;  I believe this is necessary for filesystem
integrity in the face of a volatile write cache on storage.

An administrator who trusts that the cache is sufficiently battery-
backed (and power supplies are sufficiently redundant, etc...)
can always turn it back off again.

SuSE has carried such a patch for quite some time now.

Also document the mount option while we're at it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
---
 Documentation/filesystems/ext3.txt |   12 ++++++++++--
 fs/ext3/super.c                    |   11 +++++++++--
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index b45f3c1..daab1f5 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -52,8 +52,16 @@ commit=nrsec	(*)	Ext3 can be told to sync all its data and metadata
 			Setting it to very large values will improve
 			performance.
 
-barrier=1		This enables/disables barriers.  barrier=0 disables
-			it, barrier=1 enables it.
+barrier=<0|1(*)>	This enables/disables the use of write barriers in
+			the jbd code.  barrier=0 disables, barrier=1 enables.
+			This also requires an IO stack which can support
+			barriers, and if jbd gets an error on a barrier
+			write, it will disable again with a warning.
+			Write barriers enforce proper on-disk ordering
+			of journal commits, making volatile disk write caches
+			safe to use, at some performance penalty.  If
+			your disks are battery-backed in one way or another,
+			disabling barriers may safely improve performance.
 
 orlov		(*)	This enables the new Orlov block allocator. It is
 			enabled by default.
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index fe3119a..9c30dc7 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -555,6 +555,7 @@ static int ext3_show_options(struct seq_file *seq, struct vfsmount *vfs)
 	struct super_block *sb = vfs->mnt_sb;
 	struct ext3_sb_info *sbi = EXT3_SB(sb);
 	struct ext3_super_block *es = sbi->s_es;
+	journal_t *journal = sbi->s_journal;
 	unsigned long def_mount_opts;
 
 	def_mount_opts = le32_to_cpu(es->s_default_mount_opts);
@@ -613,8 +614,13 @@ static int ext3_show_options(struct seq_file *seq, struct vfsmount *vfs)
 		seq_printf(seq, ",commit=%u",
 			   (unsigned) (sbi->s_commit_interval / HZ));
 	}
-	if (test_opt(sb, BARRIER))
-		seq_puts(seq, ",barrier=1");
+	/*
+ 	 * jbd inherits the barrier flag from ext3, and may actually
+ 	 * turn off barriers if a write fails, so it's the real test.
+ 	 */
+	if (!test_opt(sb, BARRIER) ||
+	    (journal && !(journal->j_flags & JFS_BARRIER)))
+		seq_puts(seq, ",barrier=0");
 	if (test_opt(sb, NOBH))
 		seq_puts(seq, ",nobh");
 
@@ -1589,6 +1595,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 	sbi->s_resgid = le16_to_cpu(es->s_def_resgid);
 
 	set_opt(sbi->s_mount_opt, RESERVATION);
+	set_opt(sbi->s_mount_opt, BARRIER);
 
 	if (!parse_options ((char *) data, sb, &journal_inum, &journal_devnum,
 			    NULL, 0))
-- 
1.5.3.6


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/4] ext3: call blkdev_issue_flush on fsync
  2008-05-16 14:37 [PATCH 0/4] ext3[34] barrier changes Eric Sandeen
  2008-05-16 14:37 ` [PATCH 1/4] ext3: enable barriers by default Eric Sandeen
@ 2008-05-16 14:37 ` Eric Sandeen
  2008-05-16 14:37 ` [PATCH 3/4] ext4: enable barriers by default Eric Sandeen
  2008-05-16 14:37 ` [PATCH 4/4] ext4: call blkdev_issue_flush on fsync Eric Sandeen
  3 siblings, 0 replies; 13+ messages in thread
From: Eric Sandeen @ 2008-05-16 14:37 UTC (permalink / raw)
  To: sandeen; +Cc: Eric Sandeen

To ensure that bits are truly on-disk after an fsync,
ext3 should call blkdev_issue_flush if barriers are supported.

Inspired by an old thread on barriers, by reiserfs & xfs
which do the same, and by a patch SuSE ships with their kernel

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---
 fs/ext3/fsync.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
index d336341..f6167ec 100644
--- a/fs/ext3/fsync.c
+++ b/fs/ext3/fsync.c
@@ -27,6 +27,7 @@
 #include <linux/sched.h>
 #include <linux/writeback.h>
 #include <linux/jbd.h>
+#include <linux/blkdev.h>
 #include <linux/ext3_fs.h>
 #include <linux/ext3_jbd.h>
 
@@ -45,6 +46,7 @@
 int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
 {
 	struct inode *inode = dentry->d_inode;
+	journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
 	int ret = 0;
 
 	J_ASSERT(ext3_journal_current_handle() == NULL);
@@ -85,6 +87,9 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
 			.nr_to_write = 0, /* sys_fsync did this */
 		};
 		ret = sync_inode(inode, &wbc);
+
+		if (journal && (journal->j_flags & JFS_BARRIER))
+			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
 	}
 out:
 	return ret;
-- 
1.5.3.6


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/4] ext4: enable barriers by default
  2008-05-16 14:37 [PATCH 0/4] ext3[34] barrier changes Eric Sandeen
  2008-05-16 14:37 ` [PATCH 1/4] ext3: enable barriers by default Eric Sandeen
  2008-05-16 14:37 ` [PATCH 2/4] ext3: call blkdev_issue_flush on fsync Eric Sandeen
@ 2008-05-16 14:37 ` Eric Sandeen
  2008-05-16 14:37 ` [PATCH 4/4] ext4: call blkdev_issue_flush on fsync Eric Sandeen
  3 siblings, 0 replies; 13+ messages in thread
From: Eric Sandeen @ 2008-05-16 14:37 UTC (permalink / raw)
  To: sandeen; +Cc: Eric Sandeen

I can't think of any valid reason for ext4 to not use barriers when
they are available;  I believe this is necessary for filesystem
integrity in the face of a volatile write cache on storage.

An administrator who trusts that the cache is sufficiently battery-
backed (and power supplies are sufficiently redundant, etc...)
can always turn it back off again.

SuSE has carried such a patch for ext3 for quite some time now.

Also document the mount option while we're at it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---
 Documentation/filesystems/ext4.txt |   12 ++++++++++--
 fs/ext4/super.c                    |   11 +++++++++--
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 560f88d..0c5086d 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -139,8 +139,16 @@ commit=nrsec	(*)	Ext4 can be told to sync all its data and metadata
 			Setting it to very large values will improve
 			performance.
 
-barrier=1		This enables/disables barriers.  barrier=0 disables
-			it, barrier=1 enables it.
+barrier=<0|1(*)>	This enables/disables the use of write barriers in
+			the jbd code.  barrier=0 disables, barrier=1 enables.
+			This also requires an IO stack which can support
+			barriers, and if jbd gets an error on a barrier
+			write, it will disable again with a warning.
+			Write barriers enforce proper on-disk ordering
+			of journal commits, making volatile disk write caches
+			safe to use, at some performance penalty.  If
+			your disks are battery-backed in one way or another,
+			disabling barriers may safely improve performance.
 
 orlov		(*)	This enables the new Orlov block allocator. It is
 			enabled by default.
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 52dd067..77b036a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -671,6 +671,7 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
 	unsigned long def_mount_opts;
 	struct super_block *sb = vfs->mnt_sb;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	journal_t *journal = sbi->s_journal;
 	struct ext4_super_block *es = sbi->s_es;
 
 	def_mount_opts = le32_to_cpu(es->s_default_mount_opts);
@@ -729,8 +730,13 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
 		seq_printf(seq, ",commit=%u",
 			   (unsigned) (sbi->s_commit_interval / HZ));
 	}
-	if (test_opt(sb, BARRIER))
-		seq_puts(seq, ",barrier=1");
+	/*
+	 * jbd2 inherits the barrier flag from ext4, and may actually
+	 * turn off barriers if a write fails, so it's the real test.
+	 */
+	if (!test_opt(sb, BARRIER) ||
+	    (journal && !(journal->j_flags & JBD2_BARRIER)))
+		seq_puts(seq, ",barrier=0");
 	if (test_opt(sb, NOBH))
 		seq_puts(seq, ",nobh");
 	if (!test_opt(sb, EXTENTS))
@@ -1890,6 +1896,7 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent)
 	sbi->s_resgid = le16_to_cpu(es->s_def_resgid);
 
 	set_opt(sbi->s_mount_opt, RESERVATION);
+	set_opt(sbi->s_mount_opt, BARRIER);
 
 	/*
 	 * turn on extents feature by default in ext4 filesystem
-- 
1.5.3.6


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/4] ext4: call blkdev_issue_flush on fsync
  2008-05-16 14:37 [PATCH 0/4] ext3[34] barrier changes Eric Sandeen
                   ` (2 preceding siblings ...)
  2008-05-16 14:37 ` [PATCH 3/4] ext4: enable barriers by default Eric Sandeen
@ 2008-05-16 14:37 ` Eric Sandeen
  3 siblings, 0 replies; 13+ messages in thread
From: Eric Sandeen @ 2008-05-16 14:37 UTC (permalink / raw)
  To: sandeen; +Cc: Eric Sandeen

To ensure that bits are truly on-disk after an fsync,
we should call blkdev_issue_flush if barriers are supported.

Inspired by an old thread on barriers, by reiserfs & xfs
which do the same, and by a patch SuSE ships with their kernel

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---
 fs/ext4/fsync.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 1c8ba48..a45c373 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -27,6 +27,7 @@
 #include <linux/sched.h>
 #include <linux/writeback.h>
 #include <linux/jbd2.h>
+#include <linux/blkdev.h>
 #include "ext4.h"
 #include "ext4_jbd2.h"
 
@@ -45,6 +46,7 @@
 int ext4_sync_file(struct file * file, struct dentry *dentry, int datasync)
 {
 	struct inode *inode = dentry->d_inode;
+	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
 	int ret = 0;
 
 	J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -85,6 +87,8 @@ int ext4_sync_file(struct file * file, struct dentry *dentry, int datasync)
 			.nr_to_write = 0, /* sys_fsync did this */
 		};
 		ret = sync_inode(inode, &wbc);
+		if (journal && (journal->j_flags & JBD2_BARRIER))
+			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
 	}
 out:
 	return ret;
-- 
1.5.3.6


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/4] ext4: call blkdev_issue_flush on fsync
  2008-05-16 19:02 [PATCH 0/4] (RESEND) ext3[34] barrier changes Eric Sandeen
@ 2008-05-16 19:09 ` Eric Sandeen
  2008-05-20  2:34   ` Theodore Tso
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Sandeen @ 2008-05-16 19:09 UTC (permalink / raw)
  To: ext4 development, linux-kernel Mailing List, linux-fsdevel
  Cc: Andrew Morton, Jamie Lokier

To ensure that bits are truly on-disk after an fsync,
we should call blkdev_issue_flush if barriers are supported.

Inspired by an old thread on barriers, by reiserfs & xfs
which do the same, and by a patch SuSE ships with their kernel

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---
 fs/ext4/fsync.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 1c8ba48..a45c373 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -27,6 +27,7 @@
 #include <linux/sched.h>
 #include <linux/writeback.h>
 #include <linux/jbd2.h>
+#include <linux/blkdev.h>
 #include "ext4.h"
 #include "ext4_jbd2.h"
 
@@ -45,6 +46,7 @@
 int ext4_sync_file(struct file * file, struct dentry *dentry, int datasync)
 {
 	struct inode *inode = dentry->d_inode;
+	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
 	int ret = 0;
 
 	J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -85,6 +87,8 @@ int ext4_sync_file(struct file * file, struct dentry *dentry, int datasync)
 			.nr_to_write = 0, /* sys_fsync did this */
 		};
 		ret = sync_inode(inode, &wbc);
+		if (journal && (journal->j_flags & JBD2_BARRIER))
+			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
 	}
 out:
 	return ret;
-- 1.5.3.6


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] ext4: call blkdev_issue_flush on fsync
  2008-05-16 19:09 ` [PATCH 4/4] ext4: call blkdev_issue_flush on fsync Eric Sandeen
@ 2008-05-20  2:34   ` Theodore Tso
  2008-05-20 15:43     ` Jamie Lokier
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Tso @ 2008-05-20  2:34 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: ext4 development, linux-kernel Mailing List, linux-fsdevel,
	Andrew Morton, Jamie Lokier

On Fri, May 16, 2008 at 02:09:56PM -0500, Eric Sandeen wrote:
> To ensure that bits are truly on-disk after an fsync,
> we should call blkdev_issue_flush if barriers are supported.

This patch isn't necessary, and in fact will cause a double flush.
When you call fsync(), it calls ext4_force_commit(), and we do a the
equivalent of a blkdev_issue_flush() today (which is what happenes
when you do a submit_bh(WRITE_BARRIER, bh), which is what setting
set_ordered_mode(bh) ends up causing.

						- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] ext4: call blkdev_issue_flush on fsync
  2008-05-20  2:34   ` Theodore Tso
@ 2008-05-20 15:43     ` Jamie Lokier
  2008-05-20 15:52       ` Eric Sandeen
  2008-05-20 19:54       ` Jens Axboe
  0 siblings, 2 replies; 13+ messages in thread
From: Jamie Lokier @ 2008-05-20 15:43 UTC (permalink / raw)
  To: Theodore Tso, Eric Sandeen, ext4 development,
	linux-kernel Mailing List, linux-fsdevel

Theodore Tso wrote:
> On Fri, May 16, 2008 at 02:09:56PM -0500, Eric Sandeen wrote:
> > To ensure that bits are truly on-disk after an fsync,
> > we should call blkdev_issue_flush if barriers are supported.
> 
> This patch isn't necessary, and in fact will cause a double flush.
> When you call fsync(), it calls ext4_force_commit(), and we do a the
> equivalent of a blkdev_issue_flush() today (which is what happenes
> when you do a submit_bh(WRITE_BARRIER, bh), which is what setting
> set_ordered_mode(bh) ends up causing.

ISTR fsync() on ext3 did not always force a commit, if in-place data
writes did not change any metadata.  Has this been fixed in ext4 but
not ext3 then?

Does WRITE_BARRIER always cause a flush?  It does not have to
according to Documentation/block/barrier.txt.  There are caveats about
tagged queuing "not yet implemented" in the text, but can we rely on
that?  The documentation is older than the current implementation;
those caveats might no longer apply.

-- Jamie

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] ext4: call blkdev_issue_flush on fsync
  2008-05-20 15:43     ` Jamie Lokier
@ 2008-05-20 15:52       ` Eric Sandeen
  2008-05-20 20:14         ` Jens Axboe
  2008-05-20 19:54       ` Jens Axboe
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Sandeen @ 2008-05-20 15:52 UTC (permalink / raw)
  To: Theodore Tso, Eric Sandeen, ext4 development,
	linux-kernel Mailing List, linux-fsdevel

Jamie Lokier wrote:
> Theodore Tso wrote:
>> On Fri, May 16, 2008 at 02:09:56PM -0500, Eric Sandeen wrote:
>>> To ensure that bits are truly on-disk after an fsync,
>>> we should call blkdev_issue_flush if barriers are supported.
>> This patch isn't necessary, and in fact will cause a double flush.
>> When you call fsync(), it calls ext4_force_commit(), and we do a the
>> equivalent of a blkdev_issue_flush() today (which is what happenes
>> when you do a submit_bh(WRITE_BARRIER, bh), which is what setting
>> set_ordered_mode(bh) ends up causing.
> 
> ISTR fsync() on ext3 did not always force a commit, if in-place data
> writes did not change any metadata.

I think that might still be true, but I'm still looking through it (in
the background...)

I tried blktrace to see what was going on but I'm not sure what an "NB"
in the RWBS field means, anyone know?

-Eric


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] ext4: call blkdev_issue_flush on fsync
  2008-05-20 15:43     ` Jamie Lokier
  2008-05-20 15:52       ` Eric Sandeen
@ 2008-05-20 19:54       ` Jens Axboe
  2008-05-20 22:02         ` Jamie Lokier
  1 sibling, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2008-05-20 19:54 UTC (permalink / raw)
  To: Theodore Tso, Eric Sandeen, ext4 development,
	linux-kernel Mailing List, linux-fsdevel

On Tue, May 20 2008, Jamie Lokier wrote:
> Does WRITE_BARRIER always cause a flush?  It does not have to
> according to Documentation/block/barrier.txt.  There are caveats about
> tagged queuing "not yet implemented" in the text, but can we rely on
> that?  The documentation is older than the current implementation;
> those caveats might no longer apply.

It does, if you use ordered tags then that assumes write through
caching (or ordered tag + drain + flush after completion).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] ext4: call blkdev_issue_flush on fsync
  2008-05-20 15:52       ` Eric Sandeen
@ 2008-05-20 20:14         ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2008-05-20 20:14 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Theodore Tso, ext4 development, linux-kernel Mailing List,
	linux-fsdevel, Andrew Morton

On Tue, May 20 2008, Eric Sandeen wrote:
> Jamie Lokier wrote:
> > Theodore Tso wrote:
> >> On Fri, May 16, 2008 at 02:09:56PM -0500, Eric Sandeen wrote:
> >>> To ensure that bits are truly on-disk after an fsync,
> >>> we should call blkdev_issue_flush if barriers are supported.
> >> This patch isn't necessary, and in fact will cause a double flush.
> >> When you call fsync(), it calls ext4_force_commit(), and we do a the
> >> equivalent of a blkdev_issue_flush() today (which is what happenes
> >> when you do a submit_bh(WRITE_BARRIER, bh), which is what setting
> >> set_ordered_mode(bh) ends up causing.
> > 
> > ISTR fsync() on ext3 did not always force a commit, if in-place data
> > writes did not change any metadata.
> 
> I think that might still be true, but I'm still looking through it (in
> the background...)
> 
> I tried blktrace to see what was going on but I'm not sure what an "NB"
> in the RWBS field means, anyone know?

Eric already knows this now, but for the benefit of anyone else that may
be curious - it's an empty (data-less) barrier. 'N' is basically 'no
data' (eg not a read nor a write) and 'B' is barrier.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] ext4: call blkdev_issue_flush on fsync
  2008-05-20 19:54       ` Jens Axboe
@ 2008-05-20 22:02         ` Jamie Lokier
  2008-05-21  7:30           ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Jamie Lokier @ 2008-05-20 22:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Theodore Tso, Eric Sandeen, ext4 development,
	linux-kernel Mailing List, linux-fsdevel, Andrew Morton

Jens Axboe wrote:
> On Tue, May 20 2008, Jamie Lokier wrote:
> > Does WRITE_BARRIER always cause a flush?  It does not have to
> > according to Documentation/block/barrier.txt.  There are caveats about
> > tagged queuing "not yet implemented" in the text, but can we rely on
> > that?  The documentation is older than the current implementation;
> > those caveats might no longer apply.
> 
> It does, if you use ordered tags then that assumes write through
> caching (or ordered tag + drain + flush after completion).

Oh.  That's really unclear from the opening paragraph of barrier.txt,
which _defines_ what I/O barriers are for, and does not mention flushing:

   I/O barrier requests are used to guarantee ordering around the barrier
   requests.  Unless you're crazy enough to use disk drives for
   implementing synchronization constructs (wow, sounds interesting...),
   the ordering is meaningful only for write requests for things like
   journal checkpoints.  All requests queued before a barrier request
   must be finished (made it to the physical medium) before the barrier
   request is started, and all requests queued after the barrier request
   must be started only after the barrier request is finished (again,
   made it to the physical medium).

So I assumed the reason flush is talked about later was only because
most devices don't offer an alternative.

Later in barrier.txt, in the section about flushing, it says:

   the reason you use I/O barriers is mainly to protect filesystem
   integrity when power failure or some other events abruptly stop the
   drive from operating and possibly make the drive lose data in its
   cache.  So, I/O barriers need to guarantee that requests actually
   get written to non-volatile medium in order

Woa!  Nothing about flushing being important, just "to guarantee
... in order".

Thus flushing looks like an implementation detail - all we could do at
the time.  It does not seem to be the _point_ of WRITE_BARRIER
(according to the text), which is to ensure journalling integrity by
ordering writes.

Really, the main reason I was confused was that I imagine some
SCSI-like devices letting you do partially ordered writes to
write-back cache - with their cache preserving ordering constraints
the same way as some CPU or database caches.  (Perhaps I've been
thinking about CPUs too long!)

Anyway, moving on....  Let's admit I am wrong about that :-)

And get back to my idea.  Ignoring actual disks for a moment ( ;-) ),
there are some I/O scheduling optimisations possible in the kernel
itself by distinguishing between barriers (for journalling) and
flushes (for fsync).

Basically, barriers can be moved around between ordered writes,
including postponing indefinitely (i.e. a "needs barrier" flag).
Unordered writes (in-place data?) can be reordered somewhat around
barriers and other writes.  Nobody should wait for a barrier to complete.

On the other hand, flushes must be issued soon, and fdatasync/fsync
wait for the result.  Reordering possibilities are different: all
writes can be moved before a flush (unless it's a barrier too), and
in-place data writes cannot be moved after one.

Both barriers and flushes can be merged if there are no intervening
writes except unordered writes.  Double flushing, e.g. calling fsync
twice, or calling blkdev_issue_flush just to be sure somewhere,
shouldn't have any overhead.

The request elevator seems a good place to apply those heuristics.
I've written earlier about how to remove some barriers from ext3/4
journalling.  This stuff seems to suggest even more I/O scheduling
optimisations with tree-like journalling (as in BTRFS?).

-- Jamie

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] ext4: call blkdev_issue_flush on fsync
  2008-05-20 22:02         ` Jamie Lokier
@ 2008-05-21  7:30           ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2008-05-21  7:30 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Theodore Tso, Eric Sandeen, ext4 development,
	linux-kernel Mailing List, linux-fsdevel, Andrew Morton

On Tue, May 20 2008, Jamie Lokier wrote:
> Jens Axboe wrote:
> > On Tue, May 20 2008, Jamie Lokier wrote:
> > > Does WRITE_BARRIER always cause a flush?  It does not have to
> > > according to Documentation/block/barrier.txt.  There are caveats about
> > > tagged queuing "not yet implemented" in the text, but can we rely on
> > > that?  The documentation is older than the current implementation;
> > > those caveats might no longer apply.
> > 
> > It does, if you use ordered tags then that assumes write through
> > caching (or ordered tag + drain + flush after completion).
> 
> Oh.  That's really unclear from the opening paragraph of barrier.txt,
> which _defines_ what I/O barriers are for, and does not mention flushing:
> 
>    I/O barrier requests are used to guarantee ordering around the barrier
>    requests.  Unless you're crazy enough to use disk drives for
>    implementing synchronization constructs (wow, sounds interesting...),
>    the ordering is meaningful only for write requests for things like
>    journal checkpoints.  All requests queued before a barrier request
>    must be finished (made it to the physical medium) before the barrier
>    request is started, and all requests queued after the barrier request
>    must be started only after the barrier request is finished (again,
>    made it to the physical medium).
> 
> So I assumed the reason flush is talked about later was only because
> most devices don't offer an alternative.

It may not mention flushing explicitly, but it does not have to since
the flushing is one way to implement what the above describes. Note how
it says that request must have made it to physical medium before
allowing others to continue? That means you have to either write through
or flush caches, otherwise you cannot make that guarentee.

> Later in barrier.txt, in the section about flushing, it says:
> 
>    the reason you use I/O barriers is mainly to protect filesystem
>    integrity when power failure or some other events abruptly stop the
>    drive from operating and possibly make the drive lose data in its
>    cache.  So, I/O barriers need to guarantee that requests actually
>    get written to non-volatile medium in order
> 
> Woa!  Nothing about flushing being important, just "to guarantee
> ... in order".
> 
> Thus flushing looks like an implementation detail - all we could do at
> the time.  It does not seem to be the _point_ of WRITE_BARRIER
> (according to the text), which is to ensure journalling integrity by
> ordering writes.

Yeah, that is precisely what it is and why it does not mention flushing
explicitly!

> Really, the main reason I was confused was that I imagine some
> SCSI-like devices letting you do partially ordered writes to
> write-back cache - with their cache preserving ordering constraints
> the same way as some CPU or database caches.  (Perhaps I've been
> thinking about CPUs too long!)

Right, that is what ordered tags give you. But if you have write back
caching enabled, then you get a completion event before the data is
actually on disk. Perhaps that is OK for some cases, perhaps not. The
Linux barrier implementation has always guarenteed that the data is
actually on platter before considering a barrier write done, as
described in the text you quote.

> Anyway, moving on....  Let's admit I am wrong about that :-)
> 
> And get back to my idea.  Ignoring actual disks for a moment ( ;-) ),
> there are some I/O scheduling optimisations possible in the kernel
> itself by distinguishing between barriers (for journalling) and
> flushes (for fsync).
> 
> Basically, barriers can be moved around between ordered writes,
> including postponing indefinitely (i.e. a "needs barrier" flag).
> Unordered writes (in-place data?) can be reordered somewhat around
> barriers and other writes.  Nobody should wait for a barrier to complete.
> 
> On the other hand, flushes must be issued soon, and fdatasync/fsync
> wait for the result.  Reordering possibilities are different: all
> writes can be moved before a flush (unless it's a barrier too), and
> in-place data writes cannot be moved after one.
> 
> Both barriers and flushes can be merged if there are no intervening
> writes except unordered writes.  Double flushing, e.g. calling fsync
> twice, or calling blkdev_issue_flush just to be sure somewhere,
> shouldn't have any overhead.
> 
> The request elevator seems a good place to apply those heuristics.
> I've written earlier about how to remove some barriers from ext3/4
> journalling.  This stuff seems to suggest even more I/O scheduling
> optimisations with tree-like journalling (as in BTRFS?).

There are some good ideas in there, I'll punt that to the fs people.
Ignoring actual disk drives makes things easier :-). While SCSI has
ordered IO with write back caching, SATA does not. You basically have to
drain and flush there, or use one of the other variants for getting data
in disk - see blkdev.h:

        /*
         * Hardbarrier is supported with one of the following methods.
         *
         * NONE         : hardbarrier unsupported
         * DRAIN        : ordering by draining is enough
         * DRAIN_FLUSH  : ordering by draining w/ pre and post flushes
         * DRAIN_FUA    : ordering by draining w/ pre flush and FUA
         * write
         * TAG          : ordering by tag is enough
         * TAG_FLUSH    : ordering by tag w/ pre and post flushes
         * TAG_FUA      : ordering by tag w/ pre flush and FUA write

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-05-21  7:30 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-16 14:37 [PATCH 0/4] ext3[34] barrier changes Eric Sandeen
2008-05-16 14:37 ` [PATCH 1/4] ext3: enable barriers by default Eric Sandeen
2008-05-16 14:37 ` [PATCH 2/4] ext3: call blkdev_issue_flush on fsync Eric Sandeen
2008-05-16 14:37 ` [PATCH 3/4] ext4: enable barriers by default Eric Sandeen
2008-05-16 14:37 ` [PATCH 4/4] ext4: call blkdev_issue_flush on fsync Eric Sandeen
  -- strict thread matches above, loose matches on Subject: below --
2008-05-16 19:02 [PATCH 0/4] (RESEND) ext3[34] barrier changes Eric Sandeen
2008-05-16 19:09 ` [PATCH 4/4] ext4: call blkdev_issue_flush on fsync Eric Sandeen
2008-05-20  2:34   ` Theodore Tso
2008-05-20 15:43     ` Jamie Lokier
2008-05-20 15:52       ` Eric Sandeen
2008-05-20 20:14         ` Jens Axboe
2008-05-20 19:54       ` Jens Axboe
2008-05-20 22:02         ` Jamie Lokier
2008-05-21  7:30           ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).