Re: [PATCH] jbd2: avoid the concurrent data writeback

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] jbd2: avoid the concurrent data writeback
       [not found] <1289827533-2576-1-git-send-email-feng.tang@intel.com>
@ 2010-11-15  5:54 ` Wu Fengguang
  2010-11-15  9:59   ` Feng Tang
  0 siblings, 1 reply; 6+ messages in thread
From: Wu Fengguang @ 2010-11-15  5:54 UTC (permalink / raw)
  To: Tang, Feng; +Cc: linux-ext4, linux-fsdevel, LKML

[add CC to mailing lists]

Tang,

Good catch!

On Mon, Nov 15, 2010 at 09:25:33PM +0800, Tang, Feng wrote:
> When dd a big file to an ext4 partition, it is very likely to happen
> that both the background flush thread and kjounald try to do data
> writeback for it, and ext4_witepage may be called 100, 000 times by
> journal_submit_inode_data_buffers() without really writing one page
> back (skipped), as those pages haven't had disk blocks allocated yet.

The above changelog could show a bit more details (to help me
understand it :).

Does it happen frequently and hence has measurable overheads?

Is it safe to skip the inode? Another alternative is to wait for it:
with inode_wait_for_writeback(inode).

> This could be find by a simple test case with ftrace:
> $ sync;
> $ echo 40960 > buffer_size_kb;echo 1 > events/writeback/enable;echo 1 > events/jbd2/enable;echo 1 > events/ext4/enable;
> $ dd if=/dev/zero of=/home/test/1g.bin bs=1M count=1024;sync;
> $ cat trace > /home/test/jbd2_ext4_1g_dd.log
> 
> This patch will check if the inode is under data syncing, if yes then
> don't start the writeback from kjournald
> 
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  fs/jbd2/commit.c |    8 ++++++++
>  1 files changed, 8 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index f3ad159..8a1978d 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -181,6 +181,14 @@ static int journal_submit_inode_data_buffers(struct address_space *mapping)
>  		.range_end = i_size_read(mapping->host),
>  	};
>  
> +	spin_lock(&inode_lock);
> +	/* If this inode is under data syncing, then just quit */

Comments should go above the whole code block. And the above comment
does not really say anything. Say something *why* code like this
rather than *what* the code is doing. 

> +	if (mapping->host->i_state & I_SYNC) {
> +		spin_unlock(&inode_lock);
> +		return 0;
> +	}
> +	spin_unlock(&inode_lock);
> +
>  	ret = generic_writepages(mapping, &wbc);
>  	return ret;
>  }

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] jbd2: avoid the concurrent data writeback
  2010-11-15  5:54 ` [PATCH] jbd2: avoid the concurrent data writeback Wu Fengguang
@ 2010-11-15  9:59   ` Feng Tang
  2010-11-15 11:27     ` Christoph Hellwig
  2010-11-16 12:13     ` Jan Kara
  0 siblings, 2 replies; 6+ messages in thread
From: Feng Tang @ 2010-11-15  9:59 UTC (permalink / raw)
  Cc: Wu, Fengguang, linux-ext4@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, LKML

Hi Fengguang,

Thanks for the review

On Mon, 15 Nov 2010 13:54:20 +0800
"Wu, Fengguang" <fengguang.wu@intel.com> wrote:

> [add CC to mailing lists]
> 
> Tang,
> 
> Good catch!
> 
> On Mon, Nov 15, 2010 at 09:25:33PM +0800, Tang, Feng wrote:
> > When dd a big file to an ext4 partition, it is very likely to happen
> > that both the background flush thread and kjounald try to do data
> > writeback for it, and ext4_witepage may be called 100, 000 times by
> > journal_submit_inode_data_buffers() without really writing one page
> > back (skipped), as those pages haven't had disk blocks allocated
> > yet.
> 
> The above changelog could show a bit more details (to help me
> understand it :).
OK, will update the commit log. 
> 
> Does it happen frequently and hence has measurable overheads?

It could always be reproduced on my Core Duo 2 + 4G RAM + SATA + EXT4 FS.
dd 1G file's time on my machine is reduced to 11.964s from 12.396s.

> 
> Is it safe to skip the inode? Another alternative is to wait for it:
> with inode_wait_for_writeback(inode).

Yes, I think it's safe to skip this inode, as current 
journal_submit_inode_data_buffers() will just return without do any real job
in this case, and following jbd2 transaction process will ensure the
data coherence.

Following is the updated patch, pls help to review. thanks!

- Feng
---------------

>From b16cfc5a560f2549ac69dbb235a550500ea1719f Mon Sep 17 00:00:00 2001
From: Feng Tang <feng.tang@intel.com>
Date: Mon, 15 Nov 2010 21:06:44 +0800
Subject: [PATCH] jbd2: avoid the concurrent data writeback

When dd a big file to an ext4 partition, it is very likely to happen
that both the background flush thread and kjounald try to do data
writeback for it, that the flush thread is doing the writeback for
this file and jbd2 thread are also waken up to commit the transaction.
Because kjounald only calls the generic_writepages() whose path
doesn't really allocate disk blocks, the ext4_witepage() may be called
lots of times (100000+ for a 1g file dd) without really writing one page
back (skipped), which will consume lots of unnecessary CPU time

This could be found by a simple test case with ftrace:
$ sync;
$ echo 40960 > buffer_size_kb;echo 1 > events/writeback/enable;echo 1 > events/jbd2/enable;echo 1 > events/ext4/enable;
$ dd if=/dev/zero of=/home/test/1g.bin bs=1M count=1024;sync;
$ cat trace > /home/test/jbd2_ext4_1g_dd.log
$ grep -c wcb_writepage /home/test/jbd2_ext4_1g_dd.log

This patch will check if the inode is under data syncing, if yes then
don't start the writeback from kjournald

The Perf statics (On my Core Duo 2 + 4G RAM + SATA disk + Ext4 in all default modes):
before the patch >  112191  writeback:wbc_writepage  #      0.005 M/sec
after the patch  >  54  writeback:wbc_writepage  #      0.000 M/sec

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 fs/jbd2/commit.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index f3ad159..0f3e356 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -170,6 +170,10 @@ static int journal_wait_on_commit_record(journal_t *journal,
  * We don't do block allocation here even for delalloc. We don't
  * use writepages() because with dealyed allocation we may be doing
  * block allocation in writepages().
+ *
+ * Sometimes when this get called, the host inode may be under data
+ * syncing initiated by flush thread(especially for a large file), and 
+ * in such situation, we should skip this path of writeback
  */
 static int journal_submit_inode_data_buffers(struct address_space *mapping)
 {
@@ -181,6 +185,13 @@ static int journal_submit_inode_data_buffers(struct address_space *mapping)
 		.range_end = i_size_read(mapping->host),
 	};
 
+	spin_lock(&inode_lock);
+	if (mapping->host->i_state & I_SYNC) {
+		spin_unlock(&inode_lock);
+		return 0;
+	}
+	spin_unlock(&inode_lock);
+
 	ret = generic_writepages(mapping, &wbc);
 	return ret;
 }
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] jbd2: avoid the concurrent data writeback
  2010-11-15  9:59   ` Feng Tang
@ 2010-11-15 11:27     ` Christoph Hellwig
  2010-11-16  8:13       ` Feng Tang
  2010-11-16 12:13     ` Jan Kara
  1 sibling, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2010-11-15 11:27 UTC (permalink / raw)
  To: Feng Tang
  Cc: Wu, Fengguang, linux-ext4@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, LKML

On Mon, Nov 15, 2010 at 05:59:43PM +0800, Feng Tang wrote:
> + *
> + * Sometimes when this get called, the host inode may be under data
> + * syncing initiated by flush thread(especially for a large file), and 
> + * in such situation, we should skip this path of writeback
>   */
>  static int journal_submit_inode_data_buffers(struct address_space *mapping)
>  {
> @@ -181,6 +185,13 @@ static int journal_submit_inode_data_buffers(struct address_space *mapping)
>  		.range_end = i_size_read(mapping->host),
>  	};
>  
> +	spin_lock(&inode_lock);
> +	if (mapping->host->i_state & I_SYNC) {
> +		spin_unlock(&inode_lock);
> +		return 0;
> +	}
> +	spin_unlock(&inode_lock);
> +

inode_lock is not exported to modules, and that's for a pretty good
reason.  I think you want to change this code at a higher level to not
compete with the flusher threads at all.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] jbd2: avoid the concurrent data writeback
  2010-11-15 11:27     ` Christoph Hellwig
@ 2010-11-16  8:13       ` Feng Tang
  0 siblings, 0 replies; 6+ messages in thread
From: Feng Tang @ 2010-11-16  8:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Wu, Fengguang, linux-ext4@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, LKML

Hi Hellwig,

On Mon, 15 Nov 2010 19:27:32 +0800
Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Nov 15, 2010 at 05:59:43PM +0800, Feng Tang wrote:
> > + *
> > + * Sometimes when this get called, the host inode may be under data
> > + * syncing initiated by flush thread(especially for a large file),
> > and 
> > + * in such situation, we should skip this path of writeback
> >   */
> >  static int journal_submit_inode_data_buffers(struct address_space
> > *mapping) {
> > @@ -181,6 +185,13 @@ static int
> > journal_submit_inode_data_buffers(struct address_space
> > *mapping) .range_end = i_size_read(mapping->host), };
> >  
> > +	spin_lock(&inode_lock);
> > +	if (mapping->host->i_state & I_SYNC) {
> > +		spin_unlock(&inode_lock);
> > +		return 0;
> > +	}
> > +	spin_unlock(&inode_lock);
> > +
> 
> inode_lock is not exported to modules, and that's for a pretty good
> reason.  I think you want to change this code at a higher level to not
> compete with the flusher threads at all.
> 
Good point. The alternative I can think of is to use writeback_in_progress(),
thus the check will be changed to:

	if (writeback_in_progress(mapping->backing_dev_info))
		return 0;
This have the same effect as the original patch.


Thanks,
Feng

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] jbd2: avoid the concurrent data writeback
  2010-11-15  9:59   ` Feng Tang
  2010-11-15 11:27     ` Christoph Hellwig
@ 2010-11-16 12:13     ` Jan Kara
  2010-11-17  1:36       ` Feng Tang
  1 sibling, 1 reply; 6+ messages in thread
From: Jan Kara @ 2010-11-16 12:13 UTC (permalink / raw)
  To: Feng Tang
  Cc: no To-header on input <""@suse.de>, Wu, Fengguang,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, LKML

  Hi,

  sorry for chiming in a bit late...
On Mon 15-11-10 17:59:43, Feng Tang wrote:
> From b16cfc5a560f2549ac69dbb235a550500ea1719f Mon Sep 17 00:00:00 2001
> From: Feng Tang <feng.tang@intel.com>
> Date: Mon, 15 Nov 2010 21:06:44 +0800
> Subject: [PATCH] jbd2: avoid the concurrent data writeback
> 
> When dd a big file to an ext4 partition, it is very likely to happen
> that both the background flush thread and kjounald try to do data
> writeback for it, that the flush thread is doing the writeback for
> this file and jbd2 thread are also waken up to commit the transaction.
> Because kjounald only calls the generic_writepages() whose path
> doesn't really allocate disk blocks, the ext4_witepage() may be called
> lots of times (100000+ for a 1g file dd) without really writing one page
> back (skipped), which will consume lots of unnecessary CPU time
> 
> This could be found by a simple test case with ftrace:
> $ sync;
> $ echo 40960 > buffer_size_kb;echo 1 > events/writeback/enable;echo 1 > events/jbd2/enable;echo 1 > events/ext4/enable;
> $ dd if=/dev/zero of=/home/test/1g.bin bs=1M count=1024;sync;
> $ cat trace > /home/test/jbd2_ext4_1g_dd.log
> $ grep -c wcb_writepage /home/test/jbd2_ext4_1g_dd.log
> 
> This patch will check if the inode is under data syncing, if yes then
> don't start the writeback from kjournald
> 
> The Perf statics (On my Core Duo 2 + 4G RAM + SATA disk + Ext4 in all default modes):
> before the patch >  112191  writeback:wbc_writepage  #      0.005 M/sec
> after the patch  >  54  writeback:wbc_writepage  #      0.000 M/sec
> 
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  fs/jbd2/commit.c |   11 +++++++++++
>  1 files changed, 11 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index f3ad159..0f3e356 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -170,6 +170,10 @@ static int journal_wait_on_commit_record(journal_t *journal,
>   * We don't do block allocation here even for delalloc. We don't
>   * use writepages() because with dealyed allocation we may be doing
>   * block allocation in writepages().
> + *
> + * Sometimes when this get called, the host inode may be under data
> + * syncing initiated by flush thread(especially for a large file), and 
> + * in such situation, we should skip this path of writeback
>   */
>  static int journal_submit_inode_data_buffers(struct address_space *mapping)
>  {
> @@ -181,6 +185,13 @@ static int journal_submit_inode_data_buffers(struct address_space *mapping)
>  		.range_end = i_size_read(mapping->host),
>  	};
>  
> +	spin_lock(&inode_lock);
> +	if (mapping->host->i_state & I_SYNC) {
> +		spin_unlock(&inode_lock);
> +		return 0;
> +	}
> +	spin_unlock(&inode_lock);
> +
  Sorry, but this is just wrong. Not only because of inode_lock as
Christoph pointed out but mainly principially. ext4 and ocfs2 in
data=ordered mode rely on data pages (with underlying blocks already
allocated) being written out before transaction commit proceeds for data
integrity. So you cannot just go and remove the writeback saying it
improves performance.

I'm not saying that ext4 handling of ordered mode does not need a revision
(we actually talked with Ted about it at Kernel Summit). But the solution
for it is to use IO completion callback to do extent tree manipulations
and stop using JBD2 for data syncing. We already do that for direct IO and
conversion of preallocated space so doing it in all cases should be
reasonably easy. Until that happens, you can run ext4 in data=writeback
mode which will also stop JBD2 from doing the writeback (and effectively is
rather similar to your patch).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] jbd2: avoid the concurrent data writeback
  2010-11-16 12:13     ` Jan Kara
@ 2010-11-17  1:36       ` Feng Tang
  0 siblings, 0 replies; 6+ messages in thread
From: Feng Tang @ 2010-11-17  1:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: no To-header on input <""@suse.de>, Wu, Fengguang,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, LKML

Hi Jan,

On Tue, 16 Nov 2010 20:13:23 +0800
Jan Kara <jack@suse.cz> wrote:

>   Hi,
> 
>   sorry for chiming in a bit late...
> On Mon 15-11-10 17:59:43, Feng Tang wrote:
> > From b16cfc5a560f2549ac69dbb235a550500ea1719f Mon Sep 17 00:00:00
> > 2001 From: Feng Tang <feng.tang@intel.com>
> > Date: Mon, 15 Nov 2010 21:06:44 +0800
> > Subject: [PATCH] jbd2: avoid the concurrent data writeback
> > 
> > When dd a big file to an ext4 partition, it is very likely to happen
> > that both the background flush thread and kjounald try to do data
> > writeback for it, that the flush thread is doing the writeback for
> > this file and jbd2 thread are also waken up to commit the
> > transaction. Because kjounald only calls the generic_writepages()
> > whose path doesn't really allocate disk blocks, the ext4_witepage()
> > may be called lots of times (100000+ for a 1g file dd) without
> > really writing one page back (skipped), which will consume lots of
> > unnecessary CPU time
> > 
> > This could be found by a simple test case with ftrace:
> > $ sync;
> > $ echo 40960 > buffer_size_kb;echo 1 > events/writeback/enable;echo
> > 1 > events/jbd2/enable;echo 1 > events/ext4/enable; $ dd
> > if=/dev/zero of=/home/test/1g.bin bs=1M count=1024;sync; $ cat
> > trace > /home/test/jbd2_ext4_1g_dd.log $ grep -c
> > wcb_writepage /home/test/jbd2_ext4_1g_dd.log
> > 
> > This patch will check if the inode is under data syncing, if yes
> > then don't start the writeback from kjournald
> > 
> > The Perf statics (On my Core Duo 2 + 4G RAM + SATA disk + Ext4 in
> > all default modes): before the patch >  112191
> > writeback:wbc_writepage  #      0.005 M/sec after the patch  >  54
> > writeback:wbc_writepage  #      0.000 M/sec
> > 
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > ---
> >  fs/jbd2/commit.c |   11 +++++++++++
> >  1 files changed, 11 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index f3ad159..0f3e356 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -170,6 +170,10 @@ static int
> > journal_wait_on_commit_record(journal_t *journal,
> >   * We don't do block allocation here even for delalloc. We don't
> >   * use writepages() because with dealyed allocation we may be doing
> >   * block allocation in writepages().
> > + *
> > + * Sometimes when this get called, the host inode may be under data
> > + * syncing initiated by flush thread(especially for a large file),
> > and 
> > + * in such situation, we should skip this path of writeback
> >   */
> >  static int journal_submit_inode_data_buffers(struct address_space
> > *mapping) {
> > @@ -181,6 +185,13 @@ static int
> > journal_submit_inode_data_buffers(struct address_space
> > *mapping) .range_end = i_size_read(mapping->host), };
> >  
> > +	spin_lock(&inode_lock);
> > +	if (mapping->host->i_state & I_SYNC) {
> > +		spin_unlock(&inode_lock);
> > +		return 0;
> > +	}
> > +	spin_unlock(&inode_lock);
> > +
>   Sorry, but this is just wrong. Not only because of inode_lock as
> Christoph pointed out but mainly principially. ext4 and ocfs2 in
> data=ordered mode rely on data pages (with underlying blocks already
> allocated) being written out before transaction commit proceeds for
> data integrity. So you cannot just go and remove the writeback saying
> it improves performance.
> 
> I'm not saying that ext4 handling of ordered mode does not need a
> revision (we actually talked with Ted about it at Kernel Summit). But
> the solution for it is to use IO completion callback to do extent
> tree manipulations and stop using JBD2 for data syncing. We already
> do that for direct IO and conversion of preallocated space so doing
> it in all cases should be reasonably easy. Until that happens, you
> can run ext4 in data=writeback mode which will also stop JBD2 from
> doing the writeback (and effectively is rather similar to your patch).

Glad to know that the revision is on the way, and thanks for the 
detailed clarification.

- Feng 

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-11-17  1:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1289827533-2576-1-git-send-email-feng.tang@intel.com>
2010-11-15  5:54 ` [PATCH] jbd2: avoid the concurrent data writeback Wu Fengguang
2010-11-15  9:59   ` Feng Tang
2010-11-15 11:27     ` Christoph Hellwig
2010-11-16  8:13       ` Feng Tang
2010-11-16 12:13     ` Jan Kara
2010-11-17  1:36       ` Feng Tang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).