[PATCH] improve jbd fsync batching

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] improve jbd fsync batching
@ 2008-10-28 20:16 Josef Bacik
  2008-10-28 21:38 ` Andreas Dilger
  2008-11-03 20:27 ` Andrew Morton
  0 siblings, 2 replies; 12+ messages in thread
From: Josef Bacik @ 2008-10-28 20:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, linux-ext4, rwheeler

Hello,

This is a rework of the patch I did a few months ago, taking into account some
comments from Andrew and using the new schedule_hrtimeout function (thanks
Arjan!).

There is a flaw with the way jbd handles fsync batching.  If we fsync() a file
and we were not the last person to run fsync() on this fs then we automatically
sleep for 1 jiffie in order to wait for new writers to join into the transaction
before forcing the commit.  The problem with this is that with really fast
storage (ie a Clariion) the time it takes to commit a transaction to disk is way
faster than 1 jiffie in most cases, so sleeping means waiting longer with
nothing to do than if we just committed the transaction and kept going.  Ric
Wheeler noticed this when using fs_mark with more than 1 thread, the throughput
would plummet as he added more threads.

This patch attempts to fix this problem by recording the average time in
nanoseconds that it takes to commit a transaction to disk, and what time we
started the transaction.  If we run an fsync() and we have been running for less
time than it takes to commit the transaction to disk, we sleep for the delta
amount of time and then commit to disk.  We acheive sub-jiffie sleeping using
schedule_hrtimeout.  This means that the wait time is auto-tuned to the speed of
the underlying disk, instead of having this static timeout.  I weighted the
average according to somebody's comments (Andreas Dilger I think) in order to
help normalize random outliers where we take way longer or way less time to
commit than the average.  I also have a min() check in there to make sure we
don't sleep longer than a jiffie in case our storage is super slow, this was
requested by Andrew.

I unfortunately do not have access to a Clariion, so I had to use a ramdisk to
represent a super fast array.  I tested with a SATA drive with barrier=1 to make
sure there was no regression with local disks, I tested with a 4 way multipathed
Apple Xserve RAID array and of course the ramdisk.  I ran the following command

fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t $i

where $i was 2, 4, 8, 16 and 32.  I mkfs'ed the fs each time.  Here are my
results

type	threads		with patch	without patch
sata	2		24.6		26.3
sata	4		49.2		48.1
sata	8		70.1		67.0
sata	16		104.0		94.1
sata	32		153.6		142.7

xserve	2		246.4		222.0
xserve	4		480.0		440.8
xserve	8		829.5		730.8
xserve	16		1172.7		1026.9
xserve	32		1816.3		1650.5

ramdisk	2		2538.3		1745.6
ramdisk	4		2942.3		661.9
ramdisk	8		2882.5		999.8
ramdisk	16		2738.7		1801.9
ramdisk	32		2541.9		2394.0

All comments are welcome.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>


diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 25719d9..3fbffb1 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -306,6 +306,8 @@ void journal_commit_transaction(journal_t *journal)
 	int flags;
 	int err;
 	unsigned long blocknr;
+	ktime_t start_time;
+	u64 commit_time;
 	char *tagp = NULL;
 	journal_header_t *header;
 	journal_block_tag_t *tag = NULL;
@@ -418,6 +420,7 @@ void journal_commit_transaction(journal_t *journal)
 	commit_transaction->t_state = T_FLUSH;
 	journal->j_committing_transaction = commit_transaction;
 	journal->j_running_transaction = NULL;
+	start_time = ktime_get();
 	commit_transaction->t_log_start = journal->j_head;
 	wake_up(&journal->j_wait_transaction_locked);
 	spin_unlock(&journal->j_state_lock);
@@ -913,6 +916,18 @@ restart_loop:
 	J_ASSERT(commit_transaction == journal->j_committing_transaction);
 	journal->j_commit_sequence = commit_transaction->t_tid;
 	journal->j_committing_transaction = NULL;
+	commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
+
+	/*
+	 * weight the commit time higher than the average time so we don't
+	 * react too strongly to vast changes in commit time
+	 */
+	if (likely(journal->j_average_commit_time))
+		journal->j_average_commit_time = (commit_time*3 +
+				journal->j_average_commit_time) / 4;
+	else
+		journal->j_average_commit_time = commit_time;
+
 	spin_unlock(&journal->j_state_lock);
 
 	if (commit_transaction->t_checkpoint_list == NULL &&
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index d15cd6e..59dd181 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -25,6 +25,7 @@
 #include <linux/timer.h>
 #include <linux/mm.h>
 #include <linux/highmem.h>
+#include <linux/hrtimer.h>
 
 static void __journal_temp_unlink_buffer(struct journal_head *jh);
 
@@ -49,6 +50,7 @@ get_transaction(journal_t *journal, transaction_t *transaction)
 {
 	transaction->t_journal = journal;
 	transaction->t_state = T_RUNNING;
+	transaction->t_start_time = ktime_get();
 	transaction->t_tid = journal->j_transaction_sequence++;
 	transaction->t_expires = jiffies + journal->j_commit_interval;
 	spin_lock_init(&transaction->t_handle_lock);
@@ -1371,7 +1373,7 @@ int journal_stop(handle_t *handle)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal = transaction->t_journal;
-	int old_handle_count, err;
+	int err;
 	pid_t pid;
 
 	J_ASSERT(journal_current_handle() == handle);
@@ -1407,11 +1409,26 @@ int journal_stop(handle_t *handle)
 	 */
 	pid = current->pid;
 	if (handle->h_sync && journal->j_last_sync_writer != pid) {
+		u64 commit_time, trans_time;
+
 		journal->j_last_sync_writer = pid;
-		do {
-			old_handle_count = transaction->t_handle_count;
-			schedule_timeout_uninterruptible(1);
-		} while (old_handle_count != transaction->t_handle_count);
+
+		spin_lock(&journal->j_state_lock);
+		commit_time = journal->j_average_commit_time;
+		spin_unlock(&journal->j_state_lock);
+
+		trans_time = ktime_to_ns(ktime_sub(ktime_get(),
+						   transaction->t_start_time));
+
+		commit_time = min_t(u64, commit_time,
+				    1000*jiffies_to_usecs(1));
+
+		if (trans_time < commit_time) {
+			ktime_t expires = ktime_add_ns(ktime_get(),
+						       commit_time);
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
+		}
 	}
 
 	current->journal_info = NULL;
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index 346e2b8..d842230 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -543,6 +543,11 @@ struct transaction_s
 	unsigned long		t_expires;
 
 	/*
+	 * When this transaction started, in nanoseconds [no locking]
+	 */
+	ktime_t			t_start_time;
+
+	/*
 	 * How many handles used this transaction? [t_handle_lock]
 	 */
 	int t_handle_count;
@@ -800,6 +805,8 @@ struct journal_s
 
 	pid_t			j_last_sync_writer;
 
+	u64			j_average_commit_time;
+
 	/*
 	 * An opaque pointer to fs-private information.  ext3 puts its
 	 * superblock pointer here

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] improve jbd fsync batching
  2008-10-28 20:16 [PATCH] improve jbd fsync batching Josef Bacik
@ 2008-10-28 21:38 ` Andreas Dilger
  2008-10-28 21:33   ` Josef Bacik
  2008-10-28 21:44   ` Arjan van de Ven
  2008-11-03 20:27 ` Andrew Morton
  1 sibling, 2 replies; 12+ messages in thread
From: Andreas Dilger @ 2008-10-28 21:38 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-kernel, linux-fsdevel, linux-ext4, rwheeler

On Oct 28, 2008  16:16 -0400, Josef Bacik wrote:
> I also have a min() check in there to make sure we don't sleep longer than
> a jiffie in case our storage is super slow, this was requested by Andrew.

Is there a particular reason why 1 jiffie is considered the "right amount"
of time to sleep, given this is a kernel config parameter and has nothing
to do with the storage?  Considering a seek time in the range of ~10ms
this would only be right for HZ=100 and the wait would otherwise be too
short to maximize batching within a single transaction.

> type	threads		with patch	without patch
> sata	2		24.6		26.3
> sata	4		49.2		48.1
> sata	8		70.1		67.0
> sata	16		104.0		94.1
> sata	32		153.6		142.7

In the previous patch where this wasn't limited it had better performance
even for the 2 thread case.  With the current 1-jiffie wait it likely
isn't long enough to batch every pair of operations and every other
operation waits an extra amount before giving up too soon.  Previous patch:

type    threads       patch     unpatched
sata    2              34.6     26.2
sata    4              58.0     48.0
sata    8              75.2     70.4
sata    16            101.1     89.6

I'd recommend changing the patch to have a maximum sleep time that has a
fixed maximum number of milliseconds (15ms should be enough for even very
old disks).

That said, this would be a minor enhancement and should NOT be considered
a reason to delay this patch's inclusion into -mm or the ext4 tree.

PS - it should really go into jbd2 also

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] improve jbd fsync batching
  2008-10-28 21:38 ` Andreas Dilger
@ 2008-10-28 21:33   ` Josef Bacik
  2008-10-28 21:44   ` Arjan van de Ven
  1 sibling, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2008-10-28 21:33 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Josef Bacik, linux-kernel, linux-fsdevel, linux-ext4, rwheeler

On Tue, Oct 28, 2008 at 03:38:05PM -0600, Andreas Dilger wrote:
> On Oct 28, 2008  16:16 -0400, Josef Bacik wrote:
> > I also have a min() check in there to make sure we don't sleep longer than
> > a jiffie in case our storage is super slow, this was requested by Andrew.
> 
> Is there a particular reason why 1 jiffie is considered the "right amount"
> of time to sleep, given this is a kernel config parameter and has nothing
> to do with the storage?  Considering a seek time in the range of ~10ms
> this would only be right for HZ=100 and the wait would otherwise be too
> short to maximize batching within a single transaction.
>

I wouldn't say "right amount", more of "traditional amount".  If you have super
slow storage this patch will not result in you waiting any longer than you did
originally, which I think is what the concern was, that we not wait a super long
time just because the disk is slow.
 
> > type	threads		with patch	without patch
> > sata	2		24.6		26.3
> > sata	4		49.2		48.1
> > sata	8		70.1		67.0
> > sata	16		104.0		94.1
> > sata	32		153.6		142.7
> 
> In the previous patch where this wasn't limited it had better performance
> even for the 2 thread case.  With the current 1-jiffie wait it likely
> isn't long enough to batch every pair of operations and every other
> operation waits an extra amount before giving up too soon.  Previous patch:
> 
> type    threads       patch     unpatched
> sata    2              34.6     26.2
> sata    4              58.0     48.0
> sata    8              75.2     70.4
> sata    16            101.1     89.6
> 
> I'd recommend changing the patch to have a maximum sleep time that has a
> fixed maximum number of milliseconds (15ms should be enough for even very
> old disks).
> 

This stat gathering process has been very unscientific :), I just ran once and
took that number.  Sometimes the patched version would come out on top,
sometimes it wouldn't.  If I were to do this the way my stat teacher taught me
I'm sure the patched/unpatched versions would come out to be relatively the same
in the 2 thread case.

> 
> That said, this would be a minor enhancement and should NOT be considered
> a reason to delay this patch's inclusion into -mm or the ext4 tree.
> 
> PS - it should really go into jbd2 also
>

Yes I will be doing a jbd2 version of this patch provided there are no issues
with this patch.  Thanks much for the comments,

Josef 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] improve jbd fsync batching
  2008-10-28 21:38 ` Andreas Dilger
  2008-10-28 21:33   ` Josef Bacik
@ 2008-10-28 21:44   ` Arjan van de Ven
  2008-10-28 21:56     ` Ric Wheeler
  1 sibling, 1 reply; 12+ messages in thread
From: Arjan van de Ven @ 2008-10-28 21:44 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Josef Bacik, linux-kernel, linux-fsdevel, linux-ext4, rwheeler

On Tue, 28 Oct 2008 15:38:05 -0600
Andreas Dilger <adilger@sun.com> wrote:

> On Oct 28, 2008  16:16 -0400, Josef Bacik wrote:
> > I also have a min() check in there to make sure we don't sleep
> > longer than a jiffie in case our storage is super slow, this was
> > requested by Andrew.
> 
> Is there a particular reason why 1 jiffie is considered the "right
> amount" of time to sleep, given this is a kernel config parameter and
> has nothing to do with the storage?  Considering a seek time in the
> range of ~10ms this would only be right for HZ=100 and the wait would

well... my disk does a 50 usec seek time or so.. so I don't mind ;-)

in fact it sounds awefully long to me.

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] improve jbd fsync batching
  2008-10-28 21:44   ` Arjan van de Ven
@ 2008-10-28 21:56     ` Ric Wheeler
  0 siblings, 0 replies; 12+ messages in thread
From: Ric Wheeler @ 2008-10-28 21:56 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andreas Dilger, Josef Bacik, linux-kernel, linux-fsdevel,
	linux-ext4

Arjan van de Ven wrote:
> On Tue, 28 Oct 2008 15:38:05 -0600
> Andreas Dilger <adilger@sun.com> wrote:
>
>   
>> On Oct 28, 2008  16:16 -0400, Josef Bacik wrote:
>>     
>>> I also have a min() check in there to make sure we don't sleep
>>> longer than a jiffie in case our storage is super slow, this was
>>> requested by Andrew.
>>>       
>> Is there a particular reason why 1 jiffie is considered the "right
>> amount" of time to sleep, given this is a kernel config parameter and
>> has nothing to do with the storage?  Considering a seek time in the
>> range of ~10ms this would only be right for HZ=100 and the wait would
>>     
>
> well... my disk does a 50 usec seek time or so.. so I don't mind ;-)
>
> in fact it sounds awefully long to me.
>   

For small writes as well as reads?

If so, it would be great to test Josef's patch against your shiny new 
SSD  :-)

ric


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] improve jbd fsync batching
  2008-10-28 20:16 [PATCH] improve jbd fsync batching Josef Bacik
  2008-10-28 21:38 ` Andreas Dilger
@ 2008-11-03 20:27 ` Andrew Morton
  2008-11-03 20:24   ` Josef Bacik
  2008-11-04  5:24   ` Andreas Dilger
  1 sibling, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2008-11-03 20:27 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-kernel, linux-fsdevel, linux-ext4, rwheeler

On Tue, 28 Oct 2008 16:16:15 -0400
Josef Bacik <jbacik@redhat.com> wrote:

> Hello,
> 
> This is a rework of the patch I did a few months ago, taking into account some
> comments from Andrew and using the new schedule_hrtimeout function (thanks
> Arjan!).
> 
> There is a flaw with the way jbd handles fsync batching.  If we fsync() a file
> and we were not the last person to run fsync() on this fs then we automatically
> sleep for 1 jiffie in order to wait for new writers to join into the transaction
> before forcing the commit.  The problem with this is that with really fast
> storage (ie a Clariion) the time it takes to commit a transaction to disk is way
> faster than 1 jiffie in most cases, so sleeping means waiting longer with
> nothing to do than if we just committed the transaction and kept going.  Ric
> Wheeler noticed this when using fs_mark with more than 1 thread, the throughput
> would plummet as he added more threads.
> 
> ...
>
> ...
>  
> @@ -49,6 +50,7 @@ get_transaction(journal_t *journal, transaction_t *transaction)
>  {
>  	transaction->t_journal = journal;
>  	transaction->t_state = T_RUNNING;
> +	transaction->t_start_time = ktime_get();
>  	transaction->t_tid = journal->j_transaction_sequence++;
>  	transaction->t_expires = jiffies + journal->j_commit_interval;
>  	spin_lock_init(&transaction->t_handle_lock);
> @@ -1371,7 +1373,7 @@ int journal_stop(handle_t *handle)
>  {
>  	transaction_t *transaction = handle->h_transaction;
>  	journal_t *journal = transaction->t_journal;
> -	int old_handle_count, err;
> +	int err;
>  	pid_t pid;
>  
>  	J_ASSERT(journal_current_handle() == handle);
> @@ -1407,11 +1409,26 @@ int journal_stop(handle_t *handle)
>  	 */
>  	pid = current->pid;
>  	if (handle->h_sync && journal->j_last_sync_writer != pid) {

It would be nice to have a comment here explaining the overall design. 
it's a bit opaque working that out from the raw implementation.

> +		u64 commit_time, trans_time;
> +
>  		journal->j_last_sync_writer = pid;
> -		do {
> -			old_handle_count = transaction->t_handle_count;
> -			schedule_timeout_uninterruptible(1);
> -		} while (old_handle_count != transaction->t_handle_count);
> +
> +		spin_lock(&journal->j_state_lock);
> +		commit_time = journal->j_average_commit_time;
> +		spin_unlock(&journal->j_state_lock);

OK, the lock is needed on 32-bit machines, I guess.


> +		trans_time = ktime_to_ns(ktime_sub(ktime_get(),
> +						   transaction->t_start_time));
> +
> +		commit_time = min_t(u64, commit_time,
> +				    1000*jiffies_to_usecs(1));

OK.  The multiplication of an unsigned by 1000 could overflow, but only
if HZ is less than 0.25.  I don't think we need worry about that ;)


> +		if (trans_time < commit_time) {
> +			ktime_t expires = ktime_add_ns(ktime_get(),
> +						       commit_time);
> +			set_current_state(TASK_UNINTERRUPTIBLE);
> +			schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);

We should have schedule_hrtimeout_uninterruptible(), but we don't.

> +		}
>  	}
>  
>  	current->journal_info = NULL;
> diff --git a/include/linux/jbd.h b/include/linux/jbd.h
> index 346e2b8..d842230 100644
> --- a/include/linux/jbd.h
> +++ b/include/linux/jbd.h
> @@ -543,6 +543,11 @@ struct transaction_s
>  	unsigned long		t_expires;
>  
>  	/*
> +	 * When this transaction started, in nanoseconds [no locking]
> +	 */
> +	ktime_t			t_start_time;
> +
> +	/*
>  	 * How many handles used this transaction? [t_handle_lock]
>  	 */
>  	int t_handle_count;
> @@ -800,6 +805,8 @@ struct journal_s
>  
>  	pid_t			j_last_sync_writer;
>  
> +	u64			j_average_commit_time;

Every field in that structure is carefully documented (except for
j_last_sync_writer - what vandal did that?)

please fix.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] improve jbd fsync batching
  2008-11-03 20:27 ` Andrew Morton
@ 2008-11-03 20:24   ` Josef Bacik
  2008-11-03 20:55     ` Andrew Morton
  2008-11-03 22:13     ` Theodore Tso
  2008-11-04  5:24   ` Andreas Dilger
  1 sibling, 2 replies; 12+ messages in thread
From: Josef Bacik @ 2008-11-03 20:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Josef Bacik, linux-kernel, linux-fsdevel, linux-ext4, rwheeler

On Mon, Nov 03, 2008 at 12:27:29PM -0800, Andrew Morton wrote:
> On Tue, 28 Oct 2008 16:16:15 -0400
> Josef Bacik <jbacik@redhat.com> wrote:
> 
> > Hello,
> > 
> > This is a rework of the patch I did a few months ago, taking into account some
> > comments from Andrew and using the new schedule_hrtimeout function (thanks
> > Arjan!).
> > 
> > There is a flaw with the way jbd handles fsync batching.  If we fsync() a file
> > and we were not the last person to run fsync() on this fs then we automatically
> > sleep for 1 jiffie in order to wait for new writers to join into the transaction
> > before forcing the commit.  The problem with this is that with really fast
> > storage (ie a Clariion) the time it takes to commit a transaction to disk is way
> > faster than 1 jiffie in most cases, so sleeping means waiting longer with
> > nothing to do than if we just committed the transaction and kept going.  Ric
> > Wheeler noticed this when using fs_mark with more than 1 thread, the throughput
> > would plummet as he added more threads.
> > 
> > ...
> >
> > ...
> >  
> > @@ -49,6 +50,7 @@ get_transaction(journal_t *journal, transaction_t *transaction)
> >  {
> >  	transaction->t_journal = journal;
> >  	transaction->t_state = T_RUNNING;
> > +	transaction->t_start_time = ktime_get();
> >  	transaction->t_tid = journal->j_transaction_sequence++;
> >  	transaction->t_expires = jiffies + journal->j_commit_interval;
> >  	spin_lock_init(&transaction->t_handle_lock);
> > @@ -1371,7 +1373,7 @@ int journal_stop(handle_t *handle)
> >  {
> >  	transaction_t *transaction = handle->h_transaction;
> >  	journal_t *journal = transaction->t_journal;
> > -	int old_handle_count, err;
> > +	int err;
> >  	pid_t pid;
> >  
> >  	J_ASSERT(journal_current_handle() == handle);
> > @@ -1407,11 +1409,26 @@ int journal_stop(handle_t *handle)
> >  	 */
> >  	pid = current->pid;
> >  	if (handle->h_sync && journal->j_last_sync_writer != pid) {
> 
> It would be nice to have a comment here explaining the overall design. 
> it's a bit opaque working that out from the raw implementation.
> 
> > +		u64 commit_time, trans_time;
> > +
> >  		journal->j_last_sync_writer = pid;
> > -		do {
> > -			old_handle_count = transaction->t_handle_count;
> > -			schedule_timeout_uninterruptible(1);
> > -		} while (old_handle_count != transaction->t_handle_count);
> > +
> > +		spin_lock(&journal->j_state_lock);
> > +		commit_time = journal->j_average_commit_time;
> > +		spin_unlock(&journal->j_state_lock);
> 
> OK, the lock is needed on 32-bit machines, I guess.
> 
> 
> > +		trans_time = ktime_to_ns(ktime_sub(ktime_get(),
> > +						   transaction->t_start_time));
> > +
> > +		commit_time = min_t(u64, commit_time,
> > +				    1000*jiffies_to_usecs(1));
> 
> OK.  The multiplication of an unsigned by 1000 could overflow, but only
> if HZ is less than 0.25.  I don't think we need worry about that ;)
> 
> 
> > +		if (trans_time < commit_time) {
> > +			ktime_t expires = ktime_add_ns(ktime_get(),
> > +						       commit_time);
> > +			set_current_state(TASK_UNINTERRUPTIBLE);
> > +			schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
> 
> We should have schedule_hrtimeout_uninterruptible(), but we don't.
> 
> > +		}
> >  	}
> >  
> >  	current->journal_info = NULL;
> > diff --git a/include/linux/jbd.h b/include/linux/jbd.h
> > index 346e2b8..d842230 100644
> > --- a/include/linux/jbd.h
> > +++ b/include/linux/jbd.h
> > @@ -543,6 +543,11 @@ struct transaction_s
> >  	unsigned long		t_expires;
> >  
> >  	/*
> > +	 * When this transaction started, in nanoseconds [no locking]
> > +	 */
> > +	ktime_t			t_start_time;
> > +
> > +	/*
> >  	 * How many handles used this transaction? [t_handle_lock]
> >  	 */
> >  	int t_handle_count;
> > @@ -800,6 +805,8 @@ struct journal_s
> >  
> >  	pid_t			j_last_sync_writer;
> >  
> > +	u64			j_average_commit_time;
> 
> Every field in that structure is carefully documented (except for
> j_last_sync_writer - what vandal did that?)
> 
> please fix.

I see you already pulled this into -mm, would you like me to repost with the
same changelog and the patch updated, or just reply to this with the updated
patch?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] improve jbd fsync batching
  2008-11-03 20:24   ` Josef Bacik
@ 2008-11-03 20:55     ` Andrew Morton
  2008-11-04 15:41       ` Josef Bacik
  2008-11-03 22:13     ` Theodore Tso
  1 sibling, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2008-11-03 20:55 UTC (permalink / raw)
  To: Josef Bacik; +Cc: jbacik, linux-kernel, linux-fsdevel, linux-ext4, rwheeler

On Mon, 3 Nov 2008 15:24:43 -0500
Josef Bacik <jbacik@redhat.com> wrote:

> > please fix.
> 
> I see you already pulled this into -mm, would you like me to repost with the
> same changelog and the patch updated, or just reply to this with the updated
> patch?  Thanks,

Either works for me at this stage.  If it's a replacement then I'll turn
it into an incremental so I can see what changed, which takes me about 2.15
seconds.

If it had been a large patch or if it had been under test in someone's
tree for a while then a replacement patch would be unwelcome.  But for a
small, fresh patch like this one it's no big deal either way.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] improve jbd fsync batching
  2008-11-03 20:55     ` Andrew Morton
@ 2008-11-04 15:41       ` Josef Bacik
  0 siblings, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2008-11-04 15:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Josef Bacik, linux-kernel, linux-fsdevel, linux-ext4, rwheeler

On Mon, Nov 03, 2008 at 12:55:09PM -0800, Andrew Morton wrote:
> On Mon, 3 Nov 2008 15:24:43 -0500
> Josef Bacik <jbacik@redhat.com> wrote:
> 
> > > please fix.
> > 
> > I see you already pulled this into -mm, would you like me to repost with the
> > same changelog and the patch updated, or just reply to this with the updated
> > patch?  Thanks,
> 
> Either works for me at this stage.  If it's a replacement then I'll turn
> it into an incremental so I can see what changed, which takes me about 2.15
> seconds.
> 
> If it had been a large patch or if it had been under test in someone's
> tree for a while then a replacement patch would be unwelcome.  But for a
> small, fresh patch like this one it's no big deal either way.
>

Ok here is a replacement patch with the comments as requested, as well as a
comment for j_last_sync_writer.  Thank you,

Josef

 
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 25719d9..3fbffb1 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -306,6 +306,8 @@ void journal_commit_transaction(journal_t *journal)
 	int flags;
 	int err;
 	unsigned long blocknr;
+	ktime_t start_time;
+	u64 commit_time;
 	char *tagp = NULL;
 	journal_header_t *header;
 	journal_block_tag_t *tag = NULL;
@@ -418,6 +420,7 @@ void journal_commit_transaction(journal_t *journal)
 	commit_transaction->t_state = T_FLUSH;
 	journal->j_committing_transaction = commit_transaction;
 	journal->j_running_transaction = NULL;
+	start_time = ktime_get();
 	commit_transaction->t_log_start = journal->j_head;
 	wake_up(&journal->j_wait_transaction_locked);
 	spin_unlock(&journal->j_state_lock);
@@ -913,6 +916,18 @@ restart_loop:
 	J_ASSERT(commit_transaction == journal->j_committing_transaction);
 	journal->j_commit_sequence = commit_transaction->t_tid;
 	journal->j_committing_transaction = NULL;
+	commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
+
+	/*
+	 * weight the commit time higher than the average time so we don't
+	 * react too strongly to vast changes in commit time
+	 */
+	if (likely(journal->j_average_commit_time))
+		journal->j_average_commit_time = (commit_time*3 +
+				journal->j_average_commit_time) / 4;
+	else
+		journal->j_average_commit_time = commit_time;
+
 	spin_unlock(&journal->j_state_lock);
 
 	if (commit_transaction->t_checkpoint_list == NULL &&
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index d15cd6e..90c2cd4 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -25,6 +25,7 @@
 #include <linux/timer.h>
 #include <linux/mm.h>
 #include <linux/highmem.h>
+#include <linux/hrtimer.h>
 
 static void __journal_temp_unlink_buffer(struct journal_head *jh);
 
@@ -49,6 +50,7 @@ get_transaction(journal_t *journal, transaction_t *transaction)
 {
 	transaction->t_journal = journal;
 	transaction->t_state = T_RUNNING;
+	transaction->t_start_time = ktime_get();
 	transaction->t_tid = journal->j_transaction_sequence++;
 	transaction->t_expires = jiffies + journal->j_commit_interval;
 	spin_lock_init(&transaction->t_handle_lock);
@@ -1371,7 +1373,7 @@ int journal_stop(handle_t *handle)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal = transaction->t_journal;
-	int old_handle_count, err;
+	int err;
 	pid_t pid;
 
 	J_ASSERT(journal_current_handle() == handle);
@@ -1400,6 +1402,17 @@ int journal_stop(handle_t *handle)
 	 * on IO anyway.  Speeds up many-threaded, many-dir operations
 	 * by 30x or more...
 	 *
+	 * We try and optimize the sleep time against what the underlying disk
+	 * can do, instead of having a static sleep time.  This is usefull for
+	 * the case where our storage is so fast that it is more optimal to go
+	 * ahead and force a flush and wait for the transaction to be committed
+	 * than it is to wait for an arbitrary amount of time for new writers to
+	 * join the transaction.  We acheive this by measuring how long it takes
+	 * to commit a transaction, and compare it with how long this
+	 * transaction has been running, and if run time < commit time then we
+	 * sleep for the delta and commit.  This greatly helps super fast disks
+	 * that would see slowdowns as more threads started doing fsyncs.
+	 *
 	 * But don't do this if this process was the most recent one to
 	 * perform a synchronous write.  We do this to detect the case where a
 	 * single process is doing a stream of sync writes.  No point in waiting
@@ -1407,11 +1420,26 @@ int journal_stop(handle_t *handle)
 	 */
 	pid = current->pid;
 	if (handle->h_sync && journal->j_last_sync_writer != pid) {
+		u64 commit_time, trans_time;
+
 		journal->j_last_sync_writer = pid;
-		do {
-			old_handle_count = transaction->t_handle_count;
-			schedule_timeout_uninterruptible(1);
-		} while (old_handle_count != transaction->t_handle_count);
+
+		spin_lock(&journal->j_state_lock);
+		commit_time = journal->j_average_commit_time;
+		spin_unlock(&journal->j_state_lock);
+
+		trans_time = ktime_to_ns(ktime_sub(ktime_get(),
+						   transaction->t_start_time));
+
+		commit_time = min_t(u64, commit_time,
+				    1000*jiffies_to_usecs(1));
+
+		if (trans_time < commit_time) {
+			ktime_t expires = ktime_add_ns(ktime_get(),
+						       commit_time);
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
+		}
 	}
 
 	current->journal_info = NULL;
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index 346e2b8..6384b19 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -543,6 +543,11 @@ struct transaction_s
 	unsigned long		t_expires;
 
 	/*
+	 * When this transaction started, in nanoseconds [no locking]
+	 */
+	ktime_t			t_start_time;
+
+	/*
 	 * How many handles used this transaction? [t_handle_lock]
 	 */
 	int t_handle_count;
@@ -798,9 +803,19 @@ struct journal_s
 	struct buffer_head	**j_wbuf;
 	int			j_wbufsize;
 
+	/*
+	 * this is the pid of the last person to run a synchronous operation
+	 * through the journal.
+	 */
 	pid_t			j_last_sync_writer;
 
 	/*
+	 * the average amount of time in nanoseconds it takes to commit a
+	 * transaction to the disk.  [j_state_lock]
+	 */
+	u64			j_average_commit_time;
+
+	/*
 	 * An opaque pointer to fs-private information.  ext3 puts its
 	 * superblock pointer here
 	 */

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] improve jbd fsync batching
  2008-11-03 20:24   ` Josef Bacik
  2008-11-03 20:55     ` Andrew Morton
@ 2008-11-03 22:13     ` Theodore Tso
  1 sibling, 0 replies; 12+ messages in thread
From: Theodore Tso @ 2008-11-03 22:13 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-ext4, rwheeler

On Mon, Nov 03, 2008 at 03:24:43PM -0500, Josef Bacik wrote:
> I see you already pulled this into -mm, would you like me to repost with the
> same changelog and the patch updated, or just reply to this with the updated
> patch?  Thanks,
> 

Josef,

When you have a moment, any chance you could spin an ext4 version of
the patch, too?  Thanks!!

						- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] improve jbd fsync batching
  2008-11-03 20:27 ` Andrew Morton
  2008-11-03 20:24   ` Josef Bacik
@ 2008-11-04  5:24   ` Andreas Dilger
  2008-11-04  9:12     ` Andrew Morton
  1 sibling, 1 reply; 12+ messages in thread
From: Andreas Dilger @ 2008-11-04  5:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Josef Bacik, linux-kernel, linux-fsdevel, linux-ext4, rwheeler

On Nov 03, 2008  12:27 -0800, Andrew Morton wrote:
> On Tue, 28 Oct 2008 16:16:15 -0400
> Josef Bacik <jbacik@redhat.com> wrote:
> > +		spin_lock(&journal->j_state_lock);
> > +		commit_time = journal->j_average_commit_time;
> > +		spin_unlock(&journal->j_state_lock);
> 
> OK, the lock is needed on 32-bit machines, I guess.

Should we pessimize the 64-bit performance in that case, for 32-bit
increasingly rare 32-bit platforms?

It might be useful to have a spin_{,un}lock_64bit_word() helper that
evaluates to a no-op on plaforms that don't need a hammer to do an atomic
64-bit update.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] improve jbd fsync batching
  2008-11-04  5:24   ` Andreas Dilger
@ 2008-11-04  9:12     ` Andrew Morton
  0 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2008-11-04  9:12 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Josef Bacik, linux-kernel, linux-fsdevel, linux-ext4, rwheeler

On Mon, 03 Nov 2008 22:24:28 -0700 Andreas Dilger <adilger@sun.com> wrote:

> On Nov 03, 2008  12:27 -0800, Andrew Morton wrote:
> > On Tue, 28 Oct 2008 16:16:15 -0400
> > Josef Bacik <jbacik@redhat.com> wrote:
> > > +		spin_lock(&journal->j_state_lock);
> > > +		commit_time = journal->j_average_commit_time;
> > > +		spin_unlock(&journal->j_state_lock);
> > 
> > OK, the lock is needed on 32-bit machines, I guess.
> 
> Should we pessimize the 64-bit performance in that case, for 32-bit
> increasingly rare 32-bit platforms?

In general no.

But spinlocks also do memory ordering stuff on both 32- and 64-bit
machines.  Introducing differences there needs thinking about.

In this case it's fsync which is going to be monster slow anyway.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-11-04 15:41 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-28 20:16 [PATCH] improve jbd fsync batching Josef Bacik
2008-10-28 21:38 ` Andreas Dilger
2008-10-28 21:33   ` Josef Bacik
2008-10-28 21:44   ` Arjan van de Ven
2008-10-28 21:56     ` Ric Wheeler
2008-11-03 20:27 ` Andrew Morton
2008-11-03 20:24   ` Josef Bacik
2008-11-03 20:55     ` Andrew Morton
2008-11-04 15:41       ` Josef Bacik
2008-11-03 22:13     ` Theodore Tso
2008-11-04  5:24   ` Andreas Dilger
2008-11-04  9:12     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).