[PATCH] imporve jbd2 fsync batching

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] imporve jbd2 fsync batching
@ 2008-11-04 16:10 Josef Bacik
  2008-11-04 20:52 ` Theodore Tso
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Josef Bacik @ 2008-11-04 16:10 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

Hello,

This is the jbd2 version of the jbd fsync batching patch.

This patch removes the static sleep time in favor of a more self optimizing
approach where we measure the average amount of time it takes to commit a
transaction to disk and the ammount of time a transaction has been running.  If
somebody does a sync write or an fsync() traditionally we would sleep for 1
jiffies, which depending on the value of HZ could be a significant amount of
time compared to how long it takes to commit a transaction to the underlying
storage.  With this patch instead of sleeping for a jiffie, we check to see if
the amount of time this transaction has been running is less than the average
commit time, and if it is we sleep for the delta using schedule_hrtimeout to
give us a higher precision sleep time.  This greatly benefits high end storage
where you could end up sleeping for longer than it takes to commit the
transaction and therefore sitting idle instead of allowing the transaction to be
committed by keeping the sleep time to a minimum so you are sure to always be
doing something.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 8b119e1..3169db9 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -332,6 +332,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	int flags;
 	int err;
 	unsigned long long blocknr;
+	ktime_t start_time;
+	u64 commit_time;
 	char *tagp = NULL;
 	journal_header_t *header;
 	journal_block_tag_t *tag = NULL;
@@ -458,6 +460,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	commit_transaction->t_state = T_FLUSH;
 	journal->j_committing_transaction = commit_transaction;
 	journal->j_running_transaction = NULL;
+	start_time = ktime_get();
 	commit_transaction->t_log_start = journal->j_head;
 	wake_up(&journal->j_wait_transaction_locked);
 	spin_unlock(&journal->j_state_lock);
@@ -972,6 +975,17 @@ restart_loop:
 	J_ASSERT(commit_transaction == journal->j_committing_transaction);
 	journal->j_commit_sequence = commit_transaction->t_tid;
 	journal->j_committing_transaction = NULL;
+	commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
+
+	/*
+	 * weight the commit time higher than the average time so we don't
+	 * react too strongly to vast changes in the commit time
+	 */
+	if (likely(journal->j_average_commit_time))
+		journal->j_average_commit_time = (commit_time*3 +
+				journal->j_average_commit_time) / 4;
+	else
+		journal->j_average_commit_time = commit_time;
 	spin_unlock(&journal->j_state_lock);
 
 	if (commit_transaction->t_checkpoint_list == NULL &&
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 39b7805..ef705f9 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -25,6 +25,7 @@
 #include <linux/timer.h>
 #include <linux/mm.h>
 #include <linux/highmem.h>
+#include <linux/hrtimer.h>
 
 static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh);
 
@@ -48,6 +49,7 @@ jbd2_get_transaction(journal_t *journal, transaction_t *transaction)
 {
 	transaction->t_journal = journal;
 	transaction->t_state = T_RUNNING;
+	transaction->t_start_time = ktime_get();
 	transaction->t_tid = journal->j_transaction_sequence++;
 	transaction->t_expires = jiffies + journal->j_commit_interval;
 	spin_lock_init(&transaction->t_handle_lock);
@@ -1193,7 +1195,7 @@ int jbd2_journal_stop(handle_t *handle)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal = transaction->t_journal;
-	int old_handle_count, err;
+	int err;
 	pid_t pid;
 
 	J_ASSERT(journal_current_handle() == handle);
@@ -1222,6 +1224,17 @@ int jbd2_journal_stop(handle_t *handle)
 	 * on IO anyway.  Speeds up many-threaded, many-dir operations
 	 * by 30x or more...
 	 *
+	 * We try and optimize the sleep time against what the underlying disk
+	 * can do, instead of having a static sleep time.  This is usefull for
+	 * the case where our storage is so fast that it is more optimal to go
+	 * ahead and force a flush and wait for the transaction to be committed
+	 * than it is to wait for an arbitrary amount of time for new writers to
+	 * join the transaction.  We acheive this by measuring how long it takes
+	 * to commit a transaction, and compare it with how long this
+	 * transaction has been running, and if run time < commit time then we
+	 * sleep for the delta and commit.  This greatly helps super fast disks
+	 * that would see slowdowns as more threads started doing fsyncs.
+	 *
 	 * But don't do this if this process was the most recent one to
 	 * perform a synchronous write.  We do this to detect the case where a
 	 * single process is doing a stream of sync writes.  No point in waiting
@@ -1229,11 +1242,26 @@ int jbd2_journal_stop(handle_t *handle)
 	 */
 	pid = current->pid;
 	if (handle->h_sync && journal->j_last_sync_writer != pid) {
+		u64 commit_time, trans_time;
+
 		journal->j_last_sync_writer = pid;
-		do {
-			old_handle_count = transaction->t_handle_count;
-			schedule_timeout_uninterruptible(1);
-		} while (old_handle_count != transaction->t_handle_count);
+
+		spin_lock(&journal->j_state_lock);
+		commit_time = journal->j_average_commit_time;
+		spin_unlock(&journal->j_state_lock);
+
+		trans_time = ktime_to_ns(ktime_sub(ktime_get(),
+						   transaction->t_start_time));
+
+		commit_time = min_t(u64, commit_time,
+				    1000*jiffies_to_usecs(1));
+
+		if (trans_time < commit_time) {
+			ktime_t expires = ktime_add_ns(ktime_get(),
+						       commit_time);
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
+		}
 	}
 
 	current->journal_info = NULL;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index c7d106e..b8b8744 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -637,6 +637,11 @@ struct transaction_s
 	unsigned long		t_expires;
 
 	/*
+	 * When this transaction started, in nanoseconds [no locking]
+	 */
+	ktime_t			t_start_time;
+
+	/*
 	 * How many handles used this transaction? [t_handle_lock]
 	 */
 	int t_handle_count;
@@ -938,8 +943,18 @@ struct journal_s
 	struct buffer_head	**j_wbuf;
 	int			j_wbufsize;
 
+	/*
+	 * this is the pid of hte last person to run a synchronous operation
+	 * through the journal
+	 */
 	pid_t			j_last_sync_writer;
 
+	/*
+	 * the average amount of time in nanoseconds it takes to commit a
+	 * transaction to disk. [j_state_lock]
+	 */
+	u64			j_average_commit_time;
+
 	/* This function is called when a transaction is closed */
 	void			(*j_commit_callback)(journal_t *,
 						     transaction_t *);

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] imporve jbd2 fsync batching
  2008-11-04 16:10 [PATCH] imporve jbd2 fsync batching Josef Bacik
@ 2008-11-04 20:52 ` Theodore Tso
  2008-11-04 22:15   ` Leroy van Logchem
  2008-11-05 23:10 ` Andreas Dilger
  2008-11-25 10:23 ` Theodore Tso
  2 siblings, 1 reply; 13+ messages in thread
From: Theodore Tso @ 2008-11-04 20:52 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-ext4

Thanks!!

Once tiny spelling nit, which I'll fix up before I put it in the ext4
patch queue.

On Tue, Nov 04, 2008 at 11:10:25AM -0500, Josef Bacik wrote:
> @@ -1222,6 +1224,17 @@ int jbd2_journal_stop(handle_t *handle)
>  	 * on IO anyway.  Speeds up many-threaded, many-dir operations
>  	 * by 30x or more...
>  	 *
> +	 * We try and optimize the sleep time against what the underlying disk
> +	 * can do, instead of having a static sleep time.  This is usefull for

s/usefull/useful/

							- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] imporve jbd2 fsync batching
  2008-11-04 20:52 ` Theodore Tso
@ 2008-11-04 22:15   ` Leroy van Logchem
  0 siblings, 0 replies; 13+ messages in thread
From: Leroy van Logchem @ 2008-11-04 22:15 UTC (permalink / raw)
  To: linux-ext4


> > +	 * can do, instead of having a static sleep time.  This is usefull for
> 
> s/usefull/useful/

and:

+	 * join the transaction.  We acheive this by measuring how long it takes

s/acheive/achieve/

-- 
Leroy



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] imporve jbd2 fsync batching
  2008-11-04 16:10 [PATCH] imporve jbd2 fsync batching Josef Bacik
  2008-11-04 20:52 ` Theodore Tso
@ 2008-11-05 23:10 ` Andreas Dilger
  2008-11-06  0:27   ` Theodore Tso
  2008-11-06 14:35   ` [PATCH] imporve jbd2 fsync batching Josef Bacik
  2008-11-25 10:23 ` Theodore Tso
  2 siblings, 2 replies; 13+ messages in thread
From: Andreas Dilger @ 2008-11-05 23:10 UTC (permalink / raw)
  To: Josef Bacik; +Cc: tytso, linux-ext4

On Nov 04, 2008  11:10 -0500, Josef Bacik wrote:
> somebody does a sync write or an fsync() traditionally we would sleep for 1
> jiffies, which depending on the value of HZ could be a significant amount of
> time compared to how long it takes to commit a transaction to the underlying
> storage.  With this patch instead of sleeping for a jiffie, we check to see if
> the amount of time this transaction has been running is less than the average
> commit time, and if it is we sleep for the delta using schedule_hrtimeout to
> give us a higher precision sleep time.  This greatly benefits high end storage
> where you could end up sleeping for longer than it takes to commit the
> transaction and therefore sitting idle instead of allowing the transaction to
> be committed by keeping the sleep time to a minimum so you are sure to always
> be doing something.

There was no reply to my previous comments about making the maximum sleep
time be a fixed value (e.g. 15ms) instead of having it arbitrarily based
on the jiffies value, which may change between 1ms and 10ms.

I don't object to this being included in ext4, but I suspect it could do
even better, or at least be more consistent than depending on the HZ value.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] imporve jbd2 fsync batching
  2008-11-05 23:10 ` Andreas Dilger
@ 2008-11-06  0:27   ` Theodore Tso
  2008-11-06 12:45     ` Ric Wheeler
  2008-11-06 14:35   ` [PATCH] imporve jbd2 fsync batching Josef Bacik
  1 sibling, 1 reply; 13+ messages in thread
From: Theodore Tso @ 2008-11-06  0:27 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Josef Bacik, linux-ext4

On Wed, Nov 05, 2008 at 04:10:12PM -0700, Andreas Dilger wrote:
> There was no reply to my previous comments about making the maximum sleep
> time be a fixed value (e.g. 15ms) instead of having it arbitrarily based
> on the jiffies value, which may change between 1ms and 10ms.

I know the whole point is to make it be self-tuning, but I wonder if
we should have some knobs to control the min and max sleep times.

   	          	       	    	     	     - Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] imporve jbd2 fsync batching
  2008-11-06  0:27   ` Theodore Tso
@ 2008-11-06 12:45     ` Ric Wheeler
  2008-11-25 10:22       ` [PATCH] ext4: add fsync batch tuning knobs Theodore Tso
  0 siblings, 1 reply; 13+ messages in thread
From: Ric Wheeler @ 2008-11-06 12:45 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Andreas Dilger, Josef Bacik, linux-ext4

Theodore Tso wrote:
> On Wed, Nov 05, 2008 at 04:10:12PM -0700, Andreas Dilger wrote:
>   
>> There was no reply to my previous comments about making the maximum sleep
>> time be a fixed value (e.g. 15ms) instead of having it arbitrarily based
>> on the jiffies value, which may change between 1ms and 10ms.
>>     
>
> I know the whole point is to make it be self-tuning, but I wonder if
> we should have some knobs to control the min and max sleep times.
>
>    	          	       	    	     	     - Ted
>   

I think that it could be useful to have a tunable knob for the max sleep 
time. It would certainly be capped at a much lower level for a RAM disk 
than it would be for a very large RAID volume...

Ric


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] ext4: add fsync batch tuning knobs
  2008-11-06 12:45     ` Ric Wheeler
@ 2008-11-25 10:22       ` Theodore Tso
  2008-12-02 14:45         ` Aneesh Kumar K.V
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Tso @ 2008-11-25 10:22 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Andreas Dilger, Josef Bacik, linux-ext4

On Thu, Nov 06, 2008 at 07:45:38AM -0500, Ric Wheeler wrote:
> I think that it could be useful to have a tunable knob for the max sleep  
> time. It would certainly be capped at a much lower level for a RAM disk  
> than it would be for a very large RAID volume...

Here is a patch which implements tuning knobs for Josef's fsync
batching patch, at least for ext4.  I've implemented both a minimum
and maximum batch time.  I've also changed the default max batch time
to be 15ms, independent of HZ setting.

It'd be great if someone could run some benchmarks to see what would
be ideal ways to twiddle these knobs for different types of storage
devices (and workloads, presumably).

							- Ted

-------------
ext4: add fsync batch tuning knobs

From: "Theodore Ts'o" <tytso@mit.edu>

Add new mount options, min_batch_time and max_batch_time, which
controls how long the jbd2 layer should wait for additional filesystem
operations to get batched into a synchronous handle.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8370ffd..6244c5d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -328,6 +328,7 @@ struct ext4_mount_options {
 	uid_t s_resuid;
 	gid_t s_resgid;
 	unsigned long s_commit_interval;
+	u32 s_min_batch_time, s_max_batch_time;
 #ifdef CONFIG_QUOTA
 	int s_jquota_fmt;
 	char *s_qf_names[MAXQUOTAS];
@@ -806,6 +807,12 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
 #define EXT4_DEFM_JMODE_WBACK	0x0060
 
 /*
+ * Default journal batch times
+ */
+#define EXT4_DEF_MIN_BATCH_TIME	0
+#define EXT4_DEF_MAX_BATCH_TIME	15000 /* 15ms */
+
+/*
  * Structure of a directory entry
  */
 #define EXT4_NAME_LEN 255
diff --git a/fs/ext4/ext4_sb.h b/fs/ext4/ext4_sb.h
index 2f3b8b1..f5b5d56 100644
--- a/fs/ext4/ext4_sb.h
+++ b/fs/ext4/ext4_sb.h
@@ -74,6 +74,8 @@ struct ext4_sb_info {
 	struct journal_s *s_journal;
 	struct list_head s_orphan;
 	unsigned long s_commit_interval;
+	u32 s_max_batch_time;
+	u32 s_min_batch_time;
 	struct block_device *journal_bdev;
 #ifdef CONFIG_JBD2_DEBUG
 	struct timer_list turn_ro_timer;	/* For turning read-only (crash simulation) */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4aa2040..35500ed 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -681,10 +681,19 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
 #endif
 	if (!test_opt(sb, RESERVATION))
 		seq_puts(seq, ",noreservation");
-	if (sbi->s_commit_interval) {
+	if (sbi->s_commit_interval != JBD2_DEFAULT_MAX_COMMIT_AGE*HZ) {
 		seq_printf(seq, ",commit=%u",
 			   (unsigned) (sbi->s_commit_interval / HZ));
 	}
+	if (sbi->s_min_batch_time != EXT4_DEF_MIN_BATCH_TIME) {
+		seq_printf(seq, ",min_batch_time=%u",
+			   (unsigned) sbi->s_min_batch_time);
+	}
+	if (sbi->s_max_batch_time != EXT4_DEF_MAX_BATCH_TIME) {
+		seq_printf(seq, ",max_batch_time=%u",
+			   (unsigned) sbi->s_min_batch_time);
+	}
+
 	/*
 	 * We're changing the default of barrier mount option, so
 	 * let's always display its mount state so it's clear what its
@@ -850,7 +859,8 @@ enum {
 	Opt_nouid32, Opt_debug, Opt_oldalloc, Opt_orlov,
 	Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
 	Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
-	Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
+	Opt_commit, Opt_min_batch_time, Opt_max_batch_time,
+	Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
 	Opt_journal_checksum, Opt_journal_async_commit,
 	Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
 	Opt_data_err_abort, Opt_data_err_ignore,
@@ -889,6 +899,8 @@ static const match_table_t tokens = {
 	{Opt_nobh, "nobh"},
 	{Opt_bh, "bh"},
 	{Opt_commit, "commit=%u"},
+	{Opt_min_batch_time, "min_batch_time=%u"},
+	{Opt_max_batch_time, "max_batch_time=%u"},
 	{Opt_journal_update, "journal=update"},
 	{Opt_journal_inum, "journal=%u"},
 	{Opt_journal_dev, "journal_dev=%u"},
@@ -1107,6 +1119,20 @@ static int parse_options(char *options, struct super_block *sb,
 				option = JBD2_DEFAULT_MAX_COMMIT_AGE;
 			sbi->s_commit_interval = HZ * option;
 			break;
+		case Opt_max_batch_time:
+			if (match_int(&args[0], &option))
+				return 0;
+			if (option < 0)
+				return 0;
+			if (option == 0)
+				option = EXT4_DEF_MAX_BATCH_TIME;
+			sbi->s_max_batch_time = option;
+		case Opt_min_batch_time:
+			if (match_int(&args[0], &option))
+				return 0;
+			if (option < 0)
+				return 0;
+			sbi->s_min_batch_time = option;
 		case Opt_data_journal:
 			data_opt = EXT4_MOUNT_JOURNAL_DATA;
 			goto datacheck;
@@ -1955,6 +1981,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 
 	sbi->s_resuid = le16_to_cpu(es->s_def_resuid);
 	sbi->s_resgid = le16_to_cpu(es->s_def_resgid);
+	sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ;
+	sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
+	sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;
 
 	set_opt(sbi->s_mount_opt, RESERVATION);
 	set_opt(sbi->s_mount_opt, BARRIER);
@@ -2484,11 +2513,9 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 
-	if (sbi->s_commit_interval)
-		journal->j_commit_interval = sbi->s_commit_interval;
-	/* We could also set up an ext4-specific default for the commit
-	 * interval here, but for now we'll just fall back to the jbd
-	 * default. */
+	journal->j_commit_interval = sbi->s_commit_interval;
+	journal->j_min_batch_time = sbi->s_min_batch_time;
+	journal->j_max_batch_time = sbi->s_max_batch_time;
 
 	spin_lock(&journal->j_state_lock);
 	if (test_opt(sb, BARRIER))
@@ -2977,6 +3004,8 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
 	old_opts.s_resuid = sbi->s_resuid;
 	old_opts.s_resgid = sbi->s_resgid;
 	old_opts.s_commit_interval = sbi->s_commit_interval;
+	old_opts.s_min_batch_time = sbi->s_min_batch_time;
+	old_opts.s_max_batch_time = sbi->s_max_batch_time;
 #ifdef CONFIG_QUOTA
 	old_opts.s_jquota_fmt = sbi->s_jquota_fmt;
 	for (i = 0; i < MAXQUOTAS; i++)
@@ -3106,6 +3135,8 @@ restore_opts:
 	sbi->s_resuid = old_opts.s_resuid;
 	sbi->s_resgid = old_opts.s_resgid;
 	sbi->s_commit_interval = old_opts.s_commit_interval;
+	sbi->s_min_batch_time = old_opts.s_min_batch_time;
+	sbi->s_max_batch_time = old_opts.s_max_batch_time;
 #ifdef CONFIG_QUOTA
 	sbi->s_jquota_fmt = old_opts.s_jquota_fmt;
 	for (i = 0; i < MAXQUOTAS; i++) {
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e70d657..dfacd9e 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -961,6 +961,8 @@ static journal_t * journal_init_common (void)
 	spin_lock_init(&journal->j_state_lock);
 
 	journal->j_commit_interval = (HZ * JBD2_DEFAULT_MAX_COMMIT_AGE);
+	journal->j_min_batch_time = 0;
+	journal->j_max_batch_time = 15000; /* 15ms */
 
 	/* The journal is marked for error until we succeed with recovery! */
 	journal->j_flags = JBD2_ABORT;
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 0e1fc34..e7de7ac 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1255,9 +1255,11 @@ int jbd2_journal_stop(handle_t *handle)
 		trans_time = ktime_to_ns(ktime_sub(ktime_get(),
 						   transaction->t_start_time));
 
+		commit_time = max_t(u64, commit_time,
+				    1000*journal->j_min_batch_time);
 		commit_time = min_t(u64, commit_time,
-				    1000*jiffies_to_usecs(1));

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] ext4: add fsync batch tuning knobs
  2008-11-25 10:22       ` [PATCH] ext4: add fsync batch tuning knobs Theodore Tso
@ 2008-12-02 14:45         ` Aneesh Kumar K.V
  0 siblings, 0 replies; 13+ messages in thread
From: Aneesh Kumar K.V @ 2008-12-02 14:45 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Ric Wheeler, Andreas Dilger, Josef Bacik, linux-ext4

On Tue, Nov 25, 2008 at 05:22:32AM -0500, Theodore Tso wrote:
....
...

>  	{Opt_bh, "bh"},
>  	{Opt_commit, "commit=%u"},
> +	{Opt_min_batch_time, "min_batch_time=%u"},
> +	{Opt_max_batch_time, "max_batch_time=%u"},
>  	{Opt_journal_update, "journal=update"},
>  	{Opt_journal_inum, "journal=%u"},
>  	{Opt_journal_dev, "journal_dev=%u"},
> @@ -1107,6 +1119,20 @@ static int parse_options(char *options, struct super_block *sb,
>  				option = JBD2_DEFAULT_MAX_COMMIT_AGE;
>  			sbi->s_commit_interval = HZ * option;
>  			break;
> +		case Opt_max_batch_time:
> +			if (match_int(&args[0], &option))
> +				return 0;
> +			if (option < 0)
> +				return 0;
> +			if (option == 0)
> +				option = EXT4_DEF_MAX_BATCH_TIME;
> +			sbi->s_max_batch_time = option;


We need a break here .


> +		case Opt_min_batch_time:
> +			if (match_int(&args[0], &option))
> +				return 0;
> +			if (option < 0)
> +				return 0;
> +			sbi->s_min_batch_time = option;

same here

>  		case Opt_data_journal:
>  			data_opt = EXT4_MOUNT_JOURNAL_DATA;
>  			goto datacheck;
> @@ -1955,6 +1981,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> 
>  	sbi->s_resuid = le16_to_cpu(es->s_def_resuid);
>  	sbi->s_resgid = le16_to_cpu(es->s_def_resgid);
> +	sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ;
> +	sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
> +	sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;
> 
>  	set_opt(sbi->s_mount_opt, RESERVATION);
>  	set_opt(sbi->s_mount_opt, BARRIER);
....
..
The patch also needs documentation of new mount options

-aneesh

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] imporve jbd2 fsync batching
  2008-11-05 23:10 ` Andreas Dilger
  2008-11-06  0:27   ` Theodore Tso
@ 2008-11-06 14:35   ` Josef Bacik
  1 sibling, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2008-11-06 14:35 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Josef Bacik, tytso, linux-ext4

On Wed, Nov 05, 2008 at 04:10:12PM -0700, Andreas Dilger wrote:
> On Nov 04, 2008  11:10 -0500, Josef Bacik wrote:
> > somebody does a sync write or an fsync() traditionally we would sleep for 1
> > jiffies, which depending on the value of HZ could be a significant amount of
> > time compared to how long it takes to commit a transaction to the underlying
> > storage.  With this patch instead of sleeping for a jiffie, we check to see if
> > the amount of time this transaction has been running is less than the average
> > commit time, and if it is we sleep for the delta using schedule_hrtimeout to
> > give us a higher precision sleep time.  This greatly benefits high end storage
> > where you could end up sleeping for longer than it takes to commit the
> > transaction and therefore sitting idle instead of allowing the transaction to
> > be committed by keeping the sleep time to a minimum so you are sure to always
> > be doing something.
> 
> There was no reply to my previous comments about making the maximum sleep
> time be a fixed value (e.g. 15ms) instead of having it arbitrarily based
> on the jiffies value, which may change between 1ms and 10ms.
> 
> I don't object to this being included in ext4, but I suspect it could do
> even better, or at least be more consistent than depending on the HZ value.
>

Sorry I missed that comment.  The reason I set the top to be 1 jiffies was
because thats what it used to be, so in an effort to try and keep regressions
from happening I just left it as the max.  I'm fine with setting the max to some
other more constant value.  Thanks,

Josef 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] imporve jbd2 fsync batching
  2008-11-04 16:10 [PATCH] imporve jbd2 fsync batching Josef Bacik
  2008-11-04 20:52 ` Theodore Tso
  2008-11-05 23:10 ` Andreas Dilger
@ 2008-11-25 10:23 ` Theodore Tso
  2008-11-25 22:41   ` Andreas Dilger
  2 siblings, 1 reply; 13+ messages in thread
From: Theodore Tso @ 2008-11-25 10:23 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-ext4

On Tue, Nov 04, 2008 at 11:10:25AM -0500, Josef Bacik wrote:
> +	commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
> +
> +	/*
> +	 * weight the commit time higher than the average time so we don't
> +	 * react too strongly to vast changes in the commit time
> +	 */
> +	if (likely(journal->j_average_commit_time))
> +		journal->j_average_commit_time = (commit_time*3 +
> +				journal->j_average_commit_time) / 4;

Stupid question... if the goal is to not have the average commit time
not react too strongly to sudden and vast changes to the commit time,
would it be better to do this instead:

> +		journal->j_average_commit_time = (commit_time +
> +				journal->j_average_commit_time*3) / 4;


						    - Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] imporve jbd2 fsync batching
  2008-11-25 10:23 ` Theodore Tso
@ 2008-11-25 22:41   ` Andreas Dilger
  2008-11-26  5:10     ` Theodore Tso
  0 siblings, 1 reply; 13+ messages in thread
From: Andreas Dilger @ 2008-11-25 22:41 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Josef Bacik, linux-ext4

On Nov 25, 2008  05:23 -0500, Theodore Ts'o wrote:
> On Tue, Nov 04, 2008 at 11:10:25AM -0500, Josef Bacik wrote:
> > +	commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
> > +
> > +	/*
> > +	 * weight the commit time higher than the average time so we don't
> > +	 * react too strongly to vast changes in the commit time
> > +	 */
> > +	if (likely(journal->j_average_commit_time))
> > +		journal->j_average_commit_time = (commit_time*3 +
> > +				journal->j_average_commit_time) / 4;
> 
> Stupid question... if the goal is to not have the average commit time
> not react too strongly to sudden and vast changes to the commit time,
> would it be better to do this instead:
> 
> > +		journal->j_average_commit_time = (commit_time +
> > +				journal->j_average_commit_time*3) / 4;

Actually, yes.  That is my fault, I'd suggested the original change to
Josef.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] imporve jbd2 fsync batching
  2008-11-25 22:41   ` Andreas Dilger
@ 2008-11-26  5:10     ` Theodore Tso
  2008-11-26 13:18       ` Josef Bacik
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Tso @ 2008-11-26  5:10 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Josef Bacik, linux-ext4

On Tue, Nov 25, 2008 at 03:41:15PM -0700, Andreas Dilger wrote:
> > Stupid question... if the goal is to not have the average commit time
> > not react too strongly to sudden and vast changes to the commit time,
> > would it be better to do this instead:
> > 
> > > +		journal->j_average_commit_time = (commit_time +
> > > +				journal->j_average_commit_time*3) / 4;
> 
> Actually, yes.  That is my fault, I'd suggested the original change to
> Josef.

BTW, I've added a patch to display the average commit time it does
vary wildly, especially on a laptop hard drive.  While the system is
idle, and the occasional commits need to wait for the hard drive to
wake up, leads to a average commit time of around 80-140ms.  If the
disk is just getting lightly tickled, such that it never has a chance
to go to sleep, the average commit time can drop down to around
20-25ms.  If the hard drive is really busy, then the average commit
time can go up to 40-50ms.

Increasing the weight as described below will slow down the move to
these long-term averages, but at least for laptop or Green Star drives
with power savings enabled, the average commit time does seem to vary
in some non-inintuitive ways.  Of course, if we are capping the max
transaction time at 15ms, most of this will never be visible, but it
would probably be interesting to test out this patch on a fast SSD or
an expensive enterprise array to see whether are similar surprising
variations in the average commit time.

						- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] imporve jbd2 fsync batching
  2008-11-26  5:10     ` Theodore Tso
@ 2008-11-26 13:18       ` Josef Bacik
  0 siblings, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2008-11-26 13:18 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Andreas Dilger, Josef Bacik, linux-ext4

On Wed, Nov 26, 2008 at 12:10:57AM -0500, Theodore Tso wrote:
> On Tue, Nov 25, 2008 at 03:41:15PM -0700, Andreas Dilger wrote:
> > > Stupid question... if the goal is to not have the average commit time
> > > not react too strongly to sudden and vast changes to the commit time,
> > > would it be better to do this instead:
> > > 
> > > > +		journal->j_average_commit_time = (commit_time +
> > > > +				journal->j_average_commit_time*3) / 4;
> > 
> > Actually, yes.  That is my fault, I'd suggested the original change to
> > Josef.
> 
> BTW, I've added a patch to display the average commit time it does
> vary wildly, especially on a laptop hard drive.  While the system is
> idle, and the occasional commits need to wait for the hard drive to
> wake up, leads to a average commit time of around 80-140ms.  If the
> disk is just getting lightly tickled, such that it never has a chance
> to go to sleep, the average commit time can drop down to around
> 20-25ms.  If the hard drive is really busy, then the average commit
> time can go up to 40-50ms.
> 
> Increasing the weight as described below will slow down the move to
> these long-term averages, but at least for laptop or Green Star drives
> with power savings enabled, the average commit time does seem to vary
> in some non-inintuitive ways.  Of course, if we are capping the max
> transaction time at 15ms, most of this will never be visible, but it
> would probably be interesting to test out this patch on a fast SSD or
> an expensive enterprise array to see whether are similar surprising
> variations in the average commit time.
>

I was printing out the commit time when I was testing this and with high end
storage the commit time didn't vary all that much.  With my SATA drive on a
normal running box it didn't vary all that much either, it would drop down to
like 1ms because of syslog, but other than that it would ramp up when there was
load and it would slowly come back down as the load went away.  Also do you
want a new patch with that fix, or will you just fix it up?  Thanks,

Josef 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-12-02 14:45 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-04 16:10 [PATCH] imporve jbd2 fsync batching Josef Bacik
2008-11-04 20:52 ` Theodore Tso
2008-11-04 22:15   ` Leroy van Logchem
2008-11-05 23:10 ` Andreas Dilger
2008-11-06  0:27   ` Theodore Tso
2008-11-06 12:45     ` Ric Wheeler
2008-11-25 10:22       ` [PATCH] ext4: add fsync batch tuning knobs Theodore Tso
2008-12-02 14:45         ` Aneesh Kumar K.V
2008-11-06 14:35   ` [PATCH] imporve jbd2 fsync batching Josef Bacik
2008-11-25 10:23 ` Theodore Tso
2008-11-25 22:41   ` Andreas Dilger
2008-11-26  5:10     ` Theodore Tso
2008-11-26 13:18       ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).