linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* transaction batching performance & multi-threaded synchronous writers
@ 2008-07-14 16:15 Ric Wheeler
  2008-07-14 16:58 ` Josef Bacik
  2008-07-15 18:39 ` Josef Bacik
  0 siblings, 2 replies; 12+ messages in thread
From: Ric Wheeler @ 2008-07-14 16:15 UTC (permalink / raw)
  To: linux-ext4


Here is a pointer to the older patch & some results:

http://www.spinics.net/lists/linux-fsdevel/msg13121.html

I will retry this on some updated kernels, but would not expect to see a 
difference since the code has not been changed ;-)

ric


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: transaction batching performance & multi-threaded synchronous writers
  2008-07-14 16:15 transaction batching performance & multi-threaded synchronous writers Ric Wheeler
@ 2008-07-14 16:58 ` Josef Bacik
  2008-07-14 17:26   ` Ric Wheeler
  2008-07-15  7:58   ` Andreas Dilger
  2008-07-15 18:39 ` Josef Bacik
  1 sibling, 2 replies; 12+ messages in thread
From: Josef Bacik @ 2008-07-14 16:58 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: linux-ext4

On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote:
>
> Here is a pointer to the older patch & some results:
>
> http://www.spinics.net/lists/linux-fsdevel/msg13121.html
>
> I will retry this on some updated kernels, but would not expect to see a 
> difference since the code has not been changed ;-)
>

I've been thinking, the problem with this for slower disks is that with the
patch I provided we're not really allowing multiple things to be batched, since
one thread will come up, do the sync and wait for the sync to finish.  In the
meantime the next thread will come up and do the log_wait_commit() in order to
let more threads join the transaction, but in the case of fs_mark with only 2
threads there won't be another one, since the original is waiting for the log to
commit.  So when the log finishes committing, thread 1 gets woken up to do its
thing, and thread 2 gets woken up as well, it does its commit and waits for it
to finish, and thread 2 comes in and gets stuck in log_wait_commit().  So this
essentially kills the optimization, which is why on faster disks this makes
everything go better, as the faster disks don't need the original optimization.

So this is what I was thinking about.  Perhaps we track the average time a
commit takes to occur, and then if the current transaction start time is < than
the avg commit time we sleep and wait for more things to join the transaction,
and then we commit.  How does that idea sound?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: transaction batching performance & multi-threaded synchronous writers
  2008-07-14 16:58 ` Josef Bacik
@ 2008-07-14 17:26   ` Ric Wheeler
  2008-07-15  7:58   ` Andreas Dilger
  1 sibling, 0 replies; 12+ messages in thread
From: Ric Wheeler @ 2008-07-14 17:26 UTC (permalink / raw)
  To: Josef Bacik, jens.axboe; +Cc: linux-ext4

Josef Bacik wrote:
> On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote:
>   
>> Here is a pointer to the older patch & some results:
>>
>> http://www.spinics.net/lists/linux-fsdevel/msg13121.html
>>
>> I will retry this on some updated kernels, but would not expect to see a 
>> difference since the code has not been changed ;-)
>>
>>     
>
> I've been thinking, the problem with this for slower disks is that with the
> patch I provided we're not really allowing multiple things to be batched, since
> one thread will come up, do the sync and wait for the sync to finish.  In the
> meantime the next thread will come up and do the log_wait_commit() in order to
> let more threads join the transaction, but in the case of fs_mark with only 2
> threads there won't be another one, since the original is waiting for the log to
> commit.  So when the log finishes committing, thread 1 gets woken up to do its
> thing, and thread 2 gets woken up as well, it does its commit and waits for it
> to finish, and thread 2 comes in and gets stuck in log_wait_commit().  So this
> essentially kills the optimization, which is why on faster disks this makes
> everything go better, as the faster disks don't need the original optimization.
>
> So this is what I was thinking about.  Perhaps we track the average time a
> commit takes to occur, and then if the current transaction start time is < than
> the avg commit time we sleep and wait for more things to join the transaction,
> and then we commit.  How does that idea sound?  Thanks,
>
> Josef
>   
I think that this is moving in the right direction. If you think about 
this, we are basically trying to do the same kind of thing that the IO 
scheduler does - anticipate future requests and plug the file system 
level queue for a reasonable bit of time. The problem space is very 
similar - various speed devices and a need to self tune the batching 
dynamically.

It would be great to be able to share the approach (if not the actual 
code) ;-)

ric


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: transaction batching performance & multi-threaded synchronous writers
  2008-07-14 16:58 ` Josef Bacik
  2008-07-14 17:26   ` Ric Wheeler
@ 2008-07-15  7:58   ` Andreas Dilger
  2008-07-15 11:29     ` Ric Wheeler
  2008-07-15 12:51     ` Josef Bacik
  1 sibling, 2 replies; 12+ messages in thread
From: Andreas Dilger @ 2008-07-15  7:58 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Ric Wheeler, linux-ext4

On Jul 14, 2008  12:58 -0400, Josef Bacik wrote:
> Perhaps we track the average time a commit takes to occur, and then if
> the current transaction start time is < than the avg commit time we sleep
> and wait for more things to join the transaction, and then we commit.
> How does that idea sound?  Thanks,

The drawback of this approach is that if the thread waits an extra "average
transaction time" for the transaction to commit then this will increase the
average transaction time each time, and it still won't tell you if there
needs to be a wait at all.

What might be more interesting is tracking how many processes had sync
handles on the previous transaction(s), and once that number of processes
have done that work, or the timeout reached, the transaction is committed.

While this might seem like a hack for the particular benchmark, this
will also optimize real-world workloads like mailserver, NFS/fileserver,
http where the number of threads running at one time is generally fixed.

The best way to do that would be to keep a field in the task struct to
track whether a given thread has participated in transaction "T" when
it starts a new handle, and if not then increment the "number of sync
threads on this transaction" counter.

In journal_stop() if t_num_sync_thr >= prev num_sync_thr then
the transaction can be committed earlier, and if not then it does a
wait_event_interruptible_timeout(cur_num_sync_thr >= prev_num_sync_thr, 1).

While the number of sync threads is growing or constant the commits will 
be rapid, and any "slow" threads will block on the next transaction and
increment its num_sync_thr until the thread count stabilizes (i.e. a small
number of transactions at startup).  After that the wait will be exactly
as long as needed for each thread to participate.  If some threads are
too slow, or stop processing then there will be a single sleep and the
next transaction will wait for fewer threads the next time.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: transaction batching performance & multi-threaded synchronous writers
  2008-07-15  7:58   ` Andreas Dilger
@ 2008-07-15 11:29     ` Ric Wheeler
  2008-07-15 12:51     ` Josef Bacik
  1 sibling, 0 replies; 12+ messages in thread
From: Ric Wheeler @ 2008-07-15 11:29 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Josef Bacik, linux-ext4

Andreas Dilger wrote:
> On Jul 14, 2008  12:58 -0400, Josef Bacik wrote:
>   
>> Perhaps we track the average time a commit takes to occur, and then if
>> the current transaction start time is < than the avg commit time we sleep
>> and wait for more things to join the transaction, and then we commit.
>> How does that idea sound?  Thanks,
>>     
>
> The drawback of this approach is that if the thread waits an extra "average
> transaction time" for the transaction to commit then this will increase the
> average transaction time each time, and it still won't tell you if there
> needs to be a wait at all.
>
> What might be more interesting is tracking how many processes had sync
> handles on the previous transaction(s), and once that number of processes
> have done that work, or the timeout reached, the transaction is committed.
>
> While this might seem like a hack for the particular benchmark, this
> will also optimize real-world workloads like mailserver, NFS/fileserver,
> http where the number of threads running at one time is generally fixed.
>
> The best way to do that would be to keep a field in the task struct to
> track whether a given thread has participated in transaction "T" when
> it starts a new handle, and if not then increment the "number of sync
> threads on this transaction" counter.
>
> In journal_stop() if t_num_sync_thr >= prev num_sync_thr then
> the transaction can be committed earlier, and if not then it does a
> wait_event_interruptible_timeout(cur_num_sync_thr >= prev_num_sync_thr, 1).
>
> While the number of sync threads is growing or constant the commits will 
> be rapid, and any "slow" threads will block on the next transaction and
> increment its num_sync_thr until the thread count stabilizes (i.e. a small
> number of transactions at startup).  After that the wait will be exactly
> as long as needed for each thread to participate.  If some threads are
> too slow, or stop processing then there will be a single sleep and the
> next transaction will wait for fewer threads the next time.
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>   
This really sounds like one of those math problems (queuing theory?) 
that I never was able to completely wrap my head around back at 
university, but the basic things that we we have are:

    (1) the average time it takes to complete an independent 
transaction. This will be different for each target device and will 
possibly change over time (specific odd case is a shared disk, like an 
array).
    (2) the average cost it takes to add "one more" thread to a 
transaction. I think that the assumption is that this cost is close to zero.
    (3) the rate of arrival of threads trying to join a transaction.
    (4) come knowledge of the history of which threads did the past 
transactions. It is quite reasonable to never wait if a single thread is 
the author of the last (most of the last?) sequence which is the good 
thing in there now.
    (5) the minimum time we can effectively wait with a given mechanism 
(4ms or 1ms for example depending on the HZ in the code today)

I think the trick here is to try and get a heuristic that works without 
going nuts in complexity.

The obvious thing we need to keep is the heuristic to not wait if we 
detect a single thread workload.

It would seem reasonable not to wait if the latency of the device (1 
above) is lower than the time the chosen mechanism can wait (5). For 
example, if transactions are done in microseconds like for a ramdisk, 
just blast away ;-)

What would be left would be the need to figure out if (3) arrival rate 
would predict a new thread will come along before we would be able to 
finish the current transaction without waiting.

Does this make any sense? This sounds close to the idea that Josef 
proposed above, we would just tweak his proposal to avoid sleeping in 
the single threaded case.

Ric





^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: transaction batching performance & multi-threaded synchronous writers
  2008-07-15  7:58   ` Andreas Dilger
  2008-07-15 11:29     ` Ric Wheeler
@ 2008-07-15 12:51     ` Josef Bacik
  2008-07-15 14:05       ` Josef Bacik
  2008-07-15 14:22       ` Ric Wheeler
  1 sibling, 2 replies; 12+ messages in thread
From: Josef Bacik @ 2008-07-15 12:51 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Josef Bacik, Ric Wheeler, linux-ext4

On Tue, Jul 15, 2008 at 01:58:32AM -0600, Andreas Dilger wrote:
> On Jul 14, 2008  12:58 -0400, Josef Bacik wrote:
> > Perhaps we track the average time a commit takes to occur, and then if
> > the current transaction start time is < than the avg commit time we sleep
> > and wait for more things to join the transaction, and then we commit.
> > How does that idea sound?  Thanks,
> 
> The drawback of this approach is that if the thread waits an extra "average
> transaction time" for the transaction to commit then this will increase the
> average transaction time each time, and it still won't tell you if there
> needs to be a wait at all.
>

I'm not talking about the average transaction life, as you say it would be
highly dependant on random things that have nothing to do with the transaction
time (waiting for locks and such).  I'm measuring the time it takes for the
actual commit to take place, so I record the start time in
journal_commit_transaction when we set running_transaction = NULL, and then the
end time right before the wakeup() at the end of journal_commit_transaction,
that way there is an idea of how long the committing of a transaction to disk
happens.  If we only have two threads doing work and fsyncing, its going to be a
constant time, because we'll only be writing a certain number of buffers each
time.
 
> What might be more interesting is tracking how many processes had sync
> handles on the previous transaction(s), and once that number of processes
> have done that work, or the timeout reached, the transaction is committed.
> 
> While this might seem like a hack for the particular benchmark, this
> will also optimize real-world workloads like mailserver, NFS/fileserver,
> http where the number of threads running at one time is generally fixed.
> 
> The best way to do that would be to keep a field in the task struct to
> track whether a given thread has participated in transaction "T" when
> it starts a new handle, and if not then increment the "number of sync
> threads on this transaction" counter.
> 
> In journal_stop() if t_num_sync_thr >= prev num_sync_thr then
> the transaction can be committed earlier, and if not then it does a
> wait_event_interruptible_timeout(cur_num_sync_thr >= prev_num_sync_thr, 1).
> 
> While the number of sync threads is growing or constant the commits will 
> be rapid, and any "slow" threads will block on the next transaction and
> increment its num_sync_thr until the thread count stabilizes (i.e. a small
> number of transactions at startup).  After that the wait will be exactly
> as long as needed for each thread to participate.  If some threads are
> too slow, or stop processing then there will be a single sleep and the
> next transaction will wait for fewer threads the next time.
>

This idea is good, but I'm wondering about the normal user use case, ie where
syslog is the only thing that does fsync.  If we get into a position where
prev_num_sync_thr is always 1, we'll just bypass sleeping and waiting for other
stuff to join the transaction and sync whenever syslog pleases, which will
likely affect most normal users, especially if they are like me and have crappy
wireless cards that like to spit stuff into syslog constantly ;).  Thanks much,

Josef 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: transaction batching performance & multi-threaded synchronous writers
  2008-07-15 12:51     ` Josef Bacik
@ 2008-07-15 14:05       ` Josef Bacik
  2008-07-15 14:22       ` Ric Wheeler
  1 sibling, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2008-07-15 14:05 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Andreas Dilger, Ric Wheeler, linux-ext4

On Tue, Jul 15, 2008 at 08:51:27AM -0400, Josef Bacik wrote:
> On Tue, Jul 15, 2008 at 01:58:32AM -0600, Andreas Dilger wrote:
> > On Jul 14, 2008  12:58 -0400, Josef Bacik wrote:
> > > Perhaps we track the average time a commit takes to occur, and then if
> > > the current transaction start time is < than the avg commit time we sleep
> > > and wait for more things to join the transaction, and then we commit.
> > > How does that idea sound?  Thanks,
> > 
> > The drawback of this approach is that if the thread waits an extra "average
> > transaction time" for the transaction to commit then this will increase the
> > average transaction time each time, and it still won't tell you if there
> > needs to be a wait at all.
> >
> 
> I'm not talking about the average transaction life, as you say it would be
> highly dependant on random things that have nothing to do with the transaction
> time (waiting for locks and such).  I'm measuring the time it takes for the
> actual commit to take place, so I record the start time in
> journal_commit_transaction when we set running_transaction = NULL, and then the
> end time right before the wakeup() at the end of journal_commit_transaction,
> that way there is an idea of how long the committing of a transaction to disk
> happens.  If we only have two threads doing work and fsyncing, its going to be a
> constant time, because we'll only be writing a certain number of buffers each
> time.
>  

Hmm now that I think about it you are right, if the thread waits then the time
the transaction was running could be greater than the commit time and this
wouldn't help.  Though one would hope that dirtying pagecache would still be
several times faster than writing to disk so any amount of waiting wouldn't be
too much of a big deal, and waiting too long would hurt in our current case
anyway.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: transaction batching performance & multi-threaded synchronous writers
  2008-07-15 12:51     ` Josef Bacik
  2008-07-15 14:05       ` Josef Bacik
@ 2008-07-15 14:22       ` Ric Wheeler
  1 sibling, 0 replies; 12+ messages in thread
From: Ric Wheeler @ 2008-07-15 14:22 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Andreas Dilger, linux-ext4

Josef Bacik wrote:
> On Tue, Jul 15, 2008 at 01:58:32AM -0600, Andreas Dilger wrote:
>   
>> On Jul 14, 2008  12:58 -0400, Josef Bacik wrote:
>>     
>>> Perhaps we track the average time a commit takes to occur, and then if
>>> the current transaction start time is < than the avg commit time we sleep
>>> and wait for more things to join the transaction, and then we commit.
>>> How does that idea sound?  Thanks,
>>>       
>> The drawback of this approach is that if the thread waits an extra "average
>> transaction time" for the transaction to commit then this will increase the
>> average transaction time each time, and it still won't tell you if there
>> needs to be a wait at all.
>>
>>     
>
> I'm not talking about the average transaction life, as you say it would be
> highly dependant on random things that have nothing to do with the transaction
> time (waiting for locks and such).  I'm measuring the time it takes for the
> actual commit to take place, so I record the start time in
> journal_commit_transaction when we set running_transaction = NULL, and then the
> end time right before the wakeup() at the end of journal_commit_transaction,
> that way there is an idea of how long the committing of a transaction to disk
> happens.  If we only have two threads doing work and fsyncing, its going to be a
> constant time, because we'll only be writing a certain number of buffers each
> time.
>   

I think that this is exactly the interesting measurement to capture. It 
will change over time (depending on how loaded the target device is, etc).

The single thread case that is special cased is also an important one to 
handle since it should be a fairly common one.
>  
>   
>> What might be more interesting is tracking how many processes had sync
>> handles on the previous transaction(s), and once that number of processes
>> have done that work, or the timeout reached, the transaction is committed.
>>
>> While this might seem like a hack for the particular benchmark, this
>> will also optimize real-world workloads like mailserver, NFS/fileserver,
>> http where the number of threads running at one time is generally fixed.
>>
>> The best way to do that would be to keep a field in the task struct to
>> track whether a given thread has participated in transaction "T" when
>> it starts a new handle, and if not then increment the "number of sync
>> threads on this transaction" counter.
>>
>> In journal_stop() if t_num_sync_thr >= prev num_sync_thr then
>> the transaction can be committed earlier, and if not then it does a
>> wait_event_interruptible_timeout(cur_num_sync_thr >= prev_num_sync_thr, 1).
>>
>> While the number of sync threads is growing or constant the commits will 
>> be rapid, and any "slow" threads will block on the next transaction and
>> increment its num_sync_thr until the thread count stabilizes (i.e. a small
>> number of transactions at startup).  After that the wait will be exactly
>> as long as needed for each thread to participate.  If some threads are
>> too slow, or stop processing then there will be a single sleep and the
>> next transaction will wait for fewer threads the next time.
>>
>>     
>
> This idea is good, but I'm wondering about the normal user use case, ie where
> syslog is the only thing that does fsync.  If we get into a position where
> prev_num_sync_thr is always 1, we'll just bypass sleeping and waiting for other
> stuff to join the transaction and sync whenever syslog pleases, which will
> likely affect most normal users, especially if they are like me and have crappy
> wireless cards that like to spit stuff into syslog constantly ;).  Thanks much,
>
> Josef 
>   

I think this syslog case is just a common example of the same thread 
doing a sequence of IO's.

ric


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: transaction batching performance & multi-threaded synchronous writers
  2008-07-14 16:15 transaction batching performance & multi-threaded synchronous writers Ric Wheeler
  2008-07-14 16:58 ` Josef Bacik
@ 2008-07-15 18:39 ` Josef Bacik
  2008-07-15 20:10   ` Josef Bacik
  1 sibling, 1 reply; 12+ messages in thread
From: Josef Bacik @ 2008-07-15 18:39 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: linux-ext4, adilger

On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote:
>
> Here is a pointer to the older patch & some results:
>
> http://www.spinics.net/lists/linux-fsdevel/msg13121.html
>
> I will retry this on some updated kernels, but would not expect to see a 
> difference since the code has not been changed ;-)
>

Ok here are the numbers with the original idea I had proposed.

type	threads		base	patch	speedup
sata	1		17.9	17.3	0.97
sata	2		33.2	34.2	1.03
sata	4		58.4	63.6	1.09
sata	8		78.8	80.8	1.03
sata	16		94.4	97.6	1.16

ram	1		2394.4	1878.0	0.78
ram	2		989.6	2041.1	2.06
ram	4		1466.1	3201.8	2.18
ram	8		1858.1	3362.8	1.81
ram	16		3008.0	3227.7	1.07

I've got to find a fast disk array to test this with, but the ramdisk results
make me happy, though they were kind of irratic, so I think the fast disk array
will be a more stable measure of how well this patch does, but it definitely
doesn't hurt the slow case, and brings stability to the fast case.  Thanks much,

Josef


Index: linux-2.6/fs/jbd/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd/commit.c
+++ linux-2.6/fs/jbd/commit.c
@@ -273,6 +273,15 @@ write_out_data:
 	journal_do_submit_data(wbuf, bufs);
 }
 
+static inline unsigned long elapsed_jiffies(unsigned long start,
+					    unsigned long end)
+{
+	if (end >= start)
+		return end - start;
+
+	return end + (MAX_JIFFY_OFFSET - start) + 1;
+}
+
 /*
  * journal_commit_transaction
  *
@@ -288,6 +297,7 @@ void journal_commit_transaction(journal_
 	int flags;
 	int err;
 	unsigned long blocknr;
+	unsigned long long commit_time, start_time;
 	char *tagp = NULL;
 	journal_header_t *header;
 	journal_block_tag_t *tag = NULL;
@@ -400,6 +410,7 @@ void journal_commit_transaction(journal_
 	commit_transaction->t_state = T_FLUSH;
 	journal->j_committing_transaction = commit_transaction;
 	journal->j_running_transaction = NULL;
+	start_time = jiffies;
 	commit_transaction->t_log_start = journal->j_head;
 	wake_up(&journal->j_wait_transaction_locked);
 	spin_unlock(&journal->j_state_lock);
@@ -873,6 +884,12 @@ restart_loop:
 	J_ASSERT(commit_transaction == journal->j_committing_transaction);
 	journal->j_commit_sequence = commit_transaction->t_tid;
 	journal->j_committing_transaction = NULL;
+	commit_time = elapsed_jiffies(start_time, jiffies);
+	if (unlikely(!journal->j_average_commit_time))
+		journal->j_average_commit_time = commit_time;
+	else
+		journal->j_average_commit_time = (commit_time +
+					journal->j_average_commit_time) / 2;
 	spin_unlock(&journal->j_state_lock);
 
 	if (commit_transaction->t_checkpoint_list == NULL &&
Index: linux-2.6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.orig/fs/jbd/transaction.c
+++ linux-2.6/fs/jbd/transaction.c
@@ -49,6 +49,7 @@ get_transaction(journal_t *journal, tran
 {
 	transaction->t_journal = journal;
 	transaction->t_state = T_RUNNING;
+	transaction->t_start_time = jiffies;
 	transaction->t_tid = journal->j_transaction_sequence++;
 	transaction->t_expires = jiffies + journal->j_commit_interval;
 	spin_lock_init(&transaction->t_handle_lock);
@@ -1361,8 +1362,7 @@ int journal_stop(handle_t *handle)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal = transaction->t_journal;
-	int old_handle_count, err;
-	pid_t pid;
+	int err;
 
 	J_ASSERT(journal_current_handle() == handle);
 
@@ -1395,13 +1395,17 @@ int journal_stop(handle_t *handle)
 	 * single process is doing a stream of sync writes.  No point in waiting
 	 * for joiners in that case.
 	 */
-	pid = current->pid;
-	if (handle->h_sync && journal->j_last_sync_writer != pid) {
-		journal->j_last_sync_writer = pid;
-		do {
-			old_handle_count = transaction->t_handle_count;
+	if (handle->h_sync) {
+		unsigned long commit_time;
+
+		spin_lock(&journal->j_state_lock);
+		commit_time = journal->j_average_commit_time;
+		spin_unlock(&journal->j_state_lock);
+
+		while (time_before(jiffies, commit_time +
+				   transaction->t_start_time)) {
 			schedule_timeout_uninterruptible(1);
-		} while (old_handle_count != transaction->t_handle_count);
+		}
 	}
 
 	current->journal_info = NULL;
Index: linux-2.6/include/linux/jbd.h
===================================================================
--- linux-2.6.orig/include/linux/jbd.h
+++ linux-2.6/include/linux/jbd.h
@@ -543,6 +543,11 @@ struct transaction_s
 	unsigned long		t_expires;
 
 	/*
+	 * When this transaction started, in jiffies [no locking]
+	 */
+	unsigned long		t_start_time;
+
+	/*
 	 * How many handles used this transaction? [t_handle_lock]
 	 */
 	int t_handle_count;
@@ -800,6 +805,8 @@ struct journal_s
 
 	pid_t			j_last_sync_writer;
 
+	unsigned long long	j_average_commit_time;
+
 	/*
 	 * An opaque pointer to fs-private information.  ext3 puts its
 	 * superblock pointer here

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: transaction batching performance & multi-threaded synchronous writers
  2008-07-15 18:39 ` Josef Bacik
@ 2008-07-15 20:10   ` Josef Bacik
  2008-07-15 20:43     ` Josef Bacik
  0 siblings, 1 reply; 12+ messages in thread
From: Josef Bacik @ 2008-07-15 20:10 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Ric Wheeler, linux-ext4, adilger

On Tue, Jul 15, 2008 at 02:39:04PM -0400, Josef Bacik wrote:
> On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote:
> >
> > Here is a pointer to the older patch & some results:
> >
> > http://www.spinics.net/lists/linux-fsdevel/msg13121.html
> >
> > I will retry this on some updated kernels, but would not expect to see a 
> > difference since the code has not been changed ;-)
> >
> 
> Ok here are the numbers with the original idea I had proposed.
> 
> type	threads		base	patch	speedup
> sata	1		17.9	17.3	0.97
> sata	2		33.2	34.2	1.03
> sata	4		58.4	63.6	1.09
> sata	8		78.8	80.8	1.03
> sata	16		94.4	97.6	1.16
> 
> ram	1		2394.4	1878.0	0.78
> ram	2		989.6	2041.1	2.06
> ram	4		1466.1	3201.8	2.18
> ram	8		1858.1	3362.8	1.81
> ram	16		3008.0	3227.7	1.07
> 
> I've got to find a fast disk array to test this with, but the ramdisk results
> make me happy, though they were kind of irratic, so I think the fast disk array
> will be a more stable measure of how well this patch does, but it definitely
> doesn't hurt the slow case, and brings stability to the fast case.  Thanks much,
>

Hmm talking with ric I should just leave the single thread stuff alone.  That
removes the slight speed regression seen above.  Thanks,

Josef


Index: linux-2.6/fs/jbd/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd/commit.c
+++ linux-2.6/fs/jbd/commit.c
@@ -273,6 +273,15 @@ write_out_data:
 	journal_do_submit_data(wbuf, bufs);
 }
 
+static inline unsigned long elapsed_jiffies(unsigned long start,
+					    unsigned long end)
+{
+	if (end >= start)
+		return end - start;
+
+	return end + (MAX_JIFFY_OFFSET - start) + 1;
+}
+
 /*
  * journal_commit_transaction
  *
@@ -288,6 +297,7 @@ void journal_commit_transaction(journal_
 	int flags;
 	int err;
 	unsigned long blocknr;
+	unsigned long long commit_time, start_time;
 	char *tagp = NULL;
 	journal_header_t *header;
 	journal_block_tag_t *tag = NULL;
@@ -400,6 +410,7 @@ void journal_commit_transaction(journal_
 	commit_transaction->t_state = T_FLUSH;
 	journal->j_committing_transaction = commit_transaction;
 	journal->j_running_transaction = NULL;
+	start_time = jiffies;
 	commit_transaction->t_log_start = journal->j_head;
 	wake_up(&journal->j_wait_transaction_locked);
 	spin_unlock(&journal->j_state_lock);
@@ -873,6 +884,12 @@ restart_loop:
 	J_ASSERT(commit_transaction == journal->j_committing_transaction);
 	journal->j_commit_sequence = commit_transaction->t_tid;
 	journal->j_committing_transaction = NULL;
+	commit_time = elapsed_jiffies(start_time, jiffies);
+	if (unlikely(!journal->j_average_commit_time))
+		journal->j_average_commit_time = commit_time;
+	else
+		journal->j_average_commit_time = (commit_time +
+					journal->j_average_commit_time) / 2;
 	spin_unlock(&journal->j_state_lock);
 
 	if (commit_transaction->t_checkpoint_list == NULL &&
Index: linux-2.6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.orig/fs/jbd/transaction.c
+++ linux-2.6/fs/jbd/transaction.c
@@ -49,6 +49,7 @@ get_transaction(journal_t *journal, tran
 {
 	transaction->t_journal = journal;
 	transaction->t_state = T_RUNNING;
+	transaction->t_start_time = jiffies;
 	transaction->t_tid = journal->j_transaction_sequence++;
 	transaction->t_expires = jiffies + journal->j_commit_interval;
 	spin_lock_init(&transaction->t_handle_lock);
@@ -1361,7 +1362,7 @@ int journal_stop(handle_t *handle)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal = transaction->t_journal;
-	int old_handle_count, err;
+	int err;
 	pid_t pid;
 
 	J_ASSERT(journal_current_handle() == handle);
@@ -1397,11 +1398,17 @@ int journal_stop(handle_t *handle)
 	 */
 	pid = current->pid;
 	if (handle->h_sync && journal->j_last_sync_writer != pid) {
+		unsigned long commit_time;
+
 		journal->j_last_sync_writer = pid;
-		do {
-			old_handle_count = transaction->t_handle_count;
+
+		spin_lock(&journal->j_state_lock);
+		commit_time = journal->j_average_commit_time;
+		spin_unlock(&journal->j_state_lock);
+
+		while (time_before(jiffies, commit_time +
+				   transaction->t_start_time))
 			schedule_timeout_uninterruptible(1);
-		} while (old_handle_count != transaction->t_handle_count);
 	}
 
 	current->journal_info = NULL;
Index: linux-2.6/include/linux/jbd.h
===================================================================
--- linux-2.6.orig/include/linux/jbd.h
+++ linux-2.6/include/linux/jbd.h
@@ -543,6 +543,11 @@ struct transaction_s
 	unsigned long		t_expires;
 
 	/*
+	 * When this transaction started, in jiffies [no locking]
+	 */
+	unsigned long		t_start_time;
+
+	/*
 	 * How many handles used this transaction? [t_handle_lock]
 	 */
 	int t_handle_count;
@@ -800,6 +805,8 @@ struct journal_s
 
 	pid_t			j_last_sync_writer;
 
+	unsigned long long	j_average_commit_time;
+
 	/*
 	 * An opaque pointer to fs-private information.  ext3 puts its
 	 * superblock pointer here

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: transaction batching performance & multi-threaded synchronous writers
  2008-07-15 20:10   ` Josef Bacik
@ 2008-07-15 20:43     ` Josef Bacik
  2008-07-15 22:33       ` Ric Wheeler
  0 siblings, 1 reply; 12+ messages in thread
From: Josef Bacik @ 2008-07-15 20:43 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Ric Wheeler, linux-ext4, adilger

On Tue, Jul 15, 2008 at 04:10:10PM -0400, Josef Bacik wrote:
> On Tue, Jul 15, 2008 at 02:39:04PM -0400, Josef Bacik wrote:
> > On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote:
> > >
> > > Here is a pointer to the older patch & some results:
> > >
> > > http://www.spinics.net/lists/linux-fsdevel/msg13121.html
> > >
> > > I will retry this on some updated kernels, but would not expect to see a 
> > > difference since the code has not been changed ;-)
> > >
> > 
> > Ok here are the numbers with the original idea I had proposed.
> > 
> > type	threads		base	patch	speedup
> > sata	1		17.9	17.3	0.97
> > sata	2		33.2	34.2	1.03
> > sata	4		58.4	63.6	1.09
> > sata	8		78.8	80.8	1.03
> > sata	16		94.4	97.6	1.16
> > 
> > ram	1		2394.4	1878.0	0.78
> > ram	2		989.6	2041.1	2.06
> > ram	4		1466.1	3201.8	2.18
> > ram	8		1858.1	3362.8	1.81
> > ram	16		3008.0	3227.7	1.07
> > 
> > I've got to find a fast disk array to test this with, but the ramdisk results
> > make me happy, though they were kind of irratic, so I think the fast disk array
> > will be a more stable measure of how well this patch does, but it definitely
> > doesn't hurt the slow case, and brings stability to the fast case.  Thanks much,
> >
> 
> Hmm talking with ric I should just leave the single thread stuff alone.  That
> removes the slight speed regression seen above.  Thanks,
>

Here are the results with the single thread stuff put back in and with 250HZ
instead of 1000HZ from before

type	threads		base	patch
sata	1		21.8	21.6
sata	2		26.2	34.6
sata	4		48.0	58.0
sata	8		70.4	75.2
sata	16		89.6	101.1

ram	1		2505.4	2422.0
ram	2		463.8	3462.3
ram	4		330.4	3653.9
ram	8		995.1	3592.4
ram	16		1335.2	3806.5

Thanks,

Josef 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: transaction batching performance & multi-threaded synchronous writers
  2008-07-15 20:43     ` Josef Bacik
@ 2008-07-15 22:33       ` Ric Wheeler
  0 siblings, 0 replies; 12+ messages in thread
From: Ric Wheeler @ 2008-07-15 22:33 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-ext4, adilger

Josef Bacik wrote:
> On Tue, Jul 15, 2008 at 04:10:10PM -0400, Josef Bacik wrote:
>   
>> On Tue, Jul 15, 2008 at 02:39:04PM -0400, Josef Bacik wrote:
>>     
>>> On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote:
>>>       
>>>> Here is a pointer to the older patch & some results:
>>>>
>>>> http://www.spinics.net/lists/linux-fsdevel/msg13121.html
>>>>
>>>> I will retry this on some updated kernels, but would not expect to see a 
>>>> difference since the code has not been changed ;-)
>>>>
>>>>         
>>> Ok here are the numbers with the original idea I had proposed.
>>>
>>> type	threads		base	patch	speedup
>>> sata	1		17.9	17.3	0.97
>>> sata	2		33.2	34.2	1.03
>>> sata	4		58.4	63.6	1.09
>>> sata	8		78.8	80.8	1.03
>>> sata	16		94.4	97.6	1.16
>>>
>>> ram	1		2394.4	1878.0	0.78
>>> ram	2		989.6	2041.1	2.06
>>> ram	4		1466.1	3201.8	2.18
>>> ram	8		1858.1	3362.8	1.81
>>> ram	16		3008.0	3227.7	1.07
>>>
>>> I've got to find a fast disk array to test this with, but the ramdisk results
>>> make me happy, though they were kind of irratic, so I think the fast disk array
>>> will be a more stable measure of how well this patch does, but it definitely
>>> doesn't hurt the slow case, and brings stability to the fast case.  Thanks much,
>>>
>>>       
>> Hmm talking with ric I should just leave the single thread stuff alone.  That
>> removes the slight speed regression seen above.  Thanks,
>>
>>     
>
> Here are the results with the single thread stuff put back in and with 250HZ
> instead of 1000HZ from before
>
> type	threads		base	patch
> sata	1		21.8	21.6
> sata	2		26.2	34.6
> sata	4		48.0	58.0
> sata	8		70.4	75.2
> sata	16		89.6	101.1
>
> ram	1		2505.4	2422.0
> ram	2		463.8	3462.3
> ram	4		330.4	3653.9
> ram	8		995.1	3592.4
> ram	16		1335.2	3806.5
>
> Thanks,
>
> Josef 
>   
These numbers are pretty impressive - we need to get a run on an array 
backed file system as well to round out the picture and possibly an SSD 
(anyone have one out there to play with)?

ric


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-07-15 22:33 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-14 16:15 transaction batching performance & multi-threaded synchronous writers Ric Wheeler
2008-07-14 16:58 ` Josef Bacik
2008-07-14 17:26   ` Ric Wheeler
2008-07-15  7:58   ` Andreas Dilger
2008-07-15 11:29     ` Ric Wheeler
2008-07-15 12:51     ` Josef Bacik
2008-07-15 14:05       ` Josef Bacik
2008-07-15 14:22       ` Ric Wheeler
2008-07-15 18:39 ` Josef Bacik
2008-07-15 20:10   ` Josef Bacik
2008-07-15 20:43     ` Josef Bacik
2008-07-15 22:33       ` Ric Wheeler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).