* transaction batching performance & multi-threaded synchronous writers @ 2008-07-14 16:15 Ric Wheeler 2008-07-14 16:58 ` Josef Bacik 2008-07-15 18:39 ` Josef Bacik 0 siblings, 2 replies; 12+ messages in thread From: Ric Wheeler @ 2008-07-14 16:15 UTC (permalink / raw) To: linux-ext4 Here is a pointer to the older patch & some results: http://www.spinics.net/lists/linux-fsdevel/msg13121.html I will retry this on some updated kernels, but would not expect to see a difference since the code has not been changed ;-) ric ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: transaction batching performance & multi-threaded synchronous writers 2008-07-14 16:15 transaction batching performance & multi-threaded synchronous writers Ric Wheeler @ 2008-07-14 16:58 ` Josef Bacik 2008-07-14 17:26 ` Ric Wheeler 2008-07-15 7:58 ` Andreas Dilger 2008-07-15 18:39 ` Josef Bacik 1 sibling, 2 replies; 12+ messages in thread From: Josef Bacik @ 2008-07-14 16:58 UTC (permalink / raw) To: Ric Wheeler; +Cc: linux-ext4 On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote: > > Here is a pointer to the older patch & some results: > > http://www.spinics.net/lists/linux-fsdevel/msg13121.html > > I will retry this on some updated kernels, but would not expect to see a > difference since the code has not been changed ;-) > I've been thinking, the problem with this for slower disks is that with the patch I provided we're not really allowing multiple things to be batched, since one thread will come up, do the sync and wait for the sync to finish. In the meantime the next thread will come up and do the log_wait_commit() in order to let more threads join the transaction, but in the case of fs_mark with only 2 threads there won't be another one, since the original is waiting for the log to commit. So when the log finishes committing, thread 1 gets woken up to do its thing, and thread 2 gets woken up as well, it does its commit and waits for it to finish, and thread 2 comes in and gets stuck in log_wait_commit(). So this essentially kills the optimization, which is why on faster disks this makes everything go better, as the faster disks don't need the original optimization. So this is what I was thinking about. Perhaps we track the average time a commit takes to occur, and then if the current transaction start time is < than the avg commit time we sleep and wait for more things to join the transaction, and then we commit. How does that idea sound? Thanks, Josef ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: transaction batching performance & multi-threaded synchronous writers 2008-07-14 16:58 ` Josef Bacik @ 2008-07-14 17:26 ` Ric Wheeler 2008-07-15 7:58 ` Andreas Dilger 1 sibling, 0 replies; 12+ messages in thread From: Ric Wheeler @ 2008-07-14 17:26 UTC (permalink / raw) To: Josef Bacik, jens.axboe; +Cc: linux-ext4 Josef Bacik wrote: > On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote: > >> Here is a pointer to the older patch & some results: >> >> http://www.spinics.net/lists/linux-fsdevel/msg13121.html >> >> I will retry this on some updated kernels, but would not expect to see a >> difference since the code has not been changed ;-) >> >> > > I've been thinking, the problem with this for slower disks is that with the > patch I provided we're not really allowing multiple things to be batched, since > one thread will come up, do the sync and wait for the sync to finish. In the > meantime the next thread will come up and do the log_wait_commit() in order to > let more threads join the transaction, but in the case of fs_mark with only 2 > threads there won't be another one, since the original is waiting for the log to > commit. So when the log finishes committing, thread 1 gets woken up to do its > thing, and thread 2 gets woken up as well, it does its commit and waits for it > to finish, and thread 2 comes in and gets stuck in log_wait_commit(). So this > essentially kills the optimization, which is why on faster disks this makes > everything go better, as the faster disks don't need the original optimization. > > So this is what I was thinking about. Perhaps we track the average time a > commit takes to occur, and then if the current transaction start time is < than > the avg commit time we sleep and wait for more things to join the transaction, > and then we commit. How does that idea sound? Thanks, > > Josef > I think that this is moving in the right direction. If you think about this, we are basically trying to do the same kind of thing that the IO scheduler does - anticipate future requests and plug the file system level queue for a reasonable bit of time. The problem space is very similar - various speed devices and a need to self tune the batching dynamically. It would be great to be able to share the approach (if not the actual code) ;-) ric ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: transaction batching performance & multi-threaded synchronous writers 2008-07-14 16:58 ` Josef Bacik 2008-07-14 17:26 ` Ric Wheeler @ 2008-07-15 7:58 ` Andreas Dilger 2008-07-15 11:29 ` Ric Wheeler 2008-07-15 12:51 ` Josef Bacik 1 sibling, 2 replies; 12+ messages in thread From: Andreas Dilger @ 2008-07-15 7:58 UTC (permalink / raw) To: Josef Bacik; +Cc: Ric Wheeler, linux-ext4 On Jul 14, 2008 12:58 -0400, Josef Bacik wrote: > Perhaps we track the average time a commit takes to occur, and then if > the current transaction start time is < than the avg commit time we sleep > and wait for more things to join the transaction, and then we commit. > How does that idea sound? Thanks, The drawback of this approach is that if the thread waits an extra "average transaction time" for the transaction to commit then this will increase the average transaction time each time, and it still won't tell you if there needs to be a wait at all. What might be more interesting is tracking how many processes had sync handles on the previous transaction(s), and once that number of processes have done that work, or the timeout reached, the transaction is committed. While this might seem like a hack for the particular benchmark, this will also optimize real-world workloads like mailserver, NFS/fileserver, http where the number of threads running at one time is generally fixed. The best way to do that would be to keep a field in the task struct to track whether a given thread has participated in transaction "T" when it starts a new handle, and if not then increment the "number of sync threads on this transaction" counter. In journal_stop() if t_num_sync_thr >= prev num_sync_thr then the transaction can be committed earlier, and if not then it does a wait_event_interruptible_timeout(cur_num_sync_thr >= prev_num_sync_thr, 1). While the number of sync threads is growing or constant the commits will be rapid, and any "slow" threads will block on the next transaction and increment its num_sync_thr until the thread count stabilizes (i.e. a small number of transactions at startup). After that the wait will be exactly as long as needed for each thread to participate. If some threads are too slow, or stop processing then there will be a single sleep and the next transaction will wait for fewer threads the next time. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: transaction batching performance & multi-threaded synchronous writers 2008-07-15 7:58 ` Andreas Dilger @ 2008-07-15 11:29 ` Ric Wheeler 2008-07-15 12:51 ` Josef Bacik 1 sibling, 0 replies; 12+ messages in thread From: Ric Wheeler @ 2008-07-15 11:29 UTC (permalink / raw) To: Andreas Dilger; +Cc: Josef Bacik, linux-ext4 Andreas Dilger wrote: > On Jul 14, 2008 12:58 -0400, Josef Bacik wrote: > >> Perhaps we track the average time a commit takes to occur, and then if >> the current transaction start time is < than the avg commit time we sleep >> and wait for more things to join the transaction, and then we commit. >> How does that idea sound? Thanks, >> > > The drawback of this approach is that if the thread waits an extra "average > transaction time" for the transaction to commit then this will increase the > average transaction time each time, and it still won't tell you if there > needs to be a wait at all. > > What might be more interesting is tracking how many processes had sync > handles on the previous transaction(s), and once that number of processes > have done that work, or the timeout reached, the transaction is committed. > > While this might seem like a hack for the particular benchmark, this > will also optimize real-world workloads like mailserver, NFS/fileserver, > http where the number of threads running at one time is generally fixed. > > The best way to do that would be to keep a field in the task struct to > track whether a given thread has participated in transaction "T" when > it starts a new handle, and if not then increment the "number of sync > threads on this transaction" counter. > > In journal_stop() if t_num_sync_thr >= prev num_sync_thr then > the transaction can be committed earlier, and if not then it does a > wait_event_interruptible_timeout(cur_num_sync_thr >= prev_num_sync_thr, 1). > > While the number of sync threads is growing or constant the commits will > be rapid, and any "slow" threads will block on the next transaction and > increment its num_sync_thr until the thread count stabilizes (i.e. a small > number of transactions at startup). After that the wait will be exactly > as long as needed for each thread to participate. If some threads are > too slow, or stop processing then there will be a single sleep and the > next transaction will wait for fewer threads the next time. > > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > This really sounds like one of those math problems (queuing theory?) that I never was able to completely wrap my head around back at university, but the basic things that we we have are: (1) the average time it takes to complete an independent transaction. This will be different for each target device and will possibly change over time (specific odd case is a shared disk, like an array). (2) the average cost it takes to add "one more" thread to a transaction. I think that the assumption is that this cost is close to zero. (3) the rate of arrival of threads trying to join a transaction. (4) come knowledge of the history of which threads did the past transactions. It is quite reasonable to never wait if a single thread is the author of the last (most of the last?) sequence which is the good thing in there now. (5) the minimum time we can effectively wait with a given mechanism (4ms or 1ms for example depending on the HZ in the code today) I think the trick here is to try and get a heuristic that works without going nuts in complexity. The obvious thing we need to keep is the heuristic to not wait if we detect a single thread workload. It would seem reasonable not to wait if the latency of the device (1 above) is lower than the time the chosen mechanism can wait (5). For example, if transactions are done in microseconds like for a ramdisk, just blast away ;-) What would be left would be the need to figure out if (3) arrival rate would predict a new thread will come along before we would be able to finish the current transaction without waiting. Does this make any sense? This sounds close to the idea that Josef proposed above, we would just tweak his proposal to avoid sleeping in the single threaded case. Ric ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: transaction batching performance & multi-threaded synchronous writers 2008-07-15 7:58 ` Andreas Dilger 2008-07-15 11:29 ` Ric Wheeler @ 2008-07-15 12:51 ` Josef Bacik 2008-07-15 14:05 ` Josef Bacik 2008-07-15 14:22 ` Ric Wheeler 1 sibling, 2 replies; 12+ messages in thread From: Josef Bacik @ 2008-07-15 12:51 UTC (permalink / raw) To: Andreas Dilger; +Cc: Josef Bacik, Ric Wheeler, linux-ext4 On Tue, Jul 15, 2008 at 01:58:32AM -0600, Andreas Dilger wrote: > On Jul 14, 2008 12:58 -0400, Josef Bacik wrote: > > Perhaps we track the average time a commit takes to occur, and then if > > the current transaction start time is < than the avg commit time we sleep > > and wait for more things to join the transaction, and then we commit. > > How does that idea sound? Thanks, > > The drawback of this approach is that if the thread waits an extra "average > transaction time" for the transaction to commit then this will increase the > average transaction time each time, and it still won't tell you if there > needs to be a wait at all. > I'm not talking about the average transaction life, as you say it would be highly dependant on random things that have nothing to do with the transaction time (waiting for locks and such). I'm measuring the time it takes for the actual commit to take place, so I record the start time in journal_commit_transaction when we set running_transaction = NULL, and then the end time right before the wakeup() at the end of journal_commit_transaction, that way there is an idea of how long the committing of a transaction to disk happens. If we only have two threads doing work and fsyncing, its going to be a constant time, because we'll only be writing a certain number of buffers each time. > What might be more interesting is tracking how many processes had sync > handles on the previous transaction(s), and once that number of processes > have done that work, or the timeout reached, the transaction is committed. > > While this might seem like a hack for the particular benchmark, this > will also optimize real-world workloads like mailserver, NFS/fileserver, > http where the number of threads running at one time is generally fixed. > > The best way to do that would be to keep a field in the task struct to > track whether a given thread has participated in transaction "T" when > it starts a new handle, and if not then increment the "number of sync > threads on this transaction" counter. > > In journal_stop() if t_num_sync_thr >= prev num_sync_thr then > the transaction can be committed earlier, and if not then it does a > wait_event_interruptible_timeout(cur_num_sync_thr >= prev_num_sync_thr, 1). > > While the number of sync threads is growing or constant the commits will > be rapid, and any "slow" threads will block on the next transaction and > increment its num_sync_thr until the thread count stabilizes (i.e. a small > number of transactions at startup). After that the wait will be exactly > as long as needed for each thread to participate. If some threads are > too slow, or stop processing then there will be a single sleep and the > next transaction will wait for fewer threads the next time. > This idea is good, but I'm wondering about the normal user use case, ie where syslog is the only thing that does fsync. If we get into a position where prev_num_sync_thr is always 1, we'll just bypass sleeping and waiting for other stuff to join the transaction and sync whenever syslog pleases, which will likely affect most normal users, especially if they are like me and have crappy wireless cards that like to spit stuff into syslog constantly ;). Thanks much, Josef ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: transaction batching performance & multi-threaded synchronous writers 2008-07-15 12:51 ` Josef Bacik @ 2008-07-15 14:05 ` Josef Bacik 2008-07-15 14:22 ` Ric Wheeler 1 sibling, 0 replies; 12+ messages in thread From: Josef Bacik @ 2008-07-15 14:05 UTC (permalink / raw) To: Josef Bacik; +Cc: Andreas Dilger, Ric Wheeler, linux-ext4 On Tue, Jul 15, 2008 at 08:51:27AM -0400, Josef Bacik wrote: > On Tue, Jul 15, 2008 at 01:58:32AM -0600, Andreas Dilger wrote: > > On Jul 14, 2008 12:58 -0400, Josef Bacik wrote: > > > Perhaps we track the average time a commit takes to occur, and then if > > > the current transaction start time is < than the avg commit time we sleep > > > and wait for more things to join the transaction, and then we commit. > > > How does that idea sound? Thanks, > > > > The drawback of this approach is that if the thread waits an extra "average > > transaction time" for the transaction to commit then this will increase the > > average transaction time each time, and it still won't tell you if there > > needs to be a wait at all. > > > > I'm not talking about the average transaction life, as you say it would be > highly dependant on random things that have nothing to do with the transaction > time (waiting for locks and such). I'm measuring the time it takes for the > actual commit to take place, so I record the start time in > journal_commit_transaction when we set running_transaction = NULL, and then the > end time right before the wakeup() at the end of journal_commit_transaction, > that way there is an idea of how long the committing of a transaction to disk > happens. If we only have two threads doing work and fsyncing, its going to be a > constant time, because we'll only be writing a certain number of buffers each > time. > Hmm now that I think about it you are right, if the thread waits then the time the transaction was running could be greater than the commit time and this wouldn't help. Though one would hope that dirtying pagecache would still be several times faster than writing to disk so any amount of waiting wouldn't be too much of a big deal, and waiting too long would hurt in our current case anyway. Thanks, Josef ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: transaction batching performance & multi-threaded synchronous writers 2008-07-15 12:51 ` Josef Bacik 2008-07-15 14:05 ` Josef Bacik @ 2008-07-15 14:22 ` Ric Wheeler 1 sibling, 0 replies; 12+ messages in thread From: Ric Wheeler @ 2008-07-15 14:22 UTC (permalink / raw) To: Josef Bacik; +Cc: Andreas Dilger, linux-ext4 Josef Bacik wrote: > On Tue, Jul 15, 2008 at 01:58:32AM -0600, Andreas Dilger wrote: > >> On Jul 14, 2008 12:58 -0400, Josef Bacik wrote: >> >>> Perhaps we track the average time a commit takes to occur, and then if >>> the current transaction start time is < than the avg commit time we sleep >>> and wait for more things to join the transaction, and then we commit. >>> How does that idea sound? Thanks, >>> >> The drawback of this approach is that if the thread waits an extra "average >> transaction time" for the transaction to commit then this will increase the >> average transaction time each time, and it still won't tell you if there >> needs to be a wait at all. >> >> > > I'm not talking about the average transaction life, as you say it would be > highly dependant on random things that have nothing to do with the transaction > time (waiting for locks and such). I'm measuring the time it takes for the > actual commit to take place, so I record the start time in > journal_commit_transaction when we set running_transaction = NULL, and then the > end time right before the wakeup() at the end of journal_commit_transaction, > that way there is an idea of how long the committing of a transaction to disk > happens. If we only have two threads doing work and fsyncing, its going to be a > constant time, because we'll only be writing a certain number of buffers each > time. > I think that this is exactly the interesting measurement to capture. It will change over time (depending on how loaded the target device is, etc). The single thread case that is special cased is also an important one to handle since it should be a fairly common one. > > >> What might be more interesting is tracking how many processes had sync >> handles on the previous transaction(s), and once that number of processes >> have done that work, or the timeout reached, the transaction is committed. >> >> While this might seem like a hack for the particular benchmark, this >> will also optimize real-world workloads like mailserver, NFS/fileserver, >> http where the number of threads running at one time is generally fixed. >> >> The best way to do that would be to keep a field in the task struct to >> track whether a given thread has participated in transaction "T" when >> it starts a new handle, and if not then increment the "number of sync >> threads on this transaction" counter. >> >> In journal_stop() if t_num_sync_thr >= prev num_sync_thr then >> the transaction can be committed earlier, and if not then it does a >> wait_event_interruptible_timeout(cur_num_sync_thr >= prev_num_sync_thr, 1). >> >> While the number of sync threads is growing or constant the commits will >> be rapid, and any "slow" threads will block on the next transaction and >> increment its num_sync_thr until the thread count stabilizes (i.e. a small >> number of transactions at startup). After that the wait will be exactly >> as long as needed for each thread to participate. If some threads are >> too slow, or stop processing then there will be a single sleep and the >> next transaction will wait for fewer threads the next time. >> >> > > This idea is good, but I'm wondering about the normal user use case, ie where > syslog is the only thing that does fsync. If we get into a position where > prev_num_sync_thr is always 1, we'll just bypass sleeping and waiting for other > stuff to join the transaction and sync whenever syslog pleases, which will > likely affect most normal users, especially if they are like me and have crappy > wireless cards that like to spit stuff into syslog constantly ;). Thanks much, > > Josef > I think this syslog case is just a common example of the same thread doing a sequence of IO's. ric ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: transaction batching performance & multi-threaded synchronous writers 2008-07-14 16:15 transaction batching performance & multi-threaded synchronous writers Ric Wheeler 2008-07-14 16:58 ` Josef Bacik @ 2008-07-15 18:39 ` Josef Bacik 2008-07-15 20:10 ` Josef Bacik 1 sibling, 1 reply; 12+ messages in thread From: Josef Bacik @ 2008-07-15 18:39 UTC (permalink / raw) To: Ric Wheeler; +Cc: linux-ext4, adilger On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote: > > Here is a pointer to the older patch & some results: > > http://www.spinics.net/lists/linux-fsdevel/msg13121.html > > I will retry this on some updated kernels, but would not expect to see a > difference since the code has not been changed ;-) > Ok here are the numbers with the original idea I had proposed. type threads base patch speedup sata 1 17.9 17.3 0.97 sata 2 33.2 34.2 1.03 sata 4 58.4 63.6 1.09 sata 8 78.8 80.8 1.03 sata 16 94.4 97.6 1.16 ram 1 2394.4 1878.0 0.78 ram 2 989.6 2041.1 2.06 ram 4 1466.1 3201.8 2.18 ram 8 1858.1 3362.8 1.81 ram 16 3008.0 3227.7 1.07 I've got to find a fast disk array to test this with, but the ramdisk results make me happy, though they were kind of irratic, so I think the fast disk array will be a more stable measure of how well this patch does, but it definitely doesn't hurt the slow case, and brings stability to the fast case. Thanks much, Josef Index: linux-2.6/fs/jbd/commit.c =================================================================== --- linux-2.6.orig/fs/jbd/commit.c +++ linux-2.6/fs/jbd/commit.c @@ -273,6 +273,15 @@ write_out_data: journal_do_submit_data(wbuf, bufs); } +static inline unsigned long elapsed_jiffies(unsigned long start, + unsigned long end) +{ + if (end >= start) + return end - start; + + return end + (MAX_JIFFY_OFFSET - start) + 1; +} + /* * journal_commit_transaction * @@ -288,6 +297,7 @@ void journal_commit_transaction(journal_ int flags; int err; unsigned long blocknr; + unsigned long long commit_time, start_time; char *tagp = NULL; journal_header_t *header; journal_block_tag_t *tag = NULL; @@ -400,6 +410,7 @@ void journal_commit_transaction(journal_ commit_transaction->t_state = T_FLUSH; journal->j_committing_transaction = commit_transaction; journal->j_running_transaction = NULL; + start_time = jiffies; commit_transaction->t_log_start = journal->j_head; wake_up(&journal->j_wait_transaction_locked); spin_unlock(&journal->j_state_lock); @@ -873,6 +884,12 @@ restart_loop: J_ASSERT(commit_transaction == journal->j_committing_transaction); journal->j_commit_sequence = commit_transaction->t_tid; journal->j_committing_transaction = NULL; + commit_time = elapsed_jiffies(start_time, jiffies); + if (unlikely(!journal->j_average_commit_time)) + journal->j_average_commit_time = commit_time; + else + journal->j_average_commit_time = (commit_time + + journal->j_average_commit_time) / 2; spin_unlock(&journal->j_state_lock); if (commit_transaction->t_checkpoint_list == NULL && Index: linux-2.6/fs/jbd/transaction.c =================================================================== --- linux-2.6.orig/fs/jbd/transaction.c +++ linux-2.6/fs/jbd/transaction.c @@ -49,6 +49,7 @@ get_transaction(journal_t *journal, tran { transaction->t_journal = journal; transaction->t_state = T_RUNNING; + transaction->t_start_time = jiffies; transaction->t_tid = journal->j_transaction_sequence++; transaction->t_expires = jiffies + journal->j_commit_interval; spin_lock_init(&transaction->t_handle_lock); @@ -1361,8 +1362,7 @@ int journal_stop(handle_t *handle) { transaction_t *transaction = handle->h_transaction; journal_t *journal = transaction->t_journal; - int old_handle_count, err; - pid_t pid; + int err; J_ASSERT(journal_current_handle() == handle); @@ -1395,13 +1395,17 @@ int journal_stop(handle_t *handle) * single process is doing a stream of sync writes. No point in waiting * for joiners in that case. */ - pid = current->pid; - if (handle->h_sync && journal->j_last_sync_writer != pid) { - journal->j_last_sync_writer = pid; - do { - old_handle_count = transaction->t_handle_count; + if (handle->h_sync) { + unsigned long commit_time; + + spin_lock(&journal->j_state_lock); + commit_time = journal->j_average_commit_time; + spin_unlock(&journal->j_state_lock); + + while (time_before(jiffies, commit_time + + transaction->t_start_time)) { schedule_timeout_uninterruptible(1); - } while (old_handle_count != transaction->t_handle_count); + } } current->journal_info = NULL; Index: linux-2.6/include/linux/jbd.h =================================================================== --- linux-2.6.orig/include/linux/jbd.h +++ linux-2.6/include/linux/jbd.h @@ -543,6 +543,11 @@ struct transaction_s unsigned long t_expires; /* + * When this transaction started, in jiffies [no locking] + */ + unsigned long t_start_time; + + /* * How many handles used this transaction? [t_handle_lock] */ int t_handle_count; @@ -800,6 +805,8 @@ struct journal_s pid_t j_last_sync_writer; + unsigned long long j_average_commit_time; + /* * An opaque pointer to fs-private information. ext3 puts its * superblock pointer here ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: transaction batching performance & multi-threaded synchronous writers 2008-07-15 18:39 ` Josef Bacik @ 2008-07-15 20:10 ` Josef Bacik 2008-07-15 20:43 ` Josef Bacik 0 siblings, 1 reply; 12+ messages in thread From: Josef Bacik @ 2008-07-15 20:10 UTC (permalink / raw) To: Josef Bacik; +Cc: Ric Wheeler, linux-ext4, adilger On Tue, Jul 15, 2008 at 02:39:04PM -0400, Josef Bacik wrote: > On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote: > > > > Here is a pointer to the older patch & some results: > > > > http://www.spinics.net/lists/linux-fsdevel/msg13121.html > > > > I will retry this on some updated kernels, but would not expect to see a > > difference since the code has not been changed ;-) > > > > Ok here are the numbers with the original idea I had proposed. > > type threads base patch speedup > sata 1 17.9 17.3 0.97 > sata 2 33.2 34.2 1.03 > sata 4 58.4 63.6 1.09 > sata 8 78.8 80.8 1.03 > sata 16 94.4 97.6 1.16 > > ram 1 2394.4 1878.0 0.78 > ram 2 989.6 2041.1 2.06 > ram 4 1466.1 3201.8 2.18 > ram 8 1858.1 3362.8 1.81 > ram 16 3008.0 3227.7 1.07 > > I've got to find a fast disk array to test this with, but the ramdisk results > make me happy, though they were kind of irratic, so I think the fast disk array > will be a more stable measure of how well this patch does, but it definitely > doesn't hurt the slow case, and brings stability to the fast case. Thanks much, > Hmm talking with ric I should just leave the single thread stuff alone. That removes the slight speed regression seen above. Thanks, Josef Index: linux-2.6/fs/jbd/commit.c =================================================================== --- linux-2.6.orig/fs/jbd/commit.c +++ linux-2.6/fs/jbd/commit.c @@ -273,6 +273,15 @@ write_out_data: journal_do_submit_data(wbuf, bufs); } +static inline unsigned long elapsed_jiffies(unsigned long start, + unsigned long end) +{ + if (end >= start) + return end - start; + + return end + (MAX_JIFFY_OFFSET - start) + 1; +} + /* * journal_commit_transaction * @@ -288,6 +297,7 @@ void journal_commit_transaction(journal_ int flags; int err; unsigned long blocknr; + unsigned long long commit_time, start_time; char *tagp = NULL; journal_header_t *header; journal_block_tag_t *tag = NULL; @@ -400,6 +410,7 @@ void journal_commit_transaction(journal_ commit_transaction->t_state = T_FLUSH; journal->j_committing_transaction = commit_transaction; journal->j_running_transaction = NULL; + start_time = jiffies; commit_transaction->t_log_start = journal->j_head; wake_up(&journal->j_wait_transaction_locked); spin_unlock(&journal->j_state_lock); @@ -873,6 +884,12 @@ restart_loop: J_ASSERT(commit_transaction == journal->j_committing_transaction); journal->j_commit_sequence = commit_transaction->t_tid; journal->j_committing_transaction = NULL; + commit_time = elapsed_jiffies(start_time, jiffies); + if (unlikely(!journal->j_average_commit_time)) + journal->j_average_commit_time = commit_time; + else + journal->j_average_commit_time = (commit_time + + journal->j_average_commit_time) / 2; spin_unlock(&journal->j_state_lock); if (commit_transaction->t_checkpoint_list == NULL && Index: linux-2.6/fs/jbd/transaction.c =================================================================== --- linux-2.6.orig/fs/jbd/transaction.c +++ linux-2.6/fs/jbd/transaction.c @@ -49,6 +49,7 @@ get_transaction(journal_t *journal, tran { transaction->t_journal = journal; transaction->t_state = T_RUNNING; + transaction->t_start_time = jiffies; transaction->t_tid = journal->j_transaction_sequence++; transaction->t_expires = jiffies + journal->j_commit_interval; spin_lock_init(&transaction->t_handle_lock); @@ -1361,7 +1362,7 @@ int journal_stop(handle_t *handle) { transaction_t *transaction = handle->h_transaction; journal_t *journal = transaction->t_journal; - int old_handle_count, err; + int err; pid_t pid; J_ASSERT(journal_current_handle() == handle); @@ -1397,11 +1398,17 @@ int journal_stop(handle_t *handle) */ pid = current->pid; if (handle->h_sync && journal->j_last_sync_writer != pid) { + unsigned long commit_time; + journal->j_last_sync_writer = pid; - do { - old_handle_count = transaction->t_handle_count; + + spin_lock(&journal->j_state_lock); + commit_time = journal->j_average_commit_time; + spin_unlock(&journal->j_state_lock); + + while (time_before(jiffies, commit_time + + transaction->t_start_time)) schedule_timeout_uninterruptible(1); - } while (old_handle_count != transaction->t_handle_count); } current->journal_info = NULL; Index: linux-2.6/include/linux/jbd.h =================================================================== --- linux-2.6.orig/include/linux/jbd.h +++ linux-2.6/include/linux/jbd.h @@ -543,6 +543,11 @@ struct transaction_s unsigned long t_expires; /* + * When this transaction started, in jiffies [no locking] + */ + unsigned long t_start_time; + + /* * How many handles used this transaction? [t_handle_lock] */ int t_handle_count; @@ -800,6 +805,8 @@ struct journal_s pid_t j_last_sync_writer; + unsigned long long j_average_commit_time; + /* * An opaque pointer to fs-private information. ext3 puts its * superblock pointer here ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: transaction batching performance & multi-threaded synchronous writers 2008-07-15 20:10 ` Josef Bacik @ 2008-07-15 20:43 ` Josef Bacik 2008-07-15 22:33 ` Ric Wheeler 0 siblings, 1 reply; 12+ messages in thread From: Josef Bacik @ 2008-07-15 20:43 UTC (permalink / raw) To: Josef Bacik; +Cc: Ric Wheeler, linux-ext4, adilger On Tue, Jul 15, 2008 at 04:10:10PM -0400, Josef Bacik wrote: > On Tue, Jul 15, 2008 at 02:39:04PM -0400, Josef Bacik wrote: > > On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote: > > > > > > Here is a pointer to the older patch & some results: > > > > > > http://www.spinics.net/lists/linux-fsdevel/msg13121.html > > > > > > I will retry this on some updated kernels, but would not expect to see a > > > difference since the code has not been changed ;-) > > > > > > > Ok here are the numbers with the original idea I had proposed. > > > > type threads base patch speedup > > sata 1 17.9 17.3 0.97 > > sata 2 33.2 34.2 1.03 > > sata 4 58.4 63.6 1.09 > > sata 8 78.8 80.8 1.03 > > sata 16 94.4 97.6 1.16 > > > > ram 1 2394.4 1878.0 0.78 > > ram 2 989.6 2041.1 2.06 > > ram 4 1466.1 3201.8 2.18 > > ram 8 1858.1 3362.8 1.81 > > ram 16 3008.0 3227.7 1.07 > > > > I've got to find a fast disk array to test this with, but the ramdisk results > > make me happy, though they were kind of irratic, so I think the fast disk array > > will be a more stable measure of how well this patch does, but it definitely > > doesn't hurt the slow case, and brings stability to the fast case. Thanks much, > > > > Hmm talking with ric I should just leave the single thread stuff alone. That > removes the slight speed regression seen above. Thanks, > Here are the results with the single thread stuff put back in and with 250HZ instead of 1000HZ from before type threads base patch sata 1 21.8 21.6 sata 2 26.2 34.6 sata 4 48.0 58.0 sata 8 70.4 75.2 sata 16 89.6 101.1 ram 1 2505.4 2422.0 ram 2 463.8 3462.3 ram 4 330.4 3653.9 ram 8 995.1 3592.4 ram 16 1335.2 3806.5 Thanks, Josef ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: transaction batching performance & multi-threaded synchronous writers 2008-07-15 20:43 ` Josef Bacik @ 2008-07-15 22:33 ` Ric Wheeler 0 siblings, 0 replies; 12+ messages in thread From: Ric Wheeler @ 2008-07-15 22:33 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-ext4, adilger Josef Bacik wrote: > On Tue, Jul 15, 2008 at 04:10:10PM -0400, Josef Bacik wrote: > >> On Tue, Jul 15, 2008 at 02:39:04PM -0400, Josef Bacik wrote: >> >>> On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote: >>> >>>> Here is a pointer to the older patch & some results: >>>> >>>> http://www.spinics.net/lists/linux-fsdevel/msg13121.html >>>> >>>> I will retry this on some updated kernels, but would not expect to see a >>>> difference since the code has not been changed ;-) >>>> >>>> >>> Ok here are the numbers with the original idea I had proposed. >>> >>> type threads base patch speedup >>> sata 1 17.9 17.3 0.97 >>> sata 2 33.2 34.2 1.03 >>> sata 4 58.4 63.6 1.09 >>> sata 8 78.8 80.8 1.03 >>> sata 16 94.4 97.6 1.16 >>> >>> ram 1 2394.4 1878.0 0.78 >>> ram 2 989.6 2041.1 2.06 >>> ram 4 1466.1 3201.8 2.18 >>> ram 8 1858.1 3362.8 1.81 >>> ram 16 3008.0 3227.7 1.07 >>> >>> I've got to find a fast disk array to test this with, but the ramdisk results >>> make me happy, though they were kind of irratic, so I think the fast disk array >>> will be a more stable measure of how well this patch does, but it definitely >>> doesn't hurt the slow case, and brings stability to the fast case. Thanks much, >>> >>> >> Hmm talking with ric I should just leave the single thread stuff alone. That >> removes the slight speed regression seen above. Thanks, >> >> > > Here are the results with the single thread stuff put back in and with 250HZ > instead of 1000HZ from before > > type threads base patch > sata 1 21.8 21.6 > sata 2 26.2 34.6 > sata 4 48.0 58.0 > sata 8 70.4 75.2 > sata 16 89.6 101.1 > > ram 1 2505.4 2422.0 > ram 2 463.8 3462.3 > ram 4 330.4 3653.9 > ram 8 995.1 3592.4 > ram 16 1335.2 3806.5 > > Thanks, > > Josef > These numbers are pretty impressive - we need to get a run on an array backed file system as well to round out the picture and possibly an SSD (anyone have one out there to play with)? ric ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2008-07-15 22:33 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-07-14 16:15 transaction batching performance & multi-threaded synchronous writers Ric Wheeler 2008-07-14 16:58 ` Josef Bacik 2008-07-14 17:26 ` Ric Wheeler 2008-07-15 7:58 ` Andreas Dilger 2008-07-15 11:29 ` Ric Wheeler 2008-07-15 12:51 ` Josef Bacik 2008-07-15 14:05 ` Josef Bacik 2008-07-15 14:22 ` Ric Wheeler 2008-07-15 18:39 ` Josef Bacik 2008-07-15 20:10 ` Josef Bacik 2008-07-15 20:43 ` Josef Bacik 2008-07-15 22:33 ` Ric Wheeler
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).