Re: [PATCH 2/2] improve ext3 fsync batching

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Ric Wheeler <rwheeler@redhat.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: rwheeler@redhat.com, adilger@sun.com, jbacik@redhat.com,
	linux-kernel@vger.kernel.org, tglx@linutronix.de,
	linux-fsdevel@vger.kernel.org, chris.mason@oracle.com,
	linux-ext4@vger.kernel.org
Subject: Re: [PATCH 2/2] improve ext3 fsync batching
Date: Tue, 19 Aug 2008 16:55:47 -0400	[thread overview]
Message-ID: <48AB3353.5000308@redhat.com> (raw)
In-Reply-To: <20080819132914.6fcc9f29.akpm@linux-foundation.org>

Andrew Morton wrote:
> Could I just point out that this is a very very painful way of writing
> a changelog?  All these new revelations are important and relevant and
> should have been there!
>   

Agreed, some of this was bounced around only in the email thread.
> On Tue, 19 Aug 2008 14:08:31 -0400
> Ric Wheeler <rwheeler@redhat.com> wrote:
>
>   
>> Andrew Morton wrote:
>>     
>>> On Tue, 19 Aug 2008 07:01:11 -0400 Ric Wheeler <rwheeler@redhat.com> wrote:
>>>
>>>   
>>>       
>>>> It would be great to be able to use this batching technique for faster 
>>>> devices, but we currently sleep 3-4 times longer waiting to batch for an 
>>>> array than it takes to complete the transaction.
>>>>     
>>>>         
>>> Obviously, tuning that delay down to the minimum necessary is a good
>>> thing.  But doing it based on commit-time seems indirect at best.  What
>>> happens on a slower disk when commit times are in the tens of
>>> milliseconds?  When someone runs a concurrent `dd if=/dev/zero of=foo'
>>> when commit times go up to seconds?
>>>   
>>>       
>> Transactions on that busier drive would take longer, we would sleep 
>> longer which would allow us to batch up more into one transaction. That 
>> should be a good result and it should reset when the drive gets less 
>> busy (and transactions shorter) to a shorter sleep time.
>>     
>
> Has this been empirically confirmed?
>   

No, this was just me thinking about how a shared disk can become 
sluggish. It is a good experiment to try (run the test on one host, 
inject a heavy load from a second, watch the behaviour).

> Commits can takes tens of seconds and causing an fsync() to block
> itself for such periods could have quite bad effects.  How do we (ie:
> I) know that there are no such scenarios with this change?
>   

A very good point - I like your suggestion to do the minimum of avg 
sleep time vs 1 jiffie in the follow up message.

>   
>>> Perhaps a better scheme would be to tune it based on how many other
>>> processes are joining that transaction.  If it's "zero" then decrease
>>> the timeout.  But one would need to work out how to increase it, which
>>> perhaps could be done by detecting the case where process A runs an
>>> fsync when a commit is currently in progress, and that commit was
>>> caused by process B's fsync.
>>>   
>>>       
>> This is really, really a property of the device's latency at any given 
>> point in time. If there are no other processes running, we could do an 
>> optimization and not wait.
>>     
>
> well yes.  This represents yet another attempt to predict future
> application behaviour.  The way in which we _usually_ do that is by
> monitoring past application behaviour.
>   
This whole area is very similar to the IO elevator area where some of 
the same device characteristics are measured.


> Only this patch didn't do that (directly) and I'm wondering why not.
>   

The average transaction commit time is a direct measurement of the past 
behaviour, right?

>   
>>> But before doing all that I would recommend/ask that the following be
>>> investigated:
>>>
>>> - How effective is the present code?
>>>   
>>>       
>> It causes the most expensive storage (arrays) to run 3-4 times slower 
>> than they should on a synchronous write workload (NFS server, mail 
>> server?) with more than 1 thread. For example, against a small EMC 
>> array, I saw single threaded write rates of 720 files/sec against ext3 
>> with 1 thread, 225 (if I remember correctly) with 2 ;-)
>>     
>
> Current code has:
>
> 	/*
> 	 * Implement synchronous transaction batching.  If the handle
> 	 * was synchronous, don't force a commit immediately.  Let's
> 	 * yield and let another thread piggyback onto this transaction.
> 	 * Keep doing that while new threads continue to arrive.
> 	 * It doesn't cost much - we're about to run a commit and sleep
> 	 * on IO anyway.  Speeds up many-threaded, many-dir operations
> 	 * by 30x or more...
> 	 *
> 	 * But don't do this if this process was the most recent one to
> 	 * perform a synchronous write.  We do this to detect the case where a
> 	 * single process is doing a stream of sync writes.  No point in waiting
> 	 * for joiners in that case.
> 	 */
>
> has the 30x been reproduced?  If not, what broke?  If so, what effect
> did the proposed change have upon it?
>   

The huge gain was only in the case of a RAM disk test which I assume was 
not tested against the original patch early on. Against an array (with a 
250HZ kernel), we saw a 2.5x speedup with the new code.

>   
>>>   - What happens when it is simply removed?
>>>   
>>>       
>> If you remove the code, you will not see the throughput rise when you go 
>> multithreaded on existing slow devices (S-ATA/ATA for example). Faster 
>> devices will not see that 2 threaded drop.
>>     
>
> See above - has this been tested and confirmed?
>   

Yes - we (back at EMC) did remove the logic and the fast devices will 
write at least at their starting rate (700+ files/sec).

>   
>>>   - Add instrumentation (a counter and a printk) to work out how
>>>     many other tasks are joining this task's transaction.
>>>
>>>     - If the answer is "zero" or "small", work out why.
>>>
>>>   - See if we can increase its effectiveness.
>>>
>>> Because it could be that the code broke.  There might be issues with
>>> higher-level locks which are preventing the batching.  For example, if
>>> all the files which the test app is syncing are in the same directory,
>>> perhaps all the tasks are piling up on that directory's i_mutex?
>>>   
>>>       
>> I have to admit that I don't see the down side here - we have shown a 
>> huge increase for arrays (embarrassingly huge  increase for RAM disks) 
>> and see no degradation for the S-ATA/ATA case.
>>
>> The code is not broken (having been there and done the performance 
>> tuning on the original code), it just did not account for the widely 
>> varying average response times for different classes of storage ;-)
>>     
>
> Well, as I said - last time I checked, it did seem to be broken.  By
> what means did you confirm that it is still effective, and what were
> the results?
>
>   

I think Josef posted those results for S-ATA earlier in the thread and 
they were still working. We can repost/rerun to give more detail...

ric

next prev parent reply	other threads:[~2008-08-19 20:55 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-08-06 19:08 [PATCH 1/2] add hrtimer_sleep_ns helper function Josef Bacik
2008-08-06 19:15 ` [PATCH 2/2] improve ext3 fsync batching Josef Bacik
2008-08-06 19:23   ` Josef Bacik
2008-08-19  4:31   ` Andrew Morton
2008-08-19  5:44     ` Andreas Dilger
2008-08-19 11:01       ` Ric Wheeler
2008-08-19 17:56         ` Andrew Morton
2008-08-19 18:08           ` Ric Wheeler
2008-08-19 20:29             ` Andrew Morton
2008-08-19 20:55               ` Ric Wheeler [this message]
2008-08-19 21:18                 ` Andrew Morton
2008-08-19 21:29                   ` Ric Wheeler
2008-08-19 18:43           ` Ric Wheeler
2008-08-19 20:34             ` Andrew Morton
2008-08-19 19:18           ` Josef Bacik
2008-08-19 19:15 ` [PATCH 1/2] add hrtimer_sleep_ns helper function Matthew Wilcox
2008-08-19 19:22   ` Josef Bacik
2008-08-19 19:36     ` Matthew Wilcox
2008-08-19 19:39       ` Josef Bacik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48AB3353.5000308@redhat.com \
    --to=rwheeler@redhat.com \
    --cc=adilger@sun.com \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=jbacik@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).