linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/4] ext4: Coordinate data-only flush requests sent by fsync
@ 2010-11-29 22:05 Darrick J. Wong
  2010-11-29 22:05 ` [PATCH 1/4] block: Measure flush round-trip times and report average value Darrick J. Wong
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Darrick J. Wong @ 2010-11-29 22:05 UTC (permalink / raw)
  To: Jens Axboe, Theodore Ts'o, Neil Brown, Andreas Dilger,
	Alasdair G Kergon, "Darrick J.
  Cc: Jan Kara, Mike Snitzer, linux-kernel, linux-raid, Keith Mannthey,
	dm-devel, Mingming Cao, Tejun Heo, linux-ext4, Ric Wheeler,
	Christoph Hellwig, Josef Bacik

On certain types of hardware, issuing a write cache flush takes a considerable
amount of time.  Typically, these are simple storage systems with write cache
enabled and no battery to save that cache after a power failure.  When we
encounter a system with many I/O threads that write data and then call fsync
after more transactions accumulate, ext4_sync_file performs a data-only flush,
the performance of which is suboptimal because each of those threads issues its
own flush command to the drive instead of trying to coordinate the flush,
thereby wasting execution time.

Instead of each fsync call initiating its own flush, there's now a flag to
indicate if (0) no flushes are ongoing, (1) we're delaying a short time to
collect other fsync threads, or (2) we're actually in-progress on a flush.

So, if someone calls ext4_sync_file and no flushes are in progress, the flag
shifts from 0->1 and the thread delays for a short time to see if there are any
other threads that are close behind in ext4_sync_file.  After that wait, the
state transitions to 2 and the flush is issued.  Once that's done, the state
goes back to 0 and a completion is signalled.

Those close-behind threads see the flag is already 1, and go to sleep until the
completion is signalled.  Instead of issuing a flush themselves, they simply
wait for that first thread to do it for them.  If they see that the flag is 2,
they wait for the current flush to finish, and start over.

However, there are a couple of exceptions to this rule.  First, there exist
high-end storage arrays with battery-backed write caches for which flush
commands take very little time (< 2ms); on these systems, performing the
coordination actually lowers performance.  Given the earlier patch to the block
layer to report low-level device flush times, we can detect this situation and
have all threads issue flushes without coordinating, as we did before.  The
second case is when there's a single thread issuing flushes, in which case it
can skip the coordination.

This author of this patch is aware that jbd2 has a similar flush coordination
scheme for journal commits.  An earlier version of this patch simply created a
new empty journal transaction and committed it, but that approach was shown to
increase the amount of write traffic heading towards the disk, which in turn
lowered performance considerably, especially in the case where directio was in
use.  Therefore, this patch adds the coordination code directly to ext4.

To test the performance and safety of this patchset, I crafted an ffsb profile
named fsync-happy that performs a bunch of disk I/O with periodic fsync()s to
flush the data out to disk.  Performance results can be seen here:
http://bit.ly/fYAclV

The data presented in blue text represent results obtained on high performance
disk arrays that have battery-backed write cache enabled.  Red results on the
"speed differences" page represent performance regressions, of course.
Descriptions of the disk hardware tested are on the rightmost page.  In no case
were any of the benchmarks CPU-bound.

The speed differences page shows some interesting results.  Before Tejun Heo's
barrier -> flush conversion in 2.6.37-rc1, we saw that enabling barriers caused
between a 30-80 percent performance regression on a fairly large variety of
test programs; generally, the more fsyncs, the bigger the drop; if one never
fsyncs any data, the only flushes that ever happen are during the periodic
journal commits.  Now we see that the cost of enabling flushes in ext4 on the
fsync-happy workload has dropped from about 80 percent to about 25-30 percent.
With this fsync coordination patch, that drop becomes about 5-14 percent.

I see some small performance (< 1 percent) regressions for some hardware.  This is
generally acceptable because I see larger variances from repeatedly running
fsync-happy.  The two larger regressions (elm3a4_ipr_nowc and
elm3c44_sata_nowc) are a somewhat questionable case because those two disks
have no write cache yet ext4 was not properly detecting this and setting
barrier=0.  That bug will be addressed separately.

In terms of data safety, I've been performing power failure testing with a
bunch of blades that have slow IDE disks with fairly large write caches.  So
far I haven't seen any more FS errors after reset than I see with 2.6.36.

This patchset consists of four patches.  The first adds to the block layer the
ability to measure the amount of time it takes for a lower-level block device
to issue and complete a flush command.  The middle two patches add to md and
dm, respectively, the ability to report the component devices' flush times.
For 2.6.37-rc3, the md patch also requires my earlier patch to md to enable
REQ_FLUSH support.  The fourth patch adds the auto-tuning fsync flush
coordination to ext4.

To everyone who has reviewed this patch set so far, thank you for your help!
As usual, I welcome any questions or comments.

--D

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2011-01-08 14:08 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-29 22:05 [PATCH v6 0/4] ext4: Coordinate data-only flush requests sent by fsync Darrick J. Wong
2010-11-29 22:05 ` [PATCH 1/4] block: Measure flush round-trip times and report average value Darrick J. Wong
2010-12-02  9:49   ` Lukas Czerner
2010-11-29 22:05 ` [PATCH 2/4] md: Compute average flush time from component devices Darrick J. Wong
2010-11-29 22:05 ` [PATCH 3/4] dm: " Darrick J. Wong
2010-11-30  5:21   ` Mike Snitzer
2010-11-29 22:06 ` [PATCH 4/4] ext4: Coordinate data-only flush requests sent by fsync Darrick J. Wong
2010-11-29 23:48 ` [PATCH v6 0/4] " Ric Wheeler
2010-11-30  0:19   ` Darrick J. Wong
2010-12-01  0:14   ` Mingming Cao
2010-11-30  0:39 ` Neil Brown
2010-11-30  0:48   ` Ric Wheeler
2010-11-30  1:26     ` Neil Brown
2010-11-30 23:32       ` Darrick J. Wong
2010-11-30 13:45   ` Tejun Heo
2010-11-30 13:58     ` Ric Wheeler
2010-11-30 16:43   ` Christoph Hellwig
2010-11-30 23:31   ` Darrick J. Wong
2010-11-30 16:41 ` Christoph Hellwig
2011-01-07 23:54   ` Patch to issue pure flushes directly (Was: Re: [PATCH v6 0/4] ext4: Coordinate data-only flush requests sent) " Ted Ts'o
2011-01-08  7:45     ` Christoph Hellwig
     [not found]     ` <20110108074524.GA13024@lst.de>
2011-01-08 14:08       ` Tejun Heo
2011-01-04 16:27 ` [RFC PATCH v7] ext4: Coordinate data-only flush requests sent " Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).