linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Tux3 Report: Untar Unleashed
@ 2014-04-25  0:55 Daniel Phillips
  2014-04-25 22:27 ` Daniel Phillips
  0 siblings, 1 reply; 2+ messages in thread
From: Daniel Phillips @ 2014-04-25  0:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, tux3

Hi,


It is about time to post Tux3 patches for review. Almost. Running the 
classic kernel untar test on an older 60 MB/sec hard disk turned up a 
performance gap versus Ext4 by a factor of six. Hmm, that is in the 
range where it could be a basic design bug, we need to do something.


Tux3 performed fine on the actual untar, but syncing those thousands of 
small files to disk was slow. The cause turned out to be multiple 
issues. Two were unimplemented design features. For every file, we were 
writing a btree root and a leaf block with pointers to data extents. 
Instead, most btrees can be "zero depth" with just a leaf block and no 
index block. And many files are just a single extent so the btree is not 
needed at all. When writing lots of small files, we were transferring up 
to three times more blocks than necessary.


The kernel tarball showed a metadata-to-data ratio of about 1.2. This is 
way too high, it should be less than a tenth of that. After factoring 
out those extra blocks, a huge performance gap still remains. This must 
have something to do with seeking, but our disk layout for this load is 
pretty good, so what is going on?


It turns out to be important not only to write to the right place, but 
at the right time. The block scheduler tries to merge physically 
contiguous requests even when submitted out of order, but if the 
requests are too far apart in time, an earlier request may have already 
left the queue by the time an adjacent request shows up. This causes 
extra, costly seeks on spinning disks. Ideally, we want our writes 
contiguous in both time and space. Then the disk hardware should be able 
to make seek costs effectively disappear and stream the data out at 
close to media speed.


Fixes for these issues took the form of patches from Hirofumi to 
eliminate redundant btree roots and submit metadata writes in a better 
order for block scheduling, and a patch from me to implement a planned 
"direct extent" feature to eliminate btrees completely for small files. 
Iterative improvements went something like this:


* Start: 60 seconds to sync but Ext4 only needs 10 seconds

* Eliminate most btree roots => now sync in 53 seconds

* Submit file btree together with data => sync in 19 seconds

* Eliminate btree complete for small files => 15 seconds

* Flush inode table blocks after all data => 8 seconds


We ended up with:


untar:

real 0m2.706s

user 0m0.360s

sys 0m2.160s


sync:

real 0m8.651s

user 0m0.000s

sys 0m0.024s


Ext4 takes about 4 seconds to untar and 10 seconds to sync, turning in a 
respectable 50 MB/sec write bandwidth on a 62 MB/sec disk. Tux3 now 
syncs at 60 MB/sec, or 97% of raw media bandwidth. So we went from 500% 
slower to 23% faster, woohoo. The cost for this is that we dropped out 
of sight for a few weeks. Maybe it was worth it because the performance 
artifact was so big that it could have been a major design deficiency 
instead of what it really was: leaving some details for later.


When we checked read performance on the untarred tree, we immediately 
saw mixed results. Re-tarring the kernel tree is faster than Ext4, but 
directory listing is slower by a multiple. So we need to analyze and fix 
ls without breaking the good tar and untar behavior. The question is, is 
it worth another delay before putting Tux3 patches up for review?


I think not. In fact, by going quiet when we hit these things, we 
detract from the spectator sport aspect of open source. It might not be 
any faster to work in public, but it is more fun. Plus, it engraves a 
record on the internet as a guide for the next effort to invent a new 
and wonderful filesystem. It is hard to overstate the value to our 
project of all the historical chatter about design and development 
process for Ext4 and other Linux filesystems. We often find ourselves 
retracing the same learning processes. By doing this work in public, we 
give something back.


Improving ls performance to Ext4 standards may just be a matter of 
implementing inode table readahead, or it might be that plus something 
else. In any case, this will go on the longish list of important issues 
that are not central.


Inline data is a related item already on that list. There is a nice plan 
for it, where the same design feature handles inline files and tail 
packing, similar to extended attributes. In particular, most directories 
in the kernel tarball are small enough to inline, which could speed up 
ls significantly. With most files and directories inlined, a Tux3 
filesystem becomes a single, fatter btree, with different issues and 
tradeoffs. On the whole, I expect substantial improvements in both space 
utilization and performance. The final chapter in the tar performance 
saga has yet to be written.

We also need to ask why we are putting so much effort into performing 
well on spinning disks, which are rapidly disappearing. Two reasons. 
First, spinning disks are not gone yet, they are just migrating to a 
backend storage role. Second, optimizations for spinning disk are 
helpful for solid state storage more often than not. In this case, 
keeping related requests close together in time lets the flash 
translation layer pack its erase blocks better, reducing write 
multiplication during space recovery, and in turn increasing media life 
and performance. So it is too soon to forget about the idiosyncrasies 
and challenges of traditional rotating media, perhaps ten or twenty 
years too soon.


Regards,


Daniel

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Tux3 Report: Untar Unleashed
  2014-04-25  0:55 Tux3 Report: Untar Unleashed Daniel Phillips
@ 2014-04-25 22:27 ` Daniel Phillips
  0 siblings, 0 replies; 2+ messages in thread
From: Daniel Phillips @ 2014-04-25 22:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, tux3

Yesterday I wrote:
 > When we checked read performance on the untarred tree, we immediately saw
 > mixed results. Re-tarring the kernel tree is faster than Ext4, but 
directory listing is
 > slower by a multiple. So we need to analyze and fix ls without 
breaking the good tar
 > and untar behavior. The question is, is it worth another delay before 
putting Tux3
 > patches up for review?

Hirofumi would not let me slink cowardly away from that open question. 
We noticed that Tux3 does slightly more than one seek per directory, 
which is entirely reasonable. But Ext4 goes way beyond that and does 
some special magic to read multiple directories per seek. The only 
possible way to do that is, pack directories together so there are many 
per track. A bit of sleuthing confirmed that this is indeed the case, 
and apparently comes from a patch posted by Ted T'so a few years back:

     lwn.net/Articles/319829/
     [PATCH, RFC] ext4: New inode/block allocation algorithms for 
flex_bg filesystems

That patch was aimed at speeding up fsck and the huge ls speedup appears 
to have gone unnoticed.

Thus inspired, Hirofumi whipped up a prototype patch to allocate new 
directories first, per delta. Result: Tux3 went from 400% slower to 25% 
faster than Ext4, for "ls -R" of the kernel source. Even better, tar and 
untar performance stayed about the same with Tux3 topping the untar test 
at 20% faster and tar at 350%. (The lopsided tar result looks like a 
performance bug for Ext4.)

This optimization only applies to spinning disk. It is pretty hard to 
think of a reason why packing directories together would benefit flash. 
Maybe, directories that are written together are more likely to be 
updated together? But it doesn't hurt flash either, and is another data 
point to support our theory that optimizing for spinning disk also 
optimizes for flash. We are still waiting for the first counterexample 
to show up.

There are a few reasons why Tux3 has an edge for the case exercised by 
the kernel source loads:

   * Defer everything

Tux3 takes the idea of delayed allocation much further and delays nearly 
everything. Directory updates and inode number selection are the only 
exceptions. (In future we will attempt to defer the namespace updates as 
well.)

   * Front back separation

Besides enabling defer-everything, this simplifies locking and reduces 
contention a lot, for both read and update. For now, a naive locking 
strategy serves us well. Eventually we will multithread the backend, 
which will help with high processor core counts, once we get there.

   * Big deltas

Under heavy update load, Tux3 deltas grow as big as cache will allow, so 
per-delta layout algorithms have a big data window available to optimize 
over. With our current strategy, we observe an effect similar to Ext4 
flex_bg, where directories and other metadata tend to self-organize 
along delta boundaries, with beneficial performance effects. We might 
control this behavior more explicitly in future.

   * More inodes per inode table block.

Tux3 stores about 57 inodes per block while Ext4 typically has 16 or 
less. Multiple inodes per block already resembles a kind of inode table 
readahead. Without this, there would be two seeks per directory even 
with directories packed together.

Anyway, I don't think we need to hang our heads in shame for performance 
reasons at this point, even though plenty of major optimization issues 
still remain on the list. For example, you can embarras Tux3 just by 
running a benchmark with 10,000 files per directory. The answer to that 
one is Shardmap, which needs a couple of months to bring up and solves a 
problem that does not come up on your home server or phone. Not a reason 
to get sidetracked again..

Regards,

Daniel

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2014-04-25 22:27 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-25  0:55 Tux3 Report: Untar Unleashed Daniel Phillips
2014-04-25 22:27 ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).