* Tux3 report: New news for the new year
@ 2013-01-01 10:55 Daniel Phillips
0 siblings, 0 replies; 3+ messages in thread
From: Daniel Phillips @ 2013-01-01 10:55 UTC (permalink / raw)
To: lkml, linux-fsdevel, tux3
Hi everybody,
The Tux3 project has some interesting news to report for the new year. In
brief, the first time Hirofumi ever put together all the kernel pieces in his
magical lab over in Tokyo, our Tux3 rocket took off and made it straight to
orbit. Or in less metaphorical terms, our first meaningful benchmarks turned in
numbers that meet or even slightly beat the illustrious incumbent, Ext4:
fsstress -f dread=0 -f dwrite=0 -f fsync=0 -f fdatasync=0 \
-s 1000 -l 200 -n 200 -p 3
ext4
time cpu wait
46.338, 1.244, 5.096
49.101, 1.144, 5.896
49.838, 1.152, 5.776
tux3
time cpu wait
46.684, 0.592, 1.860
44.011, 0.684, 1.764
43.773, 0.556, 1.888
Fsstress runs a mix of filesystem operations typical of a Linux system under
heavy load. In this test, Tux3 spends less time waiting than Ext4, uses less
CPU (see below) and finishes faster on average. This was exciting for us,
though we must temper our enthusiasm by noting that these are still early
results and several important bits of Tux3 are as yet unfinished. While we do
not expect the current code to excel at extreme scales just yet, it seems we
are already doing well at the scale that resembles computers you are running
at this very moment.
About Tux3
Here is a short Tux3 primer. Tux3 is a general purpose LInux filesystem
developed by a group of us mainly for the fun of it. Tux3 started in summer of
2008, as a container for a new storage versioning algorithm originally meant
to serve as a new engine for the ddsnap volume snapshot virtual device:
http://lwn.net/Articles/288896/
"Versioned pointers: a new method of representing snapshots"
As design work proceeded on a suitably simple filesystem with modern features,
the focus shifted from versioning to the filesystem itself, as the latter is a
notoriously challenging and engaging project. Initial prototyping was done in
user space by me and others, and later ran under Fuse, a spectacular driveby
contribution from one Tero Roponen. Hirofumi joined the team with an amazing
utility that makes graphs of the disk structure of Tux3 volumes, and soon took
charge of the kernel port. I stand in awe of Hirofumi's design sense, detail
work and general developer prowess.
Like a German car, Tux3 is both old school and modern. Closer in spirit to
Ext4 than Btrfs, Tux3 sports an inode table, allocates blocks with bitmaps,
puts directories in files, and stores attributes in inodes. Like Ext4 and
Btrfs, Tux3 uses extents indexed by btrees. Source file names are familiar:
balloc.c, namei.c etc. But Tux3 has some new files like filemap.c and log.c that
help make it fast, compact, and very ACID.
Unlike Ext4, Tux3 keeps inodes in a btree, inodes are variable length, and all
inode attributes are variable length and optional. Also unlike Ext4, Tux3
writes nondestructively and uses a write-anywhere log instead of a journal.
Differences with Btrfs are larger. The code base is considerably smaller,
though to be sure, some of that can be accounted for by incomplete features.
The Tux3 filesystem tree is single-rooted, there is no forest of shared trees.
There is no built-in volume manager. Names and inodes are stored separately.
And so on. But our goal is the same: a modern, snapshotting, replicating
general purpose filesystem, which I am happy to say, seems to have just gotten
a lot closer.
Front/Back Separation
At the heart of Tux3's kernel implementation lies a technique we call
"front/back separation", which partly accounts for the surprising kernel CPU
advantage in the above benchmark results. Tux3 runs as two, loosely coupled
pieces: the frontend, which handles Posix filesystem operations entirely in
cache, and the backend, which does the brute work of preparing dirty cache for
atomic transfer to media. The frontend shows up as kernel CPU accounted to the
Fsstress task, while the backend is largely invisible, running on some
otherwise idle CPU. We suspect that the total of frontend and backend CPU is
less than Ext4 as well, but so far nobody has checked. What we do know, is
that filesystem operations tend to complete faster when they only need to deal
with cache and not little details such as backing store.
Front/back separtion is like taking delayed allocation to its logical
conclusion: every kind of structural change is delayed, not just block
allocation. I credit Matt Dillon of Dragonfly fame for this idea. He described
the way he used it in Hammer as part of this dialog:
http://kerneltrap.org/Linux/Comparing_HAMMER_And_Tux3
"Comparing HAMMER And Tux3"
Hammer is a cluster filesystem, but front/back separation turns out to be
equally effective on a single node. Of course, the tricky part is making the
two pieces run asynchronously without stalling on each other. Which brings us
to...
Block Forking
Block forking is an idea that has been part of Tux3 from the beginning, and
roughly resembles the "stable pages" work now underway. Unlike stable pages,
block forking does not reduce performance. Quite the contrary - block forking
enables front/back separation, which boosted Tux3 Fsstress performance about
40%. The basic idea of block forking is to never wait on pages under IO, but
clone them instead. This protects in-flight pages from damage by VFS syscalls
without forcing page cache updates to stall on writeback.
Implementing this simple idea is harder than it sounds. We need to deal with
multiple blocks being accessed asynchronously on the same page, and we need to
worry a lot about cache object lifetimes and locking. Especially in truncate,
things can get pretty crazy. Hirofumi's work here can only be described by one
word: brilliant.
Deltas and Strong Consistency
Tux3 groups frontend update transactions into "deltas". According to some
heuristic, one delta ends and the next one begins, such that all dirty cache
objects affected by the operations belonging to a given delta may be
transferred to media in a single atomic operation. In particular, we take care
that directory updates always lie in the same delta as associated updates such
as creating or deleting inode representations in the inode table.
Tux3 always cleans dirty cache completely on each delta commit. This is not
traditional behavior for Linux filesystems, which normally let the core VM
memory flusher tell them which dirty pages of which inodes should be flushed to
disk. We largely ignore the VM's opinion about that and flush everything, every
delta. You might think this would hurt performance, but apparently it does
not. It does allow us to implement stronger consistency guarantees than
typical for Linux.
We provide two main guarantees:
* Atomicity: File data never appears on media in an intermediate state,
with the single exception of large file writes, which may be broken
across multiple deltas, but with write ordering preserved.
* Ordering: If one filesystem transaction ends before another transaction
begins, then the second transaction will never appear on durable media
unless the first does too.
Our atomicity guarantee resembles Ext4's data=journal but performs more like
data=ordered. This is interesting, considering that Tux3 always writes
nondestructively. Finding a new, empty location for each block written and
updating the associated metadata would seem to carry a fairly hefty cost, but
apparently it does not.
Our ordering guarantee has not been seen on Linux before, as far as we know.
We get it "for free" from Tux3's atomic update algorithm. This could possibly
prove useful to developers of file-based databases, for example, mailers and
MTAs. (Kmail devs, please take note!)
Logging and Rollup
Tux3 goes out of its way to avoid recursive copy on write, that is, the
expensive behavior where a change to a data leaf must be propagated all the
way up the filesystem tree to the root, to avoid altering data that belongs to
a previously committed consistent filesystem image. (Btrfs extends this
recursive copy on write idea to implement snapshots, but Tux3 does not.)
Instead of writing out changes to parents of altered blocks, Tux3 only changes
the parents in cache, and writes a description of each change to a log on
media. This prevents recursive copy-on-write. Tux3 will eventually write out
such retained dirty metadata blocks in a process we call "rollup", which
retires log blocks and writes out dirty metadata blocks in full. A delta
containing a rollup also tidily avoids recursive copy on write: just like any
other delta, changes to the parents of redirected blocks are made only in
cache, and new log entries are generated.
Tux3 further employs logging to make the allocation bitmap overhead largely
vanish. Tux3 retains dirty bitmaps in memory and writes a description of each
allocate/free to the log. It is much cheaper to write out one log block than
potentially many dirty bitmap blocks, each containing only a few changed bits.
Tux3's rollup not only avoids expensive recursive copy on write, it optimizes
updating in a least three ways.
* Multiple deltas may dirty the same metadata block multiple times but
rollup only writes those blocks once.
* Multiple metadata blocks may be written out in a single, linear pass
across spinning media.
* Backend structure changes are batched in a cache friendly way.
One curious side effect of Tux3's log+rollup strategy is that in normal
operation, the image of a Tux3 filesystem is never entirely consistent if
considered only as literal block images. Instead, the log must be replayed in
order to reconstruct dirty cache, then the view of the filesystem tree from
dirty cache is consistent.
This is more or less the inverse of the traditional view where a replay
changes the media image. Tux3 replay is a true read-only operation that leaves
media untouched and changes cache instead. In fact, this theme runs
consistently through Tux3's entire design. As a filesystem, Tux3 cares about
updating cache, moving data between cache and media, and little else.
Tux3 does not normally update the media view of its filesystem tree even at
unmount. Instead, it replays the log on each mount. One excellent reason for
doing this is to exercise our replay code. (You surely would not want to
discover replay flaws only on the rare occasions you crash.) Another reason is
that we view sudden interruption as the normal way a filesystem should shut
down. We uphold your right to hit the power switch on a computing device and
expect to find nothing but consistent data when you turn it back on.
Fast Sync
Tux3 can sync a minimal file data change to disk by writing four blocks, or a
minimal file create and write with seven blocks:
http://phunq.net/pipermail/tux3/2012-December/000011.html
"Full volume sync performance"
This is so fast that we are tempted to implement fsync as sync. However, we
intend to resist that temptation in the long run, and implement an optimized
fsync that "jumps the queue" of Tux3's delta update pipeline and completes
without waiting for a potentially large amount of unrelated dirty cache to be
flushed to media.
Still to do
There is a significant amount of work still needed to bring Tux3 to a
production state. As of today, Tux3 does not have snapshots, in spite of that
being the main motivation for starting on this in the first place. The new
PHtree directory index is designed, not implemented. Freespace management
needs acceleration before it will benchmark well at extreme scale. Block
allocation needs to be much smarter before it will age well and resist read
fragmentation. There are several major optimizations still left to implement.
We need a good fsck that approaches the effectiveness of e2fsck. There is a
long list of shiny features to add: block migration, volume growing and
shrinking, defragmentation, dedupilcation, replication, and so on.
We have made plausible plans for all of the above, but indeed the devil is in
the doing. So we are considering the merits of invoking the "many hands make
light work" principle. Tux3 is pretty well documented and the code base is, if
not completely obvious, at least small and orthogonal. Tux3 runs in userspace
in two different ways: the tux3 command and fuse. Prototyping in user space is
a rare luxury that could almost make one lazy. Tux3 is an entirely grassroots
effort driven by volunteers. Nonetheless, we would welcome offers of
assistance from wherever they may come, especially testers.
Regards,
Daniel
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Tux3 report: New news for the new year
[not found] ` <18617468.vkbTvjxc0P@mars>
@ 2013-01-02 6:58 ` Shentino
2013-01-02 11:03 ` Daniel Phillips
0 siblings, 1 reply; 3+ messages in thread
From: Shentino @ 2013-01-02 6:58 UTC (permalink / raw)
To: Daniel Phillips; +Cc: Martin Steigerwald, linux-kernel, linux-fsdevel, tux3
Haven't had a chance to benchmark it yet due to system troubles in the
wake of a botched emerge, but from what I can tell the "write a
promise" logic seems to be much smoother than btrfs's recursive
cow'ing.
From what I can tell on the design, tux3 is "fsync satiating" with a
single disk write. It writes the data to the final location, updates
the log, and at that point the data is considered committed and it can
let userspace go on its merry way and take care of rolling up the
changes later. If I understand btrfs correctly though it has to block
until the cow logic percolates all the way up to the superblock.
One other thing that interests me is this "page forking" that allows
userspace to write to a page that's already busy being written to
disk. From what I heard it bypasses a stall caused by userspace I/O
hitting a locked page.
Finally, atime handling. I personally dislike the forced default of
"relatime" for mount options and anything that can let atime updates
happen without being a bottleneck is a plus for me.
I'll probably have more to say once I get my new computer set up but
from what I gather talking to these guys on IRC, tux3 seems pretty
promising.
I actually followed its development in its heyday 2 years earlier
before it dropped off the radar.
On Tue, Jan 1, 2013 at 1:49 PM, Daniel Phillips <lkml@phunq.net> wrote:
> Hi Martin,
>
> Thanks for the "tux3 howto". Obviously tux3.org needs refresh, but you hit the
> main points.
>
> On Tuesday, January 01, 2013 03:37:08 PM you wrote:
>> Writing a file with
>>
>> ./tux3 write tux3.img /etc/fstab
>>
>> also seemed to work, but I gave up holding down the enter key at:
>>
>> delta_get: delta 448, refcount 2
>> tuxio: write 1 bytes at 4484, isize = 0x1184
>> delta_put: delta 448, refcount 1
>>
>> /etc/fstab is 1714 bytes long.
>
> Indeed, the trace output is too chatty, but it's nice that it wrote a file. The
> tux3 command is cool, it can access and update an unmounted tux3 volume, even
> make diagrams of it. This will form the basis of our maintenance suite. A
> basic "tux3 fsck" is under construction:
>
> http://phunq.net/pipermail/tux3/2012-December/000012.html
> "Towards basic filesystem checking"
>
> http://phunq.net/pipermail/tux3/2012-December/000013.html
> "Towards basic filesystem checking (simplified)"
>
>> No tux3fuse, but then, I lacked libfuse-dev, after installing, compiling
>> worked:
>>
>> martin@merkaba:~[…]> make tux3fuse
>> gcc -MF ./.deps/tux3fuse.d -MP -MMD -m64 -std=gnu99 -Wall -g -rdynamic
>> -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -
>> I/home/martin/Linux/Dateisysteme/tux3/tux3/user -Wall -Wextra -Werror
>> -Wundef -Wstrict-prototypes -Werror-implicit- function-declaration
>> -Wno-unused-parameter -Wno-sign-compare -Wno-missing-field-initializers
>> -D_FORTIFY_SOURCE=2 - DLOCK_DEBUG=1 -DROLLUP_DEBUG=1
>> -DDISABLE_ASYNC_BACKEND=1 $(pkg-config --cflags fuse) tux3fuse.c -lfuse
>> -o tux3fuse libtux3.a libklib/libklib.a
>>
>> Then I could use it:
>>
>> martin@merkaba:~[…]> ./tux3 mkfs tux3.img
>> __setup_sb: blocksize 4096, blockbits 12, blockmask 00000fff
>> __setup_sb: volblocks 25600, freeblocks 25600, freeinodes 281474976710656,
>> nextalloc 0 __setup_sb: atom_dictsize 0, freeatom 0, atomgen 1
>> __setup_sb: logchain 0, logcount 0
>> make tux3 filesystem on tux3.img (0x6400000 bytes)
>> […]
>>
>> martin@merkaba:~[…]> sudo ./tux3fuse tux3.img /mnt/zeit
>> [sudo] password for martin:
>>
>> martin@merkaba:~[…]> mount | grep fuse
>> fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime)
>> tux3.img on /mnt/zeit type fuse.tux3.img
>> (rw,nosuid,nodev,relatime,user_id=0,group_id=0)
>>
>> But I am stuck with accessing it:
>>
>> martin@merkaba:~[…]> LANG=C ls -l /mnt/zeit
>> ls: cannot access /mnt/zeit: Permission denied
>>
>> martin@merkaba:~[…]> LANG=C sudo chown martin:martin /mnt/zeit
>> chown: cannot access '/mnt/zeit': Transport endpoint is not connected
>> martin@merkaba:~[…]> LANG=C sudo ls -l /mnt/zeit
>> ls: cannot access /mnt/zeit: Transport endpoint is not connected
>> martin@merkaba:~[…]>
>>
>> Unmounting it again worked nicely.
>
> That would be a bug, most probably in the fuse glue. Nearly all the testing
> recently has been on the kernel port (mostly under kvm) so its nice to hear
> that tux3 write still works and tux3fuse almost works. Probably, the fuse glue
> hit an assert and bailed out, causing the "endpoint is not connected" state. A
> good example of why it would be nice to beef up our team a little. Anybody who
> wants to take charge of the fuse glue is welcome.
>
>> I keep it at that for now until I may take time to take a closer look.
>>
>> I think its better to continue this on tux3 mailing list, which I have
>> subscribed to. But I thought I post this here, to give others some starting
>> point for own experiments as I did not any documentation about this
>> in the git repo.
>
> Thanks, that was great, and quick. We look forward to seeing you on the Tux3
> mailing list.
>
> Another thing I forgot to post is a link to the tux3 mailing list:
>
> http://phunq.net/mailman/listinfo/tux3
>
> Also, we are on oftc.net, #tux3 channel.
>
> We will do our best to improve the howto documentation. A wiki would be nice.
> We may start one on github, then set one up on tux3.org later. Currently, the
> main focus is on filling in the last few big pieces needed to scale well, and
> of course, debugging. A few more helping hands on things like wikis and
> documentation refresh would be most appreciated. For now, design documentation
> and howtos get posted to the Tux3 mailing list. You are more than welcome to
> post your recipes above.
>
> Thanks again,
>
> Daniel
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Tux3 report: New news for the new year
2013-01-02 6:58 ` Tux3 report: New news for the new year Shentino
@ 2013-01-02 11:03 ` Daniel Phillips
0 siblings, 0 replies; 3+ messages in thread
From: Daniel Phillips @ 2013-01-02 11:03 UTC (permalink / raw)
To: Shentino; +Cc: Martin Steigerwald, linux-kernel, linux-fsdevel, tux3
On Tuesday, January 01, 2013 10:58:35 PM Shentino wrote:
> From what I can tell on the design, tux3 is "fsync satiating" with a
> single disk write. It writes the data to the final location, updates
> the log, and at that point the data is considered committed and it can
> let userspace go on its merry way and take care of rolling up the
> changes later.
Yes, correct. I think we currently sync a small file create+write with seven
blocks and a file rewrite with four blocks, including the commit block and only
one long seek. We haven't benchmarked that yet, but it sounds fast. There are
two synchronous waits in the backend, but the frontend only waits on the
commit block completion in the task doing the sync while other concurrent
filesystem operations just keep going.
> If I understand btrfs correctly though it has to block
> until the cow logic percolates all the way up to the superblock.
A careful reading of the Btrfs design doc left me confused about that. Perhaps
Btrfs devs could clarify?
> One other thing that interests me is this "page forking" that allows
> userspace to write to a page that's already busy being written to
> disk. From what I heard it bypasses a stall caused by userspace I/O
> hitting a locked page.
Page forking is an amazing thing and should really head into core, after being
thoroughly proved out of course.
> Finally, atime handling. I personally dislike the forced default of
> "relatime" for mount options and anything that can let atime updates
> happen without being a bottleneck is a plus for me.
Atime is an odious invention indeed from a developer's perspective, but
apparently well loved by some users and has real applications. Knowing which
videos you watched recently apparently being one of them. We have a pretty
good plan for it that is actually just a small development item, the main
feature of which is avoiding polluting the inode table btree, which would
cause a lot of churn and aggravate allocate-on-write issues that are already
difficult, plus be horribly unfriendly to flash. Instead, we churn a dedicated
btree array (actually a regular file) where the write-on-reads are densely
concentrated. It somehow feels good to quarantine this craziness at least.
Regards,
Daniel
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2013-01-02 11:03 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <2597526.usDRg4h3X1@mars>
[not found] ` <1492866.3ugQ2fqrP0@mars>
[not found] ` <201301011537.08466.Martin@lichtvoll.de>
[not found] ` <18617468.vkbTvjxc0P@mars>
2013-01-02 6:58 ` Tux3 report: New news for the new year Shentino
2013-01-02 11:03 ` Daniel Phillips
2013-01-01 10:55 Daniel Phillips
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).