* Tux3 Report: Faster than tmpfs, what?
@ 2013-05-07 23:24 Daniel Phillips
2013-05-10 4:50 ` Dave Chinner
0 siblings, 1 reply; 17+ messages in thread
From: Daniel Phillips @ 2013-05-07 23:24 UTC (permalink / raw)
To: linux-kernel, tux3, linux-fsdevel
When something sounds to good to be true, it usually is. But not always. Today
Hirofumi posted some nigh on unbelievable dbench results that show Tux3
beating tmpfs. To put this in perspective, we normally regard tmpfs as
unbeatable because it is just a thin shim between the standard VFS mechanisms
that every filesystem must use, and the swap device. Our usual definition of
successful optimization is that we end up somewhere between Ext4 and Tmpfs,
or in other words, faster than Ext4. This time we got an excellent surprise.
The benchmark:
dbench -t 30 -c client2.txt 1 & (while true; do sync; sleep 4; done)
Configuration:
KVM with two CPUs and 4 GB memory running on a Sandy Bridge four core host
at 3.4 GHz with 8 GB of memory. Spinning disk. (Disk drive details
to follow.)
Summary of results:
tmpfs: Throughput 1489.00 MB/sec max_latency=1.758 ms
tux3: Throughput 1546.81 MB/sec max_latency=12.950 ms
ext4: Throughput 1017.84 MB/sec max_latency=1441.585 ms
Tux3 edged out Tmpfs and stomped Ext4 righteously. What is going on?
Simple: Tux3 has a frontend/backend design that runs on two CPUs. This
allows handing off some of the work of unlink and delete to the kernel tux3d,
which runs asynchronously from the dbench task. All Tux3 needs to do in the
dbench context is set a flag in the deleted inode and add it to a dirty
list. The remaining work like truncating page cache pages is handled by the
backend tux3d. The effect is easily visible in the dbench details below
(See the Unlink and Deltree lines).
It is hard to overstate how pleased we are with these results. Particularly
after our first dbench tests a couple of days ago were embarrassing: more than
five times slower than Ext4. The issue turned out to be inefficient inode
allocation. Hirofumi changed the horribly slow itable btree search to a
simple "allocate the next inode number" counter, and shazam! The slowpoke
became a superstar. Now, this comes with a caveat: the code that produces
this benchmark currently relies on this benchmark-specific hack to speed up
inode number allocation. However, we are pretty sure that our production inode
allocation algorithm will have insignificant additional overhead versus this
temporary hack. If only because "allocate the next inode number" is nearly
always the best strategy.
With directory indexing now considered a solved problem, the only big
issue we feel needs to be addressed before offering Tux3 for merge is
allocation. For now we use the same overly simplistic strategy to allocate
both disk blocks and inode numbers, which is trivially easy to defeat to
generate horrible benchmark numbers on spinning disk. So the next round
of work, which I hope will only take a few weeks, consists of improving
these allocators to at least a somewhat respectable level.
For inode number allocation, I have proposed a strategy that looks a lot
like Ext2/3/4 inode bitmaps. Tux3's twist is that these bitmaps are just
volatile cache objects, never transferred to disk. According to me, the
overhead of allocating from these bitmaps will hardly affect today's
benchmark numbers at all, but that remains to be proven.
Detailed dbench results:
tux3:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 1477980 0.003 12.944
Close 1085650 0.001 0.307
Rename 62579 0.006 0.288
Unlink 298496 0.002 0.345
Deltree 38 0.083 0.157
Mkdir 19 0.001 0.002
Qpathinfo 1339597 0.002 0.468
Qfileinfo 234761 0.000 0.231
Qfsinfo 245654 0.001 0.259
Sfileinfo 120379 0.001 0.342
Find 517948 0.005 0.352
WriteX 736964 0.007 0.520
ReadX 2316653 0.002 0.499
LockX 4812 0.002 0.207
UnlockX 4812 0.001 0.221
Throughput 1546.81 MB/sec 1 clients 1 procs max_latency=12.950 ms
tmpfs:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 1423080 0.004 1.155
Close 1045354 0.001 0.578
Rename 60260 0.007 0.470
Unlink 287392 0.004 0.607
Deltree 36 0.651 1.352
Mkdir 18 0.001 0.002
Qpathinfo 1289893 0.002 0.575
Qfileinfo 226045 0.000 0.346
Qfsinfo 236518 0.001 0.383
Sfileinfo 115924 0.001 0.405
Find 498705 0.007 0.614
WriteX 709522 0.005 0.679
ReadX 2230794 0.002 1.271
LockX 4634 0.002 0.021
UnlockX 4634 0.001 0.324
Throughput 1489 MB/sec 1 clients 1 procs max_latency=1.758 ms
ext4:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 988446 0.005 29.226
Close 726028 0.001 0.247
Rename 41857 0.011 0.238
Unlink 199651 0.022 1441.552
Deltree 24 1.517 3.358
Mkdir 12 0.002 0.002
Qpathinfo 895940 0.003 15.849
Qfileinfo 156970 0.001 0.429
Qfsinfo 164303 0.001 0.210
Sfileinfo 80501 0.002 1.037
Find 346400 0.010 2.885
WriteX 492615 0.009 13.676
ReadX 1549654 0.002 0.808
LockX 3220 0.002 0.015
UnlockX 3220 0.001 0.010
Throughput 1017.84 MB/sec 1 clients 1 procs max_latency=1441.585 ms
Apologies for the formatting. I will get back to a real mailer soon.
Regards,
Daniel
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: Tux3 Report: Faster than tmpfs, what? 2013-05-07 23:24 Tux3 Report: Faster than tmpfs, what? Daniel Phillips @ 2013-05-10 4:50 ` Dave Chinner 2013-05-10 5:06 ` Christian Stroetmann ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Dave Chinner @ 2013-05-10 4:50 UTC (permalink / raw) To: Daniel Phillips; +Cc: linux-kernel, tux3, linux-fsdevel On Tue, May 07, 2013 at 04:24:05PM -0700, Daniel Phillips wrote: > When something sounds to good to be true, it usually is. But not always. Today > Hirofumi posted some nigh on unbelievable dbench results that show Tux3 > beating tmpfs. To put this in perspective, we normally regard tmpfs as > unbeatable because it is just a thin shim between the standard VFS mechanisms > that every filesystem must use, and the swap device. Our usual definition of > successful optimization is that we end up somewhere between Ext4 and Tmpfs, > or in other words, faster than Ext4. This time we got an excellent surprise. > > The benchmark: > > dbench -t 30 -c client2.txt 1 & (while true; do sync; sleep 4; done) I'm deeply suspicious of what is in that client2.txt file. dbench on ext4 on a 4 SSD RAID0 array with a single process gets 130MB/s (kernel is 3.9.0). Your workload gives you over 1GB/s on ext4..... > tux3: > Operation Count AvgLat MaxLat > ---------------------------------------- > NTCreateX 1477980 0.003 12.944 .... > ReadX 2316653 0.002 0.499 > LockX 4812 0.002 0.207 > UnlockX 4812 0.001 0.221 > Throughput 1546.81 MB/sec 1 clients 1 procs max_latency=12.950 ms Hmmm... No "Flush" operations. Gotcha - you've removed the data integrity operations from the benchmark. Ah, I get it now - you've done that so the front end of tux3 won't encounter any blocking operations and so can offload 100% of operations. It also explains the sync call every 4 seconds to keep tux3 back end writing out to disk so that a) all the offloaded work is done by the sync process and not measured by the benchmark, and b) so the front end doesn't overrun queues and throttle or run out of memory. Oh, so nicely contrived. But terribly obvious now that I've found it. You've carefully crafted the benchmark to demonstrate a best case workload for the tux3 architecture, then carefully not measured the overhead of the work tux3 has offloaded, and then not disclosed any of this in the hope that all people will look at is the headline. This would make a great case study for a "BenchMarketing For Dummies" book. Shame for you that you sent it to a list where people see the dbench numbers for ext4 and immediately think "that's not right" and then look deeper. Phoronix might swallow your sensationalist headline grab without analysis, but I don't think I'm alone in my suspicion that there was something stinky about your numbers. Perhaps in future you'll disclose such information with your results, otherwise nobody is ever going to trust anything you say about tux3.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Tux3 Report: Faster than tmpfs, what? 2013-05-10 4:50 ` Dave Chinner @ 2013-05-10 5:06 ` Christian Stroetmann 2013-05-10 5:47 ` OGAWA Hirofumi 2013-05-11 6:12 ` Daniel Phillips 2 siblings, 0 replies; 17+ messages in thread From: Christian Stroetmann @ 2013-05-10 5:06 UTC (permalink / raw) To: Dave Chinner; +Cc: Linux FS Devel, Linux Kernel Aloha hardcore coders Thank you very much for working out the facts, Dave. You proved why I had all the years such a special suspicious feeling by reading between the lines of the Tux3 e-mails sent to the mailing-list, which should not mean that I do not like the work around the Tux3 file system in general. Quite contrary, it is highly interesting to watch if there are possibilites to bring the whole field further. But this kind of marketing seen in the past is truely not constructive but contemporary. Have fun in the sun Christian Stroetmann > On Tue, May 07, 2013 at 04:24:05PM -0700, Daniel Phillips wrote: >> When something sounds to good to be true, it usually is. But not always. Today >> Hirofumi posted some nigh on unbelievable dbench results that show Tux3 >> beating tmpfs. To put this in perspective, we normally regard tmpfs as >> unbeatable because it is just a thin shim between the standard VFS mechanisms >> that every filesystem must use, and the swap device. Our usual definition of >> successful optimization is that we end up somewhere between Ext4 and Tmpfs, >> or in other words, faster than Ext4. This time we got an excellent surprise. >> >> The benchmark: >> >> dbench -t 30 -c client2.txt 1& (while true; do sync; sleep 4; done) > I'm deeply suspicious of what is in that client2.txt file. dbench on > ext4 on a 4 SSD RAID0 array with a single process gets 130MB/s > (kernel is 3.9.0). Your workload gives you over 1GB/s on ext4..... > >> tux3: >> Operation Count AvgLat MaxLat >> ---------------------------------------- >> NTCreateX 1477980 0.003 12.944 > .... >> ReadX 2316653 0.002 0.499 >> LockX 4812 0.002 0.207 >> UnlockX 4812 0.001 0.221 >> Throughput 1546.81 MB/sec 1 clients 1 procs max_latency=12.950 ms > Hmmm... No "Flush" operations. Gotcha - you've removed the data > integrity operations from the benchmark. > > Ah, I get it now - you've done that so the front end of tux3 won't > encounter any blocking operations and so can offload 100% of > operations. It also explains the sync call every 4 seconds to keep > tux3 back end writing out to disk so that a) all the offloaded work > is done by the sync process and not measured by the benchmark, and > b) so the front end doesn't overrun queues and throttle or run out > of memory. > > Oh, so nicely contrived. But terribly obvious now that I've found > it. You've carefully crafted the benchmark to demonstrate a best > case workload for the tux3 architecture, then carefully not > measured the overhead of the work tux3 has offloaded, and then not > disclosed any of this in the hope that all people will look at is > the headline. > > This would make a great case study for a "BenchMarketing For > Dummies" book. > > Shame for you that you sent it to a list where people see the dbench > numbers for ext4 and immediately think "that's not right" and then > look deeper. Phoronix might swallow your sensationalist headline > grab without analysis, but I don't think I'm alone in my suspicion > that there was something stinky about your numbers. > > Perhaps in future you'll disclose such information with your > results, otherwise nobody is ever going to trust anything you say > about tux3.... > > Cheers, > > Dave. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Tux3 Report: Faster than tmpfs, what? 2013-05-10 4:50 ` Dave Chinner 2013-05-10 5:06 ` Christian Stroetmann @ 2013-05-10 5:47 ` OGAWA Hirofumi 2013-05-14 6:34 ` Dave Chinner 2013-05-11 6:12 ` Daniel Phillips 2 siblings, 1 reply; 17+ messages in thread From: OGAWA Hirofumi @ 2013-05-10 5:47 UTC (permalink / raw) To: Dave Chinner; +Cc: Daniel Phillips, linux-fsdevel, tux3, linux-kernel Dave Chinner <david@fromorbit.com> writes: >> tux3: >> Operation Count AvgLat MaxLat >> ---------------------------------------- >> NTCreateX 1477980 0.003 12.944 > .... >> ReadX 2316653 0.002 0.499 >> LockX 4812 0.002 0.207 >> UnlockX 4812 0.001 0.221 >> Throughput 1546.81 MB/sec 1 clients 1 procs max_latency=12.950 ms > > Hmmm... No "Flush" operations. Gotcha - you've removed the data > integrity operations from the benchmark. Right. Because tux3 is not implementing fsync() yet. So, I did grep -v Flush /usr/share/dbench/client.txt > client2.txt Why is it important for comparing? > Ah, I get it now - you've done that so the front end of tux3 won't > encounter any blocking operations and so can offload 100% of > operations. It also explains the sync call every 4 seconds to keep > tux3 back end writing out to disk so that a) all the offloaded work > is done by the sync process and not measured by the benchmark, and > b) so the front end doesn't overrun queues and throttle or run out > of memory. Our backend is still using debugging mode (flush each 10 transactions for stress/debugging). Since no interface to use normal writeback timing yet, and I'm not tackling about it yet. And if normal writeback can't beat crappy fixed timing (4 secs), Rather, it means we have to improve writeback timing. I.e. sync should be rather slower than best timing, right? > Oh, so nicely contrived. But terribly obvious now that I've found > it. You've carefully crafted the benchmark to demonstrate a best > case workload for the tux3 architecture, then carefully not > measured the overhead of the work tux3 has offloaded, and then not > disclosed any of this in the hope that all people will look at is > the headline. > > This would make a great case study for a "BenchMarketing For > Dummies" book. Simply wrong. I did this to start optimization of tux3 (We know we have many places to optimize in tux3), but the result was that post. If you can't see at all what we did by frontend/backend design from that, I'm a bit sad for it. I think I can improve tmpfs/ext4 like tux3 (Unlink/Deltree) if I want to do, from this result. Thanks. -- OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Tux3 Report: Faster than tmpfs, what? 2013-05-10 5:47 ` OGAWA Hirofumi @ 2013-05-14 6:34 ` Dave Chinner 2013-05-14 7:59 ` OGAWA Hirofumi 0 siblings, 1 reply; 17+ messages in thread From: Dave Chinner @ 2013-05-14 6:34 UTC (permalink / raw) To: OGAWA Hirofumi; +Cc: Daniel Phillips, linux-fsdevel, tux3, linux-kernel On Fri, May 10, 2013 at 02:47:35PM +0900, OGAWA Hirofumi wrote: > Dave Chinner <david@fromorbit.com> writes: > > >> tux3: > >> Operation Count AvgLat MaxLat > >> ---------------------------------------- > >> NTCreateX 1477980 0.003 12.944 > > .... > >> ReadX 2316653 0.002 0.499 > >> LockX 4812 0.002 0.207 > >> UnlockX 4812 0.001 0.221 > >> Throughput 1546.81 MB/sec 1 clients 1 procs max_latency=12.950 ms > > > > Hmmm... No "Flush" operations. Gotcha - you've removed the data > > integrity operations from the benchmark. > > Right. Because tux3 is not implementing fsync() yet. So, I did > > grep -v Flush /usr/share/dbench/client.txt > client2.txt > > Why is it important for comparing? Because nobody could reproduce your results without working that out. You didn't disclose that you'd made these changes, and that makes it extremely misleading as to what the results mean. Given the headline-grab nature of it, it's deceptive at best. I don't care how fast tux3 is - I care about being able to reproduce other people's results. Hence if you are going to report benchmark results comparing filesystems then you need to tell everyone exactly what you've tweaked and why, from the hardware all the way up to the benchmark config. Work on how *you* report *your* results - don't let Daniel turn them into some silly marketing fluff that tries to grab headlines. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Tux3 Report: Faster than tmpfs, what? 2013-05-14 6:34 ` Dave Chinner @ 2013-05-14 7:59 ` OGAWA Hirofumi 0 siblings, 0 replies; 17+ messages in thread From: OGAWA Hirofumi @ 2013-05-14 7:59 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, tux3, Daniel Phillips, linux-kernel Dave Chinner <david@fromorbit.com> writes: >> Right. Because tux3 is not implementing fsync() yet. So, I did >> >> grep -v Flush /usr/share/dbench/client.txt > client2.txt >> >> Why is it important for comparing? > > Because nobody could reproduce your results without working that > out. You didn't disclose that you'd made these changes, and that > makes it extremely misleading as to what the results mean. Given the > headline-grab nature of it, it's deceptive at best. > > I don't care how fast tux3 is - I care about being able to reproduce > other people's results. Hence if you are going to report benchmark > results comparing filesystems then you need to tell everyone exactly > what you've tweaked and why, from the hardware all the way up to the > benchmark config. Thanks for adivce. -- OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Tux3 Report: Faster than tmpfs, what? 2013-05-10 4:50 ` Dave Chinner 2013-05-10 5:06 ` Christian Stroetmann 2013-05-10 5:47 ` OGAWA Hirofumi @ 2013-05-11 6:12 ` Daniel Phillips 2013-05-11 18:35 ` james northrup 2013-05-11 21:26 ` Theodore Ts'o 2 siblings, 2 replies; 17+ messages in thread From: Daniel Phillips @ 2013-05-11 6:12 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, tux3, linux-kernel Hi Dave, Thanks for the catch - I should indeed have noted that "modified dbench" was used for this benchmark, thus amplifying Tux3's advantage in delete performance. This literary oversight does not make the results any less interesting: we beat Tmpfs on that particular load. Beating tmpfs at anything is worthy of note. Obviously, all three filesystems ran the same load. We agree that "classic unadulterated dbench" is an important Linux benchmark for comparison with other filesystems. I think we should implement a proper fsync for that one and not just use fsync = sync. That isn't very far in the future, however our main focus right now is optimizing spinning disk allocation. It probably makes logistical sense to leave fsync as it is for now and concentrate on the more important issues. I do not agree with your assertion that the benchmark as run is invalid, only that the modified load should have been described in detail. I presume you would like to see a new bakeoff using "classic" dbench. Patience please, this will certainly come down the pipe in due course. We might not beat Tmpfs on that load but we certainly expect to outperform some other filesystems. Note that Tux3 ran this benchmark using its normal strong consistency semantics, roughly similar to Ext4's data=journal. In that light, the results are even more interesting. > ...you've done that so the front end of tux3 won't > encounter any blocking operations and so can offload 100% of > operations. Yes, that is the entire point of our front/back design: reduce application latency for buffered filesystem transactions. > It also explains the sync call every 4 seconds to keep > tux3 back end writing out to disk so that a) all the offloaded work > is done by the sync process and not measured by the benchmark, and > b) so the front end doesn't overrun queues and throttle or run out > of memory. Entirely correct. That's really nice, don't you think? You nicely described a central part of Tux3's design: our "delta" mechanism. We expect to spend considerable effort tuning the details of our delta transition behaviour as time goes by. However this is not an immediate priority because the simplistic "flush every 4 seconds" hack already works pretty well for a lot of loads. Thanks for your feedback, Daniel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Tux3 Report: Faster than tmpfs, what? 2013-05-11 6:12 ` Daniel Phillips @ 2013-05-11 18:35 ` james northrup 2013-05-12 4:39 ` Daniel Phillips 2013-05-11 21:26 ` Theodore Ts'o 1 sibling, 1 reply; 17+ messages in thread From: james northrup @ 2013-05-11 18:35 UTC (permalink / raw) To: Daniel Phillips; +Cc: Dave Chinner, linux-kernel, tux3, linux-fsdevel also interesting information... Study of 2,047 papers on PubMed finds that two-thirds of retracted papers were down to scientific misconduct, not error On Fri, May 10, 2013 at 11:12 PM, Daniel Phillips <daniel.raymond.phillips@gmail.com> wrote: > Hi Dave, > > Thanks for the catch - I should indeed have noted that "modified > dbench" was used for this benchmark, thus amplifying Tux3's advantage > in delete performance. This literary oversight does not make the > results any less interesting: we beat Tmpfs on that particular load. > Beating tmpfs at anything is worthy of note. Obviously, all three ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Tux3 Report: Faster than tmpfs, what? 2013-05-11 18:35 ` james northrup @ 2013-05-12 4:39 ` Daniel Phillips 0 siblings, 0 replies; 17+ messages in thread From: Daniel Phillips @ 2013-05-12 4:39 UTC (permalink / raw) To: james northrup; +Cc: Dave Chinner, linux-kernel, tux3, linux-fsdevel On Sat, May 11, 2013 at 11:35 AM, james northrup <northrup.james@gmail.com> wrote: > also interesting information... Study of 2,047 papers on PubMed finds > that two-thirds of retracted papers were down to scientific > misconduct, not error Could you please be specific about the meaning you intend? Because innuendo is less than useful in this forum. If you mean to say that our posted results might not be independently verifiable then I invite you to run the tests as described (including removing fsync) yourself. If you require any assistance from us in doing that, we will be pleased to provide it. Regards, Daniel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Tux3 Report: Faster than tmpfs, what? 2013-05-11 6:12 ` Daniel Phillips 2013-05-11 18:35 ` james northrup @ 2013-05-11 21:26 ` Theodore Ts'o 2013-05-12 1:10 ` Daniel Phillips ` (3 more replies) 1 sibling, 4 replies; 17+ messages in thread From: Theodore Ts'o @ 2013-05-11 21:26 UTC (permalink / raw) To: Daniel Phillips; +Cc: Dave Chinner, linux-kernel, tux3, linux-fsdevel On Fri, May 10, 2013 at 11:12:27PM -0700, Daniel Phillips wrote: > Hi Dave, > > Thanks for the catch - I should indeed have noted that "modified > dbench" was used for this benchmark, thus amplifying Tux3's advantage > in delete performance. Dropping fsync() does a lot more than "amplify Tux3's advantage in delete performace". Since fsync(2) is defined as not returning until the data written to the file descriptor is flushed out to stable storage --- so it is guaranteed to be seen after a system crash --- it means that the foreground application must not continue until the data is written by Tux3's back-end. So it also means that any advantage of decoupling the front/back end is nullified, since fsync(2) requires a temporal coupling. In fact, if there is any delays introdued between when the front-end sends the fsync request, and when the back-end finishes writing the data and then communicates this back to the front-end --- i.e., caused by schedular latencies, this may end up being a disadvantage compared to more traditional file system designs. Like many things in file system design, there are tradeoffs. It's perhaps more quseful when having these discussions to be clear what you are trading off for what; in this case, the front/back design may be good for somethings, and less good for others, such as mail server workloads where fsync(2) semantics is extremely important for application correctness. Best regards, - Ted ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Tux3 Report: Faster than tmpfs, what? 2013-05-11 21:26 ` Theodore Ts'o @ 2013-05-12 1:10 ` Daniel Phillips 2013-05-12 4:16 ` Daniel Phillips ` (2 subsequent siblings) 3 siblings, 0 replies; 17+ messages in thread From: Daniel Phillips @ 2013-05-12 1:10 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-fsdevel, tux3, Dave Chinner, linux-kernel [-- Attachment #1.1: Type: text/plain, Size: 1865 bytes --] On May 11, 2013 2:26 PM, "Theodore Ts'o" <tytso@mit.edu> wrote: > > On Fri, May 10, 2013 at 11:12:27PM -0700, Daniel Phillips wrote: > > Hi Dave, > > > > Thanks for the catch - I should indeed have noted that "modified > > dbench" was used for this benchmark, thus amplifying Tux3's advantage > > in delete performance. > > Dropping fsync() does a lot more than "amplify Tux3's advantage in > delete performace". Since fsync(2) is defined as not returning until > the data written to the file descriptor is flushed out to stable > storage --- so it is guaranteed to be seen after a system crash --- it > means that the foreground application must not continue until the data > is written by Tux3's back-end. > > So it also means that any advantage of decoupling the front/back end > is nullified, since fsync(2) requires a temporal coupling. In fact, > if there is any delays introdued between when the front-end sends the > fsync request, and when the back-end finishes writing the data and > then communicates this back to the front-end --- i.e., caused by > schedular latencies, this may end up being a disadvantage compared to > more traditional file system designs. > > Like many things in file system design, there are tradeoffs. It's > perhaps more quseful when having these discussions to be clear what > you are trading off for what; in this case, the front/back design may > be good for somethings, and less good for others, such as mail server > workloads where fsync(2) semantics is extremely important for > application correctness. Exactly, Ted. We avoided measuring the fsync load on this particular benchmark because we have not yet optimized fsync. When we do get to it (not an immediate priority) I expect we will perform competitively, because Tux3 does manage to get deltas onto disk with a minimal number of block writes. Regards, Daniel [-- Attachment #1.2: Type: text/html, Size: 2286 bytes --] [-- Attachment #2: Type: text/plain, Size: 120 bytes --] _______________________________________________ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Tux3 Report: Faster than tmpfs, what? 2013-05-11 21:26 ` Theodore Ts'o 2013-05-12 1:10 ` Daniel Phillips @ 2013-05-12 4:16 ` Daniel Phillips 2013-05-12 4:28 ` Daniel Phillips 2013-05-13 23:22 ` Daniel Phillips 3 siblings, 0 replies; 17+ messages in thread From: Daniel Phillips @ 2013-05-12 4:16 UTC (permalink / raw) To: Theodore Ts'o, Daniel Phillips, Dave Chinner, linux-kernel, tux3, linux-fsdevel [-- Attachment #1.1: Type: text/plain, Size: 2379 bytes --] (resent as plain text) On Sat, May 11, 2013 at 2:26 PM, Theodore Ts'o <tytso@mit.edu> wrote: > Dropping fsync() does a lot more than "amplify Tux3's advantage in > delete performace". Since fsync(2) is defined as not returning until > the data written to the file descriptor is flushed out to stable > storage --- so it is guaranteed to be seen after a system crash --- it > means that the foreground application must not continue until the data > is written by Tux3's back-end. > > So it also means that any advantage of decoupling the front/back end > is nullified, since fsync(2) requires a temporal coupling. In fact, > if there is any delays introdued between when the front-end sends the > fsync request, and when the back-end finishes writing the data and > then communicates this back to the front-end --- i.e., caused by > schedular latencies, this may end up being a disadvantage compared to > more traditional file system designs. > > Like many things in file system design, there are tradeoffs. It's > perhaps more quseful when having these discussions to be clear what > you are trading off for what; in this case, the front/back design may > be good for somethings, and less good for others, such as mail server > workloads where fsync(2) semantics is extremely important for > application correctness. Exactly, Ted. We avoided measuring the fsync load on this particular benchmark because we have not yet optimized fsync. When we do get to it (not an immediate priority) I expect Tux3 to perform competitively, because our delta commit scheme does manage the job with a minimal number of block writes. To have a really efficient fsync we need to isolate just the changes for the fsynced file into a special "half delta" that gets its own commit, ahead of any other pending changes to the filesystem. There is a plan for this, however we would rather not get sidetracked on that now, while we are getting ready for merge. The point that seems to be getting a little lost in this thread is, the benchmark just as we ran it models an important and common type of workload, arguably the most common workload for real users, and the resulting performance measurement is easily reproducible for anyone who cares to try. In fact, I think we should prepare and post a detailed recipe for doing just that, since the interest level seems to be high. Regards, Daniel [-- Attachment #1.2: Type: text/html, Size: 2688 bytes --] [-- Attachment #2: Type: text/plain, Size: 120 bytes --] _______________________________________________ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Tux3 Report: Faster than tmpfs, what? 2013-05-11 21:26 ` Theodore Ts'o 2013-05-12 1:10 ` Daniel Phillips 2013-05-12 4:16 ` Daniel Phillips @ 2013-05-12 4:28 ` Daniel Phillips 2013-05-13 23:22 ` Daniel Phillips 3 siblings, 0 replies; 17+ messages in thread From: Daniel Phillips @ 2013-05-12 4:28 UTC (permalink / raw) To: Theodore Ts'o, Daniel Phillips, Dave Chinner, linux-kernel, tux3, linux-fsdevel (resent as plain text) On Sat, May 11, 2013 at 2:26 PM, Theodore Ts'o <tytso@mit.edu> wrote: > Dropping fsync() does a lot more than "amplify Tux3's advantage in > delete performace". Since fsync(2) is defined as not returning until > the data written to the file descriptor is flushed out to stable > storage --- so it is guaranteed to be seen after a system crash --- it > means that the foreground application must not continue until the data > is written by Tux3's back-end. > > So it also means that any advantage of decoupling the front/back end > is nullified, since fsync(2) requires a temporal coupling. In fact, > if there is any delays introdued between when the front-end sends the > fsync request, and when the back-end finishes writing the data and > then communicates this back to the front-end --- i.e., caused by > schedular latencies, this may end up being a disadvantage compared to > more traditional file system designs. > > Like many things in file system design, there are tradeoffs. It's > perhaps more quseful when having these discussions to be clear what > you are trading off for what; in this case, the front/back design may > be good for somethings, and less good for others, such as mail server > workloads where fsync(2) semantics is extremely important for > application correctness. Exactly, Ted. We avoided measuring the fsync load on this particular benchmark because we have not yet optimized fsync. When we do get to it (not an immediate priority) I expect Tux3 to perform competitively, because our delta commit scheme does manage the job with a minimal number of block writes. To have a really efficient fsync we need to isolate just the changes for the fsynced file into a special "half delta" that gets its own commit, ahead of any other pending changes to the filesystem. There is a plan for this, however we would rather not get sidetracked on that project now while we are getting ready for merge. The point that seems to be getting a little lost in this thread is, the benchmark just as we ran it models an important and common type of workload, arguably the most common workload for real users, and the resulting performance measurement is easily reproducible for anyone who cares to try. In fact, I think we should prepare and post a detailed recipe for doing just that, since the interest level seems to be high. Regards, Daniel PS for any Googlers reading: do you know that using Gmail to post to LKML is simply maddening for all concerned? If you want to know why then try it yourself. Plain text. Some people need it, and need it to be reliable instead of gratuitously changing back to html at surprising times. And static word wrap. Necessary. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Tux3 Report: Faster than tmpfs, what? 2013-05-11 21:26 ` Theodore Ts'o ` (2 preceding siblings ...) 2013-05-12 4:28 ` Daniel Phillips @ 2013-05-13 23:22 ` Daniel Phillips 2013-05-14 0:08 ` Andreas Dilger 3 siblings, 1 reply; 17+ messages in thread From: Daniel Phillips @ 2013-05-13 23:22 UTC (permalink / raw) To: Theodore Ts'o, Daniel Phillips, Dave Chinner, linux-kernel, tux3, linux-fsdevel Hi Ted, You said: > ...any advantage of decoupling the front/back end > is nullified, since fsync(2) requires a temporal coupling After after pondering it for a while, I realized that is not completely accurate. The reduced delete latency will allow the dbench process to proceed to the fsync point faster, then if our fsync is reasonably efficient (not the case today, but planned) we may still see an overall speedup. > if there is any delays introdued between when the > front-end sends the fsync request, and when the back- > end finishes writing the data and then communicates > this back to the front-end --- i.e., caused by schedular > latencies, this may end up being a disadvantage > compared to more traditional file system designs. Nothing stops our frontend from calling its backend synchronously, which is just what we intend to do for fsync. The real design issue for Tux3 fsync is writing out the minimal set of blocks to update a single file. As it is now, Tux3 commits all dirty file data at each delta, which is fine for many common loads, but not all. Two examples of loads where this may be less than optimal: 1) fsync (as you say) 2) multiple tasks accessing different files To excel under those loads, Tux3 needs to be able to break its "always commit everything rule" in an organized way. We have considered several design options for this but not yet prototyped any because we feel that that work can reasonably be attacked later. As always, we will seek the most rugged, efficient and simple solution. Regards, Daniel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Tux3 Report: Faster than tmpfs, what? 2013-05-13 23:22 ` Daniel Phillips @ 2013-05-14 0:08 ` Andreas Dilger 2013-05-14 6:25 ` Daniel Phillips 0 siblings, 1 reply; 17+ messages in thread From: Andreas Dilger @ 2013-05-14 0:08 UTC (permalink / raw) To: Daniel Phillips Cc: Daniel Phillips, Theodore Ts'o, tux3@tux3.org, Dave Chinner, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org [-- Attachment #1.1: Type: text/plain, Size: 1529 bytes --] On 2013-05-13, at 17:22, Daniel Phillips <daniel.raymond.phillips@gmail.com> wrote: > Hi Ted, > You said: >> ...any advantage of decoupling the front/back end >> is nullified, since fsync(2) requires a temporal coupling > > After after pondering it for a while, I realized that is not > completely accurate. The reduced delete latency will > allow the dbench process to proceed to the fsync point > faster, then if our fsync is reasonably efficient (not the > case today, but planned) we may still see an overall > speedup. Ages ago, before we implemented extents for ext3, we had an asynchronous unlink/truncate-to-zero thread that was handing the busywork of traversing the indirect tree and updating all of the bitmaps. This was transactionally safe, since the blocks were moved over to a temporary inode in the main process' transaction, and the unlinked inode was on the orphan list. With the extent-mapped inodes the latency of the unlink/truncate-to-zero was greatly reduced, and we dropped that code. If anyone is interested to revive this for some reason, the newest version I could find was for 2.4.24: http://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/kernel_patches/patches/ext3-delete_thread-2.4.24.patch;hb=21420e6d66eaaf8de0342beab266460c207c054d IIRC, it only pushed unlink/truncate to the thread if it had indirect blocks, since the effort of allocating a separate inode and transferring over the allocated blocks wasn't worthwhile otherwise. Cheers, Andreas [-- Attachment #1.2: Type: text/html, Size: 2300 bytes --] [-- Attachment #2: Type: text/plain, Size: 120 bytes --] _______________________________________________ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Tux3 Report: Faster than tmpfs, what? 2013-05-14 0:08 ` Andreas Dilger @ 2013-05-14 6:25 ` Daniel Phillips 2013-05-15 17:10 ` Andreas Dilger 0 siblings, 1 reply; 17+ messages in thread From: Daniel Phillips @ 2013-05-14 6:25 UTC (permalink / raw) To: Andreas Dilger Cc: Theodore Ts'o, Dave Chinner, linux-kernel@vger.kernel.org, tux3@tux3.org, linux-fsdevel@vger.kernel.org Interesting, Andreas. We don't do anything as heavyweight as allocating an inode in this path, just mark the inode dirty (which puts it on a list) and set a bit in the inode flags. Regards, Daniel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Tux3 Report: Faster than tmpfs, what? 2013-05-14 6:25 ` Daniel Phillips @ 2013-05-15 17:10 ` Andreas Dilger 0 siblings, 0 replies; 17+ messages in thread From: Andreas Dilger @ 2013-05-15 17:10 UTC (permalink / raw) To: Daniel Phillips Cc: Theodore Ts'o, Dave Chinner, linux-kernel@vger.kernel.org, tux3@tux3.org, linux-fsdevel@vger.kernel.org On 2013-05-14, at 0:25, Daniel Phillips <daniel.raymond.phillips@gmail.com> wrote: > Interesting, Andreas. We don't do anything as heavyweight as > allocating an inode in this path, just mark the inode dirty (which > puts it on a list) and set a bit in the inode flags. The new inode allocation is only needed for the truncate-to-zero case. If the inode is being deleted it is used directly. Sorry for confusion, it has been a long time since I looked at that code. Cheers, Andreas ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2013-05-15 17:10 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-05-07 23:24 Tux3 Report: Faster than tmpfs, what? Daniel Phillips 2013-05-10 4:50 ` Dave Chinner 2013-05-10 5:06 ` Christian Stroetmann 2013-05-10 5:47 ` OGAWA Hirofumi 2013-05-14 6:34 ` Dave Chinner 2013-05-14 7:59 ` OGAWA Hirofumi 2013-05-11 6:12 ` Daniel Phillips 2013-05-11 18:35 ` james northrup 2013-05-12 4:39 ` Daniel Phillips 2013-05-11 21:26 ` Theodore Ts'o 2013-05-12 1:10 ` Daniel Phillips 2013-05-12 4:16 ` Daniel Phillips 2013-05-12 4:28 ` Daniel Phillips 2013-05-13 23:22 ` Daniel Phillips 2013-05-14 0:08 ` Andreas Dilger 2013-05-14 6:25 ` Daniel Phillips 2013-05-15 17:10 ` Andreas Dilger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).