- * Re: Tux3 Report: How fast can we fsync?
  2015-04-28 23:13 Tux3 Report: How fast can we fsync? Daniel Phillips
@ 2015-04-29  2:21 ` Mike Galbraith
  2015-04-29  6:01   ` Daniel Phillips
  2015-04-30  1:46 ` Dave Chinner
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 160+ messages in thread
From: Mike Galbraith @ 2015-04-29  2:21 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, linux-fsdevel, tux3, Theodore Ts'o
Where does tux3 live?  What I found looked abandoned.
	-Mike
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Re: Tux3 Report: How fast can we fsync?
  2015-04-29  2:21 ` Mike Galbraith
@ 2015-04-29  6:01   ` Daniel Phillips
  2015-04-29  6:20     ` Richard Weinberger
  2015-04-29  6:33     ` Mike Galbraith
  0 siblings, 2 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-04-29  6:01 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-fsdevel, tux3, Theodore Ts'o, linux-kernel
On Tuesday, April 28, 2015 7:21:11 PM PDT, Mike Galbraith wrote:
> Where does tux3 live?  What I found looked abandoned.
Current work is here:
   https://github.com/OGAWAHirofumi/linux-tux3
Note, the new fsync code isn't pushed to that tree yet, however Hirofumi's
optimized syncfs is already in there, which isn't a lot slower.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fsync?
  2015-04-29  6:01   ` Daniel Phillips
@ 2015-04-29  6:20     ` Richard Weinberger
  2015-04-29  6:56       ` Daniel Phillips
  2015-04-29  6:33     ` Mike Galbraith
  1 sibling, 1 reply; 160+ messages in thread
From: Richard Weinberger @ 2015-04-29  6:20 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Mike Galbraith, LKML, linux-fsdevel, tux3, Theodore Ts'o
On Wed, Apr 29, 2015 at 8:01 AM, Daniel Phillips <daniel@phunq.net> wrote:
> On Tuesday, April 28, 2015 7:21:11 PM PDT, Mike Galbraith wrote:
>>
>> Where does tux3 live?  What I found looked abandoned.
>
>
> Current work is here:
>
>   https://github.com/OGAWAHirofumi/linux-tux3
>
> Note, the new fsync code isn't pushed to that tree yet, however Hirofumi's
> optimized syncfs is already in there, which isn't a lot slower.
Where can I find the fsync code?
IOW how to reproduce your results? :)
-- 
Thanks,
//richard
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fsync?
  2015-04-29  6:20     ` Richard Weinberger
@ 2015-04-29  6:56       ` Daniel Phillips
  0 siblings, 0 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-04-29  6:56 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: linux-fsdevel, tux3, Mike Galbraith, LKML, Theodore Ts'o
On Tuesday, April 28, 2015 11:20:08 PM PDT, Richard Weinberger wrote:
> On Wed, Apr 29, 2015 at 8:01 AM, Daniel Phillips <daniel@phunq.net> wrote:
>> On Tuesday, April 28, 2015 7:21:11 PM PDT, Mike Galbraith wrote: ...
>
> Where can I find the fsync code?
> IOW how to reproduce your results? :)
Hi Richard,
If you can bear with us, the latest code needs to make it through
Hirofumi's QA before it appears on github. If you are impatient, the
fsync in current head benchmarks pretty well too, I don't think I need
to apologize for it.
In any case, you build userspace tools from the hirofumi-user branch,
by doing make in fs/tux3/user. This builds the tux3 command, and you
make a filesystem with "tux3 mkfs <volume>".
You can build the kernel including Tux3 from the hirofumi branch, or
the hirofumi-user branch, providing you do make clean SUBDIRS=fs/tux3
before building the kernel or make clean in tux3/user before building
the user space, so user and kernel .o files don't collide. A little
awkward indeed, but still it is pretty cool that we can even build
that code for user space.
The wiki might be helpful:
   https://github.com/OGAWAHirofumi/linux-tux3/wiki
   https://github.com/OGAWAHirofumi/linux-tux3/wiki/Compile
This is current, except you want to build from hirofumi and hirofumi-user 
rather than master and user because the latter is a bit old.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
- * Re: Tux3 Report: How fast can we fsync?
  2015-04-29  6:01   ` Daniel Phillips
  2015-04-29  6:20     ` Richard Weinberger
@ 2015-04-29  6:33     ` Mike Galbraith
  2015-04-29  7:23       ` Daniel Phillips
  1 sibling, 1 reply; 160+ messages in thread
From: Mike Galbraith @ 2015-04-29  6:33 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, linux-fsdevel, tux3, Theodore Ts'o
On Tue, 2015-04-28 at 23:01 -0700, Daniel Phillips wrote:
> On Tuesday, April 28, 2015 7:21:11 PM PDT, Mike Galbraith wrote:
> > Where does tux3 live?  What I found looked abandoned.
> 
> Current work is here:
> 
>    https://github.com/OGAWAHirofumi/linux-tux3
> 
> Note, the new fsync code isn't pushed to that tree yet, however Hirofumi's
> optimized syncfs is already in there, which isn't a lot slower.
Ah, I did find the right spot, it's just been idle a while.  Where does
one find mkfs.tux3?
	-Mike
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fsync?
  2015-04-29  6:33     ` Mike Galbraith
@ 2015-04-29  7:23       ` Daniel Phillips
  2015-04-29 16:42         ` Mike Galbraith
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-04-29  7:23 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-fsdevel, tux3, Theodore Ts'o, linux-kernel,
	OGAWA Hirofumi
On Tuesday, April 28, 2015 11:33:33 PM PDT, Mike Galbraith wrote:
> On Tue, 2015-04-28 at 23:01 -0700, Daniel Phillips wrote:
>> On Tuesday, April 28, 2015 7:21:11 PM PDT, Mike Galbraith wrote:
>>> Where does tux3 live?  What I found looked abandoned.
>> 
>> Current work is here:
>> 
>>    https://github.com/OGAWAHirofumi/linux-tux3
>> 
>> Note, the new fsync code isn't pushed to that tree yet, however Hirofumi's
>> optimized syncfs is already in there, which isn't a lot slower.
>
> Ah, I did find the right spot, it's just been idle a while.  Where does
> one find mkfs.tux3?
Hi Mike,
See my reply to Richard. You are right, we have been developing on 
Hirofumi's
branch and master is getting old. Short version:
  checkout hirofumi-user
  cd fs/tux3/user
  make
  ./tux3 mkfs <volume>
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fsync?
  2015-04-29  7:23       ` Daniel Phillips
@ 2015-04-29 16:42         ` Mike Galbraith
  2015-04-29 19:05           ` xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?) Mike Galbraith
  2015-04-29 20:40           ` Tux3 Report: How fast can we fsync? Daniel Phillips
  0 siblings, 2 replies; 160+ messages in thread
From: Mike Galbraith @ 2015-04-29 16:42 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: linux-kernel, linux-fsdevel, tux3, Theodore Ts'o,
	OGAWA Hirofumi
On Wed, 2015-04-29 at 00:23 -0700, Daniel Phillips wrote:
> On Tuesday, April 28, 2015 11:33:33 PM PDT, Mike Galbraith wrote:
> > On Tue, 2015-04-28 at 23:01 -0700, Daniel Phillips wrote:
> >> On Tuesday, April 28, 2015 7:21:11 PM PDT, Mike Galbraith wrote:
> >>> Where does tux3 live?  What I found looked abandoned.
> >> 
> >> Current work is here:
> >> 
> >>    https://github.com/OGAWAHirofumi/linux-tux3
> >> 
> >> Note, the new fsync code isn't pushed to that tree yet, however Hirofumi's
> >> optimized syncfs is already in there, which isn't a lot slower.
> >
> > Ah, I did find the right spot, it's just been idle a while.  Where does
> > one find mkfs.tux3?
> 
> Hi Mike,
> 
> See my reply to Richard. You are right, we have been developing on 
> Hirofumi's
> branch and master is getting old. Short version:
> 
>   checkout hirofumi-user
>   cd fs/tux3/user
>   make
>   ./tux3 mkfs <volume>
Ok, thanks.
I was curious about horrible looking plain ole dbench numbers you
posted, as when I used to play with it, default looked like a kinda
silly non-io test most frequently used to pile threads on a box to see
when the axles started bending.  Seems default load has changed.
With dbench v4.00, tux3 seems to be king of the max_latency hill, but
btrfs took throughput on my box.  With v3.04, tux3 took 1st place at
splashing about in pagecache, but last place at dbench -S.
Hohum, curiosity satisfied.
/usr/local/bin/dbench -t 30 (version 4.00)
ext4	Throughput 31.6148 MB/sec  8 clients  8 procs  max_latency=1696.854 ms
xfs	Throughput 26.4005 MB/sec  8 clients  8 procs  max_latency=1508.581 ms
btrfs	Throughput 82.3654 MB/sec  8 clients  8 procs  max_latency=1274.960 ms
tux3	Throughput 93.0047 MB/sec  8 clients  8 procs  max_latency=99.712 ms
ext4	Throughput 49.9795 MB/sec  16 clients  16 procs  max_latency=2180.108 ms
xfs	Throughput 35.038 MB/sec  16 clients  16 procs  max_latency=3107.321 ms
btrfs	Throughput 148.894 MB/sec  16 clients  16 procs  max_latency=618.070 ms
tux3	Throughput 130.532 MB/sec  16 clients  16 procs  max_latency=141.743 ms
ext4	Throughput 69.2642 MB/sec  32 clients  32 procs  max_latency=3166.374 ms
xfs	Throughput 55.3805 MB/sec  32 clients  32 procs  max_latency=4921.660 ms
btrfs	Throughput 230.488 MB/sec  32 clients  32 procs  max_latency=3673.387 ms
tux3	Throughput 179.473 MB/sec  32 clients  32 procs  max_latency=194.046 ms
/usr/local/bin/dbench -B fileio -t 30 (version 4.00)
ext4	Throughput 84.7361 MB/sec  32 clients  32 procs  max_latency=1401.683 ms
xfs	Throughput 57.9369 MB/sec  32 clients  32 procs  max_latency=1397.910 ms
btrfs	Throughput 268.738 MB/sec  32 clients  32 procs  max_latency=639.411 ms
tux3	Throughput 186.172 MB/sec  32 clients  32 procs  max_latency=167.389 ms
/usr/bin/dbench -t 30 32 (version 3.04)
ext4	Throughput 7920.95 MB/sec 32 procs
xfs	Throughput 674.993 MB/sec 32 procs
btrfs	Throughput 1910.63 MB/sec 32 procs
tux3	Throughput 8262.68 MB/sec 32 procs
/usr/bin/dbench -S -t 30 32 (version 3.04)
ext4	Throughput 87.2774 MB/sec (sync dirs) 32 procs
xfs	Throughput 89.3977 MB/sec (sync dirs) 32 procs
btrfs	Throughput 101.888 MB/sec (sync dirs) 32 procs
tux3	Throughput 78.7463 MB/sec (sync dirs) 32 procs
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-29 16:42         ` Mike Galbraith
@ 2015-04-29 19:05           ` Mike Galbraith
  2015-04-29 19:20             ` Austin S Hemmelgarn
                               ` (2 more replies)
  2015-04-29 20:40           ` Tux3 Report: How fast can we fsync? Daniel Phillips
  1 sibling, 3 replies; 160+ messages in thread
From: Mike Galbraith @ 2015-04-29 19:05 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: linux-kernel, linux-fsdevel, tux3, Theodore Ts'o,
	OGAWA Hirofumi
Here's something that _might_ interest xfs folks.
cd git (source repository of git itself)
make clean
echo 3 > /proc/sys/vm/drop_caches
time make -j8 test
ext4    2m20.721s
xfs     6m41.887s <-- ick
btrfs   1m32.038s
tux3    1m30.262s
Testing by Aunt Tilly: mkfs, no fancy switches, mount the thing, test.
Are defaults for mkfs.xfs such that nobody sane uses them, or does xfs
really hate whatever git selftests are doing this much?
	-Mike
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-29 19:05           ` xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?) Mike Galbraith
@ 2015-04-29 19:20             ` Austin S Hemmelgarn
  2015-04-29 21:12             ` Daniel Phillips
  2015-04-30  0:20             ` Dave Chinner
  2 siblings, 0 replies; 160+ messages in thread
From: Austin S Hemmelgarn @ 2015-04-29 19:20 UTC (permalink / raw)
  To: Mike Galbraith, Daniel Phillips
  Cc: linux-kernel, linux-fsdevel, tux3, Theodore Ts'o,
	OGAWA Hirofumi
[-- Attachment #1: Type: text/plain, Size: 1095 bytes --]
On 2015-04-29 15:05, Mike Galbraith wrote:
> Here's something that _might_ interest xfs folks.
>
> cd git (source repository of git itself)
> make clean
> echo 3 > /proc/sys/vm/drop_caches
> time make -j8 test
>
> ext4    2m20.721s
> xfs     6m41.887s <-- ick
> btrfs   1m32.038s
> tux3    1m30.262s
>
> Testing by Aunt Tilly: mkfs, no fancy switches, mount the thing, test.
>
> Are defaults for mkfs.xfs such that nobody sane uses them, or does xfs
> really hate whatever git selftests are doing this much?
>
> 	-Mike
>
I've been using the defaults for it and have been perfectly happy, 
although I do use a few non-default mount options (like noatime and 
noquota).  It may just be a factor of what exactly the tests are doing. 
  Based on my experience, xfs _is_ better performance wise with a few 
large files instead of a lot of small ones when used with the default 
mkfs options.  Of course, my uses for it are more focused on stability 
and reliability than performance (my primary use for XFS is /boot, and I 
use BTRFS for pretty much everything else).
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-29 19:05           ` xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?) Mike Galbraith
  2015-04-29 19:20             ` Austin S Hemmelgarn
@ 2015-04-29 21:12             ` Daniel Phillips
  2015-04-30  4:40               ` Mike Galbraith
  2015-04-30  0:20             ` Dave Chinner
  2 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-04-29 21:12 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, linux-fsdevel, tux3, Theodore Ts'o,
	OGAWA Hirofumi
On Wednesday, April 29, 2015 12:05:26 PM PDT, Mike Galbraith wrote:
> Here's something that _might_ interest xfs folks.
>
> cd git (source repository of git itself)
> make clean
> echo 3 > /proc/sys/vm/drop_caches
> time make -j8 test
>
> ext4    2m20.721s
> xfs     6m41.887s <-- ick
> btrfs   1m32.038s
> tux3    1m30.262s
>
> Testing by Aunt Tilly: mkfs, no fancy switches, mount the thing, test.
>
> Are defaults for mkfs.xfs such that nobody sane uses them, or does xfs
> really hate whatever git selftests are doing this much?
I'm more interested in the fact that we eked out a win :)
Btrfs appears to optimize tiny files by storing them in its big btree,
the equivalent of our itree, and Tux3 doesn't do that yet, so we are a
bit hobbled for a make load. Eventually, that gap should widen.
The pattern I noticed where the write-anywhere designs are beating the
journal designs seems to continue here. I am sure there are exceptions,
but maybe it is a real thing.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-29 21:12             ` Daniel Phillips
@ 2015-04-30  4:40               ` Mike Galbraith
  0 siblings, 0 replies; 160+ messages in thread
From: Mike Galbraith @ 2015-04-30  4:40 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: linux-kernel, linux-fsdevel, tux3, Theodore Ts'o,
	OGAWA Hirofumi
On Wed, 2015-04-29 at 14:12 -0700, Daniel Phillips wrote:
> Btrfs appears to optimize tiny files by storing them in its big btree,
> the equivalent of our itree, and Tux3 doesn't do that yet, so we are a
> bit hobbled for a make load.
That's not a build load, it's a git load.  btrfs beat all others at the
various git/quilt things I tried (since that's what I do lots of in real
life), but tux3 looked quite good too.
As Dave noted though, an orchard produces oodles of apples over its
lifetime, these shiny new apples may lose luster over time ;-)
	-Mike
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-29 19:05           ` xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?) Mike Galbraith
  2015-04-29 19:20             ` Austin S Hemmelgarn
  2015-04-29 21:12             ` Daniel Phillips
@ 2015-04-30  0:20             ` Dave Chinner
  2015-04-30  3:35               ` Mike Galbraith
                                 ` (2 more replies)
  2 siblings, 3 replies; 160+ messages in thread
From: Dave Chinner @ 2015-04-30  0:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Daniel Phillips, linux-kernel, linux-fsdevel, tux3,
	Theodore Ts'o, OGAWA Hirofumi
On Wed, Apr 29, 2015 at 09:05:26PM +0200, Mike Galbraith wrote:
> Here's something that _might_ interest xfs folks.
> 
> cd git (source repository of git itself)
> make clean
> echo 3 > /proc/sys/vm/drop_caches
> time make -j8 test
> 
> ext4    2m20.721s
> xfs     6m41.887s <-- ick
> btrfs   1m32.038s
> tux3    1m30.262s
> 
> Testing by Aunt Tilly: mkfs, no fancy switches, mount the thing, test.
TL;DR: Results are *very different* on a 256GB Samsung 840 EVO SSD
with slightly slower CPUs (E5-4620 @ 2.20GHz)i, all filesystems
using defaults:
	real		user		sys
xfs	3m16.138s	7m8.341s	14m32.462s
ext4	3m18.045s	7m7.840s	14m32.994s
btrfs	3m45.149s	7m10.184s	16m30.498s
What you are seeing is physical seek distances impacting read
performance.  XFS does not optimise for minimal physical seek
distance, and hence is slower than filesytsems that do optimise for
minimal seek distance. This shows up especially well on slow single
spindles.
XFS is *adequate* for the use on slow single drives, but it is
really designed for best performance on storage hardware that is not
seek distance sensitive.
IOWS, XFS just hates your disk. Spend $50 and buy a cheap SSD and
the problem goes away. :)
----
And now in more detail.
It's easy to be fast on empty filesystems. XFS does not aim to be
fast in such situations - it aims to have consistent performance
across the life of the filesystem.
In this case, ext4, btrfs and tux3 have optimal allocation filling
from the outside of the disk, while XFS is spreading the files
across (at least) 4 separate regions of the whole disk. Hence XFS is
seeing seek times on read are much larger than the other filesystems
when the filesystem is empty as it is doing full disk seeks rather
than being confined to the outer edges of spindle.
Thing is, once you've abused those filesytsems for a couple of
months, the files in ext4, btrfs and tux3 are not going to be laid
out perfectly on the outer edge of the disk. They'll be spread all
over the place and so all the filesystems will be seeing large seeks
on read. The thing is, XFS will have roughly the same performance as
when the filesystem is empty because the spreading of the allocation
allows it to maintain better locality and separation and hence
doesn't fragment free space nearly as badly as the oher filesystems.
Free space fragmentation is what leads to performance degradation in
filesystems, and all the other filesystem will have degraded to be
*much worse* than XFS.
Put simply: empty filesystem benchmarking does not show the real
performance of the filesystem under sustained production workloads.
Hence benchmarks like this - while interesting from a theoretical
point of view and are widely used for bragging about whose got the
fastest - are mostly irrelevant to determining how the filesystem
will perform in production environments.
We can also look at this algorithm in a different way: take a large
filesystem (say a few hundred TB) across a few tens of disks in a
linear concat.  ext4, btrfs and tux3 will only hit the first disk in
the concat, and so go no faster because they are still bound by
physical seek times.  XFS, however, will spread the load across many
(if not all) of the disks, and so effectively reduce the average
seek time by the number of disks doing concurrent IO. Then you'll
see that application level IO concurrency becomes the performance
limitation, not the physical seek time of the hardware.
IOWs, what you don't see here is that the XFS algorithms that make
your test slow will keep *lots* of disks busy. i.e. testing empty
filesystem performance a single, slow disk demonstrates that an
algorithm designed for scalability isn't designed to acheive
physical seek distance minimisation.  Hence your storage makes XFS
look particularly poor in comparison to filesystems that are being
designed and optimised for the limitations of single slow spindles...
To further demonstrate that it is physical seek distance that is the
issue here, lets take the seek time out of the equation (e.g. use a
SSD).  Doing that will result in basically no difference in
performance between all 4 filesystems as performance will now be
determined by application level concurrency and that is the same for
all tests.
e.g. on a 16p, 16GB RAM VM with storage on a SSDs a "make -j 8"
compile test on a kernel source tree (using my normal test machine
.config) gives:
	real		user		sys
xfs:	4m6.723s	26m21.087s	2m49.426s
ext4:	4m11.415s	26m21.122s	2m49.786s
btrfs:	4m8.118s	26m26.440s	2m50.357s
i.e. take seek times out of the picture, and XFS is just as fast as
any of the other filesystems.
Just about everyone I know uses SSDs in their laptops and machines
that build kernels these days, and spinning disks are rapidly
disappearing from enterprise and HPC environments which also happens
to be the target markets for XFS.  Hence filesystem performance on
slow single spindles is the furthest thing away from what we really
need to optimise XFS for.
Indeed, I'll point you to where we are going with fsync optimisation
- it's completely the other end of the scale:
http://oss.sgi.com/archives/xfs/2014-06/msg00214.html
i.e. being able to scale effectively to tens of thousands of fsync
calls every second because that's what applications like ceph and
gluster really need from XFS....
> Are defaults for mkfs.xfs such that nobody sane uses them, or does xfs
> really hate whatever git selftests are doing this much?
It just hates your disk. Spend $50 and buy a cheap SSD and the
problem goes away. :)
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-30  0:20             ` Dave Chinner
@ 2015-04-30  3:35               ` Mike Galbraith
  2015-04-30  9:00               ` Martin Steigerwald
  2015-04-30 11:14               ` Daniel Phillips
  2 siblings, 0 replies; 160+ messages in thread
From: Mike Galbraith @ 2015-04-30  3:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Daniel Phillips, linux-kernel, linux-fsdevel, tux3,
	Theodore Ts'o, OGAWA Hirofumi
On Thu, 2015-04-30 at 10:20 +1000, Dave Chinner wrote:
> IOWS, XFS just hates your disk. Spend $50 and buy a cheap SSD and
> the problem goes away. :)
I'd love to.  Too bad sorry sack of sh*t MB manufacturer only applied
_connectors_ to 4 of 6 available ports, and they're all in use :)
> ----
> 
> And now in more detail.
Thanks for those details, made perfect sense.
	-Mike
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-30  0:20             ` Dave Chinner
  2015-04-30  3:35               ` Mike Galbraith
@ 2015-04-30  9:00               ` Martin Steigerwald
  2015-04-30 14:57                 ` Theodore Ts'o
  2015-04-30 11:14               ` Daniel Phillips
  2 siblings, 1 reply; 160+ messages in thread
From: Martin Steigerwald @ 2015-04-30  9:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Theodore Ts'o, tux3, linux-kernel, linux-fsdevel,
	Mike Galbraith, Daniel Phillips, OGAWA Hirofumi
Am Donnerstag, 30. April 2015, 10:20:08 schrieb Dave Chinner:
> On Wed, Apr 29, 2015 at 09:05:26PM +0200, Mike Galbraith wrote:
> > Here's something that _might_ interest xfs folks.
> > 
> > cd git (source repository of git itself)
> > make clean
> > echo 3 > /proc/sys/vm/drop_caches
> > time make -j8 test
> > 
> > ext4    2m20.721s
> > xfs     6m41.887s <-- ick
> > btrfs   1m32.038s
> > tux3    1m30.262s
> > 
> > Testing by Aunt Tilly: mkfs, no fancy switches, mount the thing, test.
> 
> TL;DR: Results are *very different* on a 256GB Samsung 840 EVO SSD
> with slightly slower CPUs (E5-4620 @ 2.20GHz)i, all filesystems
> using defaults:
> 
> 	real		user		sys
> xfs	3m16.138s	7m8.341s	14m32.462s
> ext4	3m18.045s	7m7.840s	14m32.994s
> btrfs	3m45.149s	7m10.184s	16m30.498s
> 
> What you are seeing is physical seek distances impacting read
> performance.  XFS does not optimise for minimal physical seek
> distance, and hence is slower than filesytsems that do optimise for
> minimal seek distance. This shows up especially well on slow single
> spindles.
> 
> XFS is *adequate* for the use on slow single drives, but it is
> really designed for best performance on storage hardware that is not
> seek distance sensitive.
> 
> IOWS, XFS just hates your disk. Spend $50 and buy a cheap SSD and
> the problem goes away. :)
I am quite surprised that a traditional filesystem that was created in the 
age of rotating media does not like this kind of media and even seems to 
excel on BTRFS on the new non rotating media available.
But…
> ----
> 
> And now in more detail.
> 
> It's easy to be fast on empty filesystems. XFS does not aim to be
> fast in such situations - it aims to have consistent performance
> across the life of the filesystem.
… this is a quite important addition.
> Thing is, once you've abused those filesytsems for a couple of
> months, the files in ext4, btrfs and tux3 are not going to be laid
> out perfectly on the outer edge of the disk. They'll be spread all
> over the place and so all the filesystems will be seeing large seeks
> on read. The thing is, XFS will have roughly the same performance as
> when the filesystem is empty because the spreading of the allocation
> allows it to maintain better locality and separation and hence
> doesn't fragment free space nearly as badly as the oher filesystems.
> Free space fragmentation is what leads to performance degradation in
> filesystems, and all the other filesystem will have degraded to be
> *much worse* than XFS.
I even still see hungs on what I tend to see as freespace fragmentation in 
BTRFS. My /home on a Dual (!) BTRFS SSD setup can basically stall to a 
halt when it has reserved all space of the device for chunks. So this
merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 129.48GiB
        devid    1 size 170.00GiB used 146.03GiB path /dev/mapper/msata-
home
        devid    2 size 170.00GiB used 146.03GiB path /dev/mapper/sata-
home
Btrfs v3.18
merkaba:~> btrfs fi df /home
Data, RAID1: total=142.00GiB, used=126.72GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.00GiB, used=2.76GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
is safe, but one I have size 170 GiB user 170 GiB, even if inside the 
chunks there is enough free space to allocate from, enough as in 30-40 
GiB, it can happen that writes are stalled up to the point that 
applications on the desktop freeze and I see hung task messages in kernel 
log.
This is the case upto kernel 4.0. I have seen Chris Mason fixing some write 
stalls for big facebook setups, maybe it will help here, but unless this 
issue is fixed, I think BTRFS is not yet fully production ready, unless you 
leave *huge* amount of free space, as in for 200 GiB of data you want to 
write make a 400 GiB volume.
> Put simply: empty filesystem benchmarking does not show the real
> performance of the filesystem under sustained production workloads.
> Hence benchmarks like this - while interesting from a theoretical
> point of view and are widely used for bragging about whose got the
> fastest - are mostly irrelevant to determining how the filesystem
> will perform in production environments.
> 
> We can also look at this algorithm in a different way: take a large
> filesystem (say a few hundred TB) across a few tens of disks in a
> linear concat.  ext4, btrfs and tux3 will only hit the first disk in
> the concat, and so go no faster because they are still bound by
> physical seek times.  XFS, however, will spread the load across many
> (if not all) of the disks, and so effectively reduce the average
> seek time by the number of disks doing concurrent IO. Then you'll
> see that application level IO concurrency becomes the performance
> limitation, not the physical seek time of the hardware.
That are the allocation groups. I always wondered how it can be beneficial 
to spread the allocations onto 4 areas of one partition on expensive seek 
media. Now that makes better sense for me. I always had the gut impression 
that XFS may not be the fastest in all cases, but it is one of the 
filesystem with the most consistent performance over time, but never was 
able to fully explain why that is.
Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
_______________________________________________
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-30  9:00               ` Martin Steigerwald
@ 2015-04-30 14:57                 ` Theodore Ts'o
  2015-04-30 15:59                   ` Daniel Phillips
  2015-04-30 17:59                   ` Martin Steigerwald
  0 siblings, 2 replies; 160+ messages in thread
From: Theodore Ts'o @ 2015-04-30 14:57 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: Dave Chinner, Mike Galbraith, Daniel Phillips, linux-kernel,
	linux-fsdevel, tux3, OGAWA Hirofumi
On Thu, Apr 30, 2015 at 11:00:05AM +0200, Martin Steigerwald wrote:
> > IOWS, XFS just hates your disk. Spend $50 and buy a cheap SSD and
> > the problem goes away. :)
> 
> I am quite surprised that a traditional filesystem that was created in the 
> age of rotating media does not like this kind of media and even seems to 
> excel on BTRFS on the new non rotating media available.
You shouldn't be surprised; XFS was designed in an era where RAID was
extremely important.  To this day, on a very large RAID arrays, I'm
pretty sure none of the other file systems will come close to touching
XFS, because it was optimized by some really, really good file system
engineers for that hardware.  And while RAID systems are certainly not
identical to SSD, the fact that you have multiple disk heads means
that a good file system will optimize for that parallelism, and that's
how SSD's get their speed (individual SSD channels aren't really all
that fast; it's the fast that you can be reading or writing arge
numbers of them in parallel that high end flash get their really great
performance numbers.)
> > Thing is, once you've abused those filesytsems for a couple of
> > months, the files in ext4, btrfs and tux3 are not going to be laid
> > out perfectly on the outer edge of the disk. They'll be spread all
> > over the place and so all the filesystems will be seeing large seeks
> > on read. The thing is, XFS will have roughly the same performance as
> > when the filesystem is empty because the spreading of the allocation
> > allows it to maintain better locality and separation and hence
> > doesn't fragment free space nearly as badly as the oher filesystems.
> > Free space fragmentation is what leads to performance degradation in
> > filesystems, and all the other filesystem will have degraded to be
> > *much worse* than XFS.
In fact, ext4 doesn't actually lay out things perfectly on the outer
edge of the disk either, because we try to do spreading as well.
Worse, we use a random algorithm to try to do the spreading, so that
means that results from run to run on an empty file system will show a
lot more variation.  I won't claim that we're best in class with
either our spreading techniques or our ability to manage free space
fragmentation, although we do a lot of work to manage free space
fragmentation as well.
One of the problems is that it's *hard* to get good benchmarking
numbers that take into account file system aging and measure how well
the free space has been fragmented over time.  Most of the benchmark
results that I've seen do a really lousy job at this, and the vast
majority don't even try.
This is one of the reasons why I find head-to-head "competitions"
between file systems to be not very helpful for anything other than
benchmarketing.  It's almost certain that the benchmark won't be
"fair" in some way, and it doesn't really matter whether the person
doing the benchmark was doing it with malice aforethought, or was just
incompetent and didn't understand the issues --- or did understand the
issues and didn't really care, because what they _really_ wanted to do
was to market their file system.
And even if the benchmark is fair, it might not match up with the end
user's hardware, or their use case.  There will always be some use
case where file system A is better than file system B, for pretty much
any file system.  Don't get me wrong --- I will do comparisons between
file systems, but only so I can figure out ways of making _my_ file
system better.  And more often than not, it's comparisons of the same
file system before and after adding some new feature which is the most
interesting.
> That are the allocation groups. I always wondered how it can be beneficial 
> to spread the allocations onto 4 areas of one partition on expensive seek 
> media. Now that makes better sense for me. I always had the gut impression 
> that XFS may not be the fastest in all cases, but it is one of the 
> filesystem with the most consistent performance over time, but never was 
> able to fully explain why that is.
Yep, pretty much all of the traditional update-in-place file systems
since the BSD FFS have done this, and for the same reason.  For COW
file systems which are are constantly moving data and metadata blocks
around, they will need different strategies for trying to avoid the
free space fragmentation problem as the file system ages.
Cheers,
      	      	       	    	      	       - Ted
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-30 14:57                 ` Theodore Ts'o
@ 2015-04-30 15:59                   ` Daniel Phillips
  2015-04-30 17:59                   ` Martin Steigerwald
  1 sibling, 0 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-04-30 15:59 UTC (permalink / raw)
  To: Theodore Ts'o, Martin Steigerwald, Dave Chinner,
	Mike Galbraith, linux-kernel, linux-fsdevel, tux3, OGAWA Hirofumi
Hi Ted,
On 04/30/2015 07:57 AM, Theodore Ts'o wrote:
> This is one of the reasons why I find head-to-head "competitions"
> between file systems to be not very helpful for anything other than
> benchmarketing.  It's almost certain that the benchmark won't be
> "fair" in some way, and it doesn't really matter whether the person
> doing the benchmark was doing it with malice aforethought, or was just
> incompetent and didn't understand the issues --- or did understand the
> issues and didn't really care, because what they _really_ wanted to do
> was to market their file system.
Your proposition, as I understand it, is that nobody should ever do
benchmarks because any benchmark must be one of: 1) malicious; 2)
incompetent; or 3) careless. When in fact, a benchmark may be perfectly
honest, competently done, and informative.
> And even if the benchmark is fair, it might not match up with the end
> user's hardware, or their use case.  There will always be some use
> case where file system A is better than file system B, for pretty much
> any file system.  Don't get me wrong --- I will do comparisons between
> file systems, but only so I can figure out ways of making _my_ file
> system better.  And more often than not, it's comparisons of the same
> file system before and after adding some new feature which is the most
> interesting.
I cordially invite you to replicate our fsync benchmarks, or invent
your own. I am confident that you will find that the numbers are
accurate, that the test cases were well chosen, that the results are
informative, and that there is no sleight of hand.
As for whether or not people should "market" their filesystems as you
put it, that is easy for you to disparage when you are the incumbant.
If we don't tell people what is great about Tux3 then how will they
ever find out? Sure, it might be "advertising", but the important
question is, is it _truthful_ advertising? Surely you remember how
Linus got started... that was really blatant, and I am glad he did it.
>> That are the allocation groups. I always wondered how it can be beneficial 
>> to spread the allocations onto 4 areas of one partition on expensive seek 
>> media. Now that makes better sense for me. I always had the gut impression 
>> that XFS may not be the fastest in all cases, but it is one of the 
>> filesystem with the most consistent performance over time, but never was 
>> able to fully explain why that is.
> 
> Yep, pretty much all of the traditional update-in-place file systems
> since the BSD FFS have done this, and for the same reason.  For COW
> file systems which are are constantly moving data and metadata blocks
> around, they will need different strategies for trying to avoid the
> free space fragmentation problem as the file system ages.
Right, different problems, but I have a pretty good idea how to go
about it now. I made a failed attempt a while back and learned a lot,
my mistake was to try to give every object a fixed home position based
on where it was first written and the result was worse for both read
and write. Now the interesting thing is, naive linear allocation is
great for both read and read, so my effort now is directed towards
ways of doing naive linear allocation but choosing carefully which
order we do the allocation in. I will keep you posted on how that
progresses of course.
Anyway, how did we get onto allocation? I thought my post was about
fsync, and after all, you are the guest of honor.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-30 14:57                 ` Theodore Ts'o
  2015-04-30 15:59                   ` Daniel Phillips
@ 2015-04-30 17:59                   ` Martin Steigerwald
  1 sibling, 0 replies; 160+ messages in thread
From: Martin Steigerwald @ 2015-04-30 17:59 UTC (permalink / raw)
  To: Theodore Ts'o, Dave Chinner, Mike Galbraith, Daniel Phillips,
	linux-kernel, linux-fsdevel, tux3, OGAWA Hirofumi
Am Donnerstag, 30. April 2015, 10:57:10 schrieb Theodore Ts'o:
> One of the problems is that it's *hard* to get good benchmarking
> numbers that take into account file system aging and measure how well
> the free space has been fragmented over time.  Most of the benchmark
> results that I've seen do a really lousy job at this, and the vast
> majority don't even try.
> 
> This is one of the reasons why I find head-to-head "competitions"
> between file systems to be not very helpful for anything other than
> benchmarketing.  It's almost certain that the benchmark won't be
> "fair" in some way, and it doesn't really matter whether the person
> doing the benchmark was doing it with malice aforethought, or was just
> incompetent and didn't understand the issues --- or did understand the
> issues and didn't really care, because what they _really_ wanted to do
> was to market their file system.
I agree to that.
One benchmark measure one thing, and if its with the fresh filesystem, it 
does so with a fresh filesystem.
Benchmarks that aiming at how to test an aged filesystem are much more 
expensive in time and resources needed, unless one reuses and aged 
filesystem image again and again.
Thanks for your explainations, Ted,
Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-30  0:20             ` Dave Chinner
  2015-04-30  3:35               ` Mike Galbraith
  2015-04-30  9:00               ` Martin Steigerwald
@ 2015-04-30 11:14               ` Daniel Phillips
  2015-04-30 12:07                 ` Mike Galbraith
  2 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-04-30 11:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mike Galbraith, linux-kernel, linux-fsdevel, tux3,
	Theodore Ts'o, OGAWA Hirofumi
On Wednesday, April 29, 2015 5:20:08 PM PDT, Dave Chinner wrote:
> It's easy to be fast on empty filesystems. XFS does not aim to be
> fast in such situations - it aims to have consistent performance
> across the life of the filesystem.
>
> In this case, ext4, btrfs and tux3 have optimal allocation filling
> from the outside of the disk, while XFS is spreading the files
> across (at least) 4 separate regions of the whole disk. Hence XFS is
> seeing seek times on read are much larger than the other filesystems
> when the filesystem is empty as it is doing full disk seeks rather
> than being confined to the outer edges of spindle.
>
> Thing is, once you've abused those filesytsems for a couple of
> months, the files in ext4, btrfs and tux3 are not going to be laid
> out perfectly on the outer edge of the disk. They'll be spread all
> over the place and so all the filesystems will be seeing large seeks
> on read. The thing is, XFS will have roughly the same performance as
> when the filesystem is empty because the spreading of the allocation
> allows it to maintain better locality and separation and hence
> doesn't fragment free space nearly as badly as the oher filesystems.
> Free space fragmentation is what leads to performance degradation in
> filesystems, and all the other filesystem will have degraded to be
> *much worse* than XFS.
>
> Put simply: empty filesystem benchmarking does not show the real
> performance of the filesystem under sustained production workloads.
> Hence benchmarks like this - while interesting from a theoretical
> point of view and are widely used for bragging about whose got the
> fastest - are mostly irrelevant to determining how the filesystem
> will perform in production environments.
>
> We can also look at this algorithm in a different way: take a large
> filesystem (say a few hundred TB) across a few tens of disks in a
> linear concat.  ext4, btrfs and tux3 will only hit the first disk in
> the concat, and so go no faster because they are still bound by
> physical seek times.  XFS, however, will spread the load across many
> (if not all) of the disks, and so effectively reduce the average
> seek time by the number of disks doing concurrent IO. Then you'll
> see that application level IO concurrency becomes the performance
> limitation, not the physical seek time of the hardware.
>
> IOWs, what you don't see here is that the XFS algorithms that make
> your test slow will keep *lots* of disks busy. i.e. testing empty
> filesystem performance a single, slow disk demonstrates that an
> algorithm designed for scalability isn't designed to acheive
> physical seek distance minimisation.  Hence your storage makes XFS
> look particularly poor in comparison to filesystems that are being
> designed and optimised for the limitations of single slow spindles...
>
> To further demonstrate that it is physical seek distance that is the
> issue here, lets take the seek time out of the equation (e.g. use a
> SSD).  Doing that will result in basically no difference in
> performance between all 4 filesystems as performance will now be
> determined by application level concurrency and that is the same for
> all tests.
Lovely sounding argument, but it is wrong because Tux3 still beats XFS
even with seek time factored out of the equation.
Even with SSD, if you just go splattering files all over the disk you
will pay for it in latency and lifetime when the disk goes into
continuous erase and your messy layout causes write multiplication.
But of course you can design your filesystem any way you want. Tux3
is designed to be fast on the hardware that people actually have.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-30 11:14               ` Daniel Phillips
@ 2015-04-30 12:07                 ` Mike Galbraith
  2015-04-30 12:58                   ` Daniel Phillips
  0 siblings, 1 reply; 160+ messages in thread
From: Mike Galbraith @ 2015-04-30 12:07 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, tux3,
	Theodore Ts'o, OGAWA Hirofumi
On Thu, 2015-04-30 at 04:14 -0700, Daniel Phillips wrote:
> Lovely sounding argument, but it is wrong because Tux3 still beats XFS
> even with seek time factored out of the equation.
Hm.  Do you have big-storage comparison numbers to back that?  I'm no
storage guy (waiting for holographic crystal arrays to obsolete all this
crap;), but Dave's big-storage guy words made sense to me.
	-Mike
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-30 12:07                 ` Mike Galbraith
@ 2015-04-30 12:58                   ` Daniel Phillips
  2015-04-30 13:48                     ` Mike Galbraith
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-04-30 12:58 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Theodore Ts'o, tux3, Dave Chinner, linux-kernel,
	linux-fsdevel, OGAWA Hirofumi
On Thursday, April 30, 2015 5:07:21 AM PDT, Mike Galbraith wrote:
> On Thu, 2015-04-30 at 04:14 -0700, Daniel Phillips wrote:
>
>> Lovely sounding argument, but it is wrong because Tux3 still beats XFS
>> even with seek time factored out of the equation.
>
> Hm.  Do you have big-storage comparison numbers to back that?  I'm no
> storage guy (waiting for holographic crystal arrays to obsolete all this
> crap;), but Dave's big-storage guy words made sense to me.
This has nothing to do with big storage. The proposition was that seek
time is the reason for Tux3's fsync performance. That claim was easily
falsified by removing the seek time.
Dave's big storage words are there to draw attention away from the fact
that XFS ran the Git tests four times slower than Tux3 and three times
slower than Ext4. Whatever the big storage excuse is for that, the fact
is, XFS obviously sucks at little storage.
He also posted nonsense: "XFS, however, will spread the load across
many (if not all) of the disks, and so effectively reduce the average
seek time by the number of disks doing concurrent IO." False. No matter
how big an array of spinning disks you have, seek latency and
synchronous write latency stay the same. It is just an attempt to
bamboozle you. If instead he had talked about throughput, he would have
a point. But he didn't, because he knows that does not help his
argument. If fsync sucks on one disk, it will suck just as much on
a thousand disks.
The talk about filling up from the outside of disk is disingenuous.
Dave should know that Ext4 does not do that, it spreads out allocations
exactly to give good aging, and it does deliver that - Ext4's aging
performance is second to none. What XFS does is just stupid, and
instead of admitting that and fixing it, Dave claims it would be great
if the disk was an array or an SSD instead of what it actually is.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-30 12:58                   ` Daniel Phillips
@ 2015-04-30 13:48                     ` Mike Galbraith
  2015-04-30 14:07                       ` Daniel Phillips
  0 siblings, 1 reply; 160+ messages in thread
From: Mike Galbraith @ 2015-04-30 13:48 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, tux3,
	Theodore Ts'o, OGAWA Hirofumi
On Thu, 2015-04-30 at 05:58 -0700, Daniel Phillips wrote:
> On Thursday, April 30, 2015 5:07:21 AM PDT, Mike Galbraith wrote:
> > On Thu, 2015-04-30 at 04:14 -0700, Daniel Phillips wrote:
> >
> >> Lovely sounding argument, but it is wrong because Tux3 still beats XFS
> >> even with seek time factored out of the equation.
> >
> > Hm.  Do you have big-storage comparison numbers to back that?  I'm no
> > storage guy (waiting for holographic crystal arrays to obsolete all this
> > crap;), but Dave's big-storage guy words made sense to me.
> 
> This has nothing to do with big storage. The proposition was that seek
> time is the reason for Tux3's fsync performance. That claim was easily
> falsified by removing the seek time.
> 
> Dave's big storage words are there to draw attention away from the fact
> that XFS ran the Git tests four times slower than Tux3 and three times
> slower than Ext4. Whatever the big storage excuse is for that, the fact
> is, XFS obviously sucks at little storage.
If you allocate spanning the disk from start of life, you're going to
eat seeks that others don't until later.  That seemed rather obvious and
straight forward.  He flat stated that xfs has passable performance on
single bit of rust, and openly explained why.  I see no misdirection,
only some evidence of bad blood between you two.
No, I won't be switching to xfs any time soon, but then it would take a
hell of a lot of evidence to get me to move away from ext4.  I trust
ext[n] deeply because it has proven many times over the years that it
can take one hell of a lot (of self inflicted wounds;).
	-Mike
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-30 13:48                     ` Mike Galbraith
@ 2015-04-30 14:07                       ` Daniel Phillips
  2015-04-30 14:28                         ` Howard Chu
  2015-04-30 14:33                         ` Mike Galbraith
  0 siblings, 2 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-04-30 14:07 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, tux3,
	Theodore Ts'o, OGAWA Hirofumi
On 04/30/2015 06:48 AM, Mike Galbraith wrote:
> On Thu, 2015-04-30 at 05:58 -0700, Daniel Phillips wrote:
>> On Thursday, April 30, 2015 5:07:21 AM PDT, Mike Galbraith wrote:
>>> On Thu, 2015-04-30 at 04:14 -0700, Daniel Phillips wrote:
>>>
>>>> Lovely sounding argument, but it is wrong because Tux3 still beats XFS
>>>> even with seek time factored out of the equation.
>>>
>>> Hm.  Do you have big-storage comparison numbers to back that?  I'm no
>>> storage guy (waiting for holographic crystal arrays to obsolete all this
>>> crap;), but Dave's big-storage guy words made sense to me.
>>
>> This has nothing to do with big storage. The proposition was that seek
>> time is the reason for Tux3's fsync performance. That claim was easily
>> falsified by removing the seek time.
>>
>> Dave's big storage words are there to draw attention away from the fact
>> that XFS ran the Git tests four times slower than Tux3 and three times
>> slower than Ext4. Whatever the big storage excuse is for that, the fact
>> is, XFS obviously sucks at little storage.
> 
> If you allocate spanning the disk from start of life, you're going to
> eat seeks that others don't until later.  That seemed rather obvious and
> straight forward.
It is a logical falacy. It mixes a grain of truth (spreading all over the
disk causes extra seeks) with an obvious falsehood (it is not necessarily
the only possible way to avoid long term fragmentation).
> He flat stated that xfs has passable performance on
> single bit of rust, and openly explained why.  I see no misdirection,
> only some evidence of bad blood between you two.
Raising the spectre of theoretical fragmentation issues when we have not
even begun that work is a straw man and intellectually dishonest. You have
to wonder why he does it. It is destructive to our community image and
harmful to progress.
> No, I won't be switching to xfs any time soon, but then it would take a
> hell of a lot of evidence to get me to move away from ext4.  I trust
> ext[n] deeply because it has proven many times over the years that it
> can take one hell of a lot (of self inflicted wounds;).
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-30 14:07                       ` Daniel Phillips
@ 2015-04-30 14:28                         ` Howard Chu
  2015-04-30 15:14                           ` Daniel Phillips
  2015-04-30 14:33                         ` Mike Galbraith
  1 sibling, 1 reply; 160+ messages in thread
From: Howard Chu @ 2015-04-30 14:28 UTC (permalink / raw)
  To: Daniel Phillips, Mike Galbraith
  Cc: Theodore Ts'o, tux3, Dave Chinner, linux-kernel,
	linux-fsdevel, OGAWA Hirofumi
Daniel Phillips wrote:
>
>
> On 04/30/2015 06:48 AM, Mike Galbraith wrote:
>> On Thu, 2015-04-30 at 05:58 -0700, Daniel Phillips wrote:
>>> On Thursday, April 30, 2015 5:07:21 AM PDT, Mike Galbraith wrote:
>>>> On Thu, 2015-04-30 at 04:14 -0700, Daniel Phillips wrote:
>>>>
>>>>> Lovely sounding argument, but it is wrong because Tux3 still beats XFS
>>>>> even with seek time factored out of the equation.
>>>>
>>>> Hm.  Do you have big-storage comparison numbers to back that?  I'm no
>>>> storage guy (waiting for holographic crystal arrays to obsolete all this
>>>> crap;), but Dave's big-storage guy words made sense to me.
>>>
>>> This has nothing to do with big storage. The proposition was that seek
>>> time is the reason for Tux3's fsync performance. That claim was easily
>>> falsified by removing the seek time.
>>>
>>> Dave's big storage words are there to draw attention away from the fact
>>> that XFS ran the Git tests four times slower than Tux3 and three times
>>> slower than Ext4. Whatever the big storage excuse is for that, the fact
>>> is, XFS obviously sucks at little storage.
>>
>> If you allocate spanning the disk from start of life, you're going to
>> eat seeks that others don't until later.  That seemed rather obvious and
>> straight forward.
>
> It is a logical falacy. It mixes a grain of truth (spreading all over the
> disk causes extra seeks) with an obvious falsehood (it is not necessarily
> the only possible way to avoid long term fragmentation).
You're reading into it what isn't there. Spreading over the disk isn't 
(just) about avoiding fragmentation - it's about delivering consistent 
and predictable latency. It is undeniable that if you start by only 
allocating from the fastest portion of the platter, you are going to see 
performance slow down over time. If you start by spreading allocations 
across the entire platter, you make the worst-case and average-case 
latency equal, which is exactly what a lot of folks are looking for.
>> He flat stated that xfs has passable performance on
>> single bit of rust, and openly explained why.  I see no misdirection,
>> only some evidence of bad blood between you two.
>
> Raising the spectre of theoretical fragmentation issues when we have not
> even begun that work is a straw man and intellectually dishonest. You have
> to wonder why he does it. It is destructive to our community image and
> harmful to progress.
It is a fact of life that when you change one aspect of an intimately 
interconnected system, something else will change as well. You have 
naive/nonexistent free space management now; when you design something 
workable there it is going to impact everything else you've already 
done. It's an easy bet that the impact will be negative, the only 
question is to what degree.
-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-30 14:28                         ` Howard Chu
@ 2015-04-30 15:14                           ` Daniel Phillips
  2015-04-30 16:00                             ` Howard Chu
                                               ` (2 more replies)
  0 siblings, 3 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-04-30 15:14 UTC (permalink / raw)
  To: Howard Chu, Mike Galbraith
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, tux3,
	Theodore Ts'o, OGAWA Hirofumi
On 04/30/2015 07:28 AM, Howard Chu wrote:
> Daniel Phillips wrote:
>>
>>
>> On 04/30/2015 06:48 AM, Mike Galbraith wrote:
>>> On Thu, 2015-04-30 at 05:58 -0700, Daniel Phillips wrote:
>>>> On Thursday, April 30, 2015 5:07:21 AM PDT, Mike Galbraith wrote:
>>>>> On Thu, 2015-04-30 at 04:14 -0700, Daniel Phillips wrote:
>>>>>
>>>>>> Lovely sounding argument, but it is wrong because Tux3 still beats XFS
>>>>>> even with seek time factored out of the equation.
>>>>>
>>>>> Hm.  Do you have big-storage comparison numbers to back that?  I'm no
>>>>> storage guy (waiting for holographic crystal arrays to obsolete all this
>>>>> crap;), but Dave's big-storage guy words made sense to me.
>>>>
>>>> This has nothing to do with big storage. The proposition was that seek
>>>> time is the reason for Tux3's fsync performance. That claim was easily
>>>> falsified by removing the seek time.
>>>>
>>>> Dave's big storage words are there to draw attention away from the fact
>>>> that XFS ran the Git tests four times slower than Tux3 and three times
>>>> slower than Ext4. Whatever the big storage excuse is for that, the fact
>>>> is, XFS obviously sucks at little storage.
>>>
>>> If you allocate spanning the disk from start of life, you're going to
>>> eat seeks that others don't until later.  That seemed rather obvious and
>>> straight forward.
>>
>> It is a logical falacy. It mixes a grain of truth (spreading all over the
>> disk causes extra seeks) with an obvious falsehood (it is not necessarily
>> the only possible way to avoid long term fragmentation).
> 
> You're reading into it what isn't there. Spreading over the disk isn't (just) about avoiding
> fragmentation - it's about delivering consistent and predictable latency. It is undeniable that if
> you start by only allocating from the fastest portion of the platter, you are going to see
> performance slow down over time. If you start by spreading allocations across the entire platter,
> you make the worst-case and average-case latency equal, which is exactly what a lot of folks are
> looking for.
Another fallacy: intentionally running slower than necessary is not necessarily
the only way to deliver consistent and predictable latency. Not only that, but
intentionally running slower than necessary does not necessarily guarantee
performing better than some alternate strategy later.
Anyway, let's not be silly. Everybody in the room who wants Git to run 4 times
slower with no guarantee of any benefit in the future, please raise your hand.
>>> He flat stated that xfs has passable performance on
>>> single bit of rust, and openly explained why.  I see no misdirection,
>>> only some evidence of bad blood between you two.
>>
>> Raising the spectre of theoretical fragmentation issues when we have not
>> even begun that work is a straw man and intellectually dishonest. You have
>> to wonder why he does it. It is destructive to our community image and
>> harmful to progress.
> 
> It is a fact of life that when you change one aspect of an intimately interconnected system,
> something else will change as well. You have naive/nonexistent free space management now; when you
> design something workable there it is going to impact everything else you've already done. It's an
> easy bet that the impact will be negative, the only question is to what degree.
You might lose that bet. For example, suppose we do strictly linear allocation
each delta, and just leave nice big gaps between the deltas for future
expansion. Clearly, we run at similar or identical speed to the current naive
strategy until we must start filling in the gaps, and at that point our layout
is not any worse than XFS, which started bad and stayed that way.
Now here is where you lose the bet: we already know that linear allocation
with wrap ends horribly right? However, as above, we start linear, without
compromise, but because of the gaps we leave, we are able to switch to a
slower strategy, but not nearly as slow as the ugly tangle we get with
simple wrap. So impact over the lifetime of the filesystem is positive, not
negative, and what seemed to be self evident to you turns out to be wrong.
In short, we would rather deliver as much performance as possible, all the
time. I really don't need to think about it very hard to know that is what I
want, and what most users want.
I will make you a bet in return: when we get to doing that part properly, the
quality of the work will be just as high as everything else we have completed
so far. Why would we suddenly get lazy?
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-30 15:14                           ` Daniel Phillips
@ 2015-04-30 16:00                             ` Howard Chu
  2015-04-30 18:22                             ` Christian Stroetmann
  2015-05-11 22:12                             ` Pavel Machek
  2 siblings, 0 replies; 160+ messages in thread
From: Howard Chu @ 2015-04-30 16:00 UTC (permalink / raw)
  To: Daniel Phillips, Mike Galbraith
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, tux3,
	Theodore Ts'o, OGAWA Hirofumi
Daniel Phillips wrote:
> On 04/30/2015 07:28 AM, Howard Chu wrote:
>> You're reading into it what isn't there. Spreading over the disk isn't (just) about avoiding
>> fragmentation - it's about delivering consistent and predictable latency. It is undeniable that if
>> you start by only allocating from the fastest portion of the platter, you are going to see
>> performance slow down over time. If you start by spreading allocations across the entire platter,
>> you make the worst-case and average-case latency equal, which is exactly what a lot of folks are
>> looking for.
>
> Another fallacy: intentionally running slower than necessary is not necessarily
> the only way to deliver consistent and predictable latency.
Totally agree with you there.
> Not only that, but
> intentionally running slower than necessary does not necessarily guarantee
> performing better than some alternate strategy later.
True, it's a question of algorithmic efficiency - does the performance 
decay linearly or logarithmically.
> Anyway, let's not be silly. Everybody in the room who wants Git to run 4 times
> slower with no guarantee of any benefit in the future, please raise your hand.
git is an important workload for us as developers, but I don't think 
that's the only workload that's important for us.
>>>> He flat stated that xfs has passable performance on
>>>> single bit of rust, and openly explained why.  I see no misdirection,
>>>> only some evidence of bad blood between you two.
>>>
>>> Raising the spectre of theoretical fragmentation issues when we have not
>>> even begun that work is a straw man and intellectually dishonest. You have
>>> to wonder why he does it. It is destructive to our community image and
>>> harmful to progress.
>>
>> It is a fact of life that when you change one aspect of an intimately interconnected system,
>> something else will change as well. You have naive/nonexistent free space management now; when you
>> design something workable there it is going to impact everything else you've already done. It's an
>> easy bet that the impact will be negative, the only question is to what degree.
>
> You might lose that bet. For example, suppose we do strictly linear allocation
> each delta, and just leave nice big gaps between the deltas for future
> expansion. Clearly, we run at similar or identical speed to the current naive
> strategy until we must start filling in the gaps, and at that point our layout
> is not any worse than XFS, which started bad and stayed that way.
>
> Now here is where you lose the bet: we already know that linear allocation
> with wrap ends horribly right? However, as above, we start linear, without
> compromise, but because of the gaps we leave, we are able to switch to a
> slower strategy, but not nearly as slow as the ugly tangle we get with
> simple wrap. So impact over the lifetime of the filesystem is positive, not
> negative, and what seemed to be self evident to you turns out to be wrong.
>
> In short, we would rather deliver as much performance as possible, all the
> time. I really don't need to think about it very hard to know that is what I
> want, and what most users want.
>
> I will make you a bet in return: when we get to doing that part properly, the
> quality of the work will be just as high as everything else we have completed
> so far. Why would we suddenly get lazy?
I never said anything about getting lazy. You're working in a closed 
system though. If you run today's version on a system, and then you run 
your future version on that same hardware, you're doing more CPU work 
and probably more I/O work to do the more complex space management. It's 
not quite zero-sum but close enough, when you're talking about highly 
optimized designs.
-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-30 15:14                           ` Daniel Phillips
  2015-04-30 16:00                             ` Howard Chu
@ 2015-04-30 18:22                             ` Christian Stroetmann
  2015-05-11 22:12                             ` Pavel Machek
  2 siblings, 0 replies; 160+ messages in thread
From: Christian Stroetmann @ 2015-04-30 18:22 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Howard Chu, Mike Galbraith, Dave Chinner, linux-kernel,
	linux-fsdevel, tux3, Theodore Ts'o, OGAWA Hirofumi
On the 30th of April 2015 17:14, Daniel Phillips wrote:
Hallo hardcore coders
> On 04/30/2015 07:28 AM, Howard Chu wrote:
>> Daniel Phillips wrote:
>>>
>>> On 04/30/2015 06:48 AM, Mike Galbraith wrote:
>>>> On Thu, 2015-04-30 at 05:58 -0700, Daniel Phillips wrote:
>>>>> On Thursday, April 30, 2015 5:07:21 AM PDT, Mike Galbraith wrote:
>>>>>> On Thu, 2015-04-30 at 04:14 -0700, Daniel Phillips wrote:
>>>>>>
>>>>>>> Lovely sounding argument, but it is wrong because Tux3 still beats XFS
>>>>>>> even with seek time factored out of the equation.
>>>>>> Hm.  Do you have big-storage comparison numbers to back that?  I'm no
>>>>>> storage guy (waiting for holographic crystal arrays to obsolete all this
>>>>>> crap;), but Dave's big-storage guy words made sense to me.
>>>>> This has nothing to do with big storage. The proposition was that seek
>>>>> time is the reason for Tux3's fsync performance. That claim was easily
>>>>> falsified by removing the seek time.
>>>>>
>>>>> Dave's big storage words are there to draw attention away from the fact
>>>>> that XFS ran the Git tests four times slower than Tux3 and three times
>>>>> slower than Ext4. Whatever the big storage excuse is for that, the fact
>>>>> is, XFS obviously sucks at little storage.
>>>> If you allocate spanning the disk from start of life, you're going to
>>>> eat seeks that others don't until later.  That seemed rather obvious and
>>>> straight forward.
>>> It is a logical falacy. It mixes a grain of truth (spreading all over the
>>> disk causes extra seeks) with an obvious falsehood (it is not necessarily
>>> the only possible way to avoid long term fragmentation).
>> You're reading into it what isn't there. Spreading over the disk isn't (just) about avoiding
>> fragmentation - it's about delivering consistent and predictable latency. It is undeniable that if
>> you start by only allocating from the fastest portion of the platter, you are going to see
>> performance slow down over time. If you start by spreading allocations across the entire platter,
>> you make the worst-case and average-case latency equal, which is exactly what a lot of folks are
>> looking for.
> Another fallacy: intentionally running slower than necessary is not necessarily
> the only way to deliver consistent and predictable latency. Not only that, but
> intentionally running slower than necessary does not necessarily guarantee
> performing better than some alternate strategy later.
>
> Anyway, let's not be silly. Everybody in the room who wants Git to run 4 times
> slower with no guarantee of any benefit in the future, please raise your hand.
>
>>>> He flat stated that xfs has passable performance on
>>>> single bit of rust, and openly explained why.  I see no misdirection,
>>>> only some evidence of bad blood between you two.
>>> Raising the spectre of theoretical fragmentation issues when we have not
>>> even begun that work is a straw man and intellectually dishonest. You have
>>> to wonder why he does it. It is destructive to our community image and
>>> harmful to progress.
>> It is a fact of life that when you change one aspect of an intimately interconnected system,
>> something else will change as well. You have naive/nonexistent free space management now; when you
>> design something workable there it is going to impact everything else you've already done. It's an
>> easy bet that the impact will be negative, the only question is to what degree.
> You might lose that bet. For example, suppose we do strictly linear allocation
> each delta, and just leave nice big gaps between the deltas for future
> expansion. Clearly, we run at similar or identical speed to the current naive
> strategy until we must start filling in the gaps, and at that point our layout
> is not any worse than XFS, which started bad and stayed that way.
>
> Now here is where you lose the bet: we already know that linear allocation
> with wrap ends horribly right? However, as above, we start linear, without
> compromise, but because of the gaps we leave, we are able to switch to a
> slower strategy, but not nearly as slow as the ugly tangle we get with
> simple wrap. So impact over the lifetime of the filesystem is positive, not
> negative, and what seemed to be self evident to you turns out to be wrong.
>
> In short, we would rather deliver as much performance as possible, all the
> time. I really don't need to think about it very hard to know that is what I
> want, and what most users want.
>
> I will make you a bet in return: when we get to doing that part properly, the
> quality of the work will be just as high as everything else we have completed
> so far. Why would we suddenly get lazy?
>
> Regards,
>
> Daniel
> --
>
How?
Maybe this is explained and discussed in a new thread about allocation 
or so.
Thanks
Best Regards
Have fun
C.S.
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-30 15:14                           ` Daniel Phillips
  2015-04-30 16:00                             ` Howard Chu
  2015-04-30 18:22                             ` Christian Stroetmann
@ 2015-05-11 22:12                             ` Pavel Machek
  2015-05-11 23:17                               ` Theodore Ts'o
  2015-05-11 23:53                               ` Daniel Phillips
  2 siblings, 2 replies; 160+ messages in thread
From: Pavel Machek @ 2015-05-11 22:12 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Howard Chu, Mike Galbraith, Dave Chinner, linux-kernel,
	linux-fsdevel, tux3, Theodore Ts'o, OGAWA Hirofumi
Hi!
> > It is a fact of life that when you change one aspect of an intimately interconnected system,
> > something else will change as well. You have naive/nonexistent free space management now; when you
> > design something workable there it is going to impact everything else you've already done. It's an
> > easy bet that the impact will be negative, the only question is to what degree.
> 
> You might lose that bet. For example, suppose we do strictly linear allocation
> each delta, and just leave nice big gaps between the deltas for future
> expansion. Clearly, we run at similar or identical speed to the current naive
> strategy until we must start filling in the gaps, and at that point our layout
> is not any worse than XFS, which started bad and stayed that way.
Umm, are you sure. If "some areas of disk are faster than others" is
still true on todays harddrives, the gaps will decrease the
performance (as you'll "use up" the fast areas more quickly).
Anyway... you have brand new filesystem. Of course it should be
faster/better/nicer than the existing filesystems. So don't be too
harsh with XFS people.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-11 22:12                             ` Pavel Machek
@ 2015-05-11 23:17                               ` Theodore Ts'o
  2015-05-12  2:34                                 ` Daniel Phillips
  2015-05-11 23:53                               ` Daniel Phillips
  1 sibling, 1 reply; 160+ messages in thread
From: Theodore Ts'o @ 2015-05-11 23:17 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Daniel Phillips, Howard Chu, Mike Galbraith, Dave Chinner,
	linux-kernel, linux-fsdevel, tux3, OGAWA Hirofumi
On Tue, May 12, 2015 at 12:12:23AM +0200, Pavel Machek wrote:
> Umm, are you sure. If "some areas of disk are faster than others" is
> still true on todays harddrives, the gaps will decrease the
> performance (as you'll "use up" the fast areas more quickly).
It's still true.  The difference between O.D. and I.D. (outer diameter
vs inner diameter) LBA's is typically a factor of 2.  This is why
"short-stroking" works as a technique, and another way that people
doing competitive benchmarking can screw up and produce misleading
numbers.  (If you use partitions instead of the whole disk, you have
to use the same partition in order to make sure you aren't comparing
apples with oranges.)
Cheers,
					- Ted
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-11 23:17                               ` Theodore Ts'o
@ 2015-05-12  2:34                                 ` Daniel Phillips
  2015-05-12  5:38                                   ` Dave Chinner
  2015-05-12  9:03                                   ` Pavel Machek
  0 siblings, 2 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-12  2:34 UTC (permalink / raw)
  To: Theodore Ts'o, Pavel Machek, Howard Chu, Mike Galbraith,
	Dave Chinner, linux-kernel, linux-fsdevel, tux3, OGAWA Hirofumi
On 05/11/2015 04:17 PM, Theodore Ts'o wrote:
> On Tue, May 12, 2015 at 12:12:23AM +0200, Pavel Machek wrote:
>> Umm, are you sure. If "some areas of disk are faster than others" is
>> still true on todays harddrives, the gaps will decrease the
>> performance (as you'll "use up" the fast areas more quickly).
> 
> It's still true.  The difference between O.D. and I.D. (outer diameter
> vs inner diameter) LBA's is typically a factor of 2.  This is why
> "short-stroking" works as a technique,
That is true, and the effect is not dominant compared to introducing
a lot of extra seeks.
> and another way that people
> doing competitive benchmarking can screw up and produce misleading
> numbers.
If you think we screwed up or produced misleading numbers, could you
please be up front about it instead of making insinuations and
continuing your tirade against benchmarking and those who do it.
> (If you use partitions instead of the whole disk, you have
> to use the same partition in order to make sure you aren't comparing
> apples with oranges.)
You can rest assured I did exactly that.
Somebody complained that things would look much different with seeks
factored out, so here are some new "competitive benchmarks" using
fs_mark on a ram disk:
   tasks        1        16        64
   ------------------------------------
   ext4:       231      2154       5439
   btrfs:      152       962       2230
   xfs:        268      2729       6466
   tux3:       315      5529      20301
    (Files per second, more is better)
The shell commands are:
   fs_mark -dtest -D5 -N100 -L1 -p5 -r5 -s1048576 -w4096 -n1000 -t1
   fs_mark -dtest -D5 -N100 -L1 -p5 -r5 -s65536 -w4096 -n1000 -t16
   fs_mark -dtest -D5 -N100 -L1 -p5 -r5 -s4096 -w4096 -n1000 -t64
The ram disk removes seek overhead and greatly reduces media transfer
overhead. This does not change things much: it confirms that Tux3 is
significantly faster than the others at synchronous loads. This is
apparently true independently of media type, though to be sure SSD
remains to be tested.
The really interesting result is how much difference there is between
filesystems, even on a ram disk. Is it just CPU or is it synchronization
strategy and lock contention? Does our asynchronous front/back design
actually help a lot, instead of being a disadvantage as you predicted?
It is too bad that fs_mark caps number of tasks at 64, because I am
sure that some embarrassing behavior would emerge at high task counts,
as with my tests on spinning disk.
Anyway, everybody but you loves competitive benchmarks, that is why I
post them. They are not only useful for tracking down performance bugs,
but as you point out, they help us advertise the reasons why Tux3 is
interesting and ought to be merged.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-12  2:34                                 ` Daniel Phillips
@ 2015-05-12  5:38                                   ` Dave Chinner
  2015-05-12  6:18                                     ` Daniel Phillips
  2015-05-12  9:03                                   ` Pavel Machek
  1 sibling, 1 reply; 160+ messages in thread
From: Dave Chinner @ 2015-05-12  5:38 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Theodore Ts'o, Pavel Machek, Howard Chu, Mike Galbraith,
	linux-kernel, linux-fsdevel, tux3, OGAWA Hirofumi
On Mon, May 11, 2015 at 07:34:34PM -0700, Daniel Phillips wrote:
> Anyway, everybody but you loves competitive benchmarks, that is why I
I think Ted and I are on the same page here. "Competitive
benchmarks" only matter to the people who are trying to sell
something. You're trying to sell Tux3, but....
> post them. They are not only useful for tracking down performance bugs,
> but as you point out, they help us advertise the reasons why Tux3 is
> interesting and ought to be merged.
.... benchmarks won't get tux3 merged.
Addressing the significant issues that have been raised during
previous code reviews is what will get it merged.  I posted that
list elsewhere in this thread which you replied that they were all
"on the list of things to do except for the page forking design".
The "except page forking design" statement is your biggest hurdle
for getting tux3 merged, not performance. Without page forking, tux3
cannot be merged at all. But it's not filesystem developers you need
to convince about the merits of the page forking design and
implementation - it's the mm and core kernel developers that need to
review and accept that code *before* we can consider merging tux3.
IOWs, you need to focus on the important things needed to acheive
your stated goal of getting tux3 merged. New filesystems should be
faster than those based on 20-25 year old designs, so you don't need
to waste time trying to convince people that tux3, when complete,
will be fast.
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-12  5:38                                   ` Dave Chinner
@ 2015-05-12  6:18                                     ` Daniel Phillips
  2015-05-12 18:39                                       ` David Lang
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-12  6:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Theodore Ts'o, Howard Chu, linux-kernel, Mike Galbraith,
	Pavel Machek, tux3, linux-fsdevel, OGAWA Hirofumi
On Monday, May 11, 2015 10:38:42 PM PDT, Dave Chinner wrote:
> I think Ted and I are on the same page here. "Competitive
> benchmarks" only matter to the people who are trying to sell
> something. You're trying to sell Tux3, but....
By "same page", do you mean "transparently obvious about
obstructing other projects"?
> The "except page forking design" statement is your biggest hurdle
> for getting tux3 merged, not performance.
No, the "except page forking design" is because the design is
already good and effective. The small adjustments needed in core
are well worth merging because the benefits are proved by benchmarks.
So benchmarks are key and will not stop just because you don't like
the attention they bring to XFS issues.
> Without page forking, tux3
> cannot be merged at all. But it's not filesystem developers you need
> to convince about the merits of the page forking design and
> implementation - it's the mm and core kernel developers that need to
> review and accept that code *before* we can consider merging tux3.
Please do not say "we" when you know that I am just as much a "we"
as you are. Merging Tux3 is not your decision. The people whose
decision it actually is are perfectly capable of recognizing your
agenda for what it is.
   http://www.phoronix.com/scan.php?page=news_item&px=MTA0NzM
   "XFS Developer Takes Shots At Btrfs, EXT4"
The real question is, has the Linux development process become
so political and toxic that worthwhile projects fail to benefit
from supposed grassroots community support. You are the poster
child for that.
> IOWs, you need to focus on the important things needed to acheive
> your stated goal of getting tux3 merged. New filesystems should be
> faster than those based on 20-25 year old designs, so you don't need
> to waste time trying to convince people that tux3, when complete,
> will be fast.
You know that Tux3 is already fast. Not just that of course. It
has a higher standard of data integrity than your metadata-only
journalling filesystem and a small enough code base that it can
be reasonably expected to reach the quality expected of an
enterprise class filesystem, quite possibly before XFS gets
there.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-12  6:18                                     ` Daniel Phillips
@ 2015-05-12 18:39                                       ` David Lang
  2015-05-12 20:54                                         ` Daniel Phillips
  0 siblings, 1 reply; 160+ messages in thread
From: David Lang @ 2015-05-12 18:39 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Dave Chinner, Theodore Ts'o, Pavel Machek, Howard Chu,
	Mike Galbraith, linux-kernel, linux-fsdevel, tux3, OGAWA Hirofumi
On Mon, 11 May 2015, Daniel Phillips wrote:
> On Monday, May 11, 2015 10:38:42 PM PDT, Dave Chinner wrote:
>> I think Ted and I are on the same page here. "Competitive
>> benchmarks" only matter to the people who are trying to sell
>> something. You're trying to sell Tux3, but....
>
> By "same page", do you mean "transparently obvious about
> obstructing other projects"?
>
>> The "except page forking design" statement is your biggest hurdle
>> for getting tux3 merged, not performance.
>
> No, the "except page forking design" is because the design is
> already good and effective. The small adjustments needed in core
> are well worth merging because the benefits are proved by benchmarks.
> So benchmarks are key and will not stop just because you don't like
> the attention they bring to XFS issues.
>
>> Without page forking, tux3
>> cannot be merged at all. But it's not filesystem developers you need
>> to convince about the merits of the page forking design and
>> implementation - it's the mm and core kernel developers that need to
>> review and accept that code *before* we can consider merging tux3.
>
> Please do not say "we" when you know that I am just as much a "we"
> as you are. Merging Tux3 is not your decision. The people whose
> decision it actually is are perfectly capable of recognizing your
> agenda for what it is.
>
>   http://www.phoronix.com/scan.php?page=news_item&px=MTA0NzM
>   "XFS Developer Takes Shots At Btrfs, EXT4"
umm, Phoronix has no input on what gets merged into the kernel. they also hae a 
reputation for trying to turn anything into click-bait by making it sound like a 
fight when it isn't.
> The real question is, has the Linux development process become
> so political and toxic that worthwhile projects fail to benefit
> from supposed grassroots community support. You are the poster
> child for that.
The linux development process is making code available, responding to concerns 
from the experts in the community, and letting the code talk for itself.
There have been many people pushing code for inclusion that has not gotten into 
the kernel, or has not been used by any distros after it's made it into the 
kernel, in spite of benchmarks being posted that seem to show how wonderful the 
new code is. ReiserFS was one of the first, and part of what tarnished it's 
reputation with many people was how much they were pushing the benchmarks that 
were shown to be faulty (the one I remember most vividly was that the entire 
benchmark completed in <30 seconds, and they had the FS tuned to not start 
flushing data to disk for 30 seconds, so the entire 'benchmark' ran out of ram 
without ever touching the disk)
So when Ted and Dave point out problems with the benchmark (the difference in 
behavior between a single spinning disk, different partitions on the same disk, 
SSDs, and ramdisks), you would be better off acknowledging them and if you can't 
adjust and re-run the benchmarks, don't start attacking them as a result.
As Dave says above, it's not the other filesystem people you have to convince, 
it's the core VFS and Memory Mangement folks you have to convince. You may need 
a little benchmarking to show that there is a real advantage to be gained, but 
the real discussion is going to be on the impact that page forking is going to 
have on everything else (both in complexity and in performance impact to other 
things)
>> IOWs, you need to focus on the important things needed to acheive
>> your stated goal of getting tux3 merged. New filesystems should be
>> faster than those based on 20-25 year old designs, so you don't need
>> to waste time trying to convince people that tux3, when complete,
>> will be fast.
>
> You know that Tux3 is already fast. Not just that of course. It
> has a higher standard of data integrity than your metadata-only
> journalling filesystem and a small enough code base that it can
> be reasonably expected to reach the quality expected of an
> enterprise class filesystem, quite possibly before XFS gets
> there.
We wouldn't expect anyone developing a new filesystem to believe any 
differently. If they didn't believe this, why would they be working on the 
filesystem instead of just using an existing filesystem.
The ugly reality is that everyone's early versions of their new filesystem looks 
really good. The problem is when they extend it to cover the corner cases and 
when it gets stressed by real-world (as opposed to benchmark) workloads. This 
isn't saying that you are wrong in your belief, just that you may not be right, 
and nobody will know until you are to a usable state and other people can start 
beating on it.
David Lang
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-12 18:39                                       ` David Lang
@ 2015-05-12 20:54                                         ` Daniel Phillips
  2015-05-12 21:30                                           ` David Lang
                                                             ` (2 more replies)
  0 siblings, 3 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-12 20:54 UTC (permalink / raw)
  To: David Lang
  Cc: Theodore Ts'o, Howard Chu, Dave Chinner, linux-kernel,
	Mike Galbraith, Pavel Machek, tux3, linux-fsdevel, OGAWA Hirofumi
On 05/12/2015 11:39 AM, David Lang wrote:
> On Mon, 11 May 2015, Daniel Phillips wrote:
>>> ...it's the mm and core kernel developers that need to
>>> review and accept that code *before* we can consider merging tux3.
>>
>> Please do not say "we" when you know that I am just as much a "we"
>> as you are. Merging Tux3 is not your decision. The people whose
>> decision it actually is are perfectly capable of recognizing your
>> agenda for what it is.
>>
>>   http://www.phoronix.com/scan.php?page=news_item&px=MTA0NzM
>>   "XFS Developer Takes Shots At Btrfs, EXT4"
> 
> umm, Phoronix has no input on what gets merged into the kernel. they also hae a reputation for
> trying to turn anything into click-bait by making it sound like a fight when it isn't.
Perhaps you misunderstood. Linus decides what gets merged. Andrew
decides. Greg decides. Dave Chinner does not decide, he just does
his level best to create the impression that our project is unfit
to merge. Any chance there might be an agenda?
Phoronix published a headline that identifies Dave Chinner as
someone who takes shots at other projects. Seems pretty much on
the money to me, and it ought to be obvious why he does it.
>> The real question is, has the Linux development process become
>> so political and toxic that worthwhile projects fail to benefit
>> from supposed grassroots community support. You are the poster
>> child for that.
> 
> The linux development process is making code available, responding to concerns from the experts in
> the community, and letting the code talk for itself.
Nice idea, but it isn't working. Did you let the code talk to you?
Right, you let the code talk to Dave Chinner, then you listen to
what Dave Chinner has to say about it. Any chance that there might
be some creative licence acting somewhere in that chain?
> There have been many people pushing code for inclusion that has not gotten into the kernel, or has
> not been used by any distros after it's made it into the kernel, in spite of benchmarks being posted
> that seem to show how wonderful the new code is. ReiserFS was one of the first, and part of what
> tarnished it's reputation with many people was how much they were pushing the benchmarks that were
> shown to be faulty (the one I remember most vividly was that the entire benchmark completed in <30
> seconds, and they had the FS tuned to not start flushing data to disk for 30 seconds, so the entire
> 'benchmark' ran out of ram without ever touching the disk)
You know what to do about checking for faulty benchmarks.
> So when Ted and Dave point out problems with the benchmark (the difference in behavior between a
> single spinning disk, different partitions on the same disk, SSDs, and ramdisks), you would be
> better off acknowledging them and if you can't adjust and re-run the benchmarks, don't start
> attacking them as a result.
Ted and Dave failed to point out any actual problem with any
benchmark. They invented issues with benchmarks and promoted those
as FUD.
> As Dave says above, it's not the other filesystem people you have to convince, it's the core VFS and
> Memory Mangement folks you have to convince. You may need a little benchmarking to show that there
> is a real advantage to be gained, but the real discussion is going to be on the impact that page
> forking is going to have on everything else (both in complexity and in performance impact to other
> things)
Yet he clearly wrote "we" as if he believes he is part of it.
Now that ENOSPC is done to a standard way beyond what Btrfs had
when it was merged, the next item on the agenda is writeback. That
involves us and VFS people as you say, and not Dave Chinner, who
only intends to obstruct the process as much as he possibly can. He
should get back to work on his own project. Nobody will miss his
posts if he doesn't make them. They contribute nothing of value,
create a lot of bad blood, and just serve to further besmirch the
famously tarnished reputation of LKML.
>> You know that Tux3 is already fast. Not just that of course. It
>> has a higher standard of data integrity than your metadata-only
>> journalling filesystem and a small enough code base that it can
>> be reasonably expected to reach the quality expected of an
>> enterprise class filesystem, quite possibly before XFS gets
>> there.
> 
> We wouldn't expect anyone developing a new filesystem to believe any differently.
It is not a matter of belief, it is a matter of testable fact. For
example, you can count the lines. You can run the same benchmarks.
Proving the data consistency claims would be a little harder, you
need tools for that, and some of those aren't built yet. Or, if you
have technical ability, you can read the code and the copious design
material that has been posted and convince yourself that, yes, there
is something cool here, why didn't anybody do it that way before?
But of course that starts to sound like work. Debating nontechnical
issues and playing politics seems so much more like fun.
> If they didn't
> believe this, why would they be working on the filesystem instead of just using an existing filesystem.
Right, and it is my job to convince you that what I believe for
perfectly valid, demonstrable technical reasons, is really true. I do
not see why you feel it is your job to convince me that the obviously
broken Linux community process is not in fact broken, and that a
certain person who obviously has an agenda, is not actually obstructing.
> The ugly reality is that everyone's early versions of their new filesystem looks really good. The
> problem is when they extend it to cover the corner cases and when it gets stressed by real-world (as
> opposed to benchmark) workloads. This isn't saying that you are wrong in your belief, just that you
> may not be right, and nobody will know until you are to a usable state and other people can start
> beating on it.
With ENOSPC we are at that state. Tux3 would get more testing and advance
faster if it was merged. Things like ifdefs, grandiose new schemes for
writeback infrastructure, dumb little hooks in the mkwrite path, those
are all just manufactured red herrings. Somebody wanted those to be
issues, so now they are issues. Fake ones.
Nobody is trying to trick you. Just stating a fact. You ought to be able
to figure out by now that Tux3 is worth merging.
You might possibly have an argument that merging a filesystem that
crashes as soon as it fills the disk is just sheer stupidity than can
only lead to embarrassment in the long run, but then you would need to
explain why Btrfs was merged. As I recall, it went something like, Chris
had it on a laptop, so it must be a filesystem, and wow look at that
feature list. Then it got merged in a completely unusable state and got
worked on. If it had not been merged, Btrfs would most likely be dead
right now. After all, who cares about an out of tree filesystem?
By the way, I gave my Tux3 presentation at SCALE 7x in Los Angeles in
2009, with Tux3 running as my root filesystem. By the standard applied
to Btrfs, Tux3 should have been merged then, right? After all, our
nospace handling worked just as well as theirs at that time.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-12 20:54                                         ` Daniel Phillips
@ 2015-05-12 21:30                                           ` David Lang
  2015-05-12 22:27                                             ` Daniel Phillips
  2015-05-13  0:31                                             ` Daniel Phillips
  2015-05-12 21:30                                           ` Christian Stroetmann
  2015-05-13  7:20                                           ` Pavel Machek
  2 siblings, 2 replies; 160+ messages in thread
From: David Lang @ 2015-05-12 21:30 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Theodore Ts'o, Howard Chu, Dave Chinner, linux-kernel,
	Mike Galbraith, Pavel Machek, tux3, linux-fsdevel, OGAWA Hirofumi
On Tue, 12 May 2015, Daniel Phillips wrote:
> On 05/12/2015 11:39 AM, David Lang wrote:
>> On Mon, 11 May 2015, Daniel Phillips wrote:
>>>> ...it's the mm and core kernel developers that need to
>>>> review and accept that code *before* we can consider merging tux3.
>>>
>>> Please do not say "we" when you know that I am just as much a "we"
>>> as you are. Merging Tux3 is not your decision. The people whose
>>> decision it actually is are perfectly capable of recognizing your
>>> agenda for what it is.
>>>
>>>   http://www.phoronix.com/scan.php?page=news_item&px=MTA0NzM
>>>   "XFS Developer Takes Shots At Btrfs, EXT4"
>>
>> umm, Phoronix has no input on what gets merged into the kernel. they also hae a reputation for
>> trying to turn anything into click-bait by making it sound like a fight when it isn't.
>
> Perhaps you misunderstood. Linus decides what gets merged. Andrew
> decides. Greg decides. Dave Chinner does not decide, he just does
> his level best to create the impression that our project is unfit
> to merge. Any chance there might be an agenda?
>
> Phoronix published a headline that identifies Dave Chinner as
> someone who takes shots at other projects. Seems pretty much on
> the money to me, and it ought to be obvious why he does it.
Phoronix turns any correction or criticism into an attack.
You need to get out of the mindset that Ted and Dave are Enemies that you need 
to overcome, they are friendly competitors, not Enemies. They assume that you 
are working in good faith (but are inexperienced compared to them), and you need 
to assume that they are working in good faith. If they ever do resort to 
underhanded means to sabotage you, Linus and the other kernel developers will 
take action. But pointing out limits in your current implementation, problems in 
your benchmarks based on how they are run, and concepts that are going to be 
difficult to merge is not underhanded, it's exactly the type of assistance that 
you should be greatful for in friendly competition.
You were the one who started crowing about how badly XFS performed. Dave gave a 
long and detailed explination about the reasons for the differences, and showing 
benchmarks on other hardware that showed that XFS works very well there. That's 
not an attack on EXT4 (or Tux3), it's an explination.
>>> The real question is, has the Linux development process become
>>> so political and toxic that worthwhile projects fail to benefit
>>> from supposed grassroots community support. You are the poster
>>> child for that.
>>
>> The linux development process is making code available, responding to concerns from the experts in
>> the community, and letting the code talk for itself.
>
> Nice idea, but it isn't working. Did you let the code talk to you?
> Right, you let the code talk to Dave Chinner, then you listen to
> what Dave Chinner has to say about it. Any chance that there might
> be some creative licence acting somewhere in that chain?
I have my own concerns about how things are going to work (I've voiced some of 
them), but no, I haven't tried running Tux3 because you say it's not ready yet.
>> There have been many people pushing code for inclusion that has not gotten into the kernel, or has
>> not been used by any distros after it's made it into the kernel, in spite of benchmarks being posted
>> that seem to show how wonderful the new code is. ReiserFS was one of the first, and part of what
>> tarnished it's reputation with many people was how much they were pushing the benchmarks that were
>> shown to be faulty (the one I remember most vividly was that the entire benchmark completed in <30
>> seconds, and they had the FS tuned to not start flushing data to disk for 30 seconds, so the entire
>> 'benchmark' ran out of ram without ever touching the disk)
>
> You know what to do about checking for faulty benchmarks.
That requires that the code be readily available, which last I heard, Tux3 
wasn't. Has this been fixed?
>> So when Ted and Dave point out problems with the benchmark (the difference in behavior between a
>> single spinning disk, different partitions on the same disk, SSDs, and ramdisks), you would be
>> better off acknowledging them and if you can't adjust and re-run the benchmarks, don't start
>> attacking them as a result.
>
> Ted and Dave failed to point out any actual problem with any
> benchmark. They invented issues with benchmarks and promoted those
> as FUD.
They pointed out problems with using ramdisk to simulate a SSD and huge 
differences between spinning rust and an SSD (or disk array). Those aren't FUD.
>> As Dave says above, it's not the other filesystem people you have to convince, it's the core VFS and
>> Memory Mangement folks you have to convince. You may need a little benchmarking to show that there
>> is a real advantage to be gained, but the real discussion is going to be on the impact that page
>> forking is going to have on everything else (both in complexity and in performance impact to other
>> things)
>
> Yet he clearly wrote "we" as if he believes he is part of it.
He is part of the group of people who use and work with this stuff, so he is 
part of it.
> Now that ENOSPC is done to a standard way beyond what Btrfs had
> when it was merged, the next item on the agenda is writeback. That
> involves us and VFS people as you say, and not Dave Chinner, who
> only intends to obstruct the process as much as he possibly can. He
> should get back to work on his own project. Nobody will miss his
> posts if he doesn't make them. They contribute nothing of value,
> create a lot of bad blood, and just serve to further besmirch the
> famously tarnished reputation of LKML.
BTRFS is a perfect example of how not to introduce a new filesystem. Lots of 
hype, the presumption that is is going to replace all the existing filesystems 
because it's so much better (especially according to benchmarks). But then 
progress stalled before it was really ready, and it's still something most 
people avoid.
>>> You know that Tux3 is already fast. Not just that of course. It
>>> has a higher standard of data integrity than your metadata-only
>>> journalling filesystem and a small enough code base that it can
>>> be reasonably expected to reach the quality expected of an
>>> enterprise class filesystem, quite possibly before XFS gets
>>> there.
>>
>> We wouldn't expect anyone developing a new filesystem to believe any differently.
>
> It is not a matter of belief, it is a matter of testable fact. For
> example, you can count the lines. You can run the same benchmarks.
>
> Proving the data consistency claims would be a little harder, you
> need tools for that, and some of those aren't built yet. Or, if you
> have technical ability, you can read the code and the copious design
> material that has been posted and convince yourself that, yes, there
> is something cool here, why didn't anybody do it that way before?
> But of course that starts to sound like work. Debating nontechnical
> issues and playing politics seems so much more like fun.
why are you picking a fight? there was no attack in my statement?
>> If they didn't
>> believe this, why would they be working on the filesystem instead of just using an existing filesystem.
>
> Right, and it is my job to convince you that what I believe for
> perfectly valid, demonstrable technical reasons, is really true. I do
> not see why you feel it is your job to convince me that the obviously
> broken Linux community process is not in fact broken, and that a
> certain person who obviously has an agenda, is not actually obstructing.
You will need to have a fully working, usable system before you can convince 
people that you are right. A partial system may look good, but how much is 
fixing the corner cases that you haven't gotten to yet going to hurt it? That 
there are going to be such cases is pretty much a given, and that changing 
things to add code to work around the pathalogical conditions is going to hurt 
the common case is pretty close to a given (it's one of those things that isn't 
mathamatically guaranteed, but happens on 99.99999+% of projects)
>> The ugly reality is that everyone's early versions of their new filesystem looks really good. The
>> problem is when they extend it to cover the corner cases and when it gets stressed by real-world (as
>> opposed to benchmark) workloads. This isn't saying that you are wrong in your belief, just that you
>> may not be right, and nobody will know until you are to a usable state and other people can start
>> beating on it.
>
> With ENOSPC we are at that state. Tux3 would get more testing and advance
> faster if it was merged. Things like ifdefs, grandiose new schemes for
> writeback infrastructure, dumb little hooks in the mkwrite path, those
> are all just manufactured red herrings. Somebody wanted those to be
> issues, so now they are issues. Fake ones.
Ok, so you are happy with your allocation strategy? you didn't seem to be a few 
e-mail ago.
but if you think it's ready for users, then start working to submit it in the 
next merge window. Dave said that except for one part, there was no reason not 
to merge it. That's pretty good. So you need to be discussing that one part with 
the the folks that Dave pointed you at.
> Nobody is trying to trick you. Just stating a fact. You ought to be able
> to figure out by now that Tux3 is worth merging.
>
> You might possibly have an argument that merging a filesystem that
> crashes as soon as it fills the disk is just sheer stupidity than can
> only lead to embarrassment in the long run, but then you would need to
> explain why Btrfs was merged. As I recall, it went something like, Chris
> had it on a laptop, so it must be a filesystem, and wow look at that
> feature list. Then it got merged in a completely unusable state and got
> worked on. If it had not been merged, Btrfs would most likely be dead
> right now. After all, who cares about an out of tree filesystem?
As I said above, Btrfs is a perfect example of how not to do things.
The other think you need to realize is that getting something in the kernel 
isn't a one-time effort, the code needs to be maintained over time (especially 
for a filesystem), and it's very possible for a developer/team/company to be so 
toxic and hostile to others that the Linux folks don't want to deal with the 
hassle of dealing with them. You are starting out on a path to put yourself into 
that category. Calm down and stop taking offense at everything. Your succeeding 
doesn't require that other people loose, so stop talking as if it's a zero sum 
game and you have to beat down the enemy to get your code accepted.
David Lang
> By the way, I gave my Tux3 presentation at SCALE 7x in Los Angeles in
> 2009, with Tux3 running as my root filesystem. By the standard applied
> to Btrfs, Tux3 should have been merged then, right? After all, our
> nospace handling worked just as well as theirs at that time.
>
> Regards,
>
> Daniel
>
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-12 21:30                                           ` David Lang
@ 2015-05-12 22:27                                             ` Daniel Phillips
  2015-05-12 22:35                                               ` David Lang
  2015-05-13  0:31                                             ` Daniel Phillips
  1 sibling, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-12 22:27 UTC (permalink / raw)
  To: David Lang
  Cc: Theodore Ts'o, Howard Chu, Dave Chinner, linux-kernel,
	Mike Galbraith, Pavel Machek, tux3, linux-fsdevel, OGAWA Hirofumi
On 05/12/2015 02:30 PM, David Lang wrote:
> On Tue, 12 May 2015, Daniel Phillips wrote:
>> Phoronix published a headline that identifies Dave Chinner as
>> someone who takes shots at other projects. Seems pretty much on
>> the money to me, and it ought to be obvious why he does it.
> 
> Phoronix turns any correction or criticism into an attack.
Phoronix gets attacked in an unseemly way by a number of people
in the developer community who should behave better. You are
doing it yourself, seemingly oblivious to the valuable role that
the publication plays in our community. Google for filesystem
benchmarks. Where do you find them? Right. Not to mention the
Xorg coverage, community issues, etc etc. The last thing we
need is a monoculture in Linux news, and we are dangerously
close to that now.
So, how is "EXT4 is not as stable or as well tested as most
people think" not a cheap shot? By my first hand experience,
that claim is absurd. Add to that the first hand experience
of roughly two billion other people. Seems to be a bit self
serving too, or was that just an accident.
> You need to get out of the mindset that Ted and Dave are Enemies that you need to overcome, they are
> friendly competitors, not Enemies.
You are wrong about Dave These are not the words of any friend:
   "I don't think I'm alone in my suspicion that there was something
   stinky about your numbers." -- Dave Chinner
Basically allegations of cheating. And wrong. Maybe Dave just
lives in his own dreamworld where everybody is out to get him, so
he has to attack people he views as competitors first.
Ted has more taste and his FUD attack was more artful, but it
still amounted to nothing more than piling on, he just picked
Dave's straw man uncritically and proceeded to and knock it down
some more. Nice way of distracting attention from the fact that
we actually did what we claimed, and instead of getting the
appropriate recognition for it, we were called cheaters. More or
less in so many words, and more subtly by Ted, but the intent
is clear and unmistakable. Apologies from both are still in order,
but it
> They assume that you are working in good faith (but are
> inexperienced compared to them), and you need to assume that they are working in good faith. If they
> ever do resort to underhanded means to sabotage you, Linus and the other kernel developers will take
> action. But pointing out limits in your current implementation, problems in your benchmarks based on
> how they are run, and concepts that are going to be difficult to merge is not underhanded, it's
> exactly the type of assistance that you should be greatful for in friendly competition.
> 
> You were the one who started crowing about how badly XFS performed.
Not at all, somebody else posted the terrible XFS benchmark
result, then Dave put up a big smokescreen to try to deflect
atention from it. There is a term for that kind of logical
fallacy:
   http://en.wikipedia.org/wiki/Proof_by_intimidation
Seems to have worked well on you. But after all those words,
XFS does not run any faster, and it clearly needs to.
 Dave gave a long and detailed
> explination about the reasons for the differences, and showing benchmarks on other hardware that
> showed that XFS works very well there. That's not an attack on EXT4 (or Tux3), it's an explination.
> 
>>>> The real question is, has the Linux development process become
>>>> so political and toxic that worthwhile projects fail to benefit
>>>> from supposed grassroots community support. You are the poster
>>>> child for that.
>>>
>>> The linux development process is making code available, responding to concerns from the experts in
>>> the community, and letting the code talk for itself.
>>
>> Nice idea, but it isn't working. Did you let the code talk to you?
>> Right, you let the code talk to Dave Chinner, then you listen to
>> what Dave Chinner has to say about it. Any chance that there might
>> be some creative licence acting somewhere in that chain?
> 
> I have my own concerns about how things are going to work (I've voiced some of them), but no, I
> haven't tried running Tux3 because you say it's not ready yet.
> 
>>> There have been many people pushing code for inclusion that has not gotten into the kernel, or has
>>> not been used by any distros after it's made it into the kernel, in spite of benchmarks being posted
>>> that seem to show how wonderful the new code is. ReiserFS was one of the first, and part of what
>>> tarnished it's reputation with many people was how much they were pushing the benchmarks that were
>>> shown to be faulty (the one I remember most vividly was that the entire benchmark completed in <30
>>> seconds, and they had the FS tuned to not start flushing data to disk for 30 seconds, so the entire
>>> 'benchmark' ran out of ram without ever touching the disk)
>>
>> You know what to do about checking for faulty benchmarks.
> 
> That requires that the code be readily available, which last I heard, Tux3 wasn't. Has this been fixed?
> 
>>> So when Ted and Dave point out problems with the benchmark (the difference in behavior between a
>>> single spinning disk, different partitions on the same disk, SSDs, and ramdisks), you would be
>>> better off acknowledging them and if you can't adjust and re-run the benchmarks, don't start
>>> attacking them as a result.
>>
>> Ted and Dave failed to point out any actual problem with any
>> benchmark. They invented issues with benchmarks and promoted those
>> as FUD.
> 
> They pointed out problems with using ramdisk to simulate a SSD and huge differences between spinning
> rust and an SSD (or disk array). Those aren't FUD.
> 
>>> As Dave says above, it's not the other filesystem people you have to convince, it's the core VFS and
>>> Memory Mangement folks you have to convince. You may need a little benchmarking to show that there
>>> is a real advantage to be gained, but the real discussion is going to be on the impact that page
>>> forking is going to have on everything else (both in complexity and in performance impact to other
>>> things)
>>
>> Yet he clearly wrote "we" as if he believes he is part of it.
> 
> He is part of the group of people who use and work with this stuff, so he is part of it.
> 
>> Now that ENOSPC is done to a standard way beyond what Btrfs had
>> when it was merged, the next item on the agenda is writeback. That
>> involves us and VFS people as you say, and not Dave Chinner, who
>> only intends to obstruct the process as much as he possibly can. He
>> should get back to work on his own project. Nobody will miss his
>> posts if he doesn't make them. They contribute nothing of value,
>> create a lot of bad blood, and just serve to further besmirch the
>> famously tarnished reputation of LKML.
> 
> BTRFS is a perfect example of how not to introduce a new filesystem. Lots of hype, the presumption
> that is is going to replace all the existing filesystems because it's so much better (especially
> according to benchmarks). But then progress stalled before it was really ready, and it's still
> something most people avoid.
> 
>>>> You know that Tux3 is already fast. Not just that of course. It
>>>> has a higher standard of data integrity than your metadata-only
>>>> journalling filesystem and a small enough code base that it can
>>>> be reasonably expected to reach the quality expected of an
>>>> enterprise class filesystem, quite possibly before XFS gets
>>>> there.
>>>
>>> We wouldn't expect anyone developing a new filesystem to believe any differently.
>>
>> It is not a matter of belief, it is a matter of testable fact. For
>> example, you can count the lines. You can run the same benchmarks.
>>
>> Proving the data consistency claims would be a little harder, you
>> need tools for that, and some of those aren't built yet. Or, if you
>> have technical ability, you can read the code and the copious design
>> material that has been posted and convince yourself that, yes, there
>> is something cool here, why didn't anybody do it that way before?
>> But of course that starts to sound like work. Debating nontechnical
>> issues and playing politics seems so much more like fun.
> 
> why are you picking a fight? there was no attack in my statement?
> 
>>> If they didn't
>>> believe this, why would they be working on the filesystem instead of just using an existing
>>> filesystem.
>>
>> Right, and it is my job to convince you that what I believe for
>> perfectly valid, demonstrable technical reasons, is really true. I do
>> not see why you feel it is your job to convince me that the obviously
>> broken Linux community process is not in fact broken, and that a
>> certain person who obviously has an agenda, is not actually obstructing.
> 
> You will need to have a fully working, usable system before you can convince people that you are
> right. A partial system may look good, but how much is fixing the corner cases that you haven't
> gotten to yet going to hurt it? That there are going to be such cases is pretty much a given, and
> that changing things to add code to work around the pathalogical conditions is going to hurt the
> common case is pretty close to a given (it's one of those things that isn't mathamatically
> guaranteed, but happens on 99.99999+% of projects)
> 
>>> The ugly reality is that everyone's early versions of their new filesystem looks really good. The
>>> problem is when they extend it to cover the corner cases and when it gets stressed by real-world (as
>>> opposed to benchmark) workloads. This isn't saying that you are wrong in your belief, just that you
>>> may not be right, and nobody will know until you are to a usable state and other people can start
>>> beating on it.
>>
>> With ENOSPC we are at that state. Tux3 would get more testing and advance
>> faster if it was merged. Things like ifdefs, grandiose new schemes for
>> writeback infrastructure, dumb little hooks in the mkwrite path, those
>> are all just manufactured red herrings. Somebody wanted those to be
>> issues, so now they are issues. Fake ones.
> 
> Ok, so you are happy with your allocation strategy? you didn't seem to be a few e-mail ago.
> 
> but if you think it's ready for users, then start working to submit it in the next merge window.
> Dave said that except for one part, there was no reason not to merge it. That's pretty good. So you
> need to be discussing that one part with the the folks that Dave pointed you at.
> 
>> Nobody is trying to trick you. Just stating a fact. You ought to be able
>> to figure out by now that Tux3 is worth merging.
>>
>> You might possibly have an argument that merging a filesystem that
>> crashes as soon as it fills the disk is just sheer stupidity than can
>> only lead to embarrassment in the long run, but then you would need to
>> explain why Btrfs was merged. As I recall, it went something like, Chris
>> had it on a laptop, so it must be a filesystem, and wow look at that
>> feature list. Then it got merged in a completely unusable state and got
>> worked on. If it had not been merged, Btrfs would most likely be dead
>> right now. After all, who cares about an out of tree filesystem?
> 
> As I said above, Btrfs is a perfect example of how not to do things.
> 
> The other think you need to realize is that getting something in the kernel isn't a one-time effort,
> the code needs to be maintained over time (especially for a filesystem), and it's very possible for
> a developer/team/company to be so toxic and hostile to others that the Linux folks don't want to
> deal with the hassle of dealing with them. You are starting out on a path to put yourself into that
> category. Calm down and stop taking offense at everything. Your succeeding doesn't require that
> other people loose, so stop talking as if it's a zero sum game and you have to beat down the enemy
> to get your code accepted.
> 
> David Lang
> 
>> By the way, I gave my Tux3 presentation at SCALE 7x in Los Angeles in
>> 2009, with Tux3 running as my root filesystem. By the standard applied
>> to Btrfs, Tux3 should have been merged then, right? After all, our
>> nospace handling worked just as well as theirs at that time.
>>
>> Regards,
>>
>> Daniel
>>
> 
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-12 22:27                                             ` Daniel Phillips
@ 2015-05-12 22:35                                               ` David Lang
  2015-05-12 23:55                                                 ` Theodore Ts'o
  2015-05-13  1:26                                                 ` Daniel Phillips
  0 siblings, 2 replies; 160+ messages in thread
From: David Lang @ 2015-05-12 22:35 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Theodore Ts'o, Howard Chu, Dave Chinner, linux-kernel,
	Mike Galbraith, Pavel Machek, tux3, linux-fsdevel, OGAWA Hirofumi
On Tue, 12 May 2015, Daniel Phillips wrote:
> On 05/12/2015 02:30 PM, David Lang wrote:
>> On Tue, 12 May 2015, Daniel Phillips wrote:
>>> Phoronix published a headline that identifies Dave Chinner as
>>> someone who takes shots at other projects. Seems pretty much on
>>> the money to me, and it ought to be obvious why he does it.
>>
>> Phoronix turns any correction or criticism into an attack.
>
> Phoronix gets attacked in an unseemly way by a number of people
> in the developer community who should behave better. You are
> doing it yourself, seemingly oblivious to the valuable role that
> the publication plays in our community. Google for filesystem
> benchmarks. Where do you find them? Right. Not to mention the
> Xorg coverage, community issues, etc etc. The last thing we
> need is a monoculture in Linux news, and we are dangerously
> close to that now.
It's on my 'sites to check daily' list, but they have also had some pretty nasty 
errors in their benchmarks, some of which have been pointed out repeatedly over 
the years (doing fsync dependent workloads in situations where one FS actually 
honors the fsyncs and another doesn't is a classic)
> So, how is "EXT4 is not as stable or as well tested as most
> people think" not a cheap shot? By my first hand experience,
> that claim is absurd. Add to that the first hand experience
> of roughly two billion other people. Seems to be a bit self
> serving too, or was that just an accident.
I happen to think that it's correct. It's not that Ext4 isn't tested, but that 
people's expectations of how much it's been tested, and at what scale don't 
match the reality.
>> You need to get out of the mindset that Ted and Dave are Enemies that you need to overcome, they are
>> friendly competitors, not Enemies.
>
> You are wrong about Dave These are not the words of any friend:
>
>   "I don't think I'm alone in my suspicion that there was something
>   stinky about your numbers." -- Dave Chinner
you are looking for offense. That just means that something is wrong with them, 
not that they were deliberatly falsified.
> Basically allegations of cheating. And wrong. Maybe Dave just
> lives in his own dreamworld where everybody is out to get him, so
> he has to attack people he views as competitors first.
you are the one doing the attacking. Please stop. Take a break if needed, and 
then get back to producing software rather than complaining about how everyone 
is out to get you.
David Lang
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-12 22:35                                               ` David Lang
@ 2015-05-12 23:55                                                 ` Theodore Ts'o
  2015-05-13  1:26                                                 ` Daniel Phillips
  1 sibling, 0 replies; 160+ messages in thread
From: Theodore Ts'o @ 2015-05-12 23:55 UTC (permalink / raw)
  To: David Lang
  Cc: Daniel Phillips, Howard Chu, Dave Chinner, linux-kernel,
	Mike Galbraith, Pavel Machek, tux3, linux-fsdevel, OGAWA Hirofumi
On Tue, May 12, 2015 at 03:35:43PM -0700, David Lang wrote:
> 
> I happen to think that it's correct. It's not that Ext4 isn't tested, but
> that people's expectations of how much it's been tested, and at what scale
> don't match the reality.
Ext4 is used at Google, on a very large number of disks.  Exactly how
large is not something I'm allowed to say, but there's a very amusing
Ted Talk by Randall Munroe (of xkcd fame) on that topic:
http://tedsummaries.com/2014/05/14/randall-munroe-comics-that-ask-what-if/
One thing I can say is that shortly after we deployed ext4 at Google,
thanks to having a very large number of disks, and because we have
very good system monitoring, we detected a file system corruption
problem that happened with a very low probability, but we had enough
disks that we could detect the pattern.  (Fortunately, because
Google's cluster file system has replication and/or erasure coding, no
user data was lost.)  Even though we could notice the problem, it took
us several months to track down the problem.
When we finally did, it turned out to be a race condition which only
took place under high memory pressure.  What was *very* amusing was
after fixing the problem for ext4, I looked at ext3, and discovered
that (a) the ext4 had inerited the bug was also in ext3, and (b) the
bug in ext3 had not been noticed in several enterprise distribution
testing runs done by Red Hat, SuSE, and IBM --- for well over a
**decade**.
What this means is that it's hard for *any* file system to be that
well tested; it's hard to substitute for years and years of production
use, hopefully in systems that have very rigorous monitoring so you
would notice if data or file system metadata is getting corrupted in
ways that can't be explained as hardware errors.  The fact that we
found a bug that was never discovered in ext3 after years and years of
use in many enterprises is a testimony to that fact.
(This is also why the fact that Facebook has started using btrfs in
production is going to be a very good thing for btrfs.  I'm sure they
will find all sorts of problems once they start running at large
scale, which is a _good_ thing; that's how those problems get fixed.)
Of course, using xfstests certainly helps a lot, and so in my opinion
all serious file system developers should be regularly using xfstests
as a part of the daily development cycle, and to be be extremely
ruthless about not allowing any test regressions.
Best regards,
					- Ted
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-12 22:35                                               ` David Lang
  2015-05-12 23:55                                                 ` Theodore Ts'o
@ 2015-05-13  1:26                                                 ` Daniel Phillips
  2015-05-13 19:09                                                   ` Martin Steigerwald
  1 sibling, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-13  1:26 UTC (permalink / raw)
  To: David Lang
  Cc: Theodore Ts'o, Howard Chu, Dave Chinner, linux-kernel,
	Mike Galbraith, Pavel Machek, tux3, linux-fsdevel, OGAWA Hirofumi
On 05/12/2015 03:35 PM, David Lang wrote:
> On Tue, 12 May 2015, Daniel Phillips wrote:
>> On 05/12/2015 02:30 PM, David Lang wrote:
>>> You need to get out of the mindset that Ted and Dave are Enemies that you need to overcome, they are
>>> friendly competitors, not Enemies.
>>
>> You are wrong about Dave These are not the words of any friend:
>>
>>   "I don't think I'm alone in my suspicion that there was something
>>   stinky about your numbers." -- Dave Chinner
> 
> you are looking for offense. That just means that something is wrong with them, not that they were
> deliberatly falsified.
I am not mistaken. Dave made sure to eliminate any doubt about
what he meant. He said "Oh, so nicely contrived. But terribly
obvious now that I've found it" among other things.
Good work, Dave. Never mind that we did not hide it.
Let's look at some more of the story. Hirofumi ran the test and
I posted the results and explained the significant. I did not
even know that dbench had fsyncs at that time, since I had never
used it myself, nor that Hirofumi had taken them out in order to
test the things he was interested in. Which turned out to be very
interesting, don't you agree?
Anyway, Hirofumi followed up with a clear explanation, here:
   http://phunq.net/pipermail/tux3/2013-May/002022.html
Instead of accepting that, Dave chose to ride right over it and
carry on with his thinly veiled allegations of intellectual fraud,
using such words as "it's deceptive at best." Dave managed to
insult two people that day.
Dave dismissed the basic breakthrough we had made as "silly
marketing fluff". By now I hope you understand that the result in
question was anything but silly marketing fluff. There are real,
technical reasons that Tux3 wins benchmarks, and the specific
detail that Dave attacked so ungraciously is one of them.
Are you beginning to see who the victim of this mugging was?
>> Basically allegations of cheating. And wrong. Maybe Dave just
>> lives in his own dreamworld where everybody is out to get him, so
>> he has to attack people he views as competitors first.
> 
> you are the one doing the attacking.
Defending, not attacking. There is a distinction.
> Please stop. Take a break if needed, and then get back to
> producing software rather than complaining about how everyone is out to get you.
Dave is not "everyone", and a "shut up" will not fix this.
What will fix this is a simple, professional statement that
an error was made, that there was no fraud or anything even
remotely resembling it, and that instead a technical
contribution was made. It is not even important that it come
from Dave. But it is important that the aspersions that were
cast be recognized for what they were.
By the way, do you remember the scene from "Unforgiven" where
the sherrif is kicking the guy on the ground and saying "I'm
not kicking you?" It feels like that.
As far as who should take a break goes, note that either of
us can stop the thread. Does it necessarily have to be me?
If you would prefer some light reading, you could read "How fast
can we fail?", which I believe is relevant to the question of
whether Tux3 is mergeable or not.
   https://lkml.org/lkml/2015/5/12/663
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-13  1:26                                                 ` Daniel Phillips
@ 2015-05-13 19:09                                                   ` Martin Steigerwald
  2015-05-13 19:37                                                     ` Daniel Phillips
  0 siblings, 1 reply; 160+ messages in thread
From: Martin Steigerwald @ 2015-05-13 19:09 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Lang, Theodore Ts'o, Howard Chu, Dave Chinner,
	linux-kernel, Mike Galbraith, Pavel Machek, tux3, linux-fsdevel,
	OGAWA Hirofumi
Am Dienstag, 12. Mai 2015, 18:26:28 schrieb Daniel Phillips:
> On 05/12/2015 03:35 PM, David Lang wrote:
> > On Tue, 12 May 2015, Daniel Phillips wrote:
> >> On 05/12/2015 02:30 PM, David Lang wrote:
> >>> You need to get out of the mindset that Ted and Dave are Enemies that
> >>> you need to overcome, they are friendly competitors, not Enemies.
> >> 
> >> You are wrong about Dave These are not the words of any friend:
> >>   "I don't think I'm alone in my suspicion that there was something
> >>   stinky about your numbers." -- Dave Chinner
> >
> > 
> >
> > you are looking for offense. That just means that something is wrong
> > with them, not that they were deliberatly falsified.
> 
> I am not mistaken. Dave made sure to eliminate any doubt about
> what he meant. He said "Oh, so nicely contrived. But terribly
> obvious now that I've found it" among other things.
Daniel, what are you trying to achieve here?
I thought you wanted to create interest for your filesystem and acceptance 
for merging it.
What I see you are actually creating tough is something different.
Is what you see after you send your mails really what you want to see? If 
not… why not? And if you seek change, where can you create change?
I really like to see Tux3 inside the kernel for easier testing, yet I also 
see that the way you, in your oppinion, "defend" it, does not seem to move 
that goal any closer, quite the opposite. It triggers polarity and 
resistance.
I believe it to be more productive to work together with the people who will 
decide about what goes into the kernel and the people whose oppinions are 
respected by them, instead of against them.
"Assume good faith" can help here. No amount of accusing people of bad 
intention will change them. The only thing you have the power to change is 
your approach. You absolutely and ultimately do not have the power to change 
other people. You can´t force Tux3 in by sheer willpower or attacking 
people.
On any account for anyone discussing here: I believe that any personal 
attacks, counter-attacks or "you are wrong" kind of speech will not help to 
move this discussion out of the circling it seems to be in at the moment.
Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-13 19:09                                                   ` Martin Steigerwald
@ 2015-05-13 19:37                                                     ` Daniel Phillips
  2015-05-13 20:02                                                       ` Jeremy Allison
  2015-05-13 20:25                                                       ` Martin Steigerwald
  0 siblings, 2 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-13 19:37 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: David Lang, Theodore Ts'o, Howard Chu, Dave Chinner,
	linux-kernel, Mike Galbraith, Pavel Machek, tux3, linux-fsdevel,
	OGAWA Hirofumi
On 05/13/2015 12:09 PM, Martin Steigerwald wrote:
> Daniel, what are you trying to achieve here?
> 
> I thought you wanted to create interest for your filesystem and acceptance 
> for merging it.
> 
> What I see you are actually creating tough is something different.
> 
> Is what you see after you send your mails really what you want to see? If 
> not… why not? And if you seek change, where can you create change?
That is the question indeed, whether to try and change the system
while merging, or just keep smiling and get the job done. The problem
is, I am just too stupid to realize that I can't change the system,
which is famously unpleasant for submitters.
> I really like to see Tux3 inside the kernel for easier testing, yet I also 
> see that the way you, in your oppinion, "defend" it, does not seem to move 
> that goal any closer, quite the opposite. It triggers polarity and 
> resistance.
> 
> I believe it to be more productive to work together with the people who will 
> decide about what goes into the kernel and the people whose oppinions are 
> respected by them, instead of against them.
Obviously true.
> "Assume good faith" can help here. No amount of accusing people of bad 
> intention will change them. The only thing you have the power to change is 
> your approach. You absolutely and ultimately do not have the power to change 
> other people. You can´t force Tux3 in by sheer willpower or attacking 
> people.
> 
> On any account for anyone discussing here: I believe that any personal 
> attacks, counter-attacks or "you are wrong" kind of speech will not help to 
> move this discussion out of the circling it seems to be in at the moment.
Thanks for the sane commentary. I have the power to change my behavior.
But if nobody else changes their behavior, the process remains just as
unpleasant for us as it ever was (not just me!). Obviously, this is
not the first time I have been through this, and it has never been
pleasant. After a while, contributors just get tired of the grind and
move on to something more fun. I know I did, and I am far from the
only one.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-13 19:37                                                     ` Daniel Phillips
@ 2015-05-13 20:02                                                       ` Jeremy Allison
  2015-05-13 20:24                                                         ` Daniel Phillips
  2015-05-13 20:25                                                       ` Martin Steigerwald
  1 sibling, 1 reply; 160+ messages in thread
From: Jeremy Allison @ 2015-05-13 20:02 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Martin Steigerwald, David Lang, Theodore Ts'o, Howard Chu,
	Dave Chinner, linux-kernel, Mike Galbraith, Pavel Machek, tux3,
	linux-fsdevel, OGAWA Hirofumi
On Wed, May 13, 2015 at 12:37:41PM -0700, Daniel Phillips wrote:
> On 05/13/2015 12:09 PM, Martin Steigerwald wrote:
> 
> > "Assume good faith" can help here. No amount of accusing people of bad 
> > intention will change them. The only thing you have the power to change is 
> > your approach. You absolutely and ultimately do not have the power to change 
> > other people. You can´t force Tux3 in by sheer willpower or attacking 
> > people.
> > 
> > On any account for anyone discussing here: I believe that any personal 
> > attacks, counter-attacks or "you are wrong" kind of speech will not help to 
> > move this discussion out of the circling it seems to be in at the moment.
> 
> Thanks for the sane commentary. I have the power to change my behavior.
> But if nobody else changes their behavior, the process remains just as
> unpleasant for us as it ever was (not just me!). Obviously, this is
> not the first time I have been through this, and it has never been
> pleasant. After a while, contributors just get tired of the grind and
> move on to something more fun. I know I did, and I am far from the
> only one.
Daniel, please listen to Martin. He speaks a fundamental truth
here.
As you know, I am also interested in Tux3, and would love to
see it as a filesystem option for NAS servers using Samba. But
please think about the way you're interacting with people on the
list, and whether that makes this outcome more or less likely.
Cheers,
	Jeremy.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-13 20:02                                                       ` Jeremy Allison
@ 2015-05-13 20:24                                                         ` Daniel Phillips
  0 siblings, 0 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-13 20:24 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Martin Steigerwald, David Lang, Theodore Ts'o, Howard Chu,
	Dave Chinner, linux-kernel, Mike Galbraith, Pavel Machek, tux3,
	linux-fsdevel, OGAWA Hirofumi
On Wednesday, May 13, 2015 1:02:34 PM PDT, Jeremy Allison wrote:
> On Wed, May 13, 2015 at 12:37:41PM -0700, Daniel Phillips wrote:
>> On 05/13/2015 12:09 PM, Martin Steigerwald wrote:
>>  ...
>
> Daniel, please listen to Martin. He speaks a fundamental truth
> here.
>
> As you know, I am also interested in Tux3, and would love to
> see it as a filesystem option for NAS servers using Samba. But
> please think about the way you're interacting with people on the
> list, and whether that makes this outcome more or less likely.
Thanks Jeremy, that means more from you than anyone.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-13 19:37                                                     ` Daniel Phillips
  2015-05-13 20:02                                                       ` Jeremy Allison
@ 2015-05-13 20:25                                                       ` Martin Steigerwald
  2015-05-13 20:38                                                         ` Daniel Phillips
  1 sibling, 1 reply; 160+ messages in thread
From: Martin Steigerwald @ 2015-05-13 20:25 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Lang, Theodore Ts'o, Howard Chu, Dave Chinner,
	linux-kernel, Mike Galbraith, Pavel Machek, tux3, linux-fsdevel,
	OGAWA Hirofumi
Am Mittwoch, 13. Mai 2015, 12:37:41 schrieb Daniel Phillips:
> On 05/13/2015 12:09 PM, Martin Steigerwald wrote:
> > Daniel, what are you trying to achieve here?
> > 
> > I thought you wanted to create interest for your filesystem and
> > acceptance for merging it.
> > 
> > What I see you are actually creating tough is something different.
> > 
> > Is what you see after you send your mails really what you want to see?
> > If
> > not… why not? And if you seek change, where can you create change?
> 
> That is the question indeed, whether to try and change the system
> while merging, or just keep smiling and get the job done. The problem
> is, I am just too stupid to realize that I can't change the system,
> which is famously unpleasant for submitters.
> 
> > I really like to see Tux3 inside the kernel for easier testing, yet I
> > also see that the way you, in your oppinion, "defend" it, does not seem
> > to move that goal any closer, quite the opposite. It triggers polarity
> > and resistance.
> > 
> > I believe it to be more productive to work together with the people who
> > will decide about what goes into the kernel and the people whose
> > oppinions are respected by them, instead of against them.
> 
> Obviously true.
> 
> > "Assume good faith" can help here. No amount of accusing people of bad
> > intention will change them. The only thing you have the power to change
> > is your approach. You absolutely and ultimately do not have the power
> > to change other people. You can´t force Tux3 in by sheer willpower or
> > attacking people.
> > 
> > On any account for anyone discussing here: I believe that any personal
> > attacks, counter-attacks or "you are wrong" kind of speech will not help
> > to move this discussion out of the circling it seems to be in at the
> > moment.
> Thanks for the sane commentary. I have the power to change my behavior.
> But if nobody else changes their behavior, the process remains just as
> unpleasant for us as it ever was (not just me!). Obviously, this is
> not the first time I have been through this, and it has never been
> pleasant. After a while, contributors just get tired of the grind and
> move on to something more fun. I know I did, and I am far from the
> only one.
Daniel, if you want to change the process of patch review and inclusion into 
the kernel, model an example of how you would like it to be. This has way 
better chances to inspire others to change their behaviors themselves than 
accusing them of bad faith.
Its yours to choose. 
What outcome do you want to create?
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-13 20:25                                                       ` Martin Steigerwald
@ 2015-05-13 20:38                                                         ` Daniel Phillips
  2015-05-13 21:10                                                           ` Martin Steigerwald
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-13 20:38 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: David Lang, Theodore Ts'o, Howard Chu, Dave Chinner,
	linux-kernel, Mike Galbraith, Pavel Machek, tux3, linux-fsdevel,
	OGAWA Hirofumi
On Wednesday, May 13, 2015 1:25:38 PM PDT, Martin Steigerwald wrote:
> Am Mittwoch, 13. Mai 2015, 12:37:41 schrieb Daniel Phillips:
>> On 05/13/2015 12:09 PM, Martin Steigerwald wrote: ...
>
> Daniel, if you want to change the process of patch review and 
> inclusion into 
> the kernel, model an example of how you would like it to be. This has way 
> better chances to inspire others to change their behaviors themselves than 
> accusing them of bad faith.
>
> Its yours to choose. 
>
> What outcome do you want to create?
The outcome I would like is:
  * Everybody has a good think about what has gone wrong in the past,
    not only with troublesome submitters, but with mutual respect and
    collegial conduct.
  * Tux3 is merged on its merits so we get more developers and
    testers and move it along faster.
  * I left LKML better than I found it.
  * Group hugs
Well, group hugs are optional, that one would be situational.
Regards,
Daniel
    
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-13 20:38                                                         ` Daniel Phillips
@ 2015-05-13 21:10                                                           ` Martin Steigerwald
  0 siblings, 0 replies; 160+ messages in thread
From: Martin Steigerwald @ 2015-05-13 21:10 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Lang, Theodore Ts'o, Howard Chu, Dave Chinner,
	linux-kernel, Mike Galbraith, Pavel Machek, tux3, linux-fsdevel,
	OGAWA Hirofumi
Am Mittwoch, 13. Mai 2015, 13:38:24 schrieb Daniel Phillips:
> On Wednesday, May 13, 2015 1:25:38 PM PDT, Martin Steigerwald wrote:
> > Am Mittwoch, 13. Mai 2015, 12:37:41 schrieb Daniel Phillips:
> >> On 05/13/2015 12:09 PM, Martin Steigerwald wrote: ...
> > 
> > Daniel, if you want to change the process of patch review and
> > inclusion into
> > the kernel, model an example of how you would like it to be. This has
> > way
> > better chances to inspire others to change their behaviors themselves
> > than accusing them of bad faith.
> > 
> > Its yours to choose.
> > 
> > What outcome do you want to create?
> 
> The outcome I would like is:
> 
>   * Everybody has a good think about what has gone wrong in the past,
>     not only with troublesome submitters, but with mutual respect and
>     collegial conduct.
> 
>   * Tux3 is merged on its merits so we get more developers and
>     testers and move it along faster.
> 
>   * I left LKML better than I found it.
> 
>   * Group hugs
> 
> Well, group hugs are optional, that one would be situational.
Great stuff!
Looking forward to it.
Thank you,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
 
 
 
 
 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-12 21:30                                           ` David Lang
  2015-05-12 22:27                                             ` Daniel Phillips
@ 2015-05-13  0:31                                             ` Daniel Phillips
  1 sibling, 0 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-13  0:31 UTC (permalink / raw)
  To: David Lang
  Cc: Theodore Ts'o, Howard Chu, Dave Chinner, linux-kernel,
	Mike Galbraith, Pavel Machek, tux3, linux-fsdevel, OGAWA Hirofumi
On 05/12/2015 02:30 PM, David Lang wrote:
> On Tue, 12 May 2015, Daniel Phillips wrote:
>> Phoronix published a headline that identifies Dave Chinner as
>> someone who takes shots at other projects. Seems pretty much on
>> the money to me, and it ought to be obvious why he does it.
> 
> Phoronix turns any correction or criticism into an attack.
Phoronix gets attacked in an unseemly way by a number of people
in the developer community who should behave better. You are
doing it yourself, seemingly oblivious to the valuable role that
the publication plays in our community. Google for filesystem
benchmarks. Where do you find them? Right. Not to mention the
Xorg coverage, community issues, etc etc. The last thing we need
is a monoculture in Linux news, and we are dangerously close to
that now.
So, how is "EXT4 is not as stable or as well tested as most
people think" not a cheap shot? By my first hand experience, that
claim is absurd. Add to that the first hand experience of roughly
two billion other people. Seems to be a bit self serving too, or
was that just an accident.
> You need to get out of the mindset that Ted and Dave are Enemies that you need to overcome, they are
> friendly competitors, not Enemies.
You are wrong about Dave, These are not the words of any friend:
   "I don't think I'm alone in my suspicion that there was something
   stinky about your numbers." -- Dave Chinner
Basically allegations of cheating. And wrong. Maybe Dave just
lives in his own dreamworld where everybody is out to get him, so
he has to attack people he views as competitors first.
Ted has more taste and his FUD attack was more artful, but it
still amounted to nothing more than piling on, He just picked up
Dave's straw man uncritically and proceeded to knock it down
some more. Nice way of distracting attention from the fact that
we actually did what we claimed, and instead of getting the
appropriate recognition for it, we were called cheaters. More or
less in so many words by Dave, and more subtly by Ted, but the
intent is clear and unmistakable. Apologies from both are still
in order, but it will be a rainy day in that hot place before we
ever see either of them do the right thing.
That said, Ted is no enemy, he is brilliant and usually conducts
himself admirably. Except sometimes. I wish I would say the same
about Dave, but what I see there is a guy who has invested his
entire identity in his XFS career and is insecure that something
might conspire against him to disrupt it. I mean, come on, if you
convince Redhat management to elevate your life's work to the
status of something that most of the paid for servers in the
world are going to run, do you continue attacking your peers or
do you chill a bit?
> They assume that you are working in good faith (but are
> inexperienced compared to them), and you need to assume that they are working in good faith. If they
> ever do resort to underhanded means to sabotage you, Linus and the other kernel developers will take
> action. But pointing out limits in your current implementation, problems in your benchmarks based on
> how they are run, and concepts that are going to be difficult to merge is not underhanded, it's
> exactly the type of assistance that you should be greatful for in friendly competition.
> 
> You were the one who started crowing about how badly XFS performed.
Not at all, somebody else posted the terrible XFS benchmark result,
then Dave put up a big smokescreen to try to deflect atention from
it. There is a term for that kind of logical fallacy:
   http://en.wikipedia.org/wiki/Proof_by_intimidation
Seems to have worked well on you. But after all those words, XFS
does not run any faster, and it clearly needs to.
> Dave gave a long and detailed explination about the reasons for the differences, and showing
benchmarks on other hardware that
> showed that XFS works very well there. That's not an attack on EXT4 (or Tux3), it's an explination.
Long, detailed, and bogus. Summary: "oh, XFS doesn't work well on
that hardware? Get new hardware." Excuse me, but other filesystems
do work well on that hardware, the problem is not with the hardware.
> I have my own concerns about how things are going to work (I've voiced some of them), but no, I
> haven't tried running Tux3 because you say it's not ready yet.
I did not say that. I said it is not ready for users. It is more
than ready for anybody who wants to develop it, or benchmark it,
or put test data on it, and has been for a long time. Except for
enospc, and that was apparently not an issue for Btrfs, was it.
>> You know what to do about checking for faulty benchmarks.
> 
> That requires that the code be readily available, which last I heard, Tux3 wasn't. Has this been fixed?
You heard wrong. The code is readily available and you can clone it
from here:
    https://github.com/OGAWAHirofumi/linux-tux3.git
The hirofumi-user branch has the user tools including mkfs and basic
fsck, and the hirofumi branch is a 3.19 Linus kernel that includes Tux3.
(So is hirofumi-user branch, but Hirofumi likes people to build from
the other one, which is pure kernel.)
We do of course have patches not pushed to the public repository yet,
which includes enospc, so the public code is easily crashable. If I
were you, I would wait for enospc to land, but that is by no means
necessary if your objective is just to verify that we tell the truth.
> They pointed out problems with using ramdisk to simulate a SSD and huge differences between spinning
> rust and an SSD (or disk array). Those aren't FUD.
Not FUD perhaps, but wrong all the same. I have plenty of evidence
at hand to be sure of that, so I don't need to theorize about it.
Ramdisk is surprisingly predictive of performance on other media,
and is arguably closer to what the new generation of NVRAM behaves
like than flash is.
>>> As Dave says above, it's not the other filesystem people you have to convince, it's the core VFS and
>>> Memory Mangement folks you have to convince. You may need a little benchmarking to show that there
>>> is a real advantage to be gained, but the real discussion is going to be on the impact that page
>>> forking is going to have on everything else (both in complexity and in performance impact to other
>>> things)
>>
>> Yet he clearly wrote "we" as if he believes he is part of it.
> 
> He is part of the group of people who use and work with this stuff, so he is part of it.
He is not part of a committee that decides what to merge, yet he
spoke as if he was. Just a slip maybe? Let's call it that. Slip or
not, it is a divisive and offensive attitude.
> BTRFS is a perfect example of how not to introduce a new filesystem. Lots of hype, the presumption
> that is is going to replace all the existing filesystems because it's so much better (especially
> according to benchmarks). But then progress stalled before it was really ready, and it's still
> something most people avoid.
Disagree. Merging Btrfs was the only way to save it. Not everyone
avoids it. Btrfs has its share of ardent supporters, ready or not.
One day Btrfs will be ready and the rough spots will be a fading
memory. That is healthy. What Dave is trying to do to Tux3 is kind
of sick.
Even though I do not like the Btrfs design, I hope it succeeds and
fills that void where a big, fat, full featured filesystem that does
everything including sending email should be.
>> Proving the data consistency claims would be a little harder, you
>> need tools for that, and some of those aren't built yet. Or, if you
>> have technical ability, you can read the code and the copious design
>> material that has been posted and convince yourself that, yes, there
>> is something cool here, why didn't anybody do it that way before?
>> But of course that starts to sound like work. Debating nontechnical
>> issues and playing politics seems so much more like fun.
> 
> why are you picking a fight? there was no attack in my statement?
Sorry, did I pick a fight? You *are* debating nontechnical issues
and politics, and it *does* sound like work to go do your own
benchmarks. And if it is not fun for you, then why are you doing it?
Please do not take that the wrong way, you obviously enjoy it and
there is nothing wrong with that.
>>> If they didn't
>>> believe this, why would they be working on the filesystem instead of just using an existing
>>> filesystem.
>>
>> Right, and it is my job to convince you that what I believe for
>> perfectly valid, demonstrable technical reasons, is really true. I do
>> not see why you feel it is your job to convince me that the obviously
>> broken Linux community process is not in fact broken, and that a
>> certain person who obviously has an agenda, is not actually obstructing.
> 
> You will need to have a fully working, usable system before you can convince people that you are
> right.
Logical fallacy alert. You say there is only one way to convince
somebody of something, when in fact more ways may exist. And "fully
working" translates as "I get to decide what fully working means".
Ask yourself this: in order to convince you that you will die if you
jump off the empire state building, do I actually need to jump off
it, or may I explain to you the principles of gravitation instead?
Anyway, I will offer "has enospc" as a reasonable definition of "fully
working". Tux3 has actually been doing the things (out of space
handling excepted) a normal filesystem does for years. Just not
always as fast or reliably as it now does
A partial system may look good, but how much is fixing the corner cases that you haven't
> gotten to yet going to hurt it?
Straw man. To which corner cases do you refer, and why should we fix
them now instead of attending to the issues that we feel are important?
That there are going to be such cases is pretty much a given, and
> that changing things to add code to work around the pathalogical conditions is going to hurt the
> common case is pretty close to a given (it's one of those things that isn't mathamatically
> guaranteed, but happens on 99.99999+% of projects)
Another straw man. To which pathological condition do you refer, and
why is it so important that we need to drop everything and attend to
it now?
>>> The ugly reality is that everyone's early versions of their new filesystem looks really good. The
>>> problem is when they extend it to cover the corner cases and when it gets stressed by real-world (as
>>> opposed to benchmark) workloads. This isn't saying that you are wrong in your belief, just that you
>>> may not be right, and nobody will know until you are to a usable state and other people can start
>>> beating on it.
>>
>> With ENOSPC we are at that state. Tux3 would get more testing and advance
>> faster if it was merged. Things like ifdefs, grandiose new schemes for
>> writeback infrastructure, dumb little hooks in the mkwrite path, those
>> are all just manufactured red herrings. Somebody wanted those to be
>> issues, so now they are issues. Fake ones.
> 
> Ok, so you are happy with your allocation strategy? you didn't seem to be a few e-mail ago.
I am not happy with our allocation strategy, it can be improved
immensely. It is also not the most important thing in the world,
because nobody intends put their mission critical files on it.
I do see people trying to raise that issue as a merge blocker, which
would be an excellent example of how broken our community process is
if it did actually turn out to block our merge. If it concerns you
then store some files on it yourself and see if it really is a killer
problem. Alternatively, it might be exactly the sort of thing that
an interested contributor could take on, and if that is true, then
delaying merge so it can bottleneck on me instead would not make
sense.
If you actually go look at the code, you will see there is some rather
nice infrastructure in there for supporting allocation policy, and
there actually is a preliminary allocation policy, it just does not
meet our standards for production work.
> but if you think it's ready for users, then start working to submit it in the next merge window.
Red Herring. It is not supposed to be ready for users. It is supposed
to be ready for developers. Development kernel, right? Experimental
status and all that. Users are cordially invited to stay away until
further notice.
> Dave said that except for one part, there was no reason not to merge it. That's pretty good. So you
> need to be discussing that one part with the the folks that Dave pointed you at.
Oops, I missed that, are you sure? Perhaps you mean the writeback
interface. Already started on that, already talking. But do keep in
mind that his demand was always a makework project, and frankly, a
nonsensical way to go about things. It's an >internal< api, see.
Internal apis are declared to be flexible, by Linus himself. We
already have a nice, simple patch that implements a simple api that
works fine, we use it all the time. Dave was the one who suggested
we do it exactly like that, so we did. Then Dave moved the goalposts
by insisting that we should throw that one way and tackle a much
bigger project in core that is essentially a R&D project. Not
willing to play that game for a possibly endless number of iterations,
I turned instead to things that actually matter.
Anyway, the writeback project involves us, and VFS developers, you
know who they are. I would prefer that Dave not be involved. For
the record, Jan Kara is great to work with, did you see that patch
set he produced for us? Sadly, I was not able to get into it to
the extent it deserved at the time.
> As I said above, Btrfs is a perfect example of how not to do things.
Unfair. It worked. The alternative is most probably, no Btrfs, ever.
Which do you choose?
The fact that Hirofumi and I kept on with Tux3 got it to where it
is today after all the nasty things that went on and are still going
on is nothing short of a miracle. Thank Hirofumi. If it were not for
him I would have quit years ago and that would have been the end of
it. There are a lot more fun things to do in life than put up with
incessant FUD attacks from the ilk of Dave Chinner. You should tattoo
that on your arm so you can contemplate it when thinking about whether
the Linux community is dysfunctional or not.
> The other think you need to realize is that getting something in the kernel isn't a one-time effort,
> the code needs to be maintained over time (especially for a filesystem), and it's very possible for
> a developer/team/company to be so toxic and hostile to others that the Linux folks don't want to
> deal with the hassle of dealing with them. You are starting out on a path to put yourself into that
> category. Calm down and stop taking offense at everything. Your succeeding doesn't require that
> other people loose, so stop talking as if it's a zero sum game and you have to beat down the enemy
> to get your code accepted.
That argument is "blame the victim", with a bit of intimidation thrown
in. If we are to work together in an atmosphere of harmony and mutual
respect then let's see some effort from more than one side please.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread
 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-12 20:54                                         ` Daniel Phillips
  2015-05-12 21:30                                           ` David Lang
@ 2015-05-12 21:30                                           ` Christian Stroetmann
  2015-05-13  7:20                                           ` Pavel Machek
  2 siblings, 0 replies; 160+ messages in thread
From: Christian Stroetmann @ 2015-05-12 21:30 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Lang, Theodore Ts'o, Howard Chu, Dave Chinner,
	linux-kernel, Mike Galbraith, Pavel Machek, tux3, linux-fsdevel,
	OGAWA Hirofumi
On 12.05.2015 22:54, Daniel Phillips wrote:
> On 05/12/2015 11:39 AM, David Lang wrote:
>> On Mon, 11 May 2015, Daniel Phillips wrote:
>>>> ...it's the mm and core kernel developers that need to
>>>> review and accept that code *before* we can consider merging tux3.
>>> Please do not say "we" when you know that I am just as much a "we"
>>> as you are. Merging Tux3 is not your decision. The people whose
>>> decision it actually is are perfectly capable of recognizing your
>>> agenda for what it is.
>>>
>>>    http://www.phoronix.com/scan.php?page=news_item&px=MTA0NzM
>>>    "XFS Developer Takes Shots At Btrfs, EXT4"
>> umm, Phoronix has no input on what gets merged into the kernel. they also hae a reputation for
>> trying to turn anything into click-bait by making it sound like a fight when it isn't.
> Perhaps you misunderstood. Linus decides what gets merged. Andrew
> decides. Greg decides. Dave Chinner does not decide, he just does
> his level best to create the impression that our project is unfit
> to merge. Any chance there might be an agenda?
>
> Phoronix published a headline that identifies Dave Chinner as
> someone who takes shots at other projects. Seems pretty much on
> the money to me, and it ought to be obvious why he does it.
Maybe Dave has convincing arguments, that have been misinterpreted by 
that website, which is an interesting but also highliy manipulative 
publication.
>>> The real question is, has the Linux development process become
>>> so political and toxic that worthwhile projects fail to benefit
>>> from supposed grassroots community support. You are the poster
>>> child for that.
>> The linux development process is making code available, responding to concerns from the experts in
>> the community, and letting the code talk for itself.
> Nice idea, but it isn't working. Did you let the code talk to you?
> Right, you let the code talk to Dave Chinner, then you listen to
> what Dave Chinner has to say about it. Any chance that there might
> be some creative licence acting somewhere in that chain?
We are missing the complete useable thing.
>> There have been many people pushing code for inclusion that has not gotten into the kernel, or has
>> not been used by any distros after it's made it into the kernel, in spite of benchmarks being posted
>> that seem to show how wonderful the new code is. ReiserFS was one of the first, and part of what
>> tarnished it's reputation with many people was how much they were pushing the benchmarks that were
>> shown to be faulty (the one I remember most vividly was that the entire benchmark completed in<30
>> seconds, and they had the FS tuned to not start flushing data to disk for 30 seconds, so the entire
>> 'benchmark' ran out of ram without ever touching the disk)
> You know what to do about checking for faulty benchmarks.
>
>> So when Ted and Dave point out problems with the benchmark (the difference in behavior between a
>> single spinning disk, different partitions on the same disk, SSDs, and ramdisks), you would be
>> better off acknowledging them and if you can't adjust and re-run the benchmarks, don't start
>> attacking them as a result.
> Ted and Dave failed to point out any actual problem with any
> benchmark. They invented issues with benchmarks and promoted those
> as FUD.
In general, benchmarks are a critical issue. In this relation, let me 
quote Churchill in a derivated way:
Do not trust a benchmark that you have not forged yourself.
>> As Dave says above, it's not the other filesystem people you have to convince, it's the core VFS and
>> Memory Mangement folks you have to convince. You may need a little benchmarking to show that there
>> is a real advantage to be gained, but the real discussion is going to be on the impact that page
>> forking is going to have on everything else (both in complexity and in performance impact to other
>> things)
> Yet he clearly wrote "we" as if he believes he is part of it.
>
> Now that ENOSPC is done to a standard way beyond what Btrfs had
> when it was merged, the next item on the agenda is writeback. That
> involves us and VFS people as you say, and not Dave Chinner, who
> only intends to obstruct the process as much as he possibly can. He
> should get back to work on his own project. Nobody will miss his
> posts if he doesn't make them. They contribute nothing of value,
> create a lot of bad blood, and just serve to further besmirch the
> famously tarnished reputation of LKML.
At least, I would miss his contributions, specifically his technical 
explanations but also his opinions.
>>> You know that Tux3 is already fast. Not just that of course. It
>>> has a higher standard of data integrity than your metadata-only
>>> journalling filesystem and a small enough code base that it can
>>> be reasonably expected to reach the quality expected of an
>>> enterprise class filesystem, quite possibly before XFS gets
>>> there.
>> We wouldn't expect anyone developing a new filesystem to believe any differently.
> It is not a matter of belief, it is a matter of testable fact. For
> example, you can count the lines. You can run the same benchmarks.
>
> Proving the data consistency claims would be a little harder, you
> need tools for that, and some of those aren't built yet. Or, if you
> have technical ability, you can read the code and the copious design
> material that has been posted and convince yourself that, yes, there
> is something cool here, why didn't anybody do it that way before?
> But of course that starts to sound like work. Debating nontechnical
> issues and playing politics seems so much more like fun.
>
>> If they didn't
>> believe this, why would they be working on the filesystem instead of just using an existing filesystem.
> Right, and it is my job to convince you that what I believe for
> perfectly valid, demonstrable technical reasons, is really true. I do
> not see why you feel it is your job to convince me that the obviously
> broken Linux community process is not in fact broken, and that a
> certain person who obviously has an agenda, is not actually obstructing.
>
>> The ugly reality is that everyone's early versions of their new filesystem looks really good. The
>> problem is when they extend it to cover the corner cases and when it gets stressed by real-world (as
>> opposed to benchmark) workloads. This isn't saying that you are wrong in your belief, just that you
>> may not be right, and nobody will know until you are to a usable state and other people can start
>> beating on it.
> With ENOSPC we are at that state. Tux3 would get more testing and advance
> faster if it was merged. Things like ifdefs, grandiose new schemes for
> writeback infrastructure, dumb little hooks in the mkwrite path, those
> are all just manufactured red herrings. Somebody wanted those to be
> issues, so now they are issues. Fake ones.
>
> Nobody is trying to trick you. Just stating a fact. You ought to be able
> to figure out by now that Tux3 is worth merging.
>
> You might possibly have an argument that merging a filesystem that
> crashes as soon as it fills the disk is just sheer stupidity than can
> only lead to embarrassment in the long run, but then you would need to
> explain why Btrfs was merged. As I recall, it went something like, Chris
> had it on a laptop, so it must be a filesystem, and wow look at that
> feature list. Then it got merged in a completely unusable state and got
> worked on. If it had not been merged, Btrfs would most likely be dead
> right now. After all, who cares about an out of tree filesystem?
I would like to say two points to this statement:
Firstly, Btrfs was supported by Oracle, which is definitely a totally 
different size than a small group of developers.
Secondly, you are right with your complains. Said this, we do not want 
to make the same mistake with Tux3 or any other file system once again.
>
> By the way, I gave my Tux3 presentation at SCALE 7x in Los Angeles in
> 2009, with Tux3 running as my root filesystem. By the standard applied
> to Btrfs, Tux3 should have been merged then, right? After all, our
> nospace handling worked just as well as theirs at that time.
As far as I can remember from the posts on the mailing list, Tux3 has 
changed so significantly in the last 6 years with features that I always 
reference, that it cannot be the same compared with what has been 
presented in 2009 anymore.
>
> Regards,
>
> Daniel
Thanks
Best regards
Have fun
C.S.
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-12 20:54                                         ` Daniel Phillips
  2015-05-12 21:30                                           ` David Lang
  2015-05-12 21:30                                           ` Christian Stroetmann
@ 2015-05-13  7:20                                           ` Pavel Machek
  2015-05-13 13:47                                             ` Elifarley Callado Coelho Cruz
  2 siblings, 1 reply; 160+ messages in thread
From: Pavel Machek @ 2015-05-13  7:20 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Lang, Theodore Ts'o, Howard Chu, Dave Chinner,
	linux-kernel, Mike Galbraith, tux3, linux-fsdevel, OGAWA Hirofumi
On Tue 2015-05-12 13:54:58, Daniel Phillips wrote:
> On 05/12/2015 11:39 AM, David Lang wrote:
> > On Mon, 11 May 2015, Daniel Phillips wrote:
> >>> ...it's the mm and core kernel developers that need to
> >>> review and accept that code *before* we can consider merging tux3.
> >>
> >> Please do not say "we" when you know that I am just as much a "we"
> >> as you are. Merging Tux3 is not your decision. The people whose
> >> decision it actually is are perfectly capable of recognizing your
> >> agenda for what it is.
> >>
> >>   http://www.phoronix.com/scan.php?page=news_item&px=MTA0NzM
> >>   "XFS Developer Takes Shots At Btrfs, EXT4"
> > 
> > umm, Phoronix has no input on what gets merged into the kernel. they also hae a reputation for
> > trying to turn anything into click-bait by making it sound like a fight when it isn't.
> 
> Perhaps you misunderstood. Linus decides what gets merged. Andrew
> decides. Greg decides. Dave Chinner does not decide, he just does
> his level best to create the impression that our project is unfit
> to merge. Any chance there might be an agenda?
Dunno. _Your_ agenda seems to be "attack other maintainers so much
that you can later claim they are biased".
Not going to work, sorry.
> > As Dave says above, it's not the other filesystem people you have to convince, it's the core VFS and
> > Memory Mangement folks you have to convince. You may need a little benchmarking to show that there
> > is a real advantage to be gained, but the real discussion is going to be on the impact that page
> > forking is going to have on everything else (both in complexity and in performance impact to other
> > things)
> 
> Yet he clearly wrote "we" as if he believes he is part of it.
> 
> Now that ENOSPC is done to a standard way beyond what Btrfs had
> when it was merged, the next item on the agenda is writeback. That
> involves us and VFS people as you say, and not Dave Chinner, who
> only intends to obstruct the process as much as he possibly can. He
Why would he do that? Aha, maybe because you keep attacking him all
the time. Or maybe because your code is not up to the kernel
standards. You want to claim it is the former, but it really looks
like the latter.
Just stop doing that. You are not creating nice atmosphere and you are
not getting tux3 being merged in any way.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-13  7:20                                           ` Pavel Machek
@ 2015-05-13 13:47                                             ` Elifarley Callado Coelho Cruz
  0 siblings, 0 replies; 160+ messages in thread
From: Elifarley Callado Coelho Cruz @ 2015-05-13 13:47 UTC (permalink / raw)
  To: Pavel Machek
  Cc: David Lang, Theodore Ts'o, Howard Chu, Dave Chinner,
	linux-kernel, linux-fsdevel, Mike Galbraith, tux3,
	Daniel Phillips, OGAWA Hirofumi
[-- Attachment #1.1: Type: text/plain, Size: 3997 bytes --]
May I suggest a very relevant reading to all, even though the subject is
not file systems nor kernel development:
Rogerian Argument - Solving Problems by Negotiating Differences (academic
writing; psychology)
http://writingcommons.org/open-text/genres/academic-writing/arguments/318-rogerian-argument
Besides, I thinks everyone would benefit a lot more if most (if not ALL) of
emotionally loaded sentences were simply omitted from technical discussions.
Please don't use words like "crowing", "stinky" and so on. I mean, what
kind of technical advance can be achieved by using that ?
Instead of "stinky", say "your argument is false because [then provide the
minimum set of accurate logical details needed to get your point across,
nothing more, nothing else]"
Appeal to rubber ducking instead if you really need to vent.
I think all of this is valid for any technical discussion.
Elifarley Cruz
-
 " Do not believe anything because it is said by an authority, or if it  is
said to come from angels, or from Gods, or from an inspired source.
Believe it only if you have explored it in your own heart and mind and body
and found it to be true.  Work out your own path, through diligence."
- Gautama Buddha
On Wed, May 13, 2015 at 4:20 AM, Pavel Machek <pavel@ucw.cz> wrote:
> On Tue 2015-05-12 13:54:58, Daniel Phillips wrote:
> > On 05/12/2015 11:39 AM, David Lang wrote:
> > > On Mon, 11 May 2015, Daniel Phillips wrote:
> > >>> ...it's the mm and core kernel developers that need to
> > >>> review and accept that code *before* we can consider merging tux3.
> > >>
> > >> Please do not say "we" when you know that I am just as much a "we"
> > >> as you are. Merging Tux3 is not your decision. The people whose
> > >> decision it actually is are perfectly capable of recognizing your
> > >> agenda for what it is.
> > >>
> > >>   http://www.phoronix.com/scan.php?page=news_item&px=MTA0NzM
> > >>   "XFS Developer Takes Shots At Btrfs, EXT4"
> > >
> > > umm, Phoronix has no input on what gets merged into the kernel. they
> also hae a reputation for
> > > trying to turn anything into click-bait by making it sound like a
> fight when it isn't.
> >
> > Perhaps you misunderstood. Linus decides what gets merged. Andrew
> > decides. Greg decides. Dave Chinner does not decide, he just does
> > his level best to create the impression that our project is unfit
> > to merge. Any chance there might be an agenda?
>
> Dunno. _Your_ agenda seems to be "attack other maintainers so much
> that you can later claim they are biased".
>
> Not going to work, sorry.
>
> > > As Dave says above, it's not the other filesystem people you have to
> convince, it's the core VFS and
> > > Memory Mangement folks you have to convince. You may need a little
> benchmarking to show that there
> > > is a real advantage to be gained, but the real discussion is going to
> be on the impact that page
> > > forking is going to have on everything else (both in complexity and in
> performance impact to other
> > > things)
> >
> > Yet he clearly wrote "we" as if he believes he is part of it.
> >
> > Now that ENOSPC is done to a standard way beyond what Btrfs had
> > when it was merged, the next item on the agenda is writeback. That
> > involves us and VFS people as you say, and not Dave Chinner, who
> > only intends to obstruct the process as much as he possibly can. He
>
> Why would he do that? Aha, maybe because you keep attacking him all
> the time. Or maybe because your code is not up to the kernel
> standards. You want to claim it is the former, but it really looks
> like the latter.
>
> Just stop doing that. You are not creating nice atmosphere and you are
> not getting tux3 being merged in any way.
>
>
> Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures)
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
>
> _______________________________________________
> Tux3 mailing list
> Tux3@phunq.net
> http://phunq.net/mailman/listinfo/tux3
>
[-- Attachment #1.2: Type: text/html, Size: 6229 bytes --]
[-- Attachment #2: Type: text/plain, Size: 120 bytes --]
_______________________________________________
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
 
 
 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-12  2:34                                 ` Daniel Phillips
  2015-05-12  5:38                                   ` Dave Chinner
@ 2015-05-12  9:03                                   ` Pavel Machek
  2015-05-12 11:22                                     ` Daniel Phillips
  1 sibling, 1 reply; 160+ messages in thread
From: Pavel Machek @ 2015-05-12  9:03 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Theodore Ts'o, Howard Chu, Dave Chinner, linux-kernel,
	Mike Galbraith, tux3, linux-fsdevel, OGAWA Hirofumi
On Mon 2015-05-11 19:34:34, Daniel Phillips wrote:
> 
> 
> On 05/11/2015 04:17 PM, Theodore Ts'o wrote:
> > On Tue, May 12, 2015 at 12:12:23AM +0200, Pavel Machek wrote:
> >> Umm, are you sure. If "some areas of disk are faster than others" is
> >> still true on todays harddrives, the gaps will decrease the
> >> performance (as you'll "use up" the fast areas more quickly).
> > 
> > It's still true.  The difference between O.D. and I.D. (outer diameter
> > vs inner diameter) LBA's is typically a factor of 2.  This is why
> > "short-stroking" works as a technique,
> 
> That is true, and the effect is not dominant compared to introducing
> a lot of extra seeks.
> 
> > and another way that people
> > doing competitive benchmarking can screw up and produce misleading
> > numbers.
> 
> If you think we screwed up or produced misleading numbers, could you
> please be up front about it instead of making insinuations and
> continuing your tirade against benchmarking and those who do it.
Are not you little harsh with Ted? He was polite.
> The ram disk removes seek overhead and greatly reduces media transfer
> overhead. This does not change things much: it confirms that Tux3 is
> significantly faster than the others at synchronous loads. This is
> apparently true independently of media type, though to be sure SSD
> remains to be tested.
> 
> The really interesting result is how much difference there is between
> filesystems, even on a ram disk. Is it just CPU or is it synchronization
> strategy and lock contention? Does our asynchronous front/back design
> actually help a lot, instead of being a disadvantage as you predicted?
> 
> It is too bad that fs_mark caps number of tasks at 64, because I am
> sure that some embarrassing behavior would emerge at high task counts,
> as with my tests on spinning disk.
I'd call system with 65 tasks doing heavy fsync load at the some time
"embarrassingly misconfigured" :-). It is nice if your filesystem can
stay fast in that case, but...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-12  9:03                                   ` Pavel Machek
@ 2015-05-12 11:22                                     ` Daniel Phillips
  2015-05-12 13:26                                       ` Howard Chu
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-12 11:22 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Ts'o, Howard Chu, Mike Galbraith, Dave Chinner,
	linux-kernel, linux-fsdevel, tux3, OGAWA Hirofumi
On 05/12/2015 02:03 AM, Pavel Machek wrote:
> On Mon 2015-05-11 19:34:34, Daniel Phillips wrote:
>> On 05/11/2015 04:17 PM, Theodore Ts'o wrote:
>>> and another way that people
>>> doing competitive benchmarking can screw up and produce misleading
>>> numbers.
>>
>> If you think we screwed up or produced misleading numbers, could you
>> please be up front about it instead of making insinuations and
>> continuing your tirade against benchmarking and those who do it.
> 
> Are not you little harsh with Ted? He was polite.
Polite language does not include words like "screw up" and "misleading
numbers", those are combative words intended to undermine and disparage.
It is not clear how repeating the same words can be construed as less
polite than the original utterance.
>> The ram disk removes seek overhead and greatly reduces media transfer
>> overhead. This does not change things much: it confirms that Tux3 is
>> significantly faster than the others at synchronous loads. This is
>> apparently true independently of media type, though to be sure SSD
>> remains to be tested.
>>
>> The really interesting result is how much difference there is between
>> filesystems, even on a ram disk. Is it just CPU or is it synchronization
>> strategy and lock contention? Does our asynchronous front/back design
>> actually help a lot, instead of being a disadvantage as you predicted?
>>
>> It is too bad that fs_mark caps number of tasks at 64, because I am
>> sure that some embarrassing behavior would emerge at high task counts,
>> as with my tests on spinning disk.
> 
> I'd call system with 65 tasks doing heavy fsync load at the some time
> "embarrassingly misconfigured" :-). It is nice if your filesystem can
> stay fast in that case, but...
Well, Tux3 wins the fsync race now whether it is 1 task, 64 tasks or
10,000 tasks. At the high end, maybe it is just a curiosity, or maybe
it tells us something about how Tux3 is will scale on the big machines
that XFS currently lays claim to. And Java programmers are busy doing
all kinds of wild and crazy things with lots of tasks. Java almost
makes them do it. If they need their data durable then they can easily
create loads like my test case.
Suppose you have a web server meant to serve 10,000 transactions
simultaneously and it needs to survive crashes without losing client
state. How will you do it? You could install an expensive, finicky
database, or you could write some Java code that happens to work well
because Linux has a scheduler and a filesystem that can handle it.
Oh wait, we don't have the second one yet, but maybe we soon will.
I will not claim that stupidly fast and scalable fsync is the main
reason that somebody should want Tux3, however, the lack of a high
performance fsync was in fact used as a means of spreading FUD about
Tux3, so I had some fun going way beyond the call of duty to answer
that. By the way, I am still waiting for the original source of the
FUD to concede the point politely, but maybe he is waiting for the
code to land, which it still has not as of today, so I guess that is
fair. Note that it would have landed quite some time ago if Tux3 was
already merged.
Historical note: didn't Java motivate the O(1) scheduler?
Regarda,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-12 11:22                                     ` Daniel Phillips
@ 2015-05-12 13:26                                       ` Howard Chu
  0 siblings, 0 replies; 160+ messages in thread
From: Howard Chu @ 2015-05-12 13:26 UTC (permalink / raw)
  To: Daniel Phillips, Pavel Machek
  Cc: Theodore Ts'o, Mike Galbraith, Dave Chinner, linux-kernel,
	linux-fsdevel, tux3, OGAWA Hirofumi
Daniel Phillips wrote:
> On 05/12/2015 02:03 AM, Pavel Machek wrote:
>> I'd call system with 65 tasks doing heavy fsync load at the some time
>> "embarrassingly misconfigured" :-). It is nice if your filesystem can
>> stay fast in that case, but...
>
> Well, Tux3 wins the fsync race now whether it is 1 task, 64 tasks or
> 10,000 tasks. At the high end, maybe it is just a curiosity, or maybe
> it tells us something about how Tux3 is will scale on the big machines
> that XFS currently lays claim to. And Java programmers are busy doing
> all kinds of wild and crazy things with lots of tasks. Java almost
> makes them do it. If they need their data durable then they can easily
> create loads like my test case.
>
> Suppose you have a web server meant to serve 10,000 transactions
> simultaneously and it needs to survive crashes without losing client
> state. How will you do it? You could install an expensive, finicky
> database, or you could write some Java code that happens to work well
> because Linux has a scheduler and a filesystem that can handle it.
> Oh wait, we don't have the second one yet, but maybe we soon will.
>
> I will not claim that stupidly fast and scalable fsync is the main
> reason that somebody should want Tux3, however, the lack of a high
> performance fsync was in fact used as a means of spreading FUD about
> Tux3, so I had some fun going way beyond the call of duty to answer
> that. By the way, I am still waiting for the original source of the
> FUD to concede the point politely, but maybe he is waiting for the
> code to land, which it still has not as of today, so I guess that is
> fair. Note that it would have landed quite some time ago if Tux3 was
> already merged.
Well, stupidly fast and scalable fsync sounds wonderful to me; it's the 
primary pain point in LMDB write performance now.
http://symas.com/mdb/ondisk/
I look forward to testing Tux3 when usable code shows up in a public repo.
-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
 
 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-11 22:12                             ` Pavel Machek
  2015-05-11 23:17                               ` Theodore Ts'o
@ 2015-05-11 23:53                               ` Daniel Phillips
  2015-05-12  0:12                                 ` David Lang
  2015-05-13  7:25                                 ` Pavel Machek
  1 sibling, 2 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-11 23:53 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Ts'o, Howard Chu, Dave Chinner, linux-kernel,
	Mike Galbraith, tux3, linux-fsdevel, OGAWA Hirofumi
Hi Pavel,
On 05/11/2015 03:12 PM, Pavel Machek wrote:
>>> It is a fact of life that when you change one aspect of an intimately interconnected system,
>>> something else will change as well. You have naive/nonexistent free space management now; when you
>>> design something workable there it is going to impact everything else you've already done. It's an
>>> easy bet that the impact will be negative, the only question is to what degree.
>>
>> You might lose that bet. For example, suppose we do strictly linear allocation
>> each delta, and just leave nice big gaps between the deltas for future
>> expansion. Clearly, we run at similar or identical speed to the current naive
>> strategy until we must start filling in the gaps, and at that point our layout
>> is not any worse than XFS, which started bad and stayed that way.
> 
> Umm, are you sure. If "some areas of disk are faster than others" is
> still true on todays harddrives, the gaps will decrease the
> performance (as you'll "use up" the fast areas more quickly).
That's why I hedged my claim with "similar or identical". The
difference in media speed seems to be a relatively small effect
compared to extra seeks. It seems that XFS puts big spaces between
new directories, and suffers a lot of extra seeks because of it.
I propose to batch new directories together initially, then change
the allocation goal to a new, relatively empty area if a big batch
of files lands on a directory in a crowded region. The "big" gaps
would be on the order of delta size, so not really very big.
Anyway, some people seem to have pounced on the words "naive" and
"linear allocation" and jumped to the conclusion that our whole
strategy is naive. Far from it. We don't just throw files randomly
at the disk. We sort and partition files and metadata, and we
carefully arrange the order of our allocation operations so that
linear allocation produces a nice layout for both read and write.
This turned out to be so much better than fiddling with the goal
of individual allocations that we concluded we would get best
results by sticking with linear allocation, but improve our sort
step. The new plan is to partition updates into batches according
to some affinity metrics, and set the linear allocation goal per
batch. So for example, big files and append-type files can get
special treatment in separate batches, while files that seem to
be related because of having the same directory parent and being
written in the same delta will continue to be streamed out using
"naive" linear allocation, which is not necessarily as naive as
one might think.
It will take time and a lot of performance testing to get this
right, but nobody should get the idea that it is any inherent
design limitation. The opposite is true: we have no restrictions
at all in media layout.
Compared to Ext4, we do need to address the issue that data moves
around when updated. This can cause rapid fragmentation. Btrfs has
shown issues with that for big, randomly updated files. We want to
fix it without falling back on update-in-place as Btrfs does.
Actually, Tux3 already has update-in-place, and unlike Btrfs, we
can switch to it for non-empty files. But we think that perfect data
isolation per delta is something worth fighting for, and we would
rather not force users to fiddle around with mode settings just to
make something work as well as it already does on Ext4. We will
tackle this issue by partitioning as above, and use a dedicated
allocation strategy for such files, which are easy to detect.
Metadata moving around per update does not seem to be a problem
because it is all single blocks that need very little slack space
to stay close to home.
> Anyway... you have brand new filesystem. Of course it should be
> faster/better/nicer than the existing filesystems. So don't be too
> harsh with XFS people.
They have done a lot of good work, but they still have a long way
to go. I don't see any shame in that.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-11 23:53                               ` Daniel Phillips
@ 2015-05-12  0:12                                 ` David Lang
  2015-05-12  4:36                                   ` Daniel Phillips
  2015-05-13  7:25                                 ` Pavel Machek
  1 sibling, 1 reply; 160+ messages in thread
From: David Lang @ 2015-05-12  0:12 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Pavel Machek, Howard Chu, Mike Galbraith, Dave Chinner,
	linux-kernel, linux-fsdevel, tux3, Theodore Ts'o,
	OGAWA Hirofumi
On Mon, 11 May 2015, Daniel Phillips wrote:
> On 05/11/2015 03:12 PM, Pavel Machek wrote:
>>>> It is a fact of life that when you change one aspect of an intimately interconnected system,
>>>> something else will change as well. You have naive/nonexistent free space management now; when you
>>>> design something workable there it is going to impact everything else you've already done. It's an
>>>> easy bet that the impact will be negative, the only question is to what degree.
>>>
>>> You might lose that bet. For example, suppose we do strictly linear allocation
>>> each delta, and just leave nice big gaps between the deltas for future
>>> expansion. Clearly, we run at similar or identical speed to the current naive
>>> strategy until we must start filling in the gaps, and at that point our layout
>>> is not any worse than XFS, which started bad and stayed that way.
>>
>> Umm, are you sure. If "some areas of disk are faster than others" is
>> still true on todays harddrives, the gaps will decrease the
>> performance (as you'll "use up" the fast areas more quickly).
>
> That's why I hedged my claim with "similar or identical". The
> difference in media speed seems to be a relatively small effect
> compared to extra seeks. It seems that XFS puts big spaces between
> new directories, and suffers a lot of extra seeks because of it.
> I propose to batch new directories together initially, then change
> the allocation goal to a new, relatively empty area if a big batch
> of files lands on a directory in a crowded region. The "big" gaps
> would be on the order of delta size, so not really very big.
This is an interesting idea, but what happens if the files don't arrive as a big 
batch, but rather trickle in over time (think a logserver that if putting files 
into a bunch of directories at a fairly modest rate per directory)
And when you then decide that you have to move the directory/file info, doesn't 
that create a potentially large amount of unexpected IO that could end up 
interfering with what the user is trying to do?
David Lang
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-12  0:12                                 ` David Lang
@ 2015-05-12  4:36                                   ` Daniel Phillips
  2015-05-12 17:30                                     ` Christian Stroetmann
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-12  4:36 UTC (permalink / raw)
  To: David Lang
  Cc: Pavel Machek, Howard Chu, Mike Galbraith, Dave Chinner,
	linux-kernel, linux-fsdevel, tux3, Theodore Ts'o,
	OGAWA Hirofumi
Hi David,
On 05/11/2015 05:12 PM, David Lang wrote:
> On Mon, 11 May 2015, Daniel Phillips wrote:
> 
>> On 05/11/2015 03:12 PM, Pavel Machek wrote:
>>>>> It is a fact of life that when you change one aspect of an intimately interconnected system,
>>>>> something else will change as well. You have naive/nonexistent free space management now; when you
>>>>> design something workable there it is going to impact everything else you've already done. It's an
>>>>> easy bet that the impact will be negative, the only question is to what degree.
>>>>
>>>> You might lose that bet. For example, suppose we do strictly linear allocation
>>>> each delta, and just leave nice big gaps between the deltas for future
>>>> expansion. Clearly, we run at similar or identical speed to the current naive
>>>> strategy until we must start filling in the gaps, and at that point our layout
>>>> is not any worse than XFS, which started bad and stayed that way.
>>>
>>> Umm, are you sure. If "some areas of disk are faster than others" is
>>> still true on todays harddrives, the gaps will decrease the
>>> performance (as you'll "use up" the fast areas more quickly).
>>
>> That's why I hedged my claim with "similar or identical". The
>> difference in media speed seems to be a relatively small effect
>> compared to extra seeks. It seems that XFS puts big spaces between
>> new directories, and suffers a lot of extra seeks because of it.
>> I propose to batch new directories together initially, then change
>> the allocation goal to a new, relatively empty area if a big batch
>> of files lands on a directory in a crowded region. The "big" gaps
>> would be on the order of delta size, so not really very big.
> 
> This is an interesting idea, but what happens if the files don't arrive as a big batch, but rather
> trickle in over time (think a logserver that if putting files into a bunch of directories at a
> fairly modest rate per directory)
If files are trickling in then we can afford to spend a lot more time
finding nice places to tuck them in. Log server files are an especially
irksome problem for a redirect-on-write filesystem because the final
block tends to be rewritten many times and we must move it to a new
location each time, so every extent ends up as one block. Oh well. If
we just make sure to have some free space at the end of the file that
only that file can use (until everywhere else is full) then the long
term result will be slightly ravelled blocks that nonetheless tend to
be on the same track or flash block as their logically contiguous
neighbours. There will be just zero or one empty data blocks mixed
into the file tail as we commit the tail block over and over with the
same allocation goal. Sometimes there will be a block or two of
metadata as well, which will eventually bake themselves into the
middle of contiguous data and stop moving around.
Putting this together, we have:
  * At delta flush, break out all the log type files
  * Dedicate some block groups to append type files
  * Leave lots of space between files in those block groups
  * Peek at the last block of the file to set the allocation goal
Something like that. What we don't want is to throw those files into
the middle of a lot of rewrite-all files, messing up both kinds of file.
We don't care much about keeping these files near the parent directory
because one big seek per log file in a grep is acceptable, we just need
to avoid thousands of big seeks within the file, and not dribble single
blocks all over the disk.
It would also be nice to merge together extents somehow as the final
block is rewritten. One idea is to retain the final block dirty until
the next delta, and write it again into a contiguous position, so the
final block is always flushed twice. We already have the opportunistic
merge logic, but the redirty behavior and making sure it only happens
to log files would be a bit fiddly.
We will also play the incremental defragmentation card at some point,
but first we should try hard to control fragmentation in the first
place. Tux3 is well suited to online defragmentation because the delta
commit model makes it easy to move things around efficiently and safely,
but it does generate extra IO, so as a basic mechanism it is not ideal.
When we get to piling on features, that will be high on the list,
because it is relatively easy, and having that fallback gives a certain
sense of security.
> And when you then decide that you have to move the directory/file info, doesn't that create a
> potentially large amount of unexpected IO that could end up interfering with what the user is trying
> to do?
Right, we don't like that and don't plan to rely on it. What we hope
for is behavior that, when you slowly stir the pot, tends to improve the
layout just as often as it degrades it. It may indeed become harder to
find ideal places to put things as time goes by, but we also gain more
information to base decisions on.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-12  4:36                                   ` Daniel Phillips
@ 2015-05-12 17:30                                     ` Christian Stroetmann
  0 siblings, 0 replies; 160+ messages in thread
From: Christian Stroetmann @ 2015-05-12 17:30 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Lang, Pavel Machek, Howard Chu, Mike Galbraith,
	Dave Chinner, linux-kernel, linux-fsdevel, tux3,
	Theodore Ts'o, OGAWA Hirofumi
Am 12.05.2015 06:36, schrieb Daniel Phillips:
> Hi David,
>
> On 05/11/2015 05:12 PM, David Lang wrote:
>> On Mon, 11 May 2015, Daniel Phillips wrote:
>>
>>> On 05/11/2015 03:12 PM, Pavel Machek wrote:
>>>>>> It is a fact of life that when you change one aspect of an intimately interconnected system,
>>>>>> something else will change as well. You have naive/nonexistent free space management now; when you
>>>>>> design something workable there it is going to impact everything else you've already done. It's an
>>>>>> easy bet that the impact will be negative, the only question is to what degree.
>>>>> You might lose that bet. For example, suppose we do strictly linear allocation
>>>>> each delta, and just leave nice big gaps between the deltas for future
>>>>> expansion. Clearly, we run at similar or identical speed to the current naive
>>>>> strategy until we must start filling in the gaps, and at that point our layout
>>>>> is not any worse than XFS, which started bad and stayed that way.
>>>> Umm, are you sure. If "some areas of disk are faster than others" is
>>>> still true on todays harddrives, the gaps will decrease the
>>>> performance (as you'll "use up" the fast areas more quickly).
>>> That's why I hedged my claim with "similar or identical". The
>>> difference in media speed seems to be a relatively small effect
>>> compared to extra seeks. It seems that XFS puts big spaces between
>>> new directories, and suffers a lot of extra seeks because of it.
>>> I propose to batch new directories together initially, then change
>>> the allocation goal to a new, relatively empty area if a big batch
>>> of files lands on a directory in a crowded region. The "big" gaps
>>> would be on the order of delta size, so not really very big.
>> This is an interesting idea, but what happens if the files don't arrive as a big batch, but rather
>> trickle in over time (think a logserver that if putting files into a bunch of directories at a
>> fairly modest rate per directory)
> If files are trickling in then we can afford to spend a lot more time
> finding nice places to tuck them in. Log server files are an especially
> irksome problem for a redirect-on-write filesystem because the final
> block tends to be rewritten many times and we must move it to a new
> location each time, so every extent ends up as one block. Oh well. If
> we just make sure to have some free space at the end of the file that
> only that file can use (until everywhere else is full) then the long
> term result will be slightly ravelled blocks that nonetheless tend to
> be on the same track or flash block as their logically contiguous
> neighbours. There will be just zero or one empty data blocks mixed
> into the file tail as we commit the tail block over and over with the
> same allocation goal. Sometimes there will be a block or two of
> metadata as well, which will eventually bake themselves into the
> middle of contiguous data and stop moving around.
>
> Putting this together, we have:
>
>    * At delta flush, break out all the log type files
>    * Dedicate some block groups to append type files
>    * Leave lots of space between files in those block groups
>    * Peek at the last block of the file to set the allocation goal
>
> Something like that. What we don't want is to throw those files into
> the middle of a lot of rewrite-all files, messing up both kinds of file.
> We don't care much about keeping these files near the parent directory
> because one big seek per log file in a grep is acceptable, we just need
> to avoid thousands of big seeks within the file, and not dribble single
> blocks all over the disk.
>
> It would also be nice to merge together extents somehow as the final
> block is rewritten. One idea is to retain the final block dirty until
> the next delta, and write it again into a contiguous position, so the
> final block is always flushed twice. We already have the opportunistic
> merge logic, but the redirty behavior and making sure it only happens
> to log files would be a bit fiddly.
>
> We will also play the incremental defragmentation card at some point,
> but first we should try hard to control fragmentation in the first
> place. Tux3 is well suited to online defragmentation because the delta
> commit model makes it easy to move things around efficiently and safely,
> but it does generate extra IO, so as a basic mechanism it is not ideal.
> When we get to piling on features, that will be high on the list,
> because it is relatively easy, and having that fallback gives a certain
> sense of security.
So we are again at some more features of SASOS4Fun.
Said this, I can see as an alleged troll expert the agenda and strategy 
behind this and related threads, but still no usable code/file system at 
all and hence nothing that even might be ready for merging, as I 
understand the statements of the file system gurus.
So it is time for the developer(s) to take decisions, what should be 
implement respectively manifested in code eventually and then show the 
complete result, so that others can make the tests and the benchmarks.
Thanks
Best Regards
Do not feed the trolls.
C.S.
>> And when you then decide that you have to move the directory/file info, doesn't that create a
>> potentially large amount of unexpected IO that could end up interfering with what the user is trying
>> to do?
> Right, we don't like that and don't plan to rely on it. What we hope
> for is behavior that, when you slowly stir the pot, tends to improve the
> layout just as often as it degrades it. It may indeed become harder to
> find ideal places to put things as time goes by, but we also gain more
> information to base decisions on.
>
> Regards,
>
> Daniel
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-11 23:53                               ` Daniel Phillips
  2015-05-12  0:12                                 ` David Lang
@ 2015-05-13  7:25                                 ` Pavel Machek
  2015-05-13 11:31                                   ` Daniel Phillips
  1 sibling, 1 reply; 160+ messages in thread
From: Pavel Machek @ 2015-05-13  7:25 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Howard Chu, Mike Galbraith, Dave Chinner, linux-kernel,
	linux-fsdevel, tux3, Theodore Ts'o, OGAWA Hirofumi
On Mon 2015-05-11 16:53:10, Daniel Phillips wrote:
> Hi Pavel,
> 
> On 05/11/2015 03:12 PM, Pavel Machek wrote:
> >>> It is a fact of life that when you change one aspect of an intimately interconnected system,
> >>> something else will change as well. You have naive/nonexistent free space management now; when you
> >>> design something workable there it is going to impact everything else you've already done. It's an
> >>> easy bet that the impact will be negative, the only question is to what degree.
> >>
> >> You might lose that bet. For example, suppose we do strictly linear allocation
> >> each delta, and just leave nice big gaps between the deltas for future
> >> expansion. Clearly, we run at similar or identical speed to the current naive
> >> strategy until we must start filling in the gaps, and at that point our layout
> >> is not any worse than XFS, which started bad and stayed that way.
> > 
> > Umm, are you sure. If "some areas of disk are faster than others" is
> > still true on todays harddrives, the gaps will decrease the
> > performance (as you'll "use up" the fast areas more quickly).
> 
> That's why I hedged my claim with "similar or identical". The
> difference in media speed seems to be a relatively small effect
When you knew it can't be identical? That's rather confusing, right?
Perhaps you should post more details how your benchmark is structured
next time, so we can see you did not make any trivial mistakes...?
Or just clean the code up so that it can get merged, so that we can
benchmark ourselves...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-13  7:25                                 ` Pavel Machek
@ 2015-05-13 11:31                                   ` Daniel Phillips
  2015-05-13 12:41                                     ` Daniel Phillips
  2015-05-13 13:08                                     ` Mike Galbraith
  0 siblings, 2 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-13 11:31 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Howard Chu, Mike Galbraith, Dave Chinner, linux-kernel,
	linux-fsdevel, tux3, Theodore Ts'o, OGAWA Hirofumi
On 05/13/2015 12:25 AM, Pavel Machek wrote:
> On Mon 2015-05-11 16:53:10, Daniel Phillips wrote:
>> Hi Pavel,
>>
>> On 05/11/2015 03:12 PM, Pavel Machek wrote:
>>>>> It is a fact of life that when you change one aspect of an intimately interconnected system,
>>>>> something else will change as well. You have naive/nonexistent free space management now; when you
>>>>> design something workable there it is going to impact everything else you've already done. It's an
>>>>> easy bet that the impact will be negative, the only question is to what degree.
>>>>
>>>> You might lose that bet. For example, suppose we do strictly linear allocation
>>>> each delta, and just leave nice big gaps between the deltas for future
>>>> expansion. Clearly, we run at similar or identical speed to the current naive
>>>> strategy until we must start filling in the gaps, and at that point our layout
>>>> is not any worse than XFS, which started bad and stayed that way.
>>>
>>> Umm, are you sure. If "some areas of disk are faster than others" is
>>> still true on todays harddrives, the gaps will decrease the
>>> performance (as you'll "use up" the fast areas more quickly).
>>
>> That's why I hedged my claim with "similar or identical". The
>> difference in media speed seems to be a relatively small effect
> 
> When you knew it can't be identical? That's rather confusing, right?
Maybe. The top of thread is about a measured performance deficit of
a factor of five. Next to that, a media transfer rate variation by
a factor of two already starts to look small, and gets smaller when
scrutinized.
Let's say our delta size is 400MB (typical under load) and we leave
a "nice big gap" of 112 MB after flushing each one. Let's say we do
two thousand of those before deciding that we have enough information
available to switch to some smarter strategy. We used one GB of a
a 4TB disk, say. The media transfer rate decreased by a factor of:
    (1 - 2/1000) = .2%.
The performance deficit in question and the difference in media rate are
three orders of magnitude apart, does that justify the term "similar or
identical?".
> Perhaps you should post more details how your benchmark is structured
> next time, so we can see you did not make any trivial mistakes...?
Makes sense to me, though I do take considerable care to ensure that
my results are reproducible. That is born out by the fact that Mike
did reproduce, albeit from the published branch, which is a bit behind
current work. And he went on to do some original testing of his own.
I had no idea Tux3 was so much faster than XFS on the Git self test,
because we never specifically tested anything like that, or optimized
for it. Of course I was interested in why. And that was not all, Mike
also noticed a really interesting fact about latency that I failed to
reproduce. That went on to the list of things to investigate as time
permits.
I reproduced Mike's results according to his description, by actually
building Git in the VM and running the selftests just to see if the same
thing happened, which it did. I didn't think that was worth mentioning
at the time, because if somebody publishes benchmarks, my first instinct
is to trust them. Trust and verify.
> Or just clean the code up so that it can get merged, so that we can
> benchmark ourselves...
Third possibility: build from our repository, as Mike did. Obviously,
we need to merge to master so the build process matches the Wiki. But
Hirofumi is busy with other things, so please be patient.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-13 11:31                                   ` Daniel Phillips
@ 2015-05-13 12:41                                     ` Daniel Phillips
  2015-05-13 13:08                                     ` Mike Galbraith
  1 sibling, 0 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-13 12:41 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Ts'o, Howard Chu, Dave Chinner, linux-kernel,
	Mike Galbraith, tux3, linux-fsdevel, OGAWA Hirofumi
On 05/13/2015 04:31 AM, Daniel Phillips wrote:
Let me be the first to catch that arithmetic error....
> Let's say our delta size is 400MB (typical under load) and we leave
> a "nice big gap" of 112 MB after flushing each one. Let's say we do
> two thousand of those before deciding that we have enough information
> available to switch to some smarter strategy. We used one GB of a
> a 4TB disk, say. The media transfer rate decreased by a factor of:
> 
>     (1 - 2/1000) = .2%.
Ahem, no, we used 1/8th of the disk. The time/data rate increased
from unity to 1.125, for an average of 1.0625 across the region.
If we only use 1/10th of the disk instead, by not leaving gaps,
then the average time/data across the region is 1.05. The
difference is, 1.0625 - 1.05, so the gap strategy increases media
transfer time by 1.25%, which is not significant compared to the
performance deficit in question of 400%. So, same argument:
change in media transfer rate is just a distraction from the
original question.
In any case, we probably want to start using a smarter strategy
sooner than 1000 commits, maybe after ten or a hundred commits,
which would make the change in media transfer rate even less
relevant.
The thing is, when data first starts landing on media, we do not
have much information about what the long term load will be. So
just analyze the clues we have in the early commits and put those
early deltas onto disk in the most efficient format, which for
Tux3 seems to be linear per delta. There would be exceptions, but
that is the common case.
Then get smarter later. The intent is to get the best of both:
early efficiency, and long term nice aging behavior. I do not
accept the proposition that one must be sacrificed for the
other, I find that reasoning faulty.
> The performance deficit in question and the difference in media rate are
> three orders of magnitude apart, does that justify the term "similar or
> identical?".
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-13 11:31                                   ` Daniel Phillips
  2015-05-13 12:41                                     ` Daniel Phillips
@ 2015-05-13 13:08                                     ` Mike Galbraith
  2015-05-13 13:15                                       ` Daniel Phillips
  1 sibling, 1 reply; 160+ messages in thread
From: Mike Galbraith @ 2015-05-13 13:08 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Pavel Machek, Howard Chu, Dave Chinner, linux-kernel,
	linux-fsdevel, tux3, Theodore Ts'o, OGAWA Hirofumi
On Wed, 2015-05-13 at 04:31 -0700, Daniel Phillips wrote:
> Third possibility: build from our repository, as Mike did.
Sorry about that folks.  I've lost all interest, it won't happen again.
	-Mike
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-05-13 13:08                                     ` Mike Galbraith
@ 2015-05-13 13:15                                       ` Daniel Phillips
  0 siblings, 0 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-13 13:15 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Theodore Ts'o, Howard Chu, Dave Chinner, linux-kernel,
	Pavel Machek, tux3, linux-fsdevel, OGAWA Hirofumi
On 05/13/2015 06:08 AM, Mike Galbraith wrote:
> On Wed, 2015-05-13 at 04:31 -0700, Daniel Phillips wrote:
>> Third possibility: build from our repository, as Mike did.
> 
> Sorry about that folks.  I've lost all interest, it won't happen again.
Thanks for your valuable contribution. Now we are seeing a steady
of stream of people heading to the repository, after you showed
it could be done.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
 
 
 
 
 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-30 14:07                       ` Daniel Phillips
  2015-04-30 14:28                         ` Howard Chu
@ 2015-04-30 14:33                         ` Mike Galbraith
  2015-04-30 15:24                           ` Daniel Phillips
  1 sibling, 1 reply; 160+ messages in thread
From: Mike Galbraith @ 2015-04-30 14:33 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, tux3,
	Theodore Ts'o, OGAWA Hirofumi
On Thu, 2015-04-30 at 07:07 -0700, Daniel Phillips wrote:
> 
> On 04/30/2015 06:48 AM, Mike Galbraith wrote:
> > On Thu, 2015-04-30 at 05:58 -0700, Daniel Phillips wrote:
> >> On Thursday, April 30, 2015 5:07:21 AM PDT, Mike Galbraith wrote:
> >>> On Thu, 2015-04-30 at 04:14 -0700, Daniel Phillips wrote:
> >>>
> >>>> Lovely sounding argument, but it is wrong because Tux3 still beats XFS
> >>>> even with seek time factored out of the equation.
> >>>
> >>> Hm.  Do you have big-storage comparison numbers to back that?  I'm no
> >>> storage guy (waiting for holographic crystal arrays to obsolete all this
> >>> crap;), but Dave's big-storage guy words made sense to me.
> >>
> >> This has nothing to do with big storage. The proposition was that seek
> >> time is the reason for Tux3's fsync performance. That claim was easily
> >> falsified by removing the seek time.
> >>
> >> Dave's big storage words are there to draw attention away from the fact
> >> that XFS ran the Git tests four times slower than Tux3 and three times
> >> slower than Ext4. Whatever the big storage excuse is for that, the fact
> >> is, XFS obviously sucks at little storage.
> > 
> > If you allocate spanning the disk from start of life, you're going to
> > eat seeks that others don't until later.  That seemed rather obvious and
> > straight forward.
> 
> It is a logical falacy. It mixes a grain of truth (spreading all over the
> disk causes extra seeks) with an obvious falsehood (it is not necessarily
> the only possible way to avoid long term fragmentation).
Shrug, but seems it is a solution, and more importantly, an implemented
solution.  What I gleaned up as a layman reader is that xfs has no
fragmentation issue, but tux3 still does.  It doesn't seem right to slam
xfs for a conscious design decision unless tux3 can proudly display its
superior solution, which I gathered doesn't yet exist.
> > He flat stated that xfs has passable performance on
> > single bit of rust, and openly explained why.  I see no misdirection,
> > only some evidence of bad blood between you two.
> 
> Raising the spectre of theoretical fragmentation issues when we have not
> even begun that work is a straw man and intellectually dishonest. You have
> to wonder why he does it. It is destructive to our community image and
> harmful to progress.
Well ok, let's forget bad blood, straw men... and answering my question
too I suppose.  Not having any sexy  IO gizmos in my little desktop box,
I don't care deeply which stomps the other flat on beastly boxen.
	-Mike
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)
  2015-04-30 14:33                         ` Mike Galbraith
@ 2015-04-30 15:24                           ` Daniel Phillips
  0 siblings, 0 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-04-30 15:24 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Theodore Ts'o, tux3, Dave Chinner, linux-kernel,
	linux-fsdevel, OGAWA Hirofumi
On 04/30/2015 07:33 AM, Mike Galbraith wrote:
> Well ok, let's forget bad blood, straw men... and answering my question
> too I suppose.  Not having any sexy  IO gizmos in my little desktop box,
> I don't care deeply which stomps the other flat on beastly boxen.
I'm with you, especially the forget bad blood part. I did my time in
big storage and I will no doubt do it again, but right now, what I care
about is bringing truth and beauty to small storage, which includes
that spinning rust of yours and also the cheap SSD you are about to
run out and buy.
I hope you caught the bit about how Tux3 is doing really well running
in tmpfs? According to my calculations, that means good things for SSD
performance.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
 
 
 
 
 
 
- * Re: Tux3 Report: How fast can we fsync?
  2015-04-29 16:42         ` Mike Galbraith
  2015-04-29 19:05           ` xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?) Mike Galbraith
@ 2015-04-29 20:40           ` Daniel Phillips
  2015-04-29 22:06             ` OGAWA Hirofumi
  2015-04-30  3:50             ` Mike Galbraith
  1 sibling, 2 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-04-29 20:40 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-fsdevel, tux3, Theodore Ts'o, linux-kernel,
	OGAWA Hirofumi
On Wednesday, April 29, 2015 9:42:43 AM PDT, Mike Galbraith wrote:
>
> [dbench bakeoff]
>
> With dbench v4.00, tux3 seems to be king of the max_latency hill, but
> btrfs took throughput on my box.  With v3.04, tux3 took 1st place at
> splashing about in pagecache, but last place at dbench -S.
>
> Hohum, curiosity satisfied.
Hi Mike,
Thanks for that. Please keep in mind, that was our B team, it does a
full fs sync for every fsync. Maybe a rematch when the shiny new one
lands? Also, hardware? It looks like a single 7200 RPM disk, but it
would be nice to know. And it seems, not all dbench 4.0 are equal.
Mine doesn't have a -B option.
That order of magnitude latency difference is striking. It sounds
good, but what does it mean? I see a smaller difference here, maybe
because of running under KVM.
Your results seem to confirm the gap I noticed between Ext4 and XFS
on the one hand and Btrfs and Tux3 on the other, with the caveat that
the anomalous dbench -S result is probably about running with the
older fsync code. Of course, this is just dbench, but maybe something
to keep an eye on.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fsync?
  2015-04-29 20:40           ` Tux3 Report: How fast can we fsync? Daniel Phillips
@ 2015-04-29 22:06             ` OGAWA Hirofumi
  2015-04-30  3:57               ` Mike Galbraith
  2015-04-30  3:50             ` Mike Galbraith
  1 sibling, 1 reply; 160+ messages in thread
From: OGAWA Hirofumi @ 2015-04-29 22:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Daniel Phillips, linux-fsdevel, Theodore Ts'o, linux-kernel,
	tux3
Daniel Phillips <daniel@phunq.net> writes:
> On Wednesday, April 29, 2015 9:42:43 AM PDT, Mike Galbraith wrote:
>>
>> [dbench bakeoff]
>>
>> With dbench v4.00, tux3 seems to be king of the max_latency hill, but
>> btrfs took throughput on my box.  With v3.04, tux3 took 1st place at
>> splashing about in pagecache, but last place at dbench -S.
>>
>> Hohum, curiosity satisfied.
>
> Thanks for that. Please keep in mind, that was our B team, it does a
> full fs sync for every fsync. Maybe a rematch when the shiny new one
> lands? Also, hardware? It looks like a single 7200 RPM disk, but it
> would be nice to know. And it seems, not all dbench 4.0 are equal.
> Mine doesn't have a -B option.
Yeah, I also want to know hardware. Also, what size of partition?  And
each test was done by fresh FS (i.e. after mkfs), or same FS was used
through all tests?
My "hirofumi" branch in public repo is still having the bug to leave the
empty block for inodes by repeat of create and unlink. And this bug
makes fragment of FS very fast. (This bug is what I'm fixing, now.)
If same FS was used, your test might hit to this bug.
Thanks.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fsync?
  2015-04-29 22:06             ` OGAWA Hirofumi
@ 2015-04-30  3:57               ` Mike Galbraith
  0 siblings, 0 replies; 160+ messages in thread
From: Mike Galbraith @ 2015-04-30  3:57 UTC (permalink / raw)
  To: OGAWA Hirofumi
  Cc: Daniel Phillips, linux-kernel, linux-fsdevel, tux3,
	Theodore Ts'o
On Thu, 2015-04-30 at 07:06 +0900, OGAWA Hirofumi wrote:
> Yeah, I also want to know hardware. Also, what size of partition?  And
> each test was done by fresh FS (i.e. after mkfs), or same FS was used
> through all tests?
1TB rust bucket, with new fs each test.
	-Mike
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
- * Re: Tux3 Report: How fast can we fsync?
  2015-04-29 20:40           ` Tux3 Report: How fast can we fsync? Daniel Phillips
  2015-04-29 22:06             ` OGAWA Hirofumi
@ 2015-04-30  3:50             ` Mike Galbraith
  2015-04-30 10:59               ` Daniel Phillips
  1 sibling, 1 reply; 160+ messages in thread
From: Mike Galbraith @ 2015-04-30  3:50 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: linux-kernel, linux-fsdevel, tux3, Theodore Ts'o,
	OGAWA Hirofumi
On Wed, 2015-04-29 at 13:40 -0700, Daniel Phillips wrote:
> On Wednesday, April 29, 2015 9:42:43 AM PDT, Mike Galbraith wrote:
> >
> > [dbench bakeoff]
> >
> > With dbench v4.00, tux3 seems to be king of the max_latency hill, but
> > btrfs took throughput on my box.  With v3.04, tux3 took 1st place at
> > splashing about in pagecache, but last place at dbench -S.
> >
> > Hohum, curiosity satisfied.
> 
> Hi Mike,
> 
> Thanks for that. Please keep in mind, that was our B team, it does a
> full fs sync for every fsync. Maybe a rematch when the shiny new one
> lands? Also, hardware? It looks like a single 7200 RPM disk, but it
> would be nice to know. And it seems, not all dbench 4.0 are equal.
> Mine doesn't have a -B option.
Hm, mine came from git://git.samba.org/sahlberg/dbench.git.  The thing
has all kinds of cool options I have no clue how to use.
Yeah, the box is a modern plane jane, loads of CPU, cheap a$$ spinning
rust IO.  It has an SSD, but that's currently occupied by games OS.
I'll eventually either buy a bigger one or steal it from winders.  The
only thing stopping me is my inherent mistrust of storage media that has
no moving parts, but wears out anyway, and with no bearings whining to
warn you :)
> That order of magnitude latency difference is striking. It sounds
> good, but what does it mean? I see a smaller difference here, maybe
> because of running under KVM.
That max_latency thing is flush.
	-Mike
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fsync?
  2015-04-30  3:50             ` Mike Galbraith
@ 2015-04-30 10:59               ` Daniel Phillips
  0 siblings, 0 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-04-30 10:59 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, linux-fsdevel, tux3, Theodore Ts'o,
	OGAWA Hirofumi
On Wednesday, April 29, 2015 8:50:57 PM PDT, Mike Galbraith wrote:
> On Wed, 2015-04-29 at 13:40 -0700, Daniel Phillips wrote:
>>
>> That order of magnitude latency difference is striking. It sounds
>> good, but what does it mean? I see a smaller difference here, maybe
>> because of running under KVM.
>
> That max_latency thing is flush.
Right, it is just the max run time of all operations, including flush
(dbench's name for fsync I think) which would most probably be the longest
running one. I would like to know how we manage to pull that off. Now
that you mention it, I see a factor of two or so latency win here, not
the order of magnitude that you saw. Maybe KVM introduces some fuzz
for me.
I checked whether fsync = sync is the reason, and no. Well, that goes
on the back burner, we will no doubt figure it out in due course.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
 
 
 
 
 
- * Re: Tux3 Report: How fast can we fsync?
  2015-04-28 23:13 Tux3 Report: How fast can we fsync? Daniel Phillips
  2015-04-29  2:21 ` Mike Galbraith
@ 2015-04-30  1:46 ` Dave Chinner
  2015-04-30 10:28   ` Daniel Phillips
  2015-05-12 17:41 ` Daniel Phillips
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 160+ messages in thread
From: Dave Chinner @ 2015-04-30  1:46 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, linux-fsdevel, tux3, Theodore Ts'o
On Tue, Apr 28, 2015 at 04:13:18PM -0700, Daniel Phillips wrote:
> Greetings,
> 
> This post is dedicated to Ted, who raised doubts a while back about
> whether Tux3 can ever have a fast fsync:
> 
>   https://lkml.org/lkml/2013/5/11/128
>   "Re: Tux3 Report: Faster than tmpfs, what?"
[snip]
> I measured fsync performance using a 7200 RPM disk as a virtual
> drive under KVM, configured with cache=none so that asynchronous
> writes are cached and synchronous writes translate into direct
> writes to the block device.
Yup, a slow single spindle, so fsync performance is determined by
seek latency of the filesystem. Hence the filesystem that "wins"
will be the filesystem that minimises fsync seek latency above all
other considerations.
http://www.spinics.net/lists/kernel/msg1978216.html
So, to demonstrate, I'll run the same tests but using a 256GB
samsung 840 EVO SSD and show how much the picture changes. I didn't
test tux3, you don't make it easy to get or build.
> To focus purely on fsync, I wrote a
> small utility (at the end of this post) that forks a number of
> tasks, each of which continuously appends to and fsyncs its own
> file. For a single task doing 1,000 fsyncs of 1K each, we have:
> 
>    Ext4:  34.34s
>    XFS:   23.63s
>    Btrfs: 34.84s
>    Tux3:  17.24s
   Ext4:   1.94s
   XFS:    2.06s
   Btrfs:  2.06s
All equally fast, so I can't see how tux3 would be much faster here.
> Things get more interesting with parallel fsyncs. In this test, each
> task does ten fsyncs and task count scales from ten to ten thousand.
> We see that all tested filesystems are able to combine fsyncs into
> group commits, with varying degrees of success:
> 
>    Tasks:   10      100    1,000    10,000
>    Ext4:   0.79s   0.98s    4.62s    61.45s
>    XFS:    0.75s   1.68s   20.97s   238.23s
>    Btrfs   0.53s   0.78s    3.80s    84.34s
>    Tux3:   0.27s   0.34s    1.00s     6.86s
   Tasks:   10      100    1,000    10,000
   Ext4:   0.05s   0.12s    0.48s     3.99s
   XFS:    0.25s   0.41s    0.96s     4.07s
   Btrfs   0.22s   0.50s    2.86s   161.04s
             (lower is better)
Ext4 and XFS are fast and show similar performance. Tux3 *can't* be
very much faster as most of the elapsed time in the test is from
forking the processes that do the IO and fsyncs.
FWIW, btrfs shows it's horrible fsync implementation here, burning
huge amounts of CPU to do bugger all IO. i.e. it burnt all 16p for 2
and a half minutes in that 10000 fork test so wasn't IO bound at
all.
> Is there any practical use for fast parallel fsync of tens of thousands
> of tasks? This could be useful for a scalable transaction server
> that sits directly on the filesystem instead of a database, as is
> the fashion for big data these days. It certainly can't hurt to know
> that if you need that kind of scaling, Tux3 will do it.
Ext4 and XFS already do that just fine, too, when you use storage
suited to such a workload and you have a sane interface for
submitting tens of thousands of concurrent fsync operations. e.g
http://oss.sgi.com/archives/xfs/2014-06/msg00214.html
> Of course, a pure fsync load could be viewed as somewhat unnatural. We
> also need to know what happens under a realistic load with buffered
> operations mixed with fsyncs. We turn to an old friend, dbench:
> 
> Dbench -t10
> 
>    Tasks:       8           16           32
>    Ext4:    35.32 MB/s   34.08 MB/s   39.71 MB/s
>    XFS:     32.12 MB/s   25.08 MB/s   30.12 MB/s
>    Btrfs:   54.40 MB/s   75.09 MB/s  102.81 MB/s
>    Tux3:    85.82 MB/s  133.69 MB/s  159.78 MB/s
>                   (higher is better)
On a SSD (256GB samsung 840 EVO), running 4.0.0:
   Tasks:       8           16           32
   Ext4:    598.27 MB/s    981.13 MB/s 1233.77 MB/s
   XFS:     884.62 MB/s   1328.21 MB/s 1373.66 MB/s
   Btrfs:   201.64 MB/s    137.55 MB/s  108.56 MB/s
dbench looks *very different* when there is no seek latency,
doesn't it?
> Dbench -t10 -s (all file operations synchronous)
> 
>    Tasks:       8           16           32
>    Ext4:     4.51 MB/s    6.25 MB/s    7.72 MB/s
>    XFS:      4.24 MB/s    4.77 MB/s    5.15 MB/s
>    Btrfs:    7.98 MB/s   13.87 MB/s   22.87 MB/s
>    Tux3:    15.41 MB/s   25.56 MB/s   39.15 MB/s
>                   (higher is better)
    Ext4:   173.54 MB/s  294.41 MB/s  424.11 MB/s
    XFS:    172.98 MB/s  342.78 MB/s  458.87 MB/s
    Btrfs:   36.92 MB/s   34.52 MB/s   55.19 MB/s
Again, the numbers are completely the other way around on a SSD,
with the conventional filesystems being 5-10x faster than the
WA/COW style filesystem.
....
> In the full disclosure department, Tux3 is still not properly
> optimized in some areas. One of them is fragmentation: it is not
> very hard to make Tux3 slow down by running long tests. Our current
Oh, that still hasn't been fixed?
Until you sort of how you are going to scale allocation to tens of
TB and not fragment free space over time, fsync performance of the
filesystem is pretty much irrelevant. Changing the allocation
algorithms will fundamentally alter the IO patterns and so all these
benchmarks are essentially meaningless.
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Re: Tux3 Report: How fast can we fsync?
  2015-04-30  1:46 ` Dave Chinner
@ 2015-04-30 10:28   ` Daniel Phillips
  2015-05-01 15:38     ` Dave Chinner
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-04-30 10:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, tux3, Theodore Ts'o, linux-kernel
On Wednesday, April 29, 2015 6:46:16 PM PDT, Dave Chinner wrote:
>> I measured fsync performance using a 7200 RPM disk as a virtual
>> drive under KVM, configured with cache=none so that asynchronous
>> writes are cached and synchronous writes translate into direct
>> writes to the block device.
>
> Yup, a slow single spindle, so fsync performance is determined by
> seek latency of the filesystem. Hence the filesystem that "wins"
> will be the filesystem that minimises fsync seek latency above all
> other considerations.
>
> http://www.spinics.net/lists/kernel/msg1978216.html
If you want to declare that XFS only works well on solid state disks 
and big storage arrays, that is your business. But if you do, you can no
longer call XFS a general purpose filesystem. And if you would rather 
disparage people who report genuine performance bugs than get down to
fixing them, that is your business too. Don't expect to be able to stop 
the bug reports by bluster.
> So, to demonstrate, I'll run the same tests but using a 256GB
> samsung 840 EVO SSD and show how much the picture changes.
I will go you one better, I ran a series of fsync tests using tmpfs,
and I now have a very clear picture of how the picture changes. The
executive summary is: Tux3 is still way faster, and still scales way
better to large numbers of tasks. I have every confidence that the same
is true of SSD.
> I didn't test tux3, you don't make it easy to get or build.
There is no need to apologize for not testing Tux3, however, it is 
unseemly to throw mud at the same time. Remember, you are the person 
who put so much energy into blocking Tux3 from merging last summer. If
it now takes you a little extra work to build it then it is hard to be 
really sympathetic. Mike apparently did not find it very hard.
>> To focus purely on fsync, I wrote a
>> small utility (at the end of this post) that forks a number of
>> tasks, each of which continuously appends to and fsyncs its own
>> file. For a single task doing 1,000 fsyncs of 1K each, we have:
>> 
>>    Ext4:  34.34s
>>    XFS:   23.63s
>>    Btrfs: 34.84s
>>    Tux3:  17.24s
>
>    Ext4:   1.94s
>    XFS:    2.06s
>    Btrfs:  2.06s
>
> All equally fast, so I can't see how tux3 would be much faster here.
Running the same thing on tmpfs, Tux3 is significantly faster:
     Ext4:   1.40s
     XFS:    1.10s
     Btrfs:  1.56s
     Tux3:   1.07s
>    Tasks:   10      100    1,000    10,000
>    Ext4:   0.05s   0.12s    0.48s     3.99s
>    XFS:    0.25s   0.41s    0.96s     4.07s
>    Btrfs   0.22s   0.50s    2.86s   161.04s
>              (lower is better)
>
> Ext4 and XFS are fast and show similar performance. Tux3 *can't* be
> very much faster as most of the elapsed time in the test is from
> forking the processes that do the IO and fsyncs.
You wish. In fact, Tux3 is a lot faster. You must have made a mistake in 
estimating your fork overhead. It is easy to check, just run "syncs foo 
0 10000". I get 0.23 seconds to fork 10,0000 proceses, create the files 
and exit. Here are my results on tmpfs, triple checked and reproducible:
    Tasks:   10      100    1,000    10,000
    Ext4:   0.05     0.14    1.53     26.56
    XFS:    0.05     0.16    2.10     29.76
    Btrfs:  0.08     0.37    3.18     34.54
    Tux3:   0.02     0.05    0.18      2.16
Note: you should recheck your final number for Btrfs. I have seen Btrfs 
fall off the rails and take wildly longer on some tests just like that.
We know Btrfs has corner case issues, I don't think they deny it. 
Unlike you, Chris Mason is a gentleman when faced with issues. Instead 
of insulting his colleagues and hurling around the sort of abuse that 
has gained LKML its current unenviable reputation, he gets down to work 
and fixes things.
You should do that too, your own house is not in order. XFS has major 
issues. One easily reproducible one is a denial of service during the 
10,000 task test where it takes multiple seconds to cat small files. I 
saw XFS do this on both spinning disk and tmpfs, and I have seen it 
hang for minutes trying to list a directory. I looked a bit into it, and 
I see that you are blocking for aeons trying to acquire a lock in open.
Here is an example. While doing "sync6 fs/foo 10 10000":
time cat fs/foo999
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
real    0m2.282s
user    0m0.000s
sys     0m0.000s
You and I both know the truth: Ext4 is the only really reliable general 
purpose filesystem on Linux at the moment. XFS is definitely not, I 
have seen ample evidence with my own eyes. What you need is people 
helping you fix your issues instead of making your colleagues angry at 
you with your incessant attacks.
> FWIW, btrfs shows it's horrible fsync implementation here, burning
> huge amounts of CPU to do bugger all IO. i.e. it burnt all 16p for 2
> and a half minutes in that 10000 fork test so wasn't IO bound at
> all.
Btrfs is hot and cold. In my tmpfs tests, Btrfs beats XFS at high 
task counts. It is actually amazing the progress Btrfs has made in 
performance. I for one appreciate the work they are doing and I admire 
the way Chris conducts both himself and his project. I wish you were 
more like Chris, and I wish I was for that matter.
I agree that Btrfs uses too much CPU, but there is no need to be rude 
about it. I think the Btrfs team knows how to use a profiler.
>> Is there any practical use for fast parallel fsync of tens of thousands
>> of tasks? This could be useful for a scalable transaction server
>> that sits directly on the filesystem instead of a database, as is
>> the fashion for big data these days. It certainly can't hurt to know
>> that if you need that kind of scaling, Tux3 will do it.
>
> Ext4 and XFS already do that just fine, too, when you use storage
> suited to such a workload and you have a sane interface for
> submitting tens of thousands of concurrent fsync operations. e.g
>
> http://oss.sgi.com/archives/xfs/2014-06/msg00214.html
Tux3 turns in really great performance with an ordinary, cheap spinning 
disk using standard Posix ops. It is not for you to tell people they 
don't care about that, and it is wrong for you to imply that we only 
perform well on spinning disk - you don't know that, and it's not true.
By the way, I like your asynchronous fsync, nice work. It by no means
obviates the need for a fast implementation of the standard operation.
> On a SSD (256GB samsung 840 EVO), running 4.0.0:
>
>    Tasks:       8           16           32
>    Ext4:    598.27 MB/s    981.13 MB/s 1233.77 MB/s
>    XFS:     884.62 MB/s   1328.21 MB/s 1373.66 MB/s
>    Btrfs:   201.64 MB/s    137.55 MB/s  108.56 MB/s
>
> dbench looks *very different* when there is no seek latency,
> doesn't it?
It looks like Btrfs hit a bug, not a huge surprise. Btrfs hit an assert
for me earlier this evening. It is rare but it happens. I rebooted and 
got sane numbers. Running dbench -t10 on tmpfs I get:
     Tasks:       8            16            32
     Ext4:    660.69 MB/s   708.81 MB/s   720.12 MB/s
     XFS:     692.01 MB/s   388.53 MB/s   134.84 MB/s
     Btrfs:   229.66 MB/s   341.27 MB/s   377.97 MB/s
     Tux3:   1147.12 MB/s  1401.61 MB/s  1283.74 MB/s
Looks like XFS hit a bump and fell off the cliff at 32 threads. I reran
that one many times because I don't want to give you an inaccurate 
report.
Tux3 turned in a great performance. I am not pleased with the negative 
scaling at 32 threads, but it still finishes way ahead.
>> Dbench -t10 -s (all file operations synchronous)
>> 
>>    Tasks:       8           16           32
>>    Ext4:     4.51 MB/s    6.25 MB/s    7.72 MB/s
>>    XFS:      4.24 MB/s    4.77 MB/s    5.15 MB/s
>>    Btrfs:    7.98 MB/s   13.87 MB/s   22.87 MB/s
>>    Tux3:    15.41 MB/s   25.56 MB/s   39.15 MB/s
>>                   (higher is better)
>
>     Ext4:   173.54 MB/s  294.41 MB/s  424.11 MB/s
>     XFS:    172.98 MB/s  342.78 MB/s  458.87 MB/s
>     Btrfs:   36.92 MB/s   34.52 MB/s   55.19 MB/s
>
> Again, the numbers are completely the other way around on a SSD,
> with the conventional filesystems being 5-10x faster than the
> WA/COW style filesystem.
I wouldn't be so sure about that...
     Tasks:       8            16            32
     Ext4:     93.06 MB/s    98.67 MB/s   102.16 MB/s
     XFS:      81.10 MB/s    79.66 MB/s    73.27 MB/s
     Btrfs:    43.77 MB/s    64.81 MB/s    90.35 MB/s
     Tux3:    198.49 MB/s   279.00 MB/s   318.41 MB/s
>> In the full disclosure department, Tux3 is still not properly
>> optimized in some areas. One of them is fragmentation: it is not
>> very hard to make Tux3 slow down by running long tests. Our current
>
> Oh, that still hasn't been fixed?
Count your blessings while you can.
> Until you sort of how you are going to scale allocation to tens of
> TB and not fragment free space over time, fsync performance of the
> filesystem is pretty much irrelevant. Changing the allocation
> algorithms will fundamentally alter the IO patterns and so all these
> benchmarks are essentially meaningless.
Ahem, are you the same person for whom fsync was the most important 
issue in the world last time the topic came up, to the extent of 
spreading around FUD and entirely ignoring the great work we had 
accomplished for regular file operations? I said then that when we got 
around to a proper fsync it would be competitive. Now here it is, so you 
want to change the topic. I understand.
Honestly, you would be a lot better off investigating why our fsync 
algorithm is so good.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Re: Tux3 Report: How fast can we fsync?
  2015-04-30 10:28   ` Daniel Phillips
@ 2015-05-01 15:38     ` Dave Chinner
  2015-05-01 23:20       ` Daniel Phillips
  0 siblings, 1 reply; 160+ messages in thread
From: Dave Chinner @ 2015-05-01 15:38 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, linux-fsdevel, tux3, Theodore Ts'o
On Thu, Apr 30, 2015 at 03:28:13AM -0700, Daniel Phillips wrote:
> On Wednesday, April 29, 2015 6:46:16 PM PDT, Dave Chinner wrote:
> >>I measured fsync performance using a 7200 RPM disk as a virtual
> >>drive under KVM, configured with cache=none so that asynchronous
> >>writes are cached and synchronous writes translate into direct
> >>writes to the block device.
> >
> >Yup, a slow single spindle, so fsync performance is determined by
> >seek latency of the filesystem. Hence the filesystem that "wins"
> >will be the filesystem that minimises fsync seek latency above
> >all other considerations.
> >
> >http://www.spinics.net/lists/kernel/msg1978216.html
> 
> If you want to declare that XFS only works well on solid state
> disks and big storage arrays, that is your business. But if you
> do, you can no longer call XFS a general purpose filesystem. And
Well, yes - I never claimed XFS is a general purpose filesystem.  It
is a high performance filesystem. Is is also becoming more relevant
to general purpose systems as low cost storage gains capabilities
that used to be considered the domain of high performance storage...
> >So, to demonstrate, I'll run the same tests but using a 256GB
> >samsung 840 EVO SSD and show how much the picture changes.
> 
> I will go you one better, I ran a series of fsync tests using
> tmpfs, and I now have a very clear picture of how the picture
> changes. The executive summary is: Tux3 is still way faster, and
> still scales way better to large numbers of tasks. I have every
> confidence that the same is true of SSD.
/dev/ramX can't be compared to an SSD.  Yes, they both have low
seek/IO latency but they have very different dispatch and IO
concurrency models.  One is synchronous, the other is fully
asynchronous.
This is an important distinction, as we'll see later on....
> >I didn't test tux3, you don't make it easy to get or build.
> 
> There is no need to apologize for not testing Tux3, however, it is
> unseemly to throw mud at the same time. Remember, you are the
These trees:
git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git
git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3-test.git
have not been updated for 11 months. I thought tux3 had died long
ago.
You should keep them up to date, and send patches for xfstests to
support tux3, and then you'll get a lot more people running,
testing and breaking tux3....
> >>To focus purely on fsync, I wrote a
> >>small utility (at the end of this post) that forks a number of
> >>tasks, each of which continuously appends to and fsyncs its own
> >>file. For a single task doing 1,000 fsyncs of 1K each, we have:
.....
> >All equally fast, so I can't see how tux3 would be much faster here.
> 
> Running the same thing on tmpfs, Tux3 is significantly faster:
> 
>     Ext4:   1.40s
>     XFS:    1.10s
>     Btrfs:  1.56s
>     Tux3:   1.07s
3% is not "signficantly faster". It's within run to run variation!
> >   Tasks:   10      100    1,000    10,000
> >   Ext4:   0.05s   0.12s    0.48s     3.99s
> >   XFS:    0.25s   0.41s    0.96s     4.07s
> >   Btrfs   0.22s   0.50s    2.86s   161.04s
> >             (lower is better)
> >
> >Ext4 and XFS are fast and show similar performance. Tux3 *can't* be
> >very much faster as most of the elapsed time in the test is from
> >forking the processes that do the IO and fsyncs.
> 
> You wish. In fact, Tux3 is a lot faster.
Yes, it's easy to be fast when you have simple, naive algorithms and
an empty filesystem.
> triple checked and reproducible:
> 
>    Tasks:   10      100    1,000    10,000
>    Ext4:   0.05     0.14    1.53     26.56
>    XFS:    0.05     0.16    2.10     29.76
>    Btrfs:  0.08     0.37    3.18     34.54
>    Tux3:   0.02     0.05    0.18      2.16
Yet I can't reproduce those XFS or ext4 numbers you are quoting
there. eg. XFS on a 4GB ram disk:
$ for i in 10 100 1000 10000; do rm /mnt/test/foo* ; time ./test-fsync /mnt/test/foo 10 $i; done
real    0m0.030s
user    0m0.000s
sys     0m0.014s
real    0m0.031s
user    0m0.008s
sys     0m0.157s
real    0m0.305s
user    0m0.029s
sys     0m1.555s
real    0m3.624s
user    0m0.219s
sys     0m17.631s
$
That's roughly 10x faster than your numbers. Can you describe your
test setup in detail? e.g.  post the full log from block device
creation to benchmark completion so I can reproduce what you are
doing exactly?
> Note: you should recheck your final number for Btrfs. I have seen
> Btrfs fall off the rails and take wildly longer on some tests just
> like that.
Completely reproducable:
$ sudo mkfs.btrfs -f /dev/vdc
Btrfs v3.16.2
See http://btrfs.wiki.kernel.org for more information.
Turning ON incompat feature 'extref': increased hardlink limit per file to 65536
fs created label (null) on /dev/vdc
        nodesize 16384 leafsize 16384 sectorsize 4096 size 500.00TiB
$ sudo mount /dev/vdc /mnt/test
$ sudo chmod 777 /mnt/test
$ for i in 10 100 1000 10000; do rm /mnt/test/foo* ; time ./test-fsync /mnt/test/foo 10 $i; done
real    0m0.068s
user    0m0.000s
sys     0m0.061s
real    0m0.563s
user    0m0.001s
sys     0m2.047s
real    0m2.851s
user    0m0.040s
sys     0m24.503s
real    2m38.713s
user    0m0.533s
sys     38m34.831s
Same result - ~160s burning all 16 CPUs, as can be seen by the
system time.
And even on a 4GB ram disk, the 10000 process test comes in at:
real    0m35.567s
user    0m0.707s
sys     6m1.922s
That's the same wall time as your tst, but the CPU burn on my
machine is still clearly evident. You indicated that it's not doing
this on your machine, so I don't think we can really use btfrs
numbers for comparison purposes if it is behaving so differently on
different machines....
[snip]
> One easily reproducible one is a denial of service
> during the 10,000 task test where it takes multiple seconds to cat
> small files. I saw XFS do this on both spinning disk and tmpfs, and
> I have seen it hang for minutes trying to list a directory. I looked
> a bit into it, and I see that you are blocking for aeons trying to
> acquire a lock in open.
Yes, that's the usual case when XFS is waiting on buffer readahead
IO completion. The latency of which is completely determined by
block layer queuing and scheduling behaviour. And the block device
queue is being dominated by the 10,000 concurrent write processes
you just ran.....
"Doctor, it hurts when I do this!"
[snip]
> You and I both know the truth: Ext4 is the only really reliable
> general purpose filesystem on Linux at the moment.
BWAHAHAHAHAHAHAH-*choke*
*cough*
*cough*
/me wipes tears from his eyes
That's the funniest thing I've read in a long time :)
[snip]
> >On a SSD (256GB samsung 840 EVO), running 4.0.0:
> >
> >   Tasks:       8           16           32
> >   Ext4:    598.27 MB/s    981.13 MB/s 1233.77 MB/s
> >   XFS:     884.62 MB/s   1328.21 MB/s 1373.66 MB/s
> >   Btrfs:   201.64 MB/s    137.55 MB/s  108.56 MB/s
> >
> >dbench looks *very different* when there is no seek latency,
> >doesn't it?
> 
> It looks like Btrfs hit a bug, not a huge surprise. Btrfs hit an assert
> for me earlier this evening. It is rare but it happens. I rebooted
> and got sane numbers. Running dbench -t10 on tmpfs I get:
> 
>     Tasks:       8            16            32
>     Ext4:    660.69 MB/s   708.81 MB/s   720.12 MB/s
>     XFS:     692.01 MB/s   388.53 MB/s   134.84 MB/s
>     Btrfs:   229.66 MB/s   341.27 MB/s   377.97 MB/s
>     Tux3:   1147.12 MB/s  1401.61 MB/s  1283.74 MB/s
> 
> Looks like XFS hit a bump and fell off the cliff at 32 threads. I reran
> that one many times because I don't want to give you an inaccurate
> report.
I can't reproduce those numbers, either. On /dev/ram0:
    Tasks:       8            16            32
    Ext4:    1416.11 MB/s   1585.81 MB/s   1406.18 MB/s
    XFS:     2580.58 MB/s   1367.48 MB/s    994.46 MB/s
    Btrfs:    151.89 MB/s     84.88 MB/s     73.16 MB/s
Still, that negative XFS scalability shouldn't be occuring - it
should be level off and be much flatter if everything is working
correctly.
<ding>
Ah.
Ram disks and synchronous IO.....
The XFS journal a completely asynchronous IO engine and the
synchronous IO done by the ram disk really screws with the
concurrency model. There are journal write aggregation optimisations
that are based on the "buffer under IO" state detection, which is
completely skipped when journal IO is synchronous and completed in
the submission context. This problem doesn't occur on actual storage
devices where IO is asynchronous.
So, yes, dbench can trigger an interesting behaviour in XFS, but
it's well understood and doesn't actually effect normal storage
devices. If you need a volatile fileystem for performance
reasons then tmpfs is what you want, not XFS....
[
	Feel free to skip the detail:
	Let's go back to that SSD, which does asynchronous IO and so
	the journal to operates fully asynchronously:
	$ for i in 8 16 32 64 128 256; do dbench -t10  $i -D /mnt/test; done
	Throughput 811.806 MB/sec  8 clients  8 procs  max_latency=12.152 ms
	Throughput 1285.47 MB/sec  16 clients  16 procs  max_latency=22.880 ms
	Throughput 1516.22 MB/sec  32 clients  32 procs  max_latency=73.381 ms
	Throughput 1724.57 MB/sec  64 clients  64 procs  max_latency=256.681 ms
	Throughput 2046.91 MB/sec  128 clients  128 procs max_latency=1068.169 ms
	Throughput 1895.4 MB/sec  256 clients  256 procs max_latency=3157.738 ms
	So performance improves out to 128 processes and then the
	SSD runs out of capacity - it's doing >400MB/s write IO at
	128 clients. That makes latency blow out as we add more
	load, so it doesn't go any faster and we start to back up on
	the log. Hence we slowly start to go backwards as client
	count continues to increase and contention builds up on
	global wait queues.
	Now, XFS has 8 log buffer and so can issue 8 concurrent
	journal writes. Let's run dbench with fewer processes on a
	ram disk, and see what happens as we increase the number of
	processes doing IO and hence triggering journal writes:
	$ for i in 1 2 4 6 8; do dbench -t10  $i -D /mnt/test |grep Throughput; done
	Throughput 653.163 MB/sec  1 clients  1 procs  max_latency=0.355 ms
	Throughput 1273.65 MB/sec  2 clients  2 procs  max_latency=3.947 ms
	Throughput 2189.19 MB/sec  4 clients  4 procs  max_latency=7.582 ms
	Throughput 2318.33 MB/sec  6 clients  6 procs  max_latency=8.023 ms
	Throughput 2212.85 MB/sec  8 clients  8 procs  max_latency=9.120 ms
	Yeah, ok, we scale out to 4 processes, then level off.
	That's going to be limited by allocation concurrency during
	writes, not the journal (the default is 4 AGs on a
	filesystem so small). Let's make 16 AGs, cause seeks don't
	matter on a ram disk.
	$ sudo mkfs.xfs -f -d agcount=16 /dev/ram0
	....
	$ for i in 1 2 4 6 8; do dbench -t10  $i -D /mnt/test |grep Throughput; done
	Throughput 656.189 MB/sec  1 clients  1 procs  max_latency=0.565 ms
	Throughput 1277.25 MB/sec  2 clients  2 procs  max_latency=3.739 ms
	Throughput 2350.73 MB/sec  4 clients  4 procs  max_latency=5.126 ms
	Throughput 2754.3 MB/sec  6 clients  6 procs  max_latency=8.063 ms
	Throughput 3135.11 MB/sec  8 clients  8 procs  max_latency=6.746 ms
	Yup, as expected the we continue to increase performance out
	to 8 processes now that there isn't an allocation
	concurrency limit being hit.
	What happens as we pass 8 processes now?
	$ for i in 4 8 12 16; do dbench -t10  $i -D /mnt/test |grep Throughput; done
	Throughput 2277.53 MB/sec  4 clients  4 procs  max_latency=5.778 ms
	Throughput 3070.3 MB/sec  8 clients  8 procs  max_latency=7.808 ms
	Throughput 2555.29 MB/sec  12 clients  12 procs  max_latency=8.518 ms
	Throughput 1868.96 MB/sec  16 clients  16 procs  max_latency=14.193 ms
	$
	As expected, past 8 processes perform tails off because the
	journal state machine is not scheduling after dispatch of
	the journal IO and hence allowing other threads to aggregate
	journal writes into the next active log buffer because there
	is no "under IO" stage in the state machine to it to trigger
	log write aggregation delays off.
	I'd completely forgotten about this - I discovered it 3 or 4
	years ago, and then simply stopped using ramdisks for
	performance testing because I could get better performance
	from XFS on highly concurrent workloads from real storage.
]
> >>Dbench -t10 -s (all file operations synchronous)
> >>
> >>   Tasks:       8           16           32
> >>   Ext4:     4.51 MB/s    6.25 MB/s    7.72 MB/s
> >>   XFS:      4.24 MB/s    4.77 MB/s    5.15 MB/s
> >>   Btrfs:    7.98 MB/s   13.87 MB/s   22.87 MB/s
> >>   Tux3:    15.41 MB/s   25.56 MB/s   39.15 MB/s
> >>                  (higher is better)
> >
> >    Ext4:   173.54 MB/s  294.41 MB/s  424.11 MB/s
> >    XFS:    172.98 MB/s  342.78 MB/s  458.87 MB/s
> >    Btrfs:   36.92 MB/s   34.52 MB/s   55.19 MB/s
> >
> >Again, the numbers are completely the other way around on a SSD,
> >with the conventional filesystems being 5-10x faster than the
> >WA/COW style filesystem.
> 
> I wouldn't be so sure about that...
> 
>     Tasks:       8            16            32
>     Ext4:     93.06 MB/s    98.67 MB/s   102.16 MB/s
>     XFS:      81.10 MB/s    79.66 MB/s    73.27 MB/s
>     Btrfs:    43.77 MB/s    64.81 MB/s    90.35 MB/s
>     Tux3:    198.49 MB/s   279.00 MB/s   318.41 MB/s
     Ext4:     807.21 MB/s    1089.89 MB/s   867.55 MB/s
     XFS:      997.77 MB/s    1011.51 MB/s   876.49 MB/s
     Btrfs:     55.66 MB/s      56.77 MB/s    60.30 MB/s
Numbers are again very different for XFS and ext4 on /dev/ramX on my
system. Need to work out why yours are so low....
> >Until you sort of how you are going to scale allocation to tens of
> >TB and not fragment free space over time, fsync performance of the
> >filesystem is pretty much irrelevant. Changing the allocation
> >algorithms will fundamentally alter the IO patterns and so all these
> >benchmarks are essentially meaningless.
> 
> Ahem, are you the same person for whom fsync was the most important
> issue in the world last time the topic came up, to the extent of
> spreading around FUD and entirely ignoring the great work we had
> accomplished for regular file operations?
Actually, I don't remember any discussions about fsync.
Things I remember that needed addressing are:
	- the lack of ENOSPC detection
	- the writeback integration issues
	- the code cleanliness issues (ifdef mess, etc)
	- the page forking design problems
	- the lack of scalable inode and space allocation
	  algorithms.
Those are the things I remember, and fsync performance pales in
comparison to those.
> I said then that when we
> got around to a proper fsync it would be competitive. Now here it
> is, so you want to change the topic. I understand.
I haven't changed the topic, just the storage medium. The simple
fact is that the world is moving away from slow sata storage at a
pretty rapid pace and it's mostly going solid state. Spinning disks
also changing - they are going to ZBC based SMR, which is a
compeltely different problem space which doesn't even appear to be
on the tux3 radar....
So where does tux3 fit into a storage future of byte addressable
persistent memory and ZBC based SMR devices?
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Re: Tux3 Report: How fast can we fsync?
  2015-05-01 15:38     ` Dave Chinner
@ 2015-05-01 23:20       ` Daniel Phillips
  2015-05-02  1:07         ` David Lang
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-01 23:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, tux3, Theodore Ts'o
On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:
>
> Well, yes - I never claimed XFS is a general purpose filesystem.  It
> is a high performance filesystem. Is is also becoming more relevant
> to general purpose systems as low cost storage gains capabilities
> that used to be considered the domain of high performance storage...
OK. Well, Tux3 is general purpose and that means we care about single
spinning disk and small systems.
>>> So, to demonstrate, I'll run the same tests but using a 256GB
>>> samsung 840 EVO SSD and show how much the picture changes.
>>
>> I will go you one better, I ran a series of fsync tests using
>> tmpfs, and I now have a very clear picture of how the picture
>> changes. The executive summary is: Tux3 is still way faster, and
>> still scales way better to large numbers of tasks. I have every
>> confidence that the same is true of SSD.
>
> /dev/ramX can't be compared to an SSD.  Yes, they both have low
> seek/IO latency but they have very different dispatch and IO
> concurrency models.  One is synchronous, the other is fully
> asynchronous.
I had ram available and no SSD handy to abuse. I was interested in
measuring the filesystem overhead with the device factored out. I
mounted loopback on a tmpfs file, which seems to be about the same as
/dev/ram, maybe slightly faster, but much easier to configure. I ran
some tests on a ramdisk just now and was mortified to find that I have
to reboot to empty the disk. It would take a compelling reason before
I do that again.
> This is an important distinction, as we'll see later on....
I regard it as predictive of Tux3 performance on NVM.
> These trees:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git
> git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3-test.git
>
> have not been updated for 11 months. I thought tux3 had died long
> ago.
>
> You should keep them up to date, and send patches for xfstests to
> support tux3, and then you'll get a lot more people running,
> testing and breaking tux3....
People are starting to show up to do testing now, pretty much the first
time, so we must do some housecleaning. It is gratifying that Tux3 never
broke for Mike, but of course it will assert just by running out of
space at the moment. As you rightly point out, that fix is urgent and is
my current project.
>> Running the same thing on tmpfs, Tux3 is significantly faster:
>>
>>     Ext4:   1.40s
>>     XFS:    1.10s
>>     Btrfs:  1.56s
>>     Tux3:   1.07s
>
> 3% is not "signficantly faster". It's within run to run variation!
You are right, XFS and Tux3 are within experimental error for single
syncs on the ram disk, while Ext4 and Btrfs are way slower:
       Ext4:   1.59s
       XFS:    1.11s
       Btrfs:  1.70s
       Tux3:   1.11s
A distinct performance gap appears between Tux3 and XFS as parallel
tasks increase.
>> You wish. In fact, Tux3 is a lot faster. ...
>
> Yes, it's easy to be fast when you have simple, naive algorithms and
> an empty filesystem.
No it isn't or the others would be fast too. In any case our algorithms
are far from naive, except for allocation. You can rest assured that
when allocation is brought up to a respectable standard in the fullness
of time, it will be competitive and will not harm our clean filesystem
performance at all.
There is no call for you to disparage our current achievements, which
are significant. I do not mind some healthy skepticism about the
allocation work, you know as well as anyone how hard it is. However your
denial of our current result is irritating and creates the impression
that you have an agenda. If you want to complain about something real,
complain that our current code drop is not done yet. I will humbly
apologize, and the same for enospc.
>> triple checked and reproducible:
>>
>>    Tasks:   10      100    1,000    10,000
>>    Ext4:   0.05     0.14    1.53     26.56
>>    XFS:    0.05     0.16    2.10     29.76
>>    Btrfs:  0.08     0.37    3.18     34.54
>>    Tux3:   0.02     0.05    0.18      2.16
>
> Yet I can't reproduce those XFS or ext4 numbers you are quoting
> there. eg. XFS on a 4GB ram disk:
>
> $ for i in 10 100 1000 10000; do rm /mnt/test/foo* ; time
> ./test-fsync /mnt/test/foo 10 $i; done
>
> real    0m0.030s
> user    0m0.000s
> sys     0m0.014s
>
> real    0m0.031s
> user    0m0.008s
> sys     0m0.157s
>
> real    0m0.305s
> user    0m0.029s
> sys     0m1.555s
>
> real    0m3.624s
> user    0m0.219s
> sys     0m17.631s
> $
>
> That's roughly 10x faster than your numbers. Can you describe your
> test setup in detail? e.g.  post the full log from block device
> creation to benchmark completion so I can reproduce what you are
> doing exactly?
Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
more substantial, so I can't compare my numbers directly to yours.
Clearly the curve is the same: your numbers increase 10x going from 100
to 1,000 tasks and 12x going from 1,000 to 10,000. The Tux3 curve is
significantly flatter and starts from a lower base, so it ends with a
really wide gap. You will need to take my word for that for now. I
promise that the beer is on me should you not find that reproducible.
The repository delay is just about not bothering Hirofumi for a merge
while he finishes up his inode table anti-fragmentation work.
>> Note: you should recheck your final number for Btrfs. I have seen
>> Btrfs fall off the rails and take wildly longer on some tests just
>> like that.
>
> Completely reproducable...
I believe you. I found that Btrfs does that way too much. So does XFS
from time to time, when it gets up into lots of tasks. Read starvation
on XFS is much worse than Btrfs, and XFS also exhibits some very
undesirable behavior with initial file create. Note: Ext4 and Tux3 have
roughly zero read starvation in any of these tests, which pretty much
proves it is not just a block scheduler thing. I don't think this is
something you should dismiss.
>> One easily reproducible one is a denial of service
>> during the 10,000 task test where it takes multiple seconds to cat
>> small files. I saw XFS do this on both spinning disk and tmpfs, and
>> I have seen it hang for minutes trying to list a directory. I looked
>> a bit into it, and I see that you are blocking for aeons trying to
>> acquire a lock in open.
>
> Yes, that's the usual case when XFS is waiting on buffer readahead
> IO completion. The latency of which is completely determined by
> block layer queuing and scheduling behaviour. And the block device
> queue is being dominated by the 10,000 concurrent write processes
> you just ran.....
>
> "Doctor, it hurts when I do this!"
It only hurts XFS (and sometimes Btrfs) when you do that. I believe
your theory is wrong about the cause, or at least Ext4 and Tux3 skirt
that issue somehow. We definitely did not do anything special to avoid
it.
>> You and I both know the truth: Ext4 is the only really reliable
>> general purpose filesystem on Linux at the moment.
>
> That's the funniest thing I've read in a long time :)
I'm glad I could lighten your day, but I remain uncomfortable with the
read starvation issues and the massive long lock holds I see. Perhaps
XFS is stable if you don't push too many tasks at it.
[snipped the interesting ramdisk performance bug hunt]
OK, fair enough, you get a return match on SSD when I get hold of one.
>> I wouldn't be so sure about that...
>>
>>     Tasks:       8            16            32
>>     Ext4:     93.06 MB/s    98.67 MB/s   102.16 MB/s
>>     XFS:      81.10 MB/s    79.66 MB/s    73.27 MB/s
>>     Btrfs:    43.77 MB/s    64.81 MB/s    90.35 MB/s ...
>
>      Ext4:     807.21 MB/s    1089.89 MB/s   867.55 MB/s
>      XFS:      997.77 MB/s    1011.51 MB/s   876.49 MB/s
>      Btrfs:     55.66 MB/s      56.77 MB/s    60.30 MB/s
>
> Numbers are again very different for XFS and ext4 on /dev/ramX on my
> system. Need to work out why yours are so low....
Your machine makes mine look like a PCjr.
>> Ahem, are you the same person for whom fsync was the most important
>> issue in the world last time the topic came up, to the extent of
>> spreading around FUD and entirely ignoring the great work we had
>> accomplished for regular file operations? ...
>
> Actually, I don't remember any discussions about fsync.
Here:
   http://www.spinics.net/lists/linux-fsdevel/msg64825.html
   (Re: Tux3 Report: Faster than tmpfs, what?)
It still rankles that you took my innocent omission of the detail that
Hirofumi had removed the fsyncs from dbench and turned it into a major
FUD attack, casting aspersions on our integrity. We removed the fsyncs
because we weren't interested in measuring something we had not
implemented yet, it is that simple.
That, plus Ted's silly pronouncements that I could not answer at the
time, is what motivated me to design and implement an fsync that would
not just be competitive, but would righteously kick the tails of XFS
and Ext4, which is done. If I were you, I would wait for the code drop,
verify it, and then give credit where credit is due. Then I would
figure out how to make XFS work like that.
> Things I remember that needed addressing are:
> 	- the lack of ENOSPC detection
> 	- the writeback integration issues
> 	- the code cleanliness issues (ifdef mess, etc)
> 	- the page forking design problems
> 	- the lack of scalable inode and space allocation
> 	  algorithms.
>
> Those are the things I remember, and fsync performance pales in
> comparison to those.
With the exception of "page forking design", it is the same list as
ours, with progress on all of them. I freely admit that optimized fsync
was not on the critical path, but you made it an issue so I addressed
it. Anyway, I needed to hone my kernel debugging skills and that worked
out well.
>> I said then that when we
>> got around to a proper fsync it would be competitive. Now here it
>> is, so you want to change the topic. I understand.
>
> I haven't changed the topic, just the storage medium. The simple
> fact is that the world is moving away from slow sata storage at a
> pretty rapid pace and it's mostly going solid state. Spinning disks
> also changing - they are going to ZBC based SMR, which is a
> compeltely different problem space which doesn't even appear to be
> on the tux3 radar....
>
> So where does tux3 fit into a storage future of byte addressable
> persistent memory and ZBC based SMR devices?
You won't convince us to abandon spinning rust, it's going to be around
a lot longer than you think. Obviously, we care about SSD and I believe
you will find that Tux3 is more than competitive there. We lay things
out in a very erase block friendly way. We need to address the volume
wrap issue of course, and that is in progress. This is much easier than
spinning disk.
Tux3's redirect-on-write[1] is obviously a natural for SMR, however
I will not get excited about it unless a vendor waves money.
Regards,
Daniel
[1] Copy-on-write is a misnomer because there is no copy. The proper
term is "redirect-on-write".
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Re: Tux3 Report: How fast can we fsync?
  2015-05-01 23:20       ` Daniel Phillips
@ 2015-05-02  1:07         ` David Lang
  2015-05-02 10:26           ` Daniel Phillips
  0 siblings, 1 reply; 160+ messages in thread
From: David Lang @ 2015-05-02  1:07 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, tux3,
	Theodore Ts'o
On Fri, 1 May 2015, Daniel Phillips wrote:
> On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:
>>
>> Well, yes - I never claimed XFS is a general purpose filesystem.  It
>> is a high performance filesystem. Is is also becoming more relevant
>> to general purpose systems as low cost storage gains capabilities
>> that used to be considered the domain of high performance storage...
>
> OK. Well, Tux3 is general purpose and that means we care about single
> spinning disk and small systems.
keep in mind that if you optimize only for the small systems you may not scale 
as well to the larger ones.
>>>> So, to demonstrate, I'll run the same tests but using a 256GB
>>>> samsung 840 EVO SSD and show how much the picture changes.
>>>
>>> I will go you one better, I ran a series of fsync tests using
>>> tmpfs, and I now have a very clear picture of how the picture
>>> changes. The executive summary is: Tux3 is still way faster, and
>>> still scales way better to large numbers of tasks. I have every
>>> confidence that the same is true of SSD.
>>
>> /dev/ramX can't be compared to an SSD.  Yes, they both have low
>> seek/IO latency but they have very different dispatch and IO
>> concurrency models.  One is synchronous, the other is fully
>> asynchronous.
>
> I had ram available and no SSD handy to abuse. I was interested in
> measuring the filesystem overhead with the device factored out. I
> mounted loopback on a tmpfs file, which seems to be about the same as
> /dev/ram, maybe slightly faster, but much easier to configure. I ran
> some tests on a ramdisk just now and was mortified to find that I have
> to reboot to empty the disk. It would take a compelling reason before
> I do that again.
>
>> This is an important distinction, as we'll see later on....
>
> I regard it as predictive of Tux3 performance on NVM.
per the ramdisk but, possibly not as relavent as you may think. This is why it's 
good to test on as many different systems as you can. As you run into different 
types of performance you can then pick ones to keep and test all the time.
Single spinning disk is interesting now, but will be less interesting later. 
multiple spinning disks in an array of some sort is going to remain very 
interesting for quite a while.
now, some things take a lot more work to test than others. Getting time on a 
system with a high performance, high capacity RAID is hard, but getting hold of 
an SSD from Fry's is much easier. If it's a budget item, ping me directly and I 
can donate one for testing (the cost of a drive is within my unallocated budget 
and using that to improve Linux is worthwhile)
>>> Running the same thing on tmpfs, Tux3 is significantly faster:
>>>
>>>     Ext4:   1.40s
>>>     XFS:    1.10s
>>>     Btrfs:  1.56s
>>>     Tux3:   1.07s
>>
>> 3% is not "signficantly faster". It's within run to run variation!
>
> You are right, XFS and Tux3 are within experimental error for single
> syncs on the ram disk, while Ext4 and Btrfs are way slower:
>
>       Ext4:   1.59s
>       XFS:    1.11s
>       Btrfs:  1.70s
>       Tux3:   1.11s
>
> A distinct performance gap appears between Tux3 and XFS as parallel
> tasks increase.
It will be interesting to see if this continues to be true on more systems. I 
hope it does.
>>> You wish. In fact, Tux3 is a lot faster. ...
>>
>> Yes, it's easy to be fast when you have simple, naive algorithms and
>> an empty filesystem.
>
> No it isn't or the others would be fast too. In any case our algorithms
> are far from naive, except for allocation. You can rest assured that
> when allocation is brought up to a respectable standard in the fullness
> of time, it will be competitive and will not harm our clean filesystem
> performance at all.
>
> There is no call for you to disparage our current achievements, which
> are significant. I do not mind some healthy skepticism about the
> allocation work, you know as well as anyone how hard it is. However your
> denial of our current result is irritating and creates the impression
> that you have an agenda. If you want to complain about something real,
> complain that our current code drop is not done yet. I will humbly
> apologize, and the same for enospc.
As I'm reading Dave's comments, he isn't attacking you the way you seem to think 
he is. He is pointing ot that there are problems with your data, but he's also 
taking a lot of time to explain what's happening (and yes, some of this is 
probably because your simple tests with XFS made it look so bad)
the other filesystems don't use naive algortihms, they use something more 
complex, and while your current numbers are interesting, they are only 
preliminary until you add something to handle fragmentation. That can cause very 
significant problems. Remember how fabulous btrfs looked in the initial reports? 
and then corner cases were found that caused real problems and as the algorithms 
have been changed to prevent those corner cases from being so easy to hit, the 
common case has suffered somewhat. This isn't an attack on Tux2 or btrfs, it's 
just a reality of programming. If you are not accounting for all the corner 
cases, everything is easier, and faster.
>> That's roughly 10x faster than your numbers. Can you describe your
>> test setup in detail? e.g.  post the full log from block device
>> creation to benchmark completion so I can reproduce what you are
>> doing exactly?
>
> Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
> more substantial, so I can't compare my numbers directly to yours.
If you are doing tests with a 4G ramdisk on a machine with only 4G of RAM, it 
seems like you end up testing a lot more than just the filesystem. Testing in 
such low memory situations can indentify significant issues, but it is 
questionable as a 'which filesystem is better' benchmark.
> Clearly the curve is the same: your numbers increase 10x going from 100
> to 1,000 tasks and 12x going from 1,000 to 10,000. The Tux3 curve is
> significantly flatter and starts from a lower base, so it ends with a
> really wide gap. You will need to take my word for that for now. I
> promise that the beer is on me should you not find that reproducible.
>
> The repository delay is just about not bothering Hirofumi for a merge
> while he finishes up his inode table anti-fragmentation work.
Just a suggestion, but before you do a huge post about how great your filesystem 
is performing, making the code avaialble so that others can test it when 
prompted by your post is probably a very good idea. If it means that you have to 
send out your post a week later, it's a very small cost for the benefit of 
having other people able to easily try it on hardware that you don't have access 
to.
If there is a reason to post wihtout the code being in the main, publicised 
repo, then your post should point people at what code they can use to duplicate 
it.
but really, 11 months without updating the main repo?? This is Open Source 
development, publish early and often.
>>> Note: you should recheck your final number for Btrfs. I have seen
>>> Btrfs fall off the rails and take wildly longer on some tests just
>>> like that.
>>
>> Completely reproducable...
>
> I believe you. I found that Btrfs does that way too much. So does XFS
> from time to time, when it gets up into lots of tasks. Read starvation
> on XFS is much worse than Btrfs, and XFS also exhibits some very
> undesirable behavior with initial file create. Note: Ext4 and Tux3 have
> roughly zero read starvation in any of these tests, which pretty much
> proves it is not just a block scheduler thing. I don't think this is
> something you should dismiss.
something to investigate, but I have seen probelms on ext* in the past. ext4 may 
have fixed this, or it may just have moved the point where it triggers.
>>> I wouldn't be so sure about that...
>>>
>>>     Tasks:       8            16            32
>>>     Ext4:     93.06 MB/s    98.67 MB/s   102.16 MB/s
>>>     XFS:      81.10 MB/s    79.66 MB/s    73.27 MB/s
>>>     Btrfs:    43.77 MB/s    64.81 MB/s    90.35 MB/s ...
>>
>>      Ext4:     807.21 MB/s    1089.89 MB/s   867.55 MB/s
>>      XFS:      997.77 MB/s    1011.51 MB/s   876.49 MB/s
>>      Btrfs:     55.66 MB/s      56.77 MB/s    60.30 MB/s
>>
>> Numbers are again very different for XFS and ext4 on /dev/ramX on my
>> system. Need to work out why yours are so low....
>
> Your machine makes mine look like a PCjr.
The interesting thing here is that on the faster machine btrfs didn't speed up 
significantly while ext4 and xfs did. It will be interesting to see what the 
results are for tux3
and both of you need to remember that while servers are getting faster, we are 
also seeing much lower power, weaker servers showing up as well. And while these 
smaller servers are not trying to do teh 10000 thread fsync workload, they are 
using flash based storage more frequently than they are spinning rust 
(frequently through the bottleneck of a SD card) so continuing tests on low end 
devices is good.
>>> I said then that when we
>>> got around to a proper fsync it would be competitive. Now here it
>>> is, so you want to change the topic. I understand.
>>
>> I haven't changed the topic, just the storage medium. The simple
>> fact is that the world is moving away from slow sata storage at a
>> pretty rapid pace and it's mostly going solid state. Spinning disks
>> also changing - they are going to ZBC based SMR, which is a
>> compeltely different problem space which doesn't even appear to be
>> on the tux3 radar....
>>
>> So where does tux3 fit into a storage future of byte addressable
>> persistent memory and ZBC based SMR devices?
>
> You won't convince us to abandon spinning rust, it's going to be around
> a lot longer than you think. Obviously, we care about SSD and I believe
> you will find that Tux3 is more than competitive there. We lay things
> out in a very erase block friendly way. We need to address the volume
> wrap issue of course, and that is in progress. This is much easier than
> spinning disk.
>
> Tux3's redirect-on-write[1] is obviously a natural for SMR, however
> I will not get excited about it unless a vendor waves money.
what drives are available now? see if you can get a couple (either directly or 
donated)
David Lang
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fsync?
  2015-05-02  1:07         ` David Lang
@ 2015-05-02 10:26           ` Daniel Phillips
  2015-05-02 16:00             ` Christian Stroetmann
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-02 10:26 UTC (permalink / raw)
  To: David Lang
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, tux3,
	Theodore Ts'o
On Friday, May 1, 2015 6:07:48 PM PDT, David Lang wrote:
> On Fri, 1 May 2015, Daniel Phillips wrote:
>> On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:
>>> 
>>> Well, yes - I never claimed XFS is a general purpose filesystem.  It
>>> is a high performance filesystem. Is is also becoming more relevant
>>> to general purpose systems as low cost storage gains capabilities
>>> that used to be considered the domain of high performance storage...
>> 
>> OK. Well, Tux3 is general purpose and that means we care about single
>> spinning disk and small systems.
>
> keep in mind that if you optimize only for the small systems 
> you may not scale as well to the larger ones.
Tux3 is designed to scale, and it will when the time comes. I look 
forward to putting Shardmap through its billion file test in due course. 
However, right now it would be wise to stay focused on basic 
functionality suited to a workstation because volunteer devs tend to 
have those. After that, phones are a natural direction, where hard core 
ACID commit and really smooth file ops are particularly attractive.
> per the ramdisk but, possibly not as relavent as you may think. 
> This is why it's good to test on as many different systems as 
> you can. As you run into different types of performance you can 
> then pick ones to keep and test all the time.
I keep being surprised how well it works for things we never tested 
before.
> Single spinning disk is interesting now, but will be less 
> interesting later. multiple spinning disks in an array of some 
> sort is going to remain very interesting for quite a while.
The way to do md well is to integrate it into the block layer like 
Freebsd does (GEOM) and expose a richer interface for the filesystem. 
That is how I think Tux3 should work with big iron raid. I hope to be
able to tackle that sometime before the stars start winking out.
> now, some things take a lot more work to test than others. 
> Getting time on a system with a high performance, high capacity 
> RAID is hard, but getting hold of an SSD from Fry's is much 
> easier. If it's a budget item, ping me directly and I can donate 
> one for testing (the cost of a drive is within my unallocated 
> budget and using that to improve Linux is worthwhile)
Thanks.
> As I'm reading Dave's comments, he isn't attacking you the way 
> you seem to think he is. He is pointing ot that there are 
> problems with your data, but he's also taking a lot of time to 
> explain what's happening (and yes, some of this is probably 
> because your simple tests with XFS made it look so bad)
I hope the lightening up trend is a trend.
> the other filesystems don't use naive algortihms, they use 
> something more complex, and while your current numbers are 
> interesting, they are only preliminary until you add something 
> to handle fragmentation. That can cause very significant 
> problems.
Fsync is pretty much agnostic to fragmentation, so those results are 
unlikely to change substantially even if we happen to do a lousy job on 
allocation policy, which I naturally consider unlikely. In fact, Tux3 
fsync is going to get faster over time for a couple of reasons: the 
minimum blocks per commit will be reduced, and we will get rid of most 
of the seeks to beginning of volume that we currently suffer per commit.
> Remember how fabulous btrfs looked in the initial 
> reports? and then corner cases were found that caused real 
> problems and as the algorithms have been changed to prevent 
> those corner cases from being so easy to hit, the common case 
> has suffered somewhat. This isn't an attack on Tux2 or btrfs, 
> it's just a reality of programming. If you are not accounting 
> for all the corner cases, everything is easier, and faster.
>> Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
>> more substantial, so I can't compare my numbers directly to yours.
>
> If you are doing tests with a 4G ramdisk on a machine with only 
> 4G of RAM, it seems like you end up testing a lot more than just 
> the filesystem. Testing in such low memory situations can 
> indentify significant issues, but it is questionable as a 'which 
> filesystem is better' benchmark.
A 1.3 GB tmpfs, and sorry, it is 10 GB (the machine next to it is 4G). 
I am careful to ensure the test environment does not have spurious 
memory or cpu hogs. I will not claim that this is the most sterile test 
environment possible, but it is adequate for the task at hand. Nearly 
always, when I find big variations in the test numbers it turns out to 
be a quirk of one filesystem that is not exhibited by the others. 
Everything gets multiple runs and lands in a spreadsheet. Any fishy 
variance is investigated.
By the way, the low variance kings by far are Ext4 and Tux3, and of 
those two, guess which one is more consistent. XFS is usually steady, 
but can get "emotional" with lots of tasks, and Btrfs has regular wild 
mood swings whenever the stars change alignment. And while I'm making 
gross generalizations: XFS and Btrfs go OOM way before Ext4 and Tux3.
> Just a suggestion, but before you do a huge post about how 
> great your filesystem is performing, making the code avaialble 
> so that others can test it when prompted by your post is 
> probably a very good idea. If it means that you have to send out 
> your post a week later, it's a very small cost for the benefit 
> of having other people able to easily try it on hardware that 
> you don't have access to.
Next time. This time I wanted it off my plate as soon as possible so I 
could move on to enospc work. And this way is more involving, we get a 
little suspense before the rematch.
> If there is a reason to post wihtout the code being in the 
> main, publicised repo, then your post should point people at 
> what code they can use to duplicate it.
I could have included the patch in the post, it is small enough. If it 
still isn't in the repo in a few days then I will post it, to avoid 
giving the impression I'm desperately trying to fix obscure bugs in it, 
which isn't the case.
> but really, 11 months without updating the main repo?? This is 
> Open Source development, publish early and often.
It's not as bad as that:
   https://github.com/OGAWAHirofumi/linux-tux3/commits/hirofumi
   https://github.com/OGAWAHirofumi/linux-tux3/commits/hirofumi-user
> something to investigate, but I have seen probelms on ext* in 
> the past. ext4 may have fixed this, or it may just have moved 
> the point where it triggers.
My spectrum of tests is small and I am not hunting for anomalies, only 
reporting what happened to come up. It is not very surprising that some
odd things happen with 10,000 tasks, there is probably not much test 
coverage there. On the whole I was surprised and impressed when all 
filesystems mostly just worked. I was expecting to hit scheduler issues 
for one thing, and nothing obvious came up. Also, not one oops on any 
filesystem (even Tux3) and only one assert, already reported upstream 
and turned out to be fixed a week or two ago.
>>> ...
>> Your machine makes mine look like a PCjr. ...
>
> The interesting thing here is that on the faster machine btrfs 
> didn't speed up significantly while ext4 and xfs did. It will be 
> interesting to see what the results are for tux3
The numbers are well into the something-is-really-wrong zone (and I 
should have flagged that earlier but it was a long day). That test is 
supposed to be -s, all synchronous, and his numbers are more typical of
async. Needs double checking all round, including here. Anybody can 
replicate that test, it is only an apt-get install dbench away (hint 
hint).
Differences: my numbers are kvm with loopback mount on tmpfs. His are 
on ramdisk and probably native. I have to reboot to make a ramdisk big 
enough to run dbench and I would rather not right now.
How important is it to get to the bottom of the variance in test 
results running on RAM? Probably important in the long run, because 
storage devices are looking more like RAM all the time, but as of 
today, maybe not very urgent.
Also, I was half expecting somebody to question the wisdom of running 
benchmarks under KVM instead of native, but nobody did. Just for the 
record, I would respond: running virtual probably accounts for the
majority of server instances today.
> and both of you need to remember that while servers are getting 
> faster, we are also seeing much lower power, weaker servers 
> showing up as well. And while these smaller servers are not 
> trying to do teh 10000 thread fsync workload, they are using 
> flash based storage more frequently than they are spinning rust 
> (frequently through the bottleneck of a SD card) so continuing 
> tests on low end devices is good.
Low end servers and embedded concerns me more, indeed. 
> what drives are available now? see if you can get a couple 
> (either directly or donated)
Right, time to hammer on flash.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fsync?
  2015-05-02 10:26           ` Daniel Phillips
@ 2015-05-02 16:00             ` Christian Stroetmann
  2015-05-02 16:30               ` Richard Weinberger
  0 siblings, 1 reply; 160+ messages in thread
From: Christian Stroetmann @ 2015-05-02 16:00 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Lang, Dave Chinner, linux-kernel, linux-fsdevel, tux3,
	Theodore Ts'o
On the 2nd of May 2015 12:26, Daniel Phillips wrote:
Aloha everybody
> On Friday, May 1, 2015 6:07:48 PM PDT, David Lang wrote:
>> On Fri, 1 May 2015, Daniel Phillips wrote:
>>> On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:
>>>>
>>>> Well, yes - I never claimed XFS is a general purpose filesystem.  It
>>>> is a high performance filesystem. Is is also becoming more relevant
>>>> to general purpose systems as low cost storage gains capabilities
>>>> that used to be considered the domain of high performance storage...
>>>
>>> OK. Well, Tux3 is general purpose and that means we care about single
>>> spinning disk and small systems.
>>
>> keep in mind that if you optimize only for the small systems you may 
>> not scale as well to the larger ones.
>
> Tux3 is designed to scale, and it will when the time comes. I look 
> forward to putting Shardmap through its billion file test in due 
> course. However, right now it would be wise to stay focused on basic 
> functionality suited to a workstation because volunteer devs tend to 
> have those. After that, phones are a natural direction, where hard 
> core ACID commit and really smooth file ops are particularly attractive.
>
Has anybody else a deja vu?
Nevertheless, why don't you just put your fsync, your interpretations 
(ACID, shardmap, etc.) of my things (OntoFS, file system of SASOS4Fun, 
and OntoBase), and whatever gimmicks you have in mind together into a 
prototypical file system, test it, and sent a message with a short 
description focused solely on others' and your foundational ideas and 
the technical features, a WWW address where the code can be found, and 
your test results to this mailing list without your marketing and 
self-promotion BEFORE and NOT DUE COURSE respectively NOT AFTER anybody 
else does a similar work or I am so annoyed that I implement it?
Also, if it is so general that XFS and EXT4 should adapt it, then why 
don't you help to improve these file systems?
Btw.: I have rejected my claims I made in that email, but definitely not 
given up my copyright if it is valid.
>> per the ramdisk but, possibly not as relavent as you may think. This 
>> is why it's good to test on as many different systems as you can. As 
>> you run into different types of performance you can then pick ones to 
>> keep and test all the time.
>
> I keep being surprised how well it works for things we never tested 
> before.
>
>> Single spinning disk is interesting now, but will be less interesting 
>> later. multiple spinning disks in an array of some sort is going to 
>> remain very interesting for quite a while.
>
> The way to do md well is to integrate it into the block layer like 
> Freebsd does (GEOM) and expose a richer interface for the filesystem. 
> That is how I think Tux3 should work with big iron raid. I hope to be
> able to tackle that sometime before the stars start winking out.
>
>> now, some things take a lot more work to test than others. Getting 
>> time on a system with a high performance, high capacity RAID is hard, 
>> but getting hold of an SSD from Fry's is much easier. If it's a 
>> budget item, ping me directly and I can donate one for testing (the 
>> cost of a drive is within my unallocated budget and using that to 
>> improve Linux is worthwhile)
>
> Thanks.
>
>> As I'm reading Dave's comments, he isn't attacking you the way you 
>> seem to think he is. He is pointing ot that there are problems with 
>> your data, but he's also taking a lot of time to explain what's 
>> happening (and yes, some of this is probably because your simple 
>> tests with XFS made it look so bad)
>
> I hope the lightening up trend is a trend.
>
>> the other filesystems don't use naive algortihms, they use something 
>> more complex, and while your current numbers are interesting, they 
>> are only preliminary until you add something to handle fragmentation. 
>> That can cause very significant problems.
>
> Fsync is pretty much agnostic to fragmentation, so those results are 
> unlikely to change substantially even if we happen to do a lousy job 
> on allocation policy, which I naturally consider unlikely. In fact, 
> Tux3 fsync is going to get faster over time for a couple of reasons: 
> the minimum blocks per commit will be reduced, and we will get rid of 
> most of the seeks to beginning of volume that we currently suffer per 
> commit.
>
>> Remember how fabulous btrfs looked in the initial reports? and then 
>> corner cases were found that caused real problems and as the 
>> algorithms have been changed to prevent those corner cases from being 
>> so easy to hit, the common case has suffered somewhat. This isn't an 
>> attack on Tux2 or btrfs, it's just a reality of programming. If you 
>> are not accounting for all the corner cases, everything is easier, 
>> and faster.
>
>>> Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
>>> more substantial, so I can't compare my numbers directly to yours.
>>
>> If you are doing tests with a 4G ramdisk on a machine with only 4G of 
>> RAM, it seems like you end up testing a lot more than just the 
>> filesystem. Testing in such low memory situations can indentify 
>> significant issues, but it is questionable as a 'which filesystem is 
>> better' benchmark.
>
> A 1.3 GB tmpfs, and sorry, it is 10 GB (the machine next to it is 4G). 
> I am careful to ensure the test environment does not have spurious 
> memory or cpu hogs. I will not claim that this is the most sterile 
> test environment possible, but it is adequate for the task at hand. 
> Nearly always, when I find big variations in the test numbers it turns 
> out to be a quirk of one filesystem that is not exhibited by the 
> others. Everything gets multiple runs and lands in a spreadsheet. Any 
> fishy variance is investigated.
>
> By the way, the low variance kings by far are Ext4 and Tux3, and of 
> those two, guess which one is more consistent. XFS is usually steady, 
> but can get "emotional" with lots of tasks, and Btrfs has regular wild 
> mood swings whenever the stars change alignment. And while I'm making 
> gross generalizations: XFS and Btrfs go OOM way before Ext4 and Tux3.
>
>> Just a suggestion, but before you do a huge post about how great your 
>> filesystem is performing, making the code avaialble so that others 
>> can test it when prompted by your post is probably a very good idea. 
>> If it means that you have to send out your post a week later, it's a 
>> very small cost for the benefit of having other people able to easily 
>> try it on hardware that you don't have access to.
>
> Next time. This time I wanted it off my plate as soon as possible so I 
> could move on to enospc work. And this way is more involving, we get a 
> little suspense before the rematch.
>
>> If there is a reason to post wihtout the code being in the main, 
>> publicised repo, then your post should point people at what code they 
>> can use to duplicate it.
>
> I could have included the patch in the post, it is small enough. If it 
> still isn't in the repo in a few days then I will post it, to avoid 
> giving the impression I'm desperately trying to fix obscure bugs in 
> it, which isn't the case.
>
>> but really, 11 months without updating the main repo?? This is Open 
>> Source development, publish early and often.
>
> It's not as bad as that:
>
>   https://github.com/OGAWAHirofumi/linux-tux3/commits/hirofumi
>   https://github.com/OGAWAHirofumi/linux-tux3/commits/hirofumi-user
>
>> something to investigate, but I have seen probelms on ext* in the 
>> past. ext4 may have fixed this, or it may just have moved the point 
>> where it triggers.
>
> My spectrum of tests is small and I am not hunting for anomalies, only 
> reporting what happened to come up. It is not very surprising that some
> odd things happen with 10,000 tasks, there is probably not much test 
> coverage there. On the whole I was surprised and impressed when all 
> filesystems mostly just worked. I was expecting to hit scheduler 
> issues for one thing, and nothing obvious came up. Also, not one oops 
> on any filesystem (even Tux3) and only one assert, already reported 
> upstream and turned out to be fixed a week or two ago.
>
>>>> ...
>>> Your machine makes mine look like a PCjr. ...
>>
>> The interesting thing here is that on the faster machine btrfs didn't 
>> speed up significantly while ext4 and xfs did. It will be interesting 
>> to see what the results are for tux3
>
> The numbers are well into the something-is-really-wrong zone (and I 
> should have flagged that earlier but it was a long day). That test is 
> supposed to be -s, all synchronous, and his numbers are more typical of
> async. Needs double checking all round, including here. Anybody can 
> replicate that test, it is only an apt-get install dbench away (hint 
> hint).
>
> Differences: my numbers are kvm with loopback mount on tmpfs. His are 
> on ramdisk and probably native. I have to reboot to make a ramdisk big 
> enough to run dbench and I would rather not right now.
>
> How important is it to get to the bottom of the variance in test 
> results running on RAM? Probably important in the long run, because 
> storage devices are looking more like RAM all the time, but as of 
> today, maybe not very urgent.
>
> Also, I was half expecting somebody to question the wisdom of running 
> benchmarks under KVM instead of native, but nobody did. Just for the 
> record, I would respond: running virtual probably accounts for the
> majority of server instances today.
>
>> and both of you need to remember that while servers are getting 
>> faster, we are also seeing much lower power, weaker servers showing 
>> up as well. And while these smaller servers are not trying to do teh 
>> 10000 thread fsync workload, they are using flash based storage more 
>> frequently than they are spinning rust (frequently through the 
>> bottleneck of a SD card) so continuing tests on low end devices is good.
>
> Low end servers and embedded concerns me more, indeed.
>> what drives are available now? see if you can get a couple (either 
>> directly or donated)
>
> Right, time to hammer on flash.
>
> Regards,
>
> Daniel
> -- 
Thanks
Best Regards
Have fun
C.S.
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fsync?
  2015-05-02 16:00             ` Christian Stroetmann
@ 2015-05-02 16:30               ` Richard Weinberger
  2015-05-02 17:00                 ` Christian Stroetmann
  0 siblings, 1 reply; 160+ messages in thread
From: Richard Weinberger @ 2015-05-02 16:30 UTC (permalink / raw)
  To: Christian Stroetmann
  Cc: Daniel Phillips, David Lang, Dave Chinner, LKML, linux-fsdevel,
	tux3, Theodore Ts'o
On Sat, May 2, 2015 at 6:00 PM, Christian Stroetmann
<stroetmann@ontolab.com> wrote:
> On the 2nd of May 2015 12:26, Daniel Phillips wrote:
>
> Aloha everybody
>
>> On Friday, May 1, 2015 6:07:48 PM PDT, David Lang wrote:
>>>
>>> On Fri, 1 May 2015, Daniel Phillips wrote:
>>>>
>>>> On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:
>>>>>
>>>>>
>>>>> Well, yes - I never claimed XFS is a general purpose filesystem.  It
>>>>> is a high performance filesystem. Is is also becoming more relevant
>>>>> to general purpose systems as low cost storage gains capabilities
>>>>> that used to be considered the domain of high performance storage...
>>>>
>>>>
>>>> OK. Well, Tux3 is general purpose and that means we care about single
>>>> spinning disk and small systems.
>>>
>>>
>>> keep in mind that if you optimize only for the small systems you may not
>>> scale as well to the larger ones.
>>
>>
>> Tux3 is designed to scale, and it will when the time comes. I look forward
>> to putting Shardmap through its billion file test in due course. However,
>> right now it would be wise to stay focused on basic functionality suited to
>> a workstation because volunteer devs tend to have those. After that, phones
>> are a natural direction, where hard core ACID commit and really smooth file
>> ops are particularly attractive.
>>
>
> Has anybody else a deja vu?
Yes, the onto-troll strikes again...
-- 
Thanks,
//richard
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fsync?
  2015-05-02 16:30               ` Richard Weinberger
@ 2015-05-02 17:00                 ` Christian Stroetmann
  0 siblings, 0 replies; 160+ messages in thread
From: Christian Stroetmann @ 2015-05-02 17:00 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Daniel Phillips, David Lang, Dave Chinner, LKML, linux-fsdevel,
	tux3, Theodore Ts'o
On 2nd of May 2015 18:30, Richard Weinberger wrote:
> On Sat, May 2, 2015 at 6:00 PM, Christian Stroetmann
> <stroetmann@ontolab.com>  wrote:
>> On the 2nd of May 2015 12:26, Daniel Phillips wrote:
>>
>> Aloha everybody
>>
>>> On Friday, May 1, 2015 6:07:48 PM PDT, David Lang wrote:
>>>> On Fri, 1 May 2015, Daniel Phillips wrote:
>>>>> On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:
>>>>>>
>>>>>> Well, yes - I never claimed XFS is a general purpose filesystem.  It
>>>>>> is a high performance filesystem. Is is also becoming more relevant
>>>>>> to general purpose systems as low cost storage gains capabilities
>>>>>> that used to be considered the domain of high performance storage...
>>>>>
>>>>> OK. Well, Tux3 is general purpose and that means we care about single
>>>>> spinning disk and small systems.
>>>>
>>>> keep in mind that if you optimize only for the small systems you may not
>>>> scale as well to the larger ones.
>>>
>>> Tux3 is designed to scale, and it will when the time comes. I look forward
>>> to putting Shardmap through its billion file test in due course. However,
>>> right now it would be wise to stay focused on basic functionality suited to
>>> a workstation because volunteer devs tend to have those. After that, phones
>>> are a natural direction, where hard core ACID commit and really smooth file
>>> ops are particularly attractive.
>>>
>> Has anybody else a deja vu?
> Yes, the onto-troll strikes again...
>
Everybody has her/his own interpretation about what open source means.
I really thought there could be some kind of a constructive discussion 
about such a file system or at least about interesting technical 
features that can be used for other file systems like e.g. a potential 
EXT5, when I relaxed my position some days ago and proposed that also 
ideas are referenced correctly in relation with open source projects, 
specifically in relation with Tux3.
Now, I have the impression that this is not possible and due to this any 
progress is hard to achieve.
Thanks
Best Regards
Do not feed the trolls.
C.S.
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
 
 
 
 
 
 
- * Re: Tux3 Report: How fast can we fsync?
  2015-04-28 23:13 Tux3 Report: How fast can we fsync? Daniel Phillips
  2015-04-29  2:21 ` Mike Galbraith
  2015-04-30  1:46 ` Dave Chinner
@ 2015-05-12 17:41 ` Daniel Phillips
  2015-05-12 17:46 ` Tux3 Report: How fast can we fail? Daniel Phillips
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-12 17:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, tux3, OGAWA Hirofumi
Tux3 Report: How fast can we fail?
Tux3 now has a preliminary out of space handling algorithm. This might
sound like a small thing, but in fact handling out of space reliably
and efficiently is really hard, especially for Tux3. We developed an
original solution with unusually low overhead in the common case, and
simple enough to prove correct. Reliability seems good so far. But not
to keep anyone in suspense: Tux3 does not fail very fast, but it fails
very reliably. We like to think that Tux3 is better at succeeding than
failing.
We identified the following quality metrics for this algorithm:
 1) Never fails to detect out of space in the front end.
 2) Always fills a volume to 100% before reporting out of space.
 3) Allows rm, rmdir and truncate even when a volume is full.
 4) Writing to a nearly full volume is not excessively slow.
 5) Overhead is insignificant when a volume is far from full.
Like every filesystem that does delayed allocation, Tux3 must guess how
much media space will be needed to commit any update it accepts into
cache. It must not guess low or the commit may fail and lose data. This
is especially tricky for Tux3 because it does not track individual
updates, but instead, partitions updates atomically into delta groups
and commits each delta as an atomic unit. A single delta can be as
large as writable cache, including thousands of individual updates.
This delta scheme ensures perfect namespace, metadata and data
consistency without complex tracking of relationships between thousands
of cache objects, and also does delayed allocation about as well as it
can be done. Given these benefits, it is not too hard to accept some
extra pain in out of space accounting.
Speaking of accounting, we borrow some of that terminology to talk
about the problem. Each delta has a "budget" and computes a "balance"
that declines each time a transaction "cost" is "charged" against it.
The budget is all of free space, plus some space that belongs to
the current disk image that we know will be released soon, and less
a reserve for taking care of certain backend duties. When the balance
goes negative, the transaction backs out its cost, triggers a delta
transition, and tries again. This has the effect of shrinking the delta
size as a volume approaches full. When the delta budget finally shrinks
to less than the transaction cost, the update fails with ENOSPC.
This is where the "how fast can we fail" question comes up. If our guess
at cost is way higher than actual blocks consumed, deltas take a long
time to shrink. Overestimating transaction cost by a factor of ten
can trigger over a hundred deltas before failing. Fortunately, deltas
are pretty fast, so we only keep the user waiting for a second or so
before delivering the bad news. We also slow down measurably, but not
horribly, when getting close to full. Ext4 by contrast flies along at
full speed right until it fills the volume, and stops on a dime at
exactly at 100% full. I don't think that Tux3 will ever be as good at
failing as that, but we will try to get close.
Before I get into how Tux3's out of space behavior stacks up against
other filesystems, there are some interesting details to touch on about
how we go about things.
Tux3's front/back arrangement is lockless, which is great for
performance but turns into a problem when front and back need to
cooperate about something like free space accounting. If we were willing
to add a spinlock between front and back this would be easy, but don't
want to do that. Not only are we jealously protective of our lockless
design, but if our fast path suddenly became slower because of adding
essential functionality we might need to revise some posted benchmark
results. Better that we should do it right and get our accounting
almost for free.
The world of lockless algorithms is an arcane one indeed, just ask Paul
McKenney about that. The solution we came up with needs just two atomic
adds per transaction, and we will eventually turn one of those into a
per-cpu counter. As mentioned above, a frontend transaction backs out
its cost when the delta balance goes negative, so from the backend's
point of view, the balance is going up and down unpredictably all the
time. Delta transition can happen at any time, and somehow, the backend
must assign the new front delta its budget exactly at transition.
Meanwhile, the front delta balance is still going up and down
unpredictably. See the problem? The issue is, delta transition is truly
asynchronous. We can't change that short of adding locks with the
contention and stalls that go along with them.
Fortunately, one consequence of delta transition is that the total cost
charged to the delta instantly becomes stable when the front delta
becomes the back delta. Volume free space is also stable because only
the backend accesses it. The backend can easily measure the actual
space consumed by the back delta: it is the difference between free
space before and after flushing to media. Updating the front delta
budget is easy because only the backend changes it, but updating the
front delta balance is much harder because the front delta is busy
changing it. If we get this wrong, the resulting slight discrepancies
between budget, balance and charged costs would mean that somebody,
somewhere will hit out of space in the middle of a commit and end up
sticking pins into a voodoo doll that looks like us.
A solution was found that only took a few lines of code and some pencil
pushing. The backend knows what the front delta balance must have been
exactly at transition, because it knows the amount charged to the back
delta, and it knows the original budget. It can therefore deduce how
much the front balance should have increased exactly at transition (it
must always increase) so it adds that difference atomically to the
front delta budget. This has exactly the same effect as setting the
balance atomically at transition time, if that were possible, which it
is not. This technique is airtight, and the whole algorithm ends up
costing less than a hundred nanoseconds per transaction.[1] This is a
good thing because each page of a Tux3 write is a separate transaction,
so any significant overhead would stick out like a sore thumb.
Accounting cost estimates properly and stopping when actually out of
space is just the core of the algorithm. We must feed that core with
good, provable worst case cost estimates. To get an initial idea of
whether the algorithm works, we just plugged in some magic numbers, and
lo and behold, suddenly we where not running out of space in the
backend any more. But to do the job properly we need to consider things
like the file index btree depth, because just plugging in a number large
enough to handle the deepest possible btree would slow down our failure
path way too much.
The best way to account for btree depth is to make it disappear entirely
by removing the btree updates from the delta commit path. We already do
that for bitmaps, which is a good thing because our bitmaps are just
blocks that live in a normal file. Multiplying our worst case by the
maximum number of bitmaps that could possibly be affected, and then
multiplying that by the worst case change to the bitmap metadata,
including its data btree, its inode, and the inode table btree, would be
a real horror. Instead, we log all changes that affect the bitmap and
only update the bitmaps periodically at unify cycles. A Tux3 filesystem
is consistent whether or not we unify, so if space becomes critically
tight the backend can just disable the unify. The only bad effect is
that the log chain can grow and make replay take longer, but that growth
is limited by the fact that there is not much space left for more log
blocks.
If we did not have this nice way of making bitmap overhead disappear,
we would not be anywhere close to a respectable implementation today.
Actually, we weren't even thinking about out of space accounting when
we developed this design element, we were actually trying to get rid of
the overhead of updating bitmaps per delta. Which worked well and is a
significant part of the reason why we can outrun Ext4 while having a
very similar structure. The benefit for space accounting dropped out
just by dumb luck.
The same technique we use for hiding bitmap update cost works just as
well for btree metadata. Eventually, we will move btree leaf redirecting
from the delta flush to the unify flush. That will speed it up by
coalescing some index block writes and also make it vanish from the
transaction cost estimate, saving frontend CPU and speeding up the
failure path. What's not to like? It is on the list of things to do.
Today, I refactored the budgeting algorithm to skip the cost estimate
if a page is already dirty, which tightened up the estimate by a factor
of four or so and made things run smoother. There will be more
incremental improvements as time goes by. For example, we currently
overestimate the cost of a rewrite because we would need to go poking
around in btrees to do that more accurately. Fixing that will be quite
a bit of work, but somebody will probably do it, sometime.
Now the fun part: performance and bugs. Being anxious to know where
Tux3 stands with respect to the usual suspects, I ran some tests and
found that Ext4 is amazingly good at this, while XFS and Btrfs have
some serious issues. Details below.
Tux3 slows down when approaching a full state, hopefully not too much.
To quantify that, here is what happens with a 200 MB dd to a loopback
mounted file on tmpfs:
                     Volume Size    Run Time
    No check at all:   1500 MB       0.306s
    Far from full:     1500 MB       0.318s
    Getting full:        30 MB       0.386s
    Just over full:      20 MB       0.624s
The slowdown used to be a lot more before I improved the cost estimate
for write earlier today. Here is how we compare to other filesystems:
            Far from full   Just over full
    tux3:       0.303s          0.468s
    ext4:       0.399s          0.400s
    xfs:        0.293s          0.326s
    btrfs:      0.499s          0.531s
                (20 mb dd to ramdisk)
XFS eeks out a narrow win on straight up dd to the ramdisk, good job.
The gap widens when hitting the failure path, but again, not as much as
it did earlier today.
I do most of these no space tests on a ramdisk (actually, a loopback
mount on tmpfs) because it is easy to fill up. To show that the ramdisk
results are not wildly different from a real disk, here we see that the
pattern is largely unchanged:
           20 MB dd to a real disk
    tux3:       1.568s
    ext4:       1.523s
    xfs:        1.466s
    btrfs:      2.347s
XFS holds its dd lead on a real hard disk. We definitely need to learn
its trick.
Next we look at something with a bit more meat: unzipping the Git
source to multiple directories. Timings on ramdisk are the interesting
ones, because the volume approaches full on the longer test.
              10x to ram  40x to ram  10x to hdd   100x to hdd
    tux3:       2.251s      8.344s      2.671s       21.686s
    ext4:       2.343s      7.923s      3.165s       32.080s
    xfs:        2.682s     10.562s     11.870s      123.435s
    btrfs:      3.744s     15.825s      3.749s       72.405s
Tux3 is the fastest when not close to full, but Ext4 takes a slight
lead when close to full. Yesterday, that lead was much wider, and one
day we would be pleased to tie Ext4, which is really exemplary at this.
The hard disk times are there because they happened to be easy to get,
and it is interesting to see how much XFS and BTRFS are suffering on
traditional rust, XFS being the worst by far at 5.7 times slower than
Tux3.
The next one is a crash test: repeatedly untar a tarball until it
smashes into the wall, and see how long it takes to quit with an error.
Tar is nice for this because its failure handling is so awful: instead
of exiting when on the first ENOSPC, it keeps banging at the full disk
until it has failed on each and every file in its archive. First I dd
the volume until just before full, then throw a tarball at it.
Time to fail when tar hits disk full:
    tux3:  0.563s
    ext4:  0.084s
    xfs:   0.116s
    btrfs: 0.460s
We respectfully concede that Ext4 is the king of fail and Tux3 is
worst. However, we only need to be good enough on this one, with less
than a second being a very crude definition of good enough.
The next one is something I ran into when I was testing out of space
detection with rewrites. This uses the "blurt" program at the end of
this post to do 40K writes from 1000 tasks in parallel, 1K at a time,
using the bash one liner:
    for ((j=1;j<10;j++)); do \
       for ((i=1;i<10;i++)); do \
          echo step $j:$i 1>&2 && mkdir -p fs/$i && \
          ~/blurt fs/$i/f 40 1000 || break 2; \
       done; \
    done
    Tux3:    4.136s (28% full when done)
    Ext4:    5.780s (31% full when done)
    XFS:    79.063s (!)
    Btrfs:  15.489s (fails with out of space when 30% full)
Blurt is a minor revision of my fsync load generator without the fsync,
and with an error exit on disk full. The intent of the outer loop is
to do rewrites with a thousand tasks in parallel, and see if out of
space accounting is accurate. XFS and Btrfs both embarrassed themselves
horribly. XFS falls off a performance cliff that makes it 19 times
slower than Tux3, and Btrfs hits ENOSPC when only 30% full according to
df, or 47% full if you prefer to believe its own df command:
    Data, single: total=728.00MiB, used=342.25MiB
    System, DUP: total=8.00MiB, used=16.00KiB
    System, single: total=4.00MiB, used=0.00B
    Metadata, DUP: total=65.00MiB, used=5.97MiB
    Metadata, single: total=8.00MiB, used=0.00B
    GlobalReserve, single: total=16.00MiB, used=0.00B
It seems that Btrfs still has not put its epic ENOSPC nightmare behind
it. I fervently hope that such a fate does not await Tux3, which hope
would appear to be well on its way to being born out.
XFS should not do such bizarre things after 23 years of development,
while being billed as a mature, enterprise grade filesystem. It simply
is not there yet. Ext4 is exemplary in terms of reliability, and Tux3
has been been really good through this round of torture tests, though
I will not claim that it is properly hardened just yet. I know it isn't.
We don't have any open bugs, but that is probably because we only have
two users. But Tux3 is remarkably solid for the number of man years
that have gone into it. Maybe Tux3 really will be ready for the
enterprise before XFS is.
In all of these tests, Tux3, Ext4 and XFS managed to fill up their
volumes to exactly 100%. Tux3 actually has a 100 block emergency
reserve that it never fills, and wastes a few more blocks if the last
transaction does not exactly use up its budget, but apparently that
still falls within the df utility's definition of 100%. Btrfs never gets
this right: full for it tends to range from 96% to 98%, and sometimes is
much lower, like 28%. It has its own definition of disk full in its own
utility, but that does not seem to be very accurate either. This part of
Btrfs needs major work. Even at this early stage, Tux3 is much better
than that.
One thing we can all rejoice over: nobody ever hit out of space while
trying to commit. At least, nobody ever admitted it. And nobody oopsed,
or asserted, though XFS did exhibit some denial of service issues where
the filesystem was unusable for tens of seconds.
Once again, in the full disclosure department: there are some known
holes remaining in Tux3's out of space handling. The unify suspend
algorithm is not yet implemented, without which we cannot guarantee
that out of space will never happen in commit. With the simple expedient
of a 100 block emergency reserve, it has never yet happened, but no
doubt some as yet untested load can make it happen. ENOSPC handling for
mmap is not yet implemented. Cost estimates for namespace operations
are too crude and ignore btree depth. Cost estimates could be tighter
than they are, to give better performance and report disk full more
promptly. The emergency reserve should be set each delta according to
delta budget. Big truncates need to be split over multiple commits
so they always free more blocks than they consume before commit. That
is about it. On the whole, I am really happy with the way this
has worked out.
Well, that is that for today. Tux3 now has decent out of space handling
that appears to work well and has a good strong theoretical basis. It
needs more work, but is no longer a reason to block Tux3 from merging,
if it ever really was.
Regards,
Daniel
[1] Overhead of an uncontended bus locked add is about 6 nanoseconds on
my i5, and about ten times higher when contended.
 /*
 * Blurt v0.0
 *
 * A trivial multitasking filesystem load generator
 *
 * Daniel Phillips, June 2015
 *
 * to build: c99 -Wall blurt.c -oblurt
 * to run: blurt <basename> <steps> <tasks>
 */
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/wait.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
enum { chunk = 1024, sync = 0 };
char text[chunk] = { "hello world!\n" };
int main(int argc, const char *argv[]) {
	const char *basename = argc < 1 ? "foo" : argv[1];
	char name[100];
	int steps = argc < 3 ? 1 : atoi(argv[2]);
	int tasks = argc < 4 ? 1 : atoi(argv[3]);
	int fd, status, errors = 0;
	for (int t = 0; t < tasks; t++) {
		snprintf(name, sizeof name, "%s%i", basename, t);
		if (!fork())
			goto child;
	}
	for (int t = 0; t < tasks; t++) {
		wait(&status);
		if (WIFEXITED(status) && WEXITSTATUS(status))
			errors++;
	}
	return !!errors;
child:
	fd = creat(name, S_IRWXU);
	if (fd == -1)
		goto fail1;
	for (int i = 0; i < steps; i++) {
		int ret = write(fd, text, sizeof text);
		if (ret == -1)
			goto fail2;
		if (sync)
			fsync(fd);
	}
	return 0;
fail1:
	perror("create failed");
	return 1;
fail2:
	perror("write failed");
	return 1;
}
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Tux3 Report: How fast can we fail?
  2015-04-28 23:13 Tux3 Report: How fast can we fsync? Daniel Phillips
                   ` (2 preceding siblings ...)
  2015-05-12 17:41 ` Daniel Phillips
@ 2015-05-12 17:46 ` Daniel Phillips
  2015-05-13 22:07   ` Daniel Phillips
  2015-05-26 10:03   ` Pavel Machek
  2015-05-14  7:37 ` [WIP] tux3: Optimized fsync Daniel Phillips
  2015-05-14  8:26 ` [FYI] tux3: Core changes Daniel Phillips
  5 siblings, 2 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-12 17:46 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, tux3, OGAWA Hirofumi
(reposted with correct subject line)
Tux3 now has a preliminary out of space handling algorithm. This might
sound like a small thing, but in fact handling out of space reliably
and efficiently is really hard, especially for Tux3. We developed an
original solution with unusually low overhead in the common case, and
simple enough to prove correct. Reliability seems good so far. But not
to keep anyone in suspense: Tux3 does not fail very fast, but it fails
very reliably. We like to think that Tux3 is better at succeeding than
failing.
We identified the following quality metrics for this algorithm:
 1) Never fails to detect out of space in the front end.
 2) Always fills a volume to 100% before reporting out of space.
 3) Allows rm, rmdir and truncate even when a volume is full.
 4) Writing to a nearly full volume is not excessively slow.
 5) Overhead is insignificant when a volume is far from full.
Like every filesystem that does delayed allocation, Tux3 must guess how
much media space will be needed to commit any update it accepts into
cache. It must not guess low or the commit may fail and lose data. This
is especially tricky for Tux3 because it does not track individual
updates, but instead, partitions updates atomically into delta groups
and commits each delta as an atomic unit. A single delta can be as
large as writable cache, including thousands of individual updates.
This delta scheme ensures perfect namespace, metadata and data
consistency without complex tracking of relationships between thousands
of cache objects, and also does delayed allocation about as well as it
can be done. Given these benefits, it is not too hard to accept some
extra pain in out of space accounting.
Speaking of accounting, we borrow some of that terminology to talk
about the problem. Each delta has a "budget" and computes a "balance"
that declines each time a transaction "cost" is "charged" against it.
The budget is all of free space, plus some space that belongs to
the current disk image that we know will be released soon, and less
a reserve for taking care of certain backend duties. When the balance
goes negative, the transaction backs out its cost, triggers a delta
transition, and tries again. This has the effect of shrinking the delta
size as a volume approaches full. When the delta budget finally shrinks
to less than the transaction cost, the update fails with ENOSPC.
This is where the "how fast can we fail" question comes up. If our guess
at cost is way higher than actual blocks consumed, deltas take a long
time to shrink. Overestimating transaction cost by a factor of ten
can trigger over a hundred deltas before failing. Fortunately, deltas
are pretty fast, so we only keep the user waiting for a second or so
before delivering the bad news. We also slow down measurably, but not
horribly, when getting close to full. Ext4 by contrast flies along at
full speed right until it fills the volume, and stops on a dime at
exactly at 100% full. I don't think that Tux3 will ever be as good at
failing as that, but we will try to get close.
Before I get into how Tux3's out of space behavior stacks up against
other filesystems, there are some interesting details to touch on about
how we go about things.
Tux3's front/back arrangement is lockless, which is great for
performance but turns into a problem when front and back need to
cooperate about something like free space accounting. If we were willing
to add a spinlock between front and back this would be easy, but don't
want to do that. Not only are we jealously protective of our lockless
design, but if our fast path suddenly became slower because of adding
essential functionality we might need to revise some posted benchmark
results. Better that we should do it right and get our accounting
almost for free.
The world of lockless algorithms is an arcane one indeed, just ask Paul
McKenney about that. The solution we came up with needs just two atomic
adds per transaction, and we will eventually turn one of those into a
per-cpu counter. As mentioned above, a frontend transaction backs out
its cost when the delta balance goes negative, so from the backend's
point of view, the balance is going up and down unpredictably all the
time. Delta transition can happen at any time, and somehow, the backend
must assign the new front delta its budget exactly at transition.
Meanwhile, the front delta balance is still going up and down
unpredictably. See the problem? The issue is, delta transition is truly
asynchronous. We can't change that short of adding locks with the
contention and stalls that go along with them.
Fortunately, one consequence of delta transition is that the total cost
charged to the delta instantly becomes stable when the front delta
becomes the back delta. Volume free space is also stable because only
the backend accesses it. The backend can easily measure the actual
space consumed by the back delta: it is the difference between free
space before and after flushing to media. Updating the front delta
budget is easy because only the backend changes it, but updating the
front delta balance is much harder because the front delta is busy
changing it. If we get this wrong, the resulting slight discrepancies
between budget, balance and charged costs would mean that somebody,
somewhere will hit out of space in the middle of a commit and end up
sticking pins into a voodoo doll that looks like us.
A solution was found that only took a few lines of code and some pencil
pushing. The backend knows what the front delta balance must have been
exactly at transition, because it knows the amount charged to the back
delta, and it knows the original budget. It can therefore deduce how
much the front balance should have increased exactly at transition (it
must always increase) so it adds that difference atomically to the
front delta budget. This has exactly the same effect as setting the
balance atomically at transition time, if that were possible, which it
is not. This technique is airtight, and the whole algorithm ends up
costing less than a hundred nanoseconds per transaction.[1] This is a
good thing because each page of a Tux3 write is a separate transaction,
so any significant overhead would stick out like a sore thumb.
Accounting cost estimates properly and stopping when actually out of
space is just the core of the algorithm. We must feed that core with
good, provable worst case cost estimates. To get an initial idea of
whether the algorithm works, we just plugged in some magic numbers, and
lo and behold, suddenly we where not running out of space in the
backend any more. But to do the job properly we need to consider things
like the file index btree depth, because just plugging in a number large
enough to handle the deepest possible btree would slow down our failure
path way too much.
The best way to account for btree depth is to make it disappear entirely
by removing the btree updates from the delta commit path. We already do
that for bitmaps, which is a good thing because our bitmaps are just
blocks that live in a normal file. Multiplying our worst case by the
maximum number of bitmaps that could possibly be affected, and then
multiplying that by the worst case change to the bitmap metadata,
including its data btree, its inode, and the inode table btree, would be
a real horror. Instead, we log all changes that affect the bitmap and
only update the bitmaps periodically at unify cycles. A Tux3 filesystem
is consistent whether or not we unify, so if space becomes critically
tight the backend can just disable the unify. The only bad effect is
that the log chain can grow and make replay take longer, but that growth
is limited by the fact that there is not much space left for more log
blocks.
If we did not have this nice way of making bitmap overhead disappear,
we would not be anywhere close to a respectable implementation today.
Actually, we weren't even thinking about out of space accounting when
we developed this design element, we were actually trying to get rid of
the overhead of updating bitmaps per delta. Which worked well and is a
significant part of the reason why we can outrun Ext4 while having a
very similar structure. The benefit for space accounting dropped out
just by dumb luck.
The same technique we use for hiding bitmap update cost works just as
well for btree metadata. Eventually, we will move btree leaf redirecting
from the delta flush to the unify flush. That will speed it up by
coalescing some index block writes and also make it vanish from the
transaction cost estimate, saving frontend CPU and speeding up the
failure path. What's not to like? It is on the list of things to do.
Today, I refactored the budgeting algorithm to skip the cost estimate
if a page is already dirty, which tightened up the estimate by a factor
of four or so and made things run smoother. There will be more
incremental improvements as time goes by. For example, we currently
overestimate the cost of a rewrite because we would need to go poking
around in btrees to do that more accurately. Fixing that will be quite
a bit of work, but somebody will probably do it, sometime.
Now the fun part: performance and bugs. Being anxious to know where
Tux3 stands with respect to the usual suspects, I ran some tests and
found that Ext4 is amazingly good at this, while XFS and Btrfs have
some serious issues. Details below.
Tux3 slows down when approaching a full state, hopefully not too much.
To quantify that, here is what happens with a 200 MB dd to a loopback
mounted file on tmpfs:
                     Volume Size    Run Time
    No check at all:   1500 MB       0.306s
    Far from full:     1500 MB       0.318s
    Getting full:        30 MB       0.386s
    Just over full:      20 MB       0.624s
The slowdown used to be a lot more before I improved the cost estimate
for write earlier today. Here is how we compare to other filesystems:
            Far from full   Just over full
    tux3:       0.303s          0.468s
    ext4:       0.399s          0.400s
    xfs:        0.293s          0.326s
    btrfs:      0.499s          0.531s
                (20 mb dd to ramdisk)
XFS eeks out a narrow win on straight up dd to the ramdisk, good job.
The gap widens when hitting the failure path, but again, not as much as
it did earlier today.
I do most of these no space tests on a ramdisk (actually, a loopback
mount on tmpfs) because it is easy to fill up. To show that the ramdisk
results are not wildly different from a real disk, here we see that the
pattern is largely unchanged:
           20 MB dd to a real disk
    tux3:       1.568s
    ext4:       1.523s
    xfs:        1.466s
    btrfs:      2.347s
XFS holds its dd lead on a real hard disk. We definitely need to learn
its trick.
Next we look at something with a bit more meat: unzipping the Git
source to multiple directories. Timings on ramdisk are the interesting
ones, because the volume approaches full on the longer test.
              10x to ram  40x to ram  10x to hdd   100x to hdd
    tux3:       2.251s      8.344s      2.671s       21.686s
    ext4:       2.343s      7.923s      3.165s       32.080s
    xfs:        2.682s     10.562s     11.870s      123.435s
    btrfs:      3.744s     15.825s      3.749s       72.405s
Tux3 is the fastest when not close to full, but Ext4 takes a slight
lead when close to full. Yesterday, that lead was much wider, and one
day we would be pleased to tie Ext4, which is really exemplary at this.
The hard disk times are there because they happened to be easy to get,
and it is interesting to see how much XFS and BTRFS are suffering on
traditional rust, XFS being the worst by far at 5.7 times slower than
Tux3.
The next one is a crash test: repeatedly untar a tarball until it
smashes into the wall, and see how long it takes to quit with an error.
Tar is nice for this because its failure handling is so awful: instead
of exiting when on the first ENOSPC, it keeps banging at the full disk
until it has failed on each and every file in its archive. First I dd
the volume until just before full, then throw a tarball at it.
Time to fail when tar hits disk full:
    tux3:  0.563s
    ext4:  0.084s
    xfs:   0.116s
    btrfs: 0.460s
We respectfully concede that Ext4 is the king of fail and Tux3 is
worst. However, we only need to be good enough on this one, with less
than a second being a very crude definition of good enough.
The next one is something I ran into when I was testing out of space
detection with rewrites. This uses the "blurt" program at the end of
this post to do 40K writes from 1000 tasks in parallel, 1K at a time,
using the bash one liner:
    for ((j=1;j<10;j++)); do \
       for ((i=1;i<10;i++)); do \
          echo step $j:$i 1>&2 && mkdir -p fs/$i && \
          ~/blurt fs/$i/f 40 1000 || break 2; \
       done; \
    done
    Tux3:    4.136s (28% full when done)
    Ext4:    5.780s (31% full when done)
    XFS:    79.063s (!)
    Btrfs:  15.489s (fails with out of space when 30% full)
Blurt is a minor revision of my fsync load generator without the fsync,
and with an error exit on disk full. The intent of the outer loop is
to do rewrites with a thousand tasks in parallel, and see if out of
space accounting is accurate. XFS and Btrfs both embarrassed themselves
horribly. XFS falls off a performance cliff that makes it 19 times
slower than Tux3, and Btrfs hits ENOSPC when only 30% full according to
df, or 47% full if you prefer to believe its own df command:
    Data, single: total=728.00MiB, used=342.25MiB
    System, DUP: total=8.00MiB, used=16.00KiB
    System, single: total=4.00MiB, used=0.00B
    Metadata, DUP: total=65.00MiB, used=5.97MiB
    Metadata, single: total=8.00MiB, used=0.00B
    GlobalReserve, single: total=16.00MiB, used=0.00B
It seems that Btrfs still has not put its epic ENOSPC nightmare behind
it. I fervently hope that such a fate does not await Tux3, which hope
would appear to be well on its way to being born out.
XFS should not do such bizarre things after 23 years of development,
while being billed as a mature, enterprise grade filesystem. It simply
is not there yet. Ext4 is exemplary in terms of reliability, and Tux3
has been been really good through this round of torture tests, though
I will not claim that it is properly hardened just yet. I know it isn't.
We don't have any open bugs, but that is probably because we only have
two users. But Tux3 is remarkably solid for the number of man years
that have gone into it. Maybe Tux3 really will be ready for the
enterprise before XFS is.
In all of these tests, Tux3, Ext4 and XFS managed to fill up their
volumes to exactly 100%. Tux3 actually has a 100 block emergency
reserve that it never fills, and wastes a few more blocks if the last
transaction does not exactly use up its budget, but apparently that
still falls within the df utility's definition of 100%. Btrfs never gets
this right: full for it tends to range from 96% to 98%, and sometimes is
much lower, like 28%. It has its own definition of disk full in its own
utility, but that does not seem to be very accurate either. This part of
Btrfs needs major work. Even at this early stage, Tux3 is much better
than that.
One thing we can all rejoice over: nobody ever hit out of space while
trying to commit. At least, nobody ever admitted it. And nobody oopsed,
or asserted, though XFS did exhibit some denial of service issues where
the filesystem was unusable for tens of seconds.
Once again, in the full disclosure department: there are some known
holes remaining in Tux3's out of space handling. The unify suspend
algorithm is not yet implemented, without which we cannot guarantee
that out of space will never happen in commit. With the simple expedient
of a 100 block emergency reserve, it has never yet happened, but no
doubt some as yet untested load can make it happen. ENOSPC handling for
mmap is not yet implemented. Cost estimates for namespace operations
are too crude and ignore btree depth. Cost estimates could be tighter
than they are, to give better performance and report disk full more
promptly. The emergency reserve should be set each delta according to
delta budget. Big truncates need to be split over multiple commits
so they always free more blocks than they consume before commit. That
is about it. On the whole, I am really happy with the way this
has worked out.
Well, that is that for today. Tux3 now has decent out of space handling
that appears to work well and has a good strong theoretical basis. It
needs more work, but is no longer a reason to block Tux3 from merging,
if it ever really was.
Regards,
Daniel
[1] Overhead of an uncontended bus locked add is about 6 nanoseconds on
my i5, and about ten times higher when contended.
 /*
 * Blurt v0.0
 *
 * A trivial multitasking filesystem load generator
 *
 * Daniel Phillips, June 2015
 *
 * to build: c99 -Wall blurt.c -oblurt
 * to run: blurt <basename> <steps> <tasks>
 */
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/wait.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
enum { chunk = 1024, sync = 0 };
char text[chunk] = { "hello world!\n" };
int main(int argc, const char *argv[]) {
	const char *basename = argc < 1 ? "foo" : argv[1];
	char name[100];
	int steps = argc < 3 ? 1 : atoi(argv[2]);
	int tasks = argc < 4 ? 1 : atoi(argv[3]);
	int fd, status, errors = 0;
	for (int t = 0; t < tasks; t++) {
		snprintf(name, sizeof name, "%s%i", basename, t);
		if (!fork())
			goto child;
	}
	for (int t = 0; t < tasks; t++) {
		wait(&status);
		if (WIFEXITED(status) && WEXITSTATUS(status))
			errors++;
	}
	return !!errors;
child:
	fd = creat(name, S_IRWXU);
	if (fd == -1)
		goto fail1;
	for (int i = 0; i < steps; i++) {
		int ret = write(fd, text, sizeof text);
		if (ret == -1)
			goto fail2;
		if (sync)
			fsync(fd);
	}
	return 0;
fail1:
	perror("create failed");
	return 1;
fail2:
	perror("write failed");
	return 1;
}
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Re: Tux3 Report: How fast can we fail?
  2015-05-12 17:46 ` Tux3 Report: How fast can we fail? Daniel Phillips
@ 2015-05-13 22:07   ` Daniel Phillips
  2015-05-26 10:03   ` Pavel Machek
  1 sibling, 0 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-13 22:07 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, tux3, OGAWA Hirofumi
Addendum to that post...
On 05/12/2015 10:46 AM, I wrote:
> ...For example, we currently
> overestimate the cost of a rewrite because we would need to go poking
> around in btrees to do that more accurately. Fixing that will be quite
> a bit of work...
Ah no, I was wrong about that, it will not be a lot of work because
it does not need to be done.
Charging the full cost of a rewrite as if it is a new write is the
right thing to do because we need to be sure the commit can allocate
space to redirect the existing blocks before it frees the old ones.
So that means there is no need for the front end to go delving into
file metadata, ever, which is a relief because that would have been
expensive and messy.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fail?
  2015-05-12 17:46 ` Tux3 Report: How fast can we fail? Daniel Phillips
  2015-05-13 22:07   ` Daniel Phillips
@ 2015-05-26 10:03   ` Pavel Machek
  2015-05-27  6:41     ` Mosis Tembo
  2015-05-27  7:37     ` Mosis Tembo
  1 sibling, 2 replies; 160+ messages in thread
From: Pavel Machek @ 2015-05-26 10:03 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-fsdevel, tux3, linux-kernel, OGAWA Hirofumi
> We identified the following quality metrics for this algorithm:
> 
>  1) Never fails to detect out of space in the front end.
>  2) Always fills a volume to 100% before reporting out of space.
>  3) Allows rm, rmdir and truncate even when a volume is full.
Hmm. Can you also overwrite existing data in files when a volume is
full? I guess applications expect that to work..
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fail?
  2015-05-26 10:03   ` Pavel Machek
@ 2015-05-27  6:41     ` Mosis Tembo
  2015-05-27 18:28       ` Daniel Phillips
  2015-05-27  7:37     ` Mosis Tembo
  1 sibling, 1 reply; 160+ messages in thread
From: Mosis Tembo @ 2015-05-27  6:41 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Daniel Phillips, linux-fsdevel, linux-kernel, OGAWA Hirofumi,
	tux3
[-- Attachment #1.1: Type: text/plain, Size: 1004 bytes --]
On Tue, May 26, 2015 at 6:03 PM, Pavel Machek <pavel@ucw.cz> wrote:
>
> > We identified the following quality metrics for this algorithm:
> >
> >  1) Never fails to detect out of space in the front end.
> >  2) Always fills a volume to 100% before reporting out of space.
> >  3) Allows rm, rmdir and truncate even when a volume is full.
>
This is definitely nonsense. You can not rm, rmdir and truncate
when the volume is full. You will need a free space on disk to perform
such operations. Do you know why?
M.T.
>
> Hmm. Can you also overwrite existing data in files when a volume is
> full? I guess applications expect that to work..
>
> Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures)
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
[-- Attachment #1.2: Type: text/html, Size: 2176 bytes --]
[-- Attachment #2: Type: text/plain, Size: 120 bytes --]
_______________________________________________
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fail?
  2015-05-27  6:41     ` Mosis Tembo
@ 2015-05-27 18:28       ` Daniel Phillips
  2015-05-27 21:39         ` Pavel Machek
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-27 18:28 UTC (permalink / raw)
  To: Mosis Tembo
  Cc: linux-fsdevel, tux3, OGAWA Hirofumi, linux-kernel, Pavel Machek
On Tuesday, May 26, 2015 11:41:39 PM PDT, Mosis Tembo wrote:
> On Tue, May 26, 2015 at 6:03 PM, Pavel Machek <pavel@ucw.cz> wrote:
>
>> 
>>> We identified the following quality metrics for this algorithm:
>>> 
>>>  1) Never fails to detect out of space in the front end.
>>>  2) Always fills a volume to 100% before reporting out of space.
>>>  3) Allows rm, rmdir and truncate even when a volume is full.
>
> This is definitely nonsense. You can not rm, rmdir and truncate
> when the volume is full. You will need a free space on disk to perform
> such operations. Do you know why?
Because some extra space needs to be on the volume in order to do the
atomic commit. Specifically, there must be enough extra space to keep
both old and new copies of any changed metadata, plus enough space for
new data or metadata. You are almost right: we can't support rm, rmdir
or truncate _with atomic commit_ unless some space is available on the
volume. So we keep a small reserve to handle those operations, which
only those operations can access. We define the volume as "full" when
only the reserve remains. The reserve is not included in "available"
blocks reported to statfs, so the volume appears to be 100% full when
only the reserve remains.
For Tux3, that reserve is variable - about 1% of free space, declining
to a minimum of 10 blocks as free space runs out. Eventually, we will
reduce the minimum a bit as we develop finer control over how free
space is used in very low space conditions, but 10 blocks is not bad
at all. With no journal and only 10 blocks of unusable space, we do
pretty well with tiny volumes.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fail?
  2015-05-27 18:28       ` Daniel Phillips
@ 2015-05-27 21:39         ` Pavel Machek
  2015-05-27 22:46           ` Daniel Phillips
  0 siblings, 1 reply; 160+ messages in thread
From: Pavel Machek @ 2015-05-27 21:39 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Mosis Tembo, linux-kernel, linux-fsdevel, tux3, OGAWA Hirofumi
On Wed 2015-05-27 11:28:50, Daniel Phillips wrote:
> On Tuesday, May 26, 2015 11:41:39 PM PDT, Mosis Tembo wrote:
> >On Tue, May 26, 2015 at 6:03 PM, Pavel Machek <pavel@ucw.cz> wrote:
> >
> >>
> >>>We identified the following quality metrics for this algorithm:
> >>>
> >>> 1) Never fails to detect out of space in the front end.
> >>> 2) Always fills a volume to 100% before reporting out of space.
> >>> 3) Allows rm, rmdir and truncate even when a volume is full.
> >
> >This is definitely nonsense. You can not rm, rmdir and truncate
> >when the volume is full. You will need a free space on disk to perform
> >such operations. Do you know why?
> 
> Because some extra space needs to be on the volume in order to do the
> atomic commit. Specifically, there must be enough extra space to keep
> both old and new copies of any changed metadata, plus enough space for
> new data or metadata. You are almost right: we can't support rm, rmdir
> or truncate _with atomic commit_ unless some space is available on the
> volume. So we keep a small reserve to handle those operations, which
> only those operations can access. We define the volume as "full" when
> only the reserve remains. The reserve is not included in "available"
> blocks reported to statfs, so the volume appears to be 100% full when
> only the reserve remains.
> 
> For Tux3, that reserve is variable - about 1% of free space, declining
> to a minimum of 10 blocks as free space runs out. Eventually, we will
> reduce the minimum a bit as we develop finer control over how free
> space is used in very low space conditions, but 10 blocks is not bad
> at all. With no journal and only 10 blocks of unusable space, we do
> pretty well with tiny volumes.
Yeah. Filesystem that could not do rm on full filesystem would be
braindead.
Now, what about
1) writing to already-allocated space in existing files?
2) writing to already-allocated space in existing files using mmap?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fail?
  2015-05-27 21:39         ` Pavel Machek
@ 2015-05-27 22:46           ` Daniel Phillips
  2015-05-28 12:55             ` Austin S Hemmelgarn
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-27 22:46 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-fsdevel, tux3, Mosis Tembo, linux-kernel, OGAWA Hirofumi
On 05/27/2015 02:39 PM, Pavel Machek wrote:
> On Wed 2015-05-27 11:28:50, Daniel Phillips wrote:
>> On Tuesday, May 26, 2015 11:41:39 PM PDT, Mosis Tembo wrote:
>>> On Tue, May 26, 2015 at 6:03 PM, Pavel Machek <pavel@ucw.cz> wrote:
>>>
>>>>
>>>>> We identified the following quality metrics for this algorithm:
>>>>>
>>>>> 1) Never fails to detect out of space in the front end.
>>>>> 2) Always fills a volume to 100% before reporting out of space.
>>>>> 3) Allows rm, rmdir and truncate even when a volume is full.
>>>
>>> This is definitely nonsense. You can not rm, rmdir and truncate
>>> when the volume is full. You will need a free space on disk to perform
>>> such operations. Do you know why?
>>
>> Because some extra space needs to be on the volume in order to do the
>> atomic commit. Specifically, there must be enough extra space to keep
>> both old and new copies of any changed metadata, plus enough space for
>> new data or metadata. You are almost right: we can't support rm, rmdir
>> or truncate _with atomic commit_ unless some space is available on the
>> volume. So we keep a small reserve to handle those operations, which
>> only those operations can access. We define the volume as "full" when
>> only the reserve remains. The reserve is not included in "available"
>> blocks reported to statfs, so the volume appears to be 100% full when
>> only the reserve remains.
>>
>> For Tux3, that reserve is variable - about 1% of free space, declining
>> to a minimum of 10 blocks as free space runs out. Eventually, we will
>> reduce the minimum a bit as we develop finer control over how free
>> space is used in very low space conditions, but 10 blocks is not bad
>> at all. With no journal and only 10 blocks of unusable space, we do
>> pretty well with tiny volumes.
> 
> Yeah. Filesystem that could not do rm on full filesystem would be
> braindead.
> 
> Now, what about
> 
> 1) writing to already-allocated space in existing files?
I mentioned earlier, it seems to work pretty well in Tux3. But do user
applications really expect it to work? I do not know of any, perhaps
you do.
Incidentally, I have been torture testing this very property using a
32K filesystem consisting of 64 x 512 byte blocks, with repeated dd,
mknod, rm, etc. Just to show that we are serious about getting this
part right.
> 2) writing to already-allocated space in existing files using mmap?
Not part of the preliminary nospace patch, but planned. I intend to
work on that detail after merge.
The problem is almost the same as write(2) in that the reserve must be
large enough to accommodate both old and new versions of all data
blocks, otherwise we lose our ACID, which we will go to great lengths
to avoid losing. The thing that makes this work nicely is the way the
delta shrinks as freespace runs out, which is the central point of our
new nospace algorithm.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fail?
  2015-05-27 22:46           ` Daniel Phillips
@ 2015-05-28 12:55             ` Austin S Hemmelgarn
  0 siblings, 0 replies; 160+ messages in thread
From: Austin S Hemmelgarn @ 2015-05-28 12:55 UTC (permalink / raw)
  To: Daniel Phillips, Pavel Machek
  Cc: linux-fsdevel, tux3, Mosis Tembo, linux-kernel, OGAWA Hirofumi
[-- Attachment #1: Type: text/plain, Size: 2529 bytes --]
On 2015-05-27 18:46, Daniel Phillips wrote:
>
>
> On 05/27/2015 02:39 PM, Pavel Machek wrote:
>> On Wed 2015-05-27 11:28:50, Daniel Phillips wrote:
>>> On Tuesday, May 26, 2015 11:41:39 PM PDT, Mosis Tembo wrote:
>>>> On Tue, May 26, 2015 at 6:03 PM, Pavel Machek <pavel@ucw.cz> wrote:
>>>>
>>>>>
>>>>>> We identified the following quality metrics for this algorithm:
>>>>>>
>>>>>> 1) Never fails to detect out of space in the front end.
>>>>>> 2) Always fills a volume to 100% before reporting out of space.
>>>>>> 3) Allows rm, rmdir and truncate even when a volume is full.
>>>>
>>>> This is definitely nonsense. You can not rm, rmdir and truncate
>>>> when the volume is full. You will need a free space on disk to perform
>>>> such operations. Do you know why?
>>>
>>> Because some extra space needs to be on the volume in order to do the
>>> atomic commit. Specifically, there must be enough extra space to keep
>>> both old and new copies of any changed metadata, plus enough space for
>>> new data or metadata. You are almost right: we can't support rm, rmdir
>>> or truncate _with atomic commit_ unless some space is available on the
>>> volume. So we keep a small reserve to handle those operations, which
>>> only those operations can access. We define the volume as "full" when
>>> only the reserve remains. The reserve is not included in "available"
>>> blocks reported to statfs, so the volume appears to be 100% full when
>>> only the reserve remains.
>>>
>>> For Tux3, that reserve is variable - about 1% of free space, declining
>>> to a minimum of 10 blocks as free space runs out. Eventually, we will
>>> reduce the minimum a bit as we develop finer control over how free
>>> space is used in very low space conditions, but 10 blocks is not bad
>>> at all. With no journal and only 10 blocks of unusable space, we do
>>> pretty well with tiny volumes.
>>
>> Yeah. Filesystem that could not do rm on full filesystem would be
>> braindead.
>>
>> Now, what about
>>
>> 1) writing to already-allocated space in existing files?
>
> I mentioned earlier, it seems to work pretty well in Tux3. But do user
> applications really expect it to work? I do not know of any, perhaps
> you do.
I don't know of any applications that do, although I do know of quite a 
few users who would expect it to work (myself included).  This kind of 
thing could (depending on how the system in question is configured) 
potentially be critical for recovering from such a situation.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
 
 
- * Re: Tux3 Report: How fast can we fail?
  2015-05-26 10:03   ` Pavel Machek
  2015-05-27  6:41     ` Mosis Tembo
@ 2015-05-27  7:37     ` Mosis Tembo
  2015-05-27 14:04       ` Austin S Hemmelgarn
  1 sibling, 1 reply; 160+ messages in thread
From: Mosis Tembo @ 2015-05-27  7:37 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
On 05/26/2015 12:03 PM, Pavel Machek wrote:
>> We identified the following quality metrics for this algorithm:
>>
>>   1) Never fails to detect out of space in the front end.
>>   2) Always fills a volume to 100% before reporting out of space.
>>   3) Allows rm, rmdir and truncate even when a volume is full.
This is definitely nonsense. You can not rm, rmdir and truncate
when the volume is full. You will need a free space on disk to perform
such operations. Do you know why?
M.T.
> Hmm. Can you also overwrite existing data in files when a volume is
> full? I guess applications expect that to work..
> 									Pavel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fail?
  2015-05-27  7:37     ` Mosis Tembo
@ 2015-05-27 14:04       ` Austin S Hemmelgarn
  2015-05-27 15:21         ` Mosis Tembo
  0 siblings, 1 reply; 160+ messages in thread
From: Austin S Hemmelgarn @ 2015-05-27 14:04 UTC (permalink / raw)
  To: Mosis Tembo, linux-kernel, linux-fsdevel
[-- Attachment #1: Type: text/plain, Size: 1109 bytes --]
On 2015-05-27 03:37, Mosis Tembo wrote:
>
> On 05/26/2015 12:03 PM, Pavel Machek wrote:
>>> We identified the following quality metrics for this algorithm:
>>>
>>>   1) Never fails to detect out of space in the front end.
>>>   2) Always fills a volume to 100% before reporting out of space.
>>>   3) Allows rm, rmdir and truncate even when a volume is full.
>
> This is definitely nonsense. You can not rm, rmdir and truncate
> when the volume is full. You will need a free space on disk to perform
> such operations. Do you know why?
>
I assume you are referring either to Tux3 specifically or COW 
filesystems in general, because you very much _can_ do any of those on 
any of the non-COW filesystems in the Linux kernel (I know from 
experience).  Also, IIRC, it was mentioned somewhere that Tux3 keeps a 
small reserve of space on the volume for internal operations; and, I 
would assume that if that is the case, it reports the volume full when 
everything *except* that reserve of space is used, in which case rm, 
rmdir, and truncate should work fine when the volume is full.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fail?
  2015-05-27 14:04       ` Austin S Hemmelgarn
@ 2015-05-27 15:21         ` Mosis Tembo
  2015-05-27 15:37           ` Austin S Hemmelgarn
  0 siblings, 1 reply; 160+ messages in thread
From: Mosis Tembo @ 2015-05-27 15:21 UTC (permalink / raw)
  To: Austin S Hemmelgarn, linux-kernel, linux-fsdevel
On 05/27/2015 04:04 PM, Austin S Hemmelgarn wrote:
> On 2015-05-27 03:37, Mosis Tembo wrote:
>>
>> On 05/26/2015 12:03 PM, Pavel Machek wrote:
>>>> We identified the following quality metrics for this algorithm:
>>>>
>>>>   1) Never fails to detect out of space in the front end.
>>>>   2) Always fills a volume to 100% before reporting out of space.
>>>>   3) Allows rm, rmdir and truncate even when a volume is full.
>>
>> This is definitely nonsense. You can not rm, rmdir and truncate
>> when the volume is full. You will need a free space on disk to perform
>> such operations. Do you know why?
>>
> I assume you are referring either to Tux3 specifically or COW 
> filesystems in general,
I am referring to modern file systems with transaction models and 
delayed actions.
Tux3 is not the case?
> because you very much _can_ do any of those on any of the non-COW 
> filesystems in the Linux kernel
It is simply incorrect. ReiserFS is a counterexample.
> (I know from experience).  Also, IIRC, it was mentioned somewhere that 
> Tux3 keeps a small reserve of space on the volume for internal 
> operations; and, I would assume that if that is the case, it reports 
> the volume full when everything *except* that reserve of space is 
> used, in which case rm, rmdir, and truncate should work fine when the 
> volume is full.
Sorry, I prefer to not manipulate with rumors and assumptions when it comes
to the review for kernel inclusion.
Thanks,
M.T.
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: Tux3 Report: How fast can we fail?
  2015-05-27 15:21         ` Mosis Tembo
@ 2015-05-27 15:37           ` Austin S Hemmelgarn
  0 siblings, 0 replies; 160+ messages in thread
From: Austin S Hemmelgarn @ 2015-05-27 15:37 UTC (permalink / raw)
  To: Mosis Tembo, linux-kernel, linux-fsdevel
[-- Attachment #1: Type: text/plain, Size: 2494 bytes --]
On 2015-05-27 11:21, Mosis Tembo wrote:
>
> On 05/27/2015 04:04 PM, Austin S Hemmelgarn wrote:
>> On 2015-05-27 03:37, Mosis Tembo wrote:
>>>
>>> On 05/26/2015 12:03 PM, Pavel Machek wrote:
>>>>> We identified the following quality metrics for this algorithm:
>>>>>
>>>>>   1) Never fails to detect out of space in the front end.
>>>>>   2) Always fills a volume to 100% before reporting out of space.
>>>>>   3) Allows rm, rmdir and truncate even when a volume is full.
>>>
>>> This is definitely nonsense. You can not rm, rmdir and truncate
>>> when the volume is full. You will need a free space on disk to perform
>>> such operations. Do you know why?
>>>
>> I assume you are referring either to Tux3 specifically or COW
>> filesystems in general,
>
>
> I am referring to modern file systems with transaction models and
> delayed actions.
> Tux3 is not the case?
>
In a sensibly designed non-COW filesystem, unlink, truncate, and 
FALLOCATE_FL_{PUNCH_HOLE,COLLAPSE_RANGE} should never need to allocate 
anything.  On well designed COW filesystems, you keep a reserve of space 
that is only available for temporary internal use so that these will 
work even when you report the volume as 100% full so you can free space. 
  This is what BTRFS does (although it doesn't always work because of 
segregating data and metadata), and I believe that tux3 does this also, 
although I don't remember for certain.
>
>> because you very much _can_ do any of those on any of the non-COW
>> filesystems in the Linux kernel
>
>
> It is simply incorrect. ReiserFS is a counterexample.
>
Apologies, I didn't know about ReiserFS having issues with that (it's 
the only one that I haven't used, and this is yet another reason I 
probably never will), but I know for a fact that it does work on ext*, 
XFS, and JFS (I'm not entirely certain about OCFS2 and GFS2, and NILFS2 
is technically COW because it's log structured).
>
>> (I know from experience).  Also, IIRC, it was mentioned somewhere that
>> Tux3 keeps a small reserve of space on the volume for internal
>> operations; and, I would assume that if that is the case, it reports
>> the volume full when everything *except* that reserve of space is
>> used, in which case rm, rmdir, and truncate should work fine when the
>> volume is full.
>
>
> Sorry, I prefer to not manipulate with rumors and assumptions when it comes
> to the review for kernel inclusion.
>
Entirely understandable.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]
^ permalink raw reply	[flat|nested] 160+ messages in thread
 
 
 
 
 
- * [WIP] tux3: Optimized fsync
  2015-04-28 23:13 Tux3 Report: How fast can we fsync? Daniel Phillips
                   ` (3 preceding siblings ...)
  2015-05-12 17:46 ` Tux3 Report: How fast can we fail? Daniel Phillips
@ 2015-05-14  7:37 ` Daniel Phillips
  2015-05-14  8:26 ` [FYI] tux3: Core changes Daniel Phillips
  5 siblings, 0 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-14  7:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, tux3, OGAWA Hirofumi
Greetings,
This diff against head (f59558a04c5ad052dc03ceeda62ccf31f4ab0004) of
   https://github.com/OGAWAHirofumi/linux-tux3/tree/hirofumi-user
provides the optimized fsync code that was used to generate the
benchmark results here:
   https://lkml.org/lkml/2015/4/28/838
   "How fast can we fsync?"
This patch also applies to:
   https://github.com/OGAWAHirofumi/linux-tux3/tree/hirofumi
which is a 3.19 kernel cloned from mainline. (Preferred)
Build instructions are on the wiki:
   https://github.com/OGAWAHirofumi/linux-tux3/wiki
There is some slight skew in the instructions because this is
not on master yet.
****************************************************************
*****  Caveat: No out of space handling on this branch!  *******
*** If you run out of space you will get a mysterious assert ***
****************************************************************
Enjoy!
Daniel
diff --git a/fs/tux3/buffer.c b/fs/tux3/buffer.c
index ef0d917..a141687 100644
--- a/fs/tux3/buffer.c
+++ b/fs/tux3/buffer.c
@@ -29,7 +29,7 @@ TUX3_DEFINE_STATE_FNS(unsigned long, buf, BUFDELTA_AVAIL, BUFDELTA_BITS,
  * may not work on all arch (If set_bit() and cmpxchg() is not
  * exclusive, this has race).
  */
-static void tux3_set_bufdelta(struct buffer_head *buffer, int delta)
+void tux3_set_bufdelta(struct buffer_head *buffer, int delta)
 {
 	unsigned long state, old_state;
 
diff --git a/fs/tux3/commit.c b/fs/tux3/commit.c
index 909a222..955c441a 100644
--- a/fs/tux3/commit.c
+++ b/fs/tux3/commit.c
@@ -289,12 +289,13 @@ static int commit_delta(struct sb *sb)
 		req_flag |= REQ_NOIDLE | REQ_FLUSH | REQ_FUA;
 	}
 
-	trace("commit %i logblocks", be32_to_cpu(sb->super.logcount));
+	trace("commit %i logblocks", logcount(sb));
 	err = save_metablock(sb, req_flag);
 	if (err)
 		return err;
 
-	tux3_wake_delta_commit(sb);
+	if (!fsync_mode(sb))
+		tux3_wake_delta_commit(sb);
 
 	/* Commit was finished, apply defered bfree. */
 	return unstash(sb, &sb->defree, apply_defered_bfree);
@@ -314,8 +315,7 @@ static void post_commit(struct sb *sb, unsigned delta)
 
 static int need_unify(struct sb *sb)
 {
-	static unsigned crudehack;
-	return !(++crudehack % 3);
+	return logcount(sb) > 300; /* FIXME: should be based on bandwidth and tunable */
 }
 
 /* For debugging */
@@ -359,7 +359,7 @@ static int do_commit(struct sb *sb, int flags)
 	 * FIXME: there is no need to commit if normal inodes are not
 	 * dirty? better way?
 	 */
-	if (!(flags & __FORCE_DELTA) && !tux3_has_dirty_inodes(sb, delta))
+	if (0 && !(flags & __FORCE_DELTA) && !tux3_has_dirty_inodes(sb, delta))
 		goto out;
 
 	/* Prepare to wait I/O */
@@ -402,6 +402,7 @@ static int do_commit(struct sb *sb, int flags)
 #endif
 
 	if ((!no_unify && need_unify(sb)) || (flags & __FORCE_UNIFY)) {
+		trace("unify %u, delta %u", sb->unify, delta);
 		err = unify_log(sb);
 		if (err)
 			goto error; /* FIXME: error handling */
diff --git a/fs/tux3/commit_flusher.c b/fs/tux3/commit_flusher.c
index 59d6781..31cd51e 100644
--- a/fs/tux3/commit_flusher.c
+++ b/fs/tux3/commit_flusher.c
@@ -198,6 +198,8 @@ long tux3_writeback(struct super_block *super, struct bdi_writeback *wb,
 	if (work->reason == WB_REASON_SYNC)
 		goto out;
 
+	trace("tux3_writeback, reason = %i", work->reason);
+	
 	if (work->reason == WB_REASON_TUX3_PENDING) {
 		struct tux3_wb_work *wb_work;
 		/* Specified target delta for staging. */
@@ -343,3 +345,7 @@ static void schedule_flush_delta(struct sb *sb, struct delta_ref *delta_ref)
 	sb->delta_pending++;
 	wake_up_all(&sb->delta_transition_wq);
 }
+
+#ifdef __KERNEL__
+#include "commit_fsync.c"
+#endif
diff --git a/fs/tux3/commit_fsync.c b/fs/tux3/commit_fsync.c
new file mode 100644
index 0000000..9a59c59
--- /dev/null
+++ b/fs/tux3/commit_fsync.c
@@ -0,0 +1,341 @@
+/*
+ * Optimized fsync.
+ *
+ * Copyright (c) 2015 Daniel Phillips
+ */
+
+#include <linux/delay.h>
+
+static inline int fsync_pending(struct sb *sb)
+{
+	return atomic_read(&sb->fsync_pending);
+}
+
+static inline int delta_needed(struct sb *sb)
+{
+	return waitqueue_active(&sb->delta_transition_wq);
+}
+
+static inline int fsync_drain(struct sb *sb)
+{
+	return test_bit(TUX3_FSYNC_DRAIN_BIT, &sb->backend_state);
+}
+
+static inline unsigned fsync_group(struct sb *sb)
+{
+	return atomic_read(&sb->fsync_group);
+}
+
+static int suspend_transition(struct sb *sb)
+{
+	while (sb->suspended == NULL) {
+		if (!test_and_set_bit(TUX3_STATE_TRANSITION_BIT, &sb->backend_state)) {
+			sb->suspended = delta_get(sb);
+			return 1;
+		}
+		cpu_relax();
+	}
+	return 0;
+}
+
+static void resume_transition(struct sb *sb)
+{
+	delta_put(sb, sb->suspended);
+	sb->suspended = NULL;
+
+	if (need_unify(sb))
+		delta_transition(sb);
+
+	/* Make sure !suspended is visible before transition clear  */
+	smp_mb__before_atomic();
+	clear_bit(TUX3_STATE_TRANSITION_BIT, &sb->backend_state);
+	/* Make sure transition clear is visible  before drain clear */
+	smp_mb__before_atomic();
+	clear_bit(TUX3_FSYNC_DRAIN_BIT, &sb->backend_state);
+	wake_up_all(&sb->delta_transition_wq);
+}
+
+static void tux3_wait_for_free(struct sb *sb, unsigned delta)
+{
+	unsigned free_delta = delta + TUX3_MAX_DELTA;
+	/* FIXME: better to be killable */
+	wait_event(sb->delta_transition_wq,
+		   delta_after_eq(sb->delta_free, free_delta));
+}
+
+/*
+ * Write log and commit. (Mostly borrowed from do_commit)
+ *
+ * This needs specfic handling for the commit block, so
+ * maybe add an fsync flag to commit_delta.
+ */
+static int commit_fsync(struct sb *sb, unsigned delta, struct blk_plug *plug)
+{
+	write_btree(sb, delta);
+	write_log(sb);
+	blk_finish_plug(plug);
+	commit_delta(sb);
+	post_commit(sb, delta);
+	return 0;
+}
+
+enum { groups_per_commit = 4 };
+
+/*
+ * Backend fsync commit task, serialized with delta backend.
+ */
+void fsync_backend(struct work_struct *work)
+{
+	struct sb *sb = container_of(work, struct fsync_work, work)->sb;
+	struct syncgroup *back = &sb->fsync[(fsync_group(sb) - 1) % fsync_wrap];
+	struct syncgroup *front = &sb->fsync[fsync_group(sb) % fsync_wrap];
+	struct syncgroup *idle = &sb->fsync[(fsync_group(sb) + 1) % fsync_wrap];
+	unsigned back_delta = sb->suspended->delta - 1;
+	unsigned start = fsync_group(sb), groups = 0;
+	struct blk_plug plug;
+	int err; /* How to report?? */
+
+	trace("enter fsync backend, delta = %i", sb->suspended->delta);
+	tux3_start_backend(sb);
+	sb->flags |= SB_FSYNC_FLUSH_FLAG;
+
+	while (1) {
+		sb->ioinfo = NULL;
+		assert(list_empty(&tux3_sb_ddc(sb, back_delta)->dirty_inodes));
+		while (atomic_read(&front->busy)) {
+			struct ioinfo ioinfo;
+			unsigned i;
+			/*
+			 * Verify that the tail of the group queue is idle in
+			 * the sense that all waiting fsyncs woke up and released
+			 * their busy counts. This busy wait is only theoretical
+			 * because fsync tasks have plenty of time to wake up
+			 * while the the next group commits to media, but handle
+			 * it anyway for completeness.
+			 */
+			for (i = 0; atomic_read(&idle->busy); i++)
+				usleep_range(10, 1000);
+			if (i)
+				tux3_warn(sb, "*** %u spins on queue full ***", i);
+			reinit_completion(&idle->wait);
+
+			/*
+			 * Bump the fsync group counter so fsync backend owns the
+			 * next group of fsync inodes and can walk stable lists
+			 * while new fsyncs go onto the new frontend lists.
+			 */
+			spin_lock(&sb->fsync_lock);
+			atomic_inc(&sb->fsync_group);
+			spin_unlock(&sb->fsync_lock);
+
+			back = front;
+			front = idle;
+			idle = &sb->fsync[(fsync_group(sb) + 1) % fsync_wrap];
+
+			trace("fsync flush group %tu, queued = %i, busy = %i",
+				back - sb->fsync, atomic_read(&sb->fsync_pending),
+				atomic_read(&back->busy));
+
+			if (!sb->ioinfo) {
+				tux3_io_init(&ioinfo, REQ_SYNC);
+				sb->ioinfo = &ioinfo;
+				blk_start_plug(&plug);
+			}
+
+			/*
+			 * NOTE: this may flush same inode multiple times, and those
+			 * blocks are submitted under plugging. So, by reordering,
+			 * later requests by tux3_flush_inodes() can be flushed
+			 * before former submitted requests. We do page forking, and
+			 * don't free until commit, so reorder should not be problem.
+			 * But we should remember this surprise.
+			 */
+			err = tux3_flush_inodes_list(sb, back_delta, &back->list);
+			if (err) {
+				tux3_warn(sb, "tux3_flush_inodes_list error %i!", -err);
+				goto ouch;
+			}
+			list_splice_init(&back->list, &tux3_sb_ddc(sb, back_delta)->dirty_inodes);
+			atomic_sub(atomic_read(&back->busy), &sb->fsync_pending);
+
+			if (++groups < groups_per_commit && atomic_read(&front->busy)) {
+				trace("fsync merge group %u", fsync_group(sb));
+				continue;
+			}
+
+			commit_fsync(sb, back_delta, &plug);
+			sb->ioinfo = NULL;
+			wake_up_all(&sb->fsync_collide);
+
+			/*
+			 * Wake up commit waiters for all groups in this commit.
+			 */
+			trace("complete %i groups, %i to %i", groups, start, start + groups -1);
+			for (i = 0; i < groups; i++) {
+				struct syncgroup *done = &sb->fsync[(start + i) % fsync_wrap];
+				complete_all(&done->wait);
+			}
+
+			if (!fsync_pending(sb) || delta_needed(sb) || need_unify(sb))
+				set_bit(TUX3_FSYNC_DRAIN_BIT, &sb->backend_state);
+
+			start = fsync_group(sb);
+			groups = 0;
+		}
+
+		if (fsync_drain(sb) && !fsync_pending(sb))
+			break;
+
+		usleep_range(10, 500);
+	}
+
+ouch:
+	tux3_end_backend();
+	sb->flags &= ~SB_FSYNC_FLUSH_FLAG;
+	resume_transition(sb);
+	trace("leave fsync backend, group = %i", fsync_group(sb));
+	return; /* FIXME: error? */
+}
+
+int tux3_sync_inode(struct sb *sb, struct inode *inode)
+{
+	void tux3_set_bufdelta(struct buffer_head *buffer, int delta);
+	struct tux3_inode *tuxnode = tux_inode(inode);
+	struct inode_delta_dirty *front_dirty, *back_dirty;
+	struct buffer_head *buffer;
+	struct syncgroup *front;
+	unsigned front_delta;
+	int err = 0, start_backend = 0;
+
+	trace("fsync inode %Lu", (long long)tuxnode->inum);
+
+	/*
+	 * Prevent new fsyncs from queuing if fsync_backend wants to exit.
+	 */
+	if (fsync_drain(sb))
+		wait_event(sb->delta_transition_wq, !fsync_drain(sb));
+
+	/*
+	 * Prevent fsync_backend from exiting and delta from changing until
+	 * this fsync is queued and flushed.
+	 */
+	atomic_inc(&sb->fsync_pending);
+	start_backend = suspend_transition(sb);
+	front_delta = sb->suspended->delta;
+	front_dirty = tux3_inode_ddc(inode, front_delta);
+	back_dirty = tux3_inode_ddc(inode, front_delta - 1);
+	tux3_wait_for_free(sb, front_delta - 1);
+
+	/*
+	 * If another fsync is in progress on this inode then wait to
+	 * avoid block collisions.
+	 */
+	if (tux3_inode_test_and_set_flag(TUX3_INODE_FSYNC_BIT, inode)) {
+		trace("parallel fsync of inode %Lu", (long long)tuxnode->inum);
+		if (start_backend) {
+			queue_work(sb->fsync_workqueue, &sb->fsync_work.work);
+			start_backend = 0;
+		}
+		err = wait_event_killable(sb->fsync_collide,
+			!tux3_inode_test_and_set_flag(TUX3_INODE_FSYNC_BIT, inode));
+		if (err) {
+			tux3_inode_clear_flag(TUX3_INODE_FSYNC_BIT, inode);
+			atomic_dec(&sb->fsync_pending);
+			goto fail;
+		}
+	}
+
+	/*
+	 * We own INODE_FSYNC and the delta backend is not running so
+	 * if inode is dirty here then it it will still be dirty when we
+	 * move it to the backend dirty list. Otherwise, the inode is
+	 * clean and fsync should exit here. We owned INODE_FSYNC for a
+	 * short time so there might be tasks waiting on fsync_collide.
+	 * Similarly, we might own FSYNC_RUNNING and therefore must start
+	 * the fsync backend in case some other task failed to own it and
+	 * therefore assumes it is running.
+	 */
+	if (!tux3_dirty_flags1(inode, front_delta)) {
+		trace("inode %Lu is already clean", (long long)tuxnode->inum);
+		tux3_inode_clear_flag(TUX3_INODE_FSYNC_BIT, inode);
+		atomic_dec(&sb->fsync_pending);
+		if (start_backend)
+			queue_work(sb->fsync_workqueue, &sb->fsync_work.work);
+		wake_up_all(&sb->fsync_collide);
+		return 0;
+	}
+
+	/*
+	 * Exclude new dirties.
+	 * Lock order: i_mutex => truncate_lock
+	 */
+	mutex_lock(&inode->i_mutex); /* Exclude most dirty sources */
+	down_write(&tux_inode(inode)->truncate_lock); /* Exclude mmap */
+
+	/*
+	 * Force block dirty state to previous delta for each dirty
+	 * block so block fork protects block data against modify by
+	 * parallel tasks while this task waits for commit.
+	 *
+	 * This walk should not discover any dirty blocks belonging
+	 * to the previous delta due to the above wait for delta
+	 * commit.
+	 */
+	list_for_each_entry(buffer, &front_dirty->dirty_buffers, b_assoc_buffers) {
+		//assert(tux3_bufsta_get_delta(buffer->b_state) != delta - 1);
+		tux3_set_bufdelta(buffer, front_delta - 1);
+	}
+
+	/*
+	 * Move the the front end dirty block list to the backend, which
+	 * is now empty because the previous delta was completed. Remove
+	 * the inode from the frontend dirty list and add it to the front
+	 * fsync list. Note: this is not a list move because different
+	 * link fields are involved. Later, the inode will be moved to
+	 * the backend inode dirty list to be flushed but we cannot put
+	 * it there right now because it might clobber the previous fsync
+	 * group. Update the inode dirty flags to indicate the inode is
+	 * dirty in the back, not the front. The list moves must be
+	 * under the spin lock to prevent the back end from bumping
+	 * the group counter and proceeding with the commit.
+	 */
+	trace("fsync queue inode %Lu to group %u",
+		(long long)tuxnode->inum, fsync_group(sb));
+	spin_lock(&tuxnode->lock);
+	spin_lock(&sb->dirty_inodes_lock);
+	//assert(<inode is not dirty in back>);
+	assert(list_empty(&back_dirty->dirty_buffers));
+	assert(list_empty(&back_dirty->dirty_holes));
+	assert(!list_empty(&front_dirty->dirty_list));
+	list_splice_init(&front_dirty->dirty_buffers, &back_dirty->dirty_buffers);
+	list_splice_init(&front_dirty->dirty_holes, &back_dirty->dirty_holes);
+	list_del_init(&front_dirty->dirty_list);
+	spin_unlock(&sb->dirty_inodes_lock);
+
+	tux3_dirty_switch_to_prev(inode, front_delta);
+	spin_unlock(&tuxnode->lock);
+
+	spin_lock(&sb->fsync_lock);
+	front = &sb->fsync[fsync_group(sb) % fsync_wrap];
+	list_add_tail(&back_dirty->dirty_list, &front->list);
+	atomic_inc(&front->busy); /* detect queue full */
+	assert(sb->current_delta->delta == front_delta); /* last chance to check */
+	spin_unlock(&sb->fsync_lock);
+
+	/*
+	 * Allow more dirties during the wait. These will be isolated from
+	 * the commit by block forking.
+	 */
+	up_write(&tux_inode(inode)->truncate_lock);
+	mutex_unlock(&inode->i_mutex);
+
+	if (start_backend)
+		queue_work(sb->fsync_workqueue, &sb->fsync_work.work);
+
+	wait_for_completion(&front->wait);
+	atomic_dec(&front->busy);
+fail:
+	if (err)
+		tux3_warn(sb, "error %i!!!", err);
+	return err;
+}
diff --git a/fs/tux3/iattr.c b/fs/tux3/iattr.c
index 57a383b..7ac73f5 100644
--- a/fs/tux3/iattr.c
+++ b/fs/tux3/iattr.c
@@ -276,6 +276,8 @@ static int iattr_decode(struct btree *btree, void *data, void *attrs, int size)
 	}
 
 	decode_attrs(inode, attrs, size); // error???
+	tux_inode(inode)->nlink_base = inode->i_nlink;
+
 	if (tux3_trace)
 		dump_attrs(inode);
 	if (tux_inode(inode)->xcache)
diff --git a/fs/tux3/inode.c b/fs/tux3/inode.c
index f747c0e..a10ce38 100644
--- a/fs/tux3/inode.c
+++ b/fs/tux3/inode.c
@@ -922,22 +922,18 @@ void iget_if_dirty(struct inode *inode)
 	atomic_inc(&inode->i_count);
 }
 
+enum { fsync_fallback = 0 };
+
 /* Synchronize changes to a file and directory. */
 int tux3_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 {
 	struct inode *inode = file->f_mapping->host;
 	struct sb *sb = tux_sb(inode->i_sb);
 
-	/* FIXME: this is sync(2). We should implement real one */
-	static int print_once;
-	if (!print_once) {
-		print_once++;
-		tux3_warn(sb,
-			  "fsync(2) fall-back to sync(2): %Lx-%Lx, datasync %d",
-			  start, end, datasync);
-	}
+	if (fsync_fallback || S_ISDIR(inode->i_mode))
+		return sync_current_delta(sb);
 
-	return sync_current_delta(sb);
+	return tux3_sync_inode(sb, inode);
 }
 
 int tux3_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
diff --git a/fs/tux3/log.c b/fs/tux3/log.c
index bb26c73..a934659 100644
--- a/fs/tux3/log.c
+++ b/fs/tux3/log.c
@@ -83,6 +83,7 @@ unsigned log_size[] = {
 	[LOG_BNODE_FREE]	= 7,
 	[LOG_ORPHAN_ADD]	= 9,
 	[LOG_ORPHAN_DEL]	= 9,
+	[LOG_FSYNC_ORPHAN]	= 9,
 	[LOG_FREEBLOCKS]	= 7,
 	[LOG_UNIFY]		= 1,
 	[LOG_DELTA]		= 1,
@@ -470,6 +471,11 @@ void log_bnode_free(struct sb *sb, block_t bnode)
 	log_u48(sb, LOG_BNODE_FREE, bnode);
 }
 
+void log_fsync_orphan(struct sb *sb, unsigned version, tuxkey_t inum)
+{
+	log_u16_u48(sb, LOG_FSYNC_ORPHAN, version, inum);
+}
+
 /*
  * Handle inum as orphan inode
  * (this is log of frontend operation)
diff --git a/fs/tux3/orphan.c b/fs/tux3/orphan.c
index 68d08e8..3ea2d6a 100644
--- a/fs/tux3/orphan.c
+++ b/fs/tux3/orphan.c
@@ -336,7 +336,30 @@ static int load_orphan_inode(struct sb *sb, inum_t inum, struct list_head *head)
 	tux3_mark_inode_orphan(tux_inode(inode));
 	/* List inode up, then caller will decide what to do */
 	list_add(&tux_inode(inode)->orphan_list, head);
+	return 0;
+}
 
+int replay_fsync_orphan(struct replay *rp, unsigned version, inum_t inum)
+{
+	struct sb *sb = rp->sb;
+	struct inode *inode = tux3_iget(sb, inum);
+	if (IS_ERR(inode)) {
+		int err = PTR_ERR(inode);
+		return err == -ENOENT ? 0 : err;
+	}
+
+	/*
+	 * Multiple fsyncs of new inode can create multiple fsync orphan
+	 * log records for the same inode. A later delta may have added a
+	 * link.
+	 */
+	if (inode->i_nlink != 0 || tux3_inode_is_orphan(tux_inode(inode))) {
+		iput(inode);
+		return 0;
+	}
+
+	tux3_mark_inode_orphan(tux_inode(inode));
+	list_add(&tux_inode(inode)->orphan_list, &rp->orphan_in_otree);
 	return 0;
 }
 
diff --git a/fs/tux3/replay.c b/fs/tux3/replay.c
index f1f77e8..99361d6 100644
--- a/fs/tux3/replay.c
+++ b/fs/tux3/replay.c
@@ -29,6 +29,7 @@ static const char *const log_name[] = {
 	X(LOG_BNODE_FREE),
 	X(LOG_ORPHAN_ADD),
 	X(LOG_ORPHAN_DEL),
+	X(LOG_FSYNC_ORPHAN),
 	X(LOG_FREEBLOCKS),
 	X(LOG_UNIFY),
 	X(LOG_DELTA),
@@ -117,20 +118,20 @@ static void replay_unpin_logblocks(struct sb *sb, unsigned i, unsigned logcount)
 static struct replay *replay_prepare(struct sb *sb)
 {
 	block_t logchain = be64_to_cpu(sb->super.logchain);
-	unsigned i, logcount = be32_to_cpu(sb->super.logcount);
+	unsigned i, count = logcount(sb);
 	struct replay *rp;
 	struct buffer_head *buffer;
 	int err;
 
 	/* FIXME: this address array is quick hack. Rethink about log
 	 * block management and log block address. */
-	rp = alloc_replay(sb, logcount);
+	rp = alloc_replay(sb, count);
 	if (IS_ERR(rp))
 		return rp;
 
 	/* FIXME: maybe, we should use bufvec to read log blocks */
-	trace("load %u logblocks", logcount);
-	i = logcount;
+	trace("load %u logblocks", count);
+	i = count;
 	while (i-- > 0) {
 		struct logblock *log;
 
@@ -156,7 +157,7 @@ static struct replay *replay_prepare(struct sb *sb)
 
 error:
 	free_replay(rp);
-	replay_unpin_logblocks(sb, i, logcount);
+	replay_unpin_logblocks(sb, i, count);
 
 	return ERR_PTR(err);
 }
@@ -169,7 +170,7 @@ static void replay_done(struct replay *rp)
 	clean_orphan_list(&rp->log_orphan_add);	/* for error path */
 	free_replay(rp);
 
-	sb->logpos.next = be32_to_cpu(sb->super.logcount);
+	sb->logpos.next = logcount(sb);
 	replay_unpin_logblocks(sb, 0, sb->logpos.next);
 	log_finish_cycle(sb, 0);
 }
@@ -319,6 +320,7 @@ static int replay_log_stage1(struct replay *rp, struct buffer_head *logbuf)
 		case LOG_BFREE_RELOG:
 		case LOG_LEAF_REDIRECT:
 		case LOG_LEAF_FREE:
+		case LOG_FSYNC_ORPHAN:
 		case LOG_ORPHAN_ADD:
 		case LOG_ORPHAN_DEL:
 		case LOG_UNIFY:
@@ -450,6 +452,7 @@ static int replay_log_stage2(struct replay *rp, struct buffer_head *logbuf)
 				return err;
 			break;
 		}
+		case LOG_FSYNC_ORPHAN:
 		case LOG_ORPHAN_ADD:
 		case LOG_ORPHAN_DEL:
 		{
@@ -459,6 +462,9 @@ static int replay_log_stage2(struct replay *rp, struct buffer_head *logbuf)
 			data = decode48(data, &inum);
 			trace("%s: version 0x%x, inum 0x%Lx",
 			      log_name[code], version, inum);
+			if (code == LOG_FSYNC_ORPHAN)
+				err = replay_fsync_orphan(rp, version, inum);
+			else
 			if (code == LOG_ORPHAN_ADD)
 				err = replay_orphan_add(rp, version, inum);
 			else
@@ -514,11 +520,11 @@ static int replay_logblocks(struct replay *rp, replay_log_t replay_log_func)
 {
 	struct sb *sb = rp->sb;
 	struct logpos *logpos = &sb->logpos;
-	unsigned logcount = be32_to_cpu(sb->super.logcount);
+	unsigned count = logcount(sb);
 	int err;
 
 	logpos->next = 0;
-	while (logpos->next < logcount) {
+	while (logpos->next < count) {
 		trace("log block %i, blocknr %Lx, unify %Lx",
 		      logpos->next, rp->blocknrs[logpos->next],
 		      rp->unify_index);
diff --git a/fs/tux3/super.c b/fs/tux3/super.c
index b104dc7..0913d26 100644
--- a/fs/tux3/super.c
+++ b/fs/tux3/super.c
@@ -63,6 +63,7 @@ static void tux3_inode_init_always(struct tux3_inode *tuxnode)
 	tuxnode->xcache		= NULL;
 	tuxnode->generic	= 0;
 	tuxnode->state		= 0;
+	tuxnode->nlink_base	= 0;
 #ifdef __KERNEL__
 	tuxnode->io		= NULL;
 #endif
@@ -246,6 +247,9 @@ static void __tux3_put_super(struct sb *sbi)
 	sbi->idefer_map = NULL;
 	/* FIXME: add more sanity check */
 	assert(link_empty(&sbi->forked_buffers));
+
+	if (sbi->fsync_workqueue)
+		destroy_workqueue(sbi->fsync_workqueue);
 }
 
 static struct inode *create_internal_inode(struct sb *sbi, inum_t inum,
@@ -384,6 +388,21 @@ static int init_sb(struct sb *sb)
 	for (i = 0; i < ARRAY_SIZE(sb->s_ddc); i++)
 		INIT_LIST_HEAD(&sb->s_ddc[i].dirty_inodes);
 
+	for (i = 0; i < fsync_wrap; i++) {
+		INIT_LIST_HEAD(&sb->fsync[i].list);
+		init_completion(&sb->fsync[i].wait);
+		atomic_set(&sb->fsync[i].busy, 0);
+	}
+
+	if (!(sb->fsync_workqueue = create_workqueue("tux3-work")))
+		return -ENOMEM;
+
+	atomic_set(&sb->fsync_group, 0);
+	atomic_set(&sb->fsync_pending, 0);
+	spin_lock_init(&sb->fsync_lock);
+	init_waitqueue_head(&sb->fsync_collide);
+	INIT_WORK(&sb->fsync_work.work, fsync_backend);
+	sb->fsync_work.sb = sb;
 	sb->idefer_map = tux3_alloc_idefer_map();
 	if (!sb->idefer_map)
 		return -ENOMEM;
@@ -773,7 +792,7 @@ static int tux3_fill_super(struct super_block *sb, void *data, int silent)
 			goto error;
 		}
 	}
-	tux3_dbg("s_blocksize %lu", sb->s_blocksize);
+	tux3_dbg("s_blocksize %lu, sb = %p", sb->s_blocksize, tux_sb(sb));
 
 	rp = tux3_init_fs(sbi);
 	if (IS_ERR(rp)) {
@@ -781,6 +800,7 @@ static int tux3_fill_super(struct super_block *sb, void *data, int silent)
 		goto error;
 	}
 
+	sb->s_flags |= MS_ACTIVE;
 	err = replay_stage3(rp, 1);
 	if (err) {
 		rp = NULL;
diff --git a/fs/tux3/tux3.h b/fs/tux3/tux3.h
index e2f2d9b..cf4bcc6 100644
--- a/fs/tux3/tux3.h
+++ b/fs/tux3/tux3.h
@@ -252,6 +252,7 @@ enum {
 	LOG_BNODE_FREE,		/* Log of freeing bnode */
 	LOG_ORPHAN_ADD,		/* Log of adding orphan inode */
 	LOG_ORPHAN_DEL,		/* Log of deleting orphan inode */
+	LOG_FSYNC_ORPHAN,	/* Log inode fsync with no links  */
 	LOG_FREEBLOCKS,		/* Log of freeblocks in bitmap on unify */
 	LOG_UNIFY,		/* Log of marking unify */
 	LOG_DELTA,		/* just for debugging */
@@ -310,6 +311,29 @@ struct tux3_mount_opt {
 	unsigned int flags;
 };
 
+/* Per fsync group dirty inodes and synchronization */
+struct syncgroup {
+	struct list_head list; /* dirty inodes */
+	struct completion wait; /* commit wait */
+	atomic_t busy; /* fsyncs not completed */
+};
+
+struct fsync_work {
+	struct work_struct work;
+	struct sb *sb;
+};
+
+enum { fsync_wrap = 1 << 4 }; /* Maximum fsync groups in flight */
+
+enum sb_state_bits {
+	TUX3_STATE_TRANSITION_BIT,
+	TUX3_FSYNC_DRAIN_BIT, /* force fsync queue to drain */
+};
+
+enum sb_flag_bits {
+	SB_FSYNC_FLUSH_FLAG = 1 << 0, /* fsync specific actions on flush path */
+};
+
 struct tux3_idefer_map;
 /* Tux3-specific sb is a handle for the entire volume state */
 struct sb {
@@ -321,10 +345,8 @@ struct sb {
 	struct delta_ref __rcu *current_delta;	/* current delta */
 	struct delta_ref delta_refs[TUX3_MAX_DELTA];
 	unsigned unify;				/* log unify cycle */
-
-#define TUX3_STATE_TRANSITION_BIT	0
 	unsigned long backend_state;		/* delta state */
-
+	unsigned long flags;			/* non atomic state */
 #ifdef TUX3_FLUSHER_SYNC
 	struct rw_semaphore delta_lock;		/* delta transition exclusive */
 #else
@@ -403,7 +425,28 @@ struct sb {
 #else
 	struct super_block vfs_sb;	/* Userland superblock */
 #endif
-};
+	/*
+	 * Fsync and fsync backend
+	 */
+	spinlock_t fsync_lock;
+	wait_queue_head_t fsync_collide; /* parallel fsync on same inode */
+	atomic_t fsync_group; /* current fsync group */
+	atomic_t fsync_pending; /* fsyncs started but not yet queued */
+	struct syncgroup fsync[fsync_wrap]; /* fsync commit groups */
+	struct workqueue_struct *fsync_workqueue;
+	struct fsync_work fsync_work;
+	struct delta_ref *suspended;
+ };
+ 
+static inline int fsync_mode(struct sb *sb)
+{
+	return sb->flags & SB_FSYNC_FLUSH_FLAG;
+}
+
+static inline unsigned logcount(struct sb *sb)
+{
+	return be32_to_cpu(sb->super.logcount);
+}
 
 /* Block segment (physical block extent) info */
 #define BLOCK_SEG_HOLE		(1 << 0)
@@ -475,6 +518,7 @@ struct tux3_inode {
 	};
 
 	/* Per-delta dirty data for inode */
+	unsigned nlink_base;		/* link count on media for fsync */
 	unsigned state;			/* inode dirty state */
 	unsigned present;		/* Attributes decoded from or
 					 * to be encoded to itree */
@@ -553,6 +597,8 @@ static inline struct list_head *tux3_dirty_buffers(struct inode *inode,
 enum {
 	/* Deferred inum allocation, and not stored into itree yet. */
 	TUX3_I_DEFER_INUM	= 0,
+	/* Fsync in progress (protected by i_mutex) */
+	TUX3_INODE_FSYNC_BIT	= 1,
 
 	/* No per-delta buffers, and no page forking */
 	TUX3_I_NO_DELTA		= 29,
@@ -579,6 +625,11 @@ static inline void tux3_inode_clear_flag(int bit, struct inode *inode)
 	clear_bit(bit, &tux_inode(inode)->flags);
 }
 
+static inline int tux3_inode_test_and_set_flag(int bit, struct inode *inode)
+{
+	return test_and_set_bit(bit, &tux_inode(inode)->flags);
+}
+
 static inline int tux3_inode_test_flag(int bit, struct inode *inode)
 {
 	return test_bit(bit, &tux_inode(inode)->flags);
@@ -723,6 +774,8 @@ static inline block_t bufindex(struct buffer_head *buffer)
 /* commit.c */
 long tux3_writeback(struct super_block *super, struct bdi_writeback *wb,
 		    struct wb_writeback_work *work);
+int tux3_sync_inode(struct sb *sb, struct inode *inode);
+void fsync_backend(struct work_struct *work);
 
 /* dir.c */
 extern const struct file_operations tux_dir_fops;
@@ -967,6 +1020,7 @@ void log_bnode_merge(struct sb *sb, block_t src, block_t dst);
 void log_bnode_del(struct sb *sb, block_t node, tuxkey_t key, unsigned count);
 void log_bnode_adjust(struct sb *sb, block_t node, tuxkey_t from, tuxkey_t to);
 void log_bnode_free(struct sb *sb, block_t bnode);
+void log_fsync_orphan(struct sb *sb, unsigned version, tuxkey_t inum);
 void log_orphan_add(struct sb *sb, unsigned version, tuxkey_t inum);
 void log_orphan_del(struct sb *sb, unsigned version, tuxkey_t inum);
 void log_freeblocks(struct sb *sb, block_t freeblocks);
@@ -995,6 +1049,7 @@ void replay_iput_orphan_inodes(struct sb *sb,
 			       struct list_head *orphan_in_otree,
 			       int destroy);
 int replay_load_orphan_inodes(struct replay *rp);
+int replay_fsync_orphan(struct replay *rp, unsigned version, inum_t inum);
 
 /* super.c */
 struct replay *tux3_init_fs(struct sb *sbi);
@@ -1045,6 +1100,8 @@ static inline void tux3_mark_inode_dirty_sync(struct inode *inode)
 	__tux3_mark_inode_dirty(inode, I_DIRTY_SYNC);
 }
 
+unsigned tux3_dirty_flags1(struct inode *inode, unsigned delta);
+void tux3_dirty_switch_to_prev(struct inode *inode, unsigned delta);
 void tux3_dirty_inode(struct inode *inode, int flags);
 void tux3_mark_inode_to_delete(struct inode *inode);
 void tux3_iattrdirty(struct inode *inode);
@@ -1058,6 +1115,7 @@ void tux3_mark_inode_orphan(struct tux3_inode *tuxnode);
 int tux3_inode_is_orphan(struct tux3_inode *tuxnode);
 int tux3_flush_inode_internal(struct inode *inode, unsigned delta, int req_flag);
 int tux3_flush_inode(struct inode *inode, unsigned delta, int req_flag);
+int tux3_flush_inodes_list(struct sb *sb, unsigned delta, struct list_head *dirty_inodes);
 int tux3_flush_inodes(struct sb *sb, unsigned delta);
 int tux3_has_dirty_inodes(struct sb *sb, unsigned delta);
 void tux3_clear_dirty_inodes(struct sb *sb, unsigned delta);
diff --git a/fs/tux3/user/libklib/libklib.h b/fs/tux3/user/libklib/libklib.h
index 31daad5..ae9bba6 100644
--- a/fs/tux3/user/libklib/libklib.h
+++ b/fs/tux3/user/libklib/libklib.h
@@ -117,4 +117,7 @@ extern int __build_bug_on_failed;
 #define S_IWUGO		(S_IWUSR|S_IWGRP|S_IWOTH)
 #define S_IXUGO		(S_IXUSR|S_IXGRP|S_IXOTH)
 
+struct work_struct { };
+struct workqueue_struct { };
+
 #endif /* !LIBKLIB_H */
diff --git a/fs/tux3/user/super.c b/fs/tux3/user/super.c
index e34a1b4..0743551 100644
--- a/fs/tux3/user/super.c
+++ b/fs/tux3/user/super.c
@@ -15,6 +15,15 @@
 #define trace trace_off
 #endif
 
+static struct workqueue_struct *create_workqueue(char *name) {
+	static struct workqueue_struct fakework = { };
+	return &fakework;
+}
+
+static void destroy_workqueue(struct workqueue_struct *wq) { }
+
+#define INIT_WORK(work, fn)
+
 #include "../super.c"
 
 struct inode *__alloc_inode(struct super_block *sb)
diff --git a/fs/tux3/writeback.c b/fs/tux3/writeback.c
index fc20635..5c6bcf0 100644
--- a/fs/tux3/writeback.c
+++ b/fs/tux3/writeback.c
@@ -102,6 +102,22 @@ static inline unsigned tux3_dirty_flags(struct inode *inode, unsigned delta)
 	return ret;
 }
 
+unsigned tux3_dirty_flags1(struct inode *inode, unsigned delta)
+{
+	return (tux_inode(inode)->state >> tux3_dirty_shift(delta)) & I_DIRTY;
+}
+
+static inline unsigned tux3_iattrsta_update(unsigned state, unsigned delta);
+void tux3_dirty_switch_to_prev(struct inode *inode, unsigned delta)
+{
+	struct tux3_inode *tuxnode = tux_inode(inode);
+	unsigned state = tuxnode->state;
+
+	state |= tux3_dirty_mask(tux3_dirty_flags(inode, delta) & I_DIRTY, delta - 1);
+	state &= ~tux3_dirty_mask(I_DIRTY, delta);
+	tuxnode->state = tux3_iattrsta_update(state, delta - 1);
+}
+
 /* This is hook of __mark_inode_dirty() and called I_DIRTY_PAGES too */
 void tux3_dirty_inode(struct inode *inode, int flags)
 {
@@ -226,6 +242,8 @@ static void tux3_clear_dirty_inode_nolock(struct inode *inode, unsigned delta,
 	/* Update state if inode isn't dirty anymore */
 	if (!(tuxnode->state & ~NON_DIRTY_FLAGS))
 		inode->i_state &= ~I_DIRTY;
+
+	tux3_inode_clear_flag(TUX3_INODE_FSYNC_BIT, inode); /* ugly */
 }
 
 /* Clear dirty flags for delta */
@@ -502,12 +520,31 @@ int tux3_flush_inode(struct inode *inode, unsigned delta, int req_flag)
 		dirty = tux3_dirty_flags(inode, delta);
 
 	if (dirty & (TUX3_DIRTY_BTREE | I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
+		struct tux3_inode *tuxnode = tux_inode(inode);
+		struct sb *sb = tux_sb(inode->i_sb);
 		/*
 		 * If there is btree root, adjust present after
 		 * tux3_flush_buffers().
 		 */
 		tux3_iattr_adjust_for_btree(inode, &idata);
 
+		if (fsync_mode(sb)) {
+			if (idata.i_nlink != tuxnode->nlink_base) {
+				/*
+				 * FIXME: we redirty inode attributes here so next delta
+				 * will flush correct nlinks. This means that an fsync
+				 * of the same inode before the next delta will flush
+				 * it again even it has not been changed.
+				 */
+				tux3_iattrdirty_delta(inode, sb->suspended->delta);
+				tux3_mark_inode_dirty_sync(inode);
+				idata.i_nlink = tuxnode->nlink_base;
+			}
+			if (!idata.i_nlink)
+				log_fsync_orphan(sb, sb->version, tuxnode->inum);
+		} else
+			tuxnode->nlink_base = idata.i_nlink;
+
 		err = tux3_save_inode(inode, &idata, delta);
 		if (err && !ret)
 			ret = err;
@@ -569,10 +606,8 @@ static int inode_inum_cmp(void *priv, struct list_head *a, struct list_head *b)
 	return 0;
 }
 
-int tux3_flush_inodes(struct sb *sb, unsigned delta)
+int tux3_flush_inodes_list(struct sb *sb, unsigned delta, struct list_head *dirty_inodes)
 {
-	struct sb_delta_dirty *s_ddc = tux3_sb_ddc(sb, delta);
-	struct list_head *dirty_inodes = &s_ddc->dirty_inodes;
 	struct inode_delta_dirty *i_ddc, *safe;
 	inum_t private;
 	int err;
@@ -612,6 +647,12 @@ error:
 	return err;
 }
 
+int tux3_flush_inodes(struct sb *sb, unsigned delta)
+{
+	struct sb_delta_dirty *s_ddc = tux3_sb_ddc(sb, delta);
+	return tux3_flush_inodes_list(sb, delta, &s_ddc->dirty_inodes);
+}
+
 int tux3_has_dirty_inodes(struct sb *sb, unsigned delta)
 {
 	struct sb_delta_dirty *s_ddc = tux3_sb_ddc(sb, delta);
@@ -663,3 +704,4 @@ unsigned tux3_check_tuxinode_state(struct inode *inode)
 {
 	return tux_inode(inode)->state & ~NON_DIRTY_FLAGS;
 }
+
diff --git a/fs/tux3/writeback_iattrfork.c b/fs/tux3/writeback_iattrfork.c
index 658c012..c50a8c2 100644
--- a/fs/tux3/writeback_iattrfork.c
+++ b/fs/tux3/writeback_iattrfork.c
@@ -54,10 +54,9 @@ static void idata_copy(struct inode *inode, struct tux3_iattr_data *idata)
  *
  * FIXME: this is better to call tux3_mark_inode_dirty() too?
  */
-void tux3_iattrdirty(struct inode *inode)
+void tux3_iattrdirty_delta(struct inode *inode, unsigned delta)
 {
 	struct tux3_inode *tuxnode = tux_inode(inode);
-	unsigned delta = tux3_inode_delta(inode);
 	unsigned state = tuxnode->state;
 
 	/* If dirtied on this delta, nothing to do */
@@ -107,6 +106,11 @@ void tux3_iattrdirty(struct inode *inode)
 	spin_unlock(&tuxnode->lock);
 }
 
+void tux3_iattrdirty(struct inode *inode)
+{
+	tux3_iattrdirty_delta(inode, tux3_inode_delta(inode));
+}
+
 /* Caller must hold tuxnode->lock */
 static void tux3_iattr_clear_dirty(struct tux3_inode *tuxnode)
 {
^ permalink raw reply related	[flat|nested] 160+ messages in thread
- * [FYI] tux3: Core changes
  2015-04-28 23:13 Tux3 Report: How fast can we fsync? Daniel Phillips
                   ` (4 preceding siblings ...)
  2015-05-14  7:37 ` [WIP] tux3: Optimized fsync Daniel Phillips
@ 2015-05-14  8:26 ` Daniel Phillips
  2015-05-14 12:59   ` Rik van Riel
  2015-05-19 14:00   ` [FYI] tux3: Core changes Jan Kara
  5 siblings, 2 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-14  8:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, tux3, OGAWA Hirofumi, Rik van Riel
Hi Rik,
Our linux-tux3 tree currently currently carries this 652 line diff
against core, to make Tux3 work. This is mainly by Hirofumi, except
the fs-writeback.c hook, which is by me. The main part you may be
interested in is rmap.c, which addresses the issues raised at the
2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
   LSFMM: Page forking
   http://lwn.net/Articles/548091/
This is just a FYI. An upcoming Tux3 report will be a tour of the page
forking design and implementation. For now, this is just to give a
general sense of what we have done. We heard there are concerns about
how ptrace will work. I really am not familiar with the issue, could
you please explain what you were thinking of there?
Enjoy,
Daniel
[1] Which happened to be a 15 minute bus ride away from me at the time.
diffstat tux3.core.patch
 fs/Makefile               |    1 
 fs/fs-writeback.c         |  100 +++++++++++++++++++++++++--------
 include/linux/fs.h        |    6 +
 include/linux/mm.h        |    5 +
 include/linux/pagemap.h   |    2 
 include/linux/rmap.h      |   14 ++++
 include/linux/writeback.h |   23 +++++++
 mm/filemap.c              |   82 +++++++++++++++++++++++++++
 mm/rmap.c                 |  139 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/truncate.c             |   98 ++++++++++++++++++++------------
 10 files changed, 411 insertions(+), 59 deletions(-)
diff --git a/fs/Makefile b/fs/Makefile
index 91fcfa3..44d7192 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -70,7 +70,6 @@ obj-$(CONFIG_EXT4_FS)		+= ext4/
 obj-$(CONFIG_JBD)		+= jbd/
 obj-$(CONFIG_JBD2)		+= jbd2/
 obj-$(CONFIG_TUX3)		+= tux3/
-obj-$(CONFIG_TUX3_MMAP) 	+= tux3/
 obj-$(CONFIG_CRAMFS)		+= cramfs/
 obj-$(CONFIG_SQUASHFS)		+= squashfs/
 obj-y				+= ramfs/
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2d609a5..fcd1c61 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -34,25 +34,6 @@
  */
 #define MIN_WRITEBACK_PAGES	(4096UL >> (PAGE_CACHE_SHIFT - 10))
 
-/*
- * Passed into wb_writeback(), essentially a subset of writeback_control
- */
-struct wb_writeback_work {
-	long nr_pages;
-	struct super_block *sb;
-	unsigned long *older_than_this;
-	enum writeback_sync_modes sync_mode;
-	unsigned int tagged_writepages:1;
-	unsigned int for_kupdate:1;
-	unsigned int range_cyclic:1;
-	unsigned int for_background:1;
-	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
-	enum wb_reason reason;		/* why was writeback initiated? */
-
-	struct list_head list;		/* pending work list */
-	struct completion *done;	/* set if the caller waits */
-};
-
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -192,6 +173,36 @@ void inode_wb_list_del(struct inode *inode)
 }
 
 /*
+ * Remove inode from writeback list if clean.
+ */
+void inode_writeback_done(struct inode *inode)
+{
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	spin_lock(&bdi->wb.list_lock);
+	spin_lock(&inode->i_lock);
+	if (!(inode->i_state & I_DIRTY))
+		list_del_init(&inode->i_wb_list);
+	spin_unlock(&inode->i_lock);
+	spin_unlock(&bdi->wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_done);
+
+/*
+ * Add inode to writeback dirty list with current time.
+ */
+void inode_writeback_touch(struct inode *inode)
+{
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	spin_lock(&bdi->wb.list_lock);
+	inode->dirtied_when = jiffies;
+	list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
+	spin_unlock(&bdi->wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_touch);
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -610,9 +621,9 @@ static long writeback_chunk_size(struct backing_dev_info *bdi,
  *
  * Return the number of pages and/or inodes written.
  */
-static long writeback_sb_inodes(struct super_block *sb,
-				struct bdi_writeback *wb,
-				struct wb_writeback_work *work)
+static long generic_writeback_sb_inodes(struct super_block *sb,
+					struct bdi_writeback *wb,
+					struct wb_writeback_work *work)
 {
 	struct writeback_control wbc = {
 		.sync_mode		= work->sync_mode,
@@ -727,6 +738,22 @@ static long writeback_sb_inodes(struct super_block *sb,
 	return wrote;
 }
 
+static long writeback_sb_inodes(struct super_block *sb,
+				struct bdi_writeback *wb,
+				struct wb_writeback_work *work)
+{
+	if (sb->s_op->writeback) {
+		long ret;
+
+		spin_unlock(&wb->list_lock);
+		ret = sb->s_op->writeback(sb, wb, work);
+		spin_lock(&wb->list_lock);
+		return ret;
+	}
+
+	return generic_writeback_sb_inodes(sb, wb, work);
+}
+
 static long __writeback_inodes_wb(struct bdi_writeback *wb,
 				  struct wb_writeback_work *work)
 {
@@ -1293,6 +1320,35 @@ static void wait_sb_inodes(struct super_block *sb)
 }
 
 /**
+ * writeback_queue_work_sb -	schedule writeback work from given super_block
+ * @sb: the superblock
+ * @work: work item to queue
+ *
+ * Schedule writeback work on this super_block. This usually used to
+ * interact with sb->s_op->writeback callback. The caller must
+ * guarantee to @work is not freed while bdi flusher is using (for
+ * example, be safe against umount).
+ */
+void writeback_queue_work_sb(struct super_block *sb,
+			     struct wb_writeback_work *work)
+{
+	if (sb->s_bdi == &noop_backing_dev_info)
+		return;
+
+	/* Allow only following fields to use. */
+	*work = (struct wb_writeback_work){
+		.sb			= sb,
+		.sync_mode		= work->sync_mode,
+		.tagged_writepages	= work->tagged_writepages,
+		.done			= work->done,
+		.nr_pages		= work->nr_pages,
+		.reason			= work->reason,
+	};
+	bdi_queue_work(sb->s_bdi, work);
+}
+EXPORT_SYMBOL(writeback_queue_work_sb);
+
+/**
  * writeback_inodes_sb_nr -	writeback dirty inodes from given super_block
  * @sb: the superblock
  * @nr: the number of pages to write
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 42efe13..29833d2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -356,6 +356,8 @@ struct address_space_operations {
 
 	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
 	sector_t (*bmap)(struct address_space *, sector_t);
+	void (*truncatepage)(struct address_space *, struct page *,
+			     unsigned int, unsigned int, int);
 	void (*invalidatepage) (struct page *, unsigned int, unsigned int);
 	int (*releasepage) (struct page *, gfp_t);
 	void (*freepage)(struct page *);
@@ -1590,6 +1592,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 
+struct bdi_writeback;
+struct wb_writeback_work;
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
 	void (*destroy_inode)(struct inode *);
@@ -1599,6 +1603,8 @@ struct super_operations {
 	int (*drop_inode) (struct inode *);
 	void (*evict_inode) (struct inode *);
 	void (*put_super) (struct super_block *);
+	long (*writeback)(struct super_block *super, struct bdi_writeback *wb,
+			  struct wb_writeback_work *work);
 	int (*sync_fs)(struct super_block *sb, int wait);
 	int (*freeze_super) (struct super_block *);
 	int (*freeze_fs) (struct super_block *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dd5ea30..075f59f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1909,6 +1909,11 @@ vm_unmapped_area(struct vm_unmapped_area_info *info)
 }
 
 /* truncate.c */
+void generic_truncate_partial_page(struct address_space *mapping,
+				   struct page *page, unsigned int start,
+				   unsigned int len);
+void generic_truncate_full_page(struct address_space *mapping,
+				struct page *page, int wait);
 extern void truncate_inode_pages(struct address_space *, loff_t);
 extern void truncate_inode_pages_range(struct address_space *,
 				       loff_t lstart, loff_t lend);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4b3736f..13b70160 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -653,6 +653,8 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 extern void delete_from_page_cache(struct page *page);
 extern void __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
+int cow_replace_page_cache(struct page *oldpage, struct page *newpage);
+void cow_delete_from_page_cache(struct page *page);
 
 /*
  * Like add_to_page_cache_locked, but used to add newly allocated pages:
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d9d7e7e..9b67360 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -228,6 +228,20 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 int page_mkclean(struct page *);
 
 /*
+ * Make clone page for page forking.
+ *
+ * Note: only clones page state so other state such as buffer_heads
+ * must be cloned by caller.
+ */
+struct page *cow_clone_page(struct page *oldpage);
+
+/*
+ * Changes the PTES of shared mappings except the PTE in orig_vma.
+ */
+int page_cow_file(struct vm_area_struct *orig_vma, struct page *oldpage,
+		  struct page *newpage);
+
+/*
  * called in munlock()/munmap() path to check for other vmas holding
  * the page mlocked.
  */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 0004833..0784b9d 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -59,6 +59,25 @@ enum wb_reason {
 };
 
 /*
+ * Passed into wb_writeback(), essentially a subset of writeback_control
+ */
+struct wb_writeback_work {
+	long nr_pages;
+	struct super_block *sb;
+	unsigned long *older_than_this;
+	enum writeback_sync_modes sync_mode;
+	unsigned int tagged_writepages:1;
+	unsigned int for_kupdate:1;
+	unsigned int range_cyclic:1;
+	unsigned int for_background:1;
+	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
+	enum wb_reason reason;		/* why was writeback initiated? */
+
+	struct list_head list;		/* pending work list */
+	struct completion *done;	/* set if the caller waits */
+};
+
+/*
  * A control structure which tells the writeback code what to do.  These are
  * always on the stack, and hence need no locking.  They are always initialised
  * in a manner such that unspecified fields are set to zero.
@@ -90,6 +109,10 @@ struct writeback_control {
  * fs/fs-writeback.c
  */	
 struct bdi_writeback;
+void inode_writeback_done(struct inode *inode);
+void inode_writeback_touch(struct inode *inode);
+void writeback_queue_work_sb(struct super_block *sb,
+			     struct wb_writeback_work *work);
 void writeback_inodes_sb(struct super_block *, enum wb_reason reason);
 void writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
 							enum wb_reason reason);
diff --git a/mm/filemap.c b/mm/filemap.c
index 673e458..8c641d0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -639,6 +639,88 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 }
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
+/*
+ * Atomically replace oldpage with newpage.
+ *
+ * Similar to migrate_pages(), but the oldpage is for writeout.
+ */
+int cow_replace_page_cache(struct page *oldpage, struct page *newpage)
+{
+	struct address_space *mapping = oldpage->mapping;
+	void **pslot;
+
+	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
+	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
+
+	/* Get refcount for radix-tree */
+	page_cache_get(newpage);
+
+	/* Replace page in radix tree. */
+	spin_lock_irq(&mapping->tree_lock);
+	/* PAGECACHE_TAG_DIRTY represents the view of frontend. Clear it. */
+	if (PageDirty(oldpage))
+		radix_tree_tag_clear(&mapping->page_tree, page_index(oldpage),
+				     PAGECACHE_TAG_DIRTY);
+	/* The refcount to newpage is used for radix tree. */
+	pslot = radix_tree_lookup_slot(&mapping->page_tree, oldpage->index);
+	radix_tree_replace_slot(pslot, newpage);
+	__inc_zone_page_state(newpage, NR_FILE_PAGES);
+	__dec_zone_page_state(oldpage, NR_FILE_PAGES);
+	spin_unlock_irq(&mapping->tree_lock);
+
+	/* mem_cgroup codes must not be called under tree_lock */
+	mem_cgroup_migrate(oldpage, newpage, true);
+
+	/* Release refcount for radix-tree */
+	page_cache_release(oldpage);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cow_replace_page_cache);
+
+/*
+ * Delete page from radix-tree, leaving page->mapping unchanged.
+ *
+ * Similar to delete_from_page_cache(), but the deleted page is for writeout.
+ */
+void cow_delete_from_page_cache(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+
+	/* Delete page from radix tree. */
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * if we're uptodate, flush out into the cleancache, otherwise
+	 * invalidate any existing cleancache entries.  We can't leave
+	 * stale data around in the cleancache once our page is gone
+	 */
+	if (PageUptodate(page) && PageMappedToDisk(page))
+		cleancache_put_page(page);
+	else
+		cleancache_invalidate_page(mapping, page);
+
+	page_cache_tree_delete(mapping, page, NULL);
+#if 0 /* FIXME: backend is assuming page->mapping is available */
+	page->mapping = NULL;
+#endif
+	/* Leave page->index set: truncation lookup relies upon it */
+
+	__dec_zone_page_state(page, NR_FILE_PAGES);
+	BUG_ON(page_mapped(page));
+
+	/*
+	 * The following dirty accounting is done by writeback
+	 * path. So, we don't need to do here.
+	 *
+	 * dec_zone_page_state(page, NR_FILE_DIRTY);
+	 * dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+	 */
+	spin_unlock_irq(&mapping->tree_lock);
+
+	page_cache_release(page);
+}
+EXPORT_SYMBOL_GPL(cow_delete_from_page_cache);
+
 #ifdef CONFIG_NUMA
 struct page *__page_cache_alloc(gfp_t gfp)
 {
diff --git a/mm/rmap.c b/mm/rmap.c
index 71cd5bd..9125246 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -923,6 +923,145 @@ int page_mkclean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
+/*
+ * Make clone page for page forking. (Based on migrate_page_copy())
+ *
+ * Note: only clones page state so other state such as buffer_heads
+ * must be cloned by caller.
+ */
+struct page *cow_clone_page(struct page *oldpage)
+{
+	struct address_space *mapping = oldpage->mapping;
+	gfp_t gfp_mask = mapping_gfp_mask(mapping) & ~__GFP_FS;
+	struct page *newpage = __page_cache_alloc(gfp_mask);
+	int cpupid;
+
+	newpage->mapping = oldpage->mapping;
+	newpage->index = oldpage->index;
+	copy_highpage(newpage, oldpage);
+
+	/* FIXME: right? */
+	BUG_ON(PageSwapCache(oldpage));
+	BUG_ON(PageSwapBacked(oldpage));
+	BUG_ON(PageHuge(oldpage));
+	if (PageError(oldpage))
+		SetPageError(newpage);
+	if (PageReferenced(oldpage))
+		SetPageReferenced(newpage);
+	if (PageUptodate(oldpage))
+		SetPageUptodate(newpage);
+	if (PageActive(oldpage))
+		SetPageActive(newpage);
+	if (PageMappedToDisk(oldpage))
+		SetPageMappedToDisk(newpage);
+
+	/*
+	 * Copy NUMA information to the new page, to prevent over-eager
+	 * future migrations of this same page.
+	 */
+	cpupid = page_cpupid_xchg_last(oldpage, -1);
+	page_cpupid_xchg_last(newpage, cpupid);
+
+	mlock_migrate_page(newpage, oldpage);
+	ksm_migrate_page(newpage, oldpage);
+
+	/* Lock newpage before visible via radix tree */
+	BUG_ON(PageLocked(newpage));
+	__set_page_locked(newpage);
+
+	return newpage;
+}
+EXPORT_SYMBOL_GPL(cow_clone_page);
+
+static int page_cow_one(struct page *oldpage, struct page *newpage,
+			struct vm_area_struct *vma, unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t oldptval, ptval, *pte;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	pte = page_check_address(oldpage, mm, address, &ptl, 1);
+	if (!pte)
+		goto out;
+
+	flush_cache_page(vma, address, pte_pfn(*pte));
+	oldptval = ptep_clear_flush(vma, address, pte);
+
+	/* Take refcount for PTE */
+	page_cache_get(newpage);
+
+	/*
+	 * vm_page_prot doesn't have writable bit, so page fault will
+	 * be occurred immediately after returned from this page fault
+	 * again. And second time of page fault will be resolved with
+	 * forked page was set here.
+	 */
+	ptval = mk_pte(newpage, vma->vm_page_prot);
+#if 0
+	/* FIXME: we should check following too? Otherwise, we would
+	 * get additional read-only => write fault at least */
+	if (pte_write)
+		ptval = pte_mkwrite(ptval);
+	if (pte_dirty(oldptval))
+		ptval = pte_mkdirty(ptval);
+	if (pte_young(oldptval))
+		ptval = pte_mkyoung(ptval);
+#endif
+	set_pte_at(mm, address, pte, ptval);
+
+	/* Update rmap accounting */
+	BUG_ON(!PageMlocked(oldpage));	/* Caller should migrate mlock flag */
+	page_remove_rmap(oldpage);
+	page_add_file_rmap(newpage);
+
+	/* no need to invalidate: a not-present page won't be cached */
+	update_mmu_cache(vma, address, pte);
+
+	pte_unmap_unlock(pte, ptl);
+
+	mmu_notifier_invalidate_page(mm, address);
+
+	/* Release refcount for PTE */
+	page_cache_release(oldpage);
+out:
+	return ret;
+}
+
+/* Change old page in PTEs to new page exclude orig_vma */
+int page_cow_file(struct vm_area_struct *orig_vma, struct page *oldpage,
+		  struct page *newpage)
+{
+	struct address_space *mapping = page_mapping(oldpage);
+	pgoff_t pgoff = oldpage->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	int ret = 0;
+
+	BUG_ON(!PageLocked(oldpage));
+	BUG_ON(!PageLocked(newpage));
+	BUG_ON(PageAnon(oldpage));
+	BUG_ON(mapping == NULL);
+
+	i_mmap_lock_read(mapping);
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		/*
+		 * The orig_vma's PTE is handled by caller.
+		 * (e.g. ->page_mkwrite)
+		 */
+		if (vma == orig_vma)
+			continue;
+
+		if (vma->vm_flags & VM_SHARED) {
+			unsigned long address = vma_address(oldpage, vma);
+			ret += page_cow_one(oldpage, newpage, vma, address);
+		}
+	}
+	i_mmap_unlock_read(mapping);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(page_cow_file);
+
 /**
  * page_move_anon_rmap - move a page to our anon_vma
  * @page:	the page to move to our anon_vma
diff --git a/mm/truncate.c b/mm/truncate.c
index f1e4d60..e5b4673 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -216,6 +216,56 @@ int invalidate_inode_page(struct page *page)
 	return invalidate_complete_page(mapping, page);
 }
 
+void generic_truncate_partial_page(struct address_space *mapping,
+				   struct page *page, unsigned int start,
+				   unsigned int len)
+{
+	wait_on_page_writeback(page);
+	zero_user_segment(page, start, start + len);
+	if (page_has_private(page))
+		do_invalidatepage(page, start, len);
+}
+EXPORT_SYMBOL(generic_truncate_partial_page);
+
+static void truncate_partial_page(struct address_space *mapping, pgoff_t index,
+				  unsigned int start, unsigned int len)
+{
+	struct page *page = find_lock_page(mapping, index);
+	if (!page)
+		return;
+
+	if (!mapping->a_ops->truncatepage)
+		generic_truncate_partial_page(mapping, page, start, len);
+	else
+		mapping->a_ops->truncatepage(mapping, page, start, len, 1);
+
+	cleancache_invalidate_page(mapping, page);
+	unlock_page(page);
+	page_cache_release(page);
+}
+
+void generic_truncate_full_page(struct address_space *mapping,
+				struct page *page, int wait)
+{
+	if (wait)
+		wait_on_page_writeback(page);
+	else if (PageWriteback(page))
+		return;
+
+	truncate_inode_page(mapping, page);
+}
+EXPORT_SYMBOL(generic_truncate_full_page);
+
+static void truncate_full_page(struct address_space *mapping, struct page *page,
+			       int wait)
+{
+	if (!mapping->a_ops->truncatepage)
+		generic_truncate_full_page(mapping, page, wait);
+	else
+		mapping->a_ops->truncatepage(mapping, page, 0, PAGE_CACHE_SIZE,
+					     wait);
+}
+
 /**
  * truncate_inode_pages_range - truncate range of pages specified by start & end byte offsets
  * @mapping: mapping to truncate
@@ -298,11 +348,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
-			if (PageWriteback(page)) {
-				unlock_page(page);
-				continue;
-			}
-			truncate_inode_page(mapping, page);
+			truncate_full_page(mapping, page, 0);
 			unlock_page(page);
 		}
 		pagevec_remove_exceptionals(&pvec);
@@ -312,37 +358,18 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	}
 
 	if (partial_start) {
-		struct page *page = find_lock_page(mapping, start - 1);
-		if (page) {
-			unsigned int top = PAGE_CACHE_SIZE;
-			if (start > end) {
-				/* Truncation within a single page */
-				top = partial_end;
-				partial_end = 0;
-			}
-			wait_on_page_writeback(page);
-			zero_user_segment(page, partial_start, top);
-			cleancache_invalidate_page(mapping, page);
-			if (page_has_private(page))
-				do_invalidatepage(page, partial_start,
-						  top - partial_start);
-			unlock_page(page);
-			page_cache_release(page);
-		}
-	}
-	if (partial_end) {
-		struct page *page = find_lock_page(mapping, end);
-		if (page) {
-			wait_on_page_writeback(page);
-			zero_user_segment(page, 0, partial_end);
-			cleancache_invalidate_page(mapping, page);
-			if (page_has_private(page))
-				do_invalidatepage(page, 0,
-						  partial_end);
-			unlock_page(page);
-			page_cache_release(page);
+		unsigned int top = PAGE_CACHE_SIZE;
+		if (start > end) {
+			/* Truncation within a single page */
+			top = partial_end;
+			partial_end = 0;
 		}
+		truncate_partial_page(mapping, start - 1, partial_start,
+				      top - partial_start);
 	}
+	if (partial_end)
+		truncate_partial_page(mapping, end, 0, partial_end);
+
 	/*
 	 * If the truncation happened within a single page no pages
 	 * will be released, just zeroed, so we can bail out now.
@@ -386,8 +413,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 
 			lock_page(page);
 			WARN_ON(page->index != index);
-			wait_on_page_writeback(page);
-			truncate_inode_page(mapping, page);
+			truncate_full_page(mapping, page, 1);
 			unlock_page(page);
 		}
 		pagevec_remove_exceptionals(&pvec);
^ permalink raw reply related	[flat|nested] 160+ messages in thread
- * Re: [FYI] tux3: Core changes
  2015-05-14  8:26 ` [FYI] tux3: Core changes Daniel Phillips
@ 2015-05-14 12:59   ` Rik van Riel
  2015-05-15  0:06     ` Daniel Phillips
                       ` (2 more replies)
  2015-05-19 14:00   ` [FYI] tux3: Core changes Jan Kara
  1 sibling, 3 replies; 160+ messages in thread
From: Rik van Riel @ 2015-05-14 12:59 UTC (permalink / raw)
  To: Daniel Phillips, linux-kernel; +Cc: linux-fsdevel, tux3, OGAWA Hirofumi
On 05/14/2015 04:26 AM, Daniel Phillips wrote:
> Hi Rik,
> 
> Our linux-tux3 tree currently currently carries this 652 line diff
> against core, to make Tux3 work. This is mainly by Hirofumi, except
> the fs-writeback.c hook, which is by me. The main part you may be
> interested in is rmap.c, which addresses the issues raised at the
> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
> 
>    LSFMM: Page forking
>    http://lwn.net/Articles/548091/
> 
> This is just a FYI. An upcoming Tux3 report will be a tour of the page
> forking design and implementation. For now, this is just to give a
> general sense of what we have done. We heard there are concerns about
> how ptrace will work. I really am not familiar with the issue, could
> you please explain what you were thinking of there?
The issue is that things like ptrace, AIO, infiniband
RDMA, and other direct memory access subsystems can take
a reference to page A, which Tux3 clones into a new page B
when the process writes it.
However, while the process now points at page B, ptrace,
AIO, infiniband, etc will still be pointing at page A.
This causes the process and the other subsystem to each
look at a different page, instead of at shared state,
causing ptrace to do nothing, AIO and RDMA data to be
invisible (or corrupted), etc...
-- 
All rights reversed
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Re: [FYI] tux3: Core changes
  2015-05-14 12:59   ` Rik van Riel
@ 2015-05-15  0:06     ` Daniel Phillips
  2015-05-15  3:06       ` Rik van Riel
  2015-05-15  8:05       ` Mel Gorman
  2015-05-17 13:26     ` Boaz Harrosh
  2015-05-21 19:43     ` [WIP][PATCH] tux3: preliminatry nospace handling Daniel Phillips
  2 siblings, 2 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-15  0:06 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: Andrea Arcangeli, Peter Zijlstra, tux3, mgorman, linux-fsdevel,
	OGAWA Hirofumi
Hi Rik,
Added Mel, Andrea and Peterz to CC as interested parties. There are
probably others, please just jump in.
On 05/14/2015 05:59 AM, Rik van Riel wrote:
> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
>> Hi Rik,
>>
>> Our linux-tux3 tree currently currently carries this 652 line diff
>> against core, to make Tux3 work. This is mainly by Hirofumi, except
>> the fs-writeback.c hook, which is by me. The main part you may be
>> interested in is rmap.c, which addresses the issues raised at the
>> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
>>
>>    LSFMM: Page forking
>>    http://lwn.net/Articles/548091/
>>
>> This is just a FYI. An upcoming Tux3 report will be a tour of the page
>> forking design and implementation. For now, this is just to give a
>> general sense of what we have done. We heard there are concerns about
>> how ptrace will work. I really am not familiar with the issue, could
>> you please explain what you were thinking of there?
> 
> The issue is that things like ptrace, AIO, infiniband
> RDMA, and other direct memory access subsystems can take
> a reference to page A, which Tux3 clones into a new page B
> when the process writes it.
> 
> However, while the process now points at page B, ptrace,
> AIO, infiniband, etc will still be pointing at page A.
> 
> This causes the process and the other subsystem to each
> look at a different page, instead of at shared state,
> causing ptrace to do nothing, AIO and RDMA data to be
> invisible (or corrupted), etc...
Is this a bit like page migration?
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-15  0:06     ` Daniel Phillips
@ 2015-05-15  3:06       ` Rik van Riel
  2015-05-15  8:09         ` Mel Gorman
  2015-05-15  9:38         ` Daniel Phillips
  2015-05-15  8:05       ` Mel Gorman
  1 sibling, 2 replies; 160+ messages in thread
From: Rik van Riel @ 2015-05-15  3:06 UTC (permalink / raw)
  To: Daniel Phillips, linux-kernel
  Cc: linux-fsdevel, tux3, OGAWA Hirofumi, mgorman, Andrea Arcangeli,
	Peter Zijlstra
On 05/14/2015 08:06 PM, Daniel Phillips wrote:
> Hi Rik,
> 
> Added Mel, Andrea and Peterz to CC as interested parties. There are
> probably others, please just jump in.
> 
> On 05/14/2015 05:59 AM, Rik van Riel wrote:
>> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
>>> Hi Rik,
>>>
>>> Our linux-tux3 tree currently currently carries this 652 line diff
>>> against core, to make Tux3 work. This is mainly by Hirofumi, except
>>> the fs-writeback.c hook, which is by me. The main part you may be
>>> interested in is rmap.c, which addresses the issues raised at the
>>> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
>>>
>>>    LSFMM: Page forking
>>>    http://lwn.net/Articles/548091/
>>>
>>> This is just a FYI. An upcoming Tux3 report will be a tour of the page
>>> forking design and implementation. For now, this is just to give a
>>> general sense of what we have done. We heard there are concerns about
>>> how ptrace will work. I really am not familiar with the issue, could
>>> you please explain what you were thinking of there?
>>
>> The issue is that things like ptrace, AIO, infiniband
>> RDMA, and other direct memory access subsystems can take
>> a reference to page A, which Tux3 clones into a new page B
>> when the process writes it.
>>
>> However, while the process now points at page B, ptrace,
>> AIO, infiniband, etc will still be pointing at page A.
>>
>> This causes the process and the other subsystem to each
>> look at a different page, instead of at shared state,
>> causing ptrace to do nothing, AIO and RDMA data to be
>> invisible (or corrupted), etc...
> 
> Is this a bit like page migration?
Yes. Page migration will fail if there is an "extra"
reference to the page that is not accounted for by
the migration code.
Only pages that have no extra refcount can be migrated.
Similarly, your cow code needs to fail if there is an
extra reference count pinning the page. As long as
the page has a user that you cannot migrate, you cannot
move any of the other users over. They may rely on data
written by the hidden-to-you user, and the hidden-to-you
user may write to the page when you think it is a read
only stable snapshot.
-- 
All rights reversed
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-15  3:06       ` Rik van Riel
@ 2015-05-15  8:09         ` Mel Gorman
  2015-05-15  9:54           ` Daniel Phillips
  2015-05-15  9:38         ` Daniel Phillips
  1 sibling, 1 reply; 160+ messages in thread
From: Mel Gorman @ 2015-05-15  8:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Daniel Phillips, linux-kernel, linux-fsdevel, tux3,
	OGAWA Hirofumi, Andrea Arcangeli, Peter Zijlstra
On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
> > Hi Rik,
> > 
> > Added Mel, Andrea and Peterz to CC as interested parties. There are
> > probably others, please just jump in.
> > 
> > On 05/14/2015 05:59 AM, Rik van Riel wrote:
> >> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
> >>> Hi Rik,
> >>>
> >>> Our linux-tux3 tree currently currently carries this 652 line diff
> >>> against core, to make Tux3 work. This is mainly by Hirofumi, except
> >>> the fs-writeback.c hook, which is by me. The main part you may be
> >>> interested in is rmap.c, which addresses the issues raised at the
> >>> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
> >>>
> >>>    LSFMM: Page forking
> >>>    http://lwn.net/Articles/548091/
> >>>
> >>> This is just a FYI. An upcoming Tux3 report will be a tour of the page
> >>> forking design and implementation. For now, this is just to give a
> >>> general sense of what we have done. We heard there are concerns about
> >>> how ptrace will work. I really am not familiar with the issue, could
> >>> you please explain what you were thinking of there?
> >>
> >> The issue is that things like ptrace, AIO, infiniband
> >> RDMA, and other direct memory access subsystems can take
> >> a reference to page A, which Tux3 clones into a new page B
> >> when the process writes it.
> >>
> >> However, while the process now points at page B, ptrace,
> >> AIO, infiniband, etc will still be pointing at page A.
> >>
> >> This causes the process and the other subsystem to each
> >> look at a different page, instead of at shared state,
> >> causing ptrace to do nothing, AIO and RDMA data to be
> >> invisible (or corrupted), etc...
> > 
> > Is this a bit like page migration?
> 
> Yes. Page migration will fail if there is an "extra"
> reference to the page that is not accounted for by
> the migration code.
> 
When I said it's not like page migration, I was referring to the fact
that a COW on a pinned page for RDMA is a different problem to page
migration. The COW of a pinned page can lead to lost writes or
corruption depending on the ordering of events. Page migration fails
when there are unexpected problems to avoid this class of issue which is
fine for page migration but may be a critical failure in a filesystem
depending on exactly why the copy is required.
-- 
Mel Gorman
SUSE Labs
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-15  8:09         ` Mel Gorman
@ 2015-05-15  9:54           ` Daniel Phillips
  2015-05-15 11:00             ` Mel Gorman
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-15  9:54 UTC (permalink / raw)
  To: Mel Gorman, Rik van Riel
  Cc: Andrea Arcangeli, Peter Zijlstra, tux3, linux-kernel,
	linux-fsdevel, OGAWA Hirofumi
On 05/15/2015 01:09 AM, Mel Gorman wrote:
> On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
>> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
>>>> The issue is that things like ptrace, AIO, infiniband
>>>> RDMA, and other direct memory access subsystems can take
>>>> a reference to page A, which Tux3 clones into a new page B
>>>> when the process writes it.
>>>>
>>>> However, while the process now points at page B, ptrace,
>>>> AIO, infiniband, etc will still be pointing at page A.
>>>>
>>>> This causes the process and the other subsystem to each
>>>> look at a different page, instead of at shared state,
>>>> causing ptrace to do nothing, AIO and RDMA data to be
>>>> invisible (or corrupted), etc...
>>>
>>> Is this a bit like page migration?
>>
>> Yes. Page migration will fail if there is an "extra"
>> reference to the page that is not accounted for by
>> the migration code.
> 
> When I said it's not like page migration, I was referring to the fact
> that a COW on a pinned page for RDMA is a different problem to page
> migration. The COW of a pinned page can lead to lost writes or
> corruption depending on the ordering of events.
I see the lost writes case, but not the corruption case, Do you
mean corruption by changing a page already in writeout? If so,
don't all filesystems have that problem?
If RDMA to a mmapped file races with write(2) to the same file,
maybe it is reasonable and expected to lose some data.
> Page migration fails
> when there are unexpected problems to avoid this class of issue which is
> fine for page migration but may be a critical failure in a filesystem
> depending on exactly why the copy is required.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-15  9:54           ` Daniel Phillips
@ 2015-05-15 11:00             ` Mel Gorman
  2015-05-16 22:38               ` David Lang
  0 siblings, 1 reply; 160+ messages in thread
From: Mel Gorman @ 2015-05-15 11:00 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, linux-kernel, linux-fsdevel, tux3, OGAWA Hirofumi,
	Andrea Arcangeli, Peter Zijlstra
On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote:
> 
> 
> On 05/15/2015 01:09 AM, Mel Gorman wrote:
> > On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
> >> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
> >>>> The issue is that things like ptrace, AIO, infiniband
> >>>> RDMA, and other direct memory access subsystems can take
> >>>> a reference to page A, which Tux3 clones into a new page B
> >>>> when the process writes it.
> >>>>
> >>>> However, while the process now points at page B, ptrace,
> >>>> AIO, infiniband, etc will still be pointing at page A.
> >>>>
> >>>> This causes the process and the other subsystem to each
> >>>> look at a different page, instead of at shared state,
> >>>> causing ptrace to do nothing, AIO and RDMA data to be
> >>>> invisible (or corrupted), etc...
> >>>
> >>> Is this a bit like page migration?
> >>
> >> Yes. Page migration will fail if there is an "extra"
> >> reference to the page that is not accounted for by
> >> the migration code.
> > 
> > When I said it's not like page migration, I was referring to the fact
> > that a COW on a pinned page for RDMA is a different problem to page
> > migration. The COW of a pinned page can lead to lost writes or
> > corruption depending on the ordering of events.
> 
> I see the lost writes case, but not the corruption case,
Data corruption can occur depending on the ordering of events and the
applications expectations. If a process starts IO, RDMA pins the page
for read and forks are combined with writes from another thread then when
the IO completes the reads may not be visible. The application may take
improper action at that point.
Users of RDMA are typically expected to use MADV_DONTFORK to avoid this
class of problem.
You can choose to not define this as data corruption because thge kernel
is not directly involved and that's your call.
> Do you
> mean corruption by changing a page already in writeout? If so,
> don't all filesystems have that problem?
> 
No, the problem is different. Backing devices requiring stable pages will
block the write until the IO is complete. For those that do not require
stable pages it's ok to allow the write as long as the page is dirtied so
that it'll be written out again and no data is lost.
> If RDMA to a mmapped file races with write(2) to the same file,
> maybe it is reasonable and expected to lose some data.
> 
In the RDMA case, there is at least application awareness to work around
the problems. Normally it's ok to have both mapped and write() access
to data although userspace might need a lock to co-ordinate updates and
event ordering.
-- 
Mel Gorman
SUSE Labs
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-15 11:00             ` Mel Gorman
@ 2015-05-16 22:38               ` David Lang
  2015-05-18 12:57                 ` Mel Gorman
  0 siblings, 1 reply; 160+ messages in thread
From: David Lang @ 2015-05-16 22:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Daniel Phillips, Rik van Riel, linux-kernel, linux-fsdevel, tux3,
	OGAWA Hirofumi, Andrea Arcangeli, Peter Zijlstra
On Fri, 15 May 2015, Mel Gorman wrote:
> On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote:
>>
>>
>> On 05/15/2015 01:09 AM, Mel Gorman wrote:
>>> On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
>>>> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
>>>>>> The issue is that things like ptrace, AIO, infiniband
>>>>>> RDMA, and other direct memory access subsystems can take
>>>>>> a reference to page A, which Tux3 clones into a new page B
>>>>>> when the process writes it.
>>>>>>
>>>>>> However, while the process now points at page B, ptrace,
>>>>>> AIO, infiniband, etc will still be pointing at page A.
>>>>>>
>>>>>> This causes the process and the other subsystem to each
>>>>>> look at a different page, instead of at shared state,
>>>>>> causing ptrace to do nothing, AIO and RDMA data to be
>>>>>> invisible (or corrupted), etc...
>>>>>
>>>>> Is this a bit like page migration?
>>>>
>>>> Yes. Page migration will fail if there is an "extra"
>>>> reference to the page that is not accounted for by
>>>> the migration code.
>>>
>>> When I said it's not like page migration, I was referring to the fact
>>> that a COW on a pinned page for RDMA is a different problem to page
>>> migration. The COW of a pinned page can lead to lost writes or
>>> corruption depending on the ordering of events.
>>
>> I see the lost writes case, but not the corruption case,
>
> Data corruption can occur depending on the ordering of events and the
> applications expectations. If a process starts IO, RDMA pins the page
> for read and forks are combined with writes from another thread then when
> the IO completes the reads may not be visible. The application may take
> improper action at that point.
if tux3 forks the page and writes the copy while the original page is being 
modified by other things, this means that some of the changes won't be in the 
version written (and this could catch partial writes with 'interesting' results 
if the forking happens at the wrong time)
But if the original page gets re-marked as needing to be written out when it's 
changed by one of the other things that are accessing it, there shouldn't be any 
long-term corruption.
As far as short-term corruption goes, any time you have a page mmapped it could 
get written out at any time, with only some of the application changes applied 
to it, so this sort of corruption could happen anyway couldn't it?
> Users of RDMA are typically expected to use MADV_DONTFORK to avoid this
> class of problem.
>
> You can choose to not define this as data corruption because thge kernel
> is not directly involved and that's your call.
>
>> Do you
>> mean corruption by changing a page already in writeout? If so,
>> don't all filesystems have that problem?
>>
>
> No, the problem is different. Backing devices requiring stable pages will
> block the write until the IO is complete. For those that do not require
> stable pages it's ok to allow the write as long as the page is dirtied so
> that it'll be written out again and no data is lost.
so if tux3 is prevented from forking the page in cases where the write would be 
blocked, and will get forked again for follow-up writes if it's modified again 
otherwise, won't this be the same thing?
David Lang
>> If RDMA to a mmapped file races with write(2) to the same file,
>> maybe it is reasonable and expected to lose some data.
>>
>
> In the RDMA case, there is at least application awareness to work around
> the problems. Normally it's ok to have both mapped and write() access
> to data although userspace might need a lock to co-ordinate updates and
> event ordering.
>
>
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-16 22:38               ` David Lang
@ 2015-05-18 12:57                 ` Mel Gorman
  0 siblings, 0 replies; 160+ messages in thread
From: Mel Gorman @ 2015-05-18 12:57 UTC (permalink / raw)
  To: David Lang
  Cc: Andrea Arcangeli, Rik van Riel, Peter Zijlstra, tux3,
	linux-kernel, linux-fsdevel, Daniel Phillips, OGAWA Hirofumi
On Sat, May 16, 2015 at 03:38:04PM -0700, David Lang wrote:
> On Fri, 15 May 2015, Mel Gorman wrote:
> 
> >On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote:
> >>
> >>
> >>On 05/15/2015 01:09 AM, Mel Gorman wrote:
> >>>On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
> >>>>On 05/14/2015 08:06 PM, Daniel Phillips wrote:
> >>>>>>The issue is that things like ptrace, AIO, infiniband
> >>>>>>RDMA, and other direct memory access subsystems can take
> >>>>>>a reference to page A, which Tux3 clones into a new page B
> >>>>>>when the process writes it.
> >>>>>>
> >>>>>>However, while the process now points at page B, ptrace,
> >>>>>>AIO, infiniband, etc will still be pointing at page A.
> >>>>>>
> >>>>>>This causes the process and the other subsystem to each
> >>>>>>look at a different page, instead of at shared state,
> >>>>>>causing ptrace to do nothing, AIO and RDMA data to be
> >>>>>>invisible (or corrupted), etc...
> >>>>>
> >>>>>Is this a bit like page migration?
> >>>>
> >>>>Yes. Page migration will fail if there is an "extra"
> >>>>reference to the page that is not accounted for by
> >>>>the migration code.
> >>>
> >>>When I said it's not like page migration, I was referring to the fact
> >>>that a COW on a pinned page for RDMA is a different problem to page
> >>>migration. The COW of a pinned page can lead to lost writes or
> >>>corruption depending on the ordering of events.
> >>
> >>I see the lost writes case, but not the corruption case,
> >
> >Data corruption can occur depending on the ordering of events and the
> >applications expectations. If a process starts IO, RDMA pins the page
> >for read and forks are combined with writes from another thread then when
> >the IO completes the reads may not be visible. The application may take
> >improper action at that point.
> 
> if tux3 forks the page and writes the copy while the original page
> is being modified by other things, this means that some of the
> changes won't be in the version written (and this could catch
> partial writes with 'interesting' results if the forking happens at
> the wrong time)
> 
Potentially yes. There is likely to be some elevated memory usage but I
imagine that can be controlled.
> But if the original page gets re-marked as needing to be written out
> when it's changed by one of the other things that are accessing it,
> there shouldn't be any long-term corruption.
> 
> As far as short-term corruption goes, any time you have a page
> mmapped it could get written out at any time, with only some of the
> application changes applied to it, so this sort of corruption could
> happen anyway couldn't it?
> 
That becomes the responsibility of the application. It's up to it to sync
appropriately when it knows updates are complete.
> >Users of RDMA are typically expected to use MADV_DONTFORK to avoid this
> >class of problem.
> >
> >You can choose to not define this as data corruption because thge kernel
> >is not directly involved and that's your call.
> >
> >>Do you
> >>mean corruption by changing a page already in writeout? If so,
> >>don't all filesystems have that problem?
> >>
> >
> >No, the problem is different. Backing devices requiring stable pages will
> >block the write until the IO is complete. For those that do not require
> >stable pages it's ok to allow the write as long as the page is dirtied so
> >that it'll be written out again and no data is lost.
> 
> so if tux3 is prevented from forking the page in cases where the
> write would be blocked, and will get forked again for follow-up
> writes if it's modified again otherwise, won't this be the same
> thing?
> 
Functionally and from a correctness point of view, it *might* be
equivalent. It depends on the implementation and the page life cycle,
particularly the details of how the writeback and dirty state are coordinated
between the user-visible pages and the page being written back. I've read
none of the code or background so I cannot answer whether it's really
equivalent or not. Just be aware that it's not the same problem as page
migration and that it's not the same as how writeback and dirty state is
handled today.
-- 
Mel Gorman
SUSE Labs
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
 
 
- * Re: [FYI] tux3: Core changes
  2015-05-15  3:06       ` Rik van Riel
  2015-05-15  8:09         ` Mel Gorman
@ 2015-05-15  9:38         ` Daniel Phillips
  2015-05-27  7:41           ` Pavel Machek
  1 sibling, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-15  9:38 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: Andrea Arcangeli, Peter Zijlstra, tux3, mgorman, linux-fsdevel,
	OGAWA Hirofumi
On 05/14/2015 08:06 PM, Rik van Riel wrote:
> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
>>> The issue is that things like ptrace, AIO, infiniband
>>> RDMA, and other direct memory access subsystems can take
>>> a reference to page A, which Tux3 clones into a new page B
>>> when the process writes it.
>>>
>>> However, while the process now points at page B, ptrace,
>>> AIO, infiniband, etc will still be pointing at page A.
>>>
>>> This causes the process and the other subsystem to each
>>> look at a different page, instead of at shared state,
>>> causing ptrace to do nothing, AIO and RDMA data to be
>>> invisible (or corrupted), etc...
>>
>> Is this a bit like page migration?
> 
> Yes. Page migration will fail if there is an "extra"
> reference to the page that is not accounted for by
> the migration code.
> 
> Only pages that have no extra refcount can be migrated.
> 
> Similarly, your cow code needs to fail if there is an
> extra reference count pinning the page. As long as
> the page has a user that you cannot migrate, you cannot
> move any of the other users over. They may rely on data
> written by the hidden-to-you user, and the hidden-to-you
> user may write to the page when you think it is a read
> only stable snapshot.
Please bear with me as I study these cases one by one.
First one is ptrace. Only for executable files, right?
Maybe we don't need to fork pages in executable files,
Uprobes... If somebody puts a breakpoint in a page and
we fork it, the replacement page has a copy of the
breakpoint, and all the code on the page. Did anything
break?
Note: we have the option of being cowardly and just not
doing page forking for mmapped files, or certain kinds
of mmapped files, etc. But first we should give it the
old college try, to see if absolute perfection is
possible and practical.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-15  9:38         ` Daniel Phillips
@ 2015-05-27  7:41           ` Pavel Machek
  2015-05-27 18:09             ` Daniel Phillips
  0 siblings, 1 reply; 160+ messages in thread
From: Pavel Machek @ 2015-05-27  7:41 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, linux-kernel, linux-fsdevel, tux3, OGAWA Hirofumi,
	mgorman, Andrea Arcangeli, Peter Zijlstra
On Fri 2015-05-15 02:38:33, Daniel Phillips wrote:
> On 05/14/2015 08:06 PM, Rik van Riel wrote:
> > On 05/14/2015 08:06 PM, Daniel Phillips wrote:
> >>> The issue is that things like ptrace, AIO, infiniband
> >>> RDMA, and other direct memory access subsystems can take
> >>> a reference to page A, which Tux3 clones into a new page B
> >>> when the process writes it.
> >>>
> >>> However, while the process now points at page B, ptrace,
> >>> AIO, infiniband, etc will still be pointing at page A.
> >>>
> >>> This causes the process and the other subsystem to each
> >>> look at a different page, instead of at shared state,
> >>> causing ptrace to do nothing, AIO and RDMA data to be
> >>> invisible (or corrupted), etc...
> >>
> >> Is this a bit like page migration?
> > 
> > Yes. Page migration will fail if there is an "extra"
> > reference to the page that is not accounted for by
> > the migration code.
> > 
> > Only pages that have no extra refcount can be migrated.
> > 
> > Similarly, your cow code needs to fail if there is an
> > extra reference count pinning the page. As long as
> > the page has a user that you cannot migrate, you cannot
> > move any of the other users over. They may rely on data
> > written by the hidden-to-you user, and the hidden-to-you
> > user may write to the page when you think it is a read
> > only stable snapshot.
> 
> Please bear with me as I study these cases one by one.
> 
> First one is ptrace. Only for executable files, right?
> Maybe we don't need to fork pages in executable files,
Umm. Why do you think it is only issue for executable files?
I'm free to mmap() any file, and then execute from it.
/lib/ld-linux.so /path/to/binary
is known way to exec programs that do not have x bit set.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-27  7:41           ` Pavel Machek
@ 2015-05-27 18:09             ` Daniel Phillips
  2015-05-27 21:37               ` Pavel Machek
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-27 18:09 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andrea Arcangeli, Rik van Riel, Peter Zijlstra, tux3,
	linux-kernel, mgorman, linux-fsdevel, OGAWA Hirofumi
On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote:
> On Fri 2015-05-15 02:38:33, Daniel Phillips wrote:
>> On 05/14/2015 08:06 PM, Rik van Riel wrote: ...
>
> Umm. Why do you think it is only issue for executable files?
I meant: files with code in them, that will be executed. Please excuse
me for colliding with the chmod sense. I will say "code files" to avoid
ambiguity.
> I'm free to mmap() any file, and then execute from it.
>
> /lib/ld-linux.so /path/to/binary
>
> is known way to exec programs that do not have x bit set.
So... why would I write to a code file at the same time as stepping
through it with ptrace? Should I expect ptrace to work perfectly if
I do that? What would "work perfectly" mean, if the code is changing
at the same time as being traced?
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-27 18:09             ` Daniel Phillips
@ 2015-05-27 21:37               ` Pavel Machek
  2015-05-27 22:33                 ` Daniel Phillips
  0 siblings, 1 reply; 160+ messages in thread
From: Pavel Machek @ 2015-05-27 21:37 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, linux-kernel, linux-fsdevel, tux3, OGAWA Hirofumi,
	mgorman, Andrea Arcangeli, Peter Zijlstra
On Wed 2015-05-27 11:09:25, Daniel Phillips wrote:
> On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote:
> >On Fri 2015-05-15 02:38:33, Daniel Phillips wrote:
> >>On 05/14/2015 08:06 PM, Rik van Riel wrote: ...
> >
> >Umm. Why do you think it is only issue for executable files?
> 
> I meant: files with code in them, that will be executed. Please excuse
> me for colliding with the chmod sense. I will say "code files" to avoid
> ambiguity.
> 
> >I'm free to mmap() any file, and then execute from it.
> >
> >/lib/ld-linux.so /path/to/binary
> >
> >is known way to exec programs that do not have x bit set.
> 
> So... why would I write to a code file at the same time as stepping
> through it with ptrace? Should I expect ptrace to work perfectly if
> I do that? What would "work perfectly" mean, if the code is changing
> at the same time as being traced?
Do you have any imagination at all?
Reasons I should expect ptrace to work perfectly if I'm writing to
file:
1) it used to work before
2) it used to work before
3) it used to work before and regressions are not allowed
4) some kind of just in time compiler
5) some kind of malware, playing tricks so that you have trouble
analyzing it
and of course,
6) it used to work before.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-27 21:37               ` Pavel Machek
@ 2015-05-27 22:33                 ` Daniel Phillips
  0 siblings, 0 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-27 22:33 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andrea Arcangeli, Rik van Riel, Peter Zijlstra, tux3,
	linux-kernel, mgorman, linux-fsdevel, OGAWA Hirofumi
On 05/27/2015 02:37 PM, Pavel Machek wrote:
> On Wed 2015-05-27 11:09:25, Daniel Phillips wrote:
>> On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote:
>>> On Fri 2015-05-15 02:38:33, Daniel Phillips wrote:
>>>> On 05/14/2015 08:06 PM, Rik van Riel wrote: ...
>>>
>>> Umm. Why do you think it is only issue for executable files?
>>
>> I meant: files with code in them, that will be executed. Please excuse
>> me for colliding with the chmod sense. I will say "code files" to avoid
>> ambiguity.
>>
>>> I'm free to mmap() any file, and then execute from it.
>>>
>>> /lib/ld-linux.so /path/to/binary
>>>
>>> is known way to exec programs that do not have x bit set.
>>
>> So... why would I write to a code file at the same time as stepping
>> through it with ptrace? Should I expect ptrace to work perfectly if
>> I do that? What would "work perfectly" mean, if the code is changing
>> at the same time as being traced?
> 
> Do you have any imagination at all?
[Non-collegial rhetoric alert, it would be helpful to avoid that.]
> Reasons I should expect ptrace to work perfectly if I'm writing to
> file:
> 
> 1) it used to work before
> 
> 2) it used to work before
> 
> 3) it used to work before and regressions are not allowed
Are you sure that ptrace will work perfectly on a file that you are
writing to at the same time as tracing? If so, it has magic that I
do not understand. Could you please explain.
> 4) some kind of just in time compiler
A JIT that can tolerate being written to by a task it knows nothing
about, at the same time as it is generating code in the file? I do
not know of any such JIT.
> 5) some kind of malware, playing tricks so that you have trouble
> analyzing it
By writing to a code file? Then it already has write access to the
code file, so it has already gotten inside your security perimeter
without needing help from page fork. That said, we should be alert
for any new holes that page fork might open. But if there are any,
they should be actual holes, not theoretical ones.
> and of course,
> 
> 6) it used to work before.
I look forward to your explanation of how.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
 
 
 
- * Re: [FYI] tux3: Core changes
  2015-05-15  0:06     ` Daniel Phillips
  2015-05-15  3:06       ` Rik van Riel
@ 2015-05-15  8:05       ` Mel Gorman
  1 sibling, 0 replies; 160+ messages in thread
From: Mel Gorman @ 2015-05-15  8:05 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, linux-kernel, linux-fsdevel, tux3, OGAWA Hirofumi,
	Andrea Arcangeli, Peter Zijlstra
On Thu, May 14, 2015 at 05:06:39PM -0700, Daniel Phillips wrote:
> Hi Rik,
> 
> Added Mel, Andrea and Peterz to CC as interested parties. There are
> probably others, please just jump in.
> 
> On 05/14/2015 05:59 AM, Rik van Riel wrote:
> > On 05/14/2015 04:26 AM, Daniel Phillips wrote:
> >> Hi Rik,
> >>
> >> Our linux-tux3 tree currently currently carries this 652 line diff
> >> against core, to make Tux3 work. This is mainly by Hirofumi, except
> >> the fs-writeback.c hook, which is by me. The main part you may be
> >> interested in is rmap.c, which addresses the issues raised at the
> >> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
> >>
> >>    LSFMM: Page forking
> >>    http://lwn.net/Articles/548091/
> >>
> >> This is just a FYI. An upcoming Tux3 report will be a tour of the page
> >> forking design and implementation. For now, this is just to give a
> >> general sense of what we have done. We heard there are concerns about
> >> how ptrace will work. I really am not familiar with the issue, could
> >> you please explain what you were thinking of there?
> > 
> > The issue is that things like ptrace, AIO, infiniband
> > RDMA, and other direct memory access subsystems can take
> > a reference to page A, which Tux3 clones into a new page B
> > when the process writes it.
> > 
> > However, while the process now points at page B, ptrace,
> > AIO, infiniband, etc will still be pointing at page A.
> > 
> > This causes the process and the other subsystem to each
> > look at a different page, instead of at shared state,
> > causing ptrace to do nothing, AIO and RDMA data to be
> > invisible (or corrupted), etc...
> 
> Is this a bit like page migration?
> 
No, it's not.
-- 
Mel Gorman
SUSE Labs
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
- * Re: [FYI] tux3: Core changes
  2015-05-14 12:59   ` Rik van Riel
  2015-05-15  0:06     ` Daniel Phillips
@ 2015-05-17 13:26     ` Boaz Harrosh
  2015-05-18  2:20       ` Rik van Riel
  2015-05-21 19:43     ` [WIP][PATCH] tux3: preliminatry nospace handling Daniel Phillips
  2 siblings, 1 reply; 160+ messages in thread
From: Boaz Harrosh @ 2015-05-17 13:26 UTC (permalink / raw)
  To: Rik van Riel, Daniel Phillips, linux-kernel
  Cc: linux-fsdevel, tux3, OGAWA Hirofumi
On 05/14/2015 03:59 PM, Rik van Riel wrote:
> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
>> Hi Rik,
<>
> 
> The issue is that things like ptrace, AIO, infiniband
> RDMA, and other direct memory access subsystems can take
> a reference to page A, which Tux3 clones into a new page B
> when the process writes it.
> 
> However, while the process now points at page B, ptrace,
> AIO, infiniband, etc will still be pointing at page A.
> 
All these problems can also happen with truncate+new-extending-write
It is the responsibility of the application to take file/range locks
to prevent these page-pinned problems.
> This causes the process and the other subsystem to each
> look at a different page, instead of at shared state,
> causing ptrace to do nothing, AIO and RDMA data to be
> invisible (or corrupted), etc...
> 
Again these problems already exist. Consider each in-place-write
being a truncate (punch hole) + new-write is that not the same?
Cheers
Boaz
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-17 13:26     ` Boaz Harrosh
@ 2015-05-18  2:20       ` Rik van Riel
  2015-05-18  7:58         ` Boaz Harrosh
  2015-05-19  4:46         ` Daniel Phillips
  0 siblings, 2 replies; 160+ messages in thread
From: Rik van Riel @ 2015-05-18  2:20 UTC (permalink / raw)
  To: Boaz Harrosh, Daniel Phillips, linux-kernel
  Cc: linux-fsdevel, tux3, OGAWA Hirofumi
On 05/17/2015 09:26 AM, Boaz Harrosh wrote:
> On 05/14/2015 03:59 PM, Rik van Riel wrote:
>> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
>>> Hi Rik,
> <>
>>
>> The issue is that things like ptrace, AIO, infiniband
>> RDMA, and other direct memory access subsystems can take
>> a reference to page A, which Tux3 clones into a new page B
>> when the process writes it.
>>
>> However, while the process now points at page B, ptrace,
>> AIO, infiniband, etc will still be pointing at page A.
>>
> 
> All these problems can also happen with truncate+new-extending-write
> 
> It is the responsibility of the application to take file/range locks
> to prevent these page-pinned problems.
It is unreasonable to expect a process that is being ptraced
(potentially without its knowledge) to take special measures
to protect the ptraced memory from disappearing.
It is impossible for the debugger to take those special measures
for anonymous memory, or unlinked inodes.
I don't think your requirement is workable or reasonable.
-- 
All rights reversed
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-18  2:20       ` Rik van Riel
@ 2015-05-18  7:58         ` Boaz Harrosh
  2015-05-19  4:46         ` Daniel Phillips
  1 sibling, 0 replies; 160+ messages in thread
From: Boaz Harrosh @ 2015-05-18  7:58 UTC (permalink / raw)
  To: Rik van Riel, Daniel Phillips, linux-kernel
  Cc: linux-fsdevel, tux3, OGAWA Hirofumi
On 05/18/2015 05:20 AM, Rik van Riel wrote:
> On 05/17/2015 09:26 AM, Boaz Harrosh wrote:
>> On 05/14/2015 03:59 PM, Rik van Riel wrote:
>>> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
>>>> Hi Rik,
>> <>
>>>
>>> The issue is that things like ptrace, AIO, infiniband
>>> RDMA, and other direct memory access subsystems can take
>>> a reference to page A, which Tux3 clones into a new page B
>>> when the process writes it.
>>>
>>> However, while the process now points at page B, ptrace,
>>> AIO, infiniband, etc will still be pointing at page A.
>>>
>>
>> All these problems can also happen with truncate+new-extending-write
>>
>> It is the responsibility of the application to take file/range locks
>> to prevent these page-pinned problems.
> 
> It is unreasonable to expect a process that is being ptraced
> (potentially without its knowledge) to take special measures
> to protect the ptraced memory from disappearing.
If the memory disappears that's a bug. No the memory is just there
it is just not reflecting the latest content of the fs-file.
> 
> It is impossible for the debugger to take those special measures
> for anonymous memory, or unlinked inodes.
> 
Why? one line of added code after the open and before the mmap do an flock
> I don't think your requirement is workable or reasonable.
> 
Therefor it is unreasonable to write/modify a ptraced process
file.
Again what I'm saying is COWing a page on write, has the same effect
as truncate+write. They are both allowed and both might give you the same
"stale" effect. So the presidence is there. We are not introducing a new
anomaly, just introducing a new instance of it. I guess the question
is what applications/procedures are going to break. Need lots of testing
and real life installations to answer that, I guess.
Thanks
Boaz
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-18  2:20       ` Rik van Riel
  2015-05-18  7:58         ` Boaz Harrosh
@ 2015-05-19  4:46         ` Daniel Phillips
  1 sibling, 0 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-19  4:46 UTC (permalink / raw)
  To: Rik van Riel, Boaz Harrosh, linux-kernel
  Cc: linux-fsdevel, tux3, OGAWA Hirofumi
On 05/17/2015 07:20 PM, Rik van Riel wrote:
> On 05/17/2015 09:26 AM, Boaz Harrosh wrote:
>> On 05/14/2015 03:59 PM, Rik van Riel wrote:
>>> The issue is that things like ptrace, AIO, infiniband
>>> RDMA, and other direct memory access subsystems can take
>>> a reference to page A, which Tux3 clones into a new page B
>>> when the process writes it.
>>>
>>> However, while the process now points at page B, ptrace,
>>> AIO, infiniband, etc will still be pointing at page A.
>>
>> All these problems can also happen with truncate+new-extending-write
>>
>> It is the responsibility of the application to take file/range locks
>> to prevent these page-pinned problems.
> 
> It is unreasonable to expect a process that is being ptraced
> (potentially without its knowledge) to take special measures
> to protect the ptraced memory from disappearing.
> 
> It is impossible for the debugger to take those special measures
> for anonymous memory, or unlinked inodes.
> 
> I don't think your requirement is workable or reasonable.
Hi Rik,
You are quite right to poke at this aggressively. Whether or not
there is an issue needing fixing, we want to know the details. We
really need to do a deep dive in ptrace and know exactly what it
does, and whether Tux3 creates any new kind of hole. I really know
very little about ptrace at the moment, I only have heard that it
is a horrible hack we inherited from some place far away and a time
long ago.
A little guidance from you would help. Somewhere ptrace must modify
the executable page. Unlike uprobes, which makes sense to me, I did
not find where ptrace actually does that on a quick inspection.
Perhaps you could provide a pointer?
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
- * [WIP][PATCH] tux3: preliminatry nospace handling
  2015-05-14 12:59   ` Rik van Riel
  2015-05-15  0:06     ` Daniel Phillips
  2015-05-17 13:26     ` Boaz Harrosh
@ 2015-05-21 19:43     ` Daniel Phillips
  2 siblings, 0 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-21 19:43 UTC (permalink / raw)
  To: Josef Bacik, BTRFS FILE SYSTEM
  Cc: linux-kernel, linux-fsdevel, tux3, OGAWA Hirofumi
Hi Josef,
This is a rollup patch for preliminary nospace handling in Tux3, in 
line with my post here:
   http://lkml.iu.edu/hypermail/linux/kernel/1505.1/03167.html
You still have ENOSPC issues. Maybe it would be helpful to look at 
what we have done. I saw a reproducible case with 1,000 tasks in 
parallel last week that went nospace while 28% full. You also are not
giving a very good picture of the true full state via df.
Our algorithm is pretty simple, reliable and fast. I do not see any 
reason why Btrfs could not do it basically the same way. In one way it 
is easier for you - you are not forced to commit the entire delta, you 
can choose the bits you want to force to disk as convenient. You have 
more different kinds of cache objects to account, but that should be 
just detail. Your current frontend accounting looks plausible.
We're trying something a bit different with df, to see how it flies - 
we don't always return the same number to f_blocks, we actually return 
the volume size less the accounting reserve, which is variable. The 
reserve gets smaller as freespace gets smaller, so it is not a nasty 
surprise to the user to see it change, rather a pleasant surprise. What 
it does is make the 100% really be 100%, less just a handful of blocks, 
and it makes "used" and "available" add up exactly to "blocks". If the 
user wants to know how many blocks they really have, they can look at 
/proc/partitions.
Regards,
Daniel
diff --git a/fs/tux3/commit.c b/fs/tux3/commit.c
index 909a222..7043580 100644
--- a/fs/tux3/commit.c
+++ b/fs/tux3/commit.c
@@ -297,6 +297,7 @@ static int commit_delta(struct sb *sb)
 	tux3_wake_delta_commit(sb);
 
 	/* Commit was finished, apply defered bfree. */
+	sb->defreed = 0;
 	return unstash(sb, &sb->defree, apply_defered_bfree);
 }
 
@@ -321,13 +322,13 @@ static int need_unify(struct sb *sb)
 /* For debugging */
 void tux3_start_backend(struct sb *sb)
 {
-	assert(current->journal_info == NULL);
+	assert(!change_active());
 	current->journal_info = sb;
 }
 
 void tux3_end_backend(void)
 {
-	assert(current->journal_info);
+	assert(change_active());
 	current->journal_info = NULL;
 }
 
@@ -337,12 +338,103 @@ int tux3_under_backend(struct sb *sb)
 	return current->journal_info == sb;
 }
 
+/* Internal use only */
+static struct delta_ref *to_delta_ref(struct sb *sb, unsigned delta)
+{
+	return &sb->delta_refs[tux3_delta(delta)];
+}
+
+static block_t newfree(struct sb *sb)
+{
+	return sb->freeblocks + sb->defreed;
+}
+
+/*
+ * Reserve size should vary with budget. The reserve can include the
+ * log block overhead on the assumption that every block in the budget
+ * is a data block that generates one log record (or two?).
+ */
+block_t set_budget(struct sb *sb)
+{
+	block_t reserve = sb->freeblocks >> 7; /* FIXME: magic number */
+
+	if (1) {
+		if (reserve > max_reserve_blocks)
+			reserve = max_reserve_blocks;
+		if (reserve < min_reserve_blocks)
+			reserve = min_reserve_blocks;
+	} else if (0)
+		reserve = 10;
+
+	block_t budget = newfree(sb) - reserve;
+	if (1)
+		tux3_msg(sb, "set_budget: free %Li, budget %Li, reserve %Li", newfree(sb), budget, reserve);
+	sb->reserve = reserve;
+	atomic_set(&sb->budget, budget);
+	return reserve;
+}
+
+/*
+ * After transition, the front delta may have used some of the balance
+ * left over from this delta. The charged amount of the back delta is
+ * now stable and gives the exact balance at transition by subtracting
+ * from the old budget. The difference between the new budget and the
+ * balance at transition, which must never be negative, is added to
+ * the current balance, so the effect is exactly the same as if we had
+ * set the new budget and balance atomically at transition time. But
+ * we do not know the new balance at transition time and even if we
+ * did, we would need to add serialization against frontend changes,
+ * which are currently lockless and would like to stay that way. So we 
+ * let the current delta charge against the remaining balance until
+ * flush is done, here, then adjust the balance to what it would have
+ * been if the budget had been reset exactly at transition.
+ *
+ * We have:
+ *
+ *    consumed = oldfree - free
+ *    oldbudget = oldfree - reserve
+ *    newbudget = free - reserve
+ *    transition_balance = oldbudget - charged
+ * 
+ * Factoring out the reserve, the balance adjustment is:
+ * 
+ *    adjust = newbudget - transition_balance
+ *           = (free - reserve) - ((oldfree - reserve) - charged)
+ *           = free + (charged - oldfree)
+ *           = charged + (free - oldfree)
+ *           = charged - consumed
+ *
+ * To extend for variable reserve size, add the difference between
+ * old and new reserve size to the balance adjustment.
+ */
+void reset_balance(struct sb *sb, unsigned delta, block_t unify_cost)
+{
+	enum { initial_logblock = 0 };
+	unsigned charged = atomic_read(&to_delta_ref(sb, delta)->charged);
+	block_t consumed = sb->oldfree - newfree(sb);
+	//block_t old_reserve = sb->reserve;
+
+	if (1)
+		tux3_msg(sb, "budget %i, balance %i, charged %u, consumed %Li, free %Lu, defree %Lu, unify %Lu",
+			atomic_read(&sb->budget), atomic_read(&sb->balance), 
+			charged, consumed, sb->freeblocks, sb->defreed, unify_cost);
+
+	sb->oldfree = newfree(sb);
+	set_budget(sb); /* maybe should set in size dependent order */
+	atomic_add(charged - consumed /*+ (old_reserve - sb->reserve)*/, &sb->balance);
+
+	if (consumed - initial_logblock - unify_cost > charged)
+		tux3_warn(sb, "delta %u estimate exceeded by %Lu blocks",
+			delta, consumed - charged);
+}
+
 static int do_commit(struct sb *sb, int flags)
 {
 	unsigned delta = sb->delta_staging;
 	int no_unify = flags & __NO_UNIFY;
 	struct blk_plug plug;
 	struct ioinfo ioinfo;
+	block_t unify_cost = 0;
 	int err = 0;
 
 	trace(">>>>>>>>> commit delta %u", delta);
@@ -359,8 +451,10 @@ static int do_commit(struct sb *sb, int flags)
 	 * FIXME: there is no need to commit if normal inodes are not
 	 * dirty? better way?
 	 */
-	if (!(flags & __FORCE_DELTA) && !tux3_has_dirty_inodes(sb, delta))
+	if (1 && !(flags & __FORCE_DELTA) && !tux3_has_dirty_inodes(sb, delta)) {
+		reset_balance(sb, delta, 0);
 		goto out;
+	}
 
 	/* Prepare to wait I/O */
 	tux3_io_init(&ioinfo, flags);
@@ -402,9 +496,11 @@ static int do_commit(struct sb *sb, int flags)
 #endif
 
 	if ((!no_unify && need_unify(sb)) || (flags & __FORCE_UNIFY)) {
+		unify_cost = sb->freeblocks;
 		err = unify_log(sb);
 		if (err)
 			goto error; /* FIXME: error handling */
+		unify_cost -= sb->freeblocks;
 
 		/* Add delta log for debugging. */
 		log_delta(sb);
@@ -414,6 +510,8 @@ static int do_commit(struct sb *sb, int flags)
 	write_log(sb);
 	blk_finish_plug(&plug);
 
+	reset_balance(sb, delta, unify_cost);
+
 	/*
 	 * Commit last block (for now, this is sync I/O).
 	 *
@@ -455,12 +553,6 @@ error:
 	 ((int)((a) - (b)) >= 0))
 #define delta_before_eq(a,b)	delta_after_eq(b,a)
 
-/* Internal use only */
-static struct delta_ref *to_delta_ref(struct sb *sb, unsigned delta)
-{
-	return &sb->delta_refs[tux3_delta(delta)];
-}
-
 static int flush_delta(struct sb *sb, int flags)
 {
 	int err;
@@ -510,6 +602,13 @@ static struct delta_ref *delta_get(struct sb *sb)
 	 * free ->current_delta, so we don't need rcu_read_lock().
 	 */
 	do {
+		barrier();
+		/*
+		 * NOTE: Without this barrier(), at least, gcc-4.8.2 ignores
+		 * volatile dereference of sb->current_delta in this loop,
+		 * and instead uses the cached value.
+		 * (Looks like a gcc bug, this barrier() is the workaround)
+		 */
 		delta_ref = rcu_dereference_check(sb->current_delta, 1);
 	} while (!atomic_inc_not_zero(&delta_ref->refcount));
 
@@ -540,6 +639,7 @@ static void __delta_transition(struct sb *sb, struct delta_ref *delta_ref,
 	reinit_completion(&delta_ref->waitref_done);
 	/* Assign the delta number */
 	delta_ref->delta = new_delta;
+	atomic_set(&delta_ref->charged, 0);
 
 	/*
 	 * Update current delta, then release reference.
@@ -587,6 +687,7 @@ void tux3_delta_init(struct sb *sb)
 
 	for (i = 0; i < ARRAY_SIZE(sb->delta_refs); i++) {
 		atomic_set(&sb->delta_refs[i].refcount, 0);
+		atomic_set(&sb->delta_refs[i].charged, 0);
 		init_completion(&sb->delta_refs[i].waitref_done);
 	}
 #ifdef TUX3_FLUSHER_SYNC
@@ -620,11 +721,16 @@ void tux3_delta_setup(struct sb *sb)
 #endif
 }
 
-unsigned tux3_get_current_delta(void)
+static inline struct delta_ref *current_delta(void)
 {
 	struct delta_ref *delta_ref = current->journal_info;
 	assert(delta_ref != NULL);
-	return delta_ref->delta;
+	return delta_ref;
+}
+
+unsigned tux3_get_current_delta(void)
+{
+	return current_delta()->delta;
 }
 
 /* Choice sb->delta or sb->unify from inode */
@@ -654,7 +760,7 @@ unsigned tux3_inode_delta(struct inode *inode)
  */
 void change_begin_atomic(struct sb *sb)
 {
-	assert(current->journal_info == NULL);
+	assert(!change_active());
 	current->journal_info = delta_get(sb);
 }
 
@@ -662,7 +768,7 @@ void change_begin_atomic(struct sb *sb)
 void change_end_atomic(struct sb *sb)
 {
 	struct delta_ref *delta_ref = current->journal_info;
-	assert(delta_ref != NULL);
+	assert(change_active());
 	current->journal_info = NULL;
 	delta_put(sb, delta_ref);
 }
@@ -694,12 +800,52 @@ void change_end_atomic_nested(struct sb *sb, void *ptr)
  * and blocked if disabled asynchronous backend and backend is
  * running.
  */
-void change_begin(struct sb *sb)
+
+int change_begin_nospace(struct sb *sb, int cost, int limit)
 {
 #ifdef TUX3_FLUSHER_SYNC
 	down_read(&sb->delta_lock);
 #endif
+	if (1)
+		tux3_msg(sb, "check space, budget %i, balance %i, cost %u, limit %i",
+			atomic_read(&sb->budget), atomic_read(&sb->balance), cost, limit);
+
 	change_begin_atomic(sb);
+	if (atomic_sub_return(cost, &sb->balance) >= limit) {
+		atomic_add(cost, ¤t_delta()->charged);
+		return 0;
+	}
+	atomic_add(cost, &sb->balance);
+	if (1)
+		tux3_msg(sb, "wait space, budget %i, balance %i, cost %u, limit %i",
+			atomic_read(&sb->budget), atomic_read(&sb->balance), cost, limit);
+	change_end_atomic(sb);
+	return 1;
+}
+
+int change_nospace(struct sb *sb, int cost, int limit)
+{
+	assert(!change_active());
+	sync_current_delta(sb);
+	if (1)
+		tux3_msg(sb, "final check, budget %i, balance %i, cost %u, limit %i",
+			atomic_read(&sb->budget), atomic_read(&sb->balance), cost, limit);
+	if (cost > atomic_read(&sb->budget + limit)) {
+		if (1)
+			tux3_msg(sb, "*** out of space ***");
+		return 1;
+	}
+	return 0;
+}
+
+int change_begin(struct sb *sb, int cost, int limit)
+{
+	while (change_begin_nospace(sb, cost, limit)) {
+		if (change_nospace(sb, cost, limit)) {
+			return 1;
+		}
+	}
+	return 0;
 }
 
 int change_end(struct sb *sb)
@@ -714,34 +860,3 @@ int change_end(struct sb *sb)
 #endif
 	return err;
 }
-
-/*
- * This is used for simplify the error path, or separates big chunk to
- * small chunk in loop.
- *
- * E.g. the following
- *
- * change_begin()
- * while (stop) {
- *	change_begin_if_need()
- *	if (do_something() < 0)
- *		break;
- *	change_end_if_need()
- * }
- * change_end_if_need()
- */
-void change_begin_if_needed(struct sb *sb, int need_sep)
-{
-	if (current->journal_info == NULL)
-		change_begin(sb);
-	else if (need_sep) {
-		change_end(sb);
-		change_begin(sb);
-	}
-}
-
-void change_end_if_needed(struct sb *sb)
-{
-	if (current->journal_info)
-		change_end(sb);
-}
diff --git a/fs/tux3/commit_flusher.c b/fs/tux3/commit_flusher.c
index 59d6781..f543cfc 100644
--- a/fs/tux3/commit_flusher.c
+++ b/fs/tux3/commit_flusher.c
@@ -187,6 +187,10 @@ long tux3_writeback(struct super_block *super, struct bdi_writeback *wb,
 	unsigned target_delta;
 	int err;
 
+	if (0)
+		tux3_msg(sb, "writeback delta %i, reason %i",
+			container_of(work, struct tux3_wb_work, work)->delta, work->reason);
+
 	/* If we didn't finish replay yet, don't flush. */
 	if (!(super->s_flags & MS_ACTIVE))
 		return 0;
diff --git a/fs/tux3/filemap.c b/fs/tux3/filemap.c
index a8811c2..e79a148 100644
--- a/fs/tux3/filemap.c
+++ b/fs/tux3/filemap.c
@@ -834,7 +834,6 @@ static int __tux3_file_write_begin(struct file *file,
 				   int tux3_flags)
 {
 	int ret;
-
 	ret = tux3_write_begin(mapping, pos, len, flags, pagep,
 			       tux3_da_get_block, tux3_flags);
 	if (ret < 0)
@@ -877,8 +876,8 @@ static int tux3_file_write_end(struct file *file, struct address_space *mapping,
 
 	/* Separate big write transaction to small chunk. */
 	assert(S_ISREG(mapping->host->i_mode));
-	change_end_if_needed(tux_sb(mapping->host->i_sb));
-
+	if (change_active())
+		change_end(tux_sb(mapping->host->i_sb));
 	return ret;
 }
 
diff --git a/fs/tux3/filemap_blocklib.c b/fs/tux3/filemap_blocklib.c
index 1e0127f..aa13810 100644
--- a/fs/tux3/filemap_blocklib.c
+++ b/fs/tux3/filemap_blocklib.c
@@ -167,16 +167,33 @@ static int tux3_write_begin(struct address_space *mapping, loff_t pos,
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
 	struct page *page;
 	int status;
-
 retry:
 	page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
 
 	if (tux3_flags & TUX3_F_SEP_DELTA) {
+		struct sb *sb = tux_sb(mapping->host->i_sb);
+		int cost = one_page_cost(mapping->host);
 		/* Separate big write transaction to small chunk. */
 		assert(S_ISREG(mapping->host->i_mode));
-		change_begin_if_needed(tux_sb(mapping->host->i_sb), 1);
+
+		if (change_active())
+			change_end(sb);
+
+		if (PageDirty(page))
+			change_begin_atomic(sb);
+		else if (change_begin_nospace(sb, cost, 0)) {
+			unlock_page(page);
+			page_cache_release(page);
+			if (change_nospace(sb, cost, 0)) {
+				/* fail path will truncate page */
+				change_begin_atomic(sb);
+				status = -ENOSPC;
+				goto fail;
+			}
+			goto retry;
+		}
 	}
 
 	/*
@@ -207,6 +224,7 @@ retry:
 	if (unlikely(status)) {
 		unlock_page(page);
 		page_cache_release(page);
+fail:
 		page = NULL;
 	}
 
diff --git a/fs/tux3/inode.c b/fs/tux3/inode.c
index f747c0e..7c97285 100644
--- a/fs/tux3/inode.c
+++ b/fs/tux3/inode.c
@@ -984,7 +984,10 @@ int tux3_setattr(struct dentry *dentry, struct iattr *iattr)
 
 	if (need_lock)
 		down_write(&tux_inode(inode)->truncate_lock);
-	change_begin(sb);
+	if (change_begin(sb, 2, 0)) {
+		err = -ENOSPC;
+		goto unlock;
+	}
 
 	if (need_truncate)
 		err = tux3_truncate(inode, iattr->ia_size);
@@ -995,6 +998,7 @@ int tux3_setattr(struct dentry *dentry, struct iattr *iattr)
 	}
 
 	change_end(sb);
+unlock:
 	if (need_lock)
 		up_write(&tux_inode(inode)->truncate_lock);
 
@@ -1060,7 +1064,8 @@ static int tux3_special_update_time(struct inode *inode, struct timespec *time,
 		return 0;
 
 	/* FIXME: no i_mutex, so this is racy */
-	change_begin(sb);
+	if (change_begin(sb, 1, 0))
+		return -ENOSPC;
 	if (flags & S_VERSION)
 		inode_inc_iversion(inode);
 	if (flags & S_CTIME)
diff --git a/fs/tux3/inode_vfslib.c b/fs/tux3/inode_vfslib.c
index afae9b8..bdecf53 100644
--- a/fs/tux3/inode_vfslib.c
+++ b/fs/tux3/inode_vfslib.c
@@ -21,10 +21,15 @@ static ssize_t tux3_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 	mutex_lock(&inode->i_mutex);
 	/* For each ->write_end() calls change_end(). */
-	change_begin(sb);
+	if (change_begin(sb, 1, 0)) {
+		mutex_unlock(&inode->i_mutex);
+		return -ENOSPC;
+	}
+
 	/* FIXME: file_update_time() in this can be race with mmap */
 	ret = __generic_file_write_iter(iocb, from);
-	change_end_if_needed(sb);
+	if (change_active())
+		change_end(sb);
 	mutex_unlock(&inode->i_mutex);
 
 	if (ret > 0) {
diff --git a/fs/tux3/log.c b/fs/tux3/log.c
index bb26c73..fdc36b0 100644
--- a/fs/tux3/log.c
+++ b/fs/tux3/log.c
@@ -634,6 +634,7 @@ int defer_bfree(struct sb *sb, struct stash *defree,
 
 	assert(count > 0);
 	assert(block + count <= sb->volblocks);
+	sb->defreed += count;
 
 	/*
 	 * count field of stash is 16bits. So, this separates to
diff --git a/fs/tux3/namei.c b/fs/tux3/namei.c
index cb8e0b2..2b1355d 100644
--- a/fs/tux3/namei.c
+++ b/fs/tux3/namei.c
@@ -37,13 +37,16 @@ static int __tux3_mknod(struct inode *dir, struct dentry *dentry,
 			struct tux_iattr *iattr)
 {
 	struct inode *inode;
+	struct sb *sb = tux_sb(dir->i_sb);
 	int err;
 
 	if (!huge_valid_dev(iattr->rdev) &&
 	    (S_ISBLK(iattr->mode) || S_ISCHR(iattr->mode)))
 		return -EINVAL;
 
-	change_begin(tux_sb(dir->i_sb));
+	if (change_begin(sb, 5, 0))
+		return -ENOSPC;
+
 	inode = tux_create_dirent_and_inode(dir, &dentry->d_name, iattr);
 	if (IS_ERR(inode)) {
 		err = PTR_ERR(inode);
@@ -56,7 +59,7 @@ static int __tux3_mknod(struct inode *dir, struct dentry *dentry,
 		inode_inc_link_count(dir);
 	err = 0;
 out:
-	change_end(tux_sb(dir->i_sb));
+	change_end(sb);
 	return err;
 }
 
@@ -93,7 +96,8 @@ static int tux3_link(struct dentry *old_dentry, struct inode *dir,
 	struct sb *sb = tux_sb(inode->i_sb);
 	int err;
 
-	change_begin(sb);
+	if (change_begin(sb, 5, 0))
+		return -ENOSPC;
 	tux3_iattrdirty(inode);
 	inode->i_ctime = gettime();
 	inode_inc_link_count(inode);
@@ -134,7 +138,8 @@ static int __tux3_symlink(struct inode *dir, struct dentry *dentry,
 	if (len > PAGE_CACHE_SIZE)
 		return -ENAMETOOLONG;
 
-	change_begin(sb);
+	if (change_begin(sb, 6, 0))
+		return -ENOSPC;
 	inode = tux_create_dirent_and_inode(dir, &dentry->d_name, iattr);
 	if (IS_ERR(inode)) {
 		err = PTR_ERR(inode);
@@ -181,7 +186,8 @@ static int tux3_unlink(struct inode *dir, struct dentry *dentry)
 	struct inode *inode = dentry->d_inode;
 	struct sb *sb = tux_sb(inode->i_sb);
 
-	change_begin(sb);
+	if (change_begin(sb, 1, min_reserve_blocks * -.75))
+		return -ENOSPC;
 	int err = tux_del_dirent(dir, dentry);
 	if (!err) {
 		tux3_iattrdirty(inode);
@@ -201,7 +207,8 @@ static int tux3_rmdir(struct inode *dir, struct dentry *dentry)
 	int err = tux_dir_is_empty(inode);
 
 	if (!err) {
-		change_begin(sb);
+		if (change_begin(sb, 3, min_reserve_blocks * -.75))
+			return -ENOSPC;
 		err = tux_del_dirent(dir, dentry);
 		if (!err) {
 			tux3_iattrdirty(inode);
@@ -237,7 +244,8 @@ static int tux3_rename(struct inode *old_dir, struct dentry *old_dentry,
 	/* FIXME: is this needed? */
 	assert(be64_to_cpu(old_entry->inum) == tux_inode(old_inode)->inum);
 
-	change_begin(sb);
+	if (change_begin(sb, 20, 0))
+		return -ENOSPC;
 	delta = tux3_get_current_delta();
 
 	new_subdir = S_ISDIR(old_inode->i_mode) && new_dir != old_dir;
diff --git a/fs/tux3/super.c b/fs/tux3/super.c
index b104dc7..29b17e8 100644
--- a/fs/tux3/super.c
+++ b/fs/tux3/super.c
@@ -370,6 +370,7 @@ static int init_sb(struct sb *sb)
 
 	INIT_LIST_HEAD(&sb->orphan_add);
 	INIT_LIST_HEAD(&sb->orphan_del);
+	sb->defreed = 0;
 	stash_init(&sb->defree);
 	stash_init(&sb->deunify);
 	INIT_LIST_HEAD(&sb->unify_buffers);
@@ -421,6 +422,12 @@ static void __setup_sb(struct sb *sb, struct disksuper *super)
 
 	sb->blocksize = 1 << sb->blockbits;
 	sb->blockmask = (1 << sb->blockbits) - 1;
+#ifdef __KERNEL__
+	sb->blocks_per_page_bits = PAGE_CACHE_SHIFT - sb->blockbits;
+#else
+	sb->blocks_per_page_bits = 0;
+#endif
+	sb->blocks_per_page = 1 << sb->blocks_per_page_bits;
 	sb->groupbits = 13; // FIXME: put in disk super?
 	sb->volmask = roundup_pow_of_two64(sb->volblocks) - 1;
 	sb->entries_per_node = calc_entries_per_node(sb->blocksize);
@@ -656,12 +663,13 @@ static int tux3_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct super_block *sb = dentry->d_sb;
 	struct sb *sbi = tux_sb(sb);
-
+	block_t reserve = sbi->reserve, avail = sbi->freeblocks + sbi->defreed;
+	avail -= avail < reserve ? avail : reserve;
 	buf->f_type = sb->s_magic;
 	buf->f_bsize = sbi->blocksize;
-	buf->f_blocks = sbi->volblocks;
-	buf->f_bfree = sbi->freeblocks;
-	buf->f_bavail = sbi->freeblocks;
+	buf->f_blocks = sbi->volblocks - reserve;
+	buf->f_bfree = avail;
+	buf->f_bavail = avail; /* FIXME: no special privilege for root yet */
 	buf->f_files = MAX_INODES;
 	buf->f_ffree = sbi->freeinodes;
 #if 0
@@ -773,7 +781,7 @@ static int tux3_fill_super(struct super_block *sb, void *data, int silent)
 			goto error;
 		}
 	}
-	tux3_dbg("s_blocksize %lu", sb->s_blocksize);
+	tux3_dbg("s_blocksize %lu, sb = %p", sb->s_blocksize, tux_sb(sb));
 
 	rp = tux3_init_fs(sbi);
 	if (IS_ERR(rp)) {
@@ -794,6 +802,9 @@ static int tux3_fill_super(struct super_block *sb, void *data, int silent)
 		goto error;
 	}
 
+	sbi->oldfree = sbi->freeblocks;
+	set_budget(sbi);
+	atomic_set(&sbi->balance, atomic_read(&sbi->budget));
 	return 0;
 
 error:
diff --git a/fs/tux3/tux3.h b/fs/tux3/tux3.h
index e2f2d9b..4eae938 100644
--- a/fs/tux3/tux3.h
+++ b/fs/tux3/tux3.h
@@ -277,6 +277,7 @@ struct delta_ref {
 	atomic_t refcount;
 	unsigned delta;
 	struct completion waitref_done;
+	atomic_t charged; /* block allocation upper bound for this delta */
 };
 
 /* Per-delta data structure for sb */
@@ -355,7 +356,8 @@ struct sb {
 	unsigned blocksize, blockbits, blockmask, groupbits;
 	u64 freeinodes;		/* Number of free inode numbers. This is
 				 * including the deferred allocated inodes */
-	block_t volblocks, volmask, freeblocks, nextblock;
+	block_t volblocks, volmask, freeblocks, oldfree, reserve, nextblock;
+	unsigned blocks_per_page, blocks_per_page_bits;
 	inum_t nextinum;	/* FIXME: temporary hack to avoid to find
 				 * same area in itree for free inum. */
 	unsigned entries_per_node; /* must be per-btree type, get rid of this */
@@ -376,6 +378,7 @@ struct sb {
 	struct list_head orphan_add; /* defered orphan inode add list */
 	struct list_head orphan_del; /* defered orphan inode del list */
 
+	block_t defreed;	/* total deferred free blocks */
 	struct stash defree;	/* defer extent frees until after delta */
 	struct stash deunify;	/* defer extent frees until after unify */
 
@@ -387,6 +390,7 @@ struct sb {
 	/*
 	 * For frontend and backend
 	 */
+	atomic_t budget, balance;
 	spinlock_t countmap_lock;
 	struct countmap_pin countmap_pin;
 	struct tux3_idefer_map *idefer_map;
@@ -515,6 +519,12 @@ static inline struct block_device *sb_dev(struct sb *sb)
 {
 	return sb->vfs_sb->s_bdev;
 }
+#else
+static inline struct sb *tux_sb(struct super_block *sb)
+{
+	return container_of(sb, struct sb, vfs_sb);
+}
+
 #endif /* __KERNEL__ */
 
 /* Get delta from free running counter */
@@ -686,6 +696,15 @@ static inline int has_no_root(struct btree *btree)
 	return btree->root.depth == 0;
 }
 
+/* Estimate backend allocation cost per data page */
+static inline unsigned one_page_cost(struct inode *inode)
+{
+	struct sb *sb = tux_sb(inode->i_sb);
+	struct btree *btree = &tux_inode(inode)->btree;
+	unsigned depth = has_root(btree) ? btree->root.depth : 0;
+	return sb->blocks_per_page + 2 * depth + 1;
+}
+
 /* Redirect ptr which is pointing data of src from src to dst */
 static inline void *ptr_redirect(void *ptr, void *src, void *dst)
 {
@@ -832,6 +851,14 @@ int replay_bnode_del(struct replay *rp, block_t bnode, tuxkey_t key, unsigned co
 int replay_bnode_adjust(struct replay *rp, block_t bnode, tuxkey_t from, tuxkey_t to);
 
 /* commit.c */
+
+enum { min_reserve_blocks = 8, max_reserve_blocks = 128 };
+
+static inline int change_active(void)
+{
+	return !!current->journal_info;
+}
+
 void tux3_start_backend(struct sb *sb);
 void tux3_end_backend(void);
 int tux3_under_backend(struct sb *sb);
@@ -844,10 +871,11 @@ void change_begin_atomic(struct sb *sb);
 void change_end_atomic(struct sb *sb);
 void change_begin_atomic_nested(struct sb *sb, void **ptr);
 void change_end_atomic_nested(struct sb *sb, void *ptr);
-void change_begin(struct sb *sb);
+int change_begin_nospace(struct sb *sb, int cost, int limit);
+int change_nospace(struct sb *sb, int cost, int limit);
+int change_begin(struct sb *sb, int cost, int limit);
 int change_end(struct sb *sb);
-void change_begin_if_needed(struct sb *sb, int need_sep);
-void change_end_if_needed(struct sb *sb);
+block_t set_budget(struct sb *sb);
 
 /* dir.c */
 void tux_set_entry(struct buffer_head *buffer, struct tux3_dirent *entry,
diff --git a/fs/tux3/user/filemap.c b/fs/tux3/user/filemap.c
index 8d8c812..6c99948 100644
--- a/fs/tux3/user/filemap.c
+++ b/fs/tux3/user/filemap.c
@@ -309,7 +309,7 @@ int tuxwrite(struct file *file, const void *data, unsigned len)
 {
 	struct sb *sb = tux_sb(file->f_inode->i_sb);
 	int ret;
-	change_begin(sb);
+	change_begin(sb, 2 * len, 0);
 	ret = tuxio(file, (void *)data, len, 1);
 	change_end(sb);
 	return ret;
diff --git a/fs/tux3/user/inode.c b/fs/tux3/user/inode.c
index 21823a1..70d5602 100644
--- a/fs/tux3/user/inode.c
+++ b/fs/tux3/user/inode.c
@@ -354,7 +354,7 @@ int tuxtruncate(struct inode *inode, loff_t size)
 	struct sb *sb = tux_sb(inode->i_sb);
 	int err;
 
-	change_begin(sb);
+	change_begin(sb, 1, -10);
 	err = __tuxtruncate(inode, size);
 	change_end(sb);
 
diff --git a/fs/tux3/user/tux3user.h b/fs/tux3/user/tux3user.h
index 5b68e5c..a298a91 100644
--- a/fs/tux3/user/tux3user.h
+++ b/fs/tux3/user/tux3user.h
@@ -55,11 +55,6 @@ static inline map_t *mapping(struct inode *inode);
 
 #include "../tux3.h"
 
-static inline struct sb *tux_sb(struct super_block *sb)
-{
-	return container_of(sb, struct sb, vfs_sb);
-}
-
 static inline struct super_block *vfs_sb(struct sb *sb)
 {
 	return &sb->vfs_sb;
diff --git a/fs/tux3/xattr.c b/fs/tux3/xattr.c
index c4ea5f9..9aff2e7 100644
--- a/fs/tux3/xattr.c
+++ b/fs/tux3/xattr.c
@@ -724,7 +724,7 @@ int set_xattr(struct inode *inode, const char *name, unsigned len,
 	struct inode *atable = sb->atable;
 
 	mutex_lock(&atable->i_mutex);
-	change_begin(sb);
+	change_begin(sb, 0, 0);
 
 	atom_t atom;
 	int err = make_atom(atable, name, len, &atom);
@@ -748,7 +748,7 @@ int del_xattr(struct inode *inode, const char *name, unsigned len)
 	int err;
 
 	mutex_lock(&atable->i_mutex);
-	change_begin(sb);
+	change_begin(sb, 0, 0);
 
 	atom_t atom;
 	err = find_atom(atable, name, len, &atom);
^ permalink raw reply related	[flat|nested] 160+ messages in thread
 
- * Re: [FYI] tux3: Core changes
  2015-05-14  8:26 ` [FYI] tux3: Core changes Daniel Phillips
  2015-05-14 12:59   ` Rik van Riel
@ 2015-05-19 14:00   ` Jan Kara
  2015-05-19 19:18     ` Daniel Phillips
  1 sibling, 1 reply; 160+ messages in thread
From: Jan Kara @ 2015-05-19 14:00 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: linux-kernel, linux-fsdevel, tux3, OGAWA Hirofumi, Rik van Riel
On Thu 14-05-15 01:26:23, Daniel Phillips wrote:
> Hi Rik,
> 
> Our linux-tux3 tree currently currently carries this 652 line diff
> against core, to make Tux3 work. This is mainly by Hirofumi, except
> the fs-writeback.c hook, which is by me. The main part you may be
> interested in is rmap.c, which addresses the issues raised at the
> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
> 
>    LSFMM: Page forking
>    http://lwn.net/Articles/548091/
> 
> This is just a FYI. An upcoming Tux3 report will be a tour of the page
> forking design and implementation. For now, this is just to give a
> general sense of what we have done. We heard there are concerns about
> how ptrace will work. I really am not familiar with the issue, could
> you please explain what you were thinking of there?
  So here are a few things I find problematic about page forking (besides
the cases with elevated page_count already discussed in this thread - there
I believe that anything more complex than "wait for the IO instead of
forking when page has elevated use count" isn't going to work. There are
too many users depending on too subtle details of the behavior...). Some
of them are actually mentioned in the above LWN article:
When you create a copy of a page and replace it in the radix tree, nobody
in mm subsystem is aware that oldpage may be under writeback. That causes
interesting issues:
* truncate_inode_pages() can finish before all IO for the file is finished.
  So far filesystems rely on the fact that once truncate_inode_pages()
  finishes all running IO against the file is completed and new cannot be
  submitted.
* Writeback can come and try to write newpage while oldpage is still under
  IO. Then you'll have two IOs against one block which has undefined
  results.
* filemap_fdatawait() called from fsync() has additional problem that it is
  not aware of oldpage and thus may return although IO hasn't finished yet.
I understand that Tux3 may avoid these issues due to some other mechanisms
it internally has but if page forking should get into mm subsystem, the
above must work.
								Honza
> diffstat tux3.core.patch
>  fs/Makefile               |    1 
>  fs/fs-writeback.c         |  100 +++++++++++++++++++++++++--------
>  include/linux/fs.h        |    6 +
>  include/linux/mm.h        |    5 +
>  include/linux/pagemap.h   |    2 
>  include/linux/rmap.h      |   14 ++++
>  include/linux/writeback.h |   23 +++++++
>  mm/filemap.c              |   82 +++++++++++++++++++++++++++
>  mm/rmap.c                 |  139 ++++++++++++++++++++++++++++++++++++++++++++++
>  mm/truncate.c             |   98 ++++++++++++++++++++------------
>  10 files changed, 411 insertions(+), 59 deletions(-)
> 
> diff --git a/fs/Makefile b/fs/Makefile
> index 91fcfa3..44d7192 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -70,7 +70,6 @@ obj-$(CONFIG_EXT4_FS)		+= ext4/
>  obj-$(CONFIG_JBD)		+= jbd/
>  obj-$(CONFIG_JBD2)		+= jbd2/
>  obj-$(CONFIG_TUX3)		+= tux3/
> -obj-$(CONFIG_TUX3_MMAP) 	+= tux3/
>  obj-$(CONFIG_CRAMFS)		+= cramfs/
>  obj-$(CONFIG_SQUASHFS)		+= squashfs/
>  obj-y				+= ramfs/
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 2d609a5..fcd1c61 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -34,25 +34,6 @@
>   */
>  #define MIN_WRITEBACK_PAGES	(4096UL >> (PAGE_CACHE_SHIFT - 10))
>  
> -/*
> - * Passed into wb_writeback(), essentially a subset of writeback_control
> - */
> -struct wb_writeback_work {
> -	long nr_pages;
> -	struct super_block *sb;
> -	unsigned long *older_than_this;
> -	enum writeback_sync_modes sync_mode;
> -	unsigned int tagged_writepages:1;
> -	unsigned int for_kupdate:1;
> -	unsigned int range_cyclic:1;
> -	unsigned int for_background:1;
> -	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
> -	enum wb_reason reason;		/* why was writeback initiated? */
> -
> -	struct list_head list;		/* pending work list */
> -	struct completion *done;	/* set if the caller waits */
> -};
> -
>  /**
>   * writeback_in_progress - determine whether there is writeback in progress
>   * @bdi: the device's backing_dev_info structure.
> @@ -192,6 +173,36 @@ void inode_wb_list_del(struct inode *inode)
>  }
>  
>  /*
> + * Remove inode from writeback list if clean.
> + */
> +void inode_writeback_done(struct inode *inode)
> +{
> +	struct backing_dev_info *bdi = inode_to_bdi(inode);
> +
> +	spin_lock(&bdi->wb.list_lock);
> +	spin_lock(&inode->i_lock);
> +	if (!(inode->i_state & I_DIRTY))
> +		list_del_init(&inode->i_wb_list);
> +	spin_unlock(&inode->i_lock);
> +	spin_unlock(&bdi->wb.list_lock);
> +}
> +EXPORT_SYMBOL_GPL(inode_writeback_done);
> +
> +/*
> + * Add inode to writeback dirty list with current time.
> + */
> +void inode_writeback_touch(struct inode *inode)
> +{
> +	struct backing_dev_info *bdi = inode_to_bdi(inode);
> +
> +	spin_lock(&bdi->wb.list_lock);
> +	inode->dirtied_when = jiffies;
> +	list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
> +	spin_unlock(&bdi->wb.list_lock);
> +}
> +EXPORT_SYMBOL_GPL(inode_writeback_touch);
> +
> +/*
>   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
>   * furthest end of its superblock's dirty-inode list.
>   *
> @@ -610,9 +621,9 @@ static long writeback_chunk_size(struct backing_dev_info *bdi,
>   *
>   * Return the number of pages and/or inodes written.
>   */
> -static long writeback_sb_inodes(struct super_block *sb,
> -				struct bdi_writeback *wb,
> -				struct wb_writeback_work *work)
> +static long generic_writeback_sb_inodes(struct super_block *sb,
> +					struct bdi_writeback *wb,
> +					struct wb_writeback_work *work)
>  {
>  	struct writeback_control wbc = {
>  		.sync_mode		= work->sync_mode,
> @@ -727,6 +738,22 @@ static long writeback_sb_inodes(struct super_block *sb,
>  	return wrote;
>  }
>  
> +static long writeback_sb_inodes(struct super_block *sb,
> +				struct bdi_writeback *wb,
> +				struct wb_writeback_work *work)
> +{
> +	if (sb->s_op->writeback) {
> +		long ret;
> +
> +		spin_unlock(&wb->list_lock);
> +		ret = sb->s_op->writeback(sb, wb, work);
> +		spin_lock(&wb->list_lock);
> +		return ret;
> +	}
> +
> +	return generic_writeback_sb_inodes(sb, wb, work);
> +}
> +
>  static long __writeback_inodes_wb(struct bdi_writeback *wb,
>  				  struct wb_writeback_work *work)
>  {
> @@ -1293,6 +1320,35 @@ static void wait_sb_inodes(struct super_block *sb)
>  }
>  
>  /**
> + * writeback_queue_work_sb -	schedule writeback work from given super_block
> + * @sb: the superblock
> + * @work: work item to queue
> + *
> + * Schedule writeback work on this super_block. This usually used to
> + * interact with sb->s_op->writeback callback. The caller must
> + * guarantee to @work is not freed while bdi flusher is using (for
> + * example, be safe against umount).
> + */
> +void writeback_queue_work_sb(struct super_block *sb,
> +			     struct wb_writeback_work *work)
> +{
> +	if (sb->s_bdi == &noop_backing_dev_info)
> +		return;
> +
> +	/* Allow only following fields to use. */
> +	*work = (struct wb_writeback_work){
> +		.sb			= sb,
> +		.sync_mode		= work->sync_mode,
> +		.tagged_writepages	= work->tagged_writepages,
> +		.done			= work->done,
> +		.nr_pages		= work->nr_pages,
> +		.reason			= work->reason,
> +	};
> +	bdi_queue_work(sb->s_bdi, work);
> +}
> +EXPORT_SYMBOL(writeback_queue_work_sb);
> +
> +/**
>   * writeback_inodes_sb_nr -	writeback dirty inodes from given super_block
>   * @sb: the superblock
>   * @nr: the number of pages to write
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 42efe13..29833d2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -356,6 +356,8 @@ struct address_space_operations {
>  
>  	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
>  	sector_t (*bmap)(struct address_space *, sector_t);
> +	void (*truncatepage)(struct address_space *, struct page *,
> +			     unsigned int, unsigned int, int);
>  	void (*invalidatepage) (struct page *, unsigned int, unsigned int);
>  	int (*releasepage) (struct page *, gfp_t);
>  	void (*freepage)(struct page *);
> @@ -1590,6 +1592,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
>  extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
>  		unsigned long, loff_t *);
>  
> +struct bdi_writeback;
> +struct wb_writeback_work;
>  struct super_operations {
>     	struct inode *(*alloc_inode)(struct super_block *sb);
>  	void (*destroy_inode)(struct inode *);
> @@ -1599,6 +1603,8 @@ struct super_operations {
>  	int (*drop_inode) (struct inode *);
>  	void (*evict_inode) (struct inode *);
>  	void (*put_super) (struct super_block *);
> +	long (*writeback)(struct super_block *super, struct bdi_writeback *wb,
> +			  struct wb_writeback_work *work);
>  	int (*sync_fs)(struct super_block *sb, int wait);
>  	int (*freeze_super) (struct super_block *);
>  	int (*freeze_fs) (struct super_block *);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index dd5ea30..075f59f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1909,6 +1909,11 @@ vm_unmapped_area(struct vm_unmapped_area_info *info)
>  }
>  
>  /* truncate.c */
> +void generic_truncate_partial_page(struct address_space *mapping,
> +				   struct page *page, unsigned int start,
> +				   unsigned int len);
> +void generic_truncate_full_page(struct address_space *mapping,
> +				struct page *page, int wait);
>  extern void truncate_inode_pages(struct address_space *, loff_t);
>  extern void truncate_inode_pages_range(struct address_space *,
>  				       loff_t lstart, loff_t lend);
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 4b3736f..13b70160 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -653,6 +653,8 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
>  extern void delete_from_page_cache(struct page *page);
>  extern void __delete_from_page_cache(struct page *page, void *shadow);
>  int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
> +int cow_replace_page_cache(struct page *oldpage, struct page *newpage);
> +void cow_delete_from_page_cache(struct page *page);
>  
>  /*
>   * Like add_to_page_cache_locked, but used to add newly allocated pages:
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index d9d7e7e..9b67360 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -228,6 +228,20 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
>  int page_mkclean(struct page *);
>  
>  /*
> + * Make clone page for page forking.
> + *
> + * Note: only clones page state so other state such as buffer_heads
> + * must be cloned by caller.
> + */
> +struct page *cow_clone_page(struct page *oldpage);
> +
> +/*
> + * Changes the PTES of shared mappings except the PTE in orig_vma.
> + */
> +int page_cow_file(struct vm_area_struct *orig_vma, struct page *oldpage,
> +		  struct page *newpage);
> +
> +/*
>   * called in munlock()/munmap() path to check for other vmas holding
>   * the page mlocked.
>   */
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 0004833..0784b9d 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -59,6 +59,25 @@ enum wb_reason {
>  };
>  
>  /*
> + * Passed into wb_writeback(), essentially a subset of writeback_control
> + */
> +struct wb_writeback_work {
> +	long nr_pages;
> +	struct super_block *sb;
> +	unsigned long *older_than_this;
> +	enum writeback_sync_modes sync_mode;
> +	unsigned int tagged_writepages:1;
> +	unsigned int for_kupdate:1;
> +	unsigned int range_cyclic:1;
> +	unsigned int for_background:1;
> +	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
> +	enum wb_reason reason;		/* why was writeback initiated? */
> +
> +	struct list_head list;		/* pending work list */
> +	struct completion *done;	/* set if the caller waits */
> +};
> +
> +/*
>   * A control structure which tells the writeback code what to do.  These are
>   * always on the stack, and hence need no locking.  They are always initialised
>   * in a manner such that unspecified fields are set to zero.
> @@ -90,6 +109,10 @@ struct writeback_control {
>   * fs/fs-writeback.c
>   */	
>  struct bdi_writeback;
> +void inode_writeback_done(struct inode *inode);
> +void inode_writeback_touch(struct inode *inode);
> +void writeback_queue_work_sb(struct super_block *sb,
> +			     struct wb_writeback_work *work);
>  void writeback_inodes_sb(struct super_block *, enum wb_reason reason);
>  void writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
>  							enum wb_reason reason);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 673e458..8c641d0 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -639,6 +639,88 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
>  }
>  EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
>  
> +/*
> + * Atomically replace oldpage with newpage.
> + *
> + * Similar to migrate_pages(), but the oldpage is for writeout.
> + */
> +int cow_replace_page_cache(struct page *oldpage, struct page *newpage)
> +{
> +	struct address_space *mapping = oldpage->mapping;
> +	void **pslot;
> +
> +	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
> +	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> +
> +	/* Get refcount for radix-tree */
> +	page_cache_get(newpage);
> +
> +	/* Replace page in radix tree. */
> +	spin_lock_irq(&mapping->tree_lock);
> +	/* PAGECACHE_TAG_DIRTY represents the view of frontend. Clear it. */
> +	if (PageDirty(oldpage))
> +		radix_tree_tag_clear(&mapping->page_tree, page_index(oldpage),
> +				     PAGECACHE_TAG_DIRTY);
> +	/* The refcount to newpage is used for radix tree. */
> +	pslot = radix_tree_lookup_slot(&mapping->page_tree, oldpage->index);
> +	radix_tree_replace_slot(pslot, newpage);
> +	__inc_zone_page_state(newpage, NR_FILE_PAGES);
> +	__dec_zone_page_state(oldpage, NR_FILE_PAGES);
> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	/* mem_cgroup codes must not be called under tree_lock */
> +	mem_cgroup_migrate(oldpage, newpage, true);
> +
> +	/* Release refcount for radix-tree */
> +	page_cache_release(oldpage);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(cow_replace_page_cache);
> +
> +/*
> + * Delete page from radix-tree, leaving page->mapping unchanged.
> + *
> + * Similar to delete_from_page_cache(), but the deleted page is for writeout.
> + */
> +void cow_delete_from_page_cache(struct page *page)
> +{
> +	struct address_space *mapping = page->mapping;
> +
> +	/* Delete page from radix tree. */
> +	spin_lock_irq(&mapping->tree_lock);
> +	/*
> +	 * if we're uptodate, flush out into the cleancache, otherwise
> +	 * invalidate any existing cleancache entries.  We can't leave
> +	 * stale data around in the cleancache once our page is gone
> +	 */
> +	if (PageUptodate(page) && PageMappedToDisk(page))
> +		cleancache_put_page(page);
> +	else
> +		cleancache_invalidate_page(mapping, page);
> +
> +	page_cache_tree_delete(mapping, page, NULL);
> +#if 0 /* FIXME: backend is assuming page->mapping is available */
> +	page->mapping = NULL;
> +#endif
> +	/* Leave page->index set: truncation lookup relies upon it */
> +
> +	__dec_zone_page_state(page, NR_FILE_PAGES);
> +	BUG_ON(page_mapped(page));
> +
> +	/*
> +	 * The following dirty accounting is done by writeback
> +	 * path. So, we don't need to do here.
> +	 *
> +	 * dec_zone_page_state(page, NR_FILE_DIRTY);
> +	 * dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> +	 */
> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	page_cache_release(page);
> +}
> +EXPORT_SYMBOL_GPL(cow_delete_from_page_cache);
> +
>  #ifdef CONFIG_NUMA
>  struct page *__page_cache_alloc(gfp_t gfp)
>  {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 71cd5bd..9125246 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -923,6 +923,145 @@ int page_mkclean(struct page *page)
>  }
>  EXPORT_SYMBOL_GPL(page_mkclean);
>  
> +/*
> + * Make clone page for page forking. (Based on migrate_page_copy())
> + *
> + * Note: only clones page state so other state such as buffer_heads
> + * must be cloned by caller.
> + */
> +struct page *cow_clone_page(struct page *oldpage)
> +{
> +	struct address_space *mapping = oldpage->mapping;
> +	gfp_t gfp_mask = mapping_gfp_mask(mapping) & ~__GFP_FS;
> +	struct page *newpage = __page_cache_alloc(gfp_mask);
> +	int cpupid;
> +
> +	newpage->mapping = oldpage->mapping;
> +	newpage->index = oldpage->index;
> +	copy_highpage(newpage, oldpage);
> +
> +	/* FIXME: right? */
> +	BUG_ON(PageSwapCache(oldpage));
> +	BUG_ON(PageSwapBacked(oldpage));
> +	BUG_ON(PageHuge(oldpage));
> +	if (PageError(oldpage))
> +		SetPageError(newpage);
> +	if (PageReferenced(oldpage))
> +		SetPageReferenced(newpage);
> +	if (PageUptodate(oldpage))
> +		SetPageUptodate(newpage);
> +	if (PageActive(oldpage))
> +		SetPageActive(newpage);
> +	if (PageMappedToDisk(oldpage))
> +		SetPageMappedToDisk(newpage);
> +
> +	/*
> +	 * Copy NUMA information to the new page, to prevent over-eager
> +	 * future migrations of this same page.
> +	 */
> +	cpupid = page_cpupid_xchg_last(oldpage, -1);
> +	page_cpupid_xchg_last(newpage, cpupid);
> +
> +	mlock_migrate_page(newpage, oldpage);
> +	ksm_migrate_page(newpage, oldpage);
> +
> +	/* Lock newpage before visible via radix tree */
> +	BUG_ON(PageLocked(newpage));
> +	__set_page_locked(newpage);
> +
> +	return newpage;
> +}
> +EXPORT_SYMBOL_GPL(cow_clone_page);
> +
> +static int page_cow_one(struct page *oldpage, struct page *newpage,
> +			struct vm_area_struct *vma, unsigned long address)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pte_t oldptval, ptval, *pte;
> +	spinlock_t *ptl;
> +	int ret = 0;
> +
> +	pte = page_check_address(oldpage, mm, address, &ptl, 1);
> +	if (!pte)
> +		goto out;
> +
> +	flush_cache_page(vma, address, pte_pfn(*pte));
> +	oldptval = ptep_clear_flush(vma, address, pte);
> +
> +	/* Take refcount for PTE */
> +	page_cache_get(newpage);
> +
> +	/*
> +	 * vm_page_prot doesn't have writable bit, so page fault will
> +	 * be occurred immediately after returned from this page fault
> +	 * again. And second time of page fault will be resolved with
> +	 * forked page was set here.
> +	 */
> +	ptval = mk_pte(newpage, vma->vm_page_prot);
> +#if 0
> +	/* FIXME: we should check following too? Otherwise, we would
> +	 * get additional read-only => write fault at least */
> +	if (pte_write)
> +		ptval = pte_mkwrite(ptval);
> +	if (pte_dirty(oldptval))
> +		ptval = pte_mkdirty(ptval);
> +	if (pte_young(oldptval))
> +		ptval = pte_mkyoung(ptval);
> +#endif
> +	set_pte_at(mm, address, pte, ptval);
> +
> +	/* Update rmap accounting */
> +	BUG_ON(!PageMlocked(oldpage));	/* Caller should migrate mlock flag */
> +	page_remove_rmap(oldpage);
> +	page_add_file_rmap(newpage);
> +
> +	/* no need to invalidate: a not-present page won't be cached */
> +	update_mmu_cache(vma, address, pte);
> +
> +	pte_unmap_unlock(pte, ptl);
> +
> +	mmu_notifier_invalidate_page(mm, address);
> +
> +	/* Release refcount for PTE */
> +	page_cache_release(oldpage);
> +out:
> +	return ret;
> +}
> +
> +/* Change old page in PTEs to new page exclude orig_vma */
> +int page_cow_file(struct vm_area_struct *orig_vma, struct page *oldpage,
> +		  struct page *newpage)
> +{
> +	struct address_space *mapping = page_mapping(oldpage);
> +	pgoff_t pgoff = oldpage->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +	struct vm_area_struct *vma;
> +	int ret = 0;
> +
> +	BUG_ON(!PageLocked(oldpage));
> +	BUG_ON(!PageLocked(newpage));
> +	BUG_ON(PageAnon(oldpage));
> +	BUG_ON(mapping == NULL);
> +
> +	i_mmap_lock_read(mapping);
> +	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> +		/*
> +		 * The orig_vma's PTE is handled by caller.
> +		 * (e.g. ->page_mkwrite)
> +		 */
> +		if (vma == orig_vma)
> +			continue;
> +
> +		if (vma->vm_flags & VM_SHARED) {
> +			unsigned long address = vma_address(oldpage, vma);
> +			ret += page_cow_one(oldpage, newpage, vma, address);
> +		}
> +	}
> +	i_mmap_unlock_read(mapping);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(page_cow_file);
> +
>  /**
>   * page_move_anon_rmap - move a page to our anon_vma
>   * @page:	the page to move to our anon_vma
> diff --git a/mm/truncate.c b/mm/truncate.c
> index f1e4d60..e5b4673 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -216,6 +216,56 @@ int invalidate_inode_page(struct page *page)
>  	return invalidate_complete_page(mapping, page);
>  }
>  
> +void generic_truncate_partial_page(struct address_space *mapping,
> +				   struct page *page, unsigned int start,
> +				   unsigned int len)
> +{
> +	wait_on_page_writeback(page);
> +	zero_user_segment(page, start, start + len);
> +	if (page_has_private(page))
> +		do_invalidatepage(page, start, len);
> +}
> +EXPORT_SYMBOL(generic_truncate_partial_page);
> +
> +static void truncate_partial_page(struct address_space *mapping, pgoff_t index,
> +				  unsigned int start, unsigned int len)
> +{
> +	struct page *page = find_lock_page(mapping, index);
> +	if (!page)
> +		return;
> +
> +	if (!mapping->a_ops->truncatepage)
> +		generic_truncate_partial_page(mapping, page, start, len);
> +	else
> +		mapping->a_ops->truncatepage(mapping, page, start, len, 1);
> +
> +	cleancache_invalidate_page(mapping, page);
> +	unlock_page(page);
> +	page_cache_release(page);
> +}
> +
> +void generic_truncate_full_page(struct address_space *mapping,
> +				struct page *page, int wait)
> +{
> +	if (wait)
> +		wait_on_page_writeback(page);
> +	else if (PageWriteback(page))
> +		return;
> +
> +	truncate_inode_page(mapping, page);
> +}
> +EXPORT_SYMBOL(generic_truncate_full_page);
> +
> +static void truncate_full_page(struct address_space *mapping, struct page *page,
> +			       int wait)
> +{
> +	if (!mapping->a_ops->truncatepage)
> +		generic_truncate_full_page(mapping, page, wait);
> +	else
> +		mapping->a_ops->truncatepage(mapping, page, 0, PAGE_CACHE_SIZE,
> +					     wait);
> +}
> +
>  /**
>   * truncate_inode_pages_range - truncate range of pages specified by start & end byte offsets
>   * @mapping: mapping to truncate
> @@ -298,11 +348,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  			if (!trylock_page(page))
>  				continue;
>  			WARN_ON(page->index != index);
> -			if (PageWriteback(page)) {
> -				unlock_page(page);
> -				continue;
> -			}
> -			truncate_inode_page(mapping, page);
> +			truncate_full_page(mapping, page, 0);
>  			unlock_page(page);
>  		}
>  		pagevec_remove_exceptionals(&pvec);
> @@ -312,37 +358,18 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  	}
>  
>  	if (partial_start) {
> -		struct page *page = find_lock_page(mapping, start - 1);
> -		if (page) {
> -			unsigned int top = PAGE_CACHE_SIZE;
> -			if (start > end) {
> -				/* Truncation within a single page */
> -				top = partial_end;
> -				partial_end = 0;
> -			}
> -			wait_on_page_writeback(page);
> -			zero_user_segment(page, partial_start, top);
> -			cleancache_invalidate_page(mapping, page);
> -			if (page_has_private(page))
> -				do_invalidatepage(page, partial_start,
> -						  top - partial_start);
> -			unlock_page(page);
> -			page_cache_release(page);
> -		}
> -	}
> -	if (partial_end) {
> -		struct page *page = find_lock_page(mapping, end);
> -		if (page) {
> -			wait_on_page_writeback(page);
> -			zero_user_segment(page, 0, partial_end);
> -			cleancache_invalidate_page(mapping, page);
> -			if (page_has_private(page))
> -				do_invalidatepage(page, 0,
> -						  partial_end);
> -			unlock_page(page);
> -			page_cache_release(page);
> +		unsigned int top = PAGE_CACHE_SIZE;
> +		if (start > end) {
> +			/* Truncation within a single page */
> +			top = partial_end;
> +			partial_end = 0;
>  		}
> +		truncate_partial_page(mapping, start - 1, partial_start,
> +				      top - partial_start);
>  	}
> +	if (partial_end)
> +		truncate_partial_page(mapping, end, 0, partial_end);
> +
>  	/*
>  	 * If the truncation happened within a single page no pages
>  	 * will be released, just zeroed, so we can bail out now.
> @@ -386,8 +413,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  
>  			lock_page(page);
>  			WARN_ON(page->index != index);
> -			wait_on_page_writeback(page);
> -			truncate_inode_page(mapping, page);
> +			truncate_full_page(mapping, page, 1);
>  			unlock_page(page);
>  		}
>  		pagevec_remove_exceptionals(&pvec);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 160+ messages in thread
- * Re: [FYI] tux3: Core changes
  2015-05-19 14:00   ` [FYI] tux3: Core changes Jan Kara
@ 2015-05-19 19:18     ` Daniel Phillips
  2015-05-19 20:33       ` David Lang
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-19 19:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, linux-fsdevel, tux3, OGAWA Hirofumi, Rik van Riel,
	OGAWA Hirofumi
Hi Jan,
On 05/19/2015 07:00 AM, Jan Kara wrote:
> On Thu 14-05-15 01:26:23, Daniel Phillips wrote:
>> Hi Rik,
>>
>> Our linux-tux3 tree currently currently carries this 652 line diff
>> against core, to make Tux3 work. This is mainly by Hirofumi, except
>> the fs-writeback.c hook, which is by me. The main part you may be
>> interested in is rmap.c, which addresses the issues raised at the
>> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
>>
>>    LSFMM: Page forking
>>    http://lwn.net/Articles/548091/
>>
>> This is just a FYI. An upcoming Tux3 report will be a tour of the page
>> forking design and implementation. For now, this is just to give a
>> general sense of what we have done. We heard there are concerns about
>> how ptrace will work. I really am not familiar with the issue, could
>> you please explain what you were thinking of there?
>   So here are a few things I find problematic about page forking (besides
> the cases with elevated page_count already discussed in this thread - there
> I believe that anything more complex than "wait for the IO instead of
> forking when page has elevated use count" isn't going to work. There are
> too many users depending on too subtle details of the behavior...). Some
> of them are actually mentioned in the above LWN article:
> 
> When you create a copy of a page and replace it in the radix tree, nobody
> in mm subsystem is aware that oldpage may be under writeback. That causes
> interesting issues:
> * truncate_inode_pages() can finish before all IO for the file is finished.
>   So far filesystems rely on the fact that once truncate_inode_pages()
>   finishes all running IO against the file is completed and new cannot be
>   submitted.
We do not use truncate_inode_pages because of issues like that. We use
some truncate helpers, which were available in some cases, or else had
to be implemented in Tux3 to make everything work properly. The details
are Hirofumi's stomping grounds. I am pretty sure that his solution is
good and tight, or Tux3 would not pass its torture tests.
> * Writeback can come and try to write newpage while oldpage is still under
>   IO. Then you'll have two IOs against one block which has undefined
>   results.
Those writebacks only come from Tux3 (or indirectly from fs-writeback,
through our writeback) so we are able to ensure that a dirty block is
only written once. (If redirtied, the block will fork so two dirty
blocks are written in two successive deltas.)
> * filemap_fdatawait() called from fsync() has additional problem that it is
>   not aware of oldpage and thus may return although IO hasn't finished yet.
We do not use filemap_fdatawait, instead, we wait on completion of our
own writeback, which is under our control.
> I understand that Tux3 may avoid these issues due to some other mechanisms
> it internally has but if page forking should get into mm subsystem, the
> above must work.
It does work, and by example, it does not need a lot of code to make
it work, but the changes are not trivial. Tux3's delta writeback model
will not suit everyone, so you can't just lift our code and add it to
Ext4. Using it in Ext4 would require a per-inode writeback model, which
looks practical to me but far from a weekend project. Maybe something
to consider for Ext5.
It is the job of new designs like Tux3 to chase after that final drop
of performance, not our trusty Ext4 workhorse. Though stranger things
have happened - as I recall, Ext4 had O(n) directory operations at one
time. Fixing that was not easy, but we did it because we had to. Fixing
Ext4's write performance is not urgent by comparison, and the barrier
is high, you would want jbd3 for one thing.
I think the meta-question you are asking is, where is the second user
for this new CoW functionality? With a possible implication that if
there is no second user then Tux3 cannot be merged. Is that is the
question?
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-19 19:18     ` Daniel Phillips
@ 2015-05-19 20:33       ` David Lang
  2015-05-20 14:44         ` Jan Kara
  0 siblings, 1 reply; 160+ messages in thread
From: David Lang @ 2015-05-19 20:33 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, Jan Kara, tux3, linux-kernel, linux-fsdevel,
	OGAWA Hirofumi
On Tue, 19 May 2015, Daniel Phillips wrote:
>> I understand that Tux3 may avoid these issues due to some other mechanisms
>> it internally has but if page forking should get into mm subsystem, the
>> above must work.
>
> It does work, and by example, it does not need a lot of code to make
> it work, but the changes are not trivial. Tux3's delta writeback model
> will not suit everyone, so you can't just lift our code and add it to
> Ext4. Using it in Ext4 would require a per-inode writeback model, which
> looks practical to me but far from a weekend project. Maybe something
> to consider for Ext5.
>
> It is the job of new designs like Tux3 to chase after that final drop
> of performance, not our trusty Ext4 workhorse. Though stranger things
> have happened - as I recall, Ext4 had O(n) directory operations at one
> time. Fixing that was not easy, but we did it because we had to. Fixing
> Ext4's write performance is not urgent by comparison, and the barrier
> is high, you would want jbd3 for one thing.
>
> I think the meta-question you are asking is, where is the second user
> for this new CoW functionality? With a possible implication that if
> there is no second user then Tux3 cannot be merged. Is that is the
> question?
I don't think they are asking for a second user. What they are saying is that 
for this functionality to be accepted in the mm subsystem, these problem cases 
need to work reliably, not just work for Tux3 because of your implementation.
So for things that you don't use, you need to make it an error if they get used 
on a page that's been forked (or not be an error and 'do the right thing')
For cases where it doesn't matter because Tux3 controls the writeback, and it's 
undefined in general what happens if writeback is triggered twice on the same 
page, you will need to figure out how to either prevent the second writeback 
from triggering if there's one in process, or define how the two writebacks are 
going to happen so that you can't end up with them re-ordered by some other 
filesystem.
I think that that's what's meant by the top statement that I left in the quote. 
Even if your implementation details make it safe, these need to be safe even 
without your implementation details to be acceptable in the core kernel.
David Lang
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-19 20:33       ` David Lang
@ 2015-05-20 14:44         ` Jan Kara
  2015-05-20 16:22           ` Daniel Phillips
  0 siblings, 1 reply; 160+ messages in thread
From: Jan Kara @ 2015-05-20 14:44 UTC (permalink / raw)
  To: David Lang
  Cc: Daniel Phillips, Jan Kara, linux-kernel, linux-fsdevel, tux3,
	OGAWA Hirofumi, Rik van Riel
On Tue 19-05-15 13:33:31, David Lang wrote:
> On Tue, 19 May 2015, Daniel Phillips wrote:
> 
> >>I understand that Tux3 may avoid these issues due to some other mechanisms
> >>it internally has but if page forking should get into mm subsystem, the
> >>above must work.
> >
> >It does work, and by example, it does not need a lot of code to make
> >it work, but the changes are not trivial. Tux3's delta writeback model
> >will not suit everyone, so you can't just lift our code and add it to
> >Ext4. Using it in Ext4 would require a per-inode writeback model, which
> >looks practical to me but far from a weekend project. Maybe something
> >to consider for Ext5.
> >
> >It is the job of new designs like Tux3 to chase after that final drop
> >of performance, not our trusty Ext4 workhorse. Though stranger things
> >have happened - as I recall, Ext4 had O(n) directory operations at one
> >time. Fixing that was not easy, but we did it because we had to. Fixing
> >Ext4's write performance is not urgent by comparison, and the barrier
> >is high, you would want jbd3 for one thing.
> >
> >I think the meta-question you are asking is, where is the second user
> >for this new CoW functionality? With a possible implication that if
> >there is no second user then Tux3 cannot be merged. Is that is the
> >question?
> 
> I don't think they are asking for a second user. What they are
> saying is that for this functionality to be accepted in the mm
> subsystem, these problem cases need to work reliably, not just work
> for Tux3 because of your implementation.
> 
> So for things that you don't use, you need to make it an error if
> they get used on a page that's been forked (or not be an error and
> 'do the right thing')
> 
> For cases where it doesn't matter because Tux3 controls the
> writeback, and it's undefined in general what happens if writeback
> is triggered twice on the same page, you will need to figure out how
> to either prevent the second writeback from triggering if there's
> one in process, or define how the two writebacks are going to happen
> so that you can't end up with them re-ordered by some other
> filesystem.
> 
> I think that that's what's meant by the top statement that I left in
> the quote. Even if your implementation details make it safe, these
> need to be safe even without your implementation details to be
> acceptable in the core kernel.
  Yeah, that's what I meant. If you create a function which manipulates
page cache, you better make it work with other functions manipulating page
cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
developer. Sure you can document all the conditions under which the
function is safe to use but a function that has several paragraphs in front
of it explaning when it is safe to use isn't very good API...
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-20 14:44         ` Jan Kara
@ 2015-05-20 16:22           ` Daniel Phillips
  2015-05-20 18:01             ` David Lang
  2015-05-20 19:53             ` Rik van Riel
  0 siblings, 2 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-20 16:22 UTC (permalink / raw)
  To: Jan Kara, David Lang
  Cc: Rik van Riel, tux3, linux-kernel, linux-fsdevel, OGAWA Hirofumi
On 05/20/2015 07:44 AM, Jan Kara wrote:
> On Tue 19-05-15 13:33:31, David Lang wrote:
>> On Tue, 19 May 2015, Daniel Phillips wrote:
>>
>>>> I understand that Tux3 may avoid these issues due to some other mechanisms
>>>> it internally has but if page forking should get into mm subsystem, the
>>>> above must work.
>>>
>>> It does work, and by example, it does not need a lot of code to make
>>> it work, but the changes are not trivial. Tux3's delta writeback model
>>> will not suit everyone, so you can't just lift our code and add it to
>>> Ext4. Using it in Ext4 would require a per-inode writeback model, which
>>> looks practical to me but far from a weekend project. Maybe something
>>> to consider for Ext5.
>>>
>>> It is the job of new designs like Tux3 to chase after that final drop
>>> of performance, not our trusty Ext4 workhorse. Though stranger things
>>> have happened - as I recall, Ext4 had O(n) directory operations at one
>>> time. Fixing that was not easy, but we did it because we had to. Fixing
>>> Ext4's write performance is not urgent by comparison, and the barrier
>>> is high, you would want jbd3 for one thing.
>>>
>>> I think the meta-question you are asking is, where is the second user
>>> for this new CoW functionality? With a possible implication that if
>>> there is no second user then Tux3 cannot be merged. Is that is the
>>> question?
>>
>> I don't think they are asking for a second user. What they are
>> saying is that for this functionality to be accepted in the mm
>> subsystem, these problem cases need to work reliably, not just work
>> for Tux3 because of your implementation.
>>
>> So for things that you don't use, you need to make it an error if
>> they get used on a page that's been forked (or not be an error and
>> 'do the right thing')
>>
>> For cases where it doesn't matter because Tux3 controls the
>> writeback, and it's undefined in general what happens if writeback
>> is triggered twice on the same page, you will need to figure out how
>> to either prevent the second writeback from triggering if there's
>> one in process, or define how the two writebacks are going to happen
>> so that you can't end up with them re-ordered by some other
>> filesystem.
>>
>> I think that that's what's meant by the top statement that I left in
>> the quote. Even if your implementation details make it safe, these
>> need to be safe even without your implementation details to be
>> acceptable in the core kernel.
>   Yeah, that's what I meant. If you create a function which manipulates
> page cache, you better make it work with other functions manipulating page
> cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
> developer. Sure you can document all the conditions under which the
> function is safe to use but a function that has several paragraphs in front
> of it explaning when it is safe to use isn't very good API...
Violent agreement, of course. To put it in concrete terms, each of
the page fork support functions must be examined and determined
sane. They are:
 * cow_replace_page_cache
 * cow_delete_from_page_cache
 * cow_clone_page
 * page_cow_one
 * page_cow_file
Would it be useful to drill down into those, starting from the top
of the list?
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-20 16:22           ` Daniel Phillips
@ 2015-05-20 18:01             ` David Lang
  2015-05-20 19:53             ` Rik van Riel
  1 sibling, 0 replies; 160+ messages in thread
From: David Lang @ 2015-05-20 18:01 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, Jan Kara, tux3, linux-kernel, linux-fsdevel,
	OGAWA Hirofumi
On Wed, 20 May 2015, Daniel Phillips wrote:
> On 05/20/2015 07:44 AM, Jan Kara wrote:
>>   Yeah, that's what I meant. If you create a function which manipulates
>> page cache, you better make it work with other functions manipulating page
>> cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
>> developer. Sure you can document all the conditions under which the
>> function is safe to use but a function that has several paragraphs in front
>> of it explaning when it is safe to use isn't very good API...
>
> Violent agreement, of course. To put it in concrete terms, each of
> the page fork support functions must be examined and determined
> sane. They are:
>
> * cow_replace_page_cache
> * cow_delete_from_page_cache
> * cow_clone_page
> * page_cow_one
> * page_cow_file
>
> Would it be useful to drill down into those, starting from the top
> of the list?
It's a little more than determining that these 5 functions are sane, it's making 
sure that if someone mixes the use of these functions with other existing 
functions that the result is sane.
but it's probably a good starting point to look at each of these five functions 
in detail and consider how they work and could interact badly with other things 
touching the page cache.
David Lang
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-20 16:22           ` Daniel Phillips
  2015-05-20 18:01             ` David Lang
@ 2015-05-20 19:53             ` Rik van Riel
  2015-05-20 22:51               ` Daniel Phillips
  1 sibling, 1 reply; 160+ messages in thread
From: Rik van Riel @ 2015-05-20 19:53 UTC (permalink / raw)
  To: Daniel Phillips, Jan Kara, David Lang
  Cc: linux-fsdevel, tux3, linux-kernel, OGAWA Hirofumi
On 05/20/2015 12:22 PM, Daniel Phillips wrote:
> On 05/20/2015 07:44 AM, Jan Kara wrote:
>> On Tue 19-05-15 13:33:31, David Lang wrote:
>>   Yeah, that's what I meant. If you create a function which manipulates
>> page cache, you better make it work with other functions manipulating page
>> cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
>> developer. Sure you can document all the conditions under which the
>> function is safe to use but a function that has several paragraphs in front
>> of it explaning when it is safe to use isn't very good API...
> 
> Violent agreement, of course. To put it in concrete terms, each of
> the page fork support functions must be examined and determined
> sane. They are:
> 
>  * cow_replace_page_cache
>  * cow_delete_from_page_cache
>  * cow_clone_page
>  * page_cow_one
>  * page_cow_file
> 
> Would it be useful to drill down into those, starting from the top
> of the list?
How do these interact with other page cache functions, like
find_get_page() ?
How does tux3 prevent a user of find_get_page() from reading from
or writing into the pre-COW page, instead of the current page?
-- 
All rights reversed
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-20 19:53             ` Rik van Riel
@ 2015-05-20 22:51               ` Daniel Phillips
  2015-05-21  3:24                 ` Daniel Phillips
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-20 22:51 UTC (permalink / raw)
  To: Rik van Riel, Jan Kara, David Lang
  Cc: linux-fsdevel, tux3, linux-kernel, OGAWA Hirofumi
On 05/20/2015 12:53 PM, Rik van Riel wrote:
> On 05/20/2015 12:22 PM, Daniel Phillips wrote:
>> On 05/20/2015 07:44 AM, Jan Kara wrote:
>>> On Tue 19-05-15 13:33:31, David Lang wrote:
> 
>>>   Yeah, that's what I meant. If you create a function which manipulates
>>> page cache, you better make it work with other functions manipulating page
>>> cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
>>> developer. Sure you can document all the conditions under which the
>>> function is safe to use but a function that has several paragraphs in front
>>> of it explaning when it is safe to use isn't very good API...
>>
>> Violent agreement, of course. To put it in concrete terms, each of
>> the page fork support functions must be examined and determined
>> sane. They are:
>>
>>  * cow_replace_page_cache
>>  * cow_delete_from_page_cache
>>  * cow_clone_page
>>  * page_cow_one
>>  * page_cow_file
>>
>> Would it be useful to drill down into those, starting from the top
>> of the list?
> 
> How do these interact with other page cache functions, like
> find_get_page() ?
Nicely:
   https://github.com/OGAWAHirofumi/linux-tux3/blob/hirofumi/fs/tux3/filemap_mmap.c#L182
> How does tux3 prevent a user of find_get_page() from reading from
> or writing into the pre-COW page, instead of the current page?
Careful control of the dirty bits (we have two of them, one each
for front and back). That is what pagefork_for_blockdirty is about.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-20 22:51               ` Daniel Phillips
@ 2015-05-21  3:24                 ` Daniel Phillips
  2015-05-21  3:51                   ` David Lang
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-21  3:24 UTC (permalink / raw)
  To: Rik van Riel, Jan Kara, David Lang
  Cc: linux-fsdevel, tux3, linux-kernel, OGAWA Hirofumi
On 05/20/2015 03:51 PM, Daniel Phillips wrote:
> On 05/20/2015 12:53 PM, Rik van Riel wrote:
>> How does tux3 prevent a user of find_get_page() from reading from
>> or writing into the pre-COW page, instead of the current page?
> 
> Careful control of the dirty bits (we have two of them, one each
> for front and back). That is what pagefork_for_blockdirty is about.
Ah, and of course it does not matter if a reader is on the
pre-cow page. It would be reading the earlier copy, which might
no longer be the current copy, but it raced with the write so
nobody should be surprised. That is a race even without page fork.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-21  3:24                 ` Daniel Phillips
@ 2015-05-21  3:51                   ` David Lang
  2015-05-21 19:53                     ` Daniel Phillips
  0 siblings, 1 reply; 160+ messages in thread
From: David Lang @ 2015-05-21  3:51 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, Jan Kara, linux-fsdevel, tux3, linux-kernel,
	OGAWA Hirofumi
On Wed, 20 May 2015, Daniel Phillips wrote:
> On 05/20/2015 03:51 PM, Daniel Phillips wrote:
>> On 05/20/2015 12:53 PM, Rik van Riel wrote:
>>> How does tux3 prevent a user of find_get_page() from reading from
>>> or writing into the pre-COW page, instead of the current page?
>>
>> Careful control of the dirty bits (we have two of them, one each
>> for front and back). That is what pagefork_for_blockdirty is about.
>
> Ah, and of course it does not matter if a reader is on the
> pre-cow page. It would be reading the earlier copy, which might
> no longer be the current copy, but it raced with the write so
> nobody should be surprised. That is a race even without page fork.
how do you prevent it from continuing to interact with the old version of the 
page and never see updates or have it's changes reflected on the current page?
David Lang
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-21  3:51                   ` David Lang
@ 2015-05-21 19:53                     ` Daniel Phillips
  2015-05-26  4:25                       ` Rik van Riel
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-21 19:53 UTC (permalink / raw)
  To: David Lang
  Cc: Rik van Riel, Jan Kara, tux3, linux-kernel, linux-fsdevel,
	OGAWA Hirofumi
On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote:
> how do you prevent it from continuing to interact with the old 
> version of the page and never see updates or have it's changes 
> reflected on the current page?
Why would it do that, and what would be surprising about it? Did
you have a specific case in mind?
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-21 19:53                     ` Daniel Phillips
@ 2015-05-26  4:25                       ` Rik van Riel
  2015-05-26  4:30                         ` Daniel Phillips
  0 siblings, 1 reply; 160+ messages in thread
From: Rik van Riel @ 2015-05-26  4:25 UTC (permalink / raw)
  To: Daniel Phillips, David Lang
  Cc: linux-fsdevel, tux3, Jan Kara, linux-kernel, OGAWA Hirofumi
On 05/21/2015 03:53 PM, Daniel Phillips wrote:
> On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote:
>> how do you prevent it from continuing to interact with the old version
>> of the page and never see updates or have it's changes reflected on
>> the current page?
> 
> Why would it do that, and what would be surprising about it? Did
> you have a specific case in mind?
After a get_page(), page_cache_get(), or other equivalent
function, a piece of code has the expectation that it can
continue using that page until after it has released the
reference count.
This can be an arbitrarily long period of time.
-- 
All rights reversed
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-26  4:25                       ` Rik van Riel
@ 2015-05-26  4:30                         ` Daniel Phillips
  2015-05-26  6:04                           ` David Lang
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-26  4:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David Lang, Jan Kara, tux3, linux-kernel, linux-fsdevel,
	OGAWA Hirofumi
On Monday, May 25, 2015 9:25:44 PM PDT, Rik van Riel wrote:
> On 05/21/2015 03:53 PM, Daniel Phillips wrote:
>> On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote:
>>> how do you prevent it from continuing to interact with the old version
>>> of the page and never see updates or have it's changes reflected on
>>> the current page?
>> 
>> Why would it do that, and what would be surprising about it? Did
>> you have a specific case in mind?
>
> After a get_page(), page_cache_get(), or other equivalent
> function, a piece of code has the expectation that it can
> continue using that page until after it has released the
> reference count.
>
> This can be an arbitrarily long period of time.
It is perfectly welcome to keep using that page as long as it
wants, Tux3 does not care. When it lets go of the last reference
(and Tux3 has finished with it) then the page is freeable. Did
you have a more specific example where this would be an issue?
Are you talking about kernel or userspace code?
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-26  4:30                         ` Daniel Phillips
@ 2015-05-26  6:04                           ` David Lang
  2015-05-26  6:11                             ` Daniel Phillips
  0 siblings, 1 reply; 160+ messages in thread
From: David Lang @ 2015-05-26  6:04 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, Jan Kara, tux3, linux-kernel, linux-fsdevel,
	OGAWA Hirofumi
On Mon, 25 May 2015, Daniel Phillips wrote:
> On Monday, May 25, 2015 9:25:44 PM PDT, Rik van Riel wrote:
>> On 05/21/2015 03:53 PM, Daniel Phillips wrote:
>>> On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote:
>>>> how do you prevent it from continuing to interact with the old version
>>>> of the page and never see updates or have it's changes reflected on
>>>> the current page?
>>> 
>>> Why would it do that, and what would be surprising about it? Did
>>> you have a specific case in mind?
>> 
>> After a get_page(), page_cache_get(), or other equivalent
>> function, a piece of code has the expectation that it can
>> continue using that page until after it has released the
>> reference count.
>> 
>> This can be an arbitrarily long period of time.
>
> It is perfectly welcome to keep using that page as long as it
> wants, Tux3 does not care. When it lets go of the last reference
> (and Tux3 has finished with it) then the page is freeable. Did
> you have a more specific example where this would be an issue?
> Are you talking about kernel or userspace code?
if the page gets modified again, will that cause any issues? what if the page 
gets modified before the copy gets written out, so that there are two dirty 
copies of the page in the process of being written?
David Lang
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-26  6:04                           ` David Lang
@ 2015-05-26  6:11                             ` Daniel Phillips
  2015-05-26  6:13                               ` David Lang
  2015-05-26  7:09                               ` Jan Kara
  0 siblings, 2 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-26  6:11 UTC (permalink / raw)
  To: David Lang
  Cc: Rik van Riel, Jan Kara, linux-fsdevel, tux3, linux-kernel,
	OGAWA Hirofumi
On Monday, May 25, 2015 11:04:39 PM PDT, David Lang wrote:
> if the page gets modified again, will that cause any issues? 
> what if the page gets modified before the copy gets written out, 
> so that there are two dirty copies of the page in the process of 
> being written?
>
> David Lang
How is the page going to get modified again? A forked page isn't
mapped by a pte, so userspace can't modify it by mmap. The forked
page is not in the page cache, so usespace can't modify it by
posix file ops. So the writer would have to be in kernel. Tux3
knows what it is doing, so it won't modify the page. What kernel
code besides Tux3 will modify the page?
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-26  6:11                             ` Daniel Phillips
@ 2015-05-26  6:13                               ` David Lang
  2015-05-26  8:09                                 ` Daniel Phillips
  2015-05-26  7:09                               ` Jan Kara
  1 sibling, 1 reply; 160+ messages in thread
From: David Lang @ 2015-05-26  6:13 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, Jan Kara, tux3, linux-kernel, linux-fsdevel,
	OGAWA Hirofumi
On Mon, 25 May 2015, Daniel Phillips wrote:
> On Monday, May 25, 2015 11:04:39 PM PDT, David Lang wrote:
>> if the page gets modified again, will that cause any issues? what if the 
>> page gets modified before the copy gets written out, so that there are two 
>> dirty copies of the page in the process of being written?
>> 
>> David Lang
>
> How is the page going to get modified again? A forked page isn't
> mapped by a pte, so userspace can't modify it by mmap. The forked
> page is not in the page cache, so usespace can't modify it by
> posix file ops. So the writer would have to be in kernel. Tux3
> knows what it is doing, so it won't modify the page. What kernel
> code besides Tux3 will modify the page?
I'm assuming that Rik is talking about whatever has the reference to the page 
via one of the methods that he talked about.
David Lang
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-26  6:13                               ` David Lang
@ 2015-05-26  8:09                                 ` Daniel Phillips
  2015-05-26 10:13                                   ` Pavel Machek
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-05-26  8:09 UTC (permalink / raw)
  To: David Lang
  Cc: Rik van Riel, Jan Kara, tux3, linux-kernel, linux-fsdevel,
	OGAWA Hirofumi
On Monday, May 25, 2015 11:13:46 PM PDT, David Lang wrote:
> I'm assuming that Rik is talking about whatever has the 
> reference to the page via one of the methods that he talked 
> about.
This would be a good moment to provide specifics.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-26  8:09                                 ` Daniel Phillips
@ 2015-05-26 10:13                                   ` Pavel Machek
  0 siblings, 0 replies; 160+ messages in thread
From: Pavel Machek @ 2015-05-26 10:13 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Lang, Rik van Riel, Jan Kara, tux3, linux-kernel,
	linux-fsdevel, OGAWA Hirofumi
On Tue 2015-05-26 01:09:59, Daniel Phillips wrote:
> On Monday, May 25, 2015 11:13:46 PM PDT, David Lang wrote:
> >I'm assuming that Rik is talking about whatever has the reference to the
> >page via one of the methods that he talked about.
> 
> This would be a good moment to provide specifics.
Hmm. This seems like a good moment for you to audit whole kernel, to
make sure it does not do stuff you don't expect it to.
You are changing core semantics, stuff that was allowed before is not
allowed now, so it looks like you should do the auditing...
You may want to start with video4linux, as Jan pointed out.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
- * Re: [FYI] tux3: Core changes
  2015-05-26  6:11                             ` Daniel Phillips
  2015-05-26  6:13                               ` David Lang
@ 2015-05-26  7:09                               ` Jan Kara
  2015-05-26  8:08                                 ` Daniel Phillips
  1 sibling, 1 reply; 160+ messages in thread
From: Jan Kara @ 2015-05-26  7:09 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Lang, Rik van Riel, Jan Kara, linux-fsdevel, tux3,
	linux-kernel, OGAWA Hirofumi
On Mon 25-05-15 23:11:11, Daniel Phillips wrote:
> On Monday, May 25, 2015 11:04:39 PM PDT, David Lang wrote:
> >if the page gets modified again, will that cause any issues? what
> >if the page gets modified before the copy gets written out, so
> >that there are two dirty copies of the page in the process of
> >being written?
> >
> >David Lang
> 
> How is the page going to get modified again? A forked page isn't
> mapped by a pte, so userspace can't modify it by mmap. The forked
> page is not in the page cache, so usespace can't modify it by
> posix file ops. So the writer would have to be in kernel. Tux3
> knows what it is doing, so it won't modify the page. What kernel
> code besides Tux3 will modify the page?
  E.g. video drivers (or infiniband or direct IO for that matter) which
have buffers in user memory (may be mmapped file), grab references to pages
and hand out PFNs of those pages to the hardware to store data in them...
If you fork a page after the driver has handed PFNs to the hardware, you've
just lost all the writes hardware will do.
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-26  7:09                               ` Jan Kara
@ 2015-05-26  8:08                                 ` Daniel Phillips
  2015-05-26  9:00                                   ` Jan Kara
  2015-05-26 10:22                                   ` Sergey Senozhatsky
  0 siblings, 2 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-26  8:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: David Lang, Rik van Riel, tux3, linux-kernel, linux-fsdevel,
	OGAWA Hirofumi
On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
>   E.g. video drivers (or infiniband or direct IO for that matter) which
> have buffers in user memory (may be mmapped file), grab references to pages
> and hand out PFNs of those pages to the hardware to store data in them...
> If you fork a page after the driver has handed PFNs to the hardware, you've
> just lost all the writes hardware will do.
Hi Jan,
The page forked because somebody wrote to it with write(2) or mmap write at
the same time as a video driver (or infiniband or direct IO) was doing io 
to
it. Isn't the application trying hard to lose data in that case? It would 
not need page fork to lose data that way.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-26  8:08                                 ` Daniel Phillips
@ 2015-05-26  9:00                                   ` Jan Kara
  2015-05-26 20:22                                     ` Daniel Phillips
  2015-05-26 10:22                                   ` Sergey Senozhatsky
  1 sibling, 1 reply; 160+ messages in thread
From: Jan Kara @ 2015-05-26  9:00 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Lang, Rik van Riel, Jan Kara, tux3, linux-kernel,
	linux-fsdevel, OGAWA Hirofumi
On Tue 26-05-15 01:08:56, Daniel Phillips wrote:
> On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
> >  E.g. video drivers (or infiniband or direct IO for that matter) which
> >have buffers in user memory (may be mmapped file), grab references to pages
> >and hand out PFNs of those pages to the hardware to store data in them...
> >If you fork a page after the driver has handed PFNs to the hardware, you've
> >just lost all the writes hardware will do.
> 
> Hi Jan,
> 
> The page forked because somebody wrote to it with write(2) or mmap write at
> the same time as a video driver (or infiniband or direct IO) was
> doing io to
> it. Isn't the application trying hard to lose data in that case? It
> would not need page fork to lose data that way.
So I can think of two valid uses:
1) You setup IO to part of a page and modify from userspace a different
   part of a page.
2) At least for video drivers there is one ioctl() which creates object
   with buffers in memory and another ioctl() to actually ship it to hardware
   (may be called repeatedly). So in theory app could validly dirty the pages
   before it ships them to hardware. If this happens repeatedly and interacts
   badly with background writeback, you will end up with a forked page in a
   buffer and from that point on things are broken.
So my opinion is: Don't fork the page if page_count is elevated. You can
just wait for the IO if you need stable pages in that case. It's slow but
it's safe and it should be pretty rare. Is there any problem with that?
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-26  9:00                                   ` Jan Kara
@ 2015-05-26 20:22                                     ` Daniel Phillips
  2015-05-26 21:36                                       ` Rik van Riel
  2015-05-27  8:41                                       ` Jan Kara
  0 siblings, 2 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-26 20:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: David Lang, Rik van Riel, tux3, linux-kernel, linux-fsdevel,
	OGAWA Hirofumi
On 05/26/2015 02:00 AM, Jan Kara wrote:
> On Tue 26-05-15 01:08:56, Daniel Phillips wrote:
>> On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
>>>  E.g. video drivers (or infiniband or direct IO for that matter) which
>>> have buffers in user memory (may be mmapped file), grab references to pages
>>> and hand out PFNs of those pages to the hardware to store data in them...
>>> If you fork a page after the driver has handed PFNs to the hardware, you've
>>> just lost all the writes hardware will do.
>>
>> Hi Jan,
>>
>> The page forked because somebody wrote to it with write(2) or mmap write at
>> the same time as a video driver (or infiniband or direct IO) was
>> doing io to
>> it. Isn't the application trying hard to lose data in that case? It
>> would not need page fork to lose data that way.
> 
> So I can think of two valid uses:
> 
> 1) You setup IO to part of a page and modify from userspace a different
>    part of a page.
Suppose the use case is reading textures from video memory into a mmapped
file, and at the same time, the application is allowed to update the
textures in the file via mmap or write(2). Fork happens at mkwrite time.
If the page is already dirty, we do not fork it. The video API must have
made the page writable and dirty, so I do not see an issue.
> 2) At least for video drivers there is one ioctl() which creates object
>    with buffers in memory and another ioctl() to actually ship it to hardware
>    (may be called repeatedly). So in theory app could validly dirty the pages
>    before it ships them to hardware. If this happens repeatedly and interacts
>    badly with background writeback, you will end up with a forked page in a
>    buffer and from that point on things are broken.
Writeback does not fork pages. An app may dirty a page that is in process
of being shipped to hardware (must be a distinct part of the page, or it is
a race) and the data being sent to hardware will not be disturbed. If there
is an issue here, I do not see it.
> So my opinion is: Don't fork the page if page_count is elevated. You can
> just wait for the IO if you need stable pages in that case. It's slow but
> it's safe and it should be pretty rare. Is there any problem with that?
That would be our fallback if anybody discovers a specific case where page
fork breaks something, which so far has not been demonstrated.
With a known fallback, it is hard to see why we should delay merging over
that. Perfection has never been a requirement for merging filesystems. On
the contrary, imperfection is a reason for merging, so that the many
eyeballs effect may prove its value.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-26 20:22                                     ` Daniel Phillips
@ 2015-05-26 21:36                                       ` Rik van Riel
  2015-05-26 21:49                                         ` Daniel Phillips
  2015-05-27  8:41                                       ` Jan Kara
  1 sibling, 1 reply; 160+ messages in thread
From: Rik van Riel @ 2015-05-26 21:36 UTC (permalink / raw)
  To: Daniel Phillips, Jan Kara
  Cc: David Lang, tux3, linux-kernel, linux-fsdevel, OGAWA Hirofumi
On 05/26/2015 04:22 PM, Daniel Phillips wrote:
> On 05/26/2015 02:00 AM, Jan Kara wrote:
>> So my opinion is: Don't fork the page if page_count is elevated. You can
>> just wait for the IO if you need stable pages in that case. It's slow but
>> it's safe and it should be pretty rare. Is there any problem with that?
> 
> That would be our fallback if anybody discovers a specific case where page
> fork breaks something, which so far has not been demonstrated.
> 
> With a known fallback, it is hard to see why we should delay merging over
> that. Perfection has never been a requirement for merging filesystems. On
However, avoiding data corruption by erring on the side of safety is
a pretty basic requirement.
> the contrary, imperfection is a reason for merging, so that the many
> eyeballs effect may prove its value.
If you skip the page fork when there is an elevated page count, tux3
should be safe (at least from that aspect). Only do the COW when there
is no "strange" use of the page going on.
-- 
All rights reversed
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-26 21:36                                       ` Rik van Riel
@ 2015-05-26 21:49                                         ` Daniel Phillips
  0 siblings, 0 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-26 21:49 UTC (permalink / raw)
  To: Rik van Riel, Jan Kara
  Cc: David Lang, tux3, linux-fsdevel, linux-kernel, OGAWA Hirofumi
On 05/26/2015 02:36 PM, Rik van Riel wrote:
> On 05/26/2015 04:22 PM, Daniel Phillips wrote:
>> On 05/26/2015 02:00 AM, Jan Kara wrote:
>>> So my opinion is: Don't fork the page if page_count is elevated. You can
>>> just wait for the IO if you need stable pages in that case. It's slow but
>>> it's safe and it should be pretty rare. Is there any problem with that?
>>
>> That would be our fallback if anybody discovers a specific case where page
>> fork breaks something, which so far has not been demonstrated.
>>
>> With a known fallback, it is hard to see why we should delay merging over
>> that. Perfection has never been a requirement for merging filesystems. On
> 
> However, avoiding data corruption by erring on the side of safety is
> a pretty basic requirement.
Erring on the side of safety is still an error. As a community we have
never been fond of adding code or overhead to fix theoretical bugs. I
do not see why we should relax that principle now.
We can fix actual bugs, but theoretical bugs are only shapeless specters
passing in the night. We should not become frozen in fear of them.
>> the contrary, imperfection is a reason for merging, so that the many
>> eyeballs effect may prove its value.
> 
> If you skip the page fork when there is an elevated page count, tux3
> should be safe (at least from that aspect). Only do the COW when there
> is no "strange" use of the page going on.
Then you break the I in ACID. There must be a compelling reason to do
that.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
- * Re: [FYI] tux3: Core changes
  2015-05-26 20:22                                     ` Daniel Phillips
  2015-05-26 21:36                                       ` Rik van Riel
@ 2015-05-27  8:41                                       ` Jan Kara
  2015-06-21 15:36                                         ` OGAWA Hirofumi
  1 sibling, 1 reply; 160+ messages in thread
From: Jan Kara @ 2015-05-27  8:41 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Jan Kara, David Lang, Rik van Riel, tux3, linux-kernel,
	linux-fsdevel, OGAWA Hirofumi
On Tue 26-05-15 13:22:38, Daniel Phillips wrote:
> On 05/26/2015 02:00 AM, Jan Kara wrote:
> > On Tue 26-05-15 01:08:56, Daniel Phillips wrote:
> >> On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
> >>>  E.g. video drivers (or infiniband or direct IO for that matter) which
> >>> have buffers in user memory (may be mmapped file), grab references to pages
> >>> and hand out PFNs of those pages to the hardware to store data in them...
> >>> If you fork a page after the driver has handed PFNs to the hardware, you've
> >>> just lost all the writes hardware will do.
> >>
> >> Hi Jan,
> >>
> >> The page forked because somebody wrote to it with write(2) or mmap write at
> >> the same time as a video driver (or infiniband or direct IO) was
> >> doing io to
> >> it. Isn't the application trying hard to lose data in that case? It
> >> would not need page fork to lose data that way.
> > 
> > So I can think of two valid uses:
> > 
> > 1) You setup IO to part of a page and modify from userspace a different
> >    part of a page.
> 
> Suppose the use case is reading textures from video memory into a mmapped
> file, and at the same time, the application is allowed to update the
> textures in the file via mmap or write(2). Fork happens at mkwrite time.
> If the page is already dirty, we do not fork it. The video API must have
> made the page writable and dirty, so I do not see an issue.
So there are a few things to have in mind:
1) There is nothing like a "writeable" page. Page is always writeable (at
least on x86 architecture). When a page is mapped into some virtual address
space (or more of them), this *mapping* can be either writeable or read-only.
mkwrite changes the mapping from read-only to writeable but kernel /
hardware is free to write to the page regardless of the mapping.
2) When kernel / hardware writes to the page, it first modifies the page
and then marks it dirty.
 
So what can happen in this scenario is:
1) You hand kernel a part of a page as a buffer. page_mkwrite() happens,
   page is dirtied, kernel notes a PFN of the page somewhere internally.
2) Writeback comes and starts writeback for the page.
3) Kernel ships the PFN to the hardware.
4) Userspace comes and wants to write to the page (different part than the
   HW is instructed to use). page_mkwrite is called, page is forked.
   Userspace writes to the forked page.
5) HW stores its data in the original page.
Userspace never sees data from the HW! Data corrupted where without page
forking everything would work just fine.
Another possible scenario:
1) Userspace app tells kernel to setup a HW buffer in a page.
2) Userspace app fills page with data -> page_mkwrite is called, page is
   dirtied.
3) Userspace app tells kernel to ship buffer to video HW.
4) Writeback comes and starts writeback for the page
5) Video HW is done with the page. Userspace app fills new set of data into
   the page -> page_mkwrite is called, page is forked.
6) Userspace app tells kernel to ship buffer to video HW. But HW gets the
   old data from the original page.
Again a data corruption issue where previously things were working fine.
> > 2) At least for video drivers there is one ioctl() which creates object
> >    with buffers in memory and another ioctl() to actually ship it to hardware
> >    (may be called repeatedly). So in theory app could validly dirty the pages
> >    before it ships them to hardware. If this happens repeatedly and interacts
> >    badly with background writeback, you will end up with a forked page in a
> >    buffer and from that point on things are broken.
> 
> Writeback does not fork pages. An app may dirty a page that is in process
> of being shipped to hardware (must be a distinct part of the page, or it is
> a race) and the data being sent to hardware will not be disturbed. If there
> is an issue here, I do not see it.
> 
> > So my opinion is: Don't fork the page if page_count is elevated. You can
> > just wait for the IO if you need stable pages in that case. It's slow but
> > it's safe and it should be pretty rare. Is there any problem with that?
> 
> That would be our fallback if anybody discovers a specific case where page
> fork breaks something, which so far has not been demonstrated.
> 
> With a known fallback, it is hard to see why we should delay merging over
> that. Perfection has never been a requirement for merging filesystems. On
> the contrary, imperfection is a reason for merging, so that the many
> eyeballs effect may prove its value.
Sorry, but you've got several people telling you they are concerned about
the safety of your approach. That is the many eyeballs effect. And data
corruption issues aren't problems you can just wave away with "let's wait
whether it really happens".
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-27  8:41                                       ` Jan Kara
@ 2015-06-21 15:36                                         ` OGAWA Hirofumi
  2015-06-23 16:12                                           ` Jan Kara
  0 siblings, 1 reply; 160+ messages in thread
From: OGAWA Hirofumi @ 2015-06-21 15:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: David Lang, Rik van Riel, tux3, linux-kernel, linux-fsdevel,
	Daniel Phillips
Jan Kara <jack@suse.cz> writes:
Hi,
> So there are a few things to have in mind:
> 1) There is nothing like a "writeable" page. Page is always writeable (at
> least on x86 architecture). When a page is mapped into some virtual address
> space (or more of them), this *mapping* can be either writeable or read-only.
> mkwrite changes the mapping from read-only to writeable but kernel /
> hardware is free to write to the page regardless of the mapping.
>
> 2) When kernel / hardware writes to the page, it first modifies the page
> and then marks it dirty.
>  
> So what can happen in this scenario is:
>
> 1) You hand kernel a part of a page as a buffer. page_mkwrite() happens,
>    page is dirtied, kernel notes a PFN of the page somewhere internally.
>
> 2) Writeback comes and starts writeback for the page.
>
> 3) Kernel ships the PFN to the hardware.
>
> 4) Userspace comes and wants to write to the page (different part than the
>    HW is instructed to use). page_mkwrite is called, page is forked.
>    Userspace writes to the forked page.
>
> 5) HW stores its data in the original page.
>
> Userspace never sees data from the HW! Data corrupted where without page
> forking everything would work just fine.
I'm not sure I'm understanding your pseudocode logic correctly though.
This logic doesn't seems to be a page forking specific issue.  And
this pseudocode logic seems to be missing the locking and revalidate of
page.
If you can show more details, it would be helpful to see more, and
discuss the issue of page forking, or we can think about how to handle
the corner cases.
Well, before that, why need more details?
For example, replace the page fork at (4) with "truncate", "punch
hole", or "invalidate page".
Those operations remove the old page from radix tree, so the
userspace's write creates the new page, and HW still refererences the
old page.  (I.e. situation should be same with page forking, in my
understand of this pseudocode logic.)
IOW, this pseudocode logic seems to be broken without page forking if
no lock and revalidate.  Usually, we prevent unpleasant I/O by
lock_page or PG_writeback, and an obsolated page is revalidated under
lock_page.
For page forking, we may also be able to prevent similar situation by
locking, flags, and revalidate. But those details might be different
with current code, because page states are different.
> Another possible scenario:
>
> 1) Userspace app tells kernel to setup a HW buffer in a page.
>
> 2) Userspace app fills page with data -> page_mkwrite is called, page is
>    dirtied.
>
> 3) Userspace app tells kernel to ship buffer to video HW.
>
> 4) Writeback comes and starts writeback for the page
>
> 5) Video HW is done with the page. Userspace app fills new set of data into
>    the page -> page_mkwrite is called, page is forked.
>
> 6) Userspace app tells kernel to ship buffer to video HW. But HW gets the
>    old data from the original page.
>
> Again a data corruption issue where previously things were working fine.
This logic seems to be same as above. Replace the page fork at (5).
With no revalidate of page, (6) will use the old page.
Thanks.
--
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-06-21 15:36                                         ` OGAWA Hirofumi
@ 2015-06-23 16:12                                           ` Jan Kara
  2015-07-05 12:54                                             ` OGAWA Hirofumi
  0 siblings, 1 reply; 160+ messages in thread
From: Jan Kara @ 2015-06-23 16:12 UTC (permalink / raw)
  To: OGAWA Hirofumi
  Cc: Jan Kara, Daniel Phillips, David Lang, Rik van Riel, tux3,
	linux-kernel, linux-fsdevel
On Mon 22-06-15 00:36:00, OGAWA Hirofumi wrote:
> Jan Kara <jack@suse.cz> writes:
> > So there are a few things to have in mind:
> > 1) There is nothing like a "writeable" page. Page is always writeable (at
> > least on x86 architecture). When a page is mapped into some virtual address
> > space (or more of them), this *mapping* can be either writeable or read-only.
> > mkwrite changes the mapping from read-only to writeable but kernel /
> > hardware is free to write to the page regardless of the mapping.
> >
> > 2) When kernel / hardware writes to the page, it first modifies the page
> > and then marks it dirty.
> >  
> > So what can happen in this scenario is:
> >
> > 1) You hand kernel a part of a page as a buffer. page_mkwrite() happens,
> >    page is dirtied, kernel notes a PFN of the page somewhere internally.
> >
> > 2) Writeback comes and starts writeback for the page.
> >
> > 3) Kernel ships the PFN to the hardware.
> >
> > 4) Userspace comes and wants to write to the page (different part than the
> >    HW is instructed to use). page_mkwrite is called, page is forked.
> >    Userspace writes to the forked page.
> >
> > 5) HW stores its data in the original page.
> >
> > Userspace never sees data from the HW! Data corrupted where without page
> > forking everything would work just fine.
> 
> I'm not sure I'm understanding your pseudocode logic correctly though.
> This logic doesn't seems to be a page forking specific issue.  And
> this pseudocode logic seems to be missing the locking and revalidate of
> page.
> 
> If you can show more details, it would be helpful to see more, and
> discuss the issue of page forking, or we can think about how to handle
> the corner cases.
> 
> Well, before that, why need more details?
> 
> For example, replace the page fork at (4) with "truncate", "punch
> hole", or "invalidate page".
> 
> Those operations remove the old page from radix tree, so the
> userspace's write creates the new page, and HW still refererences the
> old page.  (I.e. situation should be same with page forking, in my
> understand of this pseudocode logic.)
Yes, if userspace truncates the file, the situation we end up with is
basically the same. However for truncate to happen some malicious process
has to come and truncate the file - a failure scenario that is acceptable
for most use cases since it doesn't happen unless someone is actively
trying to screw you. With page forking it is enough for flusher thread
to start writeback for that page to trigger the problem - event that is
basically bound to happen without any other userspace application
interfering.
> IOW, this pseudocode logic seems to be broken without page forking if
> no lock and revalidate.  Usually, we prevent unpleasant I/O by
> lock_page or PG_writeback, and an obsolated page is revalidated under
> lock_page.
Well, good luck with converting all the get_user_pages() users in kernel to
use lock_page() or PG_writeback checks to avoid issues with page forking. I
don't think that's really feasible.
 
> For page forking, we may also be able to prevent similar situation by
> locking, flags, and revalidate. But those details might be different
> with current code, because page states are different.
Sorry, I don't understand what do you mean in this paragraph. Can you
explain it a bit more?
 
> > Another possible scenario:
> >
> > 1) Userspace app tells kernel to setup a HW buffer in a page.
> >
> > 2) Userspace app fills page with data -> page_mkwrite is called, page is
> >    dirtied.
> >
> > 3) Userspace app tells kernel to ship buffer to video HW.
> >
> > 4) Writeback comes and starts writeback for the page
> >
> > 5) Video HW is done with the page. Userspace app fills new set of data into
> >    the page -> page_mkwrite is called, page is forked.
> >
> > 6) Userspace app tells kernel to ship buffer to video HW. But HW gets the
> >    old data from the original page.
> >
> > Again a data corruption issue where previously things were working fine.
> 
> This logic seems to be same as above. Replace the page fork at (5).
> With no revalidate of page, (6) will use the old page.
  Yes, the same arguments as above apply...
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-06-23 16:12                                           ` Jan Kara
@ 2015-07-05 12:54                                             ` OGAWA Hirofumi
  2015-07-09 16:05                                               ` Jan Kara
  0 siblings, 1 reply; 160+ messages in thread
From: OGAWA Hirofumi @ 2015-07-05 12:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: David Lang, Rik van Riel, tux3, linux-kernel, linux-fsdevel,
	Daniel Phillips
Jan Kara <jack@suse.cz> writes:
>> I'm not sure I'm understanding your pseudocode logic correctly though.
>> This logic doesn't seems to be a page forking specific issue.  And
>> this pseudocode logic seems to be missing the locking and revalidate of
>> page.
>> 
>> If you can show more details, it would be helpful to see more, and
>> discuss the issue of page forking, or we can think about how to handle
>> the corner cases.
>> 
>> Well, before that, why need more details?
>> 
>> For example, replace the page fork at (4) with "truncate", "punch
>> hole", or "invalidate page".
>> 
>> Those operations remove the old page from radix tree, so the
>> userspace's write creates the new page, and HW still refererences the
>> old page.  (I.e. situation should be same with page forking, in my
>> understand of this pseudocode logic.)
>
> Yes, if userspace truncates the file, the situation we end up with is
> basically the same. However for truncate to happen some malicious process
> has to come and truncate the file - a failure scenario that is acceptable
> for most use cases since it doesn't happen unless someone is actively
> trying to screw you. With page forking it is enough for flusher thread
> to start writeback for that page to trigger the problem - event that is
> basically bound to happen without any other userspace application
> interfering.
Acceptable conclusion is where came from? That pseudocode logic doesn't
say about usage at all. And even if assume it is acceptable, as far as I
can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a
page on non-exists block (sparse file. i.e. missing disk space check in
your logic). And if really no any lock/check, there would be another
races.
>> IOW, this pseudocode logic seems to be broken without page forking if
>> no lock and revalidate.  Usually, we prevent unpleasant I/O by
>> lock_page or PG_writeback, and an obsolated page is revalidated under
>> lock_page.
>
> Well, good luck with converting all the get_user_pages() users in kernel to
> use lock_page() or PG_writeback checks to avoid issues with page forking. I
> don't think that's really feasible.
What does all get_user_pages() conversion mean? Well, maybe right more
or less, I also think there is the issue in/around get_user_pages() that
we have to tackle.
IMO, if there is a code that pseudocode logic actually, it is the
breakage. And "it is acceptable and limitation, and give up to fix", I
don't think it is the right way to go. If there is really code broken
like your logic, I think we should fix.
Could you point which code is using your logic? Since that seems to be
so racy, I can't believe yet there are that racy codes actually.
>> For page forking, we may also be able to prevent similar situation by
>> locking, flags, and revalidate. But those details might be different
>> with current code, because page states are different.
>
> Sorry, I don't understand what do you mean in this paragraph. Can you
> explain it a bit more?
This just means a forked page (old page) and a truncated page have
different set of flags and state, so we may have to adjust revalidation.
Thanks.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-07-05 12:54                                             ` OGAWA Hirofumi
@ 2015-07-09 16:05                                               ` Jan Kara
  2015-07-31  4:44                                                 ` OGAWA Hirofumi
  0 siblings, 1 reply; 160+ messages in thread
From: Jan Kara @ 2015-07-09 16:05 UTC (permalink / raw)
  To: OGAWA Hirofumi
  Cc: David Lang, Rik van Riel, Jan Kara, tux3, linux-kernel,
	linux-fsdevel, Daniel Phillips
On Sun 05-07-15 21:54:45, OGAWA Hirofumi wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> >> I'm not sure I'm understanding your pseudocode logic correctly though.
> >> This logic doesn't seems to be a page forking specific issue.  And
> >> this pseudocode logic seems to be missing the locking and revalidate of
> >> page.
> >> 
> >> If you can show more details, it would be helpful to see more, and
> >> discuss the issue of page forking, or we can think about how to handle
> >> the corner cases.
> >> 
> >> Well, before that, why need more details?
> >> 
> >> For example, replace the page fork at (4) with "truncate", "punch
> >> hole", or "invalidate page".
> >> 
> >> Those operations remove the old page from radix tree, so the
> >> userspace's write creates the new page, and HW still refererences the
> >> old page.  (I.e. situation should be same with page forking, in my
> >> understand of this pseudocode logic.)
> >
> > Yes, if userspace truncates the file, the situation we end up with is
> > basically the same. However for truncate to happen some malicious process
> > has to come and truncate the file - a failure scenario that is acceptable
> > for most use cases since it doesn't happen unless someone is actively
> > trying to screw you. With page forking it is enough for flusher thread
> > to start writeback for that page to trigger the problem - event that is
> > basically bound to happen without any other userspace application
> > interfering.
> 
> Acceptable conclusion is where came from? That pseudocode logic doesn't
> say about usage at all. And even if assume it is acceptable, as far as I
> can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a
> page on non-exists block (sparse file. i.e. missing disk space check in
> your logic). And if really no any lock/check, there would be another
> races.
So drop_caches won't cause any issues because it avoids mmaped pages.
Also page reclaim or page migration don't cause any issues because
they avoid pages with increased refcount (and increased refcount would stop
drop_caches from reclaiming the page as well if it was not for the mmaped
check before). Generally, elevated page refcount currently guarantees page
isn't migrated, reclaimed, or otherwise detached from the mapping (except
for truncate where the combination of mapping-index becomes invalid) and
your page forking would change that assumption - which IMHO has a big
potential for some breakage somewhere. And frankly I fail to see why you
and Daniel care so much about this corner case because from performance POV
it's IMHO a non-issue and you bother with page forking because of
performance, don't you?
> >> IOW, this pseudocode logic seems to be broken without page forking if
> >> no lock and revalidate.  Usually, we prevent unpleasant I/O by
> >> lock_page or PG_writeback, and an obsolated page is revalidated under
> >> lock_page.
> >
> > Well, good luck with converting all the get_user_pages() users in kernel to
> > use lock_page() or PG_writeback checks to avoid issues with page forking. I
> > don't think that's really feasible.
> 
> What does all get_user_pages() conversion mean? Well, maybe right more
> or less, I also think there is the issue in/around get_user_pages() that
> we have to tackle.
> 
> 
> IMO, if there is a code that pseudocode logic actually, it is the
> breakage. And "it is acceptable and limitation, and give up to fix", I
> don't think it is the right way to go. If there is really code broken
> like your logic, I think we should fix.
> 
> Could you point which code is using your logic? Since that seems to be
> so racy, I can't believe yet there are that racy codes actually.
So you can have a look for example at
drivers/media/v4l2-core/videobuf2-dma-contig.c which implements setting up
of a video device buffer at virtual address specified by user. Now I don't
know whether there really is any userspace video program that sets up the
video buffer in mmaped file. I would agree with you that it would be a
strange thing to do but I've seen enough strange userspace code that I
would not be too surprised.
Another example of similar kind is at
drivers/infiniband/core/umem.c where we again set up buffer for infiniband
cards at users specified virtual address. And there are more drivers in
kernel like that.
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-07-09 16:05                                               ` Jan Kara
@ 2015-07-31  4:44                                                 ` OGAWA Hirofumi
  2015-07-31 15:37                                                   ` Raymond Jennings
  2015-08-03 13:42                                                   ` Jan Kara
  0 siblings, 2 replies; 160+ messages in thread
From: OGAWA Hirofumi @ 2015-07-31  4:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Daniel Phillips, David Lang, Rik van Riel, tux3, linux-kernel,
	linux-fsdevel
Jan Kara <jack@suse.cz> writes:
>> > Yes, if userspace truncates the file, the situation we end up with is
>> > basically the same. However for truncate to happen some malicious process
>> > has to come and truncate the file - a failure scenario that is acceptable
>> > for most use cases since it doesn't happen unless someone is actively
>> > trying to screw you. With page forking it is enough for flusher thread
>> > to start writeback for that page to trigger the problem - event that is
>> > basically bound to happen without any other userspace application
>> > interfering.
>> 
>> Acceptable conclusion is where came from? That pseudocode logic doesn't
>> say about usage at all. And even if assume it is acceptable, as far as I
>> can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a
>> page on non-exists block (sparse file. i.e. missing disk space check in
>> your logic). And if really no any lock/check, there would be another
>> races.
>
> So drop_caches won't cause any issues because it avoids mmaped pages.
> Also page reclaim or page migration don't cause any issues because
> they avoid pages with increased refcount (and increased refcount would stop
> drop_caches from reclaiming the page as well if it was not for the mmaped
> check before). Generally, elevated page refcount currently guarantees page
> isn't migrated, reclaimed, or otherwise detached from the mapping (except
> for truncate where the combination of mapping-index becomes invalid) and
> your page forking would change that assumption - which IMHO has a big
> potential for some breakage somewhere.
Lifetime and visibility from user are different topic.  The issue here
is visibility. Of course, those has relation more or less though,
refcount doesn't stop to drop page from radix-tree at all.
Well, anyway, your claim seems to be assuming the userspace app
workarounds the issues. And it sounds like still not workarounds the
ENOSPC issue (validate at page fault/GUP) even if assuming userspace
behave as perfect. Calling it as kernel assumption is strange.
If you claim, there is strange logic widely used already, and of course,
we can't simply break it because of compatibility. I would be able to
agree. But your claim sounds like that logic is sane and well designed
behavior. So I disagree.
> And frankly I fail to see why you and Daniel care so much about this
> corner case because from performance POV it's IMHO a non-issue and you
> bother with page forking because of performance, don't you?
Trying to penalize the corner case path, instead of normal path, should
try at first. Penalizing normal path to allow corner case path is insane
basically.
Make normal path faster and more reliable is what we are trying.
> So you can have a look for example at
> drivers/media/v4l2-core/videobuf2-dma-contig.c which implements setting up
> of a video device buffer at virtual address specified by user. Now I don't
> know whether there really is any userspace video program that sets up the
> video buffer in mmaped file. I would agree with you that it would be a
> strange thing to do but I've seen enough strange userspace code that I
> would not be too surprised.
>
> Another example of similar kind is at
> drivers/infiniband/core/umem.c where we again set up buffer for infiniband
> cards at users specified virtual address. And there are more drivers in
> kernel like that.
Unfortunately, I'm not looking those yet though. I guess those would be
helpful to see the details.
Thanks.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-07-31  4:44                                                 ` OGAWA Hirofumi
@ 2015-07-31 15:37                                                   ` Raymond Jennings
  2015-07-31 17:27                                                     ` Daniel Phillips
  2015-08-03 13:42                                                   ` Jan Kara
  1 sibling, 1 reply; 160+ messages in thread
From: Raymond Jennings @ 2015-07-31 15:37 UTC (permalink / raw)
  To: OGAWA Hirofumi
  Cc: David Lang, Rik van Riel, Jan Kara, tux3,
	Linux Kernel Mailing List, Daniel Phillips, linux-fsdevel
[-- Attachment #1.1: Type: text/plain, Size: 4089 bytes --]
Returning ENOSPC when you have free space you can't yet prove is safer than
not returning it and risking a data loss when you get hit by a write/commit
storm. :)
On Thu, Jul 30, 2015 at 9:44 PM, OGAWA Hirofumi <hirofumi@mail.parknet.co.jp
> wrote:
> Jan Kara <jack@suse.cz> writes:
>
> >> > Yes, if userspace truncates the file, the situation we end up with is
> >> > basically the same. However for truncate to happen some malicious
> process
> >> > has to come and truncate the file - a failure scenario that is
> acceptable
> >> > for most use cases since it doesn't happen unless someone is actively
> >> > trying to screw you. With page forking it is enough for flusher thread
> >> > to start writeback for that page to trigger the problem - event that
> is
> >> > basically bound to happen without any other userspace application
> >> > interfering.
> >>
> >> Acceptable conclusion is where came from? That pseudocode logic doesn't
> >> say about usage at all. And even if assume it is acceptable, as far as I
> >> can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a
> >> page on non-exists block (sparse file. i.e. missing disk space check in
> >> your logic). And if really no any lock/check, there would be another
> >> races.
> >
> > So drop_caches won't cause any issues because it avoids mmaped pages.
> > Also page reclaim or page migration don't cause any issues because
> > they avoid pages with increased refcount (and increased refcount would
> stop
> > drop_caches from reclaiming the page as well if it was not for the mmaped
> > check before). Generally, elevated page refcount currently guarantees
> page
> > isn't migrated, reclaimed, or otherwise detached from the mapping (except
> > for truncate where the combination of mapping-index becomes invalid) and
> > your page forking would change that assumption - which IMHO has a big
> > potential for some breakage somewhere.
>
> Lifetime and visibility from user are different topic.  The issue here
> is visibility. Of course, those has relation more or less though,
> refcount doesn't stop to drop page from radix-tree at all.
>
> Well, anyway, your claim seems to be assuming the userspace app
> workarounds the issues. And it sounds like still not workarounds the
> ENOSPC issue (validate at page fault/GUP) even if assuming userspace
> behave as perfect. Calling it as kernel assumption is strange.
>
> If you claim, there is strange logic widely used already, and of course,
> we can't simply break it because of compatibility. I would be able to
> agree. But your claim sounds like that logic is sane and well designed
> behavior. So I disagree.
>
> > And frankly I fail to see why you and Daniel care so much about this
> > corner case because from performance POV it's IMHO a non-issue and you
> > bother with page forking because of performance, don't you?
>
> Trying to penalize the corner case path, instead of normal path, should
> try at first. Penalizing normal path to allow corner case path is insane
> basically.
>
> Make normal path faster and more reliable is what we are trying.
>
> > So you can have a look for example at
> > drivers/media/v4l2-core/videobuf2-dma-contig.c which implements setting
> up
> > of a video device buffer at virtual address specified by user. Now I
> don't
> > know whether there really is any userspace video program that sets up the
> > video buffer in mmaped file. I would agree with you that it would be a
> > strange thing to do but I've seen enough strange userspace code that I
> > would not be too surprised.
> >
> > Another example of similar kind is at
> > drivers/infiniband/core/umem.c where we again set up buffer for
> infiniband
> > cards at users specified virtual address. And there are more drivers in
> > kernel like that.
>
> Unfortunately, I'm not looking those yet though. I guess those would be
> helpful to see the details.
>
> Thanks.
> --
> OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
>
> _______________________________________________
> Tux3 mailing list
> Tux3@phunq.net
> http://phunq.net/mailman/listinfo/tux3
>
[-- Attachment #1.2: Type: text/html, Size: 5215 bytes --]
[-- Attachment #2: Type: text/plain, Size: 120 bytes --]
_______________________________________________
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-07-31 15:37                                                   ` Raymond Jennings
@ 2015-07-31 17:27                                                     ` Daniel Phillips
  2015-07-31 18:29                                                       ` David Lang
  2015-08-18 16:39                                                       ` Rik van Riel
  0 siblings, 2 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-07-31 17:27 UTC (permalink / raw)
  To: Raymond Jennings
  Cc: David Lang, Rik van Riel, Jan Kara, tux3,
	Linux Kernel Mailing List, linux-fsdevel, OGAWA Hirofumi
On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote:
> Returning ENOSPC when you have free space you can't yet prove is safer than
> not returning it and risking a data loss when you get hit by a write/commit
> storm. :)
Remember when delayed allocation was scary and unproven, because proving
that ENOSPC will always be returned when needed is extremely difficult?
But the performance advantage was compelling, so we just worked at it
until it worked. There were times when it didn't work properly, but the
code was in the tree so it got fixed.
It's like that now with page forking - a new technique with compelling
advantages, and some challenges. In the past, we (the Linux community)
would rise to the challenge and err on the side of pushing optimizations
in early. That was our mojo, and that is how Linux became the dominant
operating system it is today. Do we, the Linux community, still have that
mojo?
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-07-31 17:27                                                     ` Daniel Phillips
@ 2015-07-31 18:29                                                       ` David Lang
  2015-07-31 18:43                                                         ` Daniel Phillips
  2015-07-31 22:12                                                         ` Daniel Phillips
  2015-08-18 16:39                                                       ` Rik van Riel
  1 sibling, 2 replies; 160+ messages in thread
From: David Lang @ 2015-07-31 18:29 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, Jan Kara, tux3, Linux Kernel Mailing List,
	linux-fsdevel, OGAWA Hirofumi
On Fri, 31 Jul 2015, Daniel Phillips wrote:
> Subject: Re: [FYI] tux3: Core changes
> 
> On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote:
>> Returning ENOSPC when you have free space you can't yet prove is safer than
>> not returning it and risking a data loss when you get hit by a write/commit
>> storm. :)
>
> Remember when delayed allocation was scary and unproven, because proving
> that ENOSPC will always be returned when needed is extremely difficult?
> But the performance advantage was compelling, so we just worked at it
> until it worked. There were times when it didn't work properly, but the
> code was in the tree so it got fixed.
>
> It's like that now with page forking - a new technique with compelling
> advantages, and some challenges. In the past, we (the Linux community)
> would rise to the challenge and err on the side of pushing optimizations
> in early. That was our mojo, and that is how Linux became the dominant
> operating system it is today. Do we, the Linux community, still have that
> mojo?
We, the Linux Community have less tolerance for losing people's data and 
preventing them from operating than we used to when it was all tinkerer's 
personal data and secondary systems.
So rather than pushing optimizations out to everyone and seeing what breaks, we 
now do more testing and checking for failures before pushing things out.
This means that when something new is introduced, we default to the safe, 
slightly slower way initially (there will be enough other bugs to deal with in 
any case), and then as we gain experience from the tinkerers enabling the 
performance optimizations, we make those optimizations reliable and only then 
push them out to all users.
If you define this as "loosing our mojo", then yes we have. But most people see 
the pace of development as still being high, just with more testing and 
polishing before it gets out to users.
David Lang
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-07-31 18:29                                                       ` David Lang
@ 2015-07-31 18:43                                                         ` Daniel Phillips
  2015-07-31 22:12                                                         ` Daniel Phillips
  1 sibling, 0 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-07-31 18:43 UTC (permalink / raw)
  To: David Lang
  Cc: Rik van Riel, Jan Kara, tux3, Linux Kernel Mailing List,
	linux-fsdevel, OGAWA Hirofumi
On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:
> If you define this as "loosing our mojo", then yes we have.
A pity. There remains so much to do that simply will not get
done in the absence of mojo.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-07-31 18:29                                                       ` David Lang
  2015-07-31 18:43                                                         ` Daniel Phillips
@ 2015-07-31 22:12                                                         ` Daniel Phillips
  2015-07-31 22:27                                                           ` David Lang
  1 sibling, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-07-31 22:12 UTC (permalink / raw)
  To: David Lang
  Cc: Rik van Riel, Jan Kara, tux3, Linux Kernel Mailing List,
	linux-fsdevel, OGAWA Hirofumi
On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:
> We, the Linux Community have less tolerance for losing people's data and preventing them from operating than we used to when it was all tinkerer's personal data and secondary systems.
> 
> So rather than pushing optimizations out to everyone and seeing what breaks, we now do more testing and checking for failures before pushing things out.
By the way, I am curious about whose data you think will get lost
as a result of pushing out Tux3 with a possible theoretical bug
in a wildly improbable scenario that has not actually been
described with sufficient specificity to falsify, let alone
demonstrated.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-07-31 22:12                                                         ` Daniel Phillips
@ 2015-07-31 22:27                                                           ` David Lang
  2015-08-01  0:00                                                             ` Daniel Phillips
  2015-08-01 10:55                                                             ` Elifarley Callado Coelho Cruz
  0 siblings, 2 replies; 160+ messages in thread
From: David Lang @ 2015-07-31 22:27 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, Jan Kara, tux3, Linux Kernel Mailing List,
	linux-fsdevel, OGAWA Hirofumi
On Fri, 31 Jul 2015, Daniel Phillips wrote:
> On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:
>> We, the Linux Community have less tolerance for losing people's data and 
>> preventing them from operating than we used to when it was all tinkerer's 
>> personal data and secondary systems.
>> 
>> So rather than pushing optimizations out to everyone and seeing what 
>> breaks, we now do more testing and checking for failures before pushing 
>> things out.
>
> By the way, I am curious about whose data you think will get lost
> as a result of pushing out Tux3 with a possible theoretical bug
> in a wildly improbable scenario that has not actually been
> described with sufficient specificity to falsify, let alone
> demonstrated.
you weren't asking about any particular feature of Tux, you were asking if we 
were still willing to push out stuff that breaks for users and fix it later.
Especially for filesystems that can loose the data of whoever is using it, the 
answer seems to be a clear no.
there may be bugs in what's pushed out that we don't know about. But we don't 
push out potential data corruption bugs that we do know about (or think we do)
so if you think this should be pushed out with this known corner case that's not 
handled properly, you have to convince people that it's _so_ improbable that 
they shouldn't care about it.
David Lang
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-07-31 22:27                                                           ` David Lang
@ 2015-08-01  0:00                                                             ` Daniel Phillips
  2015-08-01  0:16                                                               ` Daniel Phillips
  2015-08-01 10:55                                                             ` Elifarley Callado Coelho Cruz
  1 sibling, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-08-01  0:00 UTC (permalink / raw)
  To: David Lang
  Cc: Rik van Riel, Jan Kara, tux3, Linux Kernel Mailing List,
	linux-fsdevel, OGAWA Hirofumi
On Friday, July 31, 2015 3:27:12 PM PDT, David Lang wrote:
> On Fri, 31 Jul 2015, Daniel Phillips wrote:
>
>> On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote: ...
>
> you weren't asking about any particular feature of Tux, you 
> were asking if we were still willing to push out stuff that 
> breaks for users and fix it later.
I think you left a key word out of my ask: "theoretical".
> Especially for filesystems that can loose the data of whoever 
> is using it, the answer seems to be a clear no.
>
> there may be bugs in what's pushed out that we don't know 
> about. But we don't push out potential data corruption bugs that 
> we do know about (or think we do)
>
> so if you think this should be pushed out with this known 
> corner case that's not handled properly, you have to convince 
> people that it's _so_ improbable that they shouldn't care about 
> it.
There should also be an onus on the person posing the worry
to prove their case beyond a reasonable doubt, which has not been
done in case we are discussing here. Note: that is a technical
assessment to which a technical response is appropriate.
I do think that we should put a cap on this fencing and make
a real effort to get Tux3 into mainline. We should at least
set a ground rule that a problem should be proved real before it
becomes a reason to derail a project in the way that our project
has been derailed. Otherwise, it's hard to see what interest is
served.
OK, lets get back to the program. I accept your assertion that
we should convince people that the issue is improbable. To do
that, I need a specific issue to address. So far, no such issue
has been provided with specificity. Do you see why this is
frustrating?
Please, community. Give us specific issues to address, or give us
some way out of this eternal limbo. Or better, lets go back to the
old way of doing things in Linux, which is what got us where we
are today. Not this.
Note: Hirofumi's email is clear, logical and speaks to the
question. This branch of the thread is largely pointless, though
it essentially says the same thing in non-technical terms. Perhaps
your next response should be to Hirofumi, and perhaps it should be
technical.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-08-01  0:00                                                             ` Daniel Phillips
@ 2015-08-01  0:16                                                               ` Daniel Phillips
  2015-08-03 13:07                                                                 ` Jan Kara
  0 siblings, 1 reply; 160+ messages in thread
From: Daniel Phillips @ 2015-08-01  0:16 UTC (permalink / raw)
  To: David Lang
  Cc: Rik van Riel, Jan Kara, tux3, Linux Kernel Mailing List,
	linux-fsdevel, OGAWA Hirofumi
On Friday, July 31, 2015 5:00:43 PM PDT, Daniel Phillips wrote:
> Note: Hirofumi's email is clear, logical and speaks to the
> question. This branch of the thread is largely pointless, though
> it essentially says the same thing in non-technical terms. Perhaps
> your next response should be to Hirofumi, and perhaps it should be
> technical.
Now, let me try to lead the way, but being specific. RDMA was raised
as a potential failure case for Tux3 page forking. But the RDMA api
does not let you use memory mmaped by Tux3 as a source or destination
of IO. Instead, it sets up its own pages and hands them out to the
RDMA app from a pool. So no issue. One down, right?
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-08-01  0:16                                                               ` Daniel Phillips
@ 2015-08-03 13:07                                                                 ` Jan Kara
  0 siblings, 0 replies; 160+ messages in thread
From: Jan Kara @ 2015-08-03 13:07 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: David Lang, Rik van Riel, Jan Kara, tux3,
	Linux Kernel Mailing List, linux-fsdevel, OGAWA Hirofumi
On Fri 31-07-15 17:16:45, Daniel Phillips wrote:
> On Friday, July 31, 2015 5:00:43 PM PDT, Daniel Phillips wrote:
> >Note: Hirofumi's email is clear, logical and speaks to the
> >question. This branch of the thread is largely pointless, though
> >it essentially says the same thing in non-technical terms. Perhaps
> >your next response should be to Hirofumi, and perhaps it should be
> >technical.
> 
> Now, let me try to lead the way, but being specific. RDMA was raised
> as a potential failure case for Tux3 page forking. But the RDMA api
> does not let you use memory mmaped by Tux3 as a source or destination
> of IO. Instead, it sets up its own pages and hands them out to the
> RDMA app from a pool. So no issue. One down, right?
Can you please tell me how you arrived to that conclusion? As far as I'm
looking at the code in drivers/infiniband/ I don't see anything there
preventing userspace from passing in mmapped memory...
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
- * Re: [FYI] tux3: Core changes
  2015-07-31 22:27                                                           ` David Lang
  2015-08-01  0:00                                                             ` Daniel Phillips
@ 2015-08-01 10:55                                                             ` Elifarley Callado Coelho Cruz
  1 sibling, 0 replies; 160+ messages in thread
From: Elifarley Callado Coelho Cruz @ 2015-08-01 10:55 UTC (permalink / raw)
  To: David Lang
  Cc: Rik van Riel, Jan Kara, tux3, Linux Kernel Mailing List,
	linux-fsdevel, Daniel Phillips, OGAWA Hirofumi
[-- Attachment #1.1: Type: text/plain, Size: 2455 bytes --]
My gosh!!!! This is driving me crazy. Please let's make it crystal clear,
in technical and precise terms, devoid of any ad hominem attacks or the
like, what is preventing Tux3 from being merged.
Maybe a list of issues in Github, so that each issue can be scrutinized
more easily. Like this one: https://github.com/tux3fs/tux3-merging/issues/1
Daniel, thank you so much for spending so much energy to fight not only for
tux3, but for logic, reason and rationality, and for a saner dev process in
Linux.
Elifarley Cruz
-
 " Do not believe anything because it is said by an authority, or if it  is
said to come from angels, or from Gods, or from an inspired source.
Believe it only if you have explored it in your own heart and mind and body
and found it to be true.  Work out your own path, through diligence."
- Gautama Buddha
On Fri, Jul 31, 2015 at 7:27 PM, David Lang <david@lang.hm> wrote:
> On Fri, 31 Jul 2015, Daniel Phillips wrote:
>
> On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:
>>
>>> We, the Linux Community have less tolerance for losing people's data and
>>> preventing them from operating than we used to when it was all tinkerer's
>>> personal data and secondary systems.
>>>
>>> So rather than pushing optimizations out to everyone and seeing what
>>> breaks, we now do more testing and checking for failures before pushing
>>> things out.
>>>
>>
>> By the way, I am curious about whose data you think will get lost
>> as a result of pushing out Tux3 with a possible theoretical bug
>> in a wildly improbable scenario that has not actually been
>> described with sufficient specificity to falsify, let alone
>> demonstrated.
>>
>
> you weren't asking about any particular feature of Tux, you were asking if
> we were still willing to push out stuff that breaks for users and fix it
> later.
>
> Especially for filesystems that can loose the data of whoever is using it,
> the answer seems to be a clear no.
>
> there may be bugs in what's pushed out that we don't know about. But we
> don't push out potential data corruption bugs that we do know about (or
> think we do)
>
> so if you think this should be pushed out with this known corner case
> that's not handled properly, you have to convince people that it's _so_
> improbable that they shouldn't care about it.
>
> David Lang
>
>
> _______________________________________________
> Tux3 mailing list
> Tux3@phunq.net
> http://phunq.net/mailman/listinfo/tux3
>
[-- Attachment #1.2: Type: text/html, Size: 3633 bytes --]
[-- Attachment #2: Type: text/plain, Size: 120 bytes --]
_______________________________________________
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
 
- * Re: [FYI] tux3: Core changes
  2015-07-31 17:27                                                     ` Daniel Phillips
  2015-07-31 18:29                                                       ` David Lang
@ 2015-08-18 16:39                                                       ` Rik van Riel
  1 sibling, 0 replies; 160+ messages in thread
From: Rik van Riel @ 2015-08-18 16:39 UTC (permalink / raw)
  To: Daniel Phillips, Raymond Jennings
  Cc: OGAWA Hirofumi, David Lang, Jan Kara, tux3,
	Linux Kernel Mailing List, linux-fsdevel
On 07/31/2015 01:27 PM, Daniel Phillips wrote:
> On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote:
>> Returning ENOSPC when you have free space you can't yet prove is safer
>> than
>> not returning it and risking a data loss when you get hit by a
>> write/commit
>> storm. :)
>
> Remember when delayed allocation was scary and unproven, because proving
> that ENOSPC will always be returned when needed is extremely difficult?
> But the performance advantage was compelling, so we just worked at it
> until it worked. There were times when it didn't work properly, but the
> code was in the tree so it got fixed.
>
> It's like that now with page forking - a new technique with compelling
> advantages, and some challenges. In the past, we (the Linux community)
> would rise to the challenge and err on the side of pushing optimizations
> in early. That was our mojo, and that is how Linux became the dominant
> operating system it is today. Do we, the Linux community, still have that
> mojo?
Do you have the mojo to come up with a proposal on how
to make things work, in a way that ensures data consistency
for Linux users?
Yes, we know page forking is not compatible with the way
Linux currently uses refcounts.
The question is, does anyone have an idea on how we could
fix that?
Not necessarily an implementation yet, just an idea might
be enough to move forward at this stage.
However, if nobody wants to work on even an idea, page
forking may simply not be a safe thing to do.
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
- * Re: [FYI] tux3: Core changes
  2015-07-31  4:44                                                 ` OGAWA Hirofumi
  2015-07-31 15:37                                                   ` Raymond Jennings
@ 2015-08-03 13:42                                                   ` Jan Kara
  2015-08-09 13:42                                                     ` OGAWA Hirofumi
  1 sibling, 1 reply; 160+ messages in thread
From: Jan Kara @ 2015-08-03 13:42 UTC (permalink / raw)
  To: OGAWA Hirofumi
  Cc: David Lang, Rik van Riel, Jan Kara, tux3, linux-kernel,
	linux-fsdevel, Daniel Phillips
On Fri 31-07-15 13:44:44, OGAWA Hirofumi wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> >> > Yes, if userspace truncates the file, the situation we end up with is
> >> > basically the same. However for truncate to happen some malicious process
> >> > has to come and truncate the file - a failure scenario that is acceptable
> >> > for most use cases since it doesn't happen unless someone is actively
> >> > trying to screw you. With page forking it is enough for flusher thread
> >> > to start writeback for that page to trigger the problem - event that is
> >> > basically bound to happen without any other userspace application
> >> > interfering.
> >> 
> >> Acceptable conclusion is where came from? That pseudocode logic doesn't
> >> say about usage at all. And even if assume it is acceptable, as far as I
> >> can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a
> >> page on non-exists block (sparse file. i.e. missing disk space check in
> >> your logic). And if really no any lock/check, there would be another
> >> races.
> >
> > So drop_caches won't cause any issues because it avoids mmaped pages.
> > Also page reclaim or page migration don't cause any issues because
> > they avoid pages with increased refcount (and increased refcount would stop
> > drop_caches from reclaiming the page as well if it was not for the mmaped
> > check before). Generally, elevated page refcount currently guarantees page
> > isn't migrated, reclaimed, or otherwise detached from the mapping (except
> > for truncate where the combination of mapping-index becomes invalid) and
> > your page forking would change that assumption - which IMHO has a big
> > potential for some breakage somewhere.
> 
> Lifetime and visibility from user are different topic.  The issue here
> is visibility. Of course, those has relation more or less though,
> refcount doesn't stop to drop page from radix-tree at all.
Well, refcount prevents dropping page from a radix-tree in some cases -
memory pressure, page migration to name the most prominent ones. It doesn't
prevent page from being dropped because of truncate, that is correct. In
general, the rule we currently obey is that kernel doesn't detach a page
with increased refcount from a radix tree unless there is a syscall asking
kernel to do that.
> Well, anyway, your claim seems to be assuming the userspace app
> workarounds the issues. And it sounds like still not workarounds the
> ENOSPC issue (validate at page fault/GUP) even if assuming userspace
> behave as perfect. Calling it as kernel assumption is strange.
Realistically, I don't think userspace apps workaround anything. They just
do what happens to work. Nobody happens to delete files while application
works on it and expect application to gracefully handle that. So everyone
is happy. I'm not sure about which ENOSPC issue you are speaking BTW. Can
you please ellaborate?
> If you claim, there is strange logic widely used already, and of course,
> we can't simply break it because of compatibility. I would be able to
> agree. But your claim sounds like that logic is sane and well designed
> behavior. So I disagree.
To me the rule: "Do not detach a page from a radix tree if it has an elevated
refcount unless explicitely requested by a syscall" looks like a sane one.
Yes.
> > And frankly I fail to see why you and Daniel care so much about this
> > corner case because from performance POV it's IMHO a non-issue and you
> > bother with page forking because of performance, don't you?
> 
> Trying to penalize the corner case path, instead of normal path, should
> try at first. Penalizing normal path to allow corner case path is insane
> basically.
>
> Make normal path faster and more reliable is what we are trying.
Elevated refcount of a page is in my opinion a corner case path. That's why
I think that penalizing that case by waiting for IO instead of forking is
acceptable cost for the improved compatibility & maintainability of the
code.
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-08-03 13:42                                                   ` Jan Kara
@ 2015-08-09 13:42                                                     ` OGAWA Hirofumi
  2015-08-10 12:45                                                       ` Jan Kara
  0 siblings, 1 reply; 160+ messages in thread
From: OGAWA Hirofumi @ 2015-08-09 13:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Daniel Phillips, David Lang, Rik van Riel, tux3, linux-kernel,
	linux-fsdevel
Jan Kara <jack@suse.cz> writes:
> I'm not sure about which ENOSPC issue you are speaking BTW. Can you
> please ellaborate?
1. GUP simulate page fault, and prepare to modify
2. writeback clear dirty, and make PTE read-only
3. snapshot/reflink make block cow
4. driver called GUP modifies page, and dirty page without simulate page fault
>> If you claim, there is strange logic widely used already, and of course,
>> we can't simply break it because of compatibility. I would be able to
>> agree. But your claim sounds like that logic is sane and well designed
>> behavior. So I disagree.
>
> To me the rule: "Do not detach a page from a radix tree if it has an elevated
> refcount unless explicitely requested by a syscall" looks like a sane one.
> Yes.
>
>> > And frankly I fail to see why you and Daniel care so much about this
>> > corner case because from performance POV it's IMHO a non-issue and you
>> > bother with page forking because of performance, don't you?
>> 
>> Trying to penalize the corner case path, instead of normal path, should
>> try at first. Penalizing normal path to allow corner case path is insane
>> basically.
>>
>> Make normal path faster and more reliable is what we are trying.
>
> Elevated refcount of a page is in my opinion a corner case path. That's why
> I think that penalizing that case by waiting for IO instead of forking is
> acceptable cost for the improved compatibility & maintainability of the
> code.
What is "elevated refcount"? What is difference with normal refcount?
Are you saying "refcount >= specified threshold + waitq/wakeup" or
such? If so, it is not the path.  It is the state. IOW, some group may
not hit much, but some group may hit much, on normal path.
So it sounds like yet another "stable page". I.e. unpredictable
performance. (BTW, by recall of "stable page", noticed "stable page"
would not provide stabled page data for that logic too.)
Well, assuming "elevated refcount == threshold + waitq/wakeup", so
IMO, it is not attractive.  Rather the last option if there is no
others as design choice.
Thanks.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-08-09 13:42                                                     ` OGAWA Hirofumi
@ 2015-08-10 12:45                                                       ` Jan Kara
  2015-08-16 19:42                                                         ` OGAWA Hirofumi
  0 siblings, 1 reply; 160+ messages in thread
From: Jan Kara @ 2015-08-10 12:45 UTC (permalink / raw)
  To: OGAWA Hirofumi
  Cc: David Lang, Rik van Riel, Jan Kara, tux3, linux-kernel,
	linux-fsdevel, Daniel Phillips
On Sun 09-08-15 22:42:42, OGAWA Hirofumi wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > I'm not sure about which ENOSPC issue you are speaking BTW. Can you
> > please ellaborate?
> 
> 1. GUP simulate page fault, and prepare to modify
> 2. writeback clear dirty, and make PTE read-only
> 3. snapshot/reflink make block cow
I assume by point 3. you mean that snapshot / reflink happens now and thus
the page / block is marked as COW. Am I right?
> 4. driver called GUP modifies page, and dirty page without simulate page fault
OK, but this doesn't hit ENOSPC because as you correctly write in point 4.,
the page gets modified without triggering another page fault so COW for the
modified page isn't triggered. Modified page contents will be in both the
original and the reflinked file, won't it?
And I agree that the fact that snapshotted file's original contents can
still get modified is a bug. A one which is difficult to fix.
> >> If you claim, there is strange logic widely used already, and of course,
> >> we can't simply break it because of compatibility. I would be able to
> >> agree. But your claim sounds like that logic is sane and well designed
> >> behavior. So I disagree.
> >
> > To me the rule: "Do not detach a page from a radix tree if it has an elevated
> > refcount unless explicitely requested by a syscall" looks like a sane one.
> > Yes.
> >
> >> > And frankly I fail to see why you and Daniel care so much about this
> >> > corner case because from performance POV it's IMHO a non-issue and you
> >> > bother with page forking because of performance, don't you?
> >> 
> >> Trying to penalize the corner case path, instead of normal path, should
> >> try at first. Penalizing normal path to allow corner case path is insane
> >> basically.
> >>
> >> Make normal path faster and more reliable is what we are trying.
> >
> > Elevated refcount of a page is in my opinion a corner case path. That's why
> > I think that penalizing that case by waiting for IO instead of forking is
> > acceptable cost for the improved compatibility & maintainability of the
> > code.
> 
> What is "elevated refcount"? What is difference with normal refcount?
> Are you saying "refcount >= specified threshold + waitq/wakeup" or
> such? If so, it is not the path.  It is the state. IOW, some group may
> not hit much, but some group may hit much, on normal path.
Yes, by "elevated refcount" I meant refcount > 2 (one for pagecache, one for
your code inspecting the page).
> So it sounds like yet another "stable page". I.e. unpredictable
> performance. (BTW, by recall of "stable page", noticed "stable page"
> would not provide stabled page data for that logic too.)
> 
> Well, assuming "elevated refcount == threshold + waitq/wakeup", so
> IMO, it is not attractive.  Rather the last option if there is no
> others as design choice.
I agree the performance will be less predictable and that is not good. But
changing what is visible in the file when writeback races with GUP is a
worse problem to me.
Maybe if GUP marked pages it got ref for so that we could trigger the slow
behavior only for them (Peter Zijlstra proposed in [1] an infrastructure so
that pages pinned by get_user_pages() would be properly accounted and then
we could use PG_mlocked and elevated refcount as a more reliable indication
of pages that need special handling).
								Honza
[1] http://thread.gmane.org/gmane.linux.kernel.mm/117679
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-08-10 12:45                                                       ` Jan Kara
@ 2015-08-16 19:42                                                         ` OGAWA Hirofumi
  0 siblings, 0 replies; 160+ messages in thread
From: OGAWA Hirofumi @ 2015-08-16 19:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Daniel Phillips, David Lang, Rik van Riel, tux3, linux-kernel,
	linux-fsdevel
Jan Kara <jack@suse.cz> writes:
> On Sun 09-08-15 22:42:42, OGAWA Hirofumi wrote:
>> Jan Kara <jack@suse.cz> writes:
>> 
>> > I'm not sure about which ENOSPC issue you are speaking BTW. Can you
>> > please ellaborate?
>> 
>> 1. GUP simulate page fault, and prepare to modify
>> 2. writeback clear dirty, and make PTE read-only
>> 3. snapshot/reflink make block cow
>
> I assume by point 3. you mean that snapshot / reflink happens now and thus
> the page / block is marked as COW. Am I right?
Right.
>> 4. driver called GUP modifies page, and dirty page without simulate page fault
>
> OK, but this doesn't hit ENOSPC because as you correctly write in point 4.,
> the page gets modified without triggering another page fault so COW for the
> modified page isn't triggered. Modified page contents will be in both the
> original and the reflinked file, won't it?
And above result can be ENOSPC too, depending on implement and race
condition. Also, if FS converted zerod blocks to hole like hammerfs,
simply ENOSPC happens. I.e. other process uses all spaces, but then no
->page_mkwrite() callback to check ENOSPC.
> And I agree that the fact that snapshotted file's original contents can
> still get modified is a bug. A one which is difficult to fix.
Yes, it is why I'm thinking this logic is issue, before page forking.
>> So it sounds like yet another "stable page". I.e. unpredictable
>> performance. (BTW, by recall of "stable page", noticed "stable page"
>> would not provide stabled page data for that logic too.)
>> 
>> Well, assuming "elevated refcount == threshold + waitq/wakeup", so
>> IMO, it is not attractive.  Rather the last option if there is no
>> others as design choice.
>
> I agree the performance will be less predictable and that is not good. But
> changing what is visible in the file when writeback races with GUP is a
> worse problem to me.
>
> Maybe if GUP marked pages it got ref for so that we could trigger the slow
> behavior only for them (Peter Zijlstra proposed in [1] an infrastructure so
> that pages pinned by get_user_pages() would be properly accounted and then
> we could use PG_mlocked and elevated refcount as a more reliable indication
> of pages that need special handling).
I'm not reading Peter's patchset fully though, looks like good, and
maybe similar strategy in my mind currently. Also I'm thinking to add
callback for FS at start and end of GUP's pin window. (for just an
example, callback can be used to stop writeback by FS if FS wants.)
Thanks.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
^ permalink raw reply	[flat|nested] 160+ messages in thread 
 
 
 
 
 
 
 
 
 
 
 
- * Re: [FYI] tux3: Core changes
  2015-05-26  8:08                                 ` Daniel Phillips
  2015-05-26  9:00                                   ` Jan Kara
@ 2015-05-26 10:22                                   ` Sergey Senozhatsky
  2015-05-26 12:33                                     ` Jan Kara
  2015-05-26 19:18                                     ` Daniel Phillips
  1 sibling, 2 replies; 160+ messages in thread
From: Sergey Senozhatsky @ 2015-05-26 10:22 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Jan Kara, David Lang, Rik van Riel, linux-fsdevel, tux3,
	linux-kernel, OGAWA Hirofumi
On (05/26/15 01:08), Daniel Phillips wrote:
> On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
> >  E.g. video drivers (or infiniband or direct IO for that matter) which
> >have buffers in user memory (may be mmapped file), grab references to pages
> >and hand out PFNs of those pages to the hardware to store data in them...
> >If you fork a page after the driver has handed PFNs to the hardware, you've
> >just lost all the writes hardware will do.
> 
> Hi Jan,
> 
> The page forked because somebody wrote to it with write(2) or mmap write at
> the same time as a video driver (or infiniband or direct IO) was doing io to
> it. Isn't the application trying hard to lose data in that case? It would
> not need page fork to lose data that way.
> 
Hello,
is it possible to page-fork-bomb the system by some 'malicious' app?
	-ss
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-26 10:22                                   ` Sergey Senozhatsky
@ 2015-05-26 12:33                                     ` Jan Kara
  2015-05-26 19:18                                     ` Daniel Phillips
  1 sibling, 0 replies; 160+ messages in thread
From: Jan Kara @ 2015-05-26 12:33 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: David Lang, Rik van Riel, Jan Kara, tux3, linux-kernel,
	linux-fsdevel, Daniel Phillips, OGAWA Hirofumi
On Tue 26-05-15 19:22:39, Sergey Senozhatsky wrote:
> On (05/26/15 01:08), Daniel Phillips wrote:
> > On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
> > >  E.g. video drivers (or infiniband or direct IO for that matter) which
> > >have buffers in user memory (may be mmapped file), grab references to pages
> > >and hand out PFNs of those pages to the hardware to store data in them...
> > >If you fork a page after the driver has handed PFNs to the hardware, you've
> > >just lost all the writes hardware will do.
> > 
> > Hi Jan,
> > 
> > The page forked because somebody wrote to it with write(2) or mmap write at
> > the same time as a video driver (or infiniband or direct IO) was doing io to
> > it. Isn't the application trying hard to lose data in that case? It would
> > not need page fork to lose data that way.
> > 
> 
> Hello,
> 
> is it possible to page-fork-bomb the system by some 'malicious' app?
  Well, you can have only two copies of each page - the one under writeout
and the one in page cache. Furthermore you are limited by dirty throttling
so I don't think this would allow any out-of-ordinary DOS vector...
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 160+ messages in thread 
- * Re: [FYI] tux3: Core changes
  2015-05-26 10:22                                   ` Sergey Senozhatsky
  2015-05-26 12:33                                     ` Jan Kara
@ 2015-05-26 19:18                                     ` Daniel Phillips
  1 sibling, 0 replies; 160+ messages in thread
From: Daniel Phillips @ 2015-05-26 19:18 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: David Lang, Rik van Riel, Jan Kara, tux3, linux-kernel,
	linux-fsdevel, OGAWA Hirofumi
Hi Sergey,
On 05/26/2015 03:22 AM, Sergey Senozhatsky wrote:
> 
> Hello,
> 
> is it possible to page-fork-bomb the system by some 'malicious' app?
Not in any new way. A page fork can happen either in the front end,
where it has to wait for memory like any other normal memory user,
or in the backend, where Tux3 may have privileged access to low
memory reserves and therefore must place bounds on its memory use
like any other user of low memory reserves.
This is not specific to page fork. We must place such bounds for
any memory that the backend uses. Fortunately, the backend does not
allocate memory extravagently, for fork or anything else, so when
this does get to the top of our to-do list it should not be too
hard to deal with. We plan to attack that after merge, as we have
never observed a problem in practice. Rather, Tux3 already seems
to survive low memory situations pretty well compared to some other
filesystems.
Regards,
Daniel
^ permalink raw reply	[flat|nested] 160+ messages in thread