From: Dave Chinner <david@fromorbit.com>
To: Hugh Dickins <hughd@google.com>
Cc: Namjae Jeon <linkinjeon@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Matthew Wilcox <matthew@wil.cx>, Theodore Ts'o <tytso@mit.edu>,
Stephen Rothwell <sfr@canb.auug.org.au>,
viro@zeniv.linux.org.uk, bpm@sgi.com, adilger.kernel@dilger.ca,
jack@suse.cz, mtk.manpages@gmail.com, lczerner@redhat.com,
linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com,
linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, Namjae Jeon <namjae.jeon@samsung.com>
Subject: Re: [PATCH v5 0/10] fs: Introduce new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate
Date: Thu, 27 Feb 2014 12:24:31 +1100 [thread overview]
Message-ID: <20140227012431.GW13647@dastard> (raw)
In-Reply-To: <alpine.LSU.2.11.1402261454270.2808@eggly.anvils>
On Wed, Feb 26, 2014 at 03:08:58PM -0800, Hugh Dickins wrote:
> On Wed, 26 Feb 2014, Dave Chinner wrote:
> > On Tue, Feb 25, 2014 at 08:45:15PM -0800, Hugh Dickins wrote:
> > > On Wed, 26 Feb 2014, Dave Chinner wrote:
> > > > On Tue, Feb 25, 2014 at 03:23:35PM -0800, Hugh Dickins wrote:
> > > >
> > > > > I should mention that when "we" implemented this thirty years ago,
> > > > > we had a strong conviction that the system call should be idempotent:
> > > > > that is, the len argument should indicate the final i_size, not the
> > > > > amount being removed from it. Now, I don't remember the grounds for
> > > > > that conviction: maybe it was just an idealistic preference for how
> > > > > to design a good system call. I can certainly see that defining it
> > > > > that way round would surprise many app programmers. Just mentioning
> > > > > this in case anyone on these lists sees a practical advantage to
> > > > > doing it that way instead.
> > > >
> > > > I don't see how specifying the end file size as an improvement. What
> > > > happens if you are collapse a range in a file that is still being
> > > > appended to by the application and so you race with a file size
> > > > update? IOWs, with such an API the range to be collapsed is
> > > > completely unpredictable, and IMO that's a fundamentally broken API.
> > >
> > > That's fine if you don't see the idempotent API as an improvement,
> > > I just wanted to put it on the table in case someone does see an
> > > advantage to it. But I think I'm missing something in your race
> > > example: I don't see a difference between the two APIs there.
> >
> >
> > Userspace can't sample the inode size via stat(2) and then use the value for a
> > syscall atomically. i.e. if you specify the offset you want to
> > collapse at, and the file size you want to have to define the region
> > to collapse, then the length you need to collapse is (current inode
> > size - end file size). If "current inode size" can change between
> > the stat(2) and fallocate() call (and it can), then the length being
> > collapsed is indeterminate....
>
> Thanks for explaining more, I was just about to acknowledge what a good
> example that is. Indeed, it seems not unreasonable to be editing the
> earlier part of a file while the later part of it is still streaming in.
>
> But damn, it now occurs to me that there's still a problem at the
> streaming end: its file write offset won't be updated to reflect
> the collapse, so there would be a sparse hole at that end. And
> collapse returns -EPERM if IS_APPEND(inode).
Well, we figure that most applications won't be using append only
inode flags for files that they know they want to edit at random
offsets later on. ;)
However, I can see how DVR apps would use open(O_APPEND) to obtain
the fd they write to because that sets the write position to the EOF
on every write() call (i.e. in generic_write_checks()). And collapse
range should behave sanely with this sort of usage.
e.g. XFS calls generic_write_checks() after it has taken the IO lock
to set the current write position to EOF. Hence it will be correctly
serialised against collapse range calls and so O_APPEND writes will
not leave sparse holes if collapse range calls are interleaved with
the write stream....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Hugh Dickins <hughd@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>,
Namjae Jeon <linkinjeon@gmail.com>, Theodore Ts'o <tytso@mit.edu>,
Matthew Wilcox <matthew@wil.cx>,
Namjae Jeon <namjae.jeon@samsung.com>,
linux-kernel@vger.kernel.org, xfs@oss.sgi.com,
linux-mm@kvack.org, bpm@sgi.com, adilger.kernel@dilger.ca,
viro@zeniv.linux.org.uk, lczerner@redhat.com,
linux-fsdevel@vger.kernel.org, jack@suse.cz,
Andrew Morton <akpm@linux-foundation.org>,
linux-ext4@vger.kernel.org, mtk.manpages@gmail.com
Subject: Re: [PATCH v5 0/10] fs: Introduce new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate
Date: Thu, 27 Feb 2014 12:24:31 +1100 [thread overview]
Message-ID: <20140227012431.GW13647@dastard> (raw)
In-Reply-To: <alpine.LSU.2.11.1402261454270.2808@eggly.anvils>
On Wed, Feb 26, 2014 at 03:08:58PM -0800, Hugh Dickins wrote:
> On Wed, 26 Feb 2014, Dave Chinner wrote:
> > On Tue, Feb 25, 2014 at 08:45:15PM -0800, Hugh Dickins wrote:
> > > On Wed, 26 Feb 2014, Dave Chinner wrote:
> > > > On Tue, Feb 25, 2014 at 03:23:35PM -0800, Hugh Dickins wrote:
> > > >
> > > > > I should mention that when "we" implemented this thirty years ago,
> > > > > we had a strong conviction that the system call should be idempotent:
> > > > > that is, the len argument should indicate the final i_size, not the
> > > > > amount being removed from it. Now, I don't remember the grounds for
> > > > > that conviction: maybe it was just an idealistic preference for how
> > > > > to design a good system call. I can certainly see that defining it
> > > > > that way round would surprise many app programmers. Just mentioning
> > > > > this in case anyone on these lists sees a practical advantage to
> > > > > doing it that way instead.
> > > >
> > > > I don't see how specifying the end file size as an improvement. What
> > > > happens if you are collapse a range in a file that is still being
> > > > appended to by the application and so you race with a file size
> > > > update? IOWs, with such an API the range to be collapsed is
> > > > completely unpredictable, and IMO that's a fundamentally broken API.
> > >
> > > That's fine if you don't see the idempotent API as an improvement,
> > > I just wanted to put it on the table in case someone does see an
> > > advantage to it. But I think I'm missing something in your race
> > > example: I don't see a difference between the two APIs there.
> >
> >
> > Userspace can't sample the inode size via stat(2) and then use the value for a
> > syscall atomically. i.e. if you specify the offset you want to
> > collapse at, and the file size you want to have to define the region
> > to collapse, then the length you need to collapse is (current inode
> > size - end file size). If "current inode size" can change between
> > the stat(2) and fallocate() call (and it can), then the length being
> > collapsed is indeterminate....
>
> Thanks for explaining more, I was just about to acknowledge what a good
> example that is. Indeed, it seems not unreasonable to be editing the
> earlier part of a file while the later part of it is still streaming in.
>
> But damn, it now occurs to me that there's still a problem at the
> streaming end: its file write offset won't be updated to reflect
> the collapse, so there would be a sparse hole at that end. And
> collapse returns -EPERM if IS_APPEND(inode).
Well, we figure that most applications won't be using append only
inode flags for files that they know they want to edit at random
offsets later on. ;)
However, I can see how DVR apps would use open(O_APPEND) to obtain
the fd they write to because that sets the write position to the EOF
on every write() call (i.e. in generic_write_checks()). And collapse
range should behave sanely with this sort of usage.
e.g. XFS calls generic_write_checks() after it has taken the IO lock
to set the current write position to EOF. Hence it will be correctly
serialised against collapse range calls and so O_APPEND writes will
not leave sparse holes if collapse range calls are interleaved with
the write stream....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Hugh Dickins <hughd@google.com>
Cc: Namjae Jeon <linkinjeon@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Matthew Wilcox <matthew@wil.cx>, "Theodore Ts'o" <tytso@mit.edu>,
Stephen Rothwell <sfr@canb.auug.org.au>,
viro@zeniv.linux.org.uk, bpm@sgi.com, adilger.kernel@dilger.ca,
jack@suse.cz, mtk.manpages@gmail.com, lczerner@redhat.com,
linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com,
linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, Namjae Jeon <namjae.jeon@samsung.com>
Subject: Re: [PATCH v5 0/10] fs: Introduce new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate
Date: Thu, 27 Feb 2014 12:24:31 +1100 [thread overview]
Message-ID: <20140227012431.GW13647@dastard> (raw)
In-Reply-To: <alpine.LSU.2.11.1402261454270.2808@eggly.anvils>
On Wed, Feb 26, 2014 at 03:08:58PM -0800, Hugh Dickins wrote:
> On Wed, 26 Feb 2014, Dave Chinner wrote:
> > On Tue, Feb 25, 2014 at 08:45:15PM -0800, Hugh Dickins wrote:
> > > On Wed, 26 Feb 2014, Dave Chinner wrote:
> > > > On Tue, Feb 25, 2014 at 03:23:35PM -0800, Hugh Dickins wrote:
> > > >
> > > > > I should mention that when "we" implemented this thirty years ago,
> > > > > we had a strong conviction that the system call should be idempotent:
> > > > > that is, the len argument should indicate the final i_size, not the
> > > > > amount being removed from it. Now, I don't remember the grounds for
> > > > > that conviction: maybe it was just an idealistic preference for how
> > > > > to design a good system call. I can certainly see that defining it
> > > > > that way round would surprise many app programmers. Just mentioning
> > > > > this in case anyone on these lists sees a practical advantage to
> > > > > doing it that way instead.
> > > >
> > > > I don't see how specifying the end file size as an improvement. What
> > > > happens if you are collapse a range in a file that is still being
> > > > appended to by the application and so you race with a file size
> > > > update? IOWs, with such an API the range to be collapsed is
> > > > completely unpredictable, and IMO that's a fundamentally broken API.
> > >
> > > That's fine if you don't see the idempotent API as an improvement,
> > > I just wanted to put it on the table in case someone does see an
> > > advantage to it. But I think I'm missing something in your race
> > > example: I don't see a difference between the two APIs there.
> >
> >
> > Userspace can't sample the inode size via stat(2) and then use the value for a
> > syscall atomically. i.e. if you specify the offset you want to
> > collapse at, and the file size you want to have to define the region
> > to collapse, then the length you need to collapse is (current inode
> > size - end file size). If "current inode size" can change between
> > the stat(2) and fallocate() call (and it can), then the length being
> > collapsed is indeterminate....
>
> Thanks for explaining more, I was just about to acknowledge what a good
> example that is. Indeed, it seems not unreasonable to be editing the
> earlier part of a file while the later part of it is still streaming in.
>
> But damn, it now occurs to me that there's still a problem at the
> streaming end: its file write offset won't be updated to reflect
> the collapse, so there would be a sparse hole at that end. And
> collapse returns -EPERM if IS_APPEND(inode).
Well, we figure that most applications won't be using append only
inode flags for files that they know they want to edit at random
offsets later on. ;)
However, I can see how DVR apps would use open(O_APPEND) to obtain
the fd they write to because that sets the write position to the EOF
on every write() call (i.e. in generic_write_checks()). And collapse
range should behave sanely with this sort of usage.
e.g. XFS calls generic_write_checks() after it has taken the IO lock
to set the current write position to EOF. Hence it will be correctly
serialised against collapse range calls and so O_APPEND writes will
not leave sparse holes if collapse range calls are interleaved with
the write stream....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2014-02-27 1:24 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-02-18 16:37 [PATCH v5 0/10] fs: Introduce new flag(FALLOC_FL_COLLAPSE_RANGE) for fallocate Namjae Jeon
2014-02-18 16:37 ` Namjae Jeon
2014-02-24 0:57 ` Dave Chinner
2014-02-24 0:57 ` Dave Chinner
2014-02-24 1:34 ` Namjae Jeon
2014-02-24 1:34 ` Namjae Jeon
2014-02-25 3:16 ` Stephen Rothwell
2014-02-25 3:16 ` Stephen Rothwell
2014-02-25 4:13 ` Dave Chinner
2014-02-25 4:13 ` Dave Chinner
2014-02-25 23:23 ` Hugh Dickins
2014-02-25 23:23 ` Hugh Dickins
2014-02-25 23:23 ` Hugh Dickins
2014-02-25 23:41 ` Andrew Morton
2014-02-25 23:41 ` Andrew Morton
2014-02-25 23:41 ` Andrew Morton
2014-02-26 1:34 ` Dave Chinner
2014-02-26 1:34 ` Dave Chinner
2014-02-26 1:34 ` Dave Chinner
2014-02-26 1:52 ` Andrew Morton
2014-02-26 1:52 ` Andrew Morton
2014-02-26 1:52 ` Andrew Morton
2014-02-26 3:42 ` Dave Chinner
2014-02-26 3:42 ` Dave Chinner
2014-02-26 3:42 ` Dave Chinner
2014-02-26 1:13 ` Dave Chinner
2014-02-26 1:13 ` Dave Chinner
2014-02-26 1:13 ` Dave Chinner
2014-02-26 4:45 ` Hugh Dickins
2014-02-26 4:45 ` Hugh Dickins
2014-02-26 4:45 ` Hugh Dickins
2014-02-26 6:42 ` Dave Chinner
2014-02-26 6:42 ` Dave Chinner
2014-02-26 6:42 ` Dave Chinner
2014-02-26 23:08 ` Hugh Dickins
2014-02-26 23:08 ` Hugh Dickins
2014-02-26 23:08 ` Hugh Dickins
2014-02-27 1:24 ` Dave Chinner [this message]
2014-02-27 1:24 ` Dave Chinner
2014-02-27 1:24 ` Dave Chinner
2014-02-27 1:30 ` Hugh Dickins
2014-02-27 1:30 ` Hugh Dickins
2014-02-27 1:30 ` Hugh Dickins
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140227012431.GW13647@dastard \
--to=david@fromorbit.com \
--cc=adilger.kernel@dilger.ca \
--cc=akpm@linux-foundation.org \
--cc=bpm@sgi.com \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=lczerner@redhat.com \
--cc=linkinjeon@gmail.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=matthew@wil.cx \
--cc=mtk.manpages@gmail.com \
--cc=namjae.jeon@samsung.com \
--cc=sfr@canb.auug.org.au \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.