From: Nick Piggin <npiggin@suse.de>
To: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>,
linux-fsdevel@vger.kernel.org, Miklos Szeredi <miklos@szeredi.hu>
Subject: Filesystem setattr/truncate notes and problems
Date: Thu, 3 Jun 2010 06:08:55 +1000 [thread overview]
Message-ID: <20100602200855.GH6152@laptop> (raw)
In-Reply-To: <20100602195538.GG6152@laptop>
On Thu, Jun 03, 2010 at 05:55:38AM +1000, Nick Piggin wrote:
> Well I looked through filesystems and I cannot seem to find anywhere
> they can use ATTR_SIZE|ATTR_[MC]TIME to do anything strange that would
> let them distinguish truncate from ftruncate and O_TRUNC.
>
> I couldn't quite understand what FUSE was trying to do, or whether I
> improved it.
Looking through the filesystems, I thought I found a number of issues.
I'll try to list what I understand to be important steps required in
setattr. Unless I'm way off, a lot of filesystems get these wrong, so
we may want to add some guideline to docs?
Anyway feedback on these would be appreciated. Below this there are
some questions about specific filesystems too.
Filesystems setattr methods should:
- Perform idempotent error checks as early as possible. Perform actual
i_size change and pagecache truncation last[*]. Truncate means dirty
pagecache is fine to be thrown out at any time[**], what is important
is that it is not thrown out before the operation can complete (eg. by
doing the on-disk truncation).
[*] Pagecache truncation is irreversable unless the filesystem can
guarantee that the page is not mapped and has no dangling
get_user_pages references (it's almost impossible to check this
properly unless get_user_pages is disallowed from the start) or the
file has been readonly. Just writing back the pagecache does not
guarantee no data loss window. So it's probably easiest to do
pagecache truncation last.
[**] Be careful, this might mean the page is subject to writeout
inside i_size after the on-disk truncation. Buffer based filesystems
will probably need rework to handle this properly.
- Update mtime and ctime IFF ATTR_MTIME and ATTR_CTIME are set in
setattr.
- Update mode before owner or size if mode update can fail. Otherwise
killing of suid could fail after owner/size has changed (ideally
these would be all done atomically). Although we do already have weird
races in permission checking in core vfs code anyway.
- Perform time modifications iff others have/will succeeded (these are
applicable for chmod, chown, etc too). Do not succeed a truncate and
then error out because the time cannot be set. Do not set the time but
fail the truncate or chown etc.
- If ATTR_SIZE change cannot succeed, don't ignore it, return something
(like -EPERM or -EINVAL), right? (lots of filesystems just mask it
off, maybe that is preferable behaviour sometimes?)
- Remember to actually trim off disk blocks or underlying object past
i_size in response to setattr shrinking the file.
- In case of write_begin/write_end/direct_IO failure, trim object past
i_size if it was possible to have been allocated before the operation
fails.
Interesting specific filesystem questions: (not comprehensive)
ecryptfs improve mtime/ctime handling for truncate of lower inode
fuse, should be setting ctime on truncate?
gfs2, mtime and ctime should only be set in truncate if the ATTR_?TIME
are set.
hostfs, don't ignore ATTR_SIZE. ftruncate/truncate have interesting
different semantics for timestamp updates, so don't use fd != -1 for
these (could use utimes for this, but it causes an atomicity/error
handling issue).
hpfs could use cont_expand to do expanding truncates?
ncpfs, nfs smbfs, cifs seems to perform inode size checks
(inode_newsize_ok) after truncating the file on the server??
ntfs does not do inode size checks properly.
logfs logfs_truncate without doing inode_change_ok, inode_newsize_ok etc
can fail inode_setattr after successful truncate (truncate(2) would
return failure and yet the file would be truncated).
procfs why not return -EPERM rather than setting size?
reiserfs should do all setattr checks early as possible
(inode_change_ok, inode_newsize_ok).
ubifs should not update access times unconditionally when ATTR_SIZE is
set
ufs simple_setsize still destroys the pagecache and causes concurrent
reads to go EOF past updated i_size. Restoring old i_size is too late
I think. Should do those checks first (in fairness lots of filesystems
have bit problems with error handling, but ufs attempts a little bit and
gets it wrong).
xfs in the 0 length file shortcut, isn't this missing suid kill case?
(ftruncate mandatory mtime update seems like it wasn't working right
there either before this patch).
nfsd should not change attributes in the case of truncated file with no
size change?
next prev parent reply other threads:[~2010-06-02 20:08 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-06-01 13:39 [patch] fix truncate inode time modification breakage Nick Piggin
2010-06-01 13:48 ` Christoph Hellwig
2010-06-01 13:56 ` Nick Piggin
2010-06-02 19:55 ` [patch v2] " Nick Piggin
2010-06-02 20:08 ` Nick Piggin [this message]
2010-06-03 7:28 ` Filesystem setattr/truncate notes and problems Christoph Hellwig
2010-06-03 7:32 ` Christoph Hellwig
2010-06-03 9:50 ` Nick Piggin
2010-06-03 9:18 ` Nick Piggin
2010-06-03 9:26 ` Nick Piggin
2010-06-03 8:18 ` [patch v2] fix truncate inode time modification breakage Miklos Szeredi
2010-06-03 8:40 ` Boaz Harrosh
2010-06-03 9:05 ` Miklos Szeredi
2010-06-03 12:13 ` Boaz Harrosh
2010-06-03 9:14 ` Nick Piggin
2010-06-03 9:28 ` Miklos Szeredi
2010-06-03 10:07 ` Nick Piggin
2010-06-03 10:58 ` Miklos Szeredi
2010-06-03 11:09 ` Christoph Hellwig
2010-06-03 12:01 ` [patch v3] " Nick Piggin
2010-06-03 11:49 ` [patch v2] " Nick Piggin
2010-06-03 12:03 ` Miklos Szeredi
2010-06-01 14:10 ` [patch] " Boaz Harrosh
2010-06-01 14:32 ` Nick Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100602200855.GH6152@laptop \
--to=npiggin@suse.de \
--cc=hch@lst.de \
--cc=linux-fsdevel@vger.kernel.org \
--cc=miklos@szeredi.hu \
--cc=viro@ZenIV.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).