From: Al Viro <viro@ZenIV.linux.org.uk>
To: Eiichi Tsukata <devel@etsukata.com>
Cc: andi@firstfloor.org, Chris Mason <clm@fb.com>,
Josef Bacik <josef@toxicpanda.com>,
David Sterba <dsterba@suse.com>, Theodore Ts'o <tytso@mit.edu>,
Andreas Dilger <adilger.kernel@dilger.ca>,
Jaegeuk Kim <jaegeuk@kernel.org>, Chao Yu <yuchao0@huawei.com>,
Miklos Szeredi <miklos@szeredi.hu>,
Bob Peterson <rpeterso@redhat.com>,
Andreas Gruenbacher <agruenba@redhat.com>,
linux-btrfs@vger.kernel.org, linux-ext4@vger.kernel.org,
linux-f2fs-devel@lists.sourceforge.net,
linux-fsdevel@vger.kernel.org, cluster-devel@redhat.com,
linux-unionfs@vger.kernel.org
Subject: Re: [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write
Date: Thu, 22 Nov 2018 07:06:50 +0000 [thread overview]
Message-ID: <20181122070650.GN32577@ZenIV.linux.org.uk> (raw)
In-Reply-To: <CANhTXPQzKmsyPO9QrGz6eijjuMFzLN4BhUV=6ABDJysX0xyKfw@mail.gmail.com>
On Thu, Nov 22, 2018 at 02:40:50PM +0900, Eiichi Tsukata wrote:
> 2018年11月21日(水) 13:54 Al Viro <viro@zeniv.linux.org.uk>:
> >
> > On Wed, Nov 21, 2018 at 11:43:56AM +0900, Eiichi Tsukata wrote:
> > > Some file systems (including ext4, xfs, ramfs ...) have the following
> > > problem as I've described in the commit message of the 1/4 patch.
> > >
> > > The commit ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek")
> > > removed almost all locks in llseek() including SEEK_END. It based on the
> > > idea that write() updates size atomically. But in fact, write() can be
> > > divided into two or more parts in generic_perform_write() when pos
> > > straddles over the PAGE_SIZE, which results in updating size multiple
> > > times in one write(). It means that llseek() can see the size being
> > > updated during write().
> >
> > And? Who has ever promised anything that insane? write(2) can take an arbitrary
> > amount of time; another process doing lseek() on independently opened descriptor
> > is *not* going to wait for that (e.g. page-in of the buffer being written, which
> > just happens to be mmapped from a file on NFS over RFC1149 link).
>
> Thanks.
>
> The lock I added in NFS was nothing but slow down lseek() because a file size is
> updated atomically. Even `spin_lock(&inode->i_lock)` is unnecessary.
>
> I'll fix the commit message which only refers to specific local file
> systems that use
> generic_perform_write() and remove unnecessary locks in some
> distributed file systems
> (e.g. nfs, cifs, or more) by replacing generic_file_llseek() with
> generic_file_llseek_unlocked()
> so that `tail` don't have to wait for avian carriers.
fd = open("/mnt/sloooow/foo", O_RDONLY);
p = mmap(NULL, 8192, PROT_READ, MAP_SHARED, fd, 0);
local_fd = open("/tmp/foo", O_RDWR);
write(local_fd, p, 8192);
and there you go - extremely slow write(2). To file on a local filesystem.
Can you show me where does POSIX/SuS/whatever it's called these days promise
that kind of atomicity? TBH, I would be very surprised if any Unix promised
to have file size updated no more than once per write(2). Any userland code
that relies on such property is, as minimum, non-portable and I would argue
that it is outright broken.
Note, BTW, that your example depends upon rather non-obvious details of echo
implementation, starting with the bufferization logics used by particular
shell. And AFAICS ash(1) ends up with possibility of more than one write(2)
anyway - get SIGSTOP delivered, followed by SIGCONT and you've got just that.
Transparently for echo(1). I'm fairly sure that the same holds for anything
that isn't outright broken - write(2) *IS* interruptible (must be, for obvious
reasons) and that pair of signals will lead to short write if it comes at the
right time. The only thing userland can do (and does) is to issue further
write(2) on anything that hadn't been written the last time around.
next prev parent reply other threads:[~2018-11-22 17:45 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-21 2:43 [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write Eiichi Tsukata
2018-11-21 2:43 ` [PATCH v1 1/4] vfs: " Eiichi Tsukata
2018-11-21 2:43 ` [PATCH v1 2/4] ext4: " Eiichi Tsukata
2018-11-21 2:43 ` [PATCH v1 3/4] f2fs: " Eiichi Tsukata
2018-11-21 9:23 ` Christoph Hellwig
2018-11-22 5:42 ` Eiichi Tsukata
2018-11-21 2:44 ` [PATCH v1 4/4] overlayfs: " Eiichi Tsukata
2018-11-21 4:54 ` [PATCH v1 0/4] fs: " Al Viro
2018-11-22 5:40 ` Eiichi Tsukata
2018-11-22 7:06 ` Al Viro [this message]
2018-11-26 2:54 ` Eiichi Tsukata
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181122070650.GN32577@ZenIV.linux.org.uk \
--to=viro@zeniv.linux.org.uk \
--cc=adilger.kernel@dilger.ca \
--cc=agruenba@redhat.com \
--cc=andi@firstfloor.org \
--cc=clm@fb.com \
--cc=cluster-devel@redhat.com \
--cc=devel@etsukata.com \
--cc=dsterba@suse.com \
--cc=jaegeuk@kernel.org \
--cc=josef@toxicpanda.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-f2fs-devel@lists.sourceforge.net \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-unionfs@vger.kernel.org \
--cc=miklos@szeredi.hu \
--cc=rpeterso@redhat.com \
--cc=tytso@mit.edu \
--cc=yuchao0@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).