From: "Dilger, Andreas" <andreas.dilger@intel.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"Drokin, Oleg" <oleg.drokin@intel.com>,
Peng Tao <bergwolf@gmail.com>, "greg@kroah.com" <greg@kroah.com>,
Al Viro <viro@ZenIV.linux.org.uk>
Subject: Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict
Date: Tue, 11 Feb 2014 11:01:41 +0000 [thread overview]
Message-ID: <CF1F4D88.9158F%andreas.dilger@intel.com> (raw)
In-Reply-To: <20140211091342.GB25567@infradead.org>
On 2014/02/11, 2:13 AM, "Christoph Hellwig" <hch@infradead.org> wrote:
>On Mon, Feb 10, 2014 at 09:29:29PM +0000, Al Viro wrote:
>> I can live with that; it's a kludge, but it's less broken than that
>> explicit constant - that one is a non-starter, since O_... flag
>> values are arch-dependent.
>
>Grabbing their own O_FLAG is of course not acceptable at all.
>Personally I don't think this version is acceptable for real mainline
>either. What exactly are the semantics of the flag? Why don't you do
>object allocation on demand like all delalloc filesystems by default?
This was described in the original patch and follow-on email, but I'll
repeat it here, and expand the detail a bit further:
In kernel 3.11 O_TMPFILE was introduced, but the open flag value
conflicts with the O_LOV_DELAY_CREATE flag 020000000 previously used
by Lustre-aware applications. O_LOV_DELAY_CREATE allows applications
to defer file layout and object creation from open time (the default)
until it can instead be specified by the application using an ioctl.
The main goal of the O_LOV_DELAY_CREATE flag is to allow the file to be
opened in a "preliminary" manner to allow the application to specify the
layout of the file across the Lustre storage targets (e.g. whether the
app has millions of separate files each one written to a single server,
or there is a single huge file spread across all of the servers, or some
combination of the two, if it is RAID-0 or RAID-1, or whatever).
FYI, an "object" in Lustre is not a fixed-size chunk of space like
Ceph or HDFS that needs to be continuously allocated as a file grows,
but rather a variable-sized inode-without-a-name that is written at
arbitrary byte offsets and can be sparse, so there is no need for
the client and metadata server to communicate after the initial
file layout has been decided.
The Lustre object(s) are normally allocated by the metadata server at
open time to avoid RPC round-trips and lock contention for files opened
by large numbers of nodes at once. The layout is normally specified by
the filesystem default, or on the parent directory, but some applications
need fine-grained control over the layout to optimize for a particular
filesystem configuration.
Instead of trying to find a non-conflicting O_LOV_DELAY_CREATE flag
or define a Lustre-specific flag that isn't of use to most/any other
filesystems, use (O_NOCTTY|FASYNC) as the new value. These flag
are not meaningful for newly-created regular files and should be
OK since O_LOV_DELAY_CREATE is only meaningful for new files.
I looked into using O_ACCMODE/FMODE_WRITE_IOCTL, which allows calling
ioctl() on the minimally-opened fd and is close to what is needed,
but that doesn't allow specifying the actual read or write mode for
the file, and fcntl(F_SETFL) doesn't allow O_RDONLY/O_WRONLY/O_RDWR
to be set after the file is opened.
We want to avoid the need to have lots of syscalls to do this, since
they translate into extra RPCs that we want to avoid when creating
potentially millions of files over the network.
Cheers, Andreas
--
Andreas Dilger
Lustre Software Architect
Intel High Performance Data Division
prev parent reply other threads:[~2014-02-11 11:01 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-02-10 20:16 RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict Dilger, Andreas
2014-02-10 21:29 ` Al Viro
2014-02-10 22:10 ` Al Viro
2014-02-10 22:51 ` Al Viro
2014-02-11 0:31 ` Dilger, Andreas
2014-02-11 2:40 ` Al Viro
2014-02-11 2:54 ` Drokin, Oleg
2014-02-11 6:55 ` Xiong, Jinshan
2014-02-11 14:25 ` Al Viro
2014-02-11 18:26 ` Xiong, Jinshan
2014-02-11 0:18 ` Xiong, Jinshan
2014-02-11 0:37 ` Dilger, Andreas
2014-02-11 0:51 ` greg
2014-02-11 9:13 ` Christoph Hellwig
2014-02-11 11:01 ` Dilger, Andreas [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CF1F4D88.9158F%andreas.dilger@intel.com \
--to=andreas.dilger@intel.com \
--cc=bergwolf@gmail.com \
--cc=greg@kroah.com \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=oleg.drokin@intel.com \
--cc=viro@ZenIV.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).