Linux userland API discussions
 help / color / mirror / Atom feed
* Re: kdbus: add node and filesystem implementation
From: Sasha Levin @ 2014-11-21 17:03 UTC (permalink / raw)
  To: Greg Kroah-Hartman, David Herrmann
  Cc: Arnd Bergmann, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Andy Lutomirski,
	Linux API, linux-kernel, Daniel Mack, Djalal Harouni
In-Reply-To: <20141121165638.GA24866-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>

On 11/21/2014 11:56 AM, Greg Kroah-Hartman wrote:
>>> Maybe it's worth basing your git tree on top of Al's rather than a random
>>> > > -rc, since it's now a filesystem?
>> > 
>> > Sure, sounds good.
> No, I'll keep it as is, we can handle the merge issues later when it
> hits Linus's tree, this makes it easier for me and others to test it
> out properly.

It should be hitting -next, not Linus's tree. This is why we have an
integration tree, no?


Thanks,
Sasha

^ permalink raw reply

* Re: kdbus: add documentation
From: Andy Lutomirski @ 2014-11-21 17:12 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Arnd Bergmann, Eric W. Biederman, One Thousand Gnomes,
	Tom Gundersen, Jiri Kosina, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Daniel Mack,
	David Herrmann, Djalal Harouni
In-Reply-To: <1416546149-24799-2-git-send-email-gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>

On Thu, Nov 20, 2014 at 9:02 PM, Greg Kroah-Hartman
<gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:
> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
>
> kdbus is a system for low-latency, low-overhead, easy to use
> interprocess communication (IPC).
>
> The interface to all functions in this driver is implemented through
> ioctls on files exposed through the mount point of a kdbusfs.  This
> patch adds detailed documentation about the kernel level API design.
>
> Signed-off-by: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
> Signed-off-by: David Herrmann <dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Djalal Harouni <tixxdz-Umm1ozX2/EEdnm+yROfE0A@public.gmane.org>
> Signed-off-by: Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>
> ---

> +  Pool:
> +    Each connection allocates a piece of shmem-backed memory that is used
> +    to receive messages and answers to ioctl command from the kernel. It is
> +    never used to send anything to the kernel. In order to access that memory,
> +    userspace must mmap() it into its task.
> +    See section 12 for more details.

At the risk of opening a can of worms, wouldn't this be much more
useful if you could share a pool between multiple connections?


> +
> +
> +4. Items
> +===============================================================================
> +
> +To flexibly augment transport structures used by kdbus, data blobs of type
> +struct kdbus_item are used. An item has a fixed-sized header that only stores
> +the type of the item and the overall size. The total size is variable and is
> +in some cases defined by the item type, in other cases, they can be of
> +arbitrary length (for instance, a string).
> +
> +In the external kernel API, items are used for many ioctls to transport
> +optional information from userspace to kernelspace. They are also used for
> +information stored in a connection's pool, such as messages, name lists or
> +requested connection information.
> +
> +In all such occasions where items are used as part of the kdbus kernel API,
> +they are embedded in structs that have an overall size of their own, so there
> +can be many of them.
> +
> +The kernel expects all items to be aligned to 8-byte boundaries.
> +
> +A simple iterator in userspace would iterate over the items until the items
> +have reached the embedding structure's overall size. An example implementation
> +of such an iterator can be found in tools/testing/selftests/kdbus/kdbus-util.h.

It looks like many (all?) item consumers ignore unknown items.  This
seems like a compatbility problem.

Would it be better to have a bit in each item that toggles between
"ignore me if you don't recognize me" and "error out if you don't
recognize me"?

> +KDBUS_CMD_BUS_MAKE, and KDBUS_CMD_ENDPOINT_MAKE take a
> +struct kdbus_cmd_make argument.
> +
> +struct kdbus_cmd_make {
> +  __u64 size;
> +    The overall size of the struct, including its items.
> +
> +  __u64 flags;
> +    The flags for creation.
> +
> +    KDBUS_MAKE_ACCESS_GROUP
> +      Make the device file group-accessible
> +
> +    KDBUS_MAKE_ACCESS_WORLD
> +      Make the device file world-accessible

This thing is a file.  What's wrong with using a normal POSIX mode?
(And what to the read, write, and exec modes do?)

> +
> +
> +6.2 Creating connections
> +------------------------
> +
> +A connection to a bus is created by opening an endpoint file of a bus and
> +becoming an active client with the KDBUS_CMD_HELLO ioctl. Every connected client
> +connection has a unique identifier on the bus and can address messages to every
> +other connection on the same bus by using the peer's connection id as the
> +destination.
> +
> +The KDBUS_CMD_HELLO ioctl takes the following struct as argument.
> +
> +struct kdbus_cmd_hello {
> +  __u64 size;
> +    The overall size of the struct, including all attached items.
> +
> +  __u64 conn_flags;
> +    Flags to apply to this connection:
> +
> +    KDBUS_HELLO_ACCEPT_FD
> +      When this flag is set, the connection can be sent file descriptors
> +      as message payload. If it's not set, any attempt of doing so will
> +      result in -ECOMM on the sender's side.
> +
> +    KDBUS_HELLO_ACTIVATOR
> +      Make this connection an activator (see below). With this bit set,
> +      an item of type KDBUS_ITEM_NAME has to be attached which describes
> +      the well-known name this connection should be an activator for.
> +
> +    KDBUS_HELLO_POLICY_HOLDER
> +      Make this connection a policy holder (see below). With this bit set,
> +      an item of type KDBUS_ITEM_NAME has to be attached which describes
> +      the well-known name this connection should hold a policy for.
> +
> +    KDBUS_HELLO_MONITOR
> +      Make this connection an eaves-dropping connection that receives all
> +      unicast messages sent on the bus. To also receive broadcast messages,
> +      the connection has to upload appropriate matches as well.
> +      This flag is only valid for privileged bus connections.
> +
> +  __u64 attach_flags_send;
> +      Set the bits for metadata this connection permits to be sent to the
> +      receiving peer. Only metadata items that are both allowed to be sent by
> +      the sender and that are requested by the receiver will effectively be
> +      attached to the message eventually. Note, however, that the bus may
> +      optionally enforce some of those bits to be set. If the match fails,
> +      -ECONNREFUSED will be returned. In either case, this field will be set
> +      to the mask of metadata items that are enforced by the bus. The
> +      KDBUS_FLAGS_KERNEL bit will as well be set.
> +
> +  __u64 attach_flags_recv;
> +      Request the attachment of metadata for each message received by this
> +      connection. The metadata actually attached may actually augment the list
> +      of requested items. See section 13 for more details.
> +
> +  __u64 bus_flags;
> +      Upon successful completion of the ioctl, this member will contain the
> +      flags of the bus it connected to.
> +
> +  __u64 id;
> +      Upon successful completion of the ioctl, this member will contain the
> +      id of the new connection.
> +
> +  __u64 pool_size;
> +      The size of the communication pool, in bytes. The pool can be accessed
> +      by calling mmap() on the file descriptor that was used to issue the
> +      KDBUS_CMD_HELLO ioctl.
> +
> +  struct kdbus_bloom_parameter bloom;
> +      Bloom filter parameter (see below).
> +
> +  __u8 id128[16];
> +      Upon successful completion of the ioctl, this member will contain the
> +      128 bit wide UUID of the connected bus.
> +
> +  struct kdbus_item items[0];
> +      Variable list of items to add optional additional information. The
> +      following items are currently expected/valid:
> +
> +      KDBUS_ITEM_CONN_DESCRIPTION
> +        Contains a string to describes this connection's name, so it can be
> +        identified later.
> +
> +      KDBUS_ITEM_NAME
> +      KDBUS_ITEM_POLICY_ACCESS
> +        For activators and policy holders only, combinations of these two
> +        items describe policy access entries (see section about policy).
> +
> +      KDBUS_ITEM_CREDS
> +      KDBUS_ITEM_SECLABEL
> +        Privileged bus users may submit these types in order to create
> +        connections with faked credentials. The only real use case for this
> +        is a proxy service which acts on behalf of some other tasks. For a
> +        connection that runs in that mode, the message's metadata items will
> +        be limited to what's specified here. See section 13 for more
> +        information.

This is still confusing.  There are multiple places in which metadata
is attached.  Which does this apply to?  And why are only creds and
seclabel listed?


> +
> +6.4 Retrieving information on a connection
> +------------------------------------------
> +
> +The KDBUS_CMD_CONN_INFO ioctl can be used to retrieve credentials and
> +properties of the initial creator of a connection. This ioctl uses the
> +following struct:
> +
> +struct kdbus_cmd_info {
> +  __u64 size;
> +    The overall size of the struct, including the name with its 0-byte string
> +    terminator.
> +
> +  __u64 flags;
> +    Specify which metadata items should be attached to the answer.
> +    See section 13 for more details.
> +
> +    After the ioctl returns, this field will contain the current metadata
> +    attach flags of the connection.
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  __u64 id;
> +    The connection's numerical ID to retrieve information for. If set to
> +    non-zero value, the 'name' field is ignored.
> +
> +  __u64 offset;
> +    When the ioctl returns, this value will yield the offset of the connection
> +    information inside the caller's pool.
> +
> +  struct kdbus_item items[0];
> +    The optional item list, containing the well-known name to look up as
> +    a KDBUS_ITEM_OWNED_NAME. Only required if the 'id' field is set to 0.
> +    All other items are currently ignored.
> +};
> +
> +After the ioctl returns, the following struct will be stored in the caller's
> +pool at 'offset'.
> +
> +struct kdbus_info {
> +  __u64 size;
> +    The overall size of the struct, including all its items.
> +
> +  __u64 id;
> +    The connection's unique ID.
> +
> +  __u64 flags;
> +    The connection's flags as specified when it was created.
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  struct kdbus_item items[0];
> +    Depending on the 'flags' field in struct kdbus_cmd_info, items of
> +    types KDBUS_ITEM_OWNED_NAME and KDBUS_ITEM_CONN_DESCRIPTION are followed
> +    here.
> +};
> +
> +Once the caller is finished with parsing the return buffer, it needs to call
> +KDBUS_CMD_FREE for the offset.
> +
> +
> +6.5 Getting information about a connection's bus creator
> +--------------------------------------------------------
> +
> +The KDBUS_CMD_BUS_CREATOR_INFO ioctl takes the same struct as
> +KDBUS_CMD_CONN_INFO but is used to retrieve information about the creator of
> +the bus the connection is attached to. The metadata returned by this call is
> +collected during the creation of the bus and is never altered afterwards, so
> +it provides pristine information on the task that created the bus, at the
> +moment when it did so.

What's this for?  I understand the need for the creator of busses to
be authenticated, but doing it like this mean that anyone who will
*fail* authentication can DoS the authentic creator.

> +
> +7.3 Passing of Payload Data
> +---------------------------
> +
> +When connecting to the bus, receivers request a memory pool of a given size,
> +large enough to carry all backlog of data enqueued for the connection. The
> +pool is internally backed by a shared memory file which can be mmap()ed by
> +the receiver.
> +
> +KDBUS_MSG_PAYLOAD_VEC:
> +  Messages are directly copied by the sending process into the receiver's pool,
> +  that way two peers can exchange data by effectively doing a single-copy from
> +  one process to another, the kernel will not buffer the data anywhere else.
> +
> +KDBUS_MSG_PAYLOAD_MEMFD:
> +  Messages can reference memfd files which contain the data.
> +  memfd files are tmpfs-backed files that allow sealing of the content of the
> +  file, which prevents all writable access to the file content.
> +  Only sealed memfd files are accepted as payload data, which enforces
> +  reliable passing of data; the receiver can assume that neither the sender nor
> +  anyone else can alter the content after the message is sent.

This should specify *which* seals are checked.

> +
> +Apart from the sender filling-in the content into memfd files, the data will
> +be passed as zero-copy from one process to another, read-only, shared between
> +the peers.
> +
> +
> +7.4 Receiving messages
> +----------------------
> +
> +Messages are received by the client with the KDBUS_CMD_MSG_RECV ioctl. The
> +endpoint file of the bus supports poll() to wake up the receiving process when
> +new messages are queued up to be received.
> +
> +With the KDBUS_CMD_MSG_RECV ioctl, a struct kdbus_cmd_recv is used.
> +
> +struct kdbus_cmd_recv {
> +  __u64 flags;
> +    Flags to control the receive command.
> +
> +    KDBUS_RECV_PEEK
> +      Just return the location of the next message. Do not install file
> +      descriptors or anything else. This is usually used to determine the
> +      sender of the next queued message.
> +
> +    KDBUS_RECV_DROP
> +      Drop the next message without doing anything else with it, and free the
> +      pool slice. This a short-cut for KDBUS_RECV_PEEK and KDBUS_CMD_FREE.
> +
> +    KDBUS_RECV_USE_PRIORITY
> +      Use the priority field (see below).
> +
> +  __u64 kernel_flags;
> +    Valid flags for this command, returned by the kernel upon each call.
> +
> +  __s64 priority;
> +      With KDBUS_RECV_USE_PRIORITY set in flags, receive the next message in
> +      the queue with at least the given priority. If no such message is waiting
> +      in the queue, -ENOMSG is returned.
> +
> +  __u64 offset;
> +      Upon return of the ioctl, this field contains the offset in the
> +      receiver's memory pool.
> +};
> +
> +Unless KDBUS_RECV_DROP was passed, and given that the ioctl succeeded, the
> +offset field contains the location of the new message inside the receiver's
> +pool. The message is stored as struct kdbus_msg at this offset, and can be
> +interpreted with the semantics described above.

I'm confused here.  Is sent data written to the pool when send is
called or when recv is called?

If the former, what prevents DoS, especially DoS due to sending too many fds?

If the latter, where is the data buffered in the mean time?

> +
> +Also, if the connection allowed for file descriptor to be passed
> +(KDBUS_HELLO_ACCEPT_FD), and if the message contained any, they will be
> +installed into the receiving process after the KDBUS_CMD_MSG_RECV ioctl
> +returns. The receiving task is obliged to close all of them appropriately.

This makes it sound like fds are installed at receive time.  What
prevents resource exhaustion due to having excessive numbers of fds in
transit (that are presumably not accounted to anyone)?

> +
> +7.5 Canceling messages synchronously waiting for replies
> +--------------------------------------------------------
> +
> +When a connection sends a message with KDBUS_MSG_FLAGS_SYNC_REPLY and
> +blocks while waiting for the reply, the KDBUS_CMD_MSG_CANCEL ioctl can be
> +used on the same file descriptor to cancel the message, based on its cookie.
> +If there are multiple messages with the same cookie that are all synchronously
> +waiting for a reply, all of them will be canceled. Obviously, this is only
> +possible in multi-threaded applications.

What does "cancel the message" mean?  Does it just mean that the wait
for the reply is cancelled?

> +11. Policy
> +===============================================================================
> +
> +A policy databases restrict the possibilities of connections to own, see and
> +talk to well-known names. It can be associated with a bus (through a policy
> +holder connection) or a custom endpoint.

ISTM metadata items on bus names should be replaced with policy that
applies to the domain as a whole and governs bus creation.

> +A set of policy rules is described by a name and multiple access rules, defined
> +by the following struct.
> +
> +struct kdbus_policy_access {
> +  __u64 type;  /* USER, GROUP, WORLD */
> +    One of the following.
> +
> +    KDBUS_POLICY_ACCESS_USER
> +      Grant access to a user with the uid stored in the 'id' field.
> +
> +    KDBUS_POLICY_ACCESS_GROUP
> +      Grant access to a user with the gid stored in the 'id' field.
> +
> +    KDBUS_POLICY_ACCESS_WORLD
> +      Grant access to everyone. The 'id' field is ignored.
> +
> +  __u64 access;        /* OWN, TALK, SEE */
> +    The access to grant.
> +
> +    KDBUS_POLICY_SEE
> +      Allow the name to be seen.
> +
> +    KDBUS_POLICY_TALK
> +      Allow the name to be talked to.
> +
> +    KDBUS_POLICY_OWN
> +      Allow the name to be owned.
> +
> +  __u64 id;
> +    For KDBUS_POLICY_ACCESS_USER, stores the uid.
> +    For KDBUS_POLICY_ACCESS_GROUP, stores the gid.
> +};


What happens if there are multiple matches?


> +
> +11.4 TALK access and multiple well-known names per connection
> +-------------------------------------------------------------
> +
> +Note that TALK access is checked against all names of a connection.
> +For example, if a connection owns both 'org.foo.bar' and 'org.blah.baz', and
> +the policy database allows 'org.blah.baz' to be talked to by WORLD, then this
> +permission is also granted to 'org.foo.bar'. That might sound illogical, but
> +after all, we allow messages to be directed to either the name or a well-known
> +name, and policy is applied to the connection, not the name. In other words,
> +the effective TALK policy for a connection is the most permissive of all names
> +the connection owns.

This does seem illogical.  Does the recipient at least know which
well-known name was addressed?

> +11.5 Implicit policies
> +----------------------
> +
> +Depending on the type of the endpoint, a set of implicit rules might be
> +enforced. On default endpoints, the following set is enforced:
> +

How do these rules interact with installed policy?

> +  * Privileged connections always override any installed policy. Those
> +    connections could easily install their own policies, so there is no
> +    reason to enforce installed policies.
> +  * Connections can always talk to connections of the same user. This
> +    includes broadcast messages.

Why?  If anyone ever strengthens the concept of identity to include
things other than users (hmm -- there are already groups), this could
be very limiting.

> +  * Connections that own names might send broadcast messages to other
> +    connections that belong to a different user, but only if that
> +    destination connection does not own any name.
> +

This is weird.  It is also differently illogical than the "illogical"
thing above.

How about restricting access per name and making sure that the
receivers check what name was addressed before taking any action?

> +12. Pool
> +===============================================================================
> +
> +A pool for data received from the kernel is installed for every connection of
> +the bus, and is sized according to kdbus_cmd_hello.pool_size. It is accessed
> +when one of the following ioctls is issued:
> +
> +  * KDBUS_CMD_MSG_RECV, to receive a message
> +  * KDBUS_CMD_NAME_LIST, to dump the name registry
> +  * KDBUS_CMD_CONN_INFO, to retrieve information on a connection
> +
> +Internally, the pool is organized in slices, stored in an rb-tree. The offsets
> +returned by either one of the aforementioned ioctls describe offsets inside the
> +pool. In order to make the slice available for subsequent calls, KDBUS_CMD_FREE
> +has to be called on the offset.

Why are you documenting that the slices are stored in an rb-tree?
That's just an implementation details, right?

> +
> +To access the memory, the caller is expected to mmap() it to its task, like
> +this:
> +
> +  /*
> +   * POOL_SIZE has to be a multiple of PAGE_SIZE, and it must match the
> +   * value that was previously passed in the .pool_size field of struct
> +   * kdbus_cmd_hello.
> +   */
> +
> +  buf = mmap(NULL, POOL_SIZE, PROT_READ, MAP_PRIVATE, conn_fd, 0);
> +

Will mapping with PROT_WRITE fail?  What about MAP_SHARED?

And why are you suggesting MAP_PRIVATE?  That's just strange.

> +
> +13. Metadata
> +===============================================================================
> +
> +When a message is delivered to a receiver connection, it is augmented by
> +metadata items in accordance to the destination's current attach flags. The
> +information stored in those metadata items refer to the sender task at the
> +time of sending the message, so even if any detail of the sender task has
> +already changed upon message reception (or if the sender task does not exist
> +anymore), the information is still preserved and won't be modfied until the
> +message is freed.
> +
> +Note that there are two exceptions to the above rules:
> +
> +  a) Kernel generated messages don't have a source connection, so they won't be
> +     augmented.
> +
> +  b) If a connection was created with faked credentials (see section 6.2),
> +     the only attached metadata items are the ones provided by the connection
> +     itself. Other bits in the destination's attach_flags_recv won't have any
> +     effect in such cases.
> +
> +Also, there are two things to be considered by userspace programs regarding
> +those metadata items:
> +
> +  a) Userspace must cope with the fact that it might get more metadata than
> +     they requested. That happens, for example, when a broadcast message is
> +     sent and receivers have different attach flags. Items that haven't been
> +     requested should hence be silently ignored.
> +
> +  b) Userspace might not always get all requested metadata items that it
> +     requested. That is because some of those items are only added if a
> +     corresponding kernel feature has been enabled. Also, the two exceptions
> +     described above will as well lead to less items be attached than
> +     requested.
> +
> +
> +13.1 Known item types
> +---------------------
> +
> +The following attach flags are currently supported.
> +
> +  KDBUS_ATTACH_TIMESTAMP
> +    Attaches an item of type KDBUS_ITEM_TIMESTAMP which contains both the
> +    monotonic and the realtime timestamp, taken when the message was
> +    processed on the kernel side.
> +
> +  KDBUS_ATTACH_CREDS
> +    Attaches an item of type KDBUS_ITEM_CREDS, containing credentials as
> +    described in kdbus_creds: the uid, gid, pid, tid and starttime of the task.
> +

As mentioned last time, please remove or justify starttime.

> +  KDBUS_ATTACH_AUXGROUPS
> +    Attaches an item of type KDBUS_ITEM_AUXGROUPS, containing a dynamic
> +    number of auxiliary groups the sending task was a member of.
> +
> +  KDBUS_ATTACH_NAMES
> +    Attaches items of type KDBUS_ITEM_OWNED_NAME, one for each name the sending
> +    connection currently owns. The name and flags are stored in kdbus_item.name
> +    for each of them.
> +

That's interesting.  What's it for?

> +  KDBUS_ATTACH_TID_COMM
> +    Attaches an items of type KDBUS_ITEM_TID_COMM, transporting the sending
> +    task's 'comm', for the tid.  The string is stored in kdbus_item.str.
> +
> +  KDBUS_ATTACH_PID_COMM
> +    Attaches an items of type KDBUS_ITEM_PID_COMM, transporting the sending
> +    task's 'comm', for the pid.  The string is stored in kdbus_item.str.
> +
> +  KDBUS_ATTACH_EXE
> +    Attaches an item of type KDBUS_ITEM_EXE, containing the path to the
> +    executable of the sending task, stored in kdbus_item.str.
> +
> +  KDBUS_ATTACH_CMDLINE
> +    Attaches an item of type KDBUS_ITEM_CMDLINE, containing the command line
> +    arguments of the sending task, as an array of strings, stored in
> +    kdbus_item.str.

Please remove these four items.  They are genuinely useless.  Anything
that uses them for anything is either buggy or should have asked the
sender to put the value in the payload (and immediately wondered why
it was doing that).

> +
> +  KDBUS_ATTACH_CGROUP
> +    Attaches an item of type KDBUS_ITEM_CGROUP with the task's cgroup path.
> +
> +  KDBUS_ATTACH_CAPS
> +    Attaches an item of type KDBUS_ITEM_CAPS, carrying sets of capabilities
> +    that should be accessed via kdbus_item.caps.caps. Also, userspace should
> +    be written in a way that it takes kdbus_item.caps.last_cap into account,
> +    and derive the number of sets and rows from the item size and the reported
> +    number of valid capability bits.
> +

Please remove this, too, or justify its use.

> +  KDBUS_ATTACH_SECLABEL
> +    Attaches an item of type KDBUS_ITEM_SECLABEL, which contains the SELinux
> +    security label of the sending task. Access via kdbus_item->str.
> +

This one, too, and please justify why code that uses it will work in
containers an on non-selinux systems.

> +  KDBUS_ATTACH_AUDIT
> +    Attaches an item of type KDBUS_ITEM_AUDIT, which contains the audio label
> +    of the sending taskj. Access via kdbus_item->str.
> +

I will NAK the hell out of this until, at the very least, someone
documents what this means, how to parse it, what its stability rules
are, who is allowed to see the value it contains, why that value will
never evolve to become *more* security sensitive than it is now, etc.

Audit gets to do crazy sh*t because it's restricted to privileged
receivers.  This isn't restricted like that, so it doesn't deserve the
same dispensation.  (And, honestly, I'm not sure that audit really
deserves its free pass on making sense.)

> +  KDBUS_ATTACH_CONN_DESCRIPTION
> +    Attaches an item of type KDBUS_ITEM_CONN_DESCRIPTION that contains the
> +    sender connection's current name in kdbus_item.str.
> +

Which name?  Can't there be several?

> +
> +13.1 Metadata and namespaces
> +----------------------------
> +
> +Metadata such as PIDs, UIDs or GIDs are automatically translated to the
> +namespaces of the domain that is used to send a message over. The namespaces
> +of a domain are pinned at creation time, which is when the filesystem has been
> +mounted.
> +
> +Metadata items that cannot be translated are dropped.

What if the receiver said that the item was mandatory?


Thanks,
Andy

^ permalink raw reply

* Re: kdbus: add node and filesystem implementation
From: Greg Kroah-Hartman @ 2014-11-21 17:55 UTC (permalink / raw)
  To: Sasha Levin
  Cc: David Herrmann, Arnd Bergmann, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	One Thousand Gnomes, Tom Gundersen, Jiri Kosina, Andy Lutomirski,
	Linux API, linux-kernel, Daniel Mack, Djalal Harouni
In-Reply-To: <546F7077.4010200-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

On Fri, Nov 21, 2014 at 12:03:51PM -0500, Sasha Levin wrote:
> On 11/21/2014 11:56 AM, Greg Kroah-Hartman wrote:
> >>> Maybe it's worth basing your git tree on top of Al's rather than a random
> >>> > > -rc, since it's now a filesystem?
> >> > 
> >> > Sure, sounds good.
> > No, I'll keep it as is, we can handle the merge issues later when it
> > hits Linus's tree, this makes it easier for me and others to test it
> > out properly.
> 
> It should be hitting -next, not Linus's tree. This is why we have an
> integration tree, no?

Yes, it will hit -next, and the merge issue can be resolved there.

^ permalink raw reply

* Re: RFC: renameat(): Add a RENAME_REMOVE flag to unlink hardlinks
From: Andy Lutomirski @ 2014-11-21 18:09 UTC (permalink / raw)
  To: Pádraig Brady, Linux FS Devel; +Cc: Linux API
In-Reply-To: <546F4981.8080907-V8g9lnOeT5ydJdNcDFJN0w@public.gmane.org>

On Fri, Nov 21, 2014 at 6:17 AM, Pádraig Brady <P@draigbrady.com> wrote:
> If 'a' and 'b' are hardlinks, then the command `mv a b & mv b a`
> can result in both being removed as mv needs to emulate the
> move with an unlink of the source file.  This can only be done
> without races in the kernel and so mv was recently changed
> to not allow this operation at all.  mv could safely reintroduce
> this feature by leveraging a new flag for renameat() for which
> an illustrative/untested patch is attached.

ISTM the issue is that rename(2) does nothing if the source and dest
are hardlinks to each other.  Is that intentional?  I don't see that
behavior as required in the POSIX rename docs.

If we indeed need to keep that behavior around for legacy reasons,
then can we at least give RENAME_REMOVE a better name?

--Andy

>
> thanks,
> Pádraig.



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: RFC: renameat(): Add a RENAME_REMOVE flag to unlink hardlinks
From: Pádraig Brady @ 2014-11-21 18:39 UTC (permalink / raw)
  To: Andy Lutomirski, Linux FS Devel; +Cc: Linux API
In-Reply-To: <CALCETrXvVxvG+s39v+NMdvfkeb8YjbYjb6UXgDFg5ifYOjeKsA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 21/11/14 18:09, Andy Lutomirski wrote:
> On Fri, Nov 21, 2014 at 6:17 AM, Pádraig Brady <P@draigbrady.com> wrote:
>> If 'a' and 'b' are hardlinks, then the command `mv a b & mv b a`
>> can result in both being removed as mv needs to emulate the
>> move with an unlink of the source file.  This can only be done
>> without races in the kernel and so mv was recently changed
>> to not allow this operation at all.  mv could safely reintroduce
>> this feature by leveraging a new flag for renameat() for which
>> an illustrative/untested patch is attached.
> 
> ISTM the issue is that rename(2) does nothing if the source and dest
> are hardlinks to each other.  Is that intentional?  I don't see that
> behavior as required in the POSIX rename docs.

It's surprising and annoying that existing systems do this,
but they're conforming to the wording of POSIX I thinkg as
it says that rename() does nothing when the source and dest
_file_ is the same. What POSIX really meant I suppose was _file name_.
Eric, perhaps an adjustment could be proposed to POSIX, as I can't
see anything relying on the current behavior?

> If we indeed need to keep that behavior around for legacy reasons,
> then can we at least give RENAME_REMOVE a better name?

Definitely not attached to the name :)
Let's see if we can change it without a flag, though I guess
that would mean that mv on older systems would
silently do nothing in this case.

thanks,
Pádraig.

^ permalink raw reply

* Re: kdbus: add code to gather metadata
From: Andy Lutomirski @ 2014-11-21 19:50 UTC (permalink / raw)
  To: Greg Kroah-Hartman, arnd-r2nGTMty4D4,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io, teg-B22kvLQNl6c,
	jkosina-AlSwsSmVLrQ, luto-kltTT9wpgjJwATOyAt5JVQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: daniel-cYrQPVfZoowdnm+yROfE0A, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	tixxdz-Umm1ozX2/EEdnm+yROfE0A
In-Reply-To: <1416546149-24799-8-git-send-email-gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>

On 11/20/2014 09:02 PM, Greg Kroah-Hartman wrote:
> From: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
> 
> A connection chooses which metadata it wants to have attached to each
> message it receives with kdbus_cmd_hello.attach_flags. The metadata
> will be attached as items to the messages. All metadata refers to
> information about the sending task at sending time, unless otherwise
> stated. Also, the metadata is copied, not referenced, so even if the
> sending task doesn't exist anymore at the time the message is received,
> the information is still preserved.

Namespace comments below.

> +
> +static int kdbus_meta_append_cred(struct kdbus_meta *meta,
> +				  const struct kdbus_domain *domain)
> +{
> +	struct kdbus_creds creds = {
> +		.uid = from_kuid_munged(domain->user_namespace, current_uid()),
> +		.gid = from_kgid_munged(domain->user_namespace, current_gid()),
> +		.pid = task_pid_nr_ns(current, domain->pid_namespace),
> +		.tid = task_tgid_nr_ns(current, domain->pid_namespace),

This is better than before -- at least it gets translation right part of
the way.  But it's still wrong if the receiver's namespace doesn't match
the domain.

Also, please move pid and tgid into their own item.  They suck for
reasons that have been beaten to death.  Let's make it possible to
deprecate them separately in the future.

> +		.starttime = current->start_time,

I'm repeating myself here, but... why?

> +static int kdbus_meta_append_auxgroups(struct kdbus_meta *meta,
> +				       const struct kdbus_domain *domain)
> +{
> +	struct group_info *info;
> +	struct kdbus_item *item;
> +	int i, ret = 0;
> +	u64 *gid;
> +
> +	info = get_current_groups();
> +	item = kdbus_meta_append_item(meta, KDBUS_ITEM_AUXGROUPS,
> +				      info->ngroups * sizeof(*gid));
> +	if (IS_ERR(item)) {
> +		ret = PTR_ERR(item);
> +		goto exit_put_groups;
> +	}
> +
> +	gid = (u64 *) item->data;
> +
> +	for (i = 0; i < info->ngroups; i++)
> +		gid[i] = from_kgid_munged(domain->user_namespace,
> +					  GROUP_AT(info, i));

Ditto.

> +static int kdbus_meta_append_exe(struct kdbus_meta *meta)

NAK.

> +{
> +	struct mm_struct *mm = get_task_mm(current);
> +	struct path *exe_path = NULL;
> +	char *pathname;
> +	int ret = 0;
> +	size_t len;
> +	char *tmp;
> +
> +	if (!mm)
> +		return -EFAULT;
> +
> +	down_read(&mm->mmap_sem);
> +	if (mm->exe_file) {
> +		path_get(&mm->exe_file->f_path);
> +		exe_path = &mm->exe_file->f_path;
> +	}
> +	up_read(&mm->mmap_sem);
> +
> +	if (!exe_path)
> +		goto exit_mmput;
> +
> +	tmp = (char *)__get_free_page(GFP_TEMPORARY | __GFP_ZERO);
> +	if (!tmp) {
> +		ret = -ENOMEM;
> +		goto exit_path_put;
> +	}
> +
> +	pathname = d_path(exe_path, tmp, PAGE_SIZE);

My NAK notwithstanding, the namespacing here is completely bogus.

> +static int kdbus_meta_append_cmdline(struct kdbus_meta *meta)

NAK

> +static int kdbus_meta_append_caps(struct kdbus_meta *meta)
> +{
> +	struct caps {
> +		u32 last_cap;
> +		struct {
> +			u32 caps[_KERNEL_CAPABILITY_U32S];
> +		} set[4];
> +	} caps;
> +	unsigned int i;
> +	const struct cred *cred = current_cred();
> +
> +	caps.last_cap = CAP_LAST_CAP;
> +
> +	for (i = 0; i < _KERNEL_CAPABILITY_U32S; i++) {
> +		caps.set[0].caps[i] = cred->cap_inheritable.cap[i];
> +		caps.set[1].caps[i] = cred->cap_permitted.cap[i];
> +		caps.set[2].caps[i] = cred->cap_effective.cap[i];
> +		caps.set[3].caps[i] = cred->cap_bset.cap[i];
> +	}

Please leave this in so that I can root every single kdbus-using system.
 It'll be lots of fun.

Snark aside, the correct fix is IMO to delete this function entirely.
Even if you could find a way to implement it safely (which will be
distinctly nontrivial), it seems like a bad idea to begin with.

> +#ifdef CONFIG_CGROUPS
> +static int kdbus_meta_append_cgroup(struct kdbus_meta *meta)
> +{
> +	char *buf, *path;
> +	int ret;
> +
> +	buf = (char *)__get_free_page(GFP_TEMPORARY | __GFP_ZERO);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	path = task_cgroup_path(current, buf, PAGE_SIZE);

This may have strange interactions with cgroupns.  It's fixable, though,
but only once you implement translation at receive time, and I think
you'll have to do that to get any of this to work right.

> +
> +	if (path)
> +		ret = kdbus_meta_append_str(meta, KDBUS_ITEM_CGROUP, path);
> +	else
> +		ret = -ENAMETOOLONG;
> +
> +	free_page((unsigned long) buf);
> +
> +	return ret;
> +}
> +#endif
> +
> +#ifdef CONFIG_AUDITSYSCALL
> +static int kdbus_meta_append_audit(struct kdbus_meta *meta,
> +				   const struct kdbus_domain *domain)
> +{
> +	struct kdbus_audit audit;
> +
> +	audit.loginuid = from_kuid_munged(domain->user_namespace,
> +					  audit_get_loginuid(current));
> +	audit.sessionid = audit_get_sessionid(current);
> +
> +	return kdbus_meta_append_data(meta, KDBUS_ITEM_AUDIT,
> +				      &audit, sizeof(audit));

So *that's* what audit means.  Please document this and consider
renaming it to something like AUDIT_LOGINUID_AND_SESSIONID.

> +#ifdef CONFIG_SECURITY
> +static int kdbus_meta_append_seclabel(struct kdbus_meta *meta)
> +{
> +	u32 len, sid;
> +	char *label;
> +	int ret;
> +
> +	security_task_getsecid(current, &sid);
> +	ret = security_secid_to_secctx(sid, &label, &len);
> +	if (ret == -EOPNOTSUPP)
> +		return 0;
> +	if (ret < 0)
> +		return ret;
> +
> +	if (label && len > 0)
> +		ret = kdbus_meta_append_data(meta, KDBUS_ITEM_SECLABEL,
> +					     label, len);

This thing needs a clear, valid use case.  I think that the use case
should document how non-enforcing mode is supposed to work, too.

Also, there should be a justification for why the LSM hooks by
themselves aren't good enough to remove the need for this.

> +
> +/**
> + * kdbus_meta_size() - calculate the size of an excerpt of a metadata db
> + * @meta:	The database object containing the metadata

What is a "database object containing the metadata"?

Anyway, this is unreviewable because not only is the context in which
this function called not specified in this patch, but the function has
no callers in this patch.  However...

> + * @conn_dst:	The connection that is about to receive the data

...this suggests that this is called in the recipient's context, so...

> + * @mask:	Pointer to KDBUS_ATTACH_* bitmask to calculate the size for.
> + *		Callers *must* use the same mask for calls to
> + *		kdbus_meta_write().
> + *
> + * Return: the size in bytes the masked data will consume. Data that should
> + * not received by @conn_dst will be filtered out.
> + */
> +size_t kdbus_meta_size(const struct kdbus_meta *meta,
> +		       const struct kdbus_conn *conn_dst,
> +		       u64 *mask)
> +{
> +	struct kdbus_domain *domain = conn_dst->ep->bus->domain;
> +	const struct kdbus_item *item;
> +	size_t size = 0;
> +
> +	/*
> +	 * We currently don't have a way to translate capability flags between
> +	 * user namespaces, so let's drop these items in such cases.
> +	 */
> +	if (domain->user_namespace != current_user_ns())
> +		*mask &= ~KDBUS_ATTACH_CAPS;

...this is still completely wrong, and I'll be able to root kdbus systems :)

> +
> +	/*
> +	 * If the domain was created with hide_pid enabled, drop all items
> +	 * except for such not revealing anything about the task.
> +	 */
> +	if (domain->pid_namespace->hide_pid)
> +		*mask &= KDBUS_ATTACH_TIMESTAMP | KDBUS_ATTACH_NAMES |
> +			 KDBUS_ATTACH_CONN_DESCRIPTION;

Huh?  This looks wrong.

I realize that some of the systemd people seem to think that hide_pid is
unusable right now, but it really does seem to be usable.  Doing this,
however, will indeed make it unusable.

What are you trying to do here?  I think that the correct fix is to
remove support for all of the questionable metadata items and get rid of
this check.

--Andy

^ permalink raw reply

* Re: RFC: renameat(): Add a RENAME_REMOVE flag to unlink hardlinks
From: Andy Lutomirski @ 2014-11-21 19:55 UTC (permalink / raw)
  To: Pádraig Brady; +Cc: Linux FS Devel, Linux API
In-Reply-To: <546F86F5.6070305@draigBrady.com>

On Fri, Nov 21, 2014 at 10:39 AM, Pádraig Brady <P@draigbrady.com> wrote:
> On 21/11/14 18:09, Andy Lutomirski wrote:
>> On Fri, Nov 21, 2014 at 6:17 AM, Pádraig Brady <P@draigbrady.com> wrote:
>>> If 'a' and 'b' are hardlinks, then the command `mv a b & mv b a`
>>> can result in both being removed as mv needs to emulate the
>>> move with an unlink of the source file.  This can only be done
>>> without races in the kernel and so mv was recently changed
>>> to not allow this operation at all.  mv could safely reintroduce
>>> this feature by leveraging a new flag for renameat() for which
>>> an illustrative/untested patch is attached.
>>
>> ISTM the issue is that rename(2) does nothing if the source and dest
>> are hardlinks to each other.  Is that intentional?  I don't see that
>> behavior as required in the POSIX rename docs.
>
> It's surprising and annoying that existing systems do this,
> but they're conforming to the wording of POSIX I thinkg as
> it says that rename() does nothing when the source and dest
> _file_ is the same. What POSIX really meant I suppose was _file name_.
> Eric, perhaps an adjustment could be proposed to POSIX, as I can't
> see anything relying on the current behavior?

Which Eric?

--Andy

>
>> If we indeed need to keep that behavior around for legacy reasons,
>> then can we at least give RENAME_REMOVE a better name?
>
> Definitely not attached to the name :)
> Let's see if we can change it without a flag, though I guess
> that would mean that mv on older systems would
> silently do nothing in this case.
>
> thanks,
> Pádraig.
>



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCHv9 4/5] sparc: Hook up execveat system call.
From: David Miller @ 2014-11-21 20:08 UTC (permalink / raw)
  To: drysdale
  Cc: sfr, ebiederm, luto, viro, meredydd, linux-kernel, akpm, tglx,
	oleg, mtk.manpages, mingo, hpa, keescook, arnd, dalias, hch, x86,
	linux-arch, linux-api, sparclinux
In-Reply-To: <CAHse=S--yNmP5sTr2kg_7kb2V8Lh2FioTSH_o23MstVrdt4Wmw@mail.gmail.com>

From: David Drysdale <drysdale@google.com>
Date: Thu, 20 Nov 2014 11:28:22 +0000

> On Wed, Nov 19, 2014 at 11:42 PM, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>> Hi David,
>>
>> On Wed, 19 Nov 2014 17:27:51 +0000 David Drysdale <drysdale@google.com> wrote:
>>>
>>> diff --git a/arch/sparc/kernel/systbls_64.S b/arch/sparc/kernel/systbls_64.S
>>> index 580cde9370c9..15069cb35dac 100644
>>> --- a/arch/sparc/kernel/systbls_64.S
>>> +++ b/arch/sparc/kernel/systbls_64.S
>>> @@ -88,6 +88,7 @@ sys_call_table32:
>>>       .word sys_syncfs, compat_sys_sendmmsg, sys_setns, compat_sys_process_vm_readv, compat_sys_process_vm_writev
>>>  /*340*/      .word sys_kern_features, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
>>>       .word sys32_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
>>> +/*350*/      .word sys_execveat
>>
>> Shouldn't this be compat_sys_execveat?
> 
> Yeah, that would make sense -- thanks for spotting.
> 
> Looking at this more closely, I also wonder if we need a pair of wrappers
> analogous to sys(32|64)_execve -- they include a (pipeline-executed)
> flushw instruction after the jmpl to [compat_]sys_execve, which I guess
> might be needed.  So I could add something like the following to
> arch/sparc/kernel/syscalls.S:
> 
>   sys64_execveat:
>       set sys_execveat, %g1
>       jmpl %g1, %g0
>       flushw
>   #ifdef CONFIG_COMPAT
>   sys32_execveat:
>       set compat_sys_execveat, %g1
>       jmpl %g1, %g0
>       flushw
>   #endif
> 
> However, I don't speak SPARC and can't run the code -- any thoughts
> out there from someone who does & can?
> 
> Or maybe I should stop trying to be helpful with speculative patches, and
> just leave wiring up the sparc stuff to the experts...

You need wrappers for execve() like system calls yes, in order to flush all
of the user register windows to the stack before the kill the address space.

I think your code snippet above is fine, except that you should indent
the flushw by one space more because it is in the delay slot of a
control transfer instruction.

Thanks.

^ permalink raw reply

* Re: RFC: renameat(): Add a RENAME_REMOVE flag to unlink hardlinks
From: Pádraig Brady @ 2014-11-21 20:48 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Linux FS Devel, Linux API, Eric Blake
In-Reply-To: <CALCETrVf_7g_njnFGQx=xFwn=FyY851nnp0vmKFSmV=TusSOkQ@mail.gmail.com>

On 21/11/14 19:55, Andy Lutomirski wrote:
> On Fri, Nov 21, 2014 at 10:39 AM, Pádraig Brady <P@draigbrady.com> wrote:
>> On 21/11/14 18:09, Andy Lutomirski wrote:
>>> On Fri, Nov 21, 2014 at 6:17 AM, Pádraig Brady <P@draigbrady.com> wrote:
>>>> If 'a' and 'b' are hardlinks, then the command `mv a b & mv b a`
>>>> can result in both being removed as mv needs to emulate the
>>>> move with an unlink of the source file.  This can only be done
>>>> without races in the kernel and so mv was recently changed
>>>> to not allow this operation at all.  mv could safely reintroduce
>>>> this feature by leveraging a new flag for renameat() for which
>>>> an illustrative/untested patch is attached.
>>>
>>> ISTM the issue is that rename(2) does nothing if the source and dest
>>> are hardlinks to each other.  Is that intentional?  I don't see that
>>> behavior as required in the POSIX rename docs.
>>
>> It's surprising and annoying that existing systems do this,
>> but they're conforming to the wording of POSIX I thinkg as
>> it says that rename() does nothing when the source and dest
>> _file_ is the same. What POSIX really meant I suppose was _file name_.
>> Eric, perhaps an adjustment could be proposed to POSIX, as I can't
>> see anything relying on the current behavior?
> 
> Which Eric?

The one I forgot to CC :/

thanks,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: RFC: renameat(): Add a RENAME_REMOVE flag to unlink hardlinks
From: Eric Blake @ 2014-11-21 21:18 UTC (permalink / raw)
  To: Pádraig Brady, Andy Lutomirski; +Cc: Linux FS Devel, Linux API
In-Reply-To: <546FA51F.40503@draigBrady.com>

[-- Attachment #1: Type: text/plain, Size: 3129 bytes --]

On 11/21/2014 01:48 PM, Pádraig Brady wrote:
> On 21/11/14 19:55, Andy Lutomirski wrote:
>> On Fri, Nov 21, 2014 at 10:39 AM, Pádraig Brady <P@draigbrady.com> wrote:
>>> On 21/11/14 18:09, Andy Lutomirski wrote:
>>>> On Fri, Nov 21, 2014 at 6:17 AM, Pádraig Brady <P@draigbrady.com> wrote:
>>>>> If 'a' and 'b' are hardlinks, then the command `mv a b & mv b a`
>>>>> can result in both being removed as mv needs to emulate the
>>>>> move with an unlink of the source file.  This can only be done
>>>>> without races in the kernel and so mv was recently changed
>>>>> to not allow this operation at all.  mv could safely reintroduce
>>>>> this feature by leveraging a new flag for renameat() for which
>>>>> an illustrative/untested patch is attached.
>>>>
>>>> ISTM the issue is that rename(2) does nothing if the source and dest
>>>> are hardlinks to each other.  Is that intentional?  I don't see that
>>>> behavior as required in the POSIX rename docs.

Yes, POSIX unfortunately requires that behavior:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/rename.html

"If the old argument and the new argument resolve to either the same
existing directory entry or different directory entries for the same
existing file, rename() shall return successfully and perform no other
action."

>>>
>>> It's surprising and annoying that existing systems do this,
>>> but they're conforming to the wording of POSIX I thinkg as
>>> it says that rename() does nothing when the source and dest
>>> _file_ is the same. What POSIX really meant I suppose was _file name_.
>>> Eric, perhaps an adjustment could be proposed to POSIX, as I can't
>>> see anything relying on the current behavior?

The POSIX wording describes existing practice of all implementations,
and as annoying as it is, requiring all implementations to change would
be too invasive, and going against that wording without the use of an
extension to renameat would catch software off-guard that is expecting
the baked-in POSIX semantics.

That said, would you still like me to take a stab at a proposal to the
POSIX folks that would relax the requirements to allow
implementation-defined behavior when the two arguments to rename
describe the same file but via different directory entries?  Note that I
am NOT proposing a change to rename("a", "a") which must always succeed
(true even when two spellings are different but still resolve to the
same directory entry, as in rename("a", "./a")).  Rather, this is only a
question about the behavior of rename("a", "b") when they are two
different hardlinks resolving to the same file.

At any rate, regardless of what POSIX says, I'm totally in favor of a
renameat flag that gives us fine-tune control over the behavior; we can
implement it now as an extension, and once it hits mainline kernel, I
would have no problems proposing that flag for POSIX standardization
(the POSIX folks have a thing for preferring existing implementations,
after all)

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply

* Re: RFC: renameat(): Add a RENAME_REMOVE flag to unlink hardlinks
From: Andy Lutomirski @ 2014-11-21 21:29 UTC (permalink / raw)
  To: Eric Blake; +Cc: Pádraig Brady, Linux FS Devel, Linux API
In-Reply-To: <546FAC18.5020200-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Fri, Nov 21, 2014 at 1:18 PM, Eric Blake <eblake-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 11/21/2014 01:48 PM, Pádraig Brady wrote:
>> On 21/11/14 19:55, Andy Lutomirski wrote:
>>> On Fri, Nov 21, 2014 at 10:39 AM, Pádraig Brady <P@draigbrady.com> wrote:
>>>> On 21/11/14 18:09, Andy Lutomirski wrote:
>>>>> On Fri, Nov 21, 2014 at 6:17 AM, Pádraig Brady <P@draigbrady.com> wrote:
>>>>>> If 'a' and 'b' are hardlinks, then the command `mv a b & mv b a`
>>>>>> can result in both being removed as mv needs to emulate the
>>>>>> move with an unlink of the source file.  This can only be done
>>>>>> without races in the kernel and so mv was recently changed
>>>>>> to not allow this operation at all.  mv could safely reintroduce
>>>>>> this feature by leveraging a new flag for renameat() for which
>>>>>> an illustrative/untested patch is attached.
>>>>>
>>>>> ISTM the issue is that rename(2) does nothing if the source and dest
>>>>> are hardlinks to each other.  Is that intentional?  I don't see that
>>>>> behavior as required in the POSIX rename docs.
>
> Yes, POSIX unfortunately requires that behavior:
>
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/rename.html
>
> "If the old argument and the new argument resolve to either the same
> existing directory entry or different directory entries for the same
> existing file, rename() shall return successfully and perform no other
> action."
>
>>>>
>>>> It's surprising and annoying that existing systems do this,
>>>> but they're conforming to the wording of POSIX I thinkg as
>>>> it says that rename() does nothing when the source and dest
>>>> _file_ is the same. What POSIX really meant I suppose was _file name_.
>>>> Eric, perhaps an adjustment could be proposed to POSIX, as I can't
>>>> see anything relying on the current behavior?
>
> The POSIX wording describes existing practice of all implementations,
> and as annoying as it is, requiring all implementations to change would
> be too invasive, and going against that wording without the use of an
> extension to renameat would catch software off-guard that is expecting
> the baked-in POSIX semantics.
>
> That said, would you still like me to take a stab at a proposal to the
> POSIX folks that would relax the requirements to allow
> implementation-defined behavior when the two arguments to rename
> describe the same file but via different directory entries?  Note that I
> am NOT proposing a change to rename("a", "a") which must always succeed
> (true even when two spellings are different but still resolve to the
> same directory entry, as in rename("a", "./a")).  Rather, this is only a
> question about the behavior of rename("a", "b") when they are two
> different hardlinks resolving to the same file.
>
> At any rate, regardless of what POSIX says, I'm totally in favor of a
> renameat flag that gives us fine-tune control over the behavior; we can
> implement it now as an extension, and once it hits mainline kernel, I
> would have no problems proposing that flag for POSIX standardization
> (the POSIX folks have a thing for preferring existing implementations,
> after all)

renameat has no flags -- this would be renameat2.  Standardizing that
might be quite nice (at least the RENAME_NOREPLACE part).

My two cents: the new flag could be RENAME_HARDLINKS.

--Andy

>
> --
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: RFC: renameat(): Add a RENAME_REMOVE flag to unlink hardlinks
From: Pádraig Brady @ 2014-11-21 22:20 UTC (permalink / raw)
  To: Andy Lutomirski, Eric Blake; +Cc: Linux FS Devel, Linux API
In-Reply-To: <CALCETrUByDgug68FP=cnj-iwSXvbEEHp=S4a=WhQPFmuKc2pNw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 21/11/14 21:29, Andy Lutomirski wrote:
> On Fri, Nov 21, 2014 at 1:18 PM, Eric Blake <eblake-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On 11/21/2014 01:48 PM, Pádraig Brady wrote:
>>> On 21/11/14 19:55, Andy Lutomirski wrote:
>>>> On Fri, Nov 21, 2014 at 10:39 AM, Pádraig Brady <P@draigbrady.com> wrote:
>>>>> On 21/11/14 18:09, Andy Lutomirski wrote:
>>>>>> On Fri, Nov 21, 2014 at 6:17 AM, Pádraig Brady <P@draigbrady.com> wrote:
>>>>>>> If 'a' and 'b' are hardlinks, then the command `mv a b & mv b a`
>>>>>>> can result in both being removed as mv needs to emulate the
>>>>>>> move with an unlink of the source file.  This can only be done
>>>>>>> without races in the kernel and so mv was recently changed
>>>>>>> to not allow this operation at all.  mv could safely reintroduce
>>>>>>> this feature by leveraging a new flag for renameat() for which
>>>>>>> an illustrative/untested patch is attached.
>>>>>>
>>>>>> ISTM the issue is that rename(2) does nothing if the source and dest
>>>>>> are hardlinks to each other.  Is that intentional?  I don't see that
>>>>>> behavior as required in the POSIX rename docs.
>>
>> Yes, POSIX unfortunately requires that behavior:
>>
>> http://pubs.opengroup.org/onlinepubs/9699919799/functions/rename.html
>>
>> "If the old argument and the new argument resolve to either the same
>> existing directory entry or different directory entries for the same
>> existing file, rename() shall return successfully and perform no other
>> action."
>>
>>>>>
>>>>> It's surprising and annoying that existing systems do this,
>>>>> but they're conforming to the wording of POSIX I thinkg as
>>>>> it says that rename() does nothing when the source and dest
>>>>> _file_ is the same. What POSIX really meant I suppose was _file name_.
>>>>> Eric, perhaps an adjustment could be proposed to POSIX, as I can't
>>>>> see anything relying on the current behavior?
>>
>> The POSIX wording describes existing practice of all implementations,
>> and as annoying as it is, requiring all implementations to change would
>> be too invasive, and going against that wording without the use of an
>> extension to renameat would catch software off-guard that is expecting
>> the baked-in POSIX semantics.
>>
>> That said, would you still like me to take a stab at a proposal to the
>> POSIX folks that would relax the requirements to allow
>> implementation-defined behavior when the two arguments to rename
>> describe the same file but via different directory entries?

I guess there is no point discussing in POSIX and adding extra
implementation options if no implementations do/will act accordingly.

Linux can decide to do that independently, if appropriate.
This is one of those borderline cases where we balance
accretion of cruft vs incompatibility.
On consideration, I'm OK with keeping the existing
rename() behavior for compat and adding the new flag.
That said I still can't think of anything depending
rename() doing nothing with hardlinked source and dest.

>> Note that I
>> am NOT proposing a change to rename("a", "a") which must always succeed
>> (true even when two spellings are different but still resolve to the
>> same directory entry, as in rename("a", "./a")).  Rather, this is only a
>> question about the behavior of rename("a", "b") when they are two
>> different hardlinks resolving to the same file.
>>
>> At any rate, regardless of what POSIX says, I'm totally in favor of a
>> renameat flag that gives us fine-tune control over the behavior; we can
>> implement it now as an extension, and once it hits mainline kernel, I
>> would have no problems proposing that flag for POSIX standardization
>> (the POSIX folks have a thing for preferring existing implementations,
>> after all)

+1

> renameat has no flags -- this would be renameat2.  Standardizing that
> might be quite nice (at least the RENAME_NOREPLACE part).
> 
> My two cents: the new flag could be RENAME_HARDLINKS.

+1

thanks guys!
Pádraig.

^ permalink raw reply

* Re: RFC: renameat(): Add a RENAME_REMOVE flag to unlink hardlinks
From: Al Viro @ 2014-11-21 22:30 UTC (permalink / raw)
  To: Pádraig Brady; +Cc: Andy Lutomirski, Eric Blake, Linux FS Devel, Linux API
In-Reply-To: <546FBAC6.6020407-V8g9lnOeT5ydJdNcDFJN0w@public.gmane.org>

On Fri, Nov 21, 2014 at 10:20:54PM +0000, Pádraig Brady wrote:

> >> That said, would you still like me to take a stab at a proposal to the
> >> POSIX folks that would relax the requirements to allow
> >> implementation-defined behavior when the two arguments to rename
> >> describe the same file but via different directory entries?
> 
> I guess there is no point discussing in POSIX and adding extra
> implementation options if no implementations do/will act accordingly.
> 
> Linux can decide to do that independently, if appropriate.
> This is one of those borderline cases where we balance
> accretion of cruft vs incompatibility.
> On consideration, I'm OK with keeping the existing
> rename() behavior for compat and adding the new flag.
> That said I still can't think of anything depending
> rename() doing nothing with hardlinked source and dest.

You do realize that it opens a very nasty can of worms for filesystems that
are e.g. case-insensitive to some extent?  How do you tell links from
alternative equivalent spellings of the name?

^ permalink raw reply

* Re: RFC: renameat(): Add a RENAME_REMOVE flag to unlink hardlinks
From: Andy Lutomirski @ 2014-11-21 22:40 UTC (permalink / raw)
  To: Al Viro; +Cc: Pádraig Brady, Eric Blake, Linux FS Devel, Linux API
In-Reply-To: <20141121223018.GV7996@ZenIV.linux.org.uk>

On Fri, Nov 21, 2014 at 2:30 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Nov 21, 2014 at 10:20:54PM +0000, Pádraig Brady wrote:
>
>> >> That said, would you still like me to take a stab at a proposal to the
>> >> POSIX folks that would relax the requirements to allow
>> >> implementation-defined behavior when the two arguments to rename
>> >> describe the same file but via different directory entries?
>>
>> I guess there is no point discussing in POSIX and adding extra
>> implementation options if no implementations do/will act accordingly.
>>
>> Linux can decide to do that independently, if appropriate.
>> This is one of those borderline cases where we balance
>> accretion of cruft vs incompatibility.
>> On consideration, I'm OK with keeping the existing
>> rename() behavior for compat and adding the new flag.
>> That said I still can't think of anything depending
>> rename() doing nothing with hardlinked source and dest.
>
> You do realize that it opens a very nasty can of worms for filesystems that
> are e.g. case-insensitive to some extent?  How do you tell links from
> alternative equivalent spellings of the name?

I assume that VFS can handle this correctly if it wants to.

OTOH, if someone does rename("foo", "Foo"), and foo, Foo, fOO, etc.
are all valid spellings, then presumably they don't actually expect
"foo" to go away.  They may, however, want the name shown in readdir
to change, so maybe RENAME_HARDLINK should do that, too.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: RFC: renameat(): Add a RENAME_REMOVE flag to unlink hardlinks
From: Pádraig Brady @ 2014-11-22  1:29 UTC (permalink / raw)
  To: Andy Lutomirski, Al Viro; +Cc: Eric Blake, Linux FS Devel, Linux API
In-Reply-To: <CALCETrW0c1UzkVQgEE_je5BtVhdWRmNaYk4HCQk+t6AWvpC-FA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 21/11/14 22:40, Andy Lutomirski wrote:
> On Fri, Nov 21, 2014 at 2:30 PM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
>> On Fri, Nov 21, 2014 at 10:20:54PM +0000, Pádraig Brady wrote:
>>
>>>>> That said, would you still like me to take a stab at a proposal to the
>>>>> POSIX folks that would relax the requirements to allow
>>>>> implementation-defined behavior when the two arguments to rename
>>>>> describe the same file but via different directory entries?
>>>
>>> I guess there is no point discussing in POSIX and adding extra
>>> implementation options if no implementations do/will act accordingly.
>>>
>>> Linux can decide to do that independently, if appropriate.
>>> This is one of those borderline cases where we balance
>>> accretion of cruft vs incompatibility.
>>> On consideration, I'm OK with keeping the existing
>>> rename() behavior for compat and adding the new flag.
>>> That said I still can't think of anything depending
>>> rename() doing nothing with hardlinked source and dest.
>>
>> You do realize that it opens a very nasty can of worms for filesystems that
>> are e.g. case-insensitive to some extent?  How do you tell links from
>> alternative equivalent spellings of the name?
> 
> I assume that VFS can handle this correctly if it wants to.

I was assuming there was a way to distinguish directory entries,
and that's what should be checked first, which is what my
psuedo code patch attempted to show.

> OTOH, if someone does rename("foo", "Foo"), and foo, Foo, fOO, etc.
> are all valid spellings, then presumably they don't actually expect
> "foo" to go away.  They may, however, want the name shown in readdir
> to change, so maybe RENAME_HARDLINK should do that, too.

That's a separate case, though also useful for case retentive file systems.
If you did want to support that, i.e. when src and dst are the same directory entry,
though spelled differently, then you might have another flag.
Or you could combine both functions to a RENAME_ALIAS flag?

thanks,
Pádraig.

^ permalink raw reply

* Re: RFC: renameat(): Add a RENAME_REMOVE flag to unlink hardlinks
From: Andy Lutomirski @ 2014-11-22  1:31 UTC (permalink / raw)
  To: Pádraig Brady; +Cc: Al Viro, Eric Blake, Linux FS Devel, Linux API
In-Reply-To: <546FE6E1.7040703-V8g9lnOeT5ydJdNcDFJN0w@public.gmane.org>

On Fri, Nov 21, 2014 at 5:29 PM, Pádraig Brady <P@draigbrady.com> wrote:
> On 21/11/14 22:40, Andy Lutomirski wrote:
>> On Fri, Nov 21, 2014 at 2:30 PM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
>>> On Fri, Nov 21, 2014 at 10:20:54PM +0000, Pádraig Brady wrote:
>>>
>>>>>> That said, would you still like me to take a stab at a proposal to the
>>>>>> POSIX folks that would relax the requirements to allow
>>>>>> implementation-defined behavior when the two arguments to rename
>>>>>> describe the same file but via different directory entries?
>>>>
>>>> I guess there is no point discussing in POSIX and adding extra
>>>> implementation options if no implementations do/will act accordingly.
>>>>
>>>> Linux can decide to do that independently, if appropriate.
>>>> This is one of those borderline cases where we balance
>>>> accretion of cruft vs incompatibility.
>>>> On consideration, I'm OK with keeping the existing
>>>> rename() behavior for compat and adding the new flag.
>>>> That said I still can't think of anything depending
>>>> rename() doing nothing with hardlinked source and dest.
>>>
>>> You do realize that it opens a very nasty can of worms for filesystems that
>>> are e.g. case-insensitive to some extent?  How do you tell links from
>>> alternative equivalent spellings of the name?
>>
>> I assume that VFS can handle this correctly if it wants to.
>
> I was assuming there was a way to distinguish directory entries,
> and that's what should be checked first, which is what my
> psuedo code patch attempted to show.
>
>> OTOH, if someone does rename("foo", "Foo"), and foo, Foo, fOO, etc.
>> are all valid spellings, then presumably they don't actually expect
>> "foo" to go away.  They may, however, want the name shown in readdir
>> to change, so maybe RENAME_HARDLINK should do that, too.
>
> That's a separate case, though also useful for case retentive file systems.
> If you did want to support that, i.e. when src and dst are the same directory entry,
> though spelled differently, then you might have another flag.
> Or you could combine both functions to a RENAME_ALIAS flag?

I like that option.

>
> thanks,
> Pádraig.



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: RFC: renameat(): Add a RENAME_REMOVE flag to unlink hardlinks
From: Al Viro @ 2014-11-22  1:50 UTC (permalink / raw)
  To: Pádraig Brady; +Cc: Andy Lutomirski, Eric Blake, Linux FS Devel, Linux API
In-Reply-To: <546FE6E1.7040703-V8g9lnOeT5ydJdNcDFJN0w@public.gmane.org>

On Sat, Nov 22, 2014 at 01:29:05AM +0000, Pádraig Brady wrote:

> > I assume that VFS can handle this correctly if it wants to.
> 
> I was assuming there was a way to distinguish directory entries,
> and that's what should be checked first, which is what my
> psuedo code patch attempted to show.

There isn't, in general.  Sure, if you get the same struct dentry * from
both lookups, it's the same one.  But it's not guaranteed to be true on
every fs out there if those are non-directories (and for directories there's
no multiple hardlinks in the first place).

^ permalink raw reply

* Re: RFC: renameat(): Add a RENAME_REMOVE flag to unlink hardlinks
From: Andy Lutomirski @ 2014-11-22  1:51 UTC (permalink / raw)
  To: Al Viro; +Cc: Pádraig Brady, Eric Blake, Linux FS Devel, Linux API
In-Reply-To: <20141122015010.GW7996-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>

On Fri, Nov 21, 2014 at 5:50 PM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
> On Sat, Nov 22, 2014 at 01:29:05AM +0000, Pádraig Brady wrote:
>
>> > I assume that VFS can handle this correctly if it wants to.
>>
>> I was assuming there was a way to distinguish directory entries,
>> and that's what should be checked first, which is what my
>> psuedo code patch attempted to show.
>
> There isn't, in general.  Sure, if you get the same struct dentry * from
> both lookups, it's the same one.  But it's not guaranteed to be true on
> every fs out there if those are non-directories (and for directories there's
> no multiple hardlinks in the first place).

Does that mean that the current behavior is inconsistent between filesystems.

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: RFC: renameat(): Add a RENAME_REMOVE flag to unlink hardlinks
From: Al Viro @ 2014-11-22  1:57 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Pádraig Brady, Eric Blake, Linux FS Devel, Linux API
In-Reply-To: <CALCETrXBFysW3qnOUuhE8=SeiDVtgAkms7mDuHJTG06mK3=knw@mail.gmail.com>

On Fri, Nov 21, 2014 at 05:51:23PM -0800, Andy Lutomirski wrote:
> On Fri, Nov 21, 2014 at 5:50 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Sat, Nov 22, 2014 at 01:29:05AM +0000, Pádraig Brady wrote:
> >
> >> > I assume that VFS can handle this correctly if it wants to.
> >>
> >> I was assuming there was a way to distinguish directory entries,
> >> and that's what should be checked first, which is what my
> >> psuedo code patch attempted to show.
> >
> > There isn't, in general.  Sure, if you get the same struct dentry * from
> > both lookups, it's the same one.  But it's not guaranteed to be true on
> > every fs out there if those are non-directories (and for directories there's
> > no multiple hardlinks in the first place).
> 
> Does that mean that the current behavior is inconsistent between filesystems.

No.  You can always check that they point to the same inode.  Which is
precisely what "links to the same file" is about, and which is why rename(2)
had that semantics since way back.  _That_ is easy condition to check.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [v7 0/5] ext4: add project quota support
From: Li Xi @ 2014-11-22  5:24 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	adilger-m1MBpc4rdrD3fQ9qLvQP4Q, jack-AlSwsSmVLrQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	dmonakhov-GEFAQzZX7r8dnm+yROfE0A

The following patches propose an implementation of project quota
support for ext4. A project is an aggregate of unrelated inodes
which might scatter in different directories. Inodes that belong
to the same project possess an identical identification i.e.
'project ID', just like every inode has its user/group
identification. The following patches add project quota as
supplement to the former uer/group quota types.

The semantics of ext4 project quota is consistent with XFS. Each
directory can have EXT4_INODE_PROJINHERIT flag set. When the
EXT4_INODE_PROJINHERIT flag of a parent directory is not set, a
newly created inode under that directory will have a default project
ID (i.e. 0). And its EXT4_INODE_PROJINHERIT flag is not set either.
When this flag is set on a directory, following rules will be kept:

1) The newly created inode under that directory will inherit both
the EXT4_INODE_PROJINHERIT flag and the project ID from its parent
directory.

2) Hard-linking a inode with different project ID into that directory
will fail with errno EXDEV.

3) Renaming a inode with different project ID into that directory
will fail with errno EXDEV. However, 'mv' command will detect this
failure and copy the renamed inode to a new inode in the directory.
Thus, this new inode will inherit both the project ID and
EXT4_INODE_PROJINHERIT flag.

4) If the project quota of that ID is being enforced, statfs() on
that directory will take the quotas as another upper limits along
with the capacity of the file system, i.e. the total block/inode
number will be the minimum of the quota limits and file system
capacity.

Changelog:
* v7 <- v6:
 - Map ext4 inode flags to xflags of struct fsxattr;
 - Add patch to cleanup ext4 inode flag definitions
* v6 <- v5:
 - Add project ID check for cross rename;
 - Remove patch of EXT4_IOC_GETPROJECT/EXT4_IOC_SETPROJECT ioctl
* v5 <- v4:
 - Check project feature when set/get project ID;
 - Do not check project feature for project quota;
 - Add support of FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR.
* v4 <- v3:
 - Do not check project feature when set/get project ID;
 - Use EXT4_MAXQUOTAS instead of MAXQUOTAS in ext4 patches;
 - Remove unnecessary change of fs/quota/dquot.c;
 - Remove CONFIG_QUOTA_PROJECT.
* v3 <- v2:
 - Add EXT4_INODE_PROJINHERIT semantics.
* v2 <- v1:
 - Add ioctl interface for setting/getting project;
 - Add EXT4_FEATURE_RO_COMPAT_PROJECT;
 - Add get_projid() method in struct dquot_operations;
 - Add error check of ext4_inode_projid_set/get().

v6: http://www.spinics.net/lists/linux-fsdevel/msg80022.html
v5: http://www.spinics.net/lists/linux-api/msg04840.html
v4: http://lwn.net/Articles/612972/
v3: http://www.spinics.net/lists/linux-ext4/msg45184.html
v2: http://www.spinics.net/lists/linux-ext4/msg44695.html
v1: http://article.gmane.org/gmane.comp.file-systems.ext4/45153

Any comments or feedbacks are appreciated.

Regards,
                                         - Li Xi

Li Xi (5):
  vfs: adds general codes to enforces project quota limits
  ext4: adds project ID support
  ext4: adds project quota support
  ext4: adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support
  ext4: cleanup inode flag definitions

 fs/ext4/ext4.h             |  190 +++++++++++++++++++++----
 fs/ext4/ialloc.c           |    6 +
 fs/ext4/inode.c            |   29 ++++
 fs/ext4/ioctl.c            |  330 +++++++++++++++++++++++++++++++-------------
 fs/ext4/namei.c            |   17 +++
 fs/ext4/super.c            |   96 +++++++++++--
 fs/quota/dquot.c           |   35 ++++-
 fs/quota/quota.c           |    8 +-
 fs/quota/quotaio_v2.h      |    6 +-
 fs/xfs/xfs_fs.h            |   47 +++----
 include/linux/quota.h      |    2 +
 include/uapi/linux/fs.h    |   59 ++++++++
 include/uapi/linux/quota.h |    6 +-
 13 files changed, 650 insertions(+), 181 deletions(-)

^ permalink raw reply

* [v7 1/5] vfs: adds general codes to enforces project quota limits
From: Li Xi @ 2014-11-22  5:24 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	adilger-m1MBpc4rdrD3fQ9qLvQP4Q, jack-AlSwsSmVLrQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	dmonakhov-GEFAQzZX7r8dnm+yROfE0A
In-Reply-To: <1416633895-4004-1-git-send-email-lixi-LfVdkaOWEx8@public.gmane.org>

This patch adds support for a new quota type PRJQUOTA for project quota
enforcement. Also a new method get_projid() is added into dquot_operations
structure.

Signed-off-by: Li Xi <lixi-LfVdkaOWEx8@public.gmane.org>
Signed-off-by: Dmitry Monakhov <dmonakhov-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/quota/dquot.c           |   35 ++++++++++++++++++++++++++++++-----
 fs/quota/quota.c           |    8 ++++++--
 fs/quota/quotaio_v2.h      |    6 ++++--
 include/linux/quota.h      |    2 ++
 include/uapi/linux/quota.h |    6 ++++--
 5 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 8b663b2..84f9bb1 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -1154,8 +1154,8 @@ static int need_print_warning(struct dquot_warn *warn)
 			return uid_eq(current_fsuid(), warn->w_dq_id.uid);
 		case GRPQUOTA:
 			return in_group_p(warn->w_dq_id.gid);
-		case PRJQUOTA:	/* Never taken... Just make gcc happy */
-			return 0;
+		case PRJQUOTA:
+			return 1;
 	}
 	return 0;
 }
@@ -1394,6 +1394,9 @@ static void __dquot_initialize(struct inode *inode, int type)
 	/* First get references to structures we might need. */
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
 		struct kqid qid;
+		kprojid_t projid;
+		int rc;
+
 		got[cnt] = NULL;
 		if (type != -1 && cnt != type)
 			continue;
@@ -1404,6 +1407,10 @@ static void __dquot_initialize(struct inode *inode, int type)
 		 */
 		if (inode->i_dquot[cnt])
 			continue;
+
+		if (!sb_has_quota_active(sb, cnt))
+			continue;
+
 		init_needed = 1;
 
 		switch (cnt) {
@@ -1413,6 +1420,12 @@ static void __dquot_initialize(struct inode *inode, int type)
 		case GRPQUOTA:
 			qid = make_kqid_gid(inode->i_gid);
 			break;
+		case PRJQUOTA:
+			rc = inode->i_sb->dq_op->get_projid(inode, &projid);
+			if (rc)
+				continue;
+			qid = make_kqid_projid(projid);
+			break;
 		}
 		got[cnt] = dqget(sb, qid);
 	}
@@ -2156,7 +2169,8 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
 		error = -EROFS;
 		goto out_fmt;
 	}
-	if (!sb->s_op->quota_write || !sb->s_op->quota_read) {
+	if (!sb->s_op->quota_write || !sb->s_op->quota_read ||
+	    (type == PRJQUOTA && sb->dq_op->get_projid == NULL)) {
 		error = -EINVAL;
 		goto out_fmt;
 	}
@@ -2397,8 +2411,19 @@ static void do_get_dqblk(struct dquot *dquot, struct fs_disk_quota *di)
 
 	memset(di, 0, sizeof(*di));
 	di->d_version = FS_DQUOT_VERSION;
-	di->d_flags = dquot->dq_id.type == USRQUOTA ?
-			FS_USER_QUOTA : FS_GROUP_QUOTA;
+	switch (dquot->dq_id.type) {
+	case USRQUOTA:
+		di->d_flags = FS_USER_QUOTA;
+		break;
+	case GRPQUOTA:
+		di->d_flags = FS_GROUP_QUOTA;
+		break;
+	case PRJQUOTA:
+		di->d_flags = FS_PROJ_QUOTA;
+		break;
+	default:
+		BUG();
+	}
 	di->d_id = from_kqid_munged(current_user_ns(), dquot->dq_id);
 
 	spin_lock(&dq_data_lock);
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 7562164..795d694 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -30,11 +30,15 @@ static int check_quotactl_permission(struct super_block *sb, int type, int cmd,
 	case Q_XGETQSTATV:
 	case Q_XQUOTASYNC:
 		break;
-	/* allow to query information for dquots we "own" */
+	/*
+	 * allow to query information for dquots we "own"
+	 * always allow querying project quota
+	 */
 	case Q_GETQUOTA:
 	case Q_XGETQUOTA:
 		if ((type == USRQUOTA && uid_eq(current_euid(), make_kuid(current_user_ns(), id))) ||
-		    (type == GRPQUOTA && in_egroup_p(make_kgid(current_user_ns(), id))))
+		    (type == GRPQUOTA && in_egroup_p(make_kgid(current_user_ns(), id))) ||
+		    (type == PRJQUOTA))
 			break;
 		/*FALLTHROUGH*/
 	default:
diff --git a/fs/quota/quotaio_v2.h b/fs/quota/quotaio_v2.h
index f1966b4..4e95430 100644
--- a/fs/quota/quotaio_v2.h
+++ b/fs/quota/quotaio_v2.h
@@ -13,12 +13,14 @@
  */
 #define V2_INITQMAGICS {\
 	0xd9c01f11,	/* USRQUOTA */\
-	0xd9c01927	/* GRPQUOTA */\
+	0xd9c01927,	/* GRPQUOTA */\
+	0xd9c03f14,	/* PRJQUOTA */\
 }
 
 #define V2_INITQVERSIONS {\
 	1,		/* USRQUOTA */\
-	1		/* GRPQUOTA */\
+	1,		/* GRPQUOTA */\
+	1,		/* PRJQUOTA */\
 }
 
 /* First generic header */
diff --git a/include/linux/quota.h b/include/linux/quota.h
index 80d345a..f1b25f8 100644
--- a/include/linux/quota.h
+++ b/include/linux/quota.h
@@ -50,6 +50,7 @@
 
 #undef USRQUOTA
 #undef GRPQUOTA
+#undef PRJQUOTA
 enum quota_type {
 	USRQUOTA = 0,		/* element used for user quotas */
 	GRPQUOTA = 1,		/* element used for group quotas */
@@ -312,6 +313,7 @@ struct dquot_operations {
 	/* get reserved quota for delayed alloc, value returned is managed by
 	 * quota code only */
 	qsize_t *(*get_reserved_space) (struct inode *);
+	int (*get_projid) (struct inode *, kprojid_t *);/* Get project ID */
 };
 
 struct path;
diff --git a/include/uapi/linux/quota.h b/include/uapi/linux/quota.h
index 3b6cfbe..b2d9486 100644
--- a/include/uapi/linux/quota.h
+++ b/include/uapi/linux/quota.h
@@ -36,11 +36,12 @@
 #include <linux/errno.h>
 #include <linux/types.h>
 
-#define __DQUOT_VERSION__	"dquot_6.5.2"
+#define __DQUOT_VERSION__	"dquot_6.6.0"
 
-#define MAXQUOTAS 2
+#define MAXQUOTAS 3
 #define USRQUOTA  0		/* element used for user quotas */
 #define GRPQUOTA  1		/* element used for group quotas */
+#define PRJQUOTA  2		/* element used for project quotas */
 
 /*
  * Definitions for the default names of the quotas files.
@@ -48,6 +49,7 @@
 #define INITQFNAMES { \
 	"user",    /* USRQUOTA */ \
 	"group",   /* GRPQUOTA */ \
+	"project", /* PRJQUOTA */ \
 	"undefined", \
 };
 
-- 
1.7.1

^ permalink raw reply related

* [v7 2/5] ext4: adds project ID support
From: Li Xi @ 2014-11-22  5:24 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	adilger-m1MBpc4rdrD3fQ9qLvQP4Q, jack-AlSwsSmVLrQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	dmonakhov-GEFAQzZX7r8dnm+yROfE0A
In-Reply-To: <1416633895-4004-1-git-send-email-lixi-LfVdkaOWEx8@public.gmane.org>

This patch adds a new internal field of ext4 inode to save project
identifier. Also a new flag EXT4_INODE_PROJINHERIT is added for
inheriting project ID from parent directory.

Signed-off-by: Li Xi <lixi-LfVdkaOWEx8@public.gmane.org>
Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/ext4/ext4.h          |   21 +++++++++++++++++----
 fs/ext4/ialloc.c        |    6 ++++++
 fs/ext4/inode.c         |   29 +++++++++++++++++++++++++++++
 fs/ext4/namei.c         |   17 +++++++++++++++++
 fs/ext4/super.c         |    1 +
 include/uapi/linux/fs.h |    1 +
 6 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c55a1fa..505b9fb 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -386,16 +386,18 @@ struct flex_groups {
 #define EXT4_EA_INODE_FL	        0x00200000 /* Inode used for large EA */
 #define EXT4_EOFBLOCKS_FL		0x00400000 /* Blocks allocated beyond EOF */
 #define EXT4_INLINE_DATA_FL		0x10000000 /* Inode has inline data. */
+#define EXT4_PROJINHERIT_FL		FS_PROJINHERIT_FL /* Create with parents projid */
 #define EXT4_RESERVED_FL		0x80000000 /* reserved for ext4 lib */
 
-#define EXT4_FL_USER_VISIBLE		0x004BDFFF /* User visible flags */
-#define EXT4_FL_USER_MODIFIABLE		0x004380FF /* User modifiable flags */
+#define EXT4_FL_USER_VISIBLE		0x204BDFFF /* User visible flags */
+#define EXT4_FL_USER_MODIFIABLE		0x204380FF /* User modifiable flags */
 
 /* Flags that should be inherited by new inodes from their parent. */
 #define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\
 			   EXT4_SYNC_FL | EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
 			   EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
-			   EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL)
+			   EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL |\
+			   EXT4_PROJINHERIT_FL)
 
 /* Flags that are appropriate for regular files (all but dir-specific ones). */
 #define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL))
@@ -443,6 +445,7 @@ enum {
 	EXT4_INODE_EA_INODE	= 21,	/* Inode used for large EA */
 	EXT4_INODE_EOFBLOCKS	= 22,	/* Blocks allocated beyond EOF */
 	EXT4_INODE_INLINE_DATA	= 28,	/* Data in inode. */
+	EXT4_INODE_PROJINHERIT	= 29,	/* Create with parents projid */
 	EXT4_INODE_RESERVED	= 31,	/* reserved for ext4 lib */
 };
 
@@ -694,6 +697,7 @@ struct ext4_inode {
 	__le32  i_crtime;       /* File Creation time */
 	__le32  i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
 	__le32  i_version_hi;	/* high 32 bits for 64-bit version */
+	__le32  i_projid;	/* Project ID */
 };
 
 struct move_extent {
@@ -943,6 +947,7 @@ struct ext4_inode_info {
 
 	/* Precomputed uuid+inum+igen checksum for seeding inode checksums */
 	__u32 i_csum_seed;
+	kprojid_t i_projid;
 };
 
 /*
@@ -1526,6 +1531,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
  * GDT_CSUM bits are mutually exclusive.
  */
 #define EXT4_FEATURE_RO_COMPAT_METADATA_CSUM	0x0400
+#define EXT4_FEATURE_RO_COMPAT_PROJECT		0x1000 /* Project quota */
 
 #define EXT4_FEATURE_INCOMPAT_COMPRESSION	0x0001
 #define EXT4_FEATURE_INCOMPAT_FILETYPE		0x0002
@@ -1575,7 +1581,8 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
 					 EXT4_FEATURE_RO_COMPAT_HUGE_FILE |\
 					 EXT4_FEATURE_RO_COMPAT_BIGALLOC |\
 					 EXT4_FEATURE_RO_COMPAT_METADATA_CSUM|\
-					 EXT4_FEATURE_RO_COMPAT_QUOTA)
+					 EXT4_FEATURE_RO_COMPAT_QUOTA |\
+					 EXT4_FEATURE_RO_COMPAT_PROJECT)
 
 /*
  * Default values for user and/or group using reserved blocks
@@ -1583,6 +1590,11 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
 #define	EXT4_DEF_RESUID		0
 #define	EXT4_DEF_RESGID		0
 
+/*
+ * Default project ID
+ */
+#define	EXT4_DEF_PROJID		0
+
 #define EXT4_DEF_INODE_READAHEAD_BLKS	32
 
 /*
@@ -2135,6 +2147,7 @@ extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
 			     loff_t lstart, loff_t lend);
 extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 extern qsize_t *ext4_get_reserved_space(struct inode *inode);
+extern int ext4_get_projid(struct inode *inode, kprojid_t *projid);
 extern void ext4_da_update_reserve_space(struct inode *inode,
 					int used, int quota_claim);
 
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index ac644c3..fefb948 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -756,6 +756,12 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 		inode->i_gid = dir->i_gid;
 	} else
 		inode_init_owner(inode, dir, mode);
+	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_PROJECT) &&
+	    ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT)) {
+		ei->i_projid = EXT4_I(dir)->i_projid;
+	} else {
+		ei->i_projid = make_kprojid(&init_user_ns, EXT4_DEF_PROJID);
+	}
 	dquot_initialize(inode);
 
 	if (!goal)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3356ab5..1c440be 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3886,6 +3886,14 @@ static inline void ext4_iget_extra_inode(struct inode *inode,
 		EXT4_I(inode)->i_inline_off = 0;
 }
 
+int ext4_get_projid(struct inode *inode, kprojid_t *projid)
+{
+	if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb, EXT4_FEATURE_RO_COMPAT_PROJECT))
+		return -EOPNOTSUPP;
+	*projid = EXT4_I(inode)->i_projid;
+	return 0;
+}
+
 struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 {
 	struct ext4_iloc iloc;
@@ -3897,6 +3905,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 	int block;
 	uid_t i_uid;
 	gid_t i_gid;
+	projid_t i_projid;
 
 	inode = iget_locked(sb, ino);
 	if (!inode)
@@ -3946,12 +3955,18 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 	inode->i_mode = le16_to_cpu(raw_inode->i_mode);
 	i_uid = (uid_t)le16_to_cpu(raw_inode->i_uid_low);
 	i_gid = (gid_t)le16_to_cpu(raw_inode->i_gid_low);
+	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_PROJECT))
+		i_projid = (projid_t)le32_to_cpu(raw_inode->i_projid);
+	else
+		i_projid = EXT4_DEF_PROJID;
+
 	if (!(test_opt(inode->i_sb, NO_UID32))) {
 		i_uid |= le16_to_cpu(raw_inode->i_uid_high) << 16;
 		i_gid |= le16_to_cpu(raw_inode->i_gid_high) << 16;
 	}
 	i_uid_write(inode, i_uid);
 	i_gid_write(inode, i_gid);
+	ei->i_projid = make_kprojid(&init_user_ns, i_projid);;
 	set_nlink(inode, le16_to_cpu(raw_inode->i_links_count));
 
 	ext4_clear_state_flags(ei);	/* Only relevant on 32-bit archs */
@@ -4181,6 +4196,7 @@ static int ext4_do_update_inode(handle_t *handle,
 	int need_datasync = 0, set_large_file = 0;
 	uid_t i_uid;
 	gid_t i_gid;
+	projid_t i_projid;
 
 	spin_lock(&ei->i_raw_lock);
 
@@ -4193,6 +4209,7 @@ static int ext4_do_update_inode(handle_t *handle,
 	raw_inode->i_mode = cpu_to_le16(inode->i_mode);
 	i_uid = i_uid_read(inode);
 	i_gid = i_gid_read(inode);
+	i_projid = from_kprojid(&init_user_ns, ei->i_projid);
 	if (!(test_opt(inode->i_sb, NO_UID32))) {
 		raw_inode->i_uid_low = cpu_to_le16(low_16_bits(i_uid));
 		raw_inode->i_gid_low = cpu_to_le16(low_16_bits(i_gid));
@@ -4272,6 +4289,18 @@ static int ext4_do_update_inode(handle_t *handle,
 		}
 	}
 
+	BUG_ON(!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+			EXT4_FEATURE_RO_COMPAT_PROJECT) &&
+	       i_projid != EXT4_DEF_PROJID);
+	if (i_projid != EXT4_DEF_PROJID &&
+	    (EXT4_INODE_SIZE(inode->i_sb) <= EXT4_GOOD_OLD_INODE_SIZE ||
+	     (!EXT4_FITS_IN_INODE(raw_inode, ei, i_projid)))) {
+		spin_unlock(&ei->i_raw_lock);
+		err = -EFBIG;
+		goto out_brelse;
+	}
+	raw_inode->i_projid = cpu_to_le32(i_projid);
+
 	ext4_inode_csum_set(inode, raw_inode, ei);
 
 	spin_unlock(&ei->i_raw_lock);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 4262118..160a743 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2939,6 +2939,11 @@ static int ext4_link(struct dentry *old_dentry,
 	if (inode->i_nlink >= EXT4_LINK_MAX)
 		return -EMLINK;
 
+	if ((ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT)) &&
+	    (!projid_eq(EXT4_I(dir)->i_projid,
+			EXT4_I(old_dentry->d_inode)->i_projid)))
+		return -EXDEV;
+
 	dquot_initialize(dir);
 
 retry:
@@ -3218,6 +3223,11 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
 	int credits;
 	u8 old_file_type;
 
+	if ((ext4_test_inode_flag(new_dir, EXT4_INODE_PROJINHERIT)) &&
+	    (!projid_eq(EXT4_I(new_dir)->i_projid,
+			EXT4_I(old_dentry->d_inode)->i_projid)))
+		return -EXDEV;
+
 	dquot_initialize(old.dir);
 	dquot_initialize(new.dir);
 
@@ -3396,6 +3406,13 @@ static int ext4_cross_rename(struct inode *old_dir, struct dentry *old_dentry,
 	u8 new_file_type;
 	int retval;
 
+	if ((ext4_test_inode_flag(new_dir, EXT4_INODE_PROJINHERIT)) &&
+	    ((!projid_eq(EXT4_I(new_dir)->i_projid,
+			 EXT4_I(old_dentry->d_inode)->i_projid)) ||
+	     (!projid_eq(EXT4_I(old_dir)->i_projid,
+			 EXT4_I(new_dentry->d_inode)->i_projid))))
+		return -EXDEV;
+
 	dquot_initialize(old.dir);
 	dquot_initialize(new.dir);
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 2c9e686..d8a348d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1077,6 +1077,7 @@ static const struct dquot_operations ext4_quota_operations = {
 	.write_info	= ext4_write_info,
 	.alloc_dquot	= dquot_alloc,
 	.destroy_dquot	= dquot_destroy,
+	.get_projid	= ext4_get_projid,
 };
 
 static const struct quotactl_ops ext4_qctl_operations = {
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 3735fa0..fcbf647 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -195,6 +195,7 @@ struct inodes_stat_t {
 #define FS_EXTENT_FL			0x00080000 /* Extents */
 #define FS_DIRECTIO_FL			0x00100000 /* Use direct i/o */
 #define FS_NOCOW_FL			0x00800000 /* Do not cow file */
+#define FS_PROJINHERIT_FL		0x20000000 /* Create with parents projid */
 #define FS_RESERVED_FL			0x80000000 /* reserved for ext2 lib */
 
 #define FS_FL_USER_VISIBLE		0x0003DFFF /* User visible flags */
-- 
1.7.1

^ permalink raw reply related

* [v7 3/5] ext4: adds project quota support
From: Li Xi @ 2014-11-22  5:24 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-api, tytso, adilger, jack, viro,
	hch, dmonakhov
In-Reply-To: <1416633895-4004-1-git-send-email-lixi@ddn.com>

This patch adds mount options for enabling/disabling project quota
accounting and enforcement. A new specific inode is also used for
project quota accounting.

Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h  |    8 +++-
 fs/ext4/super.c |   95 ++++++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 90 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 505b9fb..67d4879 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -217,6 +217,7 @@ struct ext4_io_submit {
 #define EXT4_UNDEL_DIR_INO	 6	/* Undelete directory inode */
 #define EXT4_RESIZE_INO		 7	/* Reserved group descriptors inode */
 #define EXT4_JOURNAL_INO	 8	/* Journal inode */
+#define EXT4_PRJ_QUOTA_INO	 9	/* Project quota inode */
 
 /* First non-reserved inode for old ext4 filesystems */
 #define EXT4_GOOD_OLD_FIRST_INO	11
@@ -991,6 +992,7 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_DIOREAD_NOLOCK	0x400000 /* Enable support for dio read nolocking */
 #define EXT4_MOUNT_JOURNAL_CHECKSUM	0x800000 /* Journal checksums */
 #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT	0x1000000 /* Journal Async Commit */
+#define EXT4_MOUNT_PRJQUOTA		0x2000000 /* Project quota support */
 #define EXT4_MOUNT_DELALLOC		0x8000000 /* Delalloc support */
 #define EXT4_MOUNT_DATA_ERR_ABORT	0x10000000 /* Abort on file data write */
 #define EXT4_MOUNT_BLOCK_VALIDITY	0x20000000 /* Block validity checking */
@@ -1166,7 +1168,8 @@ struct ext4_super_block {
 	__le32	s_grp_quota_inum;	/* inode for tracking group quota */
 	__le32	s_overhead_clusters;	/* overhead blocks/clusters in fs */
 	__le32	s_backup_bgs[2];	/* groups with sparse_super2 SBs */
-	__le32	s_reserved[106];	/* Padding to the end of the block */
+	__le32	s_prj_quota_inum;	/* inode for tracking project quota */
+	__le32	s_reserved[105];	/* Padding to the end of the block */
 	__le32	s_checksum;		/* crc32c(superblock) */
 };
 
@@ -1181,7 +1184,7 @@ struct ext4_super_block {
 #define EXT4_MF_FS_ABORTED	0x0002	/* Fatal error detected */
 
 /* Number of quota types we support */
-#define EXT4_MAXQUOTAS 2
+#define EXT4_MAXQUOTAS 3
 
 /*
  * fourth extended-fs super-block data in memory
@@ -1372,6 +1375,7 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
 		ino == EXT4_BOOT_LOADER_INO ||
 		ino == EXT4_JOURNAL_INO ||
 		ino == EXT4_RESIZE_INO ||
+		ino == EXT4_PRJ_QUOTA_INO ||
 		(ino >= EXT4_FIRST_INO(sb) &&
 		 ino <= le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count));
 }
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d8a348d..564fbd7 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1045,8 +1045,8 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
 }
 
 #ifdef CONFIG_QUOTA
-#define QTYPE2NAME(t) ((t) == USRQUOTA ? "user" : "group")
-#define QTYPE2MOPT(on, t) ((t) == USRQUOTA?((on)##USRJQUOTA):((on)##GRPJQUOTA))
+static char *quotatypes[] = INITQFNAMES;
+#define QTYPE2NAME(t) (quotatypes[t])
 
 static int ext4_write_dquot(struct dquot *dquot);
 static int ext4_acquire_dquot(struct dquot *dquot);
@@ -1138,10 +1138,11 @@ enum {
 	Opt_journal_path, Opt_journal_checksum, Opt_journal_async_commit,
 	Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
 	Opt_data_err_abort, Opt_data_err_ignore,
-	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
+	Opt_usrjquota, Opt_grpjquota, Opt_prjjquota,
+	Opt_offusrjquota, Opt_offgrpjquota, Opt_offprjjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota,
 	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
-	Opt_usrquota, Opt_grpquota, Opt_i_version,
+	Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_i_version,
 	Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
 	Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
 	Opt_inode_readahead_blks, Opt_journal_ioprio,
@@ -1192,6 +1193,8 @@ static const match_table_t tokens = {
 	{Opt_usrjquota, "usrjquota=%s"},
 	{Opt_offgrpjquota, "grpjquota="},
 	{Opt_grpjquota, "grpjquota=%s"},
+	{Opt_offprjjquota, "prjjquota="},
+	{Opt_prjjquota, "prjjquota=%s"},
 	{Opt_jqfmt_vfsold, "jqfmt=vfsold"},
 	{Opt_jqfmt_vfsv0, "jqfmt=vfsv0"},
 	{Opt_jqfmt_vfsv1, "jqfmt=vfsv1"},
@@ -1199,6 +1202,7 @@ static const match_table_t tokens = {
 	{Opt_noquota, "noquota"},
 	{Opt_quota, "quota"},
 	{Opt_usrquota, "usrquota"},
+	{Opt_prjquota, "prjquota"},
 	{Opt_barrier, "barrier=%u"},
 	{Opt_barrier, "barrier"},
 	{Opt_nobarrier, "nobarrier"},
@@ -1411,12 +1415,17 @@ static const struct mount_opts {
 							MOPT_SET | MOPT_Q},
 	{Opt_grpquota, EXT4_MOUNT_QUOTA | EXT4_MOUNT_GRPQUOTA,
 							MOPT_SET | MOPT_Q},
+	{Opt_prjquota, EXT4_MOUNT_QUOTA | EXT4_MOUNT_PRJQUOTA,
+							MOPT_SET | MOPT_Q},
 	{Opt_noquota, (EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA |
-		       EXT4_MOUNT_GRPQUOTA), MOPT_CLEAR | MOPT_Q},
+		       EXT4_MOUNT_GRPQUOTA | EXT4_MOUNT_PRJQUOTA),
+							MOPT_CLEAR | MOPT_Q},
 	{Opt_usrjquota, 0, MOPT_Q},
 	{Opt_grpjquota, 0, MOPT_Q},
+	{Opt_prjjquota, 0, MOPT_Q},
 	{Opt_offusrjquota, 0, MOPT_Q},
 	{Opt_offgrpjquota, 0, MOPT_Q},
+	{Opt_offprjjquota, 0, MOPT_Q},
 	{Opt_jqfmt_vfsold, QFMT_VFS_OLD, MOPT_QFMT},
 	{Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT},
 	{Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
@@ -1439,10 +1448,14 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 		return set_qf_name(sb, USRQUOTA, &args[0]);
 	else if (token == Opt_grpjquota)
 		return set_qf_name(sb, GRPQUOTA, &args[0]);
+	else if (token == Opt_prjjquota)
+		return set_qf_name(sb, PRJQUOTA, &args[0]);
 	else if (token == Opt_offusrjquota)
 		return clear_qf_name(sb, USRQUOTA);
 	else if (token == Opt_offgrpjquota)
 		return clear_qf_name(sb, GRPQUOTA);
+	else if (token == Opt_offprjjquota)
+		return clear_qf_name(sb, PRJQUOTA);
 #endif
 	switch (token) {
 	case Opt_noacl:
@@ -1668,19 +1681,28 @@ static int parse_options(char *options, struct super_block *sb,
 	}
 #ifdef CONFIG_QUOTA
 	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_QUOTA) &&
-	    (test_opt(sb, USRQUOTA) || test_opt(sb, GRPQUOTA))) {
+	    (test_opt(sb, USRQUOTA) ||
+	     test_opt(sb, GRPQUOTA) ||
+	     test_opt(sb, PRJQUOTA))) {
 		ext4_msg(sb, KERN_ERR, "Cannot set quota options when QUOTA "
 			 "feature is enabled");
 		return 0;
 	}
-	if (sbi->s_qf_names[USRQUOTA] || sbi->s_qf_names[GRPQUOTA]) {
+	if (sbi->s_qf_names[USRQUOTA] ||
+	    sbi->s_qf_names[GRPQUOTA] ||
+	    sbi->s_qf_names[PRJQUOTA]) {
 		if (test_opt(sb, USRQUOTA) && sbi->s_qf_names[USRQUOTA])
 			clear_opt(sb, USRQUOTA);
 
 		if (test_opt(sb, GRPQUOTA) && sbi->s_qf_names[GRPQUOTA])
 			clear_opt(sb, GRPQUOTA);
 
-		if (test_opt(sb, GRPQUOTA) || test_opt(sb, USRQUOTA)) {
+		if (test_opt(sb, PRJQUOTA) && sbi->s_qf_names[PRJQUOTA])
+			clear_opt(sb, PRJQUOTA);
+
+		if (test_opt(sb, GRPQUOTA) ||
+		    test_opt(sb, USRQUOTA) ||
+		    test_opt(sb, PRJQUOTA)) {
 			ext4_msg(sb, KERN_ERR, "old and new quota "
 					"format mixing");
 			return 0;
@@ -1734,6 +1756,9 @@ static inline void ext4_show_quota_options(struct seq_file *seq,
 
 	if (sbi->s_qf_names[GRPQUOTA])
 		seq_printf(seq, ",grpjquota=%s", sbi->s_qf_names[GRPQUOTA]);
+
+	if (sbi->s_qf_names[PRJQUOTA])
+		seq_printf(seq, ",prjjquota=%s", sbi->s_qf_names[PRJQUOTA]);
 #endif
 }
 
@@ -5030,6 +5055,46 @@ restore_opts:
 	return err;
 }
 
+static int ext4_statfs_project(struct super_block *sb,
+			       kprojid_t projid, struct kstatfs *buf)
+{
+	struct kqid qid;
+	struct dquot *dquot;
+	u64 limit;
+	u64 curblock;
+
+	qid = make_kqid_projid(projid);
+	dquot = dqget(sb, qid);
+	if (!dquot)
+		return -ESRCH;
+	spin_lock(&dq_data_lock);
+
+	limit = dquot->dq_dqb.dqb_bsoftlimit ?
+		dquot->dq_dqb.dqb_bsoftlimit :
+		dquot->dq_dqb.dqb_bhardlimit;
+	if (limit && buf->f_blocks * buf->f_bsize > limit) {
+		curblock = dquot->dq_dqb.dqb_curspace / buf->f_bsize;
+		buf->f_blocks = limit / buf->f_bsize;
+		buf->f_bfree = buf->f_bavail =
+			(buf->f_blocks > curblock) ?
+			 (buf->f_blocks - curblock) : 0;
+	}
+
+	limit = dquot->dq_dqb.dqb_isoftlimit ?
+		dquot->dq_dqb.dqb_isoftlimit :
+		dquot->dq_dqb.dqb_ihardlimit;
+	if (limit && buf->f_files > limit) {
+		buf->f_files = limit;
+		buf->f_ffree =
+			(buf->f_files > dquot->dq_dqb.dqb_curinodes) ?
+			 (buf->f_files - dquot->dq_dqb.dqb_curinodes) : 0;
+	}
+
+	spin_unlock(&dq_data_lock);
+	dqput(dquot);
+	return 0;
+}
+
 static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct super_block *sb = dentry->d_sb;
@@ -5038,6 +5103,7 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
 	ext4_fsblk_t overhead = 0, resv_blocks;
 	u64 fsid;
 	s64 bfree;
+	struct inode *inode = dentry->d_inode;
 	resv_blocks = EXT4_C2B(sbi, atomic64_read(&sbi->s_resv_clusters));
 
 	if (!test_opt(sb, MINIX_DF))
@@ -5062,6 +5128,9 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
 	buf->f_fsid.val[0] = fsid & 0xFFFFFFFFUL;
 	buf->f_fsid.val[1] = (fsid >> 32) & 0xFFFFFFFFUL;
 
+	if (ext4_test_inode_flag(inode, EXT4_INODE_PROJINHERIT) &&
+	    sb_has_quota_limits_enabled(sb, PRJQUOTA))
+		ext4_statfs_project(sb, EXT4_I(inode)->i_projid, buf);
 	return 0;
 }
 
@@ -5142,7 +5211,9 @@ static int ext4_mark_dquot_dirty(struct dquot *dquot)
 
 	/* Are we journaling quotas? */
 	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_QUOTA) ||
-	    sbi->s_qf_names[USRQUOTA] || sbi->s_qf_names[GRPQUOTA]) {
+	    sbi->s_qf_names[USRQUOTA] ||
+	    sbi->s_qf_names[GRPQUOTA] ||
+	    sbi->s_qf_names[PRJQUOTA]) {
 		dquot_mark_dquot_dirty(dquot);
 		return ext4_write_dquot(dquot);
 	} else {
@@ -5226,7 +5297,8 @@ static int ext4_quota_enable(struct super_block *sb, int type, int format_id,
 	struct inode *qf_inode;
 	unsigned long qf_inums[EXT4_MAXQUOTAS] = {
 		le32_to_cpu(EXT4_SB(sb)->s_es->s_usr_quota_inum),
-		le32_to_cpu(EXT4_SB(sb)->s_es->s_grp_quota_inum)
+		le32_to_cpu(EXT4_SB(sb)->s_es->s_grp_quota_inum),
+		le32_to_cpu(EXT4_SB(sb)->s_es->s_prj_quota_inum)
 	};
 
 	BUG_ON(!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_QUOTA));
@@ -5254,7 +5326,8 @@ static int ext4_enable_quotas(struct super_block *sb)
 	int type, err = 0;
 	unsigned long qf_inums[EXT4_MAXQUOTAS] = {
 		le32_to_cpu(EXT4_SB(sb)->s_es->s_usr_quota_inum),
-		le32_to_cpu(EXT4_SB(sb)->s_es->s_grp_quota_inum)
+		le32_to_cpu(EXT4_SB(sb)->s_es->s_grp_quota_inum),
+		le32_to_cpu(EXT4_SB(sb)->s_es->s_prj_quota_inum)
 	};
 
 	sb_dqopt(sb)->flags |= DQUOT_QUOTA_SYS_FILE;
-- 
1.7.1


^ permalink raw reply related

* [v7 4/5] ext4: adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support
From: Li Xi @ 2014-11-22  5:24 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-api, tytso, adilger, jack, viro,
	hch, dmonakhov
In-Reply-To: <1416633895-4004-1-git-send-email-lixi@ddn.com>

This patch adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR ioctl interface
support for ext4. The interface is kept consistent with
XFS_IOC_FSGETXATTR/XFS_IOC_FSGETXATTR.

Signed-off-by: Li Xi <lixi@ddn.com>
---
 fs/ext4/ext4.h          |  111 ++++++++++++++++
 fs/ext4/ioctl.c         |  330 +++++++++++++++++++++++++++++++++--------------
 fs/xfs/xfs_fs.h         |   47 +++-----
 include/uapi/linux/fs.h |   58 ++++++++
 4 files changed, 418 insertions(+), 128 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 67d4879..542b388 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -393,6 +393,115 @@ struct flex_groups {
 #define EXT4_FL_USER_VISIBLE		0x204BDFFF /* User visible flags */
 #define EXT4_FL_USER_MODIFIABLE		0x204380FF /* User modifiable flags */
 
+/* Transfer internal flags to xflags */
+static inline __u32 ext4_iflags_to_xflags(unsigned long iflags)
+{
+	__u32 xflags = 0;
+
+	if (iflags & EXT4_SECRM_FL)
+		xflags |= FS_XFLAG_SECRM;
+	if (iflags & EXT4_UNRM_FL)
+		xflags |= FS_XFLAG_UNRM;
+	if (iflags & EXT4_COMPR_FL)
+		xflags |= FS_XFLAG_COMPR;
+	if (iflags & EXT4_SYNC_FL)
+		xflags |= FS_XFLAG_SYNC;
+	if (iflags & EXT4_IMMUTABLE_FL)
+		xflags |= FS_XFLAG_IMMUTABLE;
+	if (iflags & EXT4_APPEND_FL)
+		xflags |= FS_XFLAG_APPEND;
+	if (iflags & EXT4_NODUMP_FL)
+		xflags |= FS_XFLAG_NODUMP;
+	if (iflags & EXT4_NOATIME_FL)
+		xflags |= FS_XFLAG_NOATIME;
+	if (iflags & EXT4_COMPRBLK_FL)
+		xflags |= FS_XFLAG_COMPRBLK;
+	if (iflags & EXT4_NOCOMPR_FL)
+		xflags |= FS_XFLAG_NOCOMPR;
+	if (iflags & EXT4_ECOMPR_FL)
+		xflags |= FS_XFLAG_ECOMPR;
+	if (iflags & EXT4_INDEX_FL)
+		xflags |= FS_XFLAG_INDEX;
+	if (iflags & EXT4_IMAGIC_FL)
+		xflags |= FS_XFLAG_IMAGIC;
+	if (iflags & EXT4_JOURNAL_DATA_FL)
+		xflags |= FS_XFLAG_JOURNAL_DATA;
+	if (iflags & EXT4_NOTAIL_FL)
+		xflags |= FS_XFLAG_NOTAIL;
+	if (iflags & EXT4_DIRSYNC_FL)
+		xflags |= FS_XFLAG_DIRSYNC;
+	if (iflags & EXT4_TOPDIR_FL)
+		xflags |= FS_XFLAG_TOPDIR;
+	if (iflags & EXT4_HUGE_FILE_FL)
+		xflags |= FS_XFLAG_HUGE_FILE;
+	if (iflags & EXT4_EXTENTS_FL)
+		xflags |= FS_XFLAG_EXTENTS;
+	if (iflags & EXT4_EA_INODE_FL)
+		xflags |= FS_XFLAG_EA_INODE;
+	if (iflags & EXT4_EOFBLOCKS_FL)
+		xflags |= FS_XFLAG_EOFBLOCKS;
+	if (iflags & EXT4_INLINE_DATA_FL)
+		xflags |= FS_XFLAG_INLINE_DATA;
+	if (iflags & EXT4_PROJINHERIT_FL)
+		xflags |= FS_XFLAG_PROJINHERIT;
+	return xflags;
+}
+
+/* Transfer xflags flags to internal */
+static inline unsigned long ext4_xflags_to_iflags(__u32 xflags)
+{
+	unsigned long iflags = 0;
+
+	if (xflags & FS_XFLAG_SECRM)
+		iflags |= EXT4_SECRM_FL;
+	if (xflags & FS_XFLAG_UNRM)
+		iflags |= EXT4_UNRM_FL;
+	if (xflags & FS_XFLAG_COMPR)
+		iflags |= EXT4_COMPR_FL;
+	if (xflags & FS_XFLAG_SYNC)
+		iflags |= EXT4_SYNC_FL;
+	if (xflags & FS_XFLAG_IMMUTABLE)
+		iflags |= EXT4_IMMUTABLE_FL;
+	if (xflags & FS_XFLAG_APPEND)
+		iflags |= EXT4_APPEND_FL;
+	if (xflags & FS_XFLAG_NODUMP)
+		iflags |= EXT4_NODUMP_FL;
+	if (xflags & FS_XFLAG_NOATIME)
+		iflags |= EXT4_NOATIME_FL;
+	if (xflags & FS_XFLAG_COMPRBLK)
+		iflags |= EXT4_COMPRBLK_FL;
+	if (xflags & FS_XFLAG_NOCOMPR)
+		iflags |= EXT4_NOCOMPR_FL;
+	if (xflags & FS_XFLAG_ECOMPR)
+		iflags |= EXT4_ECOMPR_FL;
+	if (xflags & FS_XFLAG_INDEX)
+		iflags |= EXT4_INDEX_FL;
+	if (xflags & FS_XFLAG_IMAGIC)
+		iflags |= EXT4_IMAGIC_FL;
+	if (xflags & FS_XFLAG_JOURNAL_DATA)
+		iflags |= EXT4_JOURNAL_DATA_FL;
+	if (xflags & FS_XFLAG_IMAGIC)
+		iflags |= FS_XFLAG_NOTAIL;
+	if (xflags & FS_XFLAG_DIRSYNC)
+		iflags |= EXT4_DIRSYNC_FL;
+	if (xflags & FS_XFLAG_TOPDIR)
+		iflags |= EXT4_TOPDIR_FL;
+	if (xflags & FS_XFLAG_HUGE_FILE)
+		iflags |= EXT4_HUGE_FILE_FL;
+	if (xflags & FS_XFLAG_EXTENTS)
+		iflags |= EXT4_EXTENTS_FL;
+	if (xflags & FS_XFLAG_EA_INODE)
+		iflags |= EXT4_EA_INODE_FL;
+	if (xflags & FS_XFLAG_EOFBLOCKS)
+		iflags |= EXT4_EOFBLOCKS_FL;
+	if (xflags & FS_XFLAG_INLINE_DATA)
+		iflags |= EXT4_INLINE_DATA_FL;
+	if (xflags & FS_XFLAG_PROJINHERIT)
+		iflags |= EXT4_PROJINHERIT_FL;
+
+	return iflags;
+}
+
 /* Flags that should be inherited by new inodes from their parent. */
 #define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\
 			   EXT4_SYNC_FL | EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
@@ -617,6 +726,8 @@ enum {
 #define EXT4_IOC_RESIZE_FS		_IOW('f', 16, __u64)
 #define EXT4_IOC_SWAP_BOOT		_IO('f', 17)
 #define EXT4_IOC_PRECACHE_EXTENTS	_IO('f', 18)
+#define EXT4_IOC_FSGETXATTR		FS_IOC_FSGETXATTR
+#define EXT4_IOC_FSSETXATTR		FS_IOC_FSSETXATTR
 
 #if defined(__KERNEL__) && defined(CONFIG_COMPAT)
 /*
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index bfda18a..5df969f 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -14,6 +14,8 @@
 #include <linux/compat.h>
 #include <linux/mount.h>
 #include <linux/file.h>
+#include <linux/quotaops.h>
+#include <linux/quota.h>
 #include <asm/uaccess.h>
 #include "ext4_jbd2.h"
 #include "ext4.h"
@@ -198,126 +200,220 @@ journal_err_out:
 	return err;
 }
 
-long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+static int ext4_ioctl_setflags(struct file *filp, unsigned int flags)
 {
 	struct inode *inode = file_inode(filp);
-	struct super_block *sb = inode->i_sb;
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	unsigned int flags;
+	handle_t *handle = NULL;
+	int err, migrate = 0;
+	struct ext4_iloc iloc;
+	unsigned int oldflags, mask, i;
+	unsigned int jflag;
 
-	ext4_debug("cmd = %u, arg = %lu\n", cmd, arg);
+	if (!inode_owner_or_capable(inode))
+		return -EACCES;
 
-	switch (cmd) {
-	case EXT4_IOC_GETFLAGS:
-		ext4_get_inode_flags(ei);
-		flags = ei->i_flags & EXT4_FL_USER_VISIBLE;
-		return put_user(flags, (int __user *) arg);
-	case EXT4_IOC_SETFLAGS: {
-		handle_t *handle = NULL;
-		int err, migrate = 0;
-		struct ext4_iloc iloc;
-		unsigned int oldflags, mask, i;
-		unsigned int jflag;
+	err = mnt_want_write_file(filp);
+	if (err)
+		return err;
 
-		if (!inode_owner_or_capable(inode))
-			return -EACCES;
+	flags = ext4_mask_flags(inode->i_mode, flags);
 
-		if (get_user(flags, (int __user *) arg))
-			return -EFAULT;
+	err = -EPERM;
+	mutex_lock(&inode->i_mutex);
+	/* Is it quota file? Do not allow user to mess with it */
+	if (IS_NOQUOTA(inode))
+		goto flags_out;
 
-		err = mnt_want_write_file(filp);
-		if (err)
-			return err;
+	oldflags = ei->i_flags;
 
-		flags = ext4_mask_flags(inode->i_mode, flags);
+	/* The JOURNAL_DATA flag is modifiable only by root */
+	jflag = flags & EXT4_JOURNAL_DATA_FL;
 
-		err = -EPERM;
-		mutex_lock(&inode->i_mutex);
-		/* Is it quota file? Do not allow user to mess with it */
-		if (IS_NOQUOTA(inode))
+	/*
+	 * The IMMUTABLE and APPEND_ONLY flags can only be changed by
+	 * the relevant capability.
+	 *
+	 * This test looks nicer. Thanks to Pauline Middelink
+	 */
+	if ((flags ^ oldflags) & (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) {
+		if (!capable(CAP_LINUX_IMMUTABLE))
 			goto flags_out;
+	}
 
-		oldflags = ei->i_flags;
-
-		/* The JOURNAL_DATA flag is modifiable only by root */
-		jflag = flags & EXT4_JOURNAL_DATA_FL;
-
-		/*
-		 * The IMMUTABLE and APPEND_ONLY flags can only be changed by
-		 * the relevant capability.
-		 *
-		 * This test looks nicer. Thanks to Pauline Middelink
-		 */
-		if ((flags ^ oldflags) & (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) {
-			if (!capable(CAP_LINUX_IMMUTABLE))
-				goto flags_out;
-		}
-
-		/*
-		 * The JOURNAL_DATA flag can only be changed by
-		 * the relevant capability.
-		 */
-		if ((jflag ^ oldflags) & (EXT4_JOURNAL_DATA_FL)) {
-			if (!capable(CAP_SYS_RESOURCE))
-				goto flags_out;
-		}
-		if ((flags ^ oldflags) & EXT4_EXTENTS_FL)
-			migrate = 1;
-
+	/*
+	 * The JOURNAL_DATA flag can only be changed by
+	 * the relevant capability.
+	 */
+	if ((jflag ^ oldflags) & (EXT4_JOURNAL_DATA_FL)) {
+		if (!capable(CAP_SYS_RESOURCE))
+			goto flags_out;
+	}
+	if ((flags ^ oldflags) & EXT4_EXTENTS_FL)
+		migrate = 1;
 		if (flags & EXT4_EOFBLOCKS_FL) {
-			/* we don't support adding EOFBLOCKS flag */
-			if (!(oldflags & EXT4_EOFBLOCKS_FL)) {
-				err = -EOPNOTSUPP;
-				goto flags_out;
-			}
-		} else if (oldflags & EXT4_EOFBLOCKS_FL)
-			ext4_truncate(inode);
-
-		handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
-		if (IS_ERR(handle)) {
-			err = PTR_ERR(handle);
+		/* we don't support adding EOFBLOCKS flag */
+		if (!(oldflags & EXT4_EOFBLOCKS_FL)) {
+			err = -EOPNOTSUPP;
 			goto flags_out;
 		}
-		if (IS_SYNC(inode))
-			ext4_handle_sync(handle);
-		err = ext4_reserve_inode_write(handle, inode, &iloc);
-		if (err)
-			goto flags_err;
-
-		for (i = 0, mask = 1; i < 32; i++, mask <<= 1) {
-			if (!(mask & EXT4_FL_USER_MODIFIABLE))
-				continue;
-			if (mask & flags)
-				ext4_set_inode_flag(inode, i);
-			else
-				ext4_clear_inode_flag(inode, i);
-		}
+	} else if (oldflags & EXT4_EOFBLOCKS_FL)
+		ext4_truncate(inode);
 
-		ext4_set_inode_flags(inode);
-		inode->i_ctime = ext4_current_time(inode);
+	handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
+	if (IS_ERR(handle)) {
+		err = PTR_ERR(handle);
+		goto flags_out;
+	}
+	if (IS_SYNC(inode))
+		ext4_handle_sync(handle);
+	err = ext4_reserve_inode_write(handle, inode, &iloc);
+	if (err)
+		goto flags_err;
+
+	for (i = 0, mask = 1; i < 32; i++, mask <<= 1) {
+		if (!(mask & EXT4_FL_USER_MODIFIABLE))
+			continue;
+		if (mask & flags)
+			ext4_set_inode_flag(inode, i);
+		else
+			ext4_clear_inode_flag(inode, i);
+	}
 
-		err = ext4_mark_iloc_dirty(handle, inode, &iloc);
-flags_err:
-		ext4_journal_stop(handle);
-		if (err)
-			goto flags_out;
+	ext4_set_inode_flags(inode);
+	inode->i_ctime = ext4_current_time(inode);
 
-		if ((jflag ^ oldflags) & (EXT4_JOURNAL_DATA_FL))
-			err = ext4_change_inode_journal_flag(inode, jflag);
-		if (err)
-			goto flags_out;
-		if (migrate) {
-			if (flags & EXT4_EXTENTS_FL)
-				err = ext4_ext_migrate(inode);
-			else
-				err = ext4_ind_migrate(inode);
-		}
+	err = ext4_mark_iloc_dirty(handle, inode, &iloc);
+flags_err:
+	ext4_journal_stop(handle);
+	if (err)
+		goto flags_out;
+
+	if ((jflag ^ oldflags) & (EXT4_JOURNAL_DATA_FL))
+		err = ext4_change_inode_journal_flag(inode, jflag);
+	if (err)
+		goto flags_out;
+	if (migrate) {
+		if (flags & EXT4_EXTENTS_FL)
+			err = ext4_ext_migrate(inode);
+		else
+			err = ext4_ind_migrate(inode);
+	}
 
 flags_out:
-		mutex_unlock(&inode->i_mutex);
-		mnt_drop_write_file(filp);
+	mutex_unlock(&inode->i_mutex);
+	mnt_drop_write_file(filp);
+	return err;
+}
+
+static int ext4_ioctl_setproject(struct file *filp, __u32 projid)
+{
+	struct inode *inode = file_inode(filp);
+	struct super_block *sb = inode->i_sb;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	int err;
+	handle_t *handle;
+	kprojid_t kprojid;
+	struct ext4_iloc iloc;
+	struct ext4_inode *raw_inode;
+
+	struct dquot *transfer_to[EXT4_MAXQUOTAS] = { };
+
+	/* Make sure caller can change project. */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	if (projid != EXT4_DEF_PROJID
+	    && !EXT4_HAS_RO_COMPAT_FEATURE(sb,
+			EXT4_FEATURE_RO_COMPAT_PROJECT))
+		return -EOPNOTSUPP;
+
+	if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
+			EXT4_FEATURE_RO_COMPAT_PROJECT)) {
+		BUG_ON(__kprojid_val(EXT4_I(inode)->i_projid)
+		       != EXT4_DEF_PROJID);
+		if (projid != EXT4_DEF_PROJID)
+			return -EOPNOTSUPP;
+		else
+			return 0;
+	}
+
+	kprojid = make_kprojid(&init_user_ns, (projid_t)projid);
+
+	if (projid_eq(kprojid, EXT4_I(inode)->i_projid))
+		return 0;
+
+	err = mnt_want_write_file(filp);
+	if (err)
 		return err;
+
+	err = -EPERM;
+	mutex_lock(&inode->i_mutex);
+	/* Is it quota file? Do not allow user to mess with it */
+	if (IS_NOQUOTA(inode))
+		goto project_out;
+
+	dquot_initialize(inode);
+
+	handle = ext4_journal_start(inode, EXT4_HT_QUOTA,
+		EXT4_QUOTA_INIT_BLOCKS(sb) +
+		EXT4_QUOTA_DEL_BLOCKS(sb) + 3);
+	if (IS_ERR(handle)) {
+		err = PTR_ERR(handle);
+		goto project_out;
+	}
+
+	err = ext4_reserve_inode_write(handle, inode, &iloc);
+	if (err)
+		goto project_stop;
+
+	raw_inode = ext4_raw_inode(&iloc);
+	if ((EXT4_INODE_SIZE(sb) <=
+	     EXT4_GOOD_OLD_INODE_SIZE) ||
+	    (!EXT4_FITS_IN_INODE(raw_inode, ei, i_projid))) {
+	    	err = -EFBIG;
+	    	goto project_stop;
 	}
+
+	transfer_to[PRJQUOTA] = dqget(sb, make_kqid_projid(kprojid));
+	if (!transfer_to[PRJQUOTA])
+		goto project_set;
+
+	err = __dquot_transfer(inode, transfer_to);
+	dqput(transfer_to[PRJQUOTA]);
+	if (err)
+		goto project_stop;
+
+project_set:
+	EXT4_I(inode)->i_projid = kprojid;
+	inode->i_ctime = ext4_current_time(inode);
+	err = ext4_mark_iloc_dirty(handle, inode, &iloc);
+project_stop:
+	ext4_journal_stop(handle);
+project_out:
+	mutex_unlock(&inode->i_mutex);
+	mnt_drop_write_file(filp);
+	return err;
+}
+
+long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+	struct inode *inode = file_inode(filp);
+	struct super_block *sb = inode->i_sb;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	unsigned int flags;
+
+	ext4_debug("cmd = %u, arg = %lu\n", cmd, arg);
+
+	switch (cmd) {
+	case EXT4_IOC_GETFLAGS:
+		ext4_get_inode_flags(ei);
+		flags = ei->i_flags & EXT4_FL_USER_VISIBLE;
+		return put_user(flags, (int __user *) arg);
+	case EXT4_IOC_SETFLAGS:
+		if (get_user(flags, (int __user *) arg))
+			return -EFAULT;
+		return ext4_ioctl_setflags(filp, flags);
 	case EXT4_IOC_GETVERSION:
 	case EXT4_IOC_GETVERSION_OLD:
 		return put_user(inode->i_generation, (int __user *) arg);
@@ -617,7 +713,45 @@ resizefs_out:
 	}
 	case EXT4_IOC_PRECACHE_EXTENTS:
 		return ext4_ext_precache(inode);
+	case EXT4_IOC_FSGETXATTR:
+	{
+		struct fsxattr fa;
+
+		memset(&fa, 0, sizeof(struct fsxattr));
+
+		ext4_get_inode_flags(ei);
+		fa.fsx_xflags = ext4_iflags_to_xflags(ei->i_flags & EXT4_FL_USER_VISIBLE);
+
+		if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+				EXT4_FEATURE_RO_COMPAT_PROJECT)) {
+			fa.fsx_projid = (__u32)from_kprojid(&init_user_ns,
+				EXT4_I(inode)->i_projid);
+		}
+
+		if (copy_to_user((struct fsxattr __user *)arg,
+				 &fa, sizeof(fa)))
+			return -EFAULT;
+		return 0;
+	}
+	case EXT4_IOC_FSSETXATTR:
+	{
+		struct fsxattr fa;
+		int err;
+
+		if (copy_from_user(&fa, (struct fsxattr __user *)arg,
+				   sizeof(fa)))
+			return -EFAULT;
 
+		err = ext4_ioctl_setflags(filp, ext4_xflags_to_iflags(fa.fsx_xflags));
+		if (err)
+			return err;
+
+		err = ext4_ioctl_setproject(filp, fa.fsx_projid);
+		if (err)
+			return err;
+
+		return 0;
+	}
 	default:
 		return -ENOTTY;
 	}
diff --git a/fs/xfs/xfs_fs.h b/fs/xfs/xfs_fs.h
index 18dc721..64c7ae6 100644
--- a/fs/xfs/xfs_fs.h
+++ b/fs/xfs/xfs_fs.h
@@ -36,38 +36,25 @@ struct dioattr {
 #endif
 
 /*
- * Structure for XFS_IOC_FSGETXATTR[A] and XFS_IOC_FSSETXATTR.
- */
-#ifndef HAVE_FSXATTR
-struct fsxattr {
-	__u32		fsx_xflags;	/* xflags field value (get/set) */
-	__u32		fsx_extsize;	/* extsize field value (get/set)*/
-	__u32		fsx_nextents;	/* nextents field value (get)	*/
-	__u32		fsx_projid;	/* project identifier (get/set) */
-	unsigned char	fsx_pad[12];
-};
-#endif
-
-/*
  * Flags for the bs_xflags/fsx_xflags field
  * There should be a one-to-one correspondence between these flags and the
  * XFS_DIFLAG_s.
  */
-#define XFS_XFLAG_REALTIME	0x00000001	/* data in realtime volume */
-#define XFS_XFLAG_PREALLOC	0x00000002	/* preallocated file extents */
-#define XFS_XFLAG_IMMUTABLE	0x00000008	/* file cannot be modified */
-#define XFS_XFLAG_APPEND	0x00000010	/* all writes append */
-#define XFS_XFLAG_SYNC		0x00000020	/* all writes synchronous */
-#define XFS_XFLAG_NOATIME	0x00000040	/* do not update access time */
-#define XFS_XFLAG_NODUMP	0x00000080	/* do not include in backups */
-#define XFS_XFLAG_RTINHERIT	0x00000100	/* create with rt bit set */
-#define XFS_XFLAG_PROJINHERIT	0x00000200	/* create with parents projid */
-#define XFS_XFLAG_NOSYMLINKS	0x00000400	/* disallow symlink creation */
-#define XFS_XFLAG_EXTSIZE	0x00000800	/* extent size allocator hint */
-#define XFS_XFLAG_EXTSZINHERIT	0x00001000	/* inherit inode extent size */
-#define XFS_XFLAG_NODEFRAG	0x00002000  	/* do not defragment */
-#define XFS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
-#define XFS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
+#define XFS_XFLAG_REALTIME	FS_XFLAG_REALTIME	/* data in realtime volume */
+#define XFS_XFLAG_PREALLOC	FS_XFLAG_PREALLOC	/* preallocated file extents */
+#define XFS_XFLAG_IMMUTABLE	FS_XFLAG_IMMUTABLE	/* file cannot be modified */
+#define XFS_XFLAG_APPEND	FS_XFLAG_APPEND		/* all writes append */
+#define XFS_XFLAG_SYNC		FS_XFLAG_SYNC		/* all writes synchronous */
+#define XFS_XFLAG_NOATIME	FS_XFLAG_NOATIME	/* do not update access time */
+#define XFS_XFLAG_NODUMP	FS_XFLAG_NODUMP		/* do not include in backups */
+#define XFS_XFLAG_RTINHERIT	FS_XFLAG_RTINHERIT	/* create with rt bit set */
+#define XFS_XFLAG_PROJINHERIT	FS_XFLAG_PROJINHERIT	/* create with parents projid */
+#define XFS_XFLAG_NOSYMLINKS	FS_XFLAG_NOSYMLINKS	/* disallow symlink creation */
+#define XFS_XFLAG_EXTSIZE	FS_XFLAG_EXTSIZE	/* extent size allocator hint */
+#define XFS_XFLAG_EXTSZINHERIT	FS_XFLAG_EXTSZINHERIT	/* inherit inode extent size */
+#define XFS_XFLAG_NODEFRAG	FS_XFLAG_NODEFRAG  	/* do not defragment */
+#define XFS_XFLAG_FILESTREAM	FS_XFLAG_FILESTREAM	/* use filestream allocator */
+#define XFS_XFLAG_HASATTR	FS_XFLAG_HASATTR	/* no DIFLAG for this	*/
 
 /*
  * Structure for XFS_IOC_GETBMAP.
@@ -503,8 +490,8 @@ typedef struct xfs_swapext
 #define XFS_IOC_ALLOCSP		_IOW ('X', 10, struct xfs_flock64)
 #define XFS_IOC_FREESP		_IOW ('X', 11, struct xfs_flock64)
 #define XFS_IOC_DIOINFO		_IOR ('X', 30, struct dioattr)
-#define XFS_IOC_FSGETXATTR	_IOR ('X', 31, struct fsxattr)
-#define XFS_IOC_FSSETXATTR	_IOW ('X', 32, struct fsxattr)
+#define XFS_IOC_FSGETXATTR	FS_IOC_FSGETXATTR
+#define XFS_IOC_FSSETXATTR	FS_IOC_FSSETXATTR
 #define XFS_IOC_ALLOCSP64	_IOW ('X', 36, struct xfs_flock64)
 #define XFS_IOC_FREESP64	_IOW ('X', 37, struct xfs_flock64)
 #define XFS_IOC_GETBMAP		_IOWR('X', 38, struct getbmap)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index fcbf647..872fed5 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -58,6 +58,62 @@ struct inodes_stat_t {
 	long dummy[5];		/* padding for sysctl ABI compatibility */
 };
 
+/*
+ * Extend attribute flags. These should be or-ed together to figure out what
+ * is valid.
+ */
+#define FSX_XFLAGS	(1 << 0)
+#define FSX_EXTSIZE	(1 << 1)
+#define FSX_NEXTENTS	(1 << 2)
+#define FSX_PROJID	(1 << 3)
+
+/*
+ * Structure for FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR.
+ */
+struct fsxattr {
+	__u32		fsx_xflags;	/* xflags field value (get/set) */
+	__u32		fsx_extsize;	/* extsize field value (get/set)*/
+	__u32		fsx_nextents;	/* nextents field value (get)	*/
+	__u32		fsx_projid;	/* project identifier (get/set) */
+	unsigned char	fsx_pad[12];
+};
+
+/*
+ * Flags for the fsx_xflags field
+ */
+#define FS_XFLAG_REALTIME	0x00000001	/* data in realtime volume */
+#define FS_XFLAG_PREALLOC	0x00000002	/* preallocated file extents */
+#define FS_XFLAG_SECRM		0x00000004	/* secure deletion */
+#define FS_XFLAG_IMMUTABLE	0x00000008	/* file cannot be modified */
+#define FS_XFLAG_APPEND		0x00000010	/* all writes append */
+#define FS_XFLAG_SYNC		0x00000020	/* all writes synchronous */
+#define FS_XFLAG_NOATIME	0x00000040	/* do not update access time */
+#define FS_XFLAG_NODUMP		0x00000080	/* do not include in backups */
+#define FS_XFLAG_RTINHERIT	0x00000100	/* create with rt bit set */
+#define FS_XFLAG_PROJINHERIT	0x00000200	/* create with parents projid */
+#define FS_XFLAG_NOSYMLINKS	0x00000400	/* disallow symlink creation */
+#define FS_XFLAG_EXTSIZE	0x00000800	/* extent size allocator hint */
+#define FS_XFLAG_EXTSZINHERIT	0x00001000	/* inherit inode extent size */
+#define FS_XFLAG_NODEFRAG	0x00002000  	/* do not defragment */
+#define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
+#define FS_XFLAG_UNRM		0x00008000	/* undelete */
+#define FS_XFLAG_COMPR		0x00010000	/* compress file */
+#define FS_XFLAG_COMPRBLK	0x00020000	/* one or more compressed clusters */
+#define FS_XFLAG_NOCOMPR	0x00040000	/* don't compress */
+#define FS_XFLAG_ECOMPR		0x00080000	/* compression error */
+#define FS_XFLAG_INDEX		0x00100000	/* hash-indexed directory */
+#define FS_XFLAG_IMAGIC		0x00200000	/* AFS directory */
+#define FS_XFLAG_JOURNAL_DATA	0x00400000	/* file data should be journaled */
+#define FS_XFLAG_NOTAIL		0x00800000	/* file tail should not be merged */
+#define FS_XFLAG_DIRSYNC	0x01000000	/* dirsync behaviour (directories only) */
+#define FS_XFLAG_TOPDIR		0x02000000	/* top of directory hierarchies*/
+#define FS_XFLAG_HUGE_FILE	0x04000000	/* set to each huge file */
+#define FS_XFLAG_EXTENTS	0x08000000	/* inode uses extents */
+#define FS_XFLAG_EA_INODE	0x10000000	/* inode used for large EA */
+#define FS_XFLAG_EOFBLOCKS	0x20000000	/* blocks allocated beyond EOF */
+#define FS_XFLAG_INLINE_DATA	0x40000000	/* inode has inline data. */
+#define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this */
+
 
 #define NR_FILE  8192	/* this can well be larger on a larger system */
 
@@ -163,6 +219,8 @@ struct inodes_stat_t {
 #define	FS_IOC_GETVERSION		_IOR('v', 1, long)
 #define	FS_IOC_SETVERSION		_IOW('v', 2, long)
 #define FS_IOC_FIEMAP			_IOWR('f', 11, struct fiemap)
+#define FS_IOC_FSGETXATTR		_IOR('f', 31, struct fsxattr)
+#define FS_IOC_FSSETXATTR		_IOW('f', 32, struct fsxattr)
 #define FS_IOC32_GETFLAGS		_IOR('f', 1, int)
 #define FS_IOC32_SETFLAGS		_IOW('f', 2, int)
 #define FS_IOC32_GETVERSION		_IOR('v', 1, int)
-- 
1.7.1


^ permalink raw reply related

* [v7 5/5] ext4: cleanup inode flag definitions
From: Li Xi @ 2014-11-22  5:24 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	adilger-m1MBpc4rdrD3fQ9qLvQP4Q, jack-AlSwsSmVLrQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	dmonakhov-GEFAQzZX7r8dnm+yROfE0A
In-Reply-To: <1416633895-4004-1-git-send-email-lixi-LfVdkaOWEx8@public.gmane.org>

The inode flags defined in uapi/linux/fs.h were migrated from
ext4.h. This patch changes the inode flag definitions in ext4.h
to VFS definitions to make the gaps between them clearer.

Signed-off-by: Li Xi <lixi-LfVdkaOWEx8@public.gmane.org>
---
 fs/ext4/ext4.h |   50 +++++++++++++++++++++++++-------------------------
 1 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 542b388..a3083d5 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -362,33 +362,33 @@ struct flex_groups {
 /*
  * Inode flags
  */
-#define	EXT4_SECRM_FL			0x00000001 /* Secure deletion */
-#define	EXT4_UNRM_FL			0x00000002 /* Undelete */
-#define	EXT4_COMPR_FL			0x00000004 /* Compress file */
-#define EXT4_SYNC_FL			0x00000008 /* Synchronous updates */
-#define EXT4_IMMUTABLE_FL		0x00000010 /* Immutable file */
-#define EXT4_APPEND_FL			0x00000020 /* writes to file may only append */
-#define EXT4_NODUMP_FL			0x00000040 /* do not dump file */
-#define EXT4_NOATIME_FL			0x00000080 /* do not update atime */
+#define	EXT4_SECRM_FL			FS_SECRM_FL        /* Secure deletion */
+#define	EXT4_UNRM_FL			FS_UNRM_FL         /* Undelete */
+#define	EXT4_COMPR_FL			FS_COMPR_FL        /* Compress file */
+#define EXT4_SYNC_FL			FS_SYNC_FL         /* Synchronous updates */
+#define EXT4_IMMUTABLE_FL		FS_IMMUTABLE_FL    /* Immutable file */
+#define EXT4_APPEND_FL			FS_APPEND_FL       /* writes to file may only append */
+#define EXT4_NODUMP_FL			FS_NODUMP_FL       /* do not dump file */
+#define EXT4_NOATIME_FL			FS_NOATIME_FL      /* do not update atime */
 /* Reserved for compression usage... */
-#define EXT4_DIRTY_FL			0x00000100
-#define EXT4_COMPRBLK_FL		0x00000200 /* One or more compressed clusters */
-#define EXT4_NOCOMPR_FL			0x00000400 /* Don't compress */
-#define EXT4_ECOMPR_FL			0x00000800 /* Compression error */
+#define EXT4_DIRTY_FL			FS_DIRTY_FL
+#define EXT4_COMPRBLK_FL		FS_COMPRBLK_FL     /* One or more compressed clusters */
+#define EXT4_NOCOMPR_FL			FS_NOCOMP_FL       /* Don't compress */
+#define EXT4_ECOMPR_FL			FS_ECOMPR_FL       /* Compression error */
 /* End compression flags --- maybe not all used */
-#define EXT4_INDEX_FL			0x00001000 /* hash-indexed directory */
-#define EXT4_IMAGIC_FL			0x00002000 /* AFS directory */
-#define EXT4_JOURNAL_DATA_FL		0x00004000 /* file data should be journaled */
-#define EXT4_NOTAIL_FL			0x00008000 /* file tail should not be merged */
-#define EXT4_DIRSYNC_FL			0x00010000 /* dirsync behaviour (directories only) */
-#define EXT4_TOPDIR_FL			0x00020000 /* Top of directory hierarchies*/
-#define EXT4_HUGE_FILE_FL               0x00040000 /* Set to each huge file */
-#define EXT4_EXTENTS_FL			0x00080000 /* Inode uses extents */
-#define EXT4_EA_INODE_FL	        0x00200000 /* Inode used for large EA */
-#define EXT4_EOFBLOCKS_FL		0x00400000 /* Blocks allocated beyond EOF */
-#define EXT4_INLINE_DATA_FL		0x10000000 /* Inode has inline data. */
-#define EXT4_PROJINHERIT_FL		FS_PROJINHERIT_FL /* Create with parents projid */
-#define EXT4_RESERVED_FL		0x80000000 /* reserved for ext4 lib */
+#define EXT4_INDEX_FL			FS_INDEX_FL        /* hash-indexed directory */
+#define EXT4_IMAGIC_FL			FS_IMAGIC_FL       /* AFS directory */
+#define EXT4_JOURNAL_DATA_FL		FS_JOURNAL_DATA_FL /* file data should be journaled */
+#define EXT4_NOTAIL_FL			FS_NOTAIL_FL       /* file tail should not be merged */
+#define EXT4_DIRSYNC_FL			FS_DIRSYNC_FL      /* dirsync behaviour (directories only) */
+#define EXT4_TOPDIR_FL			FS_TOPDIR_FL       /* Top of directory hierarchies*/
+#define EXT4_HUGE_FILE_FL               0x00040000         /* Set to each huge file */
+#define EXT4_EXTENTS_FL			FS_EXTENT_FL       /* Inode uses extents */
+#define EXT4_EA_INODE_FL	        0x00200000         /* Inode used for large EA */
+#define EXT4_EOFBLOCKS_FL		0x00400000         /* Blocks allocated beyond EOF */
+#define EXT4_INLINE_DATA_FL		0x10000000         /* Inode has inline data. */
+#define EXT4_PROJINHERIT_FL		FS_PROJINHERIT_FL  /* Create with parents projid */
+#define EXT4_RESERVED_FL		FS_RESERVED_FL     /* reserved for ext4 lib */
 
 #define EXT4_FL_USER_VISIBLE		0x204BDFFF /* User visible flags */
 #define EXT4_FL_USER_MODIFIABLE		0x204380FF /* User modifiable flags */
-- 
1.7.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox