Dave Chinner <david@fromorbit.com> writes:

> On Fri, Feb 19, 2010 at 01:16:47PM +0300, Dmitry Monakhov wrote:
>> Dave Chinner <david@fromorbit.com> writes:
>> 
>> > On Thu, Feb 18, 2010 at 07:45:24PM +0300, Dmitry Monakhov wrote:
>> >> This is new generation of attempt to add extended inode identifier.
>> >> In previous posts it was called tree_id, subtree_id, project_id.
>> >> But after none of this was not good enough. I've refused project_id
>> >> because it is well know XFS feature.
>> >
>> > Admins, users and developers of mangement tools are all going to
>> > hate us if we introduce subtly different "project/directory quota
>> > like" accounting to different filesystems with different
>> > administration mechanisms.
>> Seems what you right here.
>> >
>> > The fact that project quotas are already implemented in XFS is not a
>> > valid reason for creating a new, slightly less functional,
>> > incompatible implementation of the same feature in other
>> > filesystems.
>> >
>> >> And my implementation is
>> >> slightly different from it especially from user-space point of view.
>> >
>> > This is exactly my point - if a user has an ext4 filesystem and an
>> > xfs filesystem then your proposal will result in them needing two
>> > different mechanisms to manage the project/directory quotas on their
>> > filesystems.  This result is not desirable from a system design
>> > perspective.  Management of such a feature needs to be consistent
>> > across all filesystem types - just like it is for user and group
>> > quotas - and we already have a widely used and well tested
>> > management interface that can be used to implement exactly what you
>> > need.
>> Not exactly. XFS  allow only subtree-like structure
>
> Not true at all.  XFS allows an arbitrary distribution of files in a
> given project - they are not restricted to subtrees. This isn't
> widely used because it requires manually setting the project ID
> after the file is created. e.g. create a backup tarball of a project
> heirarchy in an external non-controlled directory, then change the
> project ID of the tarball to the correct project ID so that the
> backup is also accounted to the correct project...
>
> For example, I'll create a new project (testproj) and subtree
> (/mnt/xfs/foo) associated with the project, create a 25MB file
> inside the subtree, show it being accounted, the copy it outside
> the subtree, show it isn't accounted, then change the project ID
> of the outside copy to testproj and show that it is accounted to
> the testproj even though it is outside the subtree:
>
> # mkfs.xfs -f /dev/ubd/1
> [.....]
> # mount -o prjquota /dev/ubd/1 /mnt/xfs
> # mkdir /mnt/xfs/foo
> #
> #
> # echo testproj:42 >> /etc/projid
> # echo 42:/mnt/xfs/foo >> /etc/projects
> # xfs_quota -x -c 'project -s testproj' /mnt/xfs
> Setting up project testproj (path /mnt/xfs/foo)...
> Processed 1 /etc/projects paths for project testproj
> #
> #
> #
> # xfs_quota -x -c 'limit -p bhard=1g testproj' /mnt/xfs
> # xfs_quota -x -c print /mnt/xfs
> Filesystem          Pathname
> /mnt/xfs            /dev/ubd/1 (pquota)
> /mnt/xfs/foo        /dev/ubd/1 (project 42, testproj)
> # xfs_quota -x -c report /mnt/xfs
> Project quota on /mnt/xfs (/dev/ubd/1)
>                                Blocks
> Project ID       Used       Soft       Hard    Warn/Grace
> ---------- --------------------------------------------------
> testproj            0          0    1048576     00 [--------]
>
> #
> #
> #
> # dd if=/dev/zero of=foo/testfile bs=1024k count=25
> 25+0 records in
> 25+0 records out
> 26214400 bytes (26 MB) copied, 0.116102 s, 226 MB/s
> # sudo xfs_quota -x -c report /mnt/xfs
> Project quota on /mnt/xfs (/dev/ubd/1)
>                                Blocks
> Project ID       Used       Soft       Hard    Warn/Grace
> ---------- --------------------------------------------------
> testproj        25600          0    1048576     00 [--------]
>
> #
> #
> #
> # cp foo/testfile .
> # sync
> # xfs_quota -x -c report /mnt/xfs
> Project quota on /mnt/xfs (/dev/ubd/1)
>                                Blocks
> Project ID       Used       Soft       Hard    Warn/Grace
> ---------- --------------------------------------------------
> testproj        25600          0    1048576     00 [--------]
>
> #
> #
> #
> # xfs_io -f -c "chproj 42" testfile
> # xfs_quota -x -c report /mnt/xfs
> Project quota on /mnt/xfs (/dev/ubd/1)
>                                Blocks
> Project ID       Used       Soft       Hard    Warn/Grace
> ---------- --------------------------------------------------
> testproj        51200          0    1048576     00 [--------]
>
> #
>
>
>> (link, rename are restricted).
>
> The EXDEV on rename behaviour is purely an implementation detail -
> it makes quota accounting in XFS simple. i.e. rename returns EXDEV
> so that a mv(1) will fall back to create/copy/unlink and that
> automatically gets the quota accounting correct. That is, it didn't
> require a complex extension of dquot handling in the rename
> transaction to implement.  This one could be fixed, and a couple of
> ppl have actually asked recently if it could be done because moving
> a few TB of data between projects is time consuming.
>
> However, hard links are a different matter. If you can clearly
> determine how to hard link a file into multiple different projects
> (dquots), then track and account for all the space used in a sane
> manner, work out how to account for new or removed files in such a
> hardlinked directory, etc, then you can allow hard links between
> different subtrees.
Yess. I do understand that. In fact initially i've specify
rename/link rules by myself, later i've discovered that XFS
implemented this long time ago in exact same way.

BTW: renames also is not so simple because renaming file which
has more than one hardlinks result in same madness situation.

But as AlViro pointed this semantics is already implemented
in bindmout (IMHO the only bad thing is that bindmount is not
persistent structure).
We just give user a rope and it his decision to shoot, or not
to shoot himself.
Otherwise. We may try to force AlViro to like hardlink isolation idea.
May be restrict this tiny rule under CONFIG_PROJECT_ID_ISOLATED
config option.
>
> For example, if you add a new file into such a hard linked
> directory, who does it get accounted to? What happens if you then
> move a multiple-hard linked file to a different subtree?
By assumption inode may belongs only to one project, the one thich
stored inside private_inode->i_prjid. It will be accounted in
that quota.
> If the
> inode is accounted to all projects, then each of these filesystem
> transactions requires updating an arbitrary (unbound) number of
> dquots - this alone makes journal reservations for transactions a
> nightmare to calculate and greatly increases the complexity of such
> transactions.
>
> Disallowing hard links between directories in different projects
> makes these cans of worms go away - it is a very practical design
> choice to make. However, it in no way results in XFS project quotas
> being restricted to subtrees - it is a *change of project quota*
> that triggers these behaviours.
>
>> Personally I think what right restriction, but someone may
>> want to have not subtree-like hierarchy. So this patch doesn't introduce
>> any link/rename rules.
>
> The link/rename behaviour of XFS does not prevent this type of usage
> at all.
>
>> If user want to restrict his tree it will use
>> bindmount. IMHO it is more intuitive than XFS does.
>
> XFS is not trying to implement bind mount -like restrictions. The
> behaviour was carefully designed to allow project quota's to be
> sanely implemented.
>
>> But again you definitely right about feature_names/interfaces ambiguity 
>> If we can create common interface it would be great. See later in 
>> the mail.
>> >
>> >> In order to avoid ambiguity i've stopped at the "metagroup" term.
>> >> I hope it is final name for the feature.
>> >
>> > I think "metagroup" is too abstract and will likely be confused with
>> > group quotas by those that don't understand what it is. i.e it does
>> > not convey any information about the bounds of the quota container
>> > (unlike user, group, directory or project).
>> Ok. Since we want common interface we should use well known "project_id"
>> term.
>> 
>> I think we can try to unify it in following way:
>> *User interface*
>> As soon as i understand XFS manage projid via xfs_ioctl_setattr, 
>> struct fsxattr. IMHO it is not good idea to make this interface common
>> for all filesystems. Let's use standard i_op->setxattr/getxattr for
>> this purpose. Let's name this xattr as "system.project_id".
>
> That's fine by me. I'd much prefer that we used the xattr interface
> for inode attributes instead of poking bits through fcntl or ioctls...
>
>> And xfs may easily catch corresponding setxattr/getxatrr and translate
>> it to it's ioctl interface, so both interfaces will be equal.
>> At least xattr interface already supported by various utils (tar,
>> rsync, etc).
>
> Well, the point of the way XFS implements project quotas is that
> utilities such as cp, mv, tar, rsync, etc do not need to know
> anything about them - just like user/group quotas.
>
> If we go down the xattr route, then these utilities can't be allowed
> to copy these xattrs to new files; the filesystem has to create them
> atomically with the new inodes so that they are accounted correctly.
Exactly. It is like init_acl  init_security works on inode creation
(see fs/ext4/ialloc.c). I've (by occasion) miss that in posted
version of ext4-add-metagroup-support patch
In fact i have to confess that ext4-metagroup-patrt was in not working
state at a posting time. Currently it's seems to work, see patch attached. 

BTW project_id changing procedure is looks really ugly because we have to
perform two things in a row quota_transfer, proj_id update
if we enabled to update project_id then we have to roll-back
quota. And in fact this may result in -EDQUOT because currently
quota has not *force* charge flag :)
> If they are created non-atomically and the system crashes between
> creating the file and applying the quota xattr, then you have an
> inconsistency that only a quotacheck will pick up....
This is just the way how it works for now
each tar like application works like this 
1)open
2)write
3)chown
4)chmod
So it somethings happens before (3) will be accounted to current user_id

>
>> *Link/Rename behavior*
>>  Let's introduce two modes:
>>  1) SHARED project hierarchy: without restrictions for link/renames
>
> See above - I don't think "without restrictions" can be easily
> implemented because of the complexity hard links introduce.
>
>>  2) ISOLATED project hierarchy: Well known XFS (subtrees like)
>>     link/rename rules
>>  And support this two mode like this:
>>  generic_fs)
>>        SHARED: by default 
>>        ISOLATED: via bindmount
>>  XFS)
>
> This is a change of behaviour from the existing XFS project quota
> configurations as they do not require bind mounts at all.
>
> I'm interested to know how you see this working when you have
> multiple subtrees with the same project ID? Renaming and linking
Yepp good catch.
> between those subtrees is currently possible with XFS project IDs,
> but adding bind mounts would cause EXDEV to be returned for these
> operations. i.e. It seems to me that these subtrees are "shared" by
> your definition, but the addition of bind mounts makes them
> "isolated".
>
> Or you want a part of a subtree to be moved to a different project
> ID because it needs to be accounted separately?  e.g. a group gets
> moved in the organisation heirarchy, so the bean counters want to
> change the project ID on all their files so there space usage can be
> billed to the new department. If bind mounts are involved, this
> quickly becomes complex and unmaintainable. It's not something that
> users can easily manage, especially compared to the current 'xfs_io
> -c "chproj -R <projid>" /path/to/subtree' method of doing this.
Seem we have to work on "vfs people to like isolation subtrees" plan.
>
> ----
>
> IMO focusing on link/rename restrictions as the deciding factor in
> defining the user interface is wrong. I started out by saying that
> having different user interfaces for different filesystems is not
> desirable. You've ended up trying to encode the differences you
> assume exist into a new user interface instead.
>
> I'll rephrase the question - what part of the existing XFS project
> quota administration interface (i.e. /etc/projects, /etc/projid,  a
> quota command to set up the initial tree, etc) is not sufficient for
> your purposes of defining and managing subtrees?  If it is not
> sufficient, what simple extensions can we add that will make it
> sufficient? Once we've got the high level management interface
> defined, everything else is just details. ;)
>
XFS interface it enough. IMHO it is kinda rich. But still all necessary
things are already there. 
>>        ISOLATED: by default, because this is expected semantics (no
>>                  changes required)
>>        SHARED: xfs may add "shared_project" mount feature to disable
>>                isolation semantics. At least this gives user more
>>                flexibility than before.
>>  We have to document such difference. In order to avoid misbehavior.
>
>
>> *VFS interface to project_id*
>>  In order to make profit of project_id we have to make it visible to
>>  vfs layer, and let quota and nfsd (any other users?) exploit this.
>>  Let's use proposed per-sb aux_attributes table for this purpose.
>
> Why go to that complexity? Just add a 32 bit proj_id identifier to
> the struct inode. If it's supposed to be generic, then simply
> implement it like user and group quotas are.
Off course this is best solution. But then i've added i_rsv_space
field to vfs_inode to support quota allocation for delayed allocation.
Many peoples was fairly against idea to bloat a vfs_inode.
So i've come in to idea to design some aux_inode_table. And allow
everybody to put they crap in to that table without big discussions.

But project_id case is better because it can be hided under
CONFIG_PROJECT_ID option. So wasting of space not happen.
next round i'll embed it in to vfs_inode and if people will
be really blame on this, we will beck to aux_inode_table approach.
I've plan to post next generation next Monday.