Re: [LSF/MM TOPIC] [ATTEND] Container disk quota and lseek(2) upon shared extents

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jan Kara <jack@suse.cz>
To: Jeff Liu <jeff.liu@oracle.com>
Cc: Jan Kara <jack@suse.cz>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	lsf-pc@lists.linux-foundation.org,
	Jim Meyering <jim@meyering.net>
Subject: Re: [LSF/MM TOPIC] [ATTEND] Container disk quota and lseek(2) upon shared extents
Date: Tue, 29 Jan 2013 20:19:28 +0100	[thread overview]
Message-ID: <20130129191928.GH32246@quack.suse.cz> (raw)
In-Reply-To: <5107FAB4.1010204@oracle.com>

  Hi Jeff,

On Wed 30-01-13 00:37:08, Jeff Liu wrote:
> On 01/29/2013 11:14 PM, Jan Kara wrote:
> >   Hello,
> > 
> > On Tue 29-01-13 22:44:24, Jeff Liu wrote:
> >> I'd like to discuss the following problems on LSF:
> >>
> >> - Container UID/GID quota support
> >> About more than half year ago, I have posted a patch set about support UID/GID
> >> quota inside containers:
> >> http://www.spinics.net/lists/linux-containers/msg25393.html
> >>
> >> However, I have to put it on ice at that time since this feature is depend on the
> >> user namespace.  Now I think it's time to bring it up because the user_ns was
> >> basically done on 3.8-rcX.
> >>
> >> Combine with user_ns, there would have a couple of issues need to be solved at first:
> >> 1) UID/GID mapping between global and containers quota files.
> >> On my previous implementation, the quotas are cached in memory that is truely can not
> >> be accepted at all,  I'll try to make it as usual with journalling quota support.
> >>  
> >> 2) To avoid modifying the quota tools, maybe we have to make quotas enabled all the
> >> time inside containers so that the end user would just set up quota limits or won't.
> >>
> >> 3) Embed container quota accounting related logic into the corresponding VFS quota
> >> routines and make it transparent for the outside file systems.  
> >   So now looking into your old submission, your main aim was to make
> > quota-tools work properly when run from inside a container, right?
> Right. 
> > Because quota enforcement works properly once user namespaces are in place. In fact
> > quota calls such as Q_GETQUOTA or Q_SETQUOTA work correctly as well with
> > user namespaces. UID/GID translation from namespace id space to the
> > global space and back is already happening. So what functionality are you
> > missing?
> So looks like there is no need to revisit it.:(
> Previously I found that we can not turn quota off insides containers without modifying
> the quota tools, I am not sure this sounds make sense or not, or is this a fair user
> requirements.  Anyway, I'll play with the user namespace with quota tools for further
> investigations. 
  So turning quotas on/off is a filesystem global action. As such it's hard
to make it work from containers when you don't have fs-per-container
setup... Implementing something like per-namespace quota enforcement (i.e.
only processes from a particular namespace will not be allowed to exceed
quota) might be reasonably possible though - you would just need to tweak
sb_has_quota_limits_enabled() function to take also current namespace into
account.

> >> - Introduce a new whence to lseek(2) to fetch the reflinked/sharing extents
> >>
> >> We have some user requests about showing the real disk footprint with OCFS2 reflinked
> >> or Btrfs cloned files.  I had written a shared-du utility based on du(1) for OCFS2 as
> >> this is the only file system with reflink supports at that time:
> >> https://oss.oracle.com/pipermail/ocfs2-devel/2010-September/007293.html
> >   But this is a though problem, isn't it? You have to minimally cache some
> > info about *every* file du(1) was called on so that you can check whether
> > two files share some extents or not. I'm not saying it isn't a useful
> > functionality, just I'd like to verify we are on the same page.
> Yes, from the user land, I have to cache the shared extents info, and
> iterate the cached item to examine if the next one to be cached is
> already exists or not.  If exits, increase the count number and check the
> next one...otherwise, cache it, and repeat this step again and again
> until all the files resides on the target partition/directories were
> checked.
  Yes, that's what I'd imagine.

> >> It based on FIEMAP ioctl(2) on the user space, and OCFS2 using FIEMAP_EXTENT_SHARED
> >> flag to indicate an extent is reflinked/cow when the internal OCFS2_EXT_REFCOUNTED
> >> flag is detected.
> >>
> >> Recently, I have started to implement this feature on Btrfs in a similar approach.
> >> Once it completed, the next thing is to teach upstream du(1) works for both file
> >> systems with a new command option.
> >>
> >> Still sounds nothing because we have FIEMAP...:( But consider the bad interface
> >> and error prone when I improving cp(1) through it for sparse files, it will extends
> >> the ugly tentacles of FIEMAP into du(1) again that the maintainer of coreutils(Jim, CC-ed)
> >> don't like it at all, and I also want to avoid if possible...
> >>
> >> How about if we add a new whence type to lseek(2) for this function?  lseek has very clear
> >> interface and works very well for SEEK_DATA/SEEK_HOLE, most likely could works fine for
> >> shared extents IMHO.
> >   Well, I can hardly imagine how such lseek(2) interface would look to be
> > useful for identifying shared extents among different files. Do you have
> > something particular in mind?
> lseek(2) is not used for identifying shared extents among files.  It
> would be improved and called to find out and return an desired extent
> which is reflinked or cloned with a particular whence, the underlying
> file system should be improved accordingly.
> 
> To say Btrfs, if we performed btrfs_ioctl_clone from source file A to
> target B, run du(1) against both files, it would show double space
> although only 1/2 space is really used/reserved upon COW.
> 
> If we can mark the cloned extents of file with a special flag(to say
> EXTENT_MAP_CLONED), then call lseek(fd, offset, SEEK_CLONE or ?), it
> would return the offset of a cloned extent which is equal or beyond the
> given offset, so we can find out all the cloned extents upon a file which
> would be used for the disk space accounting in user space tools.
  OK, but then you have to call FIEMAP anyway to find which blocks are
underlying the extent so that you can match that with cloned extents from
different files. Ah, and the advantage would be that you don't have to
cache *all* the extents but only those that are reported as reflinked. OK,
now I see.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

next prev parent reply	other threads:[~2013-01-29 19:19 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-29 14:44 [LSF/MM TOPIC] [ATTEND] Container disk quota and lseek(2) upon shared extents Jeff Liu
2013-01-29 15:14 ` Jan Kara
2013-01-29 16:37   ` Jeff Liu
2013-01-29 19:19     ` Jan Kara [this message]
2013-01-30  3:49       ` Jeff Liu
2013-01-30  2:41     ` Dave Chinner
2013-01-30  4:24       ` Jeff Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130129191928.GH32246@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=jeff.liu@oracle.com \
    --cc=jim@meyering.net \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).