linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Mason <chris.mason@oracle.com>
To: Andreas Dilger <adilger@sun.com>
Cc: jim owens <jowens@hp.com>,
	linux-fsdevel@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>,
	Mark Fasheh <mfasheh@suse.com>, Andreas Dilger <adilger@shaw.ca>,
	Kalpak Shah <Kalpak.Shah@sun.com>,
	Eric Sandeen <sandeen@redhat.com>,
	Josef Bacik <jbacik@redhat.com>
Subject: Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
Date: Fri, 30 May 2008 09:37:58 -0400	[thread overview]
Message-ID: <200805300937.58674.chris.mason@oracle.com> (raw)
In-Reply-To: <20080529220153.GI2985@webber.adilger.int>

On Thursday 29 May 2008, Andreas Dilger wrote:
> On May 28, 2008  12:33 -0400, Chris Mason wrote:
> > On Wednesday 28 May 2008, Andreas Dilger wrote:
> > > For Lustre, it is completely inefficient to return data in
> > > non-LUN_ORDER, because it is doing RAID-0 striping of the file data
> > > across data servers. A 100MB 2-stripe file with 1MB stripes would have
> > > to return 100 extents, even if the file data is allocated contiguously
> > > on disk in the backing filesystems in two 50MB chunks.  With LUN_ORDER
> > > it will return 2 extents and the user can see much more clearly that
> > > the file is layed out well.
> >
> > Ah, so lustre doesn't have a logical address layer at all?  In my case
> > the files contain pointers to contiguous logical extent and the lower
> > layers of the FS figure out that is raid0/1/10 or whatever future crud I
> > toss in.
> >
> > If the logical extents are contiguous it is safe to assume the lower end
> > is also contiguous.
>
> Well, Lustre has a logical address layer on a per-file basis, but the
> layout maps from the file offsets to multiple object offsets.  There is
> no "flat" logical device in the background which file allocations are
> coming from, because the API provided to the client is based only on
> objects and offsets, and there may be multiple objects that map into a
> single file via some striping.  That is currently RAID-0 across objects,
> but it might be RAID-1/5/6 or something else in the future.  With the
> RAID-0 layout, the logical file offsets round-robin across the multiple
> objects with a certain stripe size (default 1MB).
>
> It sounds like you actually have the same setup with btrfs (if it is at
> all like ZFS) that file blocks map onto multiple disks, and there may
> be multiple copies of the data (RAID-1/10).

In my case, all pointers to extents (both metadata blocks and file data) 
reference a logical address space.  So, even for raid10 or raid5/6 if I ever 
code it, there is a central place that does translation from 
logical->physical block(s).

The disk format supports multiple (2^64) such namespaces but that isn't being 
used yet.

>
>
> What a user/administrator really cares about in the end is whether
> the files are allocated contiguously within the objects on the server
> filesystems.  If we were to run filefrag (with FIEMAP support) on a
> Lustre file without LUN_ORDER, or maybe a RAID-5 btrfs file, it would
> return a list of extents, each broken up at smaller boundaries, and it
> will convey the wrong idea of how the file is layed out physically.
>

For Btrfs, it'll always return the logical extents, and because the storage is 
grouped in relatively large chunks (~1GB  or more), this is sufficiently 
enough for measuring fragmentation.

But, if lustre doesn't have this kind of logical backing store, I think it is 
reason enough to keep the lun interface.  I know lots of people are against 
adding interfaces to the kernel for out of tree projects, but the per-file 
logical mapping you describe is a very reasonable way to design things, and 
we might as well leave it in for future use.

> Dropping lun/device support, and removing all of the flexibility of the
> FIEMAP interface design, is IMHO killing the whole reason I proposed
> FIEMAP in the first place.

My goal isn't to remove the flexibility from the interface design, it is just 
to ask if all of this functionality needs to be in one ioctl.  At least the 
device number / lun bit makes sense now (Mark, if you keep it, please don't 
make this a dev_t) thanks for the extra details.

-chris

  reply	other threads:[~2008-05-30 13:38 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-05-25  0:01 [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl Mark Fasheh
2008-05-25 19:42 ` Christoph Hellwig
2008-05-25 20:59   ` Brad Boyer
2008-05-26 10:59   ` Andreas Dilger
2008-05-26 18:04     ` Brad Boyer
2008-05-27 16:45     ` Christoph Hellwig
2008-05-27 21:10       ` Mark Fasheh
2008-05-27 13:48   ` Chris Mason
2008-05-27 16:21     ` Eric Sandeen
2008-05-27 16:47       ` Christoph Hellwig
2008-05-27 20:34         ` Joel Becker
2008-05-27 16:52     ` jim owens
2008-05-27 17:19       ` Chris Mason
2008-05-28 16:09         ` Andreas Dilger
2008-05-28 16:33           ` Chris Mason
2008-05-29 22:01             ` Andreas Dilger
2008-05-30 13:37               ` Chris Mason [this message]
2008-05-29 13:01           ` Christoph Hellwig
2008-05-29 20:17             ` Andreas Dilger
2008-05-27 18:56   ` Mark Fasheh
2008-05-27 20:31     ` Joel Becker
2008-05-27 20:49       ` Mark Fasheh
2008-05-28  5:14       ` Christoph Hellwig
2008-05-28 16:02       ` Andreas Dilger
2008-05-28 17:04         ` Joel Becker
2008-05-29  0:51           ` Dave Chinner
2008-05-29 13:02             ` Christoph Hellwig
2008-05-29 15:33               ` jim owens
2008-05-29 15:53                 ` Jamie Lokier
2008-05-29 18:56                 ` Joel Becker
2008-05-29 21:41                   ` Andreas Dilger
2008-05-29 21:47                     ` Joel Becker
2008-05-29 23:20                       ` Andreas Dilger
2008-05-29  1:17           ` Andreas Dilger
2008-05-29  5:55         ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200805300937.58674.chris.mason@oracle.com \
    --to=chris.mason@oracle.com \
    --cc=Kalpak.Shah@sun.com \
    --cc=adilger@shaw.ca \
    --cc=adilger@sun.com \
    --cc=hch@infradead.org \
    --cc=jbacik@redhat.com \
    --cc=jowens@hp.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mfasheh@suse.com \
    --cc=sandeen@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).