From: Josef Bacik <josef@redhat.com>
To: Mike Fedyk <mfedyk@mikefedyk.com>
Cc: Josef Bacik <josef@redhat.com>,
linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
chris.mason@oracle.com, hch@lst.de, ssorce@redhat.com,
bfields@redhat.com
Subject: Re: What to do about subvolumes?
Date: Mon, 6 Dec 2010 09:27:44 -0500 [thread overview]
Message-ID: <20101206142743.GA2556@localhost.localdomain> (raw)
In-Reply-To: <AANLkTimG7Qp_VCe71y7M4t5Y+x8kO=AAJWO1NWDN0HJO@mail.gmail.com>
On Sat, Dec 04, 2010 at 01:58:07PM -0800, Mike Fedyk wrote:
> On Fri, Dec 3, 2010 at 1:45 PM, Josef Bacik <josef@redhat.com> wrote:
> > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> >> Hello,
> >>
> >> Various people have complained about how BTRFS deals with subvolum=
es recently,
> >> specifically the fact that they all have the same inode number, an=
d there's no
> >> discrete seperation from one subvolume to another. =A0Christoph as=
ked that I lay
> >> out a basic design document of how we want subvolumes to work so w=
e can hash
> >> everything out now, fix what is broken, and then move forward with=
a design that
> >> everybody is more or less happy with. =A0I apologize in advance fo=
r how freaking
> >> long this email is going to be. =A0I assume that most people are g=
enerally
> >> familiar with how BTRFS works, so I'm not going to bother explaini=
ng in great
> >> detail some stuff.
> >>
> >> =3D=3D=3D What are subvolumes? =3D=3D=3D
> >>
> >> They are just another tree. =A0In BTRFS we have various b-trees to=
describe the
> >> filesystem. =A0A few of them are filesystem wide, such as the exte=
nt tree, chunk
> >> tree, root tree etc. =A0The tree's that hold the actual filesystem=
data, that is
> >> inodes and such, are kept in their own b-tree. =A0This is how subv=
olumes and
> >> snapshots appear on disk, they are simply new b-trees with all of =
the file data
> >> contained within them.
> >>
> >> =3D=3D=3D What do subvolumes look like? =3D=3D=3D
> >>
> >> All the user sees are directories. =A0They act like any other dire=
ctory acts, with
> >> a few exceptions
> >>
> >> 1) You cannot hardlink between subvolumes. =A0This is because subv=
olumes have
> >> their own inode numbers and such, think of them as seperate mounts=
in this case,
> >> you cannot hardlink between two mounts because the link needs to p=
oint to the
> >> same on disk inode, which is impossible between two different file=
systems. =A0The
> >> same is true for subvolumes, they have their own trees with their =
own inodes and
> >> inode numbers, so it's impossible to hardlink between them.
> >>
> >> 1a) In case it wasn't clear from above, each subvolume has their o=
wn inode
> >> numbers, so you can have the same inode numbers used between two d=
ifferent
> >> subvolumes, since they are two different trees.
> >>
> >> 2) Obviously you can't just rm -rf subvolumes. =A0Because they are=
roots there's
> >> extra metadata to keep track of them, so you have to use one of ou=
r ioctls to
> >> delete subvolumes/snapshots.
> >>
> >> But permissions and everything else they are the same.
> >>
> >> There is one tricky thing. =A0When you create a subvolume, the dir=
ectory inode
> >> that is created in the parent subvolume has the inode number of 25=
6. =A0So if you
> >> have a bunch of subvolumes in the same parent subvolume, you are g=
oing to have a
> >> bunch of directories with the inode number of 256. =A0This is so w=
hen users cd
> >> into a subvolume we can know its a subvolume and do all the normal=
voodoo to
> >> start looking in the subvolumes tree instead of the parent subvolu=
mes tree.
> >>
> >> This is where things go a bit sideways. =A0We had serious problems=
with NFS, but
> >> thankfully NFS gives us a bunch of hooks to get around these probl=
ems.
> >> CIFS/Samba do not, so we will have problems there, not to mention =
any other
> >> userspace application that looks at inode numbers.
> >>
> >> =3D=3D=3D How do we want subvolumes to work from a user perspectiv=
e? =3D=3D=3D
> >>
> >> 1) Users need to be able to create their own subvolumes. =A0The pe=
rmission
> >> semantics will be absolutely the same as creating directories, so =
I don't think
> >> this is too tricky. =A0We want this because you can only take snap=
shots of
> >> subvolumes, and so it is important that users be able to create th=
eir own
> >> discrete snapshottable targets.
> >>
> >> 2) Users need to be able to snapshot their subvolumes. =A0This is =
basically the
> >> same as #1, but it bears repeating.
> >>
> >> 3) Subvolumes shouldn't need to be specifically mounted. =A0This i=
s also
> >> important, we don't want users to have to go around mounting their=
subvolumes up
> >> manually one-by-one. =A0Today users just cd into subvolumes and it=
works, just
> >> like cd'ing into a directory.
> >>
> >> =3D=3D=3D Quotas =3D=3D=3D
> >>
> >> This is a huge topic in and of itself, but Christoph mentioned wan=
ting to have
> >> an idea of what we wanted to do with it, so I'm putting it here. =A0=
There are
> >> really 2 things here
> >>
> >> 1) Limiting the size of subvolumes. =A0This is really easy for us,=
just create a
> >> subvolume and at creation time set a maximum size it can grow to a=
nd not let it
> >> go farther than that. =A0Nice, simple and straightforward.
> >>
> >> 2) Normal quotas, via the quota tools. =A0This just comes down to =
how do we want
> >> to charge users, do we want to do it per subvolume, or per filesys=
tem. =A0My vote
> >> is per filesystem. =A0Obviously this will make it tricky with snap=
shots, but I
> >> think if we're just charging the diff's between the original volum=
e and the
> >> snapshot to the user then that will be the easiest for people to u=
nderstand,
> >> rather than making a snapshot all of a sudden count the users curr=
ently used
> >> quota * 2.
> >>
> >> =3D=3D=3D What do we do? =3D=3D=3D
> >>
> >> This is where I expect to see the most discussion. =A0Here is what=
I want to do
> >>
> >> 1) Scrap the 256 inode number thing. =A0Instead we'll just put a f=
lag in the inode
> >> to say "Hey, I'm a subvolume" and then we can do all of the approp=
riate magic
> >> that way. =A0This unfortunately will be an incompatible format cha=
nge, but the
> >> sooner we get this adressed the easier it will be in the long run.=
=A0Obviously
> >> when I say format change I mean via the incompat bits we have, so =
old fs's won't
> >> be broken and such.
> >>
> >> 2) Do something like NFS's referral mounts when we cd into a subvo=
lume. =A0Now we
> >> just do dentry trickery, but that doesn't make the boundary betwee=
n subvolumes
> >> clear, so it will confuse people (and samba) when they walk into a=
subvolume and
> >> all of a sudden the inode numbers are the same as in the directory=
behind them.
> >> With doing the referral mount thing, each subvolume appears to be =
its own mount
> >> and that way things like NFS and samba will work properly.
> >>
> >> I feel like I'm forgetting something here, hopefully somebody will=
point it out.
> >>
> >> =3D=3D=3D Conclusion =3D=3D=3D
> >>
> >> There are definitely some wonky things with subvolumes, but I don'=
t think they
> >> are things that cannot be fixed now. =A0Some of these changes will=
require
> >> incompat format changes, but it's either we fix it now, or later o=
n down the
> >> road when BTRFS starts getting used in production really find out =
how many
> >> things our current scheme breaks and then have to do the changes t=
hen. =A0Thanks,
> >>
> >
> > So now that I've actually looked at everything, it looks like the s=
emantics are
> > all right for subvolumes
> >
> > 1) readdir - we return the root id in d_ino, which is unique across=
the fs
> > 2) stat - we return 256 for all subvolumes, because that is their i=
node number
> > 3) dev_t - we setup an anon super for all volumes, so they all get =
their own
> > dev_t, which is set properly for all of their children, see below
> >
> > [root@test1244 btrfs-test]# stat .
> > =A0File: `.'
> > =A0Size: 20 =A0 =A0 =A0 =A0 =A0 =A0 =A0Blocks: 8 =A0 =A0 =A0 =A0 =A0=
IO Block: 4096 =A0 directory
> > Device: 15h/21d Inode: 256 =A0 =A0 =A0 =A0 Links: 1
> > Access: (0555/dr-xr-xr-x) =A0Uid: ( =A0 =A00/ =A0 =A0root) =A0 Gid:=
( =A0 =A00/ =A0 =A0root)
> > Access: 2010-12-03 15:35:41.931679393 -0500
> > Modify: 2010-12-03 15:35:20.405679493 -0500
> > Change: 2010-12-03 15:35:20.405679493 -0500
> >
> > [root@test1244 btrfs-test]# stat foo
> > =A0File: `foo'
> > =A0Size: 12 =A0 =A0 =A0 =A0 =A0 =A0 =A0Blocks: 0 =A0 =A0 =A0 =A0 =A0=
IO Block: 4096 =A0 directory
> > Device: 19h/25d Inode: 256 =A0 =A0 =A0 =A0 Links: 1
> > Access: (0700/drwx------) =A0Uid: ( =A0 =A00/ =A0 =A0root) =A0 Gid:=
( =A0 =A00/ =A0 =A0root)
> > Access: 2010-12-03 15:35:17.501679393 -0500
> > Modify: 2010-12-03 15:35:59.150680051 -0500
> > Change: 2010-12-03 15:35:59.150680051 -0500
> >
> > [root@test1244 btrfs-test]# stat foo/foobar
> > =A0File: `foo/foobar'
> > =A0Size: 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Blocks: 0 =A0 =A0 =A0 =A0 =A0=
IO Block: 4096 =A0 regular empty file
> > Device: 19h/25d Inode: 257 =A0 =A0 =A0 =A0 Links: 1
> > Access: (0644/-rw-r--r--) =A0Uid: ( =A0 =A00/ =A0 =A0root) =A0 Gid:=
( =A0 =A00/ =A0 =A0root)
> > Access: 2010-12-03 15:35:59.150680051 -0500
> > Modify: 2010-12-03 15:35:59.150680051 -0500
> > Change: 2010-12-03 15:35:59.150680051 -0500
> >
> > So as far as the user is concerned, everything should come out righ=
t. =A0Obviously
> > we had to do the NFS trickery still because as far as VFS is concer=
ned the
> > subvolumes are all on the same mount. =A0So the question is this (a=
nd really this
> > is directed at Christoph and Bruce and anybody else who may care), =
is this good
> > enough, or do we want to have a seperate vfsmount for each subvolum=
e? =A0Thanks,
> >
>=20
> What are the drawbacks of having a vfsmount for each subvolume?
>=20
> Why (besides having to code it up) are you trying to avoid doing it t=
hat way?
It's the having to code it up that way thing, I'm nothing if not lazy.
Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2010-12-06 14:27 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-12-01 14:21 What to do about subvolumes? Josef Bacik
2010-12-01 14:50 ` Mike Hommey
2010-12-01 14:51 ` C Anthony Risinger
2010-12-01 16:01 ` Chris Mason
2010-12-01 16:03 ` C Anthony Risinger
2010-12-01 16:13 ` Chris Mason
2010-12-01 16:31 ` Mike Hommey
2010-12-09 19:53 ` Martin Steigerwald
2010-12-01 16:00 ` Chris Mason
2010-12-01 16:38 ` Hugo Mills
2010-12-01 16:48 ` Gordan Bobic
2010-12-01 16:52 ` Mike Hommey
2010-12-01 16:52 ` C Anthony Risinger
2010-12-01 17:38 ` Josef Bacik
2010-12-01 19:35 ` Hugo Mills
2010-12-01 20:24 ` Freddie Cash
2010-12-01 21:28 ` Hugo Mills
2010-12-01 23:32 ` Freddie Cash
2010-12-02 4:46 ` Mike Fedyk
2010-12-01 18:33 ` Goffredo Baroncelli
2010-12-01 18:36 ` Josef Bacik
2010-12-01 18:48 ` C Anthony Risinger
2010-12-01 18:52 ` C Anthony Risinger
2010-12-01 19:08 ` Goffredo Baroncelli
2010-12-01 19:44 ` J. Bruce Fields
2010-12-01 19:54 ` Josef Bacik
2010-12-01 20:00 ` J. Bruce Fields
2010-12-01 20:09 ` Josef Bacik
2010-12-01 20:16 ` J. Bruce Fields
2010-12-02 1:52 ` Michael Vrable
2010-12-03 20:53 ` J. Bruce Fields
2010-12-01 20:03 ` Jeff Layton
2010-12-01 20:46 ` Goffredo Baroncelli
2010-12-01 21:06 ` Jeff Layton
2010-12-02 9:26 ` Arne Jansen
2010-12-02 9:49 ` Arne Jansen
2010-12-02 16:11 ` Chris Mason
2010-12-02 17:14 ` David Pottage
[not found] ` <AANLkTinBzpoCnci+1a=0pjXbAdQ7mzpdr2k8GOo7HUc8@mail.gmail.com>
2010-12-03 13:47 ` Fwd: " Paweł Brodacki
2010-12-03 20:56 ` J. Bruce Fields
2010-12-03 2:43 ` Phillip Susi
2011-01-31 2:40 ` Ian Kent
2010-12-03 4:25 ` Chris Ball
2010-12-03 14:00 ` Josef Bacik
2010-12-03 21:45 ` Josef Bacik
2010-12-03 22:16 ` J. Bruce Fields
2010-12-03 22:27 ` Dave Chinner
2010-12-03 22:29 ` Chris Mason
2010-12-03 22:45 ` J. Bruce Fields
2010-12-03 23:01 ` Andreas Dilger
2010-12-06 16:48 ` J. Bruce Fields
2010-12-08 6:39 ` Andreas Dilger
2010-12-08 23:07 ` Neil Brown
2010-12-09 4:41 ` Andreas Dilger
2010-12-09 15:19 ` J. Bruce Fields
2010-12-07 16:52 ` hch
2010-12-07 20:45 ` J. Bruce Fields
2010-12-07 16:51 ` Christoph Hellwig
2010-12-07 17:02 ` Trond Myklebust
2010-12-08 17:16 ` Andreas Dilger
2010-12-08 17:27 ` J. Bruce Fields
2010-12-08 21:18 ` Andreas Dilger
2010-12-04 21:58 ` Mike Fedyk
2010-12-06 14:27 ` Josef Bacik [this message]
2011-01-31 2:56 ` Ian Kent
2010-12-07 16:48 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101206142743.GA2556@localhost.localdomain \
--to=josef@redhat.com \
--cc=bfields@redhat.com \
--cc=chris.mason@oracle.com \
--cc=hch@lst.de \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=mfedyk@mikefedyk.com \
--cc=ssorce@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).