* What is needed to build an AFS fileserver on top of BTRFS?
@ 2013-12-17 16:53 David Howells
2013-12-17 17:07 ` Chris Mason
` (3 more replies)
0 siblings, 4 replies; 7+ messages in thread
From: David Howells @ 2013-12-17 16:53 UTC (permalink / raw)
To: Simon Wilkinson, jaltman
Cc: dhowells, openafs-devel@openafs.org, linux-btrfs, clm
It has occurred to me and others that something like BTRFS could be a good fit
to build an AFS fileserver directly on top of. The question is what facilities
would be needed from BTRFS to make this work?
So I thought I'd kick off a shopping list;-)
(1) 64-bit data version numbers that increase monotonically with each write.
Yes, this is likely to cause some performance degredation as it introduces
an ordering over data writes and metadata writes to a file. Maybe writes
can be batched to improve performance?
(2) Storage for ACLs and AFS UIDs. Having shareable ACLs might also be useful.
Xattrs would likely do for this.
(3) The ability to snapshot a filesystem to make backups and for pushing to
read-only volume servers.
(4) A 32-bit vnode number and 32-bit vnode uniquifier/generation number.
These don't necessarily have to be stored by BTRFS directly but could
instead be in a separate database file that gets snapshotted also.
(5) The ability to set the vnode number, vnode uniquifier and data version
number to specific values. Necessary to clone volumes and restore
volume dumps.
David
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: What is needed to build an AFS fileserver on top of BTRFS?
2013-12-17 16:53 What is needed to build an AFS fileserver on top of BTRFS? David Howells
@ 2013-12-17 17:07 ` Chris Mason
2013-12-17 17:20 ` Hugo Mills
` (2 subsequent siblings)
3 siblings, 0 replies; 7+ messages in thread
From: Chris Mason @ 2013-12-17 17:07 UTC (permalink / raw)
To: dhowells@redhat.com
Cc: linux-btrfs@vger.kernel.org, simonxwilkinson@gmail.com,
jaltman@your-file-system.com, openafs-devel@openafs.org
On Tue, 2013-12-17 at 16:53 +0000, David Howells wrote:
> It has occurred to me and others that something like BTRFS could be a good fit
> to build an AFS fileserver directly on top of. The question is what facilities
> would be needed from BTRFS to make this work?
>
> So I thought I'd kick off a shopping list;-)
>
> (1) 64-bit data version numbers that increase monotonically with each write.
>
> Yes, this is likely to cause some performance degredation as it introduces
> an ordering over data writes and metadata writes to a file. Maybe writes
> can be batched to improve performance?
>
> (2) Storage for ACLs and AFS UIDs. Having shareable ACLs might also be useful.
>
> Xattrs would likely do for this.
>
> (3) The ability to snapshot a filesystem to make backups and for pushing to
> read-only volume servers.
>
> (4) A 32-bit vnode number and 32-bit vnode uniquifier/generation number.
>
> These don't necessarily have to be stored by BTRFS directly but could
> instead be in a separate database file that gets snapshotted also.
>
> (5) The ability to set the vnode number, vnode uniquifier and data version
> number to specific values. Necessary to clone volumes and restore
> volume dumps.
Hmmm, what exactly are vnodes? Could we put them in xattrs?
-chris
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: What is needed to build an AFS fileserver on top of BTRFS?
2013-12-17 16:53 What is needed to build an AFS fileserver on top of BTRFS? David Howells
2013-12-17 17:07 ` Chris Mason
@ 2013-12-17 17:20 ` Hugo Mills
2013-12-17 17:40 ` David Howells
2013-12-17 17:47 ` David Howells
3 siblings, 0 replies; 7+ messages in thread
From: Hugo Mills @ 2013-12-17 17:20 UTC (permalink / raw)
To: David Howells
Cc: Simon Wilkinson, jaltman, openafs-devel@openafs.org, linux-btrfs,
clm
[-- Attachment #1: Type: text/plain, Size: 2561 bytes --]
On Tue, Dec 17, 2013 at 04:53:16PM +0000, David Howells wrote:
> It has occurred to me and others that something like BTRFS could be
> a good fit to build an AFS fileserver directly on top of. The
> question is what facilities would be needed from BTRFS to make this
> work? So I thought I'd kick off a shopping list;-)
> (1) 64-bit data version numbers that increase monotonically with
> each write. Yes, this is likely to cause some performance
> degredation as it introduces an ordering over data writes and
> metadata writes to a file. Maybe writes can be batched to improve
> performance?
Do these have to be per-file? If not, then you might be able to get
away with using the transid, which is a filesystem-global
monotonically-increasing number.
btrfs batches disk writes already, and uses the transid to
differentiate these -- the writes come at 30 second intervals (by
default, although there's an option to change the period). There may
be multiple distinct changes to a single file within that transaction
(although obviously, only the state of the file after the last one
gets written to disk). I don't know exactly what you need it for, so
this may or may not be appropriate here.
Ceph uses transids for [something, mumble, wavy-hand] -- I don't
know if the use-case for Ceph is equivalent to the use-case for AFS.
> (2) Storage for ACLs and AFS UIDs. Having shareable ACLs might also
> be useful. Xattrs would likely do for this.
This would seem like a reasonable place to put them, given that
that's what POSIX ACLs do, and we have POSIX ACL support already.
> (3) The ability to snapshot a filesystem to make backups and for
> pushing to read-only volume servers.
We have snapshots of subvolumes, but not the filesystem as a whole.
> (4) A 32-bit vnode number and 32-bit vnode uniquifier/generation
> number. These don't necessarily have to be stored by BTRFS directly
> but could instead be in a separate database file that gets
> snapshotted also.
>
> (5) The ability to set the vnode number, vnode uniquifier and data
> version number to specific values. Necessary to clone volumes
> and restore volume dumps.
What's a vnode meant to represent? I'm not familiar with the
terminology.
Hugo.
--
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- "Are you the man who rules the Universe?" "Well, I ---
try not to."
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: What is needed to build an AFS fileserver on top of BTRFS?
2013-12-17 16:53 What is needed to build an AFS fileserver on top of BTRFS? David Howells
2013-12-17 17:07 ` Chris Mason
2013-12-17 17:20 ` Hugo Mills
@ 2013-12-17 17:40 ` David Howells
2013-12-17 18:42 ` [OpenAFS-devel] " Jeffrey Hutzelman
2013-12-17 17:47 ` David Howells
3 siblings, 1 reply; 7+ messages in thread
From: David Howells @ 2013-12-17 17:40 UTC (permalink / raw)
To: Chris Mason
Cc: dhowells, linux-btrfs@vger.kernel.org, simonxwilkinson@gmail.com,
jaltman@your-file-system.com, openafs-devel@openafs.org
Chris Mason <clm@fb.com> wrote:
> Hmmm, what exactly are vnodes? Could we put them in xattrs?
vnode numbers are AFS's equivalent of inode numbers. Since they're one per
file, they could be the object filename.
Probably there would have to be a table of {vnode,latest_uniquifier} as the
uniquifier must still go up even if the vnode is unused for a while, so there
could also be a table of {vnode,btrfs_file}.
David
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: What is needed to build an AFS fileserver on top of BTRFS?
2013-12-17 16:53 What is needed to build an AFS fileserver on top of BTRFS? David Howells
` (2 preceding siblings ...)
2013-12-17 17:40 ` David Howells
@ 2013-12-17 17:47 ` David Howells
2013-12-17 18:45 ` [OpenAFS-devel] " Jeffrey Hutzelman
3 siblings, 1 reply; 7+ messages in thread
From: David Howells @ 2013-12-17 17:47 UTC (permalink / raw)
To: Hugo Mills
Cc: dhowells, Simon Wilkinson, jaltman, openafs-devel@openafs.org,
linux-btrfs, clm
Hugo Mills <hugo@carfax.org.uk> wrote:
> > (1) 64-bit data version numbers that increase monotonically with
> > each write. Yes, this is likely to cause some performance
> > degredation as it introduces an ordering over data writes and
> > metadata writes to a file. Maybe writes can be batched to improve
> > performance?
>
> Do these have to be per-file? If not, then you might be able to get
> away with using the transid, which is a filesystem-global
> monotonically-increasing number.
Yes. If you send a write RPC op to the server, you get back the new version
number. If the new version number is not the old version number + 1 you know
there was a collision with a write from another client and you have to flush
your cache for that file and request a new "callback" (ie. a promise to notify
you if someone else changes the file).
> > (3) The ability to snapshot a filesystem to make backups and for
> > pushing to read-only volume servers.
>
> We have snapshots of subvolumes, but not the filesystem as a whole.
By "filesystem" I meant the current state of an AFS volume. Very likely this
would be represented by a BTRFS subvolume, if I understand it correctly. You
might have several AFS volumes represented within a BTRFS filesystem. They
would be manipulated independently.
> > (5) The ability to set the vnode number, vnode uniquifier and data
> > version number to specific values. Necessary to clone volumes
> > and restore volume dumps.
>
> What's a vnode meant to represent? I'm not familiar with the
> terminology.
AFS's equivalent of an inode with a 32-bit number representing it. See my
reply to Chris's question about the same thing.
David
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [OpenAFS-devel] Re: What is needed to build an AFS fileserver on top of BTRFS?
2013-12-17 17:40 ` David Howells
@ 2013-12-17 18:42 ` Jeffrey Hutzelman
0 siblings, 0 replies; 7+ messages in thread
From: Jeffrey Hutzelman @ 2013-12-17 18:42 UTC (permalink / raw)
To: David Howells
Cc: jhutz, Chris Mason, linux-btrfs@vger.kernel.org,
simonxwilkinson@gmail.com, jaltman@your-file-system.com,
openafs-devel@openafs.org
On Tue, 2013-12-17 at 17:40 +0000, David Howells wrote:
> Chris Mason <clm@fb.com> wrote:
>
> > Hmmm, what exactly are vnodes? Could we put them in xattrs?
>
> vnode numbers are AFS's equivalent of inode numbers. Since they're one per
> file, they could be the object filename.
Yes, in fact, the volume, vnode number, uniqifier, and DV are
effectively the "name" the fileserver uses for the underlying inode.
Note that if the fileserver is maintaining the vnode indices, then you
don't actually _need_ to store a uniqifier for normal operation, because
at any given time, a volume can contain at most one vnode with a
particular vnode number, and that vnode's uniqifier is stored in the
index. The uniqifier is used on-the-wire to distinguish different files
that existed at different points in time with the same vnode number.
> Probably there would have to be a table of {vnode,latest_uniquifier} as the
> uniquifier must still go up even if the vnode is unused for a while, so there
> could also be a table of {vnode,btrfs_file}.
No, you don't actually have to do this. The OpenAFS fileserver
maintains a single uniqifier for an entire volume, and simply increments
it every time a vnode is created.
-- Jeff
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [OpenAFS-devel] Re: What is needed to build an AFS fileserver on top of BTRFS?
2013-12-17 17:47 ` David Howells
@ 2013-12-17 18:45 ` Jeffrey Hutzelman
0 siblings, 0 replies; 7+ messages in thread
From: Jeffrey Hutzelman @ 2013-12-17 18:45 UTC (permalink / raw)
To: David Howells
Cc: jhutz, Hugo Mills, Simon Wilkinson, jaltman,
openafs-devel@openafs.org, linux-btrfs, clm
On Tue, 2013-12-17 at 17:47 +0000, David Howells wrote:
> Hugo Mills <hugo@carfax.org.uk> wrote:
>
> > > (1) 64-bit data version numbers that increase monotonically with
> > > each write. Yes, this is likely to cause some performance
> > > degredation as it introduces an ordering over data writes and
> > > metadata writes to a file. Maybe writes can be batched to improve
> > > performance?
> >
> > Do these have to be per-file? If not, then you might be able to get
> > away with using the transid, which is a filesystem-global
> > monotonically-increasing number.
>
> Yes. If you send a write RPC op to the server, you get back the new version
> number. If the new version number is not the old version number + 1 you know
> there was a collision with a write from another client and you have to flush
> your cache for that file and request a new "callback" (ie. a promise to notify
> you if someone else changes the file).
Right. So, the DV must increment by exactly one for each successful
StoreData (and not for other changes). This is important because
clients cache data and metadata independently, and cached data is
labeled with the file's DV. This means that even if metadata for a file
has to be refetched for some reason (for example, an expired callback),
the _data_ doesn't have to be refetched unless it has actually changed,
or been evicted from the client's cache due to cache pressure.
-- Jeff
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2013-12-17 19:41 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-17 16:53 What is needed to build an AFS fileserver on top of BTRFS? David Howells
2013-12-17 17:07 ` Chris Mason
2013-12-17 17:20 ` Hugo Mills
2013-12-17 17:40 ` David Howells
2013-12-17 18:42 ` [OpenAFS-devel] " Jeffrey Hutzelman
2013-12-17 17:47 ` David Howells
2013-12-17 18:45 ` [OpenAFS-devel] " Jeffrey Hutzelman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).