Re: Versioning file system

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Versioning file system
       [not found] <OF7FA807A1.64C0D5AF-ON882572FE.0061B34C-882572FE.00628322@us.ibm.com>
@ 2007-06-19  3:10 ` Kyle Moffett
  2007-06-19  7:49   ` Jack Stone
                     ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Kyle Moffett @ 2007-06-19  3:10 UTC (permalink / raw)
  To: Bryan Henderson
  Cc: Jack Stone, Andrew Morton, alan, H. Peter Anvin, linux-fsdevel,
	LKML Kernel, Al Viro, git

On Jun 18, 2007, at 13:56:05, Bryan Henderson wrote:
>> The question remains is where to implement versioning: directly in  
>> individual filesystems or in the vfs code so all filesystems can  
>> use it?
>
> Or not in the kernel at all.  I've been doing versioning of the  
> types I described for years with user space code and I don't  
> remember feeling that I compromised in order not to involve the  
> kernel.
>
> Of course, if you want to do it with snapshots and COW, you'll have  
> to ask where in the kernel to put that, but that's not a file  
> versioning question; it's the larger snapshot question.

What I think would be particularly interesting in this domain is  
something similar in concept to GIT, except in a file-system:
   1) Redundancy is easy, you just ensure that you have at least "N"  
distributed copies of each object, where "N" is some function of the  
object itself.
   2) Network replication is easy, you look up objects based on the  
SHA-1 stored in the parent directory entry and cache them where  
needed (IE: make the "N" function above dynamic based on frequency of  
access on a given computer).
   3) Snapshots are easy and cheap; an RO snapshot is a tag and an RW  
snapshot is a branch.  These can be easily converted between.
   4) Compression is easy; you can compress objects based on any  
arbitrary configurable criteria and the filesystem will record  
whether or not an object is compressed.  You can also compress  
differently when archiving objects to secondary storage.
   5) Streaming fsck-like verification is easy; ensure the hash name  
field matches the actual hash of the object.
   6) Fsck is easy since rollback is trivial, you can always revert  
to an older tree to boot and start up services before attempting  
resurrection of lost objects and trees in the background.
   7) Multiple-drive or multiple-host storage pools are easy:  Think  
the git "alternates" file.
   8) Network filesystem load-balancing is easy; SHA-1s are  
essentially random so you can just assign SHA-1 prefixes to different  
systems for data storage and your data is automatically split up.

Other issues:

Q. How do you deal with block allocation?
A. Same way other filesystems deal with block allocation.  Reference- 
counting gets tricky, especially across a network, but it's easy to  
play it safe with simple cross-network refcount-journalling.  Since  
the _only_ thing that needs journalling is the refcounts and block- 
free data, you need at most a megabyte or two of journal.  If in  
doubt, it's easy to play it safe and keep an extra refcount around  
for an in-the-background consistency check later on.  When networked- 
gitfs systems crash, you just assume they still have all the  
refcounts they had at the moment they died, and compare notes when  
they start back up again.  If a node has a cached copy of data on its  
local disk then it can just nonatomically increment the refcount for  
that object in its own RAM (ordered with respect to disk-flushes, of  
course) and tell its peers at some point.  A node should probably  
cache most of its working set on local disk for efficiency; it's  
trivially verified against updates from other nodes and provides an  
easy way to keep refcounts for such data.  If a node increments the  
refcount on such data and dies before getting that info out to its  
peers, then when it starts up again its peers will just be told that  
it has a "new" node with insufficient replication and they will clone  
it out again properly.  For networked refcount-increments you can do  
one of 2 things: (1) Tell at least X many peers and wait for them to  
sync the update out to disk, or (2) Get the object from any peer (at  
least one of whom hopefully has it in RAM) and save it to local disk  
with an increased refcount.

Q. How do you actually delete things?
A. Just replace all the to-be-erased tree and commit objects before a  
specified point with "History erased" objects with their SHA-1's  
magically set to that of the erased objects.  If you want you may  
delete only the "tree" objects and leave the commits intact.  If you  
delete a whole linear segment of history then you can just use a  
single "History erased" commit object with its parent pointed to the  
object before the erased segment.  Probably needs some form of back- 
reference storage to make it efficient; not sure how expensive that  
would be.  This would allow making a bunch of snapshots and purging  
them logarithmically based on passage of time.  For instance, you  
might have snapshots of every 5 minutes for the last hour, every 30  
minutes for the last day, every 4 hours for the last week, every day  
for the last month, once per week for the last year, once per month  
for the last 5 years, and once per year beyond that.

That's pretty impressive data-recovery resolution, and it accounts  
for only 200 unique commits after it's been running for 10 years.

Q. How do you archive data?
A. Same as deleting, except instead of a "History erased" object you  
would use a "History archived" object with a little bit of string  
data to indicate which volume it's stored on (and where on the  
volume).  When you stick that volume into the system you could easily  
tell the kernel to use it as an alternate for the given storage group.

Q. What enforces data integrity?
A. Ensure that a new tree object and its associated sub objects are  
on disk before you delete the old one.  Doesn't need any actual full  
syncs at all, just barriers.  If you replace the tree object before  
write-out is complete then just skip writing the old one and write  
the new one in its place.

Q. What consists of a "commit"?
A. Anything the administrator wants to define it as.  Useful  
algorithms include: "Once per x Mbyte of page dirtying", "Once per 5  
min", "Only when sync() or fsync() are called", "Only when gitfs- 
commit is called".  You could even combine them:  "Every x Mbyte of  
page dirtying or every 5 minutes, whichever is shorter (or longer,  
depending on admin requirements)".  There would also be appropriate  
syscalls to trigger appropriate git-like behavior.  Network- 
accessible gitfs would want to have mechanisms to trigger commits  
based on activity on other systems (needs more thought).

Q. How do you access old versions?
A. Mount another instance of the filesystem with an SHA-1 ID, a tag- 
name, or a branch-name in a special mount option.  Should be user  
accessible with some restrictions (needs more thought).

Q. How do you deal with conflicts on networked filesystems.
A. Once again, however the administrator wants to deal with them.   
Options:
    1)  Forcibly create a new branch for the conflicted tree.
    2)  Attempt to merge changes using the standard git-merge semantics
    3)  Merge independent changes to different files and pick one for  
changes to the same file
    4)  Your Algorithm Here(TM).  GIT makes it easy to extend  
conflict-resolution.

Q. How do you deal with little scattered changes in big (or sparse)  
files?
A. Two questions, two answers:  For sparse files, git would need  
extending to understand (and hash) the nature of the sparse-ness.   
For big files, you should be able to introduce a "compound-file"  
datatype and configure git to deal with specific X-Mbyte chunks of it  
independently.  This might not be a bad idea for native git as well.   
Would need system-specific configuration.

Q. How do you prevent massive data consumption by spurious tiny changes
A. You have a few options:
    1)  Configure your commit algorithm as above to not commit so often
    2)  Configure a stepped commit-discard algorithm as described  
above in the "How do you delete things" question
    3)  Archive unused data to secondary storage more often

Q. What about all the unanswered questions?
A. These are all the ones I could think of off the top of my head but  
there are at least a hundred more.  I'm pretty sure these are some of  
the most significant ones.

Q. That's a great idea and I'll implement it right away!
A. Yay!  (but that's not a question :-D)  Good luck and happy hacking.

Q. That's a stupid idea and would never ever work!
A. Thanks for your useful input! (but that's not a question either)   
I'm sure anybody who takes up a project like this will consider such  
opinions.

Q. *flamage*
A. I'm glad you have such strong opinions, feel free to to continue  
to spam my /dev/null device (and that's also not a question).

All opinions and comments welcomed.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Versioning file system
  2007-06-19  3:10 ` Versioning file system Kyle Moffett
@ 2007-06-19  7:49   ` Jack Stone
  2007-06-19  7:58   ` Bron Gondwana
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Jack Stone @ 2007-06-19  7:49 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Bryan Henderson, akpm, alan, hpa, linux-fsdevel, linux-kernel,
	viro, git

Kyle Moffett wrote:
> On Jun 18, 2007, at 13:56:05, Bryan Henderson wrote:
>>> The question remains is where to implement versioning: directly in
>>> individual filesystems or in the vfs code so all filesystems can use it?
>>
>> Or not in the kernel at all.  I've been doing versioning of the types
>> I described for years with user space code and I don't remember
>> feeling that I compromised in order not to involve the kernel.
>>
>> Of course, if you want to do it with snapshots and COW, you'll have to
>> ask where in the kernel to put that, but that's not a file versioning
>> question; it's the larger snapshot question.
> 
> What I think would be particularly interesting in this domain is
> something similar in concept to GIT, except in a file-system:
>   1) Redundancy is easy, you just ensure that you have at least "N"
> distributed copies of each object, where "N" is some function of the
> object itself.
>   2) Network replication is easy, you look up objects based on the SHA-1
> stored in the parent directory entry and cache them where needed (IE:
> make the "N" function above dynamic based on frequency of access on a
> given computer).
>   3) Snapshots are easy and cheap; an RO snapshot is a tag and an RW
> snapshot is a branch.  These can be easily converted between.
>   4) Compression is easy; you can compress objects based on any
> arbitrary configurable criteria and the filesystem will record whether
> or not an object is compressed.  You can also compress differently when
> archiving objects to secondary storage.
>   5) Streaming fsck-like verification is easy; ensure the hash name
> field matches the actual hash of the object.
>   6) Fsck is easy since rollback is trivial, you can always revert to an
> older tree to boot and start up services before attempting resurrection
> of lost objects and trees in the background.
>   7) Multiple-drive or multiple-host storage pools are easy:  Think the
> git "alternates" file.
>   8) Network filesystem load-balancing is easy; SHA-1s are essentially
> random so you can just assign SHA-1 prefixes to different systems for
> data storage and your data is automatically split up.
> 
> 
> Other issues:
> 
> Q. How do you deal with block allocation?
> A. Same way other filesystems deal with block allocation. 
> Reference-counting gets tricky, especially across a network, but it's
> easy to play it safe with simple cross-network refcount-journalling. 
> Since the _only_ thing that needs journalling is the refcounts and
> block-free data, you need at most a megabyte or two of journal.  If in
> doubt, it's easy to play it safe and keep an extra refcount around for
> an in-the-background consistency check later on.  When networked-gitfs
> systems crash, you just assume they still have all the refcounts they
> had at the moment they died, and compare notes when they start back up
> again.  If a node has a cached copy of data on its local disk then it
> can just nonatomically increment the refcount for that object in its own
> RAM (ordered with respect to disk-flushes, of course) and tell its peers
> at some point.  A node should probably cache most of its working set on
> local disk for efficiency; it's trivially verified against updates from
> other nodes and provides an easy way to keep refcounts for such data. 
> If a node increments the refcount on such data and dies before getting
> that info out to its peers, then when it starts up again its peers will
> just be told that it has a "new" node with insufficient replication and
> they will clone it out again properly.  For networked
> refcount-increments you can do one of 2 things: (1) Tell at least X many
> peers and wait for them to sync the update out to disk, or (2) Get the
> object from any peer (at least one of whom hopefully has it in RAM) and
> save it to local disk with an increased refcount.
> 
> Q. How do you actually delete things?
> A. Just replace all the to-be-erased tree and commit objects before a
> specified point with "History erased" objects with their SHA-1's
> magically set to that of the erased objects.  If you want you may delete
> only the "tree" objects and leave the commits intact.  If you delete a
> whole linear segment of history then you can just use a single "History
> erased" commit object with its parent pointed to the object before the
> erased segment.  Probably needs some form of back-reference storage to
> make it efficient; not sure how expensive that would be.  This would
> allow making a bunch of snapshots and purging them logarithmically based
> on passage of time.  For instance, you might have snapshots of every 5
> minutes for the last hour, every 30 minutes for the last day, every 4
> hours for the last week, every day for the last month, once per week for
> the last year, once per month for the last 5 years, and once per year
> beyond that.
> 
> That's pretty impressive data-recovery resolution, and it accounts for
> only 200 unique commits after it's been running for 10 years.
> 
> Q. How do you archive data?
> A. Same as deleting, except instead of a "History erased" object you
> would use a "History archived" object with a little bit of string data
> to indicate which volume it's stored on (and where on the volume).  When
> you stick that volume into the system you could easily tell the kernel
> to use it as an alternate for the given storage group.
> 
> Q. What enforces data integrity?
> A. Ensure that a new tree object and its associated sub objects are on
> disk before you delete the old one.  Doesn't need any actual full syncs
> at all, just barriers.  If you replace the tree object before write-out
> is complete then just skip writing the old one and write the new one in
> its place.
> 
> Q. What consists of a "commit"?
> A. Anything the administrator wants to define it as.  Useful algorithms
> include: "Once per x Mbyte of page dirtying", "Once per 5 min", "Only
> when sync() or fsync() are called", "Only when gitfs-commit is called". 
> You could even combine them:  "Every x Mbyte of page dirtying or every 5
> minutes, whichever is shorter (or longer, depending on admin
> requirements)".  There would also be appropriate syscalls to trigger
> appropriate git-like behavior.  Network-accessible gitfs would want to
> have mechanisms to trigger commits based on activity on other systems
> (needs more thought).
> 
> Q. How do you access old versions?
> A. Mount another instance of the filesystem with an SHA-1 ID, a
> tag-name, or a branch-name in a special mount option.  Should be user
> accessible with some restrictions (needs more thought).
> 
> Q. How do you deal with conflicts on networked filesystems.
> A. Once again, however the administrator wants to deal with them.  Options:
>    1)  Forcibly create a new branch for the conflicted tree.
>    2)  Attempt to merge changes using the standard git-merge semantics
>    3)  Merge independent changes to different files and pick one for
> changes to the same file
>    4)  Your Algorithm Here(TM).  GIT makes it easy to extend
> conflict-resolution.
> 
> Q. How do you deal with little scattered changes in big (or sparse) files?
> A. Two questions, two answers:  For sparse files, git would need
> extending to understand (and hash) the nature of the sparse-ness.  For
> big files, you should be able to introduce a "compound-file" datatype
> and configure git to deal with specific X-Mbyte chunks of it
> independently.  This might not be a bad idea for native git as well. 
> Would need system-specific configuration.
> 
> Q. How do you prevent massive data consumption by spurious tiny changes
> A. You have a few options:
>    1)  Configure your commit algorithm as above to not commit so often
>    2)  Configure a stepped commit-discard algorithm as described above
> in the "How do you delete things" question
>    3)  Archive unused data to secondary storage more often
> 
> Q. What about all the unanswered questions?
> A. These are all the ones I could think of off the top of my head but
> there are at least a hundred more.  I'm pretty sure these are some of
> the most significant ones.
> 
> Q. That's a great idea and I'll implement it right away!
> A. Yay!  (but that's not a question :-D)  Good luck and happy hacking.
> 
> Q. That's a stupid idea and would never ever work!
> A. Thanks for your useful input! (but that's not a question either)  I'm
> sure anybody who takes up a project like this will consider such opinions.
> 
> Q. *flamage*
> A. I'm glad you have such strong opinions, feel free to to continue to
> spam my /dev/null device (and that's also not a question).
> 
> All opinions and comments welcomed.
> 
> Cheers,
> Kyle Moffett
> 
> 

It sounds brilliant and I'd love to have a got at implementing it but I
don't know enough (yet :-D) about how git works, a little research is
called for I think.

Jack

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Versioning file system
  2007-06-19  3:10 ` Versioning file system Kyle Moffett
  2007-06-19  7:49   ` Jack Stone
@ 2007-06-19  7:58   ` Bron Gondwana
  2007-06-20  2:43     ` Kyle Moffett
  2007-06-19  9:09   ` Martin Langhoff
  2007-06-19 16:52   ` Jakub Narebski
  3 siblings, 1 reply; 6+ messages in thread
From: Bron Gondwana @ 2007-06-19  7:58 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Bryan Henderson, Jack Stone, Andrew Morton, alan, H. Peter Anvin,
	linux-fsdevel, LKML Kernel, Al Viro, git

On Mon, Jun 18, 2007 at 11:10:42PM -0400, Kyle Moffett wrote:
> On Jun 18, 2007, at 13:56:05, Bryan Henderson wrote:
>>> The question remains is where to implement versioning: directly in 
>>> individual filesystems or in the vfs code so all filesystems can use it?
>>
>> Or not in the kernel at all.  I've been doing versioning of the types I 
>> described for years with user space code and I don't remember feeling that 
>> I compromised in order not to involve the kernel.
>
> What I think would be particularly interesting in this domain is something 
> similar in concept to GIT, except in a file-system:

I've written a couple of user-space things very much like this - one
being a purely database (blobs in database, yeah I know) system for
managing medical data, where signatures and auditability were the most
important part of the system.  Performance really wasn't a
consideration.

The other one is my current job, FastMail - we have a virtual filesystem
which uses files stored by sha1 on ordainary filesystems for data
storage and a database for metadata (filename to sha1 mappings, mtime,
mimetype, directory structure, etc).

Multiple machine distribution is handled by a daemon on each machine
which can be asked to make sure the file gets sent out to every machine
that matches the prefix and will only return success once it's written
to at least one other machine.  Database replication is a different
beast.

It can work, but there's one big pain at the file level: no mmap.

If you don't want to support mmap it can work reasonably happily, though
you may want to keep your sha1 (or other digest) state as well as the
final digest so you can cheaply calculate the digest for a small append
without walking the entire file.  You may also want to keep state
checkpoints every so often along a big file so that truncates don't cost
too much to recalculate.

Luckily in a userspace VFS that's only accessed via FTP and DAV we can
support a limited set of operations (basically create, append, read,
delete)  You don't get that luxury for a general purpose filesystem, and
that's the problem.  There will always be particular usage patterns
(especially something that mmaps or seeks and touches all over the place
like a loopback mounted filesystem or a database file) that just dodn't
work for file-level sha1s.

It does have some lovely properties though.  I'd enjoy working in an
envionment that didn't look much like POSIX but had the strong
guarantees and auditability that addressing by sha1 buys you.

Bron.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Versioning file system
  2007-06-19  3:10 ` Versioning file system Kyle Moffett
  2007-06-19  7:49   ` Jack Stone
  2007-06-19  7:58   ` Bron Gondwana
@ 2007-06-19  9:09   ` Martin Langhoff
  2007-06-19 16:52   ` Jakub Narebski
  3 siblings, 0 replies; 6+ messages in thread
From: Martin Langhoff @ 2007-06-19  9:09 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Bryan Henderson, Jack Stone, Andrew Morton, alan, H. Peter Anvin,
	linux-fsdevel, LKML Kernel, Al Viro, git

On 6/19/07, Kyle Moffett <mrmacman_g4@mac.com> wrote:
> What I think would be particularly interesting in this domain is
> something similar in concept to GIT, except in a file-system:

perhaps stating the blindingly obvious, but there was an early
implementation of a FUSE-based gitfs --
http://www.sfgoth.com/~mitch/linux/gitfs/

cheers,


martin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Versioning file system
  2007-06-19  3:10 ` Versioning file system Kyle Moffett
                     ` (2 preceding siblings ...)
  2007-06-19  9:09   ` Martin Langhoff
@ 2007-06-19 16:52   ` Jakub Narebski
  3 siblings, 0 replies; 6+ messages in thread
From: Jakub Narebski @ 2007-06-19 16:52 UTC (permalink / raw)
  To: linux-kernel; +Cc: git, linux-fsdevel, git, linux-kernel

Kyle Moffett wrote:
> On Jun 18, 2007, at 13:56:05, Bryan Henderson wrote:

>>> The question remains is where to implement versioning: directly in  
>>> individual filesystems or in the vfs code so all filesystems can  
>>> use it?
>>
>> Or not in the kernel at all.  I've been doing versioning of the  
>> types I described for years with user space code and I don't  
>> remember feeling that I compromised in order not to involve the  
>> kernel.
>>
>> Of course, if you want to do it with snapshots and COW, you'll have  
>> to ask where in the kernel to put that, but that's not a file  
>> versioning question; it's the larger snapshot question.
> 
> What I think would be particularly interesting in this domain is  
> something similar in concept to GIT, except in a file-system
[cut]

How it relates to ext3cow versioning (snapshotting) filesystem,
for example? ext3cow assumes linear history, which simplifies things
a bit.

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Versioning file system
  2007-06-19  7:58   ` Bron Gondwana
@ 2007-06-20  2:43     ` Kyle Moffett
  0 siblings, 0 replies; 6+ messages in thread
From: Kyle Moffett @ 2007-06-20  2:43 UTC (permalink / raw)
  To: Bron Gondwana
  Cc: Bryan Henderson, Jack Stone, Andrew Morton, alan, H. Peter Anvin,
	linux-fsdevel, LKML Kernel, Al Viro, git

On Jun 19, 2007, at 03:58:57, Bron Gondwana wrote:
> On Mon, Jun 18, 2007 at 11:10:42PM -0400, Kyle Moffett wrote:
>> On Jun 18, 2007, at 13:56:05, Bryan Henderson wrote:
>>>> The question remains is where to implement versioning: directly  
>>>> in individual filesystems or in the vfs code so all filesystems  
>>>> can use it?
>>>
>>> Or not in the kernel at all.  I've been doing versioning of the  
>>> types I described for years with user space code and I don't  
>>> remember feeling that I compromised in order not to involve the  
>>> kernel.
>>
>> What I think would be particularly interesting in this domain is  
>> something similar in concept to GIT, except in a file-system:
>
> [...snip...]
>
> It can work, but there's one big pain at the file level: no mmap.

IMHO it's actually not that bad.  The "gitfs" would divide larger  
files up into manageable chunks (say 4MB) which could be quickly  
SHA-1ed.  When a file is mmapped and partially modified, the SHA-1  
would be marked as locally invalid, but since mmap() loses most  
consistency guarantees that's OK.  A time or writeout based "commit"  
scheme might still freeze, SHA-1, and write-out the page at regular  
intervals without the program's knowledge, but since you only have to  
SHA-1 the relatively-small 4MB chunk (which is about to hit disk  
anyways), it's not a significant time penalty.  Even if under memory  
pressure and swapping data out to disk you don't have to update the  
SHA-1 and create a new commit as long as you keep a reference to the  
object stored in the volume header somewhere and maintain the "SHA-1  
out-of-date" bit.

A program which carefully uses msync() would be fine, of course (with  
proper configuration) as that would create a new commit as appropriate.

Since mmap() is poorly defined on network filesystems in the absence  
of msync(), I don't see that such behaviour would be a problem.  And  
it certainly would be fine on local filesystems as there you can just  
stuff the "SHA-1 out-of-date" bit and a reference to the parent  
commit and path in the object itself.  Then you just need to keep a  
useful reference to that object in a table somewhere in the volume  
and you're set.

> If you don't want to support mmap it can work reasonably happily,  
> though you may want to keep your sha1 (or other digest) state as  
> well as the final digest so you can cheaply calculate the digest  
> for a small append without walking the entire file.  You may also  
> want to keep state checkpoints every so often along a big file so  
> that truncates don't cost too much to recalculate.

That may be worth it even if the file is divided into 4MB chunks (or  
other configurable value), but it would need benchmarking.

> Luckily in a userspace VFS that's only accessed via FTP and DAV we  
> can support a limited set of operations (basically create, append,  
> read, delete)  You don't get that luxury for a general purpose  
> filesystem, and that's the problem.  There will always be  
> particular usage patterns (especially something that mmaps or seeks  
> and touches all over the place like a loopback mounted filesystem  
> or a database file) that just dodn't work for file-level sha1s.

I'd think that loopback-mounted filesystems wouldn't be that difficult
   1)  Set the SHA-1 block size appropriately to divide the big file  
into a bunch of little manageable files.  Could conceivably be multi- 
layered like directories, depending on the size of the file.
   2)  Mark the file as exempt from normal commits (IE: without  
special syscalls or fsync/msync() on the file itself, it is never  
updated in the tree objects.
   3)  Set up the loopback device to call the gitfs commit code when  
it receives barriers or flushes from the parent filesystem.

And database files aren't a big issue.  I have yet to see a networked  
filesystem which you could stick a MySQL database on it from one node  
and expect to get useful/recent read results from other nodes.  If  
you really wanted something like that for such a "gitfs", you could  
just add code to MySQL to create a gitfs commit every N transactions  
and not otherwise.  The best part is: that would make online MySQL  
backups from another node trivial!  Just pick any arbitrary  
appropriate commit object and mount that object, then "cp -a  
mysql_db_dir mysql_backup_dir".  That's not to say it wouldn't have a  
performance penalty, but for some people the performance penalty  
might be worth it.

Oh, and for those programs which want multi-master replication, this  
makes it ten times easier:
   1)  Put each master-server on a different gitfs branch
   2)  Write your program as gitfs aware.  Make it create gitfs  
commits at appropriate times (so the data is accessible from other  
nodes).
   3)  Come up with a useful non-interactive database-file merge  
algorithm.  Useful examples of different kinds of merge engines may  
be found in the git project.  This should take $BASE_VERSION,  
$NEWVERSION1, $NEWVERSION2, and produce a $MERGEDVERSION.  A good  
algorithm should probably pick a safe default and save a "conflict"  
entry in the face of conflicting changes.
   4)  Hook your merge algorithm into the gitfs mechanics using some  
to-be-defined API.
   5)  Whenever your software does a database-file commit it sends  
out a little notification to the other nodes (maybe using a gitfs API?)
   6)  Run a periodic (as defined by the admin yet again) thread on  
each node which does branch merging.  When two or more branches have  
different SHA-1 sums the servers will rotate the merging task between  
them.  The thus-selected server will merge changes from the other  
server(s) into its current working copy.  With 2 servers this means  
that the maximum delay between one server making a change and the  
other server seeing it will be 2 times the merge interval.
   7)  For small pools of servers a simple rotated-merge-master  
algorithm would work.  For larger pools you would need to come up  
with some logarithmic rotating-merge-node algorithm to evenly divide  
the work of propagating changes across all nodes.

> It does have some lovely properties though.  I'd enjoy working in  
> an envionment that didn't look much like POSIX but had the strong  
> guarantees and auditability that addressing by sha1 buys you.

I'd like to think we can have our cake and eat it too :-D.  POSIX  
requirements should be doable on the local system and can be mimiced  
well enough on networked filesystems (albeit with update latency)  
that most programs won't care.  If you're the only person modifying  
files on gitfs, regardless of what node they are stored on, it should  
have the same behavior as local files (since with gitfs caching they  
would *become* local files too :-D).  The few programs that do care  
about POSIX atomicity across networked filesystems (which is already  
mostly implementation defined) could probably be updated to map gitfs  
commits and merges into their own internal transactions and do just  
fine.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-06-20  2:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <OF7FA807A1.64C0D5AF-ON882572FE.0061B34C-882572FE.00628322@us.ibm.com>
2007-06-19  3:10 ` Versioning file system Kyle Moffett
2007-06-19  7:49   ` Jack Stone
2007-06-19  7:58   ` Bron Gondwana
2007-06-20  2:43     ` Kyle Moffett
2007-06-19  9:09   ` Martin Langhoff
2007-06-19 16:52   ` Jakub Narebski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).