git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jonathan Nieder <jrnieder@gmail.com>
To: Noah Silverman <noah@smartmediacorp.com>
Cc: git@vger.kernel.org, Avery Pennarun <apenwarr@gmail.com>
Subject: Re: Advice on choosing git
Date: Wed, 12 May 2010 04:24:46 -0500	[thread overview]
Message-ID: <20100512092446.GA17520@progeny.tock> (raw)
In-Reply-To: <4BEA4B46.6010009@smartmediacorp.com>

Hi,

Noah Silverman wrote:

> I'm looking for both a version control system and backup system.

I am fond of this question. :)

> I guess, that I need just keep some files backed up (and/or synced) as
> they're not "working projects".  I will add new documents and
> occasionally edit others, but no real need for versioning.

I suggest rsync or unison[1], and to use btrfs locally if you want
snapshots.  I don’t know a good tool for shared snapshots, but that is
probably my ignorance.

In my humble opinion, tools designed for tracking source code, like
git and bzr, are not appropriate for this task.  To illustrate this, I
have put some thoughts about how to cheat git into doing an okay job
in a footnote[4].

> Other files
> are working projects (possible with collaboration) and need active VCS. 

In very small projects, I believe any free DVCS will do.

What tools are you and your collaborators already comfortable with?
I hear it can be hard to unlearn habits from using Subversion when
getting started with Git.  Some other version control systems cater to
that transition better.

As projects scale in size, the speed differences between version
control systems start to matter.  I find myself making larger commits,
looking through history less, and checking email more often when using
certain systems.

> From what I have read, I will
> effectively have multiple copies of each item on my hard drive, thus
> eating up a lot of space (One of the "working file"and several in the
> .git directory.) If I have multiple changes to a file, then I have
> several full versions of it on my machine.

If your files are relatively compressible (or at least rsyncable) and
you pack your the repository occasionally, this should not be a
problem.  The relevant page[2] of the Pro Git book tells probably more
than you wanted to know about this.

Short summary: each file is initially stored in the .git directory as
a compressed file named after its content.  When asked to pack with
the "git gc"[3] command (or automatically if there are too many
unpacked objects around), git puts the data into a larger "pack file",
this time as a delta against some suitable similar blob.

For source code (which is already rather compressible), this tends to
work well.  My local git/.git object repository is about 2½ times the
size of the working copy.

> This could be a problem for
> a directory with 100GB or more, especially on a laptop with limited hard
> drive space.

Yes.  Actually, this point is why I replied.  Using a source code
management system as a backup system generally implies this weird
assumption that even the oldest revisions are always worth keeping.

With big, machine-generated files, that doesn’t make sense to me ---
it is better to be able to throw away some snapshots when you are
running low on space.

> 2) Sub-directory selection.  On my laptop, I only want a few
> sub-directories to be synced up.  I don't need my whole document tree,
> but just a few directories of things I work on.

It requires foresight, but you could use a separate filesystem for
this (possibly loop-mounted) if you want to keep snapshots.  With
some symlinks, this would not require changing the directory
structure.

> Any and all suggestions are welcome and appreciated.

Thanks for the food for thought.
Jonathan

[1] http://www.cis.upenn.edu/~bcpierce/unison/
[2] http://progit.org/book/ch9-4.html
[3] http://www.kernel.org/pub/software/scm/git/docs/git-gc.html
[4]
So, you want to use git as a general backup tool?

 . Files should be compressible.  Set appropriate attributes.  Use
   clean and smudge filters[5] to replace the weird working-copy
   representation with a simpler tracked form.  Use !delta[6] where
   appropriate so git knows not to waste its time.

 . Files should be conducive to de-duplication.  Cut large files
   into slices using rsync’s rolling checksum algorithm[7].

 . Backups should be fault-tolerant.  Use par2[8] or zfec[9] to
   protect pack files, maybe.

 . Sometimes metadata (file owners and modes) is important.  Track a
   "restore" script that sets the appropriate metadata, and update it
   before each commit[10].

 . Files should not change as git reads them (or it will error
   out).  Wait for a quiescent state to backup, or make a
   snapshot some other way and ask git to back up that.

 . Old revisions are not precious.  It would be nice to be able to
   decide when each backed-up tree can expire.  My best suggestion is
   to rely on reflogs[11] instead of the revision graph to represent
   your history so old versions can expire, but getting this to work
   nicely would take some work: there is no built-in mechanism to
   transfer reflogs and associated objects to another repository, for
   example.

[5] http://www.kernel.org/pub/software/scm/git/docs/gitattributes.html#_tt_filter_tt
[6] http://www.kernel.org/pub/software/scm/git/docs/gitattributes.html#_tt_delta_tt
[7] http://github.com/apenwarr/bup
[8] http://parchive.sourceforge.net/
[9] http://allmydata.org/trac/zfec
[10] http://kitenet.net/~joey/code/etckeeper/
[11] http://www.kernel.org/pub/software/scm/git/docs/git-reflog.html

  parent reply	other threads:[~2010-05-12  9:24 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-12  6:31 Advice on choosing git Noah Silverman
2010-05-12  9:04 ` Dmitry Potapov
2010-05-12  9:15 ` Ramkumar Ramachandra
2010-05-12  9:24 ` Jonathan Nieder [this message]
2010-05-13  0:18 ` Joe Brenner
2010-05-13  0:31   ` Avery Pennarun
2010-05-13 11:48     ` Matthieu Moy
2010-05-13 17:31       ` Avery Pennarun
2010-05-19  0:37     ` Anthony W. Youngman
2010-05-19  1:12       ` Avery Pennarun
2010-05-13 11:42   ` Matthieu Moy
2010-05-13 11:51     ` Jeff King
2010-05-13 18:20 ` Martin Langhoff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100512092446.GA17520@progeny.tock \
    --to=jrnieder@gmail.com \
    --cc=apenwarr@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=noah@smartmediacorp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).