git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Avery Pennarun <apenwarr@gmail.com>
To: sebastianspublicaddress@googlemail.com
Cc: git@vger.kernel.org
Subject: Re: How do you best store structured data in git repositories?
Date: Wed, 2 Dec 2009 16:17:10 -0500	[thread overview]
Message-ID: <32541b130912021317y705d1d4cj28e230a3e727df2e@mail.gmail.com> (raw)
In-Reply-To: <1259788097.3590.29.camel@nord26-amd64>

On Wed, Dec 2, 2009 at 4:08 PM, Sebastian Setzer
<sebastianspublicaddress@googlemail.com> wrote:
> Do you store everything in a single file and configure git to use
> special diff- and merge-tools?
> Do you use XML for this purpose?

XML is terrible for most data storage purposes.  Data exchange, maybe,
but IMHO the best thing you can do when you get XML data is to put it
in some other format ASAP.

As it happens, I've been doing a project where we store a bunch of
stuff in csv format in git, and it works fairly well.  We made a
special merge driver that can merge csv data (based on knowing which
columns should be treated as the "primary key").

> Do you take care that the contents of your file is as stable as possible
> when it's saved or do you let your diff tools cope with issues like
> reordering, reassignment of identifiers (for example when identifiers
> are offsets in the file), ...?

A custom merge driver is better, by far, than the builtin ones (which
were designed for source code) if you have any kind of structured data
that you don't want to have to merge by hand.

That said, however, you should still try to make your files as stable
as possible, because:

- If your program outputs the data in random order, it's just being
sloppy anyway

- 'git diff' doesn't work usefully otherwise (for examining the data
and debugging)

Of course, all bets are off if your file is actually binary; merging
and diffing is mostly impossible unless you use a totally custom
engine.  And if your file contains byte offsets, then it's a binary
file, no matter that it looks like in your text editor.  Adding a byte
in the middle would make such a file entirely nonsense, which is not
an attribute of a text file.

> Do you store one object/record per file (with filename=id, for example
> with GUID-s) and hope that git will not mess them up when it merges
> them?
>
> Do you store records as directories, with very small files which contain
> single attributes (because records can be considered sets of
> key-value-pairs and the same applies to directories)? Do you configure
> git to do a scalar merge on non-text "attributes" (with special file
> extensions)?

In git, you have to balance between its different limitations.  If you
have a tonne of small files, it'll take you longer to retrieve a large
amount of data.  If you have one big huge file, git will suck a lot of
memory when repacking.  The best is to achieve a reasonable balance.

One trick that I've been using lately is to split large files
according to a rolling checksum:
http://alumnit.ca/~apenwarr/log/?m=200910#04

This generally keeps diffs useful, but keeps individual file sizes
down.  Obviously the implementation pointed to there is just a toy,
but the idea is sound.

> When you don't store everything in a single, binary file: Do you use git
> hooks to update an index for efficient queries on your structured data?
> Do you update the whole index for every change? Or do you use git hashes
> to decide which segment of your index needs to be updated?

We keep a separate index file that's not part of git.  When the git
repo is updated, we note which rows have changed, then update the
index.

Avery

  reply	other threads:[~2009-12-02 21:17 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-02 21:08 How do you best store structured data in git repositories? Sebastian Setzer
2009-12-02 21:17 ` Avery Pennarun [this message]
2009-12-04  0:14   ` David Aguilar
2009-12-04  1:45     ` Avery Pennarun
2009-12-04  8:00       ` jamesmikedupont
2009-12-07 21:20     ` Sebastian Setzer
2009-12-08  7:14       ` David Aguilar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=32541b130912021317y705d1d4cj28e230a3e727df2e@mail.gmail.com \
    --to=apenwarr@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=sebastianspublicaddress@googlemail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).