* How do you best store structured data in git repositories?
@ 2009-12-02 21:08 Sebastian Setzer
2009-12-02 21:17 ` Avery Pennarun
0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Setzer @ 2009-12-02 21:08 UTC (permalink / raw)
To: git
Hi,
when you design a file format to store structured data, and you want to
manage these files with git, how do you do this best?
I'd like to hear about best practices, experiences, links to discussions
on this subject, ...
Here are some of my questions:
Do you store everything in a single file and configure git to use
special diff- and merge-tools?
Do you use XML for this purpose?
Do you take care that the contents of your file is as stable as possible
when it's saved or do you let your diff tools cope with issues like
reordering, reassignment of identifiers (for example when identifiers
are offsets in the file), ...?
Do you store one object/record per file (with filename=id, for example
with GUID-s) and hope that git will not mess them up when it merges
them?
Do you store records as directories, with very small files which contain
single attributes (because records can be considered sets of
key-value-pairs and the same applies to directories)? Do you configure
git to do a scalar merge on non-text "attributes" (with special file
extensions)?
When you don't store everything in a single, binary file: Do you use git
hooks to update an index for efficient queries on your structured data?
Do you update the whole index for every change? Or do you use git hashes
to decide which segment of your index needs to be updated?
greetings,
Sebastian
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How do you best store structured data in git repositories?
2009-12-02 21:08 How do you best store structured data in git repositories? Sebastian Setzer
@ 2009-12-02 21:17 ` Avery Pennarun
2009-12-04 0:14 ` David Aguilar
0 siblings, 1 reply; 7+ messages in thread
From: Avery Pennarun @ 2009-12-02 21:17 UTC (permalink / raw)
To: sebastianspublicaddress; +Cc: git
On Wed, Dec 2, 2009 at 4:08 PM, Sebastian Setzer
<sebastianspublicaddress@googlemail.com> wrote:
> Do you store everything in a single file and configure git to use
> special diff- and merge-tools?
> Do you use XML for this purpose?
XML is terrible for most data storage purposes. Data exchange, maybe,
but IMHO the best thing you can do when you get XML data is to put it
in some other format ASAP.
As it happens, I've been doing a project where we store a bunch of
stuff in csv format in git, and it works fairly well. We made a
special merge driver that can merge csv data (based on knowing which
columns should be treated as the "primary key").
> Do you take care that the contents of your file is as stable as possible
> when it's saved or do you let your diff tools cope with issues like
> reordering, reassignment of identifiers (for example when identifiers
> are offsets in the file), ...?
A custom merge driver is better, by far, than the builtin ones (which
were designed for source code) if you have any kind of structured data
that you don't want to have to merge by hand.
That said, however, you should still try to make your files as stable
as possible, because:
- If your program outputs the data in random order, it's just being
sloppy anyway
- 'git diff' doesn't work usefully otherwise (for examining the data
and debugging)
Of course, all bets are off if your file is actually binary; merging
and diffing is mostly impossible unless you use a totally custom
engine. And if your file contains byte offsets, then it's a binary
file, no matter that it looks like in your text editor. Adding a byte
in the middle would make such a file entirely nonsense, which is not
an attribute of a text file.
> Do you store one object/record per file (with filename=id, for example
> with GUID-s) and hope that git will not mess them up when it merges
> them?
>
> Do you store records as directories, with very small files which contain
> single attributes (because records can be considered sets of
> key-value-pairs and the same applies to directories)? Do you configure
> git to do a scalar merge on non-text "attributes" (with special file
> extensions)?
In git, you have to balance between its different limitations. If you
have a tonne of small files, it'll take you longer to retrieve a large
amount of data. If you have one big huge file, git will suck a lot of
memory when repacking. The best is to achieve a reasonable balance.
One trick that I've been using lately is to split large files
according to a rolling checksum:
http://alumnit.ca/~apenwarr/log/?m=200910#04
This generally keeps diffs useful, but keeps individual file sizes
down. Obviously the implementation pointed to there is just a toy,
but the idea is sound.
> When you don't store everything in a single, binary file: Do you use git
> hooks to update an index for efficient queries on your structured data?
> Do you update the whole index for every change? Or do you use git hashes
> to decide which segment of your index needs to be updated?
We keep a separate index file that's not part of git. When the git
repo is updated, we note which rows have changed, then update the
index.
Avery
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How do you best store structured data in git repositories?
2009-12-02 21:17 ` Avery Pennarun
@ 2009-12-04 0:14 ` David Aguilar
2009-12-04 1:45 ` Avery Pennarun
2009-12-07 21:20 ` Sebastian Setzer
0 siblings, 2 replies; 7+ messages in thread
From: David Aguilar @ 2009-12-04 0:14 UTC (permalink / raw)
To: Avery Pennarun; +Cc: sebastianspublicaddress, git
On Wed, Dec 02, 2009 at 04:17:10PM -0500, Avery Pennarun wrote:
> On Wed, Dec 2, 2009 at 4:08 PM, Sebastian Setzer
> <sebastianspublicaddress@googlemail.com> wrote:
> > Do you store everything in a single file and configure git to use
> > special diff- and merge-tools?
> > Do you use XML for this purpose?
>
> XML is terrible for most data storage purposes. Data exchange, maybe,
> but IMHO the best thing you can do when you get XML data is to put it
> in some other format ASAP.
I agree 100%.
JSON's not too bad for data structures and is known to
be friendly to XML expats.
http://json.org/
> That said, however, you should still try to make your files as stable
> as possible, because:
>
> - If your program outputs the data in random order, it's just being
> sloppy anyway
>
> - 'git diff' doesn't work usefully otherwise (for examining the data
> and debugging)
If you were using Python + simplejson then using something
like the sort_keys=True flag would ensure that your data
is stable as the dictionaries keys will always appear in a
deterministic order.
Since I mentioned JSON and git in the same email then I might as
well also mention an old UGFWIINI candidate:
http://www.ordecon.com/2009/04/22/is-git-more-than-just-a-version-control-system/
Lastly, BERT might not be a good choice for storing inside
of a git repository, but it is a nice format for representing
data structures:
http://github.com/blog/531-introducing-bert-and-bert-rpc
We've been using git for tracking changes to a large set of
JSON files at $dayjob and it's worked out pretty well.
I'd suggest that you try to break your data up into multiple
files if possible. As someone else mentioned, it's often
easier to diff and merge stuff if you structure things in a
merge-friendly way.
One feature that we've implemented is file referencing
where data can "#include" another data file. That is
the kind of thing that can make things easier on you if
you foresee having a lot of common data that can be
shared amongst the various different files.
--
David
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How do you best store structured data in git repositories?
2009-12-04 0:14 ` David Aguilar
@ 2009-12-04 1:45 ` Avery Pennarun
2009-12-04 8:00 ` jamesmikedupont
2009-12-07 21:20 ` Sebastian Setzer
1 sibling, 1 reply; 7+ messages in thread
From: Avery Pennarun @ 2009-12-04 1:45 UTC (permalink / raw)
To: David Aguilar; +Cc: sebastianspublicaddress, git
On Thu, Dec 3, 2009 at 7:14 PM, David Aguilar <davvid@gmail.com> wrote:
> JSON's not too bad for data structures and is known to
> be friendly to XML expats.
>
> http://json.org/
yaml is also really good for storing structured data, and its
line-by-line format lends itself to easy merging (if you don't feel
like writing a custom merge algorithm).
Have fun,
Avery
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How do you best store structured data in git repositories?
2009-12-04 1:45 ` Avery Pennarun
@ 2009-12-04 8:00 ` jamesmikedupont
0 siblings, 0 replies; 7+ messages in thread
From: jamesmikedupont @ 2009-12-04 8:00 UTC (permalink / raw)
To: git
On Thu, Dec 3, 2009 at 7:14 PM, David Aguilar <davvid@gmail.com> wrote:
> JSON's not too bad for data structures and is known to
> be friendly to XML expats.
>
> http://json.org/
I am currently working on two projects in this direction :
1. mediawiki on git, using mediawiki markup files. I apologise that I
have not made progress on that lately, because I have had inspiration
on my older project
2. the gcc rdf introspector, storage of the files in rdf. It is
working now with a mysql database, using the librdf mysql driver, and
running on a catalyst framework using jquery/jstree on the front end.
None of those formats are perfect, the sizing of the files is
important. I am returning individual nodes in json on the cataylst
server and that works to deliver the AST nodes from the compiler to
the jstree front end. But these fetches to the front end should be
longer and contain direct components of the fetched node. I think that
a cluster of nodes should be pulled together to make a more optimal
system.
here is just my two cents:
if you are using a distributed git data repository as your central
repository, then think about a database page. Imagine that you would have
pages of data being retrieved and compared.
Would it not make sense to split your pages something that would be swapped
into memory directly, or with very little parsing, and then used?
So, in effect, you would design the sizing of the pages and the page
contents around the usage model, since git is a low level storage system.
I dont know what would be available if some database manager system like
mysql or postgres could be taught to store table pages in git.
just some ideas,
mike
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How do you best store structured data in git repositories?
2009-12-04 0:14 ` David Aguilar
2009-12-04 1:45 ` Avery Pennarun
@ 2009-12-07 21:20 ` Sebastian Setzer
2009-12-08 7:14 ` David Aguilar
1 sibling, 1 reply; 7+ messages in thread
From: Sebastian Setzer @ 2009-12-07 21:20 UTC (permalink / raw)
To: git
On Thursday, Dec 03 2009 at 16:14 -0800, David Aguilar wrote:
> On Wed, Dec 02, 2009 at 04:17:10PM -0500, Avery Pennarun wrote:
> > On Wed, Dec 2, 2009 at 4:08 PM, Sebastian Setzer
> > <sebastianspublicaddress@googlemail.com> wrote:
> > > Do you use XML for this purpose?
> >
> > XML is terrible for most data storage purposes.
>
> I agree 100%.
>
> JSON's not too bad for data structures and is known to
> be friendly to XML expats.
>
Sorry, I didn't want to start a flamewar against XML. I'm no big friend
of XML myself, but I don't know of an (open source) diff-/merge tool for
any general purpose file format other than XML or plain text.
When you mention other formats, I'd be interested in
- why this format is good for storage in git
- if there are merge tools available which ensure that, after a merge,
the structure (and maybe additional contraints) is still valid.
Thanks for your comments,
Sebastian
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How do you best store structured data in git repositories?
2009-12-07 21:20 ` Sebastian Setzer
@ 2009-12-08 7:14 ` David Aguilar
0 siblings, 0 replies; 7+ messages in thread
From: David Aguilar @ 2009-12-08 7:14 UTC (permalink / raw)
To: Sebastian Setzer; +Cc: git
On Mon, Dec 07, 2009 at 10:20:21PM +0100, Sebastian Setzer wrote:
> On Thursday, Dec 03 2009 at 16:14 -0800, David Aguilar wrote:
> > On Wed, Dec 02, 2009 at 04:17:10PM -0500, Avery Pennarun wrote:
> > > On Wed, Dec 2, 2009 at 4:08 PM, Sebastian Setzer
> > > <sebastianspublicaddress@googlemail.com> wrote:
> > > > Do you use XML for this purpose?
> > >
> > > XML is terrible for most data storage purposes.
> >
> > I agree 100%.
> >
> > JSON's not too bad for data structures and is known to
> > be friendly to XML expats.
> >
> Sorry, I didn't want to start a flamewar against XML. I'm no big friend
> of XML myself, but I don't know of an (open source) diff-/merge tool for
> any general purpose file format other than XML or plain text.
> When you mention other formats, I'd be interested in
> - why this format is good for storage in git
> - if there are merge tools available which ensure that, after a merge,
> the structure (and maybe additional contraints) is still valid.
>
> Thanks for your comments,
> Sebastian
Sorry, didn't mean to sound xml-flaming. The only reason for
mentioning json, yaml, etc. is that they're good data structure
formats. They're all plain text formats, so you can use existing
diff/merge tools.
I guess none of this has much to do with git aside from being
able to write custom merge drivers to operate on them as data.
If there's a diff/merge tool for xml that works well then
hooking it up to git-{diff,merge}tool might be something
to try too.
--
David
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2009-12-08 7:13 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-02 21:08 How do you best store structured data in git repositories? Sebastian Setzer
2009-12-02 21:17 ` Avery Pennarun
2009-12-04 0:14 ` David Aguilar
2009-12-04 1:45 ` Avery Pennarun
2009-12-04 8:00 ` jamesmikedupont
2009-12-07 21:20 ` Sebastian Setzer
2009-12-08 7:14 ` David Aguilar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).