* How do you best store structured data in git repositories? @ 2009-12-02 21:08 Sebastian Setzer 2009-12-02 21:17 ` Avery Pennarun 0 siblings, 1 reply; 7+ messages in thread From: Sebastian Setzer @ 2009-12-02 21:08 UTC (permalink / raw) To: git Hi, when you design a file format to store structured data, and you want to manage these files with git, how do you do this best? I'd like to hear about best practices, experiences, links to discussions on this subject, ... Here are some of my questions: Do you store everything in a single file and configure git to use special diff- and merge-tools? Do you use XML for this purpose? Do you take care that the contents of your file is as stable as possible when it's saved or do you let your diff tools cope with issues like reordering, reassignment of identifiers (for example when identifiers are offsets in the file), ...? Do you store one object/record per file (with filename=id, for example with GUID-s) and hope that git will not mess them up when it merges them? Do you store records as directories, with very small files which contain single attributes (because records can be considered sets of key-value-pairs and the same applies to directories)? Do you configure git to do a scalar merge on non-text "attributes" (with special file extensions)? When you don't store everything in a single, binary file: Do you use git hooks to update an index for efficient queries on your structured data? Do you update the whole index for every change? Or do you use git hashes to decide which segment of your index needs to be updated? greetings, Sebastian ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How do you best store structured data in git repositories? 2009-12-02 21:08 How do you best store structured data in git repositories? Sebastian Setzer @ 2009-12-02 21:17 ` Avery Pennarun 2009-12-04 0:14 ` David Aguilar 0 siblings, 1 reply; 7+ messages in thread From: Avery Pennarun @ 2009-12-02 21:17 UTC (permalink / raw) To: sebastianspublicaddress; +Cc: git On Wed, Dec 2, 2009 at 4:08 PM, Sebastian Setzer <sebastianspublicaddress@googlemail.com> wrote: > Do you store everything in a single file and configure git to use > special diff- and merge-tools? > Do you use XML for this purpose? XML is terrible for most data storage purposes. Data exchange, maybe, but IMHO the best thing you can do when you get XML data is to put it in some other format ASAP. As it happens, I've been doing a project where we store a bunch of stuff in csv format in git, and it works fairly well. We made a special merge driver that can merge csv data (based on knowing which columns should be treated as the "primary key"). > Do you take care that the contents of your file is as stable as possible > when it's saved or do you let your diff tools cope with issues like > reordering, reassignment of identifiers (for example when identifiers > are offsets in the file), ...? A custom merge driver is better, by far, than the builtin ones (which were designed for source code) if you have any kind of structured data that you don't want to have to merge by hand. That said, however, you should still try to make your files as stable as possible, because: - If your program outputs the data in random order, it's just being sloppy anyway - 'git diff' doesn't work usefully otherwise (for examining the data and debugging) Of course, all bets are off if your file is actually binary; merging and diffing is mostly impossible unless you use a totally custom engine. And if your file contains byte offsets, then it's a binary file, no matter that it looks like in your text editor. Adding a byte in the middle would make such a file entirely nonsense, which is not an attribute of a text file. > Do you store one object/record per file (with filename=id, for example > with GUID-s) and hope that git will not mess them up when it merges > them? > > Do you store records as directories, with very small files which contain > single attributes (because records can be considered sets of > key-value-pairs and the same applies to directories)? Do you configure > git to do a scalar merge on non-text "attributes" (with special file > extensions)? In git, you have to balance between its different limitations. If you have a tonne of small files, it'll take you longer to retrieve a large amount of data. If you have one big huge file, git will suck a lot of memory when repacking. The best is to achieve a reasonable balance. One trick that I've been using lately is to split large files according to a rolling checksum: http://alumnit.ca/~apenwarr/log/?m=200910#04 This generally keeps diffs useful, but keeps individual file sizes down. Obviously the implementation pointed to there is just a toy, but the idea is sound. > When you don't store everything in a single, binary file: Do you use git > hooks to update an index for efficient queries on your structured data? > Do you update the whole index for every change? Or do you use git hashes > to decide which segment of your index needs to be updated? We keep a separate index file that's not part of git. When the git repo is updated, we note which rows have changed, then update the index. Avery ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How do you best store structured data in git repositories? 2009-12-02 21:17 ` Avery Pennarun @ 2009-12-04 0:14 ` David Aguilar 2009-12-04 1:45 ` Avery Pennarun 2009-12-07 21:20 ` Sebastian Setzer 0 siblings, 2 replies; 7+ messages in thread From: David Aguilar @ 2009-12-04 0:14 UTC (permalink / raw) To: Avery Pennarun; +Cc: sebastianspublicaddress, git On Wed, Dec 02, 2009 at 04:17:10PM -0500, Avery Pennarun wrote: > On Wed, Dec 2, 2009 at 4:08 PM, Sebastian Setzer > <sebastianspublicaddress@googlemail.com> wrote: > > Do you store everything in a single file and configure git to use > > special diff- and merge-tools? > > Do you use XML for this purpose? > > XML is terrible for most data storage purposes. Data exchange, maybe, > but IMHO the best thing you can do when you get XML data is to put it > in some other format ASAP. I agree 100%. JSON's not too bad for data structures and is known to be friendly to XML expats. http://json.org/ > That said, however, you should still try to make your files as stable > as possible, because: > > - If your program outputs the data in random order, it's just being > sloppy anyway > > - 'git diff' doesn't work usefully otherwise (for examining the data > and debugging) If you were using Python + simplejson then using something like the sort_keys=True flag would ensure that your data is stable as the dictionaries keys will always appear in a deterministic order. Since I mentioned JSON and git in the same email then I might as well also mention an old UGFWIINI candidate: http://www.ordecon.com/2009/04/22/is-git-more-than-just-a-version-control-system/ Lastly, BERT might not be a good choice for storing inside of a git repository, but it is a nice format for representing data structures: http://github.com/blog/531-introducing-bert-and-bert-rpc We've been using git for tracking changes to a large set of JSON files at $dayjob and it's worked out pretty well. I'd suggest that you try to break your data up into multiple files if possible. As someone else mentioned, it's often easier to diff and merge stuff if you structure things in a merge-friendly way. One feature that we've implemented is file referencing where data can "#include" another data file. That is the kind of thing that can make things easier on you if you foresee having a lot of common data that can be shared amongst the various different files. -- David ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How do you best store structured data in git repositories? 2009-12-04 0:14 ` David Aguilar @ 2009-12-04 1:45 ` Avery Pennarun 2009-12-04 8:00 ` jamesmikedupont 2009-12-07 21:20 ` Sebastian Setzer 1 sibling, 1 reply; 7+ messages in thread From: Avery Pennarun @ 2009-12-04 1:45 UTC (permalink / raw) To: David Aguilar; +Cc: sebastianspublicaddress, git On Thu, Dec 3, 2009 at 7:14 PM, David Aguilar <davvid@gmail.com> wrote: > JSON's not too bad for data structures and is known to > be friendly to XML expats. > > http://json.org/ yaml is also really good for storing structured data, and its line-by-line format lends itself to easy merging (if you don't feel like writing a custom merge algorithm). Have fun, Avery ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How do you best store structured data in git repositories? 2009-12-04 1:45 ` Avery Pennarun @ 2009-12-04 8:00 ` jamesmikedupont 0 siblings, 0 replies; 7+ messages in thread From: jamesmikedupont @ 2009-12-04 8:00 UTC (permalink / raw) To: git On Thu, Dec 3, 2009 at 7:14 PM, David Aguilar <davvid@gmail.com> wrote: > JSON's not too bad for data structures and is known to > be friendly to XML expats. > > http://json.org/ I am currently working on two projects in this direction : 1. mediawiki on git, using mediawiki markup files. I apologise that I have not made progress on that lately, because I have had inspiration on my older project 2. the gcc rdf introspector, storage of the files in rdf. It is working now with a mysql database, using the librdf mysql driver, and running on a catalyst framework using jquery/jstree on the front end. None of those formats are perfect, the sizing of the files is important. I am returning individual nodes in json on the cataylst server and that works to deliver the AST nodes from the compiler to the jstree front end. But these fetches to the front end should be longer and contain direct components of the fetched node. I think that a cluster of nodes should be pulled together to make a more optimal system. here is just my two cents: if you are using a distributed git data repository as your central repository, then think about a database page. Imagine that you would have pages of data being retrieved and compared. Would it not make sense to split your pages something that would be swapped into memory directly, or with very little parsing, and then used? So, in effect, you would design the sizing of the pages and the page contents around the usage model, since git is a low level storage system. I dont know what would be available if some database manager system like mysql or postgres could be taught to store table pages in git. just some ideas, mike ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How do you best store structured data in git repositories? 2009-12-04 0:14 ` David Aguilar 2009-12-04 1:45 ` Avery Pennarun @ 2009-12-07 21:20 ` Sebastian Setzer 2009-12-08 7:14 ` David Aguilar 1 sibling, 1 reply; 7+ messages in thread From: Sebastian Setzer @ 2009-12-07 21:20 UTC (permalink / raw) To: git On Thursday, Dec 03 2009 at 16:14 -0800, David Aguilar wrote: > On Wed, Dec 02, 2009 at 04:17:10PM -0500, Avery Pennarun wrote: > > On Wed, Dec 2, 2009 at 4:08 PM, Sebastian Setzer > > <sebastianspublicaddress@googlemail.com> wrote: > > > Do you use XML for this purpose? > > > > XML is terrible for most data storage purposes. > > I agree 100%. > > JSON's not too bad for data structures and is known to > be friendly to XML expats. > Sorry, I didn't want to start a flamewar against XML. I'm no big friend of XML myself, but I don't know of an (open source) diff-/merge tool for any general purpose file format other than XML or plain text. When you mention other formats, I'd be interested in - why this format is good for storage in git - if there are merge tools available which ensure that, after a merge, the structure (and maybe additional contraints) is still valid. Thanks for your comments, Sebastian ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How do you best store structured data in git repositories? 2009-12-07 21:20 ` Sebastian Setzer @ 2009-12-08 7:14 ` David Aguilar 0 siblings, 0 replies; 7+ messages in thread From: David Aguilar @ 2009-12-08 7:14 UTC (permalink / raw) To: Sebastian Setzer; +Cc: git On Mon, Dec 07, 2009 at 10:20:21PM +0100, Sebastian Setzer wrote: > On Thursday, Dec 03 2009 at 16:14 -0800, David Aguilar wrote: > > On Wed, Dec 02, 2009 at 04:17:10PM -0500, Avery Pennarun wrote: > > > On Wed, Dec 2, 2009 at 4:08 PM, Sebastian Setzer > > > <sebastianspublicaddress@googlemail.com> wrote: > > > > Do you use XML for this purpose? > > > > > > XML is terrible for most data storage purposes. > > > > I agree 100%. > > > > JSON's not too bad for data structures and is known to > > be friendly to XML expats. > > > Sorry, I didn't want to start a flamewar against XML. I'm no big friend > of XML myself, but I don't know of an (open source) diff-/merge tool for > any general purpose file format other than XML or plain text. > When you mention other formats, I'd be interested in > - why this format is good for storage in git > - if there are merge tools available which ensure that, after a merge, > the structure (and maybe additional contraints) is still valid. > > Thanks for your comments, > Sebastian Sorry, didn't mean to sound xml-flaming. The only reason for mentioning json, yaml, etc. is that they're good data structure formats. They're all plain text formats, so you can use existing diff/merge tools. I guess none of this has much to do with git aside from being able to write custom merge drivers to operate on them as data. If there's a diff/merge tool for xml that works well then hooking it up to git-{diff,merge}tool might be something to try too. -- David ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2009-12-08 7:13 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-12-02 21:08 How do you best store structured data in git repositories? Sebastian Setzer 2009-12-02 21:17 ` Avery Pennarun 2009-12-04 0:14 ` David Aguilar 2009-12-04 1:45 ` Avery Pennarun 2009-12-04 8:00 ` jamesmikedupont 2009-12-07 21:20 ` Sebastian Setzer 2009-12-08 7:14 ` David Aguilar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).