From mboxrd@z Thu Jan 1 00:00:00 1970 From: Avery Pennarun Subject: Re: How do you best store structured data in git repositories? Date: Wed, 2 Dec 2009 16:17:10 -0500 Message-ID: <32541b130912021317y705d1d4cj28e230a3e727df2e@mail.gmail.com> References: <1259788097.3590.29.camel@nord26-amd64> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Cc: git@vger.kernel.org To: sebastianspublicaddress@googlemail.com X-From: git-owner@vger.kernel.org Wed Dec 02 22:17:41 2009 Return-path: Envelope-to: gcvg-git-2@lo.gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1NFwZs-0001op-J6 for gcvg-git-2@lo.gmane.org; Wed, 02 Dec 2009 22:17:37 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754242AbZLBVRZ (ORCPT ); Wed, 2 Dec 2009 16:17:25 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753945AbZLBVRZ (ORCPT ); Wed, 2 Dec 2009 16:17:25 -0500 Received: from mail-yx0-f187.google.com ([209.85.210.187]:50864 "EHLO mail-yx0-f187.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753823AbZLBVRY (ORCPT ); Wed, 2 Dec 2009 16:17:24 -0500 Received: by yxe17 with SMTP id 17so548753yxe.33 for ; Wed, 02 Dec 2009 13:17:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:cc:content-type; bh=Mi5eo1hHb+MxdM7gqUw6sBHxjEqt2Y8gBBrPxW9aIJk=; b=v6ks3D6lj5dxlaN6j6EaSV61Uq8XGDod9OjORdAKFrFkkIFus4XMsM1Z7XJqBR9XZ5 44Z0JXipkz1h4oQofxUs14s7xHq7Oe+mwcrwMeYJLf6yAtJl7kWWiJjN76F34yJR58sX ALckpgBditEI3IVQIB9E0AkgAFsrqNQTpQCJA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; b=fvK5h+YTpXvmdeIKjqrBcFWEt7YZzomfV36vvx08VFtXmaPu80zAfcykzZb0M4yPb+ TYLoN7pRCDVB4j7GCOoS1tBy1kKNHXp8ATBdbssHiK40wZ+iz/iwSjuV1XVWfYXfiqR5 HGDZ0e+37Oe/5gvJC43VkmkjlJLK4wn/ZQwu4= Received: by 10.150.120.8 with SMTP id s8mr1227295ybc.185.1259788650251; Wed, 02 Dec 2009 13:17:30 -0800 (PST) In-Reply-To: <1259788097.3590.29.camel@nord26-amd64> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Wed, Dec 2, 2009 at 4:08 PM, Sebastian Setzer wrote: > Do you store everything in a single file and configure git to use > special diff- and merge-tools? > Do you use XML for this purpose? XML is terrible for most data storage purposes. Data exchange, maybe, but IMHO the best thing you can do when you get XML data is to put it in some other format ASAP. As it happens, I've been doing a project where we store a bunch of stuff in csv format in git, and it works fairly well. We made a special merge driver that can merge csv data (based on knowing which columns should be treated as the "primary key"). > Do you take care that the contents of your file is as stable as possible > when it's saved or do you let your diff tools cope with issues like > reordering, reassignment of identifiers (for example when identifiers > are offsets in the file), ...? A custom merge driver is better, by far, than the builtin ones (which were designed for source code) if you have any kind of structured data that you don't want to have to merge by hand. That said, however, you should still try to make your files as stable as possible, because: - If your program outputs the data in random order, it's just being sloppy anyway - 'git diff' doesn't work usefully otherwise (for examining the data and debugging) Of course, all bets are off if your file is actually binary; merging and diffing is mostly impossible unless you use a totally custom engine. And if your file contains byte offsets, then it's a binary file, no matter that it looks like in your text editor. Adding a byte in the middle would make such a file entirely nonsense, which is not an attribute of a text file. > Do you store one object/record per file (with filename=id, for example > with GUID-s) and hope that git will not mess them up when it merges > them? > > Do you store records as directories, with very small files which contain > single attributes (because records can be considered sets of > key-value-pairs and the same applies to directories)? Do you configure > git to do a scalar merge on non-text "attributes" (with special file > extensions)? In git, you have to balance between its different limitations. If you have a tonne of small files, it'll take you longer to retrieve a large amount of data. If you have one big huge file, git will suck a lot of memory when repacking. The best is to achieve a reasonable balance. One trick that I've been using lately is to split large files according to a rolling checksum: http://alumnit.ca/~apenwarr/log/?m=200910#04 This generally keeps diffs useful, but keeps individual file sizes down. Obviously the implementation pointed to there is just a toy, but the idea is sound. > When you don't store everything in a single, binary file: Do you use git > hooks to update an index for efficient queries on your structured data? > Do you update the whole index for every change? Or do you use git hashes > to decide which segment of your index needs to be updated? We keep a separate index file that's not part of git. When the git repo is updated, we note which rows have changed, then update the index. Avery