From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Waitz Subject: Re: Submodule object store Date: Tue, 27 Mar 2007 17:25:07 +0200 Message-ID: <20070327152507.GO22773@admingilde.org> References: <20070326183523.GG22773@admingilde.org> <200703262133.44520.Josef.Weidendorfer@gmx.de> <20070326220302.GH22773@admingilde.org> <7vfy7rvct2.fsf@assigned-by-dhcp.cox.net> <20070326231637.GJ22773@admingilde.org> <7vy7ljtwir.fsf@assigned-by-dhcp.cox.net> <20070326233603.GL22773@admingilde.org> <20070326235527.GM22773@admingilde.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="3hAdtgBjtgL7p0NQ" Cc: Junio C Hamano , Josef Weidendorfer , Eric Lesh , Matthieu Moy , git@vger.kernel.org To: David Lang X-From: git-owner@vger.kernel.org Tue Mar 27 17:25:20 2007 Return-path: Envelope-to: gcvg-git@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1HWDXx-00036T-TE for gcvg-git@gmane.org; Tue, 27 Mar 2007 17:25:18 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753900AbXC0PZK (ORCPT ); Tue, 27 Mar 2007 11:25:10 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753901AbXC0PZK (ORCPT ); Tue, 27 Mar 2007 11:25:10 -0400 Received: from mail.admingilde.org ([213.95.32.147]:58641 "EHLO mail.admingilde.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753900AbXC0PZI (ORCPT ); Tue, 27 Mar 2007 11:25:08 -0400 Received: from martin by mail.admingilde.org with local (Exim 4.50 #1) id 1HWDXn-0002I1-3Y; Tue, 27 Mar 2007 17:25:07 +0200 Content-Disposition: inline In-Reply-To: X-PGP-Fingerprint: B21B 5755 9684 5489 7577 001A 8FF1 1AC5 DFE8 0FB2 User-Agent: Mutt/1.5.9i Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: --3hAdtgBjtgL7p0NQ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable hoi :) Its really funny that when I proposed one big object database everybody wanted it separated and now that I propose a separate database everybody wants it as one combined database. I read this as a sign that people really try to think critically about the design, which is a good thing and will hopefully lead to a good and stable submodule implementation. On Mon, Mar 26, 2007 at 03:40:15PM -0800, David Lang wrote: > useing the same object store makes this work automaticaly (think of all t= he=20 > copies of COPYING that would end up being the same as a trivial example) Yes, but I guess not much more than COPYING, INSTALL, some trivial Makefiles and empty files will be shared between subprojects. Except when you have the same subproject in your tree multiple times, of course. Yet this sharing is exactly why I started to do it that way, until Linus stopped me. > >If someone comes up with a nice way to handle everything in one big > >object store I would happily use that! :-) >=20 > what exactly are the problems with one big object store? I think we really have to discuss this separation on several layers: traversal, pack-files, and object database. For the traversal the point of separating it into a per-module traversal is that only one module has to be loaded into RAM at a time. This effects all operations which do a (potentially) recursive traversal: push, pull, fsck, prune, repack. However a separated traversal will no longer be garanteed to only list an object once, so this has to be handled in some way. Pack files should have better access patterns if they are per-module. Most of the time you are only interested in one individual module and locality is important here. Separating the entire object database is a way to improve unreachability analysis, as it now can be done per module. The other two separations are easier to implement with a separated object database, but that's not too strong an argument. So if we can come up with a nice way to do unreachability analysis we can indeed go on with the shared object database and tackle the remaining scalability issues as they arise. Those could then be added later without changing the on-disk format. > ones that I can think of: >=20 > 1. when you are doing a fsck you need to walk all the trees and find out= =20 > the list of objects that you know about. >=20 > done as a tree of binary values you can hold a LOT in memory before=20 > running into swap. Could you explain the algorithm you are thinking about in more detail? > if it's enough larger then available ram then an option for fsck to use= =20 > trees on disk is an option. This could simplify some things. There could be an on-disk index of all known objects, so that the sha1 sums do not have to loaded into RAM all at once. --=20 Martin Waitz --3hAdtgBjtgL7p0NQ Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFGCTdSj/Eaxd/oD7IRAlODAJ96jojW84apYT5x2RBpvM5llgu93gCdHLc7 MQbW6NqR7nlH7LJm6d+OSRY= =fKs7 -----END PGP SIGNATURE----- --3hAdtgBjtgL7p0NQ--