From mboxrd@z Thu Jan 1 00:00:00 1970 From: duchier@ps.uni-sb.de Subject: Re: [Gnu-arch-users] Re: [ANNOUNCEMENT] /Arch/ embraces `git' Date: Thu, 21 Apr 2005 13:46:45 +0200 Message-ID: <878y3ce4ii.fsf@star.lifl.fr> References: <200504201000.DAA04988@emf.net> <877jixfjxw.fsf@star.lifl.fr> <1114036849.5880.61.camel@perun.redhat.usu> <86d5soa42h.fsf@speedy.lifl.fr> <1114078877.5886.37.camel@perun.redhat.usu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: gnu-arch-dev@lists.seyza.com, talli@museatech.net, git@vger.kernel.org X-From: gnu-arch-dev-bounces@lists.seyza.com Thu Apr 21 13:56:24 2005 Return-path: Received: from [207.166.200.220] (helo=hera.museatech.net) by ciao.gmane.org with smtp (Exim 4.43) id 1DOaH4-0001yU-NU for gcvad-gnu-arch-dev@gmane.org; Thu, 21 Apr 2005 13:55:15 +0200 Received: (qmail 12065 invoked by uid 110); 21 Apr 2005 11:59:34 -0000 Received: from gnu-arch-dev-bounces@lists.seyza.com by hera by uid 102 with qmail-scanner-1.20 (clamscan: 0.60. spamassassin: 2.60. Clear:RC:1(127.0.0.1):. Processed in 0.204484 secs); 21 Apr 2005 11:59:34 -0000 Received: from unknown (HELO hera.museatech.net) (127.0.0.1) by hera.museatech.net with SMTP; 21 Apr 2005 11:59:32 -0000 Delivered-To: list-gnu-arch-dev@lists.seyza.com Received: (qmail 12034 invoked by uid 110); 21 Apr 2005 11:59:28 -0000 Received: from duchier@ps.uni-sb.de by hera by uid 102 with qmail-scanner-1.20 (clamscan: 0.60. spamassassin: 2.60. Clear:RC:1(207.166.200.201):. Processed in 0.087564 secs); 21 Apr 2005 11:59:28 -0000 Received: from unknown (HELO a1.mx.alal.us) (207.166.200.201) by hera.museatech.net with SMTP; 21 Apr 2005 11:59:28 -0000 Received: (qmail 4794 invoked from network); 21 Apr 2005 13:13:39 -0000 Received: from reserv6.univ-lille1.fr (193.49.225.20) by a1.mx.alal.us with SMTP; 21 Apr 2005 13:13:37 -0000 Received: from malonne.lifl.fr (malonne.lifl.fr [134.206.10.29]) by reserv6.univ-lille1.fr (8.13.3/jtpda-5.3.1) with ESMTP id j3LBkkk3010456 ; Thu, 21 Apr 2005 13:46:46 +0200 Received: from star.lifl.fr (star.lifl.fr [134.206.10.23]) by malonne.lifl.fr with ESMTP id j3LBkj624481 ; Thu, 21 Apr 2005 13:46:45 +0200 (MEST) To: Tomas Mraz In-Reply-To: <1114078877.5886.37.camel@perun.redhat.usu> (Tomas Mraz's message of "Thu, 21 Apr 2005 12:21:16 +0200") User-Agent: Gnus/5.110003 (No Gnus v0.3) Emacs/21.4 (gnu/linux) X-USTL-MailScanner-Information: Please contact the ISP for more information X-USTL-MailScanner: Found to be clean X-MailScanner-From: duchier@ps.uni-sb.de X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on dev.alal.com X-Spam-Level: X-Spam-Status: No, score=-1.3 required=9.0 tests=BAYES_00,NO_REAL_NAME autolearn=no version=3.0.2 X-BeenThere: gnu-arch-dev@lists.seyza.com X-Mailman-Version: 2.1.5 Precedence: list List-Id: a mailing list for GNU Arch developers List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: gnu-arch-dev-bounces@lists.seyza.com Errors-To: gnu-arch-dev-bounces@lists.seyza.com X-Broken-Reverse-DNS: no host name found for IP address 207.166.200.220 Tomas Mraz writes: >> Btw, if, as you indicate above, you do believe that a 1 level indexing should >> use [0:2], then it doesn't make much sense to me to also suggest that a 2 level >> indexing should use [0:1] as primary subkey :-) > > Why do you think so? IMHO we should always target a similar number of > files/subdirectories in a directories of the blob archive. So If I > always suppose that the archive would contain at most 16 millions of > files then the possible indexing schemes are either 1 level with key > length 3 (each directory would contain ~4096 files) or 2 level with key > length 2 (each directory would contain ~256 files). > Which one is better could be of course filesystem and hardware > dependent. First off, I have been using python slice notation, so when I write [0:2] I mean a key of length 2 (the second index is not included). I now realize that when you wrote the same you meant to include the second index. I believe that our disagreement comes from the fact that we are asking different questions. You consider the question of how to best index a fixed database and I consider the question of how to best index an ever increasing database. Now consider why we even want multiple indexing levels: presumably this is because certain operations become too costly when the size of a directory becomes too large. If that's not the case, then we might as well just have one big flat directory - perhaps that's even a viable option for some filesystems.[1] [1] there is the additional consideration that a hierarchical system implements a form of key compression by sharing key prefixes. I don't know at what point such an effect becomes beneficial, if ever. Now suppose we need at least one level of indexing. Under an assumption of uniform distribution of bits in keys, as more objects are added to the database, the lower levels are going to fill up uniformly. Therefore at those levels we are again faced with exactly the same indexing problem and thus should come up with exactly the same answer. This is why I believe that the scheme I proposed is best: when a bottom level directory fills up past a certain size, introduce under it an additional level, and reindex the keys. Since the "certain size" is fixed, this is a constant time operation. One could also entertain the idea of reindexing not just a bottom level directory but an entire subtree of the database (this would be closer to your idea of finding an optimal reindexing of just this part of the database). However this has the disadvantage that the operation's cost grows exponentially with the depth of the tree. Cheers, --Denys