From mboxrd@z Thu Jan 1 00:00:00 1970 From: "C. Scott Ananian" Subject: Re: WARNING! Object DB conversion (was Re: [PATCH] write-tree performance problems) Date: Wed, 20 Apr 2005 11:28:20 -0400 (EDT) Message-ID: References: <200504191250.10286.mason@suse.com> <200504191708.23536.mason@suse.com> <200504192049.21947.mason@suse.com> <42660708.60109@zytor.com> <2cfc403205042005116484231c@mail.gmail.com> <20050420132446.GA10126@macavity> <20050420151902.GA13175@macavity> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Git Mailing List X-From: git-owner@vger.kernel.org Wed Apr 20 17:26:14 2005 Return-path: Received: from vger.kernel.org ([12.107.209.244]) by ciao.gmane.org with esmtp (Exim 4.43) id 1DOH3z-0008Ih-F0 for gcvg-git@gmane.org; Wed, 20 Apr 2005 17:24:27 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261683AbVDTP2l (ORCPT ); Wed, 20 Apr 2005 11:28:41 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261687AbVDTP2l (ORCPT ); Wed, 20 Apr 2005 11:28:41 -0400 Received: from sincerity-forever.csail.mit.edu ([128.30.67.31]:28046 "EHLO sincerity-forever.csail.mit.edu") by vger.kernel.org with ESMTP id S261683AbVDTP2e (ORCPT ); Wed, 20 Apr 2005 11:28:34 -0400 Received: from catfish.lcs.mit.edu ([128.30.67.25] helo=cag.csail.mit.edu) by sincerity-forever.csail.mit.edu with esmtp (Exim 3.36 #1 (Debian)) id 1DOH7v-0000O9-00; Wed, 20 Apr 2005 11:28:31 -0400 To: Martin Uecker In-Reply-To: <20050420151902.GA13175@macavity> Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org On Wed, 20 Apr 2005, Martin Uecker wrote: >> You can (and my code demonstrates/will demonstrate) still use a whole-file >> hash to use chunking. With content prefixes, this takes O(N ln M) time >> (where N is the file size and M is the number of chunks) to compute all >> hashes; if subtrees can share the same prefix, then you can do this in >> O(N) time (ie, as fast as possible, modulo a constant factor, which is >> '2'). You don't *need* internal hashing functions. > > I don't understand this paragraph. What is an internal > hash function? Your code seems to do exactly what I want. > The hashes are computed recusively as in a hash tree > with O(N ln N). The only difference between your design > and a design based on a conventional (binary) hash tree > seems to be that data is stored in the intermediate nodes > too. A merkle-tree (which I think you initially pointed me at) makes the hash of the internal nodes be a hash of the chunk's hashes; ie not a straight content hash. This is roughly what my current implementation does, but I would like to identify each subtree with the hash of the *(expanded) contents of that subtree* (ie no explicit reference to subtree hashes). This makes it interoperable with non-chunked or differently-chunked representations, in that the top-level hash is *just the hash of the complete content*, not some hash-of-subtree-hashes. Does that make more sense? The code I posted doesn't demonstrate this very well, but now that Linus has abandoned the 'hash of compressed content' stuff, my next code posting should show this more clearly. > If I don't miss anything essential, you can validate > each treap piece at the moment you get it from the > network with its SHA1 hash and then proceed with > downloading the prefix and suffix tree (in parallel > if you have more than one peer a la bittorrent). Yes, I guess this is the detail I was going to abandon. =) I viewed the fact that the top-level hash was dependent on the exact chunk makeup a 'misfeature', because it doesn't allow easy interoperability with existing non-chunked repos. --scott WTO atomic operation Mossad Castro overthrow FSF fissionable HTAUTOMAT LCPANES MKDELTA Bush non-violent protest OVER THE HORIZON RADAR KUPALM ( http://cscott.net/ )