From mboxrd@z Thu Jan 1 00:00:00 1970 From: "C. Scott Ananian" Subject: Re: WARNING! Object DB conversion (was Re: [PATCH] write-tree performance problems) Date: Wed, 20 Apr 2005 10:30:15 -0400 (EDT) Message-ID: References: <200504191250.10286.mason@suse.com> <200504191708.23536.mason@suse.com> <200504192049.21947.mason@suse.com> <42660708.60109@zytor.com> <2cfc403205042005116484231c@mail.gmail.com> <20050420132446.GA10126@macavity> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Git Mailing List X-From: git-owner@vger.kernel.org Wed Apr 20 16:27:26 2005 Return-path: Received: from vger.kernel.org ([12.107.209.244]) by ciao.gmane.org with esmtp (Exim 4.43) id 1DOGA0-0005im-Jx for gcvg-git@gmane.org; Wed, 20 Apr 2005 16:26:36 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261637AbVDTOaf (ORCPT ); Wed, 20 Apr 2005 10:30:35 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261650AbVDTOaf (ORCPT ); Wed, 20 Apr 2005 10:30:35 -0400 Received: from sincerity-forever.csail.mit.edu ([128.30.67.31]:10889 "EHLO sincerity-forever.csail.mit.edu") by vger.kernel.org with ESMTP id S261637AbVDTOa1 (ORCPT ); Wed, 20 Apr 2005 10:30:27 -0400 Received: from catfish.lcs.mit.edu ([128.30.67.25] helo=cag.csail.mit.edu) by sincerity-forever.csail.mit.edu with esmtp (Exim 3.36 #1 (Debian)) id 1DOGDh-0008Qk-00; Wed, 20 Apr 2005 10:30:25 -0400 To: Martin Uecker In-Reply-To: <20050420132446.GA10126@macavity> Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org On Wed, 20 Apr 2005, Martin Uecker wrote: > The other thing I don't like is the use of a sha1 > for a complete file. Switching to some kind of hash > tree would allow to introduce chunks later. This has > two advantages: You can (and my code demonstrates/will demonstrate) still use a whole-file hash to use chunking. With content prefixes, this takes O(N ln M) time (where N is the file size and M is the number of chunks) to compute all hashes; if subtrees can share the same prefix, then you can do this in O(N) time (ie, as fast as possible, modulo a constant factor, which is '2'). You don't *need* internal hashing functions. > It would allow git to scale to repositories of large > binary files. And it would allow to build a very cool > content transport algorithm for those repositories. > This algorithm could combine all the advantages of > bittorrent and rsync (without the cpu load). Yes, the big benefit of internal hashing is that it lets you check validity of a chunk w/o having the entire file available. I'm not sure that's terribly useful in this case. [And, if it is, then it can obviously be done w/ other means.] > And it would allow trivial merging of patches which > apply to different chunks of a file in exact the same > way as merging changesets which apply to different > files in a tree. I'm not sure anyone should be looking at chunks. To me, at least, they are an object-store-implementation detail only. For merging, etc, we should be looking at whole files, or (better) the whole repository. The chunking algorithm is guaranteed not to respect semantic boundaries (for *some* semantics of *some* file). --scott explosion JMTRAX DC KUBARK biowarfare LCFLUTTER ESMERALDITE for Dummies Hager Nader Israel General ZRMETAL Castro cryptographic Indonesia ( http://cscott.net/ )