From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan Holmsand Subject: Re: [PATCH] improved delta support for git Date: Wed, 18 May 2005 19:15:56 +0200 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-From: git-owner@vger.kernel.org Wed May 18 19:20:43 2005 Return-path: Received: from vger.kernel.org ([12.107.209.244]) by ciao.gmane.org with esmtp (Exim 4.43) id 1DYSBp-0006nF-Lk for gcvg-git@gmane.org; Wed, 18 May 2005 19:18:38 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S262169AbVERRTI (ORCPT ); Wed, 18 May 2005 13:19:08 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S262168AbVERRTI (ORCPT ); Wed, 18 May 2005 13:19:08 -0400 Received: from main.gmane.org ([80.91.229.2]:55689 "EHLO ciao.gmane.org") by vger.kernel.org with ESMTP id S262169AbVERRS7 (ORCPT ); Wed, 18 May 2005 13:18:59 -0400 Received: from list by ciao.gmane.org with local (Exim 4.43) id 1DYS9o-0006T3-UC for git@vger.kernel.org; Wed, 18 May 2005 19:16:32 +0200 Received: from h65n2fls35o265.telia.com ([217.211.115.65]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 18 May 2005 19:16:32 +0200 Received: from holmsand by h65n2fls35o265.telia.com with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 18 May 2005 19:16:32 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: git@vger.kernel.org X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: h65n2fls35o265.telia.com User-Agent: Mozilla Thunderbird 1.0.2 (X11/20050404) X-Accept-Language: en-us, en In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Linus Torvalds wrote: > Has anybody tried: > > 4) don't limit yourself to previous-history-objects > > One of the things I liked best about the delta patches was that it is > history-neutral, and can happily delta an object against any other random > object, in the same tree, in a future tree, or in a past tree. > > Even without any history at all, there should be a noticeable amount of > delta opportunities, as different architectures often end up sharing files > that are quite similar, but not exactly the same. > > Now, that's a very expensive thing to do, since it changes the question of > "which object should I delta against" from O(1) to O(n) (where "n" is tyhe > total number of objects), and thus the whole deltafication from O(n) to > O(n**2), but especially together with your "max 10%" rule, you should be > able to limit your choices very effectively: if you know that your delta > should be within 10% of the total size, you can limit your "let's try that > object" search to other objects that are also within 10% of your object. Yeah, that sounds very interresting *and* very expensive... Ideally, I'd like to find the set of objects that should *not* be deltafied (i.e. the ideal "keyframe" objects), but that would generate the maximum number of small, depth-one deltas with the least total size. But I can't really see how that could be done in a number of deltafications significantly less than the number of atoms in the universe. Let me think about that some more, though. I'd like to try a couple of other approaches anyway: a) Sort all objects by size. Start by biggest (or smallest), and try to get as many max-10%-deltas out of that as possible, stopping the search when objects get too small (big) according to some size difference limit. Cross already deltafied objects off the list, and continue. Might work, and might be fast enough with a sufficiently small size limit. b) Use the same history-based approach as before, and in addition try to deltafy any "new" objects against other new objects and previous ones (say one or two commits back) in a given size range. That should catch renames, copys of the same template, etc. That shouldn't really affect performance, as new files are added comparatively seldom. > Doing this experiment at least once should be interesting. It may turn out > that the incremental space savings aren't all that noticeable, and that > the pure history-based one already finds 90% of all savings, making the > expensive version not worth it. It would be nice to _know_, though. I definitely agree. And I also agree that the history-based deltafication seems less than pure, from a "git, the object store that doesn't really care" point of view. On the other hand, the history-based thing has its advantages. It takes advantage of people's hard work to make patches as small as possible. It's fast. And (perhaps more importantly), it's deterministic. The "ideal" approach could possibly require every single blob to be redeltafied when a new object is added, if we want to stay ideal. And it could be done at commit-time, thus keeping git's nice promise of immutable files, while still keeping size requirements down. And as my current method gives roughly an 80% size reduction over "plain git", that might (by boring, excessively practical people) be considered enough :-) /dan