From mboxrd@z Thu Jan 1 00:00:00 1970 From: Davide Libenzi Subject: Re: heads-up: git-index-pack in "next" is broken Date: Wed, 18 Oct 2006 14:21:58 -0700 (PDT) Message-ID: References: <7vy7rfsfqa.fsf@assigned-by-dhcp.cox.net> <7vslhnj58e.fsf@assigned-by-dhcp.cox.net> <7vbqoake1v.fsf@assigned-by-dhcp.cox.net> <20061017233630.72a0aae5.vsu@altlinux.ru> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Nicolas Pitre , Sergey Vlasov , Junio C Hamano , git@vger.kernel.org X-From: git-owner@vger.kernel.org Wed Oct 18 23:22:21 2006 Return-path: Envelope-to: gcvg-git@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by ciao.gmane.org with esmtp (Exim 4.43) id 1GaIrY-0006gF-CN for gcvg-git@gmane.org; Wed, 18 Oct 2006 23:22:08 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1422981AbWJRVWE (ORCPT ); Wed, 18 Oct 2006 17:22:04 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1422986AbWJRVWE (ORCPT ); Wed, 18 Oct 2006 17:22:04 -0400 Received: from x35.xmailserver.org ([69.30.125.51]:20098 "EHLO x35.xmailserver.org") by vger.kernel.org with ESMTP id S1422981AbWJRVWB (ORCPT ); Wed, 18 Oct 2006 17:22:01 -0400 X-AuthUser: davidel@xmailserver.org Received: from alien.or.mcafeemobile.com by xmailserver.org with [XMail 1.23 ESMTP Server] id for from ; Wed, 18 Oct 2006 14:21:58 -0700 X-X-Sender: davide@alien.or.mcafeemobile.com To: Linus Torvalds In-Reply-To: X-GPG-FINGRPRINT: CFAE 5BEE FD36 F65E E640 56FE 0974 BF23 270F 474E X-GPG-PUBLIC_KEY: http://www.xmailserver.org/davidel.asc Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: On Wed, 18 Oct 2006, Linus Torvalds wrote: > On Wed, 18 Oct 2006, Davide Libenzi wrote: > > > > Speaking in general, seen at the hash function level, of course an interface > > should not give different result for different word sizes or word endianess. > > Considering the diff algorithm as interface, as I said, the output was > > unaffected by the 64 bits word size. It was just very slow. > > Well, even the output may actually be affected, in the case of _real_ hash > collisions (as opposed to just the hash _list_ collision that XDL_HASHLONG > caused). > > So I actually think it would be better to have "uint32_t" as the hash > value - because that would mean that all diffs (or, in the case of the > block-algorithm, the deltas) are guaranteed to give the same results > regardless of architecture. > > Right now, we actually generate a 64-bit hash value (BUT: for short lines, > it's likely only _interesting_ in the low bits, so the high bits tend to > have a very high likelihood of being zero). So hash collisions are > different: on a 32-bit architecture, two lines may have the same hash, > while on a 64-bit one, they are different. > > And together with some of the limiters we have (eg XDL_MAX_EQLIMIT) hash > collisions can sometimes affect the output. > > Admittedly, in _practice_ this is really unlikely to affect anything > (you'd get a valid diff in either case, they'd just possibly be subtly > different, and the input data must be _really_ strange to even see that > case), but I do think that the hash algorithm can matter. > > NOTE! I'm not talking about XDL_HASHLONG(), I'm talking about the > xdl_hash_record() hash, which returns differently-sized hash results on > 32-bit and 64-bit. And there are cases where we _only_ compare the hashes, > and don't actually double-check the contents. The hash value (hence the hash bucket index) simply directs you to the bucket where a real record-compare loop is performed. Collisions here means only performance loss (you don't spread buckets enough), nothing more. So the different behaviour does not apply to the diff algo. But, it would apply if the knowledge of the hash indexing would be "exported", for example inside an external file. Think about a trivial DB file where you store hash buckets on file an you want to lookup records based on the store hash layout. In that case, of course, if the hash function that generated the DB bucket layout is different from the one that you use to get the bucket index at lookup time, you're in trouble. IOW if the hash function result is not "exported" is some way, it doesn't really matter if it's an 'unsigned long' or a 'uint32_t'. In the same way you cannot export binary structures using 'int' or 'long', and expect to be compatible over different architectures. - Davide