From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Marko Kreen" Subject: Re: I'm a total push-over.. Date: Fri, 25 Jan 2008 22:52:06 +0200 Message-ID: References: <4796FBB6.9080609@op5.se> <20080123091558.GP14871@dpotapov.dyndns.org> <4797095F.9020602@op5.se> <4797518A.3040704@op5.se> <4798B633.8040606@op5.se> <37fcd2780801240828vac82e6ds4da5aecde56e8d2f@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: "Dmitry Potapov" , "Andreas Ericsson" , "Git Mailing List" , "Junio C Hamano" To: "Linus Torvalds" X-From: git-owner@vger.kernel.org Fri Jan 25 21:52:44 2008 Return-path: Envelope-to: gcvg-git-2@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1JIVXX-0003Oe-Fs for gcvg-git-2@gmane.org; Fri, 25 Jan 2008 21:52:43 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754870AbYAYUwM (ORCPT ); Fri, 25 Jan 2008 15:52:12 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753636AbYAYUwL (ORCPT ); Fri, 25 Jan 2008 15:52:11 -0500 Received: from fk-out-0910.google.com ([209.85.128.188]:47410 "EHLO fk-out-0910.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752858AbYAYUwJ (ORCPT ); Fri, 25 Jan 2008 15:52:09 -0500 Received: by fk-out-0910.google.com with SMTP id z23so758768fkz.5 for ; Fri, 25 Jan 2008 12:52:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=ohoQAnofjLQu05UkBSoA68iX8KbUctBjsDyHsACafoA=; b=Oe6EPuV+QX+5JaGrpK1dOpI18brBuPHmhs+LAVXLhP2JLF/hFxnwvWBaNdzTPNO1ZAar+OjQMJiQd7mgRNLcQ2BcPCZ2woMc8o8ayzdErauokwb85j5JgQHRJobFDjigYFNjF1oSbIHYDuwpQxLM2PKnIUBuelxpvdEuXhHiUnM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=pMglNiHoY0q7omc/QQGn7B1xj/sNXcacCHalH19chUYaAyEHolPLZQltxToWTqMfCKWw9CKRlHYcVhP+NeFMiJ7K9/QFLOI1xvr98RnjMZatrvCFNB4zr3VcSNIK8+ivFL9+XVZNgVNHouaCwajE5mUyfTU1Nm2RbGGv715Qj1g= Received: by 10.78.190.10 with SMTP id n10mr3746949huf.37.1201294326692; Fri, 25 Jan 2008 12:52:06 -0800 (PST) Received: by 10.78.200.6 with HTTP; Fri, 25 Jan 2008 12:52:06 -0800 (PST) In-Reply-To: Content-Disposition: inline Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On 1/24/08, Linus Torvalds wrote: > You can do a perfectly fine 8-bytes-at-a-time hash for almost 100% of all > source code projects in UTF-8, without *ever* doing any format changes at > all. Admittedly, it's a lot easier if the hash is a pure in-memory one (ie > we don't care about byte-order or size of integers or anything like that), > but that's the common case for most hashes that aren't used for BTree > lookup on disk or something like that. > > Here, let me show you: > > unsigned int name_hash(const char *name, int size) Well, although this is very clever approach, I suggest against it. You'll end up with complex code that gives out substandard results. I think its better to have separate case-folding function (or several), that copies string to temp buffer and then run proper optimized hash function on that buffer. That way you can use already tested building blocks and can optimize both sides separately. Eg. the folding-only function can aswell be optimized to load 4 or 8-byte at-a-time. This also isolates hashing from exact details how folding happens to access the input string which seem to be the weak point in your approach. (In both collision and complexity sense.) Such temp buffer happens to fits my lookup3_memcpy also better (heh). Its weak point is that on platforms that do not allow unaligned access, it degenerates to byte-by-byte loading. But if know you always have aligned buffer, you can notify gcc to do 4-byte fetch there too. It should be as simple as tagging data pointer as uint32_t *. Anyway, now you dont need to worry about folding when picking hash. > Basically, it's almost always a stupid thing to actually normalize a whole > string. You do those things character-by-character, and only lazily when > you actually need to! If your input strings are over kilobyte on average then I'd agree with you, but if you process 20-30 bytes on average, is the additional complexity worth it? -- marko