From mboxrd@z Thu Jan 1 00:00:00 1970 From: Linus Torvalds Subject: Re: Horrible re-packing? Date: Mon, 5 Jun 2006 17:35:14 -0700 (PDT) Message-ID: References: <20060605211436.GA58708@dspnet.fr.eu.org> <20060606001802.GB5401@tuatara.stupidest.org> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Nicolas Pitre , Olivier Galibert , Junio C Hamano , Git Mailing List X-From: git-owner@vger.kernel.org Tue Jun 06 02:35:35 2006 Return-path: Envelope-to: gcvg-git@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by ciao.gmane.org with esmtp (Exim 4.43) id 1FnPXf-0007UZ-1Z for gcvg-git@gmane.org; Tue, 06 Jun 2006 02:35:31 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750812AbWFFAf2 (ORCPT ); Mon, 5 Jun 2006 20:35:28 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1750837AbWFFAf2 (ORCPT ); Mon, 5 Jun 2006 20:35:28 -0400 Received: from smtp.osdl.org ([65.172.181.4]:47054 "EHLO smtp.osdl.org") by vger.kernel.org with ESMTP id S1750812AbWFFAf1 (ORCPT ); Mon, 5 Jun 2006 20:35:27 -0400 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp.osdl.org (8.12.8/8.12.8) with ESMTP id k560ZG2g021046 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Mon, 5 Jun 2006 17:35:16 -0700 Received: from localhost (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with ESMTP id k560ZEfV030320; Mon, 5 Jun 2006 17:35:15 -0700 To: Chris Wedgwood In-Reply-To: <20060606001802.GB5401@tuatara.stupidest.org> X-Spam-Status: No, hits=0 required=5 tests= X-Spam-Checker-Version: SpamAssassin 2.63-osdl_revision__1.75__ X-MIMEDefang-Filter: osdl$Revision: 1.135 $ X-Scanned-By: MIMEDefang 2.36 Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: On Mon, 5 Jun 2006, Chris Wedgwood wrote: > > On Mon, Jun 05, 2006 at 05:22:02PM -0400, Nicolas Pitre wrote: > > > Much more expensive for both memory usage and CPU cycles. > > On a modern CPU I'm not sure you would notice unless the dataset was > insanely large. The dataset really _is_ pretty large. For the kernel, we're talking about a quarter million strings, and it's not shrinking. Other (CVS imported from long histories) are in the several million object range. The real problem, btw, is not the CPU cost of sorting them (hey, O(nlogn) works ok), but the memory use. We have to keep those quarter million names in memory too. At a "pointer + average of ~30 bytes of full pathname allocation + malloc overhead", the strings would basically take about 40 bytes of memory each, and cause constant cache-misses. In contrast, the "hash" value is a 32-bit unsigned, with no pointer overhead. It's not the biggest part to keep track of (we need to obviously keep track of the 20-byte SHA1 and the pointers between objects), but if we had the full string info, it would be quite noticeable overhead, on the order of several tens of megabytes. Now, "tens of megabytes" is not a make-or-break issue any more (you definitely want lots of memory if you want to develop with large trees), but in a very real sense, the memory footprint of an app is often very closely correlated with its performance these days. Trying to keep things within the L2 cache can help a lot, and even if we expect 2MB and 4MB L2's to get more and more common, it means that we should _strive_ to keep datasets down. As it is, we have no _chance_ of staying in the L2 cache on the kernel, but for smaller projects we hopefully can still do so ;) Linus