From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matt Mackall Subject: Re: cleaner/better zlib sources? Date: Fri, 16 Mar 2007 11:42:41 -0500 Message-ID: <20070316164241.GC4892@waste.org> References: <45F9EED5.3070706@garzik.org> <45FAC75B.3030902@garzik.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Linus Torvalds , Davide Libenzi , Git Mailing List , bcrl@kvack.org To: Jeff Garzik X-From: git-owner@vger.kernel.org Fri Mar 16 17:55:27 2007 Return-path: Envelope-to: gcvg-git@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1HSFiA-0008JE-UU for gcvg-git@gmane.org; Fri, 16 Mar 2007 17:55:27 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965635AbXCPQzY (ORCPT ); Fri, 16 Mar 2007 12:55:24 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S965640AbXCPQzY (ORCPT ); Fri, 16 Mar 2007 12:55:24 -0400 Received: from waste.org ([66.93.16.53]:36409 "EHLO waste.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965635AbXCPQzX (ORCPT ); Fri, 16 Mar 2007 12:55:23 -0400 Received: from waste.org (localhost [127.0.0.1]) by waste.org (8.13.4/8.13.4/Debian-3sarge3) with ESMTP id l2GGggnF004667 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Fri, 16 Mar 2007 11:42:42 -0500 Received: (from oxymoron@localhost) by waste.org (8.13.4/8.13.4/Submit) id l2GGgfWO004666; Fri, 16 Mar 2007 11:42:41 -0500 Content-Disposition: inline In-Reply-To: <45FAC75B.3030902@garzik.org> User-Agent: Mutt/1.5.9i X-Virus-Scanned: by amavisd-new Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: On Fri, Mar 16, 2007 at 12:35:39PM -0400, Jeff Garzik wrote: > Linus Torvalds wrote: > >The normal size for the performance-critical git objects are in the couple > >of *hundred* bytes. Not kilobytes, and not megabytes. > > > >The most performance-critical objects for uncompression are commits and > >trees. At least for the kernel, the average size of a tree object is 678 > >bytes. And that's ignoring the fact that most of them are then deltified, > >so about 80% of them are likely just a ~60-byte delta. > > > Ahhh. At least for me, that explains a lot. Rather than spending all > its time in inflate_fast(), git is dealing with lots of zlib > startup/shutdown overhead. > > Although it sounds like zlib could indeed be optimized to reduce its > startup and shutdown overhead, I wonder if switching compression > algorithms to a pure Huffman or even RLE compression (with associated > lower startup/shutdown costs) would perform better in the face of all > those small objects. Mercurial simply stores uncompressed objects below a threshold of 44 bytes, based on benchmarks I did in April 2005. I'd probably up that number if I redid my measurements today. There's just not a whole lot zlib can do at these small sizes. Given that a SHA hash is an uncompressible 20 bytes already, you're well into the domain of diminishing returns. > And another random thought, though it may be useless in this thread: I > bet using a pre-built (compiled into git) static zlib dictionary for git > commit and tree objects might improve things a bit. Ideally, you'd compress all deltas in a chain with the same context. You've got to decompress the delta base to do the delta calculation, so this should allow you to recover the context up to that point. Zlib isn't really set up for this sort of thing though. -- Mathematics is the supreme nostalgia of our time.