From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff King Subject: [PATCH/RFC 0/3] faster inexact rename handling Date: Tue, 30 Oct 2007 00:21:19 -0400 Message-ID: <20071030042118.GA14729@sigill.intra.peff.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Linus Torvalds , Andy C , Junio C Hamano To: git@vger.kernel.org X-From: git-owner@vger.kernel.org Tue Oct 30 05:21:37 2007 Return-path: Envelope-to: gcvg-git-2@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1Imibg-0006ff-DS for gcvg-git-2@gmane.org; Tue, 30 Oct 2007 05:21:36 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750920AbXJ3EVX (ORCPT ); Tue, 30 Oct 2007 00:21:23 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751094AbXJ3EVW (ORCPT ); Tue, 30 Oct 2007 00:21:22 -0400 Received: from 66-23-211-5.clients.speedfactory.net ([66.23.211.5]:3976 "EHLO peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750890AbXJ3EVW (ORCPT ); Tue, 30 Oct 2007 00:21:22 -0400 Received: (qmail 29680 invoked by uid 111); 30 Oct 2007 04:21:20 -0000 Received: from sigill.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.7) by peff.net (qpsmtpd/0.32) with ESMTP; Tue, 30 Oct 2007 00:21:20 -0400 Received: (qmail 14778 invoked by uid 1000); 30 Oct 2007 04:21:19 -0000 Content-Disposition: inline Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: This is my first stab at faster rename handling based on Andy's code. The patches are on top of next (to get Linus' recent work on exact renames). Most of the interesting stuff is in 2/3. 1/3: extension of hash interface 2/3: similarity detection code 3/3: integrate similarity detection into diffcore-rename The implementation is pretty basic, so I think there is room for code optimization (50% of the time is spent in hash lookups, so we might be able to micro-optimize that) as well as algorithmic improvements (like the sampling Andy mentioned). With these patches, I can get my monster binary diff down from about 2 minutes to 17 seconds. And comparing all of linux-2.4 to all of linux-2.6 (similar to Andy's previous demo) takes about 10 seconds. There are a few downsides: - the current implementation tends to give lower similarity values compared to the old code (see discussion in 2/3), but this should be tweakable - on large datasets, it's more memory hungry than the old code because the hash grows very large. This can be helped by bumping up the binary chunk size (actually, the 17 seconds quoted above is using 256-byte chunks rather than 64-byte -- with 64-byte chunks, it's more like 24 seconds) as well as sampling. - no improvement on smaller datasets. Running "git-whatchanged -M --raw -l0" on the linux-2.6 repo takes about the same time with the old and new code (presumably the algorithmic savings of the new code are lost in a higher constant factor, so when n is small, it is a wash). -Peff