From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff King Subject: Re: [PATCH/RFC 0/3] faster inexact rename handling Date: Tue, 30 Oct 2007 09:43:55 -0400 Message-ID: <20071030134355.GA21342@coredump.intra.peff.net> References: <20071030042118.GA14729@sigill.intra.peff.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: git@vger.kernel.org, Andy C , Junio C Hamano To: Linus Torvalds X-From: git-owner@vger.kernel.org Tue Oct 30 14:44:17 2007 Return-path: Envelope-to: gcvg-git-2@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1ImrO9-0001qT-Ni for gcvg-git-2@gmane.org; Tue, 30 Oct 2007 14:44:14 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752734AbXJ3NoA (ORCPT ); Tue, 30 Oct 2007 09:44:00 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752790AbXJ3Nn7 (ORCPT ); Tue, 30 Oct 2007 09:43:59 -0400 Received: from 66-23-211-5.clients.speedfactory.net ([66.23.211.5]:4667 "EHLO peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752209AbXJ3Nn7 (ORCPT ); Tue, 30 Oct 2007 09:43:59 -0400 Received: (qmail 709 invoked by uid 111); 30 Oct 2007 13:43:57 -0000 Received: from coredump.intra.peff.net (HELO coredump.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.32) with SMTP; Tue, 30 Oct 2007 09:43:57 -0400 Received: by coredump.intra.peff.net (sSMTP sendmail emulation); Tue, 30 Oct 2007 09:43:55 -0400 Content-Disposition: inline In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org On Mon, Oct 29, 2007 at 10:06:11PM -0700, Linus Torvalds wrote: > Have you compared the results? IOW, does it find the *same* renames? >>From my limited testing, it generally finds the same pairs. However, there are a number of renames that it _doesn't_ find, because they are composed of "uninteresting" lines, dropping them below the minimum score. Try (in git.git): git-show --raw -M -l0 :/'Big tool rename' with the old and new code. Pairs like Documentation/git-add-script.txt -> Documentation/git-add.txt are not found, because the file is composed almost entirely of boilerplate. Moving the size normalization into the similarity engine should probably fix that, and will let us compare old and new results more accurately. I'll try to work on that. > I'm a bit worried about the fact that you just pick a single (arbitrary) > src/dst per fingerprint. Yes, it should be limited, but that seems to be a > bit too *extremely* limited. But if it gives the same results in practice, > maybe nobody cares? Yes, I have not convinced myself yet that it's the right approach (but it seemed like a good place to try first, for simplicity and speed). As I noted, this approach seems to be a bit memory hungry on large, so I am a bit concerned about increasing the size of the fingerprint_entry structure. However, Andy's sampling approach might help fix that. The current code also doesn't bother marking overflow, so common lines get attributes to some random file (actually, worse than random: if a bunch of files have the same common lines, _all_ of the lines will go to the last file, which means we subtly favor renames from the end of the input list). So probably it should be tested as-is, with an "overflow, this line is too common to be interesting" bit, and with a small-ish limit (I had at one point tried 5, but the implementation was naive and too memory-hungry). -Peff