From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?B?TWljaGHFgg==?= Kiedrowicz Subject: Re: [PATCH 6/8] gitweb: Highlight interesting parts of diff Date: Mon, 13 Feb 2012 22:09:17 +0100 Message-ID: <20120213220917.4cf14eb1@gmail.com> References: <1328865494-24415-1-git-send-email-michal.kiedrowicz@gmail.com> <20120213074125.3e8793ad@mkiedrowicz.ivo.pl> <201202131944.50886.jnareb@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: git@vger.kernel.org To: Jakub Narebski X-From: git-owner@vger.kernel.org Mon Feb 13 22:09:32 2012 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Rx39M-0004Oj-UE for gcvg-git-2@plane.gmane.org; Mon, 13 Feb 2012 22:09:29 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757834Ab2BMVJY convert rfc822-to-quoted-printable (ORCPT ); Mon, 13 Feb 2012 16:09:24 -0500 Received: from mail-ey0-f174.google.com ([209.85.215.174]:59663 "EHLO mail-ey0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754696Ab2BMVJX convert rfc822-to-8bit (ORCPT ); Mon, 13 Feb 2012 16:09:23 -0500 Received: by eaah12 with SMTP id h12so1921894eaa.19 for ; Mon, 13 Feb 2012 13:09:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:in-reply-to:references:x-mailer :mime-version:content-type:content-transfer-encoding; bh=AfAv7wjkBYooHKzgxwdOoDzVg0hlEdOsK9q9LHXx/WQ=; b=DNyKKne/HYk6ddq+wcSrwsGFewneTcNstzRt6oEHOwz7CxQN3bF3mF+F0XVy9mgdl6 u1Yld+SWQRkO7g+AHcJbNOrS0BVFiZ0UNy604HCqxmVQ3/jbupQOeDjKQSzTHg+DbPEw xQ9HPnWgEGy3bHlzKsIbdB96nMoYwegUf2JYw= Received: by 10.14.22.65 with SMTP id s41mr5677963ees.50.1329167361896; Mon, 13 Feb 2012 13:09:21 -0800 (PST) Received: from localhost (77-177-78-94.net.stream.pl. [94.78.177.77]) by mx.google.com with ESMTPS id o49sm64987607eeb.7.2012.02.13.13.09.21 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 13 Feb 2012 13:09:21 -0800 (PST) In-Reply-To: <201202131944.50886.jnareb@gmail.com> X-Mailer: Claws Mail 3.7.10 (GTK+ 2.24.8; x86_64-pc-linux-gnu) Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: Jakub Narebski wrote: > On Mon, 13 Feb 2012, Michal Kiedrowicz wrote: > > Jakub Narebski wrote: > > > Micha=C5=82 Kiedrowicz writes: > > I haven't found *examples* on GitHub and Trac sites, but what about > > these ones: > >=20 > > https://github.com/gitster/git/commit/8cad4744ee37ebec1d9491a1381ec= 1771a1ba795 > > http://trac.edgewall.org/changeset/10973 >=20 > Thanks. That is what I meant by "good examples". Perhaps they shoul= d > be put in the commit message? OK >=20 > BTW GitHub is closed source, but we can check what algorithm does Tra= c > use for diff refinement highlighting (highlighting changed portions o= f > diff). >=20 I think it's http://trac.edgewall.org/browser/trunk/trac/versioncontrol/diff.py (see markup intraline_changes()). > > > It doesn't implement LCS / diff > > > algorithm like e.g. tkdiff does for its diff refinement highlight= ing, > > > isn't it? > >=20 > > I doesn't. I share the Jeff's opinion that: > > a) Jeff's approach is "good enough" > > b) LCS on bytes could be very confusing if it marked few sets of > > characters. >=20 > I wonder if we can use --diff-words for diff refinement highlighting, > i.e. LCS on words. I think we can try it, but I worry about performance of running `git diff` on every diff chunk. >=20 > Anyway Jeff's approach is a bit limited, in that it would work only f= or > change that does not involve adding newlines, for example splitting > overly long line when changing something. >=20 > See for example line 1786 (in pre-image) in http://trac.edgewall.org/= changeset/10973 > =20 Yes, I'm aware of that. I was thinking about improving it later ("Let's start with a simple refinment highlightning and maybe later add more sophisticated algorithms"). > > > By completly different you mean that they do not have common pref= ix or > > > common suffix (at least one of them), isn't it? >=20 > BTW. is it "at least one of prefix or suffix are non-empty" or "both = prefix > and suffix are non-empty"? >=20 At least one. See: -a =3D 42; +b =3D 42; Here prefix is empty but suffix is not. > > I would also consider ignoring prefixes/suffixes with punctuation, = like: > >=20 > > - * I like you. > > + * Alice had a little lamb. >=20 > But this patch doesn't implement this feature yet, isn't it? No, but is a matter of adding -$prefix_is_space =3D 0 if ($r[$prefix] !~ /\s/); +$prefix_is_space =3D 0 if ($r[$prefix] !~ /\s|[[:punct:]]/); (and the same for suffix) > Well, here is another idea: do not highlight if sum of prefix and suf= fix > lengths are less than some threshold, e.g. 2 characters not including= =20 > whitespace, or some percentage with respect to total line length. >=20 That might be a good idea. > > > Eeeeeek! First you split into letters, in caller at that, then j= oin? > > > Why not pass striung ($str suggests string not array of character= s), > > > and use substr instead? > > >=20 > > > [Please disregard this and the next paragraph at first reading] > >=20 > > I will rename $str to something more self describing. >=20 > Please do. >=20 > BTW. don't you assume here that both common prefix and common suffix > are non-empty? Yes. The caller makes sure they are correct (at least=20 > > > > +sub format_rem_add_line { > > > > + my ($rem, $add) =3D @_; > > > > + my @r =3D split(//, $rem); > > > > + my @a =3D split(//, $add); >=20 > BTW the name of variable can be just @add and @rem. > I know they are different scopes but I don't like it. It makes the code more confusing IMO. But I won't insist. > > > Shouldn't > > > $prefix / $prefix_len start from 0, not from 1? > >=20 > > It starts from 1 because it skips first +/-. It should become obvio= us > > after reading the comment from last patch :). > >=20 > > + # In combined diff we must ignore two +/- characters. > > + $prefix =3D 2 if ($is_combined); >=20 > Anyway comment about that fact would be nice. Will do. > =20 > > > > + my ($prefix_is_space, $suffix_is_space) =3D (1, 1); > > > > + > > > > + while ($prefix < @r && $prefix < @a) { > > > > + last if ($r[$prefix] ne $a[$prefix]); > > > > + > > > > + $prefix_is_space =3D 0 if ($r[$prefix] !~ /\s/); > > > > + $prefix++; > > > > + } > > >=20 > > > Ah, I see that it is easier to find common prefix by treating str= ing > > > as array of characters. > > >=20 > > > Though I wonder if it wouldn't be easier to use trick of XOR-ing = two > > > strings (see "Bitwise String Operators" in perlop(1)): > > >=20 > > > my $xor =3D "$rem" ^ "$add"; > > >=20 > > > and finding starting sequence of "\0", which denote common prefix= =2E > > >=20 > > >=20 > > > Though this and the following is a nice implementation of > > > algorithm... as it would be implemented in C. Nevermind, it migh= t be > > > good enough... > >=20 > > The splitting and comparing by characters is taken from diff-highli= ght. > > I don't think it's worth changing here. >=20 > You are right. >=20 > I'll try to come with hacky algorithm using string bitwise xor and re= gexp, > and benchmark it comparing to your C-like solution, but it can be lef= t for > later (simple is better than clever, usually). If you have time :). > > > > # HTML-format diff context, removed and added lines. > > > > sub format_ctx_rem_add_lines { > > > > - my ($ctx, $rem, $add) =3D @_; > > > > + my ($ctx, $rem, $add, $is_combined) =3D @_; > > > > my (@new_ctx, @new_rem, @new_add); > > > > + my $num_add_lines =3D @$add; > > >=20 > > > Why is this temporary variable needed? If you are not sure that = =3D=3D > > > operator enforces scalar context on both arguments you can always > > > write > > >=20 > > > scalar @$add =3D=3D scalar @$rem > >=20 > > You read my mind. >=20 > BTW. the same comment applies to patch adding support for highlightin= g > changed part in combined diff. > OK =20