Git development

Git development
 help / color / mirror / Atom feed

* make -d work in git-repack (without -a)
From: Alex Riesen @ 2006-03-13 22:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Signed-off-by: Alex Riesen <raa.lkml@gmail.com>

---

Junio C Hamano, Thu, Mar 09, 2006 19:50:43 +0100:
> I am inclined to say I prefer Alex' one.

I guess it has to be sent in formally...

 git-repack.sh |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

a594bee1d539f71970e321592f45a114ea648d92
diff --git a/git-repack.sh b/git-repack.sh
index bc90112..2f643b5 100755
--- a/git-repack.sh
+++ b/git-repack.sh
@@ -74,6 +74,8 @@ then
 			esac
 		  done
 		)
+	else
+		git-prune-packed
 	fi
 	git-prune-packed
 fi
-- 
1.2.4.g6ec337

^ permalink raw reply related

* Re: What should I use instead of git show?
From: Mark Hollomon @ 2006-03-13 23:26 UTC (permalink / raw)
  To: git
In-Reply-To: <Pine.LNX.4.64.0603130830050.3618@g5.osdl.org>

Linus Torvalds wrote:
> 
> 
> 	git whatchanged -p -1 <sha1>
> 
> instead (actually, if your git is really old, you shouldn't use the modern 
> shorthand of "-1", you should use the longer "--max-count=1" instead).

I must be misunderstanding this:

	git whatchanged -p -1 HEAD

in the current git tree results in nothing. only when I get to -5 does it show something.

Is this expected?

 > git version
git version 1.2.4.gea75

-- 
Mark Hollomon

^ permalink raw reply

* Re: What should I use instead of git show?
From: Junio C Hamano @ 2006-03-13 23:55 UTC (permalink / raw)
  To: Mark Hollomon; +Cc: git
In-Reply-To: <4415FFB8.3000001@comcast.net>

Mark Hollomon <markhollomon@comcast.net> writes:

> I must be misunderstanding this:
>
> 	git whatchanged -p -1 HEAD
>
> in the current git tree results in nothing. only when I get to -5 does it show something.
>
> Is this expected?
>
>> git version
> git version 1.2.4.gea75

In this case what matterks is not the version of your git but
what that HEAD is.  If it is a merge commit, whatchanged -p does
not show anything by default.

^ permalink raw reply

* Re: [PATCH 0/6] http-push updates
From: Nick Hengeveld @ 2006-03-14  0:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vek16udg6.fsf@assigned-by-dhcp.cox.net>

On Sun, Mar 12, 2006 at 09:21:45PM -0800, Junio C Hamano wrote:

> Repository maintenance tasks:
> 
>  - create a new repository
>  - remove an unneeded branch and tag
>  - running repack

In a DAV-only server environment, it seems like there are a few
options for supporting these tasks:

- extend http-push with additional args and/or local config settings.
  This approach would be more efficient wrt packs than separate
  push and repack steps since packs will all need to be created locally
  and then sent; a combined repack/push operation would mean that new
  objects will only be sent once as part of a pack.

- add DAV versions of git-init-db/git-branch/git-repack

- extend git-init-db/git-branch/git-repack to be DAV-aware

I like option #1.

>  - create new branch (and new tag) -- I think you can already do this

Right - you can create locally and then push that branch/tag or
--all/--tags.

>  - (perhaps) running update-server-info

http-push already updates info/refs if it existed before the push 
(perhaps that behavior should also be based on a local config setting.)
I would plan to add support for updating objects/info/packs along with
pack/repack support.  That should be all the server-info there is to
update, right?

-- 
For a successful technology, reality must take precedence over public
relations, for nature cannot be fooled.

^ permalink raw reply

* [OT] Re: [PATCH] Use explicit pointers for execl...() sentinels.
From: Jeff King @ 2006-03-14  0:42 UTC (permalink / raw)
  To: git
In-Reply-To: <200603130412.k2D4CW1b011631@laptop11.inf.utfsm.cl>

On Mon, Mar 13, 2006 at 12:12:31AM -0400, Horst von Brand wrote:

> Very improbable, they'll be the same normally ("void *" is a way of getting
> rid of the overloading of the meaning of "char *" for this before ANSI C).
> Sure, sizeof(int *) might be 4, but I think that is pretty far off.

Let me clarify my position. The STANDARD doesn't guarantee such things.
In PRACTICE, for modern machines you can assume that all pointers are
the same size (and things like all-bits-zero is a null pointer) if it
makes your code cleaner. In other words, I agree with Linus: git should
follow what works in practice, but you should at least recognize that
you're violating the standard.

That being said, you appear to be making the argument that passing a
'foo *' to a variadic function expecting a 'bar *' doesn't violate the
standard. I believe it invokes undefined behavior.

> There are special rules for variadic functions, probably pointers would be
> cast to/from void * in such a case by the compiler.

The rules indicate that arguments matching the '...' follow "default
argument promotion".  See section 6.5.2.2, paragraph 7.  This default
promotion is the same as what would happen if there were no prototype
for the function, and is defined in paragraph 6:
  ...the integer promotions are performed on each argument, and arguments
  that have type float are promoted to double.
I don't see anything about promoting pointers to void.

Furthermore, when accessing the arguments using va_arg, the types must
match or the behavior is undefined, UNLESS (7.5.1.1, para 2):
  - one type is signed and the other is the matching unsigned type
  - one type is a pointer to void and the other is a pointer of
    character type
IOW, the standard does promise that void* and char* pointers are
represented the same, but nothing else.

> > If you remain unconvinced, I can try to find chapter and verse of the
> > standard.
> Please do.

See above.

-Peff

^ permalink raw reply

* Re: Fix up diffcore-rename scoring
From: Rutger Nijlunsing @ 2006-03-14  0:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0603130727350.3618@g5.osdl.org>

[-- Attachment #1: Type: text/plain, Size: 12313 bytes --]

From: Rutger Nijlunsing <rutger@nospam.com>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Junio C Hamano <junkio@cox.net>, git@vger.kernel.org
Bcc: 
Subject: Re: Fix up diffcore-rename scoring
Reply-To: git@wingding.demon.nl
In-Reply-To: <Pine.LNX.4.64.0603130727350.3618@g5.osdl.org>
Organization: M38c

On Mon, Mar 13, 2006 at 07:38:53AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 13 Mar 2006, Junio C Hamano wrote:
> > 
> > By the way, the reason the diffcore-delta code in "next" does
> > not do every-eight-bytes hash on the source material is to
> > somewhat alleviate the problem that comes from not detecting
> > copying of consecutive byte ranges.
> 
> Yes. However, there are better ways to do that in practice.
> 
> The most effective way that is generally used is to not use a fixed 
> chunk-size, but use a terminating character, together with a 
> minimum/maximum chunksize.
> 
> There's a pretty natural terminating character that works well for 
> sources: '\n'.
> 
> So the natural way to do similarity detection when most of the code is 
> line-based is to do the hashing on chunks that follow the rule "minimum of 
> <n> bytes, maximum of <2*n> bytes, try to begin/end at a \n".
> 
> So if you don't see any '\n' at all (or the only such one is less than <n> 
> bytes into your current window), do the hash over a <2n>-byte chunk (this 
> takes care of binaries and/or long lines).
> 
> This - for source code - allows you to ignore trivial byte offset things, 
> because you have a character that is used for synchronization. So you 
> don't need to do hashing at every byte in both files - you end up doing 
> the hashing only at line boundaries in practice. And it still _works_ for 
> binary files, although you effectively need bigger identical chunk-sizes 
> to find similarities (for text-files, it finds similarities of size <n>, 
> for binaries the similarities need to effectively be of size 3*n, because 
> you chunk it up at ~2*n, and only generate the hash at certain offsets in 
> the source binary).

This looks like something I did last year as an experiment in the
pre-git times. The idea was to generate a patch-with-renames from two
(large) source trees.

Algorithm:
  - determine md5sum for each file (same idea as git's SHA1 sum)
    if changed since last run
  - only look at md5sums which do not match
  - pool files into types, which might depend on extension and/or MIME type.
    This is an optimisation.
  - Only compare filepair _within_ one pool.
  - The filepair order in one pool is determined by filename-similarity.
    So pair [include/asm-ppc/ioctl.h, include/asm-powerpc/ioctl.h]
    is inspected before pair
       [include/asm-ppc/ioctl.h, arch/arm/plat-omap/clock.h] .
  - For each file, create a hash from String line -> Integer occuranced .
    Similarities are calculated by comparing two hashes.
  - Keep as a rename-match all files which:
    - have at most 50% new lines;
    - have at most 25% lines deleted from them.

I ran the code against v2.6.12 and v2.6.14 to be able to compare it
with the current contenders. Hopefully some ideas are harvestable...

Algorithm differences:
  - '\n' is used as boundary, independant on line length.
    This is bad for binary files, and maybe even bad for text files.
    So don't harvest :)
  - don't look at the intersection percentage, but look at two values:
    - percentage of lines added (default: max. 50%)
    - percentage of lines removed (default: max. 25%)
    This assumes files get bigger during development (at most 50%), and
    not too much code is deleted (at most 25%).
    Disadvantages:
      - Two magic numbers instead of one.
      - It's non-symmetrical. Diff A->B will find different renames from
        diff B->A. This scares me, actually.
  - to speed up the detection:
    - don't start comparing files at random. Start comparing files which
      have the same 'names' in it. So when v2.6.12 has a files called
      arch/arm/mach-omap/clock.c, start comparing with files which have
      most words the same. Currently, '-', '.', '_' and '/' are used
      as word separators.
      Advantage: don't match on the first match just above the
        match-threshold.
    (next heuristics are all optional:)
    - only compare files with the same extension. This splits up all files
      into groups, which makes it much faster.
      In general, there's no reason to compare a .h with a .c file.
    - only compare files with the same MIME type. Same as above, but also
      works for files without extensions (so don't compare README with
      Makefile)

Ok, the result:

$ shpatch.rb -d linux-2.6.12,linux-2.6.14 | wc -l
104   <-- That's bad. We're missing some renames here.

$ shpatch.rb -d linux-2.6.12,linux-2.6.14 | sort -k 1.10

+ 0% -23% arch/arm/configs/omnimeter_defconfig -> arch/arm/configs/collie_defconfig
+ 5% - 9% arch/arm/mach-omap/board-generic.c -> arch/arm/mach-omap1/board-generic.c
+ 0% - 8% arch/arm/mach-omap/board-h2.c -> arch/arm/mach-omap1/board-h2.c
+ 0% - 5% arch/arm/mach-omap/board-h3.c -> arch/arm/mach-omap1/board-h3.c
+ 0% - 3% arch/arm/mach-omap/board-innovator.c -> arch/arm/mach-omap1/board-innovator.c
+ 0% - 9% arch/arm/mach-omap/board-netstar.c -> arch/arm/mach-omap1/board-netstar.c
+ 9% -10% arch/arm/mach-omap/board-osk.c -> arch/arm/mach-omap1/board-osk.c
+ 0% - 6% arch/arm/mach-omap/board-perseus2.c -> arch/arm/mach-omap1/board-perseus2.c
+ 3% - 8% arch/arm/mach-omap/board-voiceblue.c -> arch/arm/mach-omap1/board-voiceblue.c
+ 7% - 4% arch/arm/mach-omap/clock.c -> arch/arm/plat-omap/clock.c
+ 0% - 0% arch/arm/mach-omap/clock.h -> arch/arm/plat-omap/clock.h
+ 0% - 5% arch/arm/mach-omap/common.h -> include/asm-arm/arch-omap/common.h
+ 2% - 1% arch/arm/mach-omap/dma.c -> arch/arm/plat-omap/dma.c
+ 0% - 1% arch/arm/mach-omap/fpga.c -> arch/arm/mach-omap1/fpga.c
+11% -11% arch/arm/mach-omap/gpio.c -> arch/arm/plat-omap/gpio.c
+ 2% - 2% arch/arm/mach-omap/irq.c -> arch/arm/mach-omap1/irq.c
+ 0% - 4% arch/arm/mach-omap/leds.c -> arch/arm/mach-omap1/leds.c
+ 0% - 0% arch/arm/mach-omap/leds-h2p2-debug.c -> arch/arm/mach-omap1/leds-h2p2-debug.c
+ 0% - 0% arch/arm/mach-omap/leds-innovator.c -> arch/arm/mach-omap1/leds-innovator.c
+ 0% - 4% arch/arm/mach-omap/leds-osk.c -> arch/arm/mach-omap1/leds-osk.c
+ 0% -25% arch/arm/mach-omap/Makefile.boot -> arch/arm/mach-omap1/Makefile.boot
+ 1% - 2% arch/arm/mach-omap/mcbsp.c -> arch/arm/plat-omap/mcbsp.c
+ 0% - 6% arch/arm/mach-omap/mux.c -> arch/arm/plat-omap/mux.c
+ 0% - 0% arch/arm/mach-omap/ocpi.c -> arch/arm/plat-omap/ocpi.c
+ 1% -18% arch/arm/mach-omap/pm.c -> arch/arm/plat-omap/pm.c
+ 0% -11% arch/arm/mach-omap/sleep.S -> arch/arm/plat-omap/sleep.S
+ 6% - 4% arch/arm/mach-omap/time.c -> arch/arm/mach-omap1/time.c
+ 0% - 1% arch/arm/mach-omap/usb.c -> arch/arm/plat-omap/usb.c
+ 2% - 1% arch/ia64/sn/include/pci/pcibr_provider.h -> include/asm-ia64/sn/pcibr_provider.h
+ 0% - 2% arch/ia64/sn/include/pci/pic.h -> include/asm-ia64/sn/pic.h
+ 0% - 0% arch/ia64/sn/include/pci/tiocp.h -> include/asm-ia64/sn/tiocp.h
+ 3% -23% arch/m68knommu/platform/68VZ328/de2/config.c -> arch/m68knommu/platform/68VZ328/config.c
+ 1% -18% arch/mips/configs/osprey_defconfig -> arch/mips/configs/qemu_defconfig
+ 0% -12% arch/mips/vr41xx/zao-capcella/setup.c -> arch/mips/vr41xx/common/type.c
+ 0% - 0% arch/ppc64/oprofile/op_impl.h -> include/asm-ppc64/oprofile_impl.h
+ 3% -23% arch/ppc/configs/ash_defconfig -> arch/ppc64/configs/bpa_defconfig
+ 2% -21% arch/ppc/configs/beech_defconfig -> arch/ppc/configs/ev64360_defconfig
+ 5% -20% arch/ppc/configs/cedar_defconfig -> arch/ppc/configs/mpc8548_cds_defconfig
+ 9% -17% arch/ppc/configs/k2_defconfig -> arch/ppc/configs/bamboo_defconfig
+ 3% -25% arch/ppc/configs/mcpn765_defconfig -> arch/xtensa/configs/common_defconfig
+ 2% -23% arch/ppc/configs/oak_defconfig -> arch/frv/defconfig
+ 3% -16% arch/ppc/configs/SM850_defconfig -> arch/ppc/configs/mpc86x_ads_defconfig
+ 3% -13% arch/ppc/configs/SPD823TS_defconfig -> arch/ppc/configs/mpc885ads_defconfig
+19% -15% arch/um/kernel/tempfile.c -> arch/um/os-Linux/mem.c
+ 0% - 5% arch/x86_64/kernel/semaphore.c -> lib/semaphore-sleepers.c
+ 0% - 6% drivers/i2c/chips/adm1021.c -> drivers/hwmon/adm1021.c
+ 0% - 4% drivers/i2c/chips/adm1025.c -> drivers/hwmon/adm1025.c
+ 0% -17% drivers/i2c/chips/adm1026.c -> drivers/hwmon/adm1026.c
+ 0% - 3% drivers/i2c/chips/adm1031.c -> drivers/hwmon/adm1031.c
+ 0% - 4% drivers/i2c/chips/asb100.c -> drivers/hwmon/asb100.c
+ 1% - 4% drivers/i2c/chips/ds1621.c -> drivers/hwmon/ds1621.c
+ 0% - 1% drivers/i2c/chips/fscher.c -> drivers/hwmon/fscher.c
+ 0% - 2% drivers/i2c/chips/fscpos.c -> drivers/hwmon/fscpos.c
+ 0% - 2% drivers/i2c/chips/gl518sm.c -> drivers/hwmon/gl518sm.c
+ 0% - 2% drivers/i2c/chips/gl520sm.c -> drivers/hwmon/gl520sm.c
+ 3% -19% drivers/i2c/chips/it87.c -> drivers/hwmon/it87.c
+ 4% -22% drivers/i2c/chips/lm63.c -> drivers/hwmon/lm63.c
+ 0% - 6% drivers/i2c/chips/lm75.c -> drivers/hwmon/lm75.c
+ 0% - 2% drivers/i2c/chips/lm75.h -> drivers/hwmon/lm75.h
+ 0% - 3% drivers/i2c/chips/lm77.c -> drivers/hwmon/lm77.c
+ 2% - 5% drivers/i2c/chips/lm78.c -> drivers/hwmon/lm78.c
+ 0% - 3% drivers/i2c/chips/lm80.c -> drivers/hwmon/lm80.c
+ 2% -21% drivers/i2c/chips/lm83.c -> drivers/hwmon/lm83.c
+ 0% - 3% drivers/i2c/chips/lm85.c -> drivers/hwmon/lm85.c
+ 0% - 4% drivers/i2c/chips/lm87.c -> drivers/hwmon/lm87.c
+ 4% -20% drivers/i2c/chips/lm90.c -> drivers/hwmon/lm90.c
+ 0% - 3% drivers/i2c/chips/lm92.c -> drivers/hwmon/lm92.c
+ 0% - 3% drivers/i2c/chips/max1619.c -> drivers/hwmon/max1619.c
+ 0% - 7% drivers/i2c/chips/sis5595.c -> drivers/hwmon/sis5595.c
+ 0% -11% drivers/i2c/chips/smsc47b397.c -> drivers/hwmon/smsc47b397.c
+ 0% - 9% drivers/i2c/chips/smsc47m1.c -> drivers/hwmon/smsc47m1.c
+ 0% -23% drivers/i2c/chips/via686a.c -> drivers/hwmon/via686a.c
+ 0% - 4% drivers/i2c/chips/w83627hf.c -> drivers/hwmon/w83627hf.c
+ 1% - 5% drivers/i2c/chips/w83781d.c -> drivers/hwmon/w83781d.c
+ 1% - 3% drivers/i2c/chips/w83l785ts.c -> drivers/hwmon/w83l785ts.c
+14% -17% drivers/i2c/i2c-sensor-vid.c -> drivers/hwmon/hwmon-vid.c
+ 0% - 0% drivers/infiniband/include/ib_cache.h -> include/rdma/ib_cache.h
+ 0% - 3% drivers/infiniband/include/ib_fmr_pool.h -> include/rdma/ib_fmr_pool.h
+ 9% - 7% drivers/infiniband/include/ib_mad.h -> include/rdma/ib_mad.h
+ 0% - 0% drivers/infiniband/include/ib_pack.h -> include/rdma/ib_pack.h
+ 1% - 6% drivers/infiniband/include/ib_sa.h -> include/rdma/ib_sa.h
+ 0% -11% drivers/infiniband/include/ib_smi.h -> include/rdma/ib_smi.h
+ 3% - 6% drivers/infiniband/include/ib_user_mad.h -> include/rdma/ib_user_mad.h
+ 4% - 2% drivers/infiniband/include/ib_verbs.h -> include/rdma/ib_verbs.h
+ 0% -16% include/asm-ppc64/ioctl.h -> include/asm-powerpc/ioctl.h
+ 0% - 9% include/asm-ppc64/ioctls.h -> include/asm-powerpc/ioctls.h
+ 5% - 9% include/asm-ppc64/mc146818rtc.h -> include/asm-powerpc/mc146818rtc.h
+ 0% - 5% include/asm-ppc64/mman.h -> include/asm-powerpc/mman.h
+ 2% -25% include/asm-ppc64/sembuf.h -> include/asm-powerpc/sembuf.h
+ 3% -13% include/asm-ppc64/shmbuf.h -> include/asm-powerpc/shmbuf.h
+ 0% -15% include/asm-ppc64/sockios.h -> include/asm-powerpc/sockios.h
+ 1% - 5% include/asm-ppc64/topology.h -> include/asm-powerpc/topology.h
+ 0% -15% include/asm-ppc64/user.h -> include/asm-powerpc/user.h
+ 0% -21% include/asm-ppc/agp.h -> include/asm-powerpc/agp.h
+12% -16% include/asm-ppc/msgbuf.h -> include/asm-xtensa/msgbuf.h
+ 5% -25% include/asm-ppc/namei.h -> include/asm-powerpc/namei.h
+ 4% -18% include/asm-ppc/param.h -> include/asm-powerpc/param.h
+ 0% -13% include/asm-ppc/poll.h -> include/asm-powerpc/poll.h
+ 0% -24% include/asm-ppc/shmbuf.h -> include/asm-xtensa/shmbuf.h
+ 1% -17% include/asm-ppc/socket.h -> include/asm-powerpc/socket.h
+ 0% - 9% include/asm-ppc/string.h -> include/asm-powerpc/string.h
+ 1% -10% include/asm-ppc/termbits.h -> include/asm-powerpc/termbits.h
+ 0% - 3% include/asm-ppc/termios.h -> include/asm-powerpc/termios.h
+ 5% -22% include/asm-ppc/unaligned.h -> include/asm-powerpc/unaligned.h

Regards,
Rutger.

-- 
Rutger Nijlunsing ---------------------------------- eludias ed dse.nl
never attribute to a conspiracy which can be explained by incompetence
----------------------------------------------------------------------

[-- Attachment #2: shpatch.rb --]
[-- Type: text/plain, Size: 14758 bytes --]

#!/usr/bin/env ruby

# Usage: shpatch.rb --help

require 'md5'
require 'ostruct'
require 'optparse'

$config = OpenStruct.new
$config.command = :PATCH
$config.same_base = false
$config.same_ext = true
$config.same_mime = false
$config.changed_content = true
$config.max_removed = 25	# 0 .. 100
$config.max_added = 50
$config.verbose = false

# Default dirglobs to ignore
ignore_globs = [
  "BitKeeper", "PENDING", "SCCS", "CVS", "*.state", "*.o", "*.a", "*.so",
  "*~", "#*#", "*.orig", "*.dll"
]

# Option parsing
$opts = OptionParser.new
$opts.banner = %Q{\
Generate a shellpatch file, or perform the patch in a shellpatch file.
A shellpatch file is a patch file which contains shell-commands
including 'mv' and 'patch'.

Determining the renames uses a lot of heuristics and a brute-force
approach; your milage may vary. All trivial file renames are handled
by comparing the complete contents. All remaining files (the list of
added and removed files) in then searched through to find matching
pairs: this is quite costly

A cache of md5 sums is kept at the root of the repositories to make
finding differences fast.

(c)2005 R. Nijlunsing <shpatch@tux.tmfweb.nl>
License: GPLv2

Usage: shpatch [options]

Defaults options are within [brackets].

}
$opts.separator("Diff options")
$opts.on("-d", "--diff PATH1,PATH2", Array,
  "Generate a shellpatch of the diff", "between two directories") {
  |paths|
  if paths.size != 2
    raise Exception.new("Need two directories for --diff")
  end
  $config.command = :DIFF
  $config.paths = paths
}
$opts.separator("Diff options for heuristics to finding renames with changed content")
$opts.on("--[no-]changed-content",
  "Find renames with changed content [#{$config.changed_content}]" ) { |cc|
  $config.changed_content = cc
}
$opts.on("--[no-]same-base",
  "Rename only to files with same basename [#{$config.same_base}]") { |sb|
  $config.same_base = sb
}
$opts.on("--[no-]same-ext",
  "Rename only to same extention [#{$config.same_ext}]") { |se|
  $config.same_ext = se
}
$opts.on("--[no-]same-mime",
	 "Rename only to same mimetype [#{$config.same_mime}]") { |sm|
  $config.same_mime = sm
}
$opts.on("--max-removed PERC", String,
  "Max. percentage of source file which may",
  "be removed while still being considered",
  "a rename [#{$config.max_removed}]"
) { |perc| $config.max_removed = perc.to_i }
$opts.on("--max-added PERC", String,
  "Max. percentage of destination file which may",
  "be added while still being considered",
  "a rename [#{$config.max_added}]"
) { |perc| $config.max_added = perc.to_i }
$opts.separator("Options to add to current patch")
$opts.on("--mv SOURCE DEST", String, String,
  "Adds a rename to the current patch", "and perform the rename") {
  |path1, path2|
  $config.command = :MV
  $config.paths = [path1, path2]
}
$opts.separator("General options")
$opts.on("--[no-]verbose", "-v", "Be more verbose") { |v| $config.verbose = v }
$opts.on("--help", "-h", "This usage") { puts $opts; exit 1 }
%Q{

Examples:
  shpatch.rb --diff linux-2.6.8,linux-2.6.9 --max-removed 10
    Generate a shellpatch with renames from directories
    linux-2.6.8 to linux-2.6.9 . At most 10% of a file may be removed
    between versions, otherwise they are considered different.
}.split("\n").each { |line| $opts.separator(line) }
begin
  $opts.parse!(ARGV)
rescue Exception
  puts "#{$opts}\n!!! #{$!}"
  exit 1
end

module Shell
  # Escape string string so that it is parsed to the string itself
  # E.g. Shell.escapeString("what's in a name") = "what\'s\ in\ a\ name"
  # Compare to Regexp.escape
  def Shell.escape(string)
    string.gsub(%r{([^-._0-9a-zA-Z/])}i, '\\\\\1')
  end
end

# One hunk in the patch
class RenameHunk
  attr_accessor :from, :to	# Strings: pathname from and to

  def initialize(from, to)
#    puts "# Found a rename: #{Shell.escape(from)} -> #{Shell.escape(to)}"
    @from = from; @to = to
  end
  def command; "mv"; end
  def to_s; "#{command} #{Shell.escape(@from)} #{Shell.escape(@to)}"; end
  def execute(repo)
    File.rename("#{repo.root}/#@from", "#{repo.root}/#@to")
  end
end

class DeleteHunk
  attr_accessor :pathname
  def initialize(pathname); @pathname = pathname; end
  def command; "rm"; end
  def to_s; "#{command} #{Shell.escape(@pathname)}"; end
  def execute(repo); File.delete("#{repo.root}/#@pathname"); end
end

class PatchHunk
  attr_accessor :from, :to, :contents
  def initialize(repo1, from, repo2, to)
    @from = from; @to = to
  end
  def command; "patch"; end
  def to_s
    long_from = Shell.escape((from[0] == ?/ ? "" : repo1.root + "/") + from)
    long_to = Shell.escape((to[0] == ?/ ? "" : repo2.root + "/") + to)
    puts "# Diffing #{long_from} -> #{long_to}" if $config.verbose
    @contents = File.popen("diff --unified #{long_from} #{long_to}") { |io|
      io.read
    }

    mark = "_SHPATCHMARK_"
    # Make mark unique
    mark += rand(10).to_s while @contents.index(mark)
    "#{command} <<#{mark}\n#{@contents}#{mark}"
  end
end

# A filesystem as backing store
class FileSystem
  SHPATCHSTATE_FILE = ".shpatch.state"
  SHPATCHSTATE_VERSION_STRING = "shpatch.rb state version 20050418-2"

  attr_accessor :root
  attr_accessor :cache_file # String: filename with signatures
  attr_accessor :signature_cache # From Fixnum inode to Array [mtime, sig]
  attr_accessor :signature_cache_changed # Boolean

  # Reads the cache. When not readable in current directory, go
  # up a level ('..')
  def read_signatures
    @signature_cache = {}
    @signature_cache_changed = false
    @cache_file = File.expand_path("#@root/#{SHPATCHSTATE_FILE}")
    cache_file = @cache_file
    loop {
      if FileTest.readable?(cache_file)
	File.open(cache_file, "rb") do |file|
	  version_string = file.readline.chomp
	  if version_string == SHPATCHSTATE_VERSION_STRING
	    begin
	      @signature_cache = Marshal.load(file) 
	      puts "# Read signature cache with #{@signature_cache.size} signatures from #{cache_file.inspect}" if $config.verbose
	      @cache_file = cache_file
	      break
	    rescue ArgumentError, EOFError
	      puts "# (error reading state file: rebuilding file...)" if $config.verbose
	    end
	  end
	end
      end
      parent_cache_file = File.expand_path(
	File.dirname(cache_file) + "/../" + File.basename(cache_file)
      )
      break if parent_cache_file == cache_file
      cache_file = parent_cache_file
    }
  end

  def initialize(root)
    raise "#{root.inspect} does not exist" if not File.exists?(root)
    @root = root
    read_signatures
  end

  def save_signatures
    # Save all unsaved signature cache
    return if !@signature_cache_changed
    puts "# Saving #{@signature_cache.size} signatures..." if $config.verbose
    pf = @cache_file
    File.open("#{pf}.new", "wb+") do |file|
      file.puts SHPATCHSTATE_VERSION_STRING
      Marshal.dump(@signature_cache, file)
      File.rename("#{pf}.new", pf)
    end      
  end

  # Returns array of [mtime, one-line signature-string]
  def signature(stat, filename)
    signature = nil
    key = [stat.dev, stat.ino]
    cache = @signature_cache[key]
    if cache and (cache[0] == stat.mtime)
      signature = cache[1]
    else
      if $config.verbose
	why = (cache ? "#{(stat.mtime - cache[0]).to_i}s out of date" : "not indexed")
	puts "# Creating signature for #{filename.inspect} (#{why})" 
      end
      signature = MD5.new(File.read(filename)).digest
      @signature_cache[key] = [stat.mtime, signature]
      @signature_cache_changed = true
    end
    signature
  end

  def signature_from(prefix, res, from, ignoreRe)
    Dir.new("#{prefix}#{from}").entries.each { |elem|
      next if (elem == ".") or (elem == "..")
      fullname = "#{prefix}#{from}/#{elem}"
      if not fullname =~ ignoreRe
	stat = File.stat(fullname)
	if stat.directory?
	  signature_from(prefix, res, "#{from}/#{elem}", ignoreRe) 
	else
	  rel_filename = "#{from}/#{elem}"[1..-1]
	  res[rel_filename] = signature(stat, fullname)
	end
      end
    }
  end

  # Returns all filenames within this filesystem with all signatures
  def signatures(ignoreRe)
    res = {}
    prefix = File.expand_path(@root)
    signature_from(prefix, res, "", ignoreRe)
    save_signatures
    res
  end

  def mime_type(filename)
    path = @root + "/" + filename
    ($mime_cache ||= {})[path] ||=
      File.popen("file --mime #{Shell.escape(path)}") { |io| io.read }.
      gsub(%r{^.*:}, "").strip
  end

  # Read the contents of a file
  def read(filename); File.read(@root + "/" + filename); end
end

patch = []

dir1, dir2 = $config.paths
repo1 = FileSystem.new(dir1)
repo2 = FileSystem.new(dir2)

def re_from_globs(globs)
  Regexp.new(
    "(\\A|/)(" + globs.collect { |glob| 
       Regexp.escape(glob).gsub("\\*", "[^/]*")
    }.join("|") + ")$"
  )
end

ignore_globs += ["BitKeeper/etc/ignore", ".cvsignore"].collect { |a|
  ["#{dir1}/#{a}", "#{dir2}/#{a}"]
}.flatten.find_all { |f| File.exists?(f) }.collect { |f|
  File.readlines(f).collect { |line| line.chomp }
}.flatten
ignore_globs = ignore_globs.uniq.sort
ignoreRe = re_from_globs(ignore_globs)

puts "# Retrieving signatures of #{dir1.inspect}" if $config.verbose
file2sig1 = repo1.signatures(ignoreRe)
puts "# Retrieving signatures of #{dir2.inspect}" if $config.verbose
file2sig2 = repo2.signatures(ignoreRe)
files1 = file2sig1.keys.sort
files2 = file2sig2.keys.sort
common_files = files1 - (files1 - files2)

# Different hash, same filename: patch
common_files.each { |fname|
  if file2sig1[fname] != file2sig2[fname]
    patch << PatchHunk.new(repo1, fname, repo2, fname)
  end
  file2sig1.delete(fname)
  file2sig2.delete(fname)
}

# Same hash, different filename: rename
sig2file1 = file2sig1.invert
sig2file2 = file2sig2.invert
sigs1 = sig2file1.keys
sigs2 = sig2file2.keys
common_sigs = sigs1 - (sigs1 - sigs2)
common_sigs.each { |sig|
  from = sig2file1[sig]
  to = sig2file2[sig]
  patch << RenameHunk.new(from, to)
  sig2file1.delete(sig)
  sig2file2.delete(sig)
  file2sig1.delete(from)
  file2sig2.delete(to)
}

# statistics of contents of a file. Used for quick-compare
class FileContentStats
  attr_accessor :size		# Size of file in lines
  attr_accessor :lines		# Hash from String to Fixnum

  # Counter number of lines removed and added as a percentage
  # of the total file length. These are a measure for the degree
  # of matching between the files.
  def diff_match(other)
    added = 0
    removed = 0
    @lines.each_pair { |line, count|
      delta = other.lines[line] - count
      if delta > 0
	added += delta
      else
	removed += -delta
      end
    }
    other.lines.each_pair { |line, count|
      added += count if not @lines[line]
    }
    [added * 100 / other.size, removed * 100 / self.size]
  end

  def initialize(repo, path)
    @lines = Hash.new(0)
    size = 0
    repo.read(path).delete("\0").each_line { |line|
      @lines[line.intern] += 1
      size += 1
    }
    @size = size
  end

  def self.cached(repo, path)
    @@cache ||= {}
    @@cache[[repo, path]] ||= self.new(repo, path)
  end
end

# Categorize a file based on filename and/or contents
def pool_type(repo, path)
  res = []
  res << File.basename(path) if $config.same_base
  res << File.extname(path) if $config.same_ext
  res << repo.mime_type(path) if $config.same_mime
  res
end

# Determine how much a filename looks like another filename
# by splitting the filenames into words. Then count the
# words which are the same.
def path_correlation(path1, path2)
  comp1 = path1.split(%r{[-._/]})
  comp2 = path2.split(%r{[-._/]})
  (comp1 - (comp1 - comp2)).size
end

class Array
  # The inverse of an array is an hash from contents to index number.
  def inverse; res = {}; each_with_index { |e, idx| res[e] = idx }; res; end
end

if $config.changed_content
  files1 = file2sig1.keys.sort
  files2 = file2sig2.keys.sort
  all_added_files = files2 - files1
  all_removed_files = files1 - files2

  pools = {}			# Group files into 'pools'
  all_removed_files.each { |removed_file|
    (pools[pool_type(repo1, removed_file)] ||= [[], []])[0] << removed_file
  }
  all_added_files.each { |added_file|
    (pools[pool_type(repo2, added_file)] ||= [[], []])[1] << added_file
  }

  pools.each_pair { |key, pool|
    removed_files, added_files = *pool
    if $config.verbose and not removed_files.empty? and not added_files.empty?
      puts "# Comparing pool type #{key.inspect} with #{pool[0].size}x#{pool[1].size} filepairs" 
    end

    # Determine how 'special' or 'specific' a word is. We start with
    # filenames containing special words.
    words = {}			# Group files by 'words'
    removed_files.each { |removed_file|
      removed_file.split(%r{[-._/]+}).uniq.each { |word|
	words[word] ||= [[], []]
	words[word][0] << removed_file
      }
    }
    added_files.each { |added_file|
      added_file.split(%r{[-._/]+}).uniq.each { |word|
	words[word] ||= [[], []]
	words[word][1] << added_file
      }
    }
    word_importance = words.keys.find_all { |word|
      (words[word][0].size * words[word][1].size) > 0
    }.sort_by { |word|
      words[word][0].size * words[word][1].size
    }.reverse
#    p word_importance
    word_importance = word_importance.inverse
    word_importance.default = 0

    removed_files.sort_by { |removed_file|
      removed_file.split(%r{[-._/]+}).uniq.inject(0) { |s, e|
	[s, word_importance[e]].max
      }
    }.reverse.each { |removed_file|
#      puts removed_file
      removed_file_stats = FileContentStats.new(repo1, removed_file)
      added_files.sort_by { |f| -path_correlation(removed_file, f) }.
        each { |added_file|
	added_file_stats = FileContentStats.cached(repo2, added_file)
	removed_size = removed_file_stats.size
	added_size = added_file_stats.size
	min_added = (added_size - removed_size) * 100 / added_size
	next if min_added > $config.max_added
	min_removed = (removed_size - added_size) * 100 / removed_size
	next if min_removed > $config.max_removed
	
	# Calculate added & removed percentages
	added, removed = removed_file_stats.diff_match(added_file_stats)
	if (added <= $config.max_added) && (removed <= $config.max_removed)
	  # We found a rename-match!
	  puts "+%2i%% -%2i%% #{removed_file} -> #{added_file}" % [added, removed] #if $config.verbose
	  patch << RenameHunk.new(removed_file, added_file)
	  # Don't match again against this added file:
	  added_files -= [added_file]
	  all_added_files -= [added_file]
	  all_removed_files -= [removed_file]
	  patch << PatchHunk.new(repo1, removed_file, repo2, added_file)
	  break
	end
      }
    }
  }
end

all_added_files.each { |added_file|
  patch << PatchHunk.new(repo1, "/dev/null", repo2, added_file)
}
all_removed_files.each { |removed_file|
  patch << PatchHunk.new(repo1, removed_file, repo2, "/dev/null")
}

#patch.each { |hunk| puts hunk.to_s }

^ permalink raw reply

* Re: Fix up diffcore-rename scoring
From: Junio C Hamano @ 2006-03-14  0:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0603130727350.3618@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> There's a pretty natural terminating character that works well for 
> sources: '\n'.

Good to know that great minds think alike ;-).  There is a
version that did this line-oriented hashing, buried in the next
branch.  I'll see how well it performs within the context of the
current somewhat restructured code.

^ permalink raw reply

* Re: git-diff-tree -M performance regression in 'next'
From: Junio C Hamano @ 2006-03-14  2:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Fredrik Kuivinen, git
In-Reply-To: <7vwtezt202.fsf@assigned-by-dhcp.cox.net>

While we are hacking away with weird ideas...

Here is still an WIP but an insanely fast one (actually this is
a modified version of what once was in next).  I haven't
verified the sanity of its output fully, but from a cursory look
what are found look sensible.  The same v2.6.12..v2.6.14 test on
my Duron 750:

        master  64.65user 0.17system 1:05.42elapsed
                0inputs+0outputs (0major+12511minor)

        next    40.69user 0.14system 0:40.98elapsed
                0inputs+0outputs (0major+19471minor)

        "this"  5.59user 0.09system 0:05.68elapsed
                0inputs+0outputs (0major+13015minor)

The hash used here is heavily optimized for handling text files
and nothing else.  Actually, it punts on a file that contains a
NUL byte.  The hash is computed by first skipping sequences of
whitespace letters (including LF); upon seeing a non whitespace,
we start hashing, while still ignoring whitespaces, until we hit
the next LF (or EOF).  Then we store the real number of bytes
along with the hash.  

When we find the matching hash value in the destination, we say
that many bytes (including the whitespaces we ignored while
hashing) were copied.

The patch should apply on top of the current "next".

---

diff --git a/diffcore-delta.c b/diffcore-delta.c
index 835d82c..0f4866e 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -3,25 +3,10 @@
 #include "diffcore.h"
 
 /*
- * Idea here is very simple.
- *
- * We have total of (sz-N+1) N-byte overlapping sequences in buf whose
- * size is sz.  If the same N-byte sequence appears in both source and
- * destination, we say the byte that starts that sequence is shared
- * between them (i.e. copied from source to destination).
- *
- * For each possible N-byte sequence, if the source buffer has more
- * instances of it than the destination buffer, that means the
- * difference are the number of bytes not copied from source to
- * destination.  If the counts are the same, everything was copied
- * from source to destination.  If the destination has more,
- * everything was copied, and destination added more.
- *
- * We are doing an approximation so we do not really have to waste
- * memory by actually storing the sequence.  We just hash them into
- * somewhere around 2^16 hashbuckets and count the occurrences.
- *
- * The length of the sequence is arbitrarily set to 8 for now.
+ * Record the hashes for "extended lines" in both source and destination,
+ * and compare how similar they are.  "Extended lines" hash is designed
+ * to work well on text files -- leading whitespaces and tabs, and consecutive
+ * LF characters are effectively ignored.
  */
 
 /* Wild guess at the initial hash size */
@@ -40,8 +25,9 @@
 #define HASHBASE 107927
 
 struct spanhash {
-	unsigned long hashval;
-	unsigned long cnt;
+	unsigned long hashval; /* hash for the line */
+	unsigned bytes; /* real number of bytes in such a line */
+	unsigned long cnt; /* occurrences */
 };
 struct spanhash_top {
 	int alloc_log2;
@@ -87,6 +73,7 @@ static struct spanhash_top *spanhash_reh
 			if (!h->cnt) {
 				h->hashval = o->hashval;
 				h->cnt = o->cnt;
+				h->bytes = o->bytes;
 				new->free--;
 				break;
 			}
@@ -99,7 +86,8 @@ static struct spanhash_top *spanhash_reh
 }
 
 static struct spanhash_top *add_spanhash(struct spanhash_top *top,
-					 unsigned long hashval)
+					 unsigned long hashval,
+					 unsigned bytes)
 {
 	int bucket, lim;
 	struct spanhash *h;
@@ -110,6 +98,7 @@ static struct spanhash_top *add_spanhash
 		h = &(top->data[bucket++]);
 		if (!h->cnt) {
 			h->hashval = hashval;
+			h->bytes = bytes;
 			h->cnt = 1;
 			top->free--;
 			if (top->free < 0)
@@ -117,6 +106,14 @@ static struct spanhash_top *add_spanhash
 			return top;
 		}
 		if (h->hashval == hashval) {
+			if (h->bytes != bytes) {
+				/* avg of h->cnt instances were h->bytes
+				 * now we are adding bytes
+				 */
+				h->bytes = ((h->cnt / 2 + bytes +
+					     h->cnt * h->bytes) /
+					    (h->cnt + 1));
+			}
 			h->cnt++;
 			return top;
 		}
@@ -125,11 +122,49 @@ static struct spanhash_top *add_spanhash
 	}
 }
 
-static struct spanhash_top *hash_chars(unsigned char *buf, unsigned long sz)
+static unsigned long hash_extended_line(const unsigned char **buf_p,
+					unsigned long left)
+{
+	/* An extended line is zero or more whitespace letters (including LF)
+	 * followed by one non whitespace letter followed by zero or more
+	 * non LF, and terminated with by a LF (or EOF).
+	 */
+	const unsigned char *bol = *buf_p;
+	const unsigned char *buf = bol;
+	unsigned long hashval = 0;
+	while (left) {
+		unsigned c = *buf++;
+		if (!c)
+			goto binary;
+		left--;
+		if (' ' < c) {
+			hashval = c;
+			break;
+		}
+	}
+	while (left) {
+		unsigned c = *buf++;
+		if (!c)
+			goto binary;
+		left--;
+		if (c == '\n')
+			break;
+		if (' ' < c)
+			hashval = hashval * 11 + c;
+	}
+	*buf_p = buf;
+	return hashval;
+
+ binary:
+	*buf_p = NULL;
+	return 0;
+}
+
+static struct spanhash_top *hash_lines(const unsigned char *buf, unsigned long sz)
 {
 	int i;
-	unsigned long accum1, accum2, hashval;
 	struct spanhash_top *hash;
+	const unsigned char *eobuf = buf + sz;
 
 	i = INITIAL_HASH_SIZE;
 	hash = xmalloc(sizeof(*hash) + sizeof(struct spanhash) * (1<<i));
@@ -137,19 +172,14 @@ static struct spanhash_top *hash_chars(u
 	hash->free = INITIAL_FREE(i);
 	memset(hash->data, 0, sizeof(struct spanhash) * (1<<i));
 
-	/* an 8-byte shift register made of accum1 and accum2.  New
-	 * bytes come at LSB of accum2, and shifted up to accum1
-	 */
-	for (i = accum1 = accum2 = 0; i < 7; i++, sz--) {
-		accum1 = (accum1 << 8) | (accum2 >> 24);
-		accum2 = (accum2 << 8) | *buf++;
-	}
-	while (sz) {
-		accum1 = (accum1 << 8) | (accum2 >> 24);
-		accum2 = (accum2 << 8) | *buf++;
-		hashval = (accum1 + accum2 * 0x61) % HASHBASE;
-		hash = add_spanhash(hash, hashval);
-		sz--;
+	while (buf < eobuf) {
+		const unsigned char *ptr = buf;
+		unsigned long hashval = hash_extended_line(&buf, eobuf-ptr);
+		if (!buf) {
+			free(hash);
+			return NULL;
+		}
+		hash = add_spanhash(hash, hashval, buf-ptr);
 	}
 	return hash;
 }
@@ -166,21 +196,18 @@ int diffcore_count_changes(void *src, un
 	struct spanhash_top *src_count, *dst_count;
 	unsigned long sc, la;
 
-	if (src_size < 8 || dst_size < 8)
-		return -1;
-
 	src_count = dst_count = NULL;
 	if (src_count_p)
 		src_count = *src_count_p;
 	if (!src_count) {
-		src_count = hash_chars(src, src_size);
+		src_count = hash_lines(src, src_size);
 		if (src_count_p)
 			*src_count_p = src_count;
 	}
 	if (dst_count_p)
 		dst_count = *dst_count_p;
 	if (!dst_count) {
-		dst_count = hash_chars(dst, dst_size);
+		dst_count = hash_lines(dst, dst_size);
 		if (dst_count_p)
 			*dst_count_p = dst_count;
 	}
@@ -193,9 +220,9 @@ int diffcore_count_changes(void *src, un
 		unsigned dst_cnt, src_cnt;
 		if (!s->cnt)
 			continue;
-		src_cnt = s->cnt;
 		d = spanhash_find(dst_count, s->hashval);
-		dst_cnt = d ? d->cnt : 0;
+		src_cnt = s->cnt * s->bytes;
+		dst_cnt = d ? (d->cnt * d->bytes) : 0;
 		if (src_cnt < dst_cnt) {
 			la += dst_cnt - src_cnt;
 			sc += src_cnt;

^ permalink raw reply related

* Re: What should I use instead of git show?
From: Junio C Hamano @ 2006-03-14  3:09 UTC (permalink / raw)
  To: Olivier Galibert; +Cc: git
In-Reply-To: <20060313144747.GA81092@dspnet.fr.eu.org>

Olivier Galibert <galibert@pobox.com> writes:

> Up until now, when I wanted to send a patch to someone with the
> associated changelog, I just did a git log to find the changelog sha1
> then a git show to get the goods.  How am I supposed to do that now?

"git show" is fine and it is still there, but there is a command
designed specifically for that purpose: format-patch.

^ permalink raw reply

* Re: git-diff-tree -M performance regression in 'next'
From: Linus Torvalds @ 2006-03-14  3:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, git
In-Reply-To: <7vveuhohve.fsf@assigned-by-dhcp.cox.net>



On Mon, 13 Mar 2006, Junio C Hamano wrote:
> 
> Here is still an WIP but an insanely fast one (actually this is
> a modified version of what once was in next).  I haven't
> verified the sanity of its output fully, but from a cursory look
> what are found look sensible.  The same v2.6.12..v2.6.14 test on
> my Duron 750:

Heh. I did something similar, except I wanted mine to work with binary 
data too. Not that I know how _well_ it works, but assuming you have 
_some_ '\n' characters to fix up offset mismatches, it might do something.

Mine is a bit less hacky than yours, I believe. It doesn't skip 
whitespace, instead it just maintains a rolling 64-bit number, where each 
character shifts it left by 7 and then adds in the new character value 
(overflow in 32 bits just ignored).

Then it uses your old hash function, except it hides the length in the top 
byte.

It breaks the hashing on '\n' or on hitting a 64-byte sequence, whichever 
comes first.

It's fast and stupid, but doesn't seem to do any worse than your old one. 
The speed comes from the fact that it only does the hash comparisons at 
the "block boundaries", not at every byte.

Anyway, I don't think something like this is really any good for rename 
detection, but it might be good for deciding whether to do a real delta.

		Linus

----
diff --git a/diffcore-delta.c b/diffcore-delta.c
index 835d82c..4c6e512 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -127,7 +127,7 @@ static struct spanhash_top *add_spanhash
 
 static struct spanhash_top *hash_chars(unsigned char *buf, unsigned long sz)
 {
-	int i;
+	int i, n;
 	unsigned long accum1, accum2, hashval;
 	struct spanhash_top *hash;
 
@@ -137,19 +137,21 @@ static struct spanhash_top *hash_chars(u
 	hash->free = INITIAL_FREE(i);
 	memset(hash->data, 0, sizeof(struct spanhash) * (1<<i));
 
-	/* an 8-byte shift register made of accum1 and accum2.  New
-	 * bytes come at LSB of accum2, and shifted up to accum1
-	 */
-	for (i = accum1 = accum2 = 0; i < 7; i++, sz--) {
-		accum1 = (accum1 << 8) | (accum2 >> 24);
-		accum2 = (accum2 << 8) | *buf++;
-	}
+	n = 0;
+	accum1 = accum2 = 0;
 	while (sz) {
-		accum1 = (accum1 << 8) | (accum2 >> 24);
-		accum2 = (accum2 << 8) | *buf++;
+		unsigned long c = *buf++;
+		sz--;
+		accum1 = (accum1 << 7) | (accum2 >> 25);
+		accum2 = (accum2 << 7) | (accum1 >> 25);
+		accum1 += c;
+		if (++n < 64 && c != '\n')
+			continue;
 		hashval = (accum1 + accum2 * 0x61) % HASHBASE;
+		hashval |= (n << 24);
 		hash = add_spanhash(hash, hashval);
-		sz--;
+		n = 0;
+		accum1 = accum2 = 0;
 	}
 	return hash;
 }
@@ -166,9 +168,6 @@ int diffcore_count_changes(void *src, un
 	struct spanhash_top *src_count, *dst_count;
 	unsigned long sc, la;
 
-	if (src_size < 8 || dst_size < 8)
-		return -1;
-
 	src_count = dst_count = NULL;
 	if (src_count_p)
 		src_count = *src_count_p;
@@ -196,6 +195,8 @@ int diffcore_count_changes(void *src, un
 		src_cnt = s->cnt;
 		d = spanhash_find(dst_count, s->hashval);
 		dst_cnt = d ? d->cnt : 0;
+		dst_cnt *= (d->hashval >> 24);
+		src_cnt *= (d->hashval >> 24);
 		if (src_cnt < dst_cnt) {
 			la += dst_cnt - src_cnt;
 			sc += src_cnt;

^ permalink raw reply related

* Re: git-diff-tree -M performance regression in 'next'
From: Junio C Hamano @ 2006-03-14 10:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Fredrik Kuivinen, git
In-Reply-To: <Pine.LNX.4.64.0603131941260.3618@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> Mine is a bit less hacky than yours, I believe. It doesn't skip 
> whitespace, instead it just maintains a rolling 64-bit number, where each 
> character shifts it left by 7 and then adds in the new character value 
> (overflow in 32 bits just ignored).

That rolling register is a good idea.  The "whitespace hack" was
done to recognize certain kind of changes that commonly appear
in source code.  For example, it will still recognize content
copies after you re-indent your code, or add an "if (...) {" and
"} else { ... }" around an existing code block, or add extra
blank lines.

It is still an inadequate hack.  If you comment out a code block
by adding "#if 0" and "#endif" around it, it notices the
surviving lines, but if instead you comment out a block by
prefixing "//" in front of every line in the block, neither your
64-byte-or-EOL or my extended line algorithm would notice that
the content copy anymore.

Anyway, I did a bit of comparison and it appears that the
whitespace thing does not make much difference in practice.

> It's fast and stupid, but doesn't seem to do any worse than your old one. 

Comparing the "next" with your 64-byte-or-EOL and "extended
line" on the v2.6.12..v2.6.14 test case shows:

				64-or-EOL	extended line
renames identically detected	108		110
matched differently		2		2
finds what"next" misses		4		4
misses what "next" finds	23		21

What they find seem reasonable.  What they reject are sometimes
debatable.  For example, similarity between these two files does
not seem to be noticed by either.

        v2.6.12/drivers/media/dvb/dibusb/dvb-dibusb-firmware.c
        v2.6.14/drivers/media/dvb/dvb-usb/dvb-usb-firmware.c

The "next" algorithm gives 60% score while these two gives 45%
or so to this pair.

But they both reject these bogus "rename" the "next" algorithm
finds:

	v2.6.12/drivers/char/drm/gamma_drv.c
	v2.6.14/drivers/char/drm/via_verifier.h

("next" 51% vs 37-40% with these algorithms). 

> Anyway, I don't think something like this is really any good for rename 
> detection, but it might be good for deciding whether to do a real delta.

Either algorithm seem to have non-negligible false negative
rates but their false positive rates are reasonably low.  So we
could use these as a pre-filter and use real delta on pairs that
these quick and dirty algorithms say are too different.

^ permalink raw reply

* Re: What should I use instead of git show?
From: Mark Hollomon @ 2006-03-14 11:49 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vmzftq4r4.fsf@assigned-by-dhcp.cox.net>

Junio C Hamano wrote:
> Mark Hollomon <markhollomon@comcast.net> writes:
> 
>> I must be misunderstanding this:
>>
>> 	git whatchanged -p -1 HEAD
>>
>> in the current git tree results in nothing. only when I get to -5 does it show something.
>>
>> Is this expected?
>>
>>> git version
>> git version 1.2.4.gea75
> 
> In this case what matterks is not the version of your git but
> what that HEAD is.  If it is a merge commit, whatchanged -p does
> not show anything by default.

Oh, I see. As a pass through to git-rev-list that makes sense. --max-count is really 
-max-commits-to-consider (or something like that).

Is there a --max-commits-to-show?

-- 
Mark Hollomon

^ permalink raw reply

* seperate commits for objects already updated in index?
From: Paul Jakma @ 2006-03-14 16:37 UTC (permalink / raw)
  To: git list

Hi,

Dumb question, imagine you made changes to a few files, and ran 
update-index at various stages in between:

$ git status
#
# Updated but not checked in:
#   (will commit)
#
#       modified: foo/ChangeLog
#       modified: foo/whatever
#       modified: bar/ChangeLog
#       modified: bar/other

The changes in bar/ are unrelated to the changes in foo/ - how do you 
commit each seperately? Git doesn't seem to want to let me:

   $ git commit -o bar
   Different in index and the last commit:
   M       bar/ChangeLog
   M       bar/other
   You might have meant to say 'git commit -i paths...', perhaps?

git commit on its own wants to commit all the above files.

what's the silly thing I've missed?

Thanks.

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
Never tell a lie unless it is absolutely convenient.

^ permalink raw reply

* Re: seperate commits for objects already updated in index?
From: Linus Torvalds @ 2006-03-14 17:00 UTC (permalink / raw)
  To: Paul Jakma; +Cc: git list
In-Reply-To: <Pine.LNX.4.64.0603141634010.5276@sheen.jakma.org>

On Tue, 14 Mar 2006, Paul Jakma wrote:

> Hi,
> 
> Dumb question, imagine you made changes to a few files, and ran update-index
> at various stages in between:
> 
> $ git status
> #
> # Updated but not checked in:
> #   (will commit)
> #
> #       modified: foo/ChangeLog
> #       modified: foo/whatever
> #       modified: bar/ChangeLog
> #       modified: bar/other
> 
> The changes in bar/ are unrelated to the changes in foo/ - how do you commit
> each seperately? Git doesn't seem to want to let me:
> 
>   $ git commit -o bar
>   Different in index and the last commit:
>   M       bar/ChangeLog
>   M       bar/other
>   You might have meant to say 'git commit -i paths...', perhaps?
> 
> git commit on its own wants to commit all the above files.
> 
> what's the silly thing I've missed?

You've already marked them all modified in the index (using 
git-update-index), so git commit thinks you are confused by naming them 
again and saying "only".

The simplest thing to do is to do

	git reset

to reset your index back to your HEAD (but obviously DON'T use the "-f" 
flag, which will also force the working tree!). That will make your index 
clean, and undo the fact that you've already marked things to be committed 
with "git-update-index".

Then you can just do

	git commit -o bar

and everything should be fine, because then git doesn't think you're doing 
something insane.

		Linus

^ permalink raw reply

* Re: seperate commits for objects already updated in index?
From: Paul Jakma @ 2006-03-14 17:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git list
In-Reply-To: <Pine.LNX.4.64.0603140856120.3618@g5.osdl.org>

On Tue, 14 Mar 2006, Linus Torvalds wrote:

> The simplest thing to do is to do
>
> 	git reset
>
> to reset your index back to your HEAD (but obviously DON'T use the "-f"
> flag, which will also force the working tree!).

Ah, of course! (I knew I was being dumb ;) ).

> Then you can just do
>
> 	git commit -o bar
>
> and everything should be fine, because then git doesn't think you're doing
> something insane.

Yep, thank you!

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
The less a statesman amounts to, the more he loves the flag.
 		-- Kin Hubbard

^ permalink raw reply

* [PATCH] Use resolve in git-pull if NO_PYTHON
From: Mark Hollomon @ 2006-03-14 17:16 UTC (permalink / raw)
  To: git

git-pull is hardcoded to use the recursive merge strategy
for the twohead case. But if git has been built with NO_PYTHON,
that strategy is not available. Teach git-pull to use resolve
if built with NO_PYTHON.

Signed-off-by: Mark Hollomon <markhollomon@comcast.net>


---

 git-pull.sh |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

1eb3abec6f4811e3eeafa50445ed0f2ce5d85b08
diff --git a/git-pull.sh b/git-pull.sh
index 6caf1aa..ae9c346 100755
--- a/git-pull.sh
+++ b/git-pull.sh
@@ -8,6 +8,11 @@ USAGE='[-n | --no-summary] [--no-commit]
 LONG_USAGE='Fetch one or more remote refs and merge it/them into the current HEAD.'
 . git-sh-setup
 
+default_twohead_strategy='recursive'
+if test "@@NO_PYTHON@@"; then
+    default_twohead_strategy='resolve'
+fi
+
 strategy_args= no_summary= no_commit=
 while case "$#,$1" in 0) break ;; *,-*) ;; *) break ;; esac
 do
@@ -82,7 +87,7 @@ case "$merge_head" in
 	var=`git repo-config --get pull.twohead`
 	if test '' = "$var"
 	then
-		strategy_default_args='-s recursive'
+		strategy_default_args="-s $default_twohead_strategy"
 	else
 		strategy_default_args="-s $var"
 	fi
-- 
1.2.4.g967a

^ permalink raw reply related

* Re: seperate commits for objects already updated in index?
From: Linus Torvalds @ 2006-03-14 17:20 UTC (permalink / raw)
  To: Paul Jakma; +Cc: git list
In-Reply-To: <Pine.LNX.4.64.0603141703080.5276@sheen.jakma.org>

On Tue, 14 Mar 2006, Paul Jakma wrote:

> On Tue, 14 Mar 2006, Linus Torvalds wrote:
> 
> > The simplest thing to do is to do
> > 
> > 	git reset
> > 
> > to reset your index back to your HEAD (but obviously DON'T use the "-f"
> > flag, which will also force the working tree!).
> 
> Ah, of course! (I knew I was being dumb ;) ).

Well, I actually think git is being somewhat of an ass, for no really good 
reason. It's true that you are doing something pretty strange by _both_ 
using "git-update-index" and "git commit -o" but the fact is, at least 
when adding files, that would be expected (ie you have to mark a file 
in the index to add it).

I also think that test is historical, from before Junio cleaned up how 
"git commit" worked - it _used_ to be that "git commit" would work in the 
current index, but these days it generates a new index to commit when you 
do "-o", so there's really no _technical_ reason to refuse the partial 
commit any more as far as I can see.

So I don't know. I don't think you were being dumb, I think git could have 
been friendlier to you.

		Linus

^ permalink raw reply

* Re: [PATCH] Use resolve in git-pull if NO_PYTHON
From: Johannes Schindelin @ 2006-03-14 17:26 UTC (permalink / raw)
  To: Mark Hollomon; +Cc: git
In-Reply-To: <1142356355-4772-markhollomon@comcast.net>

Hi,

On Tue, 14 Mar 2006, Mark Hollomon wrote:

> git-pull is hardcoded to use the recursive merge strategy
> for the twohead case. But if git has been built with NO_PYTHON,
> that strategy is not available. Teach git-pull to use resolve
> if built with NO_PYTHON.

D'oh. I forgot to send that patch when I was doing the NO_PYTHON stuff. 
But I did it differently: There is no good reason that git-pull should 
insist on its own default strategy when git-merge already has one.

Ciao,
Dscho

^ permalink raw reply

* Re: seperate commits for objects already updated in index?
From: Paul Jakma @ 2006-03-14 17:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git list
In-Reply-To: <Pine.LNX.4.64.0603140915290.3618@g5.osdl.org>

On Tue, 14 Mar 2006, Linus Torvalds wrote:

> Well, I actually think git is being somewhat of an ass, for no 
> really good reason. It's true that you are doing something pretty 
> strange by _both_ using "git-update-index" and "git commit -o" but 
> the fact is, at least when adding files, that would be expected (ie 
> you have to mark a file in the index to add it).

Well, I tend to work on one thing, then notice something else 
unrelated (or in a support file), fix/tweak that, etc.. I use the 
index for 'way-point' diffs, rather than commit things I havn't quite 
tested yet (or dont know whether they'll be useful yet).

> I also think that test is historical, from before Junio cleaned up 
> how "git commit" worked - it _used_ to be that "git commit" would 
> work in the current index, but these days it generates a new index 
> to commit when you do "-o", so there's really no _technical_ reason 
> to refuse the partial commit any more as far as I can see.

Aha. So that check possibly could just be removed?

> So I don't know. I don't think you were being dumb, I think git 
> could have been friendlier to you.

:)

git reset works just fine too.

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
A day for firm decisions!!!!!  Or is it?

^ permalink raw reply

* git-cvsimport missed a commit
From: Paul Jakma @ 2006-03-14 17:34 UTC (permalink / raw)
  To: git list

git-cvsimport missed a commit (one involving 'renamed' files in CVS, 
added/deleted).

I'm wondering how best to fix this. My thinking was to just branch my 
'cvs_head' from the ancestor prior the missed commit, rename the 
heads around, and try again.

Is there a better way? Given I actually have the missing commit in my 
'master' branch?

(I actually have a 2-way thing going with CVS. I export selected 
commit from master to CVS every now and then, and get my own CVS 
commits back again via a later import - seems to work).

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
So much food; so little time!

^ permalink raw reply

* [PATCH] Invoke git-repo-config directly.
From: Qingning Huo @ 2006-03-14 21:10 UTC (permalink / raw)
  To: git; +Cc: junkio

The system have GNU git installed at /usr/bin/git.  I installed git-core
to ~/opt/bin.  ~/opt/bin is in my PATH, but is after /usr/bin.  I have
set alias git="$HOME/opt/bin/git".

git-push and git-pull behaves strangely, because they call "git
repo-config", which runs /usr/bin/git.  Using "git-repo-config" directly
fixed the problem.

Signed-off-by: Qingning Huo <qhuo@mayhq.co.uk>

---

 git-pull.sh     |    4 ++--
 git-sh-setup.sh |    2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

a0194fff002cb12ac58b202201d387f8ea55b225
diff --git a/git-pull.sh b/git-pull.sh
index 6caf1aa..e32e2b0 100755
--- a/git-pull.sh
+++ b/git-pull.sh
@@ -70,7 +70,7 @@ case "$merge_head" in
 	exit 0
 	;;
 ?*' '?*)
-	var=`git repo-config --get pull.octopus`
+	var=`git-repo-config --get pull.octopus`
 	if test '' = "$var"
 	then
 		strategy_default_args='-s octopus'
@@ -79,7 +79,7 @@ case "$merge_head" in
 	fi
 	;;
 *)
-	var=`git repo-config --get pull.twohead`
+	var=`git-repo-config --get pull.twohead`
 	if test '' = "$var"
 	then
 		strategy_default_args='-s recursive'
diff --git a/git-sh-setup.sh b/git-sh-setup.sh
index 025ef2d..12f5ede 100755
--- a/git-sh-setup.sh
+++ b/git-sh-setup.sh
@@ -41,7 +41,7 @@ then
 	: ${GIT_OBJECT_DIRECTORY="$GIT_DIR/objects"}
 
 	# Make sure we are in a valid repository of a vintage we understand.
-	GIT_DIR="$GIT_DIR" git repo-config --get core.nosuch >/dev/null
+	GIT_DIR="$GIT_DIR" git-repo-config --get core.nosuch >/dev/null
 	if test $? = 128
 	then
 	    exit
-- 
1.2.4.ga019-dirty

^ permalink raw reply related

* [PATCH] Invoke git-stripspace directly.
From: Qingning Huo @ 2006-03-14 21:11 UTC (permalink / raw)
  To: git; +Cc: junkio

Run "git-stripspace" instead of "git stripspace" to avoid calling
external git command.

Signed-off-by: Qingning Huo <qhuo@mayhq.co.uk>

---

 git-format-patch.sh |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

aad41923a43b82713af05eaa26db688272091520
diff --git a/git-format-patch.sh b/git-format-patch.sh
index 2ebf7e8..486fb31 100755
--- a/git-format-patch.sh
+++ b/git-format-patch.sh
@@ -213,7 +213,7 @@ sub show_date {
 }
 
 print "From nobody Mon Sep 17 00:00:00 2001\n";
-open FH, "git stripspace <$commsg |" or die "open $commsg pipe";
+open FH, "git-stripspace <$commsg |" or die "open $commsg pipe";
 while (<FH>) {
     unless ($done_header) {
 	if (/^$/) {
-- 
1.2.4.ga019-dirty

^ permalink raw reply related

* Re: [PATCH] Invoke git-repo-config directly.
From: Johannes Schindelin @ 2006-03-14 21:20 UTC (permalink / raw)
  To: Qingning Huo; +Cc: git, junkio
In-Reply-To: <20060314211022.GA12498@localhost.localdomain>

Hi,

On Tue, 14 Mar 2006, Qingning Huo wrote:

> -	var=`git repo-config --get pull.octopus`
> +	var=`git-repo-config --get pull.octopus`

This is unlikely to be applied; there are plans to have a "libexec" path 
in which all git executables are stored, and just the "git" wrapper in the 
path. Your patch would break git in those setups.

Ciao,
Dscho

P.S.: BTW there are quite a few discussions of this in the mailing list 
archives...

^ permalink raw reply

* Re: [PATCH] Invoke git-repo-config directly.
From: Qingning Huo @ 2006-03-14 21:30 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, junkio
In-Reply-To: <Pine.LNX.4.63.0603142219040.23646@wbgn013.biozentrum.uni-wuerzburg.de>

On Tue, Mar 14, 2006 at 10:20:53PM +0100, Johannes Schindelin wrote:
> Hi,
> 
> On Tue, 14 Mar 2006, Qingning Huo wrote:
> 
> > -	var=`git repo-config --get pull.octopus`
> > +	var=`git-repo-config --get pull.octopus`
> 
> This is unlikely to be applied; there are plans to have a "libexec" path 
> in which all git executables are stored, and just the "git" wrapper in the 
> path. Your patch would break git in those setups.
> 

I do not mind whether this patch is applied.  What I want is git calls
its helper programs, instead of any random git program in my PATH.  If
git-programs are installed to libexec path, how about calling them
with absolute path?

Regards,
Qingning

^ permalink raw reply

* Re: [PATCH] Invoke git-repo-config directly.
From: Linus Torvalds @ 2006-03-14 21:58 UTC (permalink / raw)
  To: Qingning Huo; +Cc: git, junkio
In-Reply-To: <20060314211022.GA12498@localhost.localdomain>

On Tue, 14 Mar 2006, Qingning Huo wrote:
>
> The system have GNU git installed at /usr/bin/git.  I installed git-core
> to ~/opt/bin.  ~/opt/bin is in my PATH, but is after /usr/bin.  I have
> set alias git="$HOME/opt/bin/git".

This should not be a problem with the modern "git.c" wrapper. It 
_should_, if you call it with the full path, automatically prepend that 
path to the PATH when executing sub-commands. 

So if you run git as "$HOME/opt/bin/git", the PATH _should_ be 
 - first the "PREFIX/bin" path as defined by the build
 - second the "$HOME/opt/bin/" path as defined by the fact that you ran 
   git from that path
 - finally the normal $PATH.

To check this out, do this:

	ln -s /usr/bin/printenv ~/opt/bin/git-printenv
	git printenv

and you should see the proper PATH that git ends up using internally that 
way.

So your problem seems to be that you do "git-pull", when you really should 
do "git pull" (where that wrapper will set up PATH for you). Since you 
don't use the wrapper, the scripts end up doing the wrong thing.

		Linus

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox