Git development

Git development
 help / color / mirror / Atom feed

* RE: fatal: git-write-tree: not able to write tree
From: Brown, Len @ 2006-04-28  8:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

>> git am --3way --interactive --signoff --utf8 --resolved

>Please say "--resolved" after you have actually resolved them,
>eh, meaning, (1) edit the working tree file into a desired
>shape, and (2) git-update-index drivers/acpi/thermal.c.

Thanks Junio, once again, for your help, we're up and running!

I'm okay with git being conservative and not doing the update-index
for me.  Perhaps the thing to do here is to make the failure message
more useful?

"fatal: git-write-tree: not able to write tree"

everything after "fatal" here is effectively a string
of random characters to the hapless user.

thanks,
-Len

^ permalink raw reply

* Re: fatal: git-write-tree: not able to write tree
From: Junio C Hamano @ 2006-04-28  8:32 UTC (permalink / raw)
  To: Len Brown; +Cc: git
In-Reply-To: <200604280430.33100.len.brown@intel.com>

Len Brown <len.brown@intel.com> writes:

> I'm trying to  use git-am to apply a patch series in a mailbox.
>
> The first patch has a conflict, which I edit to fix, and and then invoke
> git am --3way --interactive --signoff --utf8 --resolved
>
> but it bails out with this:
>
> drivers/acpi/thermal.c: unmerged (4829f067a3e7acfbeed8b230caac00b1ed4b8554)
> drivers/acpi/thermal.c: unmerged (528d198c28512af1627cce481575f37a599c0fe8)
> drivers/acpi/thermal.c: unmerged (db3bef1a3e51801128e7553f3e546c8272cc9ee1)
> fatal: git-write-tree: not able to write tree
>
> I've tried various incantations of git reset on the theory that there is some 
> old state hanging around someplace, but have not been able to check in this 
> file.
>
> clues?

Please say "--resolved" after you have actually resolved them,
eh, meaning, (1) edit the working tree file into a desired
shape, and (2) git-update-index drivers/acpi/thermal.c.

I've considered making --resolved to do update-index for all
paths that are unmerged in the index, but that risks going
forward by mistake when you still have other paths to resolve,
so...

^ permalink raw reply

* fatal: git-write-tree: not able to write tree
From: Len Brown @ 2006-04-28  8:30 UTC (permalink / raw)
  To: git

I'm trying to  use git-am to apply a patch series in a mailbox.

The first patch has a conflict, which I edit to fix, and and then invoke
git am --3way --interactive --signoff --utf8 --resolved

but it bails out with this:

drivers/acpi/thermal.c: unmerged (4829f067a3e7acfbeed8b230caac00b1ed4b8554)
drivers/acpi/thermal.c: unmerged (528d198c28512af1627cce481575f37a599c0fe8)
drivers/acpi/thermal.c: unmerged (db3bef1a3e51801128e7553f3e546c8272cc9ee1)
fatal: git-write-tree: not able to write tree

I've tried various incantations of git reset on the theory that there is some 
old state hanging around someplace, but have not been able to check in this 
file.

clues?

thanks,
-Len

^ permalink raw reply

* Re: [PATCH] Add a test case for rerere
From: Uwe Zeisberger @ 2006-04-28  8:02 UTC (permalink / raw)
  To: git
In-Reply-To: <20060428075604.GA30714@digi.com>

Hello,

Uwe Zeisberger wrote:
> +echo "added in branch" >> file-common &&
> +git add file-branch file-common &&
> +git commit -m "branch1" -i file-base file-branch file-common &&
> +git branch branch1'
> +
> ...
> + 
> +test_expect_failure 'pull branch1' \
> +'git pull . branch1'

When typing the test I first tried to pull branch^, but this failed with
"no such remote ref refs/heads/branch^".  Is it intended that one can
only pull branches and not any rev?

Best regards
Uwe

PS: I added a double blank line in the file.  Sorry for that...

-- 
Uwe Zeisberger

http://www.google.com/search?q=Planck%27s+constant%3D

^ permalink raw reply

* [PATCH] Add a test case for rerere
From: Uwe Zeisberger @ 2006-04-28  7:56 UTC (permalink / raw)
  To: git

Currently this test fails because rerere is not able to record
resolves for a file that don't exist in the merge base but in
both branches to merge.

Signed-off-by: Uwe Zeisberger <Uwe_Zeisberger@digi.com>

---

 t/t8003-rerere.sh |   66 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 66 insertions(+), 0 deletions(-)
 create mode 100644 t/t8003-rerere.sh

It's the last command that fails because rerere didn't record the
conflict between branch1:file-common and master:file-common.

Please feel free to change the filename as I don't know/see the naming
scheme of the tests.

Best regards
Uwe

ff012a80cafa3fe905de72d0db8b616ff76d0038
diff --git a/t/t8003-rerere.sh b/t/t8003-rerere.sh
new file mode 100644
index 0000000..1bb66ff
--- /dev/null
+++ b/t/t8003-rerere.sh
@@ -0,0 +1,66 @@
+#!/bin/sh
+
+test_description='git-rerere'
+. ./test-lib.sh
+
+
+test_expect_success 'prepare repository' \
+'mkdir .git/rr-cache &&
+echo "content" > file-base &&
+git add file-base &&
+git commit -m "Initial commit" &&
+git branch branch &&
+echo "added after branch" >> file-base &&
+echo "added after branch" >> file-common &&
+git add file-common &&
+git commit -m "master1" -i file-base file-common &&
+git checkout branch &&
+echo "added in branch" >> file-base &&
+echo "only in branch" > file-branch &&
+echo "added in branch" >> file-common &&
+git add file-branch file-common &&
+git commit -m "branch1" -i file-base file-branch file-common &&
+git branch branch1'
+
+test_expect_failure 'pull master' \
+'git pull . master'
+
+cat >> file-base-expect << EOF
+content
+<<<<<<< HEAD/file-base
+added in branch
+=======
+added after branch
+>>>>>>> `git rev-parse master`/file-base
+EOF
+
+test_expect_success 'merge result' \
+'cmp file-base file-base-expect &&
+git cat-file blob HEAD:file-common | cmp file-common~HEAD - &&
+git cat-file blob master:file-common | cmp file-common~`git rev-parse master` - &&
+git cat-file blob HEAD:file-branch | cmp file-branch -'
+
+test_expect_success 'record and resolve confilcts' \
+'git rerere &&
+echo "content
+added in branch
+added after branch" > file-base &&
+echo "added in branch
+added after branch" > file-common &&
+git rerere &&
+git-ls-files -o | xargs rm &&
+git commit -m "resolved conflicts" -i file-base file-common file-branch &&
+git-checkout master
+'
+ 
+test_expect_failure 'pull branch1' \
+'git pull . branch1'
+
+test_expect_success 'reuse recorded resolve' \
+'git rerere &&
+git cat-file blob branch:file-branch | cmp file-branch - &&
+git cat-file blob branch:file-base | cmp file-base - &&
+git cat-file blob branch:file-common | cmp file-common -'
+
+test_done
+
-- 
1.3.1.gac92


-- 
Uwe Zeisberger
FS Forth-Systeme GmbH, A Digi International Company
Kueferstrasse 8, D-79206 Breisach, Germany
Phone: +49 (7667) 908 0 Fax: +49 (7667) 908 200
Web: www.fsforth.de, www.digi.com

^ permalink raw reply related

* Re: new gitk feature
From: Linus Torvalds @ 2006-04-28  5:11 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: git
In-Reply-To: <17489.22838.502099.575465@cargo.ozlabs.ibm.com>



On Fri, 28 Apr 2006, Paul Mackerras wrote:
> Linus Torvalds writes:
> > Any possibility of something light that? I'd _love_ to be able to see the 
> > whole tree, but with things that touch certain files or things that are 
> > newer highlighted.
> 
> That should be quite doable.  How about I show the commits that are in
> the highlight view in bold?  That won't conflict with the existing
> yellow background for commits that match the find criteria.

Bold sounds good to me.

> > (Btw, the "revision information" is also cool things like "--unpacked". I 
> > actually use "gitk --unpacked" every once in a while, just because it's 
> > such a cool way to say "show me everything I've added since I packed the 
> > repo last).
> 
> OK, I didn't know about --unpacked. :)  I plan to add stuff to the
> view definition window to allow you to select commits to
> include/exclude by reachability from given commits (by head/tag/ID)
> and when I do I can add a way to say --unpacked too.

It's more of a gimmick, but I find myself using it occasionally just to 
decide whether it's time to repack. It falls out automatically - not 
because I thought I'd ever want it, but because the --unpacked semantics 
for git-rev-list are what incremental packing needed.

(Of course, sane people probably just do "git count-objects" to decide to 
repack).

		Linus

^ permalink raw reply

* Re: PATCH: New diff-delta.c implementation (updated)
From: Junio C Hamano @ 2006-04-28  4:28 UTC (permalink / raw)
  To: Geert Bosch; +Cc: git
In-Reply-To: <7v1wvigzka.fsf@assigned-by-dhcp.cox.net>

Junio C Hamano <junkio@cox.net> writes:

> In the kernel repository (checked out is near the tip of the
> source tree), the largest files are fs/nls/nls_cp949.c (900kB
> korean character encoding), drivers/usb/misc/emi62_fw_s.h
> (800kB, Emagic firmware blob), arch/m68k/ifpsp060/src/fpsp.S
> (750kB, floating point emulation?), and nowhere near your
> algorithm really should shine.
>
> We would probably want some internal logic that says "if we see
> that blobs larger than X MB is involved in the packing, we
> should use this version of diff-delta, otherwise the other one."

Third impression, synthetic workload.  A sequence of single file
project, the file is tarball of git.git tree (that is,
"git-tar-tree vX.Y.Z >tarball"), 120 objects or so (1 commit per
rev, 1 tree to hold 1 blob).  The (uncompressed) size of the 40
blobs in the pack are between 2.06MB - 2.86MB (average 2.30MB).

(Nico)
Total 123, written 123 (delta 38), reused 0 (delta 0)
67.26user 1.03system 1:08.76elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+136066minor)pagefaults 0swaps

1822079 pack-nico-26989d516c62197592d0d52db24dfc6a58b633eb.pack

(Geert)
Total 123, written 123 (delta 38), reused 0 (delta 0)
67.23user 1.35system 1:09.25elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+164124minor)pagefaults 0swaps

1683139 pack-geert-26989d516c62197592d0d52db24dfc6a58b633eb.pack

That's an 8% improvement in the same time, which is quite
impressive.  But I am _very_ unhappy about this particular
synthetic workload.  I wonder if there are projects with many
large blobs that is updated often, so that we can use it as a
yardstick.  Maybe Wine people have icons, background images and
sounds perhaps?  But I suspect you would not update them that
often.

Thinking about it, it does not make much sense, at least to me,
to store large tarballs or binary blobs or whatnot in a SCM (we
are _not_ in the archival business) and keeping track of their
changes.  The tarball is out of question -- it is not a source
(in GPL sense of the word -- it is not a preferred way to make
modification; you modify constituent files and bundle up the
result as a new tarball).  Graphics images, perhaps.

^ permalink raw reply

* Re: PATCH: New diff-delta.c implementation (updated)
From: Junio C Hamano @ 2006-04-28  3:16 UTC (permalink / raw)
  To: Geert Bosch; +Cc: git
In-Reply-To: <Pine.GSO.4.60.0604272132170.9650@nile.gnat.com>

Geert Bosch <bosch@gnat.com> writes:

> Even though the previous version did really well on large files
> with many changes, performance was lacking for the many small
> files with very few changes that are so common for a VCS.
>...
> The result has been only a slight increase in delta size for
> very large test cases (but with better performance), and
> both smaller deltas and faster execution speed for repacking
> git.git. I had trouble cloning the Linux kernel repository,
> but am now reasonably confident this will outperform the
> existing algorithm pretty consistently.

Interesting.

Initial impression, the same test as before (a full packing of
the git.git repository that does not have _any_ pack -- all 18k
objects are loose).

First, the incumbent, with the "reusing delta-index" patch applied.

Total 17724, written 17724 (delta 12002), reused 0 (delta 0)
34.02user 6.48system 0:42.87elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+434478minor)pagefaults 0swaps

 6188418 pack-nico-f1fac077a093ffdaf094aab2b7f11859ec0c18f1.pack

Then diff-delta.c replaced with your version.

Total 17724, written 17724 (delta 12012), reused 0 (delta 0)
44.87user 6.54system 0:54.01elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+441124minor)pagefaults 0swaps

 6099183 pack-geert-f1fac077a093ffdaf094aab2b7f11859ec0c18f1.pack

Second impression, in a recent kernel tree which is mostly
packed.  Packing 41k objects (v2.6.16..v2.6.17-rc3), with
"git-pack-objects --no-reuse-delta".

(Nico)
Total 41591, written 41591 (delta 29285), reused 8563 (delta 0)
169.08user 12.60system 3:27.68elapsed 87%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (2major+1099928minor)pagefaults 0swaps

37363966 pack-nico-b9e4339c482cb7d787a2117e6da6eb2114053abc.pack

(Geert)
Total 41591, written 41591 (delta 29347), reused 8427 (delta 0)
243.71user 12.32system 4:28.11elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1077843minor)pagefaults 0swaps

37165890 pack-geert-b9e4339c482cb7d787a2117e6da6eb2114053abc.pack

Of course, the absolute numbers do not matter, but for the
record these are on my Duron 750, 760MB or so RAM and with
relatively slow disks.

In the kernel repository (checked out is near the tip of the
source tree), the largest files are fs/nls/nls_cp949.c (900kB
korean character encoding), drivers/usb/misc/emi62_fw_s.h
(800kB, Emagic firmware blob), arch/m68k/ifpsp060/src/fpsp.S
(750kB, floating point emulation?), and nowhere near your
algorithm really should shine.

We would probably want some internal logic that says "if we see
that blobs larger than X MB is involved in the packing, we
should use this version of diff-delta, otherwise the other one."

^ permalink raw reply

* Re: PATCH: New diff-delta.c implementation (updated)
From: Geert Bosch @ 2006-04-28  2:07 UTC (permalink / raw)
  To: Git Mailing List
In-Reply-To: <Pine.GSO.4.60.0604272132170.9650@nile.gnat.com>

On Apr 27, 2006, at 21:59, Geert Bosch wrote:

> The result has been only a slight increase in delta size for
> very large test cases (but with better performance),

Just to clarify: this is compared to my initial implementation.
For very large test cases, both delta size and execution time
are much less than the current implementation.

   -Geert

^ permalink raw reply

* PATCH: New diff-delta.c implementation (updated)
From: Geert Bosch @ 2006-04-28  1:59 UTC (permalink / raw)
  To: git

Even though the previous version did really well on large files
with many changes, performance was lacking for the many small
files with very few changes that are so common for a VCS.

For example, it turns out that, for packing the 17005 objects in
my git.git repository, diff_delta processes 240 MB worth of target
data in about 12s on my powerbook. (There's even a little more
source data, and the 12s includes compression/decompression time.)

So the fancy fingerprint calculations really take too much time.
Fortunately, it turns out that of the 240M, 120M matches directly
at the start or the end of the source data. After this trivial
matching, most remaining matches are quite small. The overhead
of setting up buffers, computing longest runs of the same character
and computing 64-bit fingerprints becomes very noticeable and
can't be regained later.

As a result I implemented special indexing and matching routines
for "small" files. Here a fixed hash table size and index step
are used. The fingerprint window has been reduced to be equal to
the step size, which essentially gets rid of computation for
characters leaving the window. Finally, the fingerprint size
has been reduced to 32 bits with polynome of 31st degree.

The result has been only a slight increase in delta size for
very large test cases (but with better performance), and
both smaller deltas and faster execution speed for repacking
git.git. I had trouble cloning the Linux kernel repository,
but am now reasonably confident this will outperform the
existing algorithm pretty consistently.

On PPC, the trivial matching in head and tail, and for long
matching runs now shows up high in the profile. On x86,
byte operations are very fast, so I think things should
be at least equally good there.

Please play around with this and let me know of any results.

   -Geert

Signed-off-by: Geert Bosch <bosch@gnat.com>

#include <unistd.h>
#include <stdlib.h>
#include <assert.h>
#include <string.h>
#include <sys/types.h>

#undef assert
#define assert(x) do { } while (0)

/*
  * MIN_HTAB_SIZE is fixed amount to be added to the size of the hash table
  * used for indexing and must be a power of two. This allows for small files
  * to have a sparse hash table, since in that case it's cheap.
  * Hash table sizes are rounded up to a power of two to avoid integer division.
  */
#define MIN_HTAB_SIZE 8192
#define MAX_HTAB_SIZE (1024*1024*1024)
#define SMALL_HTAB_SIZE 8192
#define SMALL_INDEX_STEP 16

/*
  * Diffing files of gigabyte range is impractical with the current
  * algorithm, so we're assuming 32-bit sizes everywhere.
  * Size leaves some room for expansion when diffing random files.
  */
#define MAX_SIZE (0x7eff0000)

/* For small files, indices are represented in 16 bits.
  * Since indices are always a multiple of the index_step, they
  * can be shifted right a few bits to accommodate files larger than 64K
  */
#define SMALL_SHIFT 4
#define MAX_SMALL_SIZE (0xff00<<SMALL_SHIFT)

/* Initial size of copies table, dynamically extended as needed. */
#define MAX_COPIES 512

/*
  * Matching is done using a sliding window for which a Rabin
  * polynomial is computed. The advantage of such polynomials is
  * that they can efficiently be updated at every position.
  * The tables needed for this are precomputed, as it is desirable
  * to use the same polynomial all the time for repeatable results.
  * The 16 byte window is convenient for indexing with index_step 16.
  * In that special case, the U table is not needed during indexing.
  * The 32-bit hash helps on register-starved 32-bit architectures.
  */

#define RABIN_POLY 0xf3a03ce5
#define RABIN_DEGREE 31
#define RABIN_SHIFT 23
#define RABIN_WINDOW_SIZE 16

unsigned T[256] =
{ 0x00000000, 0xf3a03ce5, 0x14e0452f,
   0xe74079ca, 0x29c08a5e, 0xda60b6bb, 0x3d20cf71, 0xce80f394, 0x538114bc,
   0xa0212859, 0x47615193, 0xb4c16d76, 0x7a419ee2, 0x89e1a207, 0x6ea1dbcd,
   0x9d01e728, 0x54a2159d, 0xa7022978, 0x404250b2, 0xb3e26c57, 0x7d629fc3,
   0x8ec2a326, 0x6982daec, 0x9a22e609, 0x07230121, 0xf4833dc4, 0x13c3440e,
   0xe06378eb, 0x2ee38b7f, 0xdd43b79a, 0x3a03ce50, 0xc9a3f2b5, 0x5ae417df,
   0xa9442b3a, 0x4e0452f0, 0xbda46e15, 0x73249d81, 0x8084a164, 0x67c4d8ae,
   0x9464e44b, 0x09650363, 0xfac53f86, 0x1d85464c, 0xee257aa9, 0x20a5893d,
   0xd305b5d8, 0x3445cc12, 0xc7e5f0f7, 0x0e460242, 0xfde63ea7, 0x1aa6476d,
   0xe9067b88, 0x2786881c, 0xd426b4f9, 0x3366cd33, 0xc0c6f1d6, 0x5dc716fe,
   0xae672a1b, 0x492753d1, 0xba876f34, 0x74079ca0, 0x87a7a045, 0x60e7d98f,
   0x9347e56a, 0x4668135b, 0xb5c82fbe, 0x52885674, 0xa1286a91, 0x6fa89905,
   0x9c08a5e0, 0x7b48dc2a, 0x88e8e0cf, 0x15e907e7, 0xe6493b02, 0x010942c8,
   0xf2a97e2d, 0x3c298db9, 0xcf89b15c, 0x28c9c896, 0xdb69f473, 0x12ca06c6,
   0xe16a3a23, 0x062a43e9, 0xf58a7f0c, 0x3b0a8c98, 0xc8aab07d, 0x2feac9b7,
   0xdc4af552, 0x414b127a, 0xb2eb2e9f, 0x55ab5755, 0xa60b6bb0, 0x688b9824,
   0x9b2ba4c1, 0x7c6bdd0b, 0x8fcbe1ee, 0x1c8c0484, 0xef2c3861, 0x086c41ab,
   0xfbcc7d4e, 0x354c8eda, 0xc6ecb23f, 0x21accbf5, 0xd20cf710, 0x4f0d1038,
   0xbcad2cdd, 0x5bed5517, 0xa84d69f2, 0x66cd9a66, 0x956da683, 0x722ddf49,
   0x818de3ac, 0x482e1119, 0xbb8e2dfc, 0x5cce5436, 0xaf6e68d3, 0x61ee9b47,
   0x924ea7a2, 0x750ede68, 0x86aee28d, 0x1baf05a5, 0xe80f3940, 0x0f4f408a,
   0xfcef7c6f, 0x326f8ffb, 0xc1cfb31e, 0x268fcad4, 0xd52ff631, 0x7f701a53,
   0x8cd026b6, 0x6b905f7c, 0x98306399, 0x56b0900d, 0xa510ace8, 0x4250d522,
   0xb1f0e9c7, 0x2cf10eef, 0xdf51320a, 0x38114bc0, 0xcbb17725, 0x053184b1,
   0xf691b854, 0x11d1c19e, 0xe271fd7b, 0x2bd20fce, 0xd872332b, 0x3f324ae1,
   0xcc927604, 0x02128590, 0xf1b2b975, 0x16f2c0bf, 0xe552fc5a, 0x78531b72,
   0x8bf32797, 0x6cb35e5d, 0x9f1362b8, 0x5193912c, 0xa233adc9, 0x4573d403,
   0xb6d3e8e6, 0x25940d8c, 0xd6343169, 0x317448a3, 0xc2d47446, 0x0c5487d2,
   0xfff4bb37, 0x18b4c2fd, 0xeb14fe18, 0x76151930, 0x85b525d5, 0x62f55c1f,
   0x915560fa, 0x5fd5936e, 0xac75af8b, 0x4b35d641, 0xb895eaa4, 0x71361811,
   0x829624f4, 0x65d65d3e, 0x967661db, 0x58f6924f, 0xab56aeaa, 0x4c16d760,
   0xbfb6eb85, 0x22b70cad, 0xd1173048, 0x36574982, 0xc5f77567, 0x0b7786f3,
   0xf8d7ba16, 0x1f97c3dc, 0xec37ff39, 0x39180908, 0xcab835ed, 0x2df84c27,
   0xde5870c2, 0x10d88356, 0xe378bfb3, 0x0438c679, 0xf798fa9c, 0x6a991db4,
   0x99392151, 0x7e79589b, 0x8dd9647e, 0x435997ea, 0xb0f9ab0f, 0x57b9d2c5,
   0xa419ee20, 0x6dba1c95, 0x9e1a2070, 0x795a59ba, 0x8afa655f, 0x447a96cb,
   0xb7daaa2e, 0x509ad3e4, 0xa33aef01, 0x3e3b0829, 0xcd9b34cc, 0x2adb4d06,
   0xd97b71e3, 0x17fb8277, 0xe45bbe92, 0x031bc758, 0xf0bbfbbd, 0x63fc1ed7,
   0x905c2232, 0x771c5bf8, 0x84bc671d, 0x4a3c9489, 0xb99ca86c, 0x5edcd1a6,
   0xad7ced43, 0x307d0a6b, 0xc3dd368e, 0x249d4f44, 0xd73d73a1, 0x19bd8035,
   0xea1dbcd0, 0x0d5dc51a, 0xfefdf9ff, 0x375e0b4a, 0xc4fe37af, 0x23be4e65,
   0xd01e7280, 0x1e9e8114, 0xed3ebdf1, 0x0a7ec43b, 0xf9def8de, 0x64df1ff6,
   0x977f2313, 0x703f5ad9, 0x839f663c, 0x4d1f95a8, 0xbebfa94d, 0x59ffd087,
   0xaa5fec62
};

unsigned U[256] =
{ 0x00000000, 0x302a7c89, 0x6054f912,
   0x507e859b, 0x3309cec1, 0x0323b248, 0x535d37d3, 0x63774b5a, 0x66139d82,
   0x5639e10b, 0x06476490, 0x366d1819, 0x551a5343, 0x65302fca, 0x354eaa51,
   0x0564d6d8, 0x3f8707e1, 0x0fad7b68, 0x5fd3fef3, 0x6ff9827a, 0x0c8ec920,
   0x3ca4b5a9, 0x6cda3032, 0x5cf04cbb, 0x59949a63, 0x69bee6ea, 0x39c06371,
   0x09ea1ff8, 0x6a9d54a2, 0x5ab7282b, 0x0ac9adb0, 0x3ae3d139, 0x7f0e0fc2,
   0x4f24734b, 0x1f5af6d0, 0x2f708a59, 0x4c07c103, 0x7c2dbd8a, 0x2c533811,
   0x1c794498, 0x191d9240, 0x2937eec9, 0x79496b52, 0x496317db, 0x2a145c81,
   0x1a3e2008, 0x4a40a593, 0x7a6ad91a, 0x40890823, 0x70a374aa, 0x20ddf131,
   0x10f78db8, 0x7380c6e2, 0x43aaba6b, 0x13d43ff0, 0x23fe4379, 0x269a95a1,
   0x16b0e928, 0x46ce6cb3, 0x76e4103a, 0x15935b60, 0x25b927e9, 0x75c7a272,
   0x45eddefb, 0x0dbc2361, 0x3d965fe8, 0x6de8da73, 0x5dc2a6fa, 0x3eb5eda0,
   0x0e9f9129, 0x5ee114b2, 0x6ecb683b, 0x6bafbee3, 0x5b85c26a, 0x0bfb47f1,
   0x3bd13b78, 0x58a67022, 0x688c0cab, 0x38f28930, 0x08d8f5b9, 0x323b2480,
   0x02115809, 0x526fdd92, 0x6245a11b, 0x0132ea41, 0x311896c8, 0x61661353,
   0x514c6fda, 0x5428b902, 0x6402c58b, 0x347c4010, 0x04563c99, 0x672177c3,
   0x570b0b4a, 0x07758ed1, 0x375ff258, 0x72b22ca3, 0x4298502a, 0x12e6d5b1,
   0x22cca938, 0x41bbe262, 0x71919eeb, 0x21ef1b70, 0x11c567f9, 0x14a1b121,
   0x248bcda8, 0x74f54833, 0x44df34ba, 0x27a87fe0, 0x17820369, 0x47fc86f2,
   0x77d6fa7b, 0x4d352b42, 0x7d1f57cb, 0x2d61d250, 0x1d4baed9, 0x7e3ce583,
   0x4e16990a, 0x1e681c91, 0x2e426018, 0x2b26b6c0, 0x1b0cca49, 0x4b724fd2,
   0x7b58335b, 0x182f7801, 0x28050488, 0x787b8113, 0x4851fd9a, 0x1b7846c2,
   0x2b523a4b, 0x7b2cbfd0, 0x4b06c359, 0x28718803, 0x185bf48a, 0x48257111,
   0x780f0d98, 0x7d6bdb40, 0x4d41a7c9, 0x1d3f2252, 0x2d155edb, 0x4e621581,
   0x7e486908, 0x2e36ec93, 0x1e1c901a, 0x24ff4123, 0x14d53daa, 0x44abb831,
   0x7481c4b8, 0x17f68fe2, 0x27dcf36b, 0x77a276f0, 0x47880a79, 0x42ecdca1,
   0x72c6a028, 0x22b825b3, 0x1292593a, 0x71e51260, 0x41cf6ee9, 0x11b1eb72,
   0x219b97fb, 0x64764900, 0x545c3589, 0x0422b012, 0x3408cc9b, 0x577f87c1,
   0x6755fb48, 0x372b7ed3, 0x0701025a, 0x0265d482, 0x324fa80b, 0x62312d90,
   0x521b5119, 0x316c1a43, 0x014666ca, 0x5138e351, 0x61129fd8, 0x5bf14ee1,
   0x6bdb3268, 0x3ba5b7f3, 0x0b8fcb7a, 0x68f88020, 0x58d2fca9, 0x08ac7932,
   0x388605bb, 0x3de2d363, 0x0dc8afea, 0x5db62a71, 0x6d9c56f8, 0x0eeb1da2,
   0x3ec1612b, 0x6ebfe4b0, 0x5e959839, 0x16c465a3, 0x26ee192a, 0x76909cb1,
   0x46bae038, 0x25cdab62, 0x15e7d7eb, 0x45995270, 0x75b32ef9, 0x70d7f821,
   0x40fd84a8, 0x10830133, 0x20a97dba, 0x43de36e0, 0x73f44a69, 0x238acff2,
   0x13a0b37b, 0x29436242, 0x19691ecb, 0x49179b50, 0x793de7d9, 0x1a4aac83,
   0x2a60d00a, 0x7a1e5591, 0x4a342918, 0x4f50ffc0, 0x7f7a8349, 0x2f0406d2,
   0x1f2e7a5b, 0x7c593101, 0x4c734d88, 0x1c0dc813, 0x2c27b49a, 0x69ca6a61,
   0x59e016e8, 0x099e9373, 0x39b4effa, 0x5ac3a4a0, 0x6ae9d829, 0x3a975db2,
   0x0abd213b, 0x0fd9f7e3, 0x3ff38b6a, 0x6f8d0ef1, 0x5fa77278, 0x3cd03922,
   0x0cfa45ab, 0x5c84c030, 0x6caebcb9, 0x564d6d80, 0x66671109, 0x36199492,
   0x0633e81b, 0x6544a341, 0x556edfc8, 0x05105a53, 0x353a26da, 0x305ef002,
   0x00748c8b, 0x500a0910, 0x60207599, 0x03573ec3, 0x337d424a, 0x6303c7d1,
   0x5329bb58
};


static unsigned char rabin_window[RABIN_WINDOW_SIZE];
static unsigned rabin_pos = 0;

#ifndef MIN
#define MIN(x,y) ((y)<(x) ? (y) : (x))
#endif
#ifndef MAX
#define MAX(x,y) ((y)>(x) ? (y) : (x))
#endif

/*
  * The copies array is the central data structure for diff generation.
  * Data statements are implicit, for ranges not covered by any copy command.
  *
  * The sum of tgt and length for each entry must be monotonically increasing,
  * and data ranges must be non-overlapping. This is accomplished by not
  * extending matches backwards during initial matching.
  *
  * Copies may have zero length, to make it quick to delete copies during
  * optimization. However, the last copy in the list must always be a
  * non-trivial copy.
  *
  * Before committing copies, an important optimization is performed: during
  * a backward pass through the copies array, each entry is extended backwards,
  * and redundant copies are eliminated.
  *
  * If each match were extended backwards on insertion, the same data may be
  * matched an arbitrary number of times, resulting in potentially quadratic
  * time behavior.
  */

typedef struct copyinfo {
 	unsigned src;
 	unsigned tgt;
 	unsigned length;
} CopyInfo;

static CopyInfo *copies;
static int copy_count = 0;
static unsigned max_copies = 0; /* Dynamically increased */

static unsigned *idx;
static unsigned idx_size;
static unsigned char *idx_data;
static unsigned idx_data_len;

typedef unsigned poly_t;

static void rabin_reset(void)
{
 	memset(rabin_window, 0, sizeof(rabin_window));
}

static poly_t rabin_slide (poly_t fp, unsigned char m)
{
 	unsigned char om;
 	if (++rabin_pos == RABIN_WINDOW_SIZE) rabin_pos = 0;
 	om = rabin_window[rabin_pos];
 	fp ^= U[om];
 	rabin_window[rabin_pos] = m;
 	fp = ((fp << 8) | m) ^ T[fp >> RABIN_SHIFT];
 	return fp;
}

static int add_copy (unsigned src, unsigned tgt, unsigned length)
{
 	if (copy_count == max_copies) {
 		max_copies *= 2;

 		if (!max_copies) {
 			max_copies = MAX_COPIES;
 			copies = malloc (max_copies * sizeof (CopyInfo));
 		} else
 			copies = realloc(copies,
 			   max_copies * sizeof (CopyInfo));
 		if (!copies)
 			return 0;
 	}

 	copies[copy_count].src = src;
 	copies[copy_count].tgt = tgt;
 	copies[copy_count].length = length;
 	return ++copy_count;
}

static unsigned maxofs[256];
static unsigned maxlen[256];
static unsigned maxfp[256];

static const unsigned small_idx_size = SMALL_HTAB_SIZE;
static short unsigned small_idx[SMALL_HTAB_SIZE];

static void small_init_idx (unsigned char * data, unsigned len,
                      	    unsigned head, unsigned tail)
{
 	const unsigned index_step = SMALL_INDEX_STEP;
 	unsigned j = head - head % index_step;
 	unsigned k;

 	if (len < index_step) return;

 	idx_data = data;
 	idx_data_len = len;
 	len -= MIN (len, tail + (index_step - 1));

 	memset (small_idx, 0, sizeof(small_idx));

 	while (j < len) {
 		poly_t fp = 0;
 		do
 			fp = ((fp << 8) | data[j++]) ^ T[fp >> RABIN_SHIFT];
 		while (j % index_step);
 		small_idx[fp % small_idx_size] = j >> SMALL_SHIFT;
 	}
}

static void init_idx (unsigned char *data, unsigned len, int level,
 		      unsigned head, unsigned tail)
{
 	unsigned index_step
 	  = RABIN_WINDOW_SIZE / sizeof(unsigned) * sizeof(unsigned);
 	unsigned j, k;
 	unsigned char ch = 0;
 	unsigned runlen = 0;
 	poly_t fp = 0;

 	/* Special case small files at low optimization levels */
 	if (level <= 1 && len < MAX_SMALL_SIZE
 	  && len - head - tail < (SMALL_HTAB_SIZE * SMALL_INDEX_STEP)) {
 		small_init_idx(data, len, head, tail);
 		return;
 	}

 	assert (len <= MAX_SIZE);
 	assert (head < len);
 	assert (level >= 0 && level <= 9);
 	memset(maxofs, 0, sizeof(maxofs));
 	memset(maxlen, 0, sizeof(maxlen));
 	memset(maxfp, 0, sizeof(maxfp));

 	/* Smaller step size for higher optimization levels.
 	   The index_step must be a multiple of the word size */
 	if (level >= 1)
 		index_step = MIN(index_step, 4 * sizeof (unsigned));
 	if (level >= 3)
 		index_step = MIN (index_step, 3 * sizeof (unsigned));
 	if (level >= 4)
 		index_step = MIN (index_step, 2 * sizeof (unsigned));
 	if (level >= 6)
 		index_step = MIN (index_step, 1 * sizeof (unsigned));
 	assert (index_step && !(index_step % sizeof (unsigned)));

 	/* Add fixed amount to hash table size, as small files will benefit
 	   a lot without using significantly more memory or time. */
 	idx_size = (level + 1) * ((len - head - tail) / index_step) / 2;
 	idx_size = MIN (idx_size + MIN_HTAB_SIZE, MAX_HTAB_SIZE - 1);

 	/* Round up to next power of two, but limit to MAX_HTAB_SIZE. */
 	{
 		unsigned s = MIN_HTAB_SIZE;
 		while (s < idx_size) s += s;
 		idx_size = s;
 	}

 	idx_data = data;
 	idx_data_len = len;
 	idx = calloc(idx_size, sizeof(unsigned));

 	/* It is tempting to first index higher addresses, so hashes of lower
 	   addresses will get preference in the hash table. However, for
 	   repetitive patterns with a period that is a divisor of the
 	   fingerprint window, this may mean the match is not anchored at
 	   the end. Furthermore, even when using a window length that is
 	   prime, the benefits are small and the irregularity of the first
 	   matches being more important is not worth it. */

 	rabin_reset();

 	ch = 0;
 	runlen = 0;

 	if (head < RABIN_WINDOW_SIZE + index_step)
 		head = 0;
 	else {
 		head -= head % index_step;
 		for (j = head - RABIN_WINDOW_SIZE + 1; j < head; j++)
 			fp = rabin_slide (fp, data[j]);
 	}

 	for (j = head; j + index_step < len - tail; j += index_step) {
 		unsigned char pch = 0;
 		unsigned hash;

 		for (k = 0; k < index_step; k++) {
 			pch = ch;
 			ch = data[j + k];
 			if (ch != pch)
 				runlen = 0;
 			runlen++;
 			fp = rabin_slide(fp, ch);
 		}

 		/* See if there is a word-aligned window-sized run of
 		   equal characters */
 		if (runlen >= RABIN_WINDOW_SIZE + sizeof(unsigned) - 1) {
 			/* Skip ahead to end of run */
 			while (j + k < len && data[j + k] == ch) {
 				k++;
 				runlen++;
 			}

 			/* Although matches are usually anchored at the end,
 			   in the case of extended runs of equal characters
 			   it is better to anchor after the first
 			   RABIN_WINDOW_SIZE bytes. This allows for quick
 			   skip ahead while matching such runs, avoiding
 			   unneeded fingerprint calculations.
 			   Also, when anchoring at the end, matches will be
 			   generated after every word, because the fingerprint
 			   stays constant. Even though all matches would get
 			   combined during match optimization, it wastes time
 			   and space. */
 			if (runlen > maxlen[pch] + 4) {
 				unsigned ofs;
 				/* ofs points RABIN_WINDOW_SIZE bytes after
 				   the start of the run, rounded up to the
 				   next word */
 				ofs = j + k - runlen + RABIN_WINDOW_SIZE
 				   + (sizeof (unsigned) - 1);
 				ofs -= ofs % sizeof(unsigned);
 				maxofs[pch] = ofs;
 				maxlen [pch] = runlen;
 				assert(maxfp[pch] == 0
 				  || maxfp[pch] == (unsigned)fp);
 				maxfp[pch] = (unsigned)fp;
 			}
 			/* Keep input aligned as if no special run
 			   processing had taken place */
 			j += k - (k % index_step) - index_step;
 			k = index_step;
 		}

 		/* Testing showed that avoiding collisions using secondary
 		   hashing, or hash chaining had little effect and is not
 		   worth the time. */
 		hash = ((unsigned)fp) & (idx_size - 1);
 		idx[hash] = j + k;
 	}

 	/* Lastly, index the longest runs of equal characters found before.
 	   This ensures we always match the longerst such runs available.  */
 	for (j = 0; j < 256; j++)
 		if (maxlen[j])
 			idx[maxfp[j] % idx_size] = maxofs[j];
}

/* Match data against the current index and record all possible copies */
static int small_find_copies(unsigned char *data, unsigned len, unsigned head)
{
 	unsigned j = head < RABIN_WINDOW_SIZE ? 0 : head - RABIN_WINDOW_SIZE;
 	poly_t fp = 0;

 	while (j < MAX (head, RABIN_WINDOW_SIZE) && j < len)
 		fp = ((fp << 8) | data[j++]) ^ T[fp >> RABIN_SHIFT];

 	while (j < len) {
 		unsigned ofs, src, tgt, runlen, maxrun;

 		fp ^= U[data[j - RABIN_WINDOW_SIZE]];
 		fp = ((fp << 8) | data[j++]) ^ T[fp >> RABIN_SHIFT];

 		ofs = small_idx[fp & (small_idx_size - 1)] << SMALL_SHIFT;

 		/* Invariant:
 		   data[0] .. data[j-1] has been processed
 		   fp is fingerprint of sliding window ending at j-1
 		   ofs is zero or points just past tentative match
 		   ofs is a multiple of index_step */

 		if (!ofs)
 			continue;

 		runlen = 0;
 		tgt = j - 4;
 		src = ofs - 4;
 		maxrun = MIN(idx_data_len - src, len - tgt);

 		/* Hot loop */
 		while (runlen < maxrun &&
 		       data[tgt + runlen] == idx_data[src + runlen])
 			runlen++;
 		if (runlen < 4)
 			continue;

 		if (!add_copy(src, tgt, runlen)) return 0;

 		/* For runs extending more than RABIN_WINDOW_SIZE bytes past j,
 		   skip ahead to prevent useless fingerprint computations. */
 		if (tgt + runlen > j + RABIN_WINDOW_SIZE)
 		{
 			fp = 0;
 			j = tgt + runlen - RABIN_WINDOW_SIZE;
 			while (j < tgt + runlen)
 				fp = ((fp << 8) | data[j++])
 				      ^ T[fp >> RABIN_SHIFT];
 		}

 		/* Quickly scan ahead without looking for matches
 		   until the end of this run */
 		while (j < tgt + runlen) {
 			fp ^= U[data[j - RABIN_WINDOW_SIZE]];
 			fp = ((fp << 8) | data[j++]) ^ T[fp >> RABIN_SHIFT];
 		}
 	}

 	return 1;
}

/* Match data against the current index and record all possible copies */
static int find_copies(unsigned char *data, unsigned len, unsigned head)
{
 	unsigned j = head < RABIN_WINDOW_SIZE ? 0 : head - RABIN_WINDOW_SIZE;
 	poly_t fp = 0;

 	assert (idx_data);

 	if (!idx) return small_find_copies (data, len, head);

 	rabin_reset();

 	while (j < head + RABIN_WINDOW_SIZE && j < len)
 		fp = rabin_slide(fp, data[j++]);

 	while (j < len) {
 		unsigned ofs, src, tgt, runlen, maxrun;

 		fp = rabin_slide(fp, data[j++]);
 		ofs = idx[fp & (idx_size - 1)];

 		/* Invariant:
 		   data[0] .. data[j-1] has been processed
 		   fp is fingerprint of sliding window ending at j-1
 		   ofs is zero or points just past tentative match
 		   ofs is a multiple of index_step */

 		if (!ofs)
 			continue;

 		runlen = 0;
 		tgt = j - 4;
 		src = ofs - 4;
 		maxrun = MIN(idx_data_len - src, len - tgt);

 		/* Hot loop */
 		while (runlen < maxrun &&
 		       data[tgt + runlen] == idx_data[src + runlen])
 			runlen++;
 		if (runlen < 4)
 			continue;

 		if (!add_copy(src, tgt, runlen)) return 0;

 		/* For runs extending more than RABIN_WINDOW_SIZE bytes past j,
 		   skip ahead to prevent useless fingerprint computations. */
 		if (tgt + runlen > j + RABIN_WINDOW_SIZE)
 			j = tgt + runlen - RABIN_WINDOW_SIZE;

 		/* Quickly scan ahead without looking for matches
 		   until the end of this run */
 		while (j < tgt + runlen)
 			fp = rabin_slide(fp, data[j++]);
 	}

 	return 1;
}

static unsigned header_length(unsigned srclen, unsigned tgtlen)
{
 	unsigned len = 0;
 	assert (srclen <= MAX_SIZE && tgtlen <= MAX_SIZE);

 	/* GIT headers start with the length of the source and target,
 	   with 7 bits per byte, least significant byte first, and
 	   the high bit indicating continuation. */
 	do { len++; srclen >>= 7; } while (srclen);
 	do { len++; tgtlen >>= 7; } while (tgtlen);

 	return len;
}

static unsigned char *
write_header(unsigned char *patch, unsigned srclen, unsigned tgtlen)
{
 	assert (srclen <= MAX_SIZE && tgtlen <= MAX_SIZE);

 	while (srclen >= 0x80) {
 		*patch++ = srclen | 0x80;
 		srclen >>= 7;
 	}
 	*patch++ = srclen;

 	while (tgtlen >= 0x80) {
 		*patch++ = tgtlen | 0x80;
 		tgtlen >>= 7;
 	}
 	*patch++ = tgtlen;

 	return patch;
}

static unsigned data_length(unsigned length)
{
 	/* Can only include 0x7f data bytes per command */
 	unsigned partial = length % 0x7f;
 	assert (length > 0 && length <= MAX_SIZE);
 	if (partial) partial++;
 	return partial + (length / 0x7f) * 0x80;
}

static unsigned char *
write_data(unsigned char *patch, unsigned char *data, unsigned size)
{
 	assert (size > 0 && size < MAX_SIZE);
 	/* The return value must be equal to patch + data_length (patch, size).
 	   This correspondence is essential for calculating the patch size.  */

 	/* GIT has no data commands for large data, rest is same as GDIFF */
 	do {
 		unsigned s = size;
 		if (s > 0x7f)
 			s = 0x7f;
 		*patch++ = s;
 		memcpy(patch, data, s);
 		data += s;
 		patch += s;
 		size -= s;
 	} while (size);

 	return patch;
}

static unsigned copy_length (unsigned offset, unsigned length)
{
 	unsigned size = 0;

 	assert (offset < MAX_SIZE && length < MAX_SIZE);

 	/* For now we only copy a maximum of 0x10000 bytes per command.
 	   Longer copies are broken into pieces of that size. */
 	do {
 		signed s = length;
 		if (s > 0x10000)
 			s = 0x10000;
 		size += !!(s & 0xff) + !!(s & 0xff00);
 		size += !!(offset & 0xff) + !!(offset & 0xff00) +
 			!!(offset & 0xff0000) + !!(offset & 0xff000000);
 		size += 1;
 		offset += s;
 		length -= s;
 	} while (length);

 	return size;
}

static unsigned char *
write_copy(unsigned char *patch, unsigned offset, unsigned size)
{
 	/* The return value must be equal to patch + copy_length
 	   (patch, offset, size). This correspondence is essential
 	   for calculating the patch size.  */

 	do {
 		unsigned char c = 0x80, *cmd = patch++;
 		unsigned v, s = size;
 		if (s > 0x10000)
 			s = 0x10000;

 		v = offset;
 		if (v & 0xff) c |= 0x01, *patch++ = v;
 		v >>= 8;
 		if (v & 0xff) c |= 0x02, *patch++ = v;
 		v >>= 8;
 		if (v & 0xff) c |= 0x04, *patch++ = v;
 		v >>= 8;
 		if (v & 0xff) c |= 0x08, *patch++ = v;

 		v = s;
 		if (v & 0xff) c |= 0x10, *patch++ = v;
 		v >>= 8;
 		if (v & 0xff) c |= 0x20, *patch++ = v;

 		*cmd = c;
 		offset += s;
 		size -= s;
 	} while (size);

 	return patch;
}

static unsigned
process_copies (unsigned char *data, unsigned length, unsigned maxlen)
{
 	int j;
 	unsigned ptr = length;
 	unsigned patch_bytes = header_length(idx_data_len, length);

 	/* Work through the copies backwards, extending each one backwards. */
 	for (j = copy_count - 1; j >= 0; j--) {
 		CopyInfo *copy = copies+j;
 		unsigned src = copy->src;
 		unsigned tgt = copy->tgt;
 		unsigned len = copy->length;
 		int data_follows;

 		if (tgt + len > ptr) {
 			/* Part of copy already covered by later one,
 			   so shorten copy. */
 			if (ptr < tgt) {
 				/* Copy completely disappeared, but guess
 				   that a backward extension might still be
 				   useful. This extension is non-contiguous,
 				   as it is irrelevant whether the skipped
 				   data would have matched or not. Be careful
 				   to not extend past the beginning of
 				   the source. */
 				unsigned adjust = tgt - ptr;

 				tgt = ptr;
 				src = (src < adjust) ? 0 : src - adjust;

 				copy->tgt = tgt;
 				copy->src = src;
 			}

 			len = ptr - tgt;
 		}

 		while (src && tgt && idx_data[src - 1] == data[tgt - 1]) {
 			src--;
 			tgt--;
 		}
 		len += copy->tgt - tgt;

 		data_follows = (tgt + len < ptr);

 		/* A short copy may cost as much as 6 bytes for the copy and
 		   5 as result of an extra data command. It's not worth
 		   having extra copies in order to just save a byte or two.
 		   Being too smart here may hurt later compression as well. */
 		if (len < (data_follows ? 16 : 10))
 			len = 0;

 		/* Some target data is not covered by the copies, account for
 		   the DATA command that will follow the copy. */
 		if (len && data_follows)
 			patch_bytes += data_length(ptr - (tgt + len));

 		/* Everything about the copy is known and will not change.
 		   Write back the new information and update the patch size
 		   with the size of the copy instruction. */
 		copy->length = len;
 		copy->src = src;
 		copy->tgt = tgt;

 		if (len) {
 			/* update patch size for copy command */
 			patch_bytes += copy_length (src, len);
 			ptr = tgt;
 		} else if (j == copy_count - 1) {
 			/* Remove empty copies at end of list. */
 			copy_count--;
 		}

 		if (patch_bytes > maxlen)
 			return 0;
 	}

 	/* Account for data before first copy */
 	if (ptr != 0)
 		patch_bytes += data_length(ptr);

 	if (patch_bytes > maxlen)
 		return 0;
 	return patch_bytes;
}

static void *
create_delta (unsigned char *data, unsigned len,
 	      unsigned char *delta, unsigned delta_size)
{
 	unsigned char *ptr = delta;
 	unsigned offset = 0;
 	int j;

 	ptr = write_header(ptr, idx_data_len, len);

 	for (j = 0; j < copy_count; j++) {
 		CopyInfo *copy = copies + j;
 		unsigned copylen = copy->length;

 		if (!copylen)
 			continue;

 		if (copy->tgt > offset) {
 			ptr = write_data(ptr, data + offset,
 			   copy->tgt - offset);
 		}

 		ptr = write_copy(ptr, copy->src, copylen);
 		offset = copy->tgt + copylen;
 	}

 	if (offset < len)
 		ptr = write_data(ptr, data + offset, len - offset);

 	assert(ptr - delta == delta_size);

 	return delta;
}

static void finalize_idx()
{
 	if (max_copies > 8 * MAX_COPIES) {
 		free(copies);
 		copies = 0;
 		max_copies = 0;
 	}
 	copy_count = 0;
 	if (idx) free(idx);
 	idx = 0;
 	idx_size = 0;
 	idx_data = 0;
 	idx_data_len = 0;
}

static unsigned
match_head (unsigned char *from, unsigned char *to, unsigned size)
{
 	unsigned head = 0;
 	while (head < size && from[head] == to[head]) head++;
 	return head;
}

static unsigned
match_tail (unsigned char *from, unsigned char *to, unsigned size)
{
 	unsigned tail = 0;
 	while (tail < size && *(from - tail) == *(to - tail)) tail++;
 	return tail;
}

void *diff_delta(void *from_buf, unsigned long from_size,
 		 void *to_buf, unsigned long to_size,
 		 unsigned long *delta_size, unsigned long max_size)
{
 	unsigned char *delta = 0;
 	unsigned dsize;
         unsigned head = 0;
         unsigned tail = 0;

 	assert (from_size <= MAX_SIZE && to_size <= MAX_SIZE);

 	/* The following actually takes care of about half of all target
 	   data. This is performance critical, and may need some work. */
         head = match_head(from_buf, to_buf, MIN(from_size, to_size));
 	tail = match_tail(from_buf + (from_size - 1), to_buf + (to_size - 1),
 	                  MIN(from_size, to_size - head));

 	if (head <= RABIN_WINDOW_SIZE) head = 0;
 	if (tail <= RABIN_WINDOW_SIZE) tail = 0;

 	if (!max_size)
 		max_size = from_size;

 	init_idx (from_buf, from_size, 1, head, tail);

 	if (head) add_copy (0, 0, head);

 	if (head + tail + RABIN_WINDOW_SIZE < from_size) {
 		if (!find_copies(to_buf, to_size - tail, head))
 			return 0;
 	}
 	if (tail) add_copy (from_size - tail, to_size - tail, tail);

 	dsize = process_copies(to_buf, to_size, max_size);
 	if (dsize)
 	{
 		delta = malloc (dsize);
 		delta = create_delta (to_buf, to_size, delta, dsize);
 	}
 	finalize_idx ();
 	if (delta)
 		*delta_size = dsize;
 	return delta;
}

^ permalink raw reply

* Re: [PATCH] use delta index data when finding best delta matches
From: Nicolas Pitre @ 2006-04-28  1:56 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vy7xqh5g6.fsf@assigned-by-dhcp.cox.net>

On Thu, 27 Apr 2006, Junio C Hamano wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > This patch allows for computing the delta index for each base object 
> > only once and reuse it when trying to find the best delta match.
> >
> > This should set the mark and pave the way for possibly better delta 
> > generator algorithms.
> >
> > Signed-off-by: Nicolas Pitre <nico@cam.org>
> 
> My understanding is that theoretically this should not make any
> difference to the result, and should run faster when the memory
> pressure does not cause the machine to thrash.  However,....
> 
> I am seeing some differences.  Even with the smallish "git.git"
> repository, packing is slightly slower, and the end result is
> smaller.

Well, I changed some euristics a bit.

> Not that I am complaining that it produces better results with a
> small performance penalty.  I am curious because I do not
> understand where the differences are coming from, and I was
> reluctant to merge it in "next" until I understand what is going
> on.
> 
> But I think I know where the differences come from:
> 
> -	sizediff = oldsize > size ? oldsize - size : size - oldsize;
> +	sizediff = src_size < size ? size - src_size : 0;

Right.  The idea is that when the delta source index has to be computed 
each time, if the target buffer is really small then we spend more time 
computing that index than anything else.

But when the delta index is computed only once and already available 
anyway, we don't lose much attempting a delta with a small target buffer 
since the delta computation is non-existent at that point and the actual 
delta generation will be quick if the target buffer is small.

> There is another "omit smaller than 50" difference but that
> should not trigger -- we do not have files that small.

Right.  And if such small files show up they won't waste window space.


Nicolas

^ permalink raw reply

* Re: [PATCH] use delta index data when finding best delta matches
From: Junio C Hamano @ 2006-04-28  1:08 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0604262351221.18520@localhost.localdomain>

Nicolas Pitre <nico@cam.org> writes:

> This patch allows for computing the delta index for each base object 
> only once and reuse it when trying to find the best delta match.
>
> This should set the mark and pave the way for possibly better delta 
> generator algorithms.
>
> Signed-off-by: Nicolas Pitre <nico@cam.org>

My understanding is that theoretically this should not make any
difference to the result, and should run faster when the memory
pressure does not cause the machine to thrash.  However,....

I am seeing some differences.  Even with the smallish "git.git"
repository, packing is slightly slower, and the end result is
smaller.

Here are full packing experiments in a fully unpacked git.git
repository.

("next" version)
Total 17724, written 17724 (delta 11779), reused 0 (delta 0)
31.61user 6.24system 0:37.97elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+431995minor)pagefaults 0swaps

 6520520 pack-next-f1fac077a093ffdaf094aab2b7f11859ec0c18f1.pack

(with "use delta index" patch)
Total 17724, written 17724 (delta 12002), reused 0 (delta 0)
33.26user 6.00system 0:39.33elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+434451minor)pagefaults 0swaps

 6188418 pack-nico-f1fac077a093ffdaf094aab2b7f11859ec0c18f1.pack

Not that I am complaining that it produces better results with a
small performance penalty.  I am curious because I do not
understand where the differences are coming from, and I was
reluctant to merge it in "next" until I understand what is going
on.

But I think I know where the differences come from:

-	sizediff = oldsize > size ? oldsize - size : size - oldsize;
+	sizediff = src_size < size ? size - src_size : 0;

There is another "omit smaller than 50" difference but that
should not trigger -- we do not have files that small.

The size-diff change sort-of makes sense -- you are counting how
much the target grew, which you are likely to need to represent
as additions of literal data, and there is no reason to limit
the diff if the size difference that is greater than maxsize is
in the other direction (deletion).

So, I "backported" that part of the change on top of "next" and
tried the same experiment.

(without "use delta index" but the size heuristics part ported to "next")
Total 17724, written 17724 (delta 12002), reused 0 (delta 0)
36.92user 6.55system 0:43.75elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+431860minor)pagefaults 0swaps

 6188418 pack-size-f1fac077a093ffdaf094aab2b7f11859ec0c18f1.pack

And now the resulting pack is the same as what you produce.

So comparing 31.61 seconds vs 33.26 seconds and complaining you
made it slower is not fair.  You fixed the size heuristic logic
in the current code to produce 5% smaller pack (which made
things slower to spend 36.92 seconds while doing so -- that's
15% slowdown), and then reusing delta-index brought that penalty
down to 5% or so.

-- >8 --

This patch applies on top of "next" to match the size heuristics
used in the "reuse delta index" patch.

 pack-objects.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/pack-objects.c b/pack-objects.c
index c0acc46..6604338 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -1032,12 +1032,6 @@ static int try_delta(struct unpacked *cu
 		max_depth -= cur_entry->delta_limit;
 	}

-	size = cur_entry->size;
-	oldsize = old_entry->size;
-	sizediff = oldsize > size ? oldsize - size : size - oldsize;
-
-	if (size < 50)
-		return -1;
 	if (old_entry->depth >= max_depth)
 		return 0;

@@ -1048,9 +1042,12 @@ static int try_delta(struct unpacked *cu
 	 * more space-efficient (deletes don't have to say _what_ they
 	 * delete).
 	 */
+	size = cur_entry->size;
 	max_size = size / 2 - 20;
 	if (cur_entry->delta)
 		max_size = cur_entry->delta_size-1;
+	oldsize = old_entry->size;
+	sizediff = oldsize < size ? size - oldsize : 0;
 	if (sizediff >= max_size)
 		return 0;
 	delta_buf = diff_delta(old->data, oldsize,
@@ -1109,6 +1106,9 @@ static void find_deltas(struct object_en
 			 */
 			continue;

+		if (entry->size < 50)
+			continue;
+
 		free(n->data);
 		n->entry = entry;
 		n->data = read_sha1_file(entry->sha1, type, &size);

^ permalink raw reply related

* Re: [PATCH] send-email: Change from Mail::Sendmail to Net::SMTP
From: Martin Langhoff @ 2006-04-28  1:04 UTC (permalink / raw)
  To: Eric Wong; +Cc: Junio C Hamano, git, Ryan Anderson
In-Reply-To: <20060428002744.GB9146@hand.yhbt.net>

On 4/28/06, Eric Wong <normalperson@yhbt.net> wrote:
> You should be able to just open a pipe to:
>         /usr/sbin/sendmail @recipients
> and just write headers\nbody to that pipe.

Sounds reasonable. I just looked at what Mail::Sendmail does and it
isn't specially interesting. (There used to be a different Perl module
that did smart things, depending on what MTA it found, but I can't
find it now).

> Perhaps allow and detect --smtp-server=/path/to/sendmail ?

Oh, it should just work with sendmail if it's there and we don't
provide --smtp-server ;-)



m

^ permalink raw reply

* Re: [PATCH] send-email: Change from Mail::Sendmail to Net::SMTP
From: Eric Wong @ 2006-04-28  0:27 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Junio C Hamano, git, Ryan Anderson
In-Reply-To: <46a038f90604261324w76f272edp93941d7e8645be8@mail.gmail.com>

Martin Langhoff <martin.langhoff@gmail.com> wrote:
> On 4/27/06, Junio C Hamano <junkio@cox.net> wrote:
> > > system that we don't need an smtp daemon. Net::SMTP doesn't know how
> > > to use /usr/bin/sendmail
> 
> 
> > Wouldn't --smtp-server=that.smtp.server work for you?  Ah, that
> > would not work if your use is to send a local mail.  Hmph...
> 
> Well, the machine knows that the smtp server is (I mean, files in /etc
> have the right values in them), but I don't think often about it. Only
> when I am installing OSs or MTAs...
> 
> I know... I'm a whiner! ;-) I'll probably do something that does an
> eval and tries Mail::Sendmail and post it.

You should be able to just open a pipe to:
	/usr/sbin/sendmail @recipients
and just write headers\nbody to that pipe.

Perhaps allow and detect --smtp-server=/path/to/sendmail ?

-- 
Eric Wong

^ permalink raw reply

* Re: [PATCH] C version of git-count-objects
From: Junio C Hamano @ 2006-04-28  0:25 UTC (permalink / raw)
  To: Peter Hagervall; +Cc: git
In-Reply-To: <20060428001049.GA28347@brainysmurf.cs.umu.se>

Peter Hagervall <hager@cs.umu.se> writes:

> On Thu, Apr 27, 2006 at 03:07:37PM -0700, Junio C Hamano wrote:
>
> ...
>
>> +int cmd_count_objects(int ac, const char **av, char *ep)
>                                                        ^
> ...
>
>> +extern int cmd_count_objects(int argc, const char **argv, char **envp);
>                                                                   ^^
> Looks like we have a type mismatch here, no?

Interesting.  Lack of #include <builtin.h> was causing the
compiler not to notice X-<.

^ permalink raw reply

* Re: [PATCH] C version of git-count-objects
From: Peter Hagervall @ 2006-04-28  0:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vaca6k6za.fsf@assigned-by-dhcp.cox.net>

On Thu, Apr 27, 2006 at 03:07:37PM -0700, Junio C Hamano wrote:

...

> +int cmd_count_objects(int ac, const char **av, char *ep)
                                                       ^
...

> +extern int cmd_count_objects(int argc, const char **argv, char **envp);
                                                                  ^^
Looks like we have a type mismatch here, no?

	Peter

^ permalink raw reply

* Re: new gitk feature
From: Paul Mackerras @ 2006-04-27 23:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0604260802050.3701@g5.osdl.org>

Linus Torvalds writes:

> Any possibility of something light that? I'd _love_ to be able to see the 
> whole tree, but with things that touch certain files or things that are 
> newer highlighted.

That should be quite doable.  How about I show the commits that are in
the highlight view in bold?  That won't conflict with the existing
yellow background for commits that match the find criteria.

> (Btw, the "revision information" is also cool things like "--unpacked". I 
> actually use "gitk --unpacked" every once in a while, just because it's 
> such a cool way to say "show me everything I've added since I packed the 
> repo last).

OK, I didn't know about --unpacked. :)  I plan to add stuff to the
view definition window to allow you to select commits to
include/exclude by reachability from given commits (by head/tag/ID)
and when I do I can add a way to say --unpacked too.

Paul.

^ permalink raw reply

* Re: bug: git-repack -a -d produces broken pack on NFS
From: Linus Torvalds @ 2006-04-27 23:54 UTC (permalink / raw)
  To: Alex Riesen; +Cc: git
In-Reply-To: <20060427213207.GA6709@steel.home>

Ok, trying to think some more about this..

On Thu, 27 Apr 2006, Alex Riesen wrote:
> 
> $SRC/linux.git$ git repack -a -d
> Generating pack...
> Done counting 235947 objects.
> Deltifying 235947 objects.
>  100% (235947/235947) done
> Writing 235947 objects.
>  100% (235947/235947) done
> Total 235947, written 235947 (delta 182131), reused 235466 (delta 181650)
> Pack pack-6dcda5a7782864d57ec44bd30ebec13b07df2c87 created.
> $SRC/linux.git$ git fsck-objects --full
> git-fsck-objects: error: Packfile .git/objects/pack/pack-6dcda5a7782864d57ec44bd30ebec13b07df2c87.pack SHA1 mismatch with idx

This is interesting on so many levels.

First off, the index file or the pack-file is clearly somehow corrupt, 
because when you then try to do the "git clone" off the result later on 
(which won't actually check the SHA1's), it gets

> git-index-pack: fatal: packfile '/mnt/large/tmp/raa/tmp/.git/objects/pack/tmp-wcRvk5': bad object at offset 102601801: inflate returned -3

which means that either the offset was wrong, or the data at that offset 
was wrong.

That made me suspect the object re-use code - it might have been broken in 
the original pack, and then on re-use the broken data would have been just 
copied over.

HOWEVER - that doesn't actually fly as an explanation, because even if the 
data itself was broken, the repack would have re-generated the SHA1, so if 
the problem had been about copying an already broken pack over, you'd have 
gotten the "git clone" error, but you would _not_ have gotten the "pack 
SHA1 does not match index" error.

So in order for the SHA1 to not match, we literally must have corrupted 
things when we created the pack-file.

However, I've stared and stared at the sha1file writing code, and I don't 
see how you _could_ corrupt it. We use it with interruptible file 
descriptors all the time (sockets - the exact same code is used to 
transfer packs over the network), and that "intr" shouldn't matter one 
whit. We're doing very safe things, as far as I can tell.

The thing is, even if a wild pointer corrupts the write buffer for the 
sha1file writing code somehow, we actually always do the "calculate the 
SHA1" and "flush the buffer to the file" together. So even if somebody 
corrupted the buffer, we'd still generate the "right" SHA1 (of the 
corrupted buffer).

So the only thing that I can see that can generate bad SHA1 checksums is
 - actual problem in the SHA1 buffers themselves (ie a wild pointer 
   corrupting the "SHA1_CTX" thing itself)
 - real filesystem corruption. With NFS, the UDP checksums aren't all that 
   strong, but the ethernet CRC should catch things (there have been 
   reports of network cards that don't check the CRC well, but quite 
   frankly, I haven't seen one in a _loong_ time)
 - RAM corruption and/or kernel NFS bugs.

I'll continue to stare at the code, but I can't see anything even remotely 
suspicious in git itself so far.

		Linus

^ permalink raw reply

* Re: [PATCH] Fix cg-status with recent git versions
From: Nicolas Vilz @ 2006-04-27 23:46 UTC (permalink / raw)
  To: git
In-Reply-To: <20060427223826.10772.55883.stgit@dv.roinet.com>

On Thu, Apr 27, 2006 at 06:38:26PM -0400, Pavel Roskin wrote:
> From: Pavel Roskin <proski@gnu.org>
> 
> git-diff-index checks the arguments by lstat(), so an empty string would
> fail to be recognized as a file.  Use "--" to separate files from
> revisions, and also use "." instead of the empty string.

Thank you very much for recognizing... i was tempted to report that
bug... but was not sure if it is fixed yet...

Sincerly
Nicolas

^ permalink raw reply

* Re: Two gitweb feature requests
From: Ben Clifford @ 2006-04-27 22:54 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Kay Sievers, git
In-Reply-To: <1146144425.11909.450.camel@pmac.infradead.org>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 596 bytes --]

On Thu, 27 Apr 2006, David Woodhouse wrote:

> It would be useful if I could get away with giving just one URL --
> probably the http:// one to gitweb. If gitweb were to have a mode in
> which it gave a referral to the git:// URL, and if the git tools would
> use that, then that would work well.

HTML has a <link> element which can be used to indicate alternate forms of 
a page. Gitweb already generates one already to point people at the RSS 
feeds.

Kinda messy to make all the git tools learn how to read HTML, though...

-- 
Ben べン Бэн
http://www.hawaga.org.uk/ben/

^ permalink raw reply

* Re: bug: git-repack -a -d produces broken pack on NFS
From: Junio C Hamano @ 2006-04-27 22:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0604271526140.3701@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> Right now, if the pack-file is corrupt, it doesn't actually tell us so. It 
> says that it doesn't match the index file. Which is likely wrong - it 
> probably _does_ match the index file, but it's been corrupted.
>
> See the difference?

Makes perfect sense.  Thanks.

^ permalink raw reply

* [PATCH] Fix cg-status with recent git versions
From: Pavel Roskin @ 2006-04-27 22:38 UTC (permalink / raw)
  To: Petr Baudis, git

From: Pavel Roskin <proski@gnu.org>

git-diff-index checks the arguments by lstat(), so an empty string would
fail to be recognized as a file.  Use "--" to separate files from
revisions, and also use "." instead of the empty string.

Signed-off-by: Pavel Roskin <proski@gnu.org>
---

 cg-status |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/cg-status b/cg-status
index d11762e..0529ba2 100755
--- a/cg-status
+++ b/cg-status
@@ -233,7 +233,7 @@ if [ "$workstatus" ]; then
 		commitignore=
 		[ -s "$_git/commit-ignore" ] && commitignore=1
 
-		git-diff-index HEAD "$basepath" | cut -f5- -d' ' | 
+		git-diff-index HEAD -- "${basepath:-.}" | cut -f5- -d' ' | 
 		while IFS=$'\t' read -r mode file; do
 			if [ "$mode" = D ]; then
 				[ "$(git-diff-files "$file")" ] && mode=!

^ permalink raw reply related

* Re: bug: git-repack -a -d produces broken pack on NFS
From: Linus Torvalds @ 2006-04-27 22:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7v4q0ek6i3.fsf@assigned-by-dhcp.cox.net>

On Thu, 27 Apr 2006, Junio C Hamano wrote:

> Linus Torvalds <torvalds@osdl.org> writes:
> 
> > That said, the pack-file should all be written with the "sha1write()" 
> > interface, which is very careful indeed.
> >
> > I wonder if the _pack-file_ itself might be ok, and the problem is an 
> > index file corruption. For some reason we check the index file first, 
> > which is insane. We should check that the pack-file matches its _own_ SHA1 
> > first, and check the index file second.
> 
> We need to check both, so I fail to see why the order matters.

It's insane to do any _cross_-file checking before you've even verified 
that the files themselves are valid.

Another way of saying the same thing: checking whether the SHA1 of the 
pack-file matches the index file is pointless before you've verified that 
the SHA1 itself is valid.

Basically, if the pack-file is corrupt, you want to know that. You don't 
want to know that it's SHA1 doesn't match the index file - that's a 
"secondary" issue to the fact that the SHA1 wasn't correct in the first 
place.

Right now, if the pack-file is corrupt, it doesn't actually tell us so. It 
says that it doesn't match the index file. Which is likely wrong - it 
probably _does_ match the index file, but it's been corrupted.

See the difference?

			Linus

^ permalink raw reply

* Re: bug: git-repack -a -d produces broken pack on NFS
From: Linus Torvalds @ 2006-04-27 22:18 UTC (permalink / raw)
  To: Alex Riesen, Junio C Hamano; +Cc: Git Mailing List
In-Reply-To: <Pine.LNX.4.64.0604271500500.3701@g5.osdl.org>

On Thu, 27 Apr 2006, Linus Torvalds wrote:
> 
> I wonder if the _pack-file_ itself might be ok, and the problem is an 
> index file corruption.

Hmm. verify_pack() actually checks that the index file matches its own 
SHA1 earlier, so the index file will have passed (my suggested patch is 
still correct: the same way we check the index file internal integrity 
first, we should also check the pack-file internal integrity before we 
bother to cross-check them with each other).

Anyway, the index file SHA1 check means that it's unlikely that the index 
file was corrupt. But it would be interesting to hear if the pack-file was 
internally consistent or not.. (Something that git-pack-check didn't check 
in your case, because it checked the pack-file against the index file data 
first).

		Linus

^ permalink raw reply

* Re: bug: git-repack -a -d produces broken pack on NFS
From: Junio C Hamano @ 2006-04-27 22:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0604271500500.3701@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> That said, the pack-file should all be written with the "sha1write()" 
> interface, which is very careful indeed.
>
> I wonder if the _pack-file_ itself might be ok, and the problem is an 
> index file corruption. For some reason we check the index file first, 
> which is insane. We should check that the pack-file matches its _own_ SHA1 
> first, and check the index file second.

We need to check both, so I fail to see why the order matters.

> If it's just the index file that is corrupt, you may even have a chance to 
> recover the data.
>
> The index file is also written with sha1write(), though, so I really don't 
> see where it would break. Unless you just simply literally have data 
> corruption on the server for some strange reason.

I haven't seen this, and was wondering why.

Independently, and probably unrelated, but another person
reported failure while cloning, but the log appeared it had
trouble spawning the git-index-pack executable for some reason.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox