Git development
 help / color / mirror / Atom feed
* Re: parsecvs tool now creates git repositories
From: Keith Packard @ 2006-04-06 20:12 UTC (permalink / raw)
  To: Jim Radford; +Cc: keithp, Git Mailing List
In-Reply-To: <20060406181502.GA15741@blackbean.org>

[-- Attachment #1: Type: text/plain, Size: 560 bytes --]

On Thu, 2006-04-06 at 11:15 -0700, Jim Radford wrote:
> Hi Keith,
> 
> Here's one more build patch.  For some reason the Fedora lex doesn't
> want a space after the -o.

I probably shouldn't even use the -o flag; all it does is change the
#line directives in the output file to point at lex.c instead of
<stdout>. I'm sure it'll break something.

> Almost all of the errors I was seeing in the last version were fixed
> with your "branches that don't get merged back to the trunk" fix.

That's good news at least.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 191 bytes --]

^ permalink raw reply

* [PATCH] fix gitk with lots of tags
From: Jim Radford @ 2006-04-06 20:36 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Junio C Hamano, Git Mailing List

Hi Paul,

This fix allow gitk to be used on repositories with lots of tags.  It
bypasses git-rev-parse and passes its arguments to git-rev-list
directly to avoid the command line length restrictions.

Signed-Off-By: Jim Radford <radford@blackbean.org>

-Jim

---
diff --git a/gitk b/gitk
index 26fa79a..40672fb 100755
--- a/gitk
+++ b/gitk
@@ -17,19 +17,11 @@ proc gitdir {} {
 }
 
 proc parse_args {rargs} {
-    global parsed_args
-
-    if {[catch {
-	set parse_args [concat --default HEAD $rargs]
-	set parsed_args [split [eval exec git-rev-parse $parse_args] "\n"]
-    }]} {
-	# if git-rev-parse failed for some reason...
-	if {$rargs == {}} {
-	    set rargs HEAD
-	}
-	set parsed_args $rargs
+    if {$rargs == {}} {
+        return HEAD
+    } else {
+	return $rargs
     }
-    return $parsed_args
 }
 
 proc start_rev_list {rlargs} {

^ permalink raw reply related

* Re: Cygwin can't handle huge packfiles?
From: linux @ 2006-04-06 20:57 UTC (permalink / raw)
  To: git, junkio; +Cc: linux

> Right now we LRU the pack files and evict older ones when we
> mmap too many, but the unit of eviction is the whole file, so it
> would not help the case like yours at all.  It might be possible
> to mmap only part of a packfile, but it would involve fairly
> major surgery to sha1_file.c.

The simplest solution seems to be to limit pack file size to a reasonable
fraction of a 32-bit address space.  Say, 0.5 G.

That should be a fairly straightforward hack to git-pack-objects.
It already emits two files; just make it emit more.

You can tweak the heurisitics to try to find a good break point: start
thinking about splitting the pack when you get to one size, but don't
force a break until you hit a harder limit as long as the deltas are
working well.

This can all be adjustable with a command line and/or config file option
to allow for the eventual demise of 32-bit systems.

^ permalink raw reply

* Re: Fix up diffcore-rename scoring
From: Geert Bosch @ 2006-04-06 21:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List
In-Reply-To: <Pine.LNX.4.64.0603122316160.3618@g5.osdl.org>

[-- Attachment #1: Type: text/plain, Size: 5171 bytes --]


On Mar 13, 2006, at 02:44, Linus Torvalds wrote:
> It might be that the fast delta thing is a good way to ask "is this  
> even
> worth considering", to cut down the O(m*n) rename/copy detection to
> something much smaller, and then use xdelta() to actually figure  
> out what
> is a good rename and what isn't from a much smaller set of potential
> targets.

Here's a possible way to do that first cut. Basically,
compute a short (256-bit) fingerprint for each file, such
that the Hamming distance between two fingerprints is a measure
for their similarity. I'll include a draft write up below.

My initial implementation seems reasonably fast, works
great for 4000 (decompressed) files (25M) randomly plucked
from an old git.git repository without packs. It works OK for
comparing tar archives for GCC releases, but then it becomes
clear that random walks aren't that random anymore and
become dominated by repeated information, such as tar headers.

Speed is about 10MB/sec on my PowerBook, but one could cache
fingerprints so they only need to be computed once.
The nice thing is that one can quickly find similar files
only using the fingerprint (and in practice file size),
no filenames: this seems to fit the git model well.

I'll attach my test implementation below, it uses
David Mazieres Rabinpoly code and D. Phillips's fls code.
Please don't mind my C coding, it's not my native language.
Also, this may have some Darwinisms, although it should
work on Linux too.

   -Geert

Estimating Similarity

For estimating similarity between strings A and B, let
SA and SB be the collection of all substrings with length
W of A and B. Similarity now is defined as the ratio of
the intersection and the union of SA and SB.

The length W of these substrings is the window size, and here is
chosen somewhat arbitrarily to be 48. The idea is to make them not
so short that all context is lost (like counting symbol frequencies),
but not so long that a few small changes can affect a large portion
of substrings.  Of course, a single symbol change may affect up to
48 substrings.

Let "&" be the string concatenation operator.
If A = S2 & S1 & S2 & S3 & S2, and B = S2 & S3 & S2 & S1 & S2,
then if the length of S2 is at least W - 1, the strings
will have the same set of substrings and be considered equal
for purpose of similarity checking.  This behavior is actually
welcome, since reordering sufficiently separated pieces of a
document do not make it substantially different.

Instead of computing the ratio of identical substrings directly,
compute a 1-bit hash for each substring and calculate the difference
between the number of zeroes and ones. If the hashes appear random,
this difference follows a binomial distribution. Two files are
considered "likely similar" if their differences have the same sign.

The assumption that the hashes are randomly distributed, is not
true if there are many repeated substrings. For most applications,
it will be sufficient to ignore such repetitions (by using a small
cache of recently encountered hashes) as they do not convey much
actual information. For example, for purposes of finding small
deltas between strings, duplicating existing text will not significantly
increase the delta.

For a string with N substrings, of which K changed, perform a random
walk of N steps in 1-dimensional space (see [1]): what is the  
probability
the origin was crossed an odd number of times in the last K steps?
As the expected distance is Sqrt (2 * N / Pi), this probability
gets progressively smaller for larger N and a given ratio of N and K.
For larger files, the result should be quite stable.


In order to strengthen this similarity check and be able to
quantify the degree of similarity, many independent 1-bit hashes
are computed and counted for each string and assembled into
a bit vector of 256 bits, called the fingerprint. Each bit
of the fingerprint represents the result of independent
statistical experiment. For similar strings, corresponding bits
are more likely to be the same than for random strings.

For efficiency, a 64-bit hash is computed using a irreducible
Rabin polynomial of degree 63. The algebraic properties
of these allow for efficient calculation over a sliding window
of the input. [2] As the cryptographic advantages of randomly
generated hash functions are not required, a fixed polynomial
has been chosen.

This 64-bit hash is expanded to 256 bits by using three bits
to select 32 of the 256 bits in the fingerprint to update.
So, for every 8-bit character the polynomial needs updating,
and 32 counters are incremented or decremented.
So, each of the 256 counters represents a random walk that
is N / 4, for a string of length N.

The similarity of A and B can now be expressed as the Hamming
distance between the two bit vectors, divided by the expected
distance between two random vectors. This similarity score is
a number between 0 and 2, where smaller values mean the strings
are more similar, and values of 1 or more mean they are dissimilar.

One of the unique properties of this fingerprint is the
ability to compare files in different locations by only
transmitting their fingerprint.



[-- Attachment #2: gsimm.c --]
[-- Type: application/octet-stream, Size: 10801 bytes --]

#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <libgen.h>
#include <stdio.h>
#include <assert.h>
#include <math.h>

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>

#include "rabinpoly.h"

/* Length of file message digest (MD) in bytes. Longer MD's are
   better, but increase processing time for diminishing returns.
   Must be multiple of NUM_HASHES_PER_CHAR / 8, and at least 24
   for good results 
*/
#define MD_LENGTH 32
#define MD_BITS (MD_LENGTH * 8)

/* Has to be power of two. Since the Rabin hash only has 63
   usable bits, the number of hashes is limited to 32.
   Lower powers of two could be used for speeding up processing
   of very large files.  */
#define NUM_HASHES_PER_CHAR 32


/* For the final counting, do not count each bit individually, but
   group them. Must be power of two, at most NUM_HASHES_PER_CHAR.
   However, larger sizes result in higher cache usage. Use 8 bits
   per group for efficient processing of large files on fast machines
   with decent caches, or 4 bits for faster processing of small files
   and for machines with small caches.  */
#define GROUP_BITS 4
#define GROUP_COUNTERS (1<<GROUP_BITS)


/* The RABIN_WINDOW_SIZE is the size of fingerprint window used by 
   Rabin algorithm. This is not a modifiable parameter.

   The first RABIN_WINDOW_SIZE - 1 bytes are skipped, in order to ensure
   fingerprints are good hashes. This does somewhat reduce the
   influence of the first few bytes in the file (they're part of
   fewer windows, like the last few bytes), but that actually isn't
   so bad as files often start with fixed content that may bias comparisons.
*/

/* The MIN_FILE_SIZE indicates the absolute minimal file size that
   can be processed. As indicated above, the first and last 
   RABIN_WINDOW_SIZE - 1 bytes are skipped. 
   In order to get at least an average of 12 samples
   per bit in the final message digest, require at least 3 * MD_LENGTH
   complete windows in the file.  */
#define MIN_FILE_SIZE (3 * MD_LENGTH + 2 * (RABIN_WINDOW_SIZE - 1))

/* Limit matching algorithm to files less than 256 MB, so we can use
   32 bit integers everywhere without fear of overflow. For larger
   files we should add logic to mmap the file by piece and accumulate
   the frequency counts. */
#define MAX_FILE_SIZE (256*1024*1024 - 1)

/* Size of cache used to eliminate duplicate substrings.
   Make small enough to comfortably fit in L1 cache.  */
#define DUP_CACHE_SIZE 256

#define MIN(x,y) ((y)<(x) ? (y) : (x))
#define MAX(x,y) ((y)>(x) ? (y) : (x))

typedef struct fileinfo
{ char		*name;
  size_t	length;
  u_char	md[MD_LENGTH];
  int		match;
} File;

int flag_verbose = 0;
int flag_debug = 0;
int flag_warning = 0;
char *flag_relative = 0;

char cmd[12] = "        ...";
char md_strbuf[MD_LENGTH * 2 + 1];
u_char relative_md [MD_LENGTH];

File *file;
int    file_count;
size_t file_bytes;

FILE *msgout;

char hex[17] = "0123456789abcdef";
double pi = 3.14159265358979323844;

int freq[MD_BITS];
u_int64_t freq_dups = 0;

void usage()
{  fprintf (stderr, "usage: %s [-dhvw] [-r fingerprint] file ...\n", cmd);
   fprintf (stderr, " -d\tdebug output, repeate for more verbosity\n");
   fprintf (stderr, " -h\tshow this usage information\n");
   fprintf (stderr, " -r\tshow distance relative to fingerprint "
                    "(%u hex digits)\n", MD_LENGTH * 2);
   fprintf (stderr, " -v\tverbose output, repeat for even more verbosity\n");
   fprintf (stderr, " -w\tenable warnings for suspect statistics\n");
   exit (1);
}

int dist (u_char *l, u_char *r)
{ int j, k;
  int d = 0;

  for (j = 0; j < MD_LENGTH; j++)
  { u_char ch = l[j] ^ r[j];

    for (k = 0; k < 8; k++) d += ((ch & (1<<k)) > 0);
  } 

  return d;
}

char *md_to_str(u_char *md)
{ int j;

  for (j = 0; j < MD_LENGTH; j++)
  { u_char ch = md[j];

    md_strbuf[j*2] = hex[ch >> 4];
    md_strbuf[j*2+1] = hex[ch & 0xF];
  }

  md_strbuf[j*2] = 0;
  return md_strbuf;
}

u_char *str_to_md(char *str, u_char *md)
{ int j;

  if (!md || !str) return 0;

  bzero (md, MD_LENGTH);
  
  for (j = 0; j < MD_LENGTH * 2; j++)
  { char ch = str[j];

    if (ch >= '0' && ch <= '9')
    { md [j/2] = (md [j/2] << 4) + (ch - '0'); 
    }
    else
    { ch |= 32;

      if (ch < 'a' || ch > 'f') break;
      md [j/2] = (md[j/2] << 4) + (ch - 'a' + 10);
  } } 

  return (j != MD_LENGTH * 2 || str[j] != 0) ? 0 : md;
}
    
void freq_to_md(u_char *md)
{ int j, k;
  int num = MD_BITS;

  for (j = 0; j < MD_LENGTH; j++)
  { u_char ch = 0;

    for (k = 0; k < 8; k++) ch = 2*ch + (freq[8*j+k] > 0);
    md[j] = ch;
  }

  if (flag_debug)
  { for (j = 0; j < num; j++)
    { if (j % 8 == 0) printf ("\n%3u: ", j);
      printf ("%7i ", freq[j]);
    }
    printf ("\n");
  }
  bzero (freq, sizeof(freq));
  freq_dups = 0;
}

void process_data (char *name, u_char *data, unsigned len, u_char *md)
{ size_t j = 0;
  u_int32_t ofs;
  u_int32_t dup_cache[DUP_CACHE_SIZE];
  u_int32_t count [MD_BITS * (GROUP_COUNTERS/GROUP_BITS)];
  bzero (dup_cache, DUP_CACHE_SIZE * sizeof (u_int32_t));
  bzero (count, (MD_BITS * (GROUP_COUNTERS/GROUP_BITS) * sizeof (u_int32_t)));

  /* Ignore incomplete substrings */
  while (j < len && j < RABIN_WINDOW_SIZE) rabin_slide8 (data[j++]);

  while (j < len)
  { u_int64_t hash;
    u_int32_t ofs, sum;
    u_char idx;
    int k;

    hash = rabin_slide8 (data[j++]);

    /* In order to update a much larger frequency table
       with only 32 bits of checksum, randomly select a
       part of the table to update. The selection should
       only depend on the content of the represented data,
       and be independent of the bits used for the update.
       
       Instead of updating 32 individual counters, process
       the checksum in MD_BITS / GROUP_BITS groups of 
       GROUP_BITS bits, and count the frequency of each bit pattern.
    */

    idx = (hash >> 32);
    sum = (u_int32_t) hash;
    ofs = idx % (MD_BITS / NUM_HASHES_PER_CHAR) * NUM_HASHES_PER_CHAR;
    idx %= DUP_CACHE_SIZE;
    if (dup_cache[idx] == sum)
    { freq_dups++; 
    }
    else
    { dup_cache[idx] = sum; 
      for (k = 0; k < NUM_HASHES_PER_CHAR / GROUP_BITS; k++)
      { count[ofs * GROUP_COUNTERS / GROUP_BITS + (sum % GROUP_COUNTERS)]++;
        ofs += GROUP_BITS;
        sum >>= GROUP_BITS;
  } } }

  /* Distribute the occurrences of each bit group over the frequency table. */
  for (ofs = 0; ofs < MD_BITS; ofs += GROUP_BITS)
  { int j;
    for (j = 0; j < GROUP_COUNTERS; j++)
    { int k;
      for (k = 0; k < GROUP_BITS; k++)
      { freq[ofs + k] += ((1<<k) & j) 
          ? count[ofs * GROUP_COUNTERS / GROUP_BITS + j]
          : -count[ofs * GROUP_COUNTERS / GROUP_BITS + j];
  } } }
      
  { int j;
    int num = MD_BITS;
    int stat_warn = 0;
    double sum = 0.0;
    double sumsqr = 0.0;
    double average, variance, stddev, bits, exp_average, max_average;

    assert (num >= 2);

    sum = 0;

    for (j = 0; j < num; j++)
    { double f = abs ((double) freq[j]);
      sum += f;
      sumsqr += f*f;
    }

    variance = (sumsqr - (sum * sum / num)) / (num - 1);
    average = sum / num;
    stddev = sqrt (variance);
    bits = (NUM_HASHES_PER_CHAR * (file[file_count].length - freq_dups)) 
             / (8 * MD_LENGTH);
    /* Random files, or short files with few repetitions should have
       average very close to the expected average. Large deviations
       show there is too much redundancy, or there is another problem
       with the statistical fundamentals of the algorithm. */
    exp_average = sqrt (2 * bits / pi);
    max_average = 2.0 * pow (2 * bits / pi, 0.6);

    stat_warn = flag_warning
      && (average < exp_average * 0.5 || average > max_average);
    if (stat_warn)
    { fprintf (stdout, "%s: warning: "
               "too much redundancy, fingerprint may not be accurate\n",
               file[file_count].name);
      
    }

    if (flag_verbose > 1 || (flag_verbose && stat_warn))
    { printf 
        ("%i frequencies, average %5.1f, std dev %5.1f, %2.1f %% duplicates, "
         "\"%s\"\n",
         num, average, stddev,
         100.0 * freq_dups / (double) file[file_count].length,
         file[file_count].name);
      printf
        ("%1.0f expected bits per frequency, "
         "expected average %1.1f, max average %1.1f\n",
         bits, exp_average, max_average);
  } }

  if (md)
  { rabin_reset();
    freq_to_md (md);
    if (flag_relative)
    { int d = dist (md, relative_md);
      double sim = 1.0 - MIN (1.0, (double) (d) / (MD_LENGTH * 4 - 1));
      fprintf (stdout, "%s %llu %u %s %u %3.1f\n", 
               md_to_str (md), (long long unsigned) 0, len, name, 
               d, 100.0 * sim);
    }
    else
    {
      fprintf (stdout, "%s %llu %u %s\n", 
               md_to_str (md), (long long unsigned) 0, len, name);
} } }

void process_file (char *name)
{ int fd;
  struct stat fs;
  u_char *data;
  File *fi = file+file_count;;

  fd = open (name, O_RDONLY, 0);
  if (fd < 0) 
  { perror (name);
    exit (2);
  }

  if (fstat (fd, &fs))
  { perror (name);
    exit (2);
  }

  if (fs.st_size >= MIN_FILE_SIZE
      && fs.st_size <= MAX_FILE_SIZE)
  { fi->length = fs.st_size;
    fi->name = name;

    data = (u_char *) mmap (0, fs.st_size, PROT_READ, MAP_PRIVATE, fd, 0);

    if (data == (u_char *) -1)
    { perror (name);
      exit (2);
    }

    process_data (name, data, fs.st_size, fi->md);
    munmap (data, fs.st_size);
    file_bytes += fs.st_size;
    file_count++;
  } else if (flag_verbose) 
  { fprintf (stdout, "skipping %s (size %llu)\n", name, fs.st_size); }

  close (fd);
}

int main (int argc, char *argv[])
{ int ch, j;

  strncpy (cmd, basename (argv[0]), 8);
  msgout = stdout;

  while ((ch = getopt(argc, argv, "dhr:vw")) != -1)
  { switch (ch) 
    { case 'd': flag_debug++;
		break;
      case 'r': if (!optarg)
                { fprintf (stderr, "%s: missing argument for -r\n", cmd);
                  return 1;
                }
                if (str_to_md (optarg, relative_md)) flag_relative = optarg;
                else
                { fprintf (stderr, "%s: not a valid fingerprint\n", optarg);
                  return 1;
                }
                break;
      case 'v': flag_verbose++;
                break;
      case 'w': flag_warning++;
                break;
      default : usage();
                return (ch != 'h');
  } }

  argc -= optind;
  argv += optind;

  if (argc == 0) usage();

  rabin_reset ();
  if (flag_verbose && flag_relative)
  { fprintf (stdout, "distances are relative to %s\n", flag_relative);
  }

  file = (File *) calloc (argc, sizeof (File));

  for (j = 0; j < argc; j++) process_file (argv[j]);

  if (flag_verbose) 
  { fprintf (stdout, "%li bytes in %i files\n", file_bytes, file_count);
  }

  return 0;
}

[-- Attachment #3: rabinpoly.c --]
[-- Type: application/octet-stream, Size: 3648 bytes --]

/*
 *
 * Copyright (C) 1999 David Mazieres (dm@uun.org)
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License as
 * published by the Free Software Foundation; either version 2, or (at
 * your option) any later version.
 *
 * This program is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
 * USA
 *
 */

  /* Faster generic_fls */
  /* (c) 2002, D.Phillips and Sistina Software */

#include "rabinpoly.h"
#define MSB64 0x8000000000000000ULL

static inline unsigned fls8(unsigned n)
{
       return n & 0xf0?
           n & 0xc0? (n >> 7) + 7: (n >> 5) + 5:
           n & 0x0c? (n >> 3) + 3: n - ((n + 1) >> 2);
}

static inline unsigned fls16(unsigned n)
{
       return n & 0xff00? fls8(n >> 8) + 8: fls8(n);
}

static inline unsigned fls32(unsigned n)
{
       return n & 0xffff0000? fls16(n >> 16) + 16: fls16(n);
}

static inline unsigned fls64(unsigned long long n) /* should be u64 */
{
       return n & 0xffffffff00000000ULL? fls32(n >> 32) + 32: fls32(n);
}


static u_int64_t polymod (u_int64_t nh, u_int64_t nl, u_int64_t d);
static void      polymult (u_int64_t *php, u_int64_t *plp,
                           u_int64_t x, u_int64_t y);
static u_int64_t polymmult (u_int64_t x, u_int64_t y, u_int64_t d);

static u_int64_t poly = 0xb15e234bd3792f63ull;	// Actual polynomial
static u_int64_t T[256];			// Lookup table for mod
static int shift;

u_int64_t append8 (u_int64_t p, u_char m) 
{ return ((p << 8) | m) ^ T[p >> shift]; 
}

static u_int64_t
polymod (u_int64_t nh, u_int64_t nl, u_int64_t d)
{ assert (d);
  int i;
  int k = fls64 (d) - 1;
  d <<= 63 - k;

  if (nh) {
    if (nh & MSB64)
      nh ^= d;
    for (i = 62; i >= 0; i--)
      if (nh & 1ULL << i) {
	nh ^= d >> (63 - i);
	nl ^= d << (i + 1);
      }
  }
  for (i = 63; i >= k; i--)
    if (nl & 1ULL << i)
      nl ^= d >> (63 - i);
  return nl;
}

static void
polymult (u_int64_t *php, u_int64_t *plp, u_int64_t x, u_int64_t y)
{ int i;
  u_int64_t ph = 0, pl = 0;
  if (x & 1)
    pl = y;
  for (i = 1; i < 64; i++)
    if (x & (1ULL << i)) {
      ph ^= y >> (64 - i);
      pl ^= y << i;
    }
  if (php)
    *php = ph;
  if (plp)
    *plp = pl;
}

static u_int64_t
polymmult (u_int64_t x, u_int64_t y, u_int64_t d)
{
  u_int64_t h, l;
  polymult (&h, &l, x, y);
  return polymod (h, l, d);
}

static int size = RABIN_WINDOW_SIZE;
static u_int64_t fingerprint = 0;
static int bufpos = -1;
static u_int64_t U[256];
static u_char buf[RABIN_WINDOW_SIZE];

void rabin_init ()
{ assert (poly >= 0x100);
  u_int64_t sizeshift = 1;
  int xshift = fls64 (poly) - 1;
  int i, j;
  shift = xshift - 8;
  u_int64_t T1 = polymod (0, 1ULL << xshift, poly);
  for (j = 0; j < 256; j++)
    T[j] = polymmult (j, T1, poly) | ((u_int64_t) j << xshift);

  for (i = 1; i < size; i++)
    sizeshift = append8 (sizeshift, 0);
  for (i = 0; i < 256; i++)
    U[i] = polymmult (i, sizeshift, poly);
  bzero (buf, sizeof (buf));
}

void
rabin_reset ()
{ rabin_init();
  fingerprint = 0; 
  bzero (buf, sizeof (buf));
}

u_int64_t
rabin_slide8 (u_char m)
{ u_char om;
  if (++bufpos >= size) bufpos = 0;

  om = buf[bufpos];
  buf[bufpos] = m;
  fingerprint = append8 (fingerprint ^ U[om], m);

  return fingerprint;
}
  

[-- Attachment #4: rabinpoly.h --]
[-- Type: application/octet-stream, Size: 1015 bytes --]

/*
 *
 * Copyright (C) 2000 David Mazieres (dm@uun.org)
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License as
 * published by the Free Software Foundation; either version 2, or (at
 * your option) any later version.
 *
 * This program is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
 * USA
 *
 * Translated to C and simplified by Geert Bosch (bosch@gnat.com)
 */

#include <assert.h>
#include <strings.h>
#include <sys/types.h>

#ifndef RABIN_WINDOW_SIZE
#define RABIN_WINDOW_SIZE 48
#endif
void rabin_reset(); 
u_int64_t rabin_slide8(u_char c); 

^ permalink raw reply

* Re: parsecvs tool now creates git repositories
From: Martin Langhoff @ 2006-04-06 21:51 UTC (permalink / raw)
  To: Keith Packard; +Cc: Jim Radford, Git Mailing List
In-Reply-To: <1144354356.2303.270.camel@neko.keithp.com>

On 4/7/06, Keith Packard <keithp@keithp.com> wrote:
> > Almost all of the errors I was seeing in the last version were fixed
> > with your "branches that don't get merged back to the trunk" fix.
>
> That's good news at least.

I'm re-running my import of Moodle's cvs (20K commits) with the newer
parsecvs. The previous attempt looked very good except that

 - file additions were recorded with one-commit-per-file. I am not
sure how rcs is recording these, but hte user does enter a common
message at "commit" time. Perhaps the file addition action could be
ignored then?

 - some tags made on a branch show up in HEAD. This may be due to
partial-tree branches, but I am not sure.

cheers


m

^ permalink raw reply

* Re: git-clone and cg-clone
From: Nicolas Vilz 'niv' @ 2006-04-06 22:14 UTC (permalink / raw)
  Cc: git
In-Reply-To: <44355978.3080205@itaapy.com>

Belmar-Letelier wrote:
> Since 0.17 to take benefit of cg-switch
> 
> I use:
> 
> $ git-clone  xxx
> $ cg-branch-add origin xxx
> 
> instead of
> 
> $ cg-clone xxx
> 
> becauce cg-clone did not fetch all the heads.
> 
> Is there a better way to do this ?
> 

well, first I was also using cg clone... but i also realized, that there
is only one branch being pulled from the repository.

If you use git clone, then all tags and branches will be pulled... so
everytime i start using a fresh repository and start pulling origin of
it, i use git clone instead of cg-clone.

i also use git checkout instead of cg-switch... well, i think i haven't
had a use for the effekts, cg-switch does, and always wanted git
checkout... and wondered about the files, which were missing in the
index of the new branch..

i think thats the difference between porcelain and plumbing...

Sincerly
Nicolas

^ permalink raw reply

* Re: parsecvs tool now creates git repositories
From: Keith Packard @ 2006-04-06 22:19 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: keithp, Jim Radford, Git Mailing List
In-Reply-To: <46a038f90604061451m4522e3f3qceae2331751a307c@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 844 bytes --]

On Fri, 2006-04-07 at 09:51 +1200, Martin Langhoff wrote:

>  - file additions were recorded with one-commit-per-file. I am not
> sure how rcs is recording these, but hte user does enter a common
> message at "commit" time. Perhaps the file addition action could be
> ignored then?

If the log message is identical, and the dates are in-range, parsecvs
"should" put the adds in the same commit. 

>  - some tags made on a branch show up in HEAD. This may be due to
> partial-tree branches, but I am not sure.

Finding branch points is not perfect; it's complicated by bizzarre
behaviour when adding files and casual CVS changes which make precise
branch points hard to detect. Can I get at this repository to play with?
I'd like to see if we can't get the branch point detection more
accurate.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 191 bytes --]

^ permalink raw reply

* Re: parsecvs tool now creates git repositories
From: Martin Langhoff @ 2006-04-06 23:22 UTC (permalink / raw)
  To: Keith Packard; +Cc: Jim Radford, Git Mailing List
In-Reply-To: <1144361968.2303.288.camel@neko.keithp.com>

On 4/7/06, Keith Packard <keithp@keithp.com> wrote:
> On Fri, 2006-04-07 at 09:51 +1200, Martin Langhoff wrote:
>
> >  - file additions were recorded with one-commit-per-file. I am not
> > sure how rcs is recording these, but hte user does enter a common
> > message at "commit" time. Perhaps the file addition action could be
> > ignored then?
>
> If the log message is identical, and the dates are in-range, parsecvs
> "should" put the adds in the same commit.

parsecvs is committing them with the "added file foo.x" message, not
the actual commit message.

> >  - some tags made on a branch show up in HEAD. This may be due to
> > partial-tree branches, but I am not sure.
>
> Finding branch points is not perfect; it's complicated by bizzarre
> behaviour when adding files and casual CVS changes which make precise
> branch points hard to detect. Can I get at this repository to play with?

I fetch it with something along the lines of...

while ( true ) ; do
     wget -qc http://cvs.sourceforge.net/cvstarballs/moodle-cvsroot.tar.bz2 &&
break
     sleep 5
done

and then import the "moodle" module.

cheers,


m

^ permalink raw reply

* Re: Cygwin can't handle huge packfiles?
From: Junio C Hamano @ 2006-04-06 23:53 UTC (permalink / raw)
  To: linux; +Cc: git
In-Reply-To: <20060406205724.12216.qmail@science.horizon.com>

linux@horizon.com writes:

>> Right now we LRU the pack files and evict older ones when we
>> mmap too many, but the unit of eviction is the whole file, so it
>> would not help the case like yours at all.  It might be possible
>> to mmap only part of a packfile, but it would involve fairly
>> major surgery to sha1_file.c.
>
> The simplest solution seems to be to limit pack file size to a reasonable
> fraction of a 32-bit address space.  Say, 0.5 G.

I do not think that would help the original poster's situation
where only 5 revs result in a 1.5G pack.  I would _almost_ say
"do not pack such a repository", but there is the initial
cloning over git-aware transports which always results in a
repository with a single pack.

^ permalink raw reply

* [PATCH] rev-list: honor --abbrev=<n> when doing --pretty=oneline
From: Eric Wong @ 2006-04-07  0:44 UTC (permalink / raw)
  To: Junio C Hamano, git

This should make --pretty=oneline a whole lot more readable for
people using 80-column terminals.

Note that --abbrev=DEFAULT_ABBREV was on by default before, but
it only affected the printing of the Merge: header).  Let me
know if anybody doesn't want the default behavior to change.
Also note that --abbrev without arguments is not supported by
rev-list, but --no-abbrev is supported if you want the old
behavior.

Originally I made abbrev affect the commit sha1 output
regardless of the pretty setting, but that broke some tests and
I figured it's most/only useful for --pretty=oneline (at least
that's why *I* want it :)

Signed-off-by: Eric Wong <normalperson@yhbt.net>

---

 rev-list.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

c4da073e8256499950e25e2c20ea0b3ec4c29b46
diff --git a/rev-list.c b/rev-list.c
index 22141e2..392209d 100644
--- a/rev-list.c
+++ b/rev-list.c
@@ -52,7 +52,10 @@ static void show_commit(struct commit *c
 		fputs(commit_prefix, stdout);
 	if (commit->object.flags & BOUNDARY)
 		putchar('-');
-	fputs(sha1_to_hex(commit->object.sha1), stdout);
+	if (abbrev && commit_format == CMIT_FMT_ONELINE)
+		fputs(find_unique_abbrev(commit->object.sha1, abbrev), stdout);
+	else
+		fputs(sha1_to_hex(commit->object.sha1), stdout);
 	if (revs.parents) {
 		struct commit_list *parents = commit->parents;
 		while (parents) {
-- 
1.3.0.rc2.g454a-dirty

^ permalink raw reply related

* Re: [PATCH] rev-list: honor --abbrev=<n> when doing --pretty=oneline
From: Junio C Hamano @ 2006-04-07  1:29 UTC (permalink / raw)
  To: Eric Wong; +Cc: git
In-Reply-To: <20060407004455.GF15743@hand.yhbt.net>

Eric Wong <normalperson@yhbt.net> writes:

> Note that --abbrev=DEFAULT_ABBREV was on by default before, but
> it only affected the printing of the Merge: header).  Let me
> know if anybody doesn't want the default behavior to change.

I've never felt need for abbreviating commit object names, so I
only had the abbrev variable to determine how the merge parents
are shown.  If you want to abbreviate the commit object names as
well, you _could_ do independent precision for parents and
commits, but that would be overkil.  So I'd rather see a switch
to turn abbreviation for commits on, perhaps like this:

        $ git-rev-list --pretty=oneline --abbrev-commit -n 3 master
        454a35b Add documentation for git-imap-send.
        ba3c937 blame.c: fix completely broken ancestry traversal.
        6cbd5d7 Tweaks to make asciidoc play nice.

        $ git-rev-list --pretty=oneline --abbrev=4 --abbrev-commit -n 3 master
        454a Add documentation for git-imap-send.
        ba3c9 blame.c: fix completely broken ancestry traversal.
        6cbd5 Tweaks to make asciidoc play nice.

Otherwise you might break Porcelains and people's scripts that
read from --pretty or --header output.

-- >8 --
diff --git a/rev-list.c b/rev-list.c
index 22141e2..1301502 100644
--- a/rev-list.c
+++ b/rev-list.c
@@ -30,6 +30,7 @@ static const char rev_list_usage[] =
 "    --unpacked\n"
 "    --header | --pretty\n"
 "    --abbrev=nr | --no-abbrev\n"
+"    --abbrev-commit\n"
 "  special purpose:\n"
 "    --bisect"
 ;
@@ -39,6 +40,7 @@ struct rev_info revs;
 static int bisect_list = 0;
 static int verbose_header = 0;
 static int abbrev = DEFAULT_ABBREV;
+static int abbrev_commit = 0;
 static int show_timestamp = 0;
 static int hdr_termination = 0;
 static const char *commit_prefix = "";
@@ -52,7 +54,10 @@ static void show_commit(struct commit *c
 		fputs(commit_prefix, stdout);
 	if (commit->object.flags & BOUNDARY)
 		putchar('-');
-	fputs(sha1_to_hex(commit->object.sha1), stdout);
+	if (abbrev_commit && abbrev)
+		fputs(find_unique_abbrev(commit->object.sha1, abbrev), stdout);
+	else
+		fputs(sha1_to_hex(commit->object.sha1), stdout);
 	if (revs.parents) {
 		struct commit_list *parents = commit->parents;
 		while (parents) {
@@ -317,6 +322,14 @@ int main(int argc, const char **argv)
 		}
 		if (!strcmp(arg, "--no-abbrev")) {
 			abbrev = 0;
+			continue;
+		}
+		if (!strcmp(arg, "--abbrev")) {
+			abbrev = DEFAULT_ABBREV;
+			continue;
+		}
+		if (!strcmp(arg, "--abbrev-commit")) {
+			abbrev_commit = 1;
 			continue;
 		}
 		if (!strncmp(arg, "--abbrev=", 9)) {

^ permalink raw reply related

* Re: Cygwin can't handle huge packfiles?
From: linux @ 2006-04-07  3:05 UTC (permalink / raw)
  To: junkio, linux; +Cc: git
In-Reply-To: <7vk6a2uupy.fsf@assigned-by-dhcp.cox.net>

> I do not think that would help the original poster's situation
> where only 5 revs result in a 1.5G pack.  I would _almost_ say
> "do not pack such a repository", but there is the initial
> cloning over git-aware transports which always results in a
> repository with a single pack.

Huh?  Why not?  That repository has a lot of files.  For compression,
you want all versions of a file in one pack, and with few versions that
makes it easier to split up, not harder.

As for network transport of packs, I haven't studied the details,
but if you allow "thin packs" that have deltas relative to
objects not in the pack, then breaking up the pack anywhere
should be legal.

Or, if necessary, you can stuff an arbitrarily large file through
git-unpack-objects, which reads a stream from stdin without
attempting to mmap it.


(Speaking of unpack-objects.c, what's that "static unsigned long eof"
variable in there?  It never seems to be set to a non-zero value.)

^ permalink raw reply

* Re: [PATCH] rev-list: honor --abbrev=<n> when doing --pretty=oneline
From: Eric Wong @ 2006-04-07  3:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7v64lmuqa5.fsf@assigned-by-dhcp.cox.net>

Junio C Hamano <junkio@cox.net> wrote:
> Eric Wong <normalperson@yhbt.net> writes:
> 
> > Note that --abbrev=DEFAULT_ABBREV was on by default before, but
> > it only affected the printing of the Merge: header).  Let me
> > know if anybody doesn't want the default behavior to change.
> 
> I've never felt need for abbreviating commit object names, so I
> only had the abbrev variable to determine how the merge parents
> are shown.  If you want to abbreviate the commit object names as
> well, you _could_ do independent precision for parents and
> commits, but that would be overkil.  So I'd rather see a switch
> to turn abbreviation for commits on, perhaps like this:
> 
>         $ git-rev-list --pretty=oneline --abbrev-commit -n 3 master
>         454a35b Add documentation for git-imap-send.
>         ba3c937 blame.c: fix completely broken ancestry traversal.
>         6cbd5d7 Tweaks to make asciidoc play nice.
> 
>         $ git-rev-list --pretty=oneline --abbrev=4 --abbrev-commit -n 3 master
>         454a Add documentation for git-imap-send.
>         ba3c9 blame.c: fix completely broken ancestry traversal.
>         6cbd5 Tweaks to make asciidoc play nice.
> 
> Otherwise you might break Porcelains and people's scripts that
> read from --pretty or --header output.
> 
> -- >8 --

Sounds good, I like your patch.  I'm not thrilled with the length of the
'--abbrev-commit' switch, but I guess that's what aliases are for :>

-- 
Eric Wong

^ permalink raw reply

* [PATCH] git-svnimport: Don't assume that copied files haven't changed
From: Karl  Hasselström @ 2006-04-07  6:06 UTC (permalink / raw)
  To: Git Mailing List

Don't assume that a file that SVN claims was copied from somewhere
else is bit-for-bit identical with its parent, since SVN allows
changes to copied files before they are committed.

Without this fix, such copy-modify-commit operations causes the
imported file to lack the "modify" part -- that is, we get subtle data
corruption.

Signed-off-by: Karl Hasselström <kha@treskal.com>

---

 git-svnimport.perl |   15 ++++++++++-----
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/git-svnimport.perl b/git-svnimport.perl
index 114784f..4d5371c 100755
--- a/git-svnimport.perl
+++ b/git-svnimport.perl
@@ -616,9 +616,7 @@ sub commit {
 			}
 			if(($action->[0] eq "A") || ($action->[0] eq "R")) {
 				my $node_kind = node_kind($branch,$path,$revision);
-				if($action->[1]) {
-					copy_path($revision,$branch,$path,$action->[1],$action->[2],$node_kind,\@new,\@parents);
-				} elsif ($node_kind eq $SVN::Node::file) {
+				if ($node_kind eq $SVN::Node::file) {
 					my $f = get_file($revision,$branch,$path);
 					if ($f) {
 						push(@new,$f) if $f;
@@ -627,8 +625,15 @@ sub commit {
 						print STDERR "$revision: $branch: could not fetch '$opath'\n";
 					}
 				} elsif ($node_kind eq $SVN::Node::dir) {
-					get_ignore(\@new, \@old, $revision,
-						   $branch,$path);
+					if($action->[1]) {
+						copy_path($revision, $branch,
+							  $path, $action->[1],
+							  $action->[2], $node_kind,
+							  \@new, \@parents);
+					} else {
+						get_ignore(\@new, \@old, $revision,
+							   $branch, $path);
+					}
 				}
 			} elsif ($action->[0] eq "D") {
 				push(@old,$path);

^ permalink raw reply related

* Re: parsecvs tool now creates git repositories
From: Keith Packard @ 2006-04-07  7:24 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: keithp, Jim Radford, Git Mailing List
In-Reply-To: <46a038f90604061622s5a7bee4eq6666d9b3796f70f6@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 565 bytes --]

On Fri, 2006-04-07 at 11:22 +1200, Martin Langhoff wrote:

> parsecvs is committing them with the "added file foo.x" message, not
> the actual commit message.

heh. my cvs repositories are all so kludged that no files have ever been
added, it appears. I'll fix this when I've got a copy of the moodle
repository. sf.net is as useful as always.

I suspect the change is as simple as checking the format of the log
message and time time stamps of the commits and then just dropping the
1.1 revision from the tree entirely.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 191 bytes --]

^ permalink raw reply

* Re: Cygwin can't handle huge packfiles?
From: Junio C Hamano @ 2006-04-07  8:15 UTC (permalink / raw)
  To: git; +Cc: Kees-Jan Dijkzeul, Linus Torvalds
In-Reply-To: <Pine.LNX.4.64.0604030734440.3781@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> On Mon, 3 Apr 2006, Linus Torvalds wrote:
>> 
>> That said, I think git _does_ have problems with large pack-files. We have 
>> some 32-bit issues etc
>
> I should clarify that. git _itself_ shouldn't have any 32-bit issues, but 
> the packfile data structure does. The index has 32-bit offsets into 
> individual pack-files. 
>
> That's not hugely fundamental,...

Linus _does_ understand what he means, but let me clarify and
outline a possible future direction.

 * pack-*.pack file has the following format:

   - The header appears at the beginning and consists of the following:

     4-byte signature
     4-byte version number (network byte order)
     4-byte number of objects contained in the pack (network byte order)

     Observation: we cannot have more than 4G versions ;-) and
     more than 4G objects in a pack.

   - The header is followed by number of object entries, each of
     which looks like this:

     (undeltified representation)
     n-byte type and length (4-bit type, (n-1)*7+4-bit length)
     compressed data

     (deltified representation)
     n-byte type and length (4-bit type, (n-1)*7+4-bit length)
     20-byte base object name
     compressed delta data

     Observation: length of each object is encoded in a variable
     length format and is not constrained to 32-bit or anything.

  - The trailer records 20-byte SHA1 checksum of all of the above.

 * pack-*.idx file has the following format:

  - The header consists of 256 4-byte network byte order
    integers.  N-th entry of this table records the number of
    objects in the corresponding pack, the first byte of whose
    object name are smaller than N.

    Observation: we would need to extend this to an array of
    8-byte integers to go beyond 4G objects per pack, but it is
    not strictly necessary.

  - The header is followed by sorted 28-byte entries, one entry
    per object in the pack.  Each entry is:

    4-byte network byte order integer, recording where the
    object is stored in the packfile as the offset from the
    beginning.

    20-byte object name.

    Observation: we would definitely need to extend this to
    8-byte integer plus 20-byte object name to handle a packfile
    that is larger than 4GB.

  - The file is concluded with a trailer:

    A copy of the 20-byte SHA1 checksum at the end of
    corresponding packfile.

    20-byte SHA1-checksum of all of the above.

This is not fundamental, in that pack idx file is something we
can regenerate from a packfile.  The push/fetch transfer over
git native protocols does not even transfer pack idx file;
instead, the recipient uses git-index-pack to generate pack idx.
git-index-pack would need to be updated to update the necessary
fields to 8-byte integers, without breaking existing packfiles.

The code to read idx file currently has a sanity check logic to
make sure that the size of the idx file is consistent with
24-byte entries (the last entry in the header matches the number
of objects recorded in the pack).  So we could reliably tell
between the current 24-byte version and 28-byte "beyond 4GB"
version, and support both formats at the same time.

Even after we start supporting the 28-byte "beyond 4GB" format,
we can and we should continue writing the current 24-byte
version of pack idx file when the packfile offset can be
expressed with 32-bit.

Having said that, I have to warn that this is not for weak of
heart.  The necessary changes would be somewhat involved.


----------------------------------------------------------------

Pack idx file

	idx
	    +--------------------------------+
	    | fanout[0] = 2                  |-.
	    +--------------------------------+ |
	    | fanout[1]                      | |
	    +--------------------------------+ |
	    | fanout[2]                      | |
	    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
	    | fanout[255]                    | |
	    +--------------------------------+ |
main	    | offset                         | |
index	    | object name 00XXXXXXXXXXXXXXXX | |
table	    +--------------------------------+ | 
	    | offset                         | |
	    | object name 00XXXXXXXXXXXXXXXX | |
	    +--------------------------------+ |
	  .-| offset                         |<+
	  | | object name 01XXXXXXXXXXXXXXXX |
	  | +--------------------------------+
	  | | offset                         |
	  | | object name 01XXXXXXXXXXXXXXXX |
	  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	  | | offset                         |
	  | | object name FFXXXXXXXXXXXXXXXX |
	  | +--------------------------------+
trailer	  | | packfile checksum              |
	  | +--------------------------------+
	  | | idxfile checksum               |
	  | +--------------------------------+
          .-------.      
                  |
Pack file entry: <+

     packed object header:
	1-byte type (bit 4-6)
	       size0 (bit 0-3)
               end-of-length (bit 7)
        n-byte sizeN (as long as MSB is set, each 7-bit)
		size0..sizeN form 4+7+7+..+7 bit integer, size0
		is the most significant part.
     packed object data:
        If it is not DELTA, then deflated bytes (the size above
		is the size before compression).
	If it is DELTA, then
	  20-byte base object name SHA1 (the size above is the
	  	size of the delta data that follows).
          delta data, deflated.

^ permalink raw reply

* Re: Cygwin can't handle huge packfiles?
From: Jakub Narebski @ 2006-04-07  8:27 UTC (permalink / raw)
  To: git
In-Reply-To: <7vhd55ls24.fsf@assigned-by-dhcp.cox.net>

Junio C Hamano wrote:

>  * pack-*.pack file has the following format:
[...]
>  * pack-*.idx file has the following format:
[...]
Could you please put the information in parent post somewhere in
Documentation, for example Documentation/technical/pack-format.txt
(perhaps together with putting description of packing heuristic from
http://marc.theaimsgroup.com/?l=git&m=114134881923320 by Jon Loeliger in
Documentation/technical/pack-heuristics.txt even if it doesn't conform to
"serious documentation" standards)?

Thanks in advance
-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply

* blame now knows -S
From: Junio C Hamano @ 2006-04-07  9:28 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: git, Fredrik Kuivinen

I've made a few changes to "git blame" myself:

 - fix breakage caused by recent revision walker reorganization;
 - use built-in xdiff instead of spawning GNU diff;
 - implement -S <ancestry-file> like annotate does.

Depending on the density of changes, it now appears that blame
is 10%-30% faster than annotate.  I thought CVS emulator might
be interested to give it a whirl..

^ permalink raw reply

* Re: blame now knows -S
From: Junio C Hamano @ 2006-04-07  9:32 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: git, Fredrik Kuivinen
In-Reply-To: <7v1ww9loon.fsf@assigned-by-dhcp.cox.net>

Junio C Hamano <junkio@cox.net> writes:

> I've made a few changes to "git blame" myself:
>
>  - fix breakage caused by recent revision walker reorganization;
>  - use built-in xdiff instead of spawning GNU diff;
>  - implement -S <ancestry-file> like annotate does.
>
> Depending on the density of changes, it now appears that blame
> is 10%-30% faster than annotate.  I thought CVS emulator might
> be interested to give it a whirl..

Sorry, forgot to mention... The updated blame will be in "next",
not in "master" yet.

^ permalink raw reply

* Re: Cygwin can't handle huge packfiles?
From: Nicolas Pitre @ 2006-04-07 14:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Kees-Jan Dijkzeul, Linus Torvalds
In-Reply-To: <7vhd55ls24.fsf@assigned-by-dhcp.cox.net>

On Fri, 7 Apr 2006, Junio C Hamano wrote:

> Linus Torvalds <torvalds@osdl.org> writes:
> 
> > On Mon, 3 Apr 2006, Linus Torvalds wrote:
> >> 
> >> That said, I think git _does_ have problems with large pack-files. We have 
> >> some 32-bit issues etc
> >
> > I should clarify that. git _itself_ shouldn't have any 32-bit issues, but 
> > the packfile data structure does. The index has 32-bit offsets into 
> > individual pack-files. 
> >
> > That's not hugely fundamental,...
> 
> Linus _does_ understand what he means, but let me clarify and
> outline a possible future direction.
> 
[...]

For the record, the delta code also has 32-bit limitations of its own 
presently.  It cannot encode a delta against a buffer which is larger 
than 4GB.

I however made sure the byte 0 could be used as a prefix for future 
encoding extensions, like 64-bit file offsets for example.


Nicolas

^ permalink raw reply

* Git is one year old today
From: Luck, Tony @ 2006-04-07 16:16 UTC (permalink / raw)
  To: git

Happy birthday to git ... one year old today.  Counting
the "birth" as the point at which Linus made the first commit
of the git sources into git:

 commit e83c5163316f89bfbde7d9ab23ca2e25604af290
 Author: Linus Torvalds <torvalds@ppc970.osdl.org>
 Date:   Thu Apr 7 15:13:13 2005 -0700

    Initial revision of "git", the information manager from hell

-Tony

^ permalink raw reply

* Re: Cygwin can't handle huge packfiles?
From: Junio C Hamano @ 2006-04-07 18:31 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0604071002530.2215@localhost.localdomain>

Nicolas Pitre <nico@cam.org> writes:

> On Fri, 7 Apr 2006, Junio C Hamano wrote:
>
>> Linus Torvalds <torvalds@osdl.org> writes:
>> 
>> > On Mon, 3 Apr 2006, Linus Torvalds wrote:
>> >> 
>> >> That said, I think git _does_ have problems with large pack-files. We have 
>> >> some 32-bit issues etc
>> >
>> > I should clarify that. git _itself_ shouldn't have any 32-bit issues, but 
>> > the packfile data structure does. The index has 32-bit offsets into 
>> > individual pack-files. 
>> >
>> > That's not hugely fundamental,...
>> 
>> Linus _does_ understand what he means, but let me clarify and
>> outline a possible future direction.
>
> For the record, the delta code also has 32-bit limitations of its own 
> presently.  It cannot encode a delta against a buffer which is larger 
> than 4GB.
>
> I however made sure the byte 0 could be used as a prefix for future 
> encoding extensions, like 64-bit file offsets for example.

True the delta data representation, not just the "delta code",
has that limitation, but I do not think you issue "insert 0-byte
literal data" command from the deltifier side right now, so we
should be OK.

Maybe we would want to check (cmd == 0) case to detect delta
extension that we do not handle right now?

^ permalink raw reply

* Can't export whole repo as patches
From: Peter Baumann @ 2006-04-07 18:47 UTC (permalink / raw)
  To: git

I'd like to export the whole history of a project of mine via patches
but I can't get the inital commit.

How can I get the inital commit as a patch?

That's what I tried:

  git --version
  git version 1.2.4				# debian sarge

  mkdir /tmp/testrepo && cd /tmp/testrepo
  git-init-db
  echo a > a_file.txt
  git-add a_file.txt
  git-commit -a -m "a_file added"
  echo b >> a_file.txt
  git-commit -a -m "a_file modifed"
  xp:/tmp/testrepo git-format-patch master~1
  0001-a_file-modified.txt
  cat 0001-a_file-modified.txt
  From nobody Mon Sep 17 00:00:00 2001
  From: Peter Baumann <peter.baumann@gmail.com>
  Date: Fri Apr 7 12:20:54 2006 +0200
  Subject: [PATCH] a_file modified

  ---

   a_file.txt |    1 +
   1 files changed, 1 insertions(+), 0 deletions(-)

  d8ceeed82a29004c066a98e0d390818e65fa9da7
  diff --git a/a_file.txt b/a_file.txt
  index 7898192..422c2b7 100644
  --- a/a_file.txt
  +++ b/a_file.txt
  @@ -1 +1,2 @@
   a
  +b
  --
  1.2.4


As you can see, there is only a patch of the second commit. But it seems that
this behaviour is correct, because I asked for the diff between master^..master

Obviously, I wanted a way to get the diff of master~2..master.

Trying harder:

  git-format-patch master~2
  Not a valid rev master~2 (master~2..HEAD)

Any hint to the correct way is appreciated.

</me thinking loudly>
The best would be if git would have an implicit tag or branch called "init"
(name doesn't really matter) which is the root of an empty repository. In that case
one can do git-format-patch root..master and it would the right thing.

Greetings,
  Peter Baumann

^ permalink raw reply

* Re: Cygwin can't handle huge packfiles?
From: Nicolas Pitre @ 2006-04-07 18:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vhd55jkz0.fsf@assigned-by-dhcp.cox.net>

On Fri, 7 Apr 2006, Junio C Hamano wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > On Fri, 7 Apr 2006, Junio C Hamano wrote:
> >
> >> Linus Torvalds <torvalds@osdl.org> writes:
> >> 
> >> > On Mon, 3 Apr 2006, Linus Torvalds wrote:
> >> >> 
> >> >> That said, I think git _does_ have problems with large pack-files. We have 
> >> >> some 32-bit issues etc
> >> >
> >> > I should clarify that. git _itself_ shouldn't have any 32-bit issues, but 
> >> > the packfile data structure does. The index has 32-bit offsets into 
> >> > individual pack-files. 
> >> >
> >> > That's not hugely fundamental,...
> >> 
> >> Linus _does_ understand what he means, but let me clarify and
> >> outline a possible future direction.
> >
> > For the record, the delta code also has 32-bit limitations of its own 
> > presently.  It cannot encode a delta against a buffer which is larger 
> > than 4GB.
> >
> > I however made sure the byte 0 could be used as a prefix for future 
> > encoding extensions, like 64-bit file offsets for example.
> 
> True the delta data representation, not just the "delta code",
> has that limitation, but I do not think you issue "insert 0-byte
> literal data" command from the deltifier side right now, so we
> should be OK.
> 
> Maybe we would want to check (cmd == 0) case to detect delta
> extension that we do not handle right now?

Good idea.  Will send you a patch.


Nicolas

^ permalink raw reply

* Re: Can't export whole repo as patches
From: Junio C Hamano @ 2006-04-07 19:18 UTC (permalink / raw)
  To: Peter Baumann; +Cc: git
In-Reply-To: <20060407184701.GA6686@xp.machine.de>

Peter Baumann <peter.baumann@gmail.com> writes:

> How can I get the inital commit as a patch?

format-patch is designed to get a patch to send to upstream, and
does not handle the root commit.  In your two revisions
repository, you could do something like this:

	$ git diff-tree -p --root master~1

Or more in general:

	$ git rev-list master |
          git diff-tree --stdin --root --pretty=fuller -p

BTW, I've been meaning to add --pretty=patch to give
format-patch compatible output to diff-tree, but haven't got
around to actually do it.  Another thing I've been meaning to do
is "git log --diff" which is more or less "git whatchanged".

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox