Git development
 help / color / mirror / Atom feed
* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
From: Linus Torvalds @ 2009-01-09 18:09 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Adeodato Simó, Clemens Buchacher,
	Pierre Habouzit, davidel, Francis Galiegue, Git ML
In-Reply-To: <alpine.DEB.1.00.0901091405460.30769@pacific.mpi-cbg.de>



On Fri, 9 Jan 2009, Johannes Schindelin wrote:
> 
> -- snipsnap --
> [TOY PATCH] Add diff option '--collapse-non-alnums'

I really don't think it should be about "alnum".

Think about languages that use "begin" and "end" instead of "{" "}".

I think we'd be better off just looking at the size, but _this_ is a 
really good area where "uniqueness" matters.

Don't combine unique lines, but lines that have the same hash as a 
thousand other lines? Go right ahead.

		Linus

^ permalink raw reply

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
From: Linus Torvalds @ 2009-01-09 18:13 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Adeodato Simó, Clemens Buchacher,
	Pierre Habouzit, davidel, Francis Galiegue, Git ML
In-Reply-To: <alpine.LFD.2.00.0901091001260.6528@localhost.localdomain>



On Fri, 9 Jan 2009, Linus Torvalds wrote:
>
> Don't combine unique lines, but lines that have the same hash as a 
> thousand other lines? Go right ahead.

Oh, and even then, don't ever combine fragments unless it's a single line, 
and unless you can really stop with just one or two combines. Because we 
don't want to combine 100 one-liner changes (on every other line) into one 
300-line change.

		Linus

^ permalink raw reply

* Re: Curious about details of optimization of object database...
From: Nicolas Pitre @ 2009-01-09 18:34 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: chris, git
In-Reply-To: <vpqzli01hzl.fsf@bauges.imag.fr>

On Fri, 9 Jan 2009, Matthieu Moy wrote:

> chris@seberino.org writes:
> 
> > I'm told a commit is *not* a patch (diff), but, rather a copy of the entire
> > tree.
> 
> Conceptually, yes. But obviously, the storage format (pack) does what
> people usually call "delta-compression", which is basically storing
> only the diff against another, similar object.

Also, since objects representing files and directories are named after 
their actual content, having two commits with identical files and 
directories will of course share the same blob and tree objects for 
those identical parts.


Nicolas

^ permalink raw reply

* (unknown)
From: nathan.panike @ 2009-01-09 19:02 UTC (permalink / raw)


>From 65c4fed27fe9752ffd0e3b7cb6807561a4dd4601 Mon Sep 17 00: 00:00 2001
From: Nathan W. Panike <nathan.panike@gmail.com>
Date: Fri, 9 Jan 2009 11:53:43 -0600
Subject: [PATCH] Get format-patch to show first commit after root commit

Currently, the command

git format-patch -1 e83c5163316f89bfbde

in the git repository creates an empty file.  Instead, one is
forced to do

git format-patch -1 --root e83c5163316f89bfbde

This seems arbitrary.  This patch fixes this case, so that

git format-patch -1 e83c5163316f89bfbde

will produce an actual patch.
---
 builtin-log.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/builtin-log.c b/builtin-log.c
index 4a02ee9..5e7b61f 100644
--- a/builtin-log.c
+++ b/builtin-log.c
@@ -977,6 +977,8 @@ int cmd_format_patch(int argc, const char **argv, const char *prefix)
 		list[nr - 1] = commit;
 	}
 	total = nr;
+	if (total == 1 && !list[0]->parents)
+		rev.show_root_diff=1;
 	if (!keep_subject && auto_number && total > 1)
 		numbered = 1;
 	if (numbered)
-- 
1.6.1.76.gc123b.dirty

^ permalink raw reply related

* [PATCH/RFC v5 0/1] git checkout: optimise away lots of lstat() calls
From: Kjetil Barvik @ 2009-01-09 19:05 UTC (permalink / raw)
  To: git; +Cc: Pete Harlan, Linus Torvalds, Junio C Hamano, Kjetil Barvik

Changes since version 4:

- After looking at the changes in create_directories() inside entry.c
  for some more time, I got the idea to just use the lstat_cache()
  instead of almost use a separate cache implementation of its own.

- To be able to implement that, I had to be able to tell the cache to
  test the full length of input 'name'.  I also had to be able to tell
  the cache to use the stat() function instead of the lstat() function
  for a given prefix length.

- Fixed a missed cache optimisation where you is not able to cache a
  symlink or a none existing directory, but still wants to cache the
  real directory part.  Fixed a second missed cache optimisation where
  the 'name' argument is a (complete) substring of the cache on a path
  component basis.

- The cache can now also return LSTAT_REG for a regular file, but
  please note that it can only cache the directory part of that file.

- Since the changes made to the create_directories() function now is
  small and simple (just had to change an if-test, remove 2 unused
  variables and update the comments), patch 2/2 is gone and included
  into patch 1/1.

- More updates to the comments in the source code.

- Thanks to Pete Harlan for spelling fixes to the commit message!

I think that the patch starts to be in a good shape now, but please
comment on it, even if it is just a rename of a variable/function name
to increase the readability or a spelling fix!

For the fun of it, I did a test with 'git chekcout-index -q --all
--prefix=<a path containing 29 components>' from the my-v2.6.27 Linux
branch, and the savings was really huge: the numbers of lstat() and
stat() calls dropped from 778 145 to 26 674, and the real time also
dropped from 52 to 32 seconds on my laptop!  :-)

-----

I have just started to clone some interesting Linux git trees to watch
the development more closely, and therefore also started to use git. I
noticed that 'git checkout' takes some time, and especially that the
'git checkout' command does lots and lots of lstat() calls.

After some more investigation and thinking, I have made 1 patch and
been able to optimise away over 40% of all lstat() calls in some cases
for the 'git checkout' command.  Also, if you use a large path to the
'--prefix' argument to the 'git checkout-index' command, and you have
lots of files, the savings can be really huge!

The patch is against git master, and the git 'make test' test suite
still passes after the patch.  To document the improvement, below is
some numbers, which compares before and after the patch. To reproduce
the numbers:

- git clone the Linux git tree to be able to get the Linux tags
  'v2.6.25' and 'v2.6.27'.
- git checkout -b my-v2.6.27 v2.6.27
- git checkout -b my-v2.6.25 v2.6.25

Then, when the current branch is 'my-v2.6.25', do:

  strace -o strace_to27 -T git checkout -q my-v2.6.27

And then pretty print the 'strace_to27' file.  Below is the numbers
from the current git version (before the patch).  Notice that we do an
lstat() call on the "arch" directory over 6000 times!

TOTAL      185151 100.000% OK:165544 NOT: 19607  11.136001 sec   60 usec/call
lstat64    120954  65.327% OK:107013 NOT: 13941   5.388727 sec   45 usec/call
  strings  120954 tot  30163 uniq   4.010 /uniq   5.388727 sec   45 usec/call
  files     61491 tot  28712 uniq   2.142 /uniq   2.740520 sec   45 usec/call
  dirs      45522 tot   1436 uniq  31.701 /uniq   1.994448 sec   44 usec/call
  errors    13941 tot   5189 uniq   2.687 /uniq   0.653759 sec   47 usec/call
             6297   5.206% OK:  6297 NOT:     0  "arch"
             4544   3.757% OK:  4544 NOT:     0  "drivers"
             1816   1.501% OK:  1816 NOT:     0  "arch/arm"
             1499   1.239% OK:  1499 NOT:     0  "include"
              912   0.754% OK:   912 NOT:     0  "arch/powerpc"
              764   0.632% OK:   764 NOT:     0  "fs"
              746   0.617% OK:   746 NOT:     0  "drivers/net"
              662   0.547% OK:   662 NOT:     0  "net"
              652   0.539% OK:   325 NOT:   327  "arch/sparc/include"
              636   0.526% OK:   636 NOT:     0  "drivers/media"
              606   0.501% OK:   606 NOT:     0  "include/linux"
              533   0.441% OK:   533 NOT:     0  "arch/sh"
              522   0.432% OK:   260 NOT:   262  "arch/powerpc/include"
              488   0.403% OK:   243 NOT:   245  "arch/sh/include"
              413   0.341% OK:   413 NOT:     0  "arch/sparc"
              390   0.322% OK:   390 NOT:     0  "arch/x86"
              383   0.317% OK:   383 NOT:     0  "Documentation"
              370   0.306% OK:   184 NOT:   186  "arch/ia64/include"
              366   0.303% OK:   366 NOT:     0  "drivers/media/video"
              348   0.288% OK:   173 NOT:   175  "arch/arm/include"

Here is the numbers after applying the patch.  Notice how nice the top
20 entries list now looks!

TOTAL      134155 100.000% OK:122102 NOT: 12053  11.069389 sec   83 usec/call
lstat64     69876  52.086% OK: 63491 NOT:  6385   3.410007 sec   49 usec/call
  strings   69876 tot  30163 uniq   2.317 /uniq   3.410007 sec   49 usec/call
  files     61491 tot  28712 uniq   2.142 /uniq   3.023238 sec   49 usec/call
  dirs       2000 tot   1436 uniq   1.393 /uniq   0.085953 sec   43 usec/call
  errors     6385 tot   5189 uniq   1.230 /uniq   0.300816 sec   47 usec/call
                4   0.006% OK:     4 NOT:     0  ".gitignore"
                4   0.006% OK:     4 NOT:     0  ".mailmap"
                4   0.006% OK:     4 NOT:     0  "CREDITS"
                4   0.006% OK:     4 NOT:     0  "Documentation/00-INDEX"
                4   0.006% OK:     4 NOT:     0  "Documentation/ABI/testing/sysfs-block"
                4   0.006% OK:     4 NOT:     0  "Documentation/ABI/testing/sysfs-firmware-acpi"
                4   0.006% OK:     4 NOT:     0  "Documentation/CodingStyle"
                4   0.006% OK:     4 NOT:     0  "Documentation/DMA-API.txt"
                4   0.006% OK:     4 NOT:     0  "Documentation/DMA-mapping.txt"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/Makefile"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/gadget.tmpl"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/kernel-api.tmpl"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/kernel-locking.tmpl"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/procfs-guide.tmpl"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/procfs_example.c"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/rapidio.tmpl"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/s390-drivers.tmpl"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/uio-howto.tmpl"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/videobook.tmpl"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/writing_usb_driver.tmpl"

Comments?


Kjetil Barvik (1):
  more cache effective symlink/directory detection

 builtin-add.c          |    1 +
 builtin-apply.c        |    1 +
 builtin-update-index.c |    1 +
 cache.h                |   24 +++++++-
 diff-lib.c             |    1 +
 entry.c                |   39 +++++-------
 symlinks.c             |  158 +++++++++++++++++++++++++++++++++++-------------
 unpack-trees.c         |    6 +-
 8 files changed, 162 insertions(+), 69 deletions(-)

^ permalink raw reply

* [PATCH/RFC v5 1/1] more cache effective symlink/directory detection
From: Kjetil Barvik @ 2009-01-09 19:05 UTC (permalink / raw)
  To: git; +Cc: Pete Harlan, Linus Torvalds, Junio C Hamano, Kjetil Barvik
In-Reply-To: <1231527954-868-1-git-send-email-barvik@broadpark.no>

Changes includes the following:

- The cache functionality is more effective.  Previously when A/B/C/D
  was in the cache and A/B/C/E/file.c was called for, there was no
  match at all from the cache.  Now we use the fact that the paths
  "A", "A/B" and "A/B/C" are already tested, and we only need to do an
  lstat() call on "A/B/C/E".

- We only cache/store the last path regardless of its type.  Since the
  cache functionality is always used with alphabetically sorted names
  (at least it seems so for me), there is no need to store both the
  last symlink-leading path and the last real-directory path.  Note
  that if the cache is not called with (mostly) alphabetically sorted
  names, neither the old, nor this new one, would be very effective.

- We also can cache the fact that a directory does not exist.
  Previously we could end up doing lots of lstat() calls for a removed
  directory which previously contained lots of files.  Since we
  already have simplified the cache functionality and only store the
  last path (see above), this new functionality was easy to add.

- Previously, when symlink A/B/C/S was cached/stored in the
  symlink-leading path, and A/B/C/file.c was called for, it was not
  easy to use the fact that we already knew that the paths "A", "A/B"
  and "A/B/C" are real directories.  Since we now only store one
  single path (the last one), we also get similar logic for free
  regarding the new "non-existing-directory-cache".

- Avoid copying the first path components of the name 2 zillion times
  when we test new path components.  Since we always cache/store the
  last path, we can copy each component as we test those directly into
  the cache.  Previously we ended up doing a memcpy() for the full
  path/name right before each lstat() call, and when updating the
  cache for each time we have tested a new path component.

- We also use less memory, that is, PATH_MAX bytes less memory on the
  stack and PATH_MAX bytes less memory on the heap.

- Introduce a 3rd argument, 'unsigned int track_flags', to the
  cache-test function, check_lstat_cache().  This new argument can be
  used to tell the cache functionality which types of directories
  should be cached.  This argument can also be used to tell the cache
  to always test the complete length of 'name' (add 'LSTAT_FULLPATH').

- Introduce a 4th argument, 'prefix_len_stat_func', which tells the
  length of a prefix, where the cache should use the stat() function
  instead of the lstat() function to test each path component.  This
  can for instance be useful at some places in the source code to
  handle the --prefix argument to the 'git checkout-index' command.

- Also introduce a 'void clear_lstat_cache(void)' function, which
  should be used to clean the cache before usage.  If for instance,
  you have changed the types of directories which should be cached,
  the cache could contain a path which was not wanted.

We also start to use the lstat_cache() function inside the
'create_directories()' function inside entry.c, and we save really
huge amounts of calls to the stat()/lstat() functions in some cases.

Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
---
:100644 100644 719de8b... 870961e... M	builtin-add.c
:100644 100644 a8f75ed... d3d001a... M	builtin-apply.c
:100644 100644 5604977... 8907219... M	builtin-update-index.c
:100644 100644 231c06d... 0bf31d3... M	cache.h
:100644 100644 ae96c64... c9caa0e... M	diff-lib.c
:100644 100644 aa2ee46... 293400c... M	entry.c
:100644 100644 5a5e781... 314ba4f... M	symlinks.c
:100644 100644 54f301d... c3d1429... M	unpack-trees.c
 builtin-add.c          |    1 +
 builtin-apply.c        |    1 +
 builtin-update-index.c |    1 +
 cache.h                |   24 +++++++-
 diff-lib.c             |    1 +
 entry.c                |   39 +++++-------
 symlinks.c             |  158 +++++++++++++++++++++++++++++++++++-------------
 unpack-trees.c         |    6 +-
 8 files changed, 162 insertions(+), 69 deletions(-)

diff --git a/builtin-add.c b/builtin-add.c
index 719de8b0f2d2d831f326d948aa18700e5c474950..870961e8ca4e3d6f9333020083d0a232bccd542c 100644
--- a/builtin-add.c
+++ b/builtin-add.c
@@ -225,6 +225,7 @@ int cmd_add(int argc, const char **argv, const char *prefix)
 
 	argc = parse_options(argc, argv, builtin_add_options,
 			  builtin_add_usage, 0);
+	clear_symlink_cache();
 	if (patch_interactive)
 		add_interactive = 1;
 	if (add_interactive)
diff --git a/builtin-apply.c b/builtin-apply.c
index a8f75ed3ed411d8cf7a3ec9dfefef7407c50f447..d3d001a96be6e502d6338af4467f7c313370d78e 100644
--- a/builtin-apply.c
+++ b/builtin-apply.c
@@ -3154,6 +3154,7 @@ int cmd_apply(int argc, const char **argv, const char *unused_prefix)
 	if (apply_default_whitespace)
 		parse_whitespace_option(apply_default_whitespace);
 
+	clear_symlink_cache();
 	for (i = 1; i < argc; i++) {
 		const char *arg = argv[i];
 		char *end;
diff --git a/builtin-update-index.c b/builtin-update-index.c
index 560497750586ec61be4e34de6dedd9c307129817..8907219fb9cb438113e29ee17854edb5dd4baa4d 100644
--- a/builtin-update-index.c
+++ b/builtin-update-index.c
@@ -581,6 +581,7 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
 	if (entries < 0)
 		die("cache corrupted");
 
+	clear_symlink_cache();
 	for (i = 1 ; i < argc; i++) {
 		const char *path = argv[i];
 		const char *p;
diff --git a/cache.h b/cache.h
index 231c06d7726b575f6e522d5b0c0fe43557e8c651..0bf31d3e8903a658927938f6da45c77dd289c94a 100644
--- a/cache.h
+++ b/cache.h
@@ -719,7 +719,29 @@ struct checkout {
 };
 
 extern int checkout_entry(struct cache_entry *ce, const struct checkout *state, char *topath);
-extern int has_symlink_leading_path(int len, const char *name);
+
+#define LSTAT_REG      (1u << 0)
+#define LSTAT_DIR      (1u << 1)
+#define LSTAT_NOENT    (1u << 2)
+#define LSTAT_SYMLINK  (1u << 3)
+#define LSTAT_LSTATERR (1u << 4)
+#define LSTAT_ERR      (1u << 5)
+#define LSTAT_FULLPATH (1u << 6)
+extern unsigned int lstat_cache(int len, const char *name,
+				unsigned int track_flags, int prefix_len_stat_func);
+extern void clear_lstat_cache(void);
+static inline unsigned int has_symlink_leading_path(int len, const char *name)
+{
+	return lstat_cache(len, name, LSTAT_SYMLINK|LSTAT_DIR, -1) &
+		LSTAT_SYMLINK;
+}
+#define clear_symlink_cache() clear_lstat_cache()
+static inline unsigned int has_symlink_or_noent_leading_path(int len, const char *name)
+{
+	return lstat_cache(len, name, LSTAT_SYMLINK|LSTAT_NOENT|LSTAT_DIR, -1) &
+		(LSTAT_SYMLINK|LSTAT_NOENT);
+}
+#define clear_symlink_or_noent_cache() clear_lstat_cache()
 
 extern struct alternate_object_database {
 	struct alternate_object_database *next;
diff --git a/diff-lib.c b/diff-lib.c
index ae96c64ca209f4df9008198e8a04b160bed618c7..c9caa0e6ef0f4a8ee8b850869ef6d0f52b712385 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -69,6 +69,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 		diff_unmerged_stage = 2;
 	entries = active_nr;
 	symcache[0] = '\0';
+	clear_symlink_cache();
 	for (i = 0; i < entries; i++) {
 		struct stat st;
 		unsigned int oldmode, newmode;
diff --git a/entry.c b/entry.c
index aa2ee46a84033585d8e07a585610c5a697af82c2..293400cf5be63fd66b797a68e17bf953c600fe99 100644
--- a/entry.c
+++ b/entry.c
@@ -8,35 +8,28 @@ static void create_directories(const char *path, const struct checkout *state)
 	const char *slash = path;
 
 	while ((slash = strchr(slash+1, '/')) != NULL) {
-		struct stat st;
-		int stat_status;
-
 		len = slash - path;
 		memcpy(buf, path, len);
 		buf[len] = 0;
 
-		if (len <= state->base_dir_len)
-			/*
-			 * checkout-index --prefix=<dir>; <dir> is
-			 * allowed to be a symlink to an existing
-			 * directory.
-			 */
-			stat_status = stat(buf, &st);
-		else
-			/*
-			 * if there currently is a symlink, we would
-			 * want to replace it with a real directory.
-			 */
-			stat_status = lstat(buf, &st);
-
-		if (!stat_status && S_ISDIR(st.st_mode))
+		/* For 'checkout-index --prefix=<dir>', <dir> is
+		 * allowed to be a symlink to an existing directory,
+		 * therefore we must give 'state->base_dir_len' to the
+		 * cache, such that we test path components of the
+		 * prefix with stat() instead of lstat()
+		 *
+		 * We must also tell the cache to test the complete
+		 * length of the buffer (the '|LSTAT_FULLPATH' part).
+		 */
+		if (lstat_cache(len, buf, LSTAT_DIR|LSTAT_FULLPATH,
+				state->base_dir_len) &
+		    LSTAT_DIR)
 			continue; /* ok, it is already a directory. */
 
-		/*
-		 * We know stat_status == 0 means something exists
-		 * there and this mkdir would fail, but that is an
-		 * error codepath; we do not care, as we unlink and
-		 * mkdir again in such a case.
+		/* If this mkdir() would fail, it could be that there
+		 * is already a symlink or something else exists
+		 * there, therefore we then try to unlink it and try
+		 * one more time to create the directory.
 		 */
 		if (mkdir(buf, 0777)) {
 			if (errno == EEXIST && state->force &&
diff --git a/symlinks.c b/symlinks.c
index 5a5e781a15d7d9cb60797958433eca896b31ec85..314ba4f273f9f776a130b5e5b48c5bda0ca9beed 100644
--- a/symlinks.c
+++ b/symlinks.c
@@ -1,64 +1,136 @@
 #include "cache.h"
 
-struct pathname {
-	int len;
-	char path[PATH_MAX];
-};
+static char         cache_path[PATH_MAX + 2];
+static int          cache_len   = 0;
+static unsigned int cache_flags = 0;
 
-/* Return matching pathname prefix length, or zero if not matching */
-static inline int match_pathname(int len, const char *name, struct pathname *match)
+static inline int greatest_match_lstat_cache(int len, const char *name)
 {
-	int match_len = match->len;
-	return (len > match_len &&
-		name[match_len] == '/' &&
-		!memcmp(name, match->path, match_len)) ? match_len : 0;
-}
+	int max_len, match_len = 0, i = 0;
 
-static inline void set_pathname(int len, const char *name, struct pathname *match)
-{
-	if (len < PATH_MAX) {
-		match->len = len;
-		memcpy(match->path, name, len);
-		match->path[len] = 0;
+	max_len = len < cache_len ? len : cache_len;
+	while (i < max_len && name[i] == cache_path[i]) {
+		if (name[i] == '/') match_len = i;
+		i++;
 	}
+	if (i == cache_len && len > cache_len && name[cache_len] == '/')
+		match_len = cache_len;
+	else if (i == len && len < cache_len && cache_path[len] == '/')
+		match_len = len;
+	else if (i == len && i == cache_len)
+		match_len = len;
+	return match_len;
 }
 
-int has_symlink_leading_path(int len, const char *name)
+/*
+ * Check if name 'name' of length 'len' has a symlink leading
+ * component, or if the directory exists and is real, or not.
+ *
+ * To speed up the check, some information is allowed to be cached.
+ * This is can be indicated by the 'track_flags' argument, which also
+ * can be used to indicate that we should always check the full path.
+ *
+ * The 'prefix_len_stat_func' parameter can be used to set the length
+ * of the prefix, where the cache should use the stat() function
+ * instead of the lstat() function to test each path component.
+ */
+unsigned int lstat_cache(int len, const char *name,
+			 unsigned int track_flags, int prefix_len_stat_func)
 {
-	static struct pathname link, nonlink;
-	char path[PATH_MAX];
+	int match_len, last_slash, last_slash_dir, max_len, ret;
+	unsigned int match_flags, ret_flags, save_flags;
 	struct stat st;
-	char *sp;
-	int known_dir;
 
-	/*
-	 * See if the last known symlink cache matches.
+	/* Check if match from the cache for 2 "excluding" path types.
+	 */
+	match_len = last_slash = greatest_match_lstat_cache(len, name);
+	match_flags = cache_flags & track_flags & (LSTAT_NOENT|LSTAT_SYMLINK);
+	if (match_flags && match_len == cache_len)
+		return match_flags;
+
+	/* If 'name' is a substring of the cache on a path component
+	 * basis, and a directory is cached, we return immediately.
 	 */
-	if (match_pathname(len, name, &link))
-		return 1;
+	match_flags = cache_flags & track_flags & LSTAT_DIR;
+	if (match_flags && match_len == len)
+		return match_flags;
 
-	/*
-	 * Get rid of the last known directory part
+	/* Okay, no match from the cache so far, so now we have to
+	 * check the rest of the path components.
 	 */
-	known_dir = match_pathname(len, name, &nonlink);
+	ret_flags = LSTAT_DIR;
+	last_slash_dir = last_slash;
+	max_len = len < PATH_MAX ? len : PATH_MAX;
+	while (match_len < max_len) {
+		do {
+			cache_path[match_len] = name[match_len];
+			match_len++;
+		} while (match_len < max_len && name[match_len] != '/');
+		if (match_len >= max_len && !(track_flags & LSTAT_FULLPATH))
+			break;
+		last_slash = match_len;
+		cache_path[last_slash] = '\0';
 
-	while ((sp = strchr(name + known_dir + 1, '/')) != NULL) {
-		int thislen = sp - name ;
-		memcpy(path, name, thislen);
-		path[thislen] = 0;
+		if (last_slash <= prefix_len_stat_func)
+			ret = stat(cache_path, &st);
+		else
+			ret = lstat(cache_path, &st);
 
-		if (lstat(path, &st))
-			return 0;
-		if (S_ISDIR(st.st_mode)) {
-			set_pathname(thislen, path, &nonlink);
-			known_dir = thislen;
+		if (ret) {
+			ret_flags = LSTAT_LSTATERR;
+			if (errno == ENOENT)
+				ret_flags |= LSTAT_NOENT;
+		} else if (S_ISDIR(st.st_mode)) {
+			last_slash_dir = last_slash;
 			continue;
-		}
-		if (S_ISLNK(st.st_mode)) {
-			set_pathname(thislen, path, &link);
-			return 1;
+		} else if (S_ISLNK(st.st_mode)) {
+			ret_flags = LSTAT_SYMLINK;
+		} else if (S_ISREG(st.st_mode)) {
+			ret_flags = LSTAT_REG;
+		} else {
+			ret_flags = LSTAT_ERR;
 		}
 		break;
 	}
-	return 0;
+
+	/* At the end update the cache.  Note that max 3 different
+	 * path types, LSTAT_NOENT, LSTAT_SYMLINK and LSTAT_DIR, can
+	 * be cached for the moment!
+	 */
+	save_flags = ret_flags & track_flags & (LSTAT_NOENT|LSTAT_SYMLINK);
+	if (save_flags && last_slash > 0 && last_slash <= PATH_MAX) {
+		cache_path[last_slash] = '\0';
+		cache_len   = last_slash;
+		cache_flags = save_flags;
+	} else if (track_flags & LSTAT_DIR &&
+		   last_slash_dir > 0 && last_slash_dir <= PATH_MAX) {
+		/* We have separate test for the directory case, since
+		 * it could be that we have found a symlink or none
+		 * existing directory, and the track_flags says that
+		 * we can not cache this fact, and the cache would
+		 * then have been empty in this case.
+		 *
+		 * But, if we is allowed to track real directories, we
+		 * can still cache the path components before the last
+		 * one (the found symlink or none existing component).
+		 */
+		cache_path[last_slash_dir] = '\0';
+		cache_len   = last_slash_dir;
+		cache_flags = LSTAT_DIR;
+	} else {
+		clear_lstat_cache();
+	}
+	return ret_flags;
+}
+
+/*
+ * Before usage of the check_lstat_cache() function one should call
+ * clear_lstat_cache() (at an appropriate place) to make sure that the
+ * cache is clean.
+ */
+void clear_lstat_cache(void)
+{
+	cache_path[0] = '\0';
+	cache_len     = 0;
+	cache_flags   = 0;
 }
diff --git a/unpack-trees.c b/unpack-trees.c
index 54f301da67be879c80426bc21776427fdd38c02e..c3d14294aaaefb9e40c99f174b4f5f41e5fad635 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -61,7 +61,7 @@ static void unlink_entry(struct cache_entry *ce)
 	char *cp, *prev;
 	char *name = ce->name;
 
-	if (has_symlink_leading_path(ce_namelen(ce), ce->name))
+	if (has_symlink_or_noent_leading_path(ce_namelen(ce), ce->name))
 		return;
 	if (unlink(name))
 		return;
@@ -105,6 +105,7 @@ static int check_updates(struct unpack_trees_options *o)
 		cnt = 0;
 	}
 
+	clear_symlink_or_noent_cache();
 	for (i = 0; i < index->cache_nr; i++) {
 		struct cache_entry *ce = index->cache[i];
 
@@ -118,6 +119,7 @@ static int check_updates(struct unpack_trees_options *o)
 		}
 	}
 
+	clear_lstat_cache();
 	for (i = 0; i < index->cache_nr; i++) {
 		struct cache_entry *ce = index->cache[i];
 
@@ -584,7 +586,7 @@ static int verify_absent(struct cache_entry *ce, const char *action,
 	if (o->index_only || o->reset || !o->update)
 		return 0;
 
-	if (has_symlink_leading_path(ce_namelen(ce), ce->name))
+	if (has_symlink_or_noent_leading_path(ce_namelen(ce), ce->name))
 		return 0;
 
 	if (!lstat(ce->name, &st)) {
-- 
1.6.1.rc1.49.g7f705

^ permalink raw reply related

* Re: Curious about details of optimization of object database...
From: Boyd Stephen Smith Jr. @ 2009-01-09 19:07 UTC (permalink / raw)
  To: chris; +Cc: git
In-Reply-To: <20090109174623.GC12552@seberino.org>

[-- Attachment #1: Type: text/plain, Size: 1519 bytes --]

On Friday 2009 January 09 11:46:23 chris@seberino.org wrote:
>I'm told a commit is *not* a patch (diff), but, rather a copy of the entire
>tree.

It's even more than that.  A commit object contains its message, the SHA of 
the tree, and zero or more SHAs for its parents.

>Can anyone say, in a few sentences, how git avoids needing to keep multiple
>slightly different copies of entire files without just storing lots of
>patches/diffs?

Loose objects can have large swaths of duplicated data.  However, git also 
supports storing objects in a packed format, which uses delta compression to 
reduce the duplication to close to nothing.

Some examples:
Sizes are from "du -sh .git ."; The .git directory stores all the objects as 
well as the repository configuration, refs, reflogs, etc.  The . directory 
has .git and a clean checkout of master.

The LinuxPMI (http://linuxpmi.org/) tree:
41M     .git
83M     .
(So, the storage is actually a bit smaller than the checkout; 984 objects; 140 
commits)

A small project between me an my flatmates:
309K    .git
3.6M    .
(Here, the storage is significantly smaller than the checkout; 786 objects; 
155 commits)

My repository that tracks my dotfiles:
124K    .git
176K    .
(113 objects; 28 commits)
-- 
Boyd Stephen Smith Jr.                     ,= ,-_-. =. 
bss@iguanasuicide.net                     ((_/)o o(\_))
ICQ: 514984 YM/AIM: DaTwinkDaddy           `-'(. .)`-' 
http://iguanasuicide.net/                      \_/     

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply

* Re: [PATCH] filter-branch: add git_commit_non_empty_tree and --prune-empty.
From: Jay Soffian @ 2009-01-09 19:29 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: Junio C Hamano, git, pasky, srabbelier
In-Reply-To: <20081103151826.GJ13930@artemis.corp>

On Mon, Nov 3, 2008 at 10:18 AM, Pierre Habouzit <madcoder@debian.org> wrote:
> On Mon, Nov 03, 2008 at 09:27:29AM +0000, Pierre Habouzit wrote:
>> On Mon, Nov 03, 2008 at 04:58:44AM +0000, Junio C Hamano wrote:

Bump, http://thread.gmane.org/gmane.comp.version-control.git/99440/

(I'd like to see this included. Having a bunch of empty commits after
using filter-branch to remove unwanted files from history is, er,
sub-optimal, so seems like it might even be default behavior?)

j.

^ permalink raw reply

* [PATCH] Get format-patch to show first commit after root commit
From: Nathan W. Panike @ 2009-01-09 19:35 UTC (permalink / raw)
  To: git; +Cc: Nathan W. Panike

Currently, the command

git format-patch -1 e83c5163316f89bfbde

in the git repository creates an empty file.  Instead, one is
forced to do

git format-patch -1 --root e83c5163316f89bfbde

This seems arbitrary.  This patch fixes this case, so that

git format-patch -1 e83c5163316f89bfbde

will produce an actual patch.
---
 builtin-log.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/builtin-log.c b/builtin-log.c
index 4a02ee9..5e7b61f 100644
--- a/builtin-log.c
+++ b/builtin-log.c
@@ -977,6 +977,8 @@ int cmd_format_patch(int argc, const char **argv, const char *prefix)
 		list[nr - 1] = commit;
 	}
 	total = nr;
+	if (total == 1 && !list[0]->parents)
+		rev.show_root_diff=1;
 	if (!keep_subject && auto_number && total > 1)
 		numbered = 1;
 	if (numbered)
-- 
1.6.1.76.gc123b.dirty

^ permalink raw reply related

* Assertion Failed when importing svn into Git
From: 4jxDQ6FQee2H @ 2009-01-09 19:32 UTC (permalink / raw)
  To: git

Hi,


I'm trying to import my SVN repository into GIT and its throwing an
Assertion Failed error message in text_delta.c  Can someone help me?

The assertion failed error occurs when running 'svn diff -r 288:289'
which contains only one change - a change made to a binary file. That
binary file is the Eclipse dataModel file, both revs (288, 289) have
the file marked as application/octet-stream using the svn:mime-type
property


subversion - 1.5.4
git        - 1.6.0.6

The SVN repository was started using 1.4.6, if that matters

Here's the exact error message:

-------
$ svn diff -r 288:289
svn: subversion/libsvn_delta/text_delta.c:609: apply_window: Assertion
`window->sview_len == 0 || (window->sview_offset >= ab->sbuf_offset &&
(window->sview_offset + window->sview_len >= ab->sbuf_offset +
ab->sbuf_len))' failed.
Aborted
-------


Thanks!

^ permalink raw reply

* Re: Assertion Failed when importing svn into Git
From: 4jxDQ6FQee2H @ 2009-01-09 19:43 UTC (permalink / raw)
  To: 4jxDQ6FQee2H; +Cc: git
In-Reply-To: <20090109133234.345dacb9@family.dyweni.com>

Hi,

For more testing, I setup a brand new SVN repository (using v1.5.4).

In the old SVN Repo, I checked out the problematic file
(.cache/.dataModel) from rev 288 and rev 289.


In the new SVN Repo, I added the problematic file (from rev 288) and
committed, which became Rev 4.

Then I copied the problematic file (rev rev 289) ontop of that file and
commited, which became Rev 5.

I'm successfully able to do a 'svn diff -r 4:5'.

Does this imply that I have corruption in my old SVN Repository?

Or is something being lost when checking out the file from Rev 289 that
does break SVN in my test?





> Hi,
> 
> 
> I'm trying to import my SVN repository into GIT and its throwing an
> Assertion Failed error message in text_delta.c  Can someone help me?
> 
> The assertion failed error occurs when running 'svn diff -r 288:289'
> which contains only one change - a change made to a binary file. That
> binary file is the Eclipse dataModel file, both revs (288, 289) have
> the file marked as application/octet-stream using the svn:mime-type
> property
> 
> 
> subversion - 1.5.4
> git        - 1.6.0.6
> 
> The SVN repository was started using 1.4.6, if that matters
> 
> Here's the exact error message:
> 
> -------
> $ svn diff -r 288:289
> svn: subversion/libsvn_delta/text_delta.c:609: apply_window: Assertion
> `window->sview_len == 0 || (window->sview_offset >= ab->sbuf_offset &&
> (window->sview_offset + window->sview_len >= ab->sbuf_offset +
> ab->sbuf_len))' failed.
> Aborted
> -------
> 
> 
> Thanks!

^ permalink raw reply

* Re: Assertion Failed when importing svn into Git
From: 4jxDQ6FQee2H @ 2009-01-09 20:05 UTC (permalink / raw)
  To: 4jxDQ6FQee2H; +Cc: git
In-Reply-To: <20090109134307.435ed56f@family.dyweni.com>

Hi,

I have resolved this issue.

There was either corruption in the repository or an incompatibility
between the 1.4.6 and the 1.5.4 formats.

A dump / reload of the repository has resolved the issue.

I can run 'svn diff -r 288:289' successfully on the new repository.

Thanks!



> Hi,
> 
> For more testing, I setup a brand new SVN repository (using v1.5.4).
> 
> In the old SVN Repo, I checked out the problematic file
> (.cache/.dataModel) from rev 288 and rev 289.
> 
> 
> In the new SVN Repo, I added the problematic file (from rev 288) and
> committed, which became Rev 4.
> 
> Then I copied the problematic file (rev rev 289) ontop of that file
> and commited, which became Rev 5.
> 
> I'm successfully able to do a 'svn diff -r 4:5'.
> 
> Does this imply that I have corruption in my old SVN Repository?
> 
> Or is something being lost when checking out the file from Rev 289
> that does break SVN in my test?
> 
> 
> 
> 
> 
> > Hi,
> > 
> > 
> > I'm trying to import my SVN repository into GIT and its throwing an
> > Assertion Failed error message in text_delta.c  Can someone help me?
> > 
> > The assertion failed error occurs when running 'svn diff -r 288:289'
> > which contains only one change - a change made to a binary file.
> > That binary file is the Eclipse dataModel file, both revs (288,
> > 289) have the file marked as application/octet-stream using the
> > svn:mime-type property
> > 
> > 
> > subversion - 1.5.4
> > git        - 1.6.0.6
> > 
> > The SVN repository was started using 1.4.6, if that matters
> > 
> > Here's the exact error message:
> > 
> > -------
> > $ svn diff -r 288:289
> > svn: subversion/libsvn_delta/text_delta.c:609: apply_window:
> > Assertion `window->sview_len == 0 || (window->sview_offset >=
> > ab->sbuf_offset && (window->sview_offset + window->sview_len >=
> > ab->sbuf_offset + ab->sbuf_len))' failed.
> > Aborted
> > -------
> > 
> > 
> > Thanks!  

^ permalink raw reply

* Re: fetch branch blacklist
From: Jakub Narebski @ 2009-01-09 20:23 UTC (permalink / raw)
  To: jidanni; +Cc: git
In-Reply-To: <87r63ek6cw.fsf@jidanni.org>

jidanni@jidanni.org writes:

> If one wants to always fetch all except one remote branch, one cannot
> just blacklist it, but must instead whitelist all the rest.
> $ git branch -rd origin/man origin/html
> Deleted remote branch origin/man.
> Deleted remote branch origin/html.
> Plus I edited them out of FETCH_HEAD. Nonetheless, back from the dead:
> $ git pull
> From git://git.kernel.org/pub/scm/git/git
>  * [new branch]      html       -> origin/html
>  * [new branch]      man        -> origin/man
> The only solution is to change .git/config:
> [remote "origin"]
> 	url = git://git.kernel.org/pub/scm/git/git.git
> #	fetch = +refs/heads/*:refs/remotes/origin/*
> 	fetch = +refs/heads/maint:refs/remotes/origin/maint
> 	fetch = +refs/heads/master:refs/remotes/origin/master
> 	fetch = +refs/heads/next:refs/remotes/origin/next
> 	fetch = +refs/heads/pu:refs/remotes/origin/pu
> 	fetch = +refs/heads/todo:refs/remotes/origin/todo

Well, you can always use hooks for that (see for example
underdocumented contrib/hooks/update-paranoid)... or you can try to
scratch that itch yourself.  gitignore supports inverse (negated)
patterns (!<pattern>), so there is some code dealing with
"blacklisting".  I would propose using the same '!' character, or
perhaps one of forbidden characters (see git-check-ref-format(1)),
i.e.

 [remote "origin"]
 	url = git://git.kernel.org/pub/scm/git/git.git
 	fetch = +refs/heads/*:refs/remotes/origin/*
        fetch = !refs/heads/html
        fetch = !refs/heads/man

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply

* Re: [PATCH] Get format-patch to show first commit after root commit
From: Alexander Potashev @ 2009-01-09 20:29 UTC (permalink / raw)
  To: Nathan W. Panike; +Cc: git
In-Reply-To: <1231529725-19767-1-git-send-email-nathan.panike@gmail.com>

Hello!

I experienced this problem today while preparing a simple patch for
reply in "[PATCH 2/2] Use is_pseudo_dir_name everywhere" thread.
I used a workaround: add a file, commit, remove it, commit, add it once
again, commit and after all format-patch.

On 13:35 Fri 09 Jan     , Nathan W. Panike wrote:
> Currently, the command
> 
> git format-patch -1 e83c5163316f89bfbde
> 
> in the git repository creates an empty file.  Instead, one is
> forced to do
> 
> git format-patch -1 --root e83c5163316f89bfbde
> 
> This seems arbitrary.  This patch fixes this case, so that
> 
> git format-patch -1 e83c5163316f89bfbde

Your patch doesn't solve the problem if there are more than one commit
(say, 2 commits) and you run 'git format-patch -2'. Even with your patch
format-patch writes an empty patch file corresponding to the root commit
(actually, it creates 2 patches, but the first is empty).

Please, correct me if I'm wrong.

> 
> will produce an actual patch.
> ---
>  builtin-log.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/builtin-log.c b/builtin-log.c
> index 4a02ee9..5e7b61f 100644
> --- a/builtin-log.c
> +++ b/builtin-log.c
> @@ -977,6 +977,8 @@ int cmd_format_patch(int argc, const char **argv, const char *prefix)
>  		list[nr - 1] = commit;
>  	}
>  	total = nr;
> +	if (total == 1 && !list[0]->parents)
> +		rev.show_root_diff=1;
>  	if (!keep_subject && auto_number && total > 1)
>  		numbered = 1;
>  	if (numbered)
> -- 
> 1.6.1.76.gc123b.dirty

^ permalink raw reply

* Re: [PATCH 0/3] Teach Git about the patience diff algorithm
From: Junio C Hamano @ 2009-01-09 20:53 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Adeodato Simó, Linus Torvalds, Clemens Buchacher,
	Pierre Habouzit, davidel, Francis Galiegue, Git ML
In-Reply-To: <alpine.DEB.1.00.0901091405460.30769@pacific.mpi-cbg.de>

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> On Thu, 8 Jan 2009, Junio C Hamano wrote:
>
>> If we find the "common" context lines that have only blank and 
>> punctuation letters in Dscho output, turn each of them into "-" and "+", 
>> and rearrange them so that all "-" are together followed by "+", it will 
>> match Bzr output.
>
> So we'd need something like this (I still think we should treat curly 
> brackets the same as punctuation, and for good measure I just handled 
> everything that is not alphanumerical the same):

I meant by punctuation to include curlies (my wording may have been wrong
but from the example with " }" line it should have been obvious).

But I agree with both points Linus raised.  The criteria to pick what to
pretend unmatching should be "small insignificant lines" (small goes for
both size and also number of consecutive "insignificant" lines), and the
coallescing should be done to join a block of consecutive changed lines of
a significant size (so you do not join two 1 or 2-line "changed line"
blocks by pretending that a 1-line unchanged insignificant line in between
them is unmatching).

^ permalink raw reply

* Re: 1.5.6.5 fails to clone git.kernel.org/[...]/rostedt/linux-2.6-rt
From: Tim Shepard @ 2009-01-09 21:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Johannes Schindelin, Daniel Barkalow, Miklos Vajna
In-Reply-To: <7vpriw26uo.fsf@gitster.siamese.dyndns.org>



Junio,

Thank you for your good explanation.

Also thanks to Miklos Vajna who also replied to suggest using git:// transport.


(Over 3 years ago I used git glone to get a copy of torvalds/linux-2.6.git
 using rsync transport and I copied the recipe I used then.  It has been
 working for "git pull" updates, and for one from-scratch re-clone
 I did sometime last year. I had no reason to suspect it was broken.)

This morning I re-started the clone using git:// transport and it worked OK.


			-Tim Shepard
			 shep@alum.mit.edu

^ permalink raw reply

* [PATCH] Get format-patch to show first commit after root commit
From: Nathan W. Panike @ 2009-01-09 21:33 UTC (permalink / raw)
  To: git; +Cc: Nathan W. Panike

Rework this patch to try to handle the case where one does

git format-patch -n ...

and n is a number larger than 1.  Currently, the command

git format-patch -1 e83c5163316f89bfbde

in the git repository creates an empty file.  Instead, one is
forced to do

git format-patch -1 --root e83c5163316f89bfbde

This seems arbitrary.  This patch fixes this case, so that

git format-patch -1 e83c5163316f89bfbde

will produce an actual patch.

Signed-off-by: Nathan W. Panike <nathan.panike@gmail.com>
---
 builtin-log.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/builtin-log.c b/builtin-log.c
index 4a02ee9..0eca15f 100644
--- a/builtin-log.c
+++ b/builtin-log.c
@@ -975,6 +975,9 @@ int cmd_format_patch(int argc, const char **argv, const char *prefix)
 		nr++;
 		list = xrealloc(list, nr * sizeof(list[0]));
 		list[nr - 1] = commit;
+		if(!commit->parents){
+			rev.show_root_diff=1;
+		}
 	}
 	total = nr;
 	if (!keep_subject && auto_number && total > 1)
-- 
1.6.1.76.gc123b.dirty

^ permalink raw reply related

* git submodule merge madness
From: Ask Bjørn Hansen @ 2009-01-09 21:50 UTC (permalink / raw)
  To: git

Hi,

We've (again) replaced a few directories with submodules.  Man, it's  
madness!

The typical problem is that we get an error trying to merge a "pre- 
submodule" branch into master:

	fatal: cannot read object 894c77319a18c4d48119c2985a9275c9f5883584  
'some/sub/dir': It is a submodule!
Mark Levedahl wrote an example in July, but I don't think he got any  
replies:  http://marc.info/?l=git&m=121587851313303
Any ideas?   Is there something we can do?    I see a strong  
correlation between adding a new submodule and the number of "git  
sucks" messages on our internal IRC server.


  - ask

-- 
http://develooper.com/ - http://askask.com/

^ permalink raw reply

* [PATCH] t7700: demonstrate misbehavior of 'repack -a' when local packs exist
From: Brandon Casey @ 2009-01-09 22:14 UTC (permalink / raw)
  To: git

The ability to "...fatten [the] local repository by packing everything that
is needed by the local ref into a single new pack, including things that are
borrowed from alternates"[1] is supposed to be provided by the '-a' or '-A'
options to repack when '-l' is not used, but there is a flaw.  For each
pack in the local repository without a .keep file, repack supplies a
--unpacked=<pack> argument to pack-objects.

The --unpacked option to pack-objects, with or without an argument, causes
pack-objects to ignore any object which is packed in a pack not mentioned
in an argument to --unpacked=.  So, if there are local packs, and
'repack -a' is called, then any objects which reside in packs accessible
through alternates will _not_ be packed.  If there are no local packs, then
no --unpacked argument will be supplied, and repack will behave as expected.

[1] http://mid.gmane.org/7v8wrwidi3.fsf@gitster.siamese.dyndns.org

Signed-off-by: Brandon Casey <casey@nrlssc.navy.mil>
---


FYI: I won't be looking in to a fix for this immediately. So if someone else
     has time and the inclination, please be my guest. Also, you can thank a
     transformer blow, and lots of disk loss for the discovery of this bug.

-brandon


 t/t7700-repack.sh |   19 +++++++++++++++++++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 3f602ea..f5682d6 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -69,5 +69,24 @@ test_expect_success 'packed obs in alt ODB are repacked even when local repo is
 	done
 '
 
+test_expect_failure 'packed obs in alt ODB are repacked when local repo has packs' '
+	rm -f .git/objects/pack/* &&
+	echo new_content >> file1 &&
+	git add file1 &&
+	git commit -m more_content &&
+	git repack &&
+	git repack -a -d &&
+	myidx=$(ls -1 .git/objects/pack/*.idx) &&
+	test -f "$myidx" &&
+	for p in alt_objects/pack/*.idx; do
+		git verify-pack -v $p | sed -n -e "/^[0-9a-f]\{40\}/p"
+	done | while read sha1 rest; do
+		if ! ( git verify-pack -v $myidx | grep "^$sha1" ); then
+			echo "Missing object in local pack: $sha1"
+			return 1
+		fi
+	done
+'
+
 test_done
 
-- 
1.6.1.76.gc123b

^ permalink raw reply related

* [PATCH 1/2] grep -w: forward to next possible position after rejected match
From: René Scharfe @ 2009-01-09 23:08 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Junio C Hamano

grep -w accepts matches between non-word characters, only.  If a match
from regexec() doesn't meet this criteria, grep continues its search
after the first character of that match.

We can be a bit smarter here and skip all positions that follow a word
character first, as they can't match our criteria.  This way we can
consume characters quite cheaply and don't need to special-case the
handling of the beginning of a line.

Here's a contrived example command on msysgit (best of five runs):

	$ time git grep -w ...... v1.6.1 >/dev/null

	real    0m1.611s
	user    0m0.000s
	sys     0m0.015s

With the patch it's quite a bit faster:

	$ time git grep -w ...... v1.6.1 >/dev/null

	real    0m1.179s
	user    0m0.000s
	sys     0m0.015s

More common search patterns will gain a lot less, but it's a nice clean
up anyway.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
 grep.c |   11 +++++++----
 1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/grep.c b/grep.c
index 49e9319..394703b 100644
--- a/grep.c
+++ b/grep.c
@@ -294,7 +294,6 @@ static struct {
 static int match_one_pattern(struct grep_opt *opt, struct grep_pat *p, char *bol, char *eol, enum grep_context ctx)
 {
 	int hit = 0;
-	int at_true_bol = 1;
 	int saved_ch = 0;
 	regmatch_t pmatch[10];
 
@@ -337,7 +336,7 @@ static int match_one_pattern(struct grep_opt *opt, struct grep_pat *p, char *bol
 		 * either end of the line, or at word boundary
 		 * (i.e. the next char must not be a word char).
 		 */
-		if ( ((pmatch[0].rm_so == 0 && at_true_bol) ||
+		if ( ((pmatch[0].rm_so == 0) ||
 		      !word_char(bol[pmatch[0].rm_so-1])) &&
 		     ((pmatch[0].rm_eo == (eol-bol)) ||
 		      !word_char(bol[pmatch[0].rm_eo])) )
@@ -349,10 +348,14 @@ static int match_one_pattern(struct grep_opt *opt, struct grep_pat *p, char *bol
 			/* There could be more than one match on the
 			 * line, and the first match might not be
 			 * strict word match.  But later ones could be!
+			 * Forward to the next possible start, i.e. the
+			 * next position following a non-word char.
 			 */
 			bol = pmatch[0].rm_so + bol + 1;
-			at_true_bol = 0;
-			goto again;
+			while (word_char(bol[-1]) && bol < eol)
+				bol++;
+			if (bol < eol)
+				goto again;
 		}
 	}
 	if (p->token == GREP_PATTERN_HEAD && saved_ch)
-- 
1.6.1

^ permalink raw reply related

* [PATCH 2/2] grep: don't call regexec() for fixed strings
From: René Scharfe @ 2009-01-09 23:18 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Junio C Hamano
In-Reply-To: <4967D8F8.9070508@lsrfire.ath.cx>

Add the new flag "fixed" to struct grep_pat and set it if the pattern
is doesn't contain any regex control characters in addition to if the
flag -F/--fixed-strings was specified.

This gives a nice speed up on msysgit, where regexec() seems to be
extra slow.  Before (best of five runs):

	$ time git grep grep v1.6.1 >/dev/null

	real    0m0.552s
	user    0m0.000s
	sys     0m0.000s

	$ time git grep -F grep v1.6.1 >/dev/null

	real    0m0.170s
	user    0m0.000s
	sys     0m0.015s

With the patch:

	$ time git grep grep v1.6.1 >/dev/null

	real    0m0.173s
	user    0m0.000s
	sys     0m0.000s

The difference is much smaller on Linux, but still measurable.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
 grep.c |   29 +++++++++++++++++++++++++----
 grep.h |    1 +
 2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/grep.c b/grep.c
index 394703b..a1092df 100644
--- a/grep.c
+++ b/grep.c
@@ -28,9 +28,31 @@ void append_grep_pattern(struct grep_opt *opt, const char *pat,
 	p->next = NULL;
 }
 
+static int isregexspecial(int c)
+{
+	return isspecial(c) || c == '$' || c == '(' || c == ')' || c == '+' ||
+			       c == '.' || c == '^' || c == '{' || c == '|';
+}
+
+static int is_fixed(const char *s)
+{
+	while (!isregexspecial(*s))
+		s++;
+	return !*s;
+}
+
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
-	int err = regcomp(&p->regexp, p->pattern, opt->regflags);
+	int err;
+
+	if (opt->fixed || is_fixed(p->pattern))
+		p->fixed = 1;
+	if (opt->regflags & REG_ICASE)
+		p->fixed = 0;
+	if (p->fixed)
+		return;
+
+	err = regcomp(&p->regexp, p->pattern, opt->regflags);
 	if (err) {
 		char errbuf[1024];
 		char where[1024];
@@ -159,8 +181,7 @@ void compile_grep_patterns(struct grep_opt *opt)
 		case GREP_PATTERN: /* atom */
 		case GREP_PATTERN_HEAD:
 		case GREP_PATTERN_BODY:
-			if (!opt->fixed)
-				compile_regexp(p, opt);
+			compile_regexp(p, opt);
 			break;
 		default:
 			opt->extended = 1;
@@ -314,7 +335,7 @@ static int match_one_pattern(struct grep_opt *opt, struct grep_pat *p, char *bol
 	}
 
  again:
-	if (!opt->fixed) {
+	if (!p->fixed) {
 		regex_t *exp = &p->regexp;
 		hit = !regexec(exp, bol, ARRAY_SIZE(pmatch),
 			       pmatch, 0);
diff --git a/grep.h b/grep.h
index 45a222d..5102ce3 100644
--- a/grep.h
+++ b/grep.h
@@ -30,6 +30,7 @@ struct grep_pat {
 	const char *pattern;
 	enum grep_header_field field;
 	regex_t regexp;
+	unsigned fixed:1;
 };
 
 enum grep_expr_node {
-- 
1.6.1

^ permalink raw reply related

* Re: [PATCH] gitweb: support the rel=vcs microformat
From: Jakub Narebski @ 2009-01-09 23:49 UTC (permalink / raw)
  To: Joey Hess; +Cc: git
In-Reply-To: <20090107042518.GB24735@gnu.kitenet.net>

Joey Hess <joey@kitenet.net> writes:

> The rel=vcs microformat allows a web page to indicate the locations of
> repositories related to it in a machine-parseable manner.
> (See http://kitenet.net/~joey/rfc/rel-vcs/)

Let me put here an example from avove mentioned page:

  <head>
  <link rel="vcs-git" href="git://example.org/foo.git" 
        title="foo git repository" />
  </head>

  <a rel="vcs-git" href="git://example.org/foo.git" 
     title="git repository">git://example.org/foo.git</a>
  <a rel="vcs-git" href="git://example.org/foo.git">git repository</a>

There is one problem that is not solved in above microformat, but it
is problem only for git hosting sites like repo.or.cz or GitHub,
namely it does not allow to distinguish between fetch (read) link, and
push (write, publish) link.  This is not a problem for standard
(unmodified) gitweb as it shows only read-only git repositories links.

We also have to decide what to put in the 'title' attribute; I think
the simplest would be to put "$project git repository" or something
(for example "git/git.git git repository").

One thing I worry about is that those links (or at least some of those
links) are not meant for the browser to open; also SCP/SSH-like syntax
for SSH protocol in the form of 'user@host:/path/to/repo.git/' which
does not follow URL rules.

> 
> Make gitweb use the microformat in the header of pages it generates,
> if it has been configured with project url information in any of the usual
> ways.

There are two bit separate issues here: marking existing and future
URLs (current project fetch URLs which IIRC are not hyperlinked now;
planned/future 'git' links in project list page; perhaps also links in
OPML and RSS/Atom feeds) with 'rel="vcs-git"', and adding <link .../>
elements to page header.

> 
> Since getting the urls can require hitting disk, I avoided putting the
> microformat on *every* page gitweb generates. Just put it on the project
> summary page, the project list page, and the forks list page.
>
> The first of these already looks up the urls, so adding the microformat was
> free. 

I assume that this patch is only about adding <link ... /> elements to
head?  I think in the case of 'summary' view for a project it is an
excellent idea (similar to having 'prev' and 'next' link elements in
chaptered on-line book in HTML), and would allow for automation using
gitweb as a kind of service announcement.

> There is a small overhead in including the microformat on the latter
> two pages [projects list and list of forks], but getting the project
> descriptions for those pages already incurs a similar overhead, and
> the ability to get every repo url in one place seems worthwhile.

There is also OPML, which might be worth checking.

By the way, for 'projects_list' action and 'forks' actions we have to
decide whether to show _all_ links for each project (there can be more
than one), or whether we show only some main git link (like in the
case of proposed 'git' link).  And whether we trust @git_base_url_list
or do we take it as default and examine per-repository configuration
(more costly).

What is more important: 'project_list' page is already overly large
when hosting very large number of repositories (there were some
patches adding pagination for 'project_list', and perhaps they would
be resend).  Adding <link .../> elements would only add to its size;
and if will be divided into pages we would have also to take it into
account.

> 
> This changes git_get_project_description() to not check wantarray, and only
> return in list context -- the only way it is used AFAICS.

Errr... what? Why do you change git_get_project_description()
subroutine? I don't think it would be good source for 'title'
attribute; perhaps for 'desc' attribute, and only aftre sanitizing
"Unnamed repository; edit this file to name it for gitweb."

Errata: ah, it is git_get_project_url_list() subroutine...

> 
> Signed-off-by: Joey Hess <joey@gnu.kitenet.net>
> ---
>  gitweb/gitweb.perl |   38 ++++++++++++++++++++++++++------------
>  1 files changed, 26 insertions(+), 12 deletions(-)
> 
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 99f71b4..3f8a228 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -789,6 +789,9 @@ $git_dir = "$projectroot/$project" if $project;
>  our @snapshot_fmts = gitweb_get_feature('snapshot');
>  @snapshot_fmts = filter_snapshot_fmts(@snapshot_fmts);
>  
> +# populated later with git urls for the project
> +our @git_url_list;
> +

I'm not sure why this have to be global, but I assume that you want to
avoid recalculationg it in git_header_html

>  # dispatch
>  if (!defined $action) {
>  	if (defined $hash) {
> @@ -2100,17 +2103,22 @@ sub git_show_project_tagcloud {
>  }
>  
>  sub git_get_project_url_list {
> +	# use per project git URL list in $projectroot/$path/cloneurl
> +	# or make project git URL from git base URL and project name

I'd rather use separate subroutine for the second, I think.

>  	my $path = shift;
>  
> +	my @ret;
> +
>  	$git_dir = "$projectroot/$path";
> -	open my $fd, "$git_dir/cloneurl"
> -		or return wantarray ?
> -		@{ config_to_multi(git_get_project_config('url')) } :
> -		   config_to_multi(git_get_project_config('url'));
> -	my @git_project_url_list = map { chomp; $_ } <$fd>;
> -	close $fd;
> +	if (open my $fd, "$git_dir/cloneurl") {
> +		@ret = map { chomp; $_ } <$fd>;
> +		close $fd;
> +	}
> +	else {

Style: "} else {"

> +	       @ret = @{ config_to_multi(git_get_project_config('url')) };
> +	}
>  
> -	return wantarray ? @git_project_url_list : \@git_project_url_list;
> +	return @ret ? @ret : map { "$_/$project" } @git_base_url_list;
>  }

Hmmm... currently gitweb does it at caller:

	my @url_list = git_get_project_url_list($project);
	@url_list = map { "$_/$project" } @git_base_url_list unless @url_list;

Why do you want to put this in git_get_project_url_list()? Please
explain (here and in the commit message too; it has to be mentioned in
commit message that you cnage semantics a bit, and explain why you did
so).

>  
>  sub git_get_projects_list {
> @@ -2953,6 +2961,10 @@ EOF

Sidenote: this should be

  @@ -2953,6 +2961,10 @@ sub git_header_html {

but I'm not sure if it would be possible to automate...

>  		print qq(<link rel="shortcut icon" href="$favicon" type="image/png" />\n);
>  	}
>  
> +	foreach my $url (@git_url_list) {
> +		print qq{<link rel="vcs" type="git" href="$url" />\n};
> +	}
> +

Errr... in mentioned http://kitenet.net/~joey/rel-vcs/ it is

  <link rel="vcs-git" href="$url" title="$project git repository" />

and not

  <link rel="vcs" type="git" href="$url" />

Besides, 'type' attribute for A and LINK elements is about advisory
conent-type of the document pointed by link:

 type = content-type [CI]
    This attribute gives an advisory hint as to the content type of
    the content available at the link target address. It allows user
    agents to opt to use a fallback mechanism rather than fetch the
    content if they are advised that they will get content in a
    content type they do not support.  Authors who use this attribute
    take responsibility to manage the risk that it may become
    inconsistent with the content available at the link target
    address.  
    For the current list of registered content types, please consult
    [MIMETYPES].

>  	print "</head>\n" .
>  	      "<body>\n";
>  
> @@ -4380,6 +4392,8 @@ sub git_project_list {
>  		die_error(404, "No projects found");
>  	}
>  
> +	@git_url_list = map { git_get_project_url_list($_->{path}) } @list;
> +
>  	git_header_html();
>  	if (-f $home_text) {
>  		print "<div class=\"index_include\">\n";
> @@ -4400,6 +4414,8 @@ sub git_forks {
>  	if (defined $order && $order !~ m/none|project|descr|owner|age/) {
>  		die_error(400, "Unknown order parameter");
>  	}
> +	
> +	@git_url_list = map { git_get_project_url_list($_->{path}) } @list;
>  
>  	my @list = git_get_projects_list($project);
>  	if (!@list) {

Those two are pretty straightforward, but please note that
'project_list' view (action) might be _already_ too large...

> @@ -4457,6 +4473,8 @@ sub git_summary {
>  		@forklist = git_get_projects_list($project);
>  	}
>  
> +	@git_url_list = git_get_project_url_list($project);
> +
>  	git_header_html();
>  	git_print_page_nav('summary','', $head);
>  
> @@ -4468,12 +4486,8 @@ sub git_summary {
>  		print "<tr id=\"metadata_lchange\"><td>last change</td><td>$cd{'rfc2822'}</td></tr>\n";
>  	}
>  
> -	# use per project git URL list in $projectroot/$project/cloneurl
> -	# or make project git URL from git base URL and project name
>  	my $url_tag = "URL";
> -	my @url_list = git_get_project_url_list($project);
> -	@url_list = map { "$_/$project" } @git_base_url_list unless @url_list;
> -	foreach my $git_url (@url_list) {
> +	foreach my $git_url (@git_url_list) {
>  		next unless $git_url;
>  		print "<tr class=\"metadata_url\"><td>$url_tag</td><td>$git_url</td></tr>\n";
>  		$url_tag = "";
> -- 
> 1.5.6.5

This is also pretty straightforward: it moves calculation earlier for
results to be shared with git_header_html (and uses global variable).

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply

* Re: [PATCH] gitweb: support the rel=vcs microformat
From: Jakub Narebski @ 2009-01-09 23:56 UTC (permalink / raw)
  To: Giuseppe Bilotta; +Cc: git
In-Reply-To: <gk2794$djn$1@ger.gmane.org>

Giuseppe Bilotta <giuseppe.bilotta@gmail.com> writes:

> On Wednesday 07 January 2009 05:25, Joey Hess wrote:
> 
> > The rel=vcs microformat allows a web page to indicate the locations of
> > repositories related to it in a machine-parseable manner.
> > (See http://kitenet.net/~joey/rfc/rel-vcs/)
> 
> Interesting idea, I like it. However, I see a problem in the proposed
> implementation versus the spec. According to the spec:
> 
> """
> The "title" is optional, but recommended if there are multiple, different
> repositories linked to on one page. It is a human-readable description of the
> repository.
> [...]
> If there are multiple repositories listed, without titles, tools
> should assume they are different repositories.
> """

Good catch.

> 
> In this patch you do NOT add titles to the rel=vcs links, which means that
> everything works fine only if there is a single URL for each project. If a
> project has different URLs, it's going to appear multiple times as _different_
> projects to a spec-compliant reader.
> 
> A possible solution would be to make @git_url_list into a map keyed by the
> project name and having the description and repo URL(s) as values.
> 
> Since there is the possibility of different projects having the same
> description (e.g. the default one), the link title could be composed of
> "$project - $description" rather than simply $description.
> 
> Note that both in summary and in project list view you already retrieve the
> description, so there are no additional disk hits.

Wouldn't "$project git repository" (i.e. do not use description at
all) be a simpler, faster and also _better_ solution?
 

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply

* Re: [PATCH] gitweb: support the rel=vcs microformat
From: Jakub Narebski @ 2009-01-10  0:01 UTC (permalink / raw)
  To: Joey Hess; +Cc: Giuseppe Bilotta, git
In-Reply-To: <20090107184113.GA31795@gnu.kitenet.net>

Joey Hess <joey@kitenet.net> writes:
> Giuseppe Bilotta wrote:
> > Joey Hess <joey@kitenet.net> writes:

> > > Thanks for the feedback. There are some changes happening to the
> > > microformat that should make gitweb's job slightly easier, I'll respin
> > > the patch soon.
> > 
> > Let me know about this too, I very much like the idea of this microformat.
> 
> FYI, I've updated the microformat's page with the changes. The
> significant one for gitweb is that it can now be applied to <a> links.
> So on the project page, the display of the git URL could be converted to
> a link using the microformat, and there's no need to get the info
> earlier to put it in the header. Unfortunatly, the same can't be done to
> the project list page, unless it's changed to have "git" links as seen
> on vger.kernel.org's gitweb.

I'm not sure if making repository URLs to be hyperlinks is a good
idea.  You cannot (should not) click on those in ordinary web browser;
they are to be used by git (that is also additional reason why I am
not so sure about 'git' link on projects_list page idea).

Besides LINK elements in page HEAD are meant mainly for machine; I
think it might be more important to add them for machine there, even
if they are as A elements (links) or just plain text URLs somewhere
else.  For example we have LINK elements with alternate versions,
among others OPML for projectless pages, and RSS/Atom for project
pages, aven though those links are also in page body.

So I'd rather have them LINKs...
-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply

* Re: [PATCH] gitweb: support the rel=vcs microformat
From: Jakub Narebski @ 2009-01-10  0:03 UTC (permalink / raw)
  To: Joey Hess; +Cc: Giuseppe Bilotta, git
In-Reply-To: <20090107190238.GA3909@gnu.kitenet.net>

Joey Hess <joey@kitenet.net> writes:
> Joey Hess wrote:

> > Another approach would be to just memoize git_get_project_description
> > and git_get_project_url_list.
> 
> Especially since git_get_project_description is already called more than
> once for some pages.

Hmmm... this is an idea worth checking.

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox