Git development
 help / color / mirror / Atom feed
* Re: BUG?? INSTALL MAKEFILE
From: Ted Pavlic @ 2009-01-06 19:44 UTC (permalink / raw)
  To: git
In-Reply-To: <49638625.3090109@tedpavlic.com>

> 	git clean -f

DOH. I meant to do

	git clean -fx

(or just a git ls-files). That config.mak.autogen is certainly not 
checked into the repo, and a quick test confirms that "make install" 
certainly does set prefix to the home directory.

That being said, I'm sure that in the recent past I had to use configure 
to install git into home directories.

--Ted


-- 
Ted Pavlic <ted@tedpavlic.com>

^ permalink raw reply

* Re: Quick command to count commits
From: Ted Pavlic @ 2009-01-06 19:55 UTC (permalink / raw)
  To: Henk; +Cc: git
In-Reply-To: <1231267896595-2118851.post@n2.nabble.com>

I doubt it's going to be much faster, but it's easier to type

	git shortlog -s|numsum

I *think* that should give you the same number, and it forces git to do 
more of the counting (rather than wc).

Strangely, on one repo that I test that on, I get slightly different 
numbers from that command and yours. I'm not quite sure why (I guess 
shortlog doesn't count all commits?).

--Ted

On 1/6/09 1:51 PM, Henk wrote:
>
> Hi,
>
> For GitExtensions (windows git ui) I need a command to count all commits. I
> now use this command:
> git.cmd rev-list --all --abbrev-commit | wc -l
>
> This works perfect but its very slow in big repositories. Is there a faster
> way to count the commits?
>
> Henk

-- 
Ted Pavlic <ted@tedpavlic.com>

^ permalink raw reply

* Re: JGit vs. Git
From: Stephen Bannasch @ 2009-01-06 19:55 UTC (permalink / raw)
  To: Robin Rosenberg, Vagmi Mudumbai; +Cc: git
In-Reply-To: <200901061212.07240.robin.rosenberg.lists@dewire.com>

At 12:12 PM +0100 1/6/09, Robin Rosenberg wrote:
>tisdag 06 januari 2009 07:50:11 skrev Vagmi Mudumbai:
>  > I am working on Windows with msysGit behind a HTTP Proxy. (Life cant
>>  get worse, I guess.) . I planned on using grit via JRuby but grit uses
>>  fork which is not available on funny platforms like windows. And JRuby
>>  guys do not have any plan on supporting fork even on platforms on
>>  which for is supported. If JGit is a pure Java based implementation of
>fork isn't really supported on Windows. Cygwin goes to great lengths to
>emulate it. Trying to do that within the context of an arbitrary JVM seems
>like a daunting task. Consider submitting patches to make grit not use fork...
>just kidding.., please help us improve JGit instead :)

Or think about extending the Ruby gem grit to also use JGit.  Which 
would certainly improve grit and  probably help improve JGit also.

I've thought about this a bit -- but it hasn't gotten to the top of 
my list yet ...

There are some examples of Ruby Gems which use Java libraries when 
run in JRuby and native C libraries when used from MRI.

   hpricot:   http://github.com/why/hpricot/tree
   redcloth:  http://github.com/jgarber/redcloth/tree/master

^ permalink raw reply

* Re: [PATCH] fast-export: print usage when no options specified
From: Miklos Vajna @ 2009-01-06 20:11 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, jidanni, git
In-Reply-To: <alpine.DEB.1.00.0901062023100.30769@pacific.mpi-cbg.de>

[-- Attachment #1: Type: text/plain, Size: 1138 bytes --]

On Tue, Jan 06, 2009 at 08:28:59PM +0100, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Maybe this should be part of the commit message?
> 
> -- snip --
> Some people find it surprising that fast-export does not output a usage 
> when called without parameters, as rev-list does.
> 
> This assumes that a user usually does not want to export HEAD by default.
> -- snap --

OK, if this is the only problem, I can resend it with this included. ;-)

> However, I have to say that I would find exporting HEAD a rather sensible 
> default.
> 
> But I am not _that_ strongly opposed to the patch.  Just would like to 
> hear some opinions first.

According to man git, fast-export is not a plumbing, though I think most
user won't type it manually multiple times, they'll write a script, that
checks if there is something to convert, then run git fast-export | foo
fast-import, so it's really like plumbing. That's why I found the
rev-list-like behaviour more logical.

However, I think defaulting to HEAD is still better than the current
situation, so if that's the consensus, I'm fine with that as well.

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply

* [PATCH/RFC v2 0/4] git checkout: optimise away lots of lstat() calls
From: Kjetil Barvik @ 2009-01-06 20:36 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Kjetil Barvik

I have just started to clone some interesting Linux git trees to watch
the development more closely, and therefore also started to use git. I
noticed that 'git checkout' takes some time, and especially that the
'git checkout' command does lots and lots of lstat() calls.

After some more investigation and thinking, I have made 4 patches and
been able to optimise away over 42% of all lstat() calls in some cases
for the 'git checkout' command.  I have not tested other git porcelain
commands for reduced lstat() calls, but I would guess that the more
effective 'lstat_cache()' compared to 'has_leading_symlink_cache()',
should also give better numbers in other cases.

All the 4 patches is against git master, and the git 'make test'
test suite still passes after each patch.

To document the improvement, below is some numbers, which compares
before and after all 4 patches. To reproduce the numbers:

- git clone the Linux git tree to be able to get the Linux tags
  'v2.6.25' and 'v2.6.27'.
- git checkout -b my-v2.6.27 v2.6.27
- git checkout -b my-v2.6.25 v2.6.25

Then, when the current branch is 'my-v2.6.25' do:

  strace -o strace_to27 -T git checkout -q my-v2.6.27

And then you pretty print and collect stats from the 'strace_to27'
file.  If someone wants a copy of the strace_stat.pl script, which I
made/used to do the pretty printing, then give me a hint.

Below is the stats/numbers from the current git version (before the 4
patches).  Notice that we do an lstat() call on the "arch" directory
over 6000 times!

TOTAL      185151 100.000% OK:165544 NOT: 19607  11.136001 sec   60 usec/call
lstat64    120954  65.327% OK:107013 NOT: 13941   5.388727 sec   45 usec/call
  strings  120954 tot  30163 uniq   4.010 /uniq   5.388727 sec   45 usec/call
  files     61491 tot  28712 uniq   2.142 /uniq   2.740520 sec   45 usec/call
  dirs      45522 tot   1436 uniq  31.701 /uniq   1.994448 sec   44 usec/call
  errors    13941 tot   5189 uniq   2.687 /uniq   0.653759 sec   47 usec/call
             6297   5.206% OK:  6297 NOT:     0  "arch"
             4544   3.757% OK:  4544 NOT:     0  "drivers"
             1816   1.501% OK:  1816 NOT:     0  "arch/arm"
             1499   1.239% OK:  1499 NOT:     0  "include"
              912   0.754% OK:   912 NOT:     0  "arch/powerpc"
              764   0.632% OK:   764 NOT:     0  "fs"
              746   0.617% OK:   746 NOT:     0  "drivers/net"
              662   0.547% OK:   662 NOT:     0  "net"
              652   0.539% OK:   325 NOT:   327  "arch/sparc/include"
              636   0.526% OK:   636 NOT:     0  "drivers/media"
              606   0.501% OK:   606 NOT:     0  "include/linux"
              533   0.441% OK:   533 NOT:     0  "arch/sh"
              522   0.432% OK:   260 NOT:   262  "arch/powerpc/include"
              488   0.403% OK:   243 NOT:   245  "arch/sh/include"
              413   0.341% OK:   413 NOT:     0  "arch/sparc"
              390   0.322% OK:   390 NOT:     0  "arch/x86"
              383   0.317% OK:   383 NOT:     0  "Documentation"
              370   0.306% OK:   184 NOT:   186  "arch/ia64/include"
              366   0.303% OK:   366 NOT:     0  "drivers/media/video"
              348   0.288% OK:   173 NOT:   175  "arch/arm/include"

Here is the stats/numbers after applying the 4 patches.  Notice how
nice the top 20 entries list now looks!

TOTAL      133655 100.000% OK:121615 NOT: 12040  10.429999 sec   78 usec/call
lstat64     69603  52.077% OK: 63218 NOT:  6385   3.419920 sec   49 usec/call
  strings   69603 tot  30163 uniq   2.308 /uniq   3.419920 sec   49 usec/call
  files     61491 tot  28712 uniq   2.142 /uniq   3.034869 sec   49 usec/call
  dirs       1727 tot   1164 uniq   1.484 /uniq   0.075681 sec   44 usec/call
  errors     6385 tot   5189 uniq   1.230 /uniq   0.309370 sec   48 usec/call
                4   0.006% OK:     4 NOT:     0  ".gitignore"
                4   0.006% OK:     4 NOT:     0  ".mailmap"
                4   0.006% OK:     4 NOT:     0  "CREDITS"
                4   0.006% OK:     4 NOT:     0  "Documentation/00-INDEX"
                4   0.006% OK:     4 NOT:     0  "Documentation/ABI/testing/sysfs-block"
                4   0.006% OK:     4 NOT:     0  "Documentation/ABI/testing/sysfs-firmware-acpi"
                4   0.006% OK:     4 NOT:     0  "Documentation/CodingStyle"
                4   0.006% OK:     4 NOT:     0  "Documentation/DMA-API.txt"
                4   0.006% OK:     4 NOT:     0  "Documentation/DMA-mapping.txt"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/Makefile"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/gadget.tmpl"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/kernel-api.tmpl"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/kernel-locking.tmpl"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/procfs-guide.tmpl"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/procfs_example.c"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/rapidio.tmpl"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/s390-drivers.tmpl"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/uio-howto.tmpl"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/videobook.tmpl"
                4   0.006% OK:     4 NOT:     0  "Documentation/DocBook/writing_usb_driver.tmpl"

Note that the overall used system time as recorded from 'strace -T',
does not drop so much that the reduced lstat() time should indicate
for _this_ particular test run.  This is because now each unlink()
call takes much more time, at least for me on an slow ide disk (using
ext3) on a laptop.

A simple test gives me an overall improvement of 2.937 seconds: real
time drops from 28.195s (best of 5 runs with 'time git ...'), to
25.381s (best of 5 runs).

I have also noticed that inside the unlink_entry() function in file
unpack-trees.c, one could often end up calling rmdir() lots and lots
of times on none-empty directories.  Maybe one should schedule each
directory for removal by an appropriate function, and then at the end
call a new function to clean all the directories at once?

Comments?

----
  Changes since v1:
  = inside patch 1/4
    - always null terminate the cache_path array
    - added a paragraph to the commit message for this patch
    - small cleanup on 2 comments, and a small line indentation change
      [[Following is updates after comments from Junio C Hamano - Thanks!]]
    - removed the 'static inline update_path_cache()' function
    - replaced the else-part of the above inline function with a call
      to the 'clear_lstat_cache()' function.
    - deleted the '|| errno == ENOTDIR' part inside the big while-loop
      inside check_lstat_cache(), and updated the named BIT-fields
      accordingly
  = inside patch 2/4
    - moved a paragraph out from the commit message for this patch and
      into this cover-letter
      [[Following is updates after comments from Junio C Hamano - Thanks!]]
    - Removed the '|LSTAT_NOTDIR' part from the call to lstat_cache()
      inside function 'check_removed()' inside file diff-lib.c


Kjetil Barvik (4):
  Optimised, faster, more effective symlink/directory detection
  Use 'lstat_cache()' instead of 'has_symlink_leading_path()'
  create_directories() inside entry.c: only check each directory once!
  remove the old 'has_symlink_leading_path()' function

 Makefile               |    2 +-
 builtin-add.c          |    5 ++-
 builtin-apply.c        |    5 ++-
 builtin-update-index.c |    5 ++-
 cache.h                |   17 ++++++++-
 diff-lib.c             |    5 ++-
 dir.c                  |    4 +-
 entry.c                |   86 +++++++++++++++++++++++++++++++++---------
 lstat_cache.c          |   99 ++++++++++++++++++++++++++++++++++++++++++++++++
 symlinks.c             |   64 -------------------------------
 unpack-trees.c         |   10 ++++-
 11 files changed, 211 insertions(+), 91 deletions(-)
 create mode 100644 lstat_cache.c
 delete mode 100644 symlinks.c

^ permalink raw reply

* [PATCH/RFC v2 1/4] Optimised, faster, more effective symlink/directory detection
From: Kjetil Barvik @ 2009-01-06 20:36 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Kjetil Barvik
In-Reply-To: <1231274192-30478-1-git-send-email-barvik@broadpark.no>

This patch is work based on the following 2 commits:

Linus Torvalds: c40641b77b0274186fd1b327d5dc3246f814aaaf
Junio C Hamano: f859c846e90b385c7ef873df22403529208ade50

Changes includes the following:

- The cache functionality is more effective.  Previously when A/B/C/D
  was in the cache and A/B/C/E/file.c was called for, there was no
  match at all from the cache.  Now we use the fact that the paths
  "A", "A/B" and "A/B/C" is already tested, and we only need to do an
  lstat() call on "A/B/C/E".

- We only cache/store the last path regardless of it's type.  Since the
  cache functionality is always used with alphabetically sorted names
  (at least it seams so for me), there is no need to store both the
  last symlink-leading path and the last real-directory path.  Note
  that if the cache is not called with (mostly) alphabetically sorted
  names, neither the old, nor this new one, would be very effective.

- We also can cache the fact that a directory does not exist.
  Previously we could end up doing lots of lstat() calls for a removed
  directory which previously contained lots of files.  Since we
  already have simplified the cache functionality and only store the
  last path (see above), this new functionality was easy to add.

- Previously, when symlink A/B/C/S was cached/stored in the
  symlink-leading path, and A/B/C/file.c was called for, it was not
  easy to use the fact that we already known that the paths "A", "A/B"
  and "A/B/C" is real directories.  Since we now only store one single
  path (the last one), we also get similar logic for free regarding
  the new "non-exsisting-directory-cache".

- Avoid copying the first path components of the name 2 zillions times
  when we tests new path components.  Since we always cache/store the
  last path, we can copy each component as we test those directly into
  the cache.  Previously we ended up doing a memcpy() for the full
  path/name right before each lstat() call, and when updating the
  cache for each time we have tested an new path component.

- We also use less memory, that is PATH_MAX bytes less memory on the
  stack and PATH_MAX bytes less memory on the heap.

- Introduce a 3rd argument, 'unsigned int track_flags', to the
  cache-test function, check_lstat_cache().  This new argument can be
  used to tell the cache functionality which types of directories
  should be cached.

- Also introduce a 'void clear_lstat_cache(void)' function, which
  should be used to clean the cache before usage.  If for instance,
  you have changed the types of directories which should be cached,
  the cache could contain a path which was not wanted.

Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
---
:100644 100644 aabf013... 7449b10... M	Makefile
:100644 100644 231c06d... 9adc4a2... M	cache.h
:000000 100644 0000000... 2f025c8... A	lstat_cache.c
 Makefile      |    1 +
 cache.h       |   15 +++++++++
 lstat_cache.c |   99 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 115 insertions(+), 0 deletions(-)
 create mode 100644 lstat_cache.c

diff --git a/Makefile b/Makefile
index aabf0130b99bee5204c8e668ba8f40caea77dae2..7449b105b03e862d53244d50ed035b4ddabef028 100644
--- a/Makefile
+++ b/Makefile
@@ -446,6 +446,7 @@ LIB_OBJS += list-objects.o
 LIB_OBJS += ll-merge.o
 LIB_OBJS += lockfile.o
 LIB_OBJS += log-tree.o
+LIB_OBJS += lstat_cache.o
 LIB_OBJS += mailmap.o
 LIB_OBJS += match-trees.o
 LIB_OBJS += merge-file.o
diff --git a/cache.h b/cache.h
index 231c06d7726b575f6e522d5b0c0fe43557e8c651..9adc4a224648bb17a50e6792274225eeb6144451 100644
--- a/cache.h
+++ b/cache.h
@@ -721,6 +721,21 @@ struct checkout {
 extern int checkout_entry(struct cache_entry *ce, const struct checkout *state, char *topath);
 extern int has_symlink_leading_path(int len, const char *name);
 
+#define LSTAT_DIR       (1u << 0)
+#define LSTAT_NOENT     (1u << 1)
+#define LSTAT_SYMLINK   (1u << 2)
+#define LSTAT_LSTATERR  (1u << 3)
+#define LSTAT_ERR       (1u << 4)
+extern unsigned int check_lstat_cache(int len, const char *name,
+				      unsigned int track_flags);
+static inline unsigned int lstat_cache(int len, const char *name,
+				       unsigned int track_flags,
+				       unsigned int mask_flags)
+{
+	return check_lstat_cache(len, name, track_flags) & mask_flags;
+}
+extern void clear_lstat_cache(void);
+
 extern struct alternate_object_database {
 	struct alternate_object_database *next;
 	char *name;
diff --git a/lstat_cache.c b/lstat_cache.c
new file mode 100644
index 0000000000000000000000000000000000000000..2f025c87b0be0e713af7b1400e71284a22c25e9f
--- /dev/null
+++ b/lstat_cache.c
@@ -0,0 +1,99 @@
+#include "cache.h"
+
+static char         cache_path[PATH_MAX];
+static int          cache_len   = 0;
+static unsigned int cache_flags = 0;
+
+static inline int
+greatest_common_path_cache_prefix(int len, const char *name)
+{
+	int max_len, match_len = 0, i = 0;
+
+	max_len = len < cache_len ? len : cache_len;
+	while (i < max_len && name[i] == cache_path[i]) {
+		if (name[i] == '/') match_len = i;
+		i++;
+	}
+	if (i == cache_len && len > cache_len && name[cache_len] == '/')
+		match_len = cache_len;
+	return match_len;
+}
+
+/*
+ * Check if name 'name' of length 'len' has a symlink leading
+ * component, or if the directory exists and is real, or not.
+ *
+ * To speed up the check, some information is allowed to be cached.
+ * This is indicated by the 'track_flags' argument.
+ */
+unsigned int
+check_lstat_cache(int len, const char *name, unsigned int track_flags)
+{
+	int match_len, last_slash, max_len;
+	unsigned int match_flags, ret_flags, save_flags;
+	struct stat st;
+
+	/* Check if match from the cache for 2 "excluding" path types.
+	 */
+	match_len = last_slash =
+		greatest_common_path_cache_prefix(len, name);
+	match_flags =
+		cache_flags & track_flags & (LSTAT_NOENT|LSTAT_SYMLINK);
+	if (match_flags && match_len == cache_len)
+		return match_flags;
+
+	/* Okay, no match from the cache so far, so now we have to
+	 * check the rest of the path components.
+	 */
+	ret_flags = LSTAT_DIR;
+	max_len = len < PATH_MAX ? len : PATH_MAX;
+	while (match_len < max_len) {
+		do {
+			cache_path[match_len] = name[match_len];
+			match_len++;
+		} while (match_len < max_len && name[match_len] != '/');
+		if (match_len >= max_len)
+			break;
+		last_slash = match_len;
+		cache_path[last_slash] = '\0';
+
+		if (lstat(cache_path, &st)) {
+			ret_flags = LSTAT_LSTATERR;
+			if (errno == ENOENT)
+				ret_flags |= LSTAT_NOENT;
+		} else if (S_ISDIR(st.st_mode)) {
+			continue;
+		} else if (S_ISLNK(st.st_mode)) {
+			ret_flags = LSTAT_SYMLINK;
+		} else {
+			ret_flags = LSTAT_ERR;
+		}
+		break;
+	}
+
+	/* At the end update the cache.  Note that max 3 different
+	 * path types can be cached for the moment!
+	 */
+	save_flags = ret_flags & track_flags &
+		(LSTAT_NOENT|LSTAT_SYMLINK|LSTAT_DIR);
+	if (save_flags && last_slash > 0 && last_slash < PATH_MAX) {
+		cache_path[last_slash] = '\0';
+		cache_len   = last_slash;
+		cache_flags = save_flags;
+	} else {
+		clear_lstat_cache();
+	}
+	return ret_flags;
+}
+
+/*
+ * Before usage of the check_lstat_cache() function one should call
+ * clear_lstat_cache() (at an appropriate place) to make sure that the
+ * cache is clean.
+ */
+void clear_lstat_cache(void)
+{
+	cache_path[0] = '\0';
+	cache_len     = 0;
+	cache_flags   = 0;
+}
-- 
1.6.1.rc1.49.g7f705

^ permalink raw reply related

* [PATCH/RFC v2 2/4] Use 'lstat_cache()' instead of 'has_symlink_leading_path()'
From: Kjetil Barvik @ 2009-01-06 20:36 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Kjetil Barvik
In-Reply-To: <1231274192-30478-1-git-send-email-barvik@broadpark.no>

Start using the optimised, faster and more effective symlink/directory
cache.  The previously used call:

   has_symlink_leading_path(len, name);

should be identically with the following call to lstat_cache():

   lstat_cache(len, name,
               LSTAT_SYMLINK|LSTAT_DIR,
               LSTAT_SYMLINK);

The primary reason for the new name of the function (instead of using
the old name and add 2 extra arguments), is that it is now more
general, for instance, it now also can cache the fact that a directory
does not exists.

Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
---
:100644 100644 719de8b... 152c52c... M	builtin-add.c
:100644 100644 a8f75ed... 0eb2b21... M	builtin-apply.c
:100644 100644 5604977... 2844ddd... M	builtin-update-index.c
:100644 100644 ae96c64... b67ee06... M	diff-lib.c
:100644 100644 0131983... 9f2a1b1... M	dir.c
:100644 100644 54f301d... 0938b86... M	unpack-trees.c
 builtin-add.c          |    5 ++++-
 builtin-apply.c        |    5 ++++-
 builtin-update-index.c |    5 ++++-
 diff-lib.c             |    5 ++++-
 dir.c                  |    4 +++-
 unpack-trees.c         |    9 +++++++--
 6 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/builtin-add.c b/builtin-add.c
index 719de8b0f2d2d831f326d948aa18700e5c474950..152c52c0f22b3e71931b2a5629e1472602817785 100644
--- a/builtin-add.c
+++ b/builtin-add.c
@@ -121,7 +121,9 @@ static const char **validate_pathspec(int argc, const char **argv, const char *p
 	if (pathspec) {
 		const char **p;
 		for (p = pathspec; *p; p++) {
-			if (has_symlink_leading_path(strlen(*p), *p)) {
+			if (lstat_cache(strlen(*p), *p,
+					LSTAT_SYMLINK|LSTAT_DIR,
+					LSTAT_SYMLINK)) {
 				int len = prefix ? strlen(prefix) : 0;
 				die("'%s' is beyond a symbolic link", *p + len);
 			}
@@ -225,6 +227,7 @@ int cmd_add(int argc, const char **argv, const char *prefix)
 
 	argc = parse_options(argc, argv, builtin_add_options,
 			  builtin_add_usage, 0);
+	clear_lstat_cache();
 	if (patch_interactive)
 		add_interactive = 1;
 	if (add_interactive)
diff --git a/builtin-apply.c b/builtin-apply.c
index a8f75ed3ed411d8cf7a3ec9dfefef7407c50f447..0eb2b21ee245919189c780a64cc494ca35f7934a 100644
--- a/builtin-apply.c
+++ b/builtin-apply.c
@@ -2354,7 +2354,9 @@ static int check_to_create_blob(const char *new_name, int ok_if_exists)
 		 * In such a case, path "new_name" does not exist as
 		 * far as git is concerned.
 		 */
-		if (has_symlink_leading_path(strlen(new_name), new_name))
+		if (lstat_cache(strlen(new_name), new_name,
+				LSTAT_SYMLINK|LSTAT_DIR,
+				LSTAT_SYMLINK))
 			return 0;
 
 		return error("%s: already exists in working directory", new_name);
@@ -3154,6 +3156,7 @@ int cmd_apply(int argc, const char **argv, const char *unused_prefix)
 	if (apply_default_whitespace)
 		parse_whitespace_option(apply_default_whitespace);
 
+	clear_lstat_cache();
 	for (i = 1; i < argc; i++) {
 		const char *arg = argv[i];
 		char *end;
diff --git a/builtin-update-index.c b/builtin-update-index.c
index 560497750586ec61be4e34de6dedd9c307129817..2844ddd95d208b35d47c1cc2ca40af4b68bd94ab 100644
--- a/builtin-update-index.c
+++ b/builtin-update-index.c
@@ -195,7 +195,9 @@ static int process_path(const char *path)
 	struct stat st;
 
 	len = strlen(path);
-	if (has_symlink_leading_path(len, path))
+	if (lstat_cache(len, path,
+			LSTAT_SYMLINK|LSTAT_DIR,
+			LSTAT_SYMLINK))
 		return error("'%s' is beyond a symbolic link", path);
 
 	/*
@@ -581,6 +583,7 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
 	if (entries < 0)
 		die("cache corrupted");
 
+	clear_lstat_cache();
 	for (i = 1 ; i < argc; i++) {
 		const char *path = argv[i];
 		const char *p;
diff --git a/diff-lib.c b/diff-lib.c
index ae96c64ca209f4df9008198e8a04b160bed618c7..b67ee068e377e5aa1c7ed4abd88c0e307ba8fb4f 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -31,7 +31,9 @@ static int check_removed(const struct cache_entry *ce, struct stat *st)
 			return -1;
 		return 1;
 	}
-	if (has_symlink_leading_path(ce_namelen(ce), ce->name))
+	if (lstat_cache(ce_namelen(ce), ce->name,
+			LSTAT_SYMLINK|LSTAT_DIR,
+			LSTAT_SYMLINK))
 		return 1;
 	if (S_ISDIR(st->st_mode)) {
 		unsigned char sub[20];
@@ -69,6 +71,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 		diff_unmerged_stage = 2;
 	entries = active_nr;
 	symcache[0] = '\0';
+	clear_lstat_cache();
 	for (i = 0; i < entries; i++) {
 		struct stat st;
 		unsigned int oldmode, newmode;
diff --git a/dir.c b/dir.c
index 0131983dfbc143ce5dae77e067663bb2e7d5f126..9f2a1b1f245c3f1d4f12d54d45bb193c53fa15b5 100644
--- a/dir.c
+++ b/dir.c
@@ -719,7 +719,9 @@ int read_directory(struct dir_struct *dir, const char *path, const char *base, i
 {
 	struct path_simplify *simplify;
 
-	if (has_symlink_leading_path(strlen(path), path))
+	if (lstat_cache(strlen(path), path,
+			LSTAT_SYMLINK|LSTAT_DIR,
+			LSTAT_SYMLINK))
 		return dir->nr;
 
 	simplify = create_simplify(pathspec);
diff --git a/unpack-trees.c b/unpack-trees.c
index 54f301da67be879c80426bc21776427fdd38c02e..0938b86278f9b6aa14ef32fe15392f34b694d8cc 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -61,7 +61,9 @@ static void unlink_entry(struct cache_entry *ce)
 	char *cp, *prev;
 	char *name = ce->name;
 
-	if (has_symlink_leading_path(ce_namelen(ce), ce->name))
+	if (lstat_cache(ce_namelen(ce), ce->name,
+			LSTAT_SYMLINK|LSTAT_NOENT|LSTAT_DIR,
+			LSTAT_SYMLINK|LSTAT_NOENT))
 		return;
 	if (unlink(name))
 		return;
@@ -105,6 +107,7 @@ static int check_updates(struct unpack_trees_options *o)
 		cnt = 0;
 	}
 
+	clear_lstat_cache();
 	for (i = 0; i < index->cache_nr; i++) {
 		struct cache_entry *ce = index->cache[i];
 
@@ -584,7 +587,9 @@ static int verify_absent(struct cache_entry *ce, const char *action,
 	if (o->index_only || o->reset || !o->update)
 		return 0;
 
-	if (has_symlink_leading_path(ce_namelen(ce), ce->name))
+	if (lstat_cache(ce_namelen(ce), ce->name,
+			LSTAT_SYMLINK|LSTAT_NOENT|LSTAT_DIR,
+			LSTAT_SYMLINK|LSTAT_NOENT))
 		return 0;
 
 	if (!lstat(ce->name, &st)) {
-- 
1.6.1.rc1.49.g7f705

^ permalink raw reply related

* [PATCH/RFC v2 3/4] create_directories() inside entry.c: only check each directory once!
From: Kjetil Barvik @ 2009-01-06 20:36 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Kjetil Barvik
In-Reply-To: <1231274192-30478-1-git-send-email-barvik@broadpark.no>

When we do an 'git checkout' after some time we end up in the
'checkout_entry()' function inside entry.c, and from here we call the
'create_directories()' function to make sure the all the directories
exists for the possible new file or entry.

The 'create_directories()' function happily started to check that all
path component exists.  This resulted in tons and tons of calls to
lstat() or stat() when we checkout files nested deep inside a
directory.

We try to avoid this by remembering the last checked and possible
newly created directory.

Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
---
:100644 100644 9adc4a2... 4c56f3f... M	cache.h
:100644 100644 aa2ee46... 666a8ce... M	entry.c
:100644 100644 0938b86... 037678d... M	unpack-trees.c
 cache.h        |    1 +
 entry.c        |   86 ++++++++++++++++++++++++++++++++++++++++++++------------
 unpack-trees.c |    1 +
 3 files changed, 70 insertions(+), 18 deletions(-)

diff --git a/cache.h b/cache.h
index 9adc4a224648bb17a50e6792274225eeb6144451..4c56f3faa0b87398e27346358e40af964a1828bb 100644
--- a/cache.h
+++ b/cache.h
@@ -718,6 +718,7 @@ struct checkout {
 		 refresh_cache:1;
 };
 
+extern void clear_created_dirs_cache(void);
 extern int checkout_entry(struct cache_entry *ce, const struct checkout *state, char *topath);
 extern int has_symlink_leading_path(int len, const char *name);
 
diff --git a/entry.c b/entry.c
index aa2ee46a84033585d8e07a585610c5a697af82c2..666a8ce3a132e85a45b0828521f3c2119c77833e 100644
--- a/entry.c
+++ b/entry.c
@@ -1,33 +1,76 @@
 #include "cache.h"
 #include "blob.h"
 
-static void create_directories(const char *path, const struct checkout *state)
+static char dirs_path[PATH_MAX];
+static int  dirs_len = 0;
+
+static inline int
+greatest_common_created_dirs_prefix(int len, const char *name)
 {
-	int len = strlen(path);
-	char *buf = xmalloc(len + 1);
-	const char *slash = path;
+	int max_len, match_len = 0, i = 0;
 
-	while ((slash = strchr(slash+1, '/')) != NULL) {
-		struct stat st;
-		int stat_status;
+	max_len = len < dirs_len ? len : dirs_len;
+	while (i < max_len && name[i] == dirs_path[i]) {
+		if (name[i] == '/') match_len = i;
+		i++;
+	}
+	if (i == dirs_len && len > dirs_len && name[dirs_len] == '/')
+		match_len = dirs_len;
+	return match_len;
+}
+
+static inline void
+update_created_dirs_cache(int last_slash)
+{
+	if (last_slash > 0 && last_slash < PATH_MAX) {
+		dirs_len = last_slash;
+	} else {
+		dirs_len = 0;
+	}
+}
 
-		len = slash - path;
-		memcpy(buf, path, len);
-		buf[len] = 0;
+void clear_created_dirs_cache(void)
+{
+	dirs_len = 0;
+}
+
+static void
+create_directories(int len, const char *path, const struct checkout *state)
+{
+	int i, max_len, last_slash, stat_status;
+	struct stat st;
+
+	/* Check the cache for previously checked or created
+	 * directories (and components) within this function.  There
+	 * is no need to check or re-create directory components more
+	 * than once!
+	 */
+	max_len = len < PATH_MAX ? len : PATH_MAX;
+	i = last_slash = greatest_common_created_dirs_prefix(max_len, path);
 
-		if (len <= state->base_dir_len)
+	while (i < max_len) {
+		do {
+			dirs_path[i] = path[i];
+			i++;
+		} while (i < max_len && path[i] != '/');
+		if (i >= max_len)
+			break;
+		last_slash = i;
+		dirs_path[last_slash] = '\0';
+
+		if (last_slash <= state->base_dir_len)
 			/*
 			 * checkout-index --prefix=<dir>; <dir> is
 			 * allowed to be a symlink to an existing
 			 * directory.
 			 */
-			stat_status = stat(buf, &st);
+			stat_status = stat(dirs_path, &st);
 		else
 			/*
 			 * if there currently is a symlink, we would
 			 * want to replace it with a real directory.
 			 */
-			stat_status = lstat(buf, &st);
+			stat_status = lstat(dirs_path, &st);
 
 		if (!stat_status && S_ISDIR(st.st_mode))
 			continue; /* ok, it is already a directory. */
@@ -38,14 +81,14 @@ static void create_directories(const char *path, const struct checkout *state)
 		 * error codepath; we do not care, as we unlink and
 		 * mkdir again in such a case.
 		 */
-		if (mkdir(buf, 0777)) {
+		if (mkdir(dirs_path, 0777)) {
 			if (errno == EEXIST && state->force &&
-			    !unlink(buf) && !mkdir(buf, 0777))
+			    !unlink(dirs_path) && !mkdir(dirs_path, 0777))
 				continue;
-			die("cannot create directory at %s", buf);
+			die("cannot create directory at %s", dirs_path);
 		}
 	}
-	free(buf);
+	update_created_dirs_cache(last_slash);
 }
 
 static void remove_subtree(const char *path)
@@ -55,6 +98,11 @@ static void remove_subtree(const char *path)
 	char pathbuf[PATH_MAX];
 	char *name;
 
+	/* To be utterly safe we invalidate the cache of the
+	 * previously created directories.
+	 */
+	clear_created_dirs_cache();
+
 	if (!dir)
 		die("cannot opendir %s (%s)", path, strerror(errno));
 	strcpy(pathbuf, path);
@@ -195,12 +243,14 @@ int checkout_entry(struct cache_entry *ce, const struct checkout *state, char *t
 	static char path[PATH_MAX + 1];
 	struct stat st;
 	int len = state->base_dir_len;
+	int path_len;
 
 	if (topath)
 		return write_entry(ce, topath, state, 1);
 
 	memcpy(path, state->base_dir, len);
 	strcpy(path + len, ce->name);
+	path_len = len + ce_namelen(ce);
 
 	if (!lstat(path, &st)) {
 		unsigned changed = ce_match_stat(ce, &st, CE_MATCH_IGNORE_VALID);
@@ -229,6 +279,6 @@ int checkout_entry(struct cache_entry *ce, const struct checkout *state, char *t
 			return error("unable to unlink old '%s' (%s)", path, strerror(errno));
 	} else if (state->not_new)
 		return 0;
-	create_directories(path, state);
+	create_directories(path_len, path, state);
 	return write_entry(ce, path, state, 0);
 }
diff --git a/unpack-trees.c b/unpack-trees.c
index 0938b86278f9b6aa14ef32fe15392f34b694d8cc..037678dfd75b7ef8235105e072fb276fcc6c36a8 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -121,6 +121,7 @@ static int check_updates(struct unpack_trees_options *o)
 		}
 	}
 
+	clear_created_dirs_cache();
 	for (i = 0; i < index->cache_nr; i++) {
 		struct cache_entry *ce = index->cache[i];
 
-- 
1.6.1.rc1.49.g7f705

^ permalink raw reply related

* [PATCH/RFC v2 4/4] remove the old 'has_symlink_leading_path()' function
From: Kjetil Barvik @ 2009-01-06 20:36 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Kjetil Barvik
In-Reply-To: <1231274192-30478-1-git-send-email-barvik@broadpark.no>

It has been replaced by the more cache effective 'lstat_cache()'
function.  see lstat_cache.c

Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
---
:100644 100644 7449b10... edd4798... M	Makefile
:100644 100644 4c56f3f... 8b8da85... M	cache.h
:100644 000000 5a5e781... 0000000... D	symlinks.c
 Makefile   |    1 -
 cache.h    |    1 -
 symlinks.c |   64 ------------------------------------------------------------
 3 files changed, 0 insertions(+), 66 deletions(-)
 delete mode 100644 symlinks.c

diff --git a/Makefile b/Makefile
index 7449b105b03e862d53244d50ed035b4ddabef028..edd4798429ad3828f5b9dcff20f73c6a7269520e 100644
--- a/Makefile
+++ b/Makefile
@@ -483,7 +483,6 @@ LIB_OBJS += sha1_name.o
 LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += strbuf.o
-LIB_OBJS += symlinks.o
 LIB_OBJS += tag.o
 LIB_OBJS += trace.o
 LIB_OBJS += transport.o
diff --git a/cache.h b/cache.h
index 4c56f3faa0b87398e27346358e40af964a1828bb..8b8da85f32600050dddfcdf6b0c0e6ecb8f27072 100644
--- a/cache.h
+++ b/cache.h
@@ -720,7 +720,6 @@ struct checkout {
 
 extern void clear_created_dirs_cache(void);
 extern int checkout_entry(struct cache_entry *ce, const struct checkout *state, char *topath);
-extern int has_symlink_leading_path(int len, const char *name);
 
 #define LSTAT_DIR       (1u << 0)
 #define LSTAT_NOENT     (1u << 1)
diff --git a/symlinks.c b/symlinks.c
deleted file mode 100644
index 5a5e781a15d7d9cb60797958433eca896b31ec85..0000000000000000000000000000000000000000
--- a/symlinks.c
+++ /dev/null
@@ -1,64 +0,0 @@
-#include "cache.h"
-
-struct pathname {
-	int len;
-	char path[PATH_MAX];
-};
-
-/* Return matching pathname prefix length, or zero if not matching */
-static inline int match_pathname(int len, const char *name, struct pathname *match)
-{
-	int match_len = match->len;
-	return (len > match_len &&
-		name[match_len] == '/' &&
-		!memcmp(name, match->path, match_len)) ? match_len : 0;
-}
-
-static inline void set_pathname(int len, const char *name, struct pathname *match)
-{
-	if (len < PATH_MAX) {
-		match->len = len;
-		memcpy(match->path, name, len);
-		match->path[len] = 0;
-	}
-}
-
-int has_symlink_leading_path(int len, const char *name)
-{
-	static struct pathname link, nonlink;
-	char path[PATH_MAX];
-	struct stat st;
-	char *sp;
-	int known_dir;
-
-	/*
-	 * See if the last known symlink cache matches.
-	 */
-	if (match_pathname(len, name, &link))
-		return 1;
-
-	/*
-	 * Get rid of the last known directory part
-	 */
-	known_dir = match_pathname(len, name, &nonlink);
-
-	while ((sp = strchr(name + known_dir + 1, '/')) != NULL) {
-		int thislen = sp - name ;
-		memcpy(path, name, thislen);
-		path[thislen] = 0;
-
-		if (lstat(path, &st))
-			return 0;
-		if (S_ISDIR(st.st_mode)) {
-			set_pathname(thislen, path, &nonlink);
-			known_dir = thislen;
-			continue;
-		}
-		if (S_ISLNK(st.st_mode)) {
-			set_pathname(thislen, path, &link);
-			return 1;
-		}
-		break;
-	}
-	return 0;
-}
-- 
1.6.1.rc1.49.g7f705

^ permalink raw reply related

* Re: [ANNOUNCE] Git homepage change
From: Miklos Vajna @ 2009-01-06 20:40 UTC (permalink / raw)
  To: Boyd Lynn Gerber; +Cc: Scott Chacon, David Bryson, Mike Hommey, git
In-Reply-To: <alpine.LNX.2.00.0901060907350.9096@suse104.zenez.com>

[-- Attachment #1: Type: text/plain, Size: 337 bytes --]

On Tue, Jan 06, 2009 at 09:12:30AM -0700, Boyd Lynn Gerber <gerberb@zenez.com> wrote:
> archive.  The only problem is I have not figured out how to download all 
> 303 articles, quickly/easily.  You would think one of the majordomo 
> commands to get articles would work.  They do not.

I would use:

http://gmane.org/export.php

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply

* [PATCH] strbuf: instate cleanup rule in case of non-memory errors
From: René Scharfe @ 2009-01-06 20:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Pierre Habouzit, Linus Torvalds, git
In-Reply-To: <20090104122108.GC29325@artemis.corp>

Make all strbuf functions that can fail free() their memory on error if
they have allocated it.  They don't shrink buffers that have been grown,
though.

This allows for easier error handling, as callers only need to call
strbuf_release() if A) the command succeeded or B) if they would have had
to do so anyway because they added something to the strbuf themselves.

Bonus hunk: document strbuf_readlink.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
 Documentation/technical/api-strbuf.txt |   11 +++++++++--
 strbuf.c                               |   17 +++++++++++++----
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/api-strbuf.txt b/Documentation/technical/api-strbuf.txt
index a8ee2fe..9a4e3ea 100644
--- a/Documentation/technical/api-strbuf.txt
+++ b/Documentation/technical/api-strbuf.txt
@@ -133,8 +133,10 @@ Functions
 
 * Adding data to the buffer
 
-NOTE: All of these functions in this section will grow the buffer as
-      necessary.
+NOTE: All of the functions in this section will grow the buffer as necessary.
+If they fail for some reason other than memory shortage and the buffer hadn't
+been allocated before (i.e. the `struct strbuf` was set to `STRBUF_INIT`),
+then they will free() it.
 
 `strbuf_addch`::
 
@@ -235,6 +237,11 @@ same behaviour as well.
 	Read the contents of a file, specified by its path. The third argument
 	can be used to give a hint about the file size, to avoid reallocs.
 
+`strbuf_readlink`::
+
+	Read the target of a symbolic link, specified by its path.  The third
+	argument can be used to give a hint about the size, to avoid reallocs.
+
 `strbuf_getline`::
 
 	Read a line from a FILE* pointer. The second argument specifies the line
diff --git a/strbuf.c b/strbuf.c
index bdf4954..6ed0684 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -256,18 +256,21 @@ size_t strbuf_expand_dict_cb(struct strbuf *sb, const char *placeholder,
 size_t strbuf_fread(struct strbuf *sb, size_t size, FILE *f)
 {
 	size_t res;
+	size_t oldalloc = sb->alloc;
 
 	strbuf_grow(sb, size);
 	res = fread(sb->buf + sb->len, 1, size, f);
-	if (res > 0) {
+	if (res > 0)
 		strbuf_setlen(sb, sb->len + res);
-	}
+	else if (res < 0 && oldalloc == 0)
+		strbuf_release(sb);
 	return res;
 }
 
 ssize_t strbuf_read(struct strbuf *sb, int fd, size_t hint)
 {
 	size_t oldlen = sb->len;
+	size_t oldalloc = sb->alloc;
 
 	strbuf_grow(sb, hint ? hint : 8192);
 	for (;;) {
@@ -275,7 +278,10 @@ ssize_t strbuf_read(struct strbuf *sb, int fd, size_t hint)
 
 		cnt = xread(fd, sb->buf + sb->len, sb->alloc - sb->len - 1);
 		if (cnt < 0) {
-			strbuf_setlen(sb, oldlen);
+			if (oldalloc == 0)
+				strbuf_release(sb);
+			else
+				strbuf_setlen(sb, oldlen);
 			return -1;
 		}
 		if (!cnt)
@@ -292,6 +298,8 @@ ssize_t strbuf_read(struct strbuf *sb, int fd, size_t hint)
 
 int strbuf_readlink(struct strbuf *sb, const char *path, size_t hint)
 {
+	size_t oldalloc = sb->alloc;
+
 	if (hint < 32)
 		hint = 32;
 
@@ -311,7 +319,8 @@ int strbuf_readlink(struct strbuf *sb, const char *path, size_t hint)
 		/* .. the buffer was too small - try again */
 		hint *= 2;
 	}
-	strbuf_release(sb);
+	if (oldalloc == 0)
+		strbuf_release(sb);
 	return -1;
 }
 
-- 
1.6.1

^ permalink raw reply related

* [PATCH 4/3] shortlog: handle multi-line subjects like log --pretty=oneline et. al. do
From: René Scharfe @ 2009-01-06 20:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: markus.heidelberg, git
In-Reply-To: <7vocynz8y6.fsf@gitster.siamese.dyndns.org>

The commit message parser of git shortlog used to treat only the first
non-empty line of the commit message as the subject.  Other log commands
(e.g. --pretty=oneline) show the whole first paragraph instead (unwrapped
into a single line).

For consistency, this patch borrows format_subject() from pretty.c to
make shortlog do the same.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
 builtin-shortlog.c |    9 ++++++---
 pretty.c           |    4 ++--
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/builtin-shortlog.c b/builtin-shortlog.c
index d03f14f..e492906 100644
--- a/builtin-shortlog.c
+++ b/builtin-shortlog.c
@@ -29,6 +29,9 @@ static int compare_by_number(const void *a1, const void *a2)
 		return -1;
 }
 
+const char *format_subject(struct strbuf *sb, const char *msg,
+			   const char *line_separator);
+
 static void insert_one_record(struct shortlog *log,
 			      const char *author,
 			      const char *oneline)
@@ -41,6 +44,7 @@ static void insert_one_record(struct shortlog *log,
 	size_t len;
 	const char *eol;
 	const char *boemail, *eoemail;
+	struct strbuf subject = STRBUF_INIT;
 
 	boemail = strchr(author, '<');
 	if (!boemail)
@@ -89,9 +93,8 @@ static void insert_one_record(struct shortlog *log,
 	while (*oneline && isspace(*oneline) && *oneline != '\n')
 		oneline++;
 	len = eol - oneline;
-	while (len && isspace(oneline[len-1]))
-		len--;
-	buffer = xmemdupz(oneline, len);
+	format_subject(&subject, oneline, " ");
+	buffer = strbuf_detach(&subject, NULL);
 
 	if (dot3) {
 		int dot3len = strlen(dot3);
diff --git a/pretty.c b/pretty.c
index 343dca5..421d9c5 100644
--- a/pretty.c
+++ b/pretty.c
@@ -486,8 +486,8 @@ static void parse_commit_header(struct format_commit_context *context)
 	context->commit_header_parsed = 1;
 }
 
-static const char *format_subject(struct strbuf *sb, const char *msg,
-				  const char *line_separator)
+const char *format_subject(struct strbuf *sb, const char *msg,
+			   const char *line_separator)
 {
 	int first = 1;
 
-- 
1.6.1

^ permalink raw reply related

* Re: [JGIT PATCH] Improve the sideband progress scaper to be more accurate
From: Nicolas Pitre @ 2009-01-06 20:55 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Robin Rosenberg, git
In-Reply-To: <20090106192747.GD24578@spearce.org>

On Tue, 6 Jan 2009, Shawn O. Pearce wrote:

> Nicolas Pitre <nico@cam.org> wrote:
> > There may indeed be line fragments sent over the sideband channel, as 
> > well as the opposite which is multiple lines sent at once in a single 
> > packet.  If you look at sideband.c you'll find about all those cases.
> 
> Thanks, that's what I thought...
>  
> > In general, what you have to do is:
> > 
> >  - for each packet:
> >    - split into multiple chunks on line breaks ('\r' or '\n')
> >    - for each chunk:
> >      - if last chunk didn't end with a line break, or if current 
> >        chunk is empty or only contains a line break, then skip printing 
> >        the "remote:" prefix.  Otherwise print it.
> >      - print the current chunk up to any line break
> >      - if current chunk contains a line break and other characters then
> >        print a sequence to clear the remaining of the screen line
> >      - print the line break if any
> 
> Hmm.  I should note that C Git still screws this up sometimes.  I've
> seen 1.6.1 git fetch mess up the output from repo.or.cz's sideband.
> I'm sure Pasky isn't running JGit's daemon, its too damn fast. :-)
> 
> I don't have a spew of it handy, but late last week I saw it screw
> up while doing a clone off repo.or.cz.

Next time you see such things please send me a snapshot.

What is still possible and IMHO not worth the effort to fix is some 
interaction between local and remote displays which, if intermixed in 
some unlucky way, do screw up the final appearance.  For example, if the 
remote has sent a partial line, and before the remaining of it is 
received/printed you actually have some local process also displaying a 
line of its own.  If that was to become a real issue, then the fix would 
involve caching any partial line sent from the remote until the 
associated line break is received.


Nicolas

^ permalink raw reply

* Re: [PATCH/RFC v2 2/4] Use 'lstat_cache()' instead of 'has_symlink_leading_path()'
From: Linus Torvalds @ 2009-01-06 21:08 UTC (permalink / raw)
  To: Kjetil Barvik; +Cc: git, Junio C Hamano
In-Reply-To: <1231274192-30478-3-git-send-email-barvik@broadpark.no>



On Tue, 6 Jan 2009, Kjetil Barvik wrote:
?
> Start using the optimised, faster and more effective symlink/directory
> cache.  The previously used call:
> 
>    has_symlink_leading_path(len, name);
> 
> should be identically with the following call to lstat_cache():
> 
>    lstat_cache(len, name,
>                LSTAT_SYMLINK|LSTAT_DIR,
>                LSTAT_SYMLINK);

I think the new interface looks worse.

Why don't you just do a new inline function that says

	static inline int has_symlink_leading_path(int len, const char *name)
	{
		return lstat_cache(len, name,
			LSTAT_SYMLINK|LSTAT_DIR,
			LSTAT_SYMLINK);
	}

and now you don't need this big patch, and people who don't care about 
those magic flags don't need to have them. End result: more readable code.

Then, the new users that want _new_ semantics can use the extended 
version.

This is how git has done pretty much all "generalized" versions. See the 
whole ce_modified() vs ie_modified() thing: they're the same function, 
it's just that 'ce_modified()' is the traditional simpler interface that 
works on the default index, while ie_modified() is the "full" version that 
takes all the details that most uses don't even want to know about.

				Linus

^ permalink raw reply

* Re: JGit vs. Git
From: Johannes Schindelin @ 2009-01-06 21:41 UTC (permalink / raw)
  To: Vagmi Mudumbai; +Cc: git
In-Reply-To: <a55cfe9d0901052250k2be203dfvb0b437a523f2cecc@mail.gmail.com>

Hi,

On Tue, 6 Jan 2009, Vagmi Mudumbai wrote:

> I am working on Windows with msysGit behind a HTTP Proxy. (Life cant
> get worse, I guess.) .

FWIW I think all should work well if you use a proxy, such as 
http://www.meadowy.org/~gotoh/projects/connect

Hth,
Dscho

^ permalink raw reply

* Problems getting rid of large files using git-filter-branch
From: Øyvind Harboe @ 2009-01-06 21:59 UTC (permalink / raw)
  To: git

I'm trying to get rid of some large objects in my .git repository
using git-filter-branch. These are remnants from conversion from
CVS.


Q1: How can I figure out what it is in .git that takes so much space?

Q2: Where can I read more about what to do after running git-filter-branch to
removing the offending objects?



1. I ran this command to get rid of the offending files and that appears to
have worked. I can't find any traces of them anymore...

git filter-branch --tree-filter 'find . -regex ".*toolchain\..*" -exec
rm -f {} \;' HEAD

2. Running "git gc" takes a few seconds. The repository is still
huge(it should be
perhaps 10-20mByte).

du -skh .git/
187M    .git/

3. I tried "git reflog expire --all" + lots of other tricks in the
link below, but no luck.



I tried the tricks I could find in this thread, but no luck:

http://article.gmane.org/gmane.comp.version-control.git/60219/match=trying+use+git+filter+branch+compress

-- 
Øyvind Harboe
http://www.zylin.com/zy1000.html
ARM7 ARM9 XScale Cortex
JTAG debugger and flash programmer

^ permalink raw reply

* Re: Problems getting rid of large files using git-filter-branch
From: Johannes Schindelin @ 2009-01-06 22:20 UTC (permalink / raw)
  To: Øyvind Harboe; +Cc: git
In-Reply-To: <c09652430901061359q7a02291fk656ab23e54b19f5e@mail.gmail.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 324 bytes --]

Hi,

On Tue, 6 Jan 2009, Øyvind Harboe wrote:

> Q1: How can I figure out what it is in .git that takes so much space?

If it is a pack that is taking so much space:

$ git verify-pack -v $PACK | grep -v "^chain " | sort -n -k 4

and then for the last few lines do a

$ git rev-list --all --objects | grep $SHA1

Hth,
Dscho

^ permalink raw reply

* Re: Problems getting rid of large files using git-filter-branch
From: Nicolas Pitre @ 2009-01-06 22:31 UTC (permalink / raw)
  To: Øyvind Harboe; +Cc: git
In-Reply-To: <c09652430901061359q7a02291fk656ab23e54b19f5e@mail.gmail.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1106 bytes --]

On Tue, 6 Jan 2009, Øyvind Harboe wrote:

> Q1: How can I figure out what it is in .git that takes so much space?
> 
> Q2: Where can I read more about what to do after running git-filter-branch to
> removing the offending objects?
> 
> 
> 
> 1. I ran this command to get rid of the offending files and that appears to
> have worked. I can't find any traces of them anymore...
> 
> git filter-branch --tree-filter 'find . -regex ".*toolchain\..*" -exec
> rm -f {} \;' HEAD
> 
> 2. Running "git gc" takes a few seconds. The repository is still
> huge(it should be
> perhaps 10-20mByte).
> 
> du -skh .git/
> 187M    .git/
> 
> 3. I tried "git reflog expire --all" + lots of other tricks in the
> link below, but no luck.

OK, try this:

	cd ..
	mv my_repo my_repo.orig
	mkdir my_repo
	cd my_repo
	git init
	git pull file://$(pwd)/../my_repo.orig

This is the easiest way to ensure you have only the necessary objects in 
the new repo, without all the extra stuff tied to reflogs, etc.

Then, if your repo is still seemingly too big, you can get a bit dirty 
with the sequence Johannes just posted.


Nicolas

^ permalink raw reply

* Comments on Presentation Notes Request.
From: Tim Visher @ 2009-01-06 22:33 UTC (permalink / raw)
  To: git

Hello Everyone,

I'm putting together a little 15 minute presentation for my company
regarding SCMSes in an attempt to convince them to at the very least
use a Distributed SCMS and at best to use git.  I put together all my
notes, although I didn't put together the actual presentation yet.  I
figured I'd post them here and maybe get some feedback about it.  Let
me know what you think.

Thanks in advance!

Notes
---------

SCM: Distributed, Centralized, and Everything in Between.

* What is SCM and Why is it Useful?

** Definition

SCM is the practice of committing the revisions of your source code to
a system that can faithfully reproduce historical snap shots of your
source code.

** Advantages of SCM

*** One Source to Rule Them All.

Instead of having a bunch of source files spread across multiple
developers machines with multiple versions on each machine that may or
may not be labeled correctly, you have one repository containing every
artifact that your project includes.

*** Unlimited Undo/Redo.

Not only is it unlimited, but it's random access.  If you changed a
function a week ago, continued to work, and then decide that you want
the function back the way it was, it's as simple as pulling the
function back out of the SCMS.

*** Safe Concurrent Editing.

Many people can edit the same code base at the same time and know,
without a doubt, that when they pull all those changes together, the
system will merge the content intelligently or inform you of the
conflict and let you merge it.  You don't need to lock files.
Obviously, if there is bad coordination then the possibilities of
conflicts rise, but this should not happen regularly.

*** Diff Debugging

You can find where a bug was introduced by learning how to reproduce
the bug and then doing a binary chop search back through the History
to come to the exact commit that introduced the bug.

* SCM Best Practices

** Commit Early, Commit Often

The more you commit, the more fine grained control you have over the
undo feature of SCM.  Most documents that I have read suggested a TDD
approach wherein you commit whenever you have written just enough code
for your test to pass. But...

** Don't Commit Broken Code (To the Public Tree)

Of primary concern is the fact that your central HEAD should _always_
build.  This is why practices like Continuous Integration and TDD are
so important.  TDD gives you the freedom to be sure that a change you
made hasn't broken anything you weren't expecting it to break.
Continuous Integration allows you to be sure that your whole system
will build every time.  Thus, you should _never_ commit broken code to
the (public) tree.

Of course, in a centralized system, committing is intrinsically
public.  Even on branches, every time you commit any sort of change,
everyone is able to see it and so you could be breaking the build for
someone (even if it's just yourself and the build system).  One of the
nice features of a distributed system is that your public/private
ontology is much richer and thus allows you to have broken code in
your SCMS.

** Whole Hog

You should put everything necessary to build your system into SCM.
This includes user documentation, requirements documentation, software
tools, build tools, etc.  The only artifacts that don't need to be
managed are auto-generated artifacts such as javadocs, jar files, exe
files, etc.  This is so that you can reproduce entire releases using
only a simple checkout.

** Perform Updates and Commits on the Whole Tree

Updates and Commits should always be done on the whole tree so that
you're sure you have the latest source.  Never assume that nothing has
changed elsewhere.

** Allow and Encourage Customer Participation

Most shops seem to attempt to funnel customer participation through
the developers.  This is a cache miss for many operations such as
developing the user manual by a design team external to the
development team.  Basic operations such as commit and update are
fairly simple to grasp and can even be simplified further through
scripts and other such tools that non-developers can quickly be taught
to use.

Of note is the Tortoise family of tools which integrate directly into
Windows Explorer.  This makes it fairly easy for anyone who is
familiar with Windows Explorer to get into using any of the tools that
there is a Tortoise implementation for.

* The Centralized Model

** We Know About This One

This is traditional, plain vanilla, ubiquitous SCM.

The great majority of the SCMSes out there are centralized.

Closely resembles the Client/Server system model.

** Work Flow

<http://whygitisbetterthanx.com/#any-workflow>

*** 2 basic models: 'Lock, Modify, Unlock' and 'Copy, Modify, Merge'.

Older systems were primarily Lock, Modify, Unlock implementations.
You would checkout a file that you intended to work on, and no one
else would be able to check it out until you unlocked it, signaling
that you were done editing it.  This is inherently inefficient as on a
team of developers, the chances that two are working on the exact same
part of a system without knowing it and coordinating are fairly low.
Also, any disparate features that still touch the same files in the
system cannot be worked on simultaneously.

The answer to this is Copy, Modify, Merge.  In this system, every
developer gets a complete copy of the HEAD.  Everyone changes the HEAD
concurrently.  When commits happen, the system attempts to
intelligently merge them.  If it fails (usually doesn't happen unless
there is bad coordination), then it asks you to merge them.  This has
been proven to work well.

** Key Properties

*** Only One Repository

In centralized systems, there is only one global, public repository.
This has certain significant effects, such as an intrinsically global
name space for branches and tags, a restrictive public/private concept
(no such thing as committed but private), need for a backup process
aware of the possibility of in-progress commits, etc.

Since the repository only exists in a single location, the developers
only have copies of a specific revision and any uncommitted changes
they've made to that copy.

*** All Committed Changes Are Public

This includes regular commits (what we'd typically think of commits),
branches, and tags.

As previously mentioned, in centralized systems, all committed changes
are public.  Even if you are working on a private branch (which you
typically wouldn't be because branches are expensive in centralized
systems), the changes you are making are still visible publicly
because your branch exists in the global, public repository.

*** Intrinsically Uses the Network.

Because you must have a single repository that all developers are
accessing, you must use the network for many common operations.
Commits must be made to the central repository, Logs live centrally,
branches live centrally, diffing between revisions is a network
operation, blaming is a network operation, etc.

*** Backup Becomes A Separate Process

Because there is only a single repository, you need a back-up strategy
or else you are exposing yourself to a single point of failure.
Unfortunately, this is not as simple as it sounds.  The global, public
nature of the repository makes the chances of creating a corrupt back
up very high.  Because of this, tools have grown up around and in many
centralized systems that automate the process of backing it up while
remaining aware of the problems that can arise.  However, the point
remains that there is no intrinsic back up of a centralized system.

*** Need A Repository Admin.

Because the system is centralized, you need a repository
administrator.  This is true in most modern centralized systems where
new repositories are created on a per project basis (as in, not VSS).
In other words, when you want a new repository, you need to go through
some sort of admin interface or through the administrator of the
repository server to make it happen.

* The Distributed Model

** This Ones New

At least new as in unfamiliar.  The concept is over a decade old.

There are a few different popular distributed SCMSes (Git, Mercurial
(hg), Bazaar (bzr), Bitkeeper)

Very closely resembles a peer-to-peer network and the organic
relationships that evolve in that space.

In a distributed system, there is no one point where all development
comes together to for any reason other than policy.  Everyone who is
working on a system intrinsically has their own copy of the entire
repository.  All of the history, all of the source code, all of the
public branches, all of the public tags, etc.  Because of this,
developers can also have private branches, private tags, private
commits, private history.  The distinction between public and private
is very important in this context.  This has several distinct features
which I'll go into now.

** Work Flow (Pick Your Poison)

<http://whygitisbetterthanx.com/#any-workflow>

** Key Properties

*** Private/Public Concept

Distributed SCMSes Private/Public ontology is __much__ richer.
Whereas in a central system, private means only what you have yet to
commit or what you are leaving untracked, in a distributed system,
private means anything that you have not yet _chosen_ to make public.
In other words, you can have private branches, private tags, private
committed changes to your copy of the head, etc.  Anything that you do
not specifically publish to a location that others can access is
intrinsically private.

In other words, you can finally SCM your sandbox!  You can commit as
many broken things as you want to a private repository, giving you the
ability to have a nearly infinite set of undoable and recoverable
changes, without breaking anyone else's build.  Or, you can just as
easily ignore TDD, never commit anything for 3 weeks and then do a
big, massive commit and as long as your final product is tested and
merges with the rest of the tree, you're good to go and no one cares.

Because you have a rich ontology for private/public data, you can also
do crazy things like rewriting your local history before anyone else
sees it.  Because your repository is the only one that has to know
about the history as long as you're dealing with private data, this is
a completely safe (although policy debatable) operation.  Of course,
once data has been published, you really shouldn't mess with its
history anymore.

*** Network(less)

In distributed systems, networks are optional for almost every
operation (and indeed, every operation prior to publishing).  Of
course, you could put your repository on a network drive and then
you'd be doing everything over the network like you would in a
centralized system, but if you put your repository clone on your local
system, then everything you do in that repository is local.  Viewing
your history, committing, branching, merging, everything.

Once you've published, however, not much changes.  Almost everything
except updating and publishing (_not_ committing) remains local.
Remember that committing no longer means publicly publishing.  You can
commit many revisions, even to the master HEAD and nothing at all has
been published until you push those changes to your public HEAD.

*** Natural Backup

Because every developer has a copy of the repository, every developer
you add adds an extra failure point.  The more developers you have,
the more backups you have of the repository.

*** Must Learn New Work Flows.

In order to fully experience the advantages of distributed systems,
new work flows must be learned.  In other words, it's possible to use
distributed systems nearly the exact same way as you use a centralized
system (you just need to learn new commands), but you don't get many
of the benefits except the speed improvements.  The real game change
happens when you realize that you can keep things private until their
finished.  Once you realize that, new branching patterns emerge, new
work flows happen, you commit more often, and have the ability to
become much looser and freer in your development process.

*** Impossible To Completely Enforce A Single, Canonical
Representation of the Code Base.

By nature, a distributed system cannot enforce a single canonical
representation of the code base except by policy, and policies can
always be broken.  Also, any intentionally private data is not backed
up because it is not shared.  However, backup becomes much simpler
because you know that no one else is committing to your repository.

This bears some explanation.  Within a distributed system, you can
have a single official release point that everyone has blessed (or the
company has blessed, or the original developer has blessed, or
whatever).  However, you cannot _stop_ someone else from making a
release point because their repository is just as valid as yours.  You
cannot _stop_ developers from sharing code between themselves without
going out to the official central location.  All you can do is ask
them not to.

* Why Git is the Best Choice

** Fast

Git's implementation just happens to be wickedly fast.  It's faster
than mercurial, it's faster than bazaar, etc.  Everything, committing,
merging, viewing history, branching, and even updating and and pushing
are all faster.

** Tracks Content, not Files

Git tracks content, not files, and it's the only SCMS at the moment
that does this.  This has many effects internally, but the most
apparent effect I know of is that for the first time Git can easily
tell you the history of even a function in a file because Git can tell
you which files that function existed (or does exist) in over the
course of development.

** Extremely Efficient.

Because Git tracks content, it can also be extremely efficient
spacewise, simplifying the files to be nothing but pointers to a set
of objects in Git's internal file system.  Thus, if you have
duplicated hunks, git uses a single object to represent them.  Git has
been proven to be more efficient space wise than any other system out
there.

** (Un)Staged Changes

Git employs the concept of the Index or Cache or Commit Stage.  This
is also unique to Git, and it's pretty strange for developers coming
from a system without it.

Basically, There are 4 states that any content can be in under Git.

1. Untracked: This is content that Git is completely unaware of.
2. Tracked but Unstaged: This is content that has changed that Git is
aware of but will not commit on the next commit command.
3. Tracked and Staged: This is the same as unstaged except that this
content will be committed on the next commit.
4. Tracked and Committed:  This is content that has not changed since
the previous commit that Git is aware of.

This is very powerful yet somewhat awkward to grasp.  Basically, the
upshot of this feature is that you can manually build commits if you
want to.  Say you were working on feature foo and then made some other
changes because you came across feature bar and thought it would be
quick to do.  In any other system, the only way you could commit parts
of what you'd changed is if you were lucky enough for the disparate
changes to be in different files.  In that case, you could commit only
the files that you wanted to change for the different features.
However, if you made disparate changes to the same file, you were
stuck.  In Git, you can stage only parts of the files to an extreme
degree.  This allows you to create as many commits as you want out of
a single change set until the whole change set is committed.

I've found this to be particularly useful when working with an
existing code base that was not properly formatted.  Often, I'll come
to a file that has a bunch of wonky white space choices and improperly
indented logical constructs and I'll just quickly run through it
correcting that stuff before continuing with the feature I was working
on.  Afterwords, I'll stage the formatting and commit it, and then
stage the feature I was working on and commit that.  You may not want
that kind of control (and if you don't, you don't need to use it), but
I like it.

** Excellent Merge algorithms

Git has excellent merge algorithms.  This is widely attributed and
doesn't require much explanation.  It was one of Git's original design
goals, and it has been proven by Git's implementation.  Merging in Git
is _much_ less painful than in other systems.

** Has powerful 'maintainer tools'

Beyond the basics of committing, updating, pushing, viewing logs, etc.
Git is known to have very powerful tools maintainer level tools.  You
can modify your history, you can automatically perform binary searches
to locate errors, you can communicate via patches, it's highly
customizable, has the concept of submodules (projects within
projects), etc.  It gets complicated, but at this level of SCM it is
complicated.

** Cryptographically Guarantees Content

One of the most surprising things I learned as I was researching this
was that most SCMSes do not guarantee that your content does not get
corrupted.  In other words, if the repository's disk doesn't fail but
instead just gets corrupted, you'll never know unless you actually
notice the corruption in the files.  If you have memory corruption
locally and commit your changes, you just won't know.

Git guarantees absolutely that if corruption happens, you will know
about it.  It does this by creating SHA-1 hashes of your content and
then checking to make sure that the SHA-1 hash does not change for an
object.  The details of this aren't as important as the fact that Git
is one of the very few systems that do this and it's obviously
desirable.

* References

- <http://git-scm.com/> - The Git homepage
- <http://whygitisbetterthanx.com/> - An excellent resource explaining
the benefits of using git in relation to other common SCMSes
- <http://www.youtube.com/watch?v=4XpnKHJAok8> - Linus Torvalds's
Google Talk on Git.  Covers mainly what Git is Not and Why
Distribution is the model that works.
- <http://www.youtube.com/watch?v=8dhZ9BXQgc4> - Randal Schwartz's
Google Talk on Git.  Covers what Git is, including some implementation
details, some use case scenarios, and the like.
- <http://book.git-scm.com/> - A community written book on Git with
video tutorials about many of Git's features.
- <http://subversion.tigris.org/> - Subversion's homepage.  An
extremely popular open source centralized system.
- <http://svnbook.red-bean.com/> - Rolling publish book on Subversion.
 Chapter 1 is a good introduction to general centralized SCM concepts
and principles.
- <http://www.perforce.com/perforce/bestpractices.html> - An excellent
set of best practices from the Perforce team.  Some of it (especially
the branches) has a distinct centralized lean, but most of it is quite
good.
- <http://www.bobev.com/PresentationsAndPapers/Common%20SCM%20Patterns.pdf>
- Interesting presentation by Pretzel Logic from 2001 attempting to
outline some common SCM best practices as Patterns.

---------------
End Notes

-- 

In Christ,

Timmy V.

http://burningones.com/
http://five.sentenc.es/ - Spend less time on e-mail

^ permalink raw reply

* Re: Problems getting rid of large files using git-filter-branch
From: Øyvind Harboe @ 2009-01-06 22:36 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git
In-Reply-To: <alpine.DEB.1.00.0901062319070.30769@pacific.mpi-cbg.de>

On Tue, Jan 6, 2009 at 11:20 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Tue, 6 Jan 2009, Øyvind Harboe wrote:
>
>> Q1: How can I figure out what it is in .git that takes so much space?
>
> If it is a pack that is taking so much space:

it is.

>
> $ git verify-pack -v $PACK | grep -v "^chain " | sort -n -k 4

I have never used the git verify-pack command, but I'm pretty sure the
"Terminated" string isn't the normal output :-)

$ git verify-pack -v
.git/objects/pack/pack-1e039b82d8ae53ef5ec3614a3021466663cc70a4
Terminated

This is running git version 1.6.1. on CentOS on a virtual machine. I'm not quite
sure how to debug this. I'm sure I've done something wrong when I installed git.
I'm just a humble user of git trying to convert from cvs/svn.


> and then for the last few lines do a
>
> $ git rev-list --all --objects | grep $SHA1

I was able to run this procedure on a different machine than the
server and I can
then tell which objects take up all the space.

However, I'm unnerved by git verify-pack "Terminated"'ing on me above
and I'll have
to sort that out before I can think about using git in production.


Thanks for the pointers though! They definitely answered my questions!

>
> Hth,
> Dscho
>



-- 
Øyvind Harboe
http://www.zylin.com/zy1000.html
ARM7 ARM9 XScale Cortex
JTAG debugger and flash programmer

^ permalink raw reply

* Re: Problems getting rid of large files using git-filter-branch
From: Øyvind Harboe @ 2009-01-06 22:41 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git
In-Reply-To: <alpine.LFD.2.00.0901061709510.26118@xanadu.home>

>> 3. I tried "git reflog expire --all" + lots of other tricks in the
>> link below, but no luck.
>
> OK, try this:
>
>        cd ..
>        mv my_repo my_repo.orig
>        mkdir my_repo
>        cd my_repo
>        git init
>        git pull file://$(pwd)/../my_repo.orig
>
> This is the easiest way to ensure you have only the necessary objects in
> the new repo, without all the extra stuff tied to reflogs, etc.

Super!

That worked!

> Then, if your repo is still seemingly too big, you can get a bit dirty
> with the sequence Johannes just posted.

Johannes procedure had the unexpected side effect of showing that
my server setup is flaky somehow though... :-) I'll need his
tricks for other situations soon enough.


-- 
Øyvind Harboe
http://www.zylin.com/zy1000.html
ARM7 ARM9 XScale Cortex
JTAG debugger and flash programmer

^ permalink raw reply

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
From: R. Tyler Ballance @ 2009-01-06 22:52 UTC (permalink / raw)
  To: Jan Krüger; +Cc: Git ML
In-Reply-To: <20081209093627.77039a1f@perceptron>

[-- Attachment #1: Type: text/plain, Size: 3858 bytes --]

On Tue, 2008-12-09 at 09:36 +0100, Jan Krüger wrote:
> For fixing a corrupted repository by using backup copies of individual
> files, allow write_sha1_file() to write loose files even if the object
> already exists in a pack file, but only if the existing entry is marked
> as corrupted.

I figured I'd reply to this again, since the issue cropped up again.

We started experiencing *large* numbers of corruptions like the ones
that started the thread (one developer was receiving them once or twice
a day) with v1.6.0.4

We went ahead and upgraded to a custom build of v1.6.1 with Jan's patch
(below) and the issues /seem/ to have resolved themselves. I'm not
certain whether Jan's patch was really responsible, or if there was
another issue that caused this to correct itself in v1.6.1. 

As it stands, I think it's safe to assume that given the frequency of
the occurances that they were not tied to a memory or disk error (or
other levels of the machine's stack would be suffering as well). The
only thing I can think of is that /some/ developers who've experienced
the issue are using Samba mount points and changing files in Mac OS X,
but using Git on the mounted share (i.e. TextMate changes a file hosted
on Samba, changes are committed in an SSH session on that machine), but
that doesn't account for everything.

If there was something else included in the v1.6.1 release please let me
know so I can back Jan's patch out.


Cheers


> 
> Signed-off-by: Jan Krüger <jk@jk.gs>
> ---
> 
> On IRC I talked to rtyler who had a corrupted pack file and plenty of
> object backups by way of cloned repositories. We decided to try
> extracting the corrupted objects from the other object database and
> injecting them into the broken repo as loose objects, but this failed
> because sha1_write_file() refuses to write loose objects that are
> already present in a pack file.
> 
> This patch expands the check to see if the pack entry has been marked
> as corrupted and, if so, allows writing a loose object with the same
> ID. Unfortunately, when Tyler tried a merge while using this patch,
> something we didn't manage to track down happened and now git doesn't
> consider the object corrupted anymore. I'm not sure enough that it
> wasn't caused by the patch to submit this patch without hesitation.
> 
> Apart from that, I think the change is not all too great since it makes
> write_sha1_file() walk the list of pack entries twice. That's a bit of
> a waste.
> 
> So those are the reasons why I wanted a few opinions first. Another
> reason is that there might be a way smarter method to fix this kind of
> problem, in which case I'd love hearing about it for future reference.
> 
>  sha1_file.c |    9 +++++----
>  1 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/sha1_file.c b/sha1_file.c
> index 6c0e251..17085cc 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -2373,14 +2373,17 @@ int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned cha
>  	char hdr[32];
>  	int hdrlen;
>  
> -	/* Normally if we have it in the pack then we do not bother writing
> -	 * it out into .git/objects/??/?{38} file.
> -	 */
>  	write_sha1_file_prepare(buf, len, type, sha1, hdr, &hdrlen);
>  	if (returnsha1)
>  		hashcpy(returnsha1, sha1);
> -	if (has_sha1_file(sha1))
> -		return 0;
> +	/* Normally if we have it in the pack then we do not bother writing
> +	 * it out into .git/objects/??/?{38} file. We do, though, if there
> +	 * is no chance that we have an uncorrupted version of the object.
> +	 */
> +	if (has_sha1_file(sha1)) {
> +		if (has_loose_object(sha1) || !has_packed_and_bad(sha1))
> +			return 0;
> +	}
>  	return write_loose_object(sha1, hdr, hdrlen, buf, len, 0);
>  }
>  
-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply

* Re: Announce: TortoiseGit 0.1 preview version
From: Jakub Narebski @ 2009-01-06 23:05 UTC (permalink / raw)
  To: Frank Li; +Cc: git
In-Reply-To: <1976ea660901060645r641d73e6ob4e03747f1860b6a@mail.gmail.com>

On Tue, 6 Jan 2009, Frank Li wrote:

> TortoiseGit 0.2 released.

Nice. Thanks a lot for your work.

> Can you help update Git Wiki page?

Well, my stupid ISP blocks git.or.cz (and gimp.org), supposedly in
malformed attempt to reduce SPAM using MAPS (Mail Abuse Prevention
System; SPAM backwards) lists and null routing. So I have to use
proxy to edit wiki.

> It seem anyone can update Wiki page. Is it true?

It is the nature of Wiki that anyone can edit Wiki pages. It is
recommended but not necessary to register; please do not forget
to put comment ("commit message") as well.

> 
> Summary (feature matrix)
[...]

I am not involved with GUI feature Matrix; I only added simple info
about TortoiseGit. I'm not sure who is...

> > I have added information about TortoiseGit to git wiki at

-- 
Jakub Narebski
Poland

^ permalink raw reply

* Re: [PATCH/RFC v2 2/4] Use 'lstat_cache()' instead of 'has_symlink_leading_path()'
From: Junio C Hamano @ 2009-01-06 23:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kjetil Barvik, git
In-Reply-To: <alpine.LFD.2.00.0901061304280.3057@localhost.localdomain>

Linus Torvalds <torvalds@linux-foundation.org> writes:

>> ...  The previously used call:
>> 
>>    has_symlink_leading_path(len, name);
>> 
>> should be identically with the following call to lstat_cache():
>> 
>>    lstat_cache(len, name,
>>                LSTAT_SYMLINK|LSTAT_DIR,
>>                LSTAT_SYMLINK);
>
> I think the new interface looks worse.
>
> Why don't you just do a new inline function that says
>
> 	static inline int has_symlink_leading_path(int len, const char *name)
> 	{
> 		return lstat_cache(len, name,
> 			LSTAT_SYMLINK|LSTAT_DIR,
> 			LSTAT_SYMLINK);
> 	}
>
> and now you don't need this big patch, and people who don't care about 
> those magic flags don't need to have them. End result: more readable code.

Excellent.

Not that I did not think a backward compatible macro is much easier to
read; after all, ce/ie you mention is a refactorizaton I did myself.

What I didn't think of was that posing the above question is a much better
way to extract a clear explanation why some of these lstat_cache() calls
have LSTAT_NOENT and some of them don't from the author.  It is much
better way than my earlier attempt to do so.

> This is how git has done pretty much all "generalized" versions. See the 
> whole ce_modified() vs ie_modified() thing: they're the same function, 
> it's just that 'ce_modified()' is the traditional simpler interface that 
> works on the default index, while ie_modified() is the "full" version that 
> takes all the details that most uses don't even want to know about.

Yup, thanks for a praise ;-)

^ permalink raw reply

* Re: Problems getting rid of large files using git-filter-branch
From: Stephen R. van den Berg @ 2009-01-06 23:17 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: ?yvind Harboe, git
In-Reply-To: <alpine.LFD.2.00.0901061709510.26118@xanadu.home>

Nicolas Pitre wrote:
>On Tue, 6 Jan 2009, ?yvind Harboe wrote:
>OK, try this:

>	git pull file://$(pwd)/../my_repo.orig

Alternately, try:

rm -rf .git/ORIG_HEAD .git/FETCH_HEAD .git/index .git/logs .git/info/refs \
  .git/objects/pack/pack-*.keep .git/refs/original .git/refs/patches \
  .git/patches .git/gitk.cache &&
 git prune --expire now &&
 git repack -a -d --window=200 &&
 git gc

-- 
Sincerely,
           Stephen R. van den Berg.

"Very funny, Mr. Scott. Now beam down my clothes!"

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox