[RFC Patch] Preventing corrupt objects from entering the repository

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC Patch] Preventing corrupt objects from entering the repository
@ 2008-02-10 17:58 Martin Koegler
  2008-02-11  0:00 ` Junio C Hamano
  0 siblings, 1 reply; 23+ messages in thread
From: Martin Koegler @ 2008-02-10 17:58 UTC (permalink / raw)
  To: git

Some weeks ago, I ask myself, what unexpected things could be
done by user via git-shell/git-daemon. I created logical corrupt 
objects (such can be uploaded via receive pack without any problem)
and got git to segfaults very quickly.

In most cases they are cause by one of the two things:

* lookup/parse_XXX returns a NULL pointer, which is not handled 
  in many cases.
* The return values of functions are not checked.

Fixing all these things would be a lot of work and maybe even slow
down git. Even if all this things would be fixed, a corrupt 
object in the history would mean, that git-upload-pack will die 
with an error message and that a admin must fix the repository.

So I tried to fix the problem in another way: Instead of coping with 
corrupt objects, stop them from entering the repository.

These patch series does the following things:
* add a type aware object chain walker for all child objects (patch 2)

  I have not found a similar function in the git core, so I wrote
  one. I'm open for better alternatives.

* move fsck to object chain walker (and get rid of track objects) (patch 3,4,5,12)

  I'm not sure, if this is advocated by others. At least it fixes 
  some crashes in the case of NULL pointers.

* Factor out reuseable code from builtin-fsck.c to fsck.c (patch 6+7)
* add --strict option to unpack-objects (patch 1,8,9)

  The --strict option changes writing order slightly. All non blob objects
  are not written out immediatly. After the last object of the packfile
  is read, unpack-objects starts looking for non written objects. If the object
  is not corrupt and all its links are available, the object is written.

  This way only dangling objects can be created. 

  On shallow repositories, this check can fail, if the list of shallow commits
  is changed. For the intended use in receive-pack, this is not a problem, as
  it does not work on such repositories.

* add --strict option to index-pack (patch 8,10)

  Same as for unpack-objects, but without writting objects.

* add config option for receive pack to enable checking (patch 11)

These patch serie is a first prototype to get some feedback.

mfg Martin Kögler

Patch 1: unpack-object: cache for non written objects
Patch 2: add generic, type aware object chain walker
Patch 3: fsck: move mark-reachable to fsck_walk
Patch 4: fsck: move reachable object check to fsck_walk
Patch 5: fsck: disable track objects
Patch 6: create a common object checker code out of fsck
Patch 7: fsck: use common object checker
Patch 8: add common fsck error printing function
Patch 9: unpack-objects: prevent writing of inconsistent objects
Patch 10: index-pack: introduce checking mode
Patch 11: receive-pack: use strict mode for unpacking objects
Patch 12: Remove unused object-ref code


 Documentation/config.txt             |    6 +
 Documentation/git-index-pack.txt     |    3 +
 Documentation/git-unpack-objects.txt |    3 +
 Makefile                             |    6 +-
 builtin-fetch-pack.c                 |    1 -
 builtin-fsck.c                       |  329 +++++++++-------------------------
 builtin-pack-objects.c               |    1 -
 builtin-rev-list.c                   |    1 -
 builtin-unpack-objects.c             |  131 +++++++++++++-
 commit.c                             |   11 --
 fsck.c                               |  314 ++++++++++++++++++++++++++++++++
 fsck.h                               |   16 ++
 index-pack.c                         |   78 ++++++++-
 object-refs.c                        |   87 ---------
 object.h                             |    8 -
 receive-pack.c                       |   36 +++--
 tag.c                                |    6 -
 tree.c                               |   48 -----
 upload-pack.c                        |    1 -
 walker.c                             |    1 -
 20 files changed, 653 insertions(+), 434 deletions(-)

------------------------------------------------------------------------------

From 76e86fe55345e633c910d6b8fe166e27c23c5aaf Mon Sep 17 00:00:00 2001
From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Date: Fri, 8 Feb 2008 08:51:38 +0100
Subject: [PATCH 01/12] unpack-object: cache for non written objects

Preventing objects with broken links entering the repository
means, that write of some objects must be delayed.

This patch adds a cache to keep the object data in memory. The delta
resolving code must also search in the cache.

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
---
 builtin-unpack-objects.c |   32 ++++++++++++++++++++++++++++++++
 1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/builtin-unpack-objects.c b/builtin-unpack-objects.c
index 1e51865..f18c7e8 100644
--- a/builtin-unpack-objects.c
+++ b/builtin-unpack-objects.c
@@ -8,6 +8,7 @@
 #include "tag.h"
 #include "tree.h"
 #include "progress.h"
+#include "decorate.h"
 
 static int dry_run, quiet, recover, has_errors;
 static const char unpack_usage[] = "git-unpack-objects [-n] [-q] [-r] < pack-file";
@@ -18,6 +19,28 @@ static unsigned int offset, len;
 static off_t consumed_bytes;
 static SHA_CTX ctx;
 
+struct obj_buffer {
+	char *buffer;
+	unsigned long size;
+};
+
+static struct decoration obj_decorate;
+
+static struct obj_buffer *lookup_object_buffer(struct object *base)
+{
+	return lookup_decoration(&obj_decorate, base);
+}
+
+static void add_object_buffer(struct object *object, char *buffer, unsigned long size)
+{
+	struct obj_buffer *obj;
+	obj = xcalloc(1, sizeof(struct obj_buffer));
+	obj->buffer = buffer;
+	obj->size = size;
+	if (add_decoration(&obj_decorate, object, obj))
+		die("object %s tried to add buffer twice!", sha1_to_hex(object->sha1));
+}
+
 /*
  * Make sure at least "min" bytes are available in the buffer, and
  * return the pointer to the buffer.
@@ -252,6 +275,15 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size,
 		}
 	}
 
+	struct object* obj = lookup_object(base_sha1);
+	if (obj) {
+		struct obj_buffer *obj_buf = lookup_object_buffer (obj);
+		if (obj_buf) {
+			resolve_delta(nr, obj->type, obj_buf->buffer, obj_buf->size, delta_data, delta_size);
+			return;
+		}
+	}
+
 	base = read_sha1_file(base_sha1, &type, &base_size);
 	if (!base) {
 		error("failed to read delta-pack base object %s",
-- 
1.5.4.g074be


From 9fb732d8939ea2108440a0ab468aab15c648dde5 Mon Sep 17 00:00:00 2001
From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Date: Fri, 8 Feb 2008 09:01:01 +0100
Subject: [PATCH 02/12] add generic, type aware object chain walker

The requirements are:
* it may not crash on NULL pointers
* a callback function is needed, as index-pack/unpack-objects
  need to do different things
* the type information is needed to check the expected <-> real type
  and print better error messages

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
---
 Makefile |    4 +-
 fsck.c   |   84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fsck.h   |   10 +++++++
 3 files changed, 96 insertions(+), 2 deletions(-)
 create mode 100644 fsck.c
 create mode 100644 fsck.h

diff --git a/Makefile b/Makefile
index 728cfc2..0921484 100644
--- a/Makefile
+++ b/Makefile
@@ -301,7 +301,7 @@ LIB_H = \
 	run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
 	tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \
 	utf8.h reflog-walk.h patch-ids.h attr.h decorate.h progress.h \
-	mailmap.h remote.h parse-options.h transport.h diffcore.h hash.h
+	mailmap.h remote.h parse-options.h transport.h diffcore.h hash.h fsck.h
 
 DIFF_OBJS = \
 	diff.o diff-lib.o diffcore-break.o diffcore-order.o \
@@ -324,7 +324,7 @@ LIB_OBJS = \
 	alloc.o merge-file.o path-list.o help.o unpack-trees.o $(DIFF_OBJS) \
 	color.o wt-status.o archive-zip.o archive-tar.o shallow.o utf8.o \
 	convert.o attr.o decorate.o progress.o mailmap.o symlinks.o remote.o \
-	transport.o bundle.o walker.o parse-options.o ws.o archive.o
+	transport.o bundle.o walker.o parse-options.o ws.o archive.o fsck.o
 
 BUILTIN_OBJS = \
 	builtin-add.o \
diff --git a/fsck.c b/fsck.c
new file mode 100644
index 0000000..089f775
--- /dev/null
+++ b/fsck.c
@@ -0,0 +1,84 @@
+#include "cache.h"
+#include "object.h"
+#include "blob.h"
+#include "tree.h"
+#include "tree-walk.h"
+#include "commit.h"
+#include "tag.h"
+#include "fsck.h"
+
+static int fsck_walk_tree(struct tree* tree, fsck_walk_func walk, void* data)
+{
+	struct tree_desc desc;
+	struct name_entry entry;
+
+	if(parse_tree(tree))
+		return -1;
+
+	init_tree_desc(&desc, tree->buffer, tree->size);
+	while(tree_entry(&desc, &entry)) {
+		int result;
+		
+		if (S_ISGITLINK(entry.mode))
+			continue;
+		if (S_ISDIR(entry.mode))
+			result = walk(&lookup_tree(entry.sha1)->object, OBJ_TREE, data);
+		else if (S_ISREG(entry.mode) || S_ISLNK(entry.mode))
+			result = walk(&lookup_blob(entry.sha1)->object, OBJ_BLOB, data);
+		else {
+			error("in tree %s: entry %s has bad mode %.6o\n",
+			      sha1_to_hex(tree->object.sha1), entry.path, entry.mode);
+			result = -1;
+		}
+		if (result)
+			return result;
+	}
+	return 0;
+}
+
+static int fsck_walk_commit(struct commit* commit, fsck_walk_func walk, void* data)
+{
+	struct commit_list *parents = commit->parents;
+	int result;
+
+	if(parse_commit(commit))
+		return -1;
+
+	result = walk((struct object*)commit->tree, OBJ_TREE, data);
+	if (result)
+		return result;
+
+	while (parents) {
+		result = walk((struct object*)parents->item, OBJ_COMMIT, data);
+		if (result)
+			return result;
+		parents = parents->next;
+	}
+	return 0;
+}
+
+static int fsck_walk_tag(struct tag* tag, fsck_walk_func walk, void* data)
+{
+	if(parse_tag(tag))
+		return -1;
+	return walk(tag->tagged, OBJ_ANY, data);
+}
+
+int fsck_walk(struct object* obj, fsck_walk_func walk, void* data)
+{
+	if (!obj)
+		return -1;
+	switch(obj->type) {
+	case OBJ_BLOB:
+		return 0;
+	case OBJ_TREE:
+		return fsck_walk_tree((struct tree*)obj, walk, data);
+	case OBJ_COMMIT:
+		return fsck_walk_commit((struct commit*)obj, walk, data);
+	case OBJ_TAG:
+		return fsck_walk_tag((struct tag*)obj, walk, data);
+	default:
+		error("Unknown object type for %s", sha1_to_hex(obj->sha1));
+		return -1;
+	}
+}
diff --git a/fsck.h b/fsck.h
new file mode 100644
index 0000000..caeb2c9
--- /dev/null
+++ b/fsck.h
@@ -0,0 +1,10 @@
+#ifndef GIT_FSCK_H
+#define GIT_FSCK_H
+
+#define OBJ_ANY OBJ_BAD
+
+typedef int (*fsck_walk_func)(struct object* obj, int type, void* data);
+
+int fsck_walk(struct object* obj, fsck_walk_func walk, void* data);
+
+#endif
-- 
1.5.4.g074be


From 80b22c3f2c3e13c207790a49646020c55b34bba7 Mon Sep 17 00:00:00 2001
From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Date: Fri, 8 Feb 2008 09:01:50 +0100
Subject: [PATCH 03/12] fsck: move mark-reachable to fsck_walk

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
---
 builtin-fsck.c |   34 ++++++++++++++++++++++++----------
 1 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/builtin-fsck.c b/builtin-fsck.c
index cc7524b..49e96ff 100644
--- a/builtin-fsck.c
+++ b/builtin-fsck.c
@@ -8,6 +8,7 @@
 #include "pack.h"
 #include "cache-tree.h"
 #include "tree-walk.h"
+#include "fsck.h"
 #include "parse-options.h"
 
 #define REACHABLE 0x0001
@@ -35,6 +36,23 @@ static int verbose;
 #define DIRENT_SORT_HINT(de) ((de)->d_ino)
 #endif
 
+static int mark_object(struct object* obj, int type, void* data)
+{
+	if (!obj)
+		return 0;
+	if (obj->flags & REACHABLE)
+		return 0;
+	obj->flags |= REACHABLE;
+	if (!obj->parsed)
+		return 0;
+	return fsck_walk(obj, mark_object, data);
+}
+
+static void mark_object_reachable(struct object* obj)
+{
+	mark_object(obj, 0, 0);
+}
+
 static void objreport(struct object *obj, const char *severity,
                       const char *err, va_list params)
 {
@@ -326,8 +344,6 @@ static int fsck_tree(struct tree *item)
 		o_name = name;
 		o_sha1 = sha1;
 	}
-	free(item->buffer);
-	item->buffer = NULL;
 
 	retval = 0;
 	if (has_full_path) {
@@ -375,8 +391,6 @@ static int fsck_commit(struct commit *commit)
 	}
 	if (memcmp(buffer, "author ", 7))
 		return objerror(&commit->object, "invalid format - expected 'author' line");
-	free(commit->buffer);
-	commit->buffer = NULL;
 	if (!commit->tree)
 		return objerror(&commit->object, "could not load commit's tree %s", tree_sha1);
 	if (!commit->parents && show_root)
@@ -538,13 +552,13 @@ static int fsck_handle_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
 		obj = lookup_object(osha1);
 		if (obj) {
 			obj->used = 1;
-			mark_reachable(obj, REACHABLE);
+			mark_object_reachable(obj);
 		}
 	}
 	obj = lookup_object(nsha1);
 	if (obj) {
 		obj->used = 1;
-		mark_reachable(obj, REACHABLE);
+		mark_object_reachable(obj);
 	}
 	return 0;
 }
@@ -574,7 +588,7 @@ static int fsck_handle_ref(const char *refname, const unsigned char *sha1, int f
 		error("%s: not a commit", refname);
 	default_refs++;
 	obj->used = 1;
-	mark_reachable(obj, REACHABLE);
+	mark_object_reachable(obj);
 
 	return 0;
 }
@@ -660,7 +674,7 @@ static int fsck_cache_tree(struct cache_tree *it)
 			      sha1_to_hex(it->sha1));
 			return 1;
 		}
-		mark_reachable(obj, REACHABLE);
+		mark_object_reachable(obj);
 		obj->used = 1;
 		if (obj->type != OBJ_TREE)
 			err |= objerror(obj, "non-tree in cache-tree");
@@ -741,7 +755,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 				continue;
 
 			obj->used = 1;
-			mark_reachable(obj, REACHABLE);
+			mark_object_reachable(obj);
 			heads++;
 			continue;
 		}
@@ -773,7 +787,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 				continue;
 			obj = &blob->object;
 			obj->used = 1;
-			mark_reachable(obj, REACHABLE);
+			mark_object_reachable(obj);
 		}
 		if (active_cache_tree)
 			fsck_cache_tree(active_cache_tree);
-- 
1.5.4.g074be


From ce43251ef71962ff64fe138f1295c405ef6aaf65 Mon Sep 17 00:00:00 2001
From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Date: Fri, 8 Feb 2008 09:04:08 +0100
Subject: [PATCH 04/12] fsck: move reachable object check to fsck_walk

It handles NULL pointers in object references without crashing.

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
---
 builtin-fsck.c |   49 +++++++++++++++++++++++++++++--------------------
 1 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/builtin-fsck.c b/builtin-fsck.c
index 49e96ff..2c1e10f 100644
--- a/builtin-fsck.c
+++ b/builtin-fsck.c
@@ -81,13 +81,39 @@ static int objwarning(struct object *obj, const char *err, ...)
 	return -1;
 }
 
+static int check_reachable_object_childs(struct object *obj, int type, void *data)
+{
+	struct object *parent = data;
+	if (!obj) {
+		printf("broken link from %7s %s\n",
+			   typename(parent->type), sha1_to_hex(parent->sha1));
+		printf("broken link from %7s %s\n",
+			   (type==OBJ_ANY?"unknown":typename(type)), "unknown");
+		errors_found |= ERROR_REACHABLE;
+		return 0;
+	}
+
+	if (type != OBJ_ANY && obj->type != type) {
+		objerror(parent, "wrong object type in link");
+		return 0;
+	}
+
+	if (obj->parsed || (has_sha1_file(obj->sha1)))
+		return 0;
+				
+	printf("broken link from %7s %s\n",
+	       typename(parent->type), sha1_to_hex(parent->sha1));
+	printf("              to %7s %s\n",
+	       typename(obj->type), sha1_to_hex(obj->sha1));
+	errors_found |= ERROR_REACHABLE;
+	return 0;
+}
+
 /*
  * Check a single reachable object
  */
 static void check_reachable_object(struct object *obj)
 {
-	const struct object_refs *refs;
-
 	/*
 	 * We obviously want the object to be parsed,
 	 * except if it was in a pack-file and we didn't
@@ -101,24 +127,7 @@ static void check_reachable_object(struct object *obj)
 		return;
 	}
 
-	/*
-	 * Check that everything that we try to reference is also good.
-	 */
-	refs = lookup_object_refs(obj);
-	if (refs) {
-		unsigned j;
-		for (j = 0; j < refs->count; j++) {
-			struct object *ref = refs->ref[j];
-			if (ref->parsed ||
-			    (has_sha1_file(ref->sha1)))
-				continue;
-			printf("broken link from %7s %s\n",
-			       typename(obj->type), sha1_to_hex(obj->sha1));
-			printf("              to %7s %s\n",
-			       typename(ref->type), sha1_to_hex(ref->sha1));
-			errors_found |= ERROR_REACHABLE;
-		}
-	}
+	fsck_walk(obj, check_reachable_object_childs, obj);
 }
 
 /*
-- 
1.5.4.g074be


From ba836cfe0202f97bd5901e1155abfd22b698020c Mon Sep 17 00:00:00 2001
From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Date: Fri, 8 Feb 2008 09:06:02 +0100
Subject: [PATCH 05/12] fsck: disable track objects

Mark the usage of object with fsck_walk. As this is the last user
of object_refs, it is disabled.

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
---
 builtin-fsck.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/builtin-fsck.c b/builtin-fsck.c
index 2c1e10f..b0ca3ad 100644
--- a/builtin-fsck.c
+++ b/builtin-fsck.c
@@ -53,6 +53,14 @@ static void mark_object_reachable(struct object* obj)
 	mark_object(obj, 0, 0);
 }
 
+static int mark_used(struct object* obj, int type, void* data)
+{
+	if (!obj)
+		return 0;
+	obj->used = 1;
+	return 0;
+}
+
 static void objreport(struct object *obj, const char *severity,
                       const char *err, va_list params)
 {
@@ -437,6 +445,7 @@ static int fsck_sha1(const unsigned char *sha1)
 	if (obj->flags & SEEN)
 		return 0;
 	obj->flags |= SEEN;
+	fsck_walk (obj, mark_used, 0);
 	if (obj->type == OBJ_BLOB)
 		return 0;
 	if (obj->type == OBJ_TREE)
@@ -716,7 +725,6 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 {
 	int i, heads;
 
-	track_object_refs = 1;
 	errors_found = 0;
 
 	argc = parse_options(argc, argv, fsck_opts, fsck_usage, 0);
-- 
1.5.4.g074be


From ee11f771be1ef1c29725cb56ab3eb8dfe61ca25a Mon Sep 17 00:00:00 2001
From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Date: Fri, 8 Feb 2008 09:07:33 +0100
Subject: [PATCH 06/12] create a common object checker code out of fsck

The function provides a callback for reporting errors.

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
---
 fsck.c |  200 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fsck.h |    4 +
 2 files changed, 204 insertions(+), 0 deletions(-)

diff --git a/fsck.c b/fsck.c
index 089f775..ac253ae 100644
--- a/fsck.c
+++ b/fsck.c
@@ -82,3 +82,203 @@ int fsck_walk(struct object* obj, fsck_walk_func walk, void* data)
 		return -1;
 	}
 }
+
+/*
+ * The entries in a tree are ordered in the _path_ order,
+ * which means that a directory entry is ordered by adding
+ * a slash to the end of it.
+ *
+ * So a directory called "a" is ordered _after_ a file
+ * called "a.c", because "a/" sorts after "a.c".
+ */
+#define TREE_UNORDERED (-1)
+#define TREE_HAS_DUPS  (-2)
+
+static int verify_ordered(unsigned mode1, const char *name1, unsigned mode2, const char *name2)
+{
+	int len1 = strlen(name1);
+	int len2 = strlen(name2);
+	int len = len1 < len2 ? len1 : len2;
+	unsigned char c1, c2;
+	int cmp;
+
+	cmp = memcmp(name1, name2, len);
+	if (cmp < 0)
+		return 0;
+	if (cmp > 0)
+		return TREE_UNORDERED;
+
+	/*
+	 * Ok, the first <len> characters are the same.
+	 * Now we need to order the next one, but turn
+	 * a '\0' into a '/' for a directory entry.
+	 */
+	c1 = name1[len];
+	c2 = name2[len];
+	if (!c1 && !c2)
+		/*
+		 * git-write-tree used to write out a nonsense tree that has
+		 * entries with the same name, one blob and one tree.  Make
+		 * sure we do not have duplicate entries.
+		 */
+		return TREE_HAS_DUPS;
+	if (!c1 && S_ISDIR(mode1))
+		c1 = '/';
+	if (!c2 && S_ISDIR(mode2))
+		c2 = '/';
+	return c1 < c2 ? 0 : TREE_UNORDERED;
+}
+
+
+static int fsck_tree(struct tree *item, int strict, fsck_error error_func)
+{
+	int retval;
+	int has_full_path = 0;
+	int has_empty_name = 0;
+	int has_zero_pad = 0;
+	int has_bad_modes = 0;
+	int has_dup_entries = 0;
+	int not_properly_sorted = 0;
+	struct tree_desc desc;
+	unsigned o_mode;
+	const char *o_name;
+	const unsigned char *o_sha1;
+
+	init_tree_desc(&desc, item->buffer, item->size);
+
+	o_mode = 0;
+	o_name = NULL;
+	o_sha1 = NULL;
+	if (!desc.size)
+		return error_func (&item->object, FSCK_ERROR, "empty tree");
+
+	while (desc.size) {
+		unsigned mode;
+		const char *name;
+		const unsigned char *sha1;
+
+		sha1 = tree_entry_extract(&desc, &name, &mode);
+
+		if (strchr(name, '/'))
+			has_full_path = 1;
+		if (!*name)
+			has_empty_name = 1;
+		has_zero_pad |= *(char *)desc.buffer == '0';
+		update_tree_entry(&desc);
+
+		switch (mode) {
+		/*
+		 * Standard modes..
+		 */
+		case S_IFREG | 0755:
+		case S_IFREG | 0644:
+		case S_IFLNK:
+		case S_IFDIR:
+		case S_IFGITLINK:
+			break;
+		/*
+		 * This is nonstandard, but we had a few of these
+		 * early on when we honored the full set of mode
+		 * bits..
+		 */
+		case S_IFREG | 0664:
+			if (!strict)
+				break;
+		default:
+			has_bad_modes = 1;
+		}
+
+		if (o_name) {
+			switch (verify_ordered(o_mode, o_name, mode, name)) {
+			case TREE_UNORDERED:
+				not_properly_sorted = 1;
+				break;
+			case TREE_HAS_DUPS:
+				has_dup_entries = 1;
+				break;
+			default:
+				break;
+			}
+		}
+
+		o_mode = mode;
+		o_name = name;
+		o_sha1 = sha1;
+	}
+
+	retval = 0;
+	if (has_full_path) {
+		retval += error_func(&item->object, FSCK_WARN, "contains full pathnames");
+	}
+	if (has_empty_name) {
+		retval += error_func(&item->object, FSCK_WARN, "contains empty pathname");
+	}
+	if (has_zero_pad) {
+		retval += error_func(&item->object, FSCK_WARN, "contains zero-padded file modes");
+	}
+	if (has_bad_modes) {
+		retval += error_func(&item->object, FSCK_WARN, "contains bad file modes");
+	}
+	if (has_dup_entries) {
+		retval += error_func(&item->object, FSCK_ERROR, "contains duplicate file entries");
+	}
+	if (not_properly_sorted) {
+		retval += error_func(&item->object, FSCK_ERROR, "not properly sorted");
+	}
+	return retval;
+}
+
+static int fsck_commit(struct commit *commit, fsck_error error_func)
+{
+	char *buffer = commit->buffer;
+	unsigned char tree_sha1[20], sha1[20];
+
+	if (!commit->date)
+		return error_func(&commit->object, FSCK_ERROR, "invalid author/committer line in %s",
+				  sha1_to_hex(commit->object.sha1));
+
+	if (memcmp(buffer, "tree ", 5))
+		return error_func(&commit->object, FSCK_ERROR, "invalid format - expected 'tree' line");
+	if (get_sha1_hex(buffer+5, tree_sha1) || buffer[45] != '\n')
+		return error_func(&commit->object, FSCK_ERROR, "invalid 'tree' line format - bad sha1");
+	buffer += 46;
+	while (!memcmp(buffer, "parent ", 7)) {
+		if (get_sha1_hex(buffer+7, sha1) || buffer[47] != '\n')
+			return error_func(&commit->object, FSCK_ERROR, "invalid 'parent' line format - bad sha1");
+		buffer += 48;
+	}
+	if (memcmp(buffer, "author ", 7))
+		return error_func(&commit->object, FSCK_ERROR, "invalid format - expected 'author' line");
+	if (!commit->tree)
+		return error_func(&commit->object, FSCK_ERROR, "could not load commit's tree %s", sha1_to_hex(tree_sha1));
+
+	return 0;
+}
+
+static int fsck_tag(struct tag *tag, fsck_error error_func)
+{
+	struct object *tagged = tag->tagged;
+
+	if (!tagged) {
+		return error_func(&tag->object, FSCK_ERROR, "could not load tagged object");
+	}
+	return 0;
+}
+
+int fsck_object(struct object *obj, int strict, fsck_error error_func)
+{
+	if (!obj)
+		return error_func(obj, FSCK_ERROR, "no object given");
+
+	if (obj->type == OBJ_BLOB)
+		return 0;
+	if (obj->type == OBJ_TREE)
+		return fsck_tree((struct tree *) obj, strict, error_func);
+	if (obj->type == OBJ_COMMIT)
+		return fsck_commit((struct commit *) obj, error_func);
+	if (obj->type == OBJ_TAG)
+		return fsck_tag((struct tag *) obj, error_func);
+
+	return error_func(obj, FSCK_ERROR, "unknown type '%d' (internal fsck error)",
+			  obj->type);
+}
diff --git a/fsck.h b/fsck.h
index caeb2c9..78819d3 100644
--- a/fsck.h
+++ b/fsck.h
@@ -2,9 +2,13 @@
 #define GIT_FSCK_H
 
 #define OBJ_ANY OBJ_BAD
+#define FSCK_ERROR 1
+#define FSCK_WARN 2
 
 typedef int (*fsck_walk_func)(struct object* obj, int type, void* data);
+typedef int (*fsck_error)(struct object *obj, int type, const char *err, ...);
 
 int fsck_walk(struct object* obj, fsck_walk_func walk, void* data);
+int fsck_object(struct object *obj, int strict, fsck_error error_func);
 
 #endif
-- 
1.5.4.g074be


From 22c621718c90d88c6295bd8035001451f9b7bf9d Mon Sep 17 00:00:00 2001
From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Date: Fri, 8 Feb 2008 09:07:59 +0100
Subject: [PATCH 07/12] fsck: use common object checker

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
---
 builtin-fsck.c |  236 ++++++-------------------------------------------------
 1 files changed, 26 insertions(+), 210 deletions(-)

diff --git a/builtin-fsck.c b/builtin-fsck.c
index b0ca3ad..f69dc99 100644
--- a/builtin-fsck.c
+++ b/builtin-fsck.c
@@ -80,13 +80,13 @@ static int objerror(struct object *obj, const char *err, ...)
 	return -1;
 }
 
-static int objwarning(struct object *obj, const char *err, ...)
+static int fsck_error_func(struct object *obj, int type, const char *err, ...)
 {
 	va_list params;
 	va_start(params, err);
-	objreport(obj, "warning", err, params);
+	objreport(obj, (type == FSCK_WARN)?"warning":"error", err, params);
 	va_end(params);
-	return -1;
+	return (type == FSCK_WARN)?0:1;
 }
 
 static int check_reachable_object_childs(struct object *obj, int type, void *data)
@@ -239,201 +239,6 @@ static void check_connectivity(void)
 	}
 }
 
-/*
- * The entries in a tree are ordered in the _path_ order,
- * which means that a directory entry is ordered by adding
- * a slash to the end of it.
- *
- * So a directory called "a" is ordered _after_ a file
- * called "a.c", because "a/" sorts after "a.c".
- */
-#define TREE_UNORDERED (-1)
-#define TREE_HAS_DUPS  (-2)
-
-static int verify_ordered(unsigned mode1, const char *name1, unsigned mode2, const char *name2)
-{
-	int len1 = strlen(name1);
-	int len2 = strlen(name2);
-	int len = len1 < len2 ? len1 : len2;
-	unsigned char c1, c2;
-	int cmp;
-
-	cmp = memcmp(name1, name2, len);
-	if (cmp < 0)
-		return 0;
-	if (cmp > 0)
-		return TREE_UNORDERED;
-
-	/*
-	 * Ok, the first <len> characters are the same.
-	 * Now we need to order the next one, but turn
-	 * a '\0' into a '/' for a directory entry.
-	 */
-	c1 = name1[len];
-	c2 = name2[len];
-	if (!c1 && !c2)
-		/*
-		 * git-write-tree used to write out a nonsense tree that has
-		 * entries with the same name, one blob and one tree.  Make
-		 * sure we do not have duplicate entries.
-		 */
-		return TREE_HAS_DUPS;
-	if (!c1 && S_ISDIR(mode1))
-		c1 = '/';
-	if (!c2 && S_ISDIR(mode2))
-		c2 = '/';
-	return c1 < c2 ? 0 : TREE_UNORDERED;
-}
-
-static int fsck_tree(struct tree *item)
-{
-	int retval;
-	int has_full_path = 0;
-	int has_empty_name = 0;
-	int has_zero_pad = 0;
-	int has_bad_modes = 0;
-	int has_dup_entries = 0;
-	int not_properly_sorted = 0;
-	struct tree_desc desc;
-	unsigned o_mode;
-	const char *o_name;
-	const unsigned char *o_sha1;
-
-	if (verbose)
-		fprintf(stderr, "Checking tree %s\n",
-				sha1_to_hex(item->object.sha1));
-
-	init_tree_desc(&desc, item->buffer, item->size);
-
-	o_mode = 0;
-	o_name = NULL;
-	o_sha1 = NULL;
-	while (desc.size) {
-		unsigned mode;
-		const char *name;
-		const unsigned char *sha1;
-
-		sha1 = tree_entry_extract(&desc, &name, &mode);
-
-		if (strchr(name, '/'))
-			has_full_path = 1;
-		if (!*name)
-			has_empty_name = 1;
-		has_zero_pad |= *(char *)desc.buffer == '0';
-		update_tree_entry(&desc);
-
-		switch (mode) {
-		/*
-		 * Standard modes..
-		 */
-		case S_IFREG | 0755:
-		case S_IFREG | 0644:
-		case S_IFLNK:
-		case S_IFDIR:
-		case S_IFGITLINK:
-			break;
-		/*
-		 * This is nonstandard, but we had a few of these
-		 * early on when we honored the full set of mode
-		 * bits..
-		 */
-		case S_IFREG | 0664:
-			if (!check_strict)
-				break;
-		default:
-			has_bad_modes = 1;
-		}
-
-		if (o_name) {
-			switch (verify_ordered(o_mode, o_name, mode, name)) {
-			case TREE_UNORDERED:
-				not_properly_sorted = 1;
-				break;
-			case TREE_HAS_DUPS:
-				has_dup_entries = 1;
-				break;
-			default:
-				break;
-			}
-		}
-
-		o_mode = mode;
-		o_name = name;
-		o_sha1 = sha1;
-	}
-
-	retval = 0;
-	if (has_full_path) {
-		objwarning(&item->object, "contains full pathnames");
-	}
-	if (has_empty_name) {
-		objwarning(&item->object, "contains empty pathname");
-	}
-	if (has_zero_pad) {
-		objwarning(&item->object, "contains zero-padded file modes");
-	}
-	if (has_bad_modes) {
-		objwarning(&item->object, "contains bad file modes");
-	}
-	if (has_dup_entries) {
-		retval = objerror(&item->object, "contains duplicate file entries");
-	}
-	if (not_properly_sorted) {
-		retval = objerror(&item->object, "not properly sorted");
-	}
-	return retval;
-}
-
-static int fsck_commit(struct commit *commit)
-{
-	char *buffer = commit->buffer;
-	unsigned char tree_sha1[20], sha1[20];
-
-	if (verbose)
-		fprintf(stderr, "Checking commit %s\n",
-			sha1_to_hex(commit->object.sha1));
-
-	if (!commit->date)
-		return objerror(&commit->object, "invalid author/committer line");
-
-	if (memcmp(buffer, "tree ", 5))
-		return objerror(&commit->object, "invalid format - expected 'tree' line");
-	if (get_sha1_hex(buffer+5, tree_sha1) || buffer[45] != '\n')
-		return objerror(&commit->object, "invalid 'tree' line format - bad sha1");
-	buffer += 46;
-	while (!memcmp(buffer, "parent ", 7)) {
-		if (get_sha1_hex(buffer+7, sha1) || buffer[47] != '\n')
-			return objerror(&commit->object, "invalid 'parent' line format - bad sha1");
-		buffer += 48;
-	}
-	if (memcmp(buffer, "author ", 7))
-		return objerror(&commit->object, "invalid format - expected 'author' line");
-	if (!commit->tree)
-		return objerror(&commit->object, "could not load commit's tree %s", tree_sha1);
-	if (!commit->parents && show_root)
-		printf("root %s\n", sha1_to_hex(commit->object.sha1));
-	return 0;
-}
-
-static int fsck_tag(struct tag *tag)
-{
-	struct object *tagged = tag->tagged;
-
-	if (verbose)
-		fprintf(stderr, "Checking tag %s\n",
-			sha1_to_hex(tag->object.sha1));
-
-	if (!tagged) {
-		return objerror(&tag->object, "could not load tagged object");
-	}
-	if (!show_tags)
-		return 0;
-
-	printf("tagged %s %s", typename(tagged->type), sha1_to_hex(tagged->sha1));
-	printf(" (%s) in %s\n", tag->tag, sha1_to_hex(tag->object.sha1));
-	return 0;
-}
-
 static int fsck_sha1(const unsigned char *sha1)
 {
 	struct object *obj = parse_object(sha1);
@@ -445,19 +250,30 @@ static int fsck_sha1(const unsigned char *sha1)
 	if (obj->flags & SEEN)
 		return 0;
 	obj->flags |= SEEN;
+
+	if (verbose)
+		fprintf(stderr, "Checking %s %s\n",
+			typename(obj->type), sha1_to_hex(obj->sha1));
+
 	fsck_walk (obj, mark_used, 0);
-	if (obj->type == OBJ_BLOB)
-		return 0;
-	if (obj->type == OBJ_TREE)
-		return fsck_tree((struct tree *) obj);
-	if (obj->type == OBJ_COMMIT)
-		return fsck_commit((struct commit *) obj);
-	if (obj->type == OBJ_TAG)
-		return fsck_tag((struct tag *) obj);
-
-	/* By now, parse_object() would've returned NULL instead. */
-	return objerror(obj, "unknown type '%d' (internal fsck error)",
-			obj->type);
+	if (fsck_object (obj, check_strict, fsck_error_func))
+		return -1;
+
+	if (obj->type == OBJ_COMMIT) {
+		struct commit *commit = (struct commit*) obj;
+		if (!commit->parents && show_root)
+			printf("root %s\n", sha1_to_hex(commit->object.sha1));
+	}
+
+	if (obj->type == OBJ_TAG) {
+		struct tag *tag = (struct tag*) obj;
+		if (show_tags  && tag->tagged) {
+			printf("tagged %s %s", typename(tag->tagged->type), sha1_to_hex(tag->tagged->sha1));
+			printf(" (%s) in %s\n", tag->tag, sha1_to_hex(tag->object.sha1));
+		}
+	}
+
+	return 0;
 }
 
 /*
-- 
1.5.4.g074be


From 09938f28e230a236dc48014ace6c8ed02ad33084 Mon Sep 17 00:00:00 2001
From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Date: Fri, 8 Feb 2008 09:08:45 +0100
Subject: [PATCH 08/12] add common fsck error printing function

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
---
 fsck.c |   30 ++++++++++++++++++++++++++++++
 fsck.h |    2 ++
 2 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/fsck.c b/fsck.c
index ac253ae..07e48a8 100644
--- a/fsck.c
+++ b/fsck.c
@@ -282,3 +282,33 @@ int fsck_object(struct object *obj, int strict, fsck_error error_func)
 	return error_func(obj, FSCK_ERROR, "unknown type '%d' (internal fsck error)",
 			  obj->type);
 }
+
+int fsck_error_function(struct object *obj, int type, const char* fmt, ...)
+{
+	va_list ap;
+	int len;
+	struct strbuf sb;
+	
+	strbuf_init(&sb, 0);
+	strbuf_addf(&sb, "object %s:", sha1_to_hex(obj->sha1));
+	
+	va_start(ap, fmt);
+	len = vsnprintf(sb.buf + sb.len, strbuf_avail(&sb), fmt, ap);
+	va_end(ap);
+	
+	if (len < 0)
+		len = 0;
+	if (len >= strbuf_avail(&sb)) {
+		strbuf_grow(&sb, len + 2);
+		va_start(ap, fmt);
+		len = vsnprintf(sb.buf + sb.len, strbuf_avail(&sb), fmt, ap);
+		va_end(ap);
+		if (len >= strbuf_avail(&sb)) {
+			die("this should not happen, your snprintf is broken");
+		}
+	}
+	
+	error(sb.buf);
+	strbuf_release(&sb);
+	return 1;
+}
diff --git a/fsck.h b/fsck.h
index 78819d3..f13631c 100644
--- a/fsck.h
+++ b/fsck.h
@@ -8,6 +8,8 @@
 typedef int (*fsck_walk_func)(struct object* obj, int type, void* data);
 typedef int (*fsck_error)(struct object *obj, int type, const char *err, ...);
 
+int fsck_error_function(struct object *obj, int type, const char* fmt, ...);
+
 int fsck_walk(struct object* obj, fsck_walk_func walk, void* data);
 int fsck_object(struct object *obj, int strict, fsck_error error_func);
 
-- 
1.5.4.g074be


From a8db4e754e717bac0b2462333d4145eac3452099 Mon Sep 17 00:00:00 2001
From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Date: Fri, 8 Feb 2008 09:14:14 +0100
Subject: [PATCH 09/12] unpack-objects: prevent writing of inconsistent objects

This patch introduces a strict mode, which ensures that:
- no malformed object will be written
- no object with broken links will be written

The patch ensures this by delaying the write of all non blob object.
These object are written, after all objects they link to are written.

An error can only result in unreferenced objects.

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
---
 Documentation/git-unpack-objects.txt |    3 +
 builtin-unpack-objects.c             |   99 +++++++++++++++++++++++++++++++---
 2 files changed, 95 insertions(+), 7 deletions(-)

diff --git a/Documentation/git-unpack-objects.txt b/Documentation/git-unpack-objects.txt
index b79be3f..3697896 100644
--- a/Documentation/git-unpack-objects.txt
+++ b/Documentation/git-unpack-objects.txt
@@ -40,6 +40,9 @@ OPTIONS
 	and make the best effort to recover as many objects as
 	possible.
 
+--strict::
+	Don't write objects with broken content or links.
+
 
 Author
 ------
diff --git a/builtin-unpack-objects.c b/builtin-unpack-objects.c
index f18c7e8..3e906e4 100644
--- a/builtin-unpack-objects.c
+++ b/builtin-unpack-objects.c
@@ -7,11 +7,13 @@
 #include "commit.h"
 #include "tag.h"
 #include "tree.h"
+#include "tree-walk.h"
 #include "progress.h"
 #include "decorate.h"
+#include "fsck.h"
 
-static int dry_run, quiet, recover, has_errors;
-static const char unpack_usage[] = "git-unpack-objects [-n] [-q] [-r] < pack-file";
+static int dry_run, quiet, recover, has_errors, strict;
+static const char unpack_usage[] = "git-unpack-objects [-n] [-q] [-r] [--strict] < pack-file";
 
 /* We always read in 4kB chunks. */
 static unsigned char buffer[4096];
@@ -144,9 +146,58 @@ static void add_delta_to_list(unsigned nr, unsigned const char *base_sha1,
 struct obj_info {
 	off_t offset;
 	unsigned char sha1[20];
+	struct object * obj;
 };
 
+#define FLAG_OPEN (1u<<20)
+#define FLAG_WRITTEN (1u<<21)
+
 static struct obj_info *obj_list;
+unsigned nr_objects;
+
+static void write_cached_object(struct object* obj)
+{
+	unsigned char sha1[20];
+	struct obj_buffer *obj_buf = lookup_object_buffer(obj);
+	if (write_sha1_file(obj_buf->buffer, obj_buf->size, typename(obj->type), sha1) < 0)
+		die("failed to write object %s", sha1_to_hex(obj->sha1));
+	obj->flags |= FLAG_WRITTEN;
+}
+
+static int check_object(struct object* obj, int type, void *data)
+{
+	if (!obj)
+		return 0;
+	
+	if (obj->flags & FLAG_WRITTEN)
+		return 1;
+	
+	if (type != OBJ_ANY && obj->type != type)
+		die("object type mismatch");
+	
+	if (!(obj->flags & FLAG_OPEN)) {
+		unsigned long size;
+		int type = sha1_object_info (obj->sha1, &size);
+		if (type != obj->type || type <= 0)
+			die("object of unexpected type");
+		obj->flags |= FLAG_WRITTEN;
+		return 1;
+	}
+	
+	if (fsck_object(obj, 1, fsck_error_function))
+	  die("Error in object");
+	if(!fsck_walk(obj, check_object, 0))
+		die("Error on reachable objects of %s", sha1_to_hex (obj->sha1));
+	write_cached_object(obj);
+	return 1;
+}
+
+static void write_rest()
+{
+	unsigned i;
+	for (i = 0; i < nr_objects; i++)
+		check_object(obj_list[i].obj, OBJ_ANY, 0);
+}
 
 static void added_object(unsigned nr, enum object_type type,
 			 void *data, unsigned long size);
@@ -154,9 +205,35 @@ static void added_object(unsigned nr, enum object_type type,
 static void write_object(unsigned nr, enum object_type type,
 			 void *buf, unsigned long size)
 {
-	if (write_sha1_file(buf, size, typename(type), obj_list[nr].sha1) < 0)
-		die("failed to write object");
 	added_object(nr, type, buf, size);
+	if (!strict) {
+		if (write_sha1_file(buf, size, typename(type), obj_list[nr].sha1) < 0)
+			die("failed to write object");
+		free(buf);
+		obj_list[nr].obj = 0;
+	} else if (type == OBJ_BLOB) {
+		struct blob * blob;
+		if (write_sha1_file(buf, size, typename(type), obj_list[nr].sha1) < 0)
+			die("failed to write object");
+		free(buf);
+
+		blob = lookup_blob (obj_list[nr].sha1);
+		if (blob)
+			blob->object.flags |= FLAG_WRITTEN;
+		else
+			die("invalid blob object");
+		obj_list[nr].obj = 0;
+	} else {
+		struct object * obj;
+		int eaten;
+		hash_sha1_file(buf, size, typename(type), obj_list[nr].sha1);
+		obj = parse_object_buffer(obj_list[nr].sha1, type, size, buf, &eaten);
+		if (!obj)
+			die ("invalid %s", typename(type));
+		add_object_buffer (obj, buf, size);
+		obj->flags |= FLAG_OPEN;
+		obj_list[nr].obj = obj;
+	}
 }
 
 static void resolve_delta(unsigned nr, enum object_type type,
@@ -173,7 +250,6 @@ static void resolve_delta(unsigned nr, enum object_type type,
 		die("failed to apply delta");
 	free(delta);
 	write_object(nr, type, result, result_size);
-	free(result);
 }
 
 static void added_object(unsigned nr, enum object_type type,
@@ -203,7 +279,8 @@ static void unpack_non_delta_entry(enum object_type type, unsigned long size,
 
 	if (!dry_run && buf)
 		write_object(nr, type, buf, size);
-	free(buf);
+	else if (buf)
+		free(buf);
 }
 
 static void unpack_delta_entry(enum object_type type, unsigned long delta_size,
@@ -345,7 +422,8 @@ static void unpack_all(void)
 	int i;
 	struct progress *progress = NULL;
 	struct pack_header *hdr = fill(sizeof(struct pack_header));
-	unsigned nr_objects = ntohl(hdr->hdr_entries);
+
+	nr_objects = ntohl(hdr->hdr_entries);
 
 	if (ntohl(hdr->hdr_signature) != PACK_SIGNATURE)
 		die("bad pack file");
@@ -356,6 +434,7 @@ static void unpack_all(void)
 	if (!quiet)
 		progress = start_progress("Unpacking objects", nr_objects);
 	obj_list = xmalloc(nr_objects * sizeof(*obj_list));
+	memset(obj_list, 0, nr_objects * sizeof(*obj_list));
 	for (i = 0; i < nr_objects; i++) {
 		unpack_one(i);
 		display_progress(progress, i + 1);
@@ -391,6 +470,10 @@ int cmd_unpack_objects(int argc, const char **argv, const char *prefix)
 				recover = 1;
 				continue;
 			}
+			if (!strcmp(arg, "--strict")) {
+				strict = 1;
+				continue;
+			}
 			if (!prefixcmp(arg, "--pack_header=")) {
 				struct pack_header *hdr;
 				char *c;
@@ -416,6 +499,8 @@ int cmd_unpack_objects(int argc, const char **argv, const char *prefix)
 	unpack_all();
 	SHA1_Update(&ctx, buffer, offset);
 	SHA1_Final(sha1, &ctx);
+	if (strict)
+		write_rest();
 	if (hashcmp(fill(20), sha1))
 		die("final sha1 did not match");
 	use(20);
-- 
1.5.4.g074be


From e2c4a69caeb7650a2655db864827879a0eea701f Mon Sep 17 00:00:00 2001
From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Date: Fri, 8 Feb 2008 09:16:05 +0100
Subject: [PATCH 10/12] index-pack: introduce checking mode

Adds strict option, which bails out if the pack would
introduces broken object or links in the repository.

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
---
 Documentation/git-index-pack.txt |    3 +
 index-pack.c                     |   78 +++++++++++++++++++++++++++++++++++--
 2 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
index 72b5d00..a7825b6 100644
--- a/Documentation/git-index-pack.txt
+++ b/Documentation/git-index-pack.txt
@@ -75,6 +75,9 @@ OPTIONS
 	to force the version for the generated pack index, and to force
 	64-bit index entries on objects located above the given offset.
 
+--strict::
+	Die, if the pack contains broken objects or links.
+
 
 Note
 ----
diff --git a/index-pack.c b/index-pack.c
index 9fd6982..bde0249 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -7,9 +7,10 @@
 #include "tag.h"
 #include "tree.h"
 #include "progress.h"
+#include "fsck.h"
 
 static const char index_pack_usage[] =
-"git-index-pack [-v] [-o <index-file>] [{ ---keep | --keep=<msg> }] { <pack-file> | --stdin [--fix-thin] [<pack-file>] }";
+"git-index-pack [-v] [-o <index-file>] [{ ---keep | --keep=<msg> }] [--strict] { <pack-file> | --stdin [--fix-thin] [<pack-file>] }";
 
 struct object_entry
 {
@@ -18,6 +19,7 @@ struct object_entry
 	unsigned int hdr_size;
 	enum object_type type;
 	enum object_type real_type;
+	struct object *obj;
 };
 
 union delta_base {
@@ -31,6 +33,9 @@ union delta_base {
  */
 #define UNION_BASE_SZ	20
 
+#define FLAG_OPEN (1u<<20)
+#define FLAG_WRITTEN (1u<<21)
+
 struct delta_entry
 {
 	union delta_base base;
@@ -44,6 +49,7 @@ static int nr_deltas;
 static int nr_resolved_deltas;
 
 static int from_stdin;
+static int strict;
 static int verbose;
 
 static struct progress *progress;
@@ -56,6 +62,41 @@ static SHA_CTX input_ctx;
 static uint32_t input_crc32;
 static int input_fd, output_fd, pack_fd;
 
+static int check_object(struct object* obj, int type, void *data)
+{
+	if (!obj)
+		return 0;
+	
+	if (obj->flags & FLAG_WRITTEN)
+		return 1;
+	
+	if (type != OBJ_ANY && obj->type != type)
+		die("object type mismatch");
+	
+	if (!(obj->flags & FLAG_OPEN)) {
+		unsigned long size;
+		int type = sha1_object_info (obj->sha1, &size);
+		if (type != obj->type || type <= 0)
+			die("object of unexpected type");
+		obj->flags |= FLAG_WRITTEN;
+		return 1;
+	}
+	
+	if (fsck_object(obj, 1, fsck_error_function))
+	  die("Error in object");
+	if(!fsck_walk(obj, check_object, 0))
+		die("Error on reachable objects of %s", sha1_to_hex (obj->sha1));
+	return 1;
+}
+
+static void check_objects()
+{
+	unsigned i;
+	for (i = 0; i < nr_objects; i++)
+		check_object(objects[i].obj, OBJ_ANY, 0);
+}
+
+
 /* Discard current buffer used content. */
 static void flush(void)
 {
@@ -325,8 +366,8 @@ static int find_delta_children(const union delta_base *base,
 	return 0;
 }
 
-static void sha1_object(const void *data, unsigned long size,
-			enum object_type type, unsigned char *sha1)
+static struct object* sha1_object(const void *data, unsigned long size,
+				  enum object_type type, unsigned char *sha1)
 {
 	hash_sha1_file(data, size, typename(type), sha1);
 	if (has_sha1_file(sha1)) {
@@ -341,6 +382,29 @@ static void sha1_object(const void *data, unsigned long size,
 			die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1));
 		free(has_data);
 	}
+	if (strict) {
+		if (type == OBJ_BLOB) {
+			struct blob * blob = lookup_blob(sha1);
+			if (blob)
+				blob->object.flags |= FLAG_WRITTEN;
+			else
+				die("invalid blob object %s", sha1_to_hex(sha1));
+			return (struct object*)blob;
+		} else {
+			struct object * obj;
+			int eaten;
+			void *buf = xmemdupz(data, size);
+			
+			obj = parse_object_buffer(sha1, type, size, buf, &eaten);
+			if (!obj)
+				die ("invalid %s", typename(type));
+			if(!eaten)
+				free(buf);
+			obj->flags |= FLAG_OPEN;
+			return obj;
+		}
+	}
+	return NULL;
 }
 
 static void resolve_delta(struct object_entry *delta_obj, void *base_data,
@@ -361,7 +425,7 @@ static void resolve_delta(struct object_entry *delta_obj, void *base_data,
 	free(delta_data);
 	if (!result)
 		bad_object(delta_obj->idx.offset, "failed to apply delta");
-	sha1_object(result, result_size, type, delta_obj->idx.sha1);
+	delta_obj->obj = sha1_object(result, result_size, type, delta_obj->idx.sha1);
 	nr_resolved_deltas++;
 
 	hashcpy(delta_base.sha1, delta_obj->idx.sha1);
@@ -420,7 +484,7 @@ static void parse_pack_objects(unsigned char *sha1)
 			delta->obj_no = i;
 			delta++;
 		} else
-			sha1_object(data, obj->size, obj->type, obj->idx.sha1);
+			obj->obj = sha1_object(data, obj->size, obj->type, obj->idx.sha1);
 		free(data);
 		display_progress(progress, i+1);
 	}
@@ -714,6 +778,8 @@ int main(int argc, char **argv)
 				from_stdin = 1;
 			} else if (!strcmp(arg, "--fix-thin")) {
 				fix_thin_pack = 1;
+			} else if (!strcmp(arg, "--strict")) {
+				strict = 1;
 			} else if (!strcmp(arg, "--keep")) {
 				keep_msg = "";
 			} else if (!prefixcmp(arg, "--keep=")) {
@@ -812,6 +878,8 @@ int main(int argc, char **argv)
 			    nr_deltas - nr_resolved_deltas);
 	}
 	free(deltas);
+	if (strict)
+		check_objects();
 
 	idx_objects = xmalloc((nr_objects) * sizeof(struct pack_idx_entry *));
 	for (i = 0; i < nr_objects; i++)
-- 
1.5.4.g074be


From a992fadf4014830315d7b5053dc633c2c566ae44 Mon Sep 17 00:00:00 2001
From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Date: Sat, 9 Feb 2008 18:17:56 +0100
Subject: [PATCH 11/12] receive-pack: use strict mode for unpacking objects

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
---
 Documentation/config.txt |    6 ++++++
 receive-pack.c           |   36 +++++++++++++++++++++++-------------
 2 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 530a6a8..d5b7698 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -909,6 +909,12 @@ imap::
 	The configuration variables in the 'imap' section are described
 	in linkgit:git-imap-send[1].
 
+receive.fsckObjects::
+	If it is set to true, git-receive-pack will check all received 
+	objects. It will abort in the case of a malformed object or a 
+	broken link. The result of an abort are only dangling objects.
+	The default value is false.
+
 receive.unpackLimit::
 	If the number of objects received in a push is below this
 	limit then the objects will be unpacked into loose object
diff --git a/receive-pack.c b/receive-pack.c
index 3267495..7136f35 100644
--- a/receive-pack.c
+++ b/receive-pack.c
@@ -10,6 +10,7 @@
 static const char receive_pack_usage[] = "git-receive-pack <git-dir>";
 
 static int deny_non_fast_forwards = 0;
+static int receive_fsck_objects = 0;
 static int receive_unpack_limit = -1;
 static int transfer_unpack_limit = -1;
 static int unpack_limit = 100;
@@ -35,6 +36,11 @@ static int receive_pack_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (strcmp(var, "receive.fsckobjects") == 0) {
+		receive_fsck_objects = git_config_bool(var, value);
+		return 0;
+	}
+
 	return git_default_config(var, value);
 }
 
@@ -367,11 +373,13 @@ static const char *unpack(void)
 			ntohl(hdr.hdr_version), ntohl(hdr.hdr_entries));
 
 	if (ntohl(hdr.hdr_entries) < unpack_limit) {
-		int code;
-		const char *unpacker[3];
-		unpacker[0] = "unpack-objects";
-		unpacker[1] = hdr_arg;
-		unpacker[2] = NULL;
+		int code, i = 0;
+		const char *unpacker[4];
+		unpacker[i++] = "unpack-objects";
+		if (receive_fsck_objects)
+			unpacker[i++] = "--strict";
+		unpacker[i++] = hdr_arg;
+		unpacker[i++] = NULL;
 		code = run_command_v_opt(unpacker, RUN_GIT_CMD);
 		switch (code) {
 		case 0:
@@ -392,8 +400,8 @@ static const char *unpack(void)
 			return "unpacker exited with error code";
 		}
 	} else {
-		const char *keeper[6];
-		int s, status;
+		const char *keeper[7];
+		int s, status, i = 0;
 		char keep_arg[256];
 		struct child_process ip;
 
@@ -401,12 +409,14 @@ static const char *unpack(void)
 		if (gethostname(keep_arg + s, sizeof(keep_arg) - s))
 			strcpy(keep_arg + s, "localhost");
 
-		keeper[0] = "index-pack";
-		keeper[1] = "--stdin";
-		keeper[2] = "--fix-thin";
-		keeper[3] = hdr_arg;
-		keeper[4] = keep_arg;
-		keeper[5] = NULL;
+		keeper[i++] = "index-pack";
+		keeper[i++] = "--stdin";
+		if (receive_fsck_objects)
+			keeper[i++] = "--strict";
+		keeper[i++] = "--fix-thin";
+		keeper[i++] = hdr_arg;
+		keeper[i++] = keep_arg;
+		keeper[i++] = NULL;
 		memset(&ip, 0, sizeof(ip));
 		ip.argv = keeper;
 		ip.out = -1;
-- 
1.5.4.g074be


From c39a307f2f6f551d967cbc9fb502de7cad8beae6 Mon Sep 17 00:00:00 2001
From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Date: Sat, 9 Feb 2008 18:18:41 +0100
Subject: [PATCH 12/12] Remove unused object-ref code

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
---
 Makefile               |    2 +-
 builtin-fetch-pack.c   |    1 -
 builtin-pack-objects.c |    1 -
 builtin-rev-list.c     |    1 -
 commit.c               |   11 ------
 object-refs.c          |   87 ------------------------------------------------
 object.h               |    8 ----
 tag.c                  |    6 ---
 tree.c                 |   48 --------------------------
 upload-pack.c          |    1 -
 walker.c               |    1 -
 11 files changed, 1 insertions(+), 166 deletions(-)
 delete mode 100644 object-refs.c

diff --git a/Makefile b/Makefile
index 0921484..268b8dd 100644
--- a/Makefile
+++ b/Makefile
@@ -316,7 +316,7 @@ LIB_OBJS = \
 	patch-ids.o \
 	object.o pack-check.o pack-write.o patch-delta.o path.o pkt-line.o \
 	sideband.o reachable.o reflog-walk.o \
-	quote.o read-cache.o refs.o run-command.o dir.o object-refs.o \
+	quote.o read-cache.o refs.o run-command.o dir.o \
 	server-info.o setup.o sha1_file.o sha1_name.o strbuf.o \
 	tag.o tree.o usage.o config.o environment.o ctype.o copy.o \
 	revision.o pager.o tree-walk.o xdiff-interface.o \
diff --git a/builtin-fetch-pack.c b/builtin-fetch-pack.c
index f401352..25f1915 100644
--- a/builtin-fetch-pack.c
+++ b/builtin-fetch-pack.c
@@ -386,7 +386,6 @@ static int everything_local(struct ref **refs, int nr_match, char **match)
 	int retval;
 	unsigned long cutoff = 0;
 
-	track_object_refs = 0;
 	save_commit_buffer = 0;
 
 	for (ref = *refs; ref; ref = ref->next) {
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index d3efeff..6f8f388 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -2009,7 +2009,6 @@ static void get_object_list(int ac, const char **av)
 
 	init_revisions(&revs, NULL);
 	save_commit_buffer = 0;
-	track_object_refs = 0;
 	setup_revisions(ac, av, &revs, NULL);
 
 	while (fgets(line, sizeof(line), stdin) != NULL) {
diff --git a/builtin-rev-list.c b/builtin-rev-list.c
index de80158..14e86ce 100644
--- a/builtin-rev-list.c
+++ b/builtin-rev-list.c
@@ -605,7 +605,6 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 		usage(rev_list_usage);
 
 	save_commit_buffer = revs.verbose_header || revs.grep_filter;
-	track_object_refs = 0;
 	if (bisect_list)
 		revs.limited = 1;
 
diff --git a/commit.c b/commit.c
index 8b8fb04..df4d331 100644
--- a/commit.c
+++ b/commit.c
@@ -290,17 +290,6 @@ int parse_commit_buffer(struct commit *item, void *buffer, unsigned long size)
 	}
 	item->date = parse_commit_date(bufptr, tail);
 
-	if (track_object_refs) {
-		unsigned i = 0;
-		struct commit_list *p;
-		struct object_refs *refs = alloc_object_refs(n_refs);
-		if (item->tree)
-			refs->ref[i++] = &item->tree->object;
-		for (p = item->parents; p; p = p->next)
-			refs->ref[i++] = &p->item->object;
-		set_object_refs(&item->object, refs);
-	}
-
 	return 0;
 }
 
diff --git a/object-refs.c b/object-refs.c
deleted file mode 100644
index 5345671..0000000
--- a/object-refs.c
+++ /dev/null
@@ -1,87 +0,0 @@
-#include "cache.h"
-#include "object.h"
-#include "decorate.h"
-
-int track_object_refs = 0;
-
-static struct decoration ref_decorate;
-
-struct object_refs *lookup_object_refs(struct object *base)
-{
-	return lookup_decoration(&ref_decorate, base);
-}
-
-static void add_object_refs(struct object *obj, struct object_refs *refs)
-{
-	if (add_decoration(&ref_decorate, obj, refs))
-		die("object %s tried to add refs twice!", sha1_to_hex(obj->sha1));
-}
-
-struct object_refs *alloc_object_refs(unsigned count)
-{
-	struct object_refs *refs;
-	size_t size = sizeof(*refs) + count*sizeof(struct object *);
-
-	refs = xcalloc(1, size);
-	refs->count = count;
-	return refs;
-}
-
-static int compare_object_pointers(const void *a, const void *b)
-{
-	const struct object * const *pa = a;
-	const struct object * const *pb = b;
-	if (*pa == *pb)
-		return 0;
-	else if (*pa < *pb)
-		return -1;
-	else
-		return 1;
-}
-
-void set_object_refs(struct object *obj, struct object_refs *refs)
-{
-	unsigned int i, j;
-
-	/* Do not install empty list of references */
-	if (refs->count < 1) {
-		free(refs);
-		return;
-	}
-
-	/* Sort the list and filter out duplicates */
-	qsort(refs->ref, refs->count, sizeof(refs->ref[0]),
-	      compare_object_pointers);
-	for (i = j = 1; i < refs->count; i++) {
-		if (refs->ref[i] != refs->ref[i - 1])
-			refs->ref[j++] = refs->ref[i];
-	}
-	if (j < refs->count) {
-		/* Duplicates were found - reallocate list */
-		size_t size = sizeof(*refs) + j*sizeof(struct object *);
-		refs->count = j;
-		refs = xrealloc(refs, size);
-	}
-
-	for (i = 0; i < refs->count; i++)
-		refs->ref[i]->used = 1;
-	add_object_refs(obj, refs);
-}
-
-void mark_reachable(struct object *obj, unsigned int mask)
-{
-	const struct object_refs *refs;
-
-	if (!track_object_refs)
-		die("cannot do reachability with object refs turned off");
-	/* If we've been here already, don't bother */
-	if (obj->flags & mask)
-		return;
-	obj->flags |= mask;
-	refs = lookup_object_refs(obj);
-	if (refs) {
-		unsigned i;
-		for (i = 0; i < refs->count; i++)
-			mark_reachable(refs->ref[i], mask);
-	}
-}
diff --git a/object.h b/object.h
index 397bbfa..036bd66 100644
--- a/object.h
+++ b/object.h
@@ -35,14 +35,11 @@ struct object {
 	unsigned char sha1[20];
 };
 
-extern int track_object_refs;
-
 extern const char *typename(unsigned int type);
 extern int type_from_string(const char *str);
 
 extern unsigned int get_max_object_index(void);
 extern struct object *get_indexed_object(unsigned int);
-extern struct object_refs *lookup_object_refs(struct object *);
 
 /** Internal only **/
 struct object *lookup_object(const unsigned char *sha1);
@@ -61,11 +58,6 @@ struct object *parse_object_buffer(const unsigned char *sha1, enum object_type t
 /** Returns the object, with potentially excess memory allocated. **/
 struct object *lookup_unknown_object(const unsigned  char *sha1);
 
-struct object_refs *alloc_object_refs(unsigned count);
-void set_object_refs(struct object *obj, struct object_refs *refs);
-
-void mark_reachable(struct object *obj, unsigned int mask);
-
 struct object_list *object_list_insert(struct object *item,
 				       struct object_list **list_p);
 
diff --git a/tag.c b/tag.c
index 38bf913..d19f56d 100644
--- a/tag.c
+++ b/tag.c
@@ -84,12 +84,6 @@ int parse_tag_buffer(struct tag *item, void *data, unsigned long size)
 		item->tagged = NULL;
 	}
 
-	if (item->tagged && track_object_refs) {
-		struct object_refs *refs = alloc_object_refs(1);
-		refs->ref[0] = item->tagged;
-		set_object_refs(&item->object, refs);
-	}
-
 	return 0;
 }
 
diff --git a/tree.c b/tree.c
index 87708ef..4b1825c 100644
--- a/tree.c
+++ b/tree.c
@@ -202,52 +202,6 @@ struct tree *lookup_tree(const unsigned char *sha1)
 	return (struct tree *) obj;
 }
 
-/*
- * NOTE! Tree refs to external git repositories
- * (ie gitlinks) do not count as real references.
- *
- * You don't have to have those repositories
- * available at all, much less have the objects
- * accessible from the current repository.
- */
-static void track_tree_refs(struct tree *item)
-{
-	int n_refs = 0, i;
-	struct object_refs *refs;
-	struct tree_desc desc;
-	struct name_entry entry;
-
-	/* Count how many entries there are.. */
-	init_tree_desc(&desc, item->buffer, item->size);
-	while (tree_entry(&desc, &entry)) {
-		if (S_ISGITLINK(entry.mode))
-			continue;
-		n_refs++;
-	}
-
-	/* Allocate object refs and walk it again.. */
-	i = 0;
-	refs = alloc_object_refs(n_refs);
-	init_tree_desc(&desc, item->buffer, item->size);
-	while (tree_entry(&desc, &entry)) {
-		struct object *obj;
-
-		if (S_ISGITLINK(entry.mode))
-			continue;
-		if (S_ISDIR(entry.mode))
-			obj = &lookup_tree(entry.sha1)->object;
-		else if (S_ISREG(entry.mode) || S_ISLNK(entry.mode))
-			obj = &lookup_blob(entry.sha1)->object;
-		else {
-			warning("in tree %s: entry %s has bad mode %.6o\n",
-			     sha1_to_hex(item->object.sha1), entry.path, entry.mode);
-			obj = lookup_unknown_object(entry.sha1);
-		}
-		refs->ref[i++] = obj;
-	}
-	set_object_refs(&item->object, refs);
-}
-
 int parse_tree_buffer(struct tree *item, void *buffer, unsigned long size)
 {
 	if (item->object.parsed)
@@ -256,8 +210,6 @@ int parse_tree_buffer(struct tree *item, void *buffer, unsigned long size)
 	item->buffer = buffer;
 	item->size = size;
 
-	if (track_object_refs)
-		track_tree_refs(item);
 	return 0;
 }
 
diff --git a/upload-pack.c b/upload-pack.c
index 7e04311..b2828ab 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -392,7 +392,6 @@ static int get_common_commits(void)
 	char hex[41], last_hex[41];
 	int len;
 
-	track_object_refs = 0;
 	save_commit_buffer = 0;
 
 	for(;;) {
diff --git a/walker.c b/walker.c
index adc3e80..c10eca8 100644
--- a/walker.c
+++ b/walker.c
@@ -256,7 +256,6 @@ int walker_fetch(struct walker *walker, int targets, char **target,
 	int i;
 
 	save_commit_buffer = 0;
-	track_object_refs = 0;
 
 	for (i = 0; i < targets; i++) {
 		if (!write_ref || !write_ref[i])
-- 
1.5.4.g074be

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-10 17:58 [RFC Patch] Preventing corrupt objects from entering the repository Martin Koegler
@ 2008-02-11  0:00 ` Junio C Hamano
  2008-02-11  0:33   ` Nicolas Pitre
  2008-02-12  7:20   ` Martin Koegler
  0 siblings, 2 replies; 23+ messages in thread
From: Junio C Hamano @ 2008-02-11  0:00 UTC (permalink / raw)
  To: Martin Koegler; +Cc: git

mkoegler@auto.tuwien.ac.at (Martin Koegler) writes:

> So I tried to fix the problem in another way: Instead of coping with 
> corrupt objects, stop them from entering the repository.

Good intention.  Very nice.

> * add --strict option to unpack-objects (patch 1,8,9)
> * add --strict option to index-pack (patch 8,10)
>
>   Same as for unpack-objects, but without writting objects.
>
> * add config option for receive pack to enable checking (patch 11)

If this patch is any good, I strongly suspect it should not be
just the default but should always be on.  IOW no config is
necessary.  That would make the series a bit simpler, I guess.

> From 76e86fe55345e633c910d6b8fe166e27c23c5aaf Mon Sep 17 00:00:00 2001
> From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
> Date: Fri, 8 Feb 2008 08:51:38 +0100
> Subject: [PATCH 01/12] unpack-object: cache for non written objects
>
> Preventing objects with broken links entering the repository
> means, that write of some objects must be delayed.
>
> This patch adds a cache to keep the object data in memory. The delta
> resolving code must also search in the cache.

I have to wonder what the memory pressure in real-life usage
will be like.

When an object is proven to be good, we should be able to free
its buffer after writing it out, but would that be a good enough
optimization we can make later on this code to keep its memory
consumption manageable?

> diff --git a/fsck.c b/fsck.c
> new file mode 100644
> index 0000000..089f775
> --- /dev/null
> +++ b/fsck.c
> @@ -0,0 +1,84 @@
> +#include "cache.h"
> +#include "object.h"
> +#include "blob.h"
> +#include "tree.h"
> +#include "tree-walk.h"
> +#include "commit.h"
> +#include "tag.h"
> +#include "fsck.h"
> +
> +static int fsck_walk_tree(struct tree* tree, fsck_walk_func walk, void* data)
> +{
> +	struct tree_desc desc;
> +	struct name_entry entry;
> +
> +	if(parse_tree(tree))
> +		return -1;

I noticed many coding style issues but I won't be mentioning
them in this response.

It's a bit hard to see how these new set of functions relate to
the original code in this patch series, because you add the new
things that are initially not used anywhere independently, start
referring to them in a separate patch and then remove the old
related functions that are now unused.  This style makes
reviewing easier and harder at the same time...

> From 80b22c3f2c3e13c207790a49646020c55b34bba7 Mon Sep 17 00:00:00 2001
> From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
> Date: Fri, 8 Feb 2008 09:01:50 +0100
> Subject: [PATCH 03/12] fsck: move mark-reachable to fsck_walk
>
> Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
> ---
>  builtin-fsck.c |   34 ++++++++++++++++++++++++----------
>  1 files changed, 24 insertions(+), 10 deletions(-)
>  ...
> +static int mark_object(struct object* obj, int type, void* data)
> +{
> +	if (!obj)
> +		return 0;
> +	if (obj->flags & REACHABLE)
> +		return 0;
> +	obj->flags |= REACHABLE;
> +	if (!obj->parsed)
> +		return 0;
> +	return fsck_walk(obj, mark_object, data);
> +}

Hmm.  The return value 0 means Ok and negative is error?  The
reason we can say success if obj is NULL or it is not parsed yet
is because...?

> @@ -326,8 +344,6 @@ static int fsck_tree(struct tree *item)
>  		o_name = name;
>  		o_sha1 = sha1;
>  	}
> -	free(item->buffer);
> -	item->buffer = NULL;

Hmm.  The reason you still need the buffer after you checked the
contents of the tree in the loop is because you haven't actually
checked the referents are Ok.  But I do not see a corresponding
free that releases this memory after you are actually done with
the verification with fsck_walk() yet, so we leak this in the
meantime?

> @@ -375,8 +391,6 @@ static int fsck_commit(struct commit *commit)
>  	}
>  	if (memcmp(buffer, "author ", 7))
>  		return objerror(&commit->object, "invalid format - expected 'author' line");
> -	free(commit->buffer);
> -	commit->buffer = NULL;
>  	if (!commit->tree)
>  		return objerror(&commit->object, "could not load commit's tree %s", tree_sha1);
>  	if (!commit->parents && show_root)

Likewise.

> From ce43251ef71962ff64fe138f1295c405ef6aaf65 Mon Sep 17 00:00:00 2001
> From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
> Date: Fri, 8 Feb 2008 09:04:08 +0100
> Subject: [PATCH 04/12] fsck: move reachable object check to fsck_walk
>
> It handles NULL pointers in object references without crashing.
>
> Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
> ---
>  builtin-fsck.c |   49 +++++++++++++++++++++++++++++--------------------
>  1 files changed, 29 insertions(+), 20 deletions(-)
>
> diff --git a/builtin-fsck.c b/builtin-fsck.c
> index 49e96ff..2c1e10f 100644
> --- a/builtin-fsck.c
> +++ b/builtin-fsck.c
> @@ -81,13 +81,39 @@ static int objwarning(struct object *obj, const char *err, ...)
>  	return -1;
>  }
>  
> +static int check_reachable_object_childs(struct object *obj, int type, void *data)
> +{
> +	struct object *parent = data;
> +	if (!obj) {
> +		printf("broken link from %7s %s\n",
> +			   typename(parent->type), sha1_to_hex(parent->sha1));
> +		printf("broken link from %7s %s\n",
> +			   (type==OBJ_ANY?"unknown":typename(type)), "unknown");

Hmm?  I am not sure what this part is reporting...

> From ee11f771be1ef1c29725cb56ab3eb8dfe61ca25a Mon Sep 17 00:00:00 2001
> From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
> Date: Fri, 8 Feb 2008 09:07:33 +0100
> Subject: [PATCH 06/12] create a common object checker code out of fsck
>
> The function provides a callback for reporting errors.

The same "add unused new stuff independently, later use it and
then finally remove now unused old stuff" pattern is here.  I am
neutral to that patch style but it is a bit harder to see what
is going on.

Most of the changes seem to be straight and sane copy-and-paste though.

> From a8db4e754e717bac0b2462333d4145eac3452099 Mon Sep 17 00:00:00 2001
> From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
> Date: Fri, 8 Feb 2008 09:14:14 +0100
> Subject: [PATCH 09/12] unpack-objects: prevent writing of inconsistent objects
>
> This patch introduces a strict mode, which ensures that:
> - no malformed object will be written
> - no object with broken links will be written
>
> The patch ensures this by delaying the write of all non blob object.
> These object are written, after all objects they link to are written.
>
> An error can only result in unreferenced objects.

> diff --git a/builtin-unpack-objects.c b/builtin-unpack-objects.c
> index f18c7e8..3e906e4 100644
> --- a/builtin-unpack-objects.c
> +++ b/builtin-unpack-objects.c
> @@ -173,7 +250,6 @@ static void resolve_delta(unsigned nr, enum object_type type,
>  		die("failed to apply delta");
>  	free(delta);
>  	write_object(nr, type, result, result_size);
> -	free(result);
>  }

And this is freed later elsewhere?

> @@ -203,7 +279,8 @@ static void unpack_non_delta_entry(enum object_type type, unsigned long size,
>  
>  	if (!dry_run && buf)
>  		write_object(nr, type, buf, size);
> -	free(buf);
> +	else if (buf)
> +		free(buf);
>  }

You can always free NULL without checking.

> @@ -356,6 +434,7 @@ static void unpack_all(void)
>  	if (!quiet)
>  		progress = start_progress("Unpacking objects", nr_objects);
>  	obj_list = xmalloc(nr_objects * sizeof(*obj_list));
> +	memset(obj_list, 0, nr_objects * sizeof(*obj_list));

Hmm, is this a fix to the 'master' independent from all the rest
of your patches, or a new requirement?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-11  0:00 ` Junio C Hamano
@ 2008-02-11  0:33   ` Nicolas Pitre
  2008-02-11 19:56     ` Martin Koegler
  2008-02-12  7:20   ` Martin Koegler
  1 sibling, 1 reply; 23+ messages in thread
From: Nicolas Pitre @ 2008-02-11  0:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Martin Koegler, git

On Sun, 10 Feb 2008, Junio C Hamano wrote:

> mkoegler@auto.tuwien.ac.at (Martin Koegler) writes:
> 
> > This patch adds a cache to keep the object data in memory. The delta
> > resolving code must also search in the cache.
> 
> I have to wonder what the memory pressure in real-life usage
> will be like.

FWIW, I don't like this idea.

I'm struggling to find ways to improve performances of 
pack-objects/index-pack with those large repositories that are becoming 
more common (i.e. GCC, OOO, Mozilla, etc.)  Anything that increase 
memory usage isn't very welcome IMHO.


Nicolas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-11  0:33   ` Nicolas Pitre
@ 2008-02-11 19:56     ` Martin Koegler
  2008-02-11 20:41       ` Nicolas Pitre
  0 siblings, 1 reply; 23+ messages in thread
From: Martin Koegler @ 2008-02-11 19:56 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git

On Sun, Feb 10, 2008 at 07:33:34PM -0500, Nicolas Pitre wrote:
> On Sun, 10 Feb 2008, Junio C Hamano wrote:
> 
> > mkoegler@auto.tuwien.ac.at (Martin Koegler) writes:
> > 
> > > This patch adds a cache to keep the object data in memory. The delta
> > > resolving code must also search in the cache.
> > 
> > I have to wonder what the memory pressure in real-life usage
> > will be like.

> FWIW, I don't like this idea.
>
> I'm struggling to find ways to improve performances of 
> pack-objects/index-pack with those large repositories that are becoming 
> more common (i.e. GCC, OOO, Mozilla, etc.)  Anything that increase 
> memory usage isn't very welcome IMHO.

Maybe I have missed something, but all repack problems reported on the
git mailing list happen durring the deltifing phase. The problematic
files are mostly bigger blobs. I'm aware of these problems, so my
patch does not keep any blobs in memory.

As we are talking about memory, let's ignore unpack-objects, which is
used for small packs. Lets compare the memory usage of index-pack to
pack-objects:

If it is disabled (no --strict passed), only a (unused) pointer for
each object in the received pack file is additionally allocated.

On i386, struct object_entry is 84 bytes in pack-objects, but only 52
in index-pack. Both programs keep a struct object_entry for each
object during the runtime in memory. So in this case, index-pack uses
less memory than pack-objects

If the --strict option is passed, more memory is used:

* Again, we add one pointer to struct object_entry. object_entry is
  still smaller.(52<84 bytes).

* index-pack allocates a struct blob/tree/commit/tag for each object in the pack.

  pack-objects also allocates only struct object in the best case
  (reading from pack file), otherwise a struct
  blob/tree/commit/tag. This objects are kept during the runtime of
  pack-objects in memory.

  So depending of the parameters of pack-objects, index-pack uses
  additionally up to 24 bytes per object, but struct object_entry is 32
  bytes smaller.

* index-pack allocates a struct blob/tree/commit/tag for each link to a object outside the pack.

  I don't know the code of pack-objects enough to say something to
  this point.

* index-pack keeps the data for each tag/tree/commit in the pack in memory

  In the next version, I don't need to keep the tag/commit data in
  memory. Tree data could be reconstructed from the written pack,
  but I'm not sure, if the additional code (resolving deltas again),
  would justify the additional memory usage.

So my conclusion is, that the memory usage of index-pack with --strict
should not be too worse compared to pack-objects.

Please remember, that --strict is used for pushing data.

mfg Martin Kögler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-11 19:56     ` Martin Koegler
@ 2008-02-11 20:41       ` Nicolas Pitre
  2008-02-11 21:58         ` Martin Koegler
  0 siblings, 1 reply; 23+ messages in thread
From: Nicolas Pitre @ 2008-02-11 20:41 UTC (permalink / raw)
  To: Martin Koegler; +Cc: Junio C Hamano, git

On Mon, 11 Feb 2008, Martin Koegler wrote:

> On Sun, Feb 10, 2008 at 07:33:34PM -0500, Nicolas Pitre wrote:
> > On Sun, 10 Feb 2008, Junio C Hamano wrote:
> > 
> > > mkoegler@auto.tuwien.ac.at (Martin Koegler) writes:
> > > 
> > > > This patch adds a cache to keep the object data in memory. The delta
> > > > resolving code must also search in the cache.
> > > 
> > > I have to wonder what the memory pressure in real-life usage
> > > will be like.
> 
> > FWIW, I don't like this idea.
> >
> > I'm struggling to find ways to improve performances of 
> > pack-objects/index-pack with those large repositories that are becoming 
> > more common (i.e. GCC, OOO, Mozilla, etc.)  Anything that increase 
> > memory usage isn't very welcome IMHO.
> 
> Maybe I have missed something, but all repack problems reported on the
> git mailing list happen durring the deltifing phase. The problematic
> files are mostly bigger blobs. I'm aware of these problems, so my
> patch does not keep any blobs in memory.

This is not only repack, which is something that few people should run 
anyway.

It is also index-pack which takes an awful amount of time on the OOO 
repo (much less than the repack but still), and that's something that is 
visible to anyone cloning it.

So the comparison with pack-objects is useless.  In absolute terms, 
index-pack has to inprove, not regress.

> So my conclusion is, that the memory usage of index-pack with --strict
> should not be too worse compared to pack-objects.

Well, the comparison is flawed anyway.  Agressively repacking the OOO 
repo is reported to take around 14 GB of RAM if not bounded, whereas 
index-pack remained around 300MB.  But the buffer cache managed by the 
kernel is also a big performance factor and anything that requires more 
memory in user space will affect that cache.

Then remember that this repo has 2411764 objects.

> Please remember, that --strict is used for pushing data.

And what exactly does it provide in terms of safetiness that index-pack 
doesn't already do?  I'm not sure I fully understood the goal here.

Nicolas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-11 20:41       ` Nicolas Pitre
@ 2008-02-11 21:58         ` Martin Koegler
  2008-02-12 16:02           ` Nicolas Pitre
  0 siblings, 1 reply; 23+ messages in thread
From: Martin Koegler @ 2008-02-11 21:58 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git

On Mon, Feb 11, 2008 at 03:41:06PM -0500, Nicolas Pitre wrote:
> On Mon, 11 Feb 2008, Martin Koegler wrote:
> > Please remember, that --strict is used for pushing data.
> 
> And what exactly does it provide in terms of safetiness that index-pack 
> doesn't already do?  I'm not sure I fully understood the goal here.

The patch is about verifing the content of objects. My intented use is
for running a (restricted) server, which allows pushing via git-daemon
(or git-shell) from (untrusted) users. It can be also used to catch
corrupt objects, before they leave your repository, but this is for me
only a side effect.

receive-pack accepts any crap (anything, which you can pipe into
git-hash-object with -t parameter of your choice), if the pack file
format is correct.

But lots of code in git assums that the object content is welformd.

Having such objects somewhere reachable in your repository will
disturb cloning (In the best case you get a error messages, in the
worst a child process of upload-pack segfaults), gitweb, ... . To fix
it, you will need a person with native access to the repository in the
worst case. 

> > Maybe I have missed something, but all repack problems reported on the
> > git mailing list happen durring the deltifing phase. The problematic
> > files are mostly bigger blobs. I'm aware of these problems, so my
> > patch does not keep any blobs in memory.
> 
> This is not only repack, which is something that few people should run 
> anyway.
> 
> It is also index-pack which takes an awful amount of time on the OOO 
> repo (much less than the repack but still), and that's something that is 
> visible to anyone cloning it.
> 
> So the comparison with pack-objects is useless.  In absolute terms, 
> index-pack has to inprove, not regress.

The patch does not change the cloning/fetching behaviour:
@@ -18,6 +19,7 @@ struct object_entry
        unsigned int hdr_size;
        enum object_type type;
        enum object_type real_type;
+       struct object *obj;
 };

 union delta_base {
@@ -44,6 +49,7 @@ static int nr_deltas;
 static int nr_resolved_deltas;

 static int from_stdin;
+static int strict;
 static int verbose;

 static struct progress *progress;
@@ -325,8 +366,8 @@ static int find_delta_children(const union delta_base *base,
        return 0;
 }

-static void sha1_object(const void *data, unsigned long size,
-                       enum object_type type, unsigned char *sha1)
+static struct object* sha1_object(const void *data, unsigned long size,
+                                 enum object_type type, unsigned char *sha1)
 {
        hash_sha1_file(data, size, typename(type), sha1);
        if (has_sha1_file(sha1)) {
@@ -341,6 +382,29 @@ static void sha1_object(const void *data, unsigned long size,
                        die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1));
                free(has_data);
        }
+       if (strict) {
[..]
+       return NULL;
 }

 static void resolve_delta(struct object_entry *delta_obj, void *base_data,
@@ -361,7 +425,7 @@ static void resolve_delta(struct object_entry *delta_obj, void *base_data,
        free(delta_data);
        if (!result)
                bad_object(delta_obj->idx.offset, "failed to apply delta");
-       sha1_object(result, result_size, type, delta_obj->idx.sha1);
+       delta_obj->obj = sha1_object(result, result_size, type, delta_obj->idx.sha1);
        nr_resolved_deltas++;

        hashcpy(delta_base.sha1, delta_obj->idx.sha1);
@@ -420,7 +484,7 @@ static void parse_pack_objects(unsigned char *sha1)
                        delta->obj_no = i;
                        delta++;
                } else
-                       sha1_object(data, obj->size, obj->type, obj->idx.sha1);
+                       obj->obj = sha1_object(data, obj->size, obj->type, obj->idx.sha1);
                free(data);
                display_progress(progress, i+1);
        }
@@ -714,6 +778,8 @@ int main(int argc, char **argv)
                                from_stdin = 1;
                        } else if (!strcmp(arg, "--fix-thin")) {
                                fix_thin_pack = 1;
+                       } else if (!strcmp(arg, "--strict")) {
+                               strict = 1;
                        } else if (!strcmp(arg, "--keep")) {
                                keep_msg = "";
                        } else if (!prefixcmp(arg, "--keep=")) {
@@ -812,6 +878,8 @@ int main(int argc, char **argv)
                            nr_deltas - nr_resolved_deltas);
        }
        free(deltas);
+       if (strict)
+               check_objects();

        idx_objects = xmalloc((nr_objects) * sizeof(struct pack_idx_entry *));
        for (i = 0; i < nr_objects; i++)

Do you see any problem, if int strict is zero?

For pushing, you must distinguish two case:

* pushing some changes/new branches
  You will have a maximum of a few thousands of objects.

* pushing the whole content into an empty repository

  This is not an every day operation. 

  I implemented a config option to disable the checking. It can be
  used safe time/memory in such cases.
 
  To make this working for OOO repository, my patch will probable need
  more tuning.

> > So my conclusion is, that the memory usage of index-pack with --strict
> > should not be too worse compared to pack-objects.
> 
> Well, the comparison is flawed anyway.  Agressively repacking the OOO 
> repo is reported to take around 14 GB of RAM if not bounded, whereas 
> index-pack remained around 300MB.  But the buffer cache managed by the 
> kernel is also a big performance factor and anything that requires more 
> memory in user space will affect that cache.

The 14 GB is caused by the memory used in the deltifing phase. I
compared it to the counting object phase.

> Then remember that this repo has 2411764 objects.

How many blobs, tags, trees and commits? What is the uncompressed size
of all tags? What is the uncompressed size of all trees? What is the
uncompressed size of all commits?

This information would give me a clue, what bad effects my approach
would have.

mfg Martin Kögler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-11  0:00 ` Junio C Hamano
  2008-02-11  0:33   ` Nicolas Pitre
@ 2008-02-12  7:20   ` Martin Koegler
  1 sibling, 0 replies; 23+ messages in thread
From: Martin Koegler @ 2008-02-12  7:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sun, Feb 10, 2008 at 04:00:44PM -0800, Junio C Hamano wrote:
> mkoegler@auto.tuwien.ac.at (Martin Koegler) writes:
> > * add --strict option to unpack-objects (patch 1,8,9)
> > * add --strict option to index-pack (patch 8,10)
> >
> >   Same as for unpack-objects, but without writting objects.
> >
> > * add config option for receive pack to enable checking (patch 11)
> 
> If this patch is any good, I strongly suspect it should not be
> just the default but should always be on.  IOW no config is
> necessary.  That would make the series a bit simpler, I guess.

Not very much. In patch 11, we would not need to parse a config
variable and could pass --strict option unconditional to
index-pack/unpack-objects.

For the moment, I would like to keep it as an option (enabled by
default). If somebody has a repository with a totally broken history,
he would not be able to push any more, if there is no way to disable
it.

> > From 76e86fe55345e633c910d6b8fe166e27c23c5aaf Mon Sep 17 00:00:00 2001
> > From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
> > Date: Fri, 8 Feb 2008 08:51:38 +0100
> > Subject: [PATCH 01/12] unpack-object: cache for non written objects
> >
> > Preventing objects with broken links entering the repository
> > means, that write of some objects must be delayed.
> >
> > This patch adds a cache to keep the object data in memory. The delta
> > resolving code must also search in the cache.
> 
> I have to wonder what the memory pressure in real-life usage
> will be like.
> 
> When an object is proven to be good, we should be able to free
> its buffer after writing it out, but would that be a good enough
> optimization we can make later on this code to keep its memory
> consumption manageable?

This code is only used by unpack-objects, which is used for small
packs. It only caches metadata (tree,commit,tag), no blobs. So the
memory usage should not be a problem.

> > diff --git a/fsck.c b/fsck.c
> > new file mode 100644
> > index 0000000..089f775
> > --- /dev/null
> > +++ b/fsck.c
> > @@ -0,0 +1,84 @@
> > +#include "cache.h"
> > +#include "object.h"
> > +#include "blob.h"
> > +#include "tree.h"
> > +#include "tree-walk.h"
> > +#include "commit.h"
> > +#include "tag.h"
> > +#include "fsck.h"
> > +
> > +static int fsck_walk_tree(struct tree* tree, fsck_walk_func walk, void* data)
> > +{
> > +	struct tree_desc desc;
> > +	struct name_entry entry;
> > +
> > +	if(parse_tree(tree))
> > +		return -1;
> 
> It's a bit hard to see how these new set of functions relate to
> the original code in this patch series, because you add the new
> things that are initially not used anywhere independently, start
> referring to them in a separate patch and then remove the old
> related functions that are now unused.  This style makes
> reviewing easier and harder at the same time...

Will try to restructure the patches.

> > From 80b22c3f2c3e13c207790a49646020c55b34bba7 Mon Sep 17 00:00:00 2001
> > From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
> > Date: Fri, 8 Feb 2008 09:01:50 +0100
> > Subject: [PATCH 03/12] fsck: move mark-reachable to fsck_walk
> >
> > Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
> > ---
> >  builtin-fsck.c |   34 ++++++++++++++++++++++++----------
> >  1 files changed, 24 insertions(+), 10 deletions(-)
> >  ...
> > +static int mark_object(struct object* obj, int type, void* data)
> > +{
> > +	if (!obj)
> > +		return 0;
> > +	if (obj->flags & REACHABLE)
> > +		return 0;
> > +	obj->flags |= REACHABLE;
> > +	if (!obj->parsed)
> > +		return 0;
> > +	return fsck_walk(obj, mark_object, data);
> > +}
> 
> Hmm.  The return value 0 means Ok and negative is error?  The
> reason we can say success if obj is NULL or it is not parsed yet
> is because...?

In mark_objects, I don't care for the results. Its should only
mark all reachable objects (without crashing).

> > @@ -326,8 +344,6 @@ static int fsck_tree(struct tree *item)
> >  		o_name = name;
> >  		o_sha1 = sha1;
> >  	}
> > -	free(item->buffer);
> > -	item->buffer = NULL;
> 
> Hmm.  The reason you still need the buffer after you checked the
> contents of the tree in the loop is because you haven't actually
> checked the referents are Ok.  But I do not see a corresponding
> free that releases this memory after you are actually done with
> the verification with fsck_walk() yet, so we leak this in the
> meantime?

The tree is traversed multiple times:
* to mark reachable objects
* to mark it's childs used
* to check for broken links

The last use of the tree content is, after all objects are already in
memory.

> > @@ -375,8 +391,6 @@ static int fsck_commit(struct commit *commit)
> >  	}
> >  	if (memcmp(buffer, "author ", 7))
> >  		return objerror(&commit->object, "invalid format - expected 'author' line");
> > -	free(commit->buffer);
> > -	commit->buffer = NULL;
> >  	if (!commit->tree)
> >  		return objerror(&commit->object, "could not load commit's tree %s", tree_sha1);
> >  	if (!commit->parents && show_root)
> 
> Likewise.

Will change.

> > From ce43251ef71962ff64fe138f1295c405ef6aaf65 Mon Sep 17 00:00:00 2001
> > From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
> > Date: Fri, 8 Feb 2008 09:04:08 +0100
> > Subject: [PATCH 04/12] fsck: move reachable object check to fsck_walk
> >
> > It handles NULL pointers in object references without crashing.
> >
> > Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
> > ---
> >  builtin-fsck.c |   49 +++++++++++++++++++++++++++++--------------------
> >  1 files changed, 29 insertions(+), 20 deletions(-)
> >
> > diff --git a/builtin-fsck.c b/builtin-fsck.c
> > index 49e96ff..2c1e10f 100644
> > --- a/builtin-fsck.c
> > +++ b/builtin-fsck.c
> > @@ -81,13 +81,39 @@ static int objwarning(struct object *obj, const char *err, ...)
> >  	return -1;
> >  }
> >  
> > +static int check_reachable_object_childs(struct object *obj, int type, void *data)
> > +{
> > +	struct object *parent = data;
> > +	if (!obj) {
> > +		printf("broken link from %7s %s\n",
> > +			   typename(parent->type), sha1_to_hex(parent->sha1));
> > +		printf("broken link from %7s %s\n",
> > +			   (type==OBJ_ANY?"unknown":typename(type)), "unknown");
> 
> Hmm?  I am not sure what this part is reporting...

If eg. the sha1 of a commit is stored as tree in a commit, you will
find a null pointer in struct commit->tree. Such things will hit this
check.

> > From ee11f771be1ef1c29725cb56ab3eb8dfe61ca25a Mon Sep 17 00:00:00 2001
> > From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
> > Date: Fri, 8 Feb 2008 09:07:33 +0100
> > Subject: [PATCH 06/12] create a common object checker code out of fsck
> >
> > The function provides a callback for reporting errors.
> 
> The same "add unused new stuff independently, later use it and
> then finally remove now unused old stuff" pattern is here.  I am
> neutral to that patch style but it is a bit harder to see what
> is going on.
> 
> Most of the changes seem to be straight and sane copy-and-paste though.

I'll merge this patch with the next patch.

> > From a8db4e754e717bac0b2462333d4145eac3452099 Mon Sep 17 00:00:00 2001
> > From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
> > Date: Fri, 8 Feb 2008 09:14:14 +0100
> > Subject: [PATCH 09/12] unpack-objects: prevent writing of inconsistent objects
> >
> > This patch introduces a strict mode, which ensures that:
> > - no malformed object will be written
> > - no object with broken links will be written
> >
> > The patch ensures this by delaying the write of all non blob object.
> > These object are written, after all objects they link to are written.
> >
> > An error can only result in unreferenced objects.
> 
> > diff --git a/builtin-unpack-objects.c b/builtin-unpack-objects.c
> > index f18c7e8..3e906e4 100644
> > --- a/builtin-unpack-objects.c
> > +++ b/builtin-unpack-objects.c
> > @@ -173,7 +250,6 @@ static void resolve_delta(unsigned nr, enum object_type type,
> >  		die("failed to apply delta");
> >  	free(delta);
> >  	write_object(nr, type, result, result_size);
> > -	free(result);
> >  }
> 
> And this is freed later elsewhere?

The free is the task of write_object. If (!strict), then it is freeed
immediatly. Similar for blobs. Other objects are put in the cache (see
patch 1).

> > @@ -203,7 +279,8 @@ static void unpack_non_delta_entry(enum object_type type, unsigned long size,
> >  
> >  	if (!dry_run && buf)
> >  		write_object(nr, type, buf, size);
> > -	free(buf);
> > +	else if (buf)
> > +		free(buf);
> >  }
> 
> You can always free NULL without checking.

Will fix.

> > @@ -356,6 +434,7 @@ static void unpack_all(void)
> >  	if (!quiet)
> >  		progress = start_progress("Unpacking objects", nr_objects);
> >  	obj_list = xmalloc(nr_objects * sizeof(*obj_list));
> > +	memset(obj_list, 0, nr_objects * sizeof(*obj_list));
> 
> Hmm, is this a fix to the 'master' independent from all the rest
> of your patches, or a new requirement?

This is, because the new field in obj_list must be initialized with
zero.

mfg martin Kögler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-11 21:58         ` Martin Koegler
@ 2008-02-12 16:02           ` Nicolas Pitre
  2008-02-12 19:04             ` Martin Koegler
  2008-02-13  7:42             ` [RFC Patch] Preventing corrupt objects from entering the repository Shawn O. Pearce
  0 siblings, 2 replies; 23+ messages in thread
From: Nicolas Pitre @ 2008-02-12 16:02 UTC (permalink / raw)
  To: Martin Koegler; +Cc: Junio C Hamano, git

On Mon, 11 Feb 2008, Martin Koegler wrote:

> On Mon, Feb 11, 2008 at 03:41:06PM -0500, Nicolas Pitre wrote:
> > On Mon, 11 Feb 2008, Martin Koegler wrote:
> > > Please remember, that --strict is used for pushing data.
> > 
> > And what exactly does it provide in terms of safetiness that index-pack 
> > doesn't already do?  I'm not sure I fully understood the goal here.
> 
> The patch is about verifing the content of objects. My intented use is
> for running a (restricted) server, which allows pushing via git-daemon
> (or git-shell) from (untrusted) users. It can be also used to catch
> corrupt objects, before they leave your repository, but this is for me
> only a side effect.
> 
> receive-pack accepts any crap (anything, which you can pipe into
> git-hash-object with -t parameter of your choice), if the pack file
> format is correct.
> 
> But lots of code in git assums that the object content is welformd.
> 
> Having such objects somewhere reachable in your repository will
> disturb cloning (In the best case you get a error messages, in the
> worst a child process of upload-pack segfaults), gitweb, ... . To fix
> it, you will need a person with native access to the repository in the
> worst case. 

OK that makes sense.

I think this is a good idea to always have some sanity checks on any 
incoming objects so to make sure they're well formed and valid before 
giving them a SHA1 value, and bail out as soon as any error is found.  
From my understanding that's what your patch is doing, right? (sorry I 
can't find them in my mailbox anymore).  This can be done as objects are 
coming in just fine and requires no extra memory, and I would say this 
should be done unconditionally all the time.  After all, the Git 
coherency model is based on the SHA1 checksuming, and therefore it is a 
good idea to never validate any malformed objects with a SHA1.  So I'm 
all in favor of such validation always performed in index-pack and 
unpack-objects.

As to making sure those objects are well connected... well this is a 
technically different issue entirely, and I wonder if a special mode to 
fsck might not be a better solution.  For example, fsck could be made to 
validate object connectivity, starting from the new ref(s), and stopping 
object walking as soon as a reference to an object not included in the 
newly received pack is encountered.  This could be run from some hook to 
decide whether or not to update the new refs, and to delete the pack 
otherwise.

Nicolas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-12 16:02           ` Nicolas Pitre
@ 2008-02-12 19:04             ` Martin Koegler
  2008-02-12 20:22               ` Nicolas Pitre
  2008-02-13  7:42             ` [RFC Patch] Preventing corrupt objects from entering the repository Shawn O. Pearce
  1 sibling, 1 reply; 23+ messages in thread
From: Martin Koegler @ 2008-02-12 19:04 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git

On Tue, Feb 12, 2008 at 11:02:06AM -0500, Nicolas Pitre wrote:
> I think this is a good idea to always have some sanity checks on any 
> incoming objects so to make sure they're well formed and valid before 
> giving them a SHA1 value, and bail out as soon as any error is found.  
> From my understanding that's what your patch is doing, right? (sorry I 
> can't find them in my mailbox anymore). 

Yes. (=>http://marc.info/?l=git&m=120266631524947&w=2)

>  This can be done as objects are 
> coming in just fine and requires no extra memory, and I would say this 
> should be done unconditionally all the time.  After all, the Git 
> coherency model is based on the SHA1 checksuming, and therefore it is a 
> good idea to never validate any malformed objects with a SHA1.  So I'm 
> all in favor of such validation always performed in index-pack and 
> unpack-objects.

We will need some additional memory for struct blob/tree/tag/commit
even for this check.

I'll start reworking my patches.

> As to making sure those objects are well connected... well this is a 
> technically different issue entirely, and I wonder if a special mode to 
> fsck might not be a better solution.  For example, fsck could be made to 
> validate object connectivity, starting from the new ref(s), and stopping 
> object walking as soon as a reference to an object not included in the 
> newly received pack is encountered.  This could be run from some hook to 
> decide whether or not to update the new refs, and to delete the pack 
> otherwise.

Do you really think, that this will need less memory? fsck loads first
all objects and then verifies their connections.

mfg Martin Kögler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-12 19:04             ` Martin Koegler
@ 2008-02-12 20:22               ` Nicolas Pitre
  2008-02-12 21:38                 ` Martin Koegler
  0 siblings, 1 reply; 23+ messages in thread
From: Nicolas Pitre @ 2008-02-12 20:22 UTC (permalink / raw)
  To: Martin Koegler; +Cc: Junio C Hamano, git

On Tue, 12 Feb 2008, Martin Koegler wrote:

> On Tue, Feb 12, 2008 at 11:02:06AM -0500, Nicolas Pitre wrote:
> > I think this is a good idea to always have some sanity checks on any 
> > incoming objects so to make sure they're well formed and valid before 
> > giving them a SHA1 value, and bail out as soon as any error is found.  
> > From my understanding that's what your patch is doing, right? (sorry I 
> > can't find them in my mailbox anymore). 
> 
> Yes. (=>http://marc.info/?l=git&m=120266631524947&w=2)
> 
> >  This can be done as objects are 
> > coming in just fine and requires no extra memory, and I would say this 
> > should be done unconditionally all the time.  After all, the Git 
> > coherency model is based on the SHA1 checksuming, and therefore it is a 
> > good idea to never validate any malformed objects with a SHA1.  So I'm 
> > all in favor of such validation always performed in index-pack and 
> > unpack-objects.
> 
> We will need some additional memory for struct blob/tree/tag/commit
> even for this check.

Why so?

Each received object is stored in memory when received, so why can't you 
simply validate it in place?  No need to keep trace of them afterward.

> > As to making sure those objects are well connected... well this is a 
> > technically different issue entirely, and I wonder if a special mode to 
> > fsck might not be a better solution.  For example, fsck could be made to 
> > validate object connectivity, starting from the new ref(s), and stopping 
> > object walking as soon as a reference to an object not included in the 
> > newly received pack is encountered.  This could be run from some hook to 
> > decide whether or not to update the new refs, and to delete the pack 
> > otherwise.
> 
> Do you really think, that this will need less memory? fsck loads first
> all objects and then verifies their connections.

Not all objects otherwise I wouldn't even be able to run it.

My point is that you can have fsck load only objects contained in the 
received pack (you can use the pack index to load them) and assume 
connectivity is good whenever an object in the pack reference an 
existing object outside of the pack.  At least this doesn't need to 
happen in parallel with pack indexing.


Nicolas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-12 20:22               ` Nicolas Pitre
@ 2008-02-12 21:38                 ` Martin Koegler
  2008-02-12 21:51                   ` Nicolas Pitre
  0 siblings, 1 reply; 23+ messages in thread
From: Martin Koegler @ 2008-02-12 21:38 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git

On Tue, Feb 12, 2008 at 03:22:17PM -0500, Nicolas Pitre wrote:
> On Tue, 12 Feb 2008, Martin Koegler wrote:
> > We will need some additional memory for struct blob/tree/tag/commit
> > even for this check.
> 
> Why so?
> 
> Each received object is stored in memory when received, so why can't you 
> simply validate it in place?  No need to keep trace of them afterward.

Freeing the data is not my problem.

Many validations are in parse_XXX_buffer, which are also used by
fsck. This returns a struct commit/tree/tag/blob.

I have not found any code in git to free them. 

Same for pack-objects, e.g. add_objects_in_unpacked_packs allocates
many struct object via lookup_unknown_object. As far as I understand
the code, they are never freed, even if they are not needed later.

> > > As to making sure those objects are well connected... well this is a 
> > > technically different issue entirely, and I wonder if a special mode to 
> > > fsck might not be a better solution.  For example, fsck could be made to 
> > > validate object connectivity, starting from the new ref(s), and stopping 
> > > object walking as soon as a reference to an object not included in the 
> > > newly received pack is encountered.  This could be run from some hook to 
> > > decide whether or not to update the new refs, and to delete the pack 
> > > otherwise.
> > 
> > Do you really think, that this will need less memory? fsck loads first
> > all objects and then verifies their connections.
> 
> Not all objects otherwise I wouldn't even be able to run it.

It only loads all objects, it checks (eg. no objects from the pack file
by default). After loading a object, it frees the content of object,
but keeps the parsed information in memory for the connectivity check.

> My point is that you can have fsck load only objects contained in the 
> received pack (you can use the pack index to load them) and assume 
> connectivity is good whenever an object in the pack reference an 
> existing object outside of the pack.  At least this doesn't need to 
> happen in parallel with pack indexing.

So you propose the following:
1) run index-pack (write pack file + index in repository)
2) fsck pack file
3) update refs

This looks to be very prone to race conditions, if multiple
pushes run concurrently.

To be on the safe side, the checks must be finished, before a pack/object
becomes part of the repository. We must check the reachability before 
index-pack writes the index.

In my opinion, separating the code for the reachability check from
index-pack would complicate things. It would not safe memory, as the
memory would be used by two processes instead of one.

You don't want to increase the resource usage of index-pack. If
--strict is not passed, I implemented this. 

For --strict, it is not avoidable, that index-pack will use more
memory and cpu time.

mfg Martin Kögler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-12 21:38                 ` Martin Koegler
@ 2008-02-12 21:51                   ` Nicolas Pitre
  2008-02-13  6:20                     ` Shawn O. Pearce
  0 siblings, 1 reply; 23+ messages in thread
From: Nicolas Pitre @ 2008-02-12 21:51 UTC (permalink / raw)
  To: Martin Koegler; +Cc: Junio C Hamano, git

On Tue, 12 Feb 2008, Martin Koegler wrote:

> On Tue, Feb 12, 2008 at 03:22:17PM -0500, Nicolas Pitre wrote:
> > On Tue, 12 Feb 2008, Martin Koegler wrote:
> > > We will need some additional memory for struct blob/tree/tag/commit
> > > even for this check.
> > 
> > Why so?
> > 
> > Each received object is stored in memory when received, so why can't you 
> > simply validate it in place?  No need to keep trace of them afterward.
> 
> Freeing the data is not my problem.

Well, I would say it is now.  ;-)

> Many validations are in parse_XXX_buffer, which are also used by
> fsck. This returns a struct commit/tree/tag/blob.
> 
> I have not found any code in git to free them. 

Maybe we should add it then.  Especially in the usual case where we want 
incoming objects to be validated.

> Same for pack-objects, e.g. add_objects_in_unpacked_packs allocates
> many struct object via lookup_unknown_object. As far as I understand
> the code, they are never freed, even if they are not needed later.

Hmmm, that's bad.


Nicolas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-12 21:51                   ` Nicolas Pitre
@ 2008-02-13  6:20                     ` Shawn O. Pearce
  2008-02-13  7:39                       ` Martin Koegler
  0 siblings, 1 reply; 23+ messages in thread
From: Shawn O. Pearce @ 2008-02-13  6:20 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Martin Koegler, Junio C Hamano, git

Nicolas Pitre <nico@cam.org> wrote:
> On Tue, 12 Feb 2008, Martin Koegler wrote:
> 
> > Many validations are in parse_XXX_buffer, which are also used by
> > fsck. This returns a struct commit/tree/tag/blob.
> > 
> > I have not found any code in git to free them. 

Yea, there isn't a way to free the struct commit/tree/tag/blob
currently.  Once allocated its inside the hash, cannot be removed,
and cannot be recycled.

A while back I had worked up a patch that allowed us to reset the
entire hash and recycle the all of the objects, but that isn't
really what we need here.

> Maybe we should add it then.  Especially in the usual case where we want 
> incoming objects to be validated.

Releasing a single object back shouldn't be too difficult, so long as
you release everything after each parse.  Remember parsing a commit
loads the parent list with struct commit*s.  Those are in the object
hash and a later parse_commit_buffer() may attempt to reuse those
if the same commit was a parent for more than one descendent.  :-\

Basically I think you want a smaller case of what I had done earlier
in my attempt to fix an error in git-fetch; release everything in
the object pool after each object has been parsed and validated.
We should only have a couple of objects allocated in the blocks
(one blob, one tree, one tag, handful of commits) and just reset
the blocks and hash between each.

> > Same for pack-objects, e.g. add_objects_in_unpacked_packs allocates
> > many struct object via lookup_unknown_object. As far as I understand
> > the code, they are never freed, even if they are not needed later.
> 
> Hmmm, that's bad.

You can't free them.  We're using their o->flags field to store
OBJECT_ADDED state so we don't pack the same object twice in our
output, even if it appears in multiple input packfiles.

This isn't a leak.  Its behaving as designed.  It is however
overallocating the object, because we are allocating an unknown
object rather than a specific object type.  We don't know what the
heck that SHA-1 is just from looking at the index.  If it is an
OBJ_{REF,OFS}_DELTA its too expensive to acquire the exact type
during this method.

Which makes me wonder, would it make sense to store the object type
code inside of the .idx?

There's a few places in Git (4 lookup_unknown_objects, 19 possible
sha1_object_infos) where we may be able to benefit from having fast
access to the base type for an object, without needing to walk its
delta chain to acquire that information.  It certainly would help
in this spot of pack-objects that Martin has pointed out.

But I really don't think it is worth the disk storage space
necessary.  We only need 3 bits, but there isn't a place to steal
them from in the index v2 file format.  So we're adding a full byte
per object.  Further we only make this lookup_unknown_object call
in 3 locations:

  http-push.c/walker.c:

    We're fetching objects, but we don't know what we are getting.
    I think these are only triggered during the loose object
    upload/download, which should be only a fraction of the total
    objects transferred, and are only for cases where the object
    isn't local so we can't check its type before allocation.

    No big deal.

  tree.c:

    This is inside track_tree_refs and its a file mode we do
    not recognize.  We spit out a warning, but try to lookup
    the unknown object anyway, even though we're looking at what
    should be a bogus tree.  This occurs only in parse_tree_buffer,
    only if track_objects_refs is on, and only if we see a tree
    that is actually not understood by this version of Git.  Hmm,
    shouldn't happen unless the tree itself is corrupt.

    This code looked fishy enough to me to dig up the blame history
    for it:

      commit e2ac7cb5fbcf1407003aa07cdcd14141527ea2e3
      Author: Sam Vilain <sam.vilain@catalyst.net.nz>
      Date: Wed Jun  6 06:25:17 2007

      Don't assume tree entries that are not dirs are blobs

      When scanning the trees in track_tree_refs() there is a "lazy" test
      that assumes that entries are either directories or files.  Don't do
      that.

    Sounds like it was around the time of S_ISGITLINK being
    introduced.  Looking at this code again I have to wonder, why
    the hell are we looking up and tracking an object inside of a
    tree when we don't understand the mode?

    Lets say a new form of S_ISGITLINK gets introduced in the future.
    By this I mean the mode says "hey, this SHA-1 isn't really in my
    object pool".  We're going to cram a dummy object into the refs
    table for this tree.  Why?  We don't do this for S_ISGITLINK.

    Lets say its a new object type (not tree/tag/commit/blob) but
    it is in our storage pool.  If this Git doesn't know the mode,
    it sure as heck won't know what the hell the loose object header
    (or pack header!) says about that object.

    One of the key places where you might expect tracking an unknown
    (but referenced by a tree) SHA-1 type would be in reachability,
    rev-list --objects, packfile generation.  But the process_tree()
    function in reachable.c doesn't have Sam's change above, so it
    will assume anything new looking is a blob.

    Oh, and what a rabbit hole I just fell down.  The only caller
    that seems to set "track_object_refs = 1" (and thus get into this
    odd lookup_unknown_object() call) is fsck.  Everyone else sets
    it to 0, including its default declaration.

    So we've got this nice baggage, and differing implementation,
    so fsck can be happy how?  We've also got a whole lot of apps
    setting "track_object_refs = 0", which is what it defaults to,
    unless you managed to run fsck first.  Hmmph.

    Is it just me or is track_object_refs perhaps outlived its
    usefulness?

  pack-objects:

    Here we are packing everything in an existing packfile, as we
    are doing something like a "git repack -A" and need to carry
    it all with us.

    We probably could be smarter than allocating an unknown object,
    but I'm not sure its worth it.  If you really are paying a
    high price here it means you are carrying a *huge* amount of
    unreachable garbage with you from packfile to packfile.

    Try running "git gc --prune" every few months.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-13  6:20                     ` Shawn O. Pearce
@ 2008-02-13  7:39                       ` Martin Koegler
  2008-02-14  9:00                         ` [RFC PATCH] Remove object-refs from fsck Shawn O. Pearce
  0 siblings, 1 reply; 23+ messages in thread
From: Martin Koegler @ 2008-02-13  7:39 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Nicolas Pitre, Junio C Hamano, git

On Wed, Feb 13, 2008 at 01:20:15AM -0500, Shawn O. Pearce wrote:
>   tree.c:
> 
>     This is inside track_tree_refs and its a file mode we do
>     not recognize.  We spit out a warning, but try to lookup
>     the unknown object anyway, even though we're looking at what
>     should be a bogus tree.  This occurs only in parse_tree_buffer,
>     only if track_objects_refs is on, and only if we see a tree
>     that is actually not understood by this version of Git.  Hmm,
>     shouldn't happen unless the tree itself is corrupt.
> 
>     This code looked fishy enough to me to dig up the blame history
>     for it:
> 
>       commit e2ac7cb5fbcf1407003aa07cdcd14141527ea2e3
>       Author: Sam Vilain <sam.vilain@catalyst.net.nz>
>       Date: Wed Jun  6 06:25:17 2007
> 
>       Don't assume tree entries that are not dirs are blobs
> 
>       When scanning the trees in track_tree_refs() there is a "lazy" test
>       that assumes that entries are either directories or files.  Don't do
>       that.
> 
>     Sounds like it was around the time of S_ISGITLINK being
>     introduced.  Looking at this code again I have to wonder, why
>     the hell are we looking up and tracking an object inside of a
>     tree when we don't understand the mode?
> 
>     Lets say a new form of S_ISGITLINK gets introduced in the future.
>     By this I mean the mode says "hey, this SHA-1 isn't really in my
>     object pool".  We're going to cram a dummy object into the refs
>     table for this tree.  Why?  We don't do this for S_ISGITLINK.
> 
>     Lets say its a new object type (not tree/tag/commit/blob) but
>     it is in our storage pool.  If this Git doesn't know the mode,
>     it sure as heck won't know what the hell the loose object header
>     (or pack header!) says about that object.
> 
>     One of the key places where you might expect tracking an unknown
>     (but referenced by a tree) SHA-1 type would be in reachability,
>     rev-list --objects, packfile generation.  But the process_tree()
>     function in reachable.c doesn't have Sam's change above, so it
>     will assume anything new looking is a blob.
>     
>     Oh, and what a rabbit hole I just fell down.  The only caller
>     that seems to set "track_object_refs = 1" (and thus get into this
>     odd lookup_unknown_object() call) is fsck.  Everyone else sets
>     it to 0, including its default declaration.
> 
>     So we've got this nice baggage, and differing implementation,
>     so fsck can be happy how?  We've also got a whole lot of apps
>     setting "track_object_refs = 0", which is what it defaults to,
>     unless you managed to run fsck first.  Hmmph.
> 
>     Is it just me or is track_object_refs perhaps outlived its
>     usefulness?

See the patch at the start of this topic. I'm working on replacing
track-object-refs.

It is missing some frees and I copied the lookup_unknown_object call
(This should be replaced with an error message in my patch, as the the
verification of the tree content would reject such an invalid mode).

mfg Martin Kögler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-12 16:02           ` Nicolas Pitre
  2008-02-12 19:04             ` Martin Koegler
@ 2008-02-13  7:42             ` Shawn O. Pearce
  2008-02-13  8:11               ` Martin Koegler
  1 sibling, 1 reply; 23+ messages in thread
From: Shawn O. Pearce @ 2008-02-13  7:42 UTC (permalink / raw)
  To: Nicolas Pitre, Martin Koegler; +Cc: Junio C Hamano, git

Nicolas Pitre <nico@cam.org> wrote:
> On Mon, 11 Feb 2008, Martin Koegler wrote:
> > 
> > But lots of code in git assums that the object content is welformd.
> > 
> > Having such objects somewhere reachable in your repository will
> > disturb cloning (In the best case you get a error messages, in the
> > worst a child process of upload-pack segfaults), gitweb, ... . To fix
> > it, you will need a person with native access to the repository in the
> > worst case. 

We have *waaaay* overthought this.  The solution is a lot simpler
than Martin's patches make it out to be, and we can do it without
changing our current memory footprint.

> I think this is a good idea to always have some sanity checks on any 
> incoming objects so to make sure they're well formed and valid before 
> giving them a SHA1 value, and bail out as soon as any error is found.  

When we get the raw data for an object so we can compute its SHA-1
and/or write its loose object to disk we should first verify its
content is sane, then do the SHA-1/store loose.

Blobs - always assume sane.  Remember that these make up the largest
percentage in terms of raw bytes of any project's packfiles and we
don't need to do any additional processing to them.  Yay us.

Tags - stack allocate a temporary struct tag and pass it's address
into parse_tag_buffer().  The only thing it tries to allocate is
the item->tagged.  Throw an extra argument into parse_tag_buffer()
to bypass this lookup block.  Now there's no extra memory used for
tags, just a bit more CPU.  Tags aren't frequent.

Commits - do like tags, but use parse_commit_buffer().  Avoid
looking up the tree object and the parents, and allocating the
parent references via a new argument.  Also check !item->date like
fsck does.  Other than that I think the validation code in fsck is
redundant with what parse_commit_buffer() is doing already.

Trees - just run init_tree_desc() and tree_entry() loop like in
track_tree_refs() (and a ton of other places) to iterate over the
buffer.  All we care about is sane entry names (no '/'), sane modes,
and correct sorting.

In short, we ignore reachability, but get fsck, and we can completely
avoid additional malloc activity as well as any sort of heap increase
in index-pack and unpack-objects.  Its not hard.

Its a small amount of refactoring for the parse functions we
already have, and maybe expose a function or two from builtin-fsck
(in particular a bulk of fsck_tree should be reused).

Yea, its a tad more CPU time, but it shouldn't be that costly here.
The huge cost in fsck is redoing the SHA-1 checksums, and inflating
the deltas.  We've already got the delta inflated, and we're about
to compute the SHA-1 checksum for the first time.  So those major
costs of fsck drop to 0.

> As to making sure those objects are well connected... well this is a 
> technically different issue entirely, and I wonder if a special mode to 
> fsck might not be a better solution.

Nah, just do what quickfetch does in builtin-fetch.c, but run it
in receive-pack, between unpack() and execute_commands():

	rev-list --quiet --objects $new... --not --all

If it aborts, reachability testing failed and the push is rejected
without updating any refs.  Yes your repository now has objects
that are missing things, but none of those are considered to be
reachable, so this isn't a big deal.  They will get cleaned up on
the next `gc --prune`, whenever that is.

In this configuration (--quiet) rev-list tries to be pretty low
on its memory usage, it doesn't save buffers, etc.  Further since
everything that is already considered reachable is not interesting,
we are only doing a walk over the objects that we just received,
not our entire ODB.  Its also after index-pack exited, so we just
freed up a good chunk of memory.

Rememeber we are talking about receive-pack here.  The cost on
the to perform the rev-list is lower than the cost will be to pack
these objects for distribution back to just one client.  Since this
is a server of some sorts (otherwise why did you push here?), odds
are its going to be doing a lot of packing requests for clients to
receive these newly uploaded objects by the native git protocol.
This new rev-list is nothing compared to that already existing load.
And if your OS is any good the just created .idx and .pack is still
in OS buffer cache, so there shouldn't be any additional disk IO.

Yes, we could make this optional in receive-pack, but really I don't
see a reason to.  Just run it.  The client shouldn't be giving us
unreachable crap.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-13  7:42             ` [RFC Patch] Preventing corrupt objects from entering the repository Shawn O. Pearce
@ 2008-02-13  8:11               ` Martin Koegler
  2008-02-13 12:01                 ` Johannes Schindelin
  0 siblings, 1 reply; 23+ messages in thread
From: Martin Koegler @ 2008-02-13  8:11 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Nicolas Pitre, Junio C Hamano, git

On Wed, Feb 13, 2008 at 02:42:09AM -0500, Shawn O. Pearce wrote:
> Nicolas Pitre <nico@cam.org> wrote:
> > I think this is a good idea to always have some sanity checks on any 
> > incoming objects so to make sure they're well formed and valid before 
> > giving them a SHA1 value, and bail out as soon as any error is found.  
> 
> When we get the raw data for an object so we can compute its SHA-1
> and/or write its loose object to disk we should first verify its
> content is sane, then do the SHA-1/store loose.
> 
[...]
> > As to making sure those objects are well connected... well this is a 
> > technically different issue entirely, and I wonder if a special mode to 
> > fsck might not be a better solution.
> 
> Nah, just do what quickfetch does in builtin-fetch.c, but run it
> in receive-pack, between unpack() and execute_commands():
> 
> 	rev-list --quiet --objects $new... --not --all
> 
> If it aborts, reachability testing failed and the push is rejected
> without updating any refs.  Yes your repository now has objects
> that are missing things, but none of those are considered to be
> reachable, so this isn't a big deal.  They will get cleaned up on
> the next `gc --prune`, whenever that is.

This would mean, that we must make git-rev-list and git-pack-objects
not segfault on incorrect links between objects.

> In this configuration (--quiet) rev-list tries to be pretty low
> on its memory usage, it doesn't save buffers, etc.  Further since
> everything that is already considered reachable is not interesting,
> we are only doing a walk over the objects that we just received,
> not our entire ODB.  Its also after index-pack exited, so we just
> freed up a good chunk of memory.
> 
> Rememeber we are talking about receive-pack here.  The cost on
> the to perform the rev-list is lower than the cost will be to pack
> these objects for distribution back to just one client.  Since this
> is a server of some sorts (otherwise why did you push here?), odds
> are its going to be doing a lot of packing requests for clients to
> receive these newly uploaded objects by the native git protocol.
> This new rev-list is nothing compared to that already existing load.
> And if your OS is any good the just created .idx and .pack is still
> in OS buffer cache, so there shouldn't be any additional disk IO.
> 
> Yes, we could make this optional in receive-pack, but really I don't
> see a reason to.  Just run it.  The client shouldn't be giving us
> unreachable crap.

Looks sane to me.

mfg Martin Kögler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-13  8:11               ` Martin Koegler
@ 2008-02-13 12:01                 ` Johannes Schindelin
  2008-02-14  6:16                   ` Shawn O. Pearce
  2008-02-14 19:04                   ` Martin Koegler
  0 siblings, 2 replies; 23+ messages in thread
From: Johannes Schindelin @ 2008-02-13 12:01 UTC (permalink / raw)
  To: Martin Koegler; +Cc: Shawn O. Pearce, Nicolas Pitre, Junio C Hamano, git

Hi,

On Wed, 13 Feb 2008, Martin Koegler wrote:

> This would mean, that we must make git-rev-list and git-pack-objects not 
> segfault on incorrect links between objects.

We should do that anyway.  It may error out, but segfaulting is no option.

So if you have a test case, please make it public so we can fix the 
breakage.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-13 12:01                 ` Johannes Schindelin
@ 2008-02-14  6:16                   ` Shawn O. Pearce
  2008-02-14 19:04                   ` Martin Koegler
  1 sibling, 0 replies; 23+ messages in thread
From: Shawn O. Pearce @ 2008-02-14  6:16 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Martin Koegler, Nicolas Pitre, Junio C Hamano, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> On Wed, 13 Feb 2008, Martin Koegler wrote:
> 
> > This would mean, that we must make git-rev-list and git-pack-objects not 
> > segfault on incorrect links between objects.

My proposal suggested that malformed objects not be allowed
into the repository, so rev-list wouldn't see them in the
first place.
 
> We should do that anyway.  It may error out, but segfaulting is no option.
> 
> So if you have a test case, please make it public so we can fix the 
> breakage.

But Dscho has a really good point here.  rev-list should be failing
with a proper exit code, not SIGSEGV.  :)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH] Remove object-refs from fsck
  2008-02-13  7:39                       ` Martin Koegler
@ 2008-02-14  9:00                         ` Shawn O. Pearce
  2008-02-14 19:07                           ` Martin Koegler
  0 siblings, 1 reply; 23+ messages in thread
From: Shawn O. Pearce @ 2008-02-14  9:00 UTC (permalink / raw)
  To: Martin Koegler; +Cc: Nicolas Pitre, Junio C Hamano, git

Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote:
> On Wed, Feb 13, 2008 at 01:20:15AM -0500, Shawn O. Pearce wrote:
> >     Is it just me or is track_object_refs perhaps outlived its
> >     usefulness?
> 
> See the patch at the start of this topic. I'm working on replacing
> track-object-refs.
> 
> It is missing some frees and I copied the lookup_unknown_object call
> (This should be replaced with an error message in my patch, as the the
> verification of the tree content would reject such an invalid mode).

Here's my first attempt at removing track_object_refs.  At least
the diffstat suggets its in the correct direction.  Note!  It is
only very lightly tested, which is why this is not a properly
formatted patch.

I'm trying to reuse the code in reachable.c, which is what prune
is based on.  We now parse trees twice.  Once to read the content
and verify it is sane and later to perform the reachability walk.
We use less memory however as we no longer have object_refs.

Missing objects discovered during fsck get flagged on the referrer
by the new WANTING flag.  We go back later and reparse the referrer
to find out what links he had that were broken.  This 3rd parse of
a tree shouldn't be an issue in a perfectly valid repository.

In other news, fsck --full is a dog.

I think we can make it run faster for packed data by fsck'ing
according to the delta base, as index-pack does.  We also could get
better running time if we avoided the per-object unpack/checksum
that occurs in verify_pack() and instead do it as part of the fsck
delta base ordered loop.

Today `fsck --full` is actually SHA-1 checksumming every object in
a packfile twice: once in verify_pack() and again in parse_object().
Cute.


--8<--
 Makefile               |    2 +-
 builtin-fetch-pack.c   |    1 -
 builtin-fsck.c         |  199 ++++++++++++++++++++++++++++++------------------
 builtin-pack-objects.c |    1 -
 builtin-prune.c        |    2 +-
 builtin-reflog.c       |    2 +-
 builtin-rev-list.c     |    1 -
 commit.c               |   11 ---
 object-refs.c          |   87 ---------------------
 object.h               |   13 ---
 reachable.c            |   43 ++++++++--
 reachable.h            |    7 ++-
 tag.c                  |    6 --
 tree.c                 |   48 ------------
 upload-pack.c          |    1 -
 walker.c               |    1 -
 16 files changed, 168 insertions(+), 257 deletions(-)

diff --git a/Makefile b/Makefile
index 92341c4..8f930ed 100644
--- a/Makefile
+++ b/Makefile
@@ -312,7 +312,7 @@ LIB_OBJS = \
 	patch-ids.o \
 	object.o pack-check.o pack-write.o patch-delta.o path.o pkt-line.o \
 	sideband.o reachable.o reflog-walk.o \
-	quote.o read-cache.o refs.o run-command.o dir.o object-refs.o \
+	quote.o read-cache.o refs.o run-command.o dir.o \
 	server-info.o setup.o sha1_file.o sha1_name.o strbuf.o \
 	tag.o tree.o usage.o config.o environment.o ctype.o copy.o \
 	revision.o pager.o tree-walk.o xdiff-interface.o \
diff --git a/builtin-fetch-pack.c b/builtin-fetch-pack.c
index f401352..25f1915 100644
--- a/builtin-fetch-pack.c
+++ b/builtin-fetch-pack.c
@@ -386,7 +386,6 @@ static int everything_local(struct ref **refs, int nr_match, char **match)
 	int retval;
 	unsigned long cutoff = 0;
 
-	track_object_refs = 0;
 	save_commit_buffer = 0;
 
 	for (ref = *refs; ref; ref = ref->next) {
diff --git a/builtin-fsck.c b/builtin-fsck.c
index cc7524b..50c358b 100644
--- a/builtin-fsck.c
+++ b/builtin-fsck.c
@@ -9,9 +9,12 @@
 #include "cache-tree.h"
 #include "tree-walk.h"
 #include "parse-options.h"
+#include "diff.h"
+#include "revision.h"
+#include "reachable.h"
 
-#define REACHABLE 0x0001
-#define SEEN      0x0002
+#define FSCKED    (1u<<17)
+#define WANTING   (1u<<18)
 
 static int show_root;
 static int show_tags;
@@ -63,13 +66,81 @@ static int objwarning(struct object *obj, const char *err, ...)
 	return -1;
 }
 
+static int object_exists(const unsigned char *sha1)
+{
+	struct object *obj = lookup_object(sha1);
+	if (obj && obj->flags & FSCKED)
+		return 1;
+	return has_sha1_file(sha1);
+}
+
+static void broken_link(struct object *obj,
+		enum object_type ref_type,
+		const unsigned char *ref_sha1)
+{
+	printf("broken link from %7s %s\n",
+	       typename(obj->type), sha1_to_hex(obj->sha1));
+	printf("              to %7s %s\n",
+	       typename(ref_type), sha1_to_hex(ref_sha1));
+	errors_found |= ERROR_REACHABLE;
+}
+
+static void broken_tree(struct tree *item)
+{
+	struct tree_desc desc;
+
+	item->object.parsed = 0;
+	if (parse_tree(item) < 0)
+		return; /* error already displayed */
+
+	init_tree_desc(&desc, item->buffer, item->size);
+	while (desc.size) {
+		unsigned mode;
+		const char *name;
+		const unsigned char *sha1;
+
+		sha1 = tree_entry_extract(&desc, &name, &mode);
+		update_tree_entry(&desc);
+		if (S_ISGITLINK(mode) || object_exists(sha1))
+			continue;
+		if (S_ISDIR(mode))
+			broken_link(&item->object, OBJ_TREE, sha1);
+		else
+			broken_link(&item->object, OBJ_BLOB, sha1);
+	}
+	free(item->buffer);
+	item->buffer = NULL;
+}
+
+static void broken_commit(struct commit *commit)
+{
+	struct tree *tree = commit->tree;
+	struct commit_list *pp;
+
+	if (tree && !object_exists(tree->object.sha1))
+		broken_link(&commit->object, OBJ_TREE, tree->object.sha1);
+
+	pp = commit->parents;
+	while (pp) {
+		struct commit *p = pp->item;
+		if (p && !object_exists(p->object.sha1))
+			broken_link(&commit->object, OBJ_COMMIT, p->object.sha1);
+		pp = pp->next;
+	}
+}
+
+static void broken_tag(struct tag *tag)
+{
+	struct object *ref = tag->tagged;
+	if (ref && !object_exists(ref->sha1))
+		broken_link(&tag->object, ref->type, ref->sha1);
+}
+
 /*
  * Check a single reachable object
  */
 static void check_reachable_object(struct object *obj)
 {
-	const struct object_refs *refs;
-
 	/*
 	 * We obviously want the object to be parsed,
 	 * except if it was in a pack-file and we didn't
@@ -84,21 +155,28 @@ static void check_reachable_object(struct object *obj)
 	}
 
 	/*
-	 * Check that everything that we try to reference is also good.
+	 * If the fsck routines found this object WANTING
+	 * then we need to re-evaluate what it wanted and
+	 * report on those broken links.
 	 */
-	refs = lookup_object_refs(obj);
-	if (refs) {
-		unsigned j;
-		for (j = 0; j < refs->count; j++) {
-			struct object *ref = refs->ref[j];
-			if (ref->parsed ||
-			    (has_sha1_file(ref->sha1)))
-				continue;
-			printf("broken link from %7s %s\n",
-			       typename(obj->type), sha1_to_hex(obj->sha1));
-			printf("              to %7s %s\n",
-			       typename(ref->type), sha1_to_hex(ref->sha1));
-			errors_found |= ERROR_REACHABLE;
+	if (obj->flags & WANTING) {
+		switch (obj->type) {
+		case OBJ_TREE:
+			broken_tree((struct tree *) obj);
+			break;
+
+		case OBJ_COMMIT:
+			broken_commit((struct commit *) obj);
+			break;
+
+		case OBJ_TAG:
+			broken_tag((struct tag *) obj);
+			break;
+
+		default:
+			objerror(obj,
+				"type '%d' has broken link (internal fsck error)",
+				obj->type);
 		}
 	}
 }
@@ -181,7 +259,7 @@ static void check_object(struct object *obj)
 	if (verbose)
 		fprintf(stderr, "Checking %s\n", sha1_to_hex(obj->sha1));
 
-	if (obj->flags & REACHABLE)
+	if (obj->flags & SEEN)
 		check_reachable_object(obj);
 	else
 		check_unreachable_object(obj);
@@ -309,6 +387,15 @@ static int fsck_tree(struct tree *item)
 			has_bad_modes = 1;
 		}
 
+		/*
+		 * If the SHA-1 should exist in this repository but
+		 * we are missing it flag this tree as WANTING.  We
+		 * will revisit this error later once we determine
+		 * this tree is SEEN.
+		 */
+		if (!S_ISGITLINK(mode) && !object_exists(sha1))
+			item->object.flags |= WANTING;
+
 		if (o_name) {
 			switch (verify_ordered(o_mode, o_name, mode, name)) {
 			case TREE_UNORDERED:
@@ -367,10 +454,15 @@ static int fsck_commit(struct commit *commit)
 		return objerror(&commit->object, "invalid format - expected 'tree' line");
 	if (get_sha1_hex(buffer+5, tree_sha1) || buffer[45] != '\n')
 		return objerror(&commit->object, "invalid 'tree' line format - bad sha1");
+	if (!object_exists(tree_sha1))
+		commit->object.flags |= WANTING;
+
 	buffer += 46;
 	while (!memcmp(buffer, "parent ", 7)) {
 		if (get_sha1_hex(buffer+7, sha1) || buffer[47] != '\n')
 			return objerror(&commit->object, "invalid 'parent' line format - bad sha1");
+		if (!object_exists(sha1))
+			commit->object.flags |= WANTING;
 		buffer += 48;
 	}
 	if (memcmp(buffer, "author ", 7))
@@ -395,6 +487,9 @@ static int fsck_tag(struct tag *tag)
 	if (!tagged) {
 		return objerror(&tag->object, "could not load tagged object");
 	}
+	if (!object_exists(tagged->sha1))
+		tag->object.flags |= WANTING;
+
 	if (!show_tags)
 		return 0;
 
@@ -411,9 +506,9 @@ static int fsck_sha1(const unsigned char *sha1)
 		return error("%s: object corrupt or missing",
 			     sha1_to_hex(sha1));
 	}
-	if (obj->flags & SEEN)
+	if (obj->flags & FSCKED)
 		return 0;
-	obj->flags |= SEEN;
+	obj->flags |= FSCKED;
 	if (obj->type == OBJ_BLOB)
 		return 0;
 	if (obj->type == OBJ_TREE)
@@ -524,37 +619,6 @@ static void fsck_dir(int i, char *path)
 
 static int default_refs;
 
-static int fsck_handle_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
-		const char *email, unsigned long timestamp, int tz,
-		const char *message, void *cb_data)
-{
-	struct object *obj;
-
-	if (verbose)
-		fprintf(stderr, "Checking reflog %s->%s\n",
-			sha1_to_hex(osha1), sha1_to_hex(nsha1));
-
-	if (!is_null_sha1(osha1)) {
-		obj = lookup_object(osha1);
-		if (obj) {
-			obj->used = 1;
-			mark_reachable(obj, REACHABLE);
-		}
-	}
-	obj = lookup_object(nsha1);
-	if (obj) {
-		obj->used = 1;
-		mark_reachable(obj, REACHABLE);
-	}
-	return 0;
-}
-
-static int fsck_handle_reflog(const char *logname, const unsigned char *sha1, int flag, void *cb_data)
-{
-	for_each_reflog_ent(logname, fsck_handle_reflog_ent, NULL);
-	return 0;
-}
-
 static int is_branch(const char *refname)
 {
 	return !strcmp(refname, "HEAD") || !prefixcmp(refname, "refs/heads/");
@@ -574,7 +638,6 @@ static int fsck_handle_ref(const char *refname, const unsigned char *sha1, int f
 		error("%s: not a commit", refname);
 	default_refs++;
 	obj->used = 1;
-	mark_reachable(obj, REACHABLE);
 
 	return 0;
 }
@@ -582,8 +645,6 @@ static int fsck_handle_ref(const char *refname, const unsigned char *sha1, int f
 static void get_default_heads(void)
 {
 	for_each_ref(fsck_handle_ref, NULL);
-	if (include_reflogs)
-		for_each_reflog(fsck_handle_reflog, NULL);
 
 	/*
 	 * Not having any default heads isn't really fatal, but
@@ -660,7 +721,6 @@ static int fsck_cache_tree(struct cache_tree *it)
 			      sha1_to_hex(it->sha1));
 			return 1;
 		}
-		mark_reachable(obj, REACHABLE);
 		obj->used = 1;
 		if (obj->type != OBJ_TREE)
 			err |= objerror(obj, "non-tree in cache-tree");
@@ -692,8 +752,8 @@ static struct option fsck_opts[] = {
 int cmd_fsck(int argc, const char **argv, const char *prefix)
 {
 	int i, heads;
+	struct rev_info revs;
 
-	track_object_refs = 1;
 	errors_found = 0;
 
 	argc = parse_options(argc, argv, fsck_opts, fsck_usage, 0);
@@ -731,6 +791,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 	}
 
 	heads = 0;
+	init_revisions(&revs, "");
 	for (i = 1; i < argc; i++) {
 		const char *arg = argv[i];
 		if (!get_sha1(arg, head_sha1)) {
@@ -740,8 +801,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 			if (!obj)
 				continue;
 
-			obj->used = 1;
-			mark_reachable(obj, REACHABLE);
+			add_pending_object(&revs, obj, arg);
 			heads++;
 			continue;
 		}
@@ -760,25 +820,16 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 
 	if (keep_cache_objects) {
 		read_cache();
-		for (i = 0; i < active_nr; i++) {
-			unsigned int mode;
-			struct blob *blob;
-			struct object *obj;
-
-			mode = active_cache[i]->ce_mode;
-			if (S_ISGITLINK(mode))
-				continue;
-			blob = lookup_blob(active_cache[i]->sha1);
-			if (!blob)
-				continue;
-			obj = &blob->object;
-			obj->used = 1;
-			mark_reachable(obj, REACHABLE);
-		}
 		if (active_cache_tree)
 			fsck_cache_tree(active_cache_tree);
 	}
 
+	ignore_reachability_errors = 1;
+	mark_reachable_objects(
+		&revs,
+		!heads,
+		!heads && include_reflogs,
+		keep_cache_objects);
 	check_connectivity();
 	return errors_found;
 }
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 692a761..2aaeb1a 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -2013,7 +2013,6 @@ static void get_object_list(int ac, const char **av)
 
 	init_revisions(&revs, NULL);
 	save_commit_buffer = 0;
-	track_object_refs = 0;
 	setup_revisions(ac, av, &revs, NULL);
 
 	while (fgets(line, sizeof(line), stdin) != NULL) {
diff --git a/builtin-prune.c b/builtin-prune.c
index bb8ead9..760e41d 100644
--- a/builtin-prune.c
+++ b/builtin-prune.c
@@ -147,7 +147,7 @@ int cmd_prune(int argc, const char **argv, const char *prefix)
 
 	save_commit_buffer = 0;
 	init_revisions(&revs, prefix);
-	mark_reachable_objects(&revs, 1);
+	mark_reachable_objects(&revs, 1, 1, 1);
 
 	prune_object_dir(get_object_directory());
 
diff --git a/builtin-reflog.c b/builtin-reflog.c
index 4836ec9..108ca7a 100644
--- a/builtin-reflog.c
+++ b/builtin-reflog.c
@@ -374,7 +374,7 @@ static int cmd_reflog_expire(int argc, const char **argv, const char *prefix)
 		init_revisions(&cb.revs, prefix);
 		if (cb.verbose)
 			printf("Marking reachable objects...");
-		mark_reachable_objects(&cb.revs, 0);
+		mark_reachable_objects(&cb.revs, 1, 0, 1);
 		if (cb.verbose)
 			putchar('\n');
 	}
diff --git a/builtin-rev-list.c b/builtin-rev-list.c
index de80158..14e86ce 100644
--- a/builtin-rev-list.c
+++ b/builtin-rev-list.c
@@ -605,7 +605,6 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 		usage(rev_list_usage);
 
 	save_commit_buffer = revs.verbose_header || revs.grep_filter;
-	track_object_refs = 0;
 	if (bisect_list)
 		revs.limited = 1;
 
diff --git a/commit.c b/commit.c
index 8b8fb04..df4d331 100644
--- a/commit.c
+++ b/commit.c
@@ -290,17 +290,6 @@ int parse_commit_buffer(struct commit *item, void *buffer, unsigned long size)
 	}
 	item->date = parse_commit_date(bufptr, tail);
 
-	if (track_object_refs) {
-		unsigned i = 0;
-		struct commit_list *p;
-		struct object_refs *refs = alloc_object_refs(n_refs);
-		if (item->tree)
-			refs->ref[i++] = &item->tree->object;
-		for (p = item->parents; p; p = p->next)
-			refs->ref[i++] = &p->item->object;
-		set_object_refs(&item->object, refs);
-	}
-
 	return 0;
 }
 
diff --git a/object-refs.c b/object-refs.c
deleted file mode 100644
index 5345671..0000000
--- a/object-refs.c
+++ /dev/null
@@ -1,87 +0,0 @@
-#include "cache.h"
-#include "object.h"
-#include "decorate.h"
-
-int track_object_refs = 0;
-
-static struct decoration ref_decorate;
-
-struct object_refs *lookup_object_refs(struct object *base)
-{
-	return lookup_decoration(&ref_decorate, base);
-}
-
-static void add_object_refs(struct object *obj, struct object_refs *refs)
-{
-	if (add_decoration(&ref_decorate, obj, refs))
-		die("object %s tried to add refs twice!", sha1_to_hex(obj->sha1));
-}
-
-struct object_refs *alloc_object_refs(unsigned count)
-{
-	struct object_refs *refs;
-	size_t size = sizeof(*refs) + count*sizeof(struct object *);
-
-	refs = xcalloc(1, size);
-	refs->count = count;
-	return refs;
-}
-
-static int compare_object_pointers(const void *a, const void *b)
-{
-	const struct object * const *pa = a;
-	const struct object * const *pb = b;
-	if (*pa == *pb)
-		return 0;
-	else if (*pa < *pb)
-		return -1;
-	else
-		return 1;
-}
-
-void set_object_refs(struct object *obj, struct object_refs *refs)
-{
-	unsigned int i, j;
-
-	/* Do not install empty list of references */
-	if (refs->count < 1) {
-		free(refs);
-		return;
-	}
-
-	/* Sort the list and filter out duplicates */
-	qsort(refs->ref, refs->count, sizeof(refs->ref[0]),
-	      compare_object_pointers);
-	for (i = j = 1; i < refs->count; i++) {
-		if (refs->ref[i] != refs->ref[i - 1])
-			refs->ref[j++] = refs->ref[i];
-	}
-	if (j < refs->count) {
-		/* Duplicates were found - reallocate list */
-		size_t size = sizeof(*refs) + j*sizeof(struct object *);
-		refs->count = j;
-		refs = xrealloc(refs, size);
-	}
-
-	for (i = 0; i < refs->count; i++)
-		refs->ref[i]->used = 1;
-	add_object_refs(obj, refs);
-}
-
-void mark_reachable(struct object *obj, unsigned int mask)
-{
-	const struct object_refs *refs;
-
-	if (!track_object_refs)
-		die("cannot do reachability with object refs turned off");
-	/* If we've been here already, don't bother */
-	if (obj->flags & mask)
-		return;
-	obj->flags |= mask;
-	refs = lookup_object_refs(obj);
-	if (refs) {
-		unsigned i;
-		for (i = 0; i < refs->count; i++)
-			mark_reachable(refs->ref[i], mask);
-	}
-}
diff --git a/object.h b/object.h
index 397bbfa..20f560f 100644
--- a/object.h
+++ b/object.h
@@ -6,11 +6,6 @@ struct object_list {
 	struct object_list *next;
 };
 
-struct object_refs {
-	unsigned count;
-	struct object *ref[FLEX_ARRAY]; /* more */
-};
-
 struct object_array {
 	unsigned int nr;
 	unsigned int alloc;
@@ -35,14 +30,11 @@ struct object {
 	unsigned char sha1[20];
 };
 
-extern int track_object_refs;
-
 extern const char *typename(unsigned int type);
 extern int type_from_string(const char *str);
 
 extern unsigned int get_max_object_index(void);
 extern struct object *get_indexed_object(unsigned int);
-extern struct object_refs *lookup_object_refs(struct object *);
 
 /** Internal only **/
 struct object *lookup_object(const unsigned char *sha1);
@@ -61,11 +53,6 @@ struct object *parse_object_buffer(const unsigned char *sha1, enum object_type t
 /** Returns the object, with potentially excess memory allocated. **/
 struct object *lookup_unknown_object(const unsigned  char *sha1);
 
-struct object_refs *alloc_object_refs(unsigned count);
-void set_object_refs(struct object *obj, struct object_refs *refs);
-
-void mark_reachable(struct object *obj, unsigned int mask);
-
 struct object_list *object_list_insert(struct object *item,
 				       struct object_list **list_p);
 
diff --git a/reachable.c b/reachable.c
index 00f289f..0ded91c 100644
--- a/reachable.c
+++ b/reachable.c
@@ -8,6 +8,8 @@
 #include "reachable.h"
 #include "cache-tree.h"
 
+int ignore_reachability_errors;
+
 static void process_blob(struct blob *blob,
 			 struct object_array *p,
 			 struct name_path *path,
@@ -42,8 +44,20 @@ static void process_tree(struct tree *tree,
 	if (obj->flags & SEEN)
 		return;
 	obj->flags |= SEEN;
-	if (parse_tree(tree) < 0)
-		die("bad tree object %s", sha1_to_hex(obj->sha1));
+	if (!obj->parsed || !tree->buffer) {
+		enum object_type type;
+		unsigned long size;
+		tree->buffer = read_sha1_file(obj->sha1, &type, &size);
+		if (!tree->buffer || type != OBJ_TREE) {
+			if (!ignore_reachability_errors)
+				die("bad tree object %s", sha1_to_hex(obj->sha1));
+			free(tree->buffer);
+			tree->buffer = NULL;
+			return;
+		}
+		tree->size = size;
+		obj->parsed = 1;
+	}
 	name = xstrdup(name);
 	add_object(obj, p, path, name);
 	me.up = path;
@@ -134,9 +148,10 @@ static int add_one_ref(const char *path, const unsigned char *sha1, int flag, vo
 	struct object *object = parse_object(sha1);
 	struct rev_info *revs = (struct rev_info *)cb_data;
 
-	if (!object)
+	if (object)
+		add_pending_object(revs, object, "");
+	else if (!ignore_reachability_errors)
 		die("bad object ref: %s:%s", path, sha1_to_hex(sha1));
-	add_pending_object(revs, object, "");
 
 	return 0;
 }
@@ -166,6 +181,7 @@ static void add_cache_tree(struct cache_tree *it, struct rev_info *revs)
 static void add_cache_refs(struct rev_info *revs)
 {
 	int i;
+	struct blob *obj;
 
 	read_cache();
 	for (i = 0; i < active_nr; i++) {
@@ -179,7 +195,10 @@ static void add_cache_refs(struct rev_info *revs)
 		if (S_ISGITLINK(active_cache[i]->ce_mode))
 			continue;
 
-		lookup_blob(active_cache[i]->sha1);
+		obj = lookup_blob(active_cache[i]->sha1);
+		if (obj)
+			obj->object.flags |= SEEN;
+
 		/*
 		 * We could add the blobs to the pending list, but quite
 		 * frankly, we don't care. Once we've looked them up, and
@@ -191,7 +210,11 @@ static void add_cache_refs(struct rev_info *revs)
 		add_cache_tree(active_cache_tree, revs);
 }
 
-void mark_reachable_objects(struct rev_info *revs, int mark_reflog)
+void mark_reachable_objects(
+		struct rev_info *revs,
+		int mark_refs,
+		int mark_reflog,
+		int mark_cache)
 {
 	/*
 	 * Set up revision parsing, and mark us as being interested
@@ -202,13 +225,15 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog)
 	revs->tree_objects = 1;
 
 	/* Add all refs from the index file */
-	add_cache_refs(revs);
+	if (mark_cache)
+		add_cache_refs(revs);
 
 	/* Add all external refs */
-	for_each_ref(add_one_ref, revs);
+	if (mark_refs)
+		for_each_ref(add_one_ref, revs);
 
 	/* Add all reflog info */
-	if (mark_reflog)
+	if (mark_refs && mark_reflog)
 		for_each_reflog(add_one_reflog, revs);
 
 	/*
diff --git a/reachable.h b/reachable.h
index 4075181..802d464 100644
--- a/reachable.h
+++ b/reachable.h
@@ -1,6 +1,11 @@
 #ifndef REACHEABLE_H
 #define REACHEABLE_H
 
-extern void mark_reachable_objects(struct rev_info *revs, int mark_reflog);
+extern int ignore_reachability_errors;
+extern void mark_reachable_objects(
+		struct rev_info *revs,
+		int mark_refs,
+		int mark_reflog,
+		int mark_cache);
 
 #endif
diff --git a/tag.c b/tag.c
index 38bf913..d19f56d 100644
--- a/tag.c
+++ b/tag.c
@@ -84,12 +84,6 @@ int parse_tag_buffer(struct tag *item, void *data, unsigned long size)
 		item->tagged = NULL;
 	}
 
-	if (item->tagged && track_object_refs) {
-		struct object_refs *refs = alloc_object_refs(1);
-		refs->ref[0] = item->tagged;
-		set_object_refs(&item->object, refs);
-	}
-
 	return 0;
 }
 
diff --git a/tree.c b/tree.c
index 87708ef..4b1825c 100644
--- a/tree.c
+++ b/tree.c
@@ -202,52 +202,6 @@ struct tree *lookup_tree(const unsigned char *sha1)
 	return (struct tree *) obj;
 }
 
-/*
- * NOTE! Tree refs to external git repositories
- * (ie gitlinks) do not count as real references.
- *
- * You don't have to have those repositories
- * available at all, much less have the objects
- * accessible from the current repository.
- */
-static void track_tree_refs(struct tree *item)
-{
-	int n_refs = 0, i;
-	struct object_refs *refs;
-	struct tree_desc desc;
-	struct name_entry entry;
-
-	/* Count how many entries there are.. */
-	init_tree_desc(&desc, item->buffer, item->size);
-	while (tree_entry(&desc, &entry)) {
-		if (S_ISGITLINK(entry.mode))
-			continue;
-		n_refs++;
-	}
-
-	/* Allocate object refs and walk it again.. */
-	i = 0;
-	refs = alloc_object_refs(n_refs);
-	init_tree_desc(&desc, item->buffer, item->size);
-	while (tree_entry(&desc, &entry)) {
-		struct object *obj;
-
-		if (S_ISGITLINK(entry.mode))
-			continue;
-		if (S_ISDIR(entry.mode))
-			obj = &lookup_tree(entry.sha1)->object;
-		else if (S_ISREG(entry.mode) || S_ISLNK(entry.mode))
-			obj = &lookup_blob(entry.sha1)->object;
-		else {
-			warning("in tree %s: entry %s has bad mode %.6o\n",
-			     sha1_to_hex(item->object.sha1), entry.path, entry.mode);
-			obj = lookup_unknown_object(entry.sha1);
-		}
-		refs->ref[i++] = obj;
-	}
-	set_object_refs(&item->object, refs);
-}
-
 int parse_tree_buffer(struct tree *item, void *buffer, unsigned long size)
 {
 	if (item->object.parsed)
@@ -256,8 +210,6 @@ int parse_tree_buffer(struct tree *item, void *buffer, unsigned long size)
 	item->buffer = buffer;
 	item->size = size;
 
-	if (track_object_refs)
-		track_tree_refs(item);
 	return 0;
 }
 
diff --git a/upload-pack.c b/upload-pack.c
index 51e3ec4..7c2f5e5 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -392,7 +392,6 @@ static int get_common_commits(void)
 	char hex[41], last_hex[41];
 	int len;
 
-	track_object_refs = 0;
 	save_commit_buffer = 0;
 
 	for(;;) {
diff --git a/walker.c b/walker.c
index adc3e80..c10eca8 100644
--- a/walker.c
+++ b/walker.c
@@ -256,7 +256,6 @@ int walker_fetch(struct walker *walker, int targets, char **target,
 	int i;
 
 	save_commit_buffer = 0;
-	track_object_refs = 0;
 
 	for (i = 0; i < targets; i++) {
 		if (!write_ref || !write_ref[i])

-- 
Shawn.

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-13 12:01                 ` Johannes Schindelin
  2008-02-14  6:16                   ` Shawn O. Pearce
@ 2008-02-14 19:04                   ` Martin Koegler
  2008-02-15  0:06                     ` Johannes Schindelin
  1 sibling, 1 reply; 23+ messages in thread
From: Martin Koegler @ 2008-02-14 19:04 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Shawn O. Pearce, Nicolas Pitre, Junio C Hamano, git

On Wed, Feb 13, 2008 at 12:01:43PM +0000, Johannes Schindelin wrote:
> On Wed, 13 Feb 2008, Martin Koegler wrote:
> > This would mean, that we must make git-rev-list and git-pack-objects not 
> > segfault on incorrect links between objects.
> 
> We should do that anyway.  It may error out, but segfaulting is no option.
> 
> So if you have a test case, please make it public so we can fix the 
> breakage.

OK, here is the first part of missing checks. 

Its still WIP:
* It should be checked, if reactions to errors are correct.
* Patch should be splitted and turned into submitable pieces

If anybody wants to do this, please go on.

>From 4a6908ff51059a5b2bcc8d8c41fd41de53ee9151 Mon Sep 17 00:00:00 2001
From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Date: Thu, 14 Feb 2008 19:34:39 +0100
Subject: [PATCH] Missing checks (1)

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
---
 commit.c    |   16 +++++++++++-----
 revision.c  |   12 ++++++++++++
 sha1_file.c |    3 ++-
 sha1_name.c |   11 ++++++++++-
 tag.c       |    5 ++++-
 5 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/commit.c b/commit.c
index 8b8fb04..53f0042 100644
--- a/commit.c
+++ b/commit.c
@@ -311,6 +311,8 @@ int parse_commit(struct commit *item)
 	unsigned long size;
 	int ret;
 
+	if (!item)
+		return -1;
 	if (item->object.parsed)
 		return 0;
 	buffer = read_sha1_file(item->object.sha1, &type, &size);
@@ -385,8 +387,7 @@ struct commit *pop_most_recent_commit(struct commit_list **list,
 
 	while (parents) {
 		struct commit *commit = parents->item;
-		parse_commit(commit);
-		if (!(commit->object.flags & mark)) {
+		if (!parse_commit(commit) && !(commit->object.flags & mark)) {
 			commit->object.flags |= mark;
 			insert_by_date(commit, list);
 		}
@@ -552,8 +553,10 @@ static struct commit_list *merge_bases(struct commit *one, struct commit *two)
 		 */
 		return commit_list_insert(one, &result);
 
-	parse_commit(one);
-	parse_commit(two);
+	if (parse_commit(one))
+		die("invalid commit");
+	if (parse_commit(two))
+		die("invalid commit");
 
 	one->object.flags |= PARENT1;
 	two->object.flags |= PARENT2;
@@ -584,9 +587,12 @@ static struct commit_list *merge_bases(struct commit *one, struct commit *two)
 		while (parents) {
 			struct commit *p = parents->item;
 			parents = parents->next;
+			if (!p)
+				die("invalid commit");
 			if ((p->object.flags & flags) == flags)
 				continue;
-			parse_commit(p);
+			if(parse_commit(p))
+				die("invalid commit");
 			p->object.flags |= flags;
 			insert_by_date(p, &list);
 		}
diff --git a/revision.c b/revision.c
index 6e85aaa..9f8723d 100644
--- a/revision.c
+++ b/revision.c
@@ -46,6 +46,8 @@ void add_object(struct object *obj,
 
 static void mark_blob_uninteresting(struct blob *blob)
 {
+	if (!blob)
+		return;
 	if (blob->object.flags & UNINTERESTING)
 		return;
 	blob->object.flags |= UNINTERESTING;
@@ -57,6 +59,8 @@ void mark_tree_uninteresting(struct tree *tree)
 	struct name_entry entry;
 	struct object *obj = &tree->object;
 
+	if (!obj)
+		return;
 	if (obj->flags & UNINTERESTING)
 		return;
 	obj->flags |= UNINTERESTING;
@@ -94,6 +98,8 @@ void mark_parents_uninteresting(struct commit *commit)
 
 	while (parents) {
 		struct commit *commit = parents->item;
+		if (!commit)
+			continue;
 		if (!(commit->object.flags & UNINTERESTING)) {
 			commit->object.flags |= UNINTERESTING;
 
@@ -173,6 +179,8 @@ static struct commit *handle_commit(struct rev_info *revs, struct object *object
 		struct tag *tag = (struct tag *) object;
 		if (revs->tag_objects && !(flags & UNINTERESTING))
 			add_pending_object(revs, object, tag->tag);
+		if (!tag->tagged)
+			die("bad tag");
 		object = parse_object(tag->tagged->sha1);
 		if (!object)
 			die("bad object %s", sha1_to_hex(tag->tagged->sha1));
@@ -685,12 +693,16 @@ static int add_parents_only(struct rev_info *revs, const char *arg, int flags)
 		it = get_reference(revs, arg, sha1, 0);
 		if (it->type != OBJ_TAG)
 			break;
+		if (!((struct tag*)it)->tagged)
+			return 0;
 		hashcpy(sha1, ((struct tag*)it)->tagged->sha1);
 	}
 	if (it->type != OBJ_COMMIT)
 		return 0;
 	commit = (struct commit *)it;
 	for (parents = commit->parents; parents; parents = parents->next) {
+		if (!parents->item)
+			continue;
 		it = &parents->item->object;
 		it->flags |= flags;
 		add_pending_object(revs, it, arg);
diff --git a/sha1_file.c b/sha1_file.c
index 4179949..c25ce64 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1943,7 +1943,8 @@ void *read_object_with_reference(const unsigned char *sha1,
 		}
 		ref_length = strlen(ref_type);
 
-		if (memcmp(buffer, ref_type, ref_length) ||
+		if (ref_length + 40 >= size ||
+		    memcmp(buffer, ref_type, ref_length) ||
 		    get_sha1_hex((char *) buffer + ref_length, actual_sha1)) {
 			free(buffer);
 			return NULL;
diff --git a/sha1_name.c b/sha1_name.c
index be8489e..bd0bce6 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -396,6 +396,8 @@ static int get_parent(const char *name, int len,
 	p = commit->parents;
 	while (p) {
 		if (!--idx) {
+			if (!p->item)
+				return -1;
 			hashcpy(result, p->item->object.sha1);
 			return 0;
 		}
@@ -417,6 +419,8 @@ static int get_nth_ancestor(const char *name, int len,
 
 		if (!commit || parse_commit(commit) || !commit->parents)
 			return -1;
+		if (!commit->parents->item)
+			return -1;
 		hashcpy(sha1, commit->parents->item->object.sha1);
 	}
 	hashcpy(result, sha1);
@@ -494,6 +498,8 @@ static int peel_onion(const char *name, int len, unsigned char *sha1)
 				return error("%.*s: expected %s type, but the object dereferences to %s type",
 					     len, name, typename(expected_type),
 					     typename(o->type));
+			if (!o)
+				return -1;
 			if (!o->parsed)
 				parse_object(o->sha1);
 		}
@@ -580,6 +586,8 @@ static int handle_one_ref(const char *path,
 		return 0;
 	if (object->type == OBJ_TAG)
 		object = deref_tag(object, path, strlen(path));
+	if (!object)
+		return 0;
 	if (object->type != OBJ_COMMIT)
 		return 0;
 	insert_by_date((struct commit *)object, list);
@@ -617,7 +625,8 @@ static int get_sha1_oneline(const char *prefix, unsigned char *sha1)
 		unsigned long size;
 
 		commit = pop_most_recent_commit(&list, ONELINE_SEEN);
-		parse_object(commit->object.sha1);
+		if(!parse_object(commit->object.sha1))
+			continue;
 		if (temp_commit_buffer)
 			free(temp_commit_buffer);
 		if (commit->buffer)
diff --git a/tag.c b/tag.c
index 38bf913..990134f 100644
--- a/tag.c
+++ b/tag.c
@@ -9,7 +9,10 @@ const char *tag_type = "tag";
 struct object *deref_tag(struct object *o, const char *warn, int warnlen)
 {
 	while (o && o->type == OBJ_TAG)
-		o = parse_object(((struct tag *)o)->tagged->sha1);
+		if (((struct tag *)o)->tagged)
+			o = parse_object(((struct tag *)o)->tagged->sha1);
+		else
+			o = NULL;
 	if (!o && warn) {
 		if (!warnlen)
 			warnlen = strlen(warn);
-- 
1.5.4.1.gb43a

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH] Remove object-refs from fsck
  2008-02-14  9:00                         ` [RFC PATCH] Remove object-refs from fsck Shawn O. Pearce
@ 2008-02-14 19:07                           ` Martin Koegler
  0 siblings, 0 replies; 23+ messages in thread
From: Martin Koegler @ 2008-02-14 19:07 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Nicolas Pitre, Junio C Hamano, git

On Thu, Feb 14, 2008 at 04:00:13AM -0500, Shawn O. Pearce wrote:
>  
> +static int object_exists(const unsigned char *sha1)
> +{
> +	struct object *obj = lookup_object(sha1);
> +	if (obj && obj->flags & FSCKED)
> +		return 1;
> +	return has_sha1_file(sha1);
> +}
> +
> +static void broken_link(struct object *obj,
> +		enum object_type ref_type,
> +		const unsigned char *ref_sha1)
> +{
> +	printf("broken link from %7s %s\n",
> +	       typename(obj->type), sha1_to_hex(obj->sha1));
> +	printf("              to %7s %s\n",
> +	       typename(ref_type), sha1_to_hex(ref_sha1));
> +	errors_found |= ERROR_REACHABLE;
> +}
> +
> +static void broken_tree(struct tree *item)
> +{
> +	struct tree_desc desc;
> +
> +	item->object.parsed = 0;
> +	if (parse_tree(item) < 0)
> +		return; /* error already displayed */
> +
> +	init_tree_desc(&desc, item->buffer, item->size);
> +	while (desc.size) {
> +		unsigned mode;
> +		const char *name;
> +		const unsigned char *sha1;
> +
> +		sha1 = tree_entry_extract(&desc, &name, &mode);
> +		update_tree_entry(&desc);
> +		if (S_ISGITLINK(mode) || object_exists(sha1))
> +			continue;
> +		if (S_ISDIR(mode))
> +			broken_link(&item->object, OBJ_TREE, sha1);
> +		else
> +			broken_link(&item->object, OBJ_BLOB, sha1);
> +	}
> +	free(item->buffer);
> +	item->buffer = NULL;
> +}
> +
> +static void broken_commit(struct commit *commit)
> +{
> +	struct tree *tree = commit->tree;
> +	struct commit_list *pp;
> +
> +	if (tree && !object_exists(tree->object.sha1))
> +		broken_link(&commit->object, OBJ_TREE, tree->object.sha1);
> +
> +	pp = commit->parents;
> +	while (pp) {
> +		struct commit *p = pp->item;
> +		if (p && !object_exists(p->object.sha1))
> +			broken_link(&commit->object, OBJ_COMMIT, p->object.sha1);
> +		pp = pp->next;
> +	}
> +}
> +
> +static void broken_tag(struct tag *tag)
> +{
> +	struct object *ref = tag->tagged;
> +	if (ref && !object_exists(ref->sha1))
> +		broken_link(&tag->object, ref->type, ref->sha1);
> +}
> +
>  /*
>   * Check a single reachable object
>   */
>  static void check_reachable_object(struct object *obj)
>  {
> -	const struct object_refs *refs;
> -
>  	/*
>  	 * We obviously want the object to be parsed,
>  	 * except if it was in a pack-file and we didn't
> @@ -84,21 +155,28 @@ static void check_reachable_object(struct object *obj)
>  	}
>  
>  	/*
> -	 * Check that everything that we try to reference is also good.
> +	 * If the fsck routines found this object WANTING
> +	 * then we need to re-evaluate what it wanted and
> +	 * report on those broken links.
>  	 */
> -	refs = lookup_object_refs(obj);
> -	if (refs) {
> -		unsigned j;
> -		for (j = 0; j < refs->count; j++) {
> -			struct object *ref = refs->ref[j];
> -			if (ref->parsed ||
> -			    (has_sha1_file(ref->sha1)))
> -				continue;
> -			printf("broken link from %7s %s\n",
> -			       typename(obj->type), sha1_to_hex(obj->sha1));
> -			printf("              to %7s %s\n",
> -			       typename(ref->type), sha1_to_hex(ref->sha1));
> -			errors_found |= ERROR_REACHABLE;
> +	if (obj->flags & WANTING) {
> +		switch (obj->type) {
> +		case OBJ_TREE:
> +			broken_tree((struct tree *) obj);
> +			break;
> +
> +		case OBJ_COMMIT:
> +			broken_commit((struct commit *) obj);
> +			break;
> +
> +		case OBJ_TAG:
> +			broken_tag((struct tag *) obj);
> +			break;
> +
> +		default:
> +			objerror(obj,
> +				"type '%d' has broken link (internal fsck error)",
> +				obj->type);
>  		}
>  	}
>  }
> @@ -181,7 +259,7 @@ static void check_object(struct object *obj)
>  	if (verbose)
>  		fprintf(stderr, "Checking %s\n", sha1_to_hex(obj->sha1));
>  
> -	if (obj->flags & REACHABLE)
> +	if (obj->flags & SEEN)
>  		check_reachable_object(obj);
>  	else
>  		check_unreachable_object(obj);
> @@ -309,6 +387,15 @@ static int fsck_tree(struct tree *item)
>  			has_bad_modes = 1;
>  		}
>  
> +		/*
> +		 * If the SHA-1 should exist in this repository but
> +		 * we are missing it flag this tree as WANTING.  We
> +		 * will revisit this error later once we determine
> +		 * this tree is SEEN.
> +		 */
> +		if (!S_ISGITLINK(mode) && !object_exists(sha1))
> +			item->object.flags |= WANTING;
> +
>  		if (o_name) {
>  			switch (verify_ordered(o_mode, o_name, mode, name)) {
>  			case TREE_UNORDERED:
> @@ -367,10 +454,15 @@ static int fsck_commit(struct commit *commit)
>  		return objerror(&commit->object, "invalid format - expected 'tree' line");
>  	if (get_sha1_hex(buffer+5, tree_sha1) || buffer[45] != '\n')
>  		return objerror(&commit->object, "invalid 'tree' line format - bad sha1");
> +	if (!object_exists(tree_sha1))
> +		commit->object.flags |= WANTING;
> +

How do we verify, that tree_sha1 is a tree?


>  	buffer += 46;
>  	while (!memcmp(buffer, "parent ", 7)) {
>  		if (get_sha1_hex(buffer+7, sha1) || buffer[47] != '\n')
>  			return objerror(&commit->object, "invalid 'parent' line format - bad sha1");
> +		if (!object_exists(sha1))
> +			commit->object.flags |= WANTING;

How do we verify, that sha1 is a commit?

>  		buffer += 48;
>  	}
>  	if (memcmp(buffer, "author ", 7))
> @@ -395,6 +487,9 @@ static int fsck_tag(struct tag *tag)
>  	if (!tagged) {
>  		return objerror(&tag->object, "could not load tagged object");
>  	}
> +	if (!object_exists(tagged->sha1))
> +		tag->object.flags |= WANTING;
> +

Same here. Where do we check the type?

mfg Martin Kögler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-14 19:04                   ` Martin Koegler
@ 2008-02-15  0:06                     ` Johannes Schindelin
  2008-02-15  7:18                       ` Martin Koegler
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Schindelin @ 2008-02-15  0:06 UTC (permalink / raw)
  To: Martin Koegler; +Cc: Shawn O. Pearce, Nicolas Pitre, Junio C Hamano, git

Hi,

On Thu, 14 Feb 2008, Martin Koegler wrote:

> diff --git a/commit.c b/commit.c
> index 8b8fb04..53f0042 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -311,6 +311,8 @@ int parse_commit(struct commit *item)
>  	unsigned long size;
>  	int ret;
>  
> +	if (!item)
> +		return -1;

Yeah, I just grepped for users of parse_commit(), and there are just too 
many, so this change makes sense.

> @@ -385,8 +387,7 @@ struct commit *pop_most_recent_commit(struct commit_list **list,
>  
>  	while (parents) {
>  		struct commit *commit = parents->item;
> -		parse_commit(commit);
> -		if (!(commit->object.flags & mark)) {
> +		if (!parse_commit(commit) && !(commit->object.flags & mark)) {

This makes sense, too.

> @@ -552,8 +553,10 @@ static struct commit_list *merge_bases(struct commit *one, struct commit *two)
>  		 */
>  		return commit_list_insert(one, &result);
>  
> -	parse_commit(one);
> -	parse_commit(two);
> +	if (parse_commit(one))
> +		die("invalid commit");
> +	if (parse_commit(two))
> +		die("invalid commit");

I'd rather have this return NULL after displaying an error.  After all, 
merge_bases() is sort of "libified".

BTW a simple "git grep parse_commit\( | grep -vwe if -e gitweb" shows 
this:

builtin-blame.c:                        parse_commit(commit);
builtin-checkout.c:     parse_commit(commit);
builtin-checkout.c:                     parse_commit(old->commit);
builtin-checkout.c:             parse_commit(new->commit);
builtin-checkout.c:                     parse_commit(new.commit);
builtin-describe.c:                     parse_commit(p);
builtin-describe.c:                     parse_commit(p);
builtin-fast-export.c:  parse_commit(commit);
builtin-fast-export.c:          parse_commit(commit->parents->item);
builtin-fetch-pack.c:                   parse_commit(commit);
builtin-fetch-pack.c:                           parse_commit(commit);
builtin-fetch-pack.c:                   parse_commit(commit);
builtin-name-rev.c:             parse_commit(commit);
builtin-show-branch.c:                          parse_commit(p);
builtin-show-branch.c:          parse_commit(commit);
commit.c:int parse_commit(struct commit *item)
commit.c:               parse_commit(commit);
commit.c:       parse_commit(one);
commit.c:       parse_commit(two);
commit.c:                       parse_commit(p);
commit.h:int parse_commit(struct commit *item);
contrib/gitview/gitview:                self.parse_commit(commit_lines)
contrib/gitview/gitview:        def parse_commit(self, commit_lines):
shallow.c:              parse_commit(commit);
upload-pack.c:                          parse_commit((struct commit *)object);

A few of those need checking, too, I think.

> @@ -584,9 +587,12 @@ static struct commit_list *merge_bases(struct commit *one, struct commit *two)
>  		while (parents) {
>  			struct commit *p = parents->item;
>  			parents = parents->next;
> +			if (!p)
> +				die("invalid commit");
>  			if ((p->object.flags & flags) == flags)
>  				continue;
> -			parse_commit(p);
> +			if(parse_commit(p))
> +				die("invalid commit");
>  			p->object.flags |= flags;
>  			insert_by_date(p, &list);
>  		}

Again, please do not die() in merge_bases().

> diff --git a/revision.c b/revision.c
> index 6e85aaa..9f8723d 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -46,6 +46,8 @@ void add_object(struct object *obj,
>  
>  static void mark_blob_uninteresting(struct blob *blob)
>  {
> +	if (!blob)
> +		return;

IMHO not needed [*1*].  The first user of this (static) function calls it 
with lookup_blob(), which assumes that the blob has been read already.

The second uses a checked object.

> @@ -57,6 +59,8 @@ void mark_tree_uninteresting(struct tree *tree)
>  	struct name_entry entry;
>  	struct object *obj = &tree->object;
>  
> +	if (!obj)
> +		return;

Same here.

> @@ -94,6 +98,8 @@ void mark_parents_uninteresting(struct commit *commit)
>  
>  	while (parents) {
>  		struct commit *commit = parents->item;
> +		if (!commit)
> +			continue;

Can parents->item really be NULL?  I think not.  And indeed, a little 
search reveals that parse_commit_buffer() does this when it constructs the 
parents list, and encounters an invalid SHA-1:

return error("bad parents in commit %s", sha1_to_hex(item->object.sha1));

In case it is a valid SHA-1, but not a commit, it is ignored.

> @@ -173,6 +179,8 @@ static struct commit *handle_commit(struct rev_info *revs, struct object *object
>  		struct tag *tag = (struct tag *) object;
>  		if (revs->tag_objects && !(flags & UNINTERESTING))
>  			add_pending_object(revs, object, tag->tag);
> +		if (!tag->tagged)
> +			die("bad tag");

I haven't looked yet, but I suspect that this is as impossible as the 
invalid parent.

> @@ -685,12 +693,16 @@ static int add_parents_only(struct rev_info *revs, const char *arg, int flags)
>  		it = get_reference(revs, arg, sha1, 0);
>  		if (it->type != OBJ_TAG)
>  			break;
> +		if (!((struct tag*)it)->tagged)
> +			return 0;

See above.

>  		hashcpy(sha1, ((struct tag*)it)->tagged->sha1);
>  	}
>  	if (it->type != OBJ_COMMIT)
>  		return 0;
>  	commit = (struct commit *)it;
>  	for (parents = commit->parents; parents; parents = parents->next) {
> +		if (!parents->item)
> +			continue;

See above.

> diff --git a/sha1_file.c b/sha1_file.c
> index 4179949..c25ce64 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1943,7 +1943,8 @@ void *read_object_with_reference(const unsigned char *sha1,
>  		}
>  		ref_length = strlen(ref_type);
>  
> -		if (memcmp(buffer, ref_type, ref_length) ||
> +		if (ref_length + 40 >= size ||
> +		    memcmp(buffer, ref_type, ref_length) ||

Makes sense.

> diff --git a/sha1_name.c b/sha1_name.c
> index be8489e..bd0bce6 100644
> --- a/sha1_name.c
> +++ b/sha1_name.c
> @@ -396,6 +396,8 @@ static int get_parent(const char *name, int len,
>  	p = commit->parents;
>  	while (p) {
>  		if (!--idx) {
> +			if (!p->item)
> +				return -1;

See above.

> @@ -417,6 +419,8 @@ static int get_nth_ancestor(const char *name, int len,
>  
>  		if (!commit || parse_commit(commit) || !commit->parents)
>  			return -1;
> +		if (!commit->parents->item)
> +			return -1;

See above.

> @@ -494,6 +498,8 @@ static int peel_onion(const char *name, int len, unsigned char *sha1)
>  				return error("%.*s: expected %s type, but the object dereferences to %s type",
>  					     len, name, typename(expected_type),
>  					     typename(o->type));
> +			if (!o)
> +				return -1;

You probably want to guard for ((tag *)o)->tagged == NULL; Okay, now I am 
curious. *megoesandlooks*  Indeed, if the tag refers to an unknown type, 
tagged = NULL.

But I think in that case, it should return an error().  And maybe only for 
type == OBJ_TAG.

> @@ -580,6 +586,8 @@ static int handle_one_ref(const char *path,
>  		return 0;
>  	if (object->type == OBJ_TAG)
>  		object = deref_tag(object, path, strlen(path));
> +	if (!object)
> +		return 0;

As above, it looks strange to see that object->something is checked, and 
_then_ object.  I'd put this into curly brackets, together with the 
deref_tag(), just to help the reader a bit, should she be as weak in mind 
as yours truly.

> @@ -617,7 +625,8 @@ static int get_sha1_oneline(const char *prefix, unsigned char *sha1)
>  		unsigned long size;
>  
>  		commit = pop_most_recent_commit(&list, ONELINE_SEEN);
> -		parse_object(commit->object.sha1);
> +		if(!parse_object(commit->object.sha1))
> +			continue;

Makes sense, but please add a space after the "if".

> diff --git a/tag.c b/tag.c
> index 38bf913..990134f 100644
> --- a/tag.c
> +++ b/tag.c
> @@ -9,7 +9,10 @@ const char *tag_type = "tag";
>  struct object *deref_tag(struct object *o, const char *warn, int warnlen)
>  {
>  	while (o && o->type == OBJ_TAG)
> -		o = parse_object(((struct tag *)o)->tagged->sha1);
> +		if (((struct tag *)o)->tagged)
> +			o = parse_object(((struct tag *)o)->tagged->sha1);
> +		else
> +			o = NULL;

Knowing that tagged _can_ be NULL, this makes tons of sense.  Except that 
again, I'd call error() to tell the user, before setting o = NULL.

Ciao,
Dscho

[*1*] There might be a few people arguing that defensive programming is 
never wrong.

Alas, I saw my share of defensive programming, and more often than not, 
the real bugs were hard to find in between all those checks.  And some 
checks were actively wrong -- again hard to see...

Remember: You can make code so simple that there are obviously no bugs. 
And the other way is to make it so complicated that there are no obvious 
bugs.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC Patch] Preventing corrupt objects from entering the repository
  2008-02-15  0:06                     ` Johannes Schindelin
@ 2008-02-15  7:18                       ` Martin Koegler
  0 siblings, 0 replies; 23+ messages in thread
From: Martin Koegler @ 2008-02-15  7:18 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Shawn O. Pearce, Nicolas Pitre, Junio C Hamano, git

On Fri, Feb 15, 2008 at 12:06:49AM +0000, Johannes Schindelin wrote:
> On Thu, 14 Feb 2008, Martin Koegler wrote:
> > @@ -552,8 +553,10 @@ static struct commit_list *merge_bases(struct commit *one, struct commit *two)
> >  		 */
> >  		return commit_list_insert(one, &result);
> >  
> > -	parse_commit(one);
> > -	parse_commit(two);
> > +	if (parse_commit(one))
> > +		die("invalid commit");
> > +	if (parse_commit(two))
> > +		die("invalid commit");
> 
> I'd rather have this return NULL after displaying an error.  After all, 
> merge_bases() is sort of "libified".

Would all caller of get_merge_bases treat NULL as an error?

> BTW a simple "git grep parse_commit\( | grep -vwe if -e gitweb" shows 
> this:
> 
> builtin-blame.c:                        parse_commit(commit);
> builtin-checkout.c:     parse_commit(commit);
> builtin-checkout.c:                     parse_commit(old->commit);
> builtin-checkout.c:             parse_commit(new->commit);
> builtin-checkout.c:                     parse_commit(new.commit);
> builtin-describe.c:                     parse_commit(p);
> builtin-describe.c:                     parse_commit(p);
> builtin-fast-export.c:  parse_commit(commit);
> builtin-fast-export.c:          parse_commit(commit->parents->item);
> builtin-fetch-pack.c:                   parse_commit(commit);
> builtin-fetch-pack.c:                           parse_commit(commit);
> builtin-fetch-pack.c:                   parse_commit(commit);
> builtin-name-rev.c:             parse_commit(commit);
> builtin-show-branch.c:                          parse_commit(p);
> builtin-show-branch.c:          parse_commit(commit);
> commit.c:int parse_commit(struct commit *item)
> commit.c:               parse_commit(commit);
> commit.c:       parse_commit(one);
> commit.c:       parse_commit(two);
> commit.c:                       parse_commit(p);
> commit.h:int parse_commit(struct commit *item);
> contrib/gitview/gitview:                self.parse_commit(commit_lines)
> contrib/gitview/gitview:        def parse_commit(self, commit_lines):
> shallow.c:              parse_commit(commit);
> upload-pack.c:                          parse_commit((struct commit *)object);
> 
> A few of those need checking, too, I think.

Same for prepare_revision_walk. It can fail because an error in parse_XXX,
but no caller checks the result.

 
> > diff --git a/revision.c b/revision.c
> > index 6e85aaa..9f8723d 100644
> > --- a/revision.c
> > +++ b/revision.c
> > @@ -46,6 +46,8 @@ void add_object(struct object *obj,
> >  
> >  static void mark_blob_uninteresting(struct blob *blob)
> >  {
> > +	if (!blob)
> > +		return;
> 
> IMHO not needed [*1*].  The first user of this (static) function calls it 
> with lookup_blob(), which assumes that the blob has been read already.

What about calling handle_revision_arg with "^<tree-sha1>", where a
non blob object is stored under a blob mode?

handle_revision_arg will parse only the tree and put it on the pending
list. prepare_revision_walk will call mark_tree_uninteresting on it.
For any object with blob mode in the tree, it calls
mark_blob_uninteresting(lookup_blob(entry.sha1)); If entry.sha1 is the
sha1 of an already loaded non blob object, we need this check.

> [*1*] There might be a few people arguing that defensive programming is 
> never wrong.
> 
> Alas, I saw my share of defensive programming, and more often than not, 
> the real bugs were hard to find in between all those checks.  And some 
> checks were actively wrong -- again hard to see...
> 
> Remember: You can make code so simple that there are obviously no bugs. 
> And the other way is to make it so complicated that there are no obvious 
> bugs.

I would put rev_info in the second category. What can not be done with
it?

> > @@ -57,6 +59,8 @@ void mark_tree_uninteresting(struct tree *tree)
> >  	struct name_entry entry;
> >  	struct object *obj = &tree->object;
> >  
> > +	if (!obj)
> > +		return;
> 
> Same here.

Same argument (s/blob/tree/g).

> > @@ -94,6 +98,8 @@ void mark_parents_uninteresting(struct commit *commit)
> >  
> >  	while (parents) {
> >  		struct commit *commit = parents->item;
> > +		if (!commit)
> > +			continue;
> 
> Can parents->item really be NULL?  I think not.  And indeed, a little 
> search reveals that parse_commit_buffer() does this when it constructs the 
> parents list, and encounters an invalid SHA-1:
> 
> return error("bad parents in commit %s", sha1_to_hex(item->object.sha1));
> 
> In case it is a valid SHA-1, but not a commit, it is ignored.

Its unnecessary, I missed this check.

> > @@ -173,6 +179,8 @@ static struct commit *handle_commit(struct rev_info *revs, struct object *object
> >  		struct tag *tag = (struct tag *) object;
> >  		if (revs->tag_objects && !(flags & UNINTERESTING))
> >  			add_pending_object(revs, object, tag->tag);
> > +		if (!tag->tagged)
> > +			die("bad tag");
> 
> I haven't looked yet, but I suspect that this is as impossible as the 
> invalid parent.

Eg. tag pointing to an blob, but when calling parse_tag_buffer, a non
blob object is loaded under blob sha1.

But wouldn't it be simpler, if we would add
	if (!item->tagged)
		return -1;
in parse_tag_buffer?

> > diff --git a/sha1_file.c b/sha1_file.c
> > index 4179949..c25ce64 100644
> > --- a/sha1_file.c
> > +++ b/sha1_file.c
> > @@ -1943,7 +1943,8 @@ void *read_object_with_reference(const unsigned char *sha1,
> >  		}
> >  		ref_length = strlen(ref_type);
> >  
> > -		if (memcmp(buffer, ref_type, ref_length) ||
> > +		if (ref_length + 40 >= size ||
> > +		    memcmp(buffer, ref_type, ref_length) ||
> 
> Makes sense.
> 
> > @@ -494,6 +498,8 @@ static int peel_onion(const char *name, int len, unsigned char *sha1)
> >  				return error("%.*s: expected %s type, but the object dereferences to %s type",
> >  					     len, name, typename(expected_type),
> >  					     typename(o->type));
> > +			if (!o)
> > +				return -1;
> 
> You probably want to guard for ((tag *)o)->tagged == NULL; Okay, now I am 
> curious. *megoesandlooks*  Indeed, if the tag refers to an unknown type, 
> tagged = NULL.
> 
> But I think in that case, it should return an error().  And maybe only for 
> type == OBJ_TAG.
> 
> > @@ -580,6 +586,8 @@ static int handle_one_ref(const char *path,
> >  		return 0;
> >  	if (object->type == OBJ_TAG)
> >  		object = deref_tag(object, path, strlen(path));
> > +	if (!object)
> > +		return 0;
> 
> As above, it looks strange to see that object->something is checked, and 
> _then_ object.  I'd put this into curly brackets, together with the 
> deref_tag(), just to help the reader a bit, should she be as weak in mind 
> as yours truly.
> 
> > @@ -617,7 +625,8 @@ static int get_sha1_oneline(const char *prefix, unsigned char *sha1)
> >  		unsigned long size;
> >  
> >  		commit = pop_most_recent_commit(&list, ONELINE_SEEN);
> > -		parse_object(commit->object.sha1);
> > +		if(!parse_object(commit->object.sha1))
> > +			continue;
> 
> Makes sense, but please add a space after the "if".
> 
> > diff --git a/tag.c b/tag.c
> > index 38bf913..990134f 100644
> > --- a/tag.c
> > +++ b/tag.c
> > @@ -9,7 +9,10 @@ const char *tag_type = "tag";
> >  struct object *deref_tag(struct object *o, const char *warn, int warnlen)
> >  {
> >  	while (o && o->type == OBJ_TAG)
> > -		o = parse_object(((struct tag *)o)->tagged->sha1);
> > +		if (((struct tag *)o)->tagged)
> > +			o = parse_object(((struct tag *)o)->tagged->sha1);
> > +		else
> > +			o = NULL;
> 
> Knowing that tagged _can_ be NULL, this makes tons of sense.  Except that 
> again, I'd call error() to tell the user, before setting o = NULL.

mfg Martin Kögler

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2008-02-15  7:18 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-10 17:58 [RFC Patch] Preventing corrupt objects from entering the repository Martin Koegler
2008-02-11  0:00 ` Junio C Hamano
2008-02-11  0:33   ` Nicolas Pitre
2008-02-11 19:56     ` Martin Koegler
2008-02-11 20:41       ` Nicolas Pitre
2008-02-11 21:58         ` Martin Koegler
2008-02-12 16:02           ` Nicolas Pitre
2008-02-12 19:04             ` Martin Koegler
2008-02-12 20:22               ` Nicolas Pitre
2008-02-12 21:38                 ` Martin Koegler
2008-02-12 21:51                   ` Nicolas Pitre
2008-02-13  6:20                     ` Shawn O. Pearce
2008-02-13  7:39                       ` Martin Koegler
2008-02-14  9:00                         ` [RFC PATCH] Remove object-refs from fsck Shawn O. Pearce
2008-02-14 19:07                           ` Martin Koegler
2008-02-13  7:42             ` [RFC Patch] Preventing corrupt objects from entering the repository Shawn O. Pearce
2008-02-13  8:11               ` Martin Koegler
2008-02-13 12:01                 ` Johannes Schindelin
2008-02-14  6:16                   ` Shawn O. Pearce
2008-02-14 19:04                   ` Martin Koegler
2008-02-15  0:06                     ` Johannes Schindelin
2008-02-15  7:18                       ` Martin Koegler
2008-02-12  7:20   ` Martin Koegler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).