Git development

Git development
 help / color / mirror / Atom feed

* Re: [JGIT RFC] How read versions of a specific object
From: Shawn O. Pearce @ 2009-01-07  4:04 UTC (permalink / raw)
  To: Imran M Yousuf; +Cc: Git Mailing List
In-Reply-To: <7bfdc29a0901061944x454a9t1d01e6744f08cf78@mail.gmail.com>

Imran M Yousuf <imyousuf@gmail.com> wrote:
> I am trying to read all or n-th version of an object. Currently to do
> this I am using the following piece of code, which has to walk to
> every commit is present and from there prepare a set of its object id,
> it is definitely expensive if the commit history is huge, is there a
> faster/better way to achieve it?

Not really. You can more efficiently use JGit and reduce some of
the overheads, but that's about it.

> for (int i = 0; i < App.OBJECT_COUNT;
>             ++i) {
>             ObjectWalk objectWalk = new ObjectWalk(repo);

Don't use ObjectWalk, use a RevWalk.  You don't need it to keep
track of tree or blob identities.  The ObjectWalk code has more
overhead to do that bookkeeping.

>                     Commit revision = repo.mapCommit(revObject.getId());
>                     Tree versionTree = repo.mapTree(revision.getTreeId());
>                     if (versionTree.existsBlob(isbn)) {
>                         revisions.add(versionTree.findBlobMember(isbn).getId());

Use a TreeWalk to do this.  Its quicker because it doesn't
have to parse as much data to come up with the same result.

More specifically there's a static factory method that sets up for
a path limited walk and returns the TreeWalk pointing at that entry.

You can use the fact that RevWalk.next() returns a RevCommit to get
you the RevTree, which is the tree you need to give to the TreeWalk
constructor (its the root level tree of the commit).

But if App.OBJECT_COUNT is quite large and covers most of your
objects, you are probably better off using a loop over the commits
and diff'ing against the ancestor:

	final HashMap<String, Set<ObjectId>> versions = ...;
	final RevWalk rw = new RevWalk(repo);
	final TreeWalk tw = new TreeWalk(repo);
	rw.markStart(rw.parseCommit(repo.parse(HEAD)));
	tw.setFilter(TreeFilter.ANY_DIFF);

	RevCommit c;
	while ((c = rw.next()) != null) {
		final ObjectId[] p = new ObjectId[c.getParentCount() + 1];
		for (int i = 0; i < c.getParentCount(); i++) {
			rw.parse(c.getParent(i));
			p[i] = c.getParent(i).getTree();
		}
		final int me = p.length -1;
		p[me] = c.getTree();
		tw.reset(p);
		while (tw.next()) {
			if (tw.getFileMode(me).getObjectType() == Constants.OBJ_BLOB) {
				// This path was modified relative to the ancestor(s).
				//
				String s = tw.getPathString();
				Set<ObjectId> i = versions.get(s);
				if (i == null)
					versions.put(s, i = new HashSet<ObjectId>());
				i.add(tw.getObjectId(me));
			}

			if (tw.isSubtree()) {
				// make sure we recurse into modified directories
				tw.enterSubtree();
			}
		}
	}

-- 
Shawn.

^ permalink raw reply

* [JGIT RFC] How read versions of a specific object
From: Imran M Yousuf @ 2009-01-07  3:44 UTC (permalink / raw)
  To: Git Mailing List

Hi,

I am trying to read all or n-th version of an object. Currently to do
this I am using the following piece of code, which has to walk to
every commit is present and from there prepare a set of its object id,
it is definitely expensive if the commit history is huge, is there a
faster/better way to achieve it?

for (int i = 0; i < App.OBJECT_COUNT;
            ++i) {
            System.out.println("INDEX: " + i);
            String isbn =
                String.valueOf(Integer.parseInt(App.INIT_ID) + i);
            System.out.println("ISBN: " + isbn);
            ObjectWalk objectWalk = new ObjectWalk(repo);
            /*
             * Checks whether the Commit has the tree or not. It does not
             * check whether it has changed or not.
             */
            objectWalk.setTreeFilter(PathFilter.create(isbn));
            RevObject revObject = null;
            objectWalk.markStart(objectWalk.parseCommit(repo.resolve(
                Constants.HEAD)));
            Set<ObjectId> revisions =
                new HashSet<ObjectId>();
            do {
                if (revObject != null) {
                    Commit revision = repo.mapCommit(revObject.getId());
                    Tree versionTree = repo.mapTree(revision.getTreeId());
                    if (versionTree.existsBlob(isbn)) {
                        revisions.add(versionTree.findBlobMember(isbn).getId());
                    }
                }
                revObject = objectWalk.next();
            }
            while (revObject != null);
            System.out.println("Revisions: " + revisions);
        }

The details source code of the project is available @
http://github.com/imyousuf/jgit-usage/tree/master

Thank you,

-- 
Imran M Yousuf
Entrepreneur & Software Engineer
Smart IT Engineering
Dhaka, Bangladesh
Email: imran@smartitengineering.com
Blog: http://imyousuf-tech.blogs.smartitengineering.com/
Mobile: +880-1711402557

^ permalink raw reply

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
From: Nicolas Pitre @ 2009-01-07  3:21 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Jan Krüger, Git ML
In-Reply-To: <1231296475.8870.89.camel@starfruit>

On Tue, 6 Jan 2009, R. Tyler Ballance wrote:

> On Tue, 2009-01-06 at 21:09 -0500, Nicolas Pitre wrote:
> > > I've tarred one of the repositories that had it in a reproducible
> > state
> > 
> > That is wonderful.
> > 
> > > so I can create a build and extract the tar and run against that to
> > > verify any patches anybody might have, but unfortunately at 7GB of
> > > company code and assets, I can't exactly share ;)
> > 
> > First step is to understand what is going on.  Only then could reliable 
> > patches be made.
> 
> If you want to point me in the right direction, I have a few hours to
> kill this evening and fscking around with gdb(1) and printf() just might
> be some of my favorite things</sarcasm> ;)

Heh.  ;-)

To start with, a simple log of what you need to do to reproduce the 
issue would be nice.  Just do

	script /tmp/foo

then reproduce the issue and exit, after which I'd be interrested in the 
content of /tmp/foo.

Nicolas

^ permalink raw reply

* Re: [RFC PATCH] diff --no-index: test for pager after option parsing
From: Miklos Vajna @ 2009-01-07  3:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thomas Rast, git
In-Reply-To: <7vfxjwf041.fsf@gitster.siamese.dyndns.org>

[-- Attachment #1: Type: text/plain, Size: 352 bytes --]

On Tue, Jan 06, 2009 at 04:09:18PM -0800, Junio C Hamano <gitster@pobox.com> wrote:
> But I wonder if it still makes a difference in real life.idn't we stop
> reporting the exit status from the pager some time ago?

I just wanted to write this, I think that code could be just removed
since ea27a18 (spawn pager via run_command interface, 2008-07-22).

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply

* [PATCH v2] parse-opt: migrate builtin-ls-files.
From: Miklos Vajna @ 2009-01-07  3:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Pierre Habouzit, git
In-Reply-To: <20090106102202.GA30766@artemis.corp>

Signed-off-by: Miklos Vajna <vmiklos@frugalware.org>
---

On Tue, Jan 06, 2009 at 11:22:02AM +0100, Pierre Habouzit <madcoder@debian.org> wrote:
> > +static int option_parse_no_empty(const struct option *opt,
> > +                            const char *arg, int unset)
> > +{
> > +   struct dir_struct *dir = opt->value;
> > +
> > +   dir->hide_empty_directories = 1;
> > +
> > +   return 0;
> > +}
>
> Should be option_parse_empty and deal with "unset" to know if `no-`
> was
> prefixed to it or not.
>
>
> > +           { OPTION_CALLBACK, 0, "no-empty-directory", &dir, NULL,
> > +                   "do not list empty directories",
>
> This should be "empty-directory" and "list empty directories as well"

Ah, sure.

> I've not checked if you could also check more of the "unsets" things
> in
> your callbacks as well btw, but it looks like it could.

Right, added to option_parse_ignored() and option_parse_directory() as
well.

Interdiff: b3b6ad0..a44941c in git://repo.or.cz/git/vmiklos.git.

 builtin-ls-files.c |  303 +++++++++++++++++++++++++++++-----------------------
 1 files changed, 168 insertions(+), 135 deletions(-)

diff --git a/builtin-ls-files.c b/builtin-ls-files.c
index f72eb85..8a946ef 100644
--- a/builtin-ls-files.c
+++ b/builtin-ls-files.c
@@ -10,6 +10,7 @@
 #include "dir.h"
 #include "builtin.h"
 #include "tree.h"
+#include "parse-options.h"
 
 static int abbrev;
 static int show_deleted;
@@ -28,6 +29,7 @@ static const char **pathspec;
 static int error_unmatch;
 static char *ps_matched;
 static const char *with_tree;
+static int exc_given;
 
 static const char *tag_cached = "";
 static const char *tag_unmerged = "";
@@ -395,156 +397,187 @@ int report_path_error(const char *ps_matched, const char **pathspec, int prefix_
 	return errors;
 }
 
-static const char ls_files_usage[] =
-	"git ls-files [-z] [-t] [-v] (--[cached|deleted|others|stage|unmerged|killed|modified])* "
-	"[ --ignored ] [--exclude=<pattern>] [--exclude-from=<file>] "
-	"[ --exclude-per-directory=<filename> ] [--exclude-standard] "
-	"[--full-name] [--abbrev] [--] [<file>]*";
+static const char * const ls_files_usage[] = {
+	"git ls-files [options] [<file>]*",
+	NULL
+};
+
+static int option_parse_z(const struct option *opt,
+			  const char *arg, int unset)
+{
+	if (unset)
+		line_terminator = '\n';
+	else
+		line_terminator = 0;
+	return 0;
+}
+
+static int option_parse_exclude(const struct option *opt,
+				const char *arg, int unset)
+{
+	struct dir_struct *dir = opt->value;
+
+	exc_given = 1;
+	add_exclude(arg, "", 0, &dir->exclude_list[EXC_CMDL]);
+
+	return 0;
+}
+
+static int option_parse_exclude_from(const struct option *opt,
+				     const char *arg, int unset)
+{
+	struct dir_struct *dir = opt->value;
+
+	exc_given = 1;
+	add_excludes_from_file(dir, arg);
+
+	return 0;
+}
+
+static int option_parse_exclude_standard(const struct option *opt,
+					 const char *arg, int unset)
+{
+	struct dir_struct *dir = opt->value;
+
+	exc_given = 1;
+	setup_standard_excludes(dir);
+
+	return 0;
+}
+
+static int option_parse_ignored(const struct option *opt,
+				const char *arg, int unset)
+{
+	struct dir_struct *dir = opt->value;
+
+	if (unset)
+		dir->show_ignored = 0;
+	else
+		dir->show_ignored = 1;
+
+	return 0;
+}
+
+static int option_parse_directory(const struct option *opt,
+				  const char *arg, int unset)
+{
+	struct dir_struct *dir = opt->value;
+
+	if (unset)
+		dir->show_other_directories = 0;
+	else
+		dir->show_other_directories = 1;
+
+	return 0;
+}
+
+static int option_parse_empty(const struct option *opt,
+				 const char *arg, int unset)
+{
+	struct dir_struct *dir = opt->value;
+
+	if (unset)
+		dir->hide_empty_directories = 1;
+	else
+		dir->hide_empty_directories = 0;
+
+	return 0;
+}
+
+static int option_parse_full_name(const struct option *opt,
+				  const char *arg, int unset)
+{
+	prefix_offset = 0;
+
+	return 0;
+}
 
 int cmd_ls_files(int argc, const char **argv, const char *prefix)
 {
-	int i;
-	int exc_given = 0, require_work_tree = 0;
+	int require_work_tree = 0, show_tag = 0;
 	struct dir_struct dir;
+	struct option builtin_ls_files_options[] = {
+		{ OPTION_CALLBACK, 'z', NULL, NULL, NULL,
+			"paths are separated with NUL character",
+			PARSE_OPT_NOARG, option_parse_z },
+		OPT_BOOLEAN('t', NULL, &show_tag,
+			"identify the file status with tags"),
+		OPT_BOOLEAN('v', NULL, &show_valid_bit,
+			"use lowercase letters for 'assume unchanged' files"),
+		OPT_BOOLEAN('c', "cached", &show_cached,
+				"show cached files in the output (default)"),
+		OPT_BOOLEAN('d', "deleted", &show_deleted,
+				"show deleted files in the output"),
+		OPT_BOOLEAN('m', "modified", &show_modified,
+				"show modified files in the output"),
+		OPT_BOOLEAN('o', "others", &show_others,
+				"show other files in the output"),
+		{ OPTION_CALLBACK, 'i', "ignored", &dir, NULL,
+			"show ignored files in the output",
+			PARSE_OPT_NOARG, option_parse_ignored },
+		OPT_BOOLEAN('s', "stage", &show_stage,
+			"show staged contents' object name in the output"),
+		OPT_BOOLEAN('k', "killed", &show_killed,
+			"show files on the filesystem that need to be removed"),
+		{ OPTION_CALLBACK, 0, "directory", &dir, NULL,
+			"show 'other' directories' name only",
+			PARSE_OPT_NOARG, option_parse_directory },
+		{ OPTION_CALLBACK, 0, "empty-directory", &dir, NULL,
+			"list empty directories",
+			PARSE_OPT_NOARG, option_parse_empty },
+		OPT_BOOLEAN('u', "unmerged", &show_unmerged,
+			"show unmerged files in the output"),
+		{ OPTION_CALLBACK, 'x', "exclude", &dir, "pattern",
+			"skip files matching pattern",
+			0, option_parse_exclude },
+		{ OPTION_CALLBACK, 'X', "exclude-from", &dir, "file",
+			"exclude patterns are read from <file>",
+			0, option_parse_exclude_from },
+		OPT_STRING(0, "exclude-per-directory", &dir.exclude_per_dir, "file",
+			"read additional per-directory exclude patterns in <file>"),
+		{ OPTION_CALLBACK, 0, "exclude-standard", &dir, NULL,
+			"add the standard git exclusions",
+			PARSE_OPT_NOARG, option_parse_exclude_standard },
+		{ OPTION_CALLBACK, 0, "full-name", NULL, NULL,
+			"make the output relative to the project top directory",
+			PARSE_OPT_NOARG, option_parse_full_name },
+		OPT_BOOLEAN(0, "error-unmatch", &error_unmatch,
+			"if any <file> is not in the index, treat this as an error"),
+		OPT_STRING(0, "with-tree", &with_tree, "tree-ish",
+			"pretend that paths removed since <tree-ish> are still present"),
+		OPT__ABBREV(&abbrev),
+		OPT_END()
+	};
 
 	memset(&dir, 0, sizeof(dir));
 	if (prefix)
 		prefix_offset = strlen(prefix);
 	git_config(git_default_config, NULL);
 
-	for (i = 1; i < argc; i++) {
-		const char *arg = argv[i];
-
-		if (!strcmp(arg, "--")) {
-			i++;
-			break;
-		}
-		if (!strcmp(arg, "-z")) {
-			line_terminator = 0;
-			continue;
-		}
-		if (!strcmp(arg, "-t") || !strcmp(arg, "-v")) {
-			tag_cached = "H ";
-			tag_unmerged = "M ";
-			tag_removed = "R ";
-			tag_modified = "C ";
-			tag_other = "? ";
-			tag_killed = "K ";
-			if (arg[1] == 'v')
-				show_valid_bit = 1;
-			continue;
-		}
-		if (!strcmp(arg, "-c") || !strcmp(arg, "--cached")) {
-			show_cached = 1;
-			continue;
-		}
-		if (!strcmp(arg, "-d") || !strcmp(arg, "--deleted")) {
-			show_deleted = 1;
-			continue;
-		}
-		if (!strcmp(arg, "-m") || !strcmp(arg, "--modified")) {
-			show_modified = 1;
-			require_work_tree = 1;
-			continue;
-		}
-		if (!strcmp(arg, "-o") || !strcmp(arg, "--others")) {
-			show_others = 1;
-			require_work_tree = 1;
-			continue;
-		}
-		if (!strcmp(arg, "-i") || !strcmp(arg, "--ignored")) {
-			dir.show_ignored = 1;
-			require_work_tree = 1;
-			continue;
-		}
-		if (!strcmp(arg, "-s") || !strcmp(arg, "--stage")) {
-			show_stage = 1;
-			continue;
-		}
-		if (!strcmp(arg, "-k") || !strcmp(arg, "--killed")) {
-			show_killed = 1;
-			require_work_tree = 1;
-			continue;
-		}
-		if (!strcmp(arg, "--directory")) {
-			dir.show_other_directories = 1;
-			continue;
-		}
-		if (!strcmp(arg, "--no-empty-directory")) {
-			dir.hide_empty_directories = 1;
-			continue;
-		}
-		if (!strcmp(arg, "-u") || !strcmp(arg, "--unmerged")) {
-			/* There's no point in showing unmerged unless
-			 * you also show the stage information.
-			 */
-			show_stage = 1;
-			show_unmerged = 1;
-			continue;
-		}
-		if (!strcmp(arg, "-x") && i+1 < argc) {
-			exc_given = 1;
-			add_exclude(argv[++i], "", 0, &dir.exclude_list[EXC_CMDL]);
-			continue;
-		}
-		if (!prefixcmp(arg, "--exclude=")) {
-			exc_given = 1;
-			add_exclude(arg+10, "", 0, &dir.exclude_list[EXC_CMDL]);
-			continue;
-		}
-		if (!strcmp(arg, "-X") && i+1 < argc) {
-			exc_given = 1;
-			add_excludes_from_file(&dir, argv[++i]);
-			continue;
-		}
-		if (!prefixcmp(arg, "--exclude-from=")) {
-			exc_given = 1;
-			add_excludes_from_file(&dir, arg+15);
-			continue;
-		}
-		if (!prefixcmp(arg, "--exclude-per-directory=")) {
-			exc_given = 1;
-			dir.exclude_per_dir = arg + 24;
-			continue;
-		}
-		if (!strcmp(arg, "--exclude-standard")) {
-			exc_given = 1;
-			setup_standard_excludes(&dir);
-			continue;
-		}
-		if (!strcmp(arg, "--full-name")) {
-			prefix_offset = 0;
-			continue;
-		}
-		if (!strcmp(arg, "--error-unmatch")) {
-			error_unmatch = 1;
-			continue;
-		}
-		if (!prefixcmp(arg, "--with-tree=")) {
-			with_tree = arg + 12;
-			continue;
-		}
-		if (!prefixcmp(arg, "--abbrev=")) {
-			abbrev = strtoul(arg+9, NULL, 10);
-			if (abbrev && abbrev < MINIMUM_ABBREV)
-				abbrev = MINIMUM_ABBREV;
-			else if (abbrev > 40)
-				abbrev = 40;
-			continue;
-		}
-		if (!strcmp(arg, "--abbrev")) {
-			abbrev = DEFAULT_ABBREV;
-			continue;
-		}
-		if (*arg == '-')
-			usage(ls_files_usage);
-		break;
+	argc = parse_options(argc, argv, builtin_ls_files_options,
+			ls_files_usage, 0);
+	if (show_tag || show_valid_bit) {
+		tag_cached = "H ";
+		tag_unmerged = "M ";
+		tag_removed = "R ";
+		tag_modified = "C ";
+		tag_other = "? ";
+		tag_killed = "K ";
 	}
+	if (show_modified || show_others || dir.show_ignored || show_killed)
+		require_work_tree = 1;
+	if (show_unmerged)
+		/* There's no point in showing unmerged unless
+		 * you also show the stage information.
+		 */
+		show_stage = 1;
+	if (dir.exclude_per_dir)
+		exc_given = 1;
 
 	if (require_work_tree && !is_inside_work_tree())
 		setup_work_tree();
 
-	pathspec = get_pathspec(prefix, argv + i);
+	pathspec = get_pathspec(prefix, argv);
 
 	/* Verify that the pathspec matches the prefix */
 	if (pathspec)
-- 
1.6.1

^ permalink raw reply related

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
From: R. Tyler Ballance @ 2009-01-07  2:47 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jan Krüger, Git ML
In-Reply-To: <alpine.LFD.2.00.0901062059230.26118@xanadu.home>

[-- Attachment #1: Type: text/plain, Size: 751 bytes --]

On Tue, 2009-01-06 at 21:09 -0500, Nicolas Pitre wrote:
> > I've tarred one of the repositories that had it in a reproducible
> state
> 
> That is wonderful.
> 
> > so I can create a build and extract the tar and run against that to
> > verify any patches anybody might have, but unfortunately at 7GB of
> > company code and assets, I can't exactly share ;)
> 
> First step is to understand what is going on.  Only then could reliable 
> patches be made.

If you want to point me in the right direction, I have a few hours to
kill this evening and fscking around with gdb(1) and printf() just might
be some of my favorite things</sarcasm> ;)

Looking forward to killing this issue


Cheers

-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
From: Nicolas Pitre @ 2009-01-07  2:09 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Jan Krüger, Git ML
In-Reply-To: <1231292360.8870.61.camel@starfruit>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2165 bytes --]

On Tue, 6 Jan 2009, R. Tyler Ballance wrote:

> On Tue, 2009-01-06 at 20:25 -0500, Nicolas Pitre wrote:
> > On Tue, 6 Jan 2009, R. Tyler Ballance wrote:
> > 
> > > On Tue, 2008-12-09 at 09:36 +0100, Jan Krüger wrote:
> > > > For fixing a corrupted repository by using backup copies of individual
> > > > files, allow write_sha1_file() to write loose files even if the object
> > > > already exists in a pack file, but only if the existing entry is marked
> > > > as corrupted.
> > > 
> > > I figured I'd reply to this again, since the issue cropped up again.
> > > 
> > > We started experiencing *large* numbers of corruptions like the ones
> > > that started the thread (one developer was receiving them once or twice
> > > a day) with v1.6.0.4
> > > 
> > > We went ahead and upgraded to a custom build of v1.6.1 with Jan's patch
> > > (below) and the issues /seem/ to have resolved themselves. I'm not
> > > certain whether Jan's patch was really responsible, or if there was
> > > another issue that caused this to correct itself in v1.6.1. 
> 
> I'll back the patch out and redeploy, it's worth mentioning that a
> coworker of mine just got the issue as well (on 1.6.1). He was able to
> `git pull` and the error went away, but I doubt that it "magically fixed
> itself"

Please describe the "issue", ideally with transcripts of error messages, 
etc.  Normally a simple pull operation should not provide any "fix" for 
corruptions.

> I highly doubt this, I've got the issue appearing on at least 7
> different development boxes (not workstations, 2U quad-core ECC RAM, etc
> machines), while that doesn't mean that they all don't have issues, the
> probability of them *all* having disk issues, and it somehow only
> manifesting itself with Git usage, is low ;)

Agreed.

> I've tarred one of the repositories that had it in a reproducible state

That is wonderful.

> so I can create a build and extract the tar and run against that to
> verify any patches anybody might have, but unfortunately at 7GB of
> company code and assets, I can't exactly share ;)

First step is to understand what is going on.  Only then could reliable 
patches be made.


Nicolas

^ permalink raw reply

* Re: [PATCH v2] Add -ftabstop=WIDTH
From: Junio C Hamano @ 2009-01-07  1:52 UTC (permalink / raw)
  To: Christopher Li; +Cc: Alexey Zaytsev, git
In-Reply-To: <70318cbf0901061637l29837d14nfaa8a3106652b7e5@mail.gmail.com>

"Christopher Li" <sparse@chrisli.org> writes:

> So here is my understanding of what you described. The 'pu' branch is
> for highly experiment changes. The 'pu' branch can rewind and rewrite
> the history. Once the patch merge to 'next', the history will not change
> any more.  All update will stay as incremental changes.
>
> One question, does user suffer from conflict when then pull from the 'pu'
> branch?

[jc: I think this is going to the tangent for "sparse" list;
 redirecting to git@vger.kernel.org] 

I think they will, if they "pull", but:

 (1) They are upfront strongly discouraged from doing so by the way 'pu'
     is advertised.  "It is a collection of not yet even testable series,
     and any patch in it can be dropped and replaced".

 (2) They can instead 'fetch + rebase' the changes they made on top of
     previous round of 'pu', instead of 'pull' (= 'fetch + merge') to
     mitigate the pain.

Suppose I have two un-ready topics A and B in pu, and you base your work
X, Y, and Z on what was done by A (in other words, you are not interested
in topic B at all).  Then suppose one of A or B is replaced by wildly
different versions, and 'pu' is rebuilt:

                    X---Y---Z (private changes)
                   /
             A----B pu (old)
            /
           /              A'---B' pu (new)
          /              /  
     ----o----o----o----o

        Fig. 1

If you pull, even if A was not the one that was replaced, the merge will
have severe conflicts from the changes involved in the other series
(i.e. B).

But if A and A' did not change drastically in the meantime, rebasing X, Y,
Z on top of the updated pu (i.e. B') would not conflict:

                    X---Y---Z (private changes)
                   /
             A----B pu (old)     X'--Y'--Z' (private changes rebased)
            /                   / 
           /              A'---B' pu (new)
          /              /  
     ----o----o----o----o

        Fig. 2

In either case, if A (i.e. the work X, Y, Z were made on top of) was
rewritten drastically to become A', neither rebase nor merge will be of
help anyway, and it would not help if the new A' were recorded as an
incremental change from A without rebasing/rewinding 'pu' itself, either.

But at least 'fetch + rebase' would avoid the issue when it is only the
other topics in 'pu' that you are not interested in that were replaced or
rewritten drastically.

By the way, I drew A and B as if they are single patches made _directly_
on pu, only for simplicity's sake.  In reality, all topics fork from more
stable branches (maint or master), and the only commits you see on 'next'
or 'pu' are merges.

Which means, even if we assume that you never rewind 'pu':

> Here is an idea, I am just thinking it out loud.
>
> Given 'pu' branch like this, (each [ ] is a commit, A1 is a follow up
> change for A0).
>
> 'pu' branch: [A0] - [B0] - [A1] - [C0] -[B1] -[A2]

... the history of 'pu' won't look like this.

It would be more like this:

           .-----[B0]----[B1]  ...  topic branch for B
          /         \       \
         /  ...--*---*---*---* ...  pu
        /       /       /
       /       [A0]---[A1]     ...  topic branch for A
      /       /
     o-------o----o master

        Fig. 3

    Side note: my 'next' never rewinds except for once every major
    release, so the above "repeated merge from topics into the branch"
    depicts how 'next' works pretty closely.

Or, if you rebuild 'pu' every day, it would be more like
this one day, and;

           .-----[B0]          ...  topic branch for B
          /         \        
         /  ...--*---*         ...  pu
        /       /        
       /       [A0]            ...  topic branch for A
      /       /
     o-------o----o master

        Fig. 4

the next day it would look like this:

           .-----[B0]----[B1]  ...  topic branch for B
          /                 \
         /          ...--*---* ...  pu
        /               /
       /       [A0]---[A1]     ...  topic branch for A
      /       /
     o-------o----o master

        Fig. 5

In either case, unless a topic began with too many early issues and
mistakes that requires a wholesale replacement, you can expect the
accumulation of A0,A1,...,An to end up in a good shape eventually and then
you have a good incremental history you would want to preserve.

At that point, you can merge the tip of the branch (i.e. An) to master and
declare victory.  'pu' or 'next' may have a messy history that would make
anybody who looks at gitk output barf, but that is Ok.

> We can have a temporary clean up branch fork from 'pu' looks like this:
>
> 'tmp_clean' branch: [A0 + A1 + A2] - [B0 + B1] - [C0]
>
>  'tmp_clean' and 'pu' will generate exactly the same tree. The
> only different is the history path it take to get there.
>
> Then we can have 'pu' merge from 'tmp_clean', with zero text
> changes. The only change is the change log and we tell git
> that the merge is for history clean up. So when we launch
> "git log", by default it will follow the "tmp_clean" path rather
> than the original "pu" path.
>
> So it just provide "alternative" view of the history without introduce
> real changes. When user pull from 'pu', it can automatically get the
> cleanup version of the history without introduce conflicts.
>
> It seems it can have the best of both worlds. I am not sure weather
> it is doable or worth while to do though.

I do not think it is worth it, for two reasons:

 (1) That won't help the case where others based on their work on un-ready
     changes in 'pu', as I described earlier, anyway.

 (2) If you do not have any work on top of the un-ready 'pu', in other
     words, if you are just following along, then "git checkout origin/pu"
     won't care if yesterday's pu and today's pu are not fast-forward
     anyway.

If you rebuild 'pu' from scratch every day, without keeping many repeated
merges so far, it will give a pleasant read in "gitk master..pu" than
'next' that never rewinds whose "gitk master..next" output is a disaster
;-).

There is one trick my experienced users use, knowing how 'pu' is managed.

If today's 'pu' looked like Fig. 4 above, and you are interested in the
topic A, you can find the tip of that topic by looking at:

        git log --first-parent master..pu

It is what was merged to the merge that is at the second from the tip of
'pu' branch, i.e. "pu^^2 == A0".

And you fork your own enhancement to that topic by forking from A0,
creating "my-A" branch.  Your own commits go to that branch.

Next day you will find a history that is depicted in Fig. 5 and find the
tip of topic A the same way.  It is at A1.

Then you rebase "my-A" on top of A1 (or merge A1 to "my-A" branch).  You
really do not care about other uncooked garbage in 'pu', and you can
ignore them this way.

If you are working on more than one such "topics started by others", you
will have many my-A, my-B, ... branches.  You treat your 'master' branch
as if it is my 'next', i.e. fork from the last major release, merging all
of my-X branches, and employ the aggregated result for your own use.

^ permalink raw reply

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
From: R. Tyler Ballance @ 2009-01-07  1:39 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jan Krüger, Git ML
In-Reply-To: <alpine.LFD.2.00.0901062005290.26118@xanadu.home>

[-- Attachment #1: Type: text/plain, Size: 3010 bytes --]

On Tue, 2009-01-06 at 20:25 -0500, Nicolas Pitre wrote:
> On Tue, 6 Jan 2009, R. Tyler Ballance wrote:
> 
> > On Tue, 2008-12-09 at 09:36 +0100, Jan Krüger wrote:
> > > For fixing a corrupted repository by using backup copies of individual
> > > files, allow write_sha1_file() to write loose files even if the object
> > > already exists in a pack file, but only if the existing entry is marked
> > > as corrupted.
> > 
> > I figured I'd reply to this again, since the issue cropped up again.
> > 
> > We started experiencing *large* numbers of corruptions like the ones
> > that started the thread (one developer was receiving them once or twice
> > a day) with v1.6.0.4
> > 
> > We went ahead and upgraded to a custom build of v1.6.1 with Jan's patch
> > (below) and the issues /seem/ to have resolved themselves. I'm not
> > certain whether Jan's patch was really responsible, or if there was
> > another issue that caused this to correct itself in v1.6.1. 

I'll back the patch out and redeploy, it's worth mentioning that a
coworker of mine just got the issue as well (on 1.6.1). He was able to
`git pull` and the error went away, but I doubt that it "magically fixed
itself"


> Please back it out.  As it stands, that patch is a no op because of the 
> way git is used, and even if the patch was to work as intended, its 
> purpose is not to magically fix corruptions without special action from 
> your part.  If you have corruption problems coming back only because of 
> the removal of this patch then something is really really fishy and I 
> would really like to know about it.
> 
> There were indeed many changes between v1.6.0.4 and v1.6.1: the exact 
> number is 1029.  A couple of them are especially addressing increased 
> robustness against some kind of pack corruptions.  But in any case you 
> still should see error messages appearing about them.
> 
> And don't underestimate the power of disk corruptions.  I started to 
> work on git corruption resilience simply because I ended up with a 
> corrupted pack at some point.  Then a while later I got another 
> corrupted pack.  Then another while later I lost my filesystem entirely 
> and had to reinstall my system (after buying a new disk).  Turns out 
> that my old disk is silently corrupting data without signaling any 
> errors to the host.

I highly doubt this, I've got the issue appearing on at least 7
different development boxes (not workstations, 2U quad-core ECC RAM, etc
machines), while that doesn't mean that they all don't have issues, the
probability of them *all* having disk issues, and it somehow only
manifesting itself with Git usage, is low ;)

I've tarred one of the repositories that had it in a reproducible state
so I can create a build and extract the tar and run against that to
verify any patches anybody might have, but unfortunately at 7GB of
company code and assets, I can't exactly share ;)


Cheers


-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
From: Nicolas Pitre @ 2009-01-07  1:25 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Jan Krüger, Git ML
In-Reply-To: <1231282320.8870.52.camel@starfruit>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2646 bytes --]

On Tue, 6 Jan 2009, R. Tyler Ballance wrote:

> On Tue, 2008-12-09 at 09:36 +0100, Jan Krüger wrote:
> > For fixing a corrupted repository by using backup copies of individual
> > files, allow write_sha1_file() to write loose files even if the object
> > already exists in a pack file, but only if the existing entry is marked
> > as corrupted.
> 
> I figured I'd reply to this again, since the issue cropped up again.
> 
> We started experiencing *large* numbers of corruptions like the ones
> that started the thread (one developer was receiving them once or twice
> a day) with v1.6.0.4
> 
> We went ahead and upgraded to a custom build of v1.6.1 with Jan's patch
> (below) and the issues /seem/ to have resolved themselves. I'm not
> certain whether Jan's patch was really responsible, or if there was
> another issue that caused this to correct itself in v1.6.1. 
> 
> As it stands, I think it's safe to assume that given the frequency of
> the occurances that they were not tied to a memory or disk error (or
> other levels of the machine's stack would be suffering as well). The
> only thing I can think of is that /some/ developers who've experienced
> the issue are using Samba mount points and changing files in Mac OS X,
> but using Git on the mounted share (i.e. TextMate changes a file hosted
> on Samba, changes are committed in an SSH session on that machine), but
> that doesn't account for everything.
> 
> If there was something else included in the v1.6.1 release please let me
> know so I can back Jan's patch out.

Please back it out.  As it stands, that patch is a no op because of the 
way git is used, and even if the patch was to work as intended, its 
purpose is not to magically fix corruptions without special action from 
your part.  If you have corruption problems coming back only because of 
the removal of this patch then something is really really fishy and I 
would really like to know about it.

There were indeed many changes between v1.6.0.4 and v1.6.1: the exact 
number is 1029.  A couple of them are especially addressing increased 
robustness against some kind of pack corruptions.  But in any case you 
still should see error messages appearing about them.

And don't underestimate the power of disk corruptions.  I started to 
work on git corruption resilience simply because I ended up with a 
corrupted pack at some point.  Then a while later I got another 
corrupted pack.  Then another while later I lost my filesystem entirely 
and had to reinstall my system (after buying a new disk).  Turns out 
that my old disk is silently corrupting data without signaling any 
errors to the host.

Nicolas

^ permalink raw reply

* Re: Problems getting rid of large files using git-filter-branch
From: Nicolas Pitre @ 2009-01-07  0:56 UTC (permalink / raw)
  To: Stephen R. van den Berg; +Cc: ?yvind Harboe, git
In-Reply-To: <20090106231726.GB13379@cuci.nl>

On Wed, 7 Jan 2009, Stephen R. van den Berg wrote:

> Nicolas Pitre wrote:
> >On Tue, 6 Jan 2009, ?yvind Harboe wrote:
> >OK, try this:
> 
> >	git pull file://$(pwd)/../my_repo.orig
> 
> Alternately, try:
> 
> rm -rf .git/ORIG_HEAD .git/FETCH_HEAD .git/index .git/logs .git/info/refs \
>   .git/objects/pack/pack-*.keep .git/refs/original .git/refs/patches \
>   .git/patches .git/gitk.cache &&
>  git prune --expire now &&
>  git repack -a -d --window=200 &&
>  git gc

This might not be sufficient.  Or at least you better run 'git prune' at 
the very end, and possibly add -f to 'git repack'.  And if you somehow 
delete something you shouldn't have deleted then you're really screwed, 
whereas the pull method in another repository doesn't alter the original 
repository in case you need to go back to it and try something 
different.

Nicolas

^ permalink raw reply

* Re: [RFC PATCH] diff --no-index: test for pager after option parsing
From: Junio C Hamano @ 2009-01-07  0:09 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git
In-Reply-To: <1231286163-9422-1-git-send-email-trast@student.ethz.ch>

Thomas Rast <trast@student.ethz.ch> writes:

> I noticed this while working on the earlier patch for diff --no-index.
> It seems like the right thing to do (and passes tests), but I don't
> have a clue about git's normal setup sequences, so I'm flagging it
> RFC.

I think the patch itself makes sense from the logic flow point of view.

But I wonder if it still makes a difference in real life.idn't we stop
reporting the exit status from the pager some time ago?

^ permalink raw reply

* Re: [RFC PATCH] diff --no-index: test for pager after option parsing
From: Thomas Rast @ 2009-01-07  0:09 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano
In-Reply-To: <1231286163-9422-1-git-send-email-trast@student.ethz.ch>

[-- Attachment #1: Type: text/plain, Size: 859 bytes --]

Speaking of diff --no-index code, there's also this bit:

	/* diff-no-index.c:206 */
	for (i = 1; i < argc - 2; ) {
		int j;
		if (!strcmp(argv[i], "--no-index"))
			i++;
		else if (!strcmp(argv[1], "-q"))
			options |= DIFF_SILENT_ON_REMOVED;

Note the argv[i] vs. argv[1].  The entire block is from 0569e9b ("git
diff": do not ignore index without --no-index, 2008-05-23).

While it seems obvious that this should be argv[i], I'm rather
confused by the option itself.  It is not documented in my version of
git-diff(1).  Furthermore, I can't see what being silent about removed
paths (which relates to the index?) has to do with a diff --no-index
(which takes two paths that must exist).

Or perhaps I should take the "no mails/patches after midnight" rule a
tad bit more serious...

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply

* [RFC PATCH] diff --no-index: test for pager after option parsing
From: Thomas Rast @ 2009-01-06 23:56 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano

We need to parse options before we can see if --exit-code was
provided.

Signed-off-by: Thomas Rast <trast@student.ethz.ch>

---

I noticed this while working on the earlier patch for diff --no-index.
It seems like the right thing to do (and passes tests), but I don't
have a clue about git's normal setup sequences, so I'm flagging it
RFC.


 diff-no-index.c |   14 +++++++-------
 1 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/diff-no-index.c b/diff-no-index.c
index b60d345..f655f64 100644
--- a/diff-no-index.c
+++ b/diff-no-index.c
@@ -198,13 +198,6 @@ void diff_no_index(struct rev_info *revs,
 		die("git diff %s takes two paths",
 		    no_index ? "--no-index" : "[--no-index]");
 
-	/*
-	 * If the user asked for our exit code then don't start a
-	 * pager or we would end up reporting its exit code instead.
-	 */
-	if (!DIFF_OPT_TST(&revs->diffopt, EXIT_WITH_STATUS))
-		setup_pager();
-
 	diff_setup(&revs->diffopt);
 	if (!revs->diffopt.output_format)
 		revs->diffopt.output_format = DIFF_FORMAT_PATCH;
@@ -222,6 +215,13 @@ void diff_no_index(struct rev_info *revs,
 		}
 	}
 
+	/*
+	 * If the user asked for our exit code then don't start a
+	 * pager or we would end up reporting its exit code instead.
+	 */
+	if (!DIFF_OPT_TST(&revs->diffopt, EXIT_WITH_STATUS))
+		setup_pager();
+
 	if (prefix) {
 		int len = strlen(prefix);
 
-- 
tg: (e9b8523..) t/diff-no-index-status (depends on: origin/master)

^ permalink raw reply related

* Re: Problems getting rid of large files using git-filter-branch
From: Stephen R. van den Berg @ 2009-01-06 23:17 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: ?yvind Harboe, git
In-Reply-To: <alpine.LFD.2.00.0901061709510.26118@xanadu.home>

Nicolas Pitre wrote:
>On Tue, 6 Jan 2009, ?yvind Harboe wrote:
>OK, try this:

>	git pull file://$(pwd)/../my_repo.orig

Alternately, try:

rm -rf .git/ORIG_HEAD .git/FETCH_HEAD .git/index .git/logs .git/info/refs \
  .git/objects/pack/pack-*.keep .git/refs/original .git/refs/patches \
  .git/patches .git/gitk.cache &&
 git prune --expire now &&
 git repack -a -d --window=200 &&
 git gc

-- 
Sincerely,
           Stephen R. van den Berg.

"Very funny, Mr. Scott. Now beam down my clothes!"

^ permalink raw reply

* Re: [PATCH/RFC v2 2/4] Use 'lstat_cache()' instead of 'has_symlink_leading_path()'
From: Junio C Hamano @ 2009-01-06 23:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kjetil Barvik, git
In-Reply-To: <alpine.LFD.2.00.0901061304280.3057@localhost.localdomain>

Linus Torvalds <torvalds@linux-foundation.org> writes:

>> ...  The previously used call:
>> 
>>    has_symlink_leading_path(len, name);
>> 
>> should be identically with the following call to lstat_cache():
>> 
>>    lstat_cache(len, name,
>>                LSTAT_SYMLINK|LSTAT_DIR,
>>                LSTAT_SYMLINK);
>
> I think the new interface looks worse.
>
> Why don't you just do a new inline function that says
>
> 	static inline int has_symlink_leading_path(int len, const char *name)
> 	{
> 		return lstat_cache(len, name,
> 			LSTAT_SYMLINK|LSTAT_DIR,
> 			LSTAT_SYMLINK);
> 	}
>
> and now you don't need this big patch, and people who don't care about 
> those magic flags don't need to have them. End result: more readable code.

Excellent.

Not that I did not think a backward compatible macro is much easier to
read; after all, ce/ie you mention is a refactorizaton I did myself.

What I didn't think of was that posing the above question is a much better
way to extract a clear explanation why some of these lstat_cache() calls
have LSTAT_NOENT and some of them don't from the author.  It is much
better way than my earlier attempt to do so.

> This is how git has done pretty much all "generalized" versions. See the 
> whole ce_modified() vs ie_modified() thing: they're the same function, 
> it's just that 'ce_modified()' is the traditional simpler interface that 
> works on the default index, while ie_modified() is the "full" version that 
> takes all the details that most uses don't even want to know about.

Yup, thanks for a praise ;-)

^ permalink raw reply

* Re: Announce: TortoiseGit 0.1 preview version
From: Jakub Narebski @ 2009-01-06 23:05 UTC (permalink / raw)
  To: Frank Li; +Cc: git
In-Reply-To: <1976ea660901060645r641d73e6ob4e03747f1860b6a@mail.gmail.com>

On Tue, 6 Jan 2009, Frank Li wrote:

> TortoiseGit 0.2 released.

Nice. Thanks a lot for your work.

> Can you help update Git Wiki page?

Well, my stupid ISP blocks git.or.cz (and gimp.org), supposedly in
malformed attempt to reduce SPAM using MAPS (Mail Abuse Prevention
System; SPAM backwards) lists and null routing. So I have to use
proxy to edit wiki.

> It seem anyone can update Wiki page. Is it true?

It is the nature of Wiki that anyone can edit Wiki pages. It is
recommended but not necessary to register; please do not forget
to put comment ("commit message") as well.

> 
> Summary (feature matrix)
[...]

I am not involved with GUI feature Matrix; I only added simple info
about TortoiseGit. I'm not sure who is...

> > I have added information about TortoiseGit to git wiki at

-- 
Jakub Narebski
Poland

^ permalink raw reply

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
From: R. Tyler Ballance @ 2009-01-06 22:52 UTC (permalink / raw)
  To: Jan Krüger; +Cc: Git ML
In-Reply-To: <20081209093627.77039a1f@perceptron>

[-- Attachment #1: Type: text/plain, Size: 3858 bytes --]

On Tue, 2008-12-09 at 09:36 +0100, Jan Krüger wrote:
> For fixing a corrupted repository by using backup copies of individual
> files, allow write_sha1_file() to write loose files even if the object
> already exists in a pack file, but only if the existing entry is marked
> as corrupted.

I figured I'd reply to this again, since the issue cropped up again.

We started experiencing *large* numbers of corruptions like the ones
that started the thread (one developer was receiving them once or twice
a day) with v1.6.0.4

We went ahead and upgraded to a custom build of v1.6.1 with Jan's patch
(below) and the issues /seem/ to have resolved themselves. I'm not
certain whether Jan's patch was really responsible, or if there was
another issue that caused this to correct itself in v1.6.1. 

As it stands, I think it's safe to assume that given the frequency of
the occurances that they were not tied to a memory or disk error (or
other levels of the machine's stack would be suffering as well). The
only thing I can think of is that /some/ developers who've experienced
the issue are using Samba mount points and changing files in Mac OS X,
but using Git on the mounted share (i.e. TextMate changes a file hosted
on Samba, changes are committed in an SSH session on that machine), but
that doesn't account for everything.

If there was something else included in the v1.6.1 release please let me
know so I can back Jan's patch out.


Cheers


> 
> Signed-off-by: Jan Krüger <jk@jk.gs>
> ---
> 
> On IRC I talked to rtyler who had a corrupted pack file and plenty of
> object backups by way of cloned repositories. We decided to try
> extracting the corrupted objects from the other object database and
> injecting them into the broken repo as loose objects, but this failed
> because sha1_write_file() refuses to write loose objects that are
> already present in a pack file.
> 
> This patch expands the check to see if the pack entry has been marked
> as corrupted and, if so, allows writing a loose object with the same
> ID. Unfortunately, when Tyler tried a merge while using this patch,
> something we didn't manage to track down happened and now git doesn't
> consider the object corrupted anymore. I'm not sure enough that it
> wasn't caused by the patch to submit this patch without hesitation.
> 
> Apart from that, I think the change is not all too great since it makes
> write_sha1_file() walk the list of pack entries twice. That's a bit of
> a waste.
> 
> So those are the reasons why I wanted a few opinions first. Another
> reason is that there might be a way smarter method to fix this kind of
> problem, in which case I'd love hearing about it for future reference.
> 
>  sha1_file.c |    9 +++++----
>  1 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/sha1_file.c b/sha1_file.c
> index 6c0e251..17085cc 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -2373,14 +2373,17 @@ int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned cha
>  	char hdr[32];
>  	int hdrlen;
>  
> -	/* Normally if we have it in the pack then we do not bother writing
> -	 * it out into .git/objects/??/?{38} file.
> -	 */
>  	write_sha1_file_prepare(buf, len, type, sha1, hdr, &hdrlen);
>  	if (returnsha1)
>  		hashcpy(returnsha1, sha1);
> -	if (has_sha1_file(sha1))
> -		return 0;
> +	/* Normally if we have it in the pack then we do not bother writing
> +	 * it out into .git/objects/??/?{38} file. We do, though, if there
> +	 * is no chance that we have an uncorrupted version of the object.
> +	 */
> +	if (has_sha1_file(sha1)) {
> +		if (has_loose_object(sha1) || !has_packed_and_bad(sha1))
> +			return 0;
> +	}
>  	return write_loose_object(sha1, hdr, hdrlen, buf, len, 0);
>  }
>  
-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply

* Re: Problems getting rid of large files using git-filter-branch
From: Øyvind Harboe @ 2009-01-06 22:41 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git
In-Reply-To: <alpine.LFD.2.00.0901061709510.26118@xanadu.home>

>> 3. I tried "git reflog expire --all" + lots of other tricks in the
>> link below, but no luck.
>
> OK, try this:
>
>        cd ..
>        mv my_repo my_repo.orig
>        mkdir my_repo
>        cd my_repo
>        git init
>        git pull file://$(pwd)/../my_repo.orig
>
> This is the easiest way to ensure you have only the necessary objects in
> the new repo, without all the extra stuff tied to reflogs, etc.

Super!

That worked!

> Then, if your repo is still seemingly too big, you can get a bit dirty
> with the sequence Johannes just posted.

Johannes procedure had the unexpected side effect of showing that
my server setup is flaky somehow though... :-) I'll need his
tricks for other situations soon enough.


-- 
Øyvind Harboe
http://www.zylin.com/zy1000.html
ARM7 ARM9 XScale Cortex
JTAG debugger and flash programmer

^ permalink raw reply

* Re: Problems getting rid of large files using git-filter-branch
From: Øyvind Harboe @ 2009-01-06 22:36 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git
In-Reply-To: <alpine.DEB.1.00.0901062319070.30769@pacific.mpi-cbg.de>

On Tue, Jan 6, 2009 at 11:20 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Tue, 6 Jan 2009, Øyvind Harboe wrote:
>
>> Q1: How can I figure out what it is in .git that takes so much space?
>
> If it is a pack that is taking so much space:

it is.

>
> $ git verify-pack -v $PACK | grep -v "^chain " | sort -n -k 4

I have never used the git verify-pack command, but I'm pretty sure the
"Terminated" string isn't the normal output :-)

$ git verify-pack -v
.git/objects/pack/pack-1e039b82d8ae53ef5ec3614a3021466663cc70a4
Terminated

This is running git version 1.6.1. on CentOS on a virtual machine. I'm not quite
sure how to debug this. I'm sure I've done something wrong when I installed git.
I'm just a humble user of git trying to convert from cvs/svn.

> and then for the last few lines do a
>
> $ git rev-list --all --objects | grep $SHA1

I was able to run this procedure on a different machine than the
server and I can
then tell which objects take up all the space.

However, I'm unnerved by git verify-pack "Terminated"'ing on me above
and I'll have
to sort that out before I can think about using git in production.

Thanks for the pointers though! They definitely answered my questions!

>
> Hth,
> Dscho
>

-- 
Øyvind Harboe
http://www.zylin.com/zy1000.html
ARM7 ARM9 XScale Cortex
JTAG debugger and flash programmer

^ permalink raw reply

* Comments on Presentation Notes Request.
From: Tim Visher @ 2009-01-06 22:33 UTC (permalink / raw)
  To: git

Hello Everyone,

I'm putting together a little 15 minute presentation for my company
regarding SCMSes in an attempt to convince them to at the very least
use a Distributed SCMS and at best to use git.  I put together all my
notes, although I didn't put together the actual presentation yet.  I
figured I'd post them here and maybe get some feedback about it.  Let
me know what you think.

Thanks in advance!

Notes
---------

SCM: Distributed, Centralized, and Everything in Between.

* What is SCM and Why is it Useful?

** Definition

SCM is the practice of committing the revisions of your source code to
a system that can faithfully reproduce historical snap shots of your
source code.

** Advantages of SCM

*** One Source to Rule Them All.

Instead of having a bunch of source files spread across multiple
developers machines with multiple versions on each machine that may or
may not be labeled correctly, you have one repository containing every
artifact that your project includes.

*** Unlimited Undo/Redo.

Not only is it unlimited, but it's random access.  If you changed a
function a week ago, continued to work, and then decide that you want
the function back the way it was, it's as simple as pulling the
function back out of the SCMS.

*** Safe Concurrent Editing.

Many people can edit the same code base at the same time and know,
without a doubt, that when they pull all those changes together, the
system will merge the content intelligently or inform you of the
conflict and let you merge it.  You don't need to lock files.
Obviously, if there is bad coordination then the possibilities of
conflicts rise, but this should not happen regularly.

*** Diff Debugging

You can find where a bug was introduced by learning how to reproduce
the bug and then doing a binary chop search back through the History
to come to the exact commit that introduced the bug.

* SCM Best Practices

** Commit Early, Commit Often

The more you commit, the more fine grained control you have over the
undo feature of SCM.  Most documents that I have read suggested a TDD
approach wherein you commit whenever you have written just enough code
for your test to pass. But...

** Don't Commit Broken Code (To the Public Tree)

Of primary concern is the fact that your central HEAD should _always_
build.  This is why practices like Continuous Integration and TDD are
so important.  TDD gives you the freedom to be sure that a change you
made hasn't broken anything you weren't expecting it to break.
Continuous Integration allows you to be sure that your whole system
will build every time.  Thus, you should _never_ commit broken code to
the (public) tree.

Of course, in a centralized system, committing is intrinsically
public.  Even on branches, every time you commit any sort of change,
everyone is able to see it and so you could be breaking the build for
someone (even if it's just yourself and the build system).  One of the
nice features of a distributed system is that your public/private
ontology is much richer and thus allows you to have broken code in
your SCMS.

** Whole Hog

You should put everything necessary to build your system into SCM.
This includes user documentation, requirements documentation, software
tools, build tools, etc.  The only artifacts that don't need to be
managed are auto-generated artifacts such as javadocs, jar files, exe
files, etc.  This is so that you can reproduce entire releases using
only a simple checkout.

** Perform Updates and Commits on the Whole Tree

Updates and Commits should always be done on the whole tree so that
you're sure you have the latest source.  Never assume that nothing has
changed elsewhere.

** Allow and Encourage Customer Participation

Most shops seem to attempt to funnel customer participation through
the developers.  This is a cache miss for many operations such as
developing the user manual by a design team external to the
development team.  Basic operations such as commit and update are
fairly simple to grasp and can even be simplified further through
scripts and other such tools that non-developers can quickly be taught
to use.

Of note is the Tortoise family of tools which integrate directly into
Windows Explorer.  This makes it fairly easy for anyone who is
familiar with Windows Explorer to get into using any of the tools that
there is a Tortoise implementation for.

* The Centralized Model

** We Know About This One

This is traditional, plain vanilla, ubiquitous SCM.

The great majority of the SCMSes out there are centralized.

Closely resembles the Client/Server system model.

** Work Flow

<http://whygitisbetterthanx.com/#any-workflow>

*** 2 basic models: 'Lock, Modify, Unlock' and 'Copy, Modify, Merge'.

Older systems were primarily Lock, Modify, Unlock implementations.
You would checkout a file that you intended to work on, and no one
else would be able to check it out until you unlocked it, signaling
that you were done editing it.  This is inherently inefficient as on a
team of developers, the chances that two are working on the exact same
part of a system without knowing it and coordinating are fairly low.
Also, any disparate features that still touch the same files in the
system cannot be worked on simultaneously.

The answer to this is Copy, Modify, Merge.  In this system, every
developer gets a complete copy of the HEAD.  Everyone changes the HEAD
concurrently.  When commits happen, the system attempts to
intelligently merge them.  If it fails (usually doesn't happen unless
there is bad coordination), then it asks you to merge them.  This has
been proven to work well.

** Key Properties

*** Only One Repository

In centralized systems, there is only one global, public repository.
This has certain significant effects, such as an intrinsically global
name space for branches and tags, a restrictive public/private concept
(no such thing as committed but private), need for a backup process
aware of the possibility of in-progress commits, etc.

Since the repository only exists in a single location, the developers
only have copies of a specific revision and any uncommitted changes
they've made to that copy.

*** All Committed Changes Are Public

This includes regular commits (what we'd typically think of commits),
branches, and tags.

As previously mentioned, in centralized systems, all committed changes
are public.  Even if you are working on a private branch (which you
typically wouldn't be because branches are expensive in centralized
systems), the changes you are making are still visible publicly
because your branch exists in the global, public repository.

*** Intrinsically Uses the Network.

Because you must have a single repository that all developers are
accessing, you must use the network for many common operations.
Commits must be made to the central repository, Logs live centrally,
branches live centrally, diffing between revisions is a network
operation, blaming is a network operation, etc.

*** Backup Becomes A Separate Process

Because there is only a single repository, you need a back-up strategy
or else you are exposing yourself to a single point of failure.
Unfortunately, this is not as simple as it sounds.  The global, public
nature of the repository makes the chances of creating a corrupt back
up very high.  Because of this, tools have grown up around and in many
centralized systems that automate the process of backing it up while
remaining aware of the problems that can arise.  However, the point
remains that there is no intrinsic back up of a centralized system.

*** Need A Repository Admin.

Because the system is centralized, you need a repository
administrator.  This is true in most modern centralized systems where
new repositories are created on a per project basis (as in, not VSS).
In other words, when you want a new repository, you need to go through
some sort of admin interface or through the administrator of the
repository server to make it happen.

* The Distributed Model

** This Ones New

At least new as in unfamiliar.  The concept is over a decade old.

There are a few different popular distributed SCMSes (Git, Mercurial
(hg), Bazaar (bzr), Bitkeeper)

Very closely resembles a peer-to-peer network and the organic
relationships that evolve in that space.

In a distributed system, there is no one point where all development
comes together to for any reason other than policy.  Everyone who is
working on a system intrinsically has their own copy of the entire
repository.  All of the history, all of the source code, all of the
public branches, all of the public tags, etc.  Because of this,
developers can also have private branches, private tags, private
commits, private history.  The distinction between public and private
is very important in this context.  This has several distinct features
which I'll go into now.

** Work Flow (Pick Your Poison)

<http://whygitisbetterthanx.com/#any-workflow>

** Key Properties

*** Private/Public Concept

Distributed SCMSes Private/Public ontology is __much__ richer.
Whereas in a central system, private means only what you have yet to
commit or what you are leaving untracked, in a distributed system,
private means anything that you have not yet _chosen_ to make public.
In other words, you can have private branches, private tags, private
committed changes to your copy of the head, etc.  Anything that you do
not specifically publish to a location that others can access is
intrinsically private.

In other words, you can finally SCM your sandbox!  You can commit as
many broken things as you want to a private repository, giving you the
ability to have a nearly infinite set of undoable and recoverable
changes, without breaking anyone else's build.  Or, you can just as
easily ignore TDD, never commit anything for 3 weeks and then do a
big, massive commit and as long as your final product is tested and
merges with the rest of the tree, you're good to go and no one cares.

Because you have a rich ontology for private/public data, you can also
do crazy things like rewriting your local history before anyone else
sees it.  Because your repository is the only one that has to know
about the history as long as you're dealing with private data, this is
a completely safe (although policy debatable) operation.  Of course,
once data has been published, you really shouldn't mess with its
history anymore.

*** Network(less)

In distributed systems, networks are optional for almost every
operation (and indeed, every operation prior to publishing).  Of
course, you could put your repository on a network drive and then
you'd be doing everything over the network like you would in a
centralized system, but if you put your repository clone on your local
system, then everything you do in that repository is local.  Viewing
your history, committing, branching, merging, everything.

Once you've published, however, not much changes.  Almost everything
except updating and publishing (_not_ committing) remains local.
Remember that committing no longer means publicly publishing.  You can
commit many revisions, even to the master HEAD and nothing at all has
been published until you push those changes to your public HEAD.

*** Natural Backup

Because every developer has a copy of the repository, every developer
you add adds an extra failure point.  The more developers you have,
the more backups you have of the repository.

*** Must Learn New Work Flows.

In order to fully experience the advantages of distributed systems,
new work flows must be learned.  In other words, it's possible to use
distributed systems nearly the exact same way as you use a centralized
system (you just need to learn new commands), but you don't get many
of the benefits except the speed improvements.  The real game change
happens when you realize that you can keep things private until their
finished.  Once you realize that, new branching patterns emerge, new
work flows happen, you commit more often, and have the ability to
become much looser and freer in your development process.

*** Impossible To Completely Enforce A Single, Canonical
Representation of the Code Base.

By nature, a distributed system cannot enforce a single canonical
representation of the code base except by policy, and policies can
always be broken.  Also, any intentionally private data is not backed
up because it is not shared.  However, backup becomes much simpler
because you know that no one else is committing to your repository.

This bears some explanation.  Within a distributed system, you can
have a single official release point that everyone has blessed (or the
company has blessed, or the original developer has blessed, or
whatever).  However, you cannot _stop_ someone else from making a
release point because their repository is just as valid as yours.  You
cannot _stop_ developers from sharing code between themselves without
going out to the official central location.  All you can do is ask
them not to.

* Why Git is the Best Choice

** Fast

Git's implementation just happens to be wickedly fast.  It's faster
than mercurial, it's faster than bazaar, etc.  Everything, committing,
merging, viewing history, branching, and even updating and and pushing
are all faster.

** Tracks Content, not Files

Git tracks content, not files, and it's the only SCMS at the moment
that does this.  This has many effects internally, but the most
apparent effect I know of is that for the first time Git can easily
tell you the history of even a function in a file because Git can tell
you which files that function existed (or does exist) in over the
course of development.

** Extremely Efficient.

Because Git tracks content, it can also be extremely efficient
spacewise, simplifying the files to be nothing but pointers to a set
of objects in Git's internal file system.  Thus, if you have
duplicated hunks, git uses a single object to represent them.  Git has
been proven to be more efficient space wise than any other system out
there.

** (Un)Staged Changes

Git employs the concept of the Index or Cache or Commit Stage.  This
is also unique to Git, and it's pretty strange for developers coming
from a system without it.

Basically, There are 4 states that any content can be in under Git.

1. Untracked: This is content that Git is completely unaware of.
2. Tracked but Unstaged: This is content that has changed that Git is
aware of but will not commit on the next commit command.
3. Tracked and Staged: This is the same as unstaged except that this
content will be committed on the next commit.
4. Tracked and Committed:  This is content that has not changed since
the previous commit that Git is aware of.

This is very powerful yet somewhat awkward to grasp.  Basically, the
upshot of this feature is that you can manually build commits if you
want to.  Say you were working on feature foo and then made some other
changes because you came across feature bar and thought it would be
quick to do.  In any other system, the only way you could commit parts
of what you'd changed is if you were lucky enough for the disparate
changes to be in different files.  In that case, you could commit only
the files that you wanted to change for the different features.
However, if you made disparate changes to the same file, you were
stuck.  In Git, you can stage only parts of the files to an extreme
degree.  This allows you to create as many commits as you want out of
a single change set until the whole change set is committed.

I've found this to be particularly useful when working with an
existing code base that was not properly formatted.  Often, I'll come
to a file that has a bunch of wonky white space choices and improperly
indented logical constructs and I'll just quickly run through it
correcting that stuff before continuing with the feature I was working
on.  Afterwords, I'll stage the formatting and commit it, and then
stage the feature I was working on and commit that.  You may not want
that kind of control (and if you don't, you don't need to use it), but
I like it.

** Excellent Merge algorithms

Git has excellent merge algorithms.  This is widely attributed and
doesn't require much explanation.  It was one of Git's original design
goals, and it has been proven by Git's implementation.  Merging in Git
is _much_ less painful than in other systems.

** Has powerful 'maintainer tools'

Beyond the basics of committing, updating, pushing, viewing logs, etc.
Git is known to have very powerful tools maintainer level tools.  You
can modify your history, you can automatically perform binary searches
to locate errors, you can communicate via patches, it's highly
customizable, has the concept of submodules (projects within
projects), etc.  It gets complicated, but at this level of SCM it is
complicated.

** Cryptographically Guarantees Content

One of the most surprising things I learned as I was researching this
was that most SCMSes do not guarantee that your content does not get
corrupted.  In other words, if the repository's disk doesn't fail but
instead just gets corrupted, you'll never know unless you actually
notice the corruption in the files.  If you have memory corruption
locally and commit your changes, you just won't know.

Git guarantees absolutely that if corruption happens, you will know
about it.  It does this by creating SHA-1 hashes of your content and
then checking to make sure that the SHA-1 hash does not change for an
object.  The details of this aren't as important as the fact that Git
is one of the very few systems that do this and it's obviously
desirable.

* References

- <http://git-scm.com/> - The Git homepage
- <http://whygitisbetterthanx.com/> - An excellent resource explaining
the benefits of using git in relation to other common SCMSes
- <http://www.youtube.com/watch?v=4XpnKHJAok8> - Linus Torvalds's
Google Talk on Git.  Covers mainly what Git is Not and Why
Distribution is the model that works.
- <http://www.youtube.com/watch?v=8dhZ9BXQgc4> - Randal Schwartz's
Google Talk on Git.  Covers what Git is, including some implementation
details, some use case scenarios, and the like.
- <http://book.git-scm.com/> - A community written book on Git with
video tutorials about many of Git's features.
- <http://subversion.tigris.org/> - Subversion's homepage.  An
extremely popular open source centralized system.
- <http://svnbook.red-bean.com/> - Rolling publish book on Subversion.
 Chapter 1 is a good introduction to general centralized SCM concepts
and principles.
- <http://www.perforce.com/perforce/bestpractices.html> - An excellent
set of best practices from the Perforce team.  Some of it (especially
the branches) has a distinct centralized lean, but most of it is quite
good.
- <http://www.bobev.com/PresentationsAndPapers/Common%20SCM%20Patterns.pdf>
- Interesting presentation by Pretzel Logic from 2001 attempting to
outline some common SCM best practices as Patterns.

---------------
End Notes

-- 

In Christ,

Timmy V.

http://burningones.com/
http://five.sentenc.es/ - Spend less time on e-mail

^ permalink raw reply

* Re: Problems getting rid of large files using git-filter-branch
From: Nicolas Pitre @ 2009-01-06 22:31 UTC (permalink / raw)
  To: Øyvind Harboe; +Cc: git
In-Reply-To: <c09652430901061359q7a02291fk656ab23e54b19f5e@mail.gmail.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1106 bytes --]

On Tue, 6 Jan 2009, Øyvind Harboe wrote:

> Q1: How can I figure out what it is in .git that takes so much space?
> 
> Q2: Where can I read more about what to do after running git-filter-branch to
> removing the offending objects?
> 
> 
> 
> 1. I ran this command to get rid of the offending files and that appears to
> have worked. I can't find any traces of them anymore...
> 
> git filter-branch --tree-filter 'find . -regex ".*toolchain\..*" -exec
> rm -f {} \;' HEAD
> 
> 2. Running "git gc" takes a few seconds. The repository is still
> huge(it should be
> perhaps 10-20mByte).
> 
> du -skh .git/
> 187M    .git/
> 
> 3. I tried "git reflog expire --all" + lots of other tricks in the
> link below, but no luck.

OK, try this:

	cd ..
	mv my_repo my_repo.orig
	mkdir my_repo
	cd my_repo
	git init
	git pull file://$(pwd)/../my_repo.orig

This is the easiest way to ensure you have only the necessary objects in 
the new repo, without all the extra stuff tied to reflogs, etc.

Then, if your repo is still seemingly too big, you can get a bit dirty 
with the sequence Johannes just posted.


Nicolas

^ permalink raw reply

* Re: Problems getting rid of large files using git-filter-branch
From: Johannes Schindelin @ 2009-01-06 22:20 UTC (permalink / raw)
  To: Øyvind Harboe; +Cc: git
In-Reply-To: <c09652430901061359q7a02291fk656ab23e54b19f5e@mail.gmail.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 324 bytes --]

Hi,

On Tue, 6 Jan 2009, Øyvind Harboe wrote:

> Q1: How can I figure out what it is in .git that takes so much space?

If it is a pack that is taking so much space:

$ git verify-pack -v $PACK | grep -v "^chain " | sort -n -k 4

and then for the last few lines do a

$ git rev-list --all --objects | grep $SHA1

Hth,
Dscho

^ permalink raw reply

* Problems getting rid of large files using git-filter-branch
From: Øyvind Harboe @ 2009-01-06 21:59 UTC (permalink / raw)
  To: git

I'm trying to get rid of some large objects in my .git repository
using git-filter-branch. These are remnants from conversion from
CVS.

Q1: How can I figure out what it is in .git that takes so much space?

Q2: Where can I read more about what to do after running git-filter-branch to
removing the offending objects?

1. I ran this command to get rid of the offending files and that appears to
have worked. I can't find any traces of them anymore...

git filter-branch --tree-filter 'find . -regex ".*toolchain\..*" -exec
rm -f {} \;' HEAD

2. Running "git gc" takes a few seconds. The repository is still
huge(it should be
perhaps 10-20mByte).

du -skh .git/
187M    .git/

3. I tried "git reflog expire --all" + lots of other tricks in the
link below, but no luck.

I tried the tricks I could find in this thread, but no luck:

http://article.gmane.org/gmane.comp.version-control.git/60219/match=trying+use+git+filter+branch+compress

-- 
Øyvind Harboe
http://www.zylin.com/zy1000.html
ARM7 ARM9 XScale Cortex
JTAG debugger and flash programmer

^ permalink raw reply

* Re: JGit vs. Git
From: Johannes Schindelin @ 2009-01-06 21:41 UTC (permalink / raw)
  To: Vagmi Mudumbai; +Cc: git
In-Reply-To: <a55cfe9d0901052250k2be203dfvb0b437a523f2cecc@mail.gmail.com>

Hi,

On Tue, 6 Jan 2009, Vagmi Mudumbai wrote:

> I am working on Windows with msysGit behind a HTTP Proxy. (Life cant
> get worse, I guess.) .

FWIW I think all should work well if you use a proxy, such as 
http://www.meadowy.org/~gotoh/projects/connect

Hth,
Dscho

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox