[PATCH] git-revover-tags-script

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] git-revover-tags-script
@ 2005-07-16 20:20 Eric W. Biederman
  2005-07-17  0:51 ` Junio C Hamano
  2005-07-20  0:20 ` [RFD] server-info to help clients Junio C Hamano
  0 siblings, 2 replies; 12+ messages in thread
From: Eric W. Biederman @ 2005-07-16 20:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git


First pass at a script to dig through .git/objects and find dangling
tags.  It likely has a lot of weird limitations, I don't know if it
will work with packs, and the policy it implments is pretty stupid,
but it is a sane start and should keep people from needing to
rsync anything except the .git/objects part of the tree.

The current policy is if a tag's gpg signature can be verified
and if the tag name does not conflict with an existing tag
place it in .git/refs/tags/.   So far this only works with
dangling tags so I don't know if these tags will even be pulled
with the pack methods.  But since we aren't quite going at
full speed on those yet we should be good.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---

 Makefile                |    3 ++-
 git-recover-tags-script |   27 +++++++++++++++++++++++++++
 2 files changed, 29 insertions(+), 1 deletions(-)
 create mode 100755 git-recover-tags-script

4b171e71fd6b5de56dd4a93ea203e49115c2caee
diff --git a/Makefile b/Makefile
--- a/Makefile
+++ b/Makefile
@@ -36,7 +36,8 @@ SCRIPTS=git git-apply-patch-script git-m
 	git-reset-script git-add-script git-checkout-script git-clone-script \
 	gitk git-cherry git-rebase-script git-relink-script git-repack-script \
 	git-format-patch-script git-sh-setup-script git-push-script \
-	git-branch-script git-parse-remote git-verify-tag-script
+	git-branch-script git-parse-remote git-verify-tag-script \
+	git-recover-tags-script
 
 PROG=   git-update-cache git-diff-files git-init-db git-write-tree \
 	git-read-tree git-commit-tree git-cat-file git-fsck-cache \
diff --git a/git-recover-tags-script b/git-recover-tags-script
new file mode 100755
--- /dev/null
+++ b/git-recover-tags-script
@@ -0,0 +1,27 @@
+#!/bin/sh
+# Copyright (c) 2005 Eric Biederman
+
+. git-sh-setup-script || die "Not a git archive"
+
+TMP_TAG=".tmp-tag.$$"
+git-fsck-cache |
+while read status type sha1 rest ; do
+	if [ '(' "$status" == "dangling" ')' -a '(' "$type" == "tag" ')' ] ; then
+		if ! git-verify-tag-script $sha1 ; then
+			echo "Could not verify tag $sha1"
+		else
+			tag=$(git-cat-file tag $sha1 | sed -ne 's/^tag //p')
+			tagger=$(git-cat-file tag $sha1 | sed -ne 's/^tagger //p')
+			if [ ! -e $GIT_DIR/refs/tags/$tag ]; then
+				echo "installing tag $tag tagger $tagger"
+				mkdir -p $GIT_DIR/refs/tags
+				echo "$sha1" > $GIT_DIR/refs/tags/$tag
+			fi
+		fi
+	else
+		if [ "$status" != "dangling" ] ; then
+			echo "$status $type $sha1 $rest";
+		fi
+	fi
+done
+rm -f $TMP_TAG

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-revover-tags-script
  2005-07-16 20:20 [PATCH] git-revover-tags-script Eric W. Biederman
@ 2005-07-17  0:51 ` Junio C Hamano
  2005-07-17  8:40   ` Eric W. Biederman
  2005-07-20  0:20 ` [RFD] server-info to help clients Junio C Hamano
  1 sibling, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2005-07-17  0:51 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: git

ebiederm@xmission.com (Eric W. Biederman) writes:

> First pass at a script to dig through .git/objects and find dangling
> tags.  It likely has a lot of weird limitations, I don't know if it
> will work with packs, and the policy it implments is pretty stupid,
> but it is a sane start and should keep people from needing to
> rsync anything except the .git/objects part of the tree.

Also in an earlier message:

> Do we want to put some porcelain around, git-fsck-cache --tags?
> So we can discover the tag objects in the archive and place
> them someplace usable.  Jeff Garzik in his howto is still recommending:
>
>>   git-pull-script only downloads sha1-indexed object data, and the requested remote head.
>>   This misses updates to the .git/refs/tags/ and .git/refs/heads/ directories. It is
>>   advisable to update your kernel .git directories periodically with a full rsync command, to
>>   make sure you got everything:
>>$ cd linux-2.6
>>$ rsync -a --verbose --stats --progress \
>>   rsync://rsync.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git/ \
>>   .git/
>
> Which feels like something is missing.  Given that tags are
> sha1-indexed objects we should be pulling them.  And I believe you can
> have a tag as a parent of a commit, so even with the pack optimized
> clients we should be pulling them now.  

You cannot have a tag as a parent of a commit.  commit-tree.c
explicitly checks for "commit" objects, and I think it is the
right thing to do [*1*].  You will also notice that at the end
of git-fetch-script, a tag is written in the .git/tag/<name>
file as fetched, but the .git/FETCH_HEAD file records the commit
SHA1 if a tag is fetched.  So, no, unless you are using rsync
transport to pull everything in sight, I do not think you will
pull tags you do not explicitly request to be pulled as part of
the commit chain (be it done by the old fashioned commit walker,
or the on-the-fly pack transfer).  I do not think "finding a
dangling tag using git-fsck-cache" is something we particularly
want to have a special wrapper around for [*2*], because the
user should not be needing to do it.

I do think we need a way to discover remote tags, an equivalent
to "wget $remote_repo/refs/tags/" (non recursive kind, just the
names).  When to fetch them from remote, and where to store them
locally, however, are different matter, I think.

Given that tags, especially the signed kind, are almost always
only made by the project lead and percolate down the patch
foodchain in practice, copying _all_ tags from the remote
repository like Jeff suggests makes sense in many situations,
but in general I think the namespace under the .git/refs
directory should be controlled by the local user [*3*].  As
Linus said before, you can choose to pull a tag from him only
because he told you about it.  After learning about that tag,
deciding to pull the tag "v2.6.13-rc3" from his repository, and
storing it in the same ".git/refs/tags/v2.6.13-rc3" path locally
is your choice, not his [*4*].

I think the same can be said about the remote branch heads; an
obvious case is ".git/refs/heads/master".

"git-fetch-script" is very conservative.  Only when you tell it
to fetch the tag <name>, it stores it in .git/refs/tags/<name>
locally.  When you tell it to fetch the head via the short-hand
merchanism by having .git/branch/linus file that records the URL
of his repository, the head is stored in .git/ref/heads/linus.
Otherwise it does not touch .git/refs at all, and I think that
is the right thing to do.

Maybe we want to have "git-list-remote URL --tags --heads" for
discovery, and perhaps "--all-tags" flag to "git-fetch-script",
to cause it to fetch all remote tags.

[Footnote]

*1* I think I once sent a patch to break this, but luckily Linus
had a much better sense than me and dropped it.  It is very nice
to have adult supervision ;-).

*2* I noticed you have already sent a patch about it.

*3* I am not saying what Jeff suggests is wrong.  In his
suggestion, the user is making a conscious decision to accept
and use all tags Linus has in his repository as they are; and
that is one valid usage pattern.

*4* The tag discovery mechanism is one way for the remote
repository owner to tell you about the tags.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-revover-tags-script
  2005-07-17  0:51 ` Junio C Hamano
@ 2005-07-17  8:40   ` Eric W. Biederman
  2005-07-17 18:53     ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2005-07-17  8:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <junkio@cox.net> writes:

> ebiederm@xmission.com (Eric W. Biederman) writes:
>
>> First pass at a script to dig through .git/objects and find dangling
>> tags.  It likely has a lot of weird limitations, I don't know if it
>> will work with packs, and the policy it implments is pretty stupid,
>> but it is a sane start and should keep people from needing to
>> rsync anything except the .git/objects part of the tree.
>
> Also in an earlier message:
>
>> Do we want to put some porcelain around, git-fsck-cache --tags?
>> So we can discover the tag objects in the archive and place
>> them someplace usable.  Jeff Garzik in his howto is still recommending:
>>
>>> git-pull-script only downloads sha1-indexed object data, and the requested
> remote head.
>>> This misses updates to the .git/refs/tags/ and .git/refs/heads/
> directories. It is
>>> advisable to update your kernel .git directories periodically with a full
> rsync command, to
>>>   make sure you got everything:
>>>$ cd linux-2.6
>>>$ rsync -a --verbose --stats --progress \
>>> rsync://rsync.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git/ \
>>>   .git/
>>
>> Which feels like something is missing.  Given that tags are
>> sha1-indexed objects we should be pulling them.  And I believe you can
>> have a tag as a parent of a commit, so even with the pack optimized
>> clients we should be pulling them now.  
>
> You cannot have a tag as a parent of a commit.  commit-tree.c
> explicitly checks for "commit" objects, and I think it is the
> right thing to do [*1*].  You will also notice that at the end
> of git-fetch-script, a tag is written in the .git/tag/<name>
> file as fetched, but the .git/FETCH_HEAD file records the commit
> SHA1 if a tag is fetched.  So, no, unless you are using rsync
> transport to pull everything in sight, I do not think you will
> pull tags you do not explicitly request to be pulled as part of
> the commit chain (be it done by the old fashioned commit walker,
> or the on-the-fly pack transfer).  I do not think "finding a
> dangling tag using git-fsck-cache" is something we particularly
> want to have a special wrapper around for [*2*], because the
> user should not be needing to do it.

Sounds fine.  I totally agree that a better method for finding
the tags is preferable.  So far that is all I have and currently 
it works.  However since commit's cannot have tags as their parents
all tags that are no non-local tags will show up as dangling tags
in git-fsck-cache.  So it is a good general technique.  I would
certainly prefer for us to process tags when they come in so we
don't need to play with git-fsck-cache.

> I do think we need a way to discover remote tags, an equivalent
> to "wget $remote_repo/refs/tags/" (non recursive kind, just the
> names).  When to fetch them from remote, and where to store them
> locally, however, are different matter, I think.

What we care about are the tag objects, those are the only kind
that are verifiable and usable remotely.  

Now that I know we do not pull tags currently with any of the
optimized transports, I would suggest taking the list of commit
objects we are transporting and for each commit look in the
remote repo/refs/tags and transferring every tag object we can find
that refers to that commit.

The implementation would likely be different from the description,
above.  Probably simply finding the list of all remote tags (on the
remote end for packs).  Indexing that list by commit sha1 and then
merging that list with the list of commits we are fetching.

Maybe we should create a reverse index like
repo/refs/tag_objects/<object_sha1>/<tag_sha1>s to make finding and
processing tag objects easier?

> Given that tags, especially the signed kind, are almost always
> only made by the project lead and percolate down the patch
> foodchain in practice, copying _all_ tags from the remote
> repository like Jeff suggests makes sense in many situations,
> but in general I think the namespace under the .git/refs
> directory should be controlled by the local user [*3*].  As
> Linus said before, you can choose to pull a tag from him only
> because he told you about it.  After learning about that tag,
> deciding to pull the tag "v2.6.13-rc3" from his repository, and
> storing it in the same ".git/refs/tags/v2.6.13-rc3" path locally
> is your choice, not his [*4*].
>
> I think the same can be said about the remote branch heads; an
> obvious case is ".git/refs/heads/master".

Agreed.  As my script was a first pass I was not handling tag name
conflicts or renaming, I was simply detecting the conflicts.  As to tags
everyone who maintains a public tree for consumption makes tags.

On kernel.org we recognize several trees and their associated tags.
2.6.x (linus's tags)
2.6.12.x  (the sucker tree)
2.4.x  (Marcelo's tags)
2.2.x
2.0.x
2.6.?-acx ( Allen's tree)
2.6.?-mmx ( Andrew's tree)

There are several additional trees by prominent kernel developers.

And then we have the distro vendors trees.

So ultimately there are a lot of tags from a lot of different people
I would care about.  Having to grab the different branches separately
is sane but having to grab each individual tag along the way starts
becoming a major pain.  

Linus's comment about knowing a tag exists when he announces it
is a silly even for Linus because he doesn't always announce his
tags.

The way I envision using tags is that I have a local repository
that is maintained with automated scripts.  Basically pulling
the branches that I think I will care about.  This mirroring needs
the option of happening preemptively because publicly accessible
content from time to time disappears.  So if you don't get it
when it is published you may never get it.  

Eventually I will get a report about something in reference to a
particular tag, that I will care about.  I will find that tag
and perform a checkout and look at the code.

> "git-fetch-script" is very conservative.  Only when you tell it
> to fetch the tag <name>, it stores it in .git/refs/tags/<name>
> locally.  When you tell it to fetch the head via the short-hand
> merchanism by having .git/branch/linus file that records the URL
> of his repository, the head is stored in .git/ref/heads/linus.
> Otherwise it does not touch .git/refs at all, and I think that
> is the right thing to do.

I agree that not updating the local .git/refs/heads or .git/refs/tags
automatically is a good thing.  However none of that precludes
automatically fetching or storing of tag objects.  

What happens in the rsync case of simply fetching the tag objects
is perfectly serviceable, and a simple script is all that it
takes to pull out the tags and make them useful.  The only problem
in usability is that git-fsck-cache is slow, so having an index
is desirable.  

> Maybe we want to have "git-list-remote URL --tags --heads" for
> discovery, and perhaps "--all-tags" flag to "git-fetch-script",
> to cause it to fetch all remote tags.

Something like that.  After we get the tag objects with
the generic git mechanisms it becomes a question for porcelain what
to do with them.  But porcelain can't do anything with tags if
we can't fetch tags or find tags.

> [Footnote]
>
> *1* I think I once sent a patch to break this, but luckily Linus
> had a much better sense than me and dropped it.  It is very nice
> to have adult supervision ;-).

:)  Until you replied I wasn't certain where we stood, and it
seemed faster to ask and start the needed conversation then to read
through the code.

> *2* I noticed you have already sent a patch about it.

What better way to start a conversation.

> *3* I am not saying what Jeff suggests is wrong.  In his
> suggestion, the user is making a conscious decision to accept
> and use all tags Linus has in his repository as they are; and
> that is one valid usage pattern.

Agreed.  A matter for the user or for the porcelain to decide.

> *4* The tag discovery mechanism is one way for the remote
> repository owner to tell you about the tags.

And tag discovery is what we don't have, and that is what
I am picking on.

Eric

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-revover-tags-script
  2005-07-17  8:40   ` Eric W. Biederman
@ 2005-07-17 18:53     ` Junio C Hamano
  2005-07-18  0:06       ` Eric W. Biederman
  2005-07-18  0:19       ` Eric W. Biederman
  0 siblings, 2 replies; 12+ messages in thread
From: Junio C Hamano @ 2005-07-17 18:53 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: git

ebiederm@xmission.com (Eric W. Biederman) writes:

> What we care about are the tag objects, those are the only kind
> that are verifiable and usable remotely.  
>
> Now that I know we do not pull tags currently with any of the
> optimized transports, I would suggest taking the list of commit
> objects we are transporting and for each commit look in the
> remote repo/refs/tags and transferring every tag object we can find
> that refers to that commit.

I do not think it is particularly a good idea to fetch a tag
that refers to a commit when the user asks only for that commit
(e.g. the user said "the head of this remote branch I am
tracking", and the head happened to have been tagged).  Yes, it
may be convenient, but retrieving the commit chain and
retrieving tags are conceptually separate issues.  A tag does
not necessarily refer to a commit, so your reverse index does
not make sense for a tag pointing at a blob, for example.

I think if we have discovery mechanism of remote tags/heads, we
do not need anything else.  You _could_ say something like:

    $ git-list-remote --tags linux-2.6
    9e734775f7c22d2f89943ad6c745571f1930105f	v2.6.12-rc2
    26791a8bcf0e6d33f43aef7682bdb555236d56de	v2.6.12
    ...
    a339981ec18d304f9efeb9ccf01b1f04302edf32	v2.6.13-rc3
    $ git-list-remote --tags linux-2.6 |
      while read sha1 tag;
      do
          git fetch linux-2.6 tag $tag
      done

and you are done.  We did not use the reverse index, nor we used
the --all-tags flag to git-fetch-script.  You do not even need
git-list-remote if you are willing to wget a=summary output from
gitweb and parse the bottom of the page ;-).

The above may not exactly work for linux-2.6 repository because
I think the "tag" form of git-fetch-script may expect to find a
tag that resolves to a commit object and there is the oddball
v2.6.11-tree tag, but you got the general idea.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-revover-tags-script
  2005-07-17 18:53     ` Junio C Hamano
@ 2005-07-18  0:06       ` Eric W. Biederman
  2005-07-18  1:13         ` Junio C Hamano
  2005-07-18  0:19       ` Eric W. Biederman
  1 sibling, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2005-07-18  0:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <junkio@cox.net> writes:

> ebiederm@xmission.com (Eric W. Biederman) writes:
>
>> What we care about are the tag objects, those are the only kind
>> that are verifiable and usable remotely.  
>>
>> Now that I know we do not pull tags currently with any of the
>> optimized transports, I would suggest taking the list of commit
>> objects we are transporting and for each commit look in the
>> remote repo/refs/tags and transferring every tag object we can find
>> that refers to that commit.
>
> I do not think it is particularly a good idea to fetch a tag
> that refers to a commit when the user asks only for that commit
> (e.g. the user said "the head of this remote branch I am
> tracking", and the head happened to have been tagged).  Yes, it
> may be convenient, but retrieving the commit chain and
> retrieving tags are conceptually separate issues.  A tag does
> not necessarily refer to a commit, so your reverse index does
> not make sense for a tag pointing at a blob, for example.

After thinking it through I have to agree but not for your reasons.
The killer argument for me is that tags can be made at any time.
Which means that any incremental scheme that links pulling of tags
to the pulling of the objects they refer to will fail when
the tag is made after you have pulled the object.

So at the very least the computation of which tags to pull needs
to be separate from the computation of which object to pull.

> I think if we have discovery mechanism of remote tags/heads, we
> do not need anything else.  You _could_ say something like:
>
>     $ git-list-remote --tags linux-2.6
>     9e734775f7c22d2f89943ad6c745571f1930105f	v2.6.12-rc2
>     26791a8bcf0e6d33f43aef7682bdb555236d56de	v2.6.12
>     ...
>     a339981ec18d304f9efeb9ccf01b1f04302edf32	v2.6.13-rc3
>     $ git-list-remote --tags linux-2.6 |
>       while read sha1 tag;
>       do
>           git fetch linux-2.6 tag $tag
>       done
>
> and you are done.  We did not use the reverse index, nor we used
> the --all-tags flag to git-fetch-script.  You do not even need
> git-list-remote if you are willing to wget a=summary output from
> gitweb and parse the bottom of the page ;-).

I agree that anything we do will need to look roughly like the
above.  Beyond a simple index of what tags are present
in the objects directory I can't think of anything that would
be a cost savings, except possibly ordering the tags by creation
date.

There are a couple pieces of your example that disturb me.
- The tag names are forced to be the same between trees.
- You don't verify the tags before installing them.
- I view tags as history and by having tag fetching totally
  separate it becomes easy to loose that history.

I do like the fact that when you fetch a tag you are certain
to fetch all of the objects and it refers to.

I don't know what the solution is but we seem to be getting closer.

Eric

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-revover-tags-script
  2005-07-17 18:53     ` Junio C Hamano
  2005-07-18  0:06       ` Eric W. Biederman
@ 2005-07-18  0:19       ` Eric W. Biederman
  1 sibling, 0 replies; 12+ messages in thread
From: Eric W. Biederman @ 2005-07-18  0:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <junkio@cox.net> writes:

> ebiederm@xmission.com (Eric W. Biederman) writes:
>
>> What we care about are the tag objects, those are the only kind
>> that are verifiable and usable remotely.  
>>
>> Now that I know we do not pull tags currently with any of the
>> optimized transports, I would suggest taking the list of commit
>> objects we are transporting and for each commit look in the
>> remote repo/refs/tags and transferring every tag object we can find
>> that refers to that commit.
>
> I think if we have discovery mechanism of remote tags/heads, we
> do not need anything else.  You _could_ say something like:
>
>     $ git-list-remote --tags linux-2.6
>     9e734775f7c22d2f89943ad6c745571f1930105f	v2.6.12-rc2
>     26791a8bcf0e6d33f43aef7682bdb555236d56de	v2.6.12
>     ...
>     a339981ec18d304f9efeb9ccf01b1f04302edf32	v2.6.13-rc3
>     $ git-list-remote --tags linux-2.6 |
>       while read sha1 tag;
>       do
>           git fetch linux-2.6 tag $tag
>       done

Actually looking a little deeper unless I have misread
the code git-fetch-pack at least will only ask for commit
objects so git fetch will never return a tag object.

I have yet to find where it git-fetch-pack actually prints
objects out so I still may be something.

Eric

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-revover-tags-script
  2005-07-18  0:06       ` Eric W. Biederman
@ 2005-07-18  1:13         ` Junio C Hamano
  2005-07-18  5:40           ` Eric W. Biederman
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2005-07-18  1:13 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: git

ebiederm@xmission.com (Eric W. Biederman) writes:

> There are a couple pieces of your example that disturb me.

Did you actually think I suggested you to make that into a
script that cannot be configured?  No, it was Junio acquiring a
habit from Linus to give a rough outline in a code form in his
e-mail client.

In another message, you said:

> Actually looking a little deeper unless I have misread
> the code git-fetch-pack at least will only ask for commit
> objects so git fetch will never return a tag object.

I thought so but then I tried it and actually it does seem to
work as expected (well, it is Linus code so it has to be perfect
;-).  I created an empty directory and ran the following script.
It creates two commits, tags the later commit to
".git/refs/tags/one", and shows the list of objects the
upload-pack (the peer git-fetch-pack talks to) decides to pack
and send to the puller that has the first commit only.  The
first git-rev-list shows one extra object compared to the second
one; the difference is the named tag that is being asked.

------------
#!/bin/sh

rm -fr .git
git-init-db
zero_tree=$(git-write-tree)
echo "base tree $zero_tree"
zero_commit=$(
	echo Empty tree as the base |
	git-commit-tree $zero_tree
)
echo "base commit $zero_commit"

echo >a
git-update-cache --add a
one_tree=$(git-write-tree)
echo "one tree $one_tree"
one_commit=$(
	echo Add one file |
	git-commit-tree $one_tree -p $zero_commit
)
echo "one commit $one_commit"

tagger=$(git-var GIT_COMMITTER_IDENT)
echo "object $one_commit
type commit
tag tag-one
tagger $tagger

just a tag." | git-mktag >.git/refs/tags/one
echo "one tag `cat .git/refs/tags/one`"

echo "*** reachable from one tag but not from zero"
git-rev-list --objects tags/one ^$zero_commit

echo "*** reachable from one commit but not from zero"
git-rev-list --objects $one_commit ^$zero_commit

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-revover-tags-script
  2005-07-18  1:13         ` Junio C Hamano
@ 2005-07-18  5:40           ` Eric W. Biederman
  2005-07-18  6:36             ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2005-07-18  5:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <junkio@cox.net> writes:

> ebiederm@xmission.com (Eric W. Biederman) writes:
>
>> Actually looking a little deeper unless I have misread
>> the code git-fetch-pack at least will only ask for commit
>> objects so git fetch will never return a tag object.
>
> I thought so but then I tried it and actually it does seem to
> work as expected (well, it is Linus code so it has to be perfect
> ;-).  

Yep.  I confused the want and have cases when I was reading
the code. 

A generalization of git-fetch-pack that can handle multiple
heads looks like it would handle the transfer part of the
problem with tags.  git-clone-pack already does.  Then
all that is needed is a sane way to list the heads that
are read back and some post processing to install everything.

The big question is in what format should we return the heads?
Just a space separated list of sha1's or a directory hierarchy
like git-clone-pack uses.

Eric

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-revover-tags-script
  2005-07-18  5:40           ` Eric W. Biederman
@ 2005-07-18  6:36             ` Junio C Hamano
  0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2005-07-18  6:36 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: git

ebiederm@xmission.com (Eric W. Biederman) writes:

> The big question is in what format should we return the heads?
> Just a space separated list of sha1's or a directory hierarchy
> like git-clone-pack uses.

My knee-jerk reaction is something like this:

    $ git-list-remote jg-libata
    9956d54ace3c64512d0c5498e0137180741e5d04	heads/adma
    433e7832818faf93c0f366fea3e14773cdcf3811	heads/adma-mwi
    ...
    80ebd62e0cca50869da2d5159fa4d6b723f0c014	heads/sil24
    9e734775f7c22d2f89943ad6c745571f1930105f	tags/v2.6.12-rc2
    26791a8bcf0e6d33f43aef7682bdb555236d56de	tags/v2.6.12
    ...
    a339981ec18d304f9efeb9ccf01b1f04302edf32	tags/v2.6.13-rc3

That is, SHA1 and path relative to .git/refs separated with a
TAB, and terminated with LF.

I do not care too much about the protocol level, but since we
are not talking about hundreds of heads and tags, probably
the simplest would be to match the same, or use SP instead of
TAB there to match upload-pack protocol.

I think the bigger question is how to help the user manage and
store this information in his .git/refs/tags hierarchy.

The mechanism to store the URL and head in branches/<name>, and
copy the head value in the corresponding refs/heads/<name> was
borrowed from Cogito, and I think it covers the refs/heads side
quite well.  The user gives a name to the branch of a foreign
repository he is interested in, the fetched head from there is
stored in the same <name>, so the namespace under refs/heads and
branches are totally under the user's control.

If somebody cares about automated fetching of all the tags from
a remote repository, probably the easiest way would be to create
a subdirectory that corresponds to the short-hand name and use
that directory to store all tags slurped from there.  But I am
not convinced myself this is that much useful.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFD] server-info to help clients
  2005-07-16 20:20 [PATCH] git-revover-tags-script Eric W. Biederman
  2005-07-17  0:51 ` Junio C Hamano
@ 2005-07-20  0:20 ` Junio C Hamano
  2005-07-20  0:35   ` David Lang
  1 sibling, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2005-07-20  0:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git, Eric W. Biederman

While things are quiet (I envy everybody having fun at OLS),
I've been cooking something to help clients to pull from dumb
servers.

I assume that:

 - The object database is packed, following the recommendations
   in the "Working with Others" section of the tutorial.

 - The repository owner _may_ further create throw-away
   incremental packs.  There can be the following in one object
   database:

     - one baseline pack.
     - permanent incremental packs #1 .. #N
     - one throw-away incremental pack.
     - unpacked files under objects/??/.

   Baseline and permanent incremental packs are built by "git
   repack", just like Linus recommended from the beginning.  The
   throwaway pack is built periodically (say every hour) to
   collect all objects that are not in the baseline nor
   permanent incrementals.  Building of such a throw-away pack
   involves:

     - unpacking and removal of the current throw-away pack.
     - running "git repack".
     - running "git prune-packed".

 - The server could be truly dumb and can even refuse to serve
   dirindex; parsing autogenerated index.html is a pain anyway.

First, a somewhat related change I did was to write a script
called "git ls-remote".  It is used this way:

    $ git ls-remote origin
    17c0bd743c1c8113cd0ed72b7ca1776d13c27e01	HEAD
    17c0bd743c1c8113cd0ed72b7ca1776d13c27e01	refs/heads/master
    f0b32737ad5a35cc047db47353a75faccfe5939e	refs/heads/linus
    4d9ae497491fd838dafd7fcbd11c4aa678a726f1	refs/heads/pu
    d6602ec5194c87b0fc87103ca4d67251c76f233a	refs/tags/v0.99
    f25a265a342aed6041ab0cc484224d9ca54b6f41	refs/tags/v0.99.1

It slurps the set of refs from a remote repository (the same
short-hand we stole from Cogito using .git/branches/ can be used
here) and optionally it can be told to store tags under local
refs/.

This is produced by connecting directly to the git-daemon
running on the remote side and talking upload-pack protocol with
it.  A new helper program "git-peek-remote" is used to do this
when we use git:// URL.  From an rsync URL, everything under its
refs/ is copied to a temporary directory to produce the same
information.

To support the same on a dumb transport, I gave the server side
a new command, "git update-server-info", which prepares this
information in "$repo/info/refs", so writing http support for
"git ls-remote" using curl is trivial.  I arranged things so
that update-server-info is run whenever you push into the
repository via "git push".  You can of course run it by hand
from the command line.

The other file that update-server-info produces is to help dumb
pullers.  It is stored in "$repo/objects/info/pack", and looks
like this:

    P pack-c60dc6f7486e34043bd6861d6b2c0d21756dde76.pack
    P pack-e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135.pack
    D 0 1
    D 1
    T 0 9fb1759a3102c26cd8f64254a7c3e532782c2bb8 commit
    T 0 a339981ec18d304f9efeb9ccf01b1f04302edf32 tag
    T 1 0397236d43e48e821cce5bbe6a80a1a56bb7cc3a tag
    T 1 043d051615aa5da09a7e44f1edbb69798458e067 commit
    T 1 06f6d9e2f140466eeb41e494e14167f90210f89d tag
    T 1 26791a8bcf0e6d33f43aef7682bdb555236d56de tag
    T 1 5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c tag
    T 1 701d7ecec3e0c6b4ab9bb824fd2b34be4da63b7e tag
    T 1 733ad933f62e82ebc92fed988c7f0795e64dea62 tag
    T 1 9e734775f7c22d2f89943ad6c745571f1930105f tag
    T 1 c521cb0f10ef2bf28a18e1cc8adf378ccbbe5a19 tag
    T 1 ebb5573ea8beaf000d4833735f3e53acb9af844c tag

The lines that start with a 'P' list all the packs available in
this object database (relative to $repo/objects/pack).  These
packs are implicitly numbered starting at 0 in the order they
appear in the file; in the above, the pack c60dc6... is pack #0
and e3117b... is pack #1.

The lines that start with a 'D' list the dependencies.  "D 0 1"
says, pack #0 is not complete and refers to objects found in
pack #1 (e.g. a commit object in pack #0 has a subtree that is
the same one found in pack #1 hence pack #0 does not contain
that tree).  "D 1" shows that the pack #1 is self sufficient and
does not depend on anything (it is the linux-2.6 baseline pack).
Of course, you could have a pack that depends on more than one
packs, in which case you would see something like "D 4 1 2 3" to
mean pack #4 depending on packs #1, #2 and #3.

If the repository follows the "baseline, permanent incrementals,
and one throw-away" scheme I outlined above, the baseline would
be self sufficient, most likely incremental #i would depend on
the baseline and all the incrementals #j (j < i), and the
throw-away would depend on everybody else.

The lines that start with a 'T' list objects in a pack that are
not referenced by anything else in the same pack (they are
typically branch heads and tags).  We can see that pack #0 has
one head commit and a tag in the above example.

This file always resides at a known location.   A client can do
something like this to slurp from a dumb server:

 (1) Fetch $repo/objects/info/pack file for the above
     information.

 (2) Look at T lines.  If you have all the objects listed there
     for a pack, and if your repository is not incomplete to begin
     with, you are not interested in that pack.  By definition, all
     things that are in that pack are reachable from one of those
     objects listed on the T lines, and you already have them.
     Otherwise, you _may_ be interested in that pack.

 (3) Download corresponding .idx files for the packs you are
     interested in.  Run "git show-index" to see if the heads/tags
     you are interested in appear in one of them (you found out
     about the heads/tags using "git ls-remote" earlier).  If you
     find a pack that contains objects you are interested in, look
     at D lines to make sure you have all the head objects from
     packs that this pack depends on; otherwise you need to slurp
     that depended-upon packs as well (needless to say, this goes
     recursive).

 (4) Download the packs you decided to pick in the previous
     step.  It is up to you if you unpack those packs, but if
     the upstream has it statically packed I would recommend
     against unpacking.  Next time around you can just look at
     the name of the pack and decide you already have that pack.

     On the other hand, keeping a throw-away packed may not make
     much sense.  You can unpack the throw-away and then run
     "git prune-packed" in your repository next time you get the
     pack info file from the repository, by noticing that the
     pack is gone from the remote repository already.

 (5) Fill the rest using the commit walker.

The initial client implementation which is _really_ dumb could
even skip steps (2) and (3) and choose to always download/sync
all available packs from the dumb server, and directly go to
step (5) to fall back on the commit walker.

I haven't written the client side, but all the rest that are
necessary to support the above will be sent to the list as
separate patches.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFD] server-info to help clients
  2005-07-20  0:20 ` [RFD] server-info to help clients Junio C Hamano
@ 2005-07-20  0:35   ` David Lang
  2005-07-20  1:53     ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: David Lang @ 2005-07-20  0:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git, ebiederm

i wonder how much benifit there is to the throw-away packs.

if you do permanent incremental packs every day (or every few days) is 
there really enough activity to make it worth the added complexities 
(specificly including detecting that it is a throw-away pack on the client 
side and therefor you probably don't want to keep it) for the slight 
performance increase you may get

remember that since deltas only work within a pack the throw-away pack 
will only be noticably smaller once you start having one file modified 
multiple times before a new incremental pack is created, so you aren't 
likly to save much on space, so all you are likly to save is the overhead 
of fetching multiple objects compared to one object.

going forward it may be worth a smarter packing program to support HPA's 
goal of a cental object storage, one that can make decisions like: 'object 
A is part of this 40% of the trees, while object B is part of that 
otehr 40% (disjoint set) so it's probably a good idea to put them into 
seperate packs'

but that can be done much futher down the road without having to change 
the clients at all.

David Lang


  On Tue, 19 Jul 2005, 
Junio C Hamano wrote:

> Date: Tue, 19 Jul 2005 17:20:58 -0700
> From: Junio C Hamano <junkio@cox.net>
> To: Linus Torvalds <torvalds@osdl.org>
> Cc: git@vger.kernel.org, ebiederm@xmission.com
> Subject: [RFD] server-info to help clients
> 
> While things are quiet (I envy everybody having fun at OLS),
> I've been cooking something to help clients to pull from dumb
> servers.
>
> I assume that:
>
> - The object database is packed, following the recommendations
>   in the "Working with Others" section of the tutorial.
>
> - The repository owner _may_ further create throw-away
>   incremental packs.  There can be the following in one object
>   database:
>
>     - one baseline pack.
>     - permanent incremental packs #1 .. #N
>     - one throw-away incremental pack.
>     - unpacked files under objects/??/.
>
>   Baseline and permanent incremental packs are built by "git
>   repack", just like Linus recommended from the beginning.  The
>   throwaway pack is built periodically (say every hour) to
>   collect all objects that are not in the baseline nor
>   permanent incrementals.  Building of such a throw-away pack
>   involves:
>
>     - unpacking and removal of the current throw-away pack.
>     - running "git repack".
>     - running "git prune-packed".
>
> - The server could be truly dumb and can even refuse to serve
>   dirindex; parsing autogenerated index.html is a pain anyway.
>
> First, a somewhat related change I did was to write a script
> called "git ls-remote".  It is used this way:
>
>    $ git ls-remote origin
>    17c0bd743c1c8113cd0ed72b7ca1776d13c27e01	HEAD
>    17c0bd743c1c8113cd0ed72b7ca1776d13c27e01	refs/heads/master
>    f0b32737ad5a35cc047db47353a75faccfe5939e	refs/heads/linus
>    4d9ae497491fd838dafd7fcbd11c4aa678a726f1	refs/heads/pu
>    d6602ec5194c87b0fc87103ca4d67251c76f233a	refs/tags/v0.99
>    f25a265a342aed6041ab0cc484224d9ca54b6f41	refs/tags/v0.99.1
>
> It slurps the set of refs from a remote repository (the same
> short-hand we stole from Cogito using .git/branches/ can be used
> here) and optionally it can be told to store tags under local
> refs/.
>
> This is produced by connecting directly to the git-daemon
> running on the remote side and talking upload-pack protocol with
> it.  A new helper program "git-peek-remote" is used to do this
> when we use git:// URL.  From an rsync URL, everything under its
> refs/ is copied to a temporary directory to produce the same
> information.
>
> To support the same on a dumb transport, I gave the server side
> a new command, "git update-server-info", which prepares this
> information in "$repo/info/refs", so writing http support for
> "git ls-remote" using curl is trivial.  I arranged things so
> that update-server-info is run whenever you push into the
> repository via "git push".  You can of course run it by hand
> from the command line.
>
> The other file that update-server-info produces is to help dumb
> pullers.  It is stored in "$repo/objects/info/pack", and looks
> like this:
>
>    P pack-c60dc6f7486e34043bd6861d6b2c0d21756dde76.pack
>    P pack-e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135.pack
>    D 0 1
>    D 1
>    T 0 9fb1759a3102c26cd8f64254a7c3e532782c2bb8 commit
>    T 0 a339981ec18d304f9efeb9ccf01b1f04302edf32 tag
>    T 1 0397236d43e48e821cce5bbe6a80a1a56bb7cc3a tag
>    T 1 043d051615aa5da09a7e44f1edbb69798458e067 commit
>    T 1 06f6d9e2f140466eeb41e494e14167f90210f89d tag
>    T 1 26791a8bcf0e6d33f43aef7682bdb555236d56de tag
>    T 1 5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c tag
>    T 1 701d7ecec3e0c6b4ab9bb824fd2b34be4da63b7e tag
>    T 1 733ad933f62e82ebc92fed988c7f0795e64dea62 tag
>    T 1 9e734775f7c22d2f89943ad6c745571f1930105f tag
>    T 1 c521cb0f10ef2bf28a18e1cc8adf378ccbbe5a19 tag
>    T 1 ebb5573ea8beaf000d4833735f3e53acb9af844c tag
>
> The lines that start with a 'P' list all the packs available in
> this object database (relative to $repo/objects/pack).  These
> packs are implicitly numbered starting at 0 in the order they
> appear in the file; in the above, the pack c60dc6... is pack #0
> and e3117b... is pack #1.
>
> The lines that start with a 'D' list the dependencies.  "D 0 1"
> says, pack #0 is not complete and refers to objects found in
> pack #1 (e.g. a commit object in pack #0 has a subtree that is
> the same one found in pack #1 hence pack #0 does not contain
> that tree).  "D 1" shows that the pack #1 is self sufficient and
> does not depend on anything (it is the linux-2.6 baseline pack).
> Of course, you could have a pack that depends on more than one
> packs, in which case you would see something like "D 4 1 2 3" to
> mean pack #4 depending on packs #1, #2 and #3.
>
> If the repository follows the "baseline, permanent incrementals,
> and one throw-away" scheme I outlined above, the baseline would
> be self sufficient, most likely incremental #i would depend on
> the baseline and all the incrementals #j (j < i), and the
> throw-away would depend on everybody else.
>
> The lines that start with a 'T' list objects in a pack that are
> not referenced by anything else in the same pack (they are
> typically branch heads and tags).  We can see that pack #0 has
> one head commit and a tag in the above example.
>
> This file always resides at a known location.   A client can do
> something like this to slurp from a dumb server:
>
> (1) Fetch $repo/objects/info/pack file for the above
>     information.
>
> (2) Look at T lines.  If you have all the objects listed there
>     for a pack, and if your repository is not incomplete to begin
>     with, you are not interested in that pack.  By definition, all
>     things that are in that pack are reachable from one of those
>     objects listed on the T lines, and you already have them.
>     Otherwise, you _may_ be interested in that pack.
>
> (3) Download corresponding .idx files for the packs you are
>     interested in.  Run "git show-index" to see if the heads/tags
>     you are interested in appear in one of them (you found out
>     about the heads/tags using "git ls-remote" earlier).  If you
>     find a pack that contains objects you are interested in, look
>     at D lines to make sure you have all the head objects from
>     packs that this pack depends on; otherwise you need to slurp
>     that depended-upon packs as well (needless to say, this goes
>     recursive).
>
> (4) Download the packs you decided to pick in the previous
>     step.  It is up to you if you unpack those packs, but if
>     the upstream has it statically packed I would recommend
>     against unpacking.  Next time around you can just look at
>     the name of the pack and decide you already have that pack.
>
>     On the other hand, keeping a throw-away packed may not make
>     much sense.  You can unpack the throw-away and then run
>     "git prune-packed" in your repository next time you get the
>     pack info file from the repository, by noticing that the
>     pack is gone from the remote repository already.
>
> (5) Fill the rest using the commit walker.
>
> The initial client implementation which is _really_ dumb could
> even skip steps (2) and (3) and choose to always download/sync
> all available packs from the dumb server, and directly go to
> step (5) to fall back on the commit walker.
>
> I haven't written the client side, but all the rest that are
> necessary to support the above will be sent to the list as
> separate patches.
>
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFD] server-info to help clients
  2005-07-20  0:35   ` David Lang
@ 2005-07-20  1:53     ` Junio C Hamano
  0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2005-07-20  1:53 UTC (permalink / raw)
  To: David Lang; +Cc: Linus Torvalds, git, ebiederm

The management of multiple packs and strategy of deciding when
to create the next incremental (be it throw-away or permanent)
is something I am not particularly interested in at this moment,
and as you correctly pointed out, the "single throw-away pack"
is an example of _bad_ strategy [*1*].  I am more interested in
designing a concise way to express what the server side has
(after applying such packing/repacking strategy) to help clients
coming over a dumb transport.

One thing that I forgot to mention is that there is another
per-repository information "$repo/info/revinfo".  This lists all
the commit ancestry reachable from "$repo/refs", and is needed
for clients to find out the closest commit from the very tip of
branches, which are likely not packed yet, that appears as a
head in "$repo/objects/info/pack" and go from there.

[Footnote]

*1* As I said, I am not interested in thinking about this at
this moment, but I suspect a scheme that employs the base pack,
permanent incrementals, and N new throw-aways every day for the
N-th day of the month may work reasonably well.

On the N-th day of the month, you create incrementals relative
to what existed on the N-1th, N-2th, ..., 1st of the month.  At
the end of the day, create N+1 new throw-aways for the N+1th day
of the month (you can garbage collect older days' throw-away
incrementals whenever you like).  At the end of the month, you
mark the throw-away incremental that is relative to the
beginning of the month as the latest permanent incremental.

Bootstrappers can slurp base, permanent incremental and the
throwaway for today that is relative to the last permanent
incremental.  Updaters can pick the one relative to the day they
updated last time.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-07-20  1:54 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-16 20:20 [PATCH] git-revover-tags-script Eric W. Biederman
2005-07-17  0:51 ` Junio C Hamano
2005-07-17  8:40   ` Eric W. Biederman
2005-07-17 18:53     ` Junio C Hamano
2005-07-18  0:06       ` Eric W. Biederman
2005-07-18  1:13         ` Junio C Hamano
2005-07-18  5:40           ` Eric W. Biederman
2005-07-18  6:36             ` Junio C Hamano
2005-07-18  0:19       ` Eric W. Biederman
2005-07-20  0:20 ` [RFD] server-info to help clients Junio C Hamano
2005-07-20  0:35   ` David Lang
2005-07-20  1:53     ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).