* Re: Does GIT has vc keywords like CVS/Subversion?
2006-10-09 21:08 ` Martin Langhoff
@ 2006-10-09 22:48 ` Johannes Schindelin
2006-10-09 22:57 ` Martin Langhoff
2006-10-09 22:55 ` Junio C Hamano
` (2 subsequent siblings)
3 siblings, 1 reply; 12+ messages in thread
From: Johannes Schindelin @ 2006-10-09 22:48 UTC (permalink / raw)
To: Martin Langhoff; +Cc: Linus Torvalds, Liu Yubao, Dongsheng Song, git
Hi,
On Tue, 10 Oct 2006, Martin Langhoff wrote:
> On 10/10/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
> > - outside of the SCM, keyword substitution can make sense, but doing it
> > should be in helper scripts or something that can easily tailor it for
> > the actual need of that particular project.
... like a pre-commit hook.
> If we have a tool that I can pass a file or a directory tree and will
> find the (perfectly|closely) matching trees and related commits.
>
> For the single file case, searching for an exact SHA1 match is easy,
> as is by path.
If you have the path, you can reuse the whole algorithm for finding the
best delta base.
However, if you do not have the path, you might as well just give up (if
there is no perfect match for the SHA1), since the SHA1 is _not_ similar
for similar contents. IOW, you'd literally have to search _all_ objects in
the repository, which usually takes a long, long time.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: Does GIT has vc keywords like CVS/Subversion?
2006-10-09 22:48 ` Johannes Schindelin
@ 2006-10-09 22:57 ` Martin Langhoff
0 siblings, 0 replies; 12+ messages in thread
From: Martin Langhoff @ 2006-10-09 22:57 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Linus Torvalds, Liu Yubao, Dongsheng Song, git
On 10/10/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> If you have the path, you can reuse the whole algorithm for finding
> the best delta base.
Can I do that from Perl/bash? (how?)
> However, if you do not have the path, you might as well just give up (if
> there is no perfect match for the SHA1), since the SHA1 is _not_ similar
> for similar contents. IOW, you'd literally have to search _all_ objects in
> the repository, which usually takes a long, long time.
So the delta base algorithm doesn't work without a path. I thought we
had a quick way to find blobs of similar size. If the user can't even
give us a filename (that we can use to try and build a likely path)
then they have bigger problems than the delta ;-) -- at some point we
have to provide git-paddedcell for the remaining <ahem> users.
cheers,
maritn
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Does GIT has vc keywords like CVS/Subversion?
2006-10-09 21:08 ` Martin Langhoff
2006-10-09 22:48 ` Johannes Schindelin
@ 2006-10-09 22:55 ` Junio C Hamano
2006-10-10 7:37 ` Rene Scharfe
2006-10-10 16:49 ` Shawn Pearce
3 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2006-10-09 22:55 UTC (permalink / raw)
To: Martin Langhoff; +Cc: git
"Martin Langhoff" <martin.langhoff@gmail.com> writes:
> is there a way to scan the object store for blobs of around a given
> size (as the packing code does) from Perl?
For objects in packs, verify-pack -v comes to mind (show-index
might show the same information). Loose objects needs help from
git-cat-file -s or git-cat-file -t or both.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Does GIT has vc keywords like CVS/Subversion?
2006-10-09 21:08 ` Martin Langhoff
2006-10-09 22:48 ` Johannes Schindelin
2006-10-09 22:55 ` Junio C Hamano
@ 2006-10-10 7:37 ` Rene Scharfe
2006-10-10 16:49 ` Shawn Pearce
3 siblings, 0 replies; 12+ messages in thread
From: Rene Scharfe @ 2006-10-10 7:37 UTC (permalink / raw)
To: Martin Langhoff; +Cc: Linus Torvalds, Liu Yubao, Dongsheng Song, git
Martin Langhoff schrieb:
> For the outside of the SCM case, keyword subst is useful indeed if
> someone has a $version_unknown tarball, unpacks it and hacks away. It
> is a pretty broken scenario, and less likely to happen nowadays with
> easy access to SCM tools.
If you still have the tar file, and if it has been created using
git-archive or git-tar-tree it may contain the commit ID in an archive
comment. You can use git-get-tar-commit-id to extract it in that case.
This won't work with official git project tarballs btw., as commit ID
embedding has been turned off. The reason is that older tar versions
extracted the comment as a regular file, which confused users.
René
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Does GIT has vc keywords like CVS/Subversion?
2006-10-09 21:08 ` Martin Langhoff
` (2 preceding siblings ...)
2006-10-10 7:37 ` Rene Scharfe
@ 2006-10-10 16:49 ` Shawn Pearce
2006-10-10 17:14 ` Linus Torvalds
3 siblings, 1 reply; 12+ messages in thread
From: Shawn Pearce @ 2006-10-10 16:49 UTC (permalink / raw)
To: Martin Langhoff; +Cc: git
Martin Langhoff <martin.langhoff@gmail.com> wrote:
> However, I don't think that scenario is hard to support and Git can
> have a much better story to tell than keyword substituting SCMs.
>
> If we have a tool that I can pass a file or a directory tree and will
> find the (perfectly|closely) matching trees and related commits.
>
> For the single file case, searching for an exact SHA1 match is easy,
> as is by path. If we get a file without a path it gets a bit harder --
> is there a way to scan the object store for blobs of around a given
> size (as the packing code does) from Perl? Actually, if we find a
> relatively close match, it'd be useful to ask git if it's deltified
> and ask for other members of the delta chain.
git-verify-pack -v will print every SHA1, its type and its
decompressed size. It also prints who its delta base is. Its also
not very fast. However if you run that on a pack file once and
cache the result then you have much of the data you are looking for.
You can find objects within a margin of error of the blob size,
then find all objects in those delta chains. Then start fetching
those objects and comparing contents. But this is brutal and will
take a long time due to the sheer number of objects that probably
would fall into that size bucket.
The single file case without a path is not an easy problem. Even if
you have an exact SHA1 match (an unmodified file) its difficult
to find what commits used that SHA1 somewhere within their trees.
You need to unpack every tree in every commit and test every
entry for a match. That's going to take a while on any decent
sized repository.
Most maintainers would just toss the modified file pack at the sender
and say "Uh, where did this file come from?!" And rightly so.
A maintainer familiar with that section of the repository might
recognize some of the file contents and be able to guess the
filename. So in short I don't think the single file case without
filename is doable, and I don't think its very useful either.
> For the directory tree case, the ideal thing would be to build a
> temporary index without getting the blobs in the object store, and
> then do a first pass trying to match tree SHA1s. If the user has
> modified a few files in a large project, it'll be trivial to find a
> good candidate commit for delta. OTOH, if the user has indulged in
> wide ranging search and replace... it will be well deserved pain ;-)
You have a chance in the tree case. If you have the entire tree
as a working directory and the modifications made are limited to
a handful of paths then you can load that working directory into a
set of tree objects and perform a match process by walking backwards
through the commit chains looking for trees which have a high number
of paths in common with the working directory.
Unfortunately this also has limited use (but I have one myself!).
If you got the entire working directory from a submitter than that
implies they took your entire distribution, unpacked it, hacked away,
repacked it and sent you the tar/zip file. That's significantly
larger than a simple patch file produced by diff -R. As a maintainer
you probably should be kicking that back at the user and saying
"Uh, please submit a patch instead, thanks."
I actually have a scenario where I'm using Git to track another
(much, much crappier) file revision storage tool that would probably
benefit from this, but the benefit is relatively low.
I'm completely unable to read that tool's version data. The only
thing I can get from that tool is a snapshot of files as they exist
at the point in time that I am running the snapshot. The snapshots
aren't always consistent with themselves. Worse they take upwards
of 30 minutes to run, can only run on a Windows desktop, and consume
100% of the CPU while running. So we cannot get them very often.
I have several users working on those files in Git through a common
shared repository. We send changes to that file revision storage
tool on a frequent basis, say up to 3-5 times per day. Each such
change is basically a squashed merge commit in Git terminology,
so the fine grained commits in Git aren't being preserved by that
storage tool, despite being in our shared Git repository.
Many days later most of the changes the users put into the storage
tool suddenly appear on the next snapshot we obtain from it. I say
most because sometimes the powers that be either don't permit a
change to show up in the snapshot and delay it for a while, or
because they actually wanted to include a change but someone fat
fingered the storage controls and the change got omitted. Yet the
powers that be *believe* the change is included, right up through
testing accusing development of not fixing the bug despite the fix
being there in the file revision storage tool.
Now I'd like to take these snapshots every so often, load them
into Git on a special branch just for the snapshots, then generate
a merge commit on that branch which merges the real commit that
corresponds as closely as possible to to this snapshot into the
snapshot branch. Part of the reason for doing this is to look
for unexpected differences between what Git has and what the file
revision storage tool has.
But doing that is nearly impossible, so I don't.
--
Shawn.
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: Does GIT has vc keywords like CVS/Subversion?
2006-10-10 16:49 ` Shawn Pearce
@ 2006-10-10 17:14 ` Linus Torvalds
2006-10-10 17:41 ` Junio C Hamano
0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2006-10-10 17:14 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Martin Langhoff, git
On Tue, 10 Oct 2006, Shawn Pearce wrote:
>
> Now I'd like to take these snapshots every so often, load them
> into Git on a special branch just for the snapshots, then generate
> a merge commit on that branch which merges the real commit that
> corresponds as closely as possible to to this snapshot into the
> snapshot branch. Part of the reason for doing this is to look
> for unexpected differences between what Git has and what the file
> revision storage tool has.
>
> But doing that is nearly impossible, so I don't.
Well, it probably wouldn't be too nasty to try to have a "find nearest
commit" kind of thing. It's not quite as simple as bisection, but you
could probably use a bisection-like algorithm to do something like a
binary search to try to guess which tree is the closest.
In other words, if you just give git a "range" of commits to look at, and
let a bisection-line thing pick a mid-way point, you can then compare the
mid-way point and the end-points (more than two) against the target tree,
and then pick the range that looks "closer".
I wouldn't guarantee that it finds the best candidate (since the "closer"
choice will inevitably not guarantee a monotonic sequence), but I think
you could probably most of the time find something that is reasonably
close.
If you do a lot of branching, you'd have to be a lot smarter about it
(since you'd not have _one_ commit for beginning/end), but in a
straight-line tree it should be really trivial, and in a branchy one I
think it should still be quite doable.
I dunno. It might be useful even if it's just a heuristic, in a "try to
find a commit in the range X..Y that generates the smallest diff when
compared against this tree". If it finds something sucky, you can try to
look at the history of one of the files that generates a big diff, and try
to give a better range - the automation should hopefully have given you
_some_ clues.
Linus
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Does GIT has vc keywords like CVS/Subversion?
2006-10-10 17:14 ` Linus Torvalds
@ 2006-10-10 17:41 ` Junio C Hamano
0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2006-10-10 17:41 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git, Shawn Pearce, Martin Langhoff
Linus Torvalds <torvalds@osdl.org> writes:
> Well, it probably wouldn't be too nasty to try to have a "find nearest
> commit" kind of thing. It's not quite as simple as bisection, but you
> could probably use a bisection-like algorithm to do something like a
> binary search to try to guess which tree is the closest.
I had to do something like that in my day job once. A customer
installation was made from a tarball of unknown vintage, and
then field patched with later fixes.
I ended up slurping the thing back and populated my index with
it. Luckily I could guess a good initial point to find the
commit that gives minimum "git diff" output. Then from the
remaining patches it was reasonably easy to find out which
changes were cherry-picked by hand with "git log master --
$paths".
^ permalink raw reply [flat|nested] 12+ messages in thread