Git development

Git development
 help / color / mirror / Atom feed

* Re: Can't find the revelant commit with git-log
From: Francis Moreau @ 2011-01-29 13:57 UTC (permalink / raw)
  To: René Scharfe; +Cc: git, Johannes Sixt
In-Reply-To: <4D440FED.2010203@lsrfire.ath.cx>

René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:

> Am 29.01.2011 13:52, schrieb Francis Moreau:
>> René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:
>> 
>>> Am 26.01.2011 19:11, schrieb René Scharfe:
>
>>>> - Make git grep report non-matching path specs (new feature).
>>>
>>> This is a bit complicated because grep can work on files, index entries
>>> as well as versioned objects and supports wildcards,
>>> so it's not that easy to tell if a path spec matches something or is a
>>> rather typo.  But it's not impossible either, of course.
>> 
>> I don't understand this for the following use case:
>> 
>>    $ cd ~/linux-2.6/drivers/pci/
>>    $ git grep blacklist v2.6.27 -- drivers/pci/intel-iommu.c
>> 
>> From what you said, it sounds that git grep is actually searching the
>> string 'somewhere'. But where ?
>
> All files in the directory are looked at and checked if they match the
> given path spec first.  Since none of them do, no actual text search has
> to take place.

and in this case, it is complicated to tell that the given path spec
match nothing. right ?

-- 
Francis

^ permalink raw reply

* Re: Can't find the revelant commit with git-log
From: René Scharfe @ 2011-01-29 13:02 UTC (permalink / raw)
  To: Francis Moreau; +Cc: git, Johannes Sixt
In-Reply-To: <m2sjwb6feo.fsf@gmail.com>

Am 29.01.2011 13:52, schrieb Francis Moreau:
> René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:
> 
>> Am 26.01.2011 19:11, schrieb René Scharfe:
>>> - Make git grep report non-matching path specs (new feature).
>>
>> This is a bit complicated because grep can work on files, index entries
>> as well as versioned objects and supports wildcards,
>> so it's not that easy to tell if a path spec matches something or is a
>> rather typo.  But it's not impossible either, of course.
> 
> I don't understand this for the following use case:
> 
>    $ cd ~/linux-2.6/drivers/pci/
>    $ git grep blacklist v2.6.27 -- drivers/pci/intel-iommu.c
> 
> From what you said, it sounds that git grep is actually searching the
> string 'somewhere'. But where ?

All files in the directory are looked at and checked if they match the
given path spec first.  Since none of them do, no actual text search has
to take place.

René

^ permalink raw reply

* Re: Can't find the revelant commit with git-log
From: Francis Moreau @ 2011-01-29 12:52 UTC (permalink / raw)
  To: René Scharfe; +Cc: git, Johannes Sixt
In-Reply-To: <4D433CA7.9060200@lsrfire.ath.cx>

René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:

> Am 26.01.2011 19:11, schrieb René Scharfe:
>> - Make git grep report non-matching path specs (new feature).
>
> This is a bit complicated because grep can work on files, index entries
> as well as versioned objects and supports wildcards,
> so it's not that easy to tell if a path spec matches something or is a
> rather typo.  But it's not impossible either, of course.

I don't understand this for the following use case:

   $ cd ~/linux-2.6/drivers/pci/
   $ git grep blacklist v2.6.27 -- drivers/pci/intel-iommu.c

From what you said, it sounds that git grep is actually searching the
string 'somewhere'. But where ?

Thanks
-- 
Francis

^ permalink raw reply

* Re: Re: Updating a submodule with a compatible version from another submodule version using the parent meta-repository
From: Heiko Voigt @ 2011-01-29 11:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Julian Ibarz, Jens Lehmann, git
In-Reply-To: <7v1v3zjp6w.fsf@alter.siamese.dyndns.org>

Hi,

On Wed, Jan 26, 2011 at 02:05:43PM -0800, Junio C Hamano wrote:
> If that version of submodule B is explicitly bound to a commit in the
> superproject A, you know which version of A and C were recorded, and the
> problem is solved.
> 
[...]
> 
> If you are confident that you didn't introduce different kind of
> dependency to other submodules while developing your "old_feature" branch
> in submodule B, one strategy may be to find an ancestor, preferrably the
> fork point, of your "old_feature" branch that is bound to the superproject
> A.  Then at that point at least you know whoever made that commit in A
> tested the combination of what was recorded in that commit, together with
> the version of B and C, and you can go forward from there, replaying the
> changes you made to the "old_feature" branch in submodule B.

Lets extend your explanation a little further and maybe demonstrate the problem
Julian is having a little more. I think what Julian searches for is a tool in
git that does the lookup for you which is AFAIK not that easy currently. It
seems to be a quite useful feature. Here what I understand Julian wants:

1. Find the most recent superproject commit X'' in A that records a submodule
   commit X' in B which contains the commit X in B you are searching for.

   For this we would need use something similar to git describe --contains
   but instead of using the list of existing tags in B it should use the list
   of commits in B which are recorded in A.

   Here a drawing to explain (linear history for simplicity):

   superproject A:

      O---O---X''---O
               \
   submodule B: \
                 \
      O---X---O---X'---O---O

2. Look up the commit of C which is recorded in X'' of A and check it
   out.

Step 2 is easy but for Step 1 the lookup of X' is missing for the commandline.
Is there already anything that implements git describe --contains for a defined
list of commits instead of refs?

Cheers Heiko

^ permalink raw reply

* Features from GitSurvey 2010
From: Dmitry S. Kravtsov @ 2011-01-29 10:01 UTC (permalink / raw)
  To: git

Hello,

I want to dedicate my coursework at University to implementation of
some useful git feature. So I'm interesting in some kind of list of
development status of these features
https://git.wiki.kernel.org/index.php/GitSurvey2010#17._Which_of_the_following_features_would_you_like_to_see_implemented_in_git.3F

Or I'll be glad to know what features are now 'free' and what are
currently in active development.

Best Regards
-- 
Dmitry S. Kravtsov

^ permalink raw reply

* Re: Can't find the revelant commit with git-log
From: Junio C Hamano @ 2011-01-29  5:47 UTC (permalink / raw)
  To: René Scharfe; +Cc: Francis Moreau, git, Johannes Sixt
In-Reply-To: <4D437CA0.1070006@lsrfire.ath.cx>

René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:

> Perhaps we should check my underlying assumption first: is it reasonable
> to expect a git log command to show the same commits with and without a
> path spec that covers all changed files?

The simplest case would be "git log ." vs "git log" from the root level of
the repository, right?  Traditionally, the former is "please show _one_
simplest history that can explain how the current commit came to be"
(i.e. with merge simplification), while the latter is "please list
everything that is behind the current commit" (i.e. without), I think.

It feels unintuitive, but my understanding of the rationale behind the
design is that, the expectation Linus had when he first did the pathspec
limited traversal was that most of the time "git log $path" is used to get
an explanation.  It follows that having to say "git log --simplify $path"
would have been a nuisance, so "with pathspec, we simplify" was thought to
be a reasonable default.

^ permalink raw reply

* Re: [RFC] Add --create-cache to repack
From: Shawn Pearce @ 2011-01-29  4:35 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley
In-Reply-To: <alpine.LFD.2.00.1101282055190.8580@xanadu.home>

On Fri, Jan 28, 2011 at 20:08, Nicolas Pitre <nico@fluxnic.net> wrote:
>> pack is actually smaller 376.30 MiB vs. C Git's 380.59 MiB.  I point
>> out this data because improvements made to JGit may show similar
>> improvements to CGit given how close they are in running time.
>
> What are those improvements?

None right now.  JGit is similar to CGit algorithm-wise.  (Actually it
looks like JGit has a faster diff implementation, but that's a
different email.)

If you are asking about why JGit created a slightly smaller pack
file... it splits the delta window during threaded delta search
differently than CGit does, and we align our blocks slightly
differently when comparing two objects to generate a delta sequence
for them.  These two variations mean JGit produces different deltas
than CGit does.  Sometimes we are smaller, sometimes we are larger.
But its a small difference, on the order of 1-4 MiB for something like
linux-2.6.  I don't think its worthwhile trying to analyze the
specific differences in implementations and retrofit those differences
into the other one.

What I was trying to say was, _if_ we made a change to JGit and it
dropped the running time, that same change in CGit should have _at
least_ the same running time improvement, if not better.  I was
pointing out that this cached-pack change dropped the running time by
1 minute, so CGit should also see a similar improvement (if not
better).  I would prefer to test against CGit for this sort of thing,
but its been too long since I last poked pack-objects.c and the
revision code in CGit, while the JGit equivalents are really fresh in
my head.

> Now, the fact that JGit is so close to CGit must be because the actual
> cost is outside of them such as within zlib, otherwise the C code should
> normally always be faster, right?

Yup, I mostly agree with this statement.  CGit does a lot of
malloc/free activity when reading objects in.  JGit does too, but we
often fit into the young generation for the GC, which sometimes can be
faster to clean and recycle memory in.  We're not too far off from C
code.

But yes... our profile looks like this too:

> Looking at the profile for "git rev-list --objects --all > /dev/null"
> for the object enumeration phase, we have:
>
> # Samples: 1814637
> #
> # Overhead          Command  Shared Object  Symbol
> # ........  ...............  .............  ......
> #
>    28.81%              git  /home/nico/bin/git  [.] lookup_object
>    12.21%              git  /lib64/libz.so.1.2.3  [.] inflate
>    10.49%              git  /lib64/libz.so.1.2.3  [.] inflate_fast
>     7.47%              git  /lib64/libz.so.1.2.3  [.] inflate_table
>     6.66%              git  /lib64/libc-2.11.2.so  [.] __GI_memcpy
>     5.66%              git  /home/nico/bin/git  [.] find_pack_entry_one
>     2.98%              git  /home/nico/bin/git  [.] decode_tree_entry
> [...]
>
> So we've got lookup_object() clearly at the top.

Isn't this the hash table lookup inside the revision pool, to see if
the object has already been visited?  That seems horrible, 28% of the
CPU is going to probing that table.

>  I suspect the
> hashcmp() in there, which probably gets inlined, is responsible for most
> cycles.

Probably true.  I know our hashcmp() is inlined, its actually written
by hand as 5 word compares, and is marked final, so the JIT is rather
likely to inline it.

>  There is certainly a better way here, and probably in JGit you
> rely on some optimized facility provided by the language/library to
> perform that lookup.  So there is probably some easy improvements that
> can be made here.

Nope.  Actually we have to bend over backwards and work against the
language to get anything even reasonably sane for performance.  Our
"solution" in JGit has actually been used by Rob Pike to promote his
Go programming language and why Java sucks as a language.  Its a great
quote of mine that someone dragged up off the git@vger mailing list
and started using to promote Go.

At least once I week I envy how easy it is to use hashcmp() and
hashcpy() inside of CGit.  JGit's management of hashes is sh*t because
we have to bend so hard around the language.

> Otherwise it is at least 12.21 + 10.49 + 7.47 + 2.71 = 32.88% spent
> directly in the zlib code, making it the biggest cost.

Yea, that's what we have too, about 33% inside of zlib code... which
is the same implementation that CGit uses.

>  This is rather
> unavoidable unless the data structure is changed.

We already knew this from our pack v4 experiments years ago.

>  And pack v4 would
> probably move things such as find_pack_entry_one, decode_tree_entry,
> process_tree and tree_entry off the radar as well.

This is hard to do inside of CGit if I recall... but yes, changing the
way trees are handled would really improve things.

> The object writeout phase should pretty much be network bound.

Yes.

>> I fully implemented the reuse of a cached pack behind a thin pack idea
>> I was trying to describe in this thread.  It saved 1m7s off the JGit
>> running time, but increased the data transfer by 25 MiB.
>
> Yeah... this sucks.

Very much.  :-(

But this is a fundamental issue with our incremental fetch support
anyway.  In this exact case if the client was at that 1 month old
commit, and fetched current master, he would pull 25 MiB of data.. but
only needed about 4-6 MiB worth of deltas if it was properly delta
compressed against the content we know he already has.  Our server
side optimization of only pushing the immediate "have" list of the
client into the delta search window limits how much we can compress
the data we are sending.  If we were willing to push more in on the
server side, we could shrink the incremental fetch more.  But that's a
CPU problem on the server.

-- 
Shawn.

^ permalink raw reply

* Re: [RFC] Add --create-cache to repack
From: Nicolas Pitre @ 2011-01-29  4:08 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley
In-Reply-To: <AANLkTi=U7qRRij=BQXC1Goqa9toDFfaVKT=+-8zYxCcc@mail.gmail.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 8240 bytes --]

On Fri, 28 Jan 2011, Shawn Pearce wrote:

> On Fri, Jan 28, 2011 at 13:09, Nicolas Pitre <nico@fluxnic.net> wrote:
> > On Fri, 28 Jan 2011, Shawn Pearce wrote:
> >
> >> On Fri, Jan 28, 2011 at 10:46, Nicolas Pitre <nico@fluxnic.net> wrote:
> >> > On Fri, 28 Jan 2011, Shawn Pearce wrote:
> >> >
> >> >> This started because I was looking for a way to speed up clones coming
> >> >> from a JGit server.  Cloning the linux-2.6 repository is painful,
> 
> Well, scratch the idea in this thread.  I think.
> 
> I retested JGit vs. CGit on an identical linux-2.6 repository.  The
> repository was fully packed, but had two pack files.  362M and 57M,
> and was created by packing a 1 month old master, marking it .keep, and
> then repacking -a -d to get most recent last month into another pack.
> This results in some files that should be delta compressed together
> being stored whole in the two packs (obviously).
> 
> The two implementations take the same amount of time to generate the
> clone.  3m28s / 3m22s for JGit, 3m23s for C Git.  The JGit created
> pack is actually smaller 376.30 MiB vs. C Git's 380.59 MiB.  I point
> out this data because improvements made to JGit may show similar
> improvements to CGit given how close they are in running time.

What are those improvements?

Now, the fact that JGit is so close to CGit must be because the actual 
cost is outside of them such as within zlib, otherwise the C code should 
normally always be faster, right?

Looking at the profile for "git rev-list --objects --all > /dev/null" 
for the object enumeration phase, we have:

# Samples: 1814637
#
# Overhead          Command  Shared Object  Symbol
# ........  ...............  .............  ......
#
    28.81%              git  /home/nico/bin/git  [.] lookup_object
    12.21%              git  /lib64/libz.so.1.2.3  [.] inflate
    10.49%              git  /lib64/libz.so.1.2.3  [.] inflate_fast
     7.47%              git  /lib64/libz.so.1.2.3  [.] inflate_table
     6.66%              git  /lib64/libc-2.11.2.so  [.] __GI_memcpy
     5.66%              git  /home/nico/bin/git  [.] find_pack_entry_one
     2.98%              git  /home/nico/bin/git  [.] decode_tree_entry
     2.73%              git  /lib64/libc-2.11.2.so  [.] _int_malloc
     2.71%              git  /lib64/libz.so.1.2.3  [.] adler32
     2.63%              git  /home/nico/bin/git  [.] process_tree
     1.58%              git  [kernel]       [k] 0xffffffff8112fc0c
     1.44%              git  /lib64/libc-2.11.2.so  [.] __strlen_sse2
     1.31%              git  /home/nico/bin/git  [.] tree_entry
     1.10%              git  /lib64/libc-2.11.2.so  [.] _int_free
     0.96%              git  /home/nico/bin/git  [.] patch_delta
     0.92%              git  /lib64/libc-2.11.2.so  [.] malloc_consolidate
     0.86%              git  /lib64/libc-2.11.2.so  [.] __GI_vfprintf
     0.80%              git  /home/nico/bin/git  [.] create_object
     0.80%              git  /home/nico/bin/git  [.] lookup_blob
     0.63%              git  /home/nico/bin/git  [.] update_tree_entry
[...]

So we've got lookup_object() clearly at the top.  I suspect the 
hashcmp() in there, which probably gets inlined, is responsible for most 
cycles.  There is certainly a better way here, and probably in JGit you 
rely on some optimized facility provided by the language/library to 
perform that lookup.  So there is probably some easy improvements that 
can be made here.

Otherwise it is at least 12.21 + 10.49 + 7.47 + 2.71 = 32.88% spent 
directly in the zlib code, making it the biggest cost.  This is rather 
unavoidable unless the data structure is changed.  And pack v4 would 
probably move things such as find_pack_entry_one, decode_tree_entry, 
process_tree and tree_entry off the radar as well.

The object writeout phase should pretty much be network bound.

> I fully implemented the reuse of a cached pack behind a thin pack idea
> I was trying to describe in this thread.  It saved 1m7s off the JGit
> running time, but increased the data transfer by 25 MiB.  I didn't
> expect this much of an increase, I honestly expected the thin pack
> portion to be well, thinner.  The issue is the thin pack cannot delta
> against all of the history, its only delta compressing against the tip
> of the cached pack.  So long-lived side branches that forked off an
> older part of the history aren't delta compressing well, or at all,
> and that is significantly bloating the thin pack.  (Its also why that
> "newer" pack is 57M, but should be 14M if correctly combined with the
> cached pack.)  If I were to consider all of the objects in the cached
> pack as potential delta base candidates for the thin pack, the entire
> benefit of the cached pack disappears.

Yeah... this sucks.

> I'm not sure I like it so much anymore.  :-)
> 
> The idea was half-baked, and came at the end of a long day, and after
> putting my cranky infant son down to sleep way past his normal bed
> time.  I claim I was a sleep deprived new parent who wasn't thinking
> things through enough before writing an email to git@vger.

Well, this is still valuable information to archive.

And I wish I had been able to still write such quality emails when I was 
a new parent.  ;-)

> >> sendfile() call for the bulk of the content.  I think we can just hand
> >> off the major streaming to the kernel.
> >
> > While this might look like a good idea in theory, did you actually
> > profile it to see if that would make a noticeable difference?  The
> > pkt-line framing allows for asynchronous messages to be sent over a
> > sideband,
> 
> No, of course not.  The pkt-line framing is pretty low overhead, but
> copying kernel buffer to userspace back to kernel buffer sort of sucks
> for 400 MiB of data.  sendfile() on 400 MiB to a network socket is
> much easier when its all kernel space.

Of course.  But still... If you save 0.5 second by avoiding the copy to 
and from user space of that 400 MiB (based on my machine which can do 
1670MB/s) that's pretty much insignificant compared to the total time 
for the clone, and therefore the wrong thing to optimize given the 
required protocol changes.

> I figured, if it all worked
> out already to just dump the pack to the wire as-is, then we probably
> should also try to go for broke and reduce the userspace copying.  It
> might not matter to your desktop, but ask John Hawley (CC'd) about
> kernel.org and the git traffic volume he is serving.  They are doing
> more than 1 million git:// requests per day now.

Impressive.  However I suspect that the vast majority of those requests 
are from clients making a connection just to realize they're up to date 
already.  I don't think the user space copying is really a problem.

Of course, if we could have used sendfile() freely in, say, 
copy_pack_data() then we would have done so long ago.  But we are 
checksuming the data we create on the fly with the data we reuse from 
disk so this is not necessarily a gain.

> >> Plus we can safely do byte range requests for resumable clone within
> >> the cached pack part of the stream.
> >
> > That part I'm not sure of.  We are still facing the same old issues
> > here, as some mirrors might have the same commit edges for a cache pack
> > but not necessarily the same packing result, etc.  So I'd keep that out
> > of the picture for now.
> 
> I don't think its that hard.  If we modify the transfer protocol to
> allow the server to denote boundaries between packs, the server can
> send the pack name (as in pack-$name.pack) and the pack SHA-1 trailer
> to the client.  A client asking for resume of a cached pack presents
> its original want list, these two SHA-1s, and the byte offset he wants
> to restart from.  The server validates the want set is still
> reachable, that the cached pack exists, and that the cached pack tips
> are reachable from current refs.  If all of that is true, it validates
> the trailing SHA-1 in the pack matches what the client gave it.  If
> that matches, it should be OK to resume transfer from where the client
> asked for.

This is still an half solution.  If your network connection drops after 
the first 52 MiB of transfer given the scenario you provided then you're 
still screwed.


Nicolas

^ permalink raw reply

* Re: [PATCH 09/21] tree_entry_interesting(): support depth limit
From: Nguyen Thai Ngoc Duy @ 2011-01-29  3:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vd3ngdaoa.fsf@alter.siamese.dyndns.org>

2011/1/29 Junio C Hamano <gitster@pobox.com>:
> Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:
>
>>  static const char *get_mode(const char *str, unsigned int *modep)
>> @@ -557,8 +558,13 @@ int tree_entry_interesting(const struct name_entry *entry,
>>       int pathlen, baselen = base->len;
>>       int never_interesting = -1;
>>
>> -     if (!ps || !ps->nr)
>> -             return 1;
>> +     if (!ps->nr) {
>> +             if (!ps->recursive || ps->max_depth == -1)
>> +                     return 1;
>> +             return !!within_depth(base->buf, baselen,
>> +                                   !!S_ISDIR(entry->mode),
>> +                                   ps->max_depth);
>> +     }
>
> Back in 1d848f6 (tree_entry_interesting(): allow it to say "everything is
> interesting", 2007-03-21), a new return value "2" was introduced to allow
> this function to tell the caller that all the remaining entries in the
> tree object the caller is feeding the entries to this function _will_
> match.  This was to optimize away expensive pathspec matching done by this
> function.
>
> In that version, "no pathspec" case wasn't changed to return 2 but still
> returned 1 ("I tell you that this does not match; call me with the next
> entry").  We could have changed it to return 2, but the overhead was only
> a call to a function that checks the number of pathspecs and was not so
> bad.
>
> But shouldn't we start returning 2 by now?  It is not that returning 1 was
> a more correct thing to do to begin with.
>
> When depth check is in effect, the result depends on the mode of the
> entry, so we cannot short-circuit by returning 2, but at least we should
> do so when (max_depth == -1), no?

Yes, should be 2.
-- 
Duy

^ permalink raw reply

* Re: [PATCH] git-p4: Corrected typo.
From: Vitor Antunes @ 2011-01-29  2:41 UTC (permalink / raw)
  To: Thomas Berg; +Cc: git
In-Reply-To: <AANLkTikeB724f_vE6qvu1h1o5JG150mcmaHVBjLkOEWP@mail.gmail.com>

Hi Thomas,

First of all I'd like to thank you on your feedback. It's my first try
on creating submitting a patch, so having someone's guidance helps a
lot :)

I'll rebase my patches against the head of the tree and squash the fix
to avoid multiple commits. While I do that I'll also review my commit
message and sign-off the patches according to what you said. Hopefully
I will be able to do this during this weekend.

From git-diff-tree man page:

"""
-M[<n>]
    Detect renames. If n is specified, it is a is a threshold on the
similarity index (i.e. amount of addition/deletions compared to the
file’s
    size). For example, -M90% means git should consider a delete/add
pair to be a rename if more than 90% of the file hasn’t changed.
"""

But from my latest tests I think that this option is ignored in
diff-tree (I think it's only used in git log). With this in mind I'll
need to add some code to implement the check of the score value of
diff-tree output string. Again from its man page:

"""
Status letters C and R are always followed by a score (denoting the
percentage of similarity between the source and target of the move or
copy), and are the only ones to be so.
"""

Thanks,
Vitor

On Fri, Jan 28, 2011 at 3:19 PM, Thomas Berg <merlin66b@gmail.com> wrote:
> Hi,
>
> On Fri, Jan 28, 2011 at 12:35 AM, Vitor Antunes <vitor.hda@gmail.com> wrote:
>> Hi everyone,
>>
>> Could anyone comment the 3 patches I sent (being this the last one)?
>>
> [...]
>> On Thu, Nov 25, 2010 at 1:26 AM, Vitor Antunes <vitor.hda@gmail.com> wrote:
>>> ---
>>>  contrib/fast-import/git-p4 |    2 +-
>>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/contrib/fast-import/git-p4 b/contrib/fast-import/git-p4
>>> index 0ea3a44..a466847 100755
>>> --- a/contrib/fast-import/git-p4
>>> +++ b/contrib/fast-import/git-p4
>>> @@ -618,7 +618,7 @@ class P4Submit(Command):
>>>         if len(detectRenames) > 0:
>>>             diffOpts = "-M%s" % detectRenames
>>>         else:
>>> -            diffOpts = ("", "-M")[self.detectRenames]
>>> +            diffOpts = ("", "-M")[self.detectRename]
>>>
>
> This appears to me to be a bugfix for one of the other patches you
> sent, is that right?
>
> If so, maybe you could squash it with the previous patch and re-send
> it all to the list?
>
> My other comments for now are:
> - you have forgotten to sign off on the patches
> - commit messages are normally in imperative rather than past tense
> (see Documentation/SubmittingPatches in git)
>
> - In your first patch you wrote:
>> The detectRenames option should be set to the desired threshold value.
> I'm not sure what threshold value you refer to here, and what values
> you can set it to. Am I missing something?
> (I'm not very familiar with git rename detection options)
>
> I'm a git-p4 user, so I can test your changes and look a bit more at
> your code. Someone verifying it could help getting the patches
> applied.
>
> Thanks for improving git-p4!
>
> Cheers,
> Thomas Berg
>

-- 
Vitor Antunes

^ permalink raw reply

* Re: [RFC] Add --create-cache to repack
From: Shawn Pearce @ 2011-01-29  2:34 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley
In-Reply-To: <AANLkTi=U7qRRij=BQXC1Goqa9toDFfaVKT=+-8zYxCcc@mail.gmail.com>

On Fri, Jan 28, 2011 at 17:32, Shawn Pearce <spearce@spearce.org> wrote:
>
> Well, scratch the idea in this thread.  I think.
>
> I retested JGit vs. CGit on an identical linux-2.6 repository.  The
> repository was fully packed, but had two pack files.  362M and 57M,
> and was created by packing a 1 month old master, marking it .keep, and
> then repacking -a -d to get most recent last month into another pack.
> This results in some files that should be delta compressed together
> being stored whole in the two packs (obviously).
>
> The two implementations take the same amount of time to generate the
> clone.  3m28s / 3m22s for JGit, 3m23s for C Git.  The JGit created
> pack is actually smaller 376.30 MiB vs. C Git's 380.59 MiB.

I just tried caching only the object list of what is reachable from a
particular commit.  The file is a small 20 byte header:

  4 byte magic
  4 byte version
  4 byte number of commits (C)
  4 byte number of trees (T)
  4 byte number of blobs (B)

Then C commit SHA-1s, followed by T tree SHA-1 + 4 byte path_hash,
followed by B blob SHA-1 + 4 byte path_hash.  For any project the size
is basically on par with the .idx file for the pack v1 format, so ~41
MB for linux-2.6.  The file is stored as
$GIT_OBJECT_DIRECTORY/cache/$COMMIT_SHA1.list, and is completely
pack-independent.

Using this for object enumeration shaves almost 1 minute off server
packing time; the clone dropped from 3m28s to 2m29s.  That is close to
what I was getting with the cached pack idea, but the network transfer
stayed the small 376 MiB.  I think this supports your pack v4 work...
if we can speed up object enumeration to be this simple (scan down a
list of objects with their types declared inline, or implied by
location), we can cut a full minute of CPU time off the server side.

-- 
Shawn.

^ permalink raw reply

* Re: Can't find the revelant commit with git-log
From: René Scharfe @ 2011-01-29  2:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Francis Moreau, git, Johannes Sixt
In-Reply-To: <7v1v3wd1al.fsf@alter.siamese.dyndns.org>

Am 29.01.2011 01:02, schrieb Junio C Hamano:
> René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:
> 
>> Subject: pickaxe: don't simplify history too much
>>
>> If pickaxe is used, turn off history simplification and make sure to keep
>> merges with at least one interesting parent.
>>
>> If path specs are used, merges that have at least one parent whose files
>> match those in the specified subset are edited out.  This is good in
>> general, but leads to unexpectedly few results if used together with
>> pickaxe.  Merges that also have an interesting parent (in terms of -S or
>> -G) are dropped, too.
>>
>> This change makes sure pickaxe takes precedence over history
>> simplification.
> 
> Hmmm, I understand the _motivation_ behind the change in the second hunk,
> in that you _might_ want to dig the side branch that did not contribute
> anything to the end result when looking for a needle with either -S or -G,
> but doesn't the same logic apply to things like --grep?

Yes, that's true.  I have to admit that I'm mostly reacting to the
unintuitive output given in the specific case ("test driven") and
probably don't fully understand the underlying problem and all its
implications.

> I do not think it is a good idea to unconditionally disable simplification
> for -S/G without a way for the user to countermand (even though I could be
> persuaded to say that the flipping the default for -S/-G/--grep might have
> been a better alternative in hindsight).

Currently there is no way to turn simplification off, resulting in
certain commits to become invisible when using e.g. -S in combination
with path specs.

> The user can control this behaviour by giving or not giving --simplify
> from the command line anyway, no?

Yes, but that goes only so far (see the examples in the parent post
which use --full-history; --simplify-merges gives 3 more results with
-m, but still not the full 160).

And as a user I don't want to have to add another option in order to use
pickaxe with path specs.  My expectation is that my search has
precedence over any history simplification, which is a nice and
necessary optimization, but shouldn't hide any search results that I
would have got if I had used no path specs.

However, it definitely looks like a corner case and I still don't know
what happened with all these merges.

> As to the first hunk, I have no idea why this is a good change.

I didn't see any other way to fix the example given in the commit message..

> It feels as if you are fighting against what this part of the code does in
> try_to_simplify_commit():
> 
> 	switch (rev_compare_tree(revs, p, commit)) {
> 	case REV_TREE_SAME:
> 		tree_same = 1;
> 		if (!revs->simplify_history || (p->object.flags & UNINTERESTING)) {
> 			/* ... */
> 			pp = &parent->next;
> 			continue;
> 		}
> 		parent->next = NULL;
> 		commit->parents = parent;
> 		commit->object.flags |= TREESAME;
> 		return;
> 	...
> 
> When we see a _single_ parent that has the same tree, we set tree_same and
> also cull the parent list to contain only that parent.  When we are not
> simplifying the history, we do not cull the parent list and will inspect
> other parents as well, but still we set tree_same to 1 here.  When some
> other parent is found to be different, we set tree_changed to 1.  So we
> have four states (same = (0, 1) x changed = (0, 1)).
> 
> The code before your addition in the first hunk says that we keep the
> commit if there is no parent with the same contents (i.e. !tree_same) and
> there is at least one parent with different contents (i.e. tree_changed).
> I suspect that this logic may not be working well when we do not simplify
> the merge.
> 
> Let's look at the original code before your patch again.
> 
>  1. If all the parents of a commit are the same, we will see (tree_same &&
>     !tree_changed), so we get TREESAME.
> 
>  2. If some but not all of the parents are the same, we will see (tree_same
>     && tree_changed), and we end up getting TREESAME.
> 
>  3. If none of the parents is the same, (!tree_same && tree_changed) holds
>     true, and we do not get TREESAME.

For completeness, a fourth case (!tree_same && !tree_changed), which
would be triggered by commits whose parents are all classified as
REV_TREE_NEW.  That's another corner case for sure, but the old code
would mark it TREESAME and your patch changes that.

> Perhaps the second condition needs to be tweaked for the "do not simplify
> merges" case?  That is, we split 2. into two cases:
> 
>  2a. When simplifying the merges, if any of the parents is the same as the
>      commit, we say TREESAME (the same as before);
> 
>  2b. When not simplifying, we say TREESAME only when all the parents are
>      the same as the commit.  Otherwise the merge commit itself is worth
>      showing, i.e. !TREESAME.
> 
> But I probably am missing some corner cases you saw in your analysis...
> 
> diff --git a/revision.c b/revision.c
> index 7b9eaef..0147124 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -439,7 +439,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit)
>  		}
>  		die("bad tree compare for commit %s", sha1_to_hex(commit->object.sha1));
>  	}
> -	if (tree_changed && !tree_same)
> +	if ((!revs->simplify_history && tree_changed) || !tree_same)
>  		return;
>  	commit->object.flags |= TREESAME;
>  }

The patch lists the right commits in the test case, but requires the
option --full-history to be given.  Without it no output is given if the
full file name is specified, as in master.

Perhaps we should check my underlying assumption first: is it reasonable
to expect a git log command to show the same commits with and without a
path spec that covers all changed files?

René

^ permalink raw reply

* Re: [RFC] Add --create-cache to repack
From: Shawn Pearce @ 2011-01-29  1:32 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley
In-Reply-To: <alpine.LFD.2.00.1101281502170.8580@xanadu.home>

On Fri, Jan 28, 2011 at 13:09, Nicolas Pitre <nico@fluxnic.net> wrote:
> On Fri, 28 Jan 2011, Shawn Pearce wrote:
>
>> On Fri, Jan 28, 2011 at 10:46, Nicolas Pitre <nico@fluxnic.net> wrote:
>> > On Fri, 28 Jan 2011, Shawn Pearce wrote:
>> >
>> >> This started because I was looking for a way to speed up clones coming
>> >> from a JGit server.  Cloning the linux-2.6 repository is painful,

Well, scratch the idea in this thread.  I think.

I retested JGit vs. CGit on an identical linux-2.6 repository.  The
repository was fully packed, but had two pack files.  362M and 57M,
and was created by packing a 1 month old master, marking it .keep, and
then repacking -a -d to get most recent last month into another pack.
This results in some files that should be delta compressed together
being stored whole in the two packs (obviously).

The two implementations take the same amount of time to generate the
clone.  3m28s / 3m22s for JGit, 3m23s for C Git.  The JGit created
pack is actually smaller 376.30 MiB vs. C Git's 380.59 MiB.  I point
out this data because improvements made to JGit may show similar
improvements to CGit given how close they are in running time.

I fully implemented the reuse of a cached pack behind a thin pack idea
I was trying to describe in this thread.  It saved 1m7s off the JGit
running time, but increased the data transfer by 25 MiB.  I didn't
expect this much of an increase, I honestly expected the thin pack
portion to be well, thinner.  The issue is the thin pack cannot delta
against all of the history, its only delta compressing against the tip
of the cached pack.  So long-lived side branches that forked off an
older part of the history aren't delta compressing well, or at all,
and that is significantly bloating the thin pack.  (Its also why that
"newer" pack is 57M, but should be 14M if correctly combined with the
cached pack.)  If I were to consider all of the objects in the cached
pack as potential delta base candidates for the thin pack, the entire
benefit of the cached pack disappears.

Which leaves me with dropping this idea.  I started it because I was
actually looking for a way to speed up JGit.  But we're already
roughly on-par with CGit performance.  Dropping 1m7s on a clone is
great, but not at the expense of 6.5% larger network transfer.  For
most clients, 25 MiB of additional data transfer may be much more
significant time than 1m7s saved doing server-side computation.

>> That's what I also liked about my --create-cache flag.
>
> I do agree on that point.   And I like it too.

I'm not sure I like it so much anymore.  :-)

The idea was half-baked, and came at the end of a long day, and after
putting my cranky infant son down to sleep way past his normal bed
time.  I claim I was a sleep deprived new parent who wasn't thinking
things through enough before writing an email to git@vger.

>> sendfile() call for the bulk of the content.  I think we can just hand
>> off the major streaming to the kernel.
>
> While this might look like a good idea in theory, did you actually
> profile it to see if that would make a noticeable difference?  The
> pkt-line framing allows for asynchronous messages to be sent over a
> sideband,

No, of course not.  The pkt-line framing is pretty low overhead, but
copying kernel buffer to userspace back to kernel buffer sort of sucks
for 400 MiB of data.  sendfile() on 400 MiB to a network socket is
much easier when its all kernel space.  I figured, if it all worked
out already to just dump the pack to the wire as-is, then we probably
should also try to go for broke and reduce the userspace copying.  It
might not matter to your desktop, but ask John Hawley (CC'd) about
kernel.org and the git traffic volume he is serving.  They are doing
more than 1 million git:// requests per day now.

>> Plus we can safely do byte range requests for resumable clone within
>> the cached pack part of the stream.
>
> That part I'm not sure of.  We are still facing the same old issues
> here, as some mirrors might have the same commit edges for a cache pack
> but not necessarily the same packing result, etc.  So I'd keep that out
> of the picture for now.

I don't think its that hard.  If we modify the transfer protocol to
allow the server to denote boundaries between packs, the server can
send the pack name (as in pack-$name.pack) and the pack SHA-1 trailer
to the client.  A client asking for resume of a cached pack presents
its original want list, these two SHA-1s, and the byte offset he wants
to restart from.  The server validates the want set is still
reachable, that the cached pack exists, and that the cached pack tips
are reachable from current refs.  If all of that is true, it validates
the trailing SHA-1 in the pack matches what the client gave it.  If
that matches, it should be OK to resume transfer from where the client
asked for.

Then its up to the server administrators of a round-robin serving
cluster to ensure that the same cached pack is available on all nodes,
so that a resuming client is likely to have his request succeed.  This
isn't impossible.  If the server operator cares they can keep the
prior cached pack for several weeks after creating a newer cached
pack, giving clients plenty of time to resume a broken clone.  Disk is
fairly inexpensive these days.

But its perhaps pointless, see above.  :-)

-- 
Shawn.

^ permalink raw reply

* Re: Can't find the revelant commit with git-log
From: Junio C Hamano @ 2011-01-29  0:02 UTC (permalink / raw)
  To: René Scharfe; +Cc: Francis Moreau, git, Johannes Sixt
In-Reply-To: <4D432735.8000208@lsrfire.ath.cx>

René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:

> Subject: pickaxe: don't simplify history too much
>
> If pickaxe is used, turn off history simplification and make sure to keep
> merges with at least one interesting parent.
>
> If path specs are used, merges that have at least one parent whose files
> match those in the specified subset are edited out.  This is good in
> general, but leads to unexpectedly few results if used together with
> pickaxe.  Merges that also have an interesting parent (in terms of -S or
> -G) are dropped, too.
>
> This change makes sure pickaxe takes precedence over history
> simplification.

Hmmm, I understand the _motivation_ behind the change in the second hunk,
in that you _might_ want to dig the side branch that did not contribute
anything to the end result when looking for a needle with either -S or -G,
but doesn't the same logic apply to things like --grep?

I do not think it is a good idea to unconditionally disable simplification
for -S/G without a way for the user to countermand (even though I could be
persuaded to say that the flipping the default for -S/-G/--grep might have
been a better alternative in hindsight).

The user can control this behaviour by giving or not giving --simplify
from the command line anyway, no?

As to the first hunk, I have no idea why this is a good change.

It feels as if you are fighting against what this part of the code does in
try_to_simplify_commit():

	switch (rev_compare_tree(revs, p, commit)) {
	case REV_TREE_SAME:
		tree_same = 1;
		if (!revs->simplify_history || (p->object.flags & UNINTERESTING)) {
			/* ... */
			pp = &parent->next;
			continue;
		}
		parent->next = NULL;
		commit->parents = parent;
		commit->object.flags |= TREESAME;
		return;
	...

When we see a _single_ parent that has the same tree, we set tree_same and
also cull the parent list to contain only that parent.  When we are not
simplifying the history, we do not cull the parent list and will inspect
other parents as well, but still we set tree_same to 1 here.  When some
other parent is found to be different, we set tree_changed to 1.  So we
have four states (same = (0, 1) x changed = (0, 1)).

The code before your addition in the first hunk says that we keep the
commit if there is no parent with the same contents (i.e. !tree_same) and
there is at least one parent with different contents (i.e. tree_changed).
I suspect that this logic may not be working well when we do not simplify
the merge.

Let's look at the original code before your patch again.

 1. If all the parents of a commit are the same, we will see (tree_same &&
    !tree_changed), so we get TREESAME.

 2. If some but not all of the parents are the same, we will see (tree_same
    && tree_changed), and we end up getting TREESAME.

 3. If none of the parents is the same, (!tree_same && tree_changed) holds
    true, and we do not get TREESAME.

Perhaps the second condition needs to be tweaked for the "do not simplify
merges" case?  That is, we split 2. into two cases:

 2a. When simplifying the merges, if any of the parents is the same as the
     commit, we say TREESAME (the same as before);

 2b. When not simplifying, we say TREESAME only when all the parents are
     the same as the commit.  Otherwise the merge commit itself is worth
     showing, i.e. !TREESAME.

But I probably am missing some corner cases you saw in your analysis...

diff --git a/revision.c b/revision.c
index 7b9eaef..0147124 100644
--- a/revision.c
+++ b/revision.c
@@ -439,7 +439,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit)
 		}
 		die("bad tree compare for commit %s", sha1_to_hex(commit->object.sha1));
 	}
-	if (tree_changed && !tree_same)
+	if ((!revs->simplify_history && tree_changed) || !tree_same)
 		return;
 	commit->object.flags |= TREESAME;
 }

^ permalink raw reply related

* Re: Keeping the file modification date with git
From: Jeff King @ 2011-01-28 22:25 UTC (permalink / raw)
  To: Ronan Keryell; +Cc: git
In-Reply-To: <87bp30vmek.fsf@an-dro.info.enstb.org>

On Fri, Jan 28, 2011 at 08:49:39PM +0100, Ronan Keryell wrote:

> After heavily using git for code development, we plan to use it for
> administrative storage and I need to keep the modification date
> of the files.
> [...]
> So I'm envisioning different solutions:
> 
> - it is already done. I have missed this. :-) But would be great. :-)

Nope. The tree format does not have a date field.

> - giving up. Not an option :-)

Right. :)

> - it is added to git core functions because it is quite useful for some
>   people. Too time-consuming for me since I'm not a git developer... But
>   someone else could do this...

Nope. That would require changing one of the fundamental data structures
of git (the tree object) and is not likely to happen for a variety of
reasons (mostly it becomes a performance hit and a compatibility issue
when the majority of people don't even care about this issue).

> - add this concept aside. For example, just as there are .gitignore or
>   .gitattributes files, we could have a .gitdates that would store in a
>   human-readable manner the modification time of the files in its
>   directory.

Yep, this is a good solution. Check out metastore:

  http://david.hardeman.nu/software.php

-Peff

^ permalink raw reply

* Re: Can't find the revelant commit with git-log
From: René Scharfe @ 2011-01-28 22:01 UTC (permalink / raw)
  To: Francis Moreau; +Cc: git, Johannes Sixt
In-Reply-To: <4D4063EC.7090509@lsrfire.ath.cx>

Am 26.01.2011 19:11, schrieb René Scharfe:
> - Make git grep report non-matching path specs (new feature).

This is a bit complicated because grep can work on files, index entries
as well as versioned objects and supports wildcards, so it's not that
easy to tell if a path spec matches something or is a rather typo.  But
it's not impossible either, of course.

What you can do until someone implements it is to simply omit the double
dash.  Path specs are then looked up as revs and files and you'll get an
error if they can't be found:

	# In the Linux kernel repo; we enter the wrong directory:
	$ cd drivers
	$ git grep blacklist_iommu v2.6.27 intel-iommu.c
	fatal: ambiguous argument 'intel-iommu.c': unknown revision or path not in the working tree.
	Use '--' to separate paths from revisions

	# Now we enter the right one and try again:
	$ cd pci
	$ git grep blacklist_iommu v2.6.27 intel-iommu.c
	v2.6.27:intel-iommu.c:static int blacklist_iommu(const struct dmi_system_id *id)
	v2.6.27:intel-iommu.c:		.callback = blacklist_iommu,

This won't work in bare repos or with wildcards, but it's better than
nothing.  And it saves you a few keystrokes.

René

^ permalink raw reply

* Re: Project- vs. Package-Level Branching in Git
From: Eugene Sajine @ 2011-01-28 21:43 UTC (permalink / raw)
  To: Thomas Hauk; +Cc: Ævar Arnfjörð Bjarmason, git
In-Reply-To: <15B7CA2E-C584-4563-B9E3-D80861CD9565@shaggyfrog.com>

> But how often do you have a project that has no external or internal dependencies on any other packages or libraries? Any project I've ever done, big or small, has relied on some existing codebase. Imagine a project that uses liba and libb, which both reference libc. To use Git, I'd have to have copies of libc existing in three repositories, and copies of liba and lib in two repositories each. What a nightmare... and that's only a trivial hypothetical example.
>
...> I'm really trying to get on the Git bandwagon, here.
>
> --
> Thomas Hauk
> Shaggy Frog Software
> www.shaggyfrog.com
>

For example at my shop we have very "component oriented" approach
(JAVA). Each project is a separate git repository, that is producing
one artifact (.jar, .war etc) and has a ivy dependency descriptor. We
are using Hudson CI to perform intergration builds of projects
themselves upon push and their downstream projects as well. In this
particular example there is no need to keep any copies that you're
talking about. For this Git works perfectly!!

OTOH there is a part of development that is using C++. And the whole
infrastructure is about static linking and is so heavily depending on
the CVS ability to expand keywords, that there is no way  (at least so
far i could not find an acceptable solution) to migrate the C++
development to Git.

My question is why do think you should have the copies??? Can it be
that the inability to use Git for your projects is related to the way
how you do things? May be you just have to be ready for the paradigm
shift first hand?

Thanks,
Eugene

^ permalink raw reply

* Re: [PATCH] merge: default to @{upstream}
From: Martin von Zweigbergk @ 2011-01-28 21:41 UTC (permalink / raw)
  To: Bert Wesarg; +Cc: Felipe Contreras, git, Jonathan Nieder
In-Reply-To: <AANLkTimc92giAAJnzjv5Bq4f853xqEfLrgB=j+iRXPaf@mail.gmail.com>

On Fri, 28 Jan 2011, Bert Wesarg wrote:

> On Fri, Jan 28, 2011 at 17:17, Felipe Contreras
> <felipe.contreras@gmail.com> wrote:
> > So 'git merge' is 'git merge @{upstream}' instead of 'git merge -h';
> > it's better to do something useful.
> 
> Nice idea. Could you have a look into git rebase, I think this could
> be applied there too.

I submitted an RFC patch for that a while ago [1]. I will soon send a
re-roll of some rebase refactoring patches I have been working on (I
have been busy at work and also waiting for 1.7.4 to be finished). I
will then send an updated "default upstream" patch again on top of the
refactoring patches.

And thanks for taking care of the merge case, Felipe. I'm still
struggling with the part of Git written in C, so I'm glad you took
that part.

> 
> Anyway, I think some high level sanity check won't harm. Ie. check if
> there is an upstream configured.

Will be done in the case of rebase at least (stolen from the
implementation in git pull).

[1] http://thread.gmane.org/gmane.comp.version-control.git/161382/

^ permalink raw reply

* Re: [RFC] Add --create-cache to repack
From: Nicolas Pitre @ 2011-01-28 21:09 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley
In-Reply-To: <AANLkTikPcp5CUTWfhy6FYbCEkNG6epGBAMNT5vTfSbvy@mail.gmail.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5392 bytes --]

On Fri, 28 Jan 2011, Shawn Pearce wrote:

> On Fri, Jan 28, 2011 at 10:46, Nicolas Pitre <nico@fluxnic.net> wrote:
> > On Fri, 28 Jan 2011, Shawn Pearce wrote:
> >
> >> This started because I was looking for a way to speed up clones coming
> >> from a JGit server.  Cloning the linux-2.6 repository is painful,
> ...
> >> Later I realized, we can get rid of that cached list of objects and
> >> just use the pack itself.
> ...
> > Playing my old record again... I know.  But pack v4 should solve a big
> > part of this enumeration cost.
> 
> I've said the same thing for years myself.  As much as it would be
> nice to fix some of the decompression costs with pack v2/v3, v2/v3 is
> very common in the wild, and a new pack encoding is going to be a
> fairly complex thing to get added to C Git.  And pack v4 doesn't
> eliminate the enumeration, it just makes it faster.

Well, you don't necessarily need pack v4 to be widely deployed for 
people to benefit from it.  If it is available on servers such as 
git.kernel.org then everybody will see their clone requests go faster.  
Same principle as for the cache packs.

And yes it doesn't eliminate the enumeration, but you can't eliminate it 
entirely either as many other operations do require object enumeration 
too, and those would be sped up as well.

But this is in fact orthogonal to the cache pack concept indeed.

> That's what I also liked about my --create-cache flag.  Its keeping
> the same data we already have, in the same format we already have it
> in.  We're just making a more explicit statement that everything in
> some pack is about as tightly compressed as it ever would be for a
> client, and it isn't going to change anytime soon.  Thus we might as
> well tag it with .keep to prevent repack of mucking with it, and we
> can take advantage of this to serve the pack to clients very fast.

I do agree on that point.   And I like it too.  However I'd prefer if 
the whole thing wasn't created "automatically".  It's probably best if 
the repository administrator decides explicitly what should go in such 
cached packs according to actual purpose and usage for good commit 
thresholds and branches.  Only a human can make that decision.

I'd also recommend _not_ using the ref namespace for that.  Let's not 
mix up branching/tagging with what is effectively a storage 
implementation issue. Linking the ref namespace with the actual packs 
they refer to would be highly inelegant if the SHA1 of the pack has to 
be part of the ref name.  Instead, I'd suggest simply listing all the 
commit tips a cache pack contains in the .keep file directly instead.  
That would make it much easier to use with the object alternates too as 
the alternate mechanism points to the object store of a foreign repo and 
not to its refs.

> Over breakfast this morning I made the point to Junio that with the
> cached pack and a slight network protocol change (enabled by a
> capability of course) we could stop using pkt-line framing when
> sending the cached pack part of the stream, and just send the pack
> directly down the socket.  That changes the clone of a 400 MB project
> like linux-2.6 from being a lot of user space stuff, to just being a
> sendfile() call for the bulk of the content.  I think we can just hand
> off the major streaming to the kernel. 

While this might look like a good idea in theory, did you actually 
profile it to see if that would make a noticeable difference?  The 
pkt-line framing allows for asynchronous messages to be sent over a 
sideband, which you wouldn't be able to do anymore until the full 400 MB 
is received by the remote side.  Without concrete performance numbers 
I'm not convinced it is worth the maintenance cost for creating a 
deviation in the protocol like this.

> (Part of the protocol change
> is we would need to use multiple SHA-1 checksums in the stream, so we
> don't have to re-checksum the existing cached pack.)

?? I don't follow you here.

> I love the idea of some of the concepts in pack v4.  I really do.  But
> this sounds a lot simpler to implement, and it lets us completely
> eliminate a massive amount of server processing (even under pack v4
> you still have object enumeration), in exchange for what might be a
> few extra MBs on the wire to the client due to slightly less good
> deltas and the use of REF_DELTA in the thin pack used for the most
> recent objects.

I agree.  And what I personally like the most is the fact that this can 
be made transparent to clients using the existing network protocol 
unchanged.

> Plus we can safely do byte range requests for resumable clone within
> the cached pack part of the stream.

That part I'm not sure of.  We are still facing the same old issues 
here, as some mirrors might have the same commit edges for a cache pack 
but not necessarily the same packing result, etc.  So I'd keep that out 
of the picture for now.  The idea of being able to resume the transfer 
of a cache pack is good, however I'd make it into a totally separate 
service outside git-upload-pack where the issue of validating and 
updating content on both sides can be done efficiently without impacting 
the upload-pack protocol.  There would be more than just the cache pack 
in play during a typical clone.

> And when pack v4 comes along, we
> can use this same strategy for an equally large pack v4 pack.

Absolutely.

Nicolas

^ permalink raw reply

* Re: [PATCH 09/21] tree_entry_interesting(): support depth limit
From: Junio C Hamano @ 2011-01-28 20:40 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git
In-Reply-To: <1292425376-14550-10-git-send-email-pclouds@gmail.com>

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

>  static const char *get_mode(const char *str, unsigned int *modep)
> @@ -557,8 +558,13 @@ int tree_entry_interesting(const struct name_entry *entry,
>  	int pathlen, baselen = base->len;
>  	int never_interesting = -1;
>  
> -	if (!ps || !ps->nr)
> -		return 1;
> +	if (!ps->nr) {
> +		if (!ps->recursive || ps->max_depth == -1)
> +			return 1;
> +		return !!within_depth(base->buf, baselen,
> +				      !!S_ISDIR(entry->mode),
> +				      ps->max_depth);
> +	}

Back in 1d848f6 (tree_entry_interesting(): allow it to say "everything is
interesting", 2007-03-21), a new return value "2" was introduced to allow
this function to tell the caller that all the remaining entries in the
tree object the caller is feeding the entries to this function _will_
match.  This was to optimize away expensive pathspec matching done by this
function.

In that version, "no pathspec" case wasn't changed to return 2 but still
returned 1 ("I tell you that this does not match; call me with the next
entry").  We could have changed it to return 2, but the overhead was only
a call to a function that checks the number of pathspecs and was not so
bad.

But shouldn't we start returning 2 by now?  It is not that returning 1 was
a more correct thing to do to begin with.

When depth check is in effect, the result depends on the mode of the
entry, so we cannot short-circuit by returning 2, but at least we should
do so when (max_depth == -1), no?

^ permalink raw reply

* Re: Can't find the revelant commit with git-log
From: René Scharfe @ 2011-01-28 20:29 UTC (permalink / raw)
  To: Francis Moreau; +Cc: git, Johannes Sixt, Junio C Hamano
In-Reply-To: <4D4063EC.7090509@lsrfire.ath.cx>

Am 26.01.2011 19:11, schrieb René Scharfe:
> So far we have two action items, I think:
> 
> - Make git grep report non-matching path specs (new feature).
> 
> - Find out why removing the last path component made a difference in
> the case above (looks like a bug, but I don't understand what's going
> on).

OK, regarding the second point:

Merges that have at least one parent without changes in the selected
subset of files won't be displayed, not even with --full-history.  That
explains why removing the last path component made a difference: all the
merges ended up with a version of the file that matched one of their
parents, but there were other changes in the directory.

This is a feature: since the version of the file picked by the merge
must have been introduced by an earlier commit (a regular one,
presumably), you'll find it there anyway.

And this history simplification takes precedence over pickaxe (-S).

The patch below turns down its aggressiveness when the pickaxe is swung
at the same time.  Here's what it does to your use case:

	$ revs="v2.6.26..v2.6.29"
	$ opts="-Sblacklist_iommu --oneline -m --full-history"

	# This takes quite a while...
	$ git log $opts $revs | wc -l
	160

	# Without the patch:
	$ git log $opts $revs -- drivers/pci/intel-iommu.c | wc -l
	2

	# With the patch (really just its first hunk):
	$ git log $opts $revs -- drivers/pci/intel-iommu.c | wc -l
	160

	$ opts="-Sblacklist_iommu --oneline -m"

	# This takes quite a while...
	$ git log $opts $revs | wc -l
	160

	# Without the patch:
	$ git log $opts $revs -- drivers/pci/intel-iommu.c | wc -l
	0

	# With the patch:
	$ git log $opts $revs -- drivers/pci/intel-iommu.c | wc -l
	160

	$ opts="-Sblacklist_iommu --oneline"

	# This takes a bit, but not too long.
	$ git log $opts $revs | wc -l
	1

	# Without the patch:
	$ git log $opts $revs -- drivers/pci/intel-iommu.c | wc -l
	0

	# With the patch:
	$ git log $opts $revs -- drivers/pci/intel-iommu.c | wc -l
	1

The full output matches exactly if the number of lines match.  That's to
be expected, as the string "blacklist_iommu" only ever appears in the
file drivers/pci/intel-iommu.c.

It wasn't mentioned before v2.6.26 or after v2.6.29.

There is only one regular commit, namely the initial one that introduced
the function.  Some merges are reported more than once, each for every
parent where -S hit.  135 unique commits are reported.

-- >8 --
Subject: pickaxe: don't simplify history too much

If pickaxe is used, turn off history simplification and make sure to keep
merges with at least one interesting parent.

If path specs are used, merges that have at least one parent whose files
match those in the specified subset are edited out.  This is good in
general, but leads to unexpectedly few results if used together with
pickaxe.  Merges that also have an interesting parent (in terms of -S or
-G) are dropped, too.

This change makes sure pickaxe takes precedence over history
simplification.  This means path specs won't change the results as long
as they contain all the files that pickaxe turns up.  E.g. these two
commands now report the same single commit that added the function
blacklist_iommu to the specified file in the Linux kernel repo:

   $ git log -Sblacklist_iommu v2.6.26..v2.6.29 --
   $ git log -Sblacklist_iommu v2.6.26..v2.6.29 -- drivers/pci/intel-iommu.c

Previously the second one came up empty.

Reported-by: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
 revision.c                   |    5 +++++
 t/t6012-rev-list-simplify.sh |    2 ++
 2 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/revision.c b/revision.c
index 7b9eaef..cacf60c 100644
--- a/revision.c
+++ b/revision.c
@@ -441,6 +441,8 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit)
 	}
 	if (tree_changed && !tree_same)
 		return;
+	if (tree_changed && revs->diffopt.pickaxe)
+		return;
 	commit->object.flags |= TREESAME;
 }

@@ -1647,6 +1649,9 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 	    revs->diffopt.filter ||
 	    DIFF_OPT_TST(&revs->diffopt, FOLLOW_RENAMES))
 		revs->diff = 1;
+
+	if (revs->diffopt.pickaxe)
+		revs->simplify_history = 0;

 	if (revs->topo_order)
 		revs->limited = 1;
diff --git a/t/t6012-rev-list-simplify.sh b/t/t6012-rev-list-simplify.sh
index af34a1e..b4fb8d0 100755
--- a/t/t6012-rev-list-simplify.sh
+++ b/t/t6012-rev-list-simplify.sh
@@ -86,5 +86,7 @@ check_result 'I H E C B A' --full-history --date-order -- file
 check_result 'I E C B A' --simplify-merges -- file
 check_result 'I B A' -- file
 check_result 'I B A' --topo-order -- file
+check_result 'I C B' -SHello
+check_result 'I C B' -SHello -- file

 test_done
-- 
1.7.3.4

^ permalink raw reply related

* Keeping the file modification date with git
From: Ronan Keryell @ 2011-01-28 19:49 UTC (permalink / raw)
  To: git

After heavily using git for code development, we plan to use it for
administrative storage and I need to keep the modification date
of the files.

Since I'm fond of git, I don't want to go back to some other tools I
previously used but that keep the modification date... :-)

So I'm envisioning different solutions:

- it is already done. I have missed this. :-) But would be great. :-)

- giving up. Not an option :-)

- it is added to git core functions because it is quite useful for some
  people. Too time-consuming for me since I'm not a git developer... But
  someone else could do this...

- add this concept aside. For example, just as there are .gitignore or
  .gitattributes files, we could have a .gitdates that would store in a
  human-readable manner the modification time of the files in its
  directory.

  A nice side effect is that if we have conflicts on modification times
  during merge, we could just resolve conflict in the date file. :-) Of
  course, having merge tools aware of this could help the user to deal
  with this.

  By the way, we could consider modification time as a special attribute
  in .gitattributes to avoid yet another .git file? But I guess that the
  date information may be huge for a big directory, so it may be better
  to keep in a dedicated sorted file, to be processed efficiently
  without messing .gitattributes

  The implementation could be done with helper functions called from
  various hooks such as:

    - updating the .gitdates could be done in the pre-commit hook so
      that they are added to the commit when needed;

    - when a checkout is done, the date of some files may be update. I
      cannot see any hook for this...

  Well there are many things that should be triggered in the git
  plumbing or the user should have to launch explicitly some helper
  functions (for example to commit only a date change, since from the git
  point of view, nothing has changed and the commit is not done...).

  I have no idea about the execution time we would have on a big
  repository with many files, with a naive implementation to test the
  concept...

Any comments on the different approaches?

Thanks,
-- 
  Ronan KERYELL                      |\/  GSM:    (+33|0) 6 13 14 37 66
  HPC Project                        |/)  Fax:    (+33|0) 1 46 01 05 46
  9 Route du Colonel Marcel Moraine  K    E-mail: rk@hpc-project.com
  92360 Meudon La Forêt              |\   skype:keryell
  FRANCE                             | \  http://hpc-project.com

^ permalink raw reply

* Re: [PATCH] merge: default to @{upstream}
From: Bert Wesarg @ 2011-01-28 19:53 UTC (permalink / raw)
  To: Felipe Contreras; +Cc: git, Jonathan Nieder
In-Reply-To: <1296231457-18780-1-git-send-email-felipe.contreras@gmail.com>

On Fri, Jan 28, 2011 at 17:17, Felipe Contreras
<felipe.contreras@gmail.com> wrote:
> So 'git merge' is 'git merge @{upstream}' instead of 'git merge -h';
> it's better to do something useful.

Nice idea. Could you have a look into git rebase, I think this could
be applied there too.

Anyway, I think some high level sanity check won't harm. Ie. check if
there is an upstream configured.

Thanks.
Bert

^ permalink raw reply

* Re: [RFC] Add --create-cache to repack
From: Shawn Pearce @ 2011-01-28 19:19 UTC (permalink / raw)
  To: Jay Soffian
  Cc: Johannes Sixt, git, Junio C Hamano, Nicolas Pitre, John Hawley
In-Reply-To: <AANLkTi=f34Q2VUrzA0dEG0KCFcHcd_Yq=UN6RSDPVS+p@mail.gmail.com>

On Fri, Jan 28, 2011 at 11:15, Jay Soffian <jaysoffian@gmail.com> wrote:
> On Fri, Jan 28, 2011 at 10:33 AM, Johannes Sixt <j.sixt@viscovery.net> wrote:
>> Let's define a ref hierarchy, refs/cache-pack, that names the cache pack
>> tips. A cache pack would be generated for each ref found in that
>> hierarchy. Then these commits are under user control even on github,
>> because you can just push the refs. Junio would perhaps choose a release
>> tag, and corresponding commits in the man and html histories. The choice
>> would not be completely automatic, though.
>
> This is just for bare repos, right? Why not just use HEAD?

Even on a bare repository a user might rewind his/her HEAD frequently.
 Caching from today's HEAD might not be ideal if you are about to
rewrite the last 10 commits and push those again to the repository.
That's actually where the "1.month.ago" guess came from in the patch.
If we go back a little in history, the odds of a rewrite are reduced,
and we're more likely to be able to reuse this pack.

HEAD - X commits/X days might be a good approximation if there are no
refs/cache-pack *and* gc --auto notices there is "enough" content to
suggest creating a cached pack.  But I do like Johannes Sixt's
refs/cache-pack ref hierarchy as a way to configure this explicitly.

-- 
Shawn.

^ permalink raw reply

* Re: [RFC] Add --create-cache to repack
From: Jay Soffian @ 2011-01-28 19:15 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Shawn Pearce, git, Junio C Hamano, Nicolas Pitre, John Hawley
In-Reply-To: <4D42E1E3.4060808@viscovery.net>

On Fri, Jan 28, 2011 at 10:33 AM, Johannes Sixt <j.sixt@viscovery.net> wrote:
> Let's define a ref hierarchy, refs/cache-pack, that names the cache pack
> tips. A cache pack would be generated for each ref found in that
> hierarchy. Then these commits are under user control even on github,
> because you can just push the refs. Junio would perhaps choose a release
> tag, and corresponding commits in the man and html histories. The choice
> would not be completely automatic, though.

This is just for bare repos, right? Why not just use HEAD?

j.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox