Git development
 help / color / mirror / Atom feed
* Re: Handling large files with GIT
From: Linus Torvalds @ 2006-02-13  5:05 UTC (permalink / raw)
  To: Ben Clifford; +Cc: Martin Langhoff, Florian Weimer, git
In-Reply-To: <Pine.LNX.4.64.0602122049010.3691@g5.osdl.org>



On Sun, 12 Feb 2006, Linus Torvalds wrote:
> 
> This is a large part of why git performs well on the kernel. Most merges 
> don't actually touch all - or even a very big percentage - of the over 
> thousand subdirectories in the kernel. Git can quickly see and ignore the 
> whole subdirectory when that happens - the SHA1 is exactly the same, so 
> git knows that every file under that subdirectory (and every recursive 
> directory) is the same.

Final note: this means, for example, that git is relatively bad at 
tracking a "hashed" nested file directory (like the one git itself uses), 
because new files will end up randomly appearing in every directory, and 
no directory is ever "stable".

In contrast, if the directory structure is - for example - something where 
you index files by date, and subdirectories with older dates are thus much 
more naturally likely to be quiescent, the "this tree is the same" 
optimizations work very well.

Basically, a lot of the git speed optimizations depend on "on average, 
things stay the same". We may have 18,000+ files in the kernel, but most 
patches will change maybe five of them. There's a lot of fairly static 
content and the changes have a certain level of "locality". It's normally 
a hundred-line patch to one file, not a hundred files that had one-liners. 
And when 20 files are changed, most of them tend to be in the same 
subdirectory, etc etc.

Taking advantage of those kinds of things is what makes git good at 
handling software projects. But it wouldn't necessarily be how you lay out 
a mail directory, for example. An automated file store might want to 
spread out the changes on purpose.

		Linus

^ permalink raw reply

* Re: Handling large files with GIT
From: Linus Torvalds @ 2006-02-13  4:57 UTC (permalink / raw)
  To: Ben Clifford; +Cc: Martin Langhoff, Florian Weimer, git
In-Reply-To: <Pine.LNX.4.64.0602121939070.3691@g5.osdl.org>



On Sun, 12 Feb 2006, Linus Torvalds wrote:
> 
> If it takes an hour per merge, it _is_ unusable. I consider 15 _seconds_ 
> to be pretty unusable.

Btw, one thing to realize is that git is inherently a lot better at 
handling lots of files in _subdirectories_, especially if those 
subdirectories don't change.

I've never used maildir layout, but if it is a couple of large _flat_ 
subdirectories, git will potentially handle that a lot worse than if you 
have a hierarchy of directories.

I say "potentially", because if the directories are all mutable and 
change, then the flat approach is better. But if they tend to have some 
kind of stability, a lot of git operations (diffing and merging in 
particular) are able to see that two subdirectories are 100% equal, and 
will entirely skip them.

This is a large part of why git performs well on the kernel. Most merges 
don't actually touch all - or even a very big percentage - of the over 
thousand subdirectories in the kernel. Git can quickly see and ignore the 
whole subdirectory when that happens - the SHA1 is exactly the same, so 
git knows that every file under that subdirectory (and every recursive 
directory) is the same.

In contrast, if you have a million files in one directory, and 10 of them 
changed, git will still have to check the SHA1's for matches for the other 
999,990 files. Which is going to be slow.

That said, I suspect there's space for optimization. 

		Linus

^ permalink raw reply

* Re: [ANNOUNCE] pg - A patch porcelain for GIT
From: Sam Vilain @ 2006-02-13  4:40 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Git Mailing List
In-Reply-To: <20060213032903.GA32121@spearce.org>

Shawn Pearce wrote:
>>>I just posted the first public version of pg, a GIT porcelain for
>>>managing patches.  Think StGIT, but better in some ways:
>>>Feature Summary:
>>>- Maximum compatibility with other GIT porcelains.
>>>- Simplified command line user interface.
>>How do I edit the description of an existing patch using pg?  Perhaps an
>>option to pg-push ?
> There isn't any description associated with a patch beyond its name
> (which can be changed with pg-rename).  Unlike StGIT pg currently
> doesn't store a description with each patch.
> This is partly because I want pg to extract the comments given to
> pg-ci to make the description of the patch during an export with
> pg-export - but I haven't written the code to walk back along the
> related commits and extract each comment.  On the other hand this
> might not be the best description for a patch.  :-)

ok.  Well, perhaps a nice solution might be just to aggregate the
comments as each new commit is made.  ie, the previous comment is
prepended to the new comment unless you use the editor or a special
-M (or whatever) option that replaces the running comment.

I tried importing a patchset into pg, and made some changes to it to see
the patch revisioning going on.  However, I can't see this happening.
Can you perhaps include this information in your tutorial?

As far as other, more general critiques of the software goes:  What
about merging?  stgit has a very nice way of merging; I specify how to
merge using a config file, and when I rebase my patches with "stg pull",
it fires up my custom editor.  All I really want is a way to specify how
to handle merges, with the ancestor/left/right files on hand.  I want to
use something as simple as this script:

#!/bin/sh

branch1="$1"
branch2="$2"
ancestor="$3"
output="$4"

echo "Merging:"
echo
echo "   $branch1"
echo " - $ancestor"
echo
echo " with:"
echo
echo "   $branch2"
echo " - $ancestor"
echo
echo " to: $output"
echo ""
echo -n "Trying diff3..."

if diff3 -L local -L older -L remote -m -E "$branch1" "$ancestor" \
    "$branch2" > "$output"
then
     echo "OK"
else
     echo "failed"
     echo "falling back to ediff-merge"
     emacs --eval "(ediff-merge-files-with-ancestor \"${branch1}\"
                    \"${branch2}\" \"${ancestor}\" nil \"${output}\")"
fi

Those commands I got from the default .stgitrc config.

That's all the features I'm really after.

Sam.

^ permalink raw reply

* Re: Handling large files with GIT
From: Martin Langhoff @ 2006-02-13  4:40 UTC (permalink / raw)
  To: Ben Clifford; +Cc: Florian Weimer, git
In-Reply-To: <Pine.OSX.4.64.0602131416530.25089@piva.hawaga.org.uk>

On 2/13/06, Ben Clifford <benc@hawaga.org.uk> wrote:
> Any advice/thoughts/suggestions-that-this-is-a-stupid-thing-to-do would be
> greatly appreciated.

How many files are you talking about?


m

^ permalink raw reply

* Re: Handling large files with GIT
From: Linus Torvalds @ 2006-02-13  3:42 UTC (permalink / raw)
  To: Ben Clifford; +Cc: Martin Langhoff, Florian Weimer, git
In-Reply-To: <Pine.OSX.4.64.0602131416530.25089@piva.hawaga.org.uk>



On Mon, 13 Feb 2006, Ben Clifford wrote:
> 
> I've been keeping maildir in git for a few months, with mail being delivered
> into a git repo on one (permanently connected) host and me merging that branch
> into a repo on my laptop for reading (the intention being that I should be
> able to sync it back to the permanently connected host as I sometimes read
> mail there.
> 
> Alas, the merge part of this absolutely sucks -- as time goes by, its getting
> slower and slower (its taking an hour or so to do the merge, which has got to
> the point of being barely usable -- if it wasn't for the neat hack-value, I'd
> have given up on this by now).

If it takes an hour per merge, it _is_ unusable. I consider 15 _seconds_ 
to be pretty unusable.

Can you do a

	git-ls-tree -r -t HEAD
	git-ls-tree -r -t HEAD^1
	git-ls-tree -r -t HEAD^2

after a merge, and put the three resulting files up somewhere public (I 
assume the filenames aren't going to be anything private, I don't know how 
maildir organizes stuff) so that people can get an idea of what ends up 
being involved there..

		Linus

^ permalink raw reply

* Re: Make "git clone" less of a deathly quiet experience
From: Junio C Hamano @ 2006-02-13  3:36 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Keith Packard, Andreas Ericsson, Linus Torvalds, Git Mailing List,
	Petr Baudis
In-Reply-To: <46a038f90602121806jfcaac41tb98b8b4cd4c07c23@mail.gmail.com>

Martin Langhoff <martin.langhoff@gmail.com> writes:

> +1... there should be an easy-to-compute threshold trigger to say --
> hey, let's quit being smart and send this client the packs we got and
> get it over with. Or perhaps a client flag so large projects can
> recommend that uses do their initial clone with --gimme-all-packs?

What upload-pack does boils down to:

    * find out the latest of what client has and what client asked.

    * run "rev-list --objects ^client ours" to make a list of
      objects client needs.  The actual command line has multiple
      "clients" to exclude what is unneeded to be sent, and
      multiple "ours" to include refs asked.  When you are doing
      a full clone, ^client is empty and ours is essentially
      --all.

    * feed that output to "pack-objects --stdout" and send out
      the result.

If you run this command:

	$ git-rev-list --objects --all |
          git-pack-objects --stdout >/dev/null 

It would say some things.  The phases of operations are:

	Generating pack...
	Counting objects XXXX...
        Done counting XXXX objects.
        Packing XXXXX objects.....

Phase (1).  Between the time it says "Generating pack..." upto
"Done counting XXXX objects.", the time is spent by rev-list to
list up all the objects to be sent out.

Phase (2). After that, it tries to make decision what object to
delta against what other object, while twenty or so dots are
printed after "Packing XXXXX objects." (see #git irc log a
couple of days ago; Linus describes how pack building works).

Phase (3). After the dot stops, the program becomes silent.
That is where it actually does delta compression and writeout.

You would notice that quite a lot of time is spent in all
phases.

There is an internal hook to create full repository pack inside
upload-pack (which is what runs on the other end when you run
fetch-pack or clone-pack), but it works slightly differently
from what you are suggesting, in that it still tries to do the
"correct" thing.  It still runs "rev-list --objects --all", so
"dangling objects" are never sent out.

We could cheat in all phases to speed things up, at the expense
of ending up sending excess objects.  So let's pretend we
decided to treat everything in .git/objects/packs/pack-* (and
the ones found in alternates as well) have interesting objects
for the cloner.

(1) This part unfortunately cannot be totally eliminated.  By
    assume all packs are interesting, we could use the object
    names from the pack index, which is a lot cheaper than
    rev-list object traversal.  We still need to run rev-list
    --objects --all --unpacked to pick up loose objects we would
    not be able to tell by looking at the pack index to cover
    the rest.

    This however needs to be done in conjunction with the second
    phase change.  pack-objects depends on the hint rev-list
    --objects output gives it to group the blobs and trees with
    the same pathnames together, and that greatly affects the
    packing efficiency.  Unfortunately pack index does not have
    that information -- it does not know type, nor pathnames.
    Type is relatively cheap to obtain but pathnames for blob
    objects are inherently unavailable.

(2) This part can be mostly eliminated for already packed
    objects, because we have already decided to cheat by sending
    everything, so we can just reuse how objects are deltified
    in existing packs.  It still needs to be done for loose
    objects we collected to fill the gap in (1).

(3) This also can be sped up by reusing what are already in
    packs.  Pack index records starting (but not end) offset of
    each object in the pack, so we can sort by offset to find
    out which part of the existing pack corresponds to what
    object, to reorder the objects in the final pack.  This
    needs to be done somewhat carefully to preserve the locality
    of objects (again, see #git log).  The deltifying and
    compressing for loose objects cannot be avoided.

    While we are writing things out in (3), we need to keep
    track of running SHA1 sum of what we write out so that we
    can fill out the correct checksum at the end, but I am
    guessing that is relatively cheap compared to the
    deltification and compression cost we are currently paying
    in this phase.

NB. In the #git log, Linus made it sound like I am clueless
about how pack is generated, but if you check commit 9d5ab96,
the "recency of delta is inherited from base", one of the tricks
that have a big performance impact, was done by me ;-).

^ permalink raw reply

* Re: [ANNOUNCE] pg - A patch porcelain for GIT
From: Shawn Pearce @ 2006-02-13  3:29 UTC (permalink / raw)
  To: Sam Vilain; +Cc: Git Mailing List
In-Reply-To: <43EFF3D0.4090701@vilain.net>

Sam Vilain <sam@vilain.net> wrote:
> Shawn Pearce wrote:
> >I just posted the first public version of pg, a GIT porcelain for
> >managing patches.  Think StGIT, but better in some ways:
> >
> >Feature Summary:
> >- Maximum compatibility with other GIT porcelains.
> >- Simplified command line user interface.
> 
> How do I edit the description of an existing patch using pg?  Perhaps an
> option to pg-push ?

There isn't any description associated with a patch beyond its name
(which can be changed with pg-rename).  Unlike StGIT pg currently
doesn't store a description with each patch.

This is partly because I want pg to extract the comments given to
pg-ci to make the description of the patch during an export with
pg-export - but I haven't written the code to walk back along the
related commits and extract each comment.  On the other hand this
might not be the best description for a patch.  :-)

-- 
Shawn.

^ permalink raw reply

* Re: [ANNOUNCE] pg - A patch porcelain for GIT
From: Sam Vilain @ 2006-02-13  2:49 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Git Mailing List
In-Reply-To: <20060210195914.GA1350@spearce.org>

Shawn Pearce wrote:
> I just posted the first public version of pg, a GIT porcelain for
> managing patches.  Think StGIT, but better in some ways:
> 
> Feature Summary:
> - Maximum compatibility with other GIT porcelains.
> - Simplified command line user interface.

How do I edit the description of an existing patch using pg?  Perhaps an
option to pg-push ?

Sam.

^ permalink raw reply

* Re: git-cvsserver & push/commit atomically
From: Randal L. Schwartz @ 2006-02-13  2:25 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Martin Langhoff, Git Mailing List, martyn
In-Reply-To: <Pine.LNX.4.63.0602130110590.21465@wbgn013.biozentrum.uni-wuerzburg.de>

>>>>> "Johannes" == Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

Johannes> Correct. The filesystem way to lock is to create a lock file and
Johannes> fail if it already exists.

Unless NFS has been fixed, that's not an atomic operation over NFS.  Might not
matter here, but important to note if this is your first time seeing this lock
strategy.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

^ permalink raw reply

* Re: Make "git clone" less of a deathly quiet experience
From: Martin Langhoff @ 2006-02-13  2:06 UTC (permalink / raw)
  To: Keith Packard
  Cc: Andreas Ericsson, Linus Torvalds, Junio C Hamano,
	Git Mailing List, Petr Baudis
In-Reply-To: <1139717510.4183.34.camel@evo.keithp.com>

On 2/12/06, Keith Packard <keithp@keithp.com> wrote:
> On Sun, 2006-02-12 at 04:43 +0100, Andreas Ericsson wrote:
>
> > A weird oddity; Cloning is faster over rsync, day-to-day pulling is not.
>
> Precisely. If the protocol could deliver existing packs instead of
> unpacking and repacking them, then git would be as fast as rsync and I
> wouldn't have to worry about supporting two protocols.

+1... there should be an easy-to-compute threshold trigger to say --
hey, let's quit being smart and send this client the packs we got and
get it over with. Or perhaps a client flag so large projects can
recommend that uses do their initial clone with --gimme-all-packs?

My workaround for large repos is to clone over http, and s/http:/git:/
on the origin file once it's done ;-)


martin

^ permalink raw reply

* Fake linear history in a deterministic manner.
From: Martin Langhoff @ 2006-02-13  1:46 UTC (permalink / raw)
  To: Git Mailing List, martyn

To emulate `cvs log somepath` I need to munge history to look linear.
I am working on the theory that I will tell the cvs client about *one*
linear history, and show merges from parallel histories as a merge
commit, "flattened" so to speak, and with a commit message where I'll
list the hash and first line of each commit that it involves.

Now, we are keeping a sqlite db (bad dependency, I know) with a list
of the lies we tell to the clients, so we can at least be consistent
(and fast). The fact that true git clients will be able to commit to
the repo means that actual parallel development will happen in the
repo.

Now, I can't think of an approach to drawing the linearized history
that is deterministic. I can't chose any branch with any confidence,
because the repo always has a very limited view. A client could come
in any time and push onto the repo a series of commits based on an
ancient commit, with ancient dates, and a merge to todays head.

We've been talking about updating the sqlite db with a post update
hook, which means that in that context we never have to choose, the
commits that get to the repo first win because they now drive the
linearized history.

But when creating a new lies database, we have no
"pushed-to-this-repo" timestamp in the commits so we'll have to pick.
At the moment, I suspect I'll pick the one with the earliest
"following" commit.

I thought briefly about delaying the decision until I see the merge,
and pick the leftmost, or rightmost, if there is some bias in
git-merge or cg-merge on putting whatever origin has on a particular
side. It'd mean running backwards through history and that the very
last merge can flip the decision entirely. Hmmm... any strategy I can
come up with means that each new merge throws the dice again entirely.

Ideas?

(As a result of this, the git-cvsserver we're drafting is of limited
usefulness to projects that really do use git to do what it does best:
DSCM. Projects with a mostly linearized history -- using git-rebase
liberally to avoid uninteresting merges -- do get a reasonable cvs
history. Or will get. Sometime. Soon.)



martin

^ permalink raw reply

* stg refresh/conflict resolution helptext/reality inconsistency
From: Ben Clifford @ 2006-02-13  0:14 UTC (permalink / raw)
  To: git, catalin.marinas


The following happens to me. The help text about using "refresh" doesn't 
seem to match up what I actually did. Am I doing something wrong?


$ stg push
Pushing patch "strcmp-ordering"...Error: three-way merge tool failed for 
file "imap/src/osdep/unix/maildir.c"
The merge failed during "push". Use "refresh" after fixing the conflicts
stg push: git-merge-index failed (possible conflicts)

[edit file to get rid of the <<< === >>> stuff]

$ stg refresh
stg refresh: Unsolved conflicts. Please resolve them first

$ rm .git/conflicts

$ stg refresh
Refreshing patch "strcmp-ordering"... done



This is the stg version I'm using:

$ stg --version
Stacked GIT 0.8
git version 1.1.6.gd19e
Python version 2.3.5 (#1, Mar 20 2005, 20:38:20)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1809)]

$ cd ~/src/stgit 
$ cg status | grep master
    >master      0a01a6d0eaf0c6a70819da2f73fc659e7a83b569

^ permalink raw reply

* Re: Handling large files with GIT
From: Ben Clifford @ 2006-02-13  1:26 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Florian Weimer, git
In-Reply-To: <46a038f90602081435x49e53a1cgdc56040a19768adb@mail.gmail.com>


On Thu, 9 Feb 2006, Martin Langhoff wrote:

> I did suggest maildir,  where GIT is bound to do well as the content
> of the emails doesn't change but they just move around a lot. Though
> yes, trees are going to be nasty.

I've been keeping maildir in git for a few months, with mail being 
delivered into a git repo on one (permanently connected) host and me 
merging that branch into a repo on my laptop for reading (the intention 
being that I should be able to sync it back to the permanently connected 
host as I sometimes read mail there.

Alas, the merge part of this absolutely sucks -- as time goes by, its 
getting slower and slower (its taking an hour or so to do the merge, which 
has got to the point of being barely usable -- if it wasn't for the neat 
hack-value, I'd have given up on this by now).

I haven't really probed whats happening when I'm doing the merges in any 
depth, but I see a lot of index manipulation happening (git update-index, 
I think) to add and remove files where each invocation of that seems to be 
taking almost a whole second.

I wonder if the present merge algorithms perform especially badly in the 
case of a large number of files with lots of renames (and so lots of 
adds/removes) but no content changes? The merge should be able to happen 
entirely in the index, I think.

Perhaps one way to proceed would be for me to write a move-optimised merge 
strategy where I flip the index round and instead of saying "how has the 
content inside this filename changed?" I instead say "how has the filename 
associated with this content <hash> changed?"

A special-case on top of a move-optimised merge might be some 
maildir-aware filename handling that knows how to resolve conflicts when a 
particular content-hash has been renamed to two different names (eg. when 
one flag is added to a message in one repo and a different flag is added 
to a message in another repo).

Any advice/thoughts/suggestions-that-this-is-a-stupid-thing-to-do would be 
greatly appreciated.

-- 

^ permalink raw reply

* Re: Fix object re-hashing
From: Johannes Schindelin @ 2006-02-13  0:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0602121614340.3691@g5.osdl.org>

Hi,

On Sun, 12 Feb 2006, Linus Torvalds wrote:

> On Mon, 13 Feb 2006, Johannes Schindelin wrote:
> > 
> > Make hashtable resizing more robust AKA do not resize in-place
> 
> You forgot to release the old array afterwards.

D'oh! I am going to bed now.

> Anyway, I think the in-place version is fine now, even if it has a few 
> subtleties. So this isn't needed, but keep it in mind if we find another 
> bug, or if somebody wants to shrink the hash table less aggressively than 
> with doubling it every time.

Sounds fine to me.

Good night,
Dscho

^ permalink raw reply

* Re: git-cvsserver & push/commit atomically
From: Johannes Schindelin @ 2006-02-13  0:17 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Git Mailing List, martyn
In-Reply-To: <46a038f90602121550v4f487edfs788885a78c1b167@mail.gmail.com>

Hi,

On Mon, 13 Feb 2006, Martin Langhoff wrote:

>  - precondition: all the relevant objects are already in the repo
>  - reads old sha1
>  - creates repo.git/refs/<headname>.lock file with new sha1
>  - compares old and current head
>  - runs update hooks
>  - renames repo.git/refs/<headname>.lock into repo.git/refs/<headname>
> 
> So it is mostly race-safe, except for the window while we are running
> update hooks. Only a misbehaving implementation that doesn't fail on
> the creation of repo.git/refs/<headname>.lock  would affect us there,
> though. Would locking repo.git/refs/<headname> make sense?

Correct. The filesystem way to lock is to create a lock file and fail if 
it already exists. Since you are doing something akin to git-commit, I 
think you *must* lock repo.git/refs/<headname>.

> git-cvsserver commits are likely to be slow (not only likely, they
> _are_ slow right now), so we need a way to block other clients for a
> relatively long time.

I don't think there is a way around that. Of course, you could write the 
objects first without locking. But then you probably have loose objects in 
the repository if the commit fails due to another commit in the meantime.

<becomessweatyhands>
	BTW when are we going to see git-cvsserver?
</becomessweatyhands>

Ciao,
Dscho

^ permalink raw reply

* Re: Fix object re-hashing
From: Linus Torvalds @ 2006-02-13  0:16 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.63.0602130007320.9507@wbgn013.biozentrum.uni-wuerzburg.de>



On Mon, 13 Feb 2006, Johannes Schindelin wrote:
> 
> Make hashtable resizing more robust AKA do not resize in-place

You forgot to release the old array afterwards.

Anyway, I think the in-place version is fine now, even if it has a few 
subtleties. So this isn't needed, but keep it in mind if we find another 
bug, or if somebody wants to shrink the hash table less aggressively than 
with doubling it every time.

		Linus

^ permalink raw reply

* Re: [ANNOUNCE] GIT 1.2.0
From: Johannes Schindelin @ 2006-02-13  0:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, linux-kernel
In-Reply-To: <7vaccw8bsc.fsf@assigned-by-dhcp.cox.net>

Hi,

On Sun, 12 Feb 2006, Junio C Hamano wrote:

> What's new in 1.2.0
> 
> The new release 1.2.0 took about one month since 1.1.0 was released.
> It has literally _tons_ of updates since the last maintenance release
> 1.1.6.

Wow. It did not seem like soooo much reading the mailing list (... and I 
thought git development would slow down a bit after 1.0 ;-)). Thank you 
very much, Junio. You are doing a very good job!

>   optimize objects bookkeeping for 5x speedup (Johannes)

Note: it is only 2x here. Also, thank you very much for this wonderful 
word, which contains three consecutive pairs of letters!

Ciao,
Dscho

^ permalink raw reply

* Re: Fix object re-hashing
From: Johannes Schindelin @ 2006-02-12 23:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0602121101040.3691@g5.osdl.org>

Hi,

On Sun, 12 Feb 2006, Linus Torvalds wrote:

> [something about the overflow in another mail]

Thank you for thinking it through! I was soooo stuck with my original 
idea: Ideally (i.e. if there are no collisions), if the hashtable is 
doubled in size, then each offset should either stay the same, or be just 
incremented by the original size (since the index is the hash modulo the 
hashtable size).

So I wanted to be clever about resizing, and just increment the offset if 
necessary. As it turns out, it's more complicated than that. You have to 
make sure that those entries which collided  with another entry, but do no 
longer, are adjusted appropriately.

And the overflow problem eluded my attention entirely. (I feel quite silly 
about it, because I fixed so many buffer-overflow problems myself, and 
the cause of the problem is the same there.)

> On Sun, 12 Feb 2006, Linus Torvalds wrote:
> > 
> >  - the "overflow of the overflow": when the linear probing itself 
> >    overflows the size of the hash queue, it will "change direction" by 
> >    overflowing back to index zero.
> > 
> >    Happily, the re-hashing does not need to care about this case, because 
> >    the new hash is bigger: the rule we have when doing the re-hashing is 
> >    that as we re-hash, the "i" entries we have already re-hashed are all 
> >    valid in the new hash, so even if overflow occurs, it will occur the 
> >    right way (and if it overflows all the way past the current "i", we'll 
> >    re-hash the already re-hashed entry anyway).
> 
> Btw, this is only always true if the new hash is at least twice the size 
> of the old hash, I think. Otherwise a re-hash can fill up the new entries 
> and overflow entirely before we've actually even re-hashed all the old 
> entries, and then we'd need to re-hash even the overflowed entries (which 
> are now below "i").

After thinking long and hard about it, I tend to agree.

Note: I chose the factor 2 because hashtables tend to have *awful* 
performance when space becomes scarce. So, 2 is not only a wise choice for 
rehashing, but for the operation in general.

> Anyway, if all this makes you nervous, the conceptually much simpler way 
> to do the re-sizing is to not do the in-place re-hashing. Instead of doing 
> the xrealloc(), just do a "xmalloc()" of the new area, do the re-hashing 
> (which now _must_ re-hash in just the "0..oldcount-1" old area) into the 
> new area, and then free the old area after rehashing.
> 
> That would make things more obviously correct, and perhaps simpler.
> 
> Johannes, do you want to try that?

I do not particularly like it, since doubling the hashtable size is not 
particularly space efficient, and this makes it worse. Anyway, see below.

> Btw, as it currently stands, I worry a tiny tiny bit about the
> 
> 	obj_allocs = (obj_allocs < 32 ? 32 : 2 * obj_allocs)
> 
> thing, because I think that second "32" needs to be a "64" to be really 
> safe (ie guarantee that the new obj_allocs value is always at least twice 
> the old one).

As Junio already pointed out: obj_allocs is initially set to 0. But you're 
right, it is conceptually wrong.

> Anyway, I'm pretty sure people smarter than me have already codified 
> exactly what needs to be done for a in-place rehash of a linear probe hash 
> overflow algorithm. This must all be in some "hashing 101" book. I had to 
> think it through from first principles rather than "knowing" what the 
> right answer was (which probably means that I slept through some 
> fundamental algorithms class in University ;)

Well, it seems like a long time, doesn't it? But I always liked the 
Fibonacci numbers, and therefore the Fibonacci heap.

---

Make hashtable resizing more robust AKA do not resize in-place

diff --git a/object.c b/object.c
index c9ca481..94f0f5d 100644
--- a/object.c
+++ b/object.c
@@ -56,18 +56,14 @@ void created_object(const unsigned char 
 
 	if (obj_allocs - 1 <= nr_objs * 2) {
 		int i, count = obj_allocs;
-		obj_allocs = (obj_allocs < 32 ? 32 : 2 * obj_allocs);
-		objs = xrealloc(objs, obj_allocs * sizeof(struct object *));
-		memset(objs + count, 0, (obj_allocs - count)
-				* sizeof(struct object *));
-		for (i = 0; i < obj_allocs; i++)
-			if (objs[i]) {
-				int j = find_object(objs[i]->sha1);
-				if (j != i) {
-					j = -1 - j;
-					objs[j] = objs[i];
-					objs[i] = NULL;
-				}
+		struct object** old_objs = objs;
+		obj_allocs = (obj_allocs < 32 ? 64 : 2 * obj_allocs);
+		objs = xcalloc(obj_allocs, sizeof(struct object *));
+		for (i = 0; i < count; i++)
+			if (old_objs[i]) {
+				/* it is guaranteed to be new */
+				int j = -1 - find_object(old_objs[i]->sha1);
+				objs[j] = old_objs[i];
 			}
 	}
 

^ permalink raw reply related

* git-cvsserver & push/commit atomically
From: Martin Langhoff @ 2006-02-12 23:50 UTC (permalink / raw)
  To: Git Mailing List, martyn

I am trying to figure out a safe way to lock the git repo to prevent
clients pushing to it while someone else (via git-cvsserver.pl) is
manipulating it. Push and commit are *mostly* transactional it seems,
but I'm not 100% clear on how the head update gets updated safely.

Ahhhh. Hmmm. Reading git-receive-pack.c update(), it seems that we
have to mimic that behaviour in Perl. The semantics are a bit weird
for me -- not used to safe C open()s. From what I read, it looks like
it

 - precondition: all the relevant objects are already in the repo
 - reads old sha1
 - creates repo.git/refs/<headname>.lock file with new sha1
 - compares old and current head
 - runs update hooks
 - renames repo.git/refs/<headname>.lock into repo.git/refs/<headname>

So it is mostly race-safe, except for the window while we are running
update hooks. Only a misbehaving implementation that doesn't fail on
the creation of repo.git/refs/<headname>.lock  would affect us there,
though. Would locking repo.git/refs/<headname> make sense?

git-cvsserver commits are likely to be slow (not only likely, they
_are_ slow right now), so we need a way to block other clients for a
relatively long time.

cheers,


martin

^ permalink raw reply

* Re: [ANNOUNCE] GIT 1.2.0
From: Petr Baudis @ 2006-02-12 22:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, linux-kernel
In-Reply-To: <7vaccw8bsc.fsf@assigned-by-dhcp.cox.net>

Dear diary, on Sun, Feb 12, 2006 at 11:14:43PM CET, I got a letter
where Junio C Hamano <junkio@cox.net> said that...
> For the full list of changes above, see:
> 
> 	$ git log ^v1.1.6 v1.1.0..v1.2.0

Actually, you can do just

 	$ git log v1.1.6..v1.2.0

since v1.1.6 is a superset of v1.1.0. It is like "list all new patches I
will get when I update from v1.1.6 to v1.2.0".

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Of the 3 great composers Mozart tells us what it's like to be human,
Beethoven tells us what it's like to be Beethoven and Bach tells us
what it's like to be the universe.  -- Douglas Adams

^ permalink raw reply

* Re: [ANNOUNCE] GIT 1.2.0
From: Junio C Hamano @ 2006-02-12 22:29 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git
In-Reply-To: <20060212220406.GY31278@pasky.or.cz>

Petr Baudis <pasky@suse.cz> writes:

> Dear diary, on Sun, Feb 12, 2006 at 10:53:13PM CET, I got a letter
> where Junio C Hamano <junkio@cox.net> said that...
>> By the way, did I screw up the original announce?  I do not see
>> the message come back to me from vger...
>
> I didn't receive it either, and neither my git-bisect patch.

Hmph, maybe there is a hiccup at the majordomo?  I did not hear
much about that from HPA either but hopefully mine I now know
how to work it around.  Earlier I had git on To: and kernel on CC:
but this time I felt the announcement was important enough that
I had both on To: which I did not see come back.  A later resend
in my usual way made it through, so I'll stick to it as a
workaround for now.

Your patch e-mail, I _did_ get two copies, one direct and one
via vger, so it might be something else.

Anyhow, I am done for the day (it turned out to be a loooong
one), and I'll reopen "master" and "maint" after I get some
rest.

    X-From-Line: git-owner@vger.kernel.org Sun Feb 12 16:27:05 2006
    Received: from pop.west.cox.net (68.6.19.2) by osiris with POP3 for
      <postmaster@osiris>; 12 Feb 2006 16:27:05 -0000
    Return-Path: <git-owner@vger.kernel.org>
    Received: from fed1rmimpi03.cox.net ([70.169.32.70])
              by fed1rmmtai06.cox.net
              (InterMail vM.6.01.05.02 201-2131-123-102-20050715) with ESMTP
              id <20060212160519.RZSC2846.fed1rmmtai06.cox.net@fed1rmimpi03.cox.net>;
              Sun, 12 Feb 2006 11:05:19 -0500
    Received: from vger.kernel.org ([209.132.176.167])
            by fed1rmimpi03.cox.net with IMP
            id wg5C1T01W3d4pq90000000
            for nrsolis@cox.net; Sun, 12 Feb 2006 11:05:17 -0500
    Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
            id S1751138AbWBLQFj (ORCPT <rfc822;junkio@cox.net> + 2 others);
            Sun, 12 Feb 2006 11:05:39 -0500
    Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751139AbWBLQFj
            (ORCPT <rfc822;git-outgoing>); Sun, 12 Feb 2006 11:05:39 -0500
    Received: from w241.dkm.cz ([62.24.88.241]:40140 "EHLO machine.or.cz")
            by vger.kernel.org with ESMTP id S1751138AbWBLQFi (ORCPT
            <rfc822;git@vger.kernel.org>); Sun, 12 Feb 2006 11:05:38 -0500
    Received: (qmail 512 invoked by uid 2001); 12 Feb 2006 17:06:14 +0100
    Date:	Sun, 12 Feb 2006 17:06:14 +0100
    From:	Petr Baudis <pasky@suse.cz>
    To:	junkio@cox.net
    Cc:	git@vger.kernel.org
    Subject: [PATCH] Properly git-bisect reset after bisecting from non-master head
    Message-ID: <20060212160614.GV31278@pasky.or.cz>

^ permalink raw reply

* [ANNOUNCE] GIT 1.2.0
From: Junio C Hamano @ 2006-02-12 22:14 UTC (permalink / raw)
  To: git; +Cc: linux-kernel
In-Reply-To: <7vzmkw8d5b.fsf@assigned-by-dhcp.cox.net>

A resend.  I am suspecting vger does not want two lists hosted
there both named on To: line...

-- >8 --
The latest feature release GIT 1.2.0 is available at the
usual places:

	http://www.kernel.org/pub/software/scm/git/

	git-1.2.0.tar.{gz,bz2}			(tarball)
	RPMS/$arch/git-*-1.2.0-1.$arch.rpm	(RPM)

Right now binary RPM is available only for x86_64, because I do
not have an access to RPM capable i386 box.


What's new in 1.2.0

The new release 1.2.0 took about one month since 1.1.0 was released.
It has literally _tons_ of updates since the last maintenance release
1.1.6.  While people who have been following the master branch are
probably familiar with most of them, I expect others would be quite
surprised by the amount of changes.  Hopefully the surprise is of a
pleasant kind.

Here is the list of all the good stuff (not counting what are
already in the 1.1.X maintenance series).

* Major features

 - Checkout -m option.  This allows you to switch branches when
   you have local changes to paths that are different in the
   current branch and new branch.  The local changes are merged to
   the version from the branch you are switching to and the merge
   result is left in your working tree (you may see conflicts).

 - Add -c and --cc to diff-tree and diff-files.  They give the
   "combined diff" output that shows merges more human readably,
   and the "git diff" wrapper uses --cc by default.  This release
   also comes with updated gitk (Paul Mackerras) that takes advantage
   of this feature.

 - git-commit updates.

   - Allow git-commit from a subdirectory.

   - Aborted "git-commit -a" leaves the index as it was.

   - "git commit --only paths..." checks in changes to only
     named paths.

     This will become the default, instead of the traditional
     --include semantics, during development track towards
     1.3.0, so if you are used to the --include semantics,
     please start training your fingers to explicitly say
     "git commit --include paths...".

   - Add --author='A U Thor <author@example.com>' command
     line option.

   - "git commit -v" seeds the end of the commit log message editor
     with the patch to be committed.

 - git-status updates.  Now it takes the exactly same set of parameters
   as git-commit takes, and serves as a way to preview what you would
   be committing.  The -v option to get diff output is also available.

* Portability

  FreeBSD support in the main Makefile (Alecs King, Alex Riesen)
  Run GIT-VERSION-GEN with $(SHELL), not sh (Jason Riedy)
  compat/unsetenv.c (Jason Riedy)
  stat() to work around nonportable EEXIST (Jason Riedy)
  Asciidoc futureproofing (Pavel)
  Disable USE_SYMLINK_HEAD by default (Pavel)
  Various fixes for Cygwin around dirent and sockaddr_storage.

* Foreign SCM interface

  cvs-import updates (Andreas Ericsson and Martin)
  cvs-exportcommit fixes and docs (Martin)
  send-email various cleanups (Ryan)
  svnimport internal arglist fix (Sasha Khapyorsky)
  svnimport relative path fix (Christian Biesinger)

* Samples and Docs

  New tutorial (J. Bruce Fields)
  Update hook example (Andreas Ericsson)
  Topic-branch howto updates (Tony Luck)
  Other misc. cleanups (Florian Weimer, J. Bruce Fields, Petr Baudis, Jon Loeliger)
  Document git-diff-tree --always (Petr Baudis)
  Basic documentation for git-show (Petr Baudis)

* Cleanups

  Use tree_desc tree parser in tar-tree (Daniel, Linus)
  Use sha1_file.c's mkdir-like routine in apply.c (Jason Riedy)
  Use adler32() from zlib instead of defining our own (Peter Eriksen)
  Assorted cleanups (Uwe Zeisberger)
  Make --abbrev more consistent with diff-tree
  Omit duplicated parents from rev-list --parents output
  Clean up rev-parse error checking (Linus)

* Usability improvements

  git-apply -pN (Daniel)
  git-rev-list -n1 and -1 (Eric Wong)
  diff-tree --always flag (Linus)
  git-fetch -k (Tom Prince)
  git-describe: default to HEAD (Andreas Ericsson and me)
  git-clone --bare
  git-show and git-whatchanged (Linus and me)
  git-daemon user-relative paths updates (Mark Wooding and me)
  git-rerere reuse recorded resolve.
  git-fmt-merge-msg updates.
  git-format-patch: always --mbox and show sane Date (Eric W. Biederman et al)
  git-show-branch assorted updates
  git-commit seeds the message with list of conflicted files (Linus)
  git-clone and git-fetch shows progress information (Linus and me)
  git-repo-config allows type specifiers, does not dump core if bool (Pasky)

* Fixes

  git-format-patch -s (Eric W. Biederman)
  git-whatchanged: exit out early on errors (Linus)
  Various http-fetch fixes (Mark Wooding, Nick Hengeveld)
  Local push/pull env cleanup (Matt Draisey)
  Exec git programs without using PATH (Michal Ostrowski)
  Pass --upload-pack to fetch-pack and friends (Michal Ostrowski)
  Allow diff and index commands to be interrupted (Petr Baudis)
  Fix "git-push --tags".
  Parse sha1^12 for 12th parent correctly.
  git-diff-tree --stdin uses given parents by git-rev-list --parents.
  Show grafted parents in rev-list --parents.
  git-ls-files -o reads from --exclude-per-dir from upper directories.
  do not allow empty author/committer name/email.
  call setup_git_directory() before git_config().
  futureproof and optimize delta unpacker (Nicolas Pitre)

* Performance

  merge-recursive (Fredrik)
  git-read-tree --aggressive
  optimize objects bookkeeping for 5x speedup (Johannes)

For the full list of changes above, see:

	$ git log ^v1.1.6 v1.1.0..v1.2.0

^ permalink raw reply

* Re: [ANNOUNCE] GIT 1.2.0
From: Petr Baudis @ 2006-02-12 22:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vpsls8cs6.fsf@assigned-by-dhcp.cox.net>

Dear diary, on Sun, Feb 12, 2006 at 10:53:13PM CET, I got a letter
where Junio C Hamano <junkio@cox.net> said that...
> By the way, did I screw up the original announce?  I do not see
> the message come back to me from vger...

I didn't receive it either, and neither my git-bisect patch.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Of the 3 great composers Mozart tells us what it's like to be human,
Beethoven tells us what it's like to be Beethoven and Bach tells us
what it's like to be the universe.  -- Douglas Adams

^ permalink raw reply

* Re: [ANNOUNCE] GIT 1.2.0
From: Junio C Hamano @ 2006-02-12 21:53 UTC (permalink / raw)
  To: git
In-Reply-To: <43EFADD9.7010909@zytor.com>

"H. Peter Anvin" <hpa@zytor.com> writes:

> Junio C Hamano wrote:
>> The latest feature release GIT 1.2.0 is available at the
>> usual places:
>> 	http://www.kernel.org/pub/software/scm/git/
>> 	git-1.2.0.tar.{gz,bz2}			(tarball)
>> 	RPMS/$arch/git-*-1.2.0-1.$arch.rpm	(RPM)
>> Right now binary RPM is available only for x86_64, because I do
>> not have an access to RPM capable i386 box.
>>
>
> You can build the i386 binary rpms on hera as such:
>
> rpmbuild --rebuild --target i386 git-1.2.0-1.src.rpm
>
> I had to install openssl-devel from the i386 distribution, which for
> some reason wasn't part of the x86-64 distribution, but that's now
> taken care of.

OK.  Maybe I will when I find time, but not right now.

By the way, did I screw up the original announce?  I do not see
the message come back to me from vger...

^ permalink raw reply

* Re: [ANNOUNCE] GIT 1.2.0
From: H. Peter Anvin @ 2006-02-12 21:51 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, linux-kernel
In-Reply-To: <7vzmkw8d5b.fsf@assigned-by-dhcp.cox.net>

Junio C Hamano wrote:
> The latest feature release GIT 1.2.0 is available at the
> usual places:
> 
> 	http://www.kernel.org/pub/software/scm/git/
> 
> 	git-1.2.0.tar.{gz,bz2}			(tarball)
> 	RPMS/$arch/git-*-1.2.0-1.$arch.rpm	(RPM)
> 
> Right now binary RPM is available only for x86_64, because I do
> not have an access to RPM capable i386 box.
> 

You can build the i386 binary rpms on hera as such:

rpmbuild --rebuild --target i386 git-1.2.0-1.src.rpm

I had to install openssl-devel from the i386 distribution, which for 
some reason wasn't part of the x86-64 distribution, but that's now taken 
care of.

	-hpa

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox