Question about your git habits

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Question about your git habits
@ 2008-02-23  0:37 Chase Venters
  2008-02-23  1:26 ` Tommy Thorn
                   ` (9 more replies)
  0 siblings, 10 replies; 29+ messages in thread
From: Chase Venters @ 2008-02-23  0:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: git

I've been making myself more familiar with git lately and I'm curious what 
habits others have adopted. (I know there are a few documents in circulation 
that deal with using git to work on the kernel but I don't think this has 
been specifically covered).

My question is: If you're working on multiple things at once, do you tend to 
clone the entire repository repeatedly into a series of separate working 
directories and do your work there, then pull that work (possibly comprising 
a series of "temporary" commits) back into a separate local master 
respository with --squash, either into "master" or into a branch containing 
the new feature?

Or perhaps you create a temporary topical branch for each thing you are 
working on, and commit arbitrary changes then checkout another branch when 
you need to change gears, finally --squashing the intermediate commits when a 
particular piece of work is done?

I'm using git to manage my project and I'm trying to determine the most 
optimal workflow I can. I figure that I'm going to have an "official" master 
repository for the project, and I want to keep the revision history clean in 
that repository (ie, no messy intermediate commits that don't compile or only 
implement a feature half way).

On older projects I was using a certalized revision control system like 
*cough* Subversion *cough* and I'd create separate branches which I'd check 
out into their own working trees.

It seems to me that having multiple working trees (effectively, cloning 
the "master" repository every time I need to make anything but a trivial 
change) would be most effective under git as well as it doesn't require 
creating messy, intermediate commits in the first place (but allows for them 
if they are used). But I wonder how that approach would scale with a project 
whose git repo weighed hundreds of megs or more. (With a centralized rcs, of 
course, you don't have to lug around a copy of the whole project history in 
each working tree.)

Insight appreciated, and I apologize if I've failed to RTFM somewhere.

Thanks,
Chase

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23  0:37 Question about your git habits Chase Venters
@ 2008-02-23  1:26 ` Tommy Thorn
  2008-02-23  1:28 ` Steven Walter
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Tommy Thorn @ 2008-02-23  1:26 UTC (permalink / raw)
  To: Chase Venters; +Cc: git

Chase Venters wrote:
> My question is: If you're working on multiple things at once, do you tend to 
> clone the entire repository repeatedly into a series of separate working 
> directories and do your work there, then pull that work (possibly comprising 
> a series of "temporary" commits) back into a separate local master 
> respository with --squash, either into "master" or into a branch containing 
> the new feature?
>   

IMO, that approach scales poorly and involves a lot of overhead.

> Or perhaps you create a temporary topical branch for each thing you are 
> working on, and commit arbitrary changes then checkout another branch when 
> you need to change gears, finally --squashing the intermediate commits when a 
> particular piece of work is done?
>   

Spot on.


Distribution prune for relevance.

Tommy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23  0:37 Question about your git habits Chase Venters
  2008-02-23  1:26 ` Tommy Thorn
@ 2008-02-23  1:28 ` Steven Walter
  2008-02-23  1:37 ` Jan Engelhardt
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Steven Walter @ 2008-02-23  1:28 UTC (permalink / raw)
  To: Chase Venters; +Cc: git

On Fri, Feb 22, 2008 at 06:37:14PM -0600, Chase Venters wrote:
> My question is: If you're working on multiple things at once, do you tend to 
> clone the entire repository repeatedly into a series of separate working 
> directories and do your work there, then pull that work (possibly comprising 
> a series of "temporary" commits) back into a separate local master 
> respository with --squash, either into "master" or into a branch containing 
> the new feature?
> 
> Or perhaps you create a temporary topical branch for each thing you are 
> working on, and commit arbitrary changes then checkout another branch when 
> you need to change gears, finally --squashing the intermediate commits when a 
> particular piece of work is done?

I favor the second approach: single working copy, multiple branches.  My
feeling is that wanting multiple workspaces is a holdover from using
subversion.  For me, it is much faster to "git commit -a -m wip"
and then switch branches, than it would be to clone a whole new
repository and manage the inter-repository relationships.

Don't get so down on the "intermediate commits," either.  For one,
whenever I switch back to a branch with a "wip" commit, I usually do a
"git reset HEAD^" to remove it and get my working tree back where it
was.  There are also nifty tools like interactive rebase that assist
you in rewriting history to produce a set of clean, atomic commits.
It's not imperative to make your first draft perfection in git.

[...]

> Insight appreciated, and I apologize if I've failed to RTFM somewhere.

No worries, I remember being in your situation once.  git opens up
a host of opportunities with its flexibility, and getting started I
was consistently stumped by which of the many paths I should choose.
-- 
-Steven Walter <stevenrwalter@gmail.com>
Freedom is the freedom to say that 2 + 2 = 4
B2F1 0ECC E605 7321 E818  7A65 FC81 9777 DC28 9E8F 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23  0:37 Question about your git habits Chase Venters
  2008-02-23  1:26 ` Tommy Thorn
  2008-02-23  1:28 ` Steven Walter
@ 2008-02-23  1:37 ` Jan Engelhardt
  2008-02-23  1:44   ` Al Viro
  2008-02-23  1:42 ` Junio C Hamano
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Jan Engelhardt @ 2008-02-23  1:37 UTC (permalink / raw)
  To: Chase Venters; +Cc: linux-kernel, git


On Feb 22 2008 18:37, Chase Venters wrote:
>
>I've been making myself more familiar with git lately and I'm curious what 
>habits others have adopted. (I know there are a few documents in circulation 
>that deal with using git to work on the kernel but I don't think this has 
>been specifically covered).
>
>My question is: If you're working on multiple things at once,

Impossible; Humans only have one core with only seven registers --
according to CodingStyle chapter 6 paragraph 4.

>do you tend to clone the entire repository repeatedly into a series
>of separate working directories

Too time consuming on consumer drives with projects the size of Linux.

>and do your work there, then pull
>that work (possibly comprising a series of "temporary" commits) back
>into a separate local master respository with --squash, either into
>"master" or into a branch containing the new feature?

No, just commit the current unfinished work to a new branch and deal
with it later (cherry-pick, rebase, reset --soft, commit --amend -i,
you name it). Or if all else fails, use git-stash.

You do not have to push these temporary branches at all, so it is
much nicer than svn. (Once all the work is done and cleanly in
master, you can kill off all branches without having a record
of their previous existence.)

>Or perhaps you create a temporary topical branch for each thing you
>are working on, and commit arbitrary changes then checkout another
>branch when you need to change gears, finally --squashing the
>intermediate commits when a particular piece of work is done?

if I don't collect arbitrary changes, I don't need squashing
(see reset --soft/amend above)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23  0:37 Question about your git habits Chase Venters
                   ` (2 preceding siblings ...)
  2008-02-23  1:37 ` Jan Engelhardt
@ 2008-02-23  1:42 ` Junio C Hamano
  2008-02-23 10:39   ` Samuel Tardieu
       [not found] ` <998d0e4a0802221736q4e4c3a28l101522912f7d3caf@mail.gmail.com>
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Junio C Hamano @ 2008-02-23  1:42 UTC (permalink / raw)
  To: Chase Venters; +Cc: git

Chase Venters <chase.venters@clientec.com> writes:

[jc: kernel-list removed from CC: as this does not have anything
to do with them]

> My question is: If you're working on multiple things at once,
> do you tend to clone the entire repository repeatedly into a
> series of separate working directories and do your work there,
> then pull that work (possibly comprising a series of
> "temporary" commits) back into a separate local master
> respository with --squash, either into "master" or into a
> branch containing the new feature?
>
> Or perhaps you create a temporary topical branch for each
> thing you are working on, and commit arbitrary changes then
> checkout another branch when you need to change gears, finally
> --squashing the intermediate commits when a particular piece
> of work is done?

It is a matter of taste, but in any case, you should not have to
squash that often.  If you find you are always squashing because
you work on one thing and then switch to another thing before
you are done with the former, something is wrong.

	Clarification: I am not saying squashing is wrong.  I am
	just saying you should not have to.

If you want to park what you were working on before switching to
do something else, you can (and probably should) commit and it
is a very valid thing to do (an alternative is "git stash").

When resuming, if that parked commit was half-baked and
something you do not want to go back to later, then the next
commit (be it another commit that merely "parks" before getting
distracted to do something else, or a commit that finally gets
everything "finito") can be made with "commit --amend".  That
way, your sequences of commits will consist of only logically
separate units, without half-baked ones you had to create only
because you switched branches.

Some people prefer to use multiple simultanous work trees.  You
certainly can use "clone" to achieve this.  And local clone is
very cheap as it shares the object database from the origin by
default.

Many people prefer to use topic branches, and working in a
single repository with multiple branches and switching branches
without ever cd'ing around is certainly a possible and very
valid way to work.  As long as your build infrastructure is sane
(e.g. your project does not have a central header file that any
little subsystem change needs to modify and included by
everybody, which tends to screw up make quite badly), switching
branches would not incur too much recompilation either and it
obviously will save disk space not having to leave multiple
checkout around.

You can also work with a single repository, multiple branches
and have multiple simultaneous work trees attached to that
single repository, by using contrib/workdir/git-new-workdir
script.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23  1:37 ` Jan Engelhardt
@ 2008-02-23  1:44   ` Al Viro
  2008-02-23  1:51     ` Junio C Hamano
  0 siblings, 1 reply; 29+ messages in thread
From: Al Viro @ 2008-02-23  1:44 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Chase Venters, linux-kernel, git

On Sat, Feb 23, 2008 at 02:37:00AM +0100, Jan Engelhardt wrote:

> >do you tend to clone the entire repository repeatedly into a series
> >of separate working directories
> 
> Too time consuming on consumer drives with projects the size of Linux.

git clone -l -s

is not particulary slow...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23  1:44   ` Al Viro
@ 2008-02-23  1:51     ` Junio C Hamano
  2008-02-23  2:09       ` Al Viro
  0 siblings, 1 reply; 29+ messages in thread
From: Junio C Hamano @ 2008-02-23  1:51 UTC (permalink / raw)
  To: Al Viro; +Cc: Jan Engelhardt, Chase Venters, linux-kernel, git

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Sat, Feb 23, 2008 at 02:37:00AM +0100, Jan Engelhardt wrote:
>
>> >do you tend to clone the entire repository repeatedly into a series
>> >of separate working directories
>> 
>> Too time consuming on consumer drives with projects the size of Linux.
>
> git clone -l -s
>
> is not particulary slow...

How big is a checkout of a single revision of kernel these days,
compared to a well-packed history since v2.6.12-rc2?

The cost of writing out the work tree files isn't ignorable and
probably more than writing out the repository data (which -s
saves for you).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23  1:51     ` Junio C Hamano
@ 2008-02-23  2:09       ` Al Viro
       [not found]         ` <998d0e4a0802221823h3ba53097gf64fcc2ea826302b@mail.gmail.com>
  0 siblings, 1 reply; 29+ messages in thread
From: Al Viro @ 2008-02-23  2:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jan Engelhardt, Chase Venters, linux-kernel, git

On Fri, Feb 22, 2008 at 05:51:04PM -0800, Junio C Hamano wrote:
> Al Viro <viro@ZenIV.linux.org.uk> writes:
> 
> > On Sat, Feb 23, 2008 at 02:37:00AM +0100, Jan Engelhardt wrote:
> >
> >> >do you tend to clone the entire repository repeatedly into a series
> >> >of separate working directories
> >> 
> >> Too time consuming on consumer drives with projects the size of Linux.
> >
> > git clone -l -s
> >
> > is not particulary slow...
> 
> How big is a checkout of a single revision of kernel these days,
> compared to a well-packed history since v2.6.12-rc2?
> 
> The cost of writing out the work tree files isn't ignorable and
> probably more than writing out the repository data (which -s
> saves for you).

Depends...  I'm using ext2 for that and noatime everywhere, so that might
change the picture, but IME it's fast enough...  As for the size, it gets
to ~320Mb on disk, which is comparable to the pack size (~240-odd Mb).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
       [not found] ` <998d0e4a0802221736q4e4c3a28l101522912f7d3caf@mail.gmail.com>
@ 2008-02-23  2:46   ` J.C. Pizarro
  0 siblings, 0 replies; 29+ messages in thread
From: J.C. Pizarro @ 2008-02-23  2:46 UTC (permalink / raw)
  To: git

2008/2/23, Chase Venters <chase.venters@clientec.com> wrote:
 >
 > ... blablabla
 >
 >  My question is: If you're working on multiple things at once, do you tend to
 >  clone the entire repository repeatedly into a series of separate working
 >  directories and do your work there, then pull that work (possibly comprising
 >  a series of "temporary" commits) back into a separate local master
 >  respository with --squash, either into "master" or into a branch containing
 >  the new feature?
 >
 > ... blablabla
 >
 >  I'm using git to manage my project and I'm trying to determine the most
 >  optimal workflow I can. I figure that I'm going to have an "official" master
 >  repository for the project, and I want to keep the revision history clean in
 >  that repository (ie, no messy intermediate commits that don't
compile or only
 >  implement a feature half way).


I recomend you to use these complementary tools

   1. google: gitk screenshots  ( e.g. http://lwn.net/Articles/140350/ )

   2. google: "git-gui" screenshots
         ( e.g. http://www.spearce.org/2007/01/git-gui-screenshots.html )

   3. google: gitweb color meld

   ;)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
       [not found]         ` <998d0e4a0802221823h3ba53097gf64fcc2ea826302b@mail.gmail.com>
@ 2008-02-23  2:47           ` J.C. Pizarro
  2008-02-23 11:39             ` Charles Bailey
  2008-02-23 14:08             ` Mike Hommey
  0 siblings, 2 replies; 29+ messages in thread
From: J.C. Pizarro @ 2008-02-23  2:47 UTC (permalink / raw)
  To: git

On 2008/2/23, Al Viro <viro@zeniv.linux.org.uk> wrote:
 > On Fri, Feb 22, 2008 at 05:51:04PM -0800, Junio C Hamano wrote:
 >  > Al Viro <viro@ZenIV.linux.org.uk> writes:
 >  >
 >  > > On Sat, Feb 23, 2008 at 02:37:00AM +0100, Jan Engelhardt wrote:
 >  > >
 >  > >> >do you tend to clone the entire repository repeatedly into a series
 >  > >> >of separate working directories
 >  > >>
 >  > >> Too time consuming on consumer drives with projects the size of Linux.
 >  > >
 >  > > git clone -l -s
 >  > >
 >  > > is not particulary slow...
 >  >
 >  > How big is a checkout of a single revision of kernel these days,
 >  > compared to a well-packed history since v2.6.12-rc2?
 >  >
 >  > The cost of writing out the work tree files isn't ignorable and
 >  > probably more than writing out the repository data (which -s
 >  > saves for you).
 >
 >
 > Depends...  I'm using ext2 for that and noatime everywhere, so that might
 >  change the picture, but IME it's fast enough...  As for the size, it gets
 >  to ~320Mb on disk, which is comparable to the pack size (~240-odd Mb).


Yesterday, i had git cloned git://foo.com/bar.git   ( 777 MiB )
 Today, i've git cloned git://foo.com/bar.git   ( 779 MiB )

 Both repos are different binaries , and i used 777 MiB + 779 MiB = 1556 MiB
 of bandwidth in two days. It's much!

 Why don't we implement "binary delta between old git repo and recent git repo"
 with "SHA1 built git repo verifier"?

 Suppose the size cost of this binary delta is e.g. around 52 MiB instead of
 2 MiB due to numerous mismatching of binary parts, then the bandwidth
 in two days will be 777 MiB + 52 MiB = 829 MiB instead of 1556 MiB.

 Unfortunately, this "binary delta of repos" is not implemented yet :|

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23  0:37 Question about your git habits Chase Venters
                   ` (4 preceding siblings ...)
       [not found] ` <998d0e4a0802221736q4e4c3a28l101522912f7d3caf@mail.gmail.com>
@ 2008-02-23  4:10 ` Daniel Barkalow
  2008-02-23  5:03   ` Jeff Garzik
  2008-02-23  9:18   ` Mike Hommey
  2008-02-23  4:39 ` Rene Herman
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 29+ messages in thread
From: Daniel Barkalow @ 2008-02-23  4:10 UTC (permalink / raw)
  To: Chase Venters; +Cc: linux-kernel, git

On Fri, 22 Feb 2008, Chase Venters wrote:

> I've been making myself more familiar with git lately and I'm curious what 
> habits others have adopted. (I know there are a few documents in circulation 
> that deal with using git to work on the kernel but I don't think this has 
> been specifically covered).
> 
> My question is: If you're working on multiple things at once, do you tend to 
> clone the entire repository repeatedly into a series of separate working 
> directories and do your work there, then pull that work (possibly comprising 
> a series of "temporary" commits) back into a separate local master 
> respository with --squash, either into "master" or into a branch containing 
> the new feature?
> 
> Or perhaps you create a temporary topical branch for each thing you are 
> working on, and commit arbitrary changes then checkout another branch when 
> you need to change gears, finally --squashing the intermediate commits when a 
> particular piece of work is done?

I find that the sequence of changes I make is pretty much unrelated to the 
sequence of changes that end up in the project's history, because my 
changes as I make them involve writing a lot of stubs (so I can build) and 
then filling them out. It's beneficial to have version control on this so 
that, if I screw up filling out a stub, I can get back to where I was.

Having made a complete series, I then generate a new series of commits, 
each of which does one thing, without any bugs that I've resolved, such 
that the net result is the end of the messy history, except with any 
debugging or useless stuff skipped. It's this series that gets merged into 
the project history, and I discard the other history.

The real trick is that the early patches in a lot of series often refactor 
existing code in ways that are generally good and necessary for your 
eventual outcome, but which you'd never think of until you've written more 
of the series. Generating a new commit sequence is necessary to end up 
with a history where it looks from the start like you know where you're 
going and have everything done that needs to be done when you get to the 
point of needing it. Furthermore, you want to be able to test these 
commits in isolation, without the distraction of the changes that actually 
prompted them, which means that you want to have your working tree is a 
state that you never actually had it in as you were developing the end 
result.

This means that you'll usually want to rewrite commits for any series that 
isn't a single obvious patch, so it's not a big deal to commit any time 
you want to work on some different branch.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23  0:37 Question about your git habits Chase Venters
                   ` (5 preceding siblings ...)
  2008-02-23  4:10 ` Daniel Barkalow
@ 2008-02-23  4:39 ` Rene Herman
  2008-02-23  8:56 ` Willy Tarreau
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Rene Herman @ 2008-02-23  4:39 UTC (permalink / raw)
  To: Chase Venters; +Cc: linux-kernel, git

On 23-02-08 01:37, Chase Venters wrote:

> Or perhaps you create a temporary topical branch for each thing you are 
> working on, and commit arbitrary changes then checkout another branch
> when you need to change gears, finally --squashing the intermediate
> commits when a particular piece of work is done?

No very specific advice to give but this is what I do and then pull all 
(compilable) topic branches into a "local" branch for complation. Just 
wanted to remark that a definite downside is that switching branches a lot 
also touches the tree a lot and hence tends to trigger quite unwelcome 
amounts of recompiles. Using ccache would proably be effective in this 
situation but I keep neglecting to check it out...

Rene

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23  4:10 ` Daniel Barkalow
@ 2008-02-23  5:03   ` Jeff Garzik
  2008-02-23  9:18   ` Mike Hommey
  1 sibling, 0 replies; 29+ messages in thread
From: Jeff Garzik @ 2008-02-23  5:03 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Chase Venters, linux-kernel, git

Daniel Barkalow wrote:
> I find that the sequence of changes I make is pretty much unrelated to the 
> sequence of changes that end up in the project's history, because my 
> changes as I make them involve writing a lot of stubs (so I can build) and 
> then filling them out. It's beneficial to have version control on this so 
> that, if I screw up filling out a stub, I can get back to where I was.
> 
> Having made a complete series, I then generate a new series of commits, 
> each of which does one thing, without any bugs that I've resolved, such 
> that the net result is the end of the messy history, except with any 
> debugging or useless stuff skipped. It's this series that gets merged into 
> the project history, and I discard the other history.
> 
> The real trick is that the early patches in a lot of series often refactor 
> existing code in ways that are generally good and necessary for your 
> eventual outcome, but which you'd never think of until you've written more 
> of the series.

That summarizes well how I do original development, too.  Whether its a 
branch of an existing repo, or a newly cloned repo, when working on new 
code I will do a first pass, committing as I go to provide useful 
checkpoints.

Once I reach a satisfactory state, I'll refactor the patches so that 
they make sense for upstream submission.

	Jeff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23  0:37 Question about your git habits Chase Venters
                   ` (6 preceding siblings ...)
  2008-02-23  4:39 ` Rene Herman
@ 2008-02-23  8:56 ` Willy Tarreau
  2008-02-23  9:10 ` Sam Ravnborg
  2008-02-23 13:07 ` Jakub Narebski
  9 siblings, 0 replies; 29+ messages in thread
From: Willy Tarreau @ 2008-02-23  8:56 UTC (permalink / raw)
  To: Chase Venters; +Cc: linux-kernel, git

On Fri, Feb 22, 2008 at 06:37:14PM -0600, Chase Venters wrote:
> It seems to me that having multiple working trees (effectively, cloning 
> the "master" repository every time I need to make anything but a trivial 
> change) would be most effective under git as well as it doesn't require 
> creating messy, intermediate commits in the first place (but allows for them 
> if they are used). But I wonder how that approach would scale with a project 
> whose git repo weighed hundreds of megs or more. (With a centralized rcs, of 
> course, you don't have to lug around a copy of the whole project history in 
> each working tree.)

Take a look at git-new-workdir in git's contrib directory. I'm using it a
lot now. It makes it possible to set up as many workdirs as you want, sharing
the same repo. It's very dangerous if you're not rigorous, but it saves a lot
of time when you work on several branches at a time, which is even more true
for a project's documentation. The real thing to care about is not to have
the same branch checked out at several places.

Regards,
Willy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23  0:37 Question about your git habits Chase Venters
                   ` (7 preceding siblings ...)
  2008-02-23  8:56 ` Willy Tarreau
@ 2008-02-23  9:10 ` Sam Ravnborg
  2008-02-23 13:07 ` Jakub Narebski
  9 siblings, 0 replies; 29+ messages in thread
From: Sam Ravnborg @ 2008-02-23  9:10 UTC (permalink / raw)
  To: Chase Venters; +Cc: linux-kernel, git

On Fri, Feb 22, 2008 at 06:37:14PM -0600, Chase Venters wrote:
> I've been making myself more familiar with git lately and I'm curious what 
> habits others have adopted. (I know there are a few documents in circulation 
> that deal with using git to work on the kernel but I don't think this has 
> been specifically covered).
> 
> My question is: If you're working on multiple things at once, do you tend to 
> clone the entire repository repeatedly into a series of separate working 
> directories and do your work there, then pull that work (possibly comprising 
> a series of "temporary" commits) back into a separate local master 
> respository with --squash, either into "master" or into a branch containing 
> the new feature?

The simple (for me) workflow I use is to create a clone of the
kernel for each 'topic' I work on.
So at the same time I may have one or maybe up to five clones of the
kernel.

When I want to combine thing I use git format-patch and git am.
Often there is some amount of editing done before combining stuff
especially for larger changes where the first in the serie is often
preparational work that were identified in random order when I did
the inital work.

	Sam

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23  4:10 ` Daniel Barkalow
  2008-02-23  5:03   ` Jeff Garzik
@ 2008-02-23  9:18   ` Mike Hommey
  1 sibling, 0 replies; 29+ messages in thread
From: Mike Hommey @ 2008-02-23  9:18 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Chase Venters, linux-kernel, git

On Fri, Feb 22, 2008 at 11:10:48PM -0500, Daniel Barkalow wrote:
> I find that the sequence of changes I make is pretty much unrelated to the 
> sequence of changes that end up in the project's history, because my 
> changes as I make them involve writing a lot of stubs (so I can build) and 
> then filling them out. It's beneficial to have version control on this so 
> that, if I screw up filling out a stub, I can get back to where I was.
> 
> Having made a complete series, I then generate a new series of commits, 
> each of which does one thing, without any bugs that I've resolved, such 
> that the net result is the end of the messy history, except with any 
> debugging or useless stuff skipped. It's this series that gets merged into 
> the project history, and I discard the other history.
> 
> The real trick is that the early patches in a lot of series often refactor 
> existing code in ways that are generally good and necessary for your 
> eventual outcome, but which you'd never think of until you've written more 
> of the series. Generating a new commit sequence is necessary to end up 
> with a history where it looks from the start like you know where you're 
> going and have everything done that needs to be done when you get to the 
> point of needing it. Furthermore, you want to be able to test these 
> commits in isolation, without the distraction of the changes that actually 
> prompted them, which means that you want to have your working tree is a 
> state that you never actually had it in as you were developing the end 
> result.
> 
> This means that you'll usually want to rewrite commits for any series that 
> isn't a single obvious patch, so it's not a big deal to commit any time 
> you want to work on some different branch.

I do that so much that I have this alias:
        reorder = !sh -c 'git rebase -i --onto $0 $0 $1'

... and actually pass it only one argument most of the time.

Mike

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23  1:42 ` Junio C Hamano
@ 2008-02-23 10:39   ` Samuel Tardieu
  0 siblings, 0 replies; 29+ messages in thread
From: Samuel Tardieu @ 2008-02-23 10:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Chase Venters, git

>>>>> "Junio" == Junio C Hamano <gitster@pobox.com> writes:

Junio> Many people prefer to use topic branches, and working in a
Junio> single repository with multiple branches and switching branches
Junio> without ever cd'ing around is certainly a possible and very
Junio> valid way to work.  As long as your build infrastructure is
Junio> sane (e.g. your project does not have a central header file
Junio> that any little subsystem change needs to modify and included
Junio> by everybody, which tends to screw up make quite badly),
Junio> switching branches would not incur too much recompilation
Junio> either and it obviously will save disk space not having to
Junio> leave multiple checkout around.

And even in this case (central header file), ccache will greatly
decrease compilation time in the case of a C/C++ project.

  Sam
-- 
Samuel Tardieu -- sam@rfc1149.net -- http://www.rfc1149.net/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23  2:47           ` J.C. Pizarro
@ 2008-02-23 11:39             ` Charles Bailey
  2008-02-23 13:08               ` J.C. Pizarro
  2008-02-23 14:08             ` Mike Hommey
  1 sibling, 1 reply; 29+ messages in thread
From: Charles Bailey @ 2008-02-23 11:39 UTC (permalink / raw)
  To: J.C. Pizarro; +Cc: git

On Sat, Feb 23, 2008 at 03:47:07AM +0100, J.C. Pizarro wrote:
> 
> Yesterday, i had git cloned git://foo.com/bar.git   ( 777 MiB )
>  Today, i've git cloned git://foo.com/bar.git   ( 779 MiB )
> 
>  Both repos are different binaries , and i used 777 MiB + 779 MiB = 1556 MiB
>  of bandwidth in two days. It's much!
> 
>  Why don't we implement "binary delta between old git repo and recent git repo"
>  with "SHA1 built git repo verifier"?
> 
>  Suppose the size cost of this binary delta is e.g. around 52 MiB instead of
>  2 MiB due to numerous mismatching of binary parts, then the bandwidth
>  in two days will be 777 MiB + 52 MiB = 829 MiB instead of 1556 MiB.
> 
>  Unfortunately, this "binary delta of repos" is not implemented yet :|

It sounds like what concerns you is the bandwith to git://foo.bar. If
you are cloning the first repository to somewhere were the first
clone is accessible and bandwidth between the clones is not an issue,
then you should be able to use the --reference parameter to git clone
to just fetch the missing ~2 MiB from foo.bar.

A "binary delta of repos" should just be an 'incremental' pack file
and the git protocol should support generating an appropriate one. I'm
not quite sure what "not implemented yet" feature you are looking for.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23  0:37 Question about your git habits Chase Venters
                   ` (8 preceding siblings ...)
  2008-02-23  9:10 ` Sam Ravnborg
@ 2008-02-23 13:07 ` Jakub Narebski
  9 siblings, 0 replies; 29+ messages in thread
From: Jakub Narebski @ 2008-02-23 13:07 UTC (permalink / raw)
  To: Chase Venters; +Cc: git

[removed linux-kernel list from Cc]

Chase Venters <chase.venters@clientec.com> writes:

> My question is: If you're working on multiple things at once, do you
> tend to clone the entire repository repeatedly into a series of
> separate working directories and do your work there, then pull that
> work (possibly comprising a series of "temporary" commits) back into
> a separate local master respository with --squash, either into
> "master" or into a branch containing the new feature?

Alternate solution is to use multiple working trees (multiple working
directories) with single repository, although it is still a bit
fragile; you should take care to not checkout same branch multiple
times. IIRC when discussing ".git" as a file representing symlink,
there were some discussion on how to improve multiple-workspaces
workflow.

> Or perhaps you create a temporary topical branch for each thing you
> are working on, and commit arbitrary changes then checkout another
> branch when you need to change gears, finally --squashing the
> intermediate commits when a particular piece of work is done?

I personally prefer this workflow, but I do not work as a main
contributor nor maintainer of large project.

As to intermediate commits: if you feel the need to interrupt your
work which is not quite ready for final commit, you can either use
"git stash" command, or commit it as WIP commit, then when going back
just "git commit --amend" it.

Moreover, when working on some larger topic, which needs to be split
into individual commits for beter history clarity, and for better
bisectability, you usually rewrite history before submitting
(publishing) your changes. You usually have to reorder commits (for
example moving improvements to infrastructure before commits
introducing new feature), split commits (separating just noticed
bugfix from a feature commit), squash commits (joining feature commit
and its bugfix) etc. You can use "git rebase --interactive" for that,
or one of Quilt-like patch management interfaces for git: StGit (which
I personally use) or Guilt (idea based on mq: Mercurial queues
extension).

[...]

> It seems to me that having multiple working trees (effectively, cloning 
> the "master" repository every time I need to make anything but a trivial 
> change) would be most effective under git as well as it doesn't require 
> creating messy, intermediate commits in the first place (but allows for them 
> if they are used). But I wonder how that approach would scale with a project 
> whose git repo weighed hundreds of megs or more. (With a centralized rcs, of 
> course, you don't have to lug around a copy of the whole project history in 
> each working tree.)

You can always clone using --shared option to set-up alternates; this
way only new objects (new commits) would be stored in the clone. This
of course need for clone and source to be on the same filesystem.

By default git-clone on local filesystem uses hardlinks, so it also
should not be so hard on disk space.

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23 11:39             ` Charles Bailey
@ 2008-02-23 13:08               ` J.C. Pizarro
  2008-02-23 13:17                 ` Charles Bailey
  0 siblings, 1 reply; 29+ messages in thread
From: J.C. Pizarro @ 2008-02-23 13:08 UTC (permalink / raw)
  To: Charles Bailey, LKML, git

On 2008/2/23, Charles Bailey <charles@hashpling.org> wrote:
> On Sat, Feb 23, 2008 at 03:47:07AM +0100, J.C. Pizarro wrote:
>  >
>  > Yesterday, i had git cloned git://foo.com/bar.git   ( 777 MiB )
>  >  Today, i've git cloned git://foo.com/bar.git   ( 779 MiB )
>  >
>  >  Both repos are different binaries , and i used 777 MiB + 779 MiB = 1556 MiB
>  >  of bandwidth in two days. It's much!
>  >
>  >  Why don't we implement "binary delta between old git repo and recent git repo"
>  >  with "SHA1 built git repo verifier"?
>  >
>  >  Suppose the size cost of this binary delta is e.g. around 52 MiB instead of
>  >  2 MiB due to numerous mismatching of binary parts, then the bandwidth
>  >  in two days will be 777 MiB + 52 MiB = 829 MiB instead of 1556 MiB.
>  >
>  >  Unfortunately, this "binary delta of repos" is not implemented yet :|
>
>
> It sounds like what concerns you is the bandwith to git://foo.bar. If
>  you are cloning the first repository to somewhere were the first
>  clone is accessible and bandwidth between the clones is not an issue,
>  then you should be able to use the --reference parameter to git clone
>  to just fetch the missing ~2 MiB from foo.bar.
>
>  A "binary delta of repos" should just be an 'incremental' pack file
>  and the git protocol should support generating an appropriate one. I'm
>  not quite sure what "not implemented yet" feature you are looking for.

But if the repos are aggressively repacked then the bit to bit differences
are not ~2 MiB.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23 13:08               ` J.C. Pizarro
@ 2008-02-23 13:17                 ` Charles Bailey
  2008-02-23 13:36                   ` J.C. Pizarro
  0 siblings, 1 reply; 29+ messages in thread
From: Charles Bailey @ 2008-02-23 13:17 UTC (permalink / raw)
  To: J.C. Pizarro; +Cc: LKML, git

On Sat, Feb 23, 2008 at 02:08:35PM +0100, J.C. Pizarro wrote:
> 
> But if the repos are aggressively repacked then the bit to bit differences
> are not ~2 MiB.

It shouldn't matter how aggressively the repositories are packed or what
the binary differences are between the pack files are. git clone
should (with the --reference option) generate a new pack for you with
only the missing objects. If these objects are ~52 MiB then a lot has
been committed to the repository, but you're not going to be able to
get around a big download any other way.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23 13:17                 ` Charles Bailey
@ 2008-02-23 13:36                   ` J.C. Pizarro
  2008-02-23 14:01                     ` Charles Bailey
  0 siblings, 1 reply; 29+ messages in thread
From: J.C. Pizarro @ 2008-02-23 13:36 UTC (permalink / raw)
  To: Charles Bailey, LKML, git

On 2008/2/23, Charles Bailey <charles@hashpling.org> wrote:
> On Sat, Feb 23, 2008 at 02:08:35PM +0100, J.C. Pizarro wrote:
>  >
>  > But if the repos are aggressively repacked then the bit to bit differences
>  > are not ~2 MiB.
>
>
> It shouldn't matter how aggressively the repositories are packed or what
>  the binary differences are between the pack files are. git clone
>  should (with the --reference option) generate a new pack for you with
>  only the missing objects. If these objects are ~52 MiB then a lot has
>  been committed to the repository, but you're not going to be able to
>  get around a big download any other way.

You're wrong, nothing has to be commited ~52 MiB to the repository.

I'm not saying "commit", i'm saying

"Assume A & B binary git repos and delta_B-A another binary file, i
request built
B' = A + delta_B-A where is verified SHA1(B') = SHA1(B) for avoiding
corrupting".

Assume B is the higher repacked version of "A + minor commits of the day"
as if B was optimizing 24 hours more the minimum spanning tree. Wow!!!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23 13:36                   ` J.C. Pizarro
@ 2008-02-23 14:01                     ` Charles Bailey
  2008-02-23 17:10                       ` J.C. Pizarro
  0 siblings, 1 reply; 29+ messages in thread
From: Charles Bailey @ 2008-02-23 14:01 UTC (permalink / raw)
  To: J.C. Pizarro; +Cc: LKML, git

On Sat, Feb 23, 2008 at 02:36:59PM +0100, J.C. Pizarro wrote:
> On 2008/2/23, Charles Bailey <charles@hashpling.org> wrote:
> >
> > It shouldn't matter how aggressively the repositories are packed or what
> >  the binary differences are between the pack files are. git clone
> >  should (with the --reference option) generate a new pack for you with
> >  only the missing objects. If these objects are ~52 MiB then a lot has
> >  been committed to the repository, but you're not going to be able to
> >  get around a big download any other way.
> 
> You're wrong, nothing has to be commited ~52 MiB to the repository.
> 
> I'm not saying "commit", i'm saying
> 
> "Assume A & B binary git repos and delta_B-A another binary file, i
> request built
> B' = A + delta_B-A where is verified SHA1(B') = SHA1(B) for avoiding
> corrupting".
> 
> Assume B is the higher repacked version of "A + minor commits of the day"
> as if B was optimizing 24 hours more the minimum spanning tree. Wow!!!
> 

I'm not sure that I understand where you are going with this.
Originally, you stated that if you clone a 775 MiB repository on day
one, and then you clone it again on day two when it was 777 MiB, then
you currently have to download 775 + 777 MiB of data, whereas you
could download a 52 MiB binary diff. I have no idea where that value
of 52 MiB comes from, and I've no idea how many objects were committed
between day one and day two. If we're going to talk about details,
then you need to provide more details about your scenario.

Having said that, here is my original point in some more detail. git
repositories are not binary blobs, they are object databases. Better
than this, they are databases of immutable objects. This means that to
get the difference between one database and another, you only need to
add the objects that are missing from the other database. If the two
databases are actually a database and the same database at short time
interval later, then almost all the objects are going to be common and
the difference will be a small set of objects. Using git:// this set
of objects can be efficiently transfered as a pack file. You may have
a corner case scenario where the following isn't true, but in my
experience an incremental pack file will be a more compact
representation of this difference than a binary difference of two
aggressively repacked git repositories as generated by a generic
binary difference engine.

I'm sorry if I've misunderstood your last point. Perhaps you could
expand in the exact issue that are having if I have, as I'm not sure
that I've really answered your last message.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23  2:47           ` J.C. Pizarro
  2008-02-23 11:39             ` Charles Bailey
@ 2008-02-23 14:08             ` Mike Hommey
  1 sibling, 0 replies; 29+ messages in thread
From: Mike Hommey @ 2008-02-23 14:08 UTC (permalink / raw)
  To: J.C. Pizarro; +Cc: git

On Sat, Feb 23, 2008 at 03:47:07AM +0100, J.C. Pizarro wrote:
> On 2008/2/23, Al Viro <viro@zeniv.linux.org.uk> wrote:
>  > On Fri, Feb 22, 2008 at 05:51:04PM -0800, Junio C Hamano wrote:
>  >  > Al Viro <viro@ZenIV.linux.org.uk> writes:
>  >  >
>  >  > > On Sat, Feb 23, 2008 at 02:37:00AM +0100, Jan Engelhardt wrote:
>  >  > >
>  >  > >> >do you tend to clone the entire repository repeatedly into a series
>  >  > >> >of separate working directories
>  >  > >>
>  >  > >> Too time consuming on consumer drives with projects the size of Linux.
>  >  > >
>  >  > > git clone -l -s
>  >  > >
>  >  > > is not particulary slow...
>  >  >
>  >  > How big is a checkout of a single revision of kernel these days,
>  >  > compared to a well-packed history since v2.6.12-rc2?
>  >  >
>  >  > The cost of writing out the work tree files isn't ignorable and
>  >  > probably more than writing out the repository data (which -s
>  >  > saves for you).
>  >
>  >
>  > Depends...  I'm using ext2 for that and noatime everywhere, so that might
>  >  change the picture, but IME it's fast enough...  As for the size, it gets
>  >  to ~320Mb on disk, which is comparable to the pack size (~240-odd Mb).
> 
> 
> Yesterday, i had git cloned git://foo.com/bar.git   ( 777 MiB )
>  Today, i've git cloned git://foo.com/bar.git   ( 779 MiB )

Why do you need to clone it again ? Just git fetch from it.

Mike

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23 14:01                     ` Charles Bailey
@ 2008-02-23 17:10                       ` J.C. Pizarro
  2008-02-23 18:16                         ` Charles Bailey
  2008-02-23 18:19                         ` J.C. Pizarro
  0 siblings, 2 replies; 29+ messages in thread
From: J.C. Pizarro @ 2008-02-23 17:10 UTC (permalink / raw)
  To: Charles Bailey, LKML, git

On 2008/2/23, Charles Bailey <charles@hashpling.org> wrote:
> On Sat, Feb 23, 2008 at 02:36:59PM +0100, J.C. Pizarro wrote:
>  > On 2008/2/23, Charles Bailey <charles@hashpling.org> wrote:
>  > >
>
> > > It shouldn't matter how aggressively the repositories are packed or what
>  > >  the binary differences are between the pack files are. git clone
>  > >  should (with the --reference option) generate a new pack for you with
>  > >  only the missing objects. If these objects are ~52 MiB then a lot has
>  > >  been committed to the repository, but you're not going to be able to
>  > >  get around a big download any other way.
>  >
>  > You're wrong, nothing has to be commited ~52 MiB to the repository.
>  >
>  > I'm not saying "commit", i'm saying
>  >
>  > "Assume A & B binary git repos and delta_B-A another binary file, i
>  > request built
>  > B' = A + delta_B-A where is verified SHA1(B') = SHA1(B) for avoiding
>  > corrupting".
>  >
>  > Assume B is the higher repacked version of "A + minor commits of the day"
>  > as if B was optimizing 24 hours more the minimum spanning tree. Wow!!!
>  >
>
>
> I'm not sure that I understand where you are going with this.
>  Originally, you stated that if you clone a 775 MiB repository on day
>  one, and then you clone it again on day two when it was 777 MiB, then
>  you currently have to download 775 + 777 MiB of data, whereas you
>  could download a 52 MiB binary diff. I have no idea where that value
>  of 52 MiB comes from, and I've no idea how many objects were committed
>  between day one and day two. If we're going to talk about details,
>  then you need to provide more details about your scenario.

I don't said that "A & B binary git repos" are binary files, but i said that
delta_B-A is a binary file.

I said ago ~15 hours "Suppose the size cost of this binary delta is e.g. around
52 MiB instead of 2 MiB due to numerous mismatching of binary parts ..."

The binary delta is different to the textual delta (between lines of texts)
 used in the git scheme (the commits or changesets use textual deltas).
The textual delta can be compressed resulting a smaller binary object.
Collecting binary objects and some more is the git repository.
You can't apply textual delta of git repository, only binary delta.
You can apply binary delta of both git-repacked repositories if there
is a program
 that generates binary delta of both directories but it's not implement yet.
The SHA1 verifier is useful for avoid the corrupting of the generated repository
 (if it's corrupted then it has to be cloned again delta or whole
until non-corrupted).
An example of same SHA1 of both directories can be implemented as same SHA1
 of sorted SHA1s of contents, filenames and properties. Anything
alterated, added
 or eliminated from them implies different SHA1.

Don't you understand i'm saying? I will give you a practical example.
1. zip -r -8  foo1.zip foo1  # in foo1 there are tons of information
as from git repo
2. mv foo1 foo2 ; cp bar.txt foo2/
3. zip -r -9 foo2.zip foo2   # still little bit more optimized (=
higher repacked)
4. Apply binary delta between foo1.zip & foo2.zip with a supposed program
     deltaier and you get delta_foo1_foo2.bin. The size(delta_foo1_foo2.bin) is
     not nearly ~( size(foo2.zip) - size(foo1.zip) )
5. Apply hexadecimal diff and you will understand why it gives the exemplar
     ~52 MiB instead of ~2 MiB that i said it.
6. You will know some identical parts in both foo1.zip and foo2.zip.
     Identical parts are good for smaller binary deltas. It's possible to get
     still smaller binary deltas when their identical parts are in
random offsets
     or random locations depending of how deltaier program is advanced.
7. Same above but instead of both files, apply binary delta of both directories.

>  Having said that, here is my original point in some more detail. git
>  repositories are not binary blobs, they are object databases. Better
>  than this, they are databases of immutable objects. This means that to
>  get the difference between one database and another, you only need to
>  add the objects that are missing from the other database.

Databases of immutable objects <--- You're wrong because you confuse.
There are mutable objects as the better deltas of min. spanning tree.

The missing objects are not only the missing sources that you're thinking,
they can be any thing (blob, tree, commit, tag, etc.). The deltas of the
minimum spanning tree too are objects of the database that can be erased
or added when the spanning tree is alterated (because the alterated spanning
tree is smaller than previous) for better repack. Best repack is still
NP-problem
and to solve this bigger NP-problem of each day is 24/365 (eternal computing).

The git database is the top-level ".git/" directory but it has repacked binary
information and has always some size measured normally in MiBs that i was
saying above.

>                                                                        If the two
>  databases are actually a database and the same database at short time
>  interval later, then almost all the objects are going to be common and
>  the difference will be a small set of objects. Using git:// this set
>  of objects can be efficiently transfered as a pack file.

You're saying    repacked(A) + new objects   with the bandwith cost of
new objects
but i'm saying  rerepacked(A+new objects)   with the bandwith cost of
binary delta
                                   where delta is repacked(A) -
rerepacked(A+new objects)
                                         and rerepacked(X) is more
time repacking again X.

>                                                                                     You may have
>  a corner case scenario where the following isn't true, but in my
>  experience an incremental pack file will be a more compact
>  representation of this difference than a binary difference of two
>  aggressively repacked git repositories as generated by a generic
>  binary difference engine.

Yes, it's more simple and compact, but the eternal repacking 24/365 can do it
 e.g. 30% smaller after few weeks when the incremental pack has made nothing.

It's good idea that the weekly user picks the binary delta and the
daily developer
 picks the incremental pack. Put both modes working in the git server.

>  I'm sorry if I've misunderstood your last point. Perhaps you could
>  expand in the exact issue that are having if I have, as I'm not sure
>  that I've really answered your last message.

   Misunderstood can be dissappeared ;)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23 17:10                       ` J.C. Pizarro
@ 2008-02-23 18:16                         ` Charles Bailey
  2008-02-23 18:47                           ` J.C. Pizarro
  2008-02-23 18:19                         ` J.C. Pizarro
  1 sibling, 1 reply; 29+ messages in thread
From: Charles Bailey @ 2008-02-23 18:16 UTC (permalink / raw)
  To: J.C. Pizarro; +Cc: git

I've cut the cc'list down to just the git mailing list as this isn't a
linux kernel issue.

On Sat, Feb 23, 2008 at 06:10:58PM +0100, J.C. Pizarro wrote:
> Don't you understand i'm saying? I will give you a practical example.
> 1. zip -r -8  foo1.zip foo1  # in foo1 there are tons of information
> as from git repo
> 2. mv foo1 foo2 ; cp bar.txt foo2/
> 3. zip -r -9 foo2.zip foo2   # still little bit more optimized (=
> higher repacked)
> 4. Apply binary delta between foo1.zip & foo2.zip with a supposed program
>      deltaier and you get delta_foo1_foo2.bin. The size(delta_foo1_foo2.bin) is
>      not nearly ~( size(foo2.zip) - size(foo1.zip) )
> 5. Apply hexadecimal diff and you will understand why it gives the exemplar
>      ~52 MiB instead of ~2 MiB that i said it.
> 6. You will know some identical parts in both foo1.zip and foo2.zip.
>      Identical parts are good for smaller binary deltas. It's possible to get
>      still smaller binary deltas when their identical parts are in
> random offsets
>      or random locations depending of how deltaier program is advanced.
> 7. Same above but instead of both files, apply binary delta of both directories.

I totally understand what you are saying here with your zip example.
In fact this supports my original interpretation of what you were
saying. There size of the difference between the 775 MiB repository
and the 777 MiB repository is 52 MiB, not because there is 52 MiB of
new data in the latter repoistory but because of the difficulty in
generating a minimal binary delta between the two.

This is why I suggest that an incremental pack file will probably make
a better method of supplying a 'diff' between the two.

> Databases of immutable objects <--- You're wrong because you confuse.
> There are mutable objects as the better deltas of min. spanning tree.
> 
> The missing objects are not only the missing sources that you're thinking,
> they can be any thing (blob, tree, commit, tag, etc.). The deltas of the
> minimum spanning tree too are objects of the database that can be erased
> or added when the spanning tree is alterated (because the alterated spanning
> tree is smaller than previous) for better repack. Best repack is still
> NP-problem
> and to solve this bigger NP-problem of each day is 24/365 (eternal computing).
> 
> The git database is the top-level ".git/" directory but it has repacked binary
> information and has always some size measured normally in MiBs that i was
> saying above.

You're confusing two things together here. Conceptually, the git
database is a database of immutable objects. How it is stored is a
lower level implementation detail (albeit a very important one in
practice). The delta chains in the pack files are nothing to do with
git objects.

> >                                                                        If the two
> >  databases are actually a database and the same database at short time
> >  interval later, then almost all the objects are going to be common and
> >  the difference will be a small set of objects. Using git:// this set
> >  of objects can be efficiently transfered as a pack file.
> 
> You're saying    repacked(A) + new objects   with the bandwith cost of
> new objects
> but i'm saying  rerepacked(A+new objects)   with the bandwith cost of
> binary delta
>                                    where delta is repacked(A) -
> rerepacked(A+new objects)
>                                          and rerepacked(X) is more
> time repacking again X.

You seem to be comparing something that I've said with something that
you said. Originally I thought that you were making a bandwidth
argument, now you seem to be making a repacking time argument. Is X
supposed to represent to second cloned repository?

If you git clone with --reference or git fetch from a non-dumb source
repository then the remote end will generate a packfile of just the
objects that you need to update the local repository. If the remote
side is fully packed then A can reuse the delta information it already
has to generate this pack efficiently. On the local side, there is no
need to unpack these objects at all. The pack can just be placed in
the repository and used as as.

> >                                                                                     You may have
> >  a corner case scenario where the following isn't true, but in my
> >  experience an incremental pack file will be a more compact
> >  representation of this difference than a binary difference of two
> >  aggressively repacked git repositories as generated by a generic
> >  binary difference engine.
> 
> Yes, it's more simple and compact, but the eternal repacking 24/365 can do it
>  e.g. 30% smaller after few weeks when the incremental pack has made nothing.

What do you mean by 'the eternal repacking 24/365'? What is it trying
to achieve?

> It's good idea that the weekly user picks the binary delta and the
> daily developer
>  picks the incremental pack. Put both modes working in the git server.

What is the weekly user? Why would the 'binary delta' be better than
an incremental pack in this case?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23 17:10                       ` J.C. Pizarro
  2008-02-23 18:16                         ` Charles Bailey
@ 2008-02-23 18:19                         ` J.C. Pizarro
  1 sibling, 0 replies; 29+ messages in thread
From: J.C. Pizarro @ 2008-02-23 18:19 UTC (permalink / raw)
  To: LKML, git

The google's gmail made a crap my last message that it did wrap
my message of X lines to the crap of (X+o) lines misconfiguring
my original lines of the message.

    I don't see the motives of Google crapping my original lines
    of the messages that i had sended.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23 18:16                         ` Charles Bailey
@ 2008-02-23 18:47                           ` J.C. Pizarro
  2008-02-23 19:28                             ` Charles Bailey
  0 siblings, 1 reply; 29+ messages in thread
From: J.C. Pizarro @ 2008-02-23 18:47 UTC (permalink / raw)
  To: Charles Bailey, git

On 2008/2/23, Charles Bailey <charles@hashpling.org> wrote:
> You're confusing two things together here. Conceptually, the git
>  database is a database of immutable objects. How it is stored is a
>  lower level implementation detail (albeit a very important one in
>  practice). The delta chains in the pack files are nothing to do with
>  git objects.

In Documentation/git-repack.txt says:

git-repack is used to combine all objects that do not currently
reside in a "pack", into a pack. It can also be used to re-organize
existing packs into a single, more efficient pack.

A pack is a collection of objects, individually compressed, with
delta compression applied, stored in a single file, with an
associated index file.

### Can you explain me that delta chains in the pack files are
 nothing to do with git objects? ###

Packs are used to reduce the load on mirror systems, backup engines,
disk storage, etc.

> You seem to be comparing something that I've said with something that
>  you said. Originally I thought that you were making a bandwidth
>  argument, now you seem to be making a repacking time argument. Is X
>  supposed to represent to second cloned repository?

Yes, X is as the 2nd cloned repository but highly repacked, same size is not.

>
>  If you git clone with --reference or git fetch from a non-dumb source
>  repository then the remote end will generate a packfile of just the
>  objects that you need to update the local repository. If the remote
>  side is fully packed then A can reuse the delta information it already
>  has to generate this pack efficiently. On the local side, there is no
>  need to unpack these objects at all. The pack can just be placed in
>  the repository and used as as.

Is not it redundant to place git objects and pack files in the same repo?
1. Or erase the unnecesary pack files because there are git objects.
2. Or erase some git objects because there are delta chains in pack files
     that can generate the same git objects erased previously.

> What do you mean by 'the eternal repacking 24/365'? What is it trying
>  to achieve?

It's an uninterrumpted computing that is generating a sequence of
spanning trees in convergence to smaller packs.
   Each smaller spanning tree is found, the pack file is updated.

> What is the weekly user? Why would the 'binary delta' be better than
>  an incremental pack in this case?

Because the user wants to clone weekly 240 MiB in 1st week, 220 MiB in
2nd week, 205 MiB in 3rd week, .... 100 MiB repo! in Nth week instead of
240+1+1+1+1 MiB of incremental packs.

What is better for the user in the Nth week, 100 MiB repo or 244 MiB repo?

   ;)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Question about your git habits
  2008-02-23 18:47                           ` J.C. Pizarro
@ 2008-02-23 19:28                             ` Charles Bailey
  0 siblings, 0 replies; 29+ messages in thread
From: Charles Bailey @ 2008-02-23 19:28 UTC (permalink / raw)
  To: J.C. Pizarro; +Cc: git

On Sat, Feb 23, 2008 at 07:47:13PM +0100, J.C. Pizarro wrote:
> On 2008/2/23, Charles Bailey <charles@hashpling.org> wrote:
> > You're confusing two things together here. Conceptually, the git
> >  database is a database of immutable objects. How it is stored is a
> >  lower level implementation detail (albeit a very important one in
> >  practice). The delta chains in the pack files are nothing to do with
> >  git objects.
> 
> In Documentation/git-repack.txt says:
> 
> git-repack is used to combine all objects that do not currently
> reside in a "pack", into a pack. It can also be used to re-organize
> existing packs into a single, more efficient pack.
> 
> A pack is a collection of objects, individually compressed, with
> delta compression applied, stored in a single file, with an
> associated index file.
> 
> ### Can you explain me that delta chains in the pack files are
>  nothing to do with git objects? ###

It's an abstraction thing. Perhaps I should have said that git objects
have nothing to do with pack files to indicate the direction of the
dependency.

> Is not it redundant to place git objects and pack files in the same repo?
> 1. Or erase the unnecesary pack files because there are git objects.
> 2. Or erase some git objects because there are delta chains in pack files
>      that can generate the same git objects erased previously.

Only if they overlap, but usually they don't.

> > What is the weekly user? Why would the 'binary delta' be better than
> >  an incremental pack in this case?
> 
> Because the user wants to clone weekly 240 MiB in 1st week, 220 MiB in
> 2nd week, 205 MiB in 3rd week, .... 100 MiB repo! in Nth week instead of
> 240+1+1+1+1 MiB of incremental packs.
> 
> What is better for the user in the Nth week, 100 MiB repo or 244 MiB repo?
> 

That depends, doesn't it. If the everyday workflow is quicker and
easier a 244 MiB clone could well be acceptable, but if it's not there
is always the option of a repack. I don't buy the premise that people
want to be continually repacking to find the ultimate pack file, I
don't think that the gain over a one-shot repack is ever going to be
worth it.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2008-02-23 19:29 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-23  0:37 Question about your git habits Chase Venters
2008-02-23  1:26 ` Tommy Thorn
2008-02-23  1:28 ` Steven Walter
2008-02-23  1:37 ` Jan Engelhardt
2008-02-23  1:44   ` Al Viro
2008-02-23  1:51     ` Junio C Hamano
2008-02-23  2:09       ` Al Viro
     [not found]         ` <998d0e4a0802221823h3ba53097gf64fcc2ea826302b@mail.gmail.com>
2008-02-23  2:47           ` J.C. Pizarro
2008-02-23 11:39             ` Charles Bailey
2008-02-23 13:08               ` J.C. Pizarro
2008-02-23 13:17                 ` Charles Bailey
2008-02-23 13:36                   ` J.C. Pizarro
2008-02-23 14:01                     ` Charles Bailey
2008-02-23 17:10                       ` J.C. Pizarro
2008-02-23 18:16                         ` Charles Bailey
2008-02-23 18:47                           ` J.C. Pizarro
2008-02-23 19:28                             ` Charles Bailey
2008-02-23 18:19                         ` J.C. Pizarro
2008-02-23 14:08             ` Mike Hommey
2008-02-23  1:42 ` Junio C Hamano
2008-02-23 10:39   ` Samuel Tardieu
     [not found] ` <998d0e4a0802221736q4e4c3a28l101522912f7d3caf@mail.gmail.com>
2008-02-23  2:46   ` J.C. Pizarro
2008-02-23  4:10 ` Daniel Barkalow
2008-02-23  5:03   ` Jeff Garzik
2008-02-23  9:18   ` Mike Hommey
2008-02-23  4:39 ` Rene Herman
2008-02-23  8:56 ` Willy Tarreau
2008-02-23  9:10 ` Sam Ravnborg
2008-02-23 13:07 ` Jakub Narebski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).