If I were redoing git from scratch...

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* If I were redoing git from scratch...
@ 2006-11-04 11:34 Junio C Hamano
  2006-11-04 12:21 ` Jakub Narebski
  2006-11-04 16:44 ` Linus Torvalds
  0 siblings, 2 replies; 7+ messages in thread
From: Junio C Hamano @ 2006-11-04 11:34 UTC (permalink / raw)
  To: git

I've been thinking about these for a while on the back of my
head, and thought it might be better to start writing it down.

A lot of issues involve UI which means it will not materialize
without breaking existing uses, but if we know in advance what
we will be aiming for, maybe we will find a smoother path to
reach there.

* Core data structure

I consider on-disk data structures and on-wire protocol we
currently use are sane and there is not much to fix.  There are
certainly things to be enhanced (64-bit .idx offset, for
example), but I do not think there is anything fundamentally
broken and needs to be reworked.

I have the same feeling for in-core data structures in general,
except a few issues.

The biggest one is that we use too many static (worse, function
scope static) variables that live for the life of the process,
which makes many things very nice and easy ("run-once and let
exit clean up the mess" mentality), but because of this it
becomes awkward to do certain things.  Examples are:

 - Multiple invocations of merge-bases (needs clearing the
   marks left on commit objects by earlier traversal),

 - Creating a new pack and immediately start using it inside the
   process itself (prepare_packed_git() is call-once, and we
   have hacks to cause it re-read the packs in many places).

 - Visiting more than one repositories within one process
   (many per-repository variables in sha1_file.c are static
   variables and there is no "struct repository" that we can
   re-initialize in one go),

 - The object layer holds onto all parsed objects
   indefinitely.  Because the object store at the philosophy
   level represents the global commit ancestry DAG, there is
   no inherent reason to have more than one instance of
   object.c::obj_hash even if we visit more than one
   repositories in a process, but if the two repositories are
   unrelated, objects from the repository we were looking at
   only waste memory after switching to a different
   repostiory.

 - The diffcore is not run-once but it is run-one-at-a-time.
   This is easy to fix if needed, though.

There are some other minor details but they are not as
fundamental.  Examples are:

 - The revision traversal is nicely done but one gripe I have is
   that it is focused on painting commits into two (and only
   two) classes: interesting and uninteresting.  If we allowed
   more than one (especially, arbitrary number of) kinds of
   interesting, answering questions like "which branches does
   this commit belong to?  which tagged versions is this commit
   already included in?"  would become more easy and efficient.
   show-branch has machinery to do that for a handful but it
   could be unified with the revision.c traversal machinery. 

 - We have at least three independent implementations of
   pathspec match logic and two different semantics (one is
   component-prefix match, the other is shell glob), and they
   should be unified.  You can say "git grep foo -- 't/t5*'" but
   not "git diff otherbranch -- 't/t5*'".

* Fetch/Push/Pull/Merge confusion

Everybody hates the fact that inverse of push is fetch not pull,
and merge is not a usual Porcelain (while it _is_ usable as a
regular UI command, it was originally done as a lower layer
helper to "pull" Porcelain and has a strange parameter order
with seemingly useless HEAD parameter in the middle).

If I were doing git from scratch, I would probably avoid any of
the above words that have loaded meanings from other SCMs.
Perhaps...

 - "git download" would download changes made in the other end
   since we contacted them the last time and would not touch our
   branches nor working tree (associate the word with getting
   tarballs -- people would not expect the act of downloading a
   tarball would touch their working tree nor local history.
   untarring it does).  It is a different story if the end-user
   should be required to explicitly say "download"; I am leaning
   towards making it more or less transparent.

 - "git upload" to upload our changes to the other end -- that
   is what "git push" currently does.

 - "git join" to merge another branch into the current branch,
   with the "per branch configuration" frills to decide what the
   default for "another branch" is based on what the current
   branch is, etc.

* Less visible "remoteness" of remote branches

If I were doing git from scratch, I would probably have done
separate remotes _the_ only layout, except I might have opted to
make "remotes" even less visible and treating it as merely a
cache of "the branch tips and tags we saw when we connected over
the network to look at them the last time".

So "git branch --list $remote" might contact the remote over the
network or use cached version.  When you think about, it it is
not all that different from always contacting the remote end --
the remote end may have mirror propagation delays, and your
local instance of git caching and not contacting the remote all
the time introduces a similar delay on your end which is (1) not
a big deal, and (2) unlike the remote mirror delay, controllable
on your end.  For example, you could force it to update the
cache by "git download $remote; git branch --list $remote".

* Unified "fetch" and "push" across backends.

I was rediscovering git-cvsimport today and wished if I could
just have said (syntax aside):

	URL: cvs;/my.re.po/.cvsroot
        Pull: HEAD:remotes/cvs/master
        Pull: experiment:remotes/cvs/experiment

to cause "git fetch" to run git-cvsimport to update the remotes/cvs/
branches (and "git pull" to merge CVS changes to my branches).
The same thing should be possible for SVN and other foreign SCM
backends.

Also it should be possible to use git-cvsexportcommit as a
backend for "git push" into the cvs repository.

That's it for tonight...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: If I were redoing git from scratch...
  2006-11-04 11:34 If I were redoing git from scratch Junio C Hamano
@ 2006-11-04 12:21 ` Jakub Narebski
  2006-11-04 16:44 ` Linus Torvalds
  1 sibling, 0 replies; 7+ messages in thread
From: Jakub Narebski @ 2006-11-04 12:21 UTC (permalink / raw)
  To: git

Junio C Hamano wrote:

> * Core data structure
[...]
> The biggest one is that we use too many static (worse, function
> scope static) variables that live for the life of the process,
> which makes many things very nice and easy ("run-once and let
> exit clean up the mess" mentality), but because of this it
> becomes awkward to do certain things.  Examples are:

One of examples that have been only rarely fixed was for_each_ref
forcing callers to using static variable to store gathered data
instead of passing the data as one of arguments.

> * Fetch/Push/Pull/Merge confusion
> 
> Everybody hates the fact that inverse of push is fetch not pull,
> and merge is not a usual Porcelain (while it _is_ usable as a
> regular UI command, it was originally done as a lower layer
> helper to "pull" Porcelain and has a strange parameter order
> with seemingly useless HEAD parameter in the middle).
> 
> If I were doing git from scratch, I would probably avoid any of
> the above words that have loaded meanings from other SCMs.

I'm a bit used to "push", "fetch" and "pull". I consider "pull"
a bit of artifact from times of one branch per repository layout.
The fact that "pull" fetches _all_ the branches but merges one
[usually] with the _current_ branch (unless you configure it other
way)...

I'll leave "push" as is, leave "fetch" as is, and make "pull" to be
"fetch" by default unless you use "--merge[=<branch>]" option.
I'd rename "merge" to "merge-driver" and make new "merge" thanks
to new users wouldn't have to learn to use "git pull . branchA"
to merge current branch with branchA. 

Perhaps would make it possible to specify remote branch a la cogito,
<URL>#<branch>, to pull remote branch without tracking branch, and
for symmetry have "--pull[=<repo>]" or "--fetch[=<repo>]" option.

> * Unified "fetch" and "push" across backends.
>

Very nice idea, but one must remember the limitations of import/export tools
and of course limitations of other SCM... well, and also limitations of
Git, if there are any compared to other SCM ;-)

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: If I were redoing git from scratch...
  2006-11-04 11:34 If I were redoing git from scratch Junio C Hamano
  2006-11-04 12:21 ` Jakub Narebski
@ 2006-11-04 16:44 ` Linus Torvalds
  2006-11-04 19:16   ` Shawn Pearce
  1 sibling, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2006-11-04 16:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sat, 4 Nov 2006, Junio C Hamano wrote:
> 
> The biggest one is that we use too many static (worse, function
> scope static) variables that live for the life of the process,
> which makes many things very nice and easy ("run-once and let
> exit clean up the mess" mentality), but because of this it
> becomes awkward to do certain things.  Examples are:
> 
>  - Multiple invocations of merge-bases (needs clearing the
>    marks left on commit objects by earlier traversal),

Well, quite frankly, I dare anybody to do it differently, yet have good 
performance with millions of objects.

The fact is, I don't think it _can_ be done. I would seriously suggest 
re-visiting this in five years, just because CPU's and memory will by then 
hopefully have gotten an order of magnitude faster/bigger.

The thing is, the object database when we read it in really needs to be 
pretty compact-sized, and we need to remember objects we've seen earlier 
(exactly _because_ we depend on the flags). So there's exactly two 
alternatives:
 - global life-time allocations of objects like we do now
 - magic memory management with unknown lifetimes and keeping track of all 
   pointers.

And I'd like to point out that the memory management right now is simply 
not realistic:

 - it's too damn hard. A simple garbage collector based on the approach we 
   have now would simply not be able to do anything, since all objects are 
   _by_definition_ reachable from the hash chains, so there's nothing to 
   collect. The lifetime of an object fundamentally _is_ the whole process 
   lifetime, exactly because we expect the objects (and the object flags 
   in particular) to be meaningful.

 - pretty much all garbage collection schemes tend to have a memory 
   footprint that is about twice what a static footprint is under any 
   normal load. Think about what we already do with "git pack-objects" for 
   something like the mozilla repository: I worked quite a lot on getting 
   the memory footprint down, and it's _still_ several hundred MB. 

In other words, I can pretty much guarantee that some kind of "smarter" 
memory management would be a huge step backwards. Yes, we now have to do 
some things explicitly, but exactly because we do them explicitly we can 
_afford_ to have the stupid and simple and VERY EFFICIENT memory 
management ("lack of memory management") that we have now.

The memory use of git had an very real correlation with performance when I 
was doing the memory shrinking a few months back (back in June). I realize 
that it's perhaps awkward, but I would really want people to realize that 
it's a huge performance issue. It was a clear performance issue for me 
(and I use machines with 2GB of RAM, so I was never swapping), it would be 
an even bigger one for anybody where the size meant that you needed to 
start doing paging.

So I would seriously ask you not to even consider changing the object 
model. Maybe add a few more helper routines to clear all object flags or 
something, but the "every object is global and will never be de-allocated" 
is really a major deal.

Five years from now, or for somebody who re-implements git in Java (where 
performance isn't going to be the major issue anyway, and you probably do 
"small" things like "commit" and "diff", and never do full-database things 
like "git repack"), _then_ you can happily look at having something 
fancier. Right now, it's too easy to just look at cumbersome interfaces, 
and forget about the fact that those interfaces is sometimes what allows 
us to practically do some things in the first place.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: If I were redoing git from scratch...
  2006-11-04 16:44 ` Linus Torvalds
@ 2006-11-04 19:16   ` Shawn Pearce
  2006-11-04 22:29     ` Robin Rosenberg
  0 siblings, 1 reply; 7+ messages in thread
From: Shawn Pearce @ 2006-11-04 19:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git

Linus Torvalds <torvalds@osdl.org> wrote:
> or for somebody who re-implements git in Java (where 
> performance isn't going to be the major issue anyway, and you probably do 
> "small" things like "commit" and "diff", and never do full-database things 
> like "git repack"), _then_ you can happily look at having something 
> fancier. Right now, it's too easy to just look at cumbersome interfaces, 
> and forget about the fact that those interfaces is sometimes what allows 
> us to practically do some things in the first place.

Yes and no.  :-)

As the only person here who has hacked on some of Git and also
reimplemented the core on disk data structures in Java I can say
I mostly agree with Linus.

Abstractions like the repository (to allow different GIT_DIRs to
be used in the same process) isn't really a big deal and is not
a large impact on performance.  They could be implemented in the
current C core.

But trying to abstractly represent an object in Java the same
way that it is represented in Git costs a huge amount of memory.
Java is at least 16 bytes of overhead per object, before you get to
store anything in it.  Translation: Linus is right, doing a real
implementation of "git repack" in Java is nuts.  It would barely
be able to handle git.git, let alone linux.git or mozilla.git.

-- 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: If I were redoing git from scratch...
  2006-11-04 19:16   ` Shawn Pearce
@ 2006-11-04 22:29     ` Robin Rosenberg
  2006-11-04 22:44       ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Robin Rosenberg @ 2006-11-04 22:29 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Linus Torvalds, Junio C Hamano, git

lördag 04 november 2006 20:16 skrev Shawn Pearce:
> But trying to abstractly represent an object in Java the same
> way that it is represented in Git costs a huge amount of memory.
> Java is at least 16 bytes of overhead per object, before you get to
> store anything in it.  

The overhead is eigth bytes per object on a 32-bit platform and 16 on a 64-bit 
platform for Sun's JDK. The granularity is eight bytes in both cases. 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: If I were redoing git from scratch...
  2006-11-04 22:29     ` Robin Rosenberg
@ 2006-11-04 22:44       ` Linus Torvalds
  2006-11-04 23:15         ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2006-11-04 22:44 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Shawn Pearce, Junio C Hamano, git

On Sat, 4 Nov 2006, Robin Rosenberg wrote:
> 
> The overhead is eigth bytes per object on a 32-bit platform and 16 on a 64-bit 
> platform for Sun's JDK. The granularity is eight bytes in both cases. 

Note that the object structure itself is just 24 bytes (regardless of 
whether we're on a 32-bit or 64-bit architecture).

In addition to that, we need one pointer per hash entry, and in order to 
keep the hash list size down we need that hash array to be about 25% free, 
so say 1.5 pointers per object: ~6 bytes or ~12 bytes depending on whether 
it's a 32- or 64-bit architecture.

So 8 or 16 bytes overhead per object is roughly a 25% or 50% pure 
overhead. In contrast, right now we have pretty much _zero_ overhead (we 
do end up having some malloc cost, but since we batch them, it's on the 
order of a few bytes per _thousand_ objects, so we're talking fractions of 
a percent).

Of course, the memory footprint isn't _just_ the objects, but it's a big 
part of it, for some important apps (not just repack, but git-send-pack 
obviously also has this). So on a git server, keeping the memory use down 
is likely the #1 concern - even if the server might have gigabytes of RAM.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: If I were redoing git from scratch...
  2006-11-04 22:44       ` Linus Torvalds
@ 2006-11-04 23:15         ` Linus Torvalds
  0 siblings, 0 replies; 7+ messages in thread
From: Linus Torvalds @ 2006-11-04 23:15 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Shawn Pearce, Junio C Hamano, git

On Sat, 4 Nov 2006, Linus Torvalds wrote:
> 
> In addition to that, we need one pointer per hash entry, and in order to 
> keep the hash list size down we need that hash array to be about 25% free, 
> so say 1.5 pointers per object: ~6 bytes or ~12 bytes depending on whether 
> it's a 32- or 64-bit architecture.

Btw, one of the things I considered (but rejected as being _too_ far out 
for now) during the memory shrinking thing was to make both 32-bit and 
64-bit entities use a 32-bit hash table entry.

The way to do that would be to instead of using a pointer, use a 32-bit 
integer where the low ~10 bits are an index into the allocation buffer 
(since we batch allocations), and the rest of the bits would be an index 
into which batch-buffer it is.

Exactly because 8 bytes per hash entry is actually right now a big part of 
the object memory allocation overhead on 64-bit architectures, and cutting 
it down to just 4 bytes would help save memory.

I never got around to it, if only because I actually just compile my 
user-land git stuff as 32-bit, even on my ppc64 system. And partly because 
I had shrunk the object allocations enough that I just felt pretty happy 
with it anyway, and the change would have been pretty painful. But on 
64-bit architectures, the hash table right now is about a third of the 
whole memory overhead of the object database, and cutting it down by half 
would actually be noticeable.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-11-04 23:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-04 11:34 If I were redoing git from scratch Junio C Hamano
2006-11-04 12:21 ` Jakub Narebski
2006-11-04 16:44 ` Linus Torvalds
2006-11-04 19:16   ` Shawn Pearce
2006-11-04 22:29     ` Robin Rosenberg
2006-11-04 22:44       ` Linus Torvalds
2006-11-04 23:15         ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).