* If I were redoing git from scratch...
@ 2006-11-04 11:34 Junio C Hamano
2006-11-04 12:21 ` Jakub Narebski
2006-11-04 16:44 ` Linus Torvalds
0 siblings, 2 replies; 7+ messages in thread
From: Junio C Hamano @ 2006-11-04 11:34 UTC (permalink / raw)
To: git
I've been thinking about these for a while on the back of my
head, and thought it might be better to start writing it down.
A lot of issues involve UI which means it will not materialize
without breaking existing uses, but if we know in advance what
we will be aiming for, maybe we will find a smoother path to
reach there.
* Core data structure
I consider on-disk data structures and on-wire protocol we
currently use are sane and there is not much to fix. There are
certainly things to be enhanced (64-bit .idx offset, for
example), but I do not think there is anything fundamentally
broken and needs to be reworked.
I have the same feeling for in-core data structures in general,
except a few issues.
The biggest one is that we use too many static (worse, function
scope static) variables that live for the life of the process,
which makes many things very nice and easy ("run-once and let
exit clean up the mess" mentality), but because of this it
becomes awkward to do certain things. Examples are:
- Multiple invocations of merge-bases (needs clearing the
marks left on commit objects by earlier traversal),
- Creating a new pack and immediately start using it inside the
process itself (prepare_packed_git() is call-once, and we
have hacks to cause it re-read the packs in many places).
- Visiting more than one repositories within one process
(many per-repository variables in sha1_file.c are static
variables and there is no "struct repository" that we can
re-initialize in one go),
- The object layer holds onto all parsed objects
indefinitely. Because the object store at the philosophy
level represents the global commit ancestry DAG, there is
no inherent reason to have more than one instance of
object.c::obj_hash even if we visit more than one
repositories in a process, but if the two repositories are
unrelated, objects from the repository we were looking at
only waste memory after switching to a different
repostiory.
- The diffcore is not run-once but it is run-one-at-a-time.
This is easy to fix if needed, though.
There are some other minor details but they are not as
fundamental. Examples are:
- The revision traversal is nicely done but one gripe I have is
that it is focused on painting commits into two (and only
two) classes: interesting and uninteresting. If we allowed
more than one (especially, arbitrary number of) kinds of
interesting, answering questions like "which branches does
this commit belong to? which tagged versions is this commit
already included in?" would become more easy and efficient.
show-branch has machinery to do that for a handful but it
could be unified with the revision.c traversal machinery.
- We have at least three independent implementations of
pathspec match logic and two different semantics (one is
component-prefix match, the other is shell glob), and they
should be unified. You can say "git grep foo -- 't/t5*'" but
not "git diff otherbranch -- 't/t5*'".
* Fetch/Push/Pull/Merge confusion
Everybody hates the fact that inverse of push is fetch not pull,
and merge is not a usual Porcelain (while it _is_ usable as a
regular UI command, it was originally done as a lower layer
helper to "pull" Porcelain and has a strange parameter order
with seemingly useless HEAD parameter in the middle).
If I were doing git from scratch, I would probably avoid any of
the above words that have loaded meanings from other SCMs.
Perhaps...
- "git download" would download changes made in the other end
since we contacted them the last time and would not touch our
branches nor working tree (associate the word with getting
tarballs -- people would not expect the act of downloading a
tarball would touch their working tree nor local history.
untarring it does). It is a different story if the end-user
should be required to explicitly say "download"; I am leaning
towards making it more or less transparent.
- "git upload" to upload our changes to the other end -- that
is what "git push" currently does.
- "git join" to merge another branch into the current branch,
with the "per branch configuration" frills to decide what the
default for "another branch" is based on what the current
branch is, etc.
* Less visible "remoteness" of remote branches
If I were doing git from scratch, I would probably have done
separate remotes _the_ only layout, except I might have opted to
make "remotes" even less visible and treating it as merely a
cache of "the branch tips and tags we saw when we connected over
the network to look at them the last time".
So "git branch --list $remote" might contact the remote over the
network or use cached version. When you think about, it it is
not all that different from always contacting the remote end --
the remote end may have mirror propagation delays, and your
local instance of git caching and not contacting the remote all
the time introduces a similar delay on your end which is (1) not
a big deal, and (2) unlike the remote mirror delay, controllable
on your end. For example, you could force it to update the
cache by "git download $remote; git branch --list $remote".
* Unified "fetch" and "push" across backends.
I was rediscovering git-cvsimport today and wished if I could
just have said (syntax aside):
URL: cvs;/my.re.po/.cvsroot
Pull: HEAD:remotes/cvs/master
Pull: experiment:remotes/cvs/experiment
to cause "git fetch" to run git-cvsimport to update the remotes/cvs/
branches (and "git pull" to merge CVS changes to my branches).
The same thing should be possible for SVN and other foreign SCM
backends.
Also it should be possible to use git-cvsexportcommit as a
backend for "git push" into the cvs repository.
That's it for tonight...
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: If I were redoing git from scratch...
2006-11-04 11:34 If I were redoing git from scratch Junio C Hamano
@ 2006-11-04 12:21 ` Jakub Narebski
2006-11-04 16:44 ` Linus Torvalds
1 sibling, 0 replies; 7+ messages in thread
From: Jakub Narebski @ 2006-11-04 12:21 UTC (permalink / raw)
To: git
Junio C Hamano wrote:
> * Core data structure
[...]
> The biggest one is that we use too many static (worse, function
> scope static) variables that live for the life of the process,
> which makes many things very nice and easy ("run-once and let
> exit clean up the mess" mentality), but because of this it
> becomes awkward to do certain things. Examples are:
One of examples that have been only rarely fixed was for_each_ref
forcing callers to using static variable to store gathered data
instead of passing the data as one of arguments.
> * Fetch/Push/Pull/Merge confusion
>
> Everybody hates the fact that inverse of push is fetch not pull,
> and merge is not a usual Porcelain (while it _is_ usable as a
> regular UI command, it was originally done as a lower layer
> helper to "pull" Porcelain and has a strange parameter order
> with seemingly useless HEAD parameter in the middle).
>
> If I were doing git from scratch, I would probably avoid any of
> the above words that have loaded meanings from other SCMs.
I'm a bit used to "push", "fetch" and "pull". I consider "pull"
a bit of artifact from times of one branch per repository layout.
The fact that "pull" fetches _all_ the branches but merges one
[usually] with the _current_ branch (unless you configure it other
way)...
I'll leave "push" as is, leave "fetch" as is, and make "pull" to be
"fetch" by default unless you use "--merge[=<branch>]" option.
I'd rename "merge" to "merge-driver" and make new "merge" thanks
to new users wouldn't have to learn to use "git pull . branchA"
to merge current branch with branchA.
Perhaps would make it possible to specify remote branch a la cogito,
<URL>#<branch>, to pull remote branch without tracking branch, and
for symmetry have "--pull[=<repo>]" or "--fetch[=<repo>]" option.
> * Unified "fetch" and "push" across backends.
>
Very nice idea, but one must remember the limitations of import/export tools
and of course limitations of other SCM... well, and also limitations of
Git, if there are any compared to other SCM ;-)
--
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: If I were redoing git from scratch...
2006-11-04 11:34 If I were redoing git from scratch Junio C Hamano
2006-11-04 12:21 ` Jakub Narebski
@ 2006-11-04 16:44 ` Linus Torvalds
2006-11-04 19:16 ` Shawn Pearce
1 sibling, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2006-11-04 16:44 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Sat, 4 Nov 2006, Junio C Hamano wrote:
>
> The biggest one is that we use too many static (worse, function
> scope static) variables that live for the life of the process,
> which makes many things very nice and easy ("run-once and let
> exit clean up the mess" mentality), but because of this it
> becomes awkward to do certain things. Examples are:
>
> - Multiple invocations of merge-bases (needs clearing the
> marks left on commit objects by earlier traversal),
Well, quite frankly, I dare anybody to do it differently, yet have good
performance with millions of objects.
The fact is, I don't think it _can_ be done. I would seriously suggest
re-visiting this in five years, just because CPU's and memory will by then
hopefully have gotten an order of magnitude faster/bigger.
The thing is, the object database when we read it in really needs to be
pretty compact-sized, and we need to remember objects we've seen earlier
(exactly _because_ we depend on the flags). So there's exactly two
alternatives:
- global life-time allocations of objects like we do now
- magic memory management with unknown lifetimes and keeping track of all
pointers.
And I'd like to point out that the memory management right now is simply
not realistic:
- it's too damn hard. A simple garbage collector based on the approach we
have now would simply not be able to do anything, since all objects are
_by_definition_ reachable from the hash chains, so there's nothing to
collect. The lifetime of an object fundamentally _is_ the whole process
lifetime, exactly because we expect the objects (and the object flags
in particular) to be meaningful.
- pretty much all garbage collection schemes tend to have a memory
footprint that is about twice what a static footprint is under any
normal load. Think about what we already do with "git pack-objects" for
something like the mozilla repository: I worked quite a lot on getting
the memory footprint down, and it's _still_ several hundred MB.
In other words, I can pretty much guarantee that some kind of "smarter"
memory management would be a huge step backwards. Yes, we now have to do
some things explicitly, but exactly because we do them explicitly we can
_afford_ to have the stupid and simple and VERY EFFICIENT memory
management ("lack of memory management") that we have now.
The memory use of git had an very real correlation with performance when I
was doing the memory shrinking a few months back (back in June). I realize
that it's perhaps awkward, but I would really want people to realize that
it's a huge performance issue. It was a clear performance issue for me
(and I use machines with 2GB of RAM, so I was never swapping), it would be
an even bigger one for anybody where the size meant that you needed to
start doing paging.
So I would seriously ask you not to even consider changing the object
model. Maybe add a few more helper routines to clear all object flags or
something, but the "every object is global and will never be de-allocated"
is really a major deal.
Five years from now, or for somebody who re-implements git in Java (where
performance isn't going to be the major issue anyway, and you probably do
"small" things like "commit" and "diff", and never do full-database things
like "git repack"), _then_ you can happily look at having something
fancier. Right now, it's too easy to just look at cumbersome interfaces,
and forget about the fact that those interfaces is sometimes what allows
us to practically do some things in the first place.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: If I were redoing git from scratch...
2006-11-04 16:44 ` Linus Torvalds
@ 2006-11-04 19:16 ` Shawn Pearce
2006-11-04 22:29 ` Robin Rosenberg
0 siblings, 1 reply; 7+ messages in thread
From: Shawn Pearce @ 2006-11-04 19:16 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, git
Linus Torvalds <torvalds@osdl.org> wrote:
> or for somebody who re-implements git in Java (where
> performance isn't going to be the major issue anyway, and you probably do
> "small" things like "commit" and "diff", and never do full-database things
> like "git repack"), _then_ you can happily look at having something
> fancier. Right now, it's too easy to just look at cumbersome interfaces,
> and forget about the fact that those interfaces is sometimes what allows
> us to practically do some things in the first place.
Yes and no. :-)
As the only person here who has hacked on some of Git and also
reimplemented the core on disk data structures in Java I can say
I mostly agree with Linus.
Abstractions like the repository (to allow different GIT_DIRs to
be used in the same process) isn't really a big deal and is not
a large impact on performance. They could be implemented in the
current C core.
But trying to abstractly represent an object in Java the same
way that it is represented in Git costs a huge amount of memory.
Java is at least 16 bytes of overhead per object, before you get to
store anything in it. Translation: Linus is right, doing a real
implementation of "git repack" in Java is nuts. It would barely
be able to handle git.git, let alone linux.git or mozilla.git.
--
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: If I were redoing git from scratch...
2006-11-04 19:16 ` Shawn Pearce
@ 2006-11-04 22:29 ` Robin Rosenberg
2006-11-04 22:44 ` Linus Torvalds
0 siblings, 1 reply; 7+ messages in thread
From: Robin Rosenberg @ 2006-11-04 22:29 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Linus Torvalds, Junio C Hamano, git
lördag 04 november 2006 20:16 skrev Shawn Pearce:
> But trying to abstractly represent an object in Java the same
> way that it is represented in Git costs a huge amount of memory.
> Java is at least 16 bytes of overhead per object, before you get to
> store anything in it.
The overhead is eigth bytes per object on a 32-bit platform and 16 on a 64-bit
platform for Sun's JDK. The granularity is eight bytes in both cases.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: If I were redoing git from scratch...
2006-11-04 22:29 ` Robin Rosenberg
@ 2006-11-04 22:44 ` Linus Torvalds
2006-11-04 23:15 ` Linus Torvalds
0 siblings, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2006-11-04 22:44 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Shawn Pearce, Junio C Hamano, git
On Sat, 4 Nov 2006, Robin Rosenberg wrote:
>
> The overhead is eigth bytes per object on a 32-bit platform and 16 on a 64-bit
> platform for Sun's JDK. The granularity is eight bytes in both cases.
Note that the object structure itself is just 24 bytes (regardless of
whether we're on a 32-bit or 64-bit architecture).
In addition to that, we need one pointer per hash entry, and in order to
keep the hash list size down we need that hash array to be about 25% free,
so say 1.5 pointers per object: ~6 bytes or ~12 bytes depending on whether
it's a 32- or 64-bit architecture.
So 8 or 16 bytes overhead per object is roughly a 25% or 50% pure
overhead. In contrast, right now we have pretty much _zero_ overhead (we
do end up having some malloc cost, but since we batch them, it's on the
order of a few bytes per _thousand_ objects, so we're talking fractions of
a percent).
Of course, the memory footprint isn't _just_ the objects, but it's a big
part of it, for some important apps (not just repack, but git-send-pack
obviously also has this). So on a git server, keeping the memory use down
is likely the #1 concern - even if the server might have gigabytes of RAM.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: If I were redoing git from scratch...
2006-11-04 22:44 ` Linus Torvalds
@ 2006-11-04 23:15 ` Linus Torvalds
0 siblings, 0 replies; 7+ messages in thread
From: Linus Torvalds @ 2006-11-04 23:15 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Shawn Pearce, Junio C Hamano, git
On Sat, 4 Nov 2006, Linus Torvalds wrote:
>
> In addition to that, we need one pointer per hash entry, and in order to
> keep the hash list size down we need that hash array to be about 25% free,
> so say 1.5 pointers per object: ~6 bytes or ~12 bytes depending on whether
> it's a 32- or 64-bit architecture.
Btw, one of the things I considered (but rejected as being _too_ far out
for now) during the memory shrinking thing was to make both 32-bit and
64-bit entities use a 32-bit hash table entry.
The way to do that would be to instead of using a pointer, use a 32-bit
integer where the low ~10 bits are an index into the allocation buffer
(since we batch allocations), and the rest of the bits would be an index
into which batch-buffer it is.
Exactly because 8 bytes per hash entry is actually right now a big part of
the object memory allocation overhead on 64-bit architectures, and cutting
it down to just 4 bytes would help save memory.
I never got around to it, if only because I actually just compile my
user-land git stuff as 32-bit, even on my ppc64 system. And partly because
I had shrunk the object allocations enough that I just felt pretty happy
with it anyway, and the change would have been pretty painful. But on
64-bit architectures, the hash table right now is about a third of the
whole memory overhead of the object database, and cutting it down by half
would actually be noticeable.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2006-11-04 23:15 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-04 11:34 If I were redoing git from scratch Junio C Hamano
2006-11-04 12:21 ` Jakub Narebski
2006-11-04 16:44 ` Linus Torvalds
2006-11-04 19:16 ` Shawn Pearce
2006-11-04 22:29 ` Robin Rosenberg
2006-11-04 22:44 ` Linus Torvalds
2006-11-04 23:15 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).