A shortcoming of the git repo format

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* A shortcoming of the git repo format
@ 2005-04-27  5:43 H. Peter Anvin
  2005-04-27 15:00 ` C. Scott Ananian
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: H. Peter Anvin @ 2005-04-27  5:43 UTC (permalink / raw)
  To: Git Mailing List

Most of git's files are starting to converge toward an RFC822-like 
header with (tag, data) and a free-form section.  This is a good thing. 
  However, there is one problem with this, and that is that without 
knowing every possible tag, a program reading the git repository cannot 
safely tell what is a link to another git object and what is not.  When 
I did my repository conversion tools, I simply assumed any string of 20 
hexadecimal digits was a pointer, but this is probably a bad idea in the 
long run.

Additionally, there is the question of the handling of strings that may 
contain \n or even \0 (which may be necessary for some applications).

One solution to all of this would be to define a quoting standard for 
strings, and simply require that all free-format strings (like the 
author fields) or at least strings that match [0-9a-f]{20}, are always 
quoted.

I propose the following:

- Any string containing control characters or \ must be quoted;
- \xXX produces control characters; other characters following \ are 
verbatim.

Thus,

link 0123456789abcdef0123

... is a link to an object, whereas ...

string \0123456789abcdef0123

... is a string.

string1  This string begins with a space
string2 This string has an embedded newline ("\x0a")

... are both valid strings; the first contains a leading space and the 
second an embedded newline.

I'll implement this and integrate it tomorrow.

	-hpa

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27  5:43 A shortcoming of the git repo format H. Peter Anvin
@ 2005-04-27 15:00 ` C. Scott Ananian
  2005-04-27 15:22 ` Linus Torvalds
  2005-04-27 20:58 ` Gerhard Schrenk
  2 siblings, 0 replies; 29+ messages in thread
From: C. Scott Ananian @ 2005-04-27 15:00 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Git Mailing List

On Tue, 26 Apr 2005, H. Peter Anvin wrote:

> Additionally, there is the question of the handling of strings that may 
> contain \n or even \0 (which may be necessary for some applications).

While we're at it, I'll just mention that '\0' is a rather bad delimiter 
for zlib-compressed files; it usually ends up enlarging the file by three 
or more bytes compared to using any whitespace character.  The reason is 
obvious: \0 isn't actually used anywhere else in the compressed contents, 
so it tends to pollute zlib's dictionary.

It's probably too late to do anything about this, but hey.
  --scott

Soviet  STANDEL Yakima JMTRAX Hussein Ft. Meade algorithm JMBLUG CIA 
SEQUIN Bejing Morwenstow Boston nuclear Sigint Ft. Bragg ZRBRIEF Peking
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27  5:43 A shortcoming of the git repo format H. Peter Anvin
  2005-04-27 15:00 ` C. Scott Ananian
@ 2005-04-27 15:22 ` Linus Torvalds
  2005-04-27 18:03   ` H. Peter Anvin
  2005-04-27 20:58 ` Gerhard Schrenk
  2 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2005-04-27 15:22 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Git Mailing List



On Tue, 26 Apr 2005, H. Peter Anvin wrote:
> 
> One solution to all of this would be to define a quoting standard for 
> strings, and simply require that all free-format strings (like the 
> author fields) or at least strings that match [0-9a-f]{20}, are always 
> quoted.

git uses more of the ".newsrc" format, in that it just knows which 
characters are legal or not.

To find the email address, look for the first '<'. To find the date, look 
for the first '>'. Those characters are not allowed in the name or the 
email, so they act as well-defined delimeters.

		Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27 15:22 ` Linus Torvalds
@ 2005-04-27 18:03   ` H. Peter Anvin
  2005-04-27 18:32     ` Dave Jones
                       ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: H. Peter Anvin @ 2005-04-27 18:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds wrote:
> 
> On Tue, 26 Apr 2005, H. Peter Anvin wrote:
> 
>>One solution to all of this would be to define a quoting standard for 
>>strings, and simply require that all free-format strings (like the 
>>author fields) or at least strings that match [0-9a-f]{20}, are always 
>>quoted.
> 
> 
> git uses more of the ".newsrc" format, in that it just knows which 
> characters are legal or not.
> 
> To find the email address, look for the first '<'. To find the date, look 
> for the first '>'. Those characters are not allowed in the name or the 
> email, so they act as well-defined delimeters.
> 

That's true for email addresses, but the point was to distinguish links 
to other git objects from any other kind of text.  Currently there is no 
such delimiter for that.  Another solution than the one I posted would 
be to define such a delimiter, for example '<' + 20 hex character + '>' 
(which would be distinguished from email addresses by the lack of an @ 
sign.)  That would be a repo change, though.

Given no prior constraints, I would probably argue for a format which 
makes the data type known as a matter of syntax, using "..." quoted 
strings for *ALL* arbitrary strings, a different syntax for numbers and 
links, and leaving the door open for new data types like lists in the 
future.

	-hpa

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27 18:03   ` H. Peter Anvin
@ 2005-04-27 18:32     ` Dave Jones
  2005-04-27 18:47       ` H. Peter Anvin
                         ` (2 more replies)
  2005-04-27 19:11     ` Linus Torvalds
  2005-04-28 13:39     ` David Woodhouse
  2 siblings, 3 replies; 29+ messages in thread
From: Dave Jones @ 2005-04-27 18:32 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linus Torvalds, Git Mailing List

On Wed, Apr 27, 2005 at 11:03:26AM -0700, H. Peter Anvin wrote:
 > Linus Torvalds wrote:
 > >
 > >On Tue, 26 Apr 2005, H. Peter Anvin wrote:
 > >
 > >>One solution to all of this would be to define a quoting standard for 
 > >>strings, and simply require that all free-format strings (like the 
 > >>author fields) or at least strings that match [0-9a-f]{20}, are always 
 > >>quoted.
 > >
 > >
 > >git uses more of the ".newsrc" format, in that it just knows which 
 > >characters are legal or not.
 > >
 > >To find the email address, look for the first '<'. To find the date, look 
 > >for the first '>'. Those characters are not allowed in the name or the 
 > >email, so they act as well-defined delimeters.
 > >
 > 
 > That's true for email addresses, but the point was to distinguish links 
 > to other git objects from any other kind of text.  Currently there is no 
 > such delimiter for that.

That actually broke one of my first git scripts when one of the
changelog texts started a line with 'tree '.  I hacked around it
by making my script only grep in the 'head -n4' lines, but this
seems somewhat fragile having to make assumptions that the field
I want to see is in the first 4 lines.

		Dave


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27 18:32     ` Dave Jones
@ 2005-04-27 18:47       ` H. Peter Anvin
  2005-04-27 22:51         ` Jon Seymour
  2005-04-27 19:15       ` Linus Torvalds
  2005-04-27 19:39       ` Petr Baudis
  2 siblings, 1 reply; 29+ messages in thread
From: H. Peter Anvin @ 2005-04-27 18:47 UTC (permalink / raw)
  To: Dave Jones; +Cc: Linus Torvalds, Git Mailing List

Dave Jones wrote:
> 
> That actually broke one of my first git scripts when one of the
> changelog texts started a line with 'tree '.  I hacked around it
> by making my script only grep in the 'head -n4' lines, but this
> seems somewhat fragile having to make assumptions that the field
> I want to see is in the first 4 lines.
> 

You have the delimiter for that; there is an empty line between the 
header and the free-form body, similar as for RFC822.

	-hpa

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27 18:03   ` H. Peter Anvin
  2005-04-27 18:32     ` Dave Jones
@ 2005-04-27 19:11     ` Linus Torvalds
  2005-04-27 19:47       ` The " Brian O'Mahoney
  2005-04-27 20:40       ` A shortcoming of the " H. Peter Anvin
  2005-04-28 13:39     ` David Woodhouse
  2 siblings, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 2005-04-27 19:11 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Git Mailing List

On Wed, 27 Apr 2005, H. Peter Anvin wrote:
> 
> That's true for email addresses, but the point was to distinguish links 
> to other git objects from any other kind of text.

No, that's definitely _not_ the point.

I repeat: git does not do any free-form parsin AT ALL. The links are in 
well-defined places, and you do not ever search for them. And that's 
really very very important.

> Currently there is no  such delimiter for that.

There absolutely is.

For a "commit", the format is

 - first line is exactly 46 bytes: five bytes of "tree ", 40 bytes of hex 
   sha1, and one byte of "\n".

   NOTHING ELSE. Not extra spaces at the end, not extra spaces at the 
   beginning or the middle. It's ASCII, but it's not free-format ASCII.

 - the next <n> (where 'n' can be 0 or more) lines are _exactly_ 48 bytes
   each:  seven bytes of "parent ", 40 bytes of hex sha1, and one byte of 
   "\n".

   NOTHING ELSE.

 - the next lines are "author " and "committer ". They have well-defined 
   delimters for their fields, and no sha1's. The fields cannot contain 
   '<', '>' or newlines, since those are the field/line delimeters.

There is no free-format text _anywhere_ that git parses. No room for 
guesses, no room for mistakes, no room for anything half-way questionable.

And fsck actually enforces this. We do _not_ just use "gets()" to read one 
line at a time. We literally verify that the lines are 46/48 bytes long, 
and have the delimeters in the expected places.

Same goes for "tree" and "tag" objects. They all have fixed-format stuff. 
A "tree" entry is always

	"%o <space> %s" \0 [ 20 bytes of sha1 ]

with "%o" being "mode", and "%s" being "path". We don't guess. 

And this really is _important_. Exactly because we name things by the SHA1
hash of the contents, we MUST NOT have flexible formats. Having a format
which allows non-canonical representations (extra spaces etc) would mean
that two trees that were identical would depend on how you happened to
format them.

So there's really two issues:
 - we don't guess or parse contents. We have strict rules, and that makes 
   git more reliable. There are no gray areas. There's "right" and there 
   is "wrong", and the right one works, and the wrong one gets flagged as 
   being wrong and the tools refuse to touch it.
 - there is only _one_ right way to do things, and that means that the 
   the content is well-defined, and thus the SHA1 of the content is 
   well-defined.

For example, another rule is that a "tree" object is always sorted by 
the bytes in the filename (not by entry, btw: a directory called "foo" 
will sort as "foo/", even though the _entry_ only shows "foo"). That rule 
not only makes a lot of operations faster, but again, it means that there 
is only _one_ way to represent a tree validly.

IOW, you _cannot_ represent a tree any other way (and I've been too lazy
to check this in fsck, but it's alway sbeen my plan), and that is exactly 
why we can just compare the hashes of the results - because there is no 
random component of "layout" in the contents.

This really is important. It means that if you get to the same two tree
contents in totally unrelated ways (you unpack a tar-file and encode it in
git, or you have 5 years of git history and check it out), the "tree" will
match _exactly_. There's no history. There's no "optional" stuff. Since
the contents of the trees are the same, the SHA1 of the two trees will be
the same. Exactly because git refuses to touch any free-format stuff.

		Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27 18:32     ` Dave Jones
  2005-04-27 18:47       ` H. Peter Anvin
@ 2005-04-27 19:15       ` Linus Torvalds
  2005-04-27 19:39       ` Petr Baudis
  2 siblings, 0 replies; 29+ messages in thread
From: Linus Torvalds @ 2005-04-27 19:15 UTC (permalink / raw)
  To: Dave Jones; +Cc: H. Peter Anvin, Git Mailing List



On Wed, 27 Apr 2005, Dave Jones wrote:
> 
> That actually broke one of my first git scripts when one of the
> changelog texts started a line with 'tree '.  I hacked around it
> by making my script only grep in the 'head -n4' lines, but this
> seems somewhat fragile having to make assumptions that the field
> I want to see is in the first 4 lines.

It's not an assumption.

IT'S THE LAW.

The speed of light is not "an assumption". It is.

The tree is in the first line of a commit. You don't even need to parse 
it, you do

	tree=$(cat-file commit $head | sed 's/tree //;q')

and that's it. No parsing.

Git doesn't guess. Git knows.

		Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27 18:32     ` Dave Jones
  2005-04-27 18:47       ` H. Peter Anvin
  2005-04-27 19:15       ` Linus Torvalds
@ 2005-04-27 19:39       ` Petr Baudis
  2 siblings, 0 replies; 29+ messages in thread
From: Petr Baudis @ 2005-04-27 19:39 UTC (permalink / raw)
  To: Dave Jones; +Cc: H. Peter Anvin, Linus Torvalds, Git Mailing List

Dear diary, on Wed, Apr 27, 2005 at 08:32:40PM CEST, I got a letter
where Dave Jones <davej@redhat.com> told me that...
> That actually broke one of my first git scripts when one of the
> changelog texts started a line with 'tree '.  I hacked around it
> by making my script only grep in the 'head -n4' lines, but this
> seems somewhat fragile having to make assumptions that the field
> I want to see is in the first 4 lines.

The tree field is now always at the first line, but generally the header
part is variable-sized; you have multiple parent lines in case of
merges.

Just stop reading at the first newline.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply	[flat|nested] 29+ messages in thread

* The git repo format
  2005-04-27 19:11     ` Linus Torvalds
@ 2005-04-27 19:47       ` Brian O'Mahoney
  2005-04-27 20:40       ` A shortcoming of the " H. Peter Anvin
  1 sibling, 0 replies; 29+ messages in thread
From: Brian O'Mahoney @ 2005-04-27 19:47 UTC (permalink / raw)
  Cc: Git Mailing List

In understanding how to work with 'git' I had a number of initial
difficulties which are mostly covered by the e-mail from Linus below.

Most of these are already covered in the README:

for objects, ie blob, commit, tag, tree: inflate, then
<type>\s<size>\0<data>

where <data> is in the form, described by Linus below

when you look at them closely, all the formats are simple,
un-ambiguous, and very easy to parse.

The index is also easy to parse, but there is a detail,
after the 3-int header the records are padded to a multiple
of 8 bytes. The detail is in cache.h.

Maybe the README needs to re-inforce this.

Brian

> I repeat: git does not do any free-form parsin AT ALL.
========================================================

 The links are in well-defined places, and you do not ever search for them.

And that's really very very important.


> For a "commit", the format is
> 
>  - first line is exactly 46 bytes: five bytes of "tree ", 40 bytes of hex 
>    sha1, and one byte of "\n".
> 
>    NOTHING ELSE. Not extra spaces at the end, not extra spaces at the 
>    beginning or the middle. It's ASCII, but it's not free-format ASCII.
> 
>  - the next <n> (where 'n' can be 0 or more) lines are _exactly_ 48 bytes
>    each:  seven bytes of "parent ", 40 bytes of hex sha1, and one byte of 
>    "\n".
> 
>    NOTHING ELSE.
> 
>  - the next lines are "author " and "committer ". They have well-defined 
>    delimters for their fields, and no sha1's. The fields cannot contain 
>    '<', '>' or newlines, since those are the field/line delimeters.
> 
> There is no free-format text _anywhere_ that git parses. No room for 
> guesses, no room for mistakes, no room for anything half-way questionable.
> 
> And fsck actually enforces this. We do _not_ just use "gets()" to read one 
> line at a time. We literally verify that the lines are 46/48 bytes long, 
> and have the delimeters in the expected places.
> 
> Same goes for "tree" and "tag" objects. They all have fixed-format stuff. 
> A "tree" entry is always
> 
> 	"%o <space> %s" \0 [ 20 bytes of sha1 ]
> 
> with "%o" being "mode", and "%s" being "path". We don't guess. 
> 
> And this really is _important_. Exactly because we name things by the SHA1
> hash of the contents, we MUST NOT have flexible formats. Having a format
> which allows non-canonical representations (extra spaces etc) would mean
> that two trees that were identical would depend on how you happened to
> format them.
> 
> So there's really two issues:
>  - we don't guess or parse contents. We have strict rules, and that makes 
>    git more reliable. There are no gray areas. There's "right" and there 
>    is "wrong", and the right one works, and the wrong one gets flagged as 
>    being wrong and the tools refuse to touch it.
>  - there is only _one_ right way to do things, and that means that the 
>    the content is well-defined, and thus the SHA1 of the content is 
>    well-defined.
> 
> For example, another rule is that a "tree" object is always sorted by 
> the bytes in the filename (not by entry, btw: a directory called "foo" 
> will sort as "foo/", even though the _entry_ only shows "foo"). That rule 
> not only makes a lot of operations faster, but again, it means that there 
> is only _one_ way to represent a tree validly.
> 
> IOW, you _cannot_ represent a tree any other way (and I've been too lazy
> to check this in fsck, but it's alway sbeen my plan), and that is exactly 
> why we can just compare the hashes of the results - because there is no 
> random component of "layout" in the contents.
> 
> This really is important. It means that if you get to the same two tree
> contents in totally unrelated ways (you unpack a tar-file and encode it in
> git, or you have 5 years of git history and check it out), the "tree" will
> match _exactly_. There's no history. There's no "optional" stuff. Since
> the contents of the trees are the same, the SHA1 of the two trees will be
> the same. Exactly because git refuses to touch any free-format stuff.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27 19:11     ` Linus Torvalds
  2005-04-27 19:47       ` The " Brian O'Mahoney
@ 2005-04-27 20:40       ` H. Peter Anvin
  2005-04-27 20:49         ` Tom Lord
                           ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: H. Peter Anvin @ 2005-04-27 20:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds wrote:
> 
> No, that's definitely _not_ the point.
> 
> I repeat: git does not do any free-form parsin AT ALL. The links are in 
> well-defined places, and you do not ever search for them. And that's 
> really very very important.
> 

I know that.  However, is that going to be true for all versions of the 
repository format over all time?  If so, the repository format is brittle.

 > > Currently there is no  such delimiter for that.
 >
 > There absolutely is.
 >
 > For a "commit", the format is...

My point was that with a syntactic delimiter, one can write a tool that 
doesn't necessarily know everything about every tag, including future 
tags which may not have been invented when the tool was written.

One can simply say "we don't do that"; finding an unknown tag is always 
a fatal error.  That means the format is more brittle, but brittle does 
mean it breaks as opposed to getting deformed in some, potentially 
undesirable way.

	-hpa

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27 20:40       ` A shortcoming of the " H. Peter Anvin
@ 2005-04-27 20:49         ` Tom Lord
  2005-04-27 20:59           ` H. Peter Anvin
  2005-04-28  0:57           ` Linus Torvalds
  2005-04-27 20:56         ` Linus Torvalds
  2005-04-27 23:50         ` Daniel Barkalow
  2 siblings, 2 replies; 29+ messages in thread
From: Tom Lord @ 2005-04-27 20:49 UTC (permalink / raw)
  To: hpa; +Cc: git

   From: "H. Peter Anvin" <hpa@zytor.com>

   Linus Torvalds wrote:
   > 
   > No, that's definitely _not_ the point.
   > 
   > I repeat: git does not do any free-form parsin AT ALL. The links are in 
   > well-defined places, and you do not ever search for them. And that's 
   > really very very important.
   > 

   I know that.  However, is that going to be true for all versions of the 
   repository format over all time?  If so, the repository format is brittle.

I think one has to understand Linus' posts as coming from the
"head-down, steaming ahead for *MY* project cause you all suck"
perspective and impose corresponding filters on his declarations of
"LAW".  At least that's the only way *I* can make sense of his latest
contributions.

If you get git, just do the right thing -- Linus be damned.

-t

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27 20:40       ` A shortcoming of the " H. Peter Anvin
  2005-04-27 20:49         ` Tom Lord
@ 2005-04-27 20:56         ` Linus Torvalds
  2005-04-28  0:45           ` David A. Wheeler
  2005-04-27 23:50         ` Daniel Barkalow
  2 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2005-04-27 20:56 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Git Mailing List

On Wed, 27 Apr 2005, H. Peter Anvin wrote:
> 
> I know that.  However, is that going to be true for all versions of the 
> repository format over all time?  If so, the repository format is brittle.

I agree, it's brittle by design, exactly because I think it's very 
important not to allow any variations.

HOWEVER, that's where "convert-cache" comes in. Any one particular format 
may be brittle, but if we accept that, and just say "we can upgrade by 
converting the cache", then we should be ok. IOW, we can change from one 
brittle format with 160-bit SHA1 names to _another_ brittle format with 
256-bit SHA1 (or other) names.

> My point was that with a syntactic delimiter, one can write a tool that 
> doesn't necessarily know everything about every tag, including future 
> tags which may not have been invented when the tool was written.

Now, I kind of agree with that, but not on a "object level".

But exactly because the object level is "brittle by design", and because I 
the way to fix that is convert-cache (which may do _big_ changes to the 
format), I really don't think that the objects should ever be looked at 
except with very precise tools.

But when it comes to "higher-level information", I agree with you 100%.

For example, this _is_ actually why I wanted pasky to change the format of 
"git log" (now cg-log). Exactly so that the output of that isn't brittle, 
it now prepends spaces to the free-form part.

		Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27  5:43 A shortcoming of the git repo format H. Peter Anvin
  2005-04-27 15:00 ` C. Scott Ananian
  2005-04-27 15:22 ` Linus Torvalds
@ 2005-04-27 20:58 ` Gerhard Schrenk
  2 siblings, 0 replies; 29+ messages in thread
From: Gerhard Schrenk @ 2005-04-27 20:58 UTC (permalink / raw)
  To: git

* H. Peter Anvin <hpa@zytor.com> [2005-04-27 07:43]:
> Most of git's files are starting to converge toward an RFC822-like 
> header with (tag, data) and a free-form section.  This is a good
> thing.

I really hate RFC822-like data structures. Why? Lazy straightforward
people (who have written to much mails) tend to break the relational
data
modell and don't realize what they loose. Usually they introduce
non-atomar tags like

Tag: value1, value2

and game over. You have just broken the first normal form (1NF). In the 
end the relational normalization process is just not to break the
functional dependencies of your data. It's worth it.

I'm reacting like pawlov's dog and really don't know what I'm talking
about (namely git). But please don't do the same error and just
associate
relational = sql = crap. The shell's operator stream paradigma fits very
good to the relational modell. It's certainly closer to the relational
algebra than sql...

Take care
Gerhard

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27 20:49         ` Tom Lord
@ 2005-04-27 20:59           ` H. Peter Anvin
  2005-04-28  0:57           ` Linus Torvalds
  1 sibling, 0 replies; 29+ messages in thread
From: H. Peter Anvin @ 2005-04-27 20:59 UTC (permalink / raw)
  To: Tom Lord; +Cc: git

Tom Lord wrote:
> 
> I think one has to understand Linus' posts as coming from the
> "head-down, steaming ahead for *MY* project cause you all suck"
> perspective and impose corresponding filters on his declarations of
> "LAW".  At least that's the only way *I* can make sense of his latest
> contributions.
> 
> If you get git, just do the right thing -- Linus be damned.
> 

It's fair for Linus to want to make things behave a certain way in a 
project.  There are design decisions which have tradeoffs both ways -- 
robust (but subject to partial information issues) versus brittle (but 
safe.)

That's part of why I prefer to ask first.

	-hpa

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27 18:47       ` H. Peter Anvin
@ 2005-04-27 22:51         ` Jon Seymour
  0 siblings, 0 replies; 29+ messages in thread
From: Jon Seymour @ 2005-04-27 22:51 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Dave Jones, Linus Torvalds, Git Mailing List

On 4/28/05, H. Peter Anvin <hpa@zytor.com> wrote:
> Dave Jones wrote:
> >
> > That actually broke one of my first git scripts when one of the
> > changelog texts started a line with 'tree '.  I hacked around it
> > by making my script only grep in the 'head -n4' lines, but this
> > seems somewhat fragile having to make assumptions that the field
> > I want to see is in the first 4 lines.
> >
> 
> You have the delimiter for that; there is an empty line between the
> header and the free-form body, similar as for RFC822.
> 

...and a relatively simple way to use that rule to extract just the
header lines:

      sed -n "1,/^\$/p"                     # with the separator line

or, either one of these to remove the separator line as well:

      sed -n "1,/^\$/s/^\(..*\)/\1/p"  
      sed -n "1,/^\$/p" | tr -s \\012

jon
-- 
homepage: http://www.zeta.org.au/~jon/
blog: http://orwelliantremors.blogspot.com/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27 20:40       ` A shortcoming of the " H. Peter Anvin
  2005-04-27 20:49         ` Tom Lord
  2005-04-27 20:56         ` Linus Torvalds
@ 2005-04-27 23:50         ` Daniel Barkalow
  2005-04-27 23:56           ` H. Peter Anvin
  2 siblings, 1 reply; 29+ messages in thread
From: Daniel Barkalow @ 2005-04-27 23:50 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linus Torvalds, Git Mailing List

On Wed, 27 Apr 2005, H. Peter Anvin wrote:

> One can simply say "we don't do that"; finding an unknown tag is always 
> a fatal error.  That means the format is more brittle, but brittle does 
> mean it breaks as opposed to getting deformed in some, potentially 
> undesirable way.

If you find an object with an unknown tag, you can't do much with it
anyway, even if it has a format that matches generic rules. Sure, you
could trace reachability through it, but that's only helpful for a couple
of generic programs (fsck and pull), and those programs ought to
additionally have some clue about what's going on if they're going to act
appropriately.

On the other hand, it is probably true that programs should be able to
deal abstractly with new tags if built with a libgit that supports them,
but that's something that we can arrange a bit later.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27 23:50         ` Daniel Barkalow
@ 2005-04-27 23:56           ` H. Peter Anvin
  2005-04-28  1:51             ` Daniel Barkalow
  0 siblings, 1 reply; 29+ messages in thread
From: H. Peter Anvin @ 2005-04-27 23:56 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Linus Torvalds, Git Mailing List

Daniel Barkalow wrote:
> 
> If you find an object with an unknown tag, you can't do much with it
> anyway, even if it has a format that matches generic rules. Sure, you
> could trace reachability through it, but that's only helpful for a couple
> of generic programs (fsck and pull), and those programs ought to
> additionally have some clue about what's going on if they're going to act
> appropriately.
> 
> On the other hand, it is probably true that programs should be able to
> deal abstractly with new tags if built with a libgit that supports them,
> but that's something that we can arrange a bit later.
> 
> 	-Daniel

There are a fair number of tools one may want that deal with reachability.

	-hpa


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27 20:56         ` Linus Torvalds
@ 2005-04-28  0:45           ` David A. Wheeler
  2005-04-28  0:46             ` David Lang
  0 siblings, 1 reply; 29+ messages in thread
From: David A. Wheeler @ 2005-04-28  0:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, Git Mailing List

Linus Torvalds wrote:
> 
> On Wed, 27 Apr 2005, H. Peter Anvin wrote:
> 
>>I know that.  However, is that going to be true for all versions of the 
>>repository format over all time?  If so, the repository format is brittle.
> 
> I agree, it's brittle by design, exactly because I think it's very 
> important not to allow any variations.

In the short term, not allowing any variations is probably a
good thing, it'll winnow out mistakes.  Creating a format that
COULD change in the future is, however, a very good way of avoiding
getting boxed into a corner if it turns out a mistake has been made.

> HOWEVER, that's where "convert-cache" comes in. Any one particular format 
> may be brittle, but if we accept that, and just say "we can upgrade by 
> converting the cache", then we should be ok. IOW, we can change from one 
> brittle format with 160-bit SHA1 names to _another_ brittle format with 
> 256-bit SHA1 (or other) names.

There's a disadvantage to that, unfortunately: invalidating signatures.
Yes, you can get people to re-sign their stuff... assuming you can
find them & convince them to do it (ha!).  More than likely,
you'll lose signatures that way.  Probably not your TOP priority,
but there are advantages to being able to go back & years later
SHOW that someone really did sign something.

In the long run, I'd really like to see (at least) signed commits,
and that those signatures would "stick around" cleanly into the future.
"Breaks" can be handled other ways, but it is DEFINITELY a pain,
and an avoidable one.

--- David A. Wheeler

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-28  0:45           ` David A. Wheeler
@ 2005-04-28  0:46             ` David Lang
  0 siblings, 0 replies; 29+ messages in thread
From: David Lang @ 2005-04-28  0:46 UTC (permalink / raw)
  To: David A. Wheeler; +Cc: Linus Torvalds, H. Peter Anvin, Git Mailing List

On Wed, 27 Apr 2005, David A. Wheeler wrote:

> Linus Torvalds wrote:
>> 
>> On Wed, 27 Apr 2005, H. Peter Anvin wrote:
>> 
>>> I know that.  However, is that going to be true for all versions of the 
>>> repository format over all time?  If so, the repository format is brittle.
<<SNIP>> 
>> HOWEVER, that's where "convert-cache" comes in. Any one particular format 
>> may be brittle, but if we accept that, and just say "we can upgrade by 
>> converting the cache", then we should be ok. IOW, we can change from one 
>> brittle format with 160-bit SHA1 names to _another_ brittle format with 
>> 256-bit SHA1 (or other) names.
>
> There's a disadvantage to that, unfortunately: invalidating signatures.
> Yes, you can get people to re-sign their stuff... assuming you can
> find them & convince them to do it (ha!).  More than likely,
> you'll lose signatures that way.  Probably not your TOP priority,
> but there are advantages to being able to go back & years later
> SHOW that someone really did sign something.

all you have to do is to make sure that convert-cache doesn't loose any 
data and you can always convert back (through as many steps as needed) to 
check signatures.

no matter what you do, if you change the thing that's being signed the 
signature is worthless, it doesn't matter if you change it in a flexible 
or a brittle way, it's different. the brittle approach actually makes it 
easier to go backwards as you KNOW exactly what it needs to be, there's no 
possiblity that a later tag was there, but being ignored (except for the 
signature)

> In the long run, I'd really like to see (at least) signed commits,
> and that those signatures would "stick around" cleanly into the future.
> "Breaks" can be handled other ways, but it is DEFINITELY a pain,
> and an avoidable one.
>
> --- David A. Wheeler
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27 20:49         ` Tom Lord
  2005-04-27 20:59           ` H. Peter Anvin
@ 2005-04-28  0:57           ` Linus Torvalds
  2005-04-28  1:34             ` Paul Jackson
                               ` (4 more replies)
  1 sibling, 5 replies; 29+ messages in thread
From: Linus Torvalds @ 2005-04-28  0:57 UTC (permalink / raw)
  To: Tom Lord; +Cc: hpa, git

On Wed, 27 Apr 2005, Tom Lord wrote:
> 
> I think one has to understand Linus' posts as coming from the
> "head-down, steaming ahead for *MY* project cause you all suck"
> perspective and impose corresponding filters on his declarations of
> "LAW".

I'm really being very head-strong on these things, and much more so than I
normally am, because quite frankly, I see "git" as a very different
project from Linux.

(Which is not to say that I'm not opinionated even normally, but I'm 
normally a bit more open to listen to other people ;)

There's two huge differences between git and Linux, and I'm really sorry
if they make me act as an asshole, but they are important to me:

 - with Linux, lots of people know what the "right thing" is, because the 
   UNIX mindset has really been a kind of "social background" that has 
   been around for long enough that it has institutionalized knowledge
   about what an OS is supposed to do.

   This means that in 99% of all technical discussions about the kernel, 
   people are already coming at the problem roughly from the same 
   stand-point. It's not _universally_ true, but I really think that the 
   institutionalized (but not always conscious) philosophy of UNIX is what 
   has made it a lot easier to talk about almost all kernel issues,
   because people have generally the same expectations of what is "good".

   Doing development is a lot about communication. Writing code in many 
   ways is secondary - it's much more important to try to make sure that 
   everybody knows what the goals are, because the _real_ pain in 
   development ends up being not the coding, but the much more fundamental 
   disagreements that happen when people really have totally different 
   expectations of what the end result is going to be.

   SCM's don't have this. Quite the reverse. I see 30 years of "CVS" being 
   the common language for a lot of people, and the fact is, most of the
   people on this mailing list probably never _really_ used BK, and do not
   really understand very deeply about how the distributed model actually 
   ends up workign in _practice_.

   I think a lot of people understand it intellectually, but I really do 
   think that we're lackign the kind of "institutionalized" knowledge
   where people understand things at a much more visceral level.

 - With Linux, I never had something I needed to get _done_. Even when I 
   started, it was just for fun, and by the time others joined in, the 
   system already did much more than I initially envisioned, so everything 
   was really "gravy".

   With git, this isn't the case. The _only_ reason I started git in the 
   first place is that I knew better than pretty much anybody else what my
   needs were, and I was forced to act on them because nothing out there 
   really solved the problem for me.

In other words: I _know_ that I've been unpleasant. I'm sorry about that, 
but I am trying to explain _why_ I'm being an asshole about things, more 
so than I usually am.

I'm not actually all that interested in SCM's. I'd have been much happier
if I never had to start doing git in the first place. But circumstances
not only forced me to do my own, it also so happens that I don't believe
that there are many people around that have ever really _seen_ what my
kind of development requirements are.

What does that boil down to? It means, for example, that to me it doesn't
matter one _whit_ if you've been doing SCM's for the last thirty years,
and you can do xdelta algorithms in your sleep.

Quite the reverse: such a person "knows" a lot of things, but I'm pretty
damn sure that such a person has _never_ actually worked on a system that
works the way the kernel development does, which means that most of the
things that person "knows" are things that may need to be un-learnt.

And because I don't actually _care_ about SCM's, and only care about
getting to the point where I (once more) don't have to even think about
the SCM that I use for the kernel, I also don't have much incentive to
worry about CM models that may well be very valid outside of kernel work.

See? When it comes to my Linux work, I'm very inclusive. Linux already
does everything _I_ need it to do, so in many ways, all that really
motivates me to improve it are really about other peoples needs, and as
such, I'm really really interested in what _other_ people want. I still
say "no, that's now how we do things", but that's much less contentious.

In contrast, with git, I'm totally uninterested in anything that doesn't
make my kernel work go faster or more smoothly, and does so _today_. Which 
makes me a cantancerous old bastard, and bit the heads off anybody who 
isn't focused on that one thing.

And I really _am_ sorry. I don't actually _like_ being nasty about these 
things. But when it comes to git, I have one motivation, and one 
motivation only, and being nice about it isn't going to help. 

The good news? I actually think my needs are very basic. Once gits gets to 
the point where it does what I need it to do, I don't really have any 
motivation to say "this is how we do it" any more. And I think we're 
actually getting to that point fairly soon. That's not saying git is 
"done", any less than Linux was "done" in 1992. It's just that at that 
point I don't have any reason to be a nasty control freak any more.

In fact, I don't see myself even maintaining the project, especially since
there seem to be others that are more motivated to do so than I am. Then
I'll just go back into my dark kernel cave, and hopefully I don't have to
come out again for a while.

But for now, the _only_ point of git is as a kernel maintenance tool. 
There are tons of other SCM systems that are probably better for other 
projects, so if git is "just another SCM project", then git is totally 
pointless. So for now, the absolutely _only_ thing that matters for git 
design (as far as I'm concerned) is "how well does it suit Linus".

			Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-28  0:57           ` Linus Torvalds
@ 2005-04-28  1:34             ` Paul Jackson
  2005-04-28  2:14             ` Tom Lord
                               ` (3 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: Paul Jackson @ 2005-04-28  1:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: lord, hpa, git

Dang ... don't apologize too much ... it's fun watching Linus be a
cranky git.

This is turning into something neat, something different and special,
and no way we'd have gotten here using the usual ways or means.

And we're all pretty damn confident that you won't be playing SCM
dictator for long - tools are obviously not your first love.

Every China Shop needs a good Bull now and then.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27 23:56           ` H. Peter Anvin
@ 2005-04-28  1:51             ` Daniel Barkalow
  2005-04-28  1:56               ` H. Peter Anvin
  0 siblings, 1 reply; 29+ messages in thread
From: Daniel Barkalow @ 2005-04-28  1:51 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linus Torvalds, Git Mailing List

On Wed, 27 Apr 2005, H. Peter Anvin wrote:

> There are a fair number of tools one may want that deal with reachability.

Do you agree that installing a new libgit.so when you want to apply such a
tool to a new tag is sufficient? If the library is shared, and everything
for parsing the objects (to the point of getting struct object filled
out) is in the library, and you want to have some tool able to validate or
use any new tag that you want reachability-only tools to process, not
having a standard header proto-format for future tags isn't a problem,
since you'll get upgrades to the parser portion of all of your tools
together.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-28  1:51             ` Daniel Barkalow
@ 2005-04-28  1:56               ` H. Peter Anvin
  0 siblings, 0 replies; 29+ messages in thread
From: H. Peter Anvin @ 2005-04-28  1:56 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Linus Torvalds, Git Mailing List

Daniel Barkalow wrote:
> On Wed, 27 Apr 2005, H. Peter Anvin wrote:
>  
>>There are a fair number of tools one may want that deal with reachability.
>  
> Do you agree that installing a new libgit.so when you want to apply such a
> tool to a new tag is sufficient? If the library is shared, and everything
> for parsing the objects (to the point of getting struct object filled
> out) is in the library, and you want to have some tool able to validate or
> use any new tag that you want reachability-only tools to process, not
> having a standard header proto-format for future tags isn't a problem,
> since you'll get upgrades to the parser portion of all of your tools
> together.
> 

Only if language bindings are created for this library.

	-hpa

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-28  0:57           ` Linus Torvalds
  2005-04-28  1:34             ` Paul Jackson
@ 2005-04-28  2:14             ` Tom Lord
  2005-04-28  3:37             ` Ryan Anderson
                               ` (2 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: Tom Lord @ 2005-04-28  2:14 UTC (permalink / raw)
  To: torvalds; +Cc: hpa, git

   > I think a lot of people understand it intellectually, but I really do 
   > think that we're lackign the kind of "institutionalized" knowledge
   > where people understand things at a much more visceral level.

I know that Arch and its progeny, as they stand, don't seduce you
but you should be made aware that the Arch community is one where
good SCM sense that you would agree with (although you might not
recognize it at once) is well on the path to being institutionalized.
It's gratifying/amazing/inspiring to see a bunch of folk catch up 
on the topic.

One thing there's still a shortage of in my world is folks steeped
in both perspectives: "unix" /and/ SCM.  Thus, I get folks who have
pretty decent SCM ideas in the abstract -- plus utterly terrible 
ideas about how to make them real.

There is a higher-level bug I think you'll eventually viscerally 
feel yourself, related to:

   > I think a lot of people understand it intellectually, but I really do 
   > think that we're lackign the kind of "institutionalized" knowledge
   > where people understand things at a much more visceral level.

Once you get to the BK or Arch level of SCM, beyond that there are
many possible paths.  Many of those are false paths -- imaginary
(unrealizable) ideals about how things like merging can work and
be good.   Some people seem to get stuck on those paths.

   > With git, this isn't the case. The _only_ reason I started git in the 
   > first place is that I knew better than pretty much anybody else what my
   > needs were, and I was forced to act on them because nothing out there 
   > really solved the problem for me.

That's debatable but neither here nor there.  Supposing that Arch
were /perfect/ for your needs today (which I don't claim) -- `git'
would still have been the better route to take (though my reasons
probably aren't the same as yours).

   > I'm not actually all that interested in SCM's.

In a certain way: same here, oddly enough.  Go figure.

   > Quite the reverse: such a person "knows" a lot of things, but I'm pretty
   > damn sure that such a person has _never_ actually worked on a system that
   > works the way the kernel development does

I've been avoiding the topic of how kernel development works ever since
i realized, that with each additional detail you reveal, i have little
but yellow and red cards to raise.   Doesn't seem productive to have that
fight when the option of simply improving the situation is open.

   > And I really _am_ sorry. I don't actually _like_ being nasty about these 
   > things.

It's healthy enough that you are, for your sanity and others.  Just 
be tolerant of others pointing that out.

   > The good news? I actually think my needs are very basic.

So it would seem.  This is partly because the process you advertise
yourself as doing is, sorry, garbage.  It's understandable why it
happens to work for now, but it's garbage nonetheless.  Not your fault --
you haven't been afforded the degrees of freedom to do better, afaict.

   > But for now, the _only_ point of git is as a kernel maintenance tool. 

Math is math.  You don't get to say what it means.

-t

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-28  0:57           ` Linus Torvalds
  2005-04-28  1:34             ` Paul Jackson
  2005-04-28  2:14             ` Tom Lord
@ 2005-04-28  3:37             ` Ryan Anderson
  2005-04-28  8:31             ` Morgan Schweers
  2005-04-28 15:08             ` Barry Silverman
  4 siblings, 0 replies; 29+ messages in thread
From: Ryan Anderson @ 2005-04-28  3:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Tom Lord, hpa, git

On Wed, Apr 27, 2005 at 05:57:07PM -0700, Linus Torvalds wrote:
> On Wed, 27 Apr 2005, Tom Lord wrote:
> 
> I'm not actually all that interested in SCM's. I'd have been much happier
> if I never had to start doing git in the first place. But circumstances
> not only forced me to do my own, it also so happens that I don't believe
> that there are many people around that have ever really _seen_ what my
> kind of development requirements are.

Oddly, I was trying to answer "Why distributed?" in a discussion the
"Joel On Software" forum.

The particular thread I posted on, well, was kind of stupid, but in case
anyone is curious: http://discuss.joelonsoftware.com/default.asp?joel.3.115346.51

What I said might help give an overview of how Linux development works,
from my point of view.  I only occassionally poke at interesting things
on the periphery on whims, but I poke at the SCMy aspects of it, so
maybe it's relevant. 

 Here's an overview of how the distributed world of Linux works:

 1. Linus has his personal tree.  He pushes it out on a regular basis to
 rsync.kernel.org (well, kinda - that's where it ends up at).

 2. "Trusted lieutenants" have their own trees.  Some keep these on
 *.kernel.org, some don't.

 3. Lots of other people have personal trees.  These can be pretty much
 anywhere.

 These trees are in a variety of formats today, some are in "git", some
 are still in BitKeeper, some are from a tarball, some are tarball +
 patches, some are git + patches.

 There are a variety of merging methods:

 a.  Provide a publicly accessible repository.  (Formerly BK, now "git")
 that Linus, or a maintainer (i.e, "trusted lieutenant") can grab it
 from.  In the email where this location is given, the patch is usually
 included, at least in a summary format.

 b.  Provide a series of emails, with a description per email followed,
 inline, with a patch.

 These merging methods can be done directly with Linus, or with anyone
 else who is interested.  (Generally, merging with Linus is for arch and
 subsystem maintainers, or random small things that are either obviously
 correct, useful, or just don't fit elsewhere.)

 So, that's the merge process, for the most part.

 Now, most patches these days are going through Andrew Morton - even if
 he's not actually submitting them personally, he's probably putting
 them into his tree for testing purposes.  (Networking changes go direct
 to Linus, but Andrew keeps an up to date version of them in his -mm
 series of kernels.)

 If code isn't accepted, well, one of a couple things happens:
 1. The patch is silently ignored.  (This is less of a problem these
 days.)

 2. The patch is commented on and someone says, "No".  (Generally, this
 happens a few times for "new" code, as people try to get the concept to
 fit into the kernel in the cleanest way.  There are a lot of style nits
 at this point, but also discussions of "Is this the right way to do
 this?" and "Do we need a more general method to do this instead of this
 hack?")

 Verifying that testing has occurred is less important than you might
 think.  This is basically because small patches either come with a
 description of the bug they fix and an expert in that area will ACK the
 patch, they touch an area that few people use and so the submitter is
 probably the best qualified person to provide a patch and they'll only
 hurt themselves if they haven't tested it, or, via the history of your
 submissions to the kernel, you are known to not submit bad code, so
 there's an expectation of quality.

 Furthermore, an incredible amount of testing occurs in the major public
 trees (Linus/-mm) between a release, so most absolutely major bugs are
 spotted fairly quickly, and if the problem is systemic in a change,
 that change can be reverted until the code improves.

 On the topic of checking into private branches - it's not so much a
 matter of "the parent never sees the changes" as "the parent doesn't
 see them right now".

 FWIW, at my place of employment, we switched from CVS to BitKeeper last
 summer, and it is significantly more pleasant to work with, in all
 aspects.

 Currently our entire development staff is working from home.  This
 still works well, as we can all check in locally, and submit changes to
 the master repository when changes are ready.  Between having a partner
 company in Japan working on our code, and our development staff working
 from home offices, we would have a horrific time getting any
 centralized SCM product to perform well.  With purely local
 repositories, local branching, and submissions via email or ssh, the
 process still works well and is *fast*.  CVS over slow network links is
 certainly not *fast*, and I'd be very surprised if Perforce is
 significantly better in that regard.

 I'll just say this, in closing - working with a decentralized SCM tool
 changes the way you work.  There is a Linux Kernel developer that I am
 aware of that keeps 27 or so seperate branches on his machine, so he
 can keep all the logically unrelated changes seperate from each other.
 He builds kernels off an additional branch that merges all the others
 together, and submits changes to Linus via 2 or 3 "rollup" trees he
 maintains.

 You just don't work like that in a centralized SCM, because branching
 isn't painless, in the same way.

-- 

Ryan Anderson
  sometimes Pug Majere

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-28  0:57           ` Linus Torvalds
                               ` (2 preceding siblings ...)
  2005-04-28  3:37             ` Ryan Anderson
@ 2005-04-28  8:31             ` Morgan Schweers
  2005-04-28 15:08             ` Barry Silverman
  4 siblings, 0 replies; 29+ messages in thread
From: Morgan Schweers @ 2005-04-28  8:31 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds

Greetings,

This is off topic, but this is a great paragraph, and an incredibly
concise and valuable lesson for pre-architect software developers.

On 4/27/05, Linus Torvalds <torvalds@osdl.org> wrote:

[...deletia...]

>    Doing development is a lot about communication. Writing code in many
>    ways is secondary - it's much more important to try to make sure that
>    everybody knows what the goals are, because the _real_ pain in
>    development ends up being not the coding, but the much more fundamental
>    disagreements that happen when people really have totally different
>    expectations of what the end result is going to be.

[...deletia...]

>                         Linus

--  Morgan Schweers

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: A shortcoming of the git repo format
  2005-04-27 18:03   ` H. Peter Anvin
  2005-04-27 18:32     ` Dave Jones
  2005-04-27 19:11     ` Linus Torvalds
@ 2005-04-28 13:39     ` David Woodhouse
  2 siblings, 0 replies; 29+ messages in thread
From: David Woodhouse @ 2005-04-28 13:39 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linus Torvalds, Git Mailing List

On Wed, 2005-04-27 at 11:03 -0700, H. Peter Anvin wrote:
> > To find the email address, look for the first '<'. To find the date, look 
> > for the first '>'. Those characters are not allowed in the name or the 
> > email, so they act as well-defined delimeters.
> > 
> 
> That's true for email addresses,

Not in general. You can have just about any character, including @, <
and >, in either a display-name or a local-part.

For git we actually _remove_ any instances of '<' and '>' from both
'AUTHOR_NAME' and 'AUTHOR_EMAIL', so what you say becomes true.

I still say these shouldn't be considered email addresses, any more than
the 'user@host.domain' you see when you connect to an IRC server is
considered an IP address.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: A shortcoming of the git repo format
  2005-04-28  0:57           ` Linus Torvalds
                               ` (3 preceding siblings ...)
  2005-04-28  8:31             ` Morgan Schweers
@ 2005-04-28 15:08             ` Barry Silverman
  4 siblings, 0 replies; 29+ messages in thread
From: Barry Silverman @ 2005-04-28 15:08 UTC (permalink / raw)
  To: Linus Torvalds, Tom Lord; +Cc: hpa, git

>>In contrast, with git, I'm totally uninterested in anything that doesn't
>>make my kernel work go faster or more smoothly, and does so _today_. Which
>>makes me a cantancerous old bastard, and bit the heads off anybody who
>>isn't focused on that one thing.

Focus is the totally operative word here!

If you really want to feel good about the world, re-read the initial set of
git postings that Linus made on April 7th:
http://kerneltrap.org/node/4982

Contrast the picture today with the fact that three weeks ago:
April 7:
1) the kernel workflow was at a standstill
2) git was just a totally unproven concept in Linus' head, that could have
ended up as a band-aid while a REAL SCM (...sound of choking from the
wings...) was chosen
3) the performance issues in dealing with both the size of the kernel
project, and the velocity of the changes were completely up in the air

Today:
1) the kernel workflow has restarted, and has already made its first
milestone
2) git is solid in architecture, is maintained and updated by a proven set
of developers, and has been demonstrated to have all the performance
necessary going forward
3) the primary traffic on the mailing list is related to tactical issues -
not architecture, or strategy, or big-ticket item stuff - with the
occasional flame about "renames" ;-)

Are there any large strategic issues left to be resolved for git?, or is it
just a matter of getting all the kernel developers over the learning curve,
and iterating the details of the workflow to make everyone maximally
productive?

How long do you think it will take for the kernel workflow to get back to
its height during the BK days?

The achievement of going from a complete standstill, to full velocity kernel
workflow production in a couple of months has got to be something everyone
involved should be intensely proud of.
Thanks, Linus, for being such a "cantancerous old bastard". I don't think it
could have happened if you were anything but....

Barry Silverman


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2005-04-28 15:00 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-27  5:43 A shortcoming of the git repo format H. Peter Anvin
2005-04-27 15:00 ` C. Scott Ananian
2005-04-27 15:22 ` Linus Torvalds
2005-04-27 18:03   ` H. Peter Anvin
2005-04-27 18:32     ` Dave Jones
2005-04-27 18:47       ` H. Peter Anvin
2005-04-27 22:51         ` Jon Seymour
2005-04-27 19:15       ` Linus Torvalds
2005-04-27 19:39       ` Petr Baudis
2005-04-27 19:11     ` Linus Torvalds
2005-04-27 19:47       ` The " Brian O'Mahoney
2005-04-27 20:40       ` A shortcoming of the " H. Peter Anvin
2005-04-27 20:49         ` Tom Lord
2005-04-27 20:59           ` H. Peter Anvin
2005-04-28  0:57           ` Linus Torvalds
2005-04-28  1:34             ` Paul Jackson
2005-04-28  2:14             ` Tom Lord
2005-04-28  3:37             ` Ryan Anderson
2005-04-28  8:31             ` Morgan Schweers
2005-04-28 15:08             ` Barry Silverman
2005-04-27 20:56         ` Linus Torvalds
2005-04-28  0:45           ` David A. Wheeler
2005-04-28  0:46             ` David Lang
2005-04-27 23:50         ` Daniel Barkalow
2005-04-27 23:56           ` H. Peter Anvin
2005-04-28  1:51             ` Daniel Barkalow
2005-04-28  1:56               ` H. Peter Anvin
2005-04-28 13:39     ` David Woodhouse
2005-04-27 20:58 ` Gerhard Schrenk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).