Git development
 help / color / mirror / Atom feed
* RE: Mercurial 0.4b vs git patchbomb benchmark
From: Tom Lord @ 2005-04-29 20:26 UTC (permalink / raw)
  To: Andrew.Timberlake-Newell; +Cc: noel, seanlkml, git
In-Reply-To: <000e01c54cf7$f61ee4a0$9b11a8c0@allianceoneinc.com>



  > It looks to me like he did read carefully.

  > There were two different ideas:
  >   TL)  Passing tree & diff and trusting diff to create tree
  >   NM)  Passing tree and generating diff versus local tree for review

Well, I guess *you* didn't read carefully.  I also spoke about the
value of passing around triples: ancestry, diff, and tree.  The
question is about linking signatures to things that humans can
reasonably *intend* and be reasonably held accountable for, hence one
of the values of signed diffs.  (I cited other practical reasons to
value signed diffs and use them in specific ways, too.)

-t

^ permalink raw reply

* RE: Mercurial 0.4b vs git patchbomb benchmark
From: Andrew Timberlake-Newell @ 2005-04-29 20:13 UTC (permalink / raw)
  To: 'Tom Lord', noel; +Cc: seanlkml, git
In-Reply-To: <200504291954.MAA27561@emf.net>

Tom Lord responded to Noel Maddy: 
>   > Call me a naive git, but seems to me the "git way" is a little
>   > different. It's tree-based rather than diff-based, and doesn't involve
>   > passing diffs around, right?
> 
> Isn't that a significant part of what I said?  Go back and read more
> carefully, is my suggestion.

It looks to me like he did read carefully.

There were two different ideas:
   TL)  Passing tree & diff and trusting diff to create tree
   NM)  Passing tree and generating diff versus local tree for review

Maybe I'm reading them wrong, but that certainly looks like what each was
expressing and they don't look like the same thing.



^ permalink raw reply

* Re: Val Henson's critique of hash-based content storage systems
From: H. Peter Anvin @ 2005-04-29 20:14 UTC (permalink / raw)
  To: Rob Jellinghaus; +Cc: git
In-Reply-To: <loom.20050429T015434-928@post.gmane.org>

Rob Jellinghaus wrote:
> I assume most people here have read this, but just in case:
> 
> http://www.usenix.org/events/hotos03/tech/full_papers/henson/henson.pdf
> 

I have to pull out the big flamethrower, especially against someone I 
consider a friend, but that paper is a classic example on how many 
people don't understand probability.

The *only* valid criticism in it is that we may not know enough about 
the future validity of cryptographic hash function, however, she also 
does not analyze the failure scenarios applicable to those kinds of 
failures barely at all.

In the end, the whole paper centers around "this makes me feel nervous", 
without really justifying it in any reasonable way.

It is just one of many papers on cryptoanalysis written by someone with 
no real background in the field.  It really saddens me to see someone 
like Val fall into that particular trap.

	-hpa

^ permalink raw reply

* Re: Val Henson's critique of hash-based content storage systems
From: C. Scott Ananian @ 2005-04-29 20:17 UTC (permalink / raw)
  To: Tom Lord; +Cc: git, robj
In-Reply-To: <200504291952.MAA27541@emf.net>

On Fri, 29 Apr 2005, Tom Lord wrote:

> I would expect someone to have on hand a small number of blobs that are
> different but have different hashes and, eventually, to drop said files
> into a blob-based infrastructure to wreak havoc.

This is just ridiculous.  The number of known collisions in SHA1 is 
*exactly zero* at this point in time --- not guaranteed to stay that way, 
of course, but generating collisions is likely to remain relatively 
expensive for some time.  The collisions are highly structured; they are 
not just arbitrary blobs.  If, after doing your 2^69 work or so to 
generate a real honest-to-goodness SHA-1 collision, you think an 
attacker would "DROP THEM IN A REPOSITORY TO CREATE HAVOC"?  You'd have to 
break into the repository, etc, and then you'd find that *NOTHING 
REFERENCED THEM* and so *ABSOLUTELY NOTHING WOULD HAPPEN*.

It's far more likely that SHA1 collisions will be used to generate forged 
X509 certificates, for a number of highly technical reasons.

Git's highly constrained and derided 'brittle' file formats also serve
to protect against the collision attacks against SHA-1 which are beginning 
to look possible.

> So: a way to locally mark a given checksum as "controversial" seems
> prudent, to me (hence, support for such in my blob-db code/spec).

Arguably that's what *upgrades* to the spec might be for -- git has a 
solid philosophy of not creating 'features' unless it is sure that they 
are needed/will be used, and I think this is always the wise route in 
software development.  Of much specification comes no code.

And, if you actually create a 'flexible' blob-db spec with 'room for 
expansion' -- congratulations, you've just made yourself more vulnerable 
to collision attacks.
  --scott

terrorist MI5 SKILLET hack AMLASH security KMPLEBE KUFIRE SCRANTON 
D5 SLBM LINCOLN KUDESK SMOTH Kojarena Moscow HTAUTOMAT WSBURNT Chechnya
                          ( http://cscott.net/ )

^ permalink raw reply

* Re: Mercurial 0.4b vs git patchbomb benchmark
From: Andrea Arcangeli @ 2005-04-29 20:30 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Linus Torvalds, linux-kernel, git
In-Reply-To: <20050429060157.GS21897@waste.org>

On Thu, Apr 28, 2005 at 11:01:57PM -0700, Matt Mackall wrote:
> change nodes so you've got to potentially traverse all the commits to
> reconstruct a file's history. That's gonna be O(top-level changes)
> seeks. This introduces a number of problems:
> 
> - no way to easily find previous revisions of a file
>   (being able to see when a particular change was introduced is a
>   pretty critical feature)
> - no way to do bandwidth-efficient delta transfer
> - no way to do efficient delta storage
> - no way to do merges based on the file's history[1]

And IMHO also no-way to implement a git-on-the-fly efficient network
protocol if tons of clients connects at the same time, it would be
dosable etc... At the very least such a system would require an huge
amount of ram. So I see the only efficient way to design a network
protocol for git not to use git, but to import the data into mercurial
and to implement the network protocol on top of mercurial.

The one downside is that git is sort of rock solid in the way it stores
data on disk, it makes rsync usage trivial too, the git fsck is reliable
and you can just sign the hash of the root of the tree and you sign
everything including file contents. And of course the checkin is
absolutely trivial and fast too.

With a more efficient diff-based storage like mercurial we'd be losing
those fsck properties etc.. but those reliability properties don't worth
the network and disk space they take IMHO, and the checkin time
shouldn't be substantially different (still running in O(1) when
appending at the head). And we could always store the hash of the
changeset, to give it some basic self-checking.

I give extreme value in a SCM in how efficiently it can represent the
whole tree for both network downloads and backups too. Being able to
store the whole history of 2.5 in < 100M is a very valuable feature
IMHO, much more valuable than to be able to sign the root.

Also don't get me wrong, I'm _very_ happy about git too, but I just
happen to prefer mercurial storage (I would never use git for anything
but the kernel, just like I wasn't using arch for similar reasons).

^ permalink raw reply

* Re: Mercurial 0.4b vs git patchbomb benchmark
From: Matt Mackall @ 2005-04-29 20:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Sean, linux-kernel, git
In-Reply-To: <Pine.LNX.4.58.0504291248210.18901@ppc970.osdl.org>

On Fri, Apr 29, 2005 at 12:50:55PM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 29 Apr 2005, Matt Mackall wrote:
> > 
> > Here's an excerpt from http://selenic.com/mercurial/notes.txt on how
> > the back-end works.
> 
> Any notes on how you maintain repository-level information?
> 
> For example, the expense in BK wasn't the single-file history, it was the
> _repository_ history, ie the "ChangeSet" file. Which grows quite slowly,
> but because it _always_ grows, it ends up being quite big and expensive to
> parse after three years.
> 
> Ie do you have the git kind of "independent trees/commits", or do you 
> create a revision history of those too?

The changeset log (and everything else) has an external index. The
index is basically an array of (base, offset, length, parent1-hash,
parent2-hash, my-hash). This has everything you need to reconstruct a
given file revision with one seek/read into the data stream itself,
and also everything you need for doing graph merging.

This is small enough (68 bytes, currently) that the index for a
million changesets can be read into memory in a couple seconds or so,
even in Python. It can also be mmapped and random accessed since the
index entries are fixed-sized. (And it's already stored big-endian.)

So you never have to read all the data. You also never need more than
a few indices in memory at once. And you never have to rewrite the
data (it's all append-only), except to do a bulk copy when you break a
hardlink.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply

* Re: Mercurial 0.4b vs git patchbomb benchmark
From: Noel Maddy @ 2005-04-29 20:21 UTC (permalink / raw)
  To: Tom Lord; +Cc: noel, seanlkml, git
In-Reply-To: <200504291954.MAA27561@emf.net>

On Fri, Apr 29, 2005 at 12:54:19PM -0700, Tom Lord wrote:
> 
> 
>   > Call me a naive git, but seems to me the "git way" is a little
>   > different. It's tree-based rather than diff-based, and doesn't involve
>   > passing diffs around, right?
> 
> Isn't that a significant part of what I said?  Go back and read more
> carefully, is my suggestion.

I'm trying to understand you. Please bear with me, and point out what
I'm missing.

Your example had Joe reviewing a signed diff, and then applying changes
from a tree that "supposedly" had the diff applied correctly, but may
have been corrupted. If the tree was not an accurate representation of
applying the diff, then the changes Joe applied to his tree will be
different than those that he reviewed.

My example had Joe downloading a remote signed tree, reviewing the changes
locally between his own trusted tree and the remote tree, and then
applying them locally. Since the diffs are generated locally between the
two trees, Joe is always reviewing the exact changes that will be
applied to his tree.

Doesn't this deal with the logical hole that you were pointing out in
your example? Or am I seeing a different "logical hole" than you are?


-- 
A man who fears nothing is a man who loves nothing.  And if you love
nothing, what joy is there in your life?
					 -- King Arthur, "First Knight"
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
Noel Maddy <noel@zhtwn.com>

^ permalink raw reply

* Re: Mercurial 0.4b vs git patchbomb benchmark
From: Andrea Arcangeli @ 2005-04-29 20:19 UTC (permalink / raw)
  To: Sean; +Cc: Matt Mackall, Linus Torvalds, linux-kernel, git
In-Reply-To: <3817.10.10.10.24.1114756831.squirrel@linux1>

On Fri, Apr 29, 2005 at 02:40:31AM -0400, Sean wrote:
> There isn't anything preventing optomized transfer protocols for git. 

such a system might fall apart under load, converting on the fly from
git to network-optimized format sound quite expensive operation, even
ignorign the initial decompression of the payload. If something it
should be pre-converted to mercurial, so you checkout from mercurial and
you apply to local git.

^ permalink raw reply

* Re: More problems...
From: Thomas Glanzmann @ 2005-04-29 20:03 UTC (permalink / raw)
  To: git
In-Reply-To: <20050429195055.GE1233@mythryan2.michonline.com>

Hello,

> Why not just use "rsync" for both remote and local synchronization, and
> provide a "relink" command to scan two .git/objects/ repositories and
> hardlink matching files together?

That came to my mind, too. And it is actually the only thing that makes
sense. - In matters of KISS. :-)

	Thomas

^ permalink raw reply

* Re: Mercurial 0.4b vs git patchbomb benchmark
From: Tom Lord @ 2005-04-29 19:54 UTC (permalink / raw)
  To: noel; +Cc: seanlkml, git
In-Reply-To: <20050429194753.GA14222@uglybox.localnet>



  > Call me a naive git, but seems to me the "git way" is a little
  > different. It's tree-based rather than diff-based, and doesn't involve
  > passing diffs around, right?

Isn't that a significant part of what I said?  Go back and read more
carefully, is my suggestion.

  > Or am I missing something?

Very much so.


-t




^ permalink raw reply

* Re: Val Henson's critique of hash-based content storage systems
From: Tom Lord @ 2005-04-29 19:52 UTC (permalink / raw)
  To: git; +Cc: robj
In-Reply-To: <Pine.LNX.4.58.0504291221250.18901@ppc970.osdl.org>



I wouldn't expect outright successful attacks like forged replacements
for arbitrary files.

I would expect someone to have on hand a small number of blobs that are
different but have different hashes and, eventually, to drop said files
into a blob-based infrastructure to wreak havoc.

So: a way to locally mark a given checksum as "controversial" seems 
prudent, to me (hence, support for such in my blob-db code/spec).

-t

^ permalink raw reply

* Re: More problems...
From: Ryan Anderson @ 2005-04-29 19:50 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Russell King, git
In-Reply-To: <20050429182708.GB14202@pasky.ji.cz>

On Fri, Apr 29, 2005 at 08:27:08PM +0200, Petr Baudis wrote:
> Dear diary, on Fri, Apr 29, 2005 at 06:01:27PM CEST, I got a letter
> where Russell King <rmk@arm.linux.org.uk> told me that...
> > rmk@dyn-67:[linux-2.6-rmk]:<1049> cg-update origin
> > `../linux-2.6/.git/objects/00/78aeb85737197a84af1eeb0353dbef74427901' -> `.git/objects/00/78aeb85737197a84af1eeb0353dbef74427901'
> > cp: cannot create link `.git/objects/00/78aeb85737197a84af1eeb0353dbef74427901': File exists
> > 
> > By that time, the object files in the reference tree appear to have
> > a newer timestamp than the corresponding ones in my local tree, and
> > cp -lua fails.
> 
> I'm now away ,unfortunately, and no immediate idea stems to my mind on
> how to fix it. Ideas welcomed - I need to hardlink missing entries from
> one tree to another; it would be enough to be able to just tell cp to
> ignore already present files.
> 
> Could you please try to give cp the -f flag?

Why not just use "rsync" for both remote and local synchronization, and
provide a "relink" command to scan two .git/objects/ repositories and
hardlink matching files together?

With the SHA1 hash, you can even have a --unsafe option that just
compares the has names and does a link based purely off of that and the
stat(2) results of both files.  (I'd expect that a ... safer variant
would extract both files and compare them, but the --unsafe should be
sufficient, in practice, I would think.)

-- 

Ryan Anderson
  sometimes Pug Majere

^ permalink raw reply

* Re: Mercurial 0.4b vs git patchbomb benchmark
From: Noel Maddy @ 2005-04-29 19:47 UTC (permalink / raw)
  To: Tom Lord; +Cc: seanlkml, git
In-Reply-To: <200504291928.MAA27145@emf.net>

On Fri, Apr 29, 2005 at 12:28:41PM -0700, Tom Lord wrote:
> 
> Think of it this way:
> 
>   (a) Joe, the mainline maintainer, gets a trusted message containing
>       a diff.
> 
>   (b) Joe reads the diff, it makes great sense, he wants to merge.
> 
>   (c) Joe downloads a tree.  Supposedly that tree is the result of
>       applying this diff.   The tree, not the diff, is used for
>       merging.

Call me a naive git, but seems to me the "git way" is a little
different. It's tree-based rather than diff-based, and doesn't involve
passing diffs around, right?

This is the process I'd expect:

    (a)' Joe is notified of an update made to an external git tree

    (b)' Joe pulls tree from the external git tree (signed by external
         developer)

    (c)' Joe reviews the (git-generated) diffs from his current
	     (trusted) tree to the new (signed) tree. If they pass
         review, he merges the new versions into his tree, commits,
         and signs his tree.

The logical hole that you point out is assuming that the diff is passed
separately from the tree rather than being directly generated from the
current maintainer tree and the signed remote tree.

If the diff is generated from the two signed trees, I don't see a hole.

Or am I missing something?


-- 
The world's largest Internet database in the country.
					      -- Trading Times radio ad
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
Noel Maddy <noel@zhtwn.com>

^ permalink raw reply

* Re: Mercurial 0.4b vs git patchbomb benchmark
From: Linus Torvalds @ 2005-04-29 19:50 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Sean, linux-kernel, git
In-Reply-To: <20050429191207.GX21897@waste.org>



On Fri, 29 Apr 2005, Matt Mackall wrote:
> 
> Here's an excerpt from http://selenic.com/mercurial/notes.txt on how
> the back-end works.

Any notes on how you maintain repository-level information?

For example, the expense in BK wasn't the single-file history, it was the
_repository_ history, ie the "ChangeSet" file. Which grows quite slowly,
but because it _always_ grows, it ends up being quite big and expensive to
parse after three years.

Ie do you have the git kind of "independent trees/commits", or do you 
create a revision history of those too?

		Linus

^ permalink raw reply

* Re: Val Henson's critique of hash-based content storage systems
From: Linus Torvalds @ 2005-04-29 19:45 UTC (permalink / raw)
  To: Rob Jellinghaus; +Cc: Git Mailing List
In-Reply-To: <loom.20050429T015434-928@post.gmane.org>



On Fri, 29 Apr 2005, Rob Jellinghaus wrote:
> 
> If an attacker used an SHA-1 attack to create a blob that matched the hash of
> some well-known git object (say, the tree for Linux 2.7-rc1), and spammed public
> git repositories with it ahead of Linus's release, what would be the potential
> for mischief, and what would the recovery process be?

I really think people should not consider the sha1 the "security". 

The real security is in distribution. 

With the distributed setup, developers don't use public trees. They use 
their own _private_ trees, and the public ones are just staging areas for 
synchronization.

So in order to actually replace a blob, let's say that you can create an 
object with the right sha1 trivially. What then?

You now have to break into _every_ repository that has that object, and 
replace it silently. Because if you don't, the good one will still be 
around.

That's just not going to happen.

So let's say that you break into kernel.org, and replace one of the blobs
in my repository.  What happens?

First off, I'll never notice, because it's not actually my repository, so 
I won't even have the corrupt copy. So what _will_ happen?

What will happen is that people who download new stuff from kernel.org
will get the "evil" object. Not all of them, though - just the ones that
hadn't downloaded the proper one. So first off, in order to be really
_effective_, the attack really has to not just replace an object, it
really wants to replace a pretty _recent_ object, because replacing an old
just just doesn't do a whole lot.

So they get the evil object. What happens? NOTHING. Absolutely nada.  
Either they use that evil object, or they don't. Not using it might be
because it's not even top-of-tree any more, and you really just replaced
some old version of a file. Or it might be because it's a object for a
driver that you don't have, so you'd never see it.

So let's ignore that case, and say that the attacker has successfully
replaced an object that is (a) recent enough to matter and (b) actually
used.

What now? You'll get a compile error. Big deal. People will notice that
something is wrong, complain about it, we'll think they have disk
corruption for a while, and then we'll figure it out, and replace the
object. Done.

Why? Because even if you successfully find an object with the same SHA1, 
the likelihood that that object actually makes _sense_ in that conctext is 
pretty damn near zero. 

Think about it. We've had this before: people whose files got flipped
around due to driver bugs or just hardware problems, and even just a
single bit error most of the time results in real honest-to-God compiler
errors.

And because we found the bad one, and we have the good one somewhere else, 
who cares? The security industry will be all atwitter about somebody 
finding a matching SHA1 object, and it will be _huge_ news, but did it 
actually hurt the kernel integrity? No.

So let's say that somebody breaks in to _my_ personal machine. I'm behind 
a few firewalls and a NAT setup, and I don't accept even incoming ssh, but 
hey, they could crowbar my door and break in that way. 

ONLY A TOTAL IDIOT would then replace an object in my database with
something else. That would be _stupid_. He'd just guarantee that all the 
same problems as above were true, except now we'd have to find the 
good object in some _other_ database than mine.

So if you actually wanted to corrupt the kernel tree, you'd do it by just
fooling me into accepting a crap patch. Hey, it happens all the time.  
People send me buggy stuff. We figure out the bugs. What's so different
here?

In other words, the security isn't in the hash. The hash is an added level 
to make it much harder to fool, but it's not "the security". 

And if we are really really unlucky, and a meteorite hits us, and we get
an object collision that has the same sha1 for _real_, and actually makes
sense, then hey, shit happens. We can fix it by "poisoning" that sha1, and
modifying both files trivially so that they don't match any more, and then
we add a list of "illegal" sha1's to fsck, and we'll make that list be ten
entries long, just in case the meteorite strikes ten times, but the fact
is it's simply not going to happen.

(It's going to be very very obvious, very very quickly, btw: the person
who actually created the object that happened to collide will not write
the new SHA1 out, because he already "had" the same object, so next time
somebody updates the tree, the file that matches will now have the "old
contents" from some other colliding file, and the new code simply won't do
what it was supposed to. So don't worry about it - collisions, even if
they happen, will be noticed as quite obvious _bugs_ in the end result,
the same way we find the common source of bugs - bad programming).

In other words: don't depend on hashes if you only have one copy of the
data. But if you have backups of old versions (which essentially the
distribution guarantees as long as we have "stupid" mirrors that just look
at the filename) having a hash collision doesn't mean that you lost any
real data.

So anybody who thinks that a hash collision is a fundamental problem just
hasn't thought things through. It's an _annoyance_, nothing more. But we
have tons of much more pressing annoyances, and pretty much all of them
are a hell of a lot more likely than a collission, whether intentional or
unintentional.

			Linus

^ permalink raw reply

* Re: Mercurial 0.4b vs git patchbomb benchmark
From: Tom Lord @ 2005-04-29 19:28 UTC (permalink / raw)
  To: seanlkml; +Cc: git
In-Reply-To: <2944.10.10.10.24.1114802002.squirrel@linux1>


Think of it this way:

  (a) Joe, the mainline maintainer, gets a trusted message containing
      a diff.

  (b) Joe reads the diff, it makes great sense, he wants to merge.

  (c) Joe downloads a tree.  Supposedly that tree is the result of
      applying this diff.   The tree, not the diff, is used for
      merging.

You can see the logical whole there... now the practical one:


   (d) Joe is repeating (a..c) at an unfathomably high rate.
       At a low rate, he could be double-checking enough that
       that the diff-vs-tree problem isn't that serious.  But
       at the rate he operates, exploits appear all along the
       patch-flow pipeline because so much stuff goes unchecked.

       Joe may be scan the changes he's merged before committing but,
       if his rate is high, that scan *must*, out of biological and
       physical necessity, be shallow.   Exploits can occur on the
       submitter machine, in the communication channel, and on Joe's 
       machine.   Social exploits can occur because of the separation
       between a submitter saying "this is what I'm doing" vs. the reality
       of what the submitter is doing.

-t


^ permalink raw reply

* Re: Mercurial 0.4b vs git patchbomb benchmark
From: Tom Lord @ 2005-04-29 19:22 UTC (permalink / raw)
  To: seanlkml; +Cc: git
In-Reply-To: <2944.10.10.10.24.1114802002.squirrel@linux1>



  > Ahh, you don't believe in the development model that has produced Linux! 
  > Personally I do believe in it, so much so that I question the value of
  > signatures at the changeset level.  To me it doesn't matter where the code
  > came from just so long as it works.

To me, it doesn't matter where the code came from.  It's necessary
but not sufficient that it seems to work.  It's necessary that it's
well understood and has undergone only well understood changes.

On that last necessity, a *lot* of open source projects are quite
pathetic.  `git'-style use of signatures raises the bar, slightly, for
where exploits can happen.  They also lower the bar for repudiation of
bogus changes.

   > Signatures are just a way to
   > increase the comfort level that the code has passed through a number of
   > people who have shown themselves to be relatively good auditors.  That's
   > why I trust the code from my distribution of choice.  Everything is out in
   > the open anyway so it's much harder for a con man to do his thing.

Only if the audience is proactively skeptical.

-t


^ permalink raw reply

* Re: Mercurial 0.4b vs git patchbomb benchmark
From: Sean @ 2005-04-29 19:13 UTC (permalink / raw)
  To: Tom Lord; +Cc: torvalds, mpm, linux-kernel, git
In-Reply-To: <200504291854.LAA26550@emf.net>

On Fri, April 29, 2005 2:54 pm, Tom Lord said:

> The process should not rely on the security of every developer's
> machine.  The process should not rely on simply trusting quality
> contributors by reputation (e.g., most cons begin by establishing
> trust and continue by relying inappropriately on
> trust-without-verification).  This relates to why Linus'
> self-advertised process should be raising yellow and red cards all
> over the place: either he is wasting a huge amount of his own time and
> should be largely replaced by an automated patch queue manager, or he
> is being trusted to do more than is humanly possible.
>

Ahh, you don't believe in the development model that has produced Linux! 
Personally I do believe in it, so much so that I question the value of
signatures at the changeset level.  To me it doesn't matter where the code
came from just so long as it works.   Signatures are just a way to
increase the comfort level that the code has passed through a number of
people who have shown themselves to be relatively good auditors.  That's
why I trust the code from my distribution of choice.  Everything is out in
the open anyway so it's much harder for a con man to do his thing.

Sean




^ permalink raw reply

* Re: Mercurial 0.4b vs git patchbomb benchmark
From: Matt Mackall @ 2005-04-29 19:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Sean, linux-kernel, git
In-Reply-To: <Pine.LNX.4.58.0504291006450.18901@ppc970.osdl.org>

On Fri, Apr 29, 2005 at 10:09:38AM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 29 Apr 2005, Matt Mackall wrote:
> > 
> > That's because no one paid attention until I posted performance
> > numbers comparing it to git! Mercurial's goals are:
> > 
> > - to scale to the kernel development process
> > - to do clone/pull style development
> > - to be efficient in CPU, memory, bandwidth, and disk space
> >   for all the common SCM operations
> > - to have strong repo integrity
> 
> Ok, sounds good. Have you looked at how it scales over time, ie what 
> happens with files that have a lot of delta's?

I've done things like 10000 commits of a pair of revisions to printk.c
and it maintains consistently high speed and compression throughout that
range. I've also done things like commit all 500 revisions of
linux/Makefile from bkcvs. This took a couple seconds and resulted in
an 88k repo file (bkcvs takes 250k).

I haven't tried the whole kernel history corpus yet, but I've
committed all the 2.6 releases without any difficulties popping up and
I've had handling >1M total file revisions in my head since I sat down
to work on it. I'll maybe take a stab at a full history import next
week, if vacation doesn't interfere too much.

One downside Mercurial has is that long-lived repos can get fragmented on
disk. Things get defragmented to some extent as you go by doing COW on
files that are shared between local branches clones. Also a complete
defrag is a simple cp -a or equivalent, so I think this is not a big
deal.

Here's an excerpt from http://selenic.com/mercurial/notes.txt on how
the back-end works.

---

Revlogs:

The fundamental storage type in Mercurial is a "revlog". A revlog is
the set of all revisions to a file. Each revision is either stored
compressed in its entirety or as a compressed binary delta against the
previous version. The decision of when to store a full version is made
based on how much data would be needed to reconstruct the file. This
lets us ensure that we never need to read huge amounts of data to
reconstruct a file, regardless of how many revisions of it we store.

In fact, we should always be able to do it with a single read,
provided we know when and where to read. This is where the index comes
in. Each revlog has an index containing a special hash (nodeid) of the
text, hashes for its parents, and where and how much of the revlog
data we need to read to reconstruct it. Thus, with one read of the
index and one read of the data, we can reconstruct any version in time
proportional to the file size.

Similarly, revlogs and their indices are append-only. This means that
adding a new version is also O(1) seeks.

Generally revlogs are used to represent revisions of files, but they
also are used to represent manifests and changesets.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply

* Re: Mercurial 0.4b vs git patchbomb benchmark
From: Tom Lord @ 2005-04-29 18:54 UTC (permalink / raw)
  To: seanlkml; +Cc: torvalds, mpm, linux-kernel, git
In-Reply-To: <2712.10.10.10.24.1114799620.squirrel@linux1>


   From: "Sean" <seanlkml@sympatico.ca>

   On Fri, April 29, 2005 2:08 pm, Tom Lord said:

   > The confusion here is that you are talking about computational complexity
   > while I am talking about complexity measured in hours of labor.
   >
   > You are assuming that the programmer generating the signature blindly
   > trusts the tool to generate the signed document accurately.   I am
   > saying that it should be tractable for human beings to read the documents
   > they are going to sign.


   Developers obviously _do_ read the changes they submit to a project or
   they would lose their trusted status.  That has absolutely nothing to do
   with signing, it's the exact same way things work today, without sigs.

Nobody that I know is endorsing "the way things work today" as especially
robust.  Lots of people endorse it as successful in the marketplace and has
having not failed horribly yet -- but that's not the same thing.


   It's not "blind trust" to expect a script to reproducibly sign documents
   you've decided to submit to a project.

It *is* blind trust to assume without further guarantees that the diff
someone sends you (signed or not) describes a tree accurately unless
the tree in question is created by a local application of that diff.

In essense, `git' (today) wants *me* to trust that *you* have
correctly applied that diff -- evidently in order to speed things up.
It makes remote users "patch servers", for no good reason.

Triple signatures, signing both the name of the ancestor, the diff,
and the resulting tree are the most robust because I can apply the
diff to the ancestor and then *verify* that it matches the signed
tree.   But systems should neither ask users to sign something too large
to read nor rely on signatures of things too large to read.


   The signature is not a QUALITY
   guarantee in and of itself.

Which has nothing to do with any of this except indirectly.

   See?  Signing something does not change the quality guarantee one way or
   the other.  It does not put any additional demands on the developer, so
   it's fine to have an automated script do it.  It's just a way to avoid
   impersonations.

The process should not rely on the security of every developer's
machine.  The process should not rely on simply trusting quality
contributors by reputation (e.g., most cons begin by establishing
trust and continue by relying inappropriately on
trust-without-verification).  This relates to why Linus'
self-advertised process should be raising yellow and red cards all
over the place: either he is wasting a huge amount of his own time and
should be largely replaced by an automated patch queue manager, or he
is being trusted to do more than is humanly possible.

-t

^ permalink raw reply

* Re: [PATCh] jit-trackdown
From: Junio C Hamano @ 2005-04-29 18:47 UTC (permalink / raw)
  To: David Greaves; +Cc: GIT Mailing Lists
In-Reply-To: <42725AB8.5090501@dgreaves.com>

>>>>> "DG" == David Greaves <david@dgreaves.com> writes:

DG> Should really be cg-trackdown

Thanks for your kind words and the patch.

     head="$1"
    +if [ $head == "HEAD" ]; then
    +  head=$(cat .git/HEAD)
    +elif [ -f .git/refs/tags/$head ]; then
    +  head=$(cat .git/refs/tags/$head)
    +elif [ -f .git/refs/heads/$head ]; then
    +  head=$(cat .git/refs/heads/$head)
    +fi
    +

I have been primarily looking at the plumbing side and not the
toilet side, and I still have not grokked cg-* yet.  That's why
I did not do the right thing with these .git/refs/* stuff.  If
this were to become part of cg-* suite, I would recommend just
using $(commit-id) there, which should be the only one that
needs to know the .git/* structure convention.

Have toilet side gitters reached a concensus (or semi-concensus)
on how things under .git/ should be organized?  Is there a
summary somewhere, something along the following lines?

    In subdirectories under $GIT_PROJECT_TOP/.git, you have
    files that have some special meaning to the Cogito layer.
    These files are all 41-byte long, which stores a 40-byte
    SHA1 with terminating newline.  What is stored in each
    location is as follows:

    .git/HEAD           	Head commit object of the
                                current tree. 

    .git/refs/heads/$ext	Head commit object of the
                                external tree $ext.  [*Q1*]

    .git/refs/tags/$tag		Named Tag object. [*Q2*]

    *Q1* What is the syntax and semantics rule for $ext, like
         "$ext matches '^[-A-Za-z0-9_]$' and is one of the
         entries in .git/remotes"?

    *Q2* What is the syntax and semantics rule for $tag, like
         "$tag matches '^[-A-Za-z0-9_]$' and can be anything not
         just commit"?


^ permalink raw reply

* Re: Mercurial 0.4b vs git patchbomb benchmark
From: Sean @ 2005-04-29 18:33 UTC (permalink / raw)
  To: Tom Lord; +Cc: torvalds, mpm, linux-kernel, git
In-Reply-To: <200504291808.LAA25870@emf.net>

On Fri, April 29, 2005 2:08 pm, Tom Lord said:

> The confusion here is that you are talking about computational complexity
> while I am talking about complexity measured in hours of labor.
>
> You are assuming that the programmer generating the signature blindly
> trusts the tool to generate the signed document accurately.   I am
> saying that it should be tractable for human beings to read the documents
> they are going to sign.


Developers obviously _do_ read the changes they submit to a project or
they would lose their trusted status.  That has absolutely nothing to do
with signing, it's the exact same way things work today, without sigs.

It's not "blind trust" to expect a script to reproducibly sign documents
you've decided to submit to a project.  The signature is not a QUALITY
guarantee in and of itself.  It doesn't mean you have any additional
responsibility to remove all bugs before submitting.  Conversely, not
signing something doesn't mean you can submit crap.

See?  Signing something does not change the quality guarantee one way or
the other.  It does not put any additional demands on the developer, so
it's fine to have an automated script do it.  It's just a way to avoid
impersonations.

Sean


^ permalink raw reply

* [PATCH] GIT: Honour SHA1_FILE_DIRECTORY env var in git-pull-script
From: Rene Scharfe @ 2005-04-29 18:31 UTC (permalink / raw)
  To: Linux Torvalds; +Cc: git

If you set SHA1_FILE_DIRECTORY to something else than .git/objects
git-pull-script will store the fetched files in a location the rest of
the tools does not expect.

git-prune-script also ignores this setting, but I think this is good,
because pruning a shared tree to fit a single project means throwing
away a lot of useful data. :-)

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>

---
commit 6fef2965444a6509d11a79bd33842125034dcec0
tree 63e9cdf5ff724bf462d9dc408b9c951985d4cecf
parent db413479f1bb0dabfc613b2b0017ca74aeb5a919
author Rene Scharfe <rene.scharfe@lsrfire.ath.cx> 1114799335 +0200
committer Rene Scharfe <rene.scharfe@lsrfire.ath.cx> 1114799335 +0200

Index: git-pull-script
===================================================================
--- 1e2168c7d554a4fbd25a09bb591ae0f82dac6513/git-pull-script  (mode:100755 sha1:5111da98e68f4c3eb44499d20a210966dd445212)
+++ 63e9cdf5ff724bf462d9dc408b9c951985d4cecf/git-pull-script  (mode:100755 sha1:0198c4805db7c2b78cd4424634873b0a86ee4107)
@@ -9,7 +9,7 @@
 cp .git/HEAD .git/ORIG_HEAD
 
 echo "Getting object database"
-rsync -avz --ignore-existing $merge_repo/objects/. .git/objects/.
+rsync -avz --ignore-existing $merge_repo/objects/. ${SHA1_FILE_DIRECTORY:-.git/objects}/.
 
 echo "Getting remote head"
 rsync -L $merge_repo/HEAD .git/MERGE_HEAD || exit 1

^ permalink raw reply

* Re: More problems...
From: Petr Baudis @ 2005-04-29 18:27 UTC (permalink / raw)
  To: Russell King; +Cc: git
In-Reply-To: <20050429170127.A30010@flint.arm.linux.org.uk>

Dear diary, on Fri, Apr 29, 2005 at 06:01:27PM CEST, I got a letter
where Russell King <rmk@arm.linux.org.uk> told me that...
> rmk@dyn-67:[linux-2.6-rmk]:<1049> cg-update origin
> `../linux-2.6/.git/objects/00/78aeb85737197a84af1eeb0353dbef74427901' -> `.git/objects/00/78aeb85737197a84af1eeb0353dbef74427901'
> cp: cannot create link `.git/objects/00/78aeb85737197a84af1eeb0353dbef74427901': File exists
> 
> By that time, the object files in the reference tree appear to have
> a newer timestamp than the corresponding ones in my local tree, and
> cp -lua fails.

I'm now away ,unfortunately, and no immediate idea stems to my mind on
how to fix it. Ideas welcomed - I need to hardlink missing entries from
one tree to another; it would be enough to be able to just tell cp to
ignore already present files.

Could you please try to give cp the -f flag?

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply

* Re: Odd decision of git-pasky-0.7 to do a merge
From: Petr Baudis @ 2005-04-29 18:21 UTC (permalink / raw)
  To: Russell King; +Cc: git
In-Reply-To: <20050429103059.A3260@flint.arm.linux.org.uk>

Dear diary, on Fri, Apr 29, 2005 at 11:31:00AM CEST, I got a letter
where Russell King <rmk@arm.linux.org.uk> told me that...
> On Fri, Apr 29, 2005 at 10:07:17AM +0100, Russell King wrote:
> > rmk@hera:/pub/scm/linux/kernel/git/torvalds/linux-2.6.git(0)$ SHA1_FILE_DIRECTORY=./objects merge-base c60c390620e0abb60d4ae8c43583714bda27763f bdceb6a0162274934386f19f3ea5a9d44feb0b20
> > bdceb6a0162274934386f19f3ea5a9d44feb0b20
> > 
> > $ merge-base c60c390620e0abb60d4ae8c43583714bda27763f bdceb6a0162274934386f19f3ea5a9d44feb0b20
> > e8108c98dd6d65613fa0ec9d2300f89c48d554bf
> 
> This is the problem.  It seems that merge-base in git-pasky-0.7
> doesn't work correctly.

I'm not sure if the merge-base in git-pasky-0.7 isn't the old one yet.
We replaced it with a date-based one since the old one *was* giving bad
results.

So, what about cogito-0.8?

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox