on when to checksum

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* on when to checksum
@ 2005-04-20 22:25 Tom Lord
  2005-04-20 22:41 ` Linus Torvalds
  2005-04-21 16:53 ` Andrew Timberlake-Newell
  0 siblings, 2 replies; 8+ messages in thread
From: Tom Lord @ 2005-04-20 22:25 UTC (permalink / raw)
  To: git; +Cc: torvalds

Linus, 

I think you have made a mistake by moving the sha1 checksum from the
zipped form to the inflated form.  Here is why:

What you have set in motion with `git' is an ad-hoc p2p network for
sharing filesystem trees -- a global distributed filesystem.  I
believe your starter here has a good chance of taking off to be much,
much larger than just a tool for the kernel.

A subset of your work: blobs and blob databaes, has much wider application
than just sharing trees:  Those parts of `git' can form a very solid 
foundation for many other applications as well.   To the extent `git'
succeeds in the context of the kernel, it will be invested in and
extended and generalized --- and the kernel project will benefit.
So don't ignore those wider applications even though they are not your
focus today: they will generate investment that feeds back to your project.

Your `git' is silent on transports and mirroring of blob databases --
tasks for scripting, sure -- but those elements won't be far behind.

Eventually, slinging around blobs as atomic elements
of payloads will become very common.

The blob handle (aka "address")/payload model of a blob db is very
clean and simple.   In a network of nodes speaking to one and other
by exchanging blobs, I forsee a prominent need for intermediate
nodes that process blobs "blindly" and as quickly as possible.

Blob compression is mostly goofy if regarded just as a way to 
save on (diminishingly cheap) disk space but it is mostly 
sane if regarded as a way to cut the cost of network bandwidth
roughly in half.

Must intermediate nodes inflate the payloads passing through them
or which they cache just to validate them?   That's not a desirable otucome
for many obvious reasonhs.

There *are* concerns about checksumming zips: it is necessary to nail
down the zip process and make sure it is absolutely and permanently
deterministic for this application.   But *that* is the problem to 
solve, not avoid by moving what the checksum refers to.

Thanks,
-t

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: on when to checksum
  2005-04-20 22:25 on when to checksum Tom Lord
@ 2005-04-20 22:41 ` Linus Torvalds
  2005-04-20 22:52   ` Tom Lord
  2005-04-21 16:53 ` Andrew Timberlake-Newell
  1 sibling, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2005-04-20 22:41 UTC (permalink / raw)
  To: Tom Lord; +Cc: git

On Wed, 20 Apr 2005, Tom Lord wrote:
> 
> I think you have made a mistake by moving the sha1 checksum from the
> zipped form to the inflated form.  Here is why:

I'd have agreed with you (and I did, violently) if it wasn't for the
performance issues. It makes a huge difference for write-tree, and to me,
clearly performance _does_ matter.

Fractions of seconds may not sound like a lot, but they add up. I work 
with 200-patch series myself all the time, so I'm very sensitive to a 0.3 
second difference in performance.

		Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: on when to checksum
  2005-04-20 22:41 ` Linus Torvalds
@ 2005-04-20 22:52   ` Tom Lord
  2005-04-20 23:07     ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Tom Lord @ 2005-04-20 22:52 UTC (permalink / raw)
  To: torvalds; +Cc: git

   From: Linus Torvalds <torvalds@osdl.org>

   On Wed, 20 Apr 2005, Tom Lord wrote:
   > 
   > I think you have made a mistake by moving the sha1 checksum from the
   > zipped form to the inflated form.  Here is why:

   I'd have agreed with you (and I did, violently) if it wasn't for the
   performance issues. It makes a huge difference for write-tree, and to me,
   clearly performance _does_ matter.

   Fractions of seconds may not sound like a lot, but they add up. I work 
   with 200-patch series myself all the time, so I'm very sensitive to a 0.3 
   second difference in performance.

How many times per day do you invoke `write-tree' and why?

It takes a large multiple of `0.3s' to get me to take you seriously
on this point.

I have long harbored the suspician that your perceived bandwidth
implies that you process a lot of patches unread or barely read --
implying that your day-to-day bitslingling could/should largely be
handled by an Arch-style patch-queue-manager (a script).

-t

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: on when to checksum
  2005-04-20 22:52   ` Tom Lord
@ 2005-04-20 23:07     ` Linus Torvalds
  2005-04-20 23:39       ` Tom Lord
  2005-05-02 19:21       ` Tom Lord
  0 siblings, 2 replies; 8+ messages in thread
From: Linus Torvalds @ 2005-04-20 23:07 UTC (permalink / raw)
  To: Tom Lord; +Cc: git

On Wed, 20 Apr 2005, Tom Lord wrote:
> 
> How many times per day do you invoke `write-tree' and why?

Every single commit does a write-tree, so when I merge with Andrew, it's 
usually a series of 100-250 of them in a row.

(Actually, _usualyl_ it's smaller series, but it's the big series that can
be painful enough to matter).

> It takes a large multiple of `0.3s' to get me to take you seriously
> on this point.

The thing is, I don't "trickle" things in. That would be horribly 
inefficient for me. So I go over the patches, make a mbox, and do them all 
in one go. And then they need to happen _fast_. If it takes 20 minutes, I 
go away for coffee or something, and then if something didn't apply 
half-way through, I will have lost my "context".

That's why I want things instant. Not because I have huge daily throughput 
issues, but I have huge _latency_ issues. 

I considered doing a "two-level" thing, where I first did the stuff in a
light-weigth patch manager, and then batched things up in the background
for the real thing. But the fact is, I don't think it's needed. Not the
way git performs now. If I can apply a hundred patches in a minute or two,
I have not "lost the context" if it turns out that there is some silly
glitch with one of them.

		Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: on when to checksum
  2005-04-20 23:07     ` Linus Torvalds
@ 2005-04-20 23:39       ` Tom Lord
  2005-05-02 19:21       ` Tom Lord
  1 sibling, 0 replies; 8+ messages in thread
From: Tom Lord @ 2005-04-20 23:39 UTC (permalink / raw)
  To: torvalds; +Cc: git


(I'll have to study/think about that for a while before a proper
reply.  Tomorrow, probably.)

Thanks,
-t


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: on when to checksum
  2005-04-20 23:07     ` Linus Torvalds
  2005-04-20 23:39       ` Tom Lord
@ 2005-05-02 19:21       ` Tom Lord
  2005-05-02 19:57         ` Linus Torvalds
  1 sibling, 1 reply; 8+ messages in thread
From: Tom Lord @ 2005-05-02 19:21 UTC (permalink / raw)
  To: torvalds; +Cc: git

  The thing is, I don't "trickle" things in. That would be horribly 
  inefficient for me. So I go over the patches, make a mbox, and do them all 
  in one go. And then they need to happen _fast_. If it takes 20 minutes, I 
  go away for coffee or something, and then if something didn't apply 
  half-way through, I will have lost my "context".

  That's why I want things instant. Not because I have huge daily throughput 
  issues, but I have huge _latency_ issues. 

I'm curious about what is the value of the "batch" nature of that
proces?

Presumably most patches apply cleanly and most or orthogonal (order
independent).   I'm sure that there are frequently interesting exceptions
but am I generally right about "most" here?

So, if I understand, you review each change before stuffing it in a
mailbox, then you apply all the patches in that mailbox in batch.
In the majority of cases, the buffering of changes in the mailbox
adds nothing.

Why isn't that more automated: when you approve a change, it could be
applied at once, in the background.  If conflictless, it can be committed,
tested, whatever.  If conflicting, *then* the change can be buffered
up for you to look at.   Explicit declarations from programmers or 
text-based computations about dependencies among the patches can help
improve the queue management in more complicated cases.

In other words, a more asynchronous process might save you time *and*
pay off by reserving more of your attention for areas where it's 
really needed.

-t

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: on when to checksum
  2005-05-02 19:21       ` Tom Lord
@ 2005-05-02 19:57         ` Linus Torvalds
  0 siblings, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2005-05-02 19:57 UTC (permalink / raw)
  To: Tom Lord; +Cc: git

On Mon, 2 May 2005, Tom Lord wrote:
> 
> I'm curious about what is the value of the "batch" nature of that
> proces?

My time.

I don't know about other people, but I don't multitask. I do one thing, 
and that's it. I don't move my mouse around. I sit in my mail reader, and 
I read email. I don't read one email, switch to another window, apply it, 
swithc back, read the next email etc etc.

In fact, I claim that anybody who works that way is going to have an IQ of 
about 15 points lower than somebody who batches things up. Just because 
you end up losing your context, and that effectively makes you stupid. 

Concentration is a wonderful thing, but it _requires_ that you do things 
in a concentrated manner.

> So, if I understand, you review each change before stuffing it in a
> mailbox, then you apply all the patches in that mailbox in batch.
> In the majority of cases, the buffering of changes in the mailbox
> adds nothing.

I read email, and while reading email I save the interesting ones off to
another mbox (I call mine "doit"). They get saved off for "later perusal".

I do a first-order review at that stage, and in fact, 95% of the time, 
what goes into the "doit" folder _will_ get applied. Not 100%, though, 
exactly because at this stage I just read email and work in a mail-reader: 
I don't usually even look at the actual kernel sources that a patch 
involves. In particular, sometimes it turns out that the patch wasn't 
against my version at all, but against a -mm tree, and I just don't even 
worry about technical details at that stage.

Stage #2 is going through the "doit" folder at some later date (maybe a 
couple of times a day), and going through it one more time. Maybe not that 
much more "carefully", but with a different intent - now I actually check 
sign-offs, add my own, and check out the actual problems in the source 
tree if needed.

Stage #3 is actually applying it.

_Each_ stage culls out bad things.

And I _really_ don't bounce between stages.

> In other words, a more asynchronous process might save you time *and*
> pay off by reserving more of your attention for areas where it's 
> really needed.

It's not asynchronous. It's batched in different stages so that I can 
work better. And latency matters.

		Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: on when to checksum
  2005-04-20 22:25 on when to checksum Tom Lord
  2005-04-20 22:41 ` Linus Torvalds
@ 2005-04-21 16:53 ` Andrew Timberlake-Newell
  1 sibling, 0 replies; 8+ messages in thread
From: Andrew Timberlake-Newell @ 2005-04-21 16:53 UTC (permalink / raw)
  To: 'Tom Lord'; +Cc: torvalds, git

Tom Lord graced us with:
> I think you have made a mistake by moving the sha1 checksum from the
> zipped form to the inflated form.  Here is why:
> 
> What you have set in motion with `git' is an ad-hoc p2p network for
> sharing filesystem trees -- a global distributed filesystem.  I
> believe your starter here has a good chance of taking off to be much,
> much larger than just a tool for the kernel.

This might rather be a call for a git derivative.

As Linus has already mentioned in this thread, git is optimized for his need
for local speed.  But while sacrificing local speed for network speed would
break git by stepping away from the git philosophy, a gitling with a
different philosophy but making use of gitish techniques could make that
change without being broken even though git itself can't.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-05-02 19:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-20 22:25 on when to checksum Tom Lord
2005-04-20 22:41 ` Linus Torvalds
2005-04-20 22:52   ` Tom Lord
2005-04-20 23:07     ` Linus Torvalds
2005-04-20 23:39       ` Tom Lord
2005-05-02 19:21       ` Tom Lord
2005-05-02 19:57         ` Linus Torvalds
2005-04-21 16:53 ` Andrew Timberlake-Newell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).