git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [ANNOUNCEMENT] /Arch/ embraces `git'
@ 2005-04-20  9:58 Tom Lord
  0 siblings, 0 replies; 7+ messages in thread
From: Tom Lord @ 2005-04-20  9:58 UTC (permalink / raw)
  To: gnu-arch-users, gnu-arch-dev, git; +Cc: talli, torvalds


`git', by Linus Torvalds, contains some very good ideas and some
very entertaining source code -- recommended reading for hackers.

/GNU Arch/ will adopt `git':

>From the /Arch/ perspective: `git' technology will form the
basis of a new archive/revlib/cache format and the basis
of new network transports.

>From the `git' perspective, /Arch/ will replace the lame "directory
cache" component of `git' with a proper revision control system.

In my view, the core ideas in `git' are quite profound and deserve
an impeccable implementation.   This is practical because those ideas
are also pretty simple.

I started here:

   http://www.seyza.com/=clients/linus/tree/index.html

and for those interested in `git'-theory, a good place to start is

   http://www.seyza.com/=clients/linus/tree/src/liblob/index.html

(Linus is not literally a "client" of mine.  That's just the directory 
where this goes.)

-t

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ANNOUNCEMENT] /Arch/ embraces `git'
@ 2005-04-20 10:00 Tom Lord
  2005-04-20 10:19 ` Miles Bader
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Tom Lord @ 2005-04-20 10:00 UTC (permalink / raw)
  To: gnu-arch-users, gnu-arch-dev, git; +Cc: talli, torvalds


`git', by Linus Torvalds, contains some very good ideas and some
very entertaining source code -- recommended reading for hackers.

/GNU Arch/ will adopt `git':

>From the /Arch/ perspective: `git' technology will form the
basis of a new archive/revlib/cache format and the basis
of new network transports.

>From the `git' perspective, /Arch/ will replace the lame "directory
cache" component of `git' with a proper revision control system.

In my view, the core ideas in `git' are quite profound and deserve
an impeccable implementation.   This is practical because those ideas
are also pretty simple.

I started here:

   http://www.seyza.com/=clients/linus/tree/index.html

and for those interested in `git'-theory, a good place to start is

   http://www.seyza.com/=clients/linus/tree/src/liblob/index.html

(Linus is not literally a "client" of mine.  That's just the directory 
where this goes.)

-t

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ANNOUNCEMENT] /Arch/ embraces `git'
  2005-04-20 10:00 Tom Lord
@ 2005-04-20 10:19 ` Miles Bader
  2005-04-20 17:15 ` duchier
  2005-04-20 21:31 ` Petr Baudis
  2 siblings, 0 replies; 7+ messages in thread
From: Miles Bader @ 2005-04-20 10:19 UTC (permalink / raw)
  To: Tom Lord; +Cc: gnu-arch-users, gnu-arch-dev, talli, git, torvalds

Way to go.

-Miles
-- 
Do not taunt Happy Fun Ball.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ANNOUNCEMENT] /Arch/ embraces `git'
  2005-04-20 10:00 Tom Lord
  2005-04-20 10:19 ` Miles Bader
@ 2005-04-20 17:15 ` duchier
  2005-04-20 23:04   ` Tom Lord
  2005-04-20 21:31 ` Petr Baudis
  2 siblings, 1 reply; 7+ messages in thread
From: duchier @ 2005-04-20 17:15 UTC (permalink / raw)
  To: Tom Lord; +Cc: gnu-arch-users, gnu-arch-dev, talli, git, torvalds

Hi Tom,

just as a datapoint, here is an experiment I carried out.  I wanted to evaluate
how much overhead is incurred by using several levels of directories to
implement a discrimating index.  I used the key format you specified:

	SHA1,SIZE

As data, I used my /usr/src/linux which uses 301M and contains 20753 files and
1389 directories.  To compute the key for a directory, I considered that its
contents were a mapping from names to keys.

When constructing the indexed archive, I actually stored empty files instead of
blobs because I am only interested in overhead.

Using your suggested indexing method that uses [0:4] as the 1st level key and
[4:8] as the 2nd level key, I obtain an indexed archive that occupies 159M,
where the top level contains 18665 1st level keys, the largest first level dir
contains 5 entries, and all 2nd level dirs contain exactly 1 entry.

Using Linus suggested 1 level [0:2] indexing, I obtain an indexed archive that
occupies 1.8M, where the top level contains 256 1st level keys, and where the
largest 1st level dir contains 110 entries.

This experiment was performed on an ext3 file system.

Cheers,

--Denys

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ANNOUNCEMENT] /Arch/ embraces `git'
  2005-04-20 10:00 Tom Lord
  2005-04-20 10:19 ` Miles Bader
  2005-04-20 17:15 ` duchier
@ 2005-04-20 21:31 ` Petr Baudis
  2005-04-20 21:55   ` C. Scott Ananian
  2 siblings, 1 reply; 7+ messages in thread
From: Petr Baudis @ 2005-04-20 21:31 UTC (permalink / raw)
  To: Tom Lord; +Cc: gnu-arch-users, gnu-arch-dev, git, talli, torvalds

Dear diary, on Wed, Apr 20, 2005 at 12:00:36PM CEST, I got a letter
where Tom Lord <lord@emf.net> told me that...
> >From the /Arch/ perspective: `git' technology will form the
> basis of a new archive/revlib/cache format and the basis
> of new network transports.

I think one thing git's objects database is not very well suited for are
network transports. You want to have something smart doing the
transports, comparing trees so that it can do some delta compression;
that could probably reduce the amount of data needed to be sent
significantly.

> >From the `git' perspective, /Arch/ will replace the lame "directory
> cache" component of `git' with a proper revision control system.

I'm not sure if you fully grasped the git's philosophy yet. The
"directory cache" component is not by itself any revision control system
- it is merely a staging area for any revision system on top of it (IOW:
subordinate, not competitor).

> I started here:
> 
>    http://www.seyza.com/=clients/linus/tree/index.html
> 
> and for those interested in `git'-theory, a good place to start is
> 
>    http://www.seyza.com/=clients/linus/tree/src/liblob/index.html

These pages are surely very nice, unfortunately I have to enjoy them
only from the "HTML source" view. The HTML seems completely broken,
containing unterminated comments like "<!-- BEGIN  the main body>". :-(

You didn't go into surely interesting details regarding what will you be
fixing regarding ancestry graphs.

Also, I have some concerns about your naming scheme. First, why do you
include the size in the filename? Second, with ..../..../ you are
_seriously_ worse off than with ../. The first will put 1/256 of project
files to each directory, where with the second you will have
1/4294967296 of project files per directory. I think the point of
directory is that it is a container grouping certain files in a certain
way; in the objects database it is done purely for performance (and
compatibility, to a degree) reasons, but your way it will have worse
performance characteristics at least until the project accumulates
4294967296 files in the database.

Kind regards,

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ANNOUNCEMENT] /Arch/ embraces `git'
  2005-04-20 21:31 ` Petr Baudis
@ 2005-04-20 21:55   ` C. Scott Ananian
  0 siblings, 0 replies; 7+ messages in thread
From: C. Scott Ananian @ 2005-04-20 21:55 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Tom Lord, gnu-arch-users, gnu-arch-dev, git, talli, torvalds

On Wed, 20 Apr 2005, Petr Baudis wrote:

> I think one thing git's objects database is not very well suited for are
> network transports. You want to have something smart doing the
> transports, comparing trees so that it can do some delta compression;
> that could probably reduce the amount of data needed to be sent
> significantly.

I'm hoping my 'chunking' patches will fix this.  This ought to reduce the 
size of the object store by (in effect) doing delta compression; rsync
will then Do The Right Thing and only transfer the needed deltas.
Running some benchmarks right now to see how well it lives up to this 
promise...
  --scott

terrorist AEROPLANE munitions PAPERCLIP MI5 Morwenstow WSHOOFS CABOUNCE 
colonel Yakima AES MI6 nuclear NSA Cocaine Columbia plastique LICOZY
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ANNOUNCEMENT] /Arch/ embraces `git'
  2005-04-20 17:15 ` duchier
@ 2005-04-20 23:04   ` Tom Lord
  0 siblings, 0 replies; 7+ messages in thread
From: Tom Lord @ 2005-04-20 23:04 UTC (permalink / raw)
  To: duchier; +Cc: gnu-arch-users, gnu-arch-dev, git




   From: duchier@ps.uni-sb.de

Thank you for your experiment.  I'm not surprised by the 
result but it is very nice to know that my expectations
are right.

I think that to a large extent you are seeing artifacts
of the questionable trade-offs that (reports tell me) the
ext* filesystems make.   With a different filesystem, the 
results would be very different.

I'm imagining a blob database containing may revisions of the linux
kernel.  It will contain millions of blobs.

It's fine that some filesystems and some blob operations work fine
on a directory with millions of files but what about other operations
on the database?   I pity the poor program that has to `readdir' through
millions of files.

That said: I may add an optional flat-directory format to my library,
just to avoid issues such as those you raise over the next couple 
years.

-t

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-04-20 23:01 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-20  9:58 [ANNOUNCEMENT] /Arch/ embraces `git' Tom Lord
  -- strict thread matches above, loose matches on Subject: below --
2005-04-20 10:00 Tom Lord
2005-04-20 10:19 ` Miles Bader
2005-04-20 17:15 ` duchier
2005-04-20 23:04   ` Tom Lord
2005-04-20 21:31 ` Petr Baudis
2005-04-20 21:55   ` C. Scott Ananian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).