Re: Handling very large numbers of symbolic references?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Linus Torvalds <torvalds@osdl.org>
To: Nix <nix@esperi.org.uk>
Cc: git@vger.kernel.org
Subject: Re: Handling very large numbers of symbolic references?
Date: Tue, 25 Jul 2006 15:23:57 -0700 (PDT)	[thread overview]
Message-ID: <Pine.LNX.4.64.0607251508540.29649@g5.osdl.org> (raw)
In-Reply-To: <87psfteb4l.fsf@hades.wkstn.nix>

On Tue, 25 Jul 2006, Nix wrote:
> 
> However, this causes a potential problem. There are tens of thousands of
> these bugs, and the .git/refs/heads directory gets *enormous* and thus
> the system gets terribly terribly slow (crappy old Solaris filesystem
> syndrome).

I would really suggest you use some lookup logic of your own to handle 
this, because having that many refs will slow down a lot of things.

That said, you can certainly use a hierarchy of refs, and just have them 
as

	.git/refs/heads/00/000-999
			01/000-999
			02/000-999
			...

if you want to avoid the dreaded filesystem meltdown.

I suspect it would suck, though. You'd still end up with tens of thousands 
of small files, with no good way to pack them together.

> It seems to me there are two ways to fix this:
> 
>  - restructure .git/refs/* in a similar way to .git/objects, i.e. as a
>    one- or two-level tree.

So this work already.

>  - the vast majority of these bugs are closed. They still need to be got
>    at now and again for branch merges, but they could be got out of
>    .refs/heads at delete_branch time, and pushed into a tree consisting
>    entirely of deleted branches, which would in turn be pointed at from
>    some new place under .refs; perhaps .refs/heads/heavy (by analogy to
>    non-lightweight tags). The problem here is that whenever we delete
>    a tag, we'll leak that tree (at least we will if it's in a pack), and
>    that leakage really could add up in the end.

Well, the problem to some degree is that a number of git routines will 
look up all heads (eg things like "git pull" and "git ls-remote" and "git 
push", not to mention all the visualizers that want to show all the heads.

So so if you really en dup doing them as individual heads, I'm afraid that 
performance will suck big-time. And it wouldn't really help to put them 
under .git/refs/heads/heavy, you'd still be in trouble.

> I'm not sure which way is preferable. Suggestions? Is the entire idea
> lunatic?

I think you _can_ use git in the way you propose, but it's going to be 
fundamentally pretty inefficient. The diskspace usage will be inefficient 
(tens of thousands of files, all just 41 characters in size), but even 
more importantly, as mentioned, you'll have things like cloning or pulling 
a repository always havign to get tens of thousands of references, and 
that's just going to be very very slow.

So yes, I think it's a bit lunatic.

Git scales much better in _other_ ways. For example, one thing you could 
do is to have each bug-report be described as a _file_ instead of as a 
tag, and then have just one (or a few branches), and you'd have nice 
naming of bugs just because the filenames can be nice. That would allow 
git to shine because it scales well in things git is good at, ie the 
database itself.

You'd probably want to introduce the notion of a nice specialized "merge" 
for those files (assuming you really want to do _distributed_ reporting, 
and actually merge two different databases that have the same bugs), but 
git should actually be quite good at supporting something like that, even 
if you might have to do some infrastructure yourself.

OR, you could actually teach git about other ways of looking up names. So 
if you decide that you do want to have one branch per bug, you might want 
to teach git about a new "ref" file format that has multiple name/ref 
translations in the same file. That would solve the disk usage problem, 
even if it would _not_ solve the ineffiency of tools that might be 
slightly unhappy to see thousands and thousands of refs.

Anyway, whatever approach you select, send patches to Junio. I'm sure that 
we can try to make git support even some rather strange models.

		Linus

next prev parent reply	other threads:[~2006-07-25 22:24 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-07-25 19:29 Handling very large numbers of symbolic references? Nix
2006-07-25 21:29 ` Rene Scharfe
2006-07-25 21:52   ` Nix
2006-07-25 22:23 ` Linus Torvalds [this message]
2006-07-25 23:08   ` Nix
2006-07-25 23:20     ` Linus Torvalds
  -- strict thread matches above, loose matches on Subject: below --
2006-07-26 18:38 linux

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0607251508540.29649@g5.osdl.org \
    --to=torvalds@osdl.org \
    --cc=git@vger.kernel.org \
    --cc=nix@esperi.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).