* Re: Handling very large numbers of symbolic references? @ 2006-07-26 18:38 linux 0 siblings, 0 replies; 7+ messages in thread From: linux @ 2006-07-26 18:38 UTC (permalink / raw) To: nix; +Cc: git, linux Just to contribute a litle brainstorming.... - Remember that git refs only point to one end of a commit chain. The origin is kind of implicit. If bug IDs correspond to *changes*, especially ones that you want to mix and match rebasing, is this a job for StGit or quilt or something else that tracks patches rather than states? - If you do use core git to label bits of development history, are the labels supposed to be mutable heads or mostly frozen tags? - Assuming they're tags, do you need them to be part of the root set for garbage collection purposes? Or do you assume they are already referenced by the development history, and the bug ID links are symlinks that moight be broken if the patch isn't merged? I really should look at StGit more, because from my current position of ignorance, it looks like possibly a better match to the problem. The main problems I see are that its patches are per-branch, not global, and there's no fetch/push mechanism for sharing them. Also, you might want to have a "patch" with a single name be a patch SERIES, which I don't think StGit does. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Handling very large numbers of symbolic references? @ 2006-07-25 19:29 Nix 2006-07-25 21:29 ` Rene Scharfe 2006-07-25 22:23 ` Linus Torvalds 0 siblings, 2 replies; 7+ messages in thread From: Nix @ 2006-07-25 19:29 UTC (permalink / raw) To: git I'm about to start writing my first git porcelain (to try to convert my workplace from the world's oldest and cruftiest version control system to something not based on the bastard offspring of SCCS and VMS's CMS, with less power than either) and have run into a problem that I'm not sure how to solve. The biggest problem with git for totally naive users is that they get scared by the sha1 IDs used as version numbers (assuming the index is porcelained away: but that would confuse them, not scare them). They're not pronounceable, not memorable, and so on. So the porcelain I'm whipping up conceals them in large part by using instead bug IDs, as the workflow of the place I'm doing this for is driven entirely by Bugzilla bug numbers. I'm taking a leaf from the `git for the ignorant' document and arranging that every fix that fixes some Bugzilla bug is on a branch named after that bug, e.g. #2243, #10155, whatever. (I'm going to have to go further than that and track dependency relationships between bugs, i.e. `if you merge bug #1404's branch, you must merge #1306's and #1505's as well'. I could do that by adding a new bug-dependency object, respected by a wrapper around git-merge, but I'm not sure how kosher it is to add new types of objects only used by porcelain. Hell, I'm not even sure if it's possible yet.) However, this causes a potential problem. There are tens of thousands of these bugs, and the .git/refs/heads directory gets *enormous* and thus the system gets terribly terribly slow (crappy old Solaris filesystem syndrome). It seems to me there are two ways to fix this: - restructure .git/refs/* in a similar way to .git/objects, i.e. as a one- or two-level tree. - the vast majority of these bugs are closed. They still need to be got at now and again for branch merges, but they could be got out of .refs/heads at delete_branch time, and pushed into a tree consisting entirely of deleted branches, which would in turn be pointed at from some new place under .refs; perhaps .refs/heads/heavy (by analogy to non-lightweight tags). The problem here is that whenever we delete a tag, we'll leak that tree (at least we will if it's in a pack), and that leakage really could add up in the end. (Deleting branches corresponding to closed bugs is good for other reasons: e.g., it cleans up gitweb output. But certain tools *will* need to get at those closed bug branches: I'm inclined to say that all of them will sooner or later, because the users aren't going to tolerate being told that they can't do anything to a closed bug. Except for adding code to it: we can reasonably declare the addition of commits to those branches over. Of course once we have the sha1 id, it's all academic, really.) I'm not sure which way is preferable. Suggestions? Is the entire idea lunatic? And, in case this hasn't been said enough: thank you for git, it's the nicest version control system I've used in years, and the way it's structured encourages everyone to play :) -- `We're sysadmins. We deal with the inconceivable so often I can clearly see the need to define levels of inconceivability.' --- Rik Steenwinkel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Handling very large numbers of symbolic references? 2006-07-25 19:29 Nix @ 2006-07-25 21:29 ` Rene Scharfe 2006-07-25 21:52 ` Nix 2006-07-25 22:23 ` Linus Torvalds 1 sibling, 1 reply; 7+ messages in thread From: Rene Scharfe @ 2006-07-25 21:29 UTC (permalink / raw) To: Nix; +Cc: git Nix schrieb: > However, this causes a potential problem. There are tens of thousands of > these bugs, and the .git/refs/heads directory gets *enormous* and thus > the system gets terribly terribly slow (crappy old Solaris filesystem > syndrome). > > It seems to me there are two ways to fix this: > > - restructure .git/refs/* in a similar way to .git/objects, i.e. as a > one- or two-level tree. Branch names are allowed to contain slashes, thus your porcelain is free to implement such a tree. Add a slash after every two bug ID digits and your directories will never contain more than 100 objects. René ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Handling very large numbers of symbolic references? 2006-07-25 21:29 ` Rene Scharfe @ 2006-07-25 21:52 ` Nix 0 siblings, 0 replies; 7+ messages in thread From: Nix @ 2006-07-25 21:52 UTC (permalink / raw) To: Rene Scharfe; +Cc: git On Tue, 25 Jul 2006, Rene Scharfe said: > Nix schrieb: >> However, this causes a potential problem. There are tens of thousands of >> these bugs, and the .git/refs/heads directory gets *enormous* and thus >> the system gets terribly terribly slow (crappy old Solaris filesystem >> syndrome). >> >> It seems to me there are two ways to fix this: >> >> - restructure .git/refs/* in a similar way to .git/objects, i.e. as a >> one- or two-level tree. > > Branch names are allowed to contain slashes, thus your porcelain is free > to implement such a tree. Add a slash after every two bug ID digits and > your directories will never contain more than 100 objects. Oh, lovely! I was *sure* I'd need to make git core changes for this, but no, the precognitive powers of the git hackers had anticipated my needs before I knew what they were! (Now the only downside is gitweb's treatment of such heads: but looking at the code, making it skip suitably formatted heads when displaying the heads list is an utterly trivial one-liner.) -- `We're sysadmins. We deal with the inconceivable so often I can clearly see the need to define levels of inconceivability.' --- Rik Steenwinkel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Handling very large numbers of symbolic references? 2006-07-25 19:29 Nix 2006-07-25 21:29 ` Rene Scharfe @ 2006-07-25 22:23 ` Linus Torvalds 2006-07-25 23:08 ` Nix 1 sibling, 1 reply; 7+ messages in thread From: Linus Torvalds @ 2006-07-25 22:23 UTC (permalink / raw) To: Nix; +Cc: git On Tue, 25 Jul 2006, Nix wrote: > > However, this causes a potential problem. There are tens of thousands of > these bugs, and the .git/refs/heads directory gets *enormous* and thus > the system gets terribly terribly slow (crappy old Solaris filesystem > syndrome). I would really suggest you use some lookup logic of your own to handle this, because having that many refs will slow down a lot of things. That said, you can certainly use a hierarchy of refs, and just have them as .git/refs/heads/00/000-999 01/000-999 02/000-999 ... if you want to avoid the dreaded filesystem meltdown. I suspect it would suck, though. You'd still end up with tens of thousands of small files, with no good way to pack them together. > It seems to me there are two ways to fix this: > > - restructure .git/refs/* in a similar way to .git/objects, i.e. as a > one- or two-level tree. So this work already. > - the vast majority of these bugs are closed. They still need to be got > at now and again for branch merges, but they could be got out of > .refs/heads at delete_branch time, and pushed into a tree consisting > entirely of deleted branches, which would in turn be pointed at from > some new place under .refs; perhaps .refs/heads/heavy (by analogy to > non-lightweight tags). The problem here is that whenever we delete > a tag, we'll leak that tree (at least we will if it's in a pack), and > that leakage really could add up in the end. Well, the problem to some degree is that a number of git routines will look up all heads (eg things like "git pull" and "git ls-remote" and "git push", not to mention all the visualizers that want to show all the heads. So so if you really en dup doing them as individual heads, I'm afraid that performance will suck big-time. And it wouldn't really help to put them under .git/refs/heads/heavy, you'd still be in trouble. > I'm not sure which way is preferable. Suggestions? Is the entire idea > lunatic? I think you _can_ use git in the way you propose, but it's going to be fundamentally pretty inefficient. The diskspace usage will be inefficient (tens of thousands of files, all just 41 characters in size), but even more importantly, as mentioned, you'll have things like cloning or pulling a repository always havign to get tens of thousands of references, and that's just going to be very very slow. So yes, I think it's a bit lunatic. Git scales much better in _other_ ways. For example, one thing you could do is to have each bug-report be described as a _file_ instead of as a tag, and then have just one (or a few branches), and you'd have nice naming of bugs just because the filenames can be nice. That would allow git to shine because it scales well in things git is good at, ie the database itself. You'd probably want to introduce the notion of a nice specialized "merge" for those files (assuming you really want to do _distributed_ reporting, and actually merge two different databases that have the same bugs), but git should actually be quite good at supporting something like that, even if you might have to do some infrastructure yourself. OR, you could actually teach git about other ways of looking up names. So if you decide that you do want to have one branch per bug, you might want to teach git about a new "ref" file format that has multiple name/ref translations in the same file. That would solve the disk usage problem, even if it would _not_ solve the ineffiency of tools that might be slightly unhappy to see thousands and thousands of refs. Anyway, whatever approach you select, send patches to Junio. I'm sure that we can try to make git support even some rather strange models. Linus ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Handling very large numbers of symbolic references? 2006-07-25 22:23 ` Linus Torvalds @ 2006-07-25 23:08 ` Nix 2006-07-25 23:20 ` Linus Torvalds 0 siblings, 1 reply; 7+ messages in thread From: Nix @ 2006-07-25 23:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: git On Tue, 25 Jul 2006, Linus Torvalds noted: > That said, you can certainly use a hierarchy of refs, and just have them > as > > .git/refs/heads/00/000-999 > 01/000-999 > 02/000-999 > ... > > if you want to avoid the dreaded filesystem meltdown. That's what I was hoping would work, but... > I suspect it would suck, though. You'd still end up with tens of thousands > of small files, with no good way to pack them together. ... that is, indeed, the problem. >> - the vast majority of these bugs are closed. They still need to be got >> at now and again for branch merges, but they could be got out of >> .refs/heads at delete_branch time, and pushed into a tree consisting >> entirely of deleted branches, which would in turn be pointed at from >> some new place under .refs; perhaps .refs/heads/heavy (by analogy to >> non-lightweight tags). The problem here is that whenever we delete >> a tag, we'll leak that tree (at least we will if it's in a pack), and >> that leakage really could add up in the end. > > Well, the problem to some degree is that a number of git routines will > look up all heads (eg things like "git pull" and "git ls-remote" and "git > push", not to mention all the visualizers that want to show all the heads. Ick. Yes, that would be a bit of a sod. git-ls-remote showing >30,000 heads is... not ideal. Not at all. (It's growing by ~50 a day...) It's a sort of `hidden head' I meant. Hm. I think I see a way: see below. > So so if you really en dup doing them as individual heads, I'm afraid that > performance will suck big-time. And it wouldn't really help to put them > under .git/refs/heads/heavy, you'd still be in trouble. OK, so it has to go somewhere else. >> I'm not sure which way is preferable. Suggestions? Is the entire idea >> lunatic? > > I think you _can_ use git in the way you propose, but it's going to be > fundamentally pretty inefficient. The diskspace usage will be inefficient > (tens of thousands of files, all just 41 characters in size), but even > more importantly, as mentioned, you'll have things like cloning or pulling > a repository always havign to get tens of thousands of references, and > that's just going to be very very slow. > > So yes, I think it's a bit lunatic. It's perhaps unusual, but, well, the version control system we're switching from takes over an *hour* just to check out some classes of files! (SCCS's handling of large binary files is... inefficient if naively kludged by uuencoding everything before committing it; we have some s-files whose size is approaching a gigabyte as a result, being accessed over very slow NFS. git, of course, doesn't need such crud, although I may need to teach the deltifier about xdelta or something of that nature to keep sizes down in the long run.) > Git scales much better in _other_ ways. For example, one thing you could > do is to have each bug-report be described as a _file_ instead of as a > tag, and then have just one (or a few branches), and you'd have nice > naming of bugs just because the filenames can be nice. That would allow > git to shine because it scales well in things git is good at, ie the > database itself. > > You'd probably want to introduce the notion of a nice specialized "merge" > for those files (assuming you really want to do _distributed_ reporting, > and actually merge two different databases that have the same bugs), but That's the sort of unlikely thing which is *certain* to happen :) but in practice until those database merges actually take place I can't be sure how the renumbering would be done :/ but no, the heaps-of-refs seems like the only practical way, because in practice people treat these bugs as little sets of changed files that they can merge all over the place, and, well, that's a branch as far as I can see. Of course, the difference between a branch and a `tree of commits which has a ref-like thing pointing to it' is minimal: I'd have to teach git-fsck-objects about it anyway to stop it ditching things as unreachable when they weren't... > git should actually be quite good at supporting something like that, even > if you might have to do some infrastructure yourself. > > OR, you could actually teach git about other ways of looking up names. So This is what I was thinking of doing. > if you decide that you do want to have one branch per bug, you might want > to teach git about a new "ref" file format that has multiple name/ref > translations in the same file. That would solve the disk usage problem, > even if it would _not_ solve the ineffiency of tools that might be > slightly unhappy to see thousands and thousands of refs. Well, actually I was considering trying a combination of two things: - a new type of multi-entry ref (as you suggested), perhaps in a file refs/inactive-heads, which is merged with the heads list by lookup operations only (so merge would see them, but ls-remote would not: `invisible heads' if you will); git-branch moves head refs there upon deletion; so even deleted head refs are referenceable by name forever. The merging for lookup would scale as O(n), of course, but that can probably be ignored until we have hundreds of thousands of them (whereupon the right thing to do is probably to change the inactive-heads file format and lookup code and keep the general idea). (This might mean rejigging code that assumes that looking up a ref is an open() away, but that shouldn't be all that terribly hard, one new tool, `git-lookup-ref', sort of like git-symbolic-ref only applying to refs that aren't symbolic). - dependency information could be handled by rebasing the depending branch on the heads of the branches which it depends upon, but, well, that seems extremely icky to me, especially if those branches are still changing: we'd have to re-rebase all the time to stay up to date. I suspect that a new object type, or perhaps a new type of ref, would be right here. The idea is that you express a mapping from one branch ref to another set of branch refs (*not* sha1 id, because there is no fixed sha1 id that corresponds to a given branch in the presence of commit and git-rebase). A new object type seems ideal for this (sort of like a commit only with ref names instead of sha1 ids), but I'm under the impression that adding new object types to git is quite tricky and introduces inter-repository incompatibilities, so I might just make it a refs/dependencies directory with one file per depending bug, containing many ref names for the bugs it depends on. (There will likely be many fewer dependencies than inactive bug branches, anyway.) This should be fun! > Anyway, whatever approach you select, send patches to Junio. I'm sure that > we can try to make git support even some rather strange models. Yeah, I'm planning to make this general enough that anyone can use it: there'll be an outer layer of glaze around the porcelain which is specifically to change the command-line syntax to be similar to the tool that the poor sods at work are moving from, but I'll maintain that in a branch that nobody sane will pull and that I won't push to anyone, and keep it out of the tree meant for sane people. (I'm not sure if `local branch' is really the right term for it: I mean, this is git, *all* branches are local, or none are...) -- `We're sysadmins. We deal with the inconceivable so often I can clearly see the need to define levels of inconceivability.' --- Rik Steenwinkel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Handling very large numbers of symbolic references? 2006-07-25 23:08 ` Nix @ 2006-07-25 23:20 ` Linus Torvalds 0 siblings, 0 replies; 7+ messages in thread From: Linus Torvalds @ 2006-07-25 23:20 UTC (permalink / raw) To: Nix; +Cc: git On Wed, 26 Jul 2006, Nix wrote: > > Well, actually I was considering trying a combination of two things: > > - a new type of multi-entry ref (as you suggested), perhaps in a file > refs/inactive-heads, which is merged with the heads list by lookup > operations only (so merge would see them, but ls-remote would not: > `invisible heads' if you will) Yes, that should work. Make sure that you tell git-fsck-objects and git-prune that those heads are reachable, though. Of course, if you end up having one "master" head (that is the "merge" of all branches), that would take care of the reachability issue too: you don't actually need to create a _real_ merge, you can just make sure that there is a commit that points to all new heads you create. It could even have a totally dummy tree node, ie you could do oldhead=$(git-rev-parse HEAD^0) || exit newhead=$(git commit-tree $oldhead -p $oldhead -p new-bug-head < changelog) || exit git update-ref HEAD $newhead $oldhead which would just update the commit list with a fake "merge" commit merging "new-bug-head" into the stream of top commits (using the same tree as the previous "HEAD" commit had) so that it's always reachable. Something like that, anyway. That way you can do a "git clone" and you get all the bug commits through a single HEAD. Linus ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2006-07-26 18:38 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-07-26 18:38 Handling very large numbers of symbolic references? linux -- strict thread matches above, loose matches on Subject: below -- 2006-07-25 19:29 Nix 2006-07-25 21:29 ` Rene Scharfe 2006-07-25 21:52 ` Nix 2006-07-25 22:23 ` Linus Torvalds 2006-07-25 23:08 ` Nix 2006-07-25 23:20 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).