* Efficient way to import snapshots?
@ 2007-07-30 18:07 Craig Boston
2007-07-30 18:56 ` Linus Torvalds
2007-07-30 21:54 ` David Kastrup
0 siblings, 2 replies; 27+ messages in thread
From: Craig Boston @ 2007-07-30 18:07 UTC (permalink / raw)
To: git
Hello, I'm seeking some input from git users/developers to try and solve
a problem I'm encountering. I'm new with git, having used quite a bit
of Subversion and SVK in the past, but have heard good things and
decided to give it a try.
First, a little bit about what I'm trying to accomplish. There is a
large source tree -- FreeBSD to be specific -- which is maintained in
CVS. I want to have several local branches where I can develop specific
projects. Each of these branches should derive from either the HEAD of
the CVS tree, or from one of the release branches (known as the -STABLE
branch for a particular version of the OS).
That said, I really don't want to have to import the entire CVS
repository. It has many, many branches and tags that I'm not interested
in, and I really don't need the entire history of the project. I don't
even really need individual commit history. I can see that easily enough
on cvsweb. Repo size is a big factor as I need to replicate it between
several different work machines, some of which don't have unlimited disk
space.
What I'm currently doing (using SVK) is nightly, taking a snapshot using
cvsup of each of the 3 branches that I care about, then using "svk
import" to pull the snapshot into a local branch that I treat as a
vendor branch. My projects are branched off those and I regularly
smerge changes over. It works pretty well for me, the only downside is
that svk isn't exactly fast, having to load lots of perl modules for
every command.
I'd like to do something similar with git -- I like the idea of it
having many less dependencies that must be installed, and it's supposed
to be quite a bit faster when dealing with working copies. From reading
about it I think it may also be easier to generate diffs of my branches
from their origins.
So far the main snag I've found is that AFAIK there's no equivalent to
"svk import" to load a big tree (~37000 files) into a branch and commit
the changes. Here's the procedure I've come up with:
cd /path/to/git/repo
git checkout vendor_branch_X
git rm -r .
cp -R /path/to/cvs/checkout_X/* ./
git add .
git commit -m"Import yyyymmdd snapshot"
However this has quite a few disadvantages when compared to svk. The
first is that I have to checkout into a working directory and then copy
the files from the cvs checkout. It is also considerably slower than
svk import, about 7-8 times on average. When there are a lot of
changes I've seen the git process use upwards of 1GB of memory; it
actually died the first time I tried it because I didn't have any swap
configured.
Now, I don't have very much experience with git, so for my question: Is
there a better way to do this? Either importing a lot of files into
git, or a better solution for that I'm trying to do.
Any pointers would be much appreciated.
Thanks!
Craig
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 18:07 Efficient way to import snapshots? Craig Boston
@ 2007-07-30 18:56 ` Linus Torvalds
2007-07-30 19:29 ` Craig Boston
2007-07-30 21:22 ` Jakub Narebski
2007-07-30 21:54 ` David Kastrup
1 sibling, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2007-07-30 18:56 UTC (permalink / raw)
To: Craig Boston; +Cc: git
On Mon, 30 Jul 2007, Craig Boston wrote:
>
> So far the main snag I've found is that AFAIK there's no equivalent to
> "svk import" to load a big tree (~37000 files) into a branch and commit
> the changes. Here's the procedure I've come up with:
>
> cd /path/to/git/repo
> git checkout vendor_branch_X
> git rm -r .
> cp -R /path/to/cvs/checkout_X/* ./
> git add .
> git commit -m"Import yyyymmdd snapshot"
Ouch.
What you want to do should fit git very well, but doing it that way is
quite expensive.
Might I suggest just doing the .git thing *directly* in the CVS checkout
instead?
It should literally be as easy as doing something like
cd /path/to/cvs/checkout_X
export GIT_DIR=/path/to/git/repo
git add .
git commit -m"Import yyyymmdd snapshot"
and you never copy anything around, and git will just notice on its own
what CVS has changed.
You'd have to make sure that you have the CVS directories ignored, of
course, and if you don't want to change the CVS directory at all (which is
a good idea!) you'd need to do that by using the "ignore" file in your
GIT_DIR, and just having the CVS entry there, instead of adding a
".gitignore" file to the working tree and checking it in.
The first time you do this it will be expensive, since it will re-compute
all the SHA1's in the .git/index file, but afterwards, it will be able to
use the index file to speed up operation, which is what is going to make
this all *much* cheaper than removing the old files and copying all the
files around anew.
(You can just force the re-indexing by doing a "git status" once, ie do
cd /path/to/cvs/checkout_X
export GIT_DIR=/path/to/git/repo
git status
which will do it for you, and now all the subsequent snapshot generation
should be trivial and very fast).
The above is totally untested, of course, but I think that's the easiest
way to do things like this. In general, it should be *trivial* to do
snapshots with git using just about _any_ legacy SCM, exactly because you
can keep the whole git setup away from the legacy SCM directories with
that "GIT_DIR=.." thing.
(Of course, you could just move the .git directory into the CVS checkout
too - and then CVS will just ignore it. But it may be a good idea to just
keep them explicitly separate).
Linus
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 18:56 ` Linus Torvalds
@ 2007-07-30 19:29 ` Craig Boston
2007-07-30 19:52 ` Linus Torvalds
2007-07-30 21:22 ` Jakub Narebski
1 sibling, 1 reply; 27+ messages in thread
From: Craig Boston @ 2007-07-30 19:29 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
On Mon, Jul 30, 2007 at 11:56:56AM -0700, Linus Torvalds wrote:
> It should literally be as easy as doing something like
>
> cd /path/to/cvs/checkout_X
> export GIT_DIR=/path/to/git/repo
> git add .
> git commit -m"Import yyyymmdd snapshot"
Aha! I didn't know that you could point to a repository with GIT_DIR
and do useful operations without a working directory. My "master" repo
that gets backed up and cloned everywhere is a bare repo anyway; I had
been cloning it with -s and then using 'git push' to get changes back
into it.
A couple questions on that:
1. Will it notice deleted files?
2. How can I tell it what branch to commit to?
> You'd have to make sure that you have the CVS directories ignored, of
> course, and if you don't want to change the CVS directory at all (which is
> a good idea!) you'd need to do that by using the "ignore" file in your
> GIT_DIR, and just having the CVS entry there, instead of adding a
> ".gitignore" file to the working tree and checking it in.
Not a problem, I'm using cvsup in checkout mode so there are no CVS
dirs. The checkout directory is an exact snapshot of "What The
Repository Should Look Like."
> The above is totally untested, of course, but I think that's the easiest
> way to do things like this. In general, it should be *trivial* to do
> snapshots with git using just about _any_ legacy SCM, exactly because you
> can keep the whole git setup away from the legacy SCM directories with
> that "GIT_DIR=.." thing.
I'll make a backup of my repo and give it a try.
Thanks!
Craig
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 19:29 ` Craig Boston
@ 2007-07-30 19:52 ` Linus Torvalds
2007-07-30 20:10 ` Craig Boston
` (3 more replies)
0 siblings, 4 replies; 27+ messages in thread
From: Linus Torvalds @ 2007-07-30 19:52 UTC (permalink / raw)
To: Craig Boston, Junio C Hamano; +Cc: Git Mailing List
[ Junio added, because I think I noticed a performance bug ]
On Mon, 30 Jul 2007, Craig Boston wrote:
>
> A couple questions on that:
>
> 1. Will it notice deleted files?
Yes, although I think you need to do "git commit -a" for that.
"git add ." could (and perhaps _should_) notice them and remove them from
the cache, but doesn't. Whether that's the right behaviour or not (it does
seem a bit strange that "git add" would actually remove files from the
index too) is up for debate.
But with "git commit -a", it will be noticed at commit time, at least.
That said, I just noticed something nasty: "git add ." is *horrible*. It
does the full SHA1 re-computation even though the index is up-to-date.
That's really nasty.
So right now, due to this performance bug, it's actually much better to do
something more complex, namely something like
git ls-files -o | git update-index --add --stdin
git commit -a
which is a lot more efficient than just doing "git add .".
Junio? I _thought_ we already took the index into account with "git add",
but we obviously don't.
> 2. How can I tell it what branch to commit to?
Whatever branch is checked out in the GIT_DIR will be the one that it
commits to.
Linus
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 19:52 ` Linus Torvalds
@ 2007-07-30 20:10 ` Craig Boston
2007-07-30 21:29 ` Junio C Hamano
2007-07-30 21:04 ` Junio C Hamano
` (2 subsequent siblings)
3 siblings, 1 reply; 27+ messages in thread
From: Craig Boston @ 2007-07-30 20:10 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
On Mon, Jul 30, 2007 at 12:52:52PM -0700, Linus Torvalds wrote:
> On Mon, 30 Jul 2007, Craig Boston wrote:
> > 1. Will it notice deleted files?
>
> Yes, although I think you need to do "git commit -a" for that.
Ah, nice. I had underestimated how smart git is, that was the whole
reason I did the 'git rm -r .' dance at first :-)
> > 2. How can I tell it what branch to commit to?
>
> Whatever branch is checked out in the GIT_DIR will be the one that it
> commits to.
Hmm, ok. I tried it out and it was unhappy with GIT_DIR pointing at the
bare repository (no index file I presume), so I'll need a minimum of one
clone. With clone -s the repository itself shouldn't take up hardly any
space. It sounds like my options are:
1) Have a separate repository clone for each branch that I want to
import to and leave that branch permanently checked out. I lose the
disk space for N working copies, but on the server I'm doing the import
on, it's not a huge issue, especially with ZFS compression ;-)
* This might not actually be so bad if I put the .git directory inside
of the CVS checkout directory and used it as my "working copy". I
just need to insure that git doesn't create any additional files in
there, as cvsup is really picky about not deleting files that it
didn't create, even if they were removed from CVS.
2) Have one repository clone that gets re-used for each import, with the
"checked out" branch getting changed before the import. As far as I can
tell this means suffering the "git checkout" overhead for 30,000 files,
which is conceptually inefficient but in real time only a minute or so.
* Unless of course there's a way to forcibly change the state that the
repository thinks it's in without physically checking out the files.
I think it would still need to update index however.
I tried git reset --soft without success. If this is possible, it
also makes option 1 more attractive if I can safely delete the
working copy files that it won't be using anyway.
Craig
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 19:52 ` Linus Torvalds
2007-07-30 20:10 ` Craig Boston
@ 2007-07-30 21:04 ` Junio C Hamano
2007-07-30 23:19 ` Linus Torvalds
2007-07-30 21:55 ` Junio C Hamano
2007-07-30 22:20 ` Craig Boston
3 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2007-07-30 21:04 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Craig Boston, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> That said, I just noticed something nasty: "git add ." is *horrible*. It
> does the full SHA1 re-computation even though the index is up-to-date.
> That's really nasty.
>
> So right now, due to this performance bug, it's actually much better to do
> something more complex, namely something like
>
> git ls-files -o | git update-index --add --stdin
> git commit -a
>
> which is a lot more efficient than just doing "git add .".
>
> Junio? I _thought_ we already took the index into account with "git add",
> but we obviously don't.
I do not know offhand.
By the way, the above "something more complex" may be a simple
"git add -u".
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 18:56 ` Linus Torvalds
2007-07-30 19:29 ` Craig Boston
@ 2007-07-30 21:22 ` Jakub Narebski
1 sibling, 0 replies; 27+ messages in thread
From: Jakub Narebski @ 2007-07-30 21:22 UTC (permalink / raw)
To: git
Linus Torvalds wrote:
> On Mon, 30 Jul 2007, Craig Boston wrote:
>>
>> So far the main snag I've found is that AFAIK there's no equivalent to
>> "svk import" to load a big tree (~37000 files) into a branch and commit
>> the changes. Here's the procedure I've come up with:
>>
>> cd /path/to/git/repo
>> git checkout vendor_branch_X
>> git rm -r .
>> cp -R /path/to/cvs/checkout_X/* ./
>> git add .
>> git commit -m"Import yyyymmdd snapshot"
>
> Ouch.
>
> What you want to do should fit git very well, but doing it that way is
> quite expensive.
>
> Might I suggest just doing the .git thing *directly* in the CVS checkout
> instead?
[...]
And you can try to use git-fast-import. Check out
contrib/fast-import/import-tars.perl script (adapting it to your purpose).
--
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 20:10 ` Craig Boston
@ 2007-07-30 21:29 ` Junio C Hamano
2007-07-30 21:49 ` Craig Boston
0 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2007-07-30 21:29 UTC (permalink / raw)
To: Craig Boston; +Cc: Linus Torvalds, Git Mailing List
Craig Boston <craig@olyun.gank.org> writes:
> 1) Have a separate repository clone for each branch that I want to
> import to and leave that branch permanently checked out. I lose the
> disk space for N working copies, but on the server I'm doing the import
> on, it's not a huge issue, especially with ZFS compression ;-)
>
> * This might not actually be so bad if I put the .git directory inside
> of the CVS checkout directory and used it as my "working copy". I
> just need to insure that git doesn't create any additional files in
> there, as cvsup is really picky about not deleting files that it
> didn't create, even if they were removed from CVS.
With one of the projects hosted on CVS I have to interoperate
with, that is what I do. For historical reasons I do not use
git-cvsimport/exportcommit for this one, but basically:
- "cvs co" to prime the working tree;
- "echo CVS >.gitignore";
- "git init && git add . && git commit";
- "git checkout -b mine";
then I work in "mine" branch. When other people have something
to say, I do:
- "git checkout master";
- "cvs up";
- "git add <whatever was added with the above cvs up>";
- "git commit -a";
- "git rebase master mine";
When feeding my own changes back to CVS, I would:
- "git checkout master";
- "cvs up" to make sure other people do not have any stuff;
- "git show-branch master mine" to see what I have;
- "git cherry-pick <whatever the change I want to feed back>";
- "cvs commit";
- repeat the last two steps for all the changes I want;
- "git rebase master mine";
I only need to make sure not to commit on "master", and not to
run "cvs up" while on "mine".
This can be extended to more than one CVS branches by using
different branches than "master".
> 2) Have one repository clone that gets re-used for each import, with the
> "checked out" branch getting changed before the import. As far as I can
> tell this means suffering the "git checkout" overhead for 30,000 files,
> which is conceptually inefficient but in real time only a minute or so.
That should only be "conceptually" in fact, as switching between
branches should not touch paths that are the same between
branches.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 21:29 ` Junio C Hamano
@ 2007-07-30 21:49 ` Craig Boston
0 siblings, 0 replies; 27+ messages in thread
From: Craig Boston @ 2007-07-30 21:49 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Linus Torvalds, Git Mailing List
On Mon, Jul 30, 2007 at 02:29:02PM -0700, Junio C Hamano wrote:
> Craig Boston <craig@olyun.gank.org> writes:
> > 2) Have one repository clone that gets re-used for each import, with the
> > "checked out" branch getting changed before the import. As far as I can
> > tell this means suffering the "git checkout" overhead for 30,000 files,
> > which is conceptually inefficient but in real time only a minute or so.
>
> That should only be "conceptually" in fact, as switching between
> branches should not touch paths that are the same between
> branches.
I suspected as much, though in practice almost every file is different
between the branches that I'm tracking. RELENG_4 and RELENG_6 for
instance have years of development between them, with almost every major
subsystem and API reorganized in some way.
I might have to do a quick compare once I get things imported and see
exactly what the numbers are.
Craig
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 18:07 Efficient way to import snapshots? Craig Boston
2007-07-30 18:56 ` Linus Torvalds
@ 2007-07-30 21:54 ` David Kastrup
1 sibling, 0 replies; 27+ messages in thread
From: David Kastrup @ 2007-07-30 21:54 UTC (permalink / raw)
To: Craig Boston; +Cc: git
Craig Boston <craig@olyun.gank.org> writes:
> So far the main snag I've found is that AFAIK there's no equivalent to
> "svk import" to load a big tree (~37000 files) into a branch and commit
> the changes. Here's the procedure I've come up with:
>
> cd /path/to/git/repo
> git checkout vendor_branch_X
> git rm -r .
> cp -R /path/to/cvs/checkout_X/* ./
> git add .
> git commit -m"Import yyyymmdd snapshot"
I have not tried it, but shouldn't something like the following work?
cd /path/to/cvs/checkout_X
git --git-dir=/path/to/git/repo/.git reset vendor_branch_X
git --git-dir=/path/to/git/repo/.git add .
git --git-dir=/path/to/git/repo/.git commit -a -m "Import yyyymmdd snapshot"
--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 19:52 ` Linus Torvalds
2007-07-30 20:10 ` Craig Boston
2007-07-30 21:04 ` Junio C Hamano
@ 2007-07-30 21:55 ` Junio C Hamano
2007-07-30 23:27 ` Linus Torvalds
2007-07-30 22:20 ` Craig Boston
3 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2007-07-30 21:55 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Craig Boston, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> Junio? I _thought_ we already took the index into account with "git add",
> but we obviously don't.
I think 366bfcb6 "broke" it by moving read_cache() call down,
because it wanted the directory walking code to grab paths that
are already in the index. The change serves its purpose, but
introduces this regression now the responsibility of avoiding
unnecessary reindexing by matching the cached stat is shifted
nowhere.
We would need to do something like this patch, perhaps? This
function has three callers, two in builtin-add and another in
builtin-mv.
---
read-cache.c | 9 ++++++++-
1 files changed, 8 insertions(+), 1 deletions(-)
diff --git a/read-cache.c b/read-cache.c
index a363f31..c346d88 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -380,7 +380,7 @@ static int index_name_pos_also_unmerged(struct index_state *istate,
int add_file_to_index(struct index_state *istate, const char *path, int verbose)
{
- int size, namelen;
+ int size, namelen, pos;
struct stat st;
struct cache_entry *ce;
@@ -414,6 +414,13 @@ int add_file_to_index(struct index_state *istate, const char *path, int verbose)
ce->ce_mode = ce_mode_from_stat(ent, st.st_mode);
}
+ pos = index_name_pos(istate, ce->name, namelen);
+ if (0 <= pos && !ie_modified(istate, istate->cache[pos], &st, 1)) {
+ /* Nothing changed, really */
+ free(ce);
+ return 0;
+ }
+
if (index_path(ce->sha1, path, &st, 1))
die("unable to index file %s", path);
if (add_index_entry(istate, ce, ADD_CACHE_OK_TO_ADD|ADD_CACHE_OK_TO_REPLACE))
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 19:52 ` Linus Torvalds
` (2 preceding siblings ...)
2007-07-30 21:55 ` Junio C Hamano
@ 2007-07-30 22:20 ` Craig Boston
2007-07-30 23:30 ` Linus Torvalds
3 siblings, 1 reply; 27+ messages in thread
From: Craig Boston @ 2007-07-30 22:20 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List
>
> [ snip lots of helpful comments from various people ]
>
I just wanted to say thanks to Linus and Junio and everyone who
commented, I think I have a much more workable solution now. With my
brute-force remove and re-add everything script the times for import
looked like this:
Importing /compile/co/RELENG_4 (no changes):
svk import: 166.86 seconds
git: 455.82 seconds
Importing /compile/co/RELENG_6:
svk import: 203.69 seconds
git: 796.48 seconds
Importing /compile/co/HEAD:
svk import: 243.90 seconds
git: 837.13 seconds
Ok, so I remembered wrong, git was only 4x slower. Still, I knew it
could do better than that...
After transplanting the .git directory from 3 cloned repositories
checked out to the appropriate branch into the CVS checkout directories,
priming them with a 'git status', and using the git ls-file | git
update-index trick followed by commit -a, here are the revised times:
# On branch cvs_RELENG_4
nothing to commit (working directory clean)
git: 67.65 seconds
Created commit 106bc0b: Import 20070730 snapshot
7 files changed, 259 insertions(+), 75 deletions(-)
Git repository at /compile/co/RELENG_6/src updated
git: 62.02 seconds
Created commit 776031b: Import 20070730 snapshot
86 files changed, 10929 insertions(+), 587 deletions(-)
[snip lots of lines for added files]
Git repository at /compile/co/HEAD/src updated
git: 61.77 seconds
_MUCH_ better. I knew it had to be capable of faster :-)
Again, thanks for all the help. I look forward to seeing what else git
can do!
Craig
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 21:04 ` Junio C Hamano
@ 2007-07-30 23:19 ` Linus Torvalds
0 siblings, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2007-07-30 23:19 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Craig Boston, Git Mailing List
On Mon, 30 Jul 2007, Junio C Hamano wrote:
>
> By the way, the above "something more complex" may be a simple
> "git add -u".
No, that doesn't add new files, only already tracked ones.
Linus
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 21:55 ` Junio C Hamano
@ 2007-07-30 23:27 ` Linus Torvalds
2007-07-30 23:59 ` Junio C Hamano
0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2007-07-30 23:27 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Craig Boston, Git Mailing List
On Mon, 30 Jul 2007, Junio C Hamano wrote:
>
> We would need to do something like this patch, perhaps? This
> function has three callers, two in builtin-add and another in
> builtin-mv.
I think you need to check that ce is in stage 0 too.
Linus
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 22:20 ` Craig Boston
@ 2007-07-30 23:30 ` Linus Torvalds
2007-07-31 1:17 ` Craig Boston
2007-07-31 6:23 ` David Kastrup
0 siblings, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2007-07-30 23:30 UTC (permalink / raw)
To: Craig Boston; +Cc: Junio C Hamano, Git Mailing List
On Mon, 30 Jul 2007, Craig Boston wrote:
>
> # On branch cvs_RELENG_4
> nothing to commit (working directory clean)
> git: 67.65 seconds
So I _seriously_ hope that about 65 of those 67 seconds was the "cvs
update -d" or something like that.
Anything that takes a minute in git is way way *way* too slow. Any
half-way normal git operations should take less than a second.
Linus
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 23:27 ` Linus Torvalds
@ 2007-07-30 23:59 ` Junio C Hamano
2007-07-31 0:45 ` Linus Torvalds
0 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2007-07-30 23:59 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Craig Boston, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Mon, 30 Jul 2007, Junio C Hamano wrote:
>>
>> We would need to do something like this patch, perhaps? This
>> function has three callers, two in builtin-add and another in
>> builtin-mv.
>
> I think you need to check that ce is in stage 0 too.
I guess so, but can higher stage entries have cached stat
information that are valid and match the working tree?
---
read-cache.c | 11 ++++++++++-
1 files changed, 10 insertions(+), 1 deletions(-)
diff --git a/builtin-add.c b/builtin-add.c
diff --git a/read-cache.c b/read-cache.c
index a363f31..9c00ccb 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -380,7 +380,7 @@ static int index_name_pos_also_unmerged(struct index_state *istate,
int add_file_to_index(struct index_state *istate, const char *path, int verbose)
{
- int size, namelen;
+ int size, namelen, pos;
struct stat st;
struct cache_entry *ce;
@@ -414,6 +414,15 @@ int add_file_to_index(struct index_state *istate, const char *path, int verbose)
ce->ce_mode = ce_mode_from_stat(ent, st.st_mode);
}
+ pos = index_name_pos(istate, ce->name, namelen);
+ if (0 <= pos &&
+ !ce_stage(istate->cache[pos]) &&
+ !ie_modified(istate, istate->cache[pos], &st, 1)) {
+ /* Nothing changed, really */
+ free(ce);
+ return 0;
+ }
+
if (index_path(ce->sha1, path, &st, 1))
die("unable to index file %s", path);
if (add_index_entry(istate, ce, ADD_CACHE_OK_TO_ADD|ADD_CACHE_OK_TO_REPLACE))
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 23:59 ` Junio C Hamano
@ 2007-07-31 0:45 ` Linus Torvalds
2007-07-31 0:47 ` Junio C Hamano
0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2007-07-31 0:45 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Craig Boston, Git Mailing List
On Mon, 30 Jul 2007, Junio C Hamano wrote:
>
> I guess so, but can higher stage entries have cached stat
> information that are valid and match the working tree?
Probably unlikely, but I could imagine that it's the case for things that
failed to merge entirely (ie binaries), where you end up just saying "pick
the old/base binary", and it ends up matching in stage1.
Linus
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-31 0:45 ` Linus Torvalds
@ 2007-07-31 0:47 ` Junio C Hamano
0 siblings, 0 replies; 27+ messages in thread
From: Junio C Hamano @ 2007-07-31 0:47 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Craig Boston, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Mon, 30 Jul 2007, Junio C Hamano wrote:
>>
>> I guess so, but can higher stage entries have cached stat
>> information that are valid and match the working tree?
>
> Probably unlikely, but I could imagine that it's the case for things that
> failed to merge entirely (ie binaries), where you end up just saying "pick
> the old/base binary", and it ends up matching in stage1.
Probably. In any case, what I'll commit will have the stage#0
check, just to be safe anyway.
Thanks for the sanity. Really appreciate it.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 23:30 ` Linus Torvalds
@ 2007-07-31 1:17 ` Craig Boston
2007-07-31 1:44 ` Linus Torvalds
2007-07-31 6:23 ` David Kastrup
1 sibling, 1 reply; 27+ messages in thread
From: Craig Boston @ 2007-07-31 1:17 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List
On Mon, Jul 30, 2007 at 04:30:22PM -0700, Linus Torvalds wrote:
> > # On branch cvs_RELENG_4
> > nothing to commit (working directory clean)
> > git: 67.65 seconds
>
> So I _seriously_ hope that about 65 of those 67 seconds was the "cvs
> update -d" or something like that.
No, the only thing included in that is
git ls-files -o | git update-index --add --stdin
git commit -a -m "${COMMITMSG}"
> Anything that takes a minute in git is way way *way* too slow. Any
> half-way normal git operations should take less than a second.
That said, I don't think it's git's fault. I think most of the time is
spent calling stat() on all the files. The machine that took 60 seconds
isn't what I'd call top-of-the-line:
1st or maybe 2nd-gen Willamette CPU
512MB memory (stupid motherboard that won't accept more)
Slow disks in RAID-5 configuration
Running ZFS with less than half of the recommended minimum memory, to
the point where I had to reduce the number of vnodes that the kernel is
allowed to cache to avoid running out of KVA
A simple find(1) over the CVS checkout directory takes almost as long.
I don't think it has enough memory to cache the whole thing. Actually I
know it can't since maxvnodes is set to 25,000 and there's 37,000 files
in the cvs checkout, so it will have to pull some directory entries from
disk regardless.
Just to be sure, I copied the cvs checkout directory and git repository
to a newer, faster dual-core machine with plenty of memory available for
caching.
The first run of 'git status' (cold cache):
git status 1.08s user 3.68s system 13% cpu 34.043 total
The second run:
git status 1.05s user 2.68s system 85% cpu 4.373 total
Based on that I'm fairly confident that most of the 60 seconds is being
spent waiting on data from the disks. On a tmpfs filesystem I can get
it even faster (1.897 seconds)
As it's a file server for which network is the usual bottleneck, and all
the git operations will be running out of cron, I'm not too worried
about it.
Craig
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-31 1:17 ` Craig Boston
@ 2007-07-31 1:44 ` Linus Torvalds
2007-07-31 4:23 ` Theodore Tso
0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2007-07-31 1:44 UTC (permalink / raw)
To: Craig Boston; +Cc: Junio C Hamano, Git Mailing List
On Mon, 30 Jul 2007, Craig Boston wrote:
> >
> > So I _seriously_ hope that about 65 of those 67 seconds was the "cvs
> > update -d" or something like that.
>
> No, the only thing included in that is
>
> git ls-files -o | git update-index --add --stdin
> git commit -a -m "${COMMITMSG}"
Ouch.
> > Anything that takes a minute in git is way way *way* too slow. Any
> > half-way normal git operations should take less than a second.
>
> That said, I don't think it's git's fault. I think most of the time is
> spent calling stat() on all the files. The machine that took 60 seconds
> isn't what I'd call top-of-the-line:
>
> 1st or maybe 2nd-gen Willamette CPU
> 512MB memory (stupid motherboard that won't accept more)
> Slow disks in RAID-5 configuration
> Running ZFS with less than half of the recommended minimum memory, to
> the point where I had to reduce the number of vnodes that the kernel is
> allowed to cache to avoid running out of KVA
Oh, ok. Solaris.
With slow pathname lookup, and hard limits on the inode cache sizes.
Git really normally avoids reading the data, so even in 512M you should
_easily_ be able to cache the metadata (directory and inodes), which is
all you need. But yeah, Linux will probably do that a whole lot more
aggressively than Solaris does.
[ And to be honest, any CVS update would probably have blown the caches on
Linux too - I don't know what all CVS ends up doing, but from past
experience, I'll bet it's not good ]
But just for comparison, on a lowly 480MB mac-mini (running Linux, not OS
X, of course - and the 480MB is because the graphics is UMA and takes
part of the 512MB total), and the kernel archive (which is just 22k files,
not 37k), with a laptop drive:
- cold-cache "git status":
real 0m17.975s
user 0m1.098s
sys 0m0.539s
- rerunning it immediately afterwards:
real 0m1.079s
user 0m0.896s
sys 0m0.183s
so the target really _should_ generally be one second.
But yeah, in order to hit that target, you definitely do want to keep the
metadata cached, and I guess that means more than 512M on Solaris.
Linus
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-31 1:44 ` Linus Torvalds
@ 2007-07-31 4:23 ` Theodore Tso
2007-07-31 13:53 ` Craig Boston
0 siblings, 1 reply; 27+ messages in thread
From: Theodore Tso @ 2007-07-31 4:23 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Craig Boston, Junio C Hamano, Git Mailing List
On Mon, Jul 30, 2007 at 06:44:13PM -0700, Linus Torvalds wrote:
> > 1st or maybe 2nd-gen Willamette CPU
> > 512MB memory (stupid motherboard that won't accept more)
> > Slow disks in RAID-5 configuration
> > Running ZFS with less than half of the recommended minimum memory, to
> > the point where I had to reduce the number of vnodes that the kernel is
> > allowed to cache to avoid running out of KVA
>
> Oh, ok. Solaris.
>
> With slow pathname lookup, and hard limits on the inode cache sizes.
>
> Git really normally avoids reading the data, so even in 512M you should
> _easily_ be able to cache the metadata (directory and inodes), which is
> all you need. But yeah, Linux will probably do that a whole lot more
> aggressively than Solaris does.
I also have a suspicion that ZFS's "never overwrite metadata" is
causing its inodes to be be scattered all over the disk, so the lack
of cacheing is hurting even more than it would for other filesystems.
(Put another way, there's probably a really good reason for ZFS's
minimum memory recommendations.)
Craig, it might be interesting to see what sort of results you get if
you use UFS instead of ZFS in your low-memory constrained
environment...
- Ted
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-30 23:30 ` Linus Torvalds
2007-07-31 1:17 ` Craig Boston
@ 2007-07-31 6:23 ` David Kastrup
2007-07-31 7:54 ` Florian Weimer
1 sibling, 1 reply; 27+ messages in thread
From: David Kastrup @ 2007-07-31 6:23 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Craig Boston, Junio C Hamano, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Mon, 30 Jul 2007, Craig Boston wrote:
>>
>> # On branch cvs_RELENG_4
>> nothing to commit (working directory clean)
>> git: 67.65 seconds
>
> So I _seriously_ hope that about 65 of those 67 seconds was the "cvs
> update -d" or something like that.
>
> Anything that takes a minute in git is way way *way* too slow. Any
> half-way normal git operations should take less than a second.
I tried a git-add . on a TeX live tree (lots of itsy files). About
80% of the processor time was wait.
I think that SHA1 is costly enough that the processor(s) could get
saturated when enough is done in parallel.
Potential for multithreading is here enough to make any Summer of Code
student weep with joy:
a) one thread for opendir/readdir at every directory level
b) one thread for stating the files in readdir order (more likely to
correspond to the disk layout than sorted order)
c) one thread on each directory level doing a mergesort on a different
link field (some merge passes could be parallelized even, but let's
not get overexcited)
d) asynch I/O requesting the data for all files to be submitted to
SHA1
e) one thread (but no more threads than there are CPUs) per
independent file/tree for doing SHA1
f) asynch I/O reading the index sequentially
g) one thread doing a merge pass of already sorted stuff (this can
start once the top level directory has been read and sorted
completely, possibly having to stop until a complete subdirectory
comes in).
h) asynch I/O writing out the results of the merge sequentially
In fact, git-ls-files|git-add --stdin is already exploiting a bit of
parallelism (and will probably profit from CFS by making much more use
of the buffering capacity of the pipe). It is counterintuitive that
hand-built chains work more efficiently than explicit git commands.
--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-31 6:23 ` David Kastrup
@ 2007-07-31 7:54 ` Florian Weimer
2007-07-31 8:48 ` David Kastrup
0 siblings, 1 reply; 27+ messages in thread
From: Florian Weimer @ 2007-07-31 7:54 UTC (permalink / raw)
To: git
* David Kastrup:
> a) one thread for opendir/readdir at every directory level
> b) one thread for stating the files in readdir order (more likely to
> correspond to the disk layout than sorted order)
Not true for ext3. You need to sort by the d_ino field. This also
tends to be benefit other file systems more than readdir order.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-31 7:54 ` Florian Weimer
@ 2007-07-31 8:48 ` David Kastrup
0 siblings, 0 replies; 27+ messages in thread
From: David Kastrup @ 2007-07-31 8:48 UTC (permalink / raw)
To: git
Florian Weimer <fw@deneb.enyo.de> writes:
> * David Kastrup:
>
>> a) one thread for opendir/readdir at every directory level
>> b) one thread for stating the files in readdir order (more likely to
>> correspond to the disk layout than sorted order)
>
> Not true for ext3. You need to sort by the d_ino field. This also
> tends to be benefit other file systems more than readdir order.
Uh, yes, for stating it would (I actually though alphabetic sort order
here, and that would not likely help). So we just introduce another
thread a2) that sorts the partial list from a) as long as b) is still
busy stating... But I guess that a2) would be a thread that will
likely not cause much of a speedup.
--
David Kastrup
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-31 4:23 ` Theodore Tso
@ 2007-07-31 13:53 ` Craig Boston
2007-07-31 15:50 ` Linus Torvalds
0 siblings, 1 reply; 27+ messages in thread
From: Craig Boston @ 2007-07-31 13:53 UTC (permalink / raw)
To: Theodore Tso; +Cc: Linus Torvalds, Junio C Hamano, Git Mailing List
On Tue, Jul 31, 2007 at 12:23:47AM -0400, Theodore Tso wrote:
> On Mon, Jul 30, 2007 at 06:44:13PM -0700, Linus Torvalds wrote:
> >
> > Oh, ok. Solaris.
For reference, as I mentioned to Linus & Junio in an excessively
verbose, and probably uninteresting to most of the git-list members,
message about the performance characteristics of ZFS, I'm actually
running FreeBSD-current with the experimental port of ZFS.
So, even less tested & tuned than it is on Solaris. Part of what I'm
doing is stress testing the filesystem on machines with less than the
recommended memory. Even if performance is suboptimal, it should at
least not break anything.
> Craig, it might be interesting to see what sort of results you get if
> you use UFS instead of ZFS in your low-memory constrained
> environment...
I just so happen to be rebuilding the zfs pool on that server this
morning in order to add more swap, so your wish(1tcl) is my rcmd(3).
Same machine, on a UFS filesystem (single disk, since zfs was doing the
RAID), with the cache tuning parameters reset back to defaults:
First 'git status' after a reboot:
git status 2.23s user 2.23s system 17% cpu 24.987 total
Second:
git status 1.81s user 1.34s system 98% cpu 3.188 total
Third:
git status 1.76s user 1.45s system 98% cpu 3.252 total
So I definitely think the problem is just that with its increased
overhead, ZFS simply can't keep all the metadata in the cache with the
available memory.
Craig
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-31 13:53 ` Craig Boston
@ 2007-07-31 15:50 ` Linus Torvalds
2007-07-31 16:15 ` Theodore Tso
0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2007-07-31 15:50 UTC (permalink / raw)
To: Craig Boston; +Cc: Theodore Tso, Junio C Hamano, Git Mailing List
On Tue, 31 Jul 2007, Craig Boston wrote:
>
> I just so happen to be rebuilding the zfs pool on that server this
> morning in order to add more swap, so your wish(1tcl) is my rcmd(3).
You, sir, are a total geek.
I'm not sure if that's a compliment or a curse.
> Same machine, on a UFS filesystem (single disk, since zfs was doing the
> RAID), with the cache tuning parameters reset back to defaults:
>
> First 'git status' after a reboot:
> git status 2.23s user 2.23s system 17% cpu 24.987 total
>
> Second:
> git status 1.81s user 1.34s system 98% cpu 3.188 total
>
> Third:
> git status 1.76s user 1.45s system 98% cpu 3.252 total
>
> So I definitely think the problem is just that with its increased
> overhead, ZFS simply can't keep all the metadata in the cache with the
> available memory.
Very interesting. And thanks. The whole "ZFS is great" internet meme seems
to be partly due to not a lot of people having used or compared it in real
life. I'm sure it's wonderful for some things, but it clearly does have a
lot of downsides too.
Linus
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient way to import snapshots?
2007-07-31 15:50 ` Linus Torvalds
@ 2007-07-31 16:15 ` Theodore Tso
0 siblings, 0 replies; 27+ messages in thread
From: Theodore Tso @ 2007-07-31 16:15 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Craig Boston, Junio C Hamano, Git Mailing List
On Tue, Jul 31, 2007 at 08:50:19AM -0700, Linus Torvalds wrote:
> Very interesting. And thanks. The whole "ZFS is great" internet meme seems
> to be partly due to not a lot of people having used or compared it in real
> life. I'm sure it's wonderful for some things, but it clearly does have a
> lot of downsides too.
I'm pretty sure Sun's marketing machine is also very much part of it;
I'm not convinced all of the blogs pitching ZFS as the most wonderful
thing since sliced bread were all, shall we say, unbiased or
uninfluenced by Sun. Add to that the fact that the Solaris 10 License
Agreement prohibits you from publishing benchmark numbers without
Sun's permisison, and it's not at all surprising that most of what
people have heard of ZFS has all been the positive stuff. But with
more people playing with ZFS in the FreeBSD and OpenSolaris camps, I'm
sure a more balanced view that shows its advantages *and*
disadvantages will start showing up.
- Ted
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2007-07-31 16:16 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-30 18:07 Efficient way to import snapshots? Craig Boston
2007-07-30 18:56 ` Linus Torvalds
2007-07-30 19:29 ` Craig Boston
2007-07-30 19:52 ` Linus Torvalds
2007-07-30 20:10 ` Craig Boston
2007-07-30 21:29 ` Junio C Hamano
2007-07-30 21:49 ` Craig Boston
2007-07-30 21:04 ` Junio C Hamano
2007-07-30 23:19 ` Linus Torvalds
2007-07-30 21:55 ` Junio C Hamano
2007-07-30 23:27 ` Linus Torvalds
2007-07-30 23:59 ` Junio C Hamano
2007-07-31 0:45 ` Linus Torvalds
2007-07-31 0:47 ` Junio C Hamano
2007-07-30 22:20 ` Craig Boston
2007-07-30 23:30 ` Linus Torvalds
2007-07-31 1:17 ` Craig Boston
2007-07-31 1:44 ` Linus Torvalds
2007-07-31 4:23 ` Theodore Tso
2007-07-31 13:53 ` Craig Boston
2007-07-31 15:50 ` Linus Torvalds
2007-07-31 16:15 ` Theodore Tso
2007-07-31 6:23 ` David Kastrup
2007-07-31 7:54 ` Florian Weimer
2007-07-31 8:48 ` David Kastrup
2007-07-30 21:22 ` Jakub Narebski
2007-07-30 21:54 ` David Kastrup
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).