Git development
 help / color / mirror / Atom feed
* oprofile on svn import
From: Jon Smirl @ 2006-06-14  1:10 UTC (permalink / raw)
  To: git

I'm going back to cvsimport tomorrow. My svn import that had been
running for five days got killed this morning when the city decided to
move the telephone pole that provides my electricty.

Some oprofile data, this doesn't make a lot of sense to me. Why is it
in libcypto so much?

 12632739 30.6077 /lib/libcrypto.so.0.9.8a
 11762639 28.4995 /home/good/vmlinux
  6310191 15.2889 /lib/libc-2.4.so
  2498812  6.0543 /usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/libperl.so
  2079975  5.0395 /usr/local/bin/git-update-index
  1103116  2.6727 /usr/lib/libz.so.1.2.3
   617395  1.4959 /usr/lib/libapr-1.so.0.2.2
   484625  1.1742 /usr/local/bin/git-read-tree

kernel breakdown

2035561  16.4450  copy_page_range
1110813   8.9741  get_page_from_freelist
851064    6.8756  check_poison_obj
759296    6.1342  unmap_vmas
670659    5.4181  release_pages
667657    5.3939  page_remove_rmap
595826    4.8136  page_fault
241962    1.9548  __copy_from_user_ll
185876    1.5017  do_wp_page
176506    1.4260  do_page_fault


I reset the statistics and took another snapshot half an hour later.

  2232310 44.3485 /home/good/vmlinux
   757114 15.0413 /lib/libcrypto.so.0.9.8a
   507282 10.0780 /lib/libc-2.4.so
   203440  4.0417 /usr/lib/libz.so.1.2.3
   179105  3.5582 /usr/lib/libapr-1.so.0.2.2
   169724  3.3718 /usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/libperl.so
   114384  2.2724 /usr/local/bin/git-update-index
   102350  2.0334 /usr/lib/libsvn_subr-1.so.0.0.0
    74673  1.4835 /usr/lib/libaprutil-1.so.0.2.2
    69987  1.3904 /usr/lib/libsvn_fs_fs-1.so.0.0.0

Kernel:

543264   21.2518  copy_page_range
243383    9.5208  check_poison_obj
227788    8.9108  unmap_vmas
161806    6.3296  page_remove_rmap
153201    5.9930  release_pages
119092    4.6587  page_fault
100116    3.9164  get_page_from_freelist
45014     1.7609  do_wp_page
42130     1.6481  vm_normal_page
34804     1.3615  poison_obj
28231     1.1044  do_page_fault
27403     1.0720  __handle_mm_fault
24558     0.9607  __copy_to_user_ll
20618     0.8066  flush_tlb_page


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: git-cvsimport doesn't quite work, wrt branches
From: Martin Langhoff @ 2006-06-14  1:56 UTC (permalink / raw)
  To: Keith Packard
  Cc: Linus Torvalds, Jim Meyering, Git Mailing List, Matthias Urlichs,
	Yann Dirson, Pavel Roskin
In-Reply-To: <1150241459.20536.98.camel@neko.keithp.com>

On 6/14/06, Keith Packard <keithp@keithp.com> wrote:
> On Wed, 2006-06-14 at 10:55 +1200, Martin Langhoff wrote:
>
> > In terms of history parsing, parsecvs and cvs2svn are similar. I like
> > cvs2svn "many passes" approach better, though the Python source is
> > really messy. A good thing about cvs2svn is that it is a lot more
> > conservative WRT memory use.
>
> I will try to fix parsecvs so it doesn't take so much memory. Of course,
> my goal was to import various X.org repositories which have horrible
> issues, but aren't all that huge. And, for them, it works just fine.

Would it be possible to have it parse the RCS histories from a remote repo?

I had forgotten, but that's something else that the cvsps +
git-cvsimport combo can do. In short, to replace cvsps+git-cvsimport
...

 + not memory bound -- or at least must be able to import large
(mozilla, gentoo) with a decent amount of memory

 + must work local and remote (of course local can be faster)

 + must do incrementals reasonably well

> I'd like some help figuring out how to do incremental imports with
> parsecvs. As parsecvs already constructs the project history from the
> present into the past, it should be possible to "notice" when it hits
> existing bits in the repository and stop automatically. I think this
> will just take saving a bit of state in the git repository to mark where
> in CVS the tips of each branch come from.

Ok. Before starting to read the RCS files, I would look at all the
branch tips in the git repo, and read some metadata of the last commit
of each head into memory (author, commitmsg, timestamp, diffstat).

When parsing RCS files and building changesets to import, compare them
with the 'head' data. The timestamp granularity is seconds which is
pretty coarse -- you can ask for history post those timestamps, but
there's the risk of missing commits (this affects git-cvsimport today,
and I'm thinking how to fix it there). So borderline changesets should
be compared against the metadata you have.

There is the chance that your earlier import caught a commit partway
through, so you may end up putting in the 'rest' of the commit. That's
why diffstat can be useful.

Is that useful?


cheers,



martin

^ permalink raw reply

* Re: oprofile on svn import
From: Eric Wong @ 2006-06-14  2:01 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git
In-Reply-To: <9e4733910606131810ya6aa585m5d2349f651b01492@mail.gmail.com>

Jon Smirl <jonsmirl@gmail.com> wrote:
> I'm going back to cvsimport tomorrow. My svn import that had been
> running for five days got killed this morning when the city decided to
> move the telephone pole that provides my electricty.
> 
> Some oprofile data, this doesn't make a lot of sense to me. Why is it
> in libcypto so much?

The sha1 calculation is done in libcrypto, afaik.

Anybody want to see how my latest patches to git-svn (and using SVN perl
libraries) stacks up against the mozilla repo?  Speedwise, I don't
expect git-svn to be too different than git-svnimport, but it should use
much less memory (I'll probably port the hacks to git-svnimport, too).

I'll see about freeing up one of my machines to test the mozilla repo.
Unfortunately, all of my hardware is a few years old and not extremely
fast.

-- 
Eric Wong

^ permalink raw reply

* Re: [PATCH 6/8] Make git-update-ref a builtin
From: Shawn Pearce @ 2006-06-14  2:22 UTC (permalink / raw)
  To: Lukas Sandström; +Cc: Junio C Hamano, Git Mailing List
In-Reply-To: <448F1E68.5090504@etek.chalmers.se>

Lukas Sandstr?m <lukass@etek.chalmers.se> wrote:
> Signed-off-by: Lukas Sandström <lukass@etek.chalmers.se>
> ---
>  Makefile                             |    7 ++++---
>  update-ref.c => builtin-update-ref.c |    5 ++++-
>  builtin.h                            |    1 +
>  git.c                                |    3 ++-
>  4 files changed, 11 insertions(+), 5 deletions(-)

Thanks for doing this.  I know I had written this change and I was
pretty sure I had sent it to Junio a while ago but I guess it got
lost in the shuffle and I just failed to follow through with it
when it didn't show up in `next`.

-- 
Shawn.

^ permalink raw reply

* Re: oprofile on svn import
From: Jon Smirl @ 2006-06-14  2:32 UTC (permalink / raw)
  To: git
In-Reply-To: <9e4733910606131810ya6aa585m5d2349f651b01492@mail.gmail.com>

>From the previous data it is obvious that I had slab debugging
enabled. I usally never notice having it turned on but in this case it
make a lot of difference.

New numbers without slab debug. Could forking off the git tasks be
causing all of this vm load?

[root@jonsmirl jonsmirl]# vmstat 10
procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us
sy id wa st
 2  0      0  13504  91220 563280    0    0   299   232  244   426 23
18 53  6  0
 2  0      0  10900  91344 565128    0    0   169   464  481   737 26
23 48  2  0
 2  0      0  10804  91436 564832    0    0   196   650  478   780 25
24 49  3  0
 4  0      0  13516  91512 561696    0    0   166   612  474   790 26
23 49  2  0
 1  0      0  10928  91632 563548    0    0   124   471  464   789 24
25 48  2  0
 1  0      0  12312  91684 562000    0    0   179   688  472   783 26
23 48  3  0
 1  0      0  13232  91748 560712    0    0    51   198  445   794 25
26 48  1  0

  9951967 44.5102 /home/good/vmlinux
  3192131 14.2768 /lib/libcrypto.so.0.9.8a
  2207857  9.8747 /lib/libc-2.4.so
  1587518  7.1002 /usr/lib/libz.so.1.2.3
   663114  2.9658 /usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE/libperl.so
   517463  2.3144 /lib/ld-2.4.so
   435100  1.9460 /usr/lib/libapr-1.so.0.2.2
   430292  1.9245 /usr/local/bin/git-update-index
   285157  1.2754 /usr/local/bin/git-read-tree

2331728  22.8834  copy_page_range
1076769  10.5673  unmap_vmas
667975    6.5555  page_remove_rmap
663844    6.5149  page_fault
654668    6.4249  release_pages
440547    4.3235  get_page_from_freelist
245142    2.4058  do_wp_page
174656    1.7141  vm_normal_page
155185    1.5230  __handle_mm_fault
133584    1.3110  do_page_fault
131456    1.2901  __d_lookup
94194     0.9244  __link_path_walk
92927     0.9120  flush_tlb_page
91775     0.9007  find_get_page
85927     0.8433  copy_process


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: oprofile on svn import
From: Jon Smirl @ 2006-06-14  2:39 UTC (permalink / raw)
  To: Eric Wong; +Cc: git
In-Reply-To: <20060614020108.GB12083@hand.yhbt.net>

On 6/13/06, Eric Wong <normalperson@yhbt.net> wrote:
> Jon Smirl <jonsmirl@gmail.com> wrote:
> > I'm going back to cvsimport tomorrow. My svn import that had been
> > running for five days got killed this morning when the city decided to
> > move the telephone pole that provides my electricty.
> >
> > Some oprofile data, this doesn't make a lot of sense to me. Why is it
> > in libcypto so much?
>
> The sha1 calculation is done in libcrypto, afaik.

That make sense, but it's eating up 14% of my CPU in a long sample.

> Anybody want to see how my latest patches to git-svn (and using SVN perl
> libraries) stacks up against the mozilla repo?  Speedwise, I don't
> expect git-svn to be too different than git-svnimport, but it should use
> much less memory (I'll probably port the hacks to git-svnimport, too).

Can svnimport be rewritten to avoid calling fork? If I am reading the
oprofiles correctly that fork is very expensive especially when the
svnimport task grows to 600MB.

I have an import running but post your code when it is ready and I can
try it on the next run. They always seem to fail so there will
probably be another run.

> I'll see about freeing up one of my machines to test the mozilla repo.
> Unfortunately, all of my hardware is a few years old and not extremely
> fast.
>
> --
> Eric Wong
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: oprofile on svn import
From: Eric Wong @ 2006-06-14  3:02 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git, Matthias Urlichs, Linus Torvalds
In-Reply-To: <9e4733910606131939h35b2278bvaa296459ea061621@mail.gmail.com>

Linus: I hope I'm right on [1] (the stuff about fork).

Jon Smirl <jonsmirl@gmail.com> wrote:
> On 6/13/06, Eric Wong <normalperson@yhbt.net> wrote:
> >Jon Smirl <jonsmirl@gmail.com> wrote:
> >> I'm going back to cvsimport tomorrow. My svn import that had been
> >> running for five days got killed this morning when the city decided to
> >> move the telephone pole that provides my electricty.
> >>
> >> Some oprofile data, this doesn't make a lot of sense to me. Why is it
> >> in libcypto so much?
> >
> >The sha1 calculation is done in libcrypto, afaik.
> 
> That make sense, but it's eating up 14% of my CPU in a long sample.
> 
> >Anybody want to see how my latest patches to git-svn (and using SVN perl
> >libraries) stacks up against the mozilla repo?  Speedwise, I don't
> >expect git-svn to be too different than git-svnimport, but it should use
> >much less memory (I'll probably port the hacks to git-svnimport, too).
> 
> Can svnimport be rewritten to avoid calling fork? If I am reading the
> oprofiles correctly that fork is very expensive especially when the
> svnimport task grows to 600MB.

I think the problem is the process growing to 600MB, and not the fork :)
git-svn avoids process growth pretty well from my tests with the gcc
repo.

See the fetch_lib() function in this patch on how I avoid process
growth by _using_ fork():

Subject: [PATCH 12/13] git-svn: add support for Perl SVN::* libraries
	(<115022175180-git-send-email-normalperson@yhbt.net>)

Perl processes (at least on my machines (5.8.x, Linux x86) don't like to
release memory back to the OS when they're done using it (although it
can reuse the memory within the process itself).  This is why SVN::Pool
isn't very effective in many cases.

fork() will only duplicate memory for the pages that are changed by the
child, not the entire process[1].  So I fork children that run temporarily
to avoid accumulating memory usage inside the process.

This technique should probably be added to git-svnimport as well.

> I have an import running but post your code when it is ready and I can
> try it on the next run. They always seem to fail so there will
> probably be another run.

I've posted a two series of patches the past few days that have yet
to be merged by Junio:

Subject: [PATCH] git-svn: bug fixes (some resends)
	<11500094252972-git-send-email-normalperson@yhbt.net>
Subject: [PATCH 0/13] git-svn: better branch support, SVN:: lib usage, feature additions
	<11502217352245-git-send-email-normalperson@yhbt.net>

-- 
Eric Wong

^ permalink raw reply

* Re: oprofile on svn import
From: Martin Langhoff @ 2006-06-14  3:32 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git
In-Reply-To: <9e4733910606131810ya6aa585m5d2349f651b01492@mail.gmail.com>

On 6/14/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> I'm going back to cvsimport tomorrow. My svn import that had been

For best results, make sure you remove the -a from the git-repack
line. Once it's done, run git-repack -a -d manually.

cheers,


martin

^ permalink raw reply

* Re: oprofile on svn import
From: Ryan Anderson @ 2006-06-14  4:48 UTC (permalink / raw)
  To: Eric Wong; +Cc: Jon Smirl, git
In-Reply-To: <20060614020108.GB12083@hand.yhbt.net>

On Tue, Jun 13, 2006 at 07:01:08PM -0700, Eric Wong wrote:
> Anybody want to see how my latest patches to git-svn (and using SVN perl
> libraries) stacks up against the mozilla repo?  Speedwise, I don't
> expect git-svn to be too different than git-svnimport, but it should use
> much less memory (I'll probably port the hacks to git-svnimport, too).

I've got access to a pretty good machine to run this on - where can I
grab the svn repo from?
(I can just grab the CVS one and convert it, first, as well, just point
me at that, if that's got more bandwidth.)

-- 

Ryan Anderson
  sometimes Pug Majere

^ permalink raw reply

* Re: oprofile on svn import
From: Jon Smirl @ 2006-06-14  5:26 UTC (permalink / raw)
  To: Ryan Anderson; +Cc: Eric Wong, git
In-Reply-To: <20060614044802.GE30825@h4x0r5.com>

On 6/14/06, Ryan Anderson <ryan@michonline.com> wrote:
> On Tue, Jun 13, 2006 at 07:01:08PM -0700, Eric Wong wrote:
> > Anybody want to see how my latest patches to git-svn (and using SVN perl
> > libraries) stacks up against the mozilla repo?  Speedwise, I don't
> > expect git-svn to be too different than git-svnimport, but it should use
> > much less memory (I'll probably port the hacks to git-svnimport, too).
>
> I've got access to a pretty good machine to run this on - where can I
> grab the svn repo from?
> (I can just grab the CVS one and convert it, first, as well, just point
> me at that, if that's got more bandwidth.)

rsync -az cvs-mirror.mozilla.org::mozilla ~/mozilla/cvs-mirror
It took about three days for my machine to convert that cvs to svn.

I have the converted repo local but it is 8.2GB and I have 256kb up.

There is no real purpose in converting mozilla cvs to svn to git other
than to test the tools. My last attempt at svn to git ran five days
before I lost power. Towards the end it was getting significantly slow
implying some kind of n squared problem in the import process. The
idea was to see if cvsimport and svnimport both end up with the same
output.

I am going to use git-cvsimport on the mozilla repo but that tool
needs to 2GB+ physical RAM to run. I ordered 2GB more and it will be
here tomorrow. I have just been playing with the svn conversion while
I wait five days for my 2nd day air package to show up.


>
> --
>
> Ryan Anderson
>   sometimes Pug Majere
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Porcelain specific metadata under .git?
From: Shawn Pearce @ 2006-06-14  6:22 UTC (permalink / raw)
  To: git

So I'm reaching a point with my Eclipse plugin[*1*] where its
actually doing something with a GIT repository and I want to store a
ref (to a tree, not a commit) under .git/refs/eclipse-workspaces to
help the plugin cache state between workbench restarts.  But there
doesn't really seem to be any policy to what paths under .git are
available for Porcelain and what definately should be off-limits.

I already assume/know that refs/heads and refs/tags are completely
off-limits as they are for user refs only.

I also think the core GIT tools already assume that anything
directly under .git which is strictly a file and which is named
entirely with uppercase letters (aside from "HEAD") is strictly a
temporary/short-lived state type item (e.g. COMMIT_MSG) used by a
Porcelain.

But is saying ".git/refs/eclipse-workspaces" is probably able to
be used for this purpose safe?  :-)


[*1*] The Eclipse plugin is getting close to something that is worth
releasing as an early alpha for other developers.  I think I finally
found the last bug in the pack reading code and am now working on the
basic operations (add/remove/commit/status).  I hope to have all of
that working within a few days, at which point I'll publish/announce
a public GIT repository with the complete source code and an Eclipse
update site for those brave souls who might want to just install it.

-- 
Shawn.

^ permalink raw reply

* Repacking many disconnected blobs
From: Keith Packard @ 2006-06-14  7:17 UTC (permalink / raw)
  To: Git Mailing List; +Cc: keithp

[-- Attachment #1: Type: text/plain, Size: 963 bytes --]

parsecvs scans every ,v file and creates a blob for every revision of
every file right up front. Once these are created, it discards the
actual file contents and deals solely with the hash values.

The problem is that while this is going on, the repository consists
solely of disconnected objects, and I can't make git-repack put those
into pack objects. This leaves the directories bloated, and operations
within the tree quite sluggish. I'm importing a project with 30000 files
and 30000 revisions (the CVS repository is about 700MB), and after
scanning the files, and constructing (in memory) a complete revision
history, the actual construction of the commits is happening at about 2
per second, and about 70% of that time is in the kernel, presumably
playing around in the repository.

I'm assuming that if I could get these disconnected blobs all neatly
tucked into a pack object, things might go a bit faster.
-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: Repacking many disconnected blobs
From: Shawn Pearce @ 2006-06-14  7:29 UTC (permalink / raw)
  To: Keith Packard; +Cc: Git Mailing List
In-Reply-To: <1150269478.20536.150.camel@neko.keithp.com>

Keith Packard <keithp@keithp.com> wrote:
> parsecvs scans every ,v file and creates a blob for every revision of
> every file right up front. Once these are created, it discards the
> actual file contents and deals solely with the hash values.
> 
> The problem is that while this is going on, the repository consists
> solely of disconnected objects, and I can't make git-repack put those
> into pack objects. This leaves the directories bloated, and operations
> within the tree quite sluggish. I'm importing a project with 30000 files
> and 30000 revisions (the CVS repository is about 700MB), and after
> scanning the files, and constructing (in memory) a complete revision
> history, the actual construction of the commits is happening at about 2
> per second, and about 70% of that time is in the kernel, presumably
> playing around in the repository.
> 
> I'm assuming that if I could get these disconnected blobs all neatly
> tucked into a pack object, things might go a bit faster.

What about running git-update-index using .git/objects as the
current working directory and adding all files in ??/* into the
index, then git-write-tree that index and git-commit-tree the tree.

When you are done you have a bunch of orphan trees and a commit
but these shouldn't be very big and I'd guess would prune out with
a repack if you don't hold a ref to the orphan commit.

-- 
Shawn.

^ permalink raw reply

* 'sparse' clone idea
From: Jakub Narebski @ 2006-06-14  8:23 UTC (permalink / raw)
  To: git

I wonder if 'sparse clone' idea described below would avoid the most
difficult part of 'shallow clone' idea, namely the [sometimes] need to
un-cauterize history. See: (<7vac8lidwi.fsf@assigned-by-dhcp.cox.net>).

'sparse clone' begins like 'shallow clone': full history is copied down to
specified point of history (cut-off or cauterization point for shallow
clone), but instead of cauterizing the history from that point downwards,
the history is simplified using grafts.

In the sparse part we need:
 * all commits pointed by tags (if we clone/copy tags) 
   and other refs (if we clone/copy those tags)
 * merge bases for all commits in full, and in the sparse part,
   _including_ merge bases themselves
 * all roots

Commits in sparse part would be connected like in original history, only
skipping "uniteresting" commits.


Thoughts? Comments?

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply

* Your future, N electron
From: Buford Greer @ 2006-06-14 12:31 UTC (permalink / raw)
  To: linux-newbie

Even if you have no erectin problems SOFT CIAzLIS 
would help you to make BETTER SE  X MORE OFTEN!
and to bring  unimagnable plesure to her.

Just disolve half a pil under your tongue 
and get ready for action in 15 minutes. 

The tests showed that the majority of men 
after taking this medic ation were able to have 
PERFECT ER ECTI ON during 36 hours!

VISIT US, AND GET OUR SPECIAL 70% DISC OUNT OFER!

http://tvewuu.jugjest.com/?62746363

==========
weapons--and you  have the framework of  this amazing  short novel.  Add the
their  teeth  on this cotton problem  for some  time.  You  see,  they  were
own sake, the search for new devices, new techniques, to achieve new heights
"It's just a garage."
     "I don't mind being bone and feathers mom. I just want to know what I
and against the spitting devil's cabbage.... All right.

fourteen miles per hour! It was a breakthrough, the greatest single moment
     "Of  course!  But I would  like  to  Finish with  science  first. As  a

^ permalink raw reply

* Re: Repacking many disconnected blobs
From: Johannes Schindelin @ 2006-06-14  9:07 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Keith Packard, Git Mailing List
In-Reply-To: <20060614072923.GB13886@spearce.org>

Hi,

On Wed, 14 Jun 2006, Shawn Pearce wrote:

> Keith Packard <keithp@keithp.com> wrote:
> > parsecvs scans every ,v file and creates a blob for every revision of
> > every file right up front. Once these are created, it discards the
> > actual file contents and deals solely with the hash values.
> > 
> > The problem is that while this is going on, the repository consists
> > solely of disconnected objects, and I can't make git-repack put those
> > into pack objects. This leaves the directories bloated, and operations
> > within the tree quite sluggish. I'm importing a project with 30000 files
> > and 30000 revisions (the CVS repository is about 700MB), and after
> > scanning the files, and constructing (in memory) a complete revision
> > history, the actual construction of the commits is happening at about 2
> > per second, and about 70% of that time is in the kernel, presumably
> > playing around in the repository.
> > 
> > I'm assuming that if I could get these disconnected blobs all neatly
> > tucked into a pack object, things might go a bit faster.
> 
> What about running git-update-index using .git/objects as the
> current working directory and adding all files in ??/* into the
> index, then git-write-tree that index and git-commit-tree the tree.
> 
> When you are done you have a bunch of orphan trees and a commit
> but these shouldn't be very big and I'd guess would prune out with
> a repack if you don't hold a ref to the orphan commit.

Alternatively, you could construct fake trees like this:

README/1.1.1.1
README/1.2
README/1.3
...

i.e. every file becomes a directory -- containing all the versions of that 
file -- in the (virtual) tree, which you can point to by a temporary ref.

Ciao,
Dscho

^ permalink raw reply

* Re: 'sparse' clone idea
From: Johannes Schindelin @ 2006-06-14  9:20 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git
In-Reply-To: <e6oh2g$ngh$1@sea.gmane.org>

Hi,

On Wed, 14 Jun 2006, Jakub Narebski wrote:

> I wonder if 'sparse clone' idea described below would avoid the most
> difficult part of 'shallow clone' idea, namely the [sometimes] need to
> un-cauterize history. See: (<7vac8lidwi.fsf@assigned-by-dhcp.cox.net>).

I do not think that is the hardest problem. The hardest thing is to tell 
the server in an efficient manner which objects we have.

Example:

A - B - C - D
    ^ cutoff
        ^ current HEAD

Suppose B is your fake root, C is your HEAD, you want to fetch D. Now, 
make it a difficult example: both A and D contain a certain blob Z, but 
neither B nor C do. You have to tell the server _in an efficient manner_ 
to send Z also.

And by efficient manner I mean: you may not bring the server down just 
because 5 people with shallow clones decide to fetch from it.

> 'sparse clone' begins like 'shallow clone': full history is copied down to
> specified point of history (cut-off or cauterization point for shallow
> clone), but instead of cauterizing the history from that point downwards,
> the history is simplified using grafts.
> 
> In the sparse part we need:
>  * all commits pointed by tags (if we clone/copy tags) 
>    and other refs (if we clone/copy those tags)
>  * merge bases for all commits in full, and in the sparse part,
>    _including_ merge bases themselves

Hmmm. You cannot know _all_ merge bases beforehand, because you do not 
decide where other people fork off.

>  * all roots

Why?

> Commits in sparse part would be connected like in original history, only
> skipping "uniteresting" commits.

Interesting idea, though I do not think it solves the most pressing 
problems we have with shallow clones.

Ciao,
Dscho

P.S.: I think the problems of a lazy clone are much easier to solve...

^ permalink raw reply

* Re: Repacking many disconnected blobs
From: Sergey Vlasov @ 2006-06-14  9:37 UTC (permalink / raw)
  To: Keith Packard; +Cc: git
In-Reply-To: <1150269478.20536.150.camel@neko.keithp.com>

[-- Attachment #1: Type: text/plain, Size: 2149 bytes --]

On Wed, 14 Jun 2006 00:17:58 -0700 Keith Packard wrote:

> parsecvs scans every ,v file and creates a blob for every revision of
> every file right up front. Once these are created, it discards the
> actual file contents and deals solely with the hash values.
> 
> The problem is that while this is going on, the repository consists
> solely of disconnected objects, and I can't make git-repack put those
> into pack objects. This leaves the directories bloated, and operations
> within the tree quite sluggish. I'm importing a project with 30000 files
> and 30000 revisions (the CVS repository is about 700MB), and after
> scanning the files, and constructing (in memory) a complete revision
> history, the actual construction of the commits is happening at about 2
> per second, and about 70% of that time is in the kernel, presumably
> playing around in the repository.
> 
> I'm assuming that if I could get these disconnected blobs all neatly
> tucked into a pack object, things might go a bit faster.

git-repack.sh basically does:

  git-rev-list --objects --all | git-pack-objects .tmp-pack

When you have only disconnected blobs, obviously the first part does
not work - git-rev-list cannot find these blobs.  However, you can do
that part manually - e.g., when you add a blob, do:

  fprintf(list_file, "%s %s\n", sha1, path);

(path should be a relative path in the repo without ",v" or "Attic" -
it is used for delta packing optimization, so getting it wrong will
not cause any corruption, but the pack may become significantly
larger).  You may output some duplicate sha1 values, but
git-pack-objects should handle duplicates correctly.

Then just invoke "git-pack-objects --non-empty .tmp_pack <list_file";
it will output the resulting pack sha1 to stdout.  Then you need to
move the pack into place and call git-prune-packed (which does not
use object lists, so it should work even with unreachable objects).

You may even want to repack more than once during the import;
probably the simplest way to do it is to truncate list_file after
each repack and use "git-pack-objects --incremental".

[-- Attachment #2: Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: git-cvsimport doesn't quite work, wrt branches
From: sf @ 2006-06-14  9:37 UTC (permalink / raw)
  To: git
In-Reply-To: <46a038f90606131555m7b1fa744g9770140c87598b7b@mail.gmail.com>

Martin Langhoff wrote:
...
> Yes, cvsps is relying on the wrong things. I am looking at parsecvs
> and the cvs2svn tool and wondering where to from here.
...
> I am starting to look at what I can do with cvs2svn to get the import
> into git. It seems to get very good patchsets, and it yields an easily
> readable DB. I'll either learn Python, or read the DB from Perl
> (probably from git-cvsimport).

SVN has a portable format called "dumpfile" (see
http://svn.collab.net/repos/svn/trunk/notes/fs_dumprestore.txt) which is
produced by "svnadmin dump ..." and "cvs2svn --dump-only ...".

Why not use it as input for importing into git?

Pros:
- "svnadmin dump" should be fast
- svn repositories can be tracked with "svnadmin dump" (just remember
the last imported revision and restart from there)
- cvs2svn seems to be very good at its job
- only one tool needed

Cons:
- Both svnadmin and cvs2svn only work on local repositories
- cvs2svn cannot be used for tracking

Regards
	Stephan

^ permalink raw reply

* Re: 'sparse' clone idea
From: Jakub Narebski @ 2006-06-14  9:44 UTC (permalink / raw)
  To: git
In-Reply-To: <Pine.LNX.4.63.0606141110001.15673@wbgn013.biozentrum.uni-wuerzburg.de>

Johannes Schindelin wrote:

> On Wed, 14 Jun 2006, Jakub Narebski wrote:
> 
>> I wonder if 'sparse clone' idea described below would avoid the most
>> difficult part of 'shallow clone' idea, namely the [sometimes] need to
>> un-cauterize history. See: (<7vac8lidwi.fsf@assigned-by-dhcp.cox.net>).
> 
> I do not think that is the hardest problem. The hardest thing is to tell 
> the server in an efficient manner which objects we have.
> 
> Example:
> 
> A - B - C - D
>     ^ cutoff
>         ^ current HEAD
> 
> Suppose B is your fake root, C is your HEAD, you want to fetch D. Now, 
> make it a difficult example: both A and D contain a certain blob Z, but 
> neither B nor C do. You have to tell the server _in an efficient manner_ 
> to send Z also.
> 
> And by efficient manner I mean: you may not bring the server down just 
> because 5 people with shallow clones decide to fetch from it.

Nah, that I think is solved. Check the mentioned post by Junio C Hamano
in the "Re: Figured out how to get Mozilla into git" post:

 http://permalink.gmane.org/gmane.comp.version-control.git/21603

(although it would need extension to the git protocol). Client and server 
do graft exchange both ways, limiting the commit ancestry graph the both
ends walk to the intersection of the fake view of the ancestry graph both
ends have. Then server uses those virtual grafts to calculate which objects
to send.

The rest is done (or should be done) by history grafting code.

>>  * merge bases for all commits in full, and in the sparse part,
>>    _including_ merge bases themselves
> 
> Hmmm. You cannot know _all_ merge bases beforehand, because you do not 
> decide where other people fork off.

By all merge bases I mean merge bases for all commits in full part, merge
bases for all commits in full part and commits pointed by tags in sparse
part, merge bases for all commits in full part and tagged in sparse part
and merge bases in sparse part etc. recursively.
 
>>  * all roots
> 
> Why?

Just in case, as an ultimate merge bases.
 
> P.S.: I think the problems of a lazy clone are much easier to solve...

I still think that the correct idea for the lazy clone is to have soft
grafts, so you have to solve at least part of shallo clone/sparse clone
problems first.

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply

* Re: Porcelain specific metadata under .git?
From: Andreas Ericsson @ 2006-06-14 11:11 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git
In-Reply-To: <20060614062240.GA13886@spearce.org>

Shawn Pearce wrote:
> 
> I already assume/know that refs/heads and refs/tags are completely
> off-limits as they are for user refs only.
> 
> I also think the core GIT tools already assume that anything
> directly under .git which is strictly a file and which is named
> entirely with uppercase letters (aside from "HEAD") is strictly a
> temporary/short-lived state type item (e.g. COMMIT_MSG) used by a
> Porcelain.
> 
> But is saying ".git/refs/eclipse-workspaces" is probably able to
> be used for this purpose safe?  :-)
> 

.git/eclipse/whatever-you-like

would probably be better. Heads can be stored directly under .git/refs 
too. Most likely, nothing will ever be stored under ./git/eclipse by 
either core git or the current (other) porcelains though.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply

* Re: Porcelain specific metadata under .git?
From: Jakub Narebski @ 2006-06-14 11:32 UTC (permalink / raw)
  To: git
In-Reply-To: <448FEED7.30701@op5.se>

Andreas Ericsson wrote:

> Shawn Pearce wrote:
>> 
>> I already assume/know that refs/heads and refs/tags are completely
>> off-limits as they are for user refs only.
>> 
>> I also think the core GIT tools already assume that anything
>> directly under .git which is strictly a file and which is named
>> entirely with uppercase letters (aside from "HEAD") is strictly a
>> temporary/short-lived state type item (e.g. COMMIT_MSG) used by a
>> Porcelain.
>> 
>> But is saying ".git/refs/eclipse-workspaces" is probably able to
>> be used for this purpose safe?  :-)
>> 
> 
> .git/eclipse/whatever-you-like
> 
> would probably be better. Heads can be stored directly under .git/refs 
> too. Most likely, nothing will ever be stored under ./git/eclipse by 
> either core git or the current (other) porcelains though.

I think if it is a ref, which one wants to be visible to git-fsck (and
git-prune), it should be under .git/refs.

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply

* Re: Repacking many disconnected blobs
From: Junio C Hamano @ 2006-06-14 12:33 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git
In-Reply-To: <Pine.LNX.4.63.0606141104050.15578@wbgn013.biozentrum.uni-wuerzburg.de>

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Alternatively, you could construct fake trees like this:
>
> README/1.1.1.1
> README/1.2
> README/1.3
> ...
>
> i.e. every file becomes a directory -- containing all the versions of that 
> file -- in the (virtual) tree, which you can point to by a temporary ref.

That would not play well with the packing heuristics, I suspect.
If you reverse it to use rev/file-id, then the same files from
different revs would sort closer, though.

^ permalink raw reply

* [PATCH] fix git alias
From: Junio C Hamano @ 2006-06-14 13:01 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

When extra command line arguments are given to a command that
was alias-expanded, the code generated a wrong argument list,
leaving the original alias in the result, and forgetting to
terminate the new argv list.

Signed-off-by: Junio C Hamano <junkio@cox.net>

---

 * This would make "git l -n 4" work when you have "alias.l =
   log -M" in your configuration.  The original code generated
   an equivalent of "git log -M l -n 4".

   There is another more grave problem I seem to be hitting but
   haven't figured out (and will probably not figure out while
   away); I'd appreciate if you can track it down.  With
   "alias.wh = whatchanged --patch-with-stat", "git wh HEAD --
   mailinfo.c" segfaults at fclose() in git_config_from_file()
   when it reads the configuration for the second time (the
   first time being getting the alias).  The second call comes
   via init_revisions() calling setup_git_directory().  Oddly
   I do not seem to be able to reproduce this segfault on amd64.

diff --git a/git.c b/git.c
index 9469d44..329ebec 100644
--- a/git.c
+++ b/git.c
@@ -122,9 +122,9 @@ static int handle_alias(int *argcp, cons
 			/* insert after command name */
 			if (*argcp > 1) {
 				new_argv = realloc(new_argv, sizeof(char*) *
-						(count + *argcp - 1));
-				memcpy(new_argv + count, *argv, sizeof(char*) *
-						(*argcp - 1));
+						   (count + *argcp));
+				memcpy(new_argv + count, *argv + 1,
+				       sizeof(char*) * *argcp);
 			}
 
 			*argv = new_argv;

^ permalink raw reply related

* Re: Porcelain specific metadata under .git?
From: Andreas Ericsson @ 2006-06-14 13:07 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git
In-Reply-To: <e6os3v$r5g$1@sea.gmane.org>

Jakub Narebski wrote:
> Andreas Ericsson wrote:
> 
> 
>>Shawn Pearce wrote:
>>
>>>I already assume/know that refs/heads and refs/tags are completely
>>>off-limits as they are for user refs only.
>>>
>>>I also think the core GIT tools already assume that anything
>>>directly under .git which is strictly a file and which is named
>>>entirely with uppercase letters (aside from "HEAD") is strictly a
>>>temporary/short-lived state type item (e.g. COMMIT_MSG) used by a
>>>Porcelain.
>>>
>>>But is saying ".git/refs/eclipse-workspaces" is probably able to
>>>be used for this purpose safe?  :-)
>>>
>>
>>.git/eclipse/whatever-you-like
>>
>>would probably be better. Heads can be stored directly under .git/refs 
>>too. Most likely, nothing will ever be stored under ./git/eclipse by 
>>either core git or the current (other) porcelains though.
> 
> 
> I think if it is a ref, which one wants to be visible to git-fsck (and
> git-prune), it should be under .git/refs.
> 

Yes, but I understood him to mean "it's a tree-sha" instead of a 
branch/head thing, which would mean it doesn't fit the .git/refs 
definition of ref.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox