git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Narebski <jnareb@gmail.com>
To: Jonathan Nieder <jrnieder@gmail.com>
Cc: git@vger.kernel.org, admin@repo.or.cz, John Hawley <warthog9@kernel.org>
Subject: Re: [PATCH/RFC] gitweb: Restructure projects list generation
Date: Tue, 1 Mar 2011 02:01:55 +0100	[thread overview]
Message-ID: <201103010202.01408.jnareb@gmail.com> (raw)
In-Reply-To: <201102271603.13902.jnareb@gmail.com>

On Sun, 27 Feb 2011, Jakub Narebski wrote:
> On Sun, 27 Feb 2011, Jonathan Nieder wrote:
> > Jakub Narebski wrote:

> > > +# this assumes that forks follow immediately forked projects:
> > > +# [ ..., 'repo.git', 'repo/fork.git', ...]; if not, set second
> > > +# parameter to true value to sort projects by path first.
> > > +sub filter_forks_from_projects_list {
> > > +	my ($projects, $need_sorting) = @_;
> > > +
> > > +	@$projects = sort { $a->{'path'} cmp $b->{'path'} } @$projects
> > > +		if ($need_sorting);
> > 
> > What happens in this scenario?
> > 
> > 	git.git
> > 	git.related.project.git
> > 	git/platforms.git
> 
> Thanks for catching this.
> 
> It looks like I have oversimplified the algorithm by requiring that 
> forked project directly precedes any of its forks (if they exists).
> I'd have to redo this part of patch.
> 
> > [...]
> > > @@ -4747,23 +4784,36 @@ sub fill_project_list_info {
> > >  		if (!defined $pr->{'owner'}) {
> > >  			$pr->{'owner'} = git_get_project_owner("$pr->{'path'}") || "";
> > >  		}
> > > -		if ($check_forks) {
> > > -			my $pname = $pr->{'path'};
> > > -			if (($pname =~ s/\.git$//) &&
> > > -			    ($pname !~ /\/$/) &&
> > > -			    (-d "$projectroot/$pname")) {
> > > -				$pr->{'forks'} = "-d $projectroot/$pname";

Let's examine what's we had, what this commit does, and what we can have.

Before this commit there were two implementations of detecting and 
filtering-out forks: one based only on directory structure in
fill_project_list_info (it detected forks, but didn't filter them out),
and one in git_get_projects_list, which was based on pathnames and was
run only for $projects_list being [index] file (it filtered-out forks).

This commit replaced both by filter_forks_from_projects_list, which
mainly detected forks based on path, with directory based detecting
if there _can_ be forks... but this solution is based on overly strict
and not true assumption that forks always directly follow forked 
project, which doesn't need to be true e.g. for scanning $projects_list
directory.


Hmmm... it looks for me like filtering out forks worked only for 
$projects_list being file; the code in fill_project_list_info() detected
which projects were forked, but didn't detect which projects were forks
and should be filtered.  The code inside $check_forks conditional that
used to be in git_projects_list_body() was about filtering out for second
time projects that are not forks of current project, for 'forks' action.

Now naive implementation of detecting forks, and also implementation used
to be used in git_get_projects_list() for $projects_list being file is
O(n^2) in number of projects.  With a bit of simplification it can be
O(<projects>*<forked projects>) + O(<projects>)*stat (cost of "-d <dir>").
Withe git.kernel.org having around 1,000 projects, and repo.or.cz having
around 4,000 projects not counting forks, O(n^2) is out of question.

Of those 4,000 projects around 300 (or 7%) has forks, according to 
repo.or.cz projects list page.  But assuming that percentage of forked
repositories remains constant with growth of number of repositories,
this is still quadratic cost... but perhaps acceptable.


We can use more advanced algorithm to find prefixes (to find forks);
for example using trie[1][2] to find forks costs O(<projects>*<depth>).
For git.kernel.org which has quite deep hierarchy of projects number
of path components (depth) is no more than 6.

Synthetic benchmark with 2000 repositories, 5% of which are forked,
with 5 forks per forked repository shows that trie-based solution is
around 25 times faster than original solution from git_get_projects_list
with %paths.

The disadvantage of this solution is that even seriously stripped-down
trie construction and lookup is around 2 pages of code... or using non-core
Tree::Trie module from CPAN.

[1]: http://en.wikipedia.org/wiki/Trie
[2]: http://p3rl.org/Tree::Trie
-- 
Jakub Narebski
Poland

  reply	other threads:[~2011-03-01  1:02 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-26 22:06 [PATCH/RFC] gitweb: Restructure projects list generation Jakub Narebski
2011-02-27 10:08 ` Jonathan Nieder
2011-02-27 15:03   ` Jakub Narebski
2011-03-01  1:01     ` Jakub Narebski [this message]
2011-03-02 16:11       ` [PATCHv2/RFC] " Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201103010202.01408.jnareb@gmail.com \
    --to=jnareb@gmail.com \
    --cc=admin@repo.or.cz \
    --cc=git@vger.kernel.org \
    --cc=jrnieder@gmail.com \
    --cc=warthog9@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).