All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakub Narebski <jnareb@gmail.com>
To: git@vger.kernel.org
Cc: John 'Warthog9' Hawley <warthog9@eaglescrag.net>,
	John 'Warthog9' Hawley <warthog9@kernel.org>,
	Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>,
	Jonathan Nieder <jrnieder@gmail.com>, Petr Baudis <pasky@suse.cz>,
	Sebastien Cevey <seb@cine7.net>
Subject: Re: [PATCH 1/6 (v2)] gitweb: Restructure projects list generation
Date: Sat, 7 May 2011 20:39:22 +0200	[thread overview]
Message-ID: <201105072039.24548.jnareb@gmail.com> (raw)
In-Reply-To: <1304099521-27617-2-git-send-email-jnareb@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6494 bytes --]

On Fri, 29 Apr 2011, Jakub Narebski wrote:

> Extract filtering out forks (which is done if 'forks' feature is
> enabled) into filter_forks_from_projects_list subroutine, and
> searching projects (via projects search form, or via content tags)
> into search_projects_list subroutine.

If it was all this patch did, it would be probably easier to review.
Should I then split this patch into smaller, though not entirely
standalone patches: pure refactoring, improving finding forks, and
adding tests?

> Sorting projects got also refactored in a very straight way (just
> moving code) into sort_projects_list subroutine.

This also could be refactored in a separate patch... though I rather
have all refactorings together, so one can see the result: clean code.
 
> The interaction between forks, content tags and searching is now made
> more explicit: searching whether by tag, or via search form turns off
> fork filtering (gitweb searches also forks, and will show all
> results).  If 'ctags' feature is disabled, then searching by tag is
> too.

This is functional change that makes code more clear, and I think it
makes better behavior... but it changes how gitweb behaves, so it is
not pure refactoring.  Should I put it into a bit artificial separate
patch?
 
> While at it we now detect that there are no projects and respond with
> "404 No projects found" also for 'project_index' and 'opml' actions.

This "while at it" is here because we can now say "No projects found"
for 'projects_list' action, because filtering / searching is done
upfront, so we know if we found anything.

But again, this perhaps would be better put in a separate patch.
 
[...]
> filter_forks_from_projects_list() uses trie[1] (prefix tree) to find
> which repositories are forks of other projects; directories are used
> as "letters" in trie.  Algorithm to create trie and find prefix match
> was created with some help of code of Tree::Tree[2] Perl module from
> CPAN.
> 
> [1]: http://en.wikipedia.org/wiki/Trie
> [2]: http://p3rl.org/Tree::Trie

This change in algorithm should perhaps be put in a separate patch...

> Note that before this commit filtering out forks worked only for
> $projects_list being a file, it assumed that forks are always after
> forked project (not very unreasonable), and used algorithm which is
> O(<projects>^2).
> 
> Current code filters out projects only where it should, regardless
> whether $projects_list is directory or a file, and trie-based
> algorithm is O(<projects>) + O(<projects forked>*<max depth>).
> Current code also consistently finds shortest prefix (important in
> counting forks in case of forks of forks).

Jakub Narebski wrote in
"Re: What's cooking in git.git (May 2011, #02; Wed, 4)"
http://thread.gmane.org/gmane.comp.version-control.git/172789/focus=172820

> Junio C Hamano <gitster@pobox.com> writes:
[...]
>> * jn/ctags (2011-04-29) 6 commits
>>  - gitweb: Optional grouping of projects by category
>>  - gitweb: Modularized git_get_project_description to be more generic
>>  - gitweb: Split git_project_list_body in two functions
>>  - gitweb: Mark matched 'ctag' / contents tag (?by_tag=foo)
>>  - gitweb: Change the way "content tags" ('ctags') are handled
>>  - gitweb: Restructure projects list generation
>> 
>> Waiting for comments.
> 
> Should I do and post benchmarks for
> 
>    - gitweb: Restructure projects list generation
> 
> change (when 'forks' feature is used)?

It is not easy to compare old and new code, because those algorithm
have different behavior.

Best I can come up with is comparing pure fork detection (not the time
it takes to generate page) - I can easy mock that - for sizes comparable
with http://repo.or.cz which uses 'forks' feature.  repo.or.cz has
around 4,000 projects, around 300 of those have forks (note: forks are
not included in number of projects).  From non-representative sample
most forked project have but single fork; note however that there are
repositories such as git.git, which has 55 forks, and some of those have
forks on its own (forks of forks).

 plan: nprojects=4000; nprojects_forked=300; nforks_per_forked_project=1
 number of projects: 4000

 Benchmark: running old, new for at least 600 CPU seconds...
        old: 1231.38 wallclock secs (620.23 usr +  1.30 sys = 621.53 CPU) @  0.03/s (n=21)
        new:  847.6  wallclock secs (632.34 usr +  0.80 sys = 633.14 CPU) @  5.18/s (n=3277)
     s/iter    old    new
 old   29.6     --   -99%
 new  0.193 15219%     --

 Dumbbench: running old, new for at least 50 iterations...
 old: Ran 106 iterations (25 outliers).
 old: Rounded run time per iteration: 3.316e+01  +/- 1.5e-01 (0.5%)
 new: Ran 66 iterations (15 outliers).
 new: Rounded run time per iteration: 2.0213e-01 +/- 2.1e-04 (0.1%)

Below there is table of time versus number of projects, used to generate
attached and linked images.  Data was generated using Dumbbench module
from CPAN.

 ########################################################
 # nforks  = 5 (forks per project forked)
 # nforked = 5.00%
 # 1:projects 2:nforked  3:old 4:old_err  5:new 5:new_err
 #  0   0  8.959e-06  2.2e-08  1.0137e-05 2.0e-08
   50   3  3.211e-03  1.6e-05  2.0524e-03 5.0e-06
  103   5  1.4003e-02 6.7e-05  4.477e-03  2.2e-05
  204  10  5.647e-02  2.3e-04  9.464e-03  4.6e-05
  402  20  2.1939e-01 3.2e-04  1.9943e-02 9.6e-05
  500  25  3.666e-01  3.6e-03  2.706e-02  2.5e-04
  800  40  9.124e-01  1.7e-03  4.175e-02  2.1e-04
 1000  50  1.575e+00  1.3e-02  5.896e-02  5.3e-04
 1600  80  3.738e+00  1.2e-02  9.347e-02  4.3e-04
 2004 100  6.820e+00  6.7e-02  1.1216e-01 3.6e-04
 3203 160  1.8189e+01 8.5e-02  1.8144e-01 4.5e-04
 4003 200  2.927e+01  2.9e-01  2.2939e-01 5.7e-04

Script used to generate this data is attached.

From those data we can generate a plot to show how time depends on size
of problem (on total number of projects).  Assuming that both "old" ane
"new" are P i.e. are O(n^k), where 'n' is number of projects, this
dependency is best examined on loglog scale.

      y = x^k
  log y = log (x^k) = k * log x
      v = k * u,  where u = log x, v = log y

This plot, together with fit of cubic polynomial and linear function
to the data, is shown in

  https://git.wiki.kernel.org/index.php/File:Benchmark_find_forks-old_vs_new.png

(I'm sorry for abusing git wiki for sharing images...).
-- 
Jakub Narebski
Poland

[-- Attachment #2: benchmark_find_forks.perl --]
[-- Type: text/plain, Size: 6721 bytes --]

#!/usr/bin/perl

use strict;
use warnings;

use List::Util qw(first);
use POSIX qw(ceil);

use Acme::MetaSyntactic qw(any);
use Benchmark qw(:all :hireswallclock);
use Dumbbench;

use Data::Dumper::Concise; # for debugging


# ----------------------------------------------------------------------

# old: version used to be in git_get_projects_list for $projects_list file
sub filter__git_get_projects_list {
	my ($projects, $need_sorting) = @_;

	@$projects = sort { $a->{'path'} cmp $b->{'path'} } @$projects
		if ($need_sorting);

	my (@filtered, %paths);
 PROJECT:
	foreach my $pr (@$projects) {
		my $path = $pr->{'path'};

		# below is v1.7.5.1:gitweb/gitweb.perl lines 2725 and later
	PATH:
		foreach my $filter (keys %paths) {
			# looking for forks
			my $pfx = substr($path, 0, length($filter));
			if ($pfx ne $filter) {
				next PATH;
			}
			my $sfx = substr($path, length($filter));
			if ($sfx !~ /^\/.*\.git$/) {
				next PATH;
			}
			# is a fork, don't include it in
			# the list
			next PROJECT;
		}

		# [...]
		# below is v1.7.5.1:gitweb/gitweb.perl line 2746 and later
		push @filtered, $pr;
		(my $forks_path = $path) =~ s/\.git$//;
		$paths{$forks_path}++;
	}

	return @filtered;
}


# new: using 'path' only, using trie; code borrowed from Tree::Trie (Perl Artistic License)
sub filter_forks_from_projects_list {
	my ($projects, $need_sorting) = @_;

	@$projects = sort { $a->{'path'} cmp $b->{'path'} } @$projects
		if ($need_sorting);

	my %trie; # prefix tree of directories (path components)
	# generate trie out of those directories that might contain forks
	foreach my $pr (@$projects) {
		my $path = $pr->{'path'};
		$path =~ s/\.git$//;      # forks of 'repo.git' are in 'repo/' directory
		next if ($path =~ m!/$!); # skip non-bare repositories, e.g. 'repo/.git'
		next unless ($path);      # skip '.git' repository: tests, git-instaweb
		#next unless (-d $path);  # containing directory exists (NOT IN TESTS!!!)
		$pr->{'forks'} = [];      # there can be 0 or more forks of project

		# add to trie
		my @dirs = split('/', $path);
		# walk the trie, until either runs out of components or out of trie
		my $ref = \%trie;
		while (scalar @dirs &&
		       exists($ref->{$dirs[0]})) {
			$ref = $ref->{shift @dirs};
		}
		# create rest of trie structure from rest of components
		foreach my $dir (@dirs) {
			$ref = $ref->{$dir} = {};
		}
		# create end marker, store $pr as a data
		$ref->{''} = $pr if (!exists $ref->{''});
	}

	# filter out forks, by finding shortest prefix match for paths
	my @filtered;
 PROJECT:
	foreach my $pr (@$projects) {
		# trie lookup
		my $ref = \%trie;
	DIR:
		foreach my $dir (split('/', $pr->{'path'})) {
			if (exists $ref->{''}) {
				# found [shortest] prefix, is a fork - skip it
				push @{$ref->{''}{'forks'}}, $pr;
				next PROJECT;
			}
			if (!exists $ref->{$dir}) {
				# not in trie, cannot have prefix, not a fork
				push @filtered, $pr;
				next PROJECT;
			}
			# If the dir is there, we just walk one step down the trie.
			$ref = $ref->{$dir};
		}
		# we ran out of trie
		# (shouldn't happen: it's either no match, or end marker)
		push @filtered, $pr;
	}

	return @filtered;
}

# ----------------------------------------------------------------------

sub prepare_projects {
	my ($nprojects, $nforked, $nforks_per_project, $exclude_forks) = @_;
	my @projects;

	while ($nprojects > 0) {
		push @projects, metaname().".git";
		$nprojects--;
		last unless $nprojects;

		if (rand() < (1.0*$nforked/$nprojects)) {
			(my $base = $projects[-1]) =~ s/\.git$//;
			push @projects, map { "$base/$_.git" } metaname($nforks_per_project);
			$nprojects -= $nforks_per_project unless $exclude_forks;
			$nforked--;
		}
	}

	#print Dumper(\@projects);
	@projects = map { { 'path' => $_ } } @projects;

	return @projects;
}

# ######################################################################

my (@projects, @filtered);

# repo.or.cz
# * 4038 projects (without forks?), based on http://repo.or.cz/w?a=project_index
# * 4024 projects, based on "grep -c -e '<tr class="'" on http://repo.or.cz/w?a=project_list
# * 3428 active projects, based on "grep -c -e '"age'" on http://repo.or.cz/w?a=project_list
# *  308 projects have forks (7.6-10%), based on grep -c -e '/forks">+</a>' project_list.html
# * nonrepresentative sample of nforks: 6, 2, 1, 1, 0, 0, 1, 0, 1, 2, 3, 3, 3, 55+, 1, 1, 1

my $nprojects = 100;
my $nforked_frac = 0.05;
my $nforked   = ceil($nforked_frac*$nprojects); # 5%
my $nforks    = 5;

$nprojects = 4000;
$nforked   = 300;
$nforks    = 1;

print "plan: nprojects=$nprojects; nforked=$nforked; nforks=$nforks\n";
@projects = prepare_projects($nprojects, $nforked, $nforks);
print "number of projects: ".(scalar @projects)."\n";

print "\n";
my $results = timethese(-600, {
	'old' => sub { filter__git_get_projects_list(\@projects) },
	'new' => sub { filter_forks_from_projects_list(\@projects) },
});
#print Dumper($results);
cmpthese($results);
print "\n";

print "Dumbbench: old, new\n";
my $bench = Dumbbench->new(
	target_rel_precision => 0.005, # seek ~0.5%
	initial_runs         => 50,    # the higher the more reliable
);
$bench->add_instances(
	Dumbbench::Instance::PerlSub->new(
		name => 'old',
		code => sub { filter__git_get_projects_list(\@projects) },
	),
	Dumbbench::Instance::PerlSub->new(
		name => 'new',
		code => sub { filter_forks_from_projects_list(\@projects) },
	),
);
$bench->run();
$bench->report();

print "\n";
#print Dumper($bench->instances());
__END__

print "#####################################################################\n".
      "# nforks  = $nforks (forks per project forked)\n".
      "# nforked = ".(sprintf "%.2f%%", $nforked_frac*100.0)."\n".
      "# 1:projects 2:nforked  3:old 4:old_err  5:new 5:new_err\n";
for (my $i = 50; $i <= 4000; $i *= 2) {
	$nforked  = ceil($nforked_frac*$i);
	@projects = prepare_projects($i, $nforked, $nforks);

	$bench = Dumbbench->new(
		target_rel_precision => 0.005, # seek ~0.5%
		initial_runs         => 50,    # the higher the more reliable
	);
	$bench->add_instances(
		Dumbbench::Instance::PerlSub->new(
			name => 'old',
			code => sub { filter__git_get_projects_list(\@projects) },
		),
		Dumbbench::Instance::PerlSub->new(
			name => 'new',
			code => sub { filter_forks_from_projects_list(\@projects) },
		),
	);
	$bench->run();

	my %instances = map { $_->name() => $_ } $bench->instances();

	printf "%4d %3d ", (scalar @projects), $nforked;
	foreach my $name (qw(old new)) {
		my $result = $instances{$name}->result();
		print " ".$result->number();
		print " ".$result->error()->[0];
	}
	print "\n";
}
#print Dumper($bench->instances());
#print Dumper($_->result()) foreach ($bench->instances());


print "\n";


1;
__END__

  reply	other threads:[~2011-05-07 19:05 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-29 17:51 [PATCH 0/6] gitweb: Improve ctags, introduce categories Jakub Narebski
2011-04-29 17:51 ` [PATCH 1/6 (v2)] gitweb: Restructure projects list generation Jakub Narebski
2011-05-07 18:39   ` Jakub Narebski [this message]
2011-04-29 17:51 ` [PATCH 2/6] gitweb: Change the way "content tags" ('ctags') are handled Jakub Narebski
2011-04-29 17:51 ` [PATCH 3/6] gitweb: Mark matched 'ctag' / contents tag (?by_tag=foo) Jakub Narebski
2011-04-29 17:51 ` [PATCH 4/6] gitweb: Split git_project_list_body in two functions Jakub Narebski
2011-04-29 17:52 ` [PATCH 5/6] gitweb: Modularized git_get_project_description to be more generic Jakub Narebski
2011-04-29 17:52 ` [PATCH 6/6] gitweb: Optional grouping of projects by category Jakub Narebski
2011-04-29 21:31 ` [PATCH 0/6] gitweb: Improve ctags, introduce categories Junio C Hamano
2011-04-29 23:53   ` Jakub Narebski
2011-04-30 20:36   ` Øyvind A. Holm
2011-05-03 14:02     ` git-send-email and non 7bit clean message (was: [PATCH 0/6] gitweb: Improve ctags, introduce categories) Jakub Narebski
2011-05-04 13:50       ` [PATCH/RFC] git-send-email: Do not encode Subject if not required (was: git-send-email and non 7bit clean message) Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201105072039.24548.jnareb@gmail.com \
    --to=jnareb@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jrnieder@gmail.com \
    --cc=pasky@suse.cz \
    --cc=seb@cine7.net \
    --cc=u.kleine-koenig@pengutronix.de \
    --cc=warthog9@eaglescrag.net \
    --cc=warthog9@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.