[PATCH] gitweb: Support caching projects list

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] gitweb: Support caching projects list
@ 2008-03-13 23:14 Petr Baudis
  2008-03-14  0:07 ` Jay Soffian
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Petr Baudis @ 2008-03-13 23:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On repo.or.cz (permanently I/O overloaded and hosting 1050 project +
forks), the projects list (the default gitweb page) can take more than
a minute to generate. This naive patch adds simple support for caching
the projects list data structure so that all the projects do not need
to get rescanned at every page access.

$projlist_cache_lifetime gitweb configuration variable is introduced,
by default set to zero. If set to non-zero, it describes the number of
minutes for which the cache remains valid. Only single project root
per system can use the cache. Any script running with the same uid
as gitweb can change the cache trivially - this is for secure installations
only.

The cache itself is stored in /tmp/gitweb.index.cache as a Data::Dumper
dump of the perl data structure with the list of project details. When
reusing the cache, the file is simply eval'd back into @projects. For
clarity, projects scanning and @projects population is separated to
git_get_projects_details().

To prevent contention when multiple accesses coincide with cache
expiration, the timeout is postponed to time()+120 when we start
refreshing. When showing cached version, a disclaimer is shown at the
top of the projects list.

Signed-off-by: Petr Baudis <pasky@suse.cz>
---

 gitweb/gitweb.css  |    6 +++++
 gitweb/gitweb.perl |   59 ++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/gitweb/gitweb.css b/gitweb/gitweb.css
index 8e2bf3d..673077a 100644
--- a/gitweb/gitweb.css
+++ b/gitweb/gitweb.css
@@ -85,6 +85,12 @@ div.title, a.title {
 	color: #000000;
 }
 
+div.stale_info {
+	display: block;
+	text-align: right;
+	font-style: italic;
+}
+
 div.readme {
 	padding: 8px;
 }
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index bcb6193..0eee195 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -122,6 +122,15 @@ our $fallback_encoding = 'latin1';
 # - one might want to include '-B' option, e.g. '-B', '-M'
 our @diff_opts = ('-M'); # taken from git_commit
 
+# projects list cache for busy sites with many projects;
+# if you set this to non-zero, it will be used as the cached
+# index lifetime in minutes
+# the cached list version is stored in /tmp and can be tweaked
+# by other scripts running with the same uid as gitweb - use this
+# only at secure installations; only single gitweb project root per
+# system is supported!
+our $projlist_cache_lifetime = 0;
+
 # information about snapshot formats that gitweb is capable of serving
 our %known_snapshot_formats = (
 	# name => {
@@ -3507,10 +3516,8 @@ sub git_patchset_body {
 
 # . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
 
-sub git_project_list_body {
-	my ($projlist, $order, $from, $to, $extra, $no_header) = @_;
-
-	my ($check_forks) = gitweb_check_feature('forks');
+sub git_get_projects_details {
+	my ($projlist, $check_forks) = @_;
 
 	my @projects;
 	foreach my $pr (@$projlist) {
@@ -3540,11 +3547,53 @@ sub git_project_list_body {
 		}
 		push @projects, $pr;
 	}
+	return @projects;
+}
+
+sub git_project_list_body {
+	my ($projlist, $order, $from, $to, $extra, $no_header, $cache_lifetime) = @_;
+
+	my ($check_forks) = gitweb_check_feature('forks');
+
+	my $cache_file = '/tmp/gitweb.index.cache';
+	use File::stat;
+
+	my @projects;
+	my $stale = 0;
+	if ($cache_lifetime and -f $cache_file
+	    and stat($cache_file)->mtime + $cache_lifetime * 60 > time()
+	    and open (my $fd, $cache_file)) {
+		$stale = time() - stat($cache_file)->mtime;
+		my @dump = <$fd>;
+		close $fd;
+		# Hack zone start
+		my $VAR1;
+		eval join("\n", @dump);
+		@projects = @$VAR1;
+		# Hack zone end
+	} else {
+		if ($cache_lifetime and -f $cache_file) {
+			# Postpone timeout by two minutes so that we get
+			# enough time to do our job.
+			my $time = time() - $cache_lifetime + 120;
+			utime $time, $time, $cache_file;
+		}
+		@projects = git_get_projects_details($projlist, $check_forks);
+		if ($cache_lifetime and open (my $fd, '>'.$cache_file)) {
+			use Data::Dumper;
+			print $fd Dumper(\@projects);
+			close $fd;
+		}
+	}
 
 	$order ||= $default_projects_order;
 	$from = 0 unless defined $from;
 	$to = $#projects if (!defined $to || $#projects < $to);
 
+	if ($cache_lifetime and $stale) {
+		print "<div class=\"stale_info\">Cached version (${stale}s old)</div>\n";
+	}
+
 	print "<table class=\"project_list\">\n";
 	unless ($no_header) {
 		print "<tr>\n";
@@ -3927,7 +3976,7 @@ sub git_project_list {
 		close $fd;
 		print "</div>\n";
 	}
-	git_project_list_body(\@list, $order);
+	git_project_list_body(\@list, $order, undef, undef, undef, undef, $projlist_cache_lifetime);
 	git_footer_html();
 }
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-13 23:14 [PATCH] gitweb: Support caching projects list Petr Baudis
@ 2008-03-14  0:07 ` Jay Soffian
  2008-03-14  0:22   ` Petr Baudis
  2008-03-14 15:29   ` [PATCH] gitweb: Support caching projects list Jakub Narebski
  2008-03-14  0:19 ` Junio C Hamano
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 30+ messages in thread
From: Jay Soffian @ 2008-03-14  0:07 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Junio C Hamano, git

On Thu, Mar 13, 2008 at 7:14 PM, Petr Baudis <pasky@suse.cz> wrote:
>  diff --git a/gitweb/gitweb.css b/gitweb/gitweb.css
>  index 8e2bf3d..673077a 100644
>  --- a/gitweb/gitweb.css
>  +++ b/gitweb/gitweb.css
>  @@ -85,6 +85,12 @@ div.title, a.title {
>         color: #000000;
>   }
>
>  +div.stale_info {
>  +       display: block;
>  +       text-align: right;
>  +       font-style: italic;
>  +}
>  +
>   div.readme {
>         padding: 8px;
>   }

What does this have to do with it?

>  diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>  index bcb6193..0eee195 100755
>  --- a/gitweb/gitweb.perl
>  +++ b/gitweb/gitweb.perl
>  @@ -122,6 +122,15 @@ our $fallback_encoding = 'latin1';

...

>  +               if ($cache_lifetime and -f $cache_file) {
>  +                       # Postpone timeout by two minutes so that we get
>  +                       # enough time to do our job.
>  +                       my $time = time() - $cache_lifetime + 120;
>  +                       utime $time, $time, $cache_file;
>  +               }

Race condition. I don't see any locking. Nothing keeps multiple instances from
regenerating the cache concurrently...

>  +               @projects = git_get_projects_details($projlist, $check_forks);
>  +               if ($cache_lifetime and open (my $fd, '>'.$cache_file)) {

...and then clobbering each other here. You have two choices:

1) Use a lock file for the critical section.

2) Assume the race condition is rare enough, but you still need to account for
it. In that case, you want to write to a temporary file and then rename to the
cache file name. The rename is atomic, so though N instances of gitweb may
regenerate the cache (at some CPU/IO overhead), at least the cache file won't
get corrupted.

Out of curiosity, repo.or.cz isn't running this as a CGI is it? If so, wouldn't
running it as a FastCGI or modperl be a vast improvement?

j.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-14  0:07 ` Jay Soffian
@ 2008-03-14  0:22   ` Petr Baudis
  2008-03-14  0:27     ` Jay Soffian
  2008-03-14  0:36     ` J.H.
  2008-03-14 15:29   ` [PATCH] gitweb: Support caching projects list Jakub Narebski
  1 sibling, 2 replies; 30+ messages in thread
From: Petr Baudis @ 2008-03-14  0:22 UTC (permalink / raw)
  To: Jay Soffian; +Cc: Junio C Hamano, git

On Thu, Mar 13, 2008 at 08:07:09PM -0400, Jay Soffian wrote:
> On Thu, Mar 13, 2008 at 7:14 PM, Petr Baudis <pasky@suse.cz> wrote:
> >  diff --git a/gitweb/gitweb.css b/gitweb/gitweb.css
> >  index 8e2bf3d..673077a 100644
> >  --- a/gitweb/gitweb.css
> >  +++ b/gitweb/gitweb.css
> >  @@ -85,6 +85,12 @@ div.title, a.title {
> >         color: #000000;
> >   }
> >
> >  +div.stale_info {
> >  +       display: block;
> >  +       text-align: right;
> >  +       font-style: italic;
> >  +}
> >  +
> >   div.readme {
> >         padding: 8px;
> >   }
> 
> What does this have to do with it?

The box shows that cached information is being shown.

> >  diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> >  index bcb6193..0eee195 100755
> >  --- a/gitweb/gitweb.perl
> >  +++ b/gitweb/gitweb.perl
> >  @@ -122,6 +122,15 @@ our $fallback_encoding = 'latin1';
> 
> ...
> 
> >  +               if ($cache_lifetime and -f $cache_file) {
> >  +                       # Postpone timeout by two minutes so that we get
> >  +                       # enough time to do our job.
> >  +                       my $time = time() - $cache_lifetime + 120;
> >  +                       utime $time, $time, $cache_file;
> >  +               }
> 
> Race condition. I don't see any locking. Nothing keeps multiple instances from
> regenerating the cache concurrently...
> 
> >  +               @projects = git_get_projects_details($projlist, $check_forks);
> >  +               if ($cache_lifetime and open (my $fd, '>'.$cache_file)) {
> 
> ...and then clobbering each other here. You have two choices:
> 
> 1) Use a lock file for the critical section.
> 
> 2) Assume the race condition is rare enough, but you still need to account for
> it. In that case, you want to write to a temporary file and then rename to the
> cache file name. The rename is atomic, so though N instances of gitweb may
> regenerate the cache (at some CPU/IO overhead), at least the cache file won't
> get corrupted.

You are of course right - I wanted to do the rename, but forgot to write
it in the actual code. :-)

There is a more conceptual problem though - in case of such big sites,
it really makes more sense to explicitly regenerate the cache
periodically instead of making random clients to have to wait it out.
We could add a 'force_update' parameter to accept from localhost only
that will always regenerate the cache, but that feels rather kludgy -
can anyone think of a more elegant solution? (I don't think taking the
@projects generating code out of gitweb and then having to worry during
gitweb upgrades is any better.)

> Out of curiosity, repo.or.cz isn't running this as a CGI is it? If so, wouldn't
> running it as a FastCGI or modperl be a vast improvement?

Unlikely. Currently the machine is mostly IO-bound and only small
portion of CPU usage comes from gitweb itself.

-- 
				Petr "Pasky" Baudis
Whatever you can do, or dream you can, begin it.
Boldness has genius, power, and magic in it.	-- J. W. von Goethe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-14  0:22   ` Petr Baudis
@ 2008-03-14  0:27     ` Jay Soffian
  2008-03-14  0:30       ` J.H.
  2008-03-14  0:36     ` J.H.
  1 sibling, 1 reply; 30+ messages in thread
From: Jay Soffian @ 2008-03-14  0:27 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Junio C Hamano, git

On Thu, Mar 13, 2008 at 8:22 PM, Petr Baudis <pasky@suse.cz> wrote:

>  There is a more conceptual problem though - in case of such big sites,
>  it really makes more sense to explicitly regenerate the cache
>  periodically instead of making random clients to have to wait it out.

Fork off a child to update the cache?

>  Unlikely. Currently the machine is mostly IO-bound and only small
>  portion of CPU usage comes from gitweb itself.

Except that if it were FastCGI or mod_perl you could just keep the cache in
memory.

j.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-14  0:27     ` Jay Soffian
@ 2008-03-14  0:30       ` J.H.
  2008-03-14 12:17         ` Jakub Narebski
  0 siblings, 1 reply; 30+ messages in thread
From: J.H. @ 2008-03-14  0:30 UTC (permalink / raw)
  To: Jay Soffian; +Cc: Petr Baudis, Junio C Hamano, git

You would be better off using some of the logic I've got in the caching
version of gitweb to prevent the race condition.

- John 'Warthog9' Hawley

On Thu, 2008-03-13 at 20:27 -0400, Jay Soffian wrote:
> On Thu, Mar 13, 2008 at 8:22 PM, Petr Baudis <pasky@suse.cz> wrote:
> 
> >  There is a more conceptual problem though - in case of such big sites,
> >  it really makes more sense to explicitly regenerate the cache
> >  periodically instead of making random clients to have to wait it out.
> 
> Fork off a child to update the cache?
> 
> >  Unlikely. Currently the machine is mostly IO-bound and only small
> >  portion of CPU usage comes from gitweb itself.
> 
> Except that if it were FastCGI or mod_perl you could just keep the cache in
> memory.
> 
> j.
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-14  0:30       ` J.H.
@ 2008-03-14 12:17         ` Jakub Narebski
  0 siblings, 0 replies; 30+ messages in thread
From: Jakub Narebski @ 2008-03-14 12:17 UTC (permalink / raw)
  To: J.H.; +Cc: Jay Soffian, Petr Baudis, Junio C Hamano, git

"J.H." <warthog19@eaglescrag.net> writes:

> You would be better off using some of the logic I've got in the caching
> version of gitweb to prevent the race condition.

Or use Cache::FileCache the CGI::Cache uses...


By the way, J.H., would you have time and be interested in becoming
"Gitweb caching" project mentor for Google Summer of Code 2008:

  http://git.or.cz/gitwiki/SoC2008Ideas

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-14  0:22   ` Petr Baudis
  2008-03-14  0:27     ` Jay Soffian
@ 2008-03-14  0:36     ` J.H.
  2008-03-17 17:49       ` repo.or.cz renovation Petr Baudis
  1 sibling, 1 reply; 30+ messages in thread
From: J.H. @ 2008-03-14  0:36 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Jay Soffian, Junio C Hamano, git


> You are of course right - I wanted to do the rename, but forgot to write
> it in the actual code. :-)
> 
> There is a more conceptual problem though - in case of such big sites,
> it really makes more sense to explicitly regenerate the cache
> periodically instead of making random clients to have to wait it out.
> We could add a 'force_update' parameter to accept from localhost only
> that will always regenerate the cache, but that feels rather kludgy -
> can anyone think of a more elegant solution? (I don't think taking the
> @projects generating code out of gitweb and then having to worry during
> gitweb upgrades is any better.)

You could do something similar to the gitweb caching I'm doing,
basically if a file isn't generated you make a user wait (no good way
around this really).  If a cache exists show it to the user unless the
cache is older than $foo.  If a re-generation needs to happen it happens
in the background so the user who triggers the regeneration sees
something immediately vs. having to wait (at the cost of showing out of
date data)

> > Out of curiosity, repo.or.cz isn't running this as a CGI is it? If so, wouldn't
> > running it as a FastCGI or modperl be a vast improvement?
> 
> Unlikely. Currently the machine is mostly IO-bound and only small
> portion of CPU usage comes from gitweb itself.

Thats about the same as what I saw, it's disk bound vs. cpu/memory
bound.

- John 'Warthog9' Hawley

^ permalink raw reply	[flat|nested] 30+ messages in thread

* repo.or.cz renovation
  2008-03-14  0:36     ` J.H.
@ 2008-03-17 17:49       ` Petr Baudis
  2008-03-17 18:11         ` Petr Baudis
  2008-03-17 18:44         ` J.H.
  0 siblings, 2 replies; 30+ messages in thread
From: Petr Baudis @ 2008-03-17 17:49 UTC (permalink / raw)
  To: J.H.; +Cc: Jay Soffian, Junio C Hamano, git

On Thu, Mar 13, 2008 at 05:36:39PM -0700, J.H. wrote:
> 
> > You are of course right - I wanted to do the rename, but forgot to write
> > it in the actual code. :-)
> > 
> > There is a more conceptual problem though - in case of such big sites,
> > it really makes more sense to explicitly regenerate the cache
> > periodically instead of making random clients to have to wait it out.
> > We could add a 'force_update' parameter to accept from localhost only
> > that will always regenerate the cache, but that feels rather kludgy -
> > can anyone think of a more elegant solution? (I don't think taking the
> > @projects generating code out of gitweb and then having to worry during
> > gitweb upgrades is any better.)
> 
> You could do something similar to the gitweb caching I'm doing,
> basically if a file isn't generated you make a user wait (no good way
> around this really).  If a cache exists show it to the user unless the
> cache is older than $foo.  If a re-generation needs to happen it happens
> in the background so the user who triggers the regeneration sees
> something immediately vs. having to wait (at the cost of showing out of
> date data)

By the way, the index page is so far really the only bottleneck I'm
seeing, other than that even project pages for huge repositories are
shown pretty quickly. Did you ever try to just cache the index page on
kernel.org? What sort of impact did it have? What evere the hotspots -
project pages for the main repositories or some less obvious pages?

Just caching the index would be far less intrusive change than
introducing caching everywhere and it might help to bring kernel.org
gitweb back in sync with mainline. :-)

-- 
				Petr "Pasky" Baudis
Whatever you can do, or dream you can, begin it.
Boldness has genius, power, and magic in it.	-- J. W. von Goethe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: repo.or.cz renovation
  2008-03-17 17:49       ` repo.or.cz renovation Petr Baudis
@ 2008-03-17 18:11         ` Petr Baudis
  2008-03-17 18:44         ` J.H.
  1 sibling, 0 replies; 30+ messages in thread
From: Petr Baudis @ 2008-03-17 18:11 UTC (permalink / raw)
  To: J.H.; +Cc: Jay Soffian, Junio C Hamano, git

  (Sorry, I messed up the subject here. See the other mail if you are
interested about recent speedups of repo.or.cz. And if replying, it
would be nice to revert the subject back. :-)

				Petr "Pasky" Baudis

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: repo.or.cz renovation
  2008-03-17 17:49       ` repo.or.cz renovation Petr Baudis
  2008-03-17 18:11         ` Petr Baudis
@ 2008-03-17 18:44         ` J.H.
  2008-03-17 20:41           ` Jakub Narebski
  2008-03-17 21:09           ` Jakub Narebski
  1 sibling, 2 replies; 30+ messages in thread
From: J.H. @ 2008-03-17 18:44 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Jay Soffian, Junio C Hamano, git

On Mon, 2008-03-17 at 18:49 +0100, Petr Baudis wrote:
> On Thu, Mar 13, 2008 at 05:36:39PM -0700, J.H. wrote:
> > 
> > > You are of course right - I wanted to do the rename, but forgot to write
> > > it in the actual code. :-)
> > > 
> > > There is a more conceptual problem though - in case of such big sites,
> > > it really makes more sense to explicitly regenerate the cache
> > > periodically instead of making random clients to have to wait it out.
> > > We could add a 'force_update' parameter to accept from localhost only
> > > that will always regenerate the cache, but that feels rather kludgy -
> > > can anyone think of a more elegant solution? (I don't think taking the
> > > @projects generating code out of gitweb and then having to worry during
> > > gitweb upgrades is any better.)
> > 
> > You could do something similar to the gitweb caching I'm doing,
> > basically if a file isn't generated you make a user wait (no good way
> > around this really).  If a cache exists show it to the user unless the
> > cache is older than $foo.  If a re-generation needs to happen it happens
> > in the background so the user who triggers the regeneration sees
> > something immediately vs. having to wait (at the cost of showing out of
> > date data)
> 
> By the way, the index page is so far really the only bottleneck I'm
> seeing, other than that even project pages for huge repositories are
> shown pretty quickly. Did you ever try to just cache the index page on
> kernel.org? What sort of impact did it have? What evere the hotspots -
> project pages for the main repositories or some less obvious pages?
> 
> Just caching the index would be far less intrusive change than
> introducing caching everywhere and it might help to bring kernel.org
> gitweb back in sync with mainline. :-)

I think we are likely going to want to keep caching everything vs. just
the front page.  There are a few repos that get hit quite a bit and it
would be better to have those cache vs. not.  Really I would argue this
is just a step in the direction of integrating all of my caching changes
back into gitweb vs. us dropping what we've done so far.

BTW I'm about halfway through refactoring my tree from multiple files
back to one, which at that point means I can start bringing it back into
mainline and getting a patch series ready for submission.

- John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: repo.or.cz renovation
  2008-03-17 18:44         ` J.H.
@ 2008-03-17 20:41           ` Jakub Narebski
  2008-03-17 21:09           ` Jakub Narebski
  1 sibling, 0 replies; 30+ messages in thread
From: Jakub Narebski @ 2008-03-17 20:41 UTC (permalink / raw)
  To: J.H.; +Cc: Petr Baudis, Jay Soffian, Junio C Hamano, git

"J.H." <warthog19@eaglescrag.net> writes:

> BTW I'm about halfway through refactoring my tree from multiple files
> back to one, which at that point means I can start bringing it back into
> mainline and getting a patch series ready for submission.

I plan on sending email with my ideas on gitweb caching somewhere
between now and tomorrow. I'll try to cover my ideas about how to
cache (support for cache validation for external cache, caching Perl
structures / data from git commands, caching final output: HTML, RSS,
etc.) and what solutions can be used.

I wanted first to send my enchancements to Petr Baudis patch adding
caching support for projects list, i.e. third patch in the series, and
get comments (currently none) on *this* patch (idea).

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: repo.or.cz renovation
  2008-03-17 18:44         ` J.H.
  2008-03-17 20:41           ` Jakub Narebski
@ 2008-03-17 21:09           ` Jakub Narebski
  1 sibling, 0 replies; 30+ messages in thread
From: Jakub Narebski @ 2008-03-17 21:09 UTC (permalink / raw)
  To: J.H.; +Cc: Petr Baudis, Jay Soffian, Junio C Hamano, git

"J.H." <warthog19@eaglescrag.net> writes:

> BTW I'm about halfway through refactoring my tree from multiple files
> back to one, which at that point means I can start bringing it back into
> mainline and getting a patch series ready for submission.

BTW it would be nice to have a merge strategy (blame-based perhaps?)
which would allow to merge changes to split project easily into
original, single file one...  But I guess you are not interested in
writing such a merge strategy just for this ;-)

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-14  0:07 ` Jay Soffian
  2008-03-14  0:22   ` Petr Baudis
@ 2008-03-14 15:29   ` Jakub Narebski
  2008-03-14 21:11     ` Jay Soffian
  1 sibling, 1 reply; 30+ messages in thread
From: Jakub Narebski @ 2008-03-14 15:29 UTC (permalink / raw)
  To: Jay Soffian; +Cc: Petr Baudis, Junio C Hamano, git

"Jay Soffian" <jaysoffian@gmail.com> writes:

> On Thu, Mar 13, 2008 at 7:14 PM, Petr Baudis <pasky@suse.cz> wrote:
> 
> ...
> 
> >  +               if ($cache_lifetime and -f $cache_file) {
> >  +                       # Postpone timeout by two minutes so that we get
> >  +                       # enough time to do our job.
> >  +                       my $time = time() - $cache_lifetime + 120;
> >  +                       utime $time, $time, $cache_file;
> >  +               }
> 
> Race condition. I don't see any locking. Nothing keeps multiple
> instances from regenerating the cache concurrently...
> 
> >  +               @projects = git_get_projects_details($projlist, $check_forks);
> >  +               if ($cache_lifetime and open (my $fd, '>'.$cache_file)) {
> 
> ...and then clobbering each other here. You have two choices:
> 
> 1) Use a lock file for the critical section.
> 
> 2) Assume the race condition is rare enough, but you still need to
> account for it. In that case, you want to write to a temporary file
> and then rename to the cache file name. The rename is atomic, so
> though N instances of gitweb may regenerate the cache (at some
> CPU/IO overhead), at least the cache file won't get corrupted.

What should the code for this look like? Like below?

        use File::Temp;
        
        my ($fh, $temp_file) = tempfile();
        ...
        close $fh;
        rename $temp_file, $cache_file;


-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-14 15:29   ` [PATCH] gitweb: Support caching projects list Jakub Narebski
@ 2008-03-14 21:11     ` Jay Soffian
  0 siblings, 0 replies; 30+ messages in thread
From: Jay Soffian @ 2008-03-14 21:11 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Petr Baudis, Junio C Hamano, git

On Fri, Mar 14, 2008 at 11:29 AM, Jakub Narebski <jnareb@gmail.com> wrote:
>  What should the code for this look like? Like below?
>
>         use File::Temp;
>
>         my ($fh, $temp_file) = tempfile();
>         ...
>         close $fh;
>         rename $temp_file, $cache_file;

I always use something like:

  my $temp_file = "$cache_file.tmp$$";
  open(my $fh, ">$temp_file");

to ensure that the temp file is on the same filesystem.

j.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-13 23:14 [PATCH] gitweb: Support caching projects list Petr Baudis
  2008-03-14  0:07 ` Jay Soffian
@ 2008-03-14  0:19 ` Junio C Hamano
  2008-03-14  8:35 ` Frank Lichtenheld
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2008-03-14  0:19 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git

Petr Baudis <pasky@suse.cz> writes:

> To prevent contention when multiple accesses coincide with cache
> expiration, the timeout is postponed to time()+120 when we start
> refreshing. When showing cached version, a disclaimer is shown at the
> top of the projects list.

Isn't this still racy when two requests come at about the same time?
Perhaps you can avoid it by using a lockfile?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-13 23:14 [PATCH] gitweb: Support caching projects list Petr Baudis
  2008-03-14  0:07 ` Jay Soffian
  2008-03-14  0:19 ` Junio C Hamano
@ 2008-03-14  8:35 ` Frank Lichtenheld
  2008-03-14 12:14 ` Jakub Narebski
  2008-03-15 21:44 ` Jakub Narebski
  4 siblings, 0 replies; 30+ messages in thread
From: Frank Lichtenheld @ 2008-03-14  8:35 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Junio C Hamano, git

On Fri, Mar 14, 2008 at 12:14:14AM +0100, Petr Baudis wrote:
> +# projects list cache for busy sites with many projects;
> +# if you set this to non-zero, it will be used as the cached
> +# index lifetime in minutes
> +# the cached list version is stored in /tmp and can be tweaked
> +# by other scripts running with the same uid as gitweb - use this
> +# only at secure installations; only single gitweb project root per
> +# system is supported!
> +our $projlist_cache_lifetime = 0;

I think that would a situation where a uppercase disclaimer would be
appropriate ;)

In addition to the race condition problem mentioned in other mails it
also has a symlink vulnerability.

I think one should seriously consider reusing an existing caching
solution instead of reinventing the wheel here.
There are a lot of CPAN modules to do that and at least apache also
has modules for that which you could use without any code changes
at all...

Gruesse,
-- 
Frank Lichtenheld <frank@lichtenheld.de>
www: http://www.djpig.de/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-13 23:14 [PATCH] gitweb: Support caching projects list Petr Baudis
                   ` (2 preceding siblings ...)
  2008-03-14  8:35 ` Frank Lichtenheld
@ 2008-03-14 12:14 ` Jakub Narebski
  2008-03-17 17:40   ` Petr Baudis
  2008-03-15 21:44 ` Jakub Narebski
  4 siblings, 1 reply; 30+ messages in thread
From: Jakub Narebski @ 2008-03-14 12:14 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Junio C Hamano, git

Petr Baudis <pasky@suse.cz> writes:

> On repo.or.cz (permanently I/O overloaded and hosting 1050 project +
> forks), the projects list (the default gitweb page) can take more than
> a minute to generate. This naive patch adds simple support for caching
> the projects list data structure so that all the projects do not need
> to get rescanned at every page access.

Nice.

BTW adding caching to gitweb is one of proposed ideas (projects) for
Google Summer of Code 2006: http://git.or.cz/gitwiki/SoC2008Ideas

> For clarity, projects scanning and @projects population is separated
> to git_get_projects_details().

Perhaps this could be submitted as separate patch?
I could do this if you are otherwise busy...


[...]
> +	if ($cache_lifetime and -f $cache_file
> +	    and stat($cache_file)->mtime + $cache_lifetime * 60 > time()
> +	    and open (my $fd, $cache_file)) {
> +		$stale = time() - stat($cache_file)->mtime;
> +		my @dump = <$fd>;
> +		close $fd;
> +		# Hack zone start
> +		my $VAR1;
> +		eval join("\n", @dump);
> +		@projects = @$VAR1;
> +		# Hack zone end

Why do you read line by line, only to join it, i.e.
  my @dump = <$fd>; ... join("\n", @dump);
instead of slurping all file in one go:
  local $/ = undef; my $dump = <$fd>; ... $dump;

Besides, why do you use Data::Dumper instead of Storable? Both are
distributed with Perl; well, at least both are in perl-5.8.6-24.

[...]
> -	git_project_list_body(\@list, $order);
> +	git_project_list_body(\@list, $order, undef, undef, undef, undef, $projlist_cache_lifetime);

This is ugly. Why not use hash for "named parameters", as it is done
in a few separate places in gitweb (search for '%opts')?

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-14 12:14 ` Jakub Narebski
@ 2008-03-17 17:40   ` Petr Baudis
  0 siblings, 0 replies; 30+ messages in thread
From: Petr Baudis @ 2008-03-17 17:40 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Junio C Hamano, git

  Hi,

On Fri, Mar 14, 2008 at 05:14:51AM -0700, Jakub Narebski wrote:
> Petr Baudis <pasky@suse.cz> writes:
> [...]
> > +	if ($cache_lifetime and -f $cache_file
> > +	    and stat($cache_file)->mtime + $cache_lifetime * 60 > time()
> > +	    and open (my $fd, $cache_file)) {
> > +		$stale = time() - stat($cache_file)->mtime;
> > +		my @dump = <$fd>;
> > +		close $fd;
> > +		# Hack zone start
> > +		my $VAR1;
> > +		eval join("\n", @dump);
> > +		@projects = @$VAR1;
> > +		# Hack zone end
> 
> Why do you read line by line, only to join it, i.e.
>   my @dump = <$fd>; ... join("\n", @dump);
> instead of slurping all file in one go:
>   local $/ = undef; my $dump = <$fd>; ... $dump;
> 
> Besides, why do you use Data::Dumper instead of Storable? Both are
> distributed with Perl; well, at least both are in perl-5.8.6-24.

  no particular reason - I simply never heard about Storable. I learned
Perl too long ago it seems. ;-)

> [...]
> > -	git_project_list_body(\@list, $order);
> > +	git_project_list_body(\@list, $order, undef, undef, undef, undef, $projlist_cache_lifetime);
> 
> This is ugly. Why not use hash for "named parameters", as it is done
> in a few separate places in gitweb (search for '%opts')?

  I agree - I was simply too lazy to make another patch. :-)

-- 
				Petr "Pasky" Baudis
Whatever you can do, or dream you can, begin it.
Boldness has genius, power, and magic in it.	-- J. W. von Goethe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-13 23:14 [PATCH] gitweb: Support caching projects list Petr Baudis
                   ` (3 preceding siblings ...)
  2008-03-14 12:14 ` Jakub Narebski
@ 2008-03-15 21:44 ` Jakub Narebski
  2008-03-16  0:56   ` Miklos Vajna
                     ` (2 more replies)
  4 siblings, 3 replies; 30+ messages in thread
From: Jakub Narebski @ 2008-03-15 21:44 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Junio C Hamano, git

Petr Baudis <pasky@suse.cz> writes:

> On repo.or.cz (permanently I/O overloaded and hosting 1050 project +
> forks), 

It looks like repo.or.cz is overwhelmed by its success. I hope that
now that there are other software hosting sites with git hosting
(Savannah, GitHub, Gitorious,...) the number of projects wouldn't grow
as rapidly.

> the projects list (the default gitweb page) can take more than
> a minute to generate. This naive patch adds simple support for caching
> the projects list data structure so that all the projects do not need
> to get rescanned at every page access.

Another solution would be to divide projects list page into pages,
perhaps adding search box for searching for a project (by name, by
description and by owner).

Nevertheless even with pagination, if we want to have "sort by last
update" we do need caching.

[...]
> +# projects list cache for busy sites with many projects;
> +# if you set this to non-zero, it will be used as the cached
> +# index lifetime in minutes
> +# the cached list version is stored in /tmp and can be tweaked
> +# by other scripts running with the same uid as gitweb - use this
> +# only at secure installations; only single gitweb project root per
> +# system is supported!
> +our $projlist_cache_lifetime = 0;

[...]
> +sub git_project_list_body {
[...]
> +	my $cache_file = '/tmp/gitweb.index.cache';
> +	use File::stat;
> +
> +	my @projects;
> +	my $stale = 0;
> +	if ($cache_lifetime and -f $cache_file
> +	    and stat($cache_file)->mtime + $cache_lifetime * 60 > time()
> +	    and open (my $fd, $cache_file)) {
> +		$stale = time() - stat($cache_file)->mtime;
> +		my @dump = <$fd>;
> +		close $fd;
> +		# Hack zone start
> +		my $VAR1;
> +		eval join("\n", @dump);
> +		@projects = @$VAR1;
> +		# Hack zone end
> +	} else {
> +		if ($cache_lifetime and -f $cache_file) {
> +			# Postpone timeout by two minutes so that we get
> +			# enough time to do our job.
> +			my $time = time() - $cache_lifetime + 120;
> +			utime $time, $time, $cache_file;
> +		}
> +		@projects = git_get_projects_details($projlist, $check_forks);
> +		if ($cache_lifetime and open (my $fd, '>'.$cache_file)) {
> +			use Data::Dumper;
> +			print $fd Dumper(\@projects);
> +			close $fd;
> +		}
> +	}

This could be much simplified with perl-cache (perl-Cache-Cache).
Unfortunately this is non-standard module, not distributed (yet?)
with Perl.

Warning: not tested in gitweb!

+	use Cache::FileCache;
+
+	my $cache;
+	my $projects;
+	
+	if ($cache_lifetime) {
+		$cache = new Cache::FileCache(
+			{ namespace => 'gitweb',
+			  default_expires_in => $cache_lifetime
+			});
+		$projects = $cache->get('projects_list');
+	}
+	if (!defined $projects) {
+		$projects = [ git_get_projects_details($projlist, $check_forks); ];
+		$cache->set('projects_list', $projects)
+			if defined $cache;
+	}

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-15 21:44 ` Jakub Narebski
@ 2008-03-16  0:56   ` Miklos Vajna
  2008-03-16 11:41   ` Frank Lichtenheld
  2008-03-17 18:10   ` repo.or.cz renovated Petr Baudis
  2 siblings, 0 replies; 30+ messages in thread
From: Miklos Vajna @ 2008-03-16  0:56 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Petr Baudis, Junio C Hamano, git

[-- Attachment #1: Type: text/plain, Size: 418 bytes --]

On Sat, Mar 15, 2008 at 02:44:42PM -0700, Jakub Narebski <jnareb@gmail.com> wrote:
> It looks like repo.or.cz is overwhelmed by its success. I hope that
> now that there are other software hosting sites with git hosting
> (Savannah, GitHub, Gitorious,...) the number of projects wouldn't grow
> as rapidly.

i think repo.or.cz is still the only one that offers mirroring of git
repos while it's quite a handy feature.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-15 21:44 ` Jakub Narebski
  2008-03-16  0:56   ` Miklos Vajna
@ 2008-03-16 11:41   ` Frank Lichtenheld
  2008-03-16 16:52     ` J.H.
  2008-03-17 18:10   ` repo.or.cz renovated Petr Baudis
  2 siblings, 1 reply; 30+ messages in thread
From: Frank Lichtenheld @ 2008-03-16 11:41 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Petr Baudis, Junio C Hamano, git

On Sat, Mar 15, 2008 at 02:44:42PM -0700, Jakub Narebski wrote:
> Petr Baudis <pasky@suse.cz> writes:
> This could be much simplified with perl-cache (perl-Cache-Cache).
> Unfortunately this is non-standard module, not distributed (yet?)
> with Perl.

I think somebody who actually needs this can be bothered to install a
CPAN perl module. This should probably not enabled by default anyway.

> Warning: not tested in gitweb!
> 
> +	use Cache::FileCache;
> +
> +	my $cache;
> +	my $projects;
> +	
> +	if ($cache_lifetime) {
> +		$cache = new Cache::FileCache(
> +			{ namespace => 'gitweb',
> +			  default_expires_in => $cache_lifetime
> +			});
> +		$projects = $cache->get('projects_list');
> +	}
> +	if (!defined $projects) {
> +		$projects = [ git_get_projects_details($projlist, $check_forks); ];
> +		$cache->set('projects_list', $projects)
> +			if defined $cache;
> +	}

Gruesse,
-- 
Frank Lichtenheld <frank@lichtenheld.de>
www: http://www.djpig.de/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-16 11:41   ` Frank Lichtenheld
@ 2008-03-16 16:52     ` J.H.
  2008-03-16 18:37       ` Jakub Narebski
  0 siblings, 1 reply; 30+ messages in thread
From: J.H. @ 2008-03-16 16:52 UTC (permalink / raw)
  To: Frank Lichtenheld; +Cc: Jakub Narebski, Petr Baudis, Junio C Hamano, git

On Sun, 2008-03-16 at 12:41 +0100, Frank Lichtenheld wrote:
> On Sat, Mar 15, 2008 at 02:44:42PM -0700, Jakub Narebski wrote:
> > Petr Baudis <pasky@suse.cz> writes:
> > This could be much simplified with perl-cache (perl-Cache-Cache).
> > Unfortunately this is non-standard module, not distributed (yet?)
> > with Perl.
> 
> I think somebody who actually needs this can be bothered to install a
> CPAN perl module. This should probably not enabled by default anyway.

The people who need the caching are also likely those who are most
averse to using things that don't either come with their distribution or
aren't easily and readily available in something like an extras
repository or a very well trusted contrib repository.  I can at least
vouch for one large site that needs this that doesn't install things via
cpan for a lot of different reasons.


- John 'Warthog9' Hawley

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-16 16:52     ` J.H.
@ 2008-03-16 18:37       ` Jakub Narebski
  2008-03-16 22:37         ` J.H.
  0 siblings, 1 reply; 30+ messages in thread
From: Jakub Narebski @ 2008-03-16 18:37 UTC (permalink / raw)
  To: J.H.; +Cc: Frank Lichtenheld, Petr Baudis, Junio C Hamano, git

On Sun, 16 Mar 2008, J.H. wrote:
> On Sun, 2008-03-16 at 12:41 +0100, Frank Lichtenheld wrote:
>> On Sat, Mar 15, 2008 at 02:44:42PM -0700, Jakub Narebski wrote:
>>>  
>>> This could be much simplified with perl-cache (perl-Cache-Cache).
>>> Unfortunately this is non-standard module, not distributed (yet?)
>>> with Perl.
>> 
>> I think somebody who actually needs this can be bothered to install a
>> CPAN perl module. This should probably not enabled by default anyway.
> 
> The people who need the caching are also likely those who are most
> averse to using things that don't either come with their distribution or
> aren't easily and readily available in something like an extras
> repository or a very well trusted contrib repository.  I can at least
> vouch for one large site that needs this that doesn't install things via
> cpan for a lot of different reasons.

Actually Cache::FileCache, which is part of CacheCache distribution,
should be available in contrib or even extras repository. I have
installed it as perl-Cache-Cache RPM (1.05-1.fc4.rf) on my Aurox 11.1
(which is old Fedora Core 4 based distribution), from Dries RPM
repository (part of FreshRPM now, IIRC).

The problem is that at least according to what documentation of other,
never CPAN modules says Cache::FileCache is slow, as it always serialize
using Storable (Storable should be part of perl distribution).

We can always install local copy alongside gitweb...

P.S. When searching CPAN for existing modules for caching and CGI
caching I have found Cache::Adaptive::ByLoad which does what
caching-gitweb does, and some solutions in newer caching interfaces,
either CHI or Cache which try to avoid thundering horde problem.

P.P.S. Does kernel.org use memcached, or some kind of web cache
(reverse proxy cache) like Varnish or Squid?
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-16 18:37       ` Jakub Narebski
@ 2008-03-16 22:37         ` J.H.
  2008-03-16 23:39           ` Jakub Narebski
  0 siblings, 1 reply; 30+ messages in thread
From: J.H. @ 2008-03-16 22:37 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Frank Lichtenheld, Petr Baudis, Junio C Hamano, git

On Sun, 2008-03-16 at 19:37 +0100, Jakub Narebski wrote:
> On Sun, 16 Mar 2008, J.H. wrote:
> > On Sun, 2008-03-16 at 12:41 +0100, Frank Lichtenheld wrote:
> >> On Sat, Mar 15, 2008 at 02:44:42PM -0700, Jakub Narebski wrote:
> >>>  
> >>> This could be much simplified with perl-cache (perl-Cache-Cache).
> >>> Unfortunately this is non-standard module, not distributed (yet?)
> >>> with Perl.
> >> 
> >> I think somebody who actually needs this can be bothered to install a
> >> CPAN perl module. This should probably not enabled by default anyway.
> > 
> > The people who need the caching are also likely those who are most
> > averse to using things that don't either come with their distribution or
> > aren't easily and readily available in something like an extras
> > repository or a very well trusted contrib repository.  I can at least
> > vouch for one large site that needs this that doesn't install things via
> > cpan for a lot of different reasons.
> 
> Actually Cache::FileCache, which is part of CacheCache distribution,
> should be available in contrib or even extras repository. I have
> installed it as perl-Cache-Cache RPM (1.05-1.fc4.rf) on my Aurox 11.1
> (which is old Fedora Core 4 based distribution), from Dries RPM
> repository (part of FreshRPM now, IIRC).

That would be fine, I don't think the larger sites would have issues
finding a copy than.

> 
> The problem is that at least according to what documentation of other,
> never CPAN modules says Cache::FileCache is slow, as it always serialize
> using Storable (Storable should be part of perl distribution).

That makes it much less interesting unfortunately.

> We can always install local copy alongside gitweb...
> 
> 
> P.S. When searching CPAN for existing modules for caching and CGI
> caching I have found Cache::Adaptive::ByLoad which does what
> caching-gitweb does, and some solutions in newer caching interfaces,
> either CHI or Cache which try to avoid thundering horde problem.

Interesting - my have to take a look at that.

> P.P.S. Does kernel.org use memcached, or some kind of web cache
> (reverse proxy cache) like Varnish or Squid?

No - doesn't buy us anything really unfortunately.  And since I'm doing
caching inside of gitweb itself having multiple layers of caching just
makes things more complicated, adds unnecessary latency to updates, etc.

- John

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] gitweb: Support caching projects list
  2008-03-16 22:37         ` J.H.
@ 2008-03-16 23:39           ` Jakub Narebski
  0 siblings, 0 replies; 30+ messages in thread
From: Jakub Narebski @ 2008-03-16 23:39 UTC (permalink / raw)
  To: J.H.; +Cc: Frank Lichtenheld, Petr Baudis, Junio C Hamano, git

J.H. wrote:
> Jakub Narebski wrote:
>>
>> P.S. When searching CPAN for existing modules for caching and CGI
>> caching I have found Cache::Adaptive::ByLoad which does what
>> caching-gitweb does,

I'm not sure about quality of this code, though. It uses Cache::Cache, 
by the way.

>> and some solutions in newer caching interfaces, 
>> either CHI or Cache, which try to avoid thundering horde problem.
> 
> Interesting - my have to take a look at that.

CHI uses either 'busy_lock [DURATION]' (bump expiration time), or
'expires_variance [FLOAT]' for fuzzy expiration time matching.

Cache has LRU and FIFO removal strategies.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 30+ messages in thread

* repo.or.cz renovated
  2008-03-15 21:44 ` Jakub Narebski
  2008-03-16  0:56   ` Miklos Vajna
  2008-03-16 11:41   ` Frank Lichtenheld
@ 2008-03-17 18:10   ` Petr Baudis
  2008-03-17 19:09     ` Junio C Hamano
  2008-03-17 19:34     ` Theodore Tso
  2 siblings, 2 replies; 30+ messages in thread
From: Petr Baudis @ 2008-03-17 18:10 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Junio C Hamano, git

On Sat, Mar 15, 2008 at 02:44:42PM -0700, Jakub Narebski wrote:
> Petr Baudis <pasky@suse.cz> writes:
> 
> > On repo.or.cz (permanently I/O overloaded and hosting 1050 project +
> > forks), 
> 
> It looks like repo.or.cz is overwhelmed by its success. I hope that
> now that there are other software hosting sites with git hosting
> (Savannah, GitHub, Gitorious,...) the number of projects wouldn't grow
> as rapidly.

Actually, it was overwhelmed to so much by its success but by lack of
good maintenance. ;-) I gave it some love again for the past week and
the improvement was, well, overwhelming. :-)

I finally fixed tons of failures and broken repositories, and most
importantly repacked some of the big repositories with object databases
in pretty horrid shape. The effect has been immense, having everything
in database of 1/3 the size and single big pack drastically reduced the
I/O load.

Scenario: Site with about 1100 repositories weighting 13GB, running a
fetch job for about 200 of them hourly. About two git-daemon requests
per minute and 10 gitweb requests per minute (the last two numbers are
taken quite sloppily over a small sample of the last ten minutes ;-).
Site is running on 2x1GHz P3 with 2G RAM, repository is on hw RAID5.
(We are currently preparing to migrate it to a more powerful machine.)

Before, the load on the server would be normally about 6 to 15 _all the
time_ and bunch of git-related processes would be permanently eating
some CPU and crunch on the disk.

After introducing the index caching and repacking the repositories, the
load seems to be around 1 at most and hardly seems to come above 3; all
feels very snappy.

So for anyone running a hosting site, make sure your repositories are
nicely packed. It makes huge difference to the I/O load!

> Another solution would be to divide projects list page into pages,
> perhaps adding search box for searching for a project (by name, by
> description and by owner).
> 
> Nevertheless even with pagination, if we want to have "sort by last
> update" we do need caching.

Yes, I'm pondering about pagination, but because of web clients, not the
server load; it takes firefox on my notebook noticeable time to render
this list already, and it's rather big too. Ideas are welcome here.

My current plan is to have a [Search project] box at the front page,
together with direct link to 'show all'. Other than that, what makes
sense to display on the front page? I think recently added projects (age
< 1 week) for sure. I'm not so sure about recently changed projects -
maybe it is better to keep the front page cruft-free.

-- 
				Petr "Pasky" Baudis
Whatever you can do, or dream you can, begin it.
Boldness has genius, power, and magic in it.	-- J. W. von Goethe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: repo.or.cz renovated
  2008-03-17 18:10   ` repo.or.cz renovated Petr Baudis
@ 2008-03-17 19:09     ` Junio C Hamano
  2008-03-17 19:25       ` Petr Baudis
  2008-03-17 19:34     ` Theodore Tso
  1 sibling, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2008-03-17 19:09 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Jakub Narebski, git

Petr Baudis <pasky@suse.cz> writes:

> My current plan is to have a [Search project] box at the front page,
> together with direct link to 'show all'. Other than that, what makes
> sense to display on the front page? I think recently added projects (age
> < 1 week) for sure. I'm not so sure about recently changed projects -
> maybe it is better to keep the front page cruft-free.

"More important projects"?

;-) Ducks...

How about asking project owners to categorize (tag) their own projects and
show them in different categories?  Maybe many of them will start in
"unsorted bin", but if you organize the top page in such a way that the
link to unsorted bin is much less prominent than nicely sorted ones, that
may give people incentive to put their project in a real category.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: repo.or.cz renovated
  2008-03-17 19:09     ` Junio C Hamano
@ 2008-03-17 19:25       ` Petr Baudis
  0 siblings, 0 replies; 30+ messages in thread
From: Petr Baudis @ 2008-03-17 19:25 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jakub Narebski, git

On Mon, Mar 17, 2008 at 12:09:24PM -0700, Junio C Hamano wrote:
> Petr Baudis <pasky@suse.cz> writes:
> 
> > My current plan is to have a [Search project] box at the front page,
> > together with direct link to 'show all'. Other than that, what makes
> > sense to display on the front page? I think recently added projects (age
> > < 1 week) for sure. I'm not so sure about recently changed projects -
> > maybe it is better to keep the front page cruft-free.
> 
> "More important projects"?
> 
> ;-) Ducks...

I'm all for that, get access to projects people are the most likely to
look for. Would it make sense to count accesses to index pages of each
project and then sort by that? Or sort by some activity index?

> How about asking project owners to categorize (tag) their own projects and
> show them in different categories?  Maybe many of them will start in
> "unsorted bin", but if you organize the top page in such a way that the
> link to unsorted bin is much less prominent than nicely sorted ones, that
> may give people incentive to put their project in a real category.

Actually, no reason to restrict tagging to project owners. This might be
interesting, but again the question is, does anyone really want to
browse the project list based on "show me all kernel-related projects"
or "show me all xorg-related projects", especially as we have forks and
it may not be reliable?

-- 
				Petr "Pasky" Baudis
Whatever you can do, or dream you can, begin it.
Boldness has genius, power, and magic in it.	-- J. W. von Goethe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: repo.or.cz renovated
  2008-03-17 18:10   ` repo.or.cz renovated Petr Baudis
  2008-03-17 19:09     ` Junio C Hamano
@ 2008-03-17 19:34     ` Theodore Tso
  2008-03-17 19:54       ` Petr Baudis
  1 sibling, 1 reply; 30+ messages in thread
From: Theodore Tso @ 2008-03-17 19:34 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Jakub Narebski, Junio C Hamano, git

On Mon, Mar 17, 2008 at 07:10:15PM +0100, Petr Baudis wrote:
> Actually, it was overwhelmed to so much by its success but by lack of
> good maintenance. ;-) I gave it some love again for the past week and
> the improvement was, well, overwhelming. :-)
> 
> I finally fixed tons of failures and broken repositories, and most
> importantly repacked some of the big repositories with object databases
> in pretty horrid shape. The effect has been immense, having everything
> in database of 1/3 the size and single big pack drastically reduced the
> I/O load.

Are you making sure that repositories which are forks off of some
parent repository are using objects/info/alternates to share objects?
(If so you have to be careful when you prune not to drop objects, but
it can make a huge difference in disk utilization and I/O bandwidth).

At least for master.kernel.org, and for those git repositories which I
own, I make a point of periodically logging in and running git gc,
copying over the object packs so I can do a prune operation safely,
etc.  --- and I suspect most of the master.kernel.org git users do
something similar.  On repo.or.cz we don't have shell access, so the
project administrators can't do that for you.

> So for anyone running a hosting site, make sure your repositories are
> nicely packed. It makes huge difference to the I/O load!

It seems that a Really Good Idea would be do the the packing and
pruning via cron scripts that run during the off hours...

> My current plan is to have a [Search project] box at the front page,
> together with direct link to 'show all'. Other than that, what makes
> sense to display on the front page? I think recently added projects (age
> < 1 week) for sure. I'm not so sure about recently changed projects -
> maybe it is better to keep the front page cruft-free.

There are plenty of ways which sites like freshmeat and sourceforge
have come up to make it easy to browse a large number of software
projects.  One way that might make sense is Sourceforge's Software Map
(i.e., http://sourceforge.net/softwaremap/).

					- Ted

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: repo.or.cz renovated
  2008-03-17 19:34     ` Theodore Tso
@ 2008-03-17 19:54       ` Petr Baudis
  0 siblings, 0 replies; 30+ messages in thread
From: Petr Baudis @ 2008-03-17 19:54 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Jakub Narebski, Junio C Hamano, git

On Mon, Mar 17, 2008 at 03:34:23PM -0400, Theodore Tso wrote:
> On Mon, Mar 17, 2008 at 07:10:15PM +0100, Petr Baudis wrote:
> > Actually, it was overwhelmed to so much by its success but by lack of
> > good maintenance. ;-) I gave it some love again for the past week and
> > the improvement was, well, overwhelming. :-)
> > 
> > I finally fixed tons of failures and broken repositories, and most
> > importantly repacked some of the big repositories with object databases
> > in pretty horrid shape. The effect has been immense, having everything
> > in database of 1/3 the size and single big pack drastically reduced the
> > I/O load.
> 
> Are you making sure that repositories which are forks off of some
> parent repository are using objects/info/alternates to share objects?
> (If so you have to be careful when you prune not to drop objects, but
> it can make a huge difference in disk utilization and I/O bandwidth).

Yes, I reuse objects from parent projects, that has always been so.

> At least for master.kernel.org, and for those git repositories which I
> own, I make a point of periodically logging in and running git gc,
> copying over the object packs so I can do a prune operation safely,
> etc.  --- and I suspect most of the master.kernel.org git users do
> something similar.  On repo.or.cz we don't have shell access, so the
> project administrators can't do that for you.
> 
> > So for anyone running a hosting site, make sure your repositories are
> > nicely packed. It makes huge difference to the I/O load!
> 
> It seems that a Really Good Idea would be do the the packing and
> pruning via cron scripts that run during the off hours...

Yes, this was done before too, however repo.or.cz has been around for
long time and historically the scripts weren't working very well,
especially since I had to be careful about the forks problem.

Since I am repacking on live system, I think the current repacking
strategy is still not completely error prone, however I believe that I
have encountered no breakage because of pruned objects the last at least
half a year or so it has been running with the current setup (all of the
breakages I have encountered seem to be caused by child process of
git-repack dying).  Besides, if some fork breaks, it should be possible
to fix that very easily (I do not backup the object stores at all
anyway - if the server burns down, you will have to re-push ;-).

> > My current plan is to have a [Search project] box at the front page,
> > together with direct link to 'show all'. Other than that, what makes
> > sense to display on the front page? I think recently added projects (age
> > < 1 week) for sure. I'm not so sure about recently changed projects -
> > maybe it is better to keep the front page cruft-free.
> 
> There are plenty of ways which sites like freshmeat and sourceforge
> have come up to make it easy to browse a large number of software
> projects.  One way that might make sense is Sourceforge's Software Map
> (i.e., http://sourceforge.net/softwaremap/).

This all feels like a real overkill, besides my main doubt is whether
repo.or.cz needs something like this *at all* - but I think I will try
the tagging system and see how do people like it.

-- 
				Petr "Pasky" Baudis
Whatever you can do, or dream you can, begin it.
Boldness has genius, power, and magic in it.	-- J. W. von Goethe

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2008-03-17 21:10 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-13 23:14 [PATCH] gitweb: Support caching projects list Petr Baudis
2008-03-14  0:07 ` Jay Soffian
2008-03-14  0:22   ` Petr Baudis
2008-03-14  0:27     ` Jay Soffian
2008-03-14  0:30       ` J.H.
2008-03-14 12:17         ` Jakub Narebski
2008-03-14  0:36     ` J.H.
2008-03-17 17:49       ` repo.or.cz renovation Petr Baudis
2008-03-17 18:11         ` Petr Baudis
2008-03-17 18:44         ` J.H.
2008-03-17 20:41           ` Jakub Narebski
2008-03-17 21:09           ` Jakub Narebski
2008-03-14 15:29   ` [PATCH] gitweb: Support caching projects list Jakub Narebski
2008-03-14 21:11     ` Jay Soffian
2008-03-14  0:19 ` Junio C Hamano
2008-03-14  8:35 ` Frank Lichtenheld
2008-03-14 12:14 ` Jakub Narebski
2008-03-17 17:40   ` Petr Baudis
2008-03-15 21:44 ` Jakub Narebski
2008-03-16  0:56   ` Miklos Vajna
2008-03-16 11:41   ` Frank Lichtenheld
2008-03-16 16:52     ` J.H.
2008-03-16 18:37       ` Jakub Narebski
2008-03-16 22:37         ` J.H.
2008-03-16 23:39           ` Jakub Narebski
2008-03-17 18:10   ` repo.or.cz renovated Petr Baudis
2008-03-17 19:09     ` Junio C Hamano
2008-03-17 19:25       ` Petr Baudis
2008-03-17 19:34     ` Theodore Tso
2008-03-17 19:54       ` Petr Baudis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).