Re: kernel.org mirroring (Re: [GIT PULL] MMC update)

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
       [not found]                   ` <45785697.1060001@zytor.com>
@ 2006-12-07 19:05                     ` Linus Torvalds
  2006-12-07 19:16                       ` H. Peter Anvin
  2006-12-08  9:43                       ` Jakub Narebski
  0 siblings, 2 replies; 82+ messages in thread
From: Linus Torvalds @ 2006-12-07 19:05 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Kernel Org Admin, Git Mailing List, Jakub Narebski

On Thu, 7 Dec 2006, H. Peter Anvin wrote:
> 
> That all being said, the lack of intrinsic caching in gitweb continues to be a
> major problem for us.  Under high load, it makes all the problems worse.

I really don't see what gitweb could do that would be somehow better than 
apache doing the caching in front of it.. Is there some apache reason why 
that isn't sufficient (ie limitations on its cache size or timeouts?)

Maybe the cacheability hints from gitweb could be tweaked (a lot of it 
should be "infinitely cacheable", but the stuff that depends on refs and 
thus can change, could be set to some fixed host-wide value - preferably 
some that depends on how old the ref is).

Having gitweb be potentially up to an hour out of date is better than 
causing mirroring problems due to excessive load.

For example, if the git "refs/heads/" (or tags) directory hasn't changed 
in the last two months, we should probably set any ref-relative gitweb 
pages to have a caching timeout of a day or two. In contrast, if it's 
changed in the last hour, maybe we should only cache it for five minutes.

Jakub: any way to make gitweb set the "expires" fields _much_ more 
aggressively. I think we should at least have the ability to set a basic 
rules like

 - a _minimum_ of five minutes regardless of anything else

   We might even tweak this based on loadaverage, and it might be 
   worthwhile to add a randomization, to make sure that you don't get into 
   situations where everything webpage needs to be recalculated at once.

 - if refs/ directories are old, raise the minimum by the age of the refs

   If it's more than an hour old, raise it to ten minutes. If it's more 
   than a day, raise it to an hour. If it's more than a month old, raise 
   it to a day. And if it's more than half a year, it's some historical 
   archive like linux-history, and should probably default to a week or 
   more.

 - infinite for stuff that isn't ref-related.

Hmm?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-07 19:05                     ` kernel.org mirroring (Re: [GIT PULL] MMC update) Linus Torvalds
@ 2006-12-07 19:16                       ` H. Peter Anvin
  2006-12-07 19:30                         ` Olivier Galibert
  2006-12-07 19:30                         ` Linus Torvalds
  2006-12-08  9:43                       ` Jakub Narebski
  1 sibling, 2 replies; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-07 19:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Org Admin, Git Mailing List, Jakub Narebski

Linus Torvalds wrote:
> 
> On Thu, 7 Dec 2006, H. Peter Anvin wrote:
>> That all being said, the lack of intrinsic caching in gitweb continues to be a
>> major problem for us.  Under high load, it makes all the problems worse.
> 
> I really don't see what gitweb could do that would be somehow better than 
> apache doing the caching in front of it.. Is there some apache reason why 
> that isn't sufficient (ie limitations on its cache size or timeouts?)
> 

What it could do better is it could prevent multiple identical queries 
from being launched in parallel.  That's the real problem we see; under 
high load, Apache times out so the git query never gets into the cache; 
but in the meantime, the common queries might easily have been launched 
20 times in parallel.  Unfortunately, the most common queries are also 
extremely expensive.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-07 19:16                       ` H. Peter Anvin
@ 2006-12-07 19:30                         ` Olivier Galibert
  2006-12-07 19:57                           ` H. Peter Anvin
  2006-12-07 19:30                         ` Linus Torvalds
  1 sibling, 1 reply; 82+ messages in thread
From: Olivier Galibert @ 2006-12-07 19:30 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

On Thu, Dec 07, 2006 at 11:16:58AM -0800, H. Peter Anvin wrote:
> Unfortunately, the most common queries are also extremely expensive.

Do you have a top-ten of queries ?  That would be the ones to optimize
for.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-07 19:16                       ` H. Peter Anvin
  2006-12-07 19:30                         ` Olivier Galibert
@ 2006-12-07 19:30                         ` Linus Torvalds
  2006-12-07 19:39                           ` Shawn Pearce
  2006-12-07 20:05                           ` Junio C Hamano
  1 sibling, 2 replies; 82+ messages in thread
From: Linus Torvalds @ 2006-12-07 19:30 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Kernel Org Admin, Git Mailing List, Jakub Narebski

On Thu, 7 Dec 2006, H. Peter Anvin wrote:
> 
> What it could do better is it could prevent multiple identical queries from
> being launched in parallel.  That's the real problem we see; under high load,
> Apache times out so the git query never gets into the cache; but in the
> meantime, the common queries might easily have been launched 20 times in
> parallel.  Unfortunately, the most common queries are also extremely
> expensive.

Ahh. I'd have expected that apache itself had some serialization facility, 
that would kind of go hand-in-hand with any caching.

It really would make more sense to have anything that does caching 
serialize the address that gets cached (think "page cache" layer in the 
kernel: the _cache_ is also the serialization point, and is what 
guarantees that we don't do stupid multiple reads to the same address).

I'm surprised that Apache can't do that. Or maybe it can, and it just 
needs some configuration entry? I don't know apache.. I realize that 
because Apache doesn't know before-hand whether something is cacheable or 
not, it must probably _default_ to running the CGI scripts to the same 
address in parallel, but it would be stupid to not have the option to 
serialize.

That said, from some of the other horrors I've heard about, "stupid" may 
be just scratching at the surface.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-07 19:30                         ` Linus Torvalds
@ 2006-12-07 19:39                           ` Shawn Pearce
  2006-12-07 19:58                             ` Linus Torvalds
  2006-12-07 19:58                             ` H. Peter Anvin
  2006-12-07 20:05                           ` Junio C Hamano
  1 sibling, 2 replies; 82+ messages in thread
From: Shawn Pearce @ 2006-12-07 19:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

Linus Torvalds <torvalds@osdl.org> wrote:
> I'm surprised that Apache can't do that. Or maybe it can, and it just 
> needs some configuration entry? I don't know apache.. I realize that 
> because Apache doesn't know before-hand whether something is cacheable or 
> not, it must probably _default_ to running the CGI scripts to the same 
> address in parallel, but it would be stupid to not have the option to 
> serialize.

AFAIK it doesn't have such an option, for basically the reason
you describe.  I worked on a project which had much more difficult
to answer queries than gitweb and were also very popular.  Yes,
the system died under any load, no matter how much money was thrown
at it.  :-)

> That said, from some of the other horrors I've heard about, "stupid" may 
> be just scratching at the surface.

It is.  :-)

-- 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-07 19:30                         ` Olivier Galibert
@ 2006-12-07 19:57                           ` H. Peter Anvin
  2006-12-07 23:50                             ` Olivier Galibert
  2006-12-08 12:57                             ` Rogan Dawes
  0 siblings, 2 replies; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-07 19:57 UTC (permalink / raw)
  To: Olivier Galibert
  Cc: Linus Torvalds, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

Olivier Galibert wrote:
> On Thu, Dec 07, 2006 at 11:16:58AM -0800, H. Peter Anvin wrote:
>> Unfortunately, the most common queries are also extremely expensive.
> 
> Do you have a top-ten of queries ?  That would be the ones to optimize
> for.

The front page, summary page of each project, and the RSS feed for each 
project.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-07 19:39                           ` Shawn Pearce
@ 2006-12-07 19:58                             ` Linus Torvalds
  2006-12-07 23:33                               ` Michael K. Edwards
  2006-12-07 19:58                             ` H. Peter Anvin
  1 sibling, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2006-12-07 19:58 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: H. Peter Anvin, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

On Thu, 7 Dec 2006, Shawn Pearce wrote:
> 
> AFAIK it doesn't have such an option, for basically the reason
> you describe.  I worked on a project which had much more difficult
> to answer queries than gitweb and were also very popular.  Yes,
> the system died under any load, no matter how much money was thrown
> at it.  :-)
> 
> > That said, from some of the other horrors I've heard about, "stupid" may 
> > be just scratching at the surface.
> 
> It is.  :-)

Gaah. That's just stupid. This is such a _basic_ issue for caching ("if 
concurrent requests come in, only handle _one_ and give everybody the same 
result") that I claim that any cache that doesn't handle it isn't a cache 
at all, but a total disaster written by incompetent people.

Sure, you may want to disable it for certain kinds of truly dynamic 
content, but that doesn't mean you shouldn't be able to do it at all.

Does anybody who is web-server clueful know if there is some simple 
front-end (squid?) that is easy to set up and can just act as a caching 
proxy in front of such an incompetent server?

Or maybe there is some competent Apache module, not just the default 
mod_cache (which is what I assume kernel.org uses now)?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-07 19:39                           ` Shawn Pearce
  2006-12-07 19:58                             ` Linus Torvalds
@ 2006-12-07 19:58                             ` H. Peter Anvin
  1 sibling, 0 replies; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-07 19:58 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Linus Torvalds, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

Shawn Pearce wrote:
> Linus Torvalds <torvalds@osdl.org> wrote:
>> I'm surprised that Apache can't do that. Or maybe it can, and it just 
>> needs some configuration entry? I don't know apache.. I realize that 
>> because Apache doesn't know before-hand whether something is cacheable or 
>> not, it must probably _default_ to running the CGI scripts to the same 
>> address in parallel, but it would be stupid to not have the option to 
>> serialize.
> 
> AFAIK it doesn't have such an option, for basically the reason
> you describe.  I worked on a project which had much more difficult
> to answer queries than gitweb and were also very popular.  Yes,
> the system died under any load, no matter how much money was thrown
> at it.  :-)

You certainly can be smarter about it when you know the nature of the 
query, though.  I do that with the patch viewer scripts.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-07 19:30                         ` Linus Torvalds
  2006-12-07 19:39                           ` Shawn Pearce
@ 2006-12-07 20:05                           ` Junio C Hamano
  2006-12-07 20:09                             ` H. Peter Anvin
  1 sibling, 1 reply; 82+ messages in thread
From: Junio C Hamano @ 2006-12-07 20:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git, Kernel Org Admin, H. Peter Anvin

If I understand correctly, kernel.org is still running the
version of gitweb Kay last installed there (I am too busy to
take over the gitweb installation maintenance at kernel.org, and
I did not ask the $DOCUMENTROOT/git/ directory to be transferred
to me when I rolled gitweb into the git.git repository).

I do not know what queries are most popular, but I think a newer
gitweb is more efficient in the summary page (getting list of
branches and tags).  It might be worth a try.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-07 20:05                           ` Junio C Hamano
@ 2006-12-07 20:09                             ` H. Peter Anvin
  2006-12-07 22:11                               ` Junio C Hamano
  0 siblings, 1 reply; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-07 20:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git, Kernel Org Admin

Junio C Hamano wrote:
> If I understand correctly, kernel.org is still running the
> version of gitweb Kay last installed there (I am too busy to
> take over the gitweb installation maintenance at kernel.org, and
> I did not ask the $DOCUMENTROOT/git/ directory to be transferred
> to me when I rolled gitweb into the git.git repository).

That's correct.  I can transfer that directory to you if you want; I 
can't realistically track gitweb well enough to do this myself (in fact, 
it was pretty much a condition of having it up there that Kay would keep 
maintaining it.)

> I do not know what queries are most popular, but I think a newer
> gitweb is more efficient in the summary page (getting list of
> branches and tags).  It might be worth a try.

How do you want to handle it?

	-hpa


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-07 20:09                             ` H. Peter Anvin
@ 2006-12-07 22:11                               ` Junio C Hamano
  0 siblings, 0 replies; 82+ messages in thread
From: Junio C Hamano @ 2006-12-07 22:11 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linus Torvalds, git, Kernel Org Admin

"H. Peter Anvin" <hpa@zytor.com> writes:

> Junio C Hamano wrote:
>> If I understand correctly, kernel.org is still running the
>> version of gitweb Kay last installed there (I am too busy to
>> take over the gitweb installation maintenance at kernel.org, and
>> I did not ask the $DOCUMENTROOT/git/ directory to be transferred
>> to me when I rolled gitweb into the git.git repository).
>
> That's correct.  I can transfer that directory to you if you want; I
> can't realistically track gitweb well enough to do this myself...

Well, the reason I haven't asked to is because I don't have
enough time myself, so....

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-07 19:58                             ` Linus Torvalds
@ 2006-12-07 23:33                               ` Michael K. Edwards
  0 siblings, 0 replies; 82+ messages in thread
From: Michael K. Edwards @ 2006-12-07 23:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Shawn Pearce, H. Peter Anvin, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

On 12/7/06, Linus Torvalds <torvalds@osdl.org> wrote:
> Does anybody who is web-server clueful know if there is some simple
> front-end (squid?) that is easy to set up and can just act as a caching
> proxy in front of such an incompetent server?

Squid in "transparent reverse proxy" mode isn't a bad choice, although
I don't know offhand whether it queues/clusters concurrent requests
for the same URL in the way you want.  I suggest the "transparent"
deployment (netfilter/netlink integration) because you can slap it in
with no changes to the origin server and yank it out again if you have
a problem.  The challenge is in getting conntrack to scale to a
zillion concurrent sessions, but you could probably find someone in
your crowd who knows something about that.  :-)

Ignore any documentation that talks about httpd_accel_*.  Configuring
transparent mode is a great deal simpler and saner in squid 2.6 than
it used to be; you just add a "transparent" parameter to the http_port
tag.  With or without this tag, you set up what used to be called
"accelerator mode" using some parameters to http_port and cache_peer,
as described in
http://www.squid-cache.org/mail-archive/squid-users/200607/0162.html.

If transparent mode looks like the right thing for kernel.org, you
might be interested in some netfilter hackery to offload part of the
conntrack session lookup load to a front-end box that blocks DDoS and
acts more or less as an L4 switch plus session context cache.  I've
been banging on a proof of concept implementation for a while, and am
currently working on integrating against 2.6.19 by splitting
nf_conntrack into front and back halves that interact via a sort of
Layer 2+ header.  I have no idea yet whether it will have any
scalability benefit on dual-x86_64 class hardware (it was originally
conceived for rigid cache architectures where the random access
patterns of session lookups have drastic cache effects).

Cheers,

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-07 19:57                           ` H. Peter Anvin
@ 2006-12-07 23:50                             ` Olivier Galibert
  2006-12-07 23:56                               ` H. Peter Anvin
  2006-12-08 11:25                               ` Jakub Narebski
  2006-12-08 12:57                             ` Rogan Dawes
  1 sibling, 2 replies; 82+ messages in thread
From: Olivier Galibert @ 2006-12-07 23:50 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

On Thu, Dec 07, 2006 at 11:57:34AM -0800, H. Peter Anvin wrote:
> Olivier Galibert wrote:
> >On Thu, Dec 07, 2006 at 11:16:58AM -0800, H. Peter Anvin wrote:
> >>Unfortunately, the most common queries are also extremely expensive.
> >
> >Do you have a top-ten of queries ?  That would be the ones to optimize
> >for.
> 
> The front page, summary page of each project, and the RSS feed for each 
> project.

Hmmm, maybe you could have the summaries and rss feed generated on
push, which could also generate elementary files with lines of the
front page.  That would make these top offenders static page serving.

  OG.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-07 23:50                             ` Olivier Galibert
@ 2006-12-07 23:56                               ` H. Peter Anvin
  2006-12-08 11:25                               ` Jakub Narebski
  1 sibling, 0 replies; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-07 23:56 UTC (permalink / raw)
  To: Olivier Galibert
  Cc: Linus Torvalds, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

Olivier Galibert wrote:
> On Thu, Dec 07, 2006 at 11:57:34AM -0800, H. Peter Anvin wrote:
>> Olivier Galibert wrote:
>>> On Thu, Dec 07, 2006 at 11:16:58AM -0800, H. Peter Anvin wrote:
>>>> Unfortunately, the most common queries are also extremely expensive.
>>> Do you have a top-ten of queries ?  That would be the ones to optimize
>>> for.
>> The front page, summary page of each project, and the RSS feed for each 
>> project.
> 
> Hmmm, maybe you could have the summaries and rss feed generated on
> push, which could also generate elementary files with lines of the
> front page.  That would make these top offenders static page serving.
> 

There are a lot of things which "could be done" given the proper cache 
infrastructure and gitweb support.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-07 19:05                     ` kernel.org mirroring (Re: [GIT PULL] MMC update) Linus Torvalds
  2006-12-07 19:16                       ` H. Peter Anvin
@ 2006-12-08  9:43                       ` Jakub Narebski
  1 sibling, 0 replies; 82+ messages in thread
From: Jakub Narebski @ 2006-12-08  9:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Kernel Org Admin, Git Mailing List, Petr Baudis

Linus Torvalds wrote:
[...] 
> For example, if the git "refs/heads/" (or tags) directory hasn't changed 
> in the last two months, we should probably set any ref-relative gitweb 
> pages to have a caching timeout of a day or two. In contrast, if it's 
> changed in the last hour, maybe we should only cache it for five minutes.
> 
> Jakub: any way to make gitweb set the "expires" fields _much_ more 
> aggressively. I think we should at least have the ability to set a basic 
> rules like
> 
>  - a _minimum_ of five minutes regardless of anything else
> 
>    We might even tweak this based on loadaverage, and it might be 
>    worthwhile to add a randomization, to make sure that you don't get into 
>    situations where everything webpage needs to be recalculated at once.

I think the minimum expires (or minimum _additional_ expires: as of now
giweb only does expires +1d for explicit hash requests) should depend on
how often project changes. How often there are pushes to kernel.org?
 
>  - if refs/ directories are old, raise the minimum by the age of the refs
> 
>    If it's more than an hour old, raise it to ten minutes. If it's more 
>    than a day, raise it to an hour. If it's more than a month old, raise 
>    it to a day. And if it's more than half a year, it's some historical 
>    archive like linux-history, and should probably default to a week or 
>    more.

What about packed refs?

We can certainly raise expires for tags (tags objects), as they should not
usually change.
 
>  - infinite for stuff that isn't ref-related.

As sha1 is not changeable, everything that is accessed by explicit 
sha1 (hash), or by explicit sha1 (hash_base) plus pathname (file_name)
should have effectively infinite expires.


Every caching would need some temporary memory, or temporary disk space.
And perhaps mod_perl specific caching would be useful here...

P.S. I have added Pasky to Cc:, as he manages http://repo.or.cz public
git repository hosting (much smaller than kernel.org and I think under less
load: but also I think withour kernel.org resources).
-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-07 23:50                             ` Olivier Galibert
  2006-12-07 23:56                               ` H. Peter Anvin
@ 2006-12-08 11:25                               ` Jakub Narebski
  1 sibling, 0 replies; 82+ messages in thread
From: Jakub Narebski @ 2006-12-08 11:25 UTC (permalink / raw)
  To: git

Olivier Galibert wrote:

> On Thu, Dec 07, 2006 at 11:57:34AM -0800, H. Peter Anvin wrote:
>> Olivier Galibert wrote:
>>>On Thu, Dec 07, 2006 at 11:16:58AM -0800, H. Peter Anvin wrote:
>>>>
>>>>Unfortunately, the most common queries are also extremely expensive.
>>>
>>>Do you have a top-ten of queries ?  That would be the ones to optimize
>>>for.
>> 
>> The front page, summary page of each project, and the RSS feed for each 
>> project.
> 
> Hmmm, maybe you could have the summaries and rss feed generated on
> push, which could also generate elementary files with lines of the
> front page.  That would make these top offenders static page serving.

The "extremely aggresive caching solution" could be as follows: cache
everything, invalidate (remove) on push caches of variable variety related
to push (list of projects and OPML on any push; summary page and every
page without h=<hash> or hb=<hash>;f=<filename> for a given project).

The most important problem is that kernel.org uses old gitweb, the last
version before incorporating gitweb into git (and also reducing
significantly the time needed for summary, heads and tags pages).
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-07 19:57                           ` H. Peter Anvin
  2006-12-07 23:50                             ` Olivier Galibert
@ 2006-12-08 12:57                             ` Rogan Dawes
  2006-12-08 13:38                               ` Jakub Narebski
  2006-12-08 16:16                               ` H. Peter Anvin
  1 sibling, 2 replies; 82+ messages in thread
From: Rogan Dawes @ 2006-12-08 12:57 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

H. Peter Anvin wrote:
> Olivier Galibert wrote:
>> On Thu, Dec 07, 2006 at 11:16:58AM -0800, H. Peter Anvin wrote:
>>> Unfortunately, the most common queries are also extremely expensive.
>>
>> Do you have a top-ten of queries ?  That would be the ones to optimize
>> for.
> 
> The front page, summary page of each project, and the RSS feed for each 
> project.
> 
>     -hpa

How about extending gitweb to check to see if there already exists a 
cached version of these pages, before recreating them?

e.g. structure the temp dir in such a way that each project has a place 
for cached pages. Then, before performing expensive operations, check to 
see if a file corresponding to the requested page already exists. If it 
does, simply return the contents of the file, otherwise go ahead and 
create the page dynamically, and return it to the user. Do not create 
cached pages in gitweb dynamically.

Then, in a post-update hook, for each of the expensive pages, invoke 
something like:

# delete the cached copy of the file, to force gitweb to recreate it
rm -f $git_temp/$project/rss
# get gitweb to recreate the page appropriately
# use a tmp file to prevent gitweb from getting confused
wget -O $git_temp/$project/rss.tmp \
   http://kernel.org/gitweb.cgi?p=$project;a=rss
# move the tmp file into place
mv $git_temp/$project/rss.tmp $git_temp/$project/rss

This way, we get the exact output returned from the usual gitweb 
invocation, but we can now cache the result, and only update it when 
there is a new commit that would affect the page output.

This would also not affect those who do not wish to use this mechanism. 
If the file does not exist, gitweb.cgi will simply revert to its usual 
behaviour.

Possible complications are the content-type headers, etc, but you could 
use the -s flag to wget, and store the server headers as well in the 
file, and get the necessary headers from the file as you stream it.

i.e. read the headers looking for ones that are "interesting" 
(Content-Type, charset, expires) until you get a blank line, print out 
the interesting headers using $cgi->header(), then just dump the 
remainder of the file to the caller via stdout.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 12:57                             ` Rogan Dawes
@ 2006-12-08 13:38                               ` Jakub Narebski
  2006-12-08 14:31                                 ` Rogan Dawes
  2006-12-09  1:28                                 ` Martin Langhoff
  2006-12-08 16:16                               ` H. Peter Anvin
  1 sibling, 2 replies; 82+ messages in thread
From: Jakub Narebski @ 2006-12-08 13:38 UTC (permalink / raw)
  To: Rogan Dawes
  Cc: H. Peter Anvin, Linus Torvalds, Kernel Org Admin,
	Git Mailing List, Petr Baudis

Dnia piątek 8. grudnia 2006 13:57, Rogan Dawes napisał:
> H. Peter Anvin wrote:
>> Olivier Galibert wrote:
>>> On Thu, Dec 07, 2006 at 11:16:58AM -0800, H. Peter Anvin wrote:
>>>> Unfortunately, the most common queries are also extremely expensive.

With newer gitweb, which tries to do the same using less git commands,
some of queries (summary, heads, tags pages) should be less expensive.

>>> Do you have a top-ten of queries ?  That would be the ones to optimize
>>> for.
>> 
>> The front page, summary page of each project, and the RSS feed for each 
>> project.
> 
> How about extending gitweb to check to see if there already exists a 
> cached version of these pages, before recreating them?
> 
> e.g. structure the temp dir in such a way that each project has a place 
> for cached pages. Then, before performing expensive operations, check to 
> see if a file corresponding to the requested page already exists. If it 
> does, simply return the contents of the file, otherwise go ahead and 
> create the page dynamically, and return it to the user. Do not create 
> cached pages in gitweb dynamically.

This would add the need for directory for temporary files... well,
it would be optional now...

> Then, in a post-update hook, for each of the expensive pages, invoke 
> something like:
> 
> # delete the cached copy of the file, to force gitweb to recreate it
> rm -f $git_temp/$project/rss
> # get gitweb to recreate the page appropriately
> # use a tmp file to prevent gitweb from getting confused
> wget -O $git_temp/$project/rss.tmp \
>    http://kernel.org/gitweb.cgi?p=$project;a=rss
> # move the tmp file into place
> mv $git_temp/$project/rss.tmp $git_temp/$project/rss

Good idea... although there are some page views which shouldn't change
at all... well, with the possible exception of changes in gitweb output,
and even then there are some (blob_plain and snapshot views) which
doesn't change at all.

It would be good to avoid removing them on push, and only remove
them using some tmpwatch-like removal.
 
> This way, we get the exact output returned from the usual gitweb 
> invocation, but we can now cache the result, and only update it when 
> there is a new commit that would affect the page output.
> 
> This would also not affect those who do not wish to use this mechanism. 
> If the file does not exist, gitweb.cgi will simply revert to its usual 
> behaviour.

Good idea. Perhaps I should add it to gitweb TODO file.

Hmmm... perhaps it is time for next "[RFC] gitweb wishlist and TODO list"
thread?
 
> Possible complications are the content-type headers, etc, but you could 
> use the -s flag to wget, and store the server headers as well in the 
> file, and get the necessary headers from the file as you stream it.
> 
> i.e. read the headers looking for ones that are "interesting" 
> (Content-Type, charset, expires) until you get a blank line, print out 
> the interesting headers using $cgi->header(), then just dump the 
> remainder of the file to the caller via stdout.

No need for that. $cgi->header() is to _generate_ the headers, so if
a file is saved with headers, we can just dump it to STDOUT; the possible
exception is a need to rewrite 'expires' header, if it is used.

Perhaps gitweb should generate it's own ETag instead of messing with
'expires' header?
-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 13:38                               ` Jakub Narebski
@ 2006-12-08 14:31                                 ` Rogan Dawes
  2006-12-08 15:38                                   ` Jonas Fonseca
  2006-12-09  1:28                                 ` Martin Langhoff
  1 sibling, 1 reply; 82+ messages in thread
From: Rogan Dawes @ 2006-12-08 14:31 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: H. Peter Anvin, Linus Torvalds, Kernel Org Admin,
	Git Mailing List, Petr Baudis

Jakub Narebski wrote:
> Dnia piątek 8. grudnia 2006 13:57, Rogan Dawes napisał:

>> How about extending gitweb to check to see if there already exists a 
>> cached version of these pages, before recreating them?
>>
>> e.g. structure the temp dir in such a way that each project has a place 
>> for cached pages. Then, before performing expensive operations, check to 
>> see if a file corresponding to the requested page already exists. If it 
>> does, simply return the contents of the file, otherwise go ahead and 
>> create the page dynamically, and return it to the user. Do not create 
>> cached pages in gitweb dynamically.
> 
> This would add the need for directory for temporary files... well,
> it would be optional now...
> 
It would still be optional. If the "cache" directory structure exists, 
then use it, otherwise, continue as usual. All it would cost is a stat() 
or two, I guess.

>> Then, in a post-update hook, for each of the expensive pages, invoke 
>> something like:
>>
>> # delete the cached copy of the file, to force gitweb to recreate it
>> rm -f $git_temp/$project/rss
>> # get gitweb to recreate the page appropriately
>> # use a tmp file to prevent gitweb from getting confused
>> wget -O $git_temp/$project/rss.tmp \
>>    http://kernel.org/gitweb.cgi?p=$project;a=rss
>> # move the tmp file into place
>> mv $git_temp/$project/rss.tmp $git_temp/$project/rss
> 
> Good idea... although there are some page views which shouldn't change
> at all... well, with the possible exception of changes in gitweb output,
> and even then there are some (blob_plain and snapshot views) which
> doesn't change at all.
> 
> It would be good to avoid removing them on push, and only remove
> them using some tmpwatch-like removal.

Well, my theory was that we would only cache pages that change when new 
data enters the repo. So, using the push as the trigger is almost 
guaranteed to be the right thing to do. New data indicates new rss 
items, indicates an updated shortlog page, etc.

NOTE: This caching could be problematic for the "changed 2 hours ago" 
notation for various branches/files, etc. But however we implement the 
caching, we'd have this problem.

>> This way, we get the exact output returned from the usual gitweb 
>> invocation, but we can now cache the result, and only update it when 
>> there is a new commit that would affect the page output.
>>
>> This would also not affect those who do not wish to use this mechanism. 
>> If the file does not exist, gitweb.cgi will simply revert to its usual 
>> behaviour.
> 
> Good idea. Perhaps I should add it to gitweb TODO file.
> 
> Hmmm... perhaps it is time for next "[RFC] gitweb wishlist and TODO list"
> thread?
>  
>> Possible complications are the content-type headers, etc, but you could 
>> use the -s flag to wget, and store the server headers as well in the 
>> file, and get the necessary headers from the file as you stream it.
>>
>> i.e. read the headers looking for ones that are "interesting" 
>> (Content-Type, charset, expires) until you get a blank line, print out 
>> the interesting headers using $cgi->header(), then just dump the 
>> remainder of the file to the caller via stdout.
> 
> No need for that. $cgi->header() is to _generate_ the headers, so if
> a file is saved with headers, we can just dump it to STDOUT; the possible
> exception is a need to rewrite 'expires' header, if it is used.

Good point. I guess one thing that will be incorrect in the headers is 
the server date, but I doubt that anyone cares much. As you say, though, 
this might relate to the expiry of cached content in upstream caches.

> 
> Perhaps gitweb should generate it's own ETag instead of messing with
> 'expires' header?

Well, we can possibly eliminate the expires header entirely for dynamic 
pages, and check the If-Modified-Since value against the timestamp of 
the cached file, or the server date in the cached file, and return "304 
Not Modified" responses. That would also help to reduce the load on the 
server, by only returning the headers, and not the entire response.

The downside is that it would prevent upstream proxies from caching this 
data for us.

Regards,


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 14:31                                 ` Rogan Dawes
@ 2006-12-08 15:38                                   ` Jonas Fonseca
  0 siblings, 0 replies; 82+ messages in thread
From: Jonas Fonseca @ 2006-12-08 15:38 UTC (permalink / raw)
  To: Rogan Dawes
  Cc: Jakub Narebski, H. Peter Anvin, Linus Torvalds, Kernel Org Admin,
	Git Mailing List, Petr Baudis

On 12/8/06, Rogan Dawes <discard@dawes.za.net> wrote:
> NOTE: This caching could be problematic for the "changed 2 hours ago"
> notation for various branches/files, etc. But however we implement the
> caching, we'd have this problem.

It could be solved using ECMAScript (if that is an option): Include an exact
time stamp or something that browsers not supporting ECMAScript can
show and others browsers can change the time stamp to make it relative
and do the coloring/highlighting of recent activity. This could also slightly
speed up the script and it might be better to provide an exact time stamp
by default if aggressive caching is applied.

-- 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 12:57                             ` Rogan Dawes
  2006-12-08 13:38                               ` Jakub Narebski
@ 2006-12-08 16:16                               ` H. Peter Anvin
  2006-12-08 16:35                                 ` Linus Torvalds
  1 sibling, 1 reply; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-08 16:16 UTC (permalink / raw)
  To: Rogan Dawes
  Cc: Linus Torvalds, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

Rogan Dawes wrote:
> 
> How about extending gitweb to check to see if there already exists a 
> cached version of these pages, before recreating them?
> 

This goes back to the "gitweb needs native caching" again.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 16:16                               ` H. Peter Anvin
@ 2006-12-08 16:35                                 ` Linus Torvalds
  2006-12-08 16:42                                   ` H. Peter Anvin
  2006-12-08 16:54                                   ` Jeff Garzik
  0 siblings, 2 replies; 82+ messages in thread
From: Linus Torvalds @ 2006-12-08 16:35 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski

On Fri, 8 Dec 2006, H. Peter Anvin wrote:
> 
> This goes back to the "gitweb needs native caching" again.

It should be fairly easy to add a caching layer, but I wouldn't do it 
inside gitweb itself - it gets too mixed up. It would be better to have 
it as a separate front-end, that just calls gitweb for anything it doesn't 
find in the cache.

I could write a simple C caching thing that just hashes the CGI arguments 
and uses a hash to create a cache (and proper lock-files etc to serialize 
access to a particular cache object while it's being created) fairly 
easily, but I'm pretty sure people would much prefer a mod_perl thing just 
to avoid the fork/exec overhead with Apache (I think mod_perl allows 
Apache to run perl scripts without it), and that means I'm not the right 
person any more.

Not that I'm the right person anyway, since I don't have a web server set 
up on my machine to even test with ;)

		Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 16:35                                 ` Linus Torvalds
@ 2006-12-08 16:42                                   ` H. Peter Anvin
  2006-12-08 19:49                                     ` Lars Hjemli
  2006-12-10  9:43                                     ` rda
  2006-12-08 16:54                                   ` Jeff Garzik
  1 sibling, 2 replies; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-08 16:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski

Linus Torvalds wrote:
> 
> On Fri, 8 Dec 2006, H. Peter Anvin wrote:
>> This goes back to the "gitweb needs native caching" again.
> 
> It should be fairly easy to add a caching layer, but I wouldn't do it 
> inside gitweb itself - it gets too mixed up. It would be better to have 
> it as a separate front-end, that just calls gitweb for anything it doesn't 
> find in the cache.
> 

If you want to do side effect generation of cache contents, it might not 
be possible to do it that way.  At the very least gitweb needs to be 
aware of how to explicitly enter things into the cache.

All of this isn't really all that hard; I have implemented all that 
stuff for diffview, for example (when generating a single diff hunk, you 
  naturally end up producing all of them, so you want to have them 
preemptively cached.)

> I could write a simple C caching thing that just hashes the CGI arguments 
> and uses a hash to create a cache (and proper lock-files etc to serialize 
> access to a particular cache object while it's being created) fairly 
> easily, but I'm pretty sure people would much prefer a mod_perl thing just 
> to avoid the fork/exec overhead with Apache (I think mod_perl allows 
> Apache to run perl scripts without it), and that means I'm not the right 
> person any more.

True about mod_perl.  Haven't messed with that myself, either. 
fork/exec really is very cheap on Linux, so it's not a huge deal.

> Not that I'm the right person anyway, since I don't have a web server set 
> up on my machine to even test with ;)

Heh :)

	-hpa

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 16:35                                 ` Linus Torvalds
  2006-12-08 16:42                                   ` H. Peter Anvin
@ 2006-12-08 16:54                                   ` Jeff Garzik
  2006-12-08 17:04                                     ` H. Peter Anvin
  2006-12-08 23:27                                     ` Linus Torvalds
  1 sibling, 2 replies; 82+ messages in thread
From: Jeff Garzik @ 2006-12-08 16:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Rogan Dawes, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

Linus Torvalds wrote:
> I could write a simple C caching thing that just hashes the CGI arguments 
> and uses a hash to create a cache (and proper lock-files etc to serialize 
> access to a particular cache object while it's being created) fairly 
> easily, but I'm pretty sure people would much prefer a mod_perl thing just 
> to avoid the fork/exec overhead with Apache (I think mod_perl allows 
> Apache to run perl scripts without it), and that means I'm not the right 
> person any more.
> 
> Not that I'm the right person anyway, since I don't have a web server set 
> up on my machine to even test with ;)
> 	
> 		Linus
> 
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

This is quite nice and easy, if memory-only caching works for the 
situation:  http://www.danga.com/memcached/

There are APIs for C, Perl, and plenty of other languages.

	Jeff


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 16:54                                   ` Jeff Garzik
@ 2006-12-08 17:04                                     ` H. Peter Anvin
  2006-12-08 17:40                                       ` Jeff Garzik
  2006-12-08 23:27                                     ` Linus Torvalds
  1 sibling, 1 reply; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-08 17:04 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Rogan Dawes, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

Jeff Garzik wrote:
> 
> This is quite nice and easy, if memory-only caching works for the 
> situation:  http://www.danga.com/memcached/
> 
> There are APIs for C, Perl, and plenty of other languages.
> 

Memory-only caching is kind of nasty.  Memory is a premium resource on 
kernel.org.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 17:04                                     ` H. Peter Anvin
@ 2006-12-08 17:40                                       ` Jeff Garzik
  0 siblings, 0 replies; 82+ messages in thread
From: Jeff Garzik @ 2006-12-08 17:40 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Rogan Dawes, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

H. Peter Anvin wrote:
> Jeff Garzik wrote:
>>
>> This is quite nice and easy, if memory-only caching works for the 
>> situation:  http://www.danga.com/memcached/
>>
>> There are APIs for C, Perl, and plenty of other languages.
>>
> 
> Memory-only caching is kind of nasty.  Memory is a premium resource on 
> kernel.org.

hmmm.  Well, I have been wondering why nobody ever came up with a 
system-wide local (==disk) cache for remote and/or calculated objects. 
Maybe its time to do something about that.

I've been in a daemon-writing mood lately.

	Jeff



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 16:42                                   ` H. Peter Anvin
@ 2006-12-08 19:49                                     ` Lars Hjemli
  2006-12-08 19:51                                       ` H. Peter Anvin
  2006-12-10  9:43                                     ` rda
  1 sibling, 1 reply; 82+ messages in thread
From: Lars Hjemli @ 2006-12-08 19:49 UTC (permalink / raw)
  To: H. Peter Anvin, Linus Torvalds; +Cc: Git Mailing List

On 12/8/06, H. Peter Anvin <hpa@zytor.com> wrote:
> Linus Torvalds wrote:
> > I could write a simple C caching thing that just hashes the CGI arguments
> > and uses a hash to create a cache (and proper lock-files etc to serialize
> > access to a particular cache object while it's being created) fairly
> > easily, but I'm pretty sure people would much prefer a mod_perl thing just
> > to avoid the fork/exec overhead with Apache (I think mod_perl allows
> > Apache to run perl scripts without it), and that means I'm not the right
> > person any more.
>
> True about mod_perl.  Haven't messed with that myself, either.
> fork/exec really is very cheap on Linux, so it's not a huge deal.

I've been playing around with a "native git" cgi thingy the last week
(I call it cgit),  and I've been thinking about adding exactly this
kind of caching to it. And since it's basically a standard git command
written in C, it should have less overhead than any perl
implementation.

It's far from ready yet, but I'll try to publish some code this
weekend just in case someone finds it interesting.

-- 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 19:49                                     ` Lars Hjemli
@ 2006-12-08 19:51                                       ` H. Peter Anvin
  2006-12-08 19:59                                         ` Lars Hjemli
  0 siblings, 1 reply; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-08 19:51 UTC (permalink / raw)
  To: Lars Hjemli; +Cc: Linus Torvalds, Git Mailing List

Lars Hjemli wrote:
> On 12/8/06, H. Peter Anvin <hpa@zytor.com> wrote:
>> Linus Torvalds wrote:
>> > I could write a simple C caching thing that just hashes the CGI 
>> arguments
>> > and uses a hash to create a cache (and proper lock-files etc to 
>> serialize
>> > access to a particular cache object while it's being created) fairly
>> > easily, but I'm pretty sure people would much prefer a mod_perl 
>> thing just
>> > to avoid the fork/exec overhead with Apache (I think mod_perl allows
>> > Apache to run perl scripts without it), and that means I'm not the 
>> right
>> > person any more.
>>
>> True about mod_perl.  Haven't messed with that myself, either.
>> fork/exec really is very cheap on Linux, so it's not a huge deal.
> 
> I've been playing around with a "native git" cgi thingy the last week
> (I call it cgit),  and I've been thinking about adding exactly this
> kind of caching to it. And since it's basically a standard git command
> written in C, it should have less overhead than any perl
> implementation.
> 

Trust me, perl, or CGI, is not the problem.  It's all about I/O traffic 
generated by git.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 19:51                                       ` H. Peter Anvin
@ 2006-12-08 19:59                                         ` Lars Hjemli
  2006-12-08 20:02                                           ` H. Peter Anvin
  0 siblings, 1 reply; 82+ messages in thread
From: Lars Hjemli @ 2006-12-08 19:59 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linus Torvalds, Git Mailing List

On 12/8/06, H. Peter Anvin <hpa@zytor.com> wrote:
> Trust me, perl, or CGI, is not the problem.  It's all about I/O traffic
> generated by git.

Yes, I understand. That's why I've been thinking about internal
caching of pages.

It's just a kick doing it in C, playing around with the git internals :-)

-- 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 19:59                                         ` Lars Hjemli
@ 2006-12-08 20:02                                           ` H. Peter Anvin
  0 siblings, 0 replies; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-08 20:02 UTC (permalink / raw)
  To: Lars Hjemli; +Cc: Linus Torvalds, Git Mailing List

Lars Hjemli wrote:
> On 12/8/06, H. Peter Anvin <hpa@zytor.com> wrote:
>> Trust me, perl, or CGI, is not the problem.  It's all about I/O traffic
>> generated by git.
> 
> Yes, I understand. That's why I've been thinking about internal
> caching of pages.

Caching, preferrably with smarts, is the key.

> It's just a kick doing it in C, playing around with the git internals :-)

That's fine, but it does make it harder to maintain.

	-hpa


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 16:54                                   ` Jeff Garzik
  2006-12-08 17:04                                     ` H. Peter Anvin
@ 2006-12-08 23:27                                     ` Linus Torvalds
  2006-12-08 23:46                                       ` Michael K. Edwards
                                                         ` (3 more replies)
  1 sibling, 4 replies; 82+ messages in thread
From: Linus Torvalds @ 2006-12-08 23:27 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: H. Peter Anvin, Rogan Dawes, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

On Fri, 8 Dec 2006, Jeff Garzik wrote:
> 
> This is quite nice and easy, if memory-only caching works for the situation:
> http://www.danga.com/memcached/
> 
> There are APIs for C, Perl, and plenty of other languages.

Actually, just looking at the examples, it looks like memcached is 
fundamentally flawed, exactly the same way Apache mod_cache is 
fundamentally flawed.

Exactly like mod_perl, it appears that if something isn't cached, the 
memcached server will just return "not cached" to everybody, and all the 
clients will, like a stampeding herd, all do the uncached access. Even if 
they have the exact same query. And you're back to square one: your server 
load went through the roof.

You can't have a cache architecture where the client just does a "get", 
like memcached does. You need to have a "read-for-fill" operation, which 
says:

 - get this cache entry

 - if this cache entry does not exist, get an exclusive lock

 - if you get that exclusive lock, return NULL, and the client promises 
   that it will fill it (inside the kernel, see for example 
   "find_get_page()" vs "grab_cache_page()" - the latter will return a 
   locked page whether it exists or not, and if it didn't exist, it will 
   have inserted it into the cache datastructures so that you don't have 
   multiple concurrent readers trying to all create different pages)

 - if you block on the exclusive lock, that means that some other client 
   is busy fulfilling it. When you unblock, do a regular "read" operation 
   (not a "repeat": we only block once, and if that fails, that's it).

 - any cachefill operation will release the lock (and allow pending 
   cache queries to succeed)

 - the locking client going away will release the lock (and allow pending 
   cache queries to fail, and hopefully cause a "set cache" operation)

 - a timeout (settable by some method) will also force-release a lock in 
   the case of buggy clients that do "read-for-modify" but never do the 
   "modify".

The "timeout" thing is to handle the case of buggy clients that crash 
after trying to get - it will slow down things _enormously_ if that 
happens, but hey, it's a buggy client. And it will still continue to work.

Looking at the memcached operations, they have the "read" op (aka "get"), 
but they seem to have no "read-for-fill" op. So memcached fundamentally 
doesn't fix this problem, at least without explicit serialization by the 
client.

(The serialization could be done by the client, but that would serialize 
_everything_, and mean that a uncached lookup will hold up all the cached 
ones too - which is why you do NOT want to serialize in the caller: you 
really want to serialize in the layer that does the caching).

It's fairly easy to do the lock. You could just hash the lookup key using 
some reasonable hash. It doesn't even have to be a _big_ hash: it's ok to 
have just a few bits for lock hashing, since it's only going to be for 
misses.

So hashing to eight bits and using 256 locks is probably fine, as long as 
this is done by the cache server. That means that the cache server only 
ever needs to track that many timeouts, for example (it also indirectly 
sets a limit on the number of possible "outstanding uncached requests", 
which is _exactly_ what you want - but hash collissions will also 
potentially unlock the _wrong_ bucket, so if you have too many of them, it 
can make the "only one outstanding unhashed request per key" not be as 
effective).

So assuming you get good cache hit statistics, the locking shouldn't be a 
big issue. But you definitely want to do it, because the whole point of 
caching was to not do the same op multiple times. 

I still don't understand why apache doesn't do it. I guess it wants to be 
stateless or something.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 23:27                                     ` Linus Torvalds
@ 2006-12-08 23:46                                       ` Michael K. Edwards
  2006-12-08 23:49                                         ` H. Peter Anvin
  2006-12-09  0:49                                         ` Linus Torvalds
       [not found]                                       ` <4579FABC.5070509@garzik.org>
                                                         ` (2 subsequent siblings)
  3 siblings, 2 replies; 82+ messages in thread
From: Michael K. Edwards @ 2006-12-08 23:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, H. Peter Anvin, Rogan Dawes, Kernel Org Admin,
	Git Mailing List, Jakub Narebski

On 12/8/06, Linus Torvalds <torvalds@osdl.org> wrote:
> You can't have a cache architecture where the client just does a "get",
> like memcached does. You need to have a "read-for-fill" operation ...

In Squid 2.6:
    collapsed_forwarding on
    refresh_stale_window <seconds>
(apply the latter only to stanzas where you want "readahead" of
about-to-expire cache entries)

Brief design description at http://devel.squid-cache.org/collapsed_forwarding/.

(I didn't write this code, everything I know about squid leaked
through the Google-shaped pinhole in my tinfoil hat, etc.  But if you
go this way I'd like to be in the loop to understand the scalability
issues around netfilter-assisted transparent proxying.)

Cheers,

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 23:46                                       ` Michael K. Edwards
@ 2006-12-08 23:49                                         ` H. Peter Anvin
  2006-12-09  0:18                                           ` Michael K. Edwards
  2006-12-09  0:49                                         ` Linus Torvalds
  1 sibling, 1 reply; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-08 23:49 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Linus Torvalds, Jeff Garzik, Rogan Dawes, Kernel Org Admin,
	Git Mailing List, Jakub Narebski

Michael K. Edwards wrote:
> On 12/8/06, Linus Torvalds <torvalds@osdl.org> wrote:
>> You can't have a cache architecture where the client just does a "get",
>> like memcached does. You need to have a "read-for-fill" operation ...
> 
> In Squid 2.6:
>    collapsed_forwarding on
>    refresh_stale_window <seconds>
> (apply the latter only to stanzas where you want "readahead" of
> about-to-expire cache entries)
> 
> Brief design description at 
> http://devel.squid-cache.org/collapsed_forwarding/.
> 
> (I didn't write this code, everything I know about squid leaked
> through the Google-shaped pinhole in my tinfoil hat, etc.  But if you
> go this way I'd like to be in the loop to understand the scalability
> issues around netfilter-assisted transparent proxying.)
> 

There is another thing that probably will be required, and I'm not sure 
if something in front of Apache (like Squid) rather than behind it can 
easily deal with: on timeout, the process needs to continue in order to 
feed the cache.  Otherwise, you're still in a failure scenario as soon 
as timeout happens.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 23:49                                         ` H. Peter Anvin
@ 2006-12-09  0:18                                           ` Michael K. Edwards
  2006-12-09  0:23                                             ` H. Peter Anvin
  0 siblings, 1 reply; 82+ messages in thread
From: Michael K. Edwards @ 2006-12-09  0:18 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Jeff Garzik, Rogan Dawes, Kernel Org Admin,
	Git Mailing List, Jakub Narebski

On 12/8/06, H. Peter Anvin <hpa@zytor.com> wrote:
> There is another thing that probably will be required, and I'm not sure
> if something in front of Apache (like Squid) rather than behind it can
> easily deal with: on timeout, the process needs to continue in order to
> feed the cache.  Otherwise, you're still in a failure scenario as soon
> as timeout happens.

I would think this would be a great deal easier to handle in an
arm's-length "accelerator" than in the origin server.  Only restart
the hit to the origin server if you think that something has actually
gone wrong there.  Serve stale data to the client if you have to.
From the page I quoted:

"In addition an option to shortcut the cache revalidation of
frequently accessed objects is added, making further requests
immediately return as a cache hit while a cache revalidation is
pending. This may temporarily give slightly stale information to the
clients, but at the same time allows for optimal response time while a
frequently accessed object is being revalidated. This too is an
optimization only intended for accelerators, and only for accelerators
where minimizing request latency is morer important than freshness."

I don't know how sophisticated this logic is currently, but I would
think that it wouldn't be that hard to tune up.

Cheers,

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09  0:18                                           ` Michael K. Edwards
@ 2006-12-09  0:23                                             ` H. Peter Anvin
  0 siblings, 0 replies; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-09  0:23 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Linus Torvalds, Jeff Garzik, Rogan Dawes, Kernel Org Admin,
	Git Mailing List, Jakub Narebski

Michael K. Edwards wrote:
> On 12/8/06, H. Peter Anvin <hpa@zytor.com> wrote:
>> There is another thing that probably will be required, and I'm not sure
>> if something in front of Apache (like Squid) rather than behind it can
>> easily deal with: on timeout, the process needs to continue in order to
>> feed the cache.  Otherwise, you're still in a failure scenario as soon
>> as timeout happens.
> 
> I would think this would be a great deal easier to handle in an
> arm's-length "accelerator" than in the origin server

True, but it needs to run behind Apache rather than in front of it.

	-hpa

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
       [not found]                                       ` <4579FABC.5070509@garzik.org>
@ 2006-12-09  0:45                                         ` Linus Torvalds
  2006-12-09  0:47                                           ` H. Peter Anvin
  2006-12-09  9:16                                           ` Jeff Garzik
  0 siblings, 2 replies; 82+ messages in thread
From: Linus Torvalds @ 2006-12-09  0:45 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: H. Peter Anvin, Rogan Dawes, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

On Fri, 8 Dec 2006, Jeff Garzik wrote:
>
> This is a bit cheesy, and completely untested, but since mod_cache never
> worked for me either, I bet it works better ;-)

Ok, this doesn't do the locking either, so on cache misses or expiry, 
you're still going to be that thundering herd.

Also, if you want to be nice to clients, I'd seriously suggest that when 
you hit in the cache, but it's expired (or it's close to expired), you 
still serve the cached data back, but you set up a thread in the 
background (with some maximum number of active threads, of course!) that 
refreshes the cached entry and then you extend the expiration time so that 
you won't end up doing this "refresh" _again_.

It's kind of silly to have people wait for 20 seconds just because a cache 
expired five seconds ago. Much nicer to say "ok, we allow a certain 
grace-period during which we'll do the real lookup, but to make things 
_look_ really responsive, we still use the old cached value".

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09  0:45                                         ` Linus Torvalds
@ 2006-12-09  0:47                                           ` H. Peter Anvin
  2006-12-09  9:16                                           ` Jeff Garzik
  1 sibling, 0 replies; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-09  0:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Rogan Dawes, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

Linus Torvalds wrote:
> 
> It's kind of silly to have people wait for 20 seconds just because a cache 
> expired five seconds ago. Much nicer to say "ok, we allow a certain 
> grace-period during which we'll do the real lookup, but to make things 
> _look_ really responsive, we still use the old cached value".
> 

Yup, DNS does this, and it's a Very Good Thing.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 23:46                                       ` Michael K. Edwards
  2006-12-08 23:49                                         ` H. Peter Anvin
@ 2006-12-09  0:49                                         ` Linus Torvalds
  2006-12-09  0:51                                           ` H. Peter Anvin
                                                             ` (2 more replies)
  1 sibling, 3 replies; 82+ messages in thread
From: Linus Torvalds @ 2006-12-09  0:49 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Jeff Garzik, H. Peter Anvin, Rogan Dawes, Kernel Org Admin,
	Git Mailing List, Jakub Narebski



On Fri, 8 Dec 2006, Michael K. Edwards wrote:
> 
> In Squid 2.6:
>    collapsed_forwarding on
>    refresh_stale_window <seconds>
> (apply the latter only to stanzas where you want "readahead" of
> about-to-expire cache entries)

Yeah, those look like the Right Thing (tm) to do.

That said, I'm not personally convinced that there is much point to using 
netfilter for transparent proxying. Why not just use separate ports for 
squid and for apache?


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09  0:49                                         ` Linus Torvalds
@ 2006-12-09  0:51                                           ` H. Peter Anvin
  2006-12-09  4:36                                           ` Michael K. Edwards
  2006-12-09  9:27                                           ` Jeff Garzik
  2 siblings, 0 replies; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-09  0:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael K. Edwards, Jeff Garzik, Rogan Dawes, Kernel Org Admin,
	Git Mailing List, Jakub Narebski

Linus Torvalds wrote:
> 
> On Fri, 8 Dec 2006, Michael K. Edwards wrote:
>> In Squid 2.6:
>>    collapsed_forwarding on
>>    refresh_stale_window <seconds>
>> (apply the latter only to stanzas where you want "readahead" of
>> about-to-expire cache entries)
> 
> Yeah, those look like the Right Thing (tm) to do.
> 
> That said, I'm not personally convinced that there is much point to using 
> netfilter for transparent proxying. Why not just use separate ports for 
> squid and for apache?
> 

Yeah, this is pretty trivial since one can just do redirects.  However, 
I still think a backend cache is better, since it can detach itself from 
Apache when appropriate (e.g. the background refresh scenario, or timeout.)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 13:38                               ` Jakub Narebski
  2006-12-08 14:31                                 ` Rogan Dawes
@ 2006-12-09  1:28                                 ` Martin Langhoff
  2006-12-09  2:03                                   ` H. Peter Anvin
  1 sibling, 1 reply; 82+ messages in thread
From: Martin Langhoff @ 2006-12-09  1:28 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Rogan Dawes, H. Peter Anvin, Linus Torvalds, Kernel Org Admin,
	Git Mailing List, Petr Baudis

On 12/9/06, Jakub Narebski <jnareb@gmail.com> wrote:
> Perhaps gitweb should generate it's own ETag instead of messing with
> 'expires' header?

That'll be the winning solution. A combination of

 - cache SHA1-based requests forever
 - cache ref-based requests a longish time,  setting an ETag that
contains headname+SHA1
 - on 'revalidate', check the ETag vs the ref and only recompute if
things have changed

In the meantime, the code on kernel.org needs to be updated to the
latest gitweb. On our server, I'd say the newer gitweb is 3~4 times
faster serving the "expensive" summary pages. And much smarter in
terms of caching headers.

cheers



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 23:27                                     ` Linus Torvalds
  2006-12-08 23:46                                       ` Michael K. Edwards
       [not found]                                       ` <4579FABC.5070509@garzik.org>
@ 2006-12-09  1:56                                       ` Martin Langhoff
  2006-12-09 11:51                                         ` Jakub Narebski
  2006-12-09  7:56                                       ` Steven Grimm
  3 siblings, 1 reply; 82+ messages in thread
From: Martin Langhoff @ 2006-12-09  1:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, H. Peter Anvin, Rogan Dawes, Kernel Org Admin,
	Git Mailing List, Jakub Narebski

On 12/9/06, Linus Torvalds <torvalds@osdl.org> wrote:
> Actually, just looking at the examples, it looks like memcached is
> fundamentally flawed, exactly the same way Apache mod_cache is
> fundamentally flawed.

I don't know if fundamentally flawed but (having used memcached) I
don't think it's a big win for this at all.

We can make gitweb to detect mod_perl and a few smarter things if it
is running inside of it. In fact, we can (ab)use mod_perl and perl
facilities a bit to do some serialization which will be a big win for
some pages. What we need for that is to set a sensible the ETag and
use some IPC to announce/check if other apache/modperl processes are
preparing content for the same ETag. The first-process-to-announce a
given ETag can then write it to a common temp directory (atomically -
write to a temp-name and move to the expected name) while other
processes wait, polling for the file. Once the file is in place the
latecomers can just serve the content of the file and exit.

(I am calling the "state we are serving" identifier ETag because I
think we should also set it as the ETag in the HTTP headers, so well
be able to check the ETag of future requests for staleness - all we
need is a ref lookup, and if the SHA1 matches, we are sorted). So
having this 'unique request identifier' doubles up nicely...

The ETag should probably be:
 - SHA1+displaytype+args for pages that display an object identified by SHA1
 - refname+SHA!+displaytype+args for pages that display something
identified by a ref
 - SHA1(names and sha1s of all refs) for the summary page

> You can't have a cache architecture where the client just does a "get",
> like memcached does. You need to have a "read-for-fill" operation, which
> says:

You _could_ make do with a convention of polling for "entryname" and
"workingon-entryname" and if "workingon-entryname" is set to 1, you
can expect entryname to be filled real soon now. However, memcached is
completely memorybound, so it is only nice for really small stuff or
for a large server farm which has gobs of spare ram.

(Note that memcached does have timeouts which means that the
'workingon' value could have a short timeout in case the request is
cancelled or the process dies - the nasty bit in the above plan would
be the polling.)

> I still don't understand why apache doesn't do it. I guess it wants to be
> stateless or something.

Apache doesn't do it because most web applications don't use the HTTP
procol correctly - specially when it comes to the idempotency of GET.
So in 99% of the cases, web apps serve truly different pages for the
same GET request, depending on your cookie, IP address, time-of-day,
etc.

Most websites deal with very little traffic, so this isn't a problem.
And many large sites that serve a lot of traffic from a dynamic web
app want to be serving custom ads, let you login and see your
personalised toolbar, etc,etc, so this wouldn't work for them either.

So in practice, serialising speculatively on GET requests for the same
URL has very little payoff except for static content. And that's quite
fast anyway.... specially if the underlying OS is smokin' fast ;-)

cheers,

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09  1:28                                 ` Martin Langhoff
@ 2006-12-09  2:03                                   ` H. Peter Anvin
  2006-12-09  2:52                                     ` Martin Langhoff
  0 siblings, 1 reply; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-09  2:03 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Jakub Narebski, Rogan Dawes, Linus Torvalds, Kernel Org Admin,
	Git Mailing List, Petr Baudis

Martin Langhoff wrote:
> On 12/9/06, Jakub Narebski <jnareb@gmail.com> wrote:
>> Perhaps gitweb should generate it's own ETag instead of messing with
>> 'expires' header?
> 
> That'll be the winning solution.

Doesn't solve the thundering herd problem or the timeout problem at all, 
though.

	-hpa


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09  2:03                                   ` H. Peter Anvin
@ 2006-12-09  2:52                                     ` Martin Langhoff
  2006-12-09  5:09                                       ` H. Peter Anvin
  0 siblings, 1 reply; 82+ messages in thread
From: Martin Langhoff @ 2006-12-09  2:52 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jakub Narebski, Rogan Dawes, Linus Torvalds, Kernel Org Admin,
	Git Mailing List, Petr Baudis

On 12/9/06, H. Peter Anvin <hpa@zytor.com> wrote:
> Martin Langhoff wrote:
> > On 12/9/06, Jakub Narebski <jnareb@gmail.com> wrote:
> >> Perhaps gitweb should generate it's own ETag instead of messing with
> >> 'expires' header?
> >
> > That'll be the winning solution.
>
> Doesn't solve the thundering herd problem or the timeout problem at all,
> though.

I posted separately about those. And I've been mulling about whether
the thundering herd is really such a big problem that we need to
address it head-on. If we doHTTP  caching headers right (that is, a
bit better than now) then the fact that web caches are distributed
means that even a cache restart or cache invalidation won't trigger a
thundering herd.

And gitweb rarely has a "new" URL that gets a ton of hits immediately.
Our real problem is the summary page, and the fact that we aren't
setting an effecting ETag there. If we do, a front-end cache plus the
ability to revalidate the ETag cheaply will get us through.

We get 99% of the benefit from ETags and cheap revalidations,
specially if they are coupled with a reverse caching proxy,. The
remaining 1% of dealing with the highly infrequent thundering herd can
be addressed with the scheme I've posted 5 minutes ago.

cheers

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09  0:49                                         ` Linus Torvalds
  2006-12-09  0:51                                           ` H. Peter Anvin
@ 2006-12-09  4:36                                           ` Michael K. Edwards
  2006-12-09  9:27                                           ` Jeff Garzik
  2 siblings, 0 replies; 82+ messages in thread
From: Michael K. Edwards @ 2006-12-09  4:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, H. Peter Anvin, Rogan Dawes, Kernel Org Admin,
	Git Mailing List, Jakub Narebski

On 12/8/06, Linus Torvalds <torvalds@osdl.org> wrote:
> That said, I'm not personally convinced that there is much point to using
> netfilter for transparent proxying. Why not just use separate ports for
> squid and for apache?

Just a question of whether you want to be able to yank the squid box
out if it goes pear-shaped, without touching configs on the apache
box.  Some people like to stick the proxy in as a no-op at first, then
tell netfilter to divert 1% of sessions to squid and see how it holds
up, retune, ease it in, ease it out, figure out how much operational
flexibility you will have as demand continues to scale.  If the squid
and apache are on the same box it's probably less of an issue.

Cheers,

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09  2:52                                     ` Martin Langhoff
@ 2006-12-09  5:09                                       ` H. Peter Anvin
  2006-12-09  5:34                                         ` Martin Langhoff
  0 siblings, 1 reply; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-09  5:09 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Jakub Narebski, Rogan Dawes, Linus Torvalds, Kernel Org Admin,
	Git Mailing List, Petr Baudis

Martin Langhoff wrote:
> I posted separately about those. And I've been mulling about whether
> the thundering herd is really such a big problem that we need to
> address it head-on.

Uhm... yes it is.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09  5:09                                       ` H. Peter Anvin
@ 2006-12-09  5:34                                         ` Martin Langhoff
  2006-12-09 16:26                                           ` H. Peter Anvin
  0 siblings, 1 reply; 82+ messages in thread
From: Martin Langhoff @ 2006-12-09  5:34 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jakub Narebski, Rogan Dawes, Linus Torvalds, Kernel Org Admin,
	Git Mailing List, Petr Baudis

On 12/9/06, H. Peter Anvin <hpa@zytor.com> wrote:
> Martin Langhoff wrote:
> > I posted separately about those. And I've been mulling about whether
> > the thundering herd is really such a big problem that we need to
> > address it head-on.
>
> Uhm... yes it is.

Got some more info, discussion points or links to stuff I should read
to appreciate why that is? I am trying to articulate why I consider it
is not a high-payoff task, as well as describing how to tackle it.

To recap, the reasons it is not high payoff is that:

 - the main benefit comes from being cacheable and able to revalidate
the cache cheaply (with the ETags-based strategy discussed above)
 - highly distributed caches/proxies means we'll seldom see a true
cold cache situation
 - we have a huge set of URLs which are seldom hit, and will never see
a thundering anything
 - we have a tiny set of very popular URLs that are the key target for
the thundering herd - (projects page, summary page, shortlog, fulllog)
- but those are in the clear as soon as the caches are populated

Why do we have to take it head-on? :-)




^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 23:27                                     ` Linus Torvalds
                                                         ` (2 preceding siblings ...)
  2006-12-09  1:56                                       ` Martin Langhoff
@ 2006-12-09  7:56                                       ` Steven Grimm
  3 siblings, 0 replies; 82+ messages in thread
From: Steven Grimm @ 2006-12-09  7:56 UTC (permalink / raw)
  To: Git Mailing List

Linus Torvalds wrote:
> Looking at the memcached operations, they have the "read" op (aka "get"), 
> but they seem to have no "read-for-fill" op. So memcached fundamentally 
> doesn't fix this problem, at least without explicit serialization by the 
> client.
>   

Actually, memcached does support an operation that would work for this: 
the "add" request, which creates a new cache entry if and only if the 
key is not already in the cache. If the key is already present, the 
request fails. You can use that to implement a simple named mutex, and 
it supports a client-specified timeout. The one thing it doesn't support 
that you described is a notion of deleting a key when a particular 
client disconnects, but as you say, that should only happen in the case 
of buggy clients anyway.

Mind you, I'm not convinced memcached is necessarily the right answer 
for this problem, but it does provide a way to implement the required 
locking semantics.

BTW, I'm one of the main contributors to memcached, so if it does end up 
looking like a good choice except for some minor issue or another, I may 
be able to tweak it to cover whatever is missing. For example, the 
"delete a key on disconnect" thing would be fairly straightforward, if 
it's actually necessary in practice.

-Steve

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09  0:45                                         ` Linus Torvalds
  2006-12-09  0:47                                           ` H. Peter Anvin
@ 2006-12-09  9:16                                           ` Jeff Garzik
  1 sibling, 0 replies; 82+ messages in thread
From: Jeff Garzik @ 2006-12-09  9:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Rogan Dawes, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

Linus Torvalds wrote:
> 
> On Fri, 8 Dec 2006, Jeff Garzik wrote:
>> This is a bit cheesy, and completely untested, but since mod_cache never
>> worked for me either, I bet it works better ;-)
> 
> Ok, this doesn't do the locking either, so on cache misses or expiry, 
> you're still going to be that thundering herd.

Well, gdbm does reader/write locking.

You still bit a bit of a thundering herd, though.  I suppose I could 
open the gdbm db for writing before calling the CGI, which would 
effectively get what you're looking for.


> Also, if you want to be nice to clients, I'd seriously suggest that when 
> you hit in the cache, but it's expired (or it's close to expired), you 
> still serve the cached data back, but you set up a thread in the 
> background (with some maximum number of active threads, of course!) that 
> refreshes the cached entry and then you extend the expiration time so that 
> you won't end up doing this "refresh" _again_.
> 
> It's kind of silly to have people wait for 20 seconds just because a cache 
> expired five seconds ago. Much nicer to say "ok, we allow a certain 
> grace-period during which we'll do the real lookup, but to make things 
> _look_ really responsive, we still use the old cached value".

True, should work with gitweb data at least.

	Jeff


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09  0:49                                         ` Linus Torvalds
  2006-12-09  0:51                                           ` H. Peter Anvin
  2006-12-09  4:36                                           ` Michael K. Edwards
@ 2006-12-09  9:27                                           ` Jeff Garzik
  2 siblings, 0 replies; 82+ messages in thread
From: Jeff Garzik @ 2006-12-09  9:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael K. Edwards, H. Peter Anvin, Rogan Dawes, Kernel Org Admin,
	Git Mailing List, Jakub Narebski

Linus Torvalds wrote:
> That said, I'm not personally convinced that there is much point to using 
> netfilter for transparent proxying. Why not just use separate ports for 
> squid and for apache?


That's what most people using squid in "http accelerator" mode do.  They 
put Apache on port 8080 or somesuch.

	Jeff


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09  1:56                                       ` Martin Langhoff
@ 2006-12-09 11:51                                         ` Jakub Narebski
  2006-12-09 12:42                                           ` Jeff Garzik
  0 siblings, 1 reply; 82+ messages in thread
From: Jakub Narebski @ 2006-12-09 11:51 UTC (permalink / raw)
  To: Martin Langhoff, Git Mailing List
  Cc: Linus Torvalds, Jeff Garzik, H. Peter Anvin, Rogan Dawes,
	Kernel Org Admin

Martin Langhoff wrote:

> We can make gitweb to detect mod_perl and a few smarter things if it
> is running inside of it. In fact, we can (ab)use mod_perl and perl
> facilities a bit to do some serialization which will be a big win for
> some pages. What we need for that is to set a sensible the ETag and
> use some IPC to announce/check if other apache/modperl processes are
> preparing content for the same ETag. The first-process-to-announce a
> given ETag can then write it to a common temp directory (atomically -
> write to a temp-name and move to the expected name) while other
> processes wait, polling for the file. Once the file is in place the
> latecomers can just serve the content of the file and exit.

First, it would (and could) work only for serving gitweb over mod_perl.
I'm not sure if overhead with IPC and complications implementing are
worth it: this perhaps be better solved by caching engine.

But let us put aside for a while actual caching (writing HTML version
of the page to a common temp directory, and serving this static page
if possible), and talk a bit what gitweb can do with respect to
cache validation.

In addition to setting either Expires: header or Cache-Control: max-age
gitweb should also set Last-Modified: and ETag headers, and also 
probably respond to If-Modified-Since: and If-None-Match: requests.

Would be worth implementing this?

> (I am calling the "state we are serving" identifier ETag because I
> think we should also set it as the ETag in the HTTP headers, so well
> be able to check the ETag of future requests for staleness - all we
> need is a ref lookup, and if the SHA1 matches, we are sorted). So
> having this 'unique request identifier' doubles up nicely...

For some pages ETag is natural; for other Last-Modified: would be more
natural.

> The ETag should probably be:
>  - SHA1+displaytype+args for pages that display an object identified
>    by SHA1

What uniquely identifies contents in "object" views ("commit", "tag",
"tree", "blob") is either h=SHA1, or hb=SHA1;f=FILENAME (with absence
of h=SHA1). If both h=SHA1 and hb=SHA1 is present, hb=SHA1 serves as
backlink. The "diff" views ("commitdiff", "blobdiff") are uniquely
identified by pair of object identifiers (pairs of SHA1, or pairs of
hb SHA1 + FILENAME).

Three of those views ("blob", "commitdiff", "blobdiff") have their 
"plain" version; so ETag should include displaytype (action, 'a' 
parameter).

The hb=SHA1;f=FILENAME indentifier can be converted at cost of one
call to git command (but which is a bit expensive as it recurses
trees), namely to git-ls-tree.

ETag can be simply args (query), if all h/hb/hbp parameters are SHA1.
Or ETag can be SHA1 of an object (or pair of SHA1 in the case of diff),
but this is little more costly to verify. Although we usually (always?) 
convert hb=SHA1;f=FILENAME to h=SHA1 anyway when displaying/generating 
page.

Usualy you can compare ETags base on URL alone.

>  - refname+SHA!+displaytype+args for pages that display something
>    identified by a ref

For objects views we can simply convert refname to SHA1. I'm not sure if 
it is worth it. In the cases when for view we have to calculate SHA1 of 
object anyway, we can return (and validate) ETag with SHA1 as above.

- ETag and/or Last-Modified headers for "log" views: "log", 
"shortlog" (is part of summary view), "history", "rss"/"atom" views.

On one hand all log views (at least now) are identified by their 
parameters (action/view name, and filename in the case of history view) 
and SHA1 of top commit. On the other hand it might be easier to use 
Last-Modified with date of top commit... Verifying SHA1 based ETag 
could add some overhead in the case of miss.

>  - SHA1(names and sha1s of all refs) for the summary page

Wouldn't it be simplier to just set Last-Modified: header (and check
it?)

P.S. Can anyone post some benchmark comparing gitweb deployed under 
mod_perl as compared to deployed as CGI script? Does kernel.org use 
mod_perl, or CGI version of gitweb?

-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09 11:51                                         ` Jakub Narebski
@ 2006-12-09 12:42                                           ` Jeff Garzik
  2006-12-09 13:37                                             ` Jakub Narebski
                                                               ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Jeff Garzik @ 2006-12-09 12:42 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

Jakub Narebski wrote:
> First, it would (and could) work only for serving gitweb over mod_perl.
> I'm not sure if overhead with IPC and complications implementing are
> worth it: this perhaps be better solved by caching engine.

It is.  At least for kernel.org, the issue isn't that CGI is expensive, 
its that I/O is expensive.

> In addition to setting either Expires: header or Cache-Control: max-age
> gitweb should also set Last-Modified: and ETag headers, and also 
> probably respond to If-Modified-Since: and If-None-Match: requests.
> 
> Would be worth implementing this?

IMO yes, since most major browsers, caches, and spiders support these 
headers.

> For some pages ETag is natural; for other Last-Modified: would be more
> natural.

Yes, a good point to note.

> Usualy you can compare ETags base on URL alone.

Mostly true:  you must also consider HTTP_ACCEPT

> Wouldn't it be simplier to just set Last-Modified: header (and check
> it?)

That would be a good start, and suffice for many cases.  If the CGI can 
simply stat(2) files rather than executing git-* programs, that would 
increase efficiency quite a bit.

A core problem with cache hints via HTTP headers (last-modified, etc.) 
is that you don't achieve caching across multiple clients, just across 
repeated queries from the same client (or caching proxy).

At least for the RSS/Atom feeds and the git main page, it makes no sense 
to regenerate that data repeatedly.

Internally, gitweb would need to do a stat() on key files, and return 
pre-generated XML for the feeds if the stat() reveals no changes.  Ditto 
for the front page.

> P.S. Can anyone post some benchmark comparing gitweb deployed under 
> mod_perl as compared to deployed as CGI script? Does kernel.org use 
> mod_perl, or CGI version of gitweb?

CGI version of gitweb.

But again, mod_perl vs. CGI isn't the issue.

	Jeff

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09 12:42                                           ` Jeff Garzik
@ 2006-12-09 13:37                                             ` Jakub Narebski
  2006-12-09 14:43                                               ` Jeff Garzik
  2006-12-10  4:07                                               ` Martin Langhoff
  2006-12-09 18:04                                             ` Linus Torvalds
  2006-12-10  3:55                                             ` Martin Langhoff
  2 siblings, 2 replies; 82+ messages in thread
From: Jakub Narebski @ 2006-12-09 13:37 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

Jeff Garzik wrote:
> Jakub Narebski wrote:

>> In addition to setting either Expires: header or Cache-Control: max-age
>> gitweb should also set Last-Modified: and ETag headers, and also 
>> probably respond to If-Modified-Since: and If-None-Match: requests.
>> 
>> Would be worth implementing this?
> 
> IMO yes, since most major browsers, caches, and spiders support these 
> headers.

Sending Last-Modified: should be easy; sending ETag needs some consensus
on the contents: mainly about validation. Responding to If-Modified-Since:
and If-None-Match: should cut at least _some_ of the page generating time.
If ETag can be calculated on URL alone, then we can cut If-None-Match:
just at beginning of script.

>> For some pages ETag is natural; for other Last-Modified: would be more
>> natural.
> 
> Yes, a good point to note.
> 
>> Usualy you can compare ETags base on URL alone.
> 
> Mostly true:  you must also consider HTTP_ACCEPT

Well, yes, ETag is HTTP/1.1 header. 

>> Wouldn't it be simplier to just set Last-Modified: header (and check
>> it?)
> 
> That would be a good start, and suffice for many cases.  If the CGI can 
> simply stat(2) files rather than executing git-* programs, that would 
> increase efficiency quite a bit.

As I said, I'm not talking (at least now) about saving generated HTML
output. This I think is better solved in caching engine like Squid can
be. Although even here some git specific can be of help: we can invalidate
cache on push, and we know that some results doesn't ever change (well,
with exception of changing output of gitweb).

> A core problem with cache hints via HTTP headers (last-modified, etc.) 
> is that you don't achieve caching across multiple clients, just across 
> repeated queries from the same client (or caching proxy).
> 
> At least for the RSS/Atom feeds and the git main page, it makes no sense 
> to regenerate that data repeatedly.
> 
> Internally, gitweb would need to do a stat() on key files, and return 
> pre-generated XML for the feeds if the stat() reveals no changes.  Ditto 
> for the front page.

I'm not sure if it is worth implementing in gitweb, or is it better left
to caching engine. With the projects list page and summary page there is
additional problem with relative dates, although this can be solved using
Jonas Fonseca idea of using absolute dates in the page and using ECMAScript
(JavaScript) to convert them to relative: on load, and perhaps on timer ;-)

What can be _easily_ done:
 * Use post 1.4.4 gitweb, which uses git-for-each-ref to generate summary
   page; this leads to around 3 times faster summary page.
 * Perhaps using projects list file (which can be now generated by gitweb)
   instead of scanning directories and stat()-ing for owner would help
   with time to generate projects lis page

What can be quite easy incorporated into gitweb:
 * For immutable pages set Expires: or Cache-Control: max-age (or both)
   to infinity
 * Calculate hash+action based ETag at least for those actions where it is
   easy, and respond with 304 Not Modified as soon as it can.
   This might require some code reorganization to not begin writing output
   before calculating ETag and ETag comparison (If-Match, If-None-Match).
 * Generate Last-Modified: for those views where it can be calculated,
   and respond with 304 Not Modified as soon as it can.

What can be easily done using caching engine:
 * Select top 10 of common queries, and cache them, invalidating cache on push
   (depending on query: for example invalidate project list on push to any
   project, invalidate RSS/Atom feed and summary pages only on push to specific
   project) - can be done with git hooks.
-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09 13:37                                             ` Jakub Narebski
@ 2006-12-09 14:43                                               ` Jeff Garzik
  2006-12-09 17:02                                                 ` Jakub Narebski
  2006-12-10  4:07                                               ` Martin Langhoff
  1 sibling, 1 reply; 82+ messages in thread
From: Jeff Garzik @ 2006-12-09 14:43 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

Jakub Narebski wrote:
> Sending Last-Modified: should be easy; sending ETag needs some consensus
> on the contents: mainly about validation. Responding to If-Modified-Since:
> and If-None-Match: should cut at least _some_ of the page generating time.

Definitely.


> As I said, I'm not talking (at least now) about saving generated HTML
> output. This I think is better solved in caching engine like Squid can
> be. Although even here some git specific can be of help: we can invalidate
> cache on push, and we know that some results doesn't ever change (well,
> with exception of changing output of gitweb).

It depends on how creatively you think ;-)

Consider generating static HTML files on each push, via a hook, for many 
of the toplevel files.  The static HTML would then link to the CGI for 
further dynamic querying of the git database.



> What can be _easily_ done:
>  * Use post 1.4.4 gitweb, which uses git-for-each-ref to generate summary
>    page; this leads to around 3 times faster summary page.

This re-opens the question mentioned earlier, is Kay (or anyone?) still 
actively maintaining gitweb on k.org?


>  * Perhaps using projects list file (which can be now generated by gitweb)
>    instead of scanning directories and stat()-ing for owner would help
>    with time to generate projects lis page

This could be statically generated by a robot.  I think everybody would 
shrink in horror if a human needed to maintain such a file.


> What can be quite easy incorporated into gitweb:
>  * For immutable pages set Expires: or Cache-Control: max-age (or both)
>    to infinity

nice!


>  * Generate Last-Modified: for those views where it can be calculated,
>    and respond with 304 Not Modified as soon as it can.

agreed


> What can be easily done using caching engine:
>  * Select top 10 of common queries, and cache them, invalidating cache on push
>    (depending on query: for example invalidate project list on push to any
>    project, invalidate RSS/Atom feed and summary pages only on push to specific
>    project) - can be done with git hooks.

Or simply generate regular filesystem files into the webspace, as 
triggered by a hook.  Let the standard filesystem mirroring/caching work 
its magic.

	Jeff


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09  5:34                                         ` Martin Langhoff
@ 2006-12-09 16:26                                           ` H. Peter Anvin
  0 siblings, 0 replies; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-09 16:26 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Jakub Narebski, Rogan Dawes, Linus Torvalds, Kernel Org Admin,
	Git Mailing List, Petr Baudis

Martin Langhoff wrote:
> On 12/9/06, H. Peter Anvin <hpa@zytor.com> wrote:
>> Martin Langhoff wrote:
>> > I posted separately about those. And I've been mulling about whether
>> > the thundering herd is really such a big problem that we need to
>> > address it head-on.
>>
>> Uhm... yes it is.
> 
> Got some more info, discussion points or links to stuff I should read
> to appreciate why that is? I am trying to articulate why I consider it
> is not a high-payoff task, as well as describing how to tackle it.
> 
> To recap, the reasons it is not high payoff is that:
> 
> - the main benefit comes from being cacheable and able to revalidate
> the cache cheaply (with the ETags-based strategy discussed above)
> - highly distributed caches/proxies means we'll seldom see a true
> cold cache situation
> - we have a huge set of URLs which are seldom hit, and will never see
> a thundering anything
> - we have a tiny set of very popular URLs that are the key target for
> the thundering herd - (projects page, summary page, shortlog, fulllog)
> - but those are in the clear as soon as the caches are populated
> 
> Why do we have to take it head-on? :-)
> 

Because the primary failure scenario is timeout on the common queries 
due to excess parallel invocations under high I/O load resulting in 
catastrophic failure.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09 14:43                                               ` Jeff Garzik
@ 2006-12-09 17:02                                                 ` Jakub Narebski
  2006-12-09 17:27                                                   ` Jeff Garzik
  0 siblings, 1 reply; 82+ messages in thread
From: Jakub Narebski @ 2006-12-09 17:02 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

Jeff Garzik wrote:
> Jakub Narebski wrote:

>> As I said, I'm not talking (at least now) about saving generated HTML
>> output. This I think is better solved in caching engine like Squid can
>> be. Although even here some git specific can be of help: we can invalidate
>> cache on push, and we know that some results doesn't ever change (well,
>> with exception of changing output of gitweb).
> 
> It depends on how creatively you think ;-)
> 
> Consider generating static HTML files on each push, via a hook, for many 
> of the toplevel files.  The static HTML would then link to the CGI for 
> further dynamic querying of the git database.

You mean that the links in this pre-generated HTML would be to CGI
pages?
 
>> What can be _easily_ done:
>>  * Use post 1.4.4 gitweb, which uses git-for-each-ref to generate summary
>>    page; this leads to around 3 times faster summary page.
> 
> This re-opens the question mentioned earlier, is Kay (or anyone?) still 
> actively maintaining gitweb on k.org?

By the way, thanks to Martin Waitz it is much easier to install gitweb.
I for example use the following script to test changes I have made to gitweb:

-- >8 --
#!/bin/bash

BINDIR="/home/local/git"

function make_gitweb()
{
	pushd "/home/jnareb/git/"

	make GITWEB_PROJECTROOT="/home/local/scm" \
	     GITWEB_CSS="/gitweb/gitweb.css" \
	     GITWEB_LOGO="/gitweb/git-logo.png" \
	     GITWEB_FAVICON="/gitweb/git-favicon.png" \
	     bindir=$BINDIR \
	     gitweb/gitweb.cgi

	popd
}

function copy_gitweb()
{
	cp -fv /home/jnareb/git/gitweb/gitweb.{cgi,css} /home/local/gitweb/
}

make_gitweb
copy_gitweb

# end of gitweb-update.sh
-- >8 --

>>  * Perhaps using projects list file (which can be now generated by gitweb)
>>    instead of scanning directories and stat()-ing for owner would help
>>    with time to generate projects lis page
> 
> This could be statically generated by a robot.  I think everybody would 
> shrink in horror if a human needed to maintain such a file.

Gitweb can generate this file. The problem is that one would have to
temporary turn off using index file. This can be done by having the
following gitweb_list_projects.perl file:

-- >8 --
#!/usr/bin/perl

$projects_list = "";
-- >8 --

then use the following invocation to generate project index file:

$ GATEWAY_INTERFACE="CGI/1.1" HTTP_ACCEPT="*/*" REQUEST_METHOD="GET" \
  GITWEB_CONFIG=gitweb_list_projects.perl QUERY_STRING="a=project_index" \
  gitweb.cgi 

-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09 17:02                                                 ` Jakub Narebski
@ 2006-12-09 17:27                                                   ` Jeff Garzik
  0 siblings, 0 replies; 82+ messages in thread
From: Jeff Garzik @ 2006-12-09 17:27 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

Jakub Narebski wrote:
> Jeff Garzik wrote:
>> Jakub Narebski wrote:
> 
>>> As I said, I'm not talking (at least now) about saving generated HTML
>>> output. This I think is better solved in caching engine like Squid can
>>> be. Although even here some git specific can be of help: we can invalidate
>>> cache on push, and we know that some results doesn't ever change (well,
>>> with exception of changing output of gitweb).
>> It depends on how creatively you think ;-)
>>
>> Consider generating static HTML files on each push, via a hook, for many 
>> of the toplevel files.  The static HTML would then link to the CGI for 
>> further dynamic querying of the git database.
> 
> You mean that the links in this pre-generated HTML would be to CGI
> pages?

Yes, they must be.  Otherwise, the gitweb interface changes.

You don't want to pre-generate HTML for every possible git query, that 
would cause an explosion of data.

Both the HTML generator and CGI would need to know which pages were 
pre-generated and which are not.

	Jeff


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09 12:42                                           ` Jeff Garzik
  2006-12-09 13:37                                             ` Jakub Narebski
@ 2006-12-09 18:04                                             ` Linus Torvalds
  2006-12-09 18:30                                               ` H. Peter Anvin
  2006-12-10  3:55                                             ` Martin Langhoff
  2 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2006-12-09 18:04 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Jakub Narebski, Martin Langhoff, Git Mailing List, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

On Sat, 9 Dec 2006, Jeff Garzik wrote:
> 
> It is.  At least for kernel.org, the issue isn't that CGI is expensive, its
> that I/O is expensive.

Note that if we had a new gitweb, we could also used the packed refs. 
Those help CPU usage, but they actually help IO patterns more, exactly 
because they avoid all the seeking around in the filesystem.

So with packed refs, there's no need to go from directory lookup to inode 
lookup to data lookup to object lookup for *each* ref - you can do the 
"packed-refs" lookup _once_ (which obviously does the dir->inode->data), 
and you don't need to do the object lookup at all.

Of course, gitweb will then end up doing the object lookup anyway (because 
of getting the dates etc for refs), but if you have packed-refs and a 
reasonably packed repository, that should still really cut down on IO in a 
big way.

So there's probably tons of room for making this more efficient: using a 
newer gitweb, packing refs, using the cgi cache thing.. It sounds like 
what it really needs is just somebody with the competence and time to be 
willing to step up and maintain gitweb on kernel.org...

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09 18:04                                             ` Linus Torvalds
@ 2006-12-09 18:30                                               ` H. Peter Anvin
  0 siblings, 0 replies; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-09 18:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Jakub Narebski, Martin Langhoff, Git Mailing List,
	Rogan Dawes, Kernel Org Admin

Linus Torvalds wrote:
> 
> So there's probably tons of room for making this more efficient: using a 
> newer gitweb, packing refs, using the cgi cache thing.. It sounds like 
> what it really needs is just somebody with the competence and time to be 
> willing to step up and maintain gitweb on kernel.org...
> 

Indeed.  We have a lot of projects on kernel.org which are like this: 
not at all conceptually hard, but a huge time commitment for Doing It 
Right[TM].  This is why I sometimes think that it would be a Good Thing 
to get paid staff for kernel.org, although I was hoping to defer the 
need for that until at least we have our 501(c)3 paperwork done, which 
looks like mid-2007 at this point (assuming no further delays.)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09 12:42                                           ` Jeff Garzik
  2006-12-09 13:37                                             ` Jakub Narebski
  2006-12-09 18:04                                             ` Linus Torvalds
@ 2006-12-10  3:55                                             ` Martin Langhoff
  2006-12-10  7:05                                               ` H. Peter Anvin
  2 siblings, 1 reply; 82+ messages in thread
From: Martin Langhoff @ 2006-12-10  3:55 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Jakub Narebski, Git Mailing List, Linus Torvalds, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

On 12/10/06, Jeff Garzik <jeff@garzik.org> wrote:
> > P.S. Can anyone post some benchmark comparing gitweb deployed under
> > mod_perl as compared to deployed as CGI script? Does kernel.org use
> > mod_perl, or CGI version of gitweb?
>
> CGI version of gitweb.
>
> But again, mod_perl vs. CGI isn't the issue.

IO is the issue, and the CGI startup of Perl is quite IO & CPU
intensive. Even if the caching headers, thundering herds and planet
collisions are resolved, I don't think you'll ever be happy with IO
and CPU load on kernel.org running gitweb as CGI.

cheers,




^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-09 13:37                                             ` Jakub Narebski
  2006-12-09 14:43                                               ` Jeff Garzik
@ 2006-12-10  4:07                                               ` Martin Langhoff
  2006-12-10 10:09                                                 ` Jakub Narebski
  1 sibling, 1 reply; 82+ messages in thread
From: Martin Langhoff @ 2006-12-10  4:07 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Jeff Garzik, Git Mailing List, Linus Torvalds, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

On 12/10/06, Jakub Narebski <jnareb@gmail.com> wrote:
> Jeff Garzik wrote:
> > Jakub Narebski wrote:
>
> >> In addition to setting either Expires: header or Cache-Control: max-age
> >> gitweb should also set Last-Modified: and ETag headers, and also
> >> probably respond to If-Modified-Since: and If-None-Match: requests.
> >>
> >> Would be worth implementing this?
> >
> > IMO yes, since most major browsers, caches, and spiders support these
> > headers.
>
> Sending Last-Modified: should be easy; sending ETag needs some consensus
> on the contents: mainly about validation. Responding to If-Modified-Since:
> and If-None-Match: should cut at least _some_ of the page generating time.
> If ETag can be calculated on URL alone, then we can cut If-None-Match:
> just at beginning of script.

Indeed. Let me add myself to the pileup agreeing that a combination of
setting Last-Modified and checking for If-Modified-Since for
ref-centric pages (log, shortlog, RSS, and summary) is the smartest
scheme. I got locked into thinking ETags.

> > That would be a good start, and suffice for many cases.  If the CGI can
> > simply stat(2) files rather than executing git-* programs, that would
> > increase efficiency quite a bit.
>
> As I said, I'm not talking (at least now) about saving generated HTML
> output. This I think is better solved in caching engine like Squid can
> be. Although even here some git specific can be of help: we can invalidate
> cache on push, and we know that some results doesn't ever change (well,
> with exception of changing output of gitweb).

Indeed - gitweb should not be saving HTML around bit giving the best
possible hints to squid and friends. And improving our ability to
short-cut and send a 304 - Not Modified.

> What can be _easily_ done:

Great plan. :-)


cheers,



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-10  3:55                                             ` Martin Langhoff
@ 2006-12-10  7:05                                               ` H. Peter Anvin
  2006-12-12 21:19                                                 ` Jakub Narebski
  0 siblings, 1 reply; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-10  7:05 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Jeff Garzik, Jakub Narebski, Git Mailing List, Linus Torvalds,
	Rogan Dawes, Kernel Org Admin

Martin Langhoff wrote:
> On 12/10/06, Jeff Garzik <jeff@garzik.org> wrote:
>> > P.S. Can anyone post some benchmark comparing gitweb deployed under
>> > mod_perl as compared to deployed as CGI script? Does kernel.org use
>> > mod_perl, or CGI version of gitweb?
>>
>> CGI version of gitweb.
>>
>> But again, mod_perl vs. CGI isn't the issue.
> 
> IO is the issue, and the CGI startup of Perl is quite IO & CPU
> intensive. Even if the caching headers, thundering herds and planet
> collisions are resolved, I don't think you'll ever be happy with IO
> and CPU load on kernel.org running gitweb as CGI.
> 

I/O - nonexistent; that stuff will be in memory.

CPU - we have more CPU than you can shake a stick at, and it's 95+% idle.

*NOT AN ISSUE*.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-08 16:42                                   ` H. Peter Anvin
  2006-12-08 19:49                                     ` Lars Hjemli
@ 2006-12-10  9:43                                     ` rda
  1 sibling, 0 replies; 82+ messages in thread
From: rda @ 2006-12-10  9:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Rogan Dawes, Kernel Org Admin, Git Mailing List,
	Jakub Narebski

On 12/8/06, H. Peter Anvin <hpa@zytor.com> wrote:
> Linus Torvalds wrote:
> > I could write a simple C caching thing that just hashes the CGI arguments
> > and uses a hash to create a cache (and proper lock-files etc to serialize
> > access to a particular cache object while it's being created) fairly
> > easily, but I'm pretty sure people would much prefer a mod_perl thing just
> > to avoid the fork/exec overhead with Apache (I think mod_perl allows
> > Apache to run perl scripts without it), and that means I'm not the right
> > person any more.
>
> True about mod_perl.  Haven't messed with that myself, either.
> fork/exec really is very cheap on Linux, so it's not a huge deal.

In the case of Perl scripts, it's not really the fork/exec overhead,
but the Perl startup overhead that you want to try to optimize.  But
given your later statement (lots of spare cpu), this ends up just
being a bit of a latency hit.   In general, I think mod_perl has a

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-10  4:07                                               ` Martin Langhoff
@ 2006-12-10 10:09                                                 ` Jakub Narebski
  2006-12-10 12:41                                                   ` Jeff Garzik
  0 siblings, 1 reply; 82+ messages in thread
From: Jakub Narebski @ 2006-12-10 10:09 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Jeff Garzik, Git Mailing List, Linus Torvalds, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

Martin Langhoff wrote:
> On 12/10/06, Jakub Narebski <jnareb@gmail.com> wrote:

>> Sending Last-Modified: should be easy; sending ETag needs some consensus
>> on the contents: mainly about validation. Responding to If-Modified-Since:
>> and If-None-Match: should cut at least _some_ of the page generating time.
>> If ETag can be calculated on URL alone, then we can cut If-None-Match:
>> just at beginning of script.
> 
> Indeed. Let me add myself to the pileup agreeing that a combination of
> setting Last-Modified and checking for If-Modified-Since for
> ref-centric pages (log, shortlog, RSS, and summary) is the smartest
> scheme. I got locked into thinking ETags.

Sometimes it is easier to use ETags, sometimes it is easier to use
Last-Modified:. Usually you can check ETag earlier (after calling
git-rev-list) than Last-Modified (after parsing first commit). But
some pages doesn't have natural ETag...

Besides, because ETag is HTTP/1.1 we should provide and validate
both.

P.S. Any hints to how to do this with CGI Perl module?
-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-10 10:09                                                 ` Jakub Narebski
@ 2006-12-10 12:41                                                   ` Jeff Garzik
  2006-12-10 13:02                                                     ` Jakub Narebski
  0 siblings, 1 reply; 82+ messages in thread
From: Jeff Garzik @ 2006-12-10 12:41 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

Jakub Narebski wrote:
> P.S. Any hints to how to do this with CGI Perl module?

It's impossible, Apache doesn't supply e-tag info to CGI programs.  (it 
does supply HTTP_CACHE_CONTROL though apparently)

You could probably do it via mod_perl.

	Jeff


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-10 12:41                                                   ` Jeff Garzik
@ 2006-12-10 13:02                                                     ` Jakub Narebski
  2006-12-10 13:45                                                       ` Jeff Garzik
  0 siblings, 1 reply; 82+ messages in thread
From: Jakub Narebski @ 2006-12-10 13:02 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

Jeff Garzik wrote:
> Jakub Narebski wrote:
>>
>> P.S. Any hints to how to do this with CGI Perl module?
> 
> It's impossible, Apache doesn't supply e-tag info to CGI programs.  (it 
> does supply HTTP_CACHE_CONTROL though apparently)

By ETag info you mean access to HTTP headers sent by browser
If-Modified-Since:, If-Match:, If-None-Match: do you?
 
It's a pity that CGI interface doesn't cover that...

> You could probably do it via mod_perl.

So the cache verification should be wrapped in if ($ENV{MOD_PERL}) ?
-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-10 13:02                                                     ` Jakub Narebski
@ 2006-12-10 13:45                                                       ` Jeff Garzik
  2006-12-10 19:11                                                         ` Jakub Narebski
  0 siblings, 1 reply; 82+ messages in thread
From: Jeff Garzik @ 2006-12-10 13:45 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

[-- Attachment #1: Type: text/plain, Size: 1588 bytes --]

Jakub Narebski wrote:
> Jeff Garzik wrote:
>> Jakub Narebski wrote:
>>> P.S. Any hints to how to do this with CGI Perl module?
>> It's impossible, Apache doesn't supply e-tag info to CGI programs.  (it 
>> does supply HTTP_CACHE_CONTROL though apparently)
> 
> By ETag info you mean access to HTTP headers sent by browser
> If-Modified-Since:, If-Match:, If-None-Match: do you?

You can use this attached shell script as a CGI script, to see precisely 
what information Apache gives you.  You can even experiment with passing 
back headers other than Content-type (such as E-tag), to see what sort 
of results are produced.  The script currently passes back both E-Tag 
and Last-Modified of a sample file; modify or delete those lines to suit 
your experiments.

> It's a pity that CGI interface doesn't cover that...
> 
>> You could probably do it via mod_perl.
> 
> So the cache verification should be wrapped in if ($ENV{MOD_PERL}) ?

Sorry, I was /assuming/ mod_perl would make this available.  The HTTP 
header info is available to all Apache modules, but I confess I have no 
idea how mod_perl passes that info to scripts.

Also, an interesting thing while I was testing the attached shell 
script:  even though repeated hits to the script generate a proper 304 
response to the browse, the CGI script and its output run to completion. 
  So, it didn't save work on the CGI side; the savings was solely in not 
transmitting the document from server to client.  The server still went 
through the work of generating the document (by running the CGI), as one 
would expect.

	Jeff

[-- Attachment #2: fenv --]
[-- Type: text/plain, Size: 317 bytes --]

#!/bin/sh

FN=/tmp/foo

if [ ! -f "$FN" ]
then
	echo "blah blah blah" > "$FN"
fi

HASH=`md5sum "$FN"`

echo "Content-type: text/plain"
echo "E-tag: $HASH"
echo Last-Modified: `date -r /tmp/foo '+%a, %d %b %Y %T %Z'`
echo ""

# don't pollute server environment output with our local additions
unset FN
unset HASH

set

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-10 13:45                                                       ` Jeff Garzik
@ 2006-12-10 19:11                                                         ` Jakub Narebski
  2006-12-10 19:50                                                           ` Linus Torvalds
  2006-12-10 22:05                                                           ` Jeff Garzik
  0 siblings, 2 replies; 82+ messages in thread
From: Jakub Narebski @ 2006-12-10 19:11 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

Jeff Garzik wrote:
> Jakub Narebski wrote:
>> Jeff Garzik wrote:
>>> Jakub Narebski wrote:
>>>>
>>>> P.S. Any hints to how to do this with CGI Perl module?
>>>
>>> It's impossible, Apache doesn't supply e-tag info to CGI programs.  (it 
>>> does supply HTTP_CACHE_CONTROL though apparently)
>> 
>> By ETag info you mean access to HTTP headers sent by browser
>> If-Modified-Since:, If-Match:, If-None-Match: do you?

Adn in CGI standard there is a way to access additional HTTP headers
info from CGI script: the envirionmental variables are HTTP_HEADER,
for example if browser sent If-Modified-Since: header it's value
can be found in HTTP_IF_MODIFIED_SINCE environmental variable.

But of course gitweb should rather use mod_perl if possible, so
somewhere in gitweb there would be the following line:

  $in_date = $ENV{'MOD_PERL'} ?
    $r->header('If-Modified-Since') :
    $ENV{'HTTP_IF_MODIFIED_SINCE'};

or something like that...

> You can use this attached shell script as a CGI script, to see precisely 
> what information Apache gives you.  You can even experiment with passing 
> back headers other than Content-type (such as E-tag), to see what sort 
> of results are produced.  The script currently passes back both E-Tag 
> and Last-Modified of a sample file; modify or delete those lines to suit 
> your experiments.

It is ETag, not E-tag. Besides, I don't see what the attached script is
meant to do: it does not output the sample file anyway.

>> It's a pity that CGI interface doesn't cover that...
>> 
>>> You could probably do it via mod_perl.
>> 
>> So the cache verification should be wrapped in if ($ENV{MOD_PERL}) ?
> 
> Sorry, I was /assuming/ mod_perl would make this available.  The HTTP 
> header info is available to all Apache modules, but I confess I have no 
> idea how mod_perl passes that info to scripts.
> 
> Also, an interesting thing while I was testing the attached shell 
> script:  even though repeated hits to the script generate a proper 304 
> response to the browse, the CGI script and its output run to completion. 
>   So, it didn't save work on the CGI side; the savings was solely in not 
> transmitting the document from server to client.  The server still went 
> through the work of generating the document (by running the CGI), as one 
> would expect.

The idea is of course to stop processing in CGI script / mod_perl script
as soon as possible if cache validates.

I don't know if Apache intercepts and remembers ETag and Last-Modified
headers, adds 304 Not Modified HTTP response on finding that cache validates
and cuts out CGI script output. I.e. if browser provided If-Modified-Since:,
script wrote Last-Modified: header, If-Modified-Since: is no earlier than
Last-Modified: (usually is equal in the case of cache validation), then
Apache provides 304 Not Modified response instead of CGI script output.

-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-10 19:11                                                         ` Jakub Narebski
@ 2006-12-10 19:50                                                           ` Linus Torvalds
  2006-12-10 20:27                                                             ` Jakub Narebski
  2006-12-10 21:01                                                             ` H. Peter Anvin
  2006-12-10 22:05                                                           ` Jeff Garzik
  1 sibling, 2 replies; 82+ messages in thread
From: Linus Torvalds @ 2006-12-10 19:50 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Jeff Garzik, Martin Langhoff, Git Mailing List, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

On Sun, 10 Dec 2006, Jakub Narebski wrote:
> >> If-Modified-Since:, If-Match:, If-None-Match: do you?
> 
> Adn in CGI standard there is a way to access additional HTTP headers
> info from CGI script: the envirionmental variables are HTTP_HEADER,
> for example if browser sent If-Modified-Since: header it's value
> can be found in HTTP_IF_MODIFIED_SINCE environmental variable.

Guys, you're missing something fairly fundamnetal. 

It helps almost _nothing_ to support client-side caching with all these 
fancy "If-Modified-Since:" etc crap.

That's not the _problem_.

It's usually not one client asking for the gitweb pages: the load comes 
from just lots of people independently asking for it. So client-side 
caching may help a tiny tiny bit, but it's not actually fixing the 
fundamental problem at all.

So forget about "If-Modified-Since:" etc. It may help in benchmarks when 
you try it yourself, and use "refresh" on the client side. But the basic 
problem is all about lots of clients that do NOT have things cached, 
because all teh client caches are all filled up with pr0n, not with gitweb 
data from yesterday.

So the thing to help is server-side caching with good access patterns, so 
that the server won't have to seek all over the disk when clients that 
_don't_ have things in their caches want to see the "git projects" summary 
overview (that currently lists something like 200+ projects).

So to get that list of 200+ projects, right now gitweb will literally walk 
them all, look at their refs, their descriptions, their ages (which 
requires looking up the refs, and the objects behing the refs), and if 
they aren't cached, you're going to have several disk seeks for each 
project.

At 200+ projects, the thing that makes it slow is those disk seeks. Even 
with a fast disk and RAID array, the seeks are all basically going to be 
interdependent, so there's no room for disk arm movement optimization, and 
in the absense of any other load it's still going to be several seconds 
just for the seeks (say 10ms per seek, four or five seeks per project, 
you've got 10 seconds _just_ for the seeks to generate the top-level 
summary page, and quite frankly, five seeks is probably optimistic).

Now, hopefully some of it will be in the disk cache, but when the 
mirroring happens, it will basically blow the disk caches away totally 
(when using the "--checksum" option), and then you literally have tens of 
seconds to generate that one top-level page. 

And when mirroring is blowing out the disk caches, the thing will be doing 
other things _too_ to the disk, of course.

So what you want is server-side caching, and you basically _never_ want to 
re-generate that data synchronously (because even if the server can take 
the load, having the clients wait for half a minute or more for the data 
is just NOT FRIENDLY). This is why I suggested the grace-period where we 
fill the cache on he server side in the background _while_at_the_same_time 
actually feeding the clients the old cached contents.

Because what matters most to _clients_ is not getting the most recent 
up-to-date data within the last few minutes - people who go to the 
overview page want to just get a list of projects, and they want to get 
them in a second or two, not half a minute later.

And btw, all those "If-Modified-Since:" things are irrelevant, since quite 
often, the top-level page really technically _has_ been modified in the 
last few minutes, because with the kernel and git projects, _somebody_ has 
usually pushed out one of the projects within the last hour.

And no, people don't just sit there refreshing their browser page all the 
time. I bet even "active" git users do it at most once or twice a day, 
which means that their client cache will _never_ be up-to-date.

But if you do it with server-side caches and grace-periods, you can 
generally say "we have something that is at most five minutes old", and 
most importantly, you can hopefully do it without a lot of disk seeks 
(because you just cache the _one_ page as _one_ object), so hopefully you 
can do it in a few hundred ms even if the thing is on disk and even if 
there's a lot of other load going on.

I bet the top-level "all projects" summary page and the individual 
project summary pages are the important things to cache. That's what 
probably most people look at, and they are the ones that have lots of 
server-side cache locality. Individual commits and diffs probably don't 
get the same kind of "lots of people looking at them" and thus don't get 
the same kind of benefit from caching.

(Individual commits hopefully also need fewer disk seeks, at least with 
packed repositories. So even if you have to re-generate them from scratch, 
they won't have the seek times themselves taking up tens of seconds, 
unless the project is entirely unpacked and diffing just generates total 
disk seek hell)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-10 19:50                                                           ` Linus Torvalds
@ 2006-12-10 20:27                                                             ` Jakub Narebski
  2006-12-10 20:30                                                               ` Linus Torvalds
  2006-12-10 21:01                                                             ` H. Peter Anvin
  1 sibling, 1 reply; 82+ messages in thread
From: Jakub Narebski @ 2006-12-10 20:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Martin Langhoff, Git Mailing List, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

Linus Torvalds wrote:
> On Sun, 10 Dec 2006, Jakub Narebski wrote:
>>>> If-Modified-Since:, If-Match:, If-None-Match: do you?
>> 
>> And in CGI standard there is a way to access additional HTTP headers
>> info from CGI script: the envirionmental variables are HTTP_HEADER,
>> for example if browser sent If-Modified-Since: header it's value
>> can be found in HTTP_IF_MODIFIED_SINCE environmental variable.
> 
> Guys, you're missing something fairly fundamnetal. 
> 
> It helps almost _nothing_ to support client-side caching with all these 
> fancy "If-Modified-Since:" etc crap.
> 
> That's not the _problem_.
> 
> It's usually not one client asking for the gitweb pages: the load comes 
> from just lots of people independently asking for it. So client-side 
> caching may help a tiny tiny bit, but it's not actually fixing the 
> fundamental problem at all.

Well, the idea (perhaps stupid idea: I don't know how caching engines
/ reverse proxy works) was that there would be caching engine / reverse
proxy in the front (Squid for example) would cache results and serve it
to rampaging hordes. But this caching engine has to ask gitweb if the
cache is valid using "If-Modified-Since:" and "If-None-Match:" headers.
If gitweb returns 304 Not Modified then it serves contents from cache.

> So forget about "If-Modified-Since:" etc. It may help in benchmarks when 
> you try it yourself, and use "refresh" on the client side. But the basic 
> problem is all about lots of clients that do NOT have things cached, 
> because all teh client caches are all filled up with pr0n, not with gitweb 
> data from yesterday.

What about the other idea, the one with raising expires to infinity for
immutable pages like "commit" view for commit given by SHA-1? Even if
the clients won't cache it, the proxies and caches between gitweb and
client might cache it...

Talking about most accessed gitweb pages, the project list page changes
on every push, the project summary page and project main RSS feed
(now in both RSS and Atom formats) changes on every push to given project.
With a help of hooks they can be static pages, generated by push...
...with the exception that projects list and summary pages have _relative_
dates.

-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-10 20:27                                                             ` Jakub Narebski
@ 2006-12-10 20:30                                                               ` Linus Torvalds
  2006-12-10 22:01                                                                 ` Martin Langhoff
  2006-12-10 22:08                                                                 ` Jeff Garzik
  0 siblings, 2 replies; 82+ messages in thread
From: Linus Torvalds @ 2006-12-10 20:30 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Jeff Garzik, Martin Langhoff, Git Mailing List, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin



On Sun, 10 Dec 2006, Jakub Narebski wrote:
>
> Well, the idea (perhaps stupid idea: I don't know how caching engines
> / reverse proxy works) was that there would be caching engine / reverse
> proxy in the front (Squid for example) would cache results and serve it
> to rampaging hordes.

Sure, if the proxies actually do the rigth thing (which they may or may 
not do)

> What about the other idea, the one with raising expires to infinity for
> immutable pages like "commit" view for commit given by SHA-1? Even if
> the clients won't cache it, the proxies and caches between gitweb and
> client might cache it...

I agree, but as mentioned, I think the _real_ problem tends to be the 
pages that don't act that way (ie summary pages, both at the individual 
project level and the top "all projects" level).


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-10 19:50                                                           ` Linus Torvalds
  2006-12-10 20:27                                                             ` Jakub Narebski
@ 2006-12-10 21:01                                                             ` H. Peter Anvin
  1 sibling, 0 replies; 82+ messages in thread
From: H. Peter Anvin @ 2006-12-10 21:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jakub Narebski, Jeff Garzik, Martin Langhoff, Git Mailing List,
	Rogan Dawes, Kernel Org Admin

Linus Torvalds wrote:
> 
> Now, hopefully some of it will be in the disk cache, but when the 
> mirroring happens, it will basically blow the disk caches away totally 
> (when using the "--checksum" option), and then you literally have tens of 
> seconds to generate that one top-level page. 
> 

If that was the only time that happened, it would be a non-issue, since 
that only happens once every 96 hours.  However, the problem is that we 
now have lots of large datasets that blow out the caches on a much more 
frequent basis.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-10 20:30                                                               ` Linus Torvalds
@ 2006-12-10 22:01                                                                 ` Martin Langhoff
  2006-12-10 22:14                                                                   ` Jeff Garzik
  2006-12-10 22:08                                                                 ` Jeff Garzik
  1 sibling, 1 reply; 82+ messages in thread
From: Martin Langhoff @ 2006-12-10 22:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jakub Narebski, Jeff Garzik, Git Mailing List, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

On 12/11/06, Linus Torvalds <torvalds@osdl.org> wrote:
> Sure, if the proxies actually do the rigth thing (which they may or may
> not do)

For a high-traffic setup like kernel.org, you can setup a local
reverse proxy -- it's a pretty standard practice. That allows you to
control a well-behaved and locally tuned caching engine just by
emitting good headers.

It beats writing and maintaining an internal caching mechanism for
each CGI script out there by a long mile. It means there'll be no
further tunables or complexity for administrators of other gitweb
installs.

cheers,

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-10 19:11                                                         ` Jakub Narebski
  2006-12-10 19:50                                                           ` Linus Torvalds
@ 2006-12-10 22:05                                                           ` Jeff Garzik
  2006-12-10 22:59                                                             ` Jakub Narebski
  1 sibling, 1 reply; 82+ messages in thread
From: Jeff Garzik @ 2006-12-10 22:05 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

Jakub Narebski wrote:
> Adn in CGI standard there is a way to access additional HTTP headers
> info from CGI script: the envirionmental variables are HTTP_HEADER,
> for example if browser sent If-Modified-Since: header it's value
> can be found in HTTP_IF_MODIFIED_SINCE environmental variable.

The CGI spec does not at all guarantee that the CGI environment will 
contain all the HTTP headers sent by the client.  That was the point of 
the environment dump script -- you can see exactly which headers are, 
and are not, passed through to CGI.

CGI only /guarantees/ a bare minimum (things like QUERY_STRING, 
PATH_INFO, etc.)

Even basic server info environment variables are optional.

> It is ETag, not E-tag. Besides, I don't see what the attached script is
> meant to do: it does not output the sample file anyway.

It's not meant to output the sample file.  It outputs the server 
metadata sent to the CGI script (the environment variables).  The sample 
file was simply a way to play around with etag and last-modified metadata.

> The idea is of course to stop processing in CGI script / mod_perl script
> as soon as possible if cache validates.

Certainly.  That should help cut down on I/O.  FWIW though the projects 
list is particularly painful, with its File::Find call, which you'll 
need to do in order to return 304-not-modified.

> I don't know if Apache intercepts and remembers ETag and Last-Modified
> headers, adds 304 Not Modified HTTP response on finding that cache validates
> and cuts out CGI script output. I.e. if browser provided If-Modified-Since:,
> script wrote Last-Modified: header, If-Modified-Since: is no earlier than
> Last-Modified: (usually is equal in the case of cache validation), then
> Apache provides 304 Not Modified response instead of CGI script output.

This wanders into the realm of mod_cache configuration, I think.  (which 
I have tried to get working as reverse proxy, and failed serveral times) 
  If you are not using mod_*_cache, then Apache must execute the CGI 
script every time AFAICS, regardless of etag/[if-]last-mod headers.

	Jeff

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-10 20:30                                                               ` Linus Torvalds
  2006-12-10 22:01                                                                 ` Martin Langhoff
@ 2006-12-10 22:08                                                                 ` Jeff Garzik
  1 sibling, 0 replies; 82+ messages in thread
From: Jeff Garzik @ 2006-12-10 22:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jakub Narebski, Martin Langhoff, Git Mailing List, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

Linus Torvalds wrote:
> 
> On Sun, 10 Dec 2006, Jakub Narebski wrote:
>> Well, the idea (perhaps stupid idea: I don't know how caching engines
>> / reverse proxy works) was that there would be caching engine / reverse
>> proxy in the front (Squid for example) would cache results and serve it
>> to rampaging hordes.
> 
> Sure, if the proxies actually do the rigth thing (which they may or may 
> not do)

squid seems to work well as an HTTP accelerator (reverse proxy). 
Apache's mem|disk cache stuff fails miserably.

Unfortunately squid development seems to have slowed in recent years.

	Jeff



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-10 22:01                                                                 ` Martin Langhoff
@ 2006-12-10 22:14                                                                   ` Jeff Garzik
  0 siblings, 0 replies; 82+ messages in thread
From: Jeff Garzik @ 2006-12-10 22:14 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Linus Torvalds, Jakub Narebski, Git Mailing List, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

Martin Langhoff wrote:
> On 12/11/06, Linus Torvalds <torvalds@osdl.org> wrote:
>> Sure, if the proxies actually do the rigth thing (which they may or may
>> not do)
> 
> For a high-traffic setup like kernel.org, you can setup a local
> reverse proxy -- it's a pretty standard practice. That allows you to
> control a well-behaved and locally tuned caching engine just by
> emitting good headers.
> 
> It beats writing and maintaining an internal caching mechanism for
> each CGI script out there by a long mile. It means there'll be no
> further tunables or complexity for administrators of other gitweb
> installs.

If gitweb produced cache-friendly headers, squid could definitely serve 
as an HTTP front-end ("HTTP accelerator" mode in squid talk).

In fact, given kernel.org's slave1/slave2<->master setup, that's a 
pretty natural fit for caching files and/or cache-aware CGI output.

You could even replace rsync to the slaves, if squid was serving as the 
front-end accelerator running on the slaves, communicating to the master.

squid is smart enough to hold off a thundering herd, and only pulls 
single cacheable copies of files as needed.

	Jeff

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-10 22:05                                                           ` Jeff Garzik
@ 2006-12-10 22:59                                                             ` Jakub Narebski
  2006-12-11  2:16                                                               ` Martin Langhoff
  0 siblings, 1 reply; 82+ messages in thread
From: Jakub Narebski @ 2006-12-10 22:59 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

Jeff Garzik wrote:
> Jakub Narebski wrote:
>>
>> And in CGI standard there is a way to access additional HTTP headers
>> info from CGI script: the envirionmental variables are HTTP_HEADER,
>> for example if browser sent If-Modified-Since: header it's value
>> can be found in HTTP_IF_MODIFIED_SINCE environmental variable.
> 
> The CGI spec does not at all guarantee that the CGI environment will 
> contain all the HTTP headers sent by the client.  That was the point of 
> the environment dump script -- you can see exactly which headers are, 
> and are not, passed through to CGI.
> 
> CGI only /guarantees/ a bare minimum (things like QUERY_STRING, 
> PATH_INFO, etc.)
> 
> Even basic server info environment variables are optional.

I have checked that at least Apache 2.0.54 passes HTTP_IF_MODIFIED_SINCE
when getting If-Modified-Since: header (my own script + netcat/nc).

>> It is ETag, not E-tag. Besides, I don't see what the attached script is
>> meant to do: it does not output the sample file anyway.
> 
> It's not meant to output the sample file.  It outputs the server 
> metadata sent to the CGI script (the environment variables).  The sample 
> file was simply a way to play around with etag and last-modified metadata.

Ah. 

>> The idea is of course to stop processing in CGI script / mod_perl script
>> as soon as possible if cache validates.
> 
> Certainly.  That should help cut down on I/O.  FWIW though the projects 
> list is particularly painful, with its File::Find call, which you'll 
> need to do in order to return 304-not-modified.

First, it is better to use $projects_list which is projects index file
in the format:
  <project path> SPC <project owner>
where <project path> is relative to $projectroot and is URI encoded; well
at least SPC has to be URI (percent) encoded. <project owner> is owner
of given project, and is also URI encoded (one would usually use '+' in
the place of SPC here).

Gitweb now can generate projects list in above format, by using
"project_index" action ("a=project_index" query string), or by clicking
'TXT' link at the bottom of the projects list page in new gitweb: see
http://repo.or.cz by Petr Baudis. The problem is that it generates
projects list from the list of projects it sees, so to generate it from
scratch from the filesystem you have for generating "project_index"
to have $projects_list a directory (changing it to something that
evals to false, e.g. undef or "" makes gitweb use $projectroot for
$projects_list). I have posted how to do this.

The project list changes rarely, only on addition/removal of project,
and on changing owner of project; so it can be generated on demand.

Second, even with $projects_list being set to projects index file
as of now gitweb runs git-for-each-ref (which scans refs and access
pack file for commit date), checks for description file and reads it;
for $projects_list being directory it also checks project directory
owner. I plan to make it configurable to read last activity from
all heads (all branches) as it is now, from HEAD (current branch)
as it was before, or given branch (for example 'master').

Assuming that gitweb is configured to read last activity from single
defined branch, generating ETag = checksum(sha1 of heads of projects)
needs at least read one file from each project.

>> I don't know if Apache intercepts and remembers ETag and Last-Modified
>> headers, adds 304 Not Modified HTTP response on finding that cache validates
>> and cuts out CGI script output. I.e. if browser provided If-Modified-Since:,
>> script wrote Last-Modified: header, If-Modified-Since: is no earlier than
>> Last-Modified: (usually is equal in the case of cache validation), then
>> Apache provides 304 Not Modified response instead of CGI script output.
> 
> This wanders into the realm of mod_cache configuration, I think.  (which 
> I have tried to get working as reverse proxy, and failed serveral times) 
>   If you are not using mod_*_cache, then Apache must execute the CGI 
> script every time AFAICS, regardless of etag/[if-]last-mod headers.

No, it wanders into realm of header parsing by Apache, and NPH (No Parse
Headers) option.

Even if Apache does execute CGI script to completion every time, it might
not send the output of the script, but HTTP 304 Not Modified reply. Might.
I don't know if it does.

-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-10 22:59                                                             ` Jakub Narebski
@ 2006-12-11  2:16                                                               ` Martin Langhoff
  2006-12-11  8:59                                                                 ` Jakub Narebski
  0 siblings, 1 reply; 82+ messages in thread
From: Martin Langhoff @ 2006-12-11  2:16 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Jeff Garzik, Git Mailing List, Linus Torvalds, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

On 12/11/06, Jakub Narebski <jnareb@gmail.com> wrote:
> Even if Apache does execute CGI script to completion every time, it might
> not send the output of the script, but HTTP 304 Not Modified reply. Might.
> I don't know if it does.

It is up to the script (CGI or via mod_perl) to set the status to 304
and finish execution. Just setting the status to 304 does not
forcefully end execution as you may want to cleanup, log, etc.

cheers,



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
@ 2006-12-11  3:40 linux
  2006-12-11  9:30 ` Jakub Narebski
  0 siblings, 1 reply; 82+ messages in thread
From: linux @ 2006-12-11  3:40 UTC (permalink / raw)
  To: koreth; +Cc: git

>>> I posted separately about those. And I've been mulling about whether
>>> the thundering herd is really such a big problem that we need to
>>> address it head-on.
>>
>> Uhm... yes it is.
> 
> Got some more info, discussion points or links to stuff I should read
> to appreciate why that is? I am trying to articulate why I consider it
> is not a high-payoff task, as well as describing how to tackle it.
> 
> To recap, the reasons it is not high payoff is that:
>
>  - the main benefit comes from being cacheable and able to revalidate
>    the cache cheaply (with the ETags-based strategy discussed above)
>  - highly distributed caches/proxies means we'll seldom see a true
>    cold cache situation
>  - we have a huge set of URLs which are seldom hit, and will never see
>    a thundering anything
>  - we have a tiny set of very popular URLs that are the key target for
>    the thundering herd - (projects page, summary page, shortlog, fulllog)
>  - but those are in the clear as soon as the caches are populated
> 
> Why do we have to take it head-on? :-)

I think I agree with you, but not as strongly.  Certainly, having any
kind of effective cacheing (heck, just comparing the timestamp of the
relevant ref(s) with the If-Modified-Since: header) will help kernel.org
enormously.

But as soon as there's a push, particularly a release push, that
invalidates *all* of the popular pages *and* the thindering herd arrives.

The result is that all of the popular "what's new?" summary pages get
fetched 15 times in parallel and, because the front end doesn't serialize
them, populating the caches can be a painful process involving a lot of
repeated work.

I tend to agree that for the basic project summary pages, generating them
preemptively as static pages out of the push script seems best.
("find /usr/src/linux -type d -print | wc -l" is 1492.  Dear me.
Oh!  There is no per-directory shortlog page; that simplifies things.
But there *should* be.)

The only tricky thing is the "n minutes/hours/days ago" timestamps.
Basically, you want to generate a half-formatted, indefinitely-cacheable
page that contains them as absolute timestamps, and a have system for
regenerating the fully-formatted page from that (and the current time).

The ideas that people have been posting seem excellent.  Give a page
two timeouts.  If a GET arrives before the first timestamp, and no
prerequisites have changes, it's served directly from cache.  If it
arrives after the second timeout, or the prerequisites have changed,
it blocks until the page is regenerated.  But if it arrives between
those two times, it serves the stale data and starts generating fresh
data in the background.

So for the fully-formed timestamps, the first timeout is when the next
human-readable timestamp on the page ticks over.  But the second timeout
can be past that by, say, 5% of the timeout value.  It's okay to display
"3 hours ago" until 12 minutes past the 4 hour mark.

It might be okay to allow even the prerequisites to be slightly stale when
serving old data; it's okay if it takes 30 seconds for the kernel.org
web page to notice that Linus pushed.  But on my office gitweb, I'm not
sure that it's okay to take 30 seconds to notice that *I* just pushed.
(I'm also not sure about consistency issues.  If I link from one page
that shows the new release to another, it would be a bit disconcerting
if it disappeared.)

The nasty problem with built-in cacheing is that you need a whole cache
reclaim infrastructure; it would be so much nicer to let Squid deal
with that whole mess.  But it can't deal with anything other than fully

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-11  2:16                                                               ` Martin Langhoff
@ 2006-12-11  8:59                                                                 ` Jakub Narebski
  2006-12-11 10:18                                                                   ` Martin Langhoff
  0 siblings, 1 reply; 82+ messages in thread
From: Jakub Narebski @ 2006-12-11  8:59 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Jeff Garzik, Git Mailing List, Linus Torvalds, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

Martin Langhoff wrote:
> On 12/11/06, Jakub Narebski <jnareb@gmail.com> wrote:
>>
>> Even if Apache does execute CGI script to completion every time, it might
>> not send the output of the script, but HTTP 304 Not Modified reply. Might.
>> I don't know if it does.
> 
> It is up to the script (CGI or via mod_perl) to set the status to 304
> and finish execution. Just setting the status to 304 does not
> forcefully end execution as you may want to cleanup, log, etc.

I was thinking not about ending execution, but about not sending script
output but sending HTTP 304 Not Modified reply by Apache.

I meant the following sequence of events:
 1. Script sends headers, among those Last-Modified and/or ETag
 2. Apache scans headers (e.g. to add its own), notices that Last-Modified
    is earlier or equal to If-Modified-Since: sent by browser or reverse
    proxy, or ETag matches If-None-Match:, and sends 304 instead of script
    output
 3. Script finishes execution, it's output sent to /dev/null

Again, I don't know if Apache (or any other web server) does that. 
-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-11  3:40 linux
@ 2006-12-11  9:30 ` Jakub Narebski
  0 siblings, 0 replies; 82+ messages in thread
From: Jakub Narebski @ 2006-12-11  9:30 UTC (permalink / raw)
  To: git

Side note: I wonder why this mail is not attached to the rest of thread in
the gmane.comp.version-control.git news/Usenet GMane mirrot of git mailing
list.

linux@horizon.com wrote:

> Oh!  There is no per-directory shortlog page; that simplifies things.
> But there *should* be.)

There is. It is called "history" (and currently we have only shortlog-like
view for history and no log-like view).

> The only tricky thing is the "n minutes/hours/days ago" timestamps.
> Basically, you want to generate a half-formatted, indefinitely-cacheable
> page that contains them as absolute timestamps, and a have system for
> regenerating the fully-formatted page from that (and the current time).

Or use ECMAScript (JavaScript) for that. I plan (if this feature is
requested) to make it a %feature.

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-11  8:59                                                                 ` Jakub Narebski
@ 2006-12-11 10:18                                                                   ` Martin Langhoff
  0 siblings, 0 replies; 82+ messages in thread
From: Martin Langhoff @ 2006-12-11 10:18 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Jeff Garzik, Git Mailing List, Linus Torvalds, H. Peter Anvin,
	Rogan Dawes, Kernel Org Admin

On 12/11/06, Jakub Narebski <jnareb@gmail.com> wrote:
> I was thinking not about ending execution, but about not sending script
> output but sending HTTP 304 Not Modified reply by Apache.
>
> I meant the following sequence of events:
>  1. Script sends headers, among those Last-Modified and/or ETag
>  2. Apache scans headers (e.g. to add its own), notices that Last-Modified
>     is earlier or equal to If-Modified-Since: sent by browser or reverse
>     proxy, or ETag matches If-None-Match:, and sends 304 instead of script
>     output
>  3. Script finishes execution, it's output sent to /dev/null
>
> Again, I don't know if Apache (or any other web server) does that.

It doesn't. You want to take the decision to send a 304, cleanup and
exit _inside_ the CGI. If it was up to apache, then the CGI script
would end up creating the (potentially expensive to produce) content
just to see it sent to /dev/null OR if apache was to terminate
execution of the CGI more violently, the CGI wouldn't have a chance to
cleanup and release resources.

So it's a matter of setting the header to 304 and exiting.

cheers,


martin


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
  2006-12-10  7:05                                               ` H. Peter Anvin
@ 2006-12-12 21:19                                                 ` Jakub Narebski
  0 siblings, 0 replies; 82+ messages in thread
From: Jakub Narebski @ 2006-12-12 21:19 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Martin Langhoff, Jeff Garzik, Git Mailing List, Linus Torvalds,
	Rogan Dawes, Kernel Org Admin

By the way, setting Last-Modified: and ETag: and checking for 
If-Modified-Since: and If-None-Match: is easy only for log-like views: 
"shortlog", "log", "history", "rss"/"atom". With "shortlog" and 
"history" we have additional difficulity of using relative dates there.
And even for those views we need reverse proxy / caching engine
(e.g. Squid in "HTTP accelerator" mode) in front.

It would be easier to pre-generate most common accessed views: 
"projects_list", "summary" and "rss"/"atom" main for each project, and 
just serve static pages. I don't know if we need to modify gitweb for 
that.

BTW. for single client (rather stupid benchmark, I know) mod_perl is 
about twice faster in keepalive mode than CGI version of gitweb for 
git.git summary page.

-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2006-12-12 21:17 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <45708A56.3040508@drzeus.cx>
     [not found] ` <Pine.LNX.4.64.0612011639240.3695@woody.osdl.org>
     [not found]   ` <457151A0.8090203@drzeus.cx>
     [not found]     ` <Pine.LNX.4.64.0612020835110.3476@woody.osdl.org>
     [not found]       ` <45744FA3.7020908@zytor.com>
     [not found]         ` <Pine.LNX.4.64.0612061847190.3615@woody.osdl.org>
     [not found]           ` <45778AA3.7080709@zytor.com>
     [not found]             ` <Pine.LNX.4.64.0612061940170.3615@woody.osdl.org>
     [not found]               ` <4577A84C.3010601@zytor.com>
     [not found]                 ` <Pine.LNX.4.64.0612070953290.3615@woody.osdl.org>
     [not found]                   ` <45785697.1060001@zytor.com>
2006-12-07 19:05                     ` kernel.org mirroring (Re: [GIT PULL] MMC update) Linus Torvalds
2006-12-07 19:16                       ` H. Peter Anvin
2006-12-07 19:30                         ` Olivier Galibert
2006-12-07 19:57                           ` H. Peter Anvin
2006-12-07 23:50                             ` Olivier Galibert
2006-12-07 23:56                               ` H. Peter Anvin
2006-12-08 11:25                               ` Jakub Narebski
2006-12-08 12:57                             ` Rogan Dawes
2006-12-08 13:38                               ` Jakub Narebski
2006-12-08 14:31                                 ` Rogan Dawes
2006-12-08 15:38                                   ` Jonas Fonseca
2006-12-09  1:28                                 ` Martin Langhoff
2006-12-09  2:03                                   ` H. Peter Anvin
2006-12-09  2:52                                     ` Martin Langhoff
2006-12-09  5:09                                       ` H. Peter Anvin
2006-12-09  5:34                                         ` Martin Langhoff
2006-12-09 16:26                                           ` H. Peter Anvin
2006-12-08 16:16                               ` H. Peter Anvin
2006-12-08 16:35                                 ` Linus Torvalds
2006-12-08 16:42                                   ` H. Peter Anvin
2006-12-08 19:49                                     ` Lars Hjemli
2006-12-08 19:51                                       ` H. Peter Anvin
2006-12-08 19:59                                         ` Lars Hjemli
2006-12-08 20:02                                           ` H. Peter Anvin
2006-12-10  9:43                                     ` rda
2006-12-08 16:54                                   ` Jeff Garzik
2006-12-08 17:04                                     ` H. Peter Anvin
2006-12-08 17:40                                       ` Jeff Garzik
2006-12-08 23:27                                     ` Linus Torvalds
2006-12-08 23:46                                       ` Michael K. Edwards
2006-12-08 23:49                                         ` H. Peter Anvin
2006-12-09  0:18                                           ` Michael K. Edwards
2006-12-09  0:23                                             ` H. Peter Anvin
2006-12-09  0:49                                         ` Linus Torvalds
2006-12-09  0:51                                           ` H. Peter Anvin
2006-12-09  4:36                                           ` Michael K. Edwards
2006-12-09  9:27                                           ` Jeff Garzik
     [not found]                                       ` <4579FABC.5070509@garzik.org>
2006-12-09  0:45                                         ` Linus Torvalds
2006-12-09  0:47                                           ` H. Peter Anvin
2006-12-09  9:16                                           ` Jeff Garzik
2006-12-09  1:56                                       ` Martin Langhoff
2006-12-09 11:51                                         ` Jakub Narebski
2006-12-09 12:42                                           ` Jeff Garzik
2006-12-09 13:37                                             ` Jakub Narebski
2006-12-09 14:43                                               ` Jeff Garzik
2006-12-09 17:02                                                 ` Jakub Narebski
2006-12-09 17:27                                                   ` Jeff Garzik
2006-12-10  4:07                                               ` Martin Langhoff
2006-12-10 10:09                                                 ` Jakub Narebski
2006-12-10 12:41                                                   ` Jeff Garzik
2006-12-10 13:02                                                     ` Jakub Narebski
2006-12-10 13:45                                                       ` Jeff Garzik
2006-12-10 19:11                                                         ` Jakub Narebski
2006-12-10 19:50                                                           ` Linus Torvalds
2006-12-10 20:27                                                             ` Jakub Narebski
2006-12-10 20:30                                                               ` Linus Torvalds
2006-12-10 22:01                                                                 ` Martin Langhoff
2006-12-10 22:14                                                                   ` Jeff Garzik
2006-12-10 22:08                                                                 ` Jeff Garzik
2006-12-10 21:01                                                             ` H. Peter Anvin
2006-12-10 22:05                                                           ` Jeff Garzik
2006-12-10 22:59                                                             ` Jakub Narebski
2006-12-11  2:16                                                               ` Martin Langhoff
2006-12-11  8:59                                                                 ` Jakub Narebski
2006-12-11 10:18                                                                   ` Martin Langhoff
2006-12-09 18:04                                             ` Linus Torvalds
2006-12-09 18:30                                               ` H. Peter Anvin
2006-12-10  3:55                                             ` Martin Langhoff
2006-12-10  7:05                                               ` H. Peter Anvin
2006-12-12 21:19                                                 ` Jakub Narebski
2006-12-09  7:56                                       ` Steven Grimm
2006-12-07 19:30                         ` Linus Torvalds
2006-12-07 19:39                           ` Shawn Pearce
2006-12-07 19:58                             ` Linus Torvalds
2006-12-07 23:33                               ` Michael K. Edwards
2006-12-07 19:58                             ` H. Peter Anvin
2006-12-07 20:05                           ` Junio C Hamano
2006-12-07 20:09                             ` H. Peter Anvin
2006-12-07 22:11                               ` Junio C Hamano
2006-12-08  9:43                       ` Jakub Narebski
2006-12-11  3:40 linux
2006-12-11  9:30 ` Jakub Narebski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).