* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) [not found] ` <45785697.1060001@zytor.com> @ 2006-12-07 19:05 ` Linus Torvalds 2006-12-07 19:16 ` H. Peter Anvin 2006-12-08 9:43 ` Jakub Narebski 0 siblings, 2 replies; 82+ messages in thread From: Linus Torvalds @ 2006-12-07 19:05 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Kernel Org Admin, Git Mailing List, Jakub Narebski On Thu, 7 Dec 2006, H. Peter Anvin wrote: > > That all being said, the lack of intrinsic caching in gitweb continues to be a > major problem for us. Under high load, it makes all the problems worse. I really don't see what gitweb could do that would be somehow better than apache doing the caching in front of it.. Is there some apache reason why that isn't sufficient (ie limitations on its cache size or timeouts?) Maybe the cacheability hints from gitweb could be tweaked (a lot of it should be "infinitely cacheable", but the stuff that depends on refs and thus can change, could be set to some fixed host-wide value - preferably some that depends on how old the ref is). Having gitweb be potentially up to an hour out of date is better than causing mirroring problems due to excessive load. For example, if the git "refs/heads/" (or tags) directory hasn't changed in the last two months, we should probably set any ref-relative gitweb pages to have a caching timeout of a day or two. In contrast, if it's changed in the last hour, maybe we should only cache it for five minutes. Jakub: any way to make gitweb set the "expires" fields _much_ more aggressively. I think we should at least have the ability to set a basic rules like - a _minimum_ of five minutes regardless of anything else We might even tweak this based on loadaverage, and it might be worthwhile to add a randomization, to make sure that you don't get into situations where everything webpage needs to be recalculated at once. - if refs/ directories are old, raise the minimum by the age of the refs If it's more than an hour old, raise it to ten minutes. If it's more than a day, raise it to an hour. If it's more than a month old, raise it to a day. And if it's more than half a year, it's some historical archive like linux-history, and should probably default to a week or more. - infinite for stuff that isn't ref-related. Hmm? ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-07 19:05 ` kernel.org mirroring (Re: [GIT PULL] MMC update) Linus Torvalds @ 2006-12-07 19:16 ` H. Peter Anvin 2006-12-07 19:30 ` Olivier Galibert 2006-12-07 19:30 ` Linus Torvalds 2006-12-08 9:43 ` Jakub Narebski 1 sibling, 2 replies; 82+ messages in thread From: H. Peter Anvin @ 2006-12-07 19:16 UTC (permalink / raw) To: Linus Torvalds; +Cc: Kernel Org Admin, Git Mailing List, Jakub Narebski Linus Torvalds wrote: > > On Thu, 7 Dec 2006, H. Peter Anvin wrote: >> That all being said, the lack of intrinsic caching in gitweb continues to be a >> major problem for us. Under high load, it makes all the problems worse. > > I really don't see what gitweb could do that would be somehow better than > apache doing the caching in front of it.. Is there some apache reason why > that isn't sufficient (ie limitations on its cache size or timeouts?) > What it could do better is it could prevent multiple identical queries from being launched in parallel. That's the real problem we see; under high load, Apache times out so the git query never gets into the cache; but in the meantime, the common queries might easily have been launched 20 times in parallel. Unfortunately, the most common queries are also extremely expensive. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-07 19:16 ` H. Peter Anvin @ 2006-12-07 19:30 ` Olivier Galibert 2006-12-07 19:57 ` H. Peter Anvin 2006-12-07 19:30 ` Linus Torvalds 1 sibling, 1 reply; 82+ messages in thread From: Olivier Galibert @ 2006-12-07 19:30 UTC (permalink / raw) To: H. Peter Anvin Cc: Linus Torvalds, Kernel Org Admin, Git Mailing List, Jakub Narebski On Thu, Dec 07, 2006 at 11:16:58AM -0800, H. Peter Anvin wrote: > Unfortunately, the most common queries are also extremely expensive. Do you have a top-ten of queries ? That would be the ones to optimize for. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-07 19:30 ` Olivier Galibert @ 2006-12-07 19:57 ` H. Peter Anvin 2006-12-07 23:50 ` Olivier Galibert 2006-12-08 12:57 ` Rogan Dawes 0 siblings, 2 replies; 82+ messages in thread From: H. Peter Anvin @ 2006-12-07 19:57 UTC (permalink / raw) To: Olivier Galibert Cc: Linus Torvalds, Kernel Org Admin, Git Mailing List, Jakub Narebski Olivier Galibert wrote: > On Thu, Dec 07, 2006 at 11:16:58AM -0800, H. Peter Anvin wrote: >> Unfortunately, the most common queries are also extremely expensive. > > Do you have a top-ten of queries ? That would be the ones to optimize > for. The front page, summary page of each project, and the RSS feed for each project. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-07 19:57 ` H. Peter Anvin @ 2006-12-07 23:50 ` Olivier Galibert 2006-12-07 23:56 ` H. Peter Anvin 2006-12-08 11:25 ` Jakub Narebski 2006-12-08 12:57 ` Rogan Dawes 1 sibling, 2 replies; 82+ messages in thread From: Olivier Galibert @ 2006-12-07 23:50 UTC (permalink / raw) To: H. Peter Anvin Cc: Linus Torvalds, Kernel Org Admin, Git Mailing List, Jakub Narebski On Thu, Dec 07, 2006 at 11:57:34AM -0800, H. Peter Anvin wrote: > Olivier Galibert wrote: > >On Thu, Dec 07, 2006 at 11:16:58AM -0800, H. Peter Anvin wrote: > >>Unfortunately, the most common queries are also extremely expensive. > > > >Do you have a top-ten of queries ? That would be the ones to optimize > >for. > > The front page, summary page of each project, and the RSS feed for each > project. Hmmm, maybe you could have the summaries and rss feed generated on push, which could also generate elementary files with lines of the front page. That would make these top offenders static page serving. OG. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-07 23:50 ` Olivier Galibert @ 2006-12-07 23:56 ` H. Peter Anvin 2006-12-08 11:25 ` Jakub Narebski 1 sibling, 0 replies; 82+ messages in thread From: H. Peter Anvin @ 2006-12-07 23:56 UTC (permalink / raw) To: Olivier Galibert Cc: Linus Torvalds, Kernel Org Admin, Git Mailing List, Jakub Narebski Olivier Galibert wrote: > On Thu, Dec 07, 2006 at 11:57:34AM -0800, H. Peter Anvin wrote: >> Olivier Galibert wrote: >>> On Thu, Dec 07, 2006 at 11:16:58AM -0800, H. Peter Anvin wrote: >>>> Unfortunately, the most common queries are also extremely expensive. >>> Do you have a top-ten of queries ? That would be the ones to optimize >>> for. >> The front page, summary page of each project, and the RSS feed for each >> project. > > Hmmm, maybe you could have the summaries and rss feed generated on > push, which could also generate elementary files with lines of the > front page. That would make these top offenders static page serving. > There are a lot of things which "could be done" given the proper cache infrastructure and gitweb support. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-07 23:50 ` Olivier Galibert 2006-12-07 23:56 ` H. Peter Anvin @ 2006-12-08 11:25 ` Jakub Narebski 1 sibling, 0 replies; 82+ messages in thread From: Jakub Narebski @ 2006-12-08 11:25 UTC (permalink / raw) To: git Olivier Galibert wrote: > On Thu, Dec 07, 2006 at 11:57:34AM -0800, H. Peter Anvin wrote: >> Olivier Galibert wrote: >>>On Thu, Dec 07, 2006 at 11:16:58AM -0800, H. Peter Anvin wrote: >>>> >>>>Unfortunately, the most common queries are also extremely expensive. >>> >>>Do you have a top-ten of queries ? That would be the ones to optimize >>>for. >> >> The front page, summary page of each project, and the RSS feed for each >> project. > > Hmmm, maybe you could have the summaries and rss feed generated on > push, which could also generate elementary files with lines of the > front page. That would make these top offenders static page serving. The "extremely aggresive caching solution" could be as follows: cache everything, invalidate (remove) on push caches of variable variety related to push (list of projects and OPML on any push; summary page and every page without h=<hash> or hb=<hash>;f=<filename> for a given project). The most important problem is that kernel.org uses old gitweb, the last version before incorporating gitweb into git (and also reducing significantly the time needed for summary, heads and tags pages). -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-07 19:57 ` H. Peter Anvin 2006-12-07 23:50 ` Olivier Galibert @ 2006-12-08 12:57 ` Rogan Dawes 2006-12-08 13:38 ` Jakub Narebski 2006-12-08 16:16 ` H. Peter Anvin 1 sibling, 2 replies; 82+ messages in thread From: Rogan Dawes @ 2006-12-08 12:57 UTC (permalink / raw) To: H. Peter Anvin Cc: Linus Torvalds, Kernel Org Admin, Git Mailing List, Jakub Narebski H. Peter Anvin wrote: > Olivier Galibert wrote: >> On Thu, Dec 07, 2006 at 11:16:58AM -0800, H. Peter Anvin wrote: >>> Unfortunately, the most common queries are also extremely expensive. >> >> Do you have a top-ten of queries ? That would be the ones to optimize >> for. > > The front page, summary page of each project, and the RSS feed for each > project. > > -hpa How about extending gitweb to check to see if there already exists a cached version of these pages, before recreating them? e.g. structure the temp dir in such a way that each project has a place for cached pages. Then, before performing expensive operations, check to see if a file corresponding to the requested page already exists. If it does, simply return the contents of the file, otherwise go ahead and create the page dynamically, and return it to the user. Do not create cached pages in gitweb dynamically. Then, in a post-update hook, for each of the expensive pages, invoke something like: # delete the cached copy of the file, to force gitweb to recreate it rm -f $git_temp/$project/rss # get gitweb to recreate the page appropriately # use a tmp file to prevent gitweb from getting confused wget -O $git_temp/$project/rss.tmp \ http://kernel.org/gitweb.cgi?p=$project;a=rss # move the tmp file into place mv $git_temp/$project/rss.tmp $git_temp/$project/rss This way, we get the exact output returned from the usual gitweb invocation, but we can now cache the result, and only update it when there is a new commit that would affect the page output. This would also not affect those who do not wish to use this mechanism. If the file does not exist, gitweb.cgi will simply revert to its usual behaviour. Possible complications are the content-type headers, etc, but you could use the -s flag to wget, and store the server headers as well in the file, and get the necessary headers from the file as you stream it. i.e. read the headers looking for ones that are "interesting" (Content-Type, charset, expires) until you get a blank line, print out the interesting headers using $cgi->header(), then just dump the remainder of the file to the caller via stdout. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 12:57 ` Rogan Dawes @ 2006-12-08 13:38 ` Jakub Narebski 2006-12-08 14:31 ` Rogan Dawes 2006-12-09 1:28 ` Martin Langhoff 2006-12-08 16:16 ` H. Peter Anvin 1 sibling, 2 replies; 82+ messages in thread From: Jakub Narebski @ 2006-12-08 13:38 UTC (permalink / raw) To: Rogan Dawes Cc: H. Peter Anvin, Linus Torvalds, Kernel Org Admin, Git Mailing List, Petr Baudis Dnia piątek 8. grudnia 2006 13:57, Rogan Dawes napisał: > H. Peter Anvin wrote: >> Olivier Galibert wrote: >>> On Thu, Dec 07, 2006 at 11:16:58AM -0800, H. Peter Anvin wrote: >>>> Unfortunately, the most common queries are also extremely expensive. With newer gitweb, which tries to do the same using less git commands, some of queries (summary, heads, tags pages) should be less expensive. >>> Do you have a top-ten of queries ? That would be the ones to optimize >>> for. >> >> The front page, summary page of each project, and the RSS feed for each >> project. > > How about extending gitweb to check to see if there already exists a > cached version of these pages, before recreating them? > > e.g. structure the temp dir in such a way that each project has a place > for cached pages. Then, before performing expensive operations, check to > see if a file corresponding to the requested page already exists. If it > does, simply return the contents of the file, otherwise go ahead and > create the page dynamically, and return it to the user. Do not create > cached pages in gitweb dynamically. This would add the need for directory for temporary files... well, it would be optional now... > Then, in a post-update hook, for each of the expensive pages, invoke > something like: > > # delete the cached copy of the file, to force gitweb to recreate it > rm -f $git_temp/$project/rss > # get gitweb to recreate the page appropriately > # use a tmp file to prevent gitweb from getting confused > wget -O $git_temp/$project/rss.tmp \ > http://kernel.org/gitweb.cgi?p=$project;a=rss > # move the tmp file into place > mv $git_temp/$project/rss.tmp $git_temp/$project/rss Good idea... although there are some page views which shouldn't change at all... well, with the possible exception of changes in gitweb output, and even then there are some (blob_plain and snapshot views) which doesn't change at all. It would be good to avoid removing them on push, and only remove them using some tmpwatch-like removal. > This way, we get the exact output returned from the usual gitweb > invocation, but we can now cache the result, and only update it when > there is a new commit that would affect the page output. > > This would also not affect those who do not wish to use this mechanism. > If the file does not exist, gitweb.cgi will simply revert to its usual > behaviour. Good idea. Perhaps I should add it to gitweb TODO file. Hmmm... perhaps it is time for next "[RFC] gitweb wishlist and TODO list" thread? > Possible complications are the content-type headers, etc, but you could > use the -s flag to wget, and store the server headers as well in the > file, and get the necessary headers from the file as you stream it. > > i.e. read the headers looking for ones that are "interesting" > (Content-Type, charset, expires) until you get a blank line, print out > the interesting headers using $cgi->header(), then just dump the > remainder of the file to the caller via stdout. No need for that. $cgi->header() is to _generate_ the headers, so if a file is saved with headers, we can just dump it to STDOUT; the possible exception is a need to rewrite 'expires' header, if it is used. Perhaps gitweb should generate it's own ETag instead of messing with 'expires' header? -- Jakub Narebski ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 13:38 ` Jakub Narebski @ 2006-12-08 14:31 ` Rogan Dawes 2006-12-08 15:38 ` Jonas Fonseca 2006-12-09 1:28 ` Martin Langhoff 1 sibling, 1 reply; 82+ messages in thread From: Rogan Dawes @ 2006-12-08 14:31 UTC (permalink / raw) To: Jakub Narebski Cc: H. Peter Anvin, Linus Torvalds, Kernel Org Admin, Git Mailing List, Petr Baudis Jakub Narebski wrote: > Dnia piątek 8. grudnia 2006 13:57, Rogan Dawes napisał: >> How about extending gitweb to check to see if there already exists a >> cached version of these pages, before recreating them? >> >> e.g. structure the temp dir in such a way that each project has a place >> for cached pages. Then, before performing expensive operations, check to >> see if a file corresponding to the requested page already exists. If it >> does, simply return the contents of the file, otherwise go ahead and >> create the page dynamically, and return it to the user. Do not create >> cached pages in gitweb dynamically. > > This would add the need for directory for temporary files... well, > it would be optional now... > It would still be optional. If the "cache" directory structure exists, then use it, otherwise, continue as usual. All it would cost is a stat() or two, I guess. >> Then, in a post-update hook, for each of the expensive pages, invoke >> something like: >> >> # delete the cached copy of the file, to force gitweb to recreate it >> rm -f $git_temp/$project/rss >> # get gitweb to recreate the page appropriately >> # use a tmp file to prevent gitweb from getting confused >> wget -O $git_temp/$project/rss.tmp \ >> http://kernel.org/gitweb.cgi?p=$project;a=rss >> # move the tmp file into place >> mv $git_temp/$project/rss.tmp $git_temp/$project/rss > > Good idea... although there are some page views which shouldn't change > at all... well, with the possible exception of changes in gitweb output, > and even then there are some (blob_plain and snapshot views) which > doesn't change at all. > > It would be good to avoid removing them on push, and only remove > them using some tmpwatch-like removal. Well, my theory was that we would only cache pages that change when new data enters the repo. So, using the push as the trigger is almost guaranteed to be the right thing to do. New data indicates new rss items, indicates an updated shortlog page, etc. NOTE: This caching could be problematic for the "changed 2 hours ago" notation for various branches/files, etc. But however we implement the caching, we'd have this problem. >> This way, we get the exact output returned from the usual gitweb >> invocation, but we can now cache the result, and only update it when >> there is a new commit that would affect the page output. >> >> This would also not affect those who do not wish to use this mechanism. >> If the file does not exist, gitweb.cgi will simply revert to its usual >> behaviour. > > Good idea. Perhaps I should add it to gitweb TODO file. > > Hmmm... perhaps it is time for next "[RFC] gitweb wishlist and TODO list" > thread? > >> Possible complications are the content-type headers, etc, but you could >> use the -s flag to wget, and store the server headers as well in the >> file, and get the necessary headers from the file as you stream it. >> >> i.e. read the headers looking for ones that are "interesting" >> (Content-Type, charset, expires) until you get a blank line, print out >> the interesting headers using $cgi->header(), then just dump the >> remainder of the file to the caller via stdout. > > No need for that. $cgi->header() is to _generate_ the headers, so if > a file is saved with headers, we can just dump it to STDOUT; the possible > exception is a need to rewrite 'expires' header, if it is used. Good point. I guess one thing that will be incorrect in the headers is the server date, but I doubt that anyone cares much. As you say, though, this might relate to the expiry of cached content in upstream caches. > > Perhaps gitweb should generate it's own ETag instead of messing with > 'expires' header? Well, we can possibly eliminate the expires header entirely for dynamic pages, and check the If-Modified-Since value against the timestamp of the cached file, or the server date in the cached file, and return "304 Not Modified" responses. That would also help to reduce the load on the server, by only returning the headers, and not the entire response. The downside is that it would prevent upstream proxies from caching this data for us. Regards, ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 14:31 ` Rogan Dawes @ 2006-12-08 15:38 ` Jonas Fonseca 0 siblings, 0 replies; 82+ messages in thread From: Jonas Fonseca @ 2006-12-08 15:38 UTC (permalink / raw) To: Rogan Dawes Cc: Jakub Narebski, H. Peter Anvin, Linus Torvalds, Kernel Org Admin, Git Mailing List, Petr Baudis On 12/8/06, Rogan Dawes <discard@dawes.za.net> wrote: > NOTE: This caching could be problematic for the "changed 2 hours ago" > notation for various branches/files, etc. But however we implement the > caching, we'd have this problem. It could be solved using ECMAScript (if that is an option): Include an exact time stamp or something that browsers not supporting ECMAScript can show and others browsers can change the time stamp to make it relative and do the coloring/highlighting of recent activity. This could also slightly speed up the script and it might be better to provide an exact time stamp by default if aggressive caching is applied. -- ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 13:38 ` Jakub Narebski 2006-12-08 14:31 ` Rogan Dawes @ 2006-12-09 1:28 ` Martin Langhoff 2006-12-09 2:03 ` H. Peter Anvin 1 sibling, 1 reply; 82+ messages in thread From: Martin Langhoff @ 2006-12-09 1:28 UTC (permalink / raw) To: Jakub Narebski Cc: Rogan Dawes, H. Peter Anvin, Linus Torvalds, Kernel Org Admin, Git Mailing List, Petr Baudis On 12/9/06, Jakub Narebski <jnareb@gmail.com> wrote: > Perhaps gitweb should generate it's own ETag instead of messing with > 'expires' header? That'll be the winning solution. A combination of - cache SHA1-based requests forever - cache ref-based requests a longish time, setting an ETag that contains headname+SHA1 - on 'revalidate', check the ETag vs the ref and only recompute if things have changed In the meantime, the code on kernel.org needs to be updated to the latest gitweb. On our server, I'd say the newer gitweb is 3~4 times faster serving the "expensive" summary pages. And much smarter in terms of caching headers. cheers ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 1:28 ` Martin Langhoff @ 2006-12-09 2:03 ` H. Peter Anvin 2006-12-09 2:52 ` Martin Langhoff 0 siblings, 1 reply; 82+ messages in thread From: H. Peter Anvin @ 2006-12-09 2:03 UTC (permalink / raw) To: Martin Langhoff Cc: Jakub Narebski, Rogan Dawes, Linus Torvalds, Kernel Org Admin, Git Mailing List, Petr Baudis Martin Langhoff wrote: > On 12/9/06, Jakub Narebski <jnareb@gmail.com> wrote: >> Perhaps gitweb should generate it's own ETag instead of messing with >> 'expires' header? > > That'll be the winning solution. Doesn't solve the thundering herd problem or the timeout problem at all, though. -hpa ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 2:03 ` H. Peter Anvin @ 2006-12-09 2:52 ` Martin Langhoff 2006-12-09 5:09 ` H. Peter Anvin 0 siblings, 1 reply; 82+ messages in thread From: Martin Langhoff @ 2006-12-09 2:52 UTC (permalink / raw) To: H. Peter Anvin Cc: Jakub Narebski, Rogan Dawes, Linus Torvalds, Kernel Org Admin, Git Mailing List, Petr Baudis On 12/9/06, H. Peter Anvin <hpa@zytor.com> wrote: > Martin Langhoff wrote: > > On 12/9/06, Jakub Narebski <jnareb@gmail.com> wrote: > >> Perhaps gitweb should generate it's own ETag instead of messing with > >> 'expires' header? > > > > That'll be the winning solution. > > Doesn't solve the thundering herd problem or the timeout problem at all, > though. I posted separately about those. And I've been mulling about whether the thundering herd is really such a big problem that we need to address it head-on. If we doHTTP caching headers right (that is, a bit better than now) then the fact that web caches are distributed means that even a cache restart or cache invalidation won't trigger a thundering herd. And gitweb rarely has a "new" URL that gets a ton of hits immediately. Our real problem is the summary page, and the fact that we aren't setting an effecting ETag there. If we do, a front-end cache plus the ability to revalidate the ETag cheaply will get us through. We get 99% of the benefit from ETags and cheap revalidations, specially if they are coupled with a reverse caching proxy,. The remaining 1% of dealing with the highly infrequent thundering herd can be addressed with the scheme I've posted 5 minutes ago. cheers ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 2:52 ` Martin Langhoff @ 2006-12-09 5:09 ` H. Peter Anvin 2006-12-09 5:34 ` Martin Langhoff 0 siblings, 1 reply; 82+ messages in thread From: H. Peter Anvin @ 2006-12-09 5:09 UTC (permalink / raw) To: Martin Langhoff Cc: Jakub Narebski, Rogan Dawes, Linus Torvalds, Kernel Org Admin, Git Mailing List, Petr Baudis Martin Langhoff wrote: > I posted separately about those. And I've been mulling about whether > the thundering herd is really such a big problem that we need to > address it head-on. Uhm... yes it is. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 5:09 ` H. Peter Anvin @ 2006-12-09 5:34 ` Martin Langhoff 2006-12-09 16:26 ` H. Peter Anvin 0 siblings, 1 reply; 82+ messages in thread From: Martin Langhoff @ 2006-12-09 5:34 UTC (permalink / raw) To: H. Peter Anvin Cc: Jakub Narebski, Rogan Dawes, Linus Torvalds, Kernel Org Admin, Git Mailing List, Petr Baudis On 12/9/06, H. Peter Anvin <hpa@zytor.com> wrote: > Martin Langhoff wrote: > > I posted separately about those. And I've been mulling about whether > > the thundering herd is really such a big problem that we need to > > address it head-on. > > Uhm... yes it is. Got some more info, discussion points or links to stuff I should read to appreciate why that is? I am trying to articulate why I consider it is not a high-payoff task, as well as describing how to tackle it. To recap, the reasons it is not high payoff is that: - the main benefit comes from being cacheable and able to revalidate the cache cheaply (with the ETags-based strategy discussed above) - highly distributed caches/proxies means we'll seldom see a true cold cache situation - we have a huge set of URLs which are seldom hit, and will never see a thundering anything - we have a tiny set of very popular URLs that are the key target for the thundering herd - (projects page, summary page, shortlog, fulllog) - but those are in the clear as soon as the caches are populated Why do we have to take it head-on? :-) ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 5:34 ` Martin Langhoff @ 2006-12-09 16:26 ` H. Peter Anvin 0 siblings, 0 replies; 82+ messages in thread From: H. Peter Anvin @ 2006-12-09 16:26 UTC (permalink / raw) To: Martin Langhoff Cc: Jakub Narebski, Rogan Dawes, Linus Torvalds, Kernel Org Admin, Git Mailing List, Petr Baudis Martin Langhoff wrote: > On 12/9/06, H. Peter Anvin <hpa@zytor.com> wrote: >> Martin Langhoff wrote: >> > I posted separately about those. And I've been mulling about whether >> > the thundering herd is really such a big problem that we need to >> > address it head-on. >> >> Uhm... yes it is. > > Got some more info, discussion points or links to stuff I should read > to appreciate why that is? I am trying to articulate why I consider it > is not a high-payoff task, as well as describing how to tackle it. > > To recap, the reasons it is not high payoff is that: > > - the main benefit comes from being cacheable and able to revalidate > the cache cheaply (with the ETags-based strategy discussed above) > - highly distributed caches/proxies means we'll seldom see a true > cold cache situation > - we have a huge set of URLs which are seldom hit, and will never see > a thundering anything > - we have a tiny set of very popular URLs that are the key target for > the thundering herd - (projects page, summary page, shortlog, fulllog) > - but those are in the clear as soon as the caches are populated > > Why do we have to take it head-on? :-) > Because the primary failure scenario is timeout on the common queries due to excess parallel invocations under high I/O load resulting in catastrophic failure. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 12:57 ` Rogan Dawes 2006-12-08 13:38 ` Jakub Narebski @ 2006-12-08 16:16 ` H. Peter Anvin 2006-12-08 16:35 ` Linus Torvalds 1 sibling, 1 reply; 82+ messages in thread From: H. Peter Anvin @ 2006-12-08 16:16 UTC (permalink / raw) To: Rogan Dawes Cc: Linus Torvalds, Kernel Org Admin, Git Mailing List, Jakub Narebski Rogan Dawes wrote: > > How about extending gitweb to check to see if there already exists a > cached version of these pages, before recreating them? > This goes back to the "gitweb needs native caching" again. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 16:16 ` H. Peter Anvin @ 2006-12-08 16:35 ` Linus Torvalds 2006-12-08 16:42 ` H. Peter Anvin 2006-12-08 16:54 ` Jeff Garzik 0 siblings, 2 replies; 82+ messages in thread From: Linus Torvalds @ 2006-12-08 16:35 UTC (permalink / raw) To: H. Peter Anvin Cc: Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski On Fri, 8 Dec 2006, H. Peter Anvin wrote: > > This goes back to the "gitweb needs native caching" again. It should be fairly easy to add a caching layer, but I wouldn't do it inside gitweb itself - it gets too mixed up. It would be better to have it as a separate front-end, that just calls gitweb for anything it doesn't find in the cache. I could write a simple C caching thing that just hashes the CGI arguments and uses a hash to create a cache (and proper lock-files etc to serialize access to a particular cache object while it's being created) fairly easily, but I'm pretty sure people would much prefer a mod_perl thing just to avoid the fork/exec overhead with Apache (I think mod_perl allows Apache to run perl scripts without it), and that means I'm not the right person any more. Not that I'm the right person anyway, since I don't have a web server set up on my machine to even test with ;) Linus ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 16:35 ` Linus Torvalds @ 2006-12-08 16:42 ` H. Peter Anvin 2006-12-08 19:49 ` Lars Hjemli 2006-12-10 9:43 ` rda 2006-12-08 16:54 ` Jeff Garzik 1 sibling, 2 replies; 82+ messages in thread From: H. Peter Anvin @ 2006-12-08 16:42 UTC (permalink / raw) To: Linus Torvalds Cc: Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski Linus Torvalds wrote: > > On Fri, 8 Dec 2006, H. Peter Anvin wrote: >> This goes back to the "gitweb needs native caching" again. > > It should be fairly easy to add a caching layer, but I wouldn't do it > inside gitweb itself - it gets too mixed up. It would be better to have > it as a separate front-end, that just calls gitweb for anything it doesn't > find in the cache. > If you want to do side effect generation of cache contents, it might not be possible to do it that way. At the very least gitweb needs to be aware of how to explicitly enter things into the cache. All of this isn't really all that hard; I have implemented all that stuff for diffview, for example (when generating a single diff hunk, you naturally end up producing all of them, so you want to have them preemptively cached.) > I could write a simple C caching thing that just hashes the CGI arguments > and uses a hash to create a cache (and proper lock-files etc to serialize > access to a particular cache object while it's being created) fairly > easily, but I'm pretty sure people would much prefer a mod_perl thing just > to avoid the fork/exec overhead with Apache (I think mod_perl allows > Apache to run perl scripts without it), and that means I'm not the right > person any more. True about mod_perl. Haven't messed with that myself, either. fork/exec really is very cheap on Linux, so it's not a huge deal. > Not that I'm the right person anyway, since I don't have a web server set > up on my machine to even test with ;) Heh :) -hpa ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 16:42 ` H. Peter Anvin @ 2006-12-08 19:49 ` Lars Hjemli 2006-12-08 19:51 ` H. Peter Anvin 2006-12-10 9:43 ` rda 1 sibling, 1 reply; 82+ messages in thread From: Lars Hjemli @ 2006-12-08 19:49 UTC (permalink / raw) To: H. Peter Anvin, Linus Torvalds; +Cc: Git Mailing List On 12/8/06, H. Peter Anvin <hpa@zytor.com> wrote: > Linus Torvalds wrote: > > I could write a simple C caching thing that just hashes the CGI arguments > > and uses a hash to create a cache (and proper lock-files etc to serialize > > access to a particular cache object while it's being created) fairly > > easily, but I'm pretty sure people would much prefer a mod_perl thing just > > to avoid the fork/exec overhead with Apache (I think mod_perl allows > > Apache to run perl scripts without it), and that means I'm not the right > > person any more. > > True about mod_perl. Haven't messed with that myself, either. > fork/exec really is very cheap on Linux, so it's not a huge deal. I've been playing around with a "native git" cgi thingy the last week (I call it cgit), and I've been thinking about adding exactly this kind of caching to it. And since it's basically a standard git command written in C, it should have less overhead than any perl implementation. It's far from ready yet, but I'll try to publish some code this weekend just in case someone finds it interesting. -- ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 19:49 ` Lars Hjemli @ 2006-12-08 19:51 ` H. Peter Anvin 2006-12-08 19:59 ` Lars Hjemli 0 siblings, 1 reply; 82+ messages in thread From: H. Peter Anvin @ 2006-12-08 19:51 UTC (permalink / raw) To: Lars Hjemli; +Cc: Linus Torvalds, Git Mailing List Lars Hjemli wrote: > On 12/8/06, H. Peter Anvin <hpa@zytor.com> wrote: >> Linus Torvalds wrote: >> > I could write a simple C caching thing that just hashes the CGI >> arguments >> > and uses a hash to create a cache (and proper lock-files etc to >> serialize >> > access to a particular cache object while it's being created) fairly >> > easily, but I'm pretty sure people would much prefer a mod_perl >> thing just >> > to avoid the fork/exec overhead with Apache (I think mod_perl allows >> > Apache to run perl scripts without it), and that means I'm not the >> right >> > person any more. >> >> True about mod_perl. Haven't messed with that myself, either. >> fork/exec really is very cheap on Linux, so it's not a huge deal. > > I've been playing around with a "native git" cgi thingy the last week > (I call it cgit), and I've been thinking about adding exactly this > kind of caching to it. And since it's basically a standard git command > written in C, it should have less overhead than any perl > implementation. > Trust me, perl, or CGI, is not the problem. It's all about I/O traffic generated by git. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 19:51 ` H. Peter Anvin @ 2006-12-08 19:59 ` Lars Hjemli 2006-12-08 20:02 ` H. Peter Anvin 0 siblings, 1 reply; 82+ messages in thread From: Lars Hjemli @ 2006-12-08 19:59 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Linus Torvalds, Git Mailing List On 12/8/06, H. Peter Anvin <hpa@zytor.com> wrote: > Trust me, perl, or CGI, is not the problem. It's all about I/O traffic > generated by git. Yes, I understand. That's why I've been thinking about internal caching of pages. It's just a kick doing it in C, playing around with the git internals :-) -- ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 19:59 ` Lars Hjemli @ 2006-12-08 20:02 ` H. Peter Anvin 0 siblings, 0 replies; 82+ messages in thread From: H. Peter Anvin @ 2006-12-08 20:02 UTC (permalink / raw) To: Lars Hjemli; +Cc: Linus Torvalds, Git Mailing List Lars Hjemli wrote: > On 12/8/06, H. Peter Anvin <hpa@zytor.com> wrote: >> Trust me, perl, or CGI, is not the problem. It's all about I/O traffic >> generated by git. > > Yes, I understand. That's why I've been thinking about internal > caching of pages. Caching, preferrably with smarts, is the key. > It's just a kick doing it in C, playing around with the git internals :-) That's fine, but it does make it harder to maintain. -hpa ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 16:42 ` H. Peter Anvin 2006-12-08 19:49 ` Lars Hjemli @ 2006-12-10 9:43 ` rda 1 sibling, 0 replies; 82+ messages in thread From: rda @ 2006-12-10 9:43 UTC (permalink / raw) To: H. Peter Anvin Cc: Linus Torvalds, Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski On 12/8/06, H. Peter Anvin <hpa@zytor.com> wrote: > Linus Torvalds wrote: > > I could write a simple C caching thing that just hashes the CGI arguments > > and uses a hash to create a cache (and proper lock-files etc to serialize > > access to a particular cache object while it's being created) fairly > > easily, but I'm pretty sure people would much prefer a mod_perl thing just > > to avoid the fork/exec overhead with Apache (I think mod_perl allows > > Apache to run perl scripts without it), and that means I'm not the right > > person any more. > > True about mod_perl. Haven't messed with that myself, either. > fork/exec really is very cheap on Linux, so it's not a huge deal. In the case of Perl scripts, it's not really the fork/exec overhead, but the Perl startup overhead that you want to try to optimize. But given your later statement (lots of spare cpu), this ends up just being a bit of a latency hit. In general, I think mod_perl has a ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 16:35 ` Linus Torvalds 2006-12-08 16:42 ` H. Peter Anvin @ 2006-12-08 16:54 ` Jeff Garzik 2006-12-08 17:04 ` H. Peter Anvin 2006-12-08 23:27 ` Linus Torvalds 1 sibling, 2 replies; 82+ messages in thread From: Jeff Garzik @ 2006-12-08 16:54 UTC (permalink / raw) To: Linus Torvalds Cc: H. Peter Anvin, Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski Linus Torvalds wrote: > I could write a simple C caching thing that just hashes the CGI arguments > and uses a hash to create a cache (and proper lock-files etc to serialize > access to a particular cache object while it's being created) fairly > easily, but I'm pretty sure people would much prefer a mod_perl thing just > to avoid the fork/exec overhead with Apache (I think mod_perl allows > Apache to run perl scripts without it), and that means I'm not the right > person any more. > > Not that I'm the right person anyway, since I don't have a web server set > up on my machine to even test with ;) > > Linus > > - > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > This is quite nice and easy, if memory-only caching works for the situation: http://www.danga.com/memcached/ There are APIs for C, Perl, and plenty of other languages. Jeff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 16:54 ` Jeff Garzik @ 2006-12-08 17:04 ` H. Peter Anvin 2006-12-08 17:40 ` Jeff Garzik 2006-12-08 23:27 ` Linus Torvalds 1 sibling, 1 reply; 82+ messages in thread From: H. Peter Anvin @ 2006-12-08 17:04 UTC (permalink / raw) To: Jeff Garzik Cc: Linus Torvalds, Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski Jeff Garzik wrote: > > This is quite nice and easy, if memory-only caching works for the > situation: http://www.danga.com/memcached/ > > There are APIs for C, Perl, and plenty of other languages. > Memory-only caching is kind of nasty. Memory is a premium resource on kernel.org. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 17:04 ` H. Peter Anvin @ 2006-12-08 17:40 ` Jeff Garzik 0 siblings, 0 replies; 82+ messages in thread From: Jeff Garzik @ 2006-12-08 17:40 UTC (permalink / raw) To: H. Peter Anvin Cc: Linus Torvalds, Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski H. Peter Anvin wrote: > Jeff Garzik wrote: >> >> This is quite nice and easy, if memory-only caching works for the >> situation: http://www.danga.com/memcached/ >> >> There are APIs for C, Perl, and plenty of other languages. >> > > Memory-only caching is kind of nasty. Memory is a premium resource on > kernel.org. hmmm. Well, I have been wondering why nobody ever came up with a system-wide local (==disk) cache for remote and/or calculated objects. Maybe its time to do something about that. I've been in a daemon-writing mood lately. Jeff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 16:54 ` Jeff Garzik 2006-12-08 17:04 ` H. Peter Anvin @ 2006-12-08 23:27 ` Linus Torvalds 2006-12-08 23:46 ` Michael K. Edwards ` (3 more replies) 1 sibling, 4 replies; 82+ messages in thread From: Linus Torvalds @ 2006-12-08 23:27 UTC (permalink / raw) To: Jeff Garzik Cc: H. Peter Anvin, Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski On Fri, 8 Dec 2006, Jeff Garzik wrote: > > This is quite nice and easy, if memory-only caching works for the situation: > http://www.danga.com/memcached/ > > There are APIs for C, Perl, and plenty of other languages. Actually, just looking at the examples, it looks like memcached is fundamentally flawed, exactly the same way Apache mod_cache is fundamentally flawed. Exactly like mod_perl, it appears that if something isn't cached, the memcached server will just return "not cached" to everybody, and all the clients will, like a stampeding herd, all do the uncached access. Even if they have the exact same query. And you're back to square one: your server load went through the roof. You can't have a cache architecture where the client just does a "get", like memcached does. You need to have a "read-for-fill" operation, which says: - get this cache entry - if this cache entry does not exist, get an exclusive lock - if you get that exclusive lock, return NULL, and the client promises that it will fill it (inside the kernel, see for example "find_get_page()" vs "grab_cache_page()" - the latter will return a locked page whether it exists or not, and if it didn't exist, it will have inserted it into the cache datastructures so that you don't have multiple concurrent readers trying to all create different pages) - if you block on the exclusive lock, that means that some other client is busy fulfilling it. When you unblock, do a regular "read" operation (not a "repeat": we only block once, and if that fails, that's it). - any cachefill operation will release the lock (and allow pending cache queries to succeed) - the locking client going away will release the lock (and allow pending cache queries to fail, and hopefully cause a "set cache" operation) - a timeout (settable by some method) will also force-release a lock in the case of buggy clients that do "read-for-modify" but never do the "modify". The "timeout" thing is to handle the case of buggy clients that crash after trying to get - it will slow down things _enormously_ if that happens, but hey, it's a buggy client. And it will still continue to work. Looking at the memcached operations, they have the "read" op (aka "get"), but they seem to have no "read-for-fill" op. So memcached fundamentally doesn't fix this problem, at least without explicit serialization by the client. (The serialization could be done by the client, but that would serialize _everything_, and mean that a uncached lookup will hold up all the cached ones too - which is why you do NOT want to serialize in the caller: you really want to serialize in the layer that does the caching). It's fairly easy to do the lock. You could just hash the lookup key using some reasonable hash. It doesn't even have to be a _big_ hash: it's ok to have just a few bits for lock hashing, since it's only going to be for misses. So hashing to eight bits and using 256 locks is probably fine, as long as this is done by the cache server. That means that the cache server only ever needs to track that many timeouts, for example (it also indirectly sets a limit on the number of possible "outstanding uncached requests", which is _exactly_ what you want - but hash collissions will also potentially unlock the _wrong_ bucket, so if you have too many of them, it can make the "only one outstanding unhashed request per key" not be as effective). So assuming you get good cache hit statistics, the locking shouldn't be a big issue. But you definitely want to do it, because the whole point of caching was to not do the same op multiple times. I still don't understand why apache doesn't do it. I guess it wants to be stateless or something. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 23:27 ` Linus Torvalds @ 2006-12-08 23:46 ` Michael K. Edwards 2006-12-08 23:49 ` H. Peter Anvin 2006-12-09 0:49 ` Linus Torvalds [not found] ` <4579FABC.5070509@garzik.org> ` (2 subsequent siblings) 3 siblings, 2 replies; 82+ messages in thread From: Michael K. Edwards @ 2006-12-08 23:46 UTC (permalink / raw) To: Linus Torvalds Cc: Jeff Garzik, H. Peter Anvin, Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski On 12/8/06, Linus Torvalds <torvalds@osdl.org> wrote: > You can't have a cache architecture where the client just does a "get", > like memcached does. You need to have a "read-for-fill" operation ... In Squid 2.6: collapsed_forwarding on refresh_stale_window <seconds> (apply the latter only to stanzas where you want "readahead" of about-to-expire cache entries) Brief design description at http://devel.squid-cache.org/collapsed_forwarding/. (I didn't write this code, everything I know about squid leaked through the Google-shaped pinhole in my tinfoil hat, etc. But if you go this way I'd like to be in the loop to understand the scalability issues around netfilter-assisted transparent proxying.) Cheers, ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 23:46 ` Michael K. Edwards @ 2006-12-08 23:49 ` H. Peter Anvin 2006-12-09 0:18 ` Michael K. Edwards 2006-12-09 0:49 ` Linus Torvalds 1 sibling, 1 reply; 82+ messages in thread From: H. Peter Anvin @ 2006-12-08 23:49 UTC (permalink / raw) To: Michael K. Edwards Cc: Linus Torvalds, Jeff Garzik, Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski Michael K. Edwards wrote: > On 12/8/06, Linus Torvalds <torvalds@osdl.org> wrote: >> You can't have a cache architecture where the client just does a "get", >> like memcached does. You need to have a "read-for-fill" operation ... > > In Squid 2.6: > collapsed_forwarding on > refresh_stale_window <seconds> > (apply the latter only to stanzas where you want "readahead" of > about-to-expire cache entries) > > Brief design description at > http://devel.squid-cache.org/collapsed_forwarding/. > > (I didn't write this code, everything I know about squid leaked > through the Google-shaped pinhole in my tinfoil hat, etc. But if you > go this way I'd like to be in the loop to understand the scalability > issues around netfilter-assisted transparent proxying.) > There is another thing that probably will be required, and I'm not sure if something in front of Apache (like Squid) rather than behind it can easily deal with: on timeout, the process needs to continue in order to feed the cache. Otherwise, you're still in a failure scenario as soon as timeout happens. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 23:49 ` H. Peter Anvin @ 2006-12-09 0:18 ` Michael K. Edwards 2006-12-09 0:23 ` H. Peter Anvin 0 siblings, 1 reply; 82+ messages in thread From: Michael K. Edwards @ 2006-12-09 0:18 UTC (permalink / raw) To: H. Peter Anvin Cc: Linus Torvalds, Jeff Garzik, Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski On 12/8/06, H. Peter Anvin <hpa@zytor.com> wrote: > There is another thing that probably will be required, and I'm not sure > if something in front of Apache (like Squid) rather than behind it can > easily deal with: on timeout, the process needs to continue in order to > feed the cache. Otherwise, you're still in a failure scenario as soon > as timeout happens. I would think this would be a great deal easier to handle in an arm's-length "accelerator" than in the origin server. Only restart the hit to the origin server if you think that something has actually gone wrong there. Serve stale data to the client if you have to. From the page I quoted: "In addition an option to shortcut the cache revalidation of frequently accessed objects is added, making further requests immediately return as a cache hit while a cache revalidation is pending. This may temporarily give slightly stale information to the clients, but at the same time allows for optimal response time while a frequently accessed object is being revalidated. This too is an optimization only intended for accelerators, and only for accelerators where minimizing request latency is morer important than freshness." I don't know how sophisticated this logic is currently, but I would think that it wouldn't be that hard to tune up. Cheers, ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 0:18 ` Michael K. Edwards @ 2006-12-09 0:23 ` H. Peter Anvin 0 siblings, 0 replies; 82+ messages in thread From: H. Peter Anvin @ 2006-12-09 0:23 UTC (permalink / raw) To: Michael K. Edwards Cc: Linus Torvalds, Jeff Garzik, Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski Michael K. Edwards wrote: > On 12/8/06, H. Peter Anvin <hpa@zytor.com> wrote: >> There is another thing that probably will be required, and I'm not sure >> if something in front of Apache (like Squid) rather than behind it can >> easily deal with: on timeout, the process needs to continue in order to >> feed the cache. Otherwise, you're still in a failure scenario as soon >> as timeout happens. > > I would think this would be a great deal easier to handle in an > arm's-length "accelerator" than in the origin server True, but it needs to run behind Apache rather than in front of it. -hpa ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 23:46 ` Michael K. Edwards 2006-12-08 23:49 ` H. Peter Anvin @ 2006-12-09 0:49 ` Linus Torvalds 2006-12-09 0:51 ` H. Peter Anvin ` (2 more replies) 1 sibling, 3 replies; 82+ messages in thread From: Linus Torvalds @ 2006-12-09 0:49 UTC (permalink / raw) To: Michael K. Edwards Cc: Jeff Garzik, H. Peter Anvin, Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski On Fri, 8 Dec 2006, Michael K. Edwards wrote: > > In Squid 2.6: > collapsed_forwarding on > refresh_stale_window <seconds> > (apply the latter only to stanzas where you want "readahead" of > about-to-expire cache entries) Yeah, those look like the Right Thing (tm) to do. That said, I'm not personally convinced that there is much point to using netfilter for transparent proxying. Why not just use separate ports for squid and for apache? ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 0:49 ` Linus Torvalds @ 2006-12-09 0:51 ` H. Peter Anvin 2006-12-09 4:36 ` Michael K. Edwards 2006-12-09 9:27 ` Jeff Garzik 2 siblings, 0 replies; 82+ messages in thread From: H. Peter Anvin @ 2006-12-09 0:51 UTC (permalink / raw) To: Linus Torvalds Cc: Michael K. Edwards, Jeff Garzik, Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski Linus Torvalds wrote: > > On Fri, 8 Dec 2006, Michael K. Edwards wrote: >> In Squid 2.6: >> collapsed_forwarding on >> refresh_stale_window <seconds> >> (apply the latter only to stanzas where you want "readahead" of >> about-to-expire cache entries) > > Yeah, those look like the Right Thing (tm) to do. > > That said, I'm not personally convinced that there is much point to using > netfilter for transparent proxying. Why not just use separate ports for > squid and for apache? > Yeah, this is pretty trivial since one can just do redirects. However, I still think a backend cache is better, since it can detach itself from Apache when appropriate (e.g. the background refresh scenario, or timeout.) ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 0:49 ` Linus Torvalds 2006-12-09 0:51 ` H. Peter Anvin @ 2006-12-09 4:36 ` Michael K. Edwards 2006-12-09 9:27 ` Jeff Garzik 2 siblings, 0 replies; 82+ messages in thread From: Michael K. Edwards @ 2006-12-09 4:36 UTC (permalink / raw) To: Linus Torvalds Cc: Jeff Garzik, H. Peter Anvin, Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski On 12/8/06, Linus Torvalds <torvalds@osdl.org> wrote: > That said, I'm not personally convinced that there is much point to using > netfilter for transparent proxying. Why not just use separate ports for > squid and for apache? Just a question of whether you want to be able to yank the squid box out if it goes pear-shaped, without touching configs on the apache box. Some people like to stick the proxy in as a no-op at first, then tell netfilter to divert 1% of sessions to squid and see how it holds up, retune, ease it in, ease it out, figure out how much operational flexibility you will have as demand continues to scale. If the squid and apache are on the same box it's probably less of an issue. Cheers, ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 0:49 ` Linus Torvalds 2006-12-09 0:51 ` H. Peter Anvin 2006-12-09 4:36 ` Michael K. Edwards @ 2006-12-09 9:27 ` Jeff Garzik 2 siblings, 0 replies; 82+ messages in thread From: Jeff Garzik @ 2006-12-09 9:27 UTC (permalink / raw) To: Linus Torvalds Cc: Michael K. Edwards, H. Peter Anvin, Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski Linus Torvalds wrote: > That said, I'm not personally convinced that there is much point to using > netfilter for transparent proxying. Why not just use separate ports for > squid and for apache? That's what most people using squid in "http accelerator" mode do. They put Apache on port 8080 or somesuch. Jeff ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <4579FABC.5070509@garzik.org>]
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) [not found] ` <4579FABC.5070509@garzik.org> @ 2006-12-09 0:45 ` Linus Torvalds 2006-12-09 0:47 ` H. Peter Anvin 2006-12-09 9:16 ` Jeff Garzik 0 siblings, 2 replies; 82+ messages in thread From: Linus Torvalds @ 2006-12-09 0:45 UTC (permalink / raw) To: Jeff Garzik Cc: H. Peter Anvin, Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski On Fri, 8 Dec 2006, Jeff Garzik wrote: > > This is a bit cheesy, and completely untested, but since mod_cache never > worked for me either, I bet it works better ;-) Ok, this doesn't do the locking either, so on cache misses or expiry, you're still going to be that thundering herd. Also, if you want to be nice to clients, I'd seriously suggest that when you hit in the cache, but it's expired (or it's close to expired), you still serve the cached data back, but you set up a thread in the background (with some maximum number of active threads, of course!) that refreshes the cached entry and then you extend the expiration time so that you won't end up doing this "refresh" _again_. It's kind of silly to have people wait for 20 seconds just because a cache expired five seconds ago. Much nicer to say "ok, we allow a certain grace-period during which we'll do the real lookup, but to make things _look_ really responsive, we still use the old cached value". ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 0:45 ` Linus Torvalds @ 2006-12-09 0:47 ` H. Peter Anvin 2006-12-09 9:16 ` Jeff Garzik 1 sibling, 0 replies; 82+ messages in thread From: H. Peter Anvin @ 2006-12-09 0:47 UTC (permalink / raw) To: Linus Torvalds Cc: Jeff Garzik, Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski Linus Torvalds wrote: > > It's kind of silly to have people wait for 20 seconds just because a cache > expired five seconds ago. Much nicer to say "ok, we allow a certain > grace-period during which we'll do the real lookup, but to make things > _look_ really responsive, we still use the old cached value". > Yup, DNS does this, and it's a Very Good Thing. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 0:45 ` Linus Torvalds 2006-12-09 0:47 ` H. Peter Anvin @ 2006-12-09 9:16 ` Jeff Garzik 1 sibling, 0 replies; 82+ messages in thread From: Jeff Garzik @ 2006-12-09 9:16 UTC (permalink / raw) To: Linus Torvalds Cc: H. Peter Anvin, Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski Linus Torvalds wrote: > > On Fri, 8 Dec 2006, Jeff Garzik wrote: >> This is a bit cheesy, and completely untested, but since mod_cache never >> worked for me either, I bet it works better ;-) > > Ok, this doesn't do the locking either, so on cache misses or expiry, > you're still going to be that thundering herd. Well, gdbm does reader/write locking. You still bit a bit of a thundering herd, though. I suppose I could open the gdbm db for writing before calling the CGI, which would effectively get what you're looking for. > Also, if you want to be nice to clients, I'd seriously suggest that when > you hit in the cache, but it's expired (or it's close to expired), you > still serve the cached data back, but you set up a thread in the > background (with some maximum number of active threads, of course!) that > refreshes the cached entry and then you extend the expiration time so that > you won't end up doing this "refresh" _again_. > > It's kind of silly to have people wait for 20 seconds just because a cache > expired five seconds ago. Much nicer to say "ok, we allow a certain > grace-period during which we'll do the real lookup, but to make things > _look_ really responsive, we still use the old cached value". True, should work with gitweb data at least. Jeff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 23:27 ` Linus Torvalds 2006-12-08 23:46 ` Michael K. Edwards [not found] ` <4579FABC.5070509@garzik.org> @ 2006-12-09 1:56 ` Martin Langhoff 2006-12-09 11:51 ` Jakub Narebski 2006-12-09 7:56 ` Steven Grimm 3 siblings, 1 reply; 82+ messages in thread From: Martin Langhoff @ 2006-12-09 1:56 UTC (permalink / raw) To: Linus Torvalds Cc: Jeff Garzik, H. Peter Anvin, Rogan Dawes, Kernel Org Admin, Git Mailing List, Jakub Narebski On 12/9/06, Linus Torvalds <torvalds@osdl.org> wrote: > Actually, just looking at the examples, it looks like memcached is > fundamentally flawed, exactly the same way Apache mod_cache is > fundamentally flawed. I don't know if fundamentally flawed but (having used memcached) I don't think it's a big win for this at all. We can make gitweb to detect mod_perl and a few smarter things if it is running inside of it. In fact, we can (ab)use mod_perl and perl facilities a bit to do some serialization which will be a big win for some pages. What we need for that is to set a sensible the ETag and use some IPC to announce/check if other apache/modperl processes are preparing content for the same ETag. The first-process-to-announce a given ETag can then write it to a common temp directory (atomically - write to a temp-name and move to the expected name) while other processes wait, polling for the file. Once the file is in place the latecomers can just serve the content of the file and exit. (I am calling the "state we are serving" identifier ETag because I think we should also set it as the ETag in the HTTP headers, so well be able to check the ETag of future requests for staleness - all we need is a ref lookup, and if the SHA1 matches, we are sorted). So having this 'unique request identifier' doubles up nicely... The ETag should probably be: - SHA1+displaytype+args for pages that display an object identified by SHA1 - refname+SHA!+displaytype+args for pages that display something identified by a ref - SHA1(names and sha1s of all refs) for the summary page > You can't have a cache architecture where the client just does a "get", > like memcached does. You need to have a "read-for-fill" operation, which > says: You _could_ make do with a convention of polling for "entryname" and "workingon-entryname" and if "workingon-entryname" is set to 1, you can expect entryname to be filled real soon now. However, memcached is completely memorybound, so it is only nice for really small stuff or for a large server farm which has gobs of spare ram. (Note that memcached does have timeouts which means that the 'workingon' value could have a short timeout in case the request is cancelled or the process dies - the nasty bit in the above plan would be the polling.) > I still don't understand why apache doesn't do it. I guess it wants to be > stateless or something. Apache doesn't do it because most web applications don't use the HTTP procol correctly - specially when it comes to the idempotency of GET. So in 99% of the cases, web apps serve truly different pages for the same GET request, depending on your cookie, IP address, time-of-day, etc. Most websites deal with very little traffic, so this isn't a problem. And many large sites that serve a lot of traffic from a dynamic web app want to be serving custom ads, let you login and see your personalised toolbar, etc,etc, so this wouldn't work for them either. So in practice, serialising speculatively on GET requests for the same URL has very little payoff except for static content. And that's quite fast anyway.... specially if the underlying OS is smokin' fast ;-) cheers, ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 1:56 ` Martin Langhoff @ 2006-12-09 11:51 ` Jakub Narebski 2006-12-09 12:42 ` Jeff Garzik 0 siblings, 1 reply; 82+ messages in thread From: Jakub Narebski @ 2006-12-09 11:51 UTC (permalink / raw) To: Martin Langhoff, Git Mailing List Cc: Linus Torvalds, Jeff Garzik, H. Peter Anvin, Rogan Dawes, Kernel Org Admin Martin Langhoff wrote: > We can make gitweb to detect mod_perl and a few smarter things if it > is running inside of it. In fact, we can (ab)use mod_perl and perl > facilities a bit to do some serialization which will be a big win for > some pages. What we need for that is to set a sensible the ETag and > use some IPC to announce/check if other apache/modperl processes are > preparing content for the same ETag. The first-process-to-announce a > given ETag can then write it to a common temp directory (atomically - > write to a temp-name and move to the expected name) while other > processes wait, polling for the file. Once the file is in place the > latecomers can just serve the content of the file and exit. First, it would (and could) work only for serving gitweb over mod_perl. I'm not sure if overhead with IPC and complications implementing are worth it: this perhaps be better solved by caching engine. But let us put aside for a while actual caching (writing HTML version of the page to a common temp directory, and serving this static page if possible), and talk a bit what gitweb can do with respect to cache validation. In addition to setting either Expires: header or Cache-Control: max-age gitweb should also set Last-Modified: and ETag headers, and also probably respond to If-Modified-Since: and If-None-Match: requests. Would be worth implementing this? > (I am calling the "state we are serving" identifier ETag because I > think we should also set it as the ETag in the HTTP headers, so well > be able to check the ETag of future requests for staleness - all we > need is a ref lookup, and if the SHA1 matches, we are sorted). So > having this 'unique request identifier' doubles up nicely... For some pages ETag is natural; for other Last-Modified: would be more natural. > The ETag should probably be: > - SHA1+displaytype+args for pages that display an object identified > by SHA1 What uniquely identifies contents in "object" views ("commit", "tag", "tree", "blob") is either h=SHA1, or hb=SHA1;f=FILENAME (with absence of h=SHA1). If both h=SHA1 and hb=SHA1 is present, hb=SHA1 serves as backlink. The "diff" views ("commitdiff", "blobdiff") are uniquely identified by pair of object identifiers (pairs of SHA1, or pairs of hb SHA1 + FILENAME). Three of those views ("blob", "commitdiff", "blobdiff") have their "plain" version; so ETag should include displaytype (action, 'a' parameter). The hb=SHA1;f=FILENAME indentifier can be converted at cost of one call to git command (but which is a bit expensive as it recurses trees), namely to git-ls-tree. ETag can be simply args (query), if all h/hb/hbp parameters are SHA1. Or ETag can be SHA1 of an object (or pair of SHA1 in the case of diff), but this is little more costly to verify. Although we usually (always?) convert hb=SHA1;f=FILENAME to h=SHA1 anyway when displaying/generating page. Usualy you can compare ETags base on URL alone. > - refname+SHA!+displaytype+args for pages that display something > identified by a ref For objects views we can simply convert refname to SHA1. I'm not sure if it is worth it. In the cases when for view we have to calculate SHA1 of object anyway, we can return (and validate) ETag with SHA1 as above. - ETag and/or Last-Modified headers for "log" views: "log", "shortlog" (is part of summary view), "history", "rss"/"atom" views. On one hand all log views (at least now) are identified by their parameters (action/view name, and filename in the case of history view) and SHA1 of top commit. On the other hand it might be easier to use Last-Modified with date of top commit... Verifying SHA1 based ETag could add some overhead in the case of miss. > - SHA1(names and sha1s of all refs) for the summary page Wouldn't it be simplier to just set Last-Modified: header (and check it?) P.S. Can anyone post some benchmark comparing gitweb deployed under mod_perl as compared to deployed as CGI script? Does kernel.org use mod_perl, or CGI version of gitweb? -- Jakub Narebski ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 11:51 ` Jakub Narebski @ 2006-12-09 12:42 ` Jeff Garzik 2006-12-09 13:37 ` Jakub Narebski ` (2 more replies) 0 siblings, 3 replies; 82+ messages in thread From: Jeff Garzik @ 2006-12-09 12:42 UTC (permalink / raw) To: Jakub Narebski Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin, Rogan Dawes, Kernel Org Admin Jakub Narebski wrote: > First, it would (and could) work only for serving gitweb over mod_perl. > I'm not sure if overhead with IPC and complications implementing are > worth it: this perhaps be better solved by caching engine. It is. At least for kernel.org, the issue isn't that CGI is expensive, its that I/O is expensive. > In addition to setting either Expires: header or Cache-Control: max-age > gitweb should also set Last-Modified: and ETag headers, and also > probably respond to If-Modified-Since: and If-None-Match: requests. > > Would be worth implementing this? IMO yes, since most major browsers, caches, and spiders support these headers. > For some pages ETag is natural; for other Last-Modified: would be more > natural. Yes, a good point to note. > Usualy you can compare ETags base on URL alone. Mostly true: you must also consider HTTP_ACCEPT > Wouldn't it be simplier to just set Last-Modified: header (and check > it?) That would be a good start, and suffice for many cases. If the CGI can simply stat(2) files rather than executing git-* programs, that would increase efficiency quite a bit. A core problem with cache hints via HTTP headers (last-modified, etc.) is that you don't achieve caching across multiple clients, just across repeated queries from the same client (or caching proxy). At least for the RSS/Atom feeds and the git main page, it makes no sense to regenerate that data repeatedly. Internally, gitweb would need to do a stat() on key files, and return pre-generated XML for the feeds if the stat() reveals no changes. Ditto for the front page. > P.S. Can anyone post some benchmark comparing gitweb deployed under > mod_perl as compared to deployed as CGI script? Does kernel.org use > mod_perl, or CGI version of gitweb? CGI version of gitweb. But again, mod_perl vs. CGI isn't the issue. Jeff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 12:42 ` Jeff Garzik @ 2006-12-09 13:37 ` Jakub Narebski 2006-12-09 14:43 ` Jeff Garzik 2006-12-10 4:07 ` Martin Langhoff 2006-12-09 18:04 ` Linus Torvalds 2006-12-10 3:55 ` Martin Langhoff 2 siblings, 2 replies; 82+ messages in thread From: Jakub Narebski @ 2006-12-09 13:37 UTC (permalink / raw) To: Jeff Garzik Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin, Rogan Dawes, Kernel Org Admin Jeff Garzik wrote: > Jakub Narebski wrote: >> In addition to setting either Expires: header or Cache-Control: max-age >> gitweb should also set Last-Modified: and ETag headers, and also >> probably respond to If-Modified-Since: and If-None-Match: requests. >> >> Would be worth implementing this? > > IMO yes, since most major browsers, caches, and spiders support these > headers. Sending Last-Modified: should be easy; sending ETag needs some consensus on the contents: mainly about validation. Responding to If-Modified-Since: and If-None-Match: should cut at least _some_ of the page generating time. If ETag can be calculated on URL alone, then we can cut If-None-Match: just at beginning of script. >> For some pages ETag is natural; for other Last-Modified: would be more >> natural. > > Yes, a good point to note. > >> Usualy you can compare ETags base on URL alone. > > Mostly true: you must also consider HTTP_ACCEPT Well, yes, ETag is HTTP/1.1 header. >> Wouldn't it be simplier to just set Last-Modified: header (and check >> it?) > > That would be a good start, and suffice for many cases. If the CGI can > simply stat(2) files rather than executing git-* programs, that would > increase efficiency quite a bit. As I said, I'm not talking (at least now) about saving generated HTML output. This I think is better solved in caching engine like Squid can be. Although even here some git specific can be of help: we can invalidate cache on push, and we know that some results doesn't ever change (well, with exception of changing output of gitweb). > A core problem with cache hints via HTTP headers (last-modified, etc.) > is that you don't achieve caching across multiple clients, just across > repeated queries from the same client (or caching proxy). > > At least for the RSS/Atom feeds and the git main page, it makes no sense > to regenerate that data repeatedly. > > Internally, gitweb would need to do a stat() on key files, and return > pre-generated XML for the feeds if the stat() reveals no changes. Ditto > for the front page. I'm not sure if it is worth implementing in gitweb, or is it better left to caching engine. With the projects list page and summary page there is additional problem with relative dates, although this can be solved using Jonas Fonseca idea of using absolute dates in the page and using ECMAScript (JavaScript) to convert them to relative: on load, and perhaps on timer ;-) What can be _easily_ done: * Use post 1.4.4 gitweb, which uses git-for-each-ref to generate summary page; this leads to around 3 times faster summary page. * Perhaps using projects list file (which can be now generated by gitweb) instead of scanning directories and stat()-ing for owner would help with time to generate projects lis page What can be quite easy incorporated into gitweb: * For immutable pages set Expires: or Cache-Control: max-age (or both) to infinity * Calculate hash+action based ETag at least for those actions where it is easy, and respond with 304 Not Modified as soon as it can. This might require some code reorganization to not begin writing output before calculating ETag and ETag comparison (If-Match, If-None-Match). * Generate Last-Modified: for those views where it can be calculated, and respond with 304 Not Modified as soon as it can. What can be easily done using caching engine: * Select top 10 of common queries, and cache them, invalidating cache on push (depending on query: for example invalidate project list on push to any project, invalidate RSS/Atom feed and summary pages only on push to specific project) - can be done with git hooks. -- Jakub Narebski ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 13:37 ` Jakub Narebski @ 2006-12-09 14:43 ` Jeff Garzik 2006-12-09 17:02 ` Jakub Narebski 2006-12-10 4:07 ` Martin Langhoff 1 sibling, 1 reply; 82+ messages in thread From: Jeff Garzik @ 2006-12-09 14:43 UTC (permalink / raw) To: Jakub Narebski Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin, Rogan Dawes, Kernel Org Admin Jakub Narebski wrote: > Sending Last-Modified: should be easy; sending ETag needs some consensus > on the contents: mainly about validation. Responding to If-Modified-Since: > and If-None-Match: should cut at least _some_ of the page generating time. Definitely. > As I said, I'm not talking (at least now) about saving generated HTML > output. This I think is better solved in caching engine like Squid can > be. Although even here some git specific can be of help: we can invalidate > cache on push, and we know that some results doesn't ever change (well, > with exception of changing output of gitweb). It depends on how creatively you think ;-) Consider generating static HTML files on each push, via a hook, for many of the toplevel files. The static HTML would then link to the CGI for further dynamic querying of the git database. > What can be _easily_ done: > * Use post 1.4.4 gitweb, which uses git-for-each-ref to generate summary > page; this leads to around 3 times faster summary page. This re-opens the question mentioned earlier, is Kay (or anyone?) still actively maintaining gitweb on k.org? > * Perhaps using projects list file (which can be now generated by gitweb) > instead of scanning directories and stat()-ing for owner would help > with time to generate projects lis page This could be statically generated by a robot. I think everybody would shrink in horror if a human needed to maintain such a file. > What can be quite easy incorporated into gitweb: > * For immutable pages set Expires: or Cache-Control: max-age (or both) > to infinity nice! > * Generate Last-Modified: for those views where it can be calculated, > and respond with 304 Not Modified as soon as it can. agreed > What can be easily done using caching engine: > * Select top 10 of common queries, and cache them, invalidating cache on push > (depending on query: for example invalidate project list on push to any > project, invalidate RSS/Atom feed and summary pages only on push to specific > project) - can be done with git hooks. Or simply generate regular filesystem files into the webspace, as triggered by a hook. Let the standard filesystem mirroring/caching work its magic. Jeff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 14:43 ` Jeff Garzik @ 2006-12-09 17:02 ` Jakub Narebski 2006-12-09 17:27 ` Jeff Garzik 0 siblings, 1 reply; 82+ messages in thread From: Jakub Narebski @ 2006-12-09 17:02 UTC (permalink / raw) To: Jeff Garzik Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin, Rogan Dawes, Kernel Org Admin Jeff Garzik wrote: > Jakub Narebski wrote: >> As I said, I'm not talking (at least now) about saving generated HTML >> output. This I think is better solved in caching engine like Squid can >> be. Although even here some git specific can be of help: we can invalidate >> cache on push, and we know that some results doesn't ever change (well, >> with exception of changing output of gitweb). > > It depends on how creatively you think ;-) > > Consider generating static HTML files on each push, via a hook, for many > of the toplevel files. The static HTML would then link to the CGI for > further dynamic querying of the git database. You mean that the links in this pre-generated HTML would be to CGI pages? >> What can be _easily_ done: >> * Use post 1.4.4 gitweb, which uses git-for-each-ref to generate summary >> page; this leads to around 3 times faster summary page. > > This re-opens the question mentioned earlier, is Kay (or anyone?) still > actively maintaining gitweb on k.org? By the way, thanks to Martin Waitz it is much easier to install gitweb. I for example use the following script to test changes I have made to gitweb: -- >8 -- #!/bin/bash BINDIR="/home/local/git" function make_gitweb() { pushd "/home/jnareb/git/" make GITWEB_PROJECTROOT="/home/local/scm" \ GITWEB_CSS="/gitweb/gitweb.css" \ GITWEB_LOGO="/gitweb/git-logo.png" \ GITWEB_FAVICON="/gitweb/git-favicon.png" \ bindir=$BINDIR \ gitweb/gitweb.cgi popd } function copy_gitweb() { cp -fv /home/jnareb/git/gitweb/gitweb.{cgi,css} /home/local/gitweb/ } make_gitweb copy_gitweb # end of gitweb-update.sh -- >8 -- >> * Perhaps using projects list file (which can be now generated by gitweb) >> instead of scanning directories and stat()-ing for owner would help >> with time to generate projects lis page > > This could be statically generated by a robot. I think everybody would > shrink in horror if a human needed to maintain such a file. Gitweb can generate this file. The problem is that one would have to temporary turn off using index file. This can be done by having the following gitweb_list_projects.perl file: -- >8 -- #!/usr/bin/perl $projects_list = ""; -- >8 -- then use the following invocation to generate project index file: $ GATEWAY_INTERFACE="CGI/1.1" HTTP_ACCEPT="*/*" REQUEST_METHOD="GET" \ GITWEB_CONFIG=gitweb_list_projects.perl QUERY_STRING="a=project_index" \ gitweb.cgi -- Jakub Narebski ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 17:02 ` Jakub Narebski @ 2006-12-09 17:27 ` Jeff Garzik 0 siblings, 0 replies; 82+ messages in thread From: Jeff Garzik @ 2006-12-09 17:27 UTC (permalink / raw) To: Jakub Narebski Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin, Rogan Dawes, Kernel Org Admin Jakub Narebski wrote: > Jeff Garzik wrote: >> Jakub Narebski wrote: > >>> As I said, I'm not talking (at least now) about saving generated HTML >>> output. This I think is better solved in caching engine like Squid can >>> be. Although even here some git specific can be of help: we can invalidate >>> cache on push, and we know that some results doesn't ever change (well, >>> with exception of changing output of gitweb). >> It depends on how creatively you think ;-) >> >> Consider generating static HTML files on each push, via a hook, for many >> of the toplevel files. The static HTML would then link to the CGI for >> further dynamic querying of the git database. > > You mean that the links in this pre-generated HTML would be to CGI > pages? Yes, they must be. Otherwise, the gitweb interface changes. You don't want to pre-generate HTML for every possible git query, that would cause an explosion of data. Both the HTML generator and CGI would need to know which pages were pre-generated and which are not. Jeff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 13:37 ` Jakub Narebski 2006-12-09 14:43 ` Jeff Garzik @ 2006-12-10 4:07 ` Martin Langhoff 2006-12-10 10:09 ` Jakub Narebski 1 sibling, 1 reply; 82+ messages in thread From: Martin Langhoff @ 2006-12-10 4:07 UTC (permalink / raw) To: Jakub Narebski Cc: Jeff Garzik, Git Mailing List, Linus Torvalds, H. Peter Anvin, Rogan Dawes, Kernel Org Admin On 12/10/06, Jakub Narebski <jnareb@gmail.com> wrote: > Jeff Garzik wrote: > > Jakub Narebski wrote: > > >> In addition to setting either Expires: header or Cache-Control: max-age > >> gitweb should also set Last-Modified: and ETag headers, and also > >> probably respond to If-Modified-Since: and If-None-Match: requests. > >> > >> Would be worth implementing this? > > > > IMO yes, since most major browsers, caches, and spiders support these > > headers. > > Sending Last-Modified: should be easy; sending ETag needs some consensus > on the contents: mainly about validation. Responding to If-Modified-Since: > and If-None-Match: should cut at least _some_ of the page generating time. > If ETag can be calculated on URL alone, then we can cut If-None-Match: > just at beginning of script. Indeed. Let me add myself to the pileup agreeing that a combination of setting Last-Modified and checking for If-Modified-Since for ref-centric pages (log, shortlog, RSS, and summary) is the smartest scheme. I got locked into thinking ETags. > > That would be a good start, and suffice for many cases. If the CGI can > > simply stat(2) files rather than executing git-* programs, that would > > increase efficiency quite a bit. > > As I said, I'm not talking (at least now) about saving generated HTML > output. This I think is better solved in caching engine like Squid can > be. Although even here some git specific can be of help: we can invalidate > cache on push, and we know that some results doesn't ever change (well, > with exception of changing output of gitweb). Indeed - gitweb should not be saving HTML around bit giving the best possible hints to squid and friends. And improving our ability to short-cut and send a 304 - Not Modified. > What can be _easily_ done: Great plan. :-) cheers, ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-10 4:07 ` Martin Langhoff @ 2006-12-10 10:09 ` Jakub Narebski 2006-12-10 12:41 ` Jeff Garzik 0 siblings, 1 reply; 82+ messages in thread From: Jakub Narebski @ 2006-12-10 10:09 UTC (permalink / raw) To: Martin Langhoff Cc: Jeff Garzik, Git Mailing List, Linus Torvalds, H. Peter Anvin, Rogan Dawes, Kernel Org Admin Martin Langhoff wrote: > On 12/10/06, Jakub Narebski <jnareb@gmail.com> wrote: >> Sending Last-Modified: should be easy; sending ETag needs some consensus >> on the contents: mainly about validation. Responding to If-Modified-Since: >> and If-None-Match: should cut at least _some_ of the page generating time. >> If ETag can be calculated on URL alone, then we can cut If-None-Match: >> just at beginning of script. > > Indeed. Let me add myself to the pileup agreeing that a combination of > setting Last-Modified and checking for If-Modified-Since for > ref-centric pages (log, shortlog, RSS, and summary) is the smartest > scheme. I got locked into thinking ETags. Sometimes it is easier to use ETags, sometimes it is easier to use Last-Modified:. Usually you can check ETag earlier (after calling git-rev-list) than Last-Modified (after parsing first commit). But some pages doesn't have natural ETag... Besides, because ETag is HTTP/1.1 we should provide and validate both. P.S. Any hints to how to do this with CGI Perl module? -- Jakub Narebski ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-10 10:09 ` Jakub Narebski @ 2006-12-10 12:41 ` Jeff Garzik 2006-12-10 13:02 ` Jakub Narebski 0 siblings, 1 reply; 82+ messages in thread From: Jeff Garzik @ 2006-12-10 12:41 UTC (permalink / raw) To: Jakub Narebski Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin, Rogan Dawes, Kernel Org Admin Jakub Narebski wrote: > P.S. Any hints to how to do this with CGI Perl module? It's impossible, Apache doesn't supply e-tag info to CGI programs. (it does supply HTTP_CACHE_CONTROL though apparently) You could probably do it via mod_perl. Jeff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-10 12:41 ` Jeff Garzik @ 2006-12-10 13:02 ` Jakub Narebski 2006-12-10 13:45 ` Jeff Garzik 0 siblings, 1 reply; 82+ messages in thread From: Jakub Narebski @ 2006-12-10 13:02 UTC (permalink / raw) To: Jeff Garzik Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin, Rogan Dawes, Kernel Org Admin Jeff Garzik wrote: > Jakub Narebski wrote: >> >> P.S. Any hints to how to do this with CGI Perl module? > > It's impossible, Apache doesn't supply e-tag info to CGI programs. (it > does supply HTTP_CACHE_CONTROL though apparently) By ETag info you mean access to HTTP headers sent by browser If-Modified-Since:, If-Match:, If-None-Match: do you? It's a pity that CGI interface doesn't cover that... > You could probably do it via mod_perl. So the cache verification should be wrapped in if ($ENV{MOD_PERL}) ? -- Jakub Narebski ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-10 13:02 ` Jakub Narebski @ 2006-12-10 13:45 ` Jeff Garzik 2006-12-10 19:11 ` Jakub Narebski 0 siblings, 1 reply; 82+ messages in thread From: Jeff Garzik @ 2006-12-10 13:45 UTC (permalink / raw) To: Jakub Narebski Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin, Rogan Dawes, Kernel Org Admin [-- Attachment #1: Type: text/plain, Size: 1588 bytes --] Jakub Narebski wrote: > Jeff Garzik wrote: >> Jakub Narebski wrote: >>> P.S. Any hints to how to do this with CGI Perl module? >> It's impossible, Apache doesn't supply e-tag info to CGI programs. (it >> does supply HTTP_CACHE_CONTROL though apparently) > > By ETag info you mean access to HTTP headers sent by browser > If-Modified-Since:, If-Match:, If-None-Match: do you? You can use this attached shell script as a CGI script, to see precisely what information Apache gives you. You can even experiment with passing back headers other than Content-type (such as E-tag), to see what sort of results are produced. The script currently passes back both E-Tag and Last-Modified of a sample file; modify or delete those lines to suit your experiments. > It's a pity that CGI interface doesn't cover that... > >> You could probably do it via mod_perl. > > So the cache verification should be wrapped in if ($ENV{MOD_PERL}) ? Sorry, I was /assuming/ mod_perl would make this available. The HTTP header info is available to all Apache modules, but I confess I have no idea how mod_perl passes that info to scripts. Also, an interesting thing while I was testing the attached shell script: even though repeated hits to the script generate a proper 304 response to the browse, the CGI script and its output run to completion. So, it didn't save work on the CGI side; the savings was solely in not transmitting the document from server to client. The server still went through the work of generating the document (by running the CGI), as one would expect. Jeff [-- Attachment #2: fenv --] [-- Type: text/plain, Size: 317 bytes --] #!/bin/sh FN=/tmp/foo if [ ! -f "$FN" ] then echo "blah blah blah" > "$FN" fi HASH=`md5sum "$FN"` echo "Content-type: text/plain" echo "E-tag: $HASH" echo Last-Modified: `date -r /tmp/foo '+%a, %d %b %Y %T %Z'` echo "" # don't pollute server environment output with our local additions unset FN unset HASH set ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-10 13:45 ` Jeff Garzik @ 2006-12-10 19:11 ` Jakub Narebski 2006-12-10 19:50 ` Linus Torvalds 2006-12-10 22:05 ` Jeff Garzik 0 siblings, 2 replies; 82+ messages in thread From: Jakub Narebski @ 2006-12-10 19:11 UTC (permalink / raw) To: Jeff Garzik Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin, Rogan Dawes, Kernel Org Admin Jeff Garzik wrote: > Jakub Narebski wrote: >> Jeff Garzik wrote: >>> Jakub Narebski wrote: >>>> >>>> P.S. Any hints to how to do this with CGI Perl module? >>> >>> It's impossible, Apache doesn't supply e-tag info to CGI programs. (it >>> does supply HTTP_CACHE_CONTROL though apparently) >> >> By ETag info you mean access to HTTP headers sent by browser >> If-Modified-Since:, If-Match:, If-None-Match: do you? Adn in CGI standard there is a way to access additional HTTP headers info from CGI script: the envirionmental variables are HTTP_HEADER, for example if browser sent If-Modified-Since: header it's value can be found in HTTP_IF_MODIFIED_SINCE environmental variable. But of course gitweb should rather use mod_perl if possible, so somewhere in gitweb there would be the following line: $in_date = $ENV{'MOD_PERL'} ? $r->header('If-Modified-Since') : $ENV{'HTTP_IF_MODIFIED_SINCE'}; or something like that... > You can use this attached shell script as a CGI script, to see precisely > what information Apache gives you. You can even experiment with passing > back headers other than Content-type (such as E-tag), to see what sort > of results are produced. The script currently passes back both E-Tag > and Last-Modified of a sample file; modify or delete those lines to suit > your experiments. It is ETag, not E-tag. Besides, I don't see what the attached script is meant to do: it does not output the sample file anyway. >> It's a pity that CGI interface doesn't cover that... >> >>> You could probably do it via mod_perl. >> >> So the cache verification should be wrapped in if ($ENV{MOD_PERL}) ? > > Sorry, I was /assuming/ mod_perl would make this available. The HTTP > header info is available to all Apache modules, but I confess I have no > idea how mod_perl passes that info to scripts. > > Also, an interesting thing while I was testing the attached shell > script: even though repeated hits to the script generate a proper 304 > response to the browse, the CGI script and its output run to completion. > So, it didn't save work on the CGI side; the savings was solely in not > transmitting the document from server to client. The server still went > through the work of generating the document (by running the CGI), as one > would expect. The idea is of course to stop processing in CGI script / mod_perl script as soon as possible if cache validates. I don't know if Apache intercepts and remembers ETag and Last-Modified headers, adds 304 Not Modified HTTP response on finding that cache validates and cuts out CGI script output. I.e. if browser provided If-Modified-Since:, script wrote Last-Modified: header, If-Modified-Since: is no earlier than Last-Modified: (usually is equal in the case of cache validation), then Apache provides 304 Not Modified response instead of CGI script output. -- Jakub Narebski ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-10 19:11 ` Jakub Narebski @ 2006-12-10 19:50 ` Linus Torvalds 2006-12-10 20:27 ` Jakub Narebski 2006-12-10 21:01 ` H. Peter Anvin 2006-12-10 22:05 ` Jeff Garzik 1 sibling, 2 replies; 82+ messages in thread From: Linus Torvalds @ 2006-12-10 19:50 UTC (permalink / raw) To: Jakub Narebski Cc: Jeff Garzik, Martin Langhoff, Git Mailing List, H. Peter Anvin, Rogan Dawes, Kernel Org Admin On Sun, 10 Dec 2006, Jakub Narebski wrote: > >> If-Modified-Since:, If-Match:, If-None-Match: do you? > > Adn in CGI standard there is a way to access additional HTTP headers > info from CGI script: the envirionmental variables are HTTP_HEADER, > for example if browser sent If-Modified-Since: header it's value > can be found in HTTP_IF_MODIFIED_SINCE environmental variable. Guys, you're missing something fairly fundamnetal. It helps almost _nothing_ to support client-side caching with all these fancy "If-Modified-Since:" etc crap. That's not the _problem_. It's usually not one client asking for the gitweb pages: the load comes from just lots of people independently asking for it. So client-side caching may help a tiny tiny bit, but it's not actually fixing the fundamental problem at all. So forget about "If-Modified-Since:" etc. It may help in benchmarks when you try it yourself, and use "refresh" on the client side. But the basic problem is all about lots of clients that do NOT have things cached, because all teh client caches are all filled up with pr0n, not with gitweb data from yesterday. So the thing to help is server-side caching with good access patterns, so that the server won't have to seek all over the disk when clients that _don't_ have things in their caches want to see the "git projects" summary overview (that currently lists something like 200+ projects). So to get that list of 200+ projects, right now gitweb will literally walk them all, look at their refs, their descriptions, their ages (which requires looking up the refs, and the objects behing the refs), and if they aren't cached, you're going to have several disk seeks for each project. At 200+ projects, the thing that makes it slow is those disk seeks. Even with a fast disk and RAID array, the seeks are all basically going to be interdependent, so there's no room for disk arm movement optimization, and in the absense of any other load it's still going to be several seconds just for the seeks (say 10ms per seek, four or five seeks per project, you've got 10 seconds _just_ for the seeks to generate the top-level summary page, and quite frankly, five seeks is probably optimistic). Now, hopefully some of it will be in the disk cache, but when the mirroring happens, it will basically blow the disk caches away totally (when using the "--checksum" option), and then you literally have tens of seconds to generate that one top-level page. And when mirroring is blowing out the disk caches, the thing will be doing other things _too_ to the disk, of course. So what you want is server-side caching, and you basically _never_ want to re-generate that data synchronously (because even if the server can take the load, having the clients wait for half a minute or more for the data is just NOT FRIENDLY). This is why I suggested the grace-period where we fill the cache on he server side in the background _while_at_the_same_time actually feeding the clients the old cached contents. Because what matters most to _clients_ is not getting the most recent up-to-date data within the last few minutes - people who go to the overview page want to just get a list of projects, and they want to get them in a second or two, not half a minute later. And btw, all those "If-Modified-Since:" things are irrelevant, since quite often, the top-level page really technically _has_ been modified in the last few minutes, because with the kernel and git projects, _somebody_ has usually pushed out one of the projects within the last hour. And no, people don't just sit there refreshing their browser page all the time. I bet even "active" git users do it at most once or twice a day, which means that their client cache will _never_ be up-to-date. But if you do it with server-side caches and grace-periods, you can generally say "we have something that is at most five minutes old", and most importantly, you can hopefully do it without a lot of disk seeks (because you just cache the _one_ page as _one_ object), so hopefully you can do it in a few hundred ms even if the thing is on disk and even if there's a lot of other load going on. I bet the top-level "all projects" summary page and the individual project summary pages are the important things to cache. That's what probably most people look at, and they are the ones that have lots of server-side cache locality. Individual commits and diffs probably don't get the same kind of "lots of people looking at them" and thus don't get the same kind of benefit from caching. (Individual commits hopefully also need fewer disk seeks, at least with packed repositories. So even if you have to re-generate them from scratch, they won't have the seek times themselves taking up tens of seconds, unless the project is entirely unpacked and diffing just generates total disk seek hell) ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-10 19:50 ` Linus Torvalds @ 2006-12-10 20:27 ` Jakub Narebski 2006-12-10 20:30 ` Linus Torvalds 2006-12-10 21:01 ` H. Peter Anvin 1 sibling, 1 reply; 82+ messages in thread From: Jakub Narebski @ 2006-12-10 20:27 UTC (permalink / raw) To: Linus Torvalds Cc: Jeff Garzik, Martin Langhoff, Git Mailing List, H. Peter Anvin, Rogan Dawes, Kernel Org Admin Linus Torvalds wrote: > On Sun, 10 Dec 2006, Jakub Narebski wrote: >>>> If-Modified-Since:, If-Match:, If-None-Match: do you? >> >> And in CGI standard there is a way to access additional HTTP headers >> info from CGI script: the envirionmental variables are HTTP_HEADER, >> for example if browser sent If-Modified-Since: header it's value >> can be found in HTTP_IF_MODIFIED_SINCE environmental variable. > > Guys, you're missing something fairly fundamnetal. > > It helps almost _nothing_ to support client-side caching with all these > fancy "If-Modified-Since:" etc crap. > > That's not the _problem_. > > It's usually not one client asking for the gitweb pages: the load comes > from just lots of people independently asking for it. So client-side > caching may help a tiny tiny bit, but it's not actually fixing the > fundamental problem at all. Well, the idea (perhaps stupid idea: I don't know how caching engines / reverse proxy works) was that there would be caching engine / reverse proxy in the front (Squid for example) would cache results and serve it to rampaging hordes. But this caching engine has to ask gitweb if the cache is valid using "If-Modified-Since:" and "If-None-Match:" headers. If gitweb returns 304 Not Modified then it serves contents from cache. > So forget about "If-Modified-Since:" etc. It may help in benchmarks when > you try it yourself, and use "refresh" on the client side. But the basic > problem is all about lots of clients that do NOT have things cached, > because all teh client caches are all filled up with pr0n, not with gitweb > data from yesterday. What about the other idea, the one with raising expires to infinity for immutable pages like "commit" view for commit given by SHA-1? Even if the clients won't cache it, the proxies and caches between gitweb and client might cache it... Talking about most accessed gitweb pages, the project list page changes on every push, the project summary page and project main RSS feed (now in both RSS and Atom formats) changes on every push to given project. With a help of hooks they can be static pages, generated by push... ...with the exception that projects list and summary pages have _relative_ dates. -- Jakub Narebski ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-10 20:27 ` Jakub Narebski @ 2006-12-10 20:30 ` Linus Torvalds 2006-12-10 22:01 ` Martin Langhoff 2006-12-10 22:08 ` Jeff Garzik 0 siblings, 2 replies; 82+ messages in thread From: Linus Torvalds @ 2006-12-10 20:30 UTC (permalink / raw) To: Jakub Narebski Cc: Jeff Garzik, Martin Langhoff, Git Mailing List, H. Peter Anvin, Rogan Dawes, Kernel Org Admin On Sun, 10 Dec 2006, Jakub Narebski wrote: > > Well, the idea (perhaps stupid idea: I don't know how caching engines > / reverse proxy works) was that there would be caching engine / reverse > proxy in the front (Squid for example) would cache results and serve it > to rampaging hordes. Sure, if the proxies actually do the rigth thing (which they may or may not do) > What about the other idea, the one with raising expires to infinity for > immutable pages like "commit" view for commit given by SHA-1? Even if > the clients won't cache it, the proxies and caches between gitweb and > client might cache it... I agree, but as mentioned, I think the _real_ problem tends to be the pages that don't act that way (ie summary pages, both at the individual project level and the top "all projects" level). ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-10 20:30 ` Linus Torvalds @ 2006-12-10 22:01 ` Martin Langhoff 2006-12-10 22:14 ` Jeff Garzik 2006-12-10 22:08 ` Jeff Garzik 1 sibling, 1 reply; 82+ messages in thread From: Martin Langhoff @ 2006-12-10 22:01 UTC (permalink / raw) To: Linus Torvalds Cc: Jakub Narebski, Jeff Garzik, Git Mailing List, H. Peter Anvin, Rogan Dawes, Kernel Org Admin On 12/11/06, Linus Torvalds <torvalds@osdl.org> wrote: > Sure, if the proxies actually do the rigth thing (which they may or may > not do) For a high-traffic setup like kernel.org, you can setup a local reverse proxy -- it's a pretty standard practice. That allows you to control a well-behaved and locally tuned caching engine just by emitting good headers. It beats writing and maintaining an internal caching mechanism for each CGI script out there by a long mile. It means there'll be no further tunables or complexity for administrators of other gitweb installs. cheers, ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-10 22:01 ` Martin Langhoff @ 2006-12-10 22:14 ` Jeff Garzik 0 siblings, 0 replies; 82+ messages in thread From: Jeff Garzik @ 2006-12-10 22:14 UTC (permalink / raw) To: Martin Langhoff Cc: Linus Torvalds, Jakub Narebski, Git Mailing List, H. Peter Anvin, Rogan Dawes, Kernel Org Admin Martin Langhoff wrote: > On 12/11/06, Linus Torvalds <torvalds@osdl.org> wrote: >> Sure, if the proxies actually do the rigth thing (which they may or may >> not do) > > For a high-traffic setup like kernel.org, you can setup a local > reverse proxy -- it's a pretty standard practice. That allows you to > control a well-behaved and locally tuned caching engine just by > emitting good headers. > > It beats writing and maintaining an internal caching mechanism for > each CGI script out there by a long mile. It means there'll be no > further tunables or complexity for administrators of other gitweb > installs. If gitweb produced cache-friendly headers, squid could definitely serve as an HTTP front-end ("HTTP accelerator" mode in squid talk). In fact, given kernel.org's slave1/slave2<->master setup, that's a pretty natural fit for caching files and/or cache-aware CGI output. You could even replace rsync to the slaves, if squid was serving as the front-end accelerator running on the slaves, communicating to the master. squid is smart enough to hold off a thundering herd, and only pulls single cacheable copies of files as needed. Jeff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-10 20:30 ` Linus Torvalds 2006-12-10 22:01 ` Martin Langhoff @ 2006-12-10 22:08 ` Jeff Garzik 1 sibling, 0 replies; 82+ messages in thread From: Jeff Garzik @ 2006-12-10 22:08 UTC (permalink / raw) To: Linus Torvalds Cc: Jakub Narebski, Martin Langhoff, Git Mailing List, H. Peter Anvin, Rogan Dawes, Kernel Org Admin Linus Torvalds wrote: > > On Sun, 10 Dec 2006, Jakub Narebski wrote: >> Well, the idea (perhaps stupid idea: I don't know how caching engines >> / reverse proxy works) was that there would be caching engine / reverse >> proxy in the front (Squid for example) would cache results and serve it >> to rampaging hordes. > > Sure, if the proxies actually do the rigth thing (which they may or may > not do) squid seems to work well as an HTTP accelerator (reverse proxy). Apache's mem|disk cache stuff fails miserably. Unfortunately squid development seems to have slowed in recent years. Jeff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-10 19:50 ` Linus Torvalds 2006-12-10 20:27 ` Jakub Narebski @ 2006-12-10 21:01 ` H. Peter Anvin 1 sibling, 0 replies; 82+ messages in thread From: H. Peter Anvin @ 2006-12-10 21:01 UTC (permalink / raw) To: Linus Torvalds Cc: Jakub Narebski, Jeff Garzik, Martin Langhoff, Git Mailing List, Rogan Dawes, Kernel Org Admin Linus Torvalds wrote: > > Now, hopefully some of it will be in the disk cache, but when the > mirroring happens, it will basically blow the disk caches away totally > (when using the "--checksum" option), and then you literally have tens of > seconds to generate that one top-level page. > If that was the only time that happened, it would be a non-issue, since that only happens once every 96 hours. However, the problem is that we now have lots of large datasets that blow out the caches on a much more frequent basis. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-10 19:11 ` Jakub Narebski 2006-12-10 19:50 ` Linus Torvalds @ 2006-12-10 22:05 ` Jeff Garzik 2006-12-10 22:59 ` Jakub Narebski 1 sibling, 1 reply; 82+ messages in thread From: Jeff Garzik @ 2006-12-10 22:05 UTC (permalink / raw) To: Jakub Narebski Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin, Rogan Dawes, Kernel Org Admin Jakub Narebski wrote: > Adn in CGI standard there is a way to access additional HTTP headers > info from CGI script: the envirionmental variables are HTTP_HEADER, > for example if browser sent If-Modified-Since: header it's value > can be found in HTTP_IF_MODIFIED_SINCE environmental variable. The CGI spec does not at all guarantee that the CGI environment will contain all the HTTP headers sent by the client. That was the point of the environment dump script -- you can see exactly which headers are, and are not, passed through to CGI. CGI only /guarantees/ a bare minimum (things like QUERY_STRING, PATH_INFO, etc.) Even basic server info environment variables are optional. > It is ETag, not E-tag. Besides, I don't see what the attached script is > meant to do: it does not output the sample file anyway. It's not meant to output the sample file. It outputs the server metadata sent to the CGI script (the environment variables). The sample file was simply a way to play around with etag and last-modified metadata. > The idea is of course to stop processing in CGI script / mod_perl script > as soon as possible if cache validates. Certainly. That should help cut down on I/O. FWIW though the projects list is particularly painful, with its File::Find call, which you'll need to do in order to return 304-not-modified. > I don't know if Apache intercepts and remembers ETag and Last-Modified > headers, adds 304 Not Modified HTTP response on finding that cache validates > and cuts out CGI script output. I.e. if browser provided If-Modified-Since:, > script wrote Last-Modified: header, If-Modified-Since: is no earlier than > Last-Modified: (usually is equal in the case of cache validation), then > Apache provides 304 Not Modified response instead of CGI script output. This wanders into the realm of mod_cache configuration, I think. (which I have tried to get working as reverse proxy, and failed serveral times) If you are not using mod_*_cache, then Apache must execute the CGI script every time AFAICS, regardless of etag/[if-]last-mod headers. Jeff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-10 22:05 ` Jeff Garzik @ 2006-12-10 22:59 ` Jakub Narebski 2006-12-11 2:16 ` Martin Langhoff 0 siblings, 1 reply; 82+ messages in thread From: Jakub Narebski @ 2006-12-10 22:59 UTC (permalink / raw) To: Jeff Garzik Cc: Martin Langhoff, Git Mailing List, Linus Torvalds, H. Peter Anvin, Rogan Dawes, Kernel Org Admin Jeff Garzik wrote: > Jakub Narebski wrote: >> >> And in CGI standard there is a way to access additional HTTP headers >> info from CGI script: the envirionmental variables are HTTP_HEADER, >> for example if browser sent If-Modified-Since: header it's value >> can be found in HTTP_IF_MODIFIED_SINCE environmental variable. > > The CGI spec does not at all guarantee that the CGI environment will > contain all the HTTP headers sent by the client. That was the point of > the environment dump script -- you can see exactly which headers are, > and are not, passed through to CGI. > > CGI only /guarantees/ a bare minimum (things like QUERY_STRING, > PATH_INFO, etc.) > > Even basic server info environment variables are optional. I have checked that at least Apache 2.0.54 passes HTTP_IF_MODIFIED_SINCE when getting If-Modified-Since: header (my own script + netcat/nc). >> It is ETag, not E-tag. Besides, I don't see what the attached script is >> meant to do: it does not output the sample file anyway. > > It's not meant to output the sample file. It outputs the server > metadata sent to the CGI script (the environment variables). The sample > file was simply a way to play around with etag and last-modified metadata. Ah. >> The idea is of course to stop processing in CGI script / mod_perl script >> as soon as possible if cache validates. > > Certainly. That should help cut down on I/O. FWIW though the projects > list is particularly painful, with its File::Find call, which you'll > need to do in order to return 304-not-modified. First, it is better to use $projects_list which is projects index file in the format: <project path> SPC <project owner> where <project path> is relative to $projectroot and is URI encoded; well at least SPC has to be URI (percent) encoded. <project owner> is owner of given project, and is also URI encoded (one would usually use '+' in the place of SPC here). Gitweb now can generate projects list in above format, by using "project_index" action ("a=project_index" query string), or by clicking 'TXT' link at the bottom of the projects list page in new gitweb: see http://repo.or.cz by Petr Baudis. The problem is that it generates projects list from the list of projects it sees, so to generate it from scratch from the filesystem you have for generating "project_index" to have $projects_list a directory (changing it to something that evals to false, e.g. undef or "" makes gitweb use $projectroot for $projects_list). I have posted how to do this. The project list changes rarely, only on addition/removal of project, and on changing owner of project; so it can be generated on demand. Second, even with $projects_list being set to projects index file as of now gitweb runs git-for-each-ref (which scans refs and access pack file for commit date), checks for description file and reads it; for $projects_list being directory it also checks project directory owner. I plan to make it configurable to read last activity from all heads (all branches) as it is now, from HEAD (current branch) as it was before, or given branch (for example 'master'). Assuming that gitweb is configured to read last activity from single defined branch, generating ETag = checksum(sha1 of heads of projects) needs at least read one file from each project. >> I don't know if Apache intercepts and remembers ETag and Last-Modified >> headers, adds 304 Not Modified HTTP response on finding that cache validates >> and cuts out CGI script output. I.e. if browser provided If-Modified-Since:, >> script wrote Last-Modified: header, If-Modified-Since: is no earlier than >> Last-Modified: (usually is equal in the case of cache validation), then >> Apache provides 304 Not Modified response instead of CGI script output. > > This wanders into the realm of mod_cache configuration, I think. (which > I have tried to get working as reverse proxy, and failed serveral times) > If you are not using mod_*_cache, then Apache must execute the CGI > script every time AFAICS, regardless of etag/[if-]last-mod headers. No, it wanders into realm of header parsing by Apache, and NPH (No Parse Headers) option. Even if Apache does execute CGI script to completion every time, it might not send the output of the script, but HTTP 304 Not Modified reply. Might. I don't know if it does. -- Jakub Narebski ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-10 22:59 ` Jakub Narebski @ 2006-12-11 2:16 ` Martin Langhoff 2006-12-11 8:59 ` Jakub Narebski 0 siblings, 1 reply; 82+ messages in thread From: Martin Langhoff @ 2006-12-11 2:16 UTC (permalink / raw) To: Jakub Narebski Cc: Jeff Garzik, Git Mailing List, Linus Torvalds, H. Peter Anvin, Rogan Dawes, Kernel Org Admin On 12/11/06, Jakub Narebski <jnareb@gmail.com> wrote: > Even if Apache does execute CGI script to completion every time, it might > not send the output of the script, but HTTP 304 Not Modified reply. Might. > I don't know if it does. It is up to the script (CGI or via mod_perl) to set the status to 304 and finish execution. Just setting the status to 304 does not forcefully end execution as you may want to cleanup, log, etc. cheers, ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-11 2:16 ` Martin Langhoff @ 2006-12-11 8:59 ` Jakub Narebski 2006-12-11 10:18 ` Martin Langhoff 0 siblings, 1 reply; 82+ messages in thread From: Jakub Narebski @ 2006-12-11 8:59 UTC (permalink / raw) To: Martin Langhoff Cc: Jeff Garzik, Git Mailing List, Linus Torvalds, H. Peter Anvin, Rogan Dawes, Kernel Org Admin Martin Langhoff wrote: > On 12/11/06, Jakub Narebski <jnareb@gmail.com> wrote: >> >> Even if Apache does execute CGI script to completion every time, it might >> not send the output of the script, but HTTP 304 Not Modified reply. Might. >> I don't know if it does. > > It is up to the script (CGI or via mod_perl) to set the status to 304 > and finish execution. Just setting the status to 304 does not > forcefully end execution as you may want to cleanup, log, etc. I was thinking not about ending execution, but about not sending script output but sending HTTP 304 Not Modified reply by Apache. I meant the following sequence of events: 1. Script sends headers, among those Last-Modified and/or ETag 2. Apache scans headers (e.g. to add its own), notices that Last-Modified is earlier or equal to If-Modified-Since: sent by browser or reverse proxy, or ETag matches If-None-Match:, and sends 304 instead of script output 3. Script finishes execution, it's output sent to /dev/null Again, I don't know if Apache (or any other web server) does that. -- Jakub Narebski ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-11 8:59 ` Jakub Narebski @ 2006-12-11 10:18 ` Martin Langhoff 0 siblings, 0 replies; 82+ messages in thread From: Martin Langhoff @ 2006-12-11 10:18 UTC (permalink / raw) To: Jakub Narebski Cc: Jeff Garzik, Git Mailing List, Linus Torvalds, H. Peter Anvin, Rogan Dawes, Kernel Org Admin On 12/11/06, Jakub Narebski <jnareb@gmail.com> wrote: > I was thinking not about ending execution, but about not sending script > output but sending HTTP 304 Not Modified reply by Apache. > > I meant the following sequence of events: > 1. Script sends headers, among those Last-Modified and/or ETag > 2. Apache scans headers (e.g. to add its own), notices that Last-Modified > is earlier or equal to If-Modified-Since: sent by browser or reverse > proxy, or ETag matches If-None-Match:, and sends 304 instead of script > output > 3. Script finishes execution, it's output sent to /dev/null > > Again, I don't know if Apache (or any other web server) does that. It doesn't. You want to take the decision to send a 304, cleanup and exit _inside_ the CGI. If it was up to apache, then the CGI script would end up creating the (potentially expensive to produce) content just to see it sent to /dev/null OR if apache was to terminate execution of the CGI more violently, the CGI wouldn't have a chance to cleanup and release resources. So it's a matter of setting the header to 304 and exiting. cheers, martin ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 12:42 ` Jeff Garzik 2006-12-09 13:37 ` Jakub Narebski @ 2006-12-09 18:04 ` Linus Torvalds 2006-12-09 18:30 ` H. Peter Anvin 2006-12-10 3:55 ` Martin Langhoff 2 siblings, 1 reply; 82+ messages in thread From: Linus Torvalds @ 2006-12-09 18:04 UTC (permalink / raw) To: Jeff Garzik Cc: Jakub Narebski, Martin Langhoff, Git Mailing List, H. Peter Anvin, Rogan Dawes, Kernel Org Admin On Sat, 9 Dec 2006, Jeff Garzik wrote: > > It is. At least for kernel.org, the issue isn't that CGI is expensive, its > that I/O is expensive. Note that if we had a new gitweb, we could also used the packed refs. Those help CPU usage, but they actually help IO patterns more, exactly because they avoid all the seeking around in the filesystem. So with packed refs, there's no need to go from directory lookup to inode lookup to data lookup to object lookup for *each* ref - you can do the "packed-refs" lookup _once_ (which obviously does the dir->inode->data), and you don't need to do the object lookup at all. Of course, gitweb will then end up doing the object lookup anyway (because of getting the dates etc for refs), but if you have packed-refs and a reasonably packed repository, that should still really cut down on IO in a big way. So there's probably tons of room for making this more efficient: using a newer gitweb, packing refs, using the cgi cache thing.. It sounds like what it really needs is just somebody with the competence and time to be willing to step up and maintain gitweb on kernel.org... ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 18:04 ` Linus Torvalds @ 2006-12-09 18:30 ` H. Peter Anvin 0 siblings, 0 replies; 82+ messages in thread From: H. Peter Anvin @ 2006-12-09 18:30 UTC (permalink / raw) To: Linus Torvalds Cc: Jeff Garzik, Jakub Narebski, Martin Langhoff, Git Mailing List, Rogan Dawes, Kernel Org Admin Linus Torvalds wrote: > > So there's probably tons of room for making this more efficient: using a > newer gitweb, packing refs, using the cgi cache thing.. It sounds like > what it really needs is just somebody with the competence and time to be > willing to step up and maintain gitweb on kernel.org... > Indeed. We have a lot of projects on kernel.org which are like this: not at all conceptually hard, but a huge time commitment for Doing It Right[TM]. This is why I sometimes think that it would be a Good Thing to get paid staff for kernel.org, although I was hoping to defer the need for that until at least we have our 501(c)3 paperwork done, which looks like mid-2007 at this point (assuming no further delays.) ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-09 12:42 ` Jeff Garzik 2006-12-09 13:37 ` Jakub Narebski 2006-12-09 18:04 ` Linus Torvalds @ 2006-12-10 3:55 ` Martin Langhoff 2006-12-10 7:05 ` H. Peter Anvin 2 siblings, 1 reply; 82+ messages in thread From: Martin Langhoff @ 2006-12-10 3:55 UTC (permalink / raw) To: Jeff Garzik Cc: Jakub Narebski, Git Mailing List, Linus Torvalds, H. Peter Anvin, Rogan Dawes, Kernel Org Admin On 12/10/06, Jeff Garzik <jeff@garzik.org> wrote: > > P.S. Can anyone post some benchmark comparing gitweb deployed under > > mod_perl as compared to deployed as CGI script? Does kernel.org use > > mod_perl, or CGI version of gitweb? > > CGI version of gitweb. > > But again, mod_perl vs. CGI isn't the issue. IO is the issue, and the CGI startup of Perl is quite IO & CPU intensive. Even if the caching headers, thundering herds and planet collisions are resolved, I don't think you'll ever be happy with IO and CPU load on kernel.org running gitweb as CGI. cheers, ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-10 3:55 ` Martin Langhoff @ 2006-12-10 7:05 ` H. Peter Anvin 2006-12-12 21:19 ` Jakub Narebski 0 siblings, 1 reply; 82+ messages in thread From: H. Peter Anvin @ 2006-12-10 7:05 UTC (permalink / raw) To: Martin Langhoff Cc: Jeff Garzik, Jakub Narebski, Git Mailing List, Linus Torvalds, Rogan Dawes, Kernel Org Admin Martin Langhoff wrote: > On 12/10/06, Jeff Garzik <jeff@garzik.org> wrote: >> > P.S. Can anyone post some benchmark comparing gitweb deployed under >> > mod_perl as compared to deployed as CGI script? Does kernel.org use >> > mod_perl, or CGI version of gitweb? >> >> CGI version of gitweb. >> >> But again, mod_perl vs. CGI isn't the issue. > > IO is the issue, and the CGI startup of Perl is quite IO & CPU > intensive. Even if the caching headers, thundering herds and planet > collisions are resolved, I don't think you'll ever be happy with IO > and CPU load on kernel.org running gitweb as CGI. > I/O - nonexistent; that stuff will be in memory. CPU - we have more CPU than you can shake a stick at, and it's 95+% idle. *NOT AN ISSUE*. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-10 7:05 ` H. Peter Anvin @ 2006-12-12 21:19 ` Jakub Narebski 0 siblings, 0 replies; 82+ messages in thread From: Jakub Narebski @ 2006-12-12 21:19 UTC (permalink / raw) To: H. Peter Anvin Cc: Martin Langhoff, Jeff Garzik, Git Mailing List, Linus Torvalds, Rogan Dawes, Kernel Org Admin By the way, setting Last-Modified: and ETag: and checking for If-Modified-Since: and If-None-Match: is easy only for log-like views: "shortlog", "log", "history", "rss"/"atom". With "shortlog" and "history" we have additional difficulity of using relative dates there. And even for those views we need reverse proxy / caching engine (e.g. Squid in "HTTP accelerator" mode) in front. It would be easier to pre-generate most common accessed views: "projects_list", "summary" and "rss"/"atom" main for each project, and just serve static pages. I don't know if we need to modify gitweb for that. BTW. for single client (rather stupid benchmark, I know) mod_perl is about twice faster in keepalive mode than CGI version of gitweb for git.git summary page. -- Jakub Narebski ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-08 23:27 ` Linus Torvalds ` (2 preceding siblings ...) 2006-12-09 1:56 ` Martin Langhoff @ 2006-12-09 7:56 ` Steven Grimm 3 siblings, 0 replies; 82+ messages in thread From: Steven Grimm @ 2006-12-09 7:56 UTC (permalink / raw) To: Git Mailing List Linus Torvalds wrote: > Looking at the memcached operations, they have the "read" op (aka "get"), > but they seem to have no "read-for-fill" op. So memcached fundamentally > doesn't fix this problem, at least without explicit serialization by the > client. > Actually, memcached does support an operation that would work for this: the "add" request, which creates a new cache entry if and only if the key is not already in the cache. If the key is already present, the request fails. You can use that to implement a simple named mutex, and it supports a client-specified timeout. The one thing it doesn't support that you described is a notion of deleting a key when a particular client disconnects, but as you say, that should only happen in the case of buggy clients anyway. Mind you, I'm not convinced memcached is necessarily the right answer for this problem, but it does provide a way to implement the required locking semantics. BTW, I'm one of the main contributors to memcached, so if it does end up looking like a good choice except for some minor issue or another, I may be able to tweak it to cover whatever is missing. For example, the "delete a key on disconnect" thing would be fairly straightforward, if it's actually necessary in practice. -Steve ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-07 19:16 ` H. Peter Anvin 2006-12-07 19:30 ` Olivier Galibert @ 2006-12-07 19:30 ` Linus Torvalds 2006-12-07 19:39 ` Shawn Pearce 2006-12-07 20:05 ` Junio C Hamano 1 sibling, 2 replies; 82+ messages in thread From: Linus Torvalds @ 2006-12-07 19:30 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Kernel Org Admin, Git Mailing List, Jakub Narebski On Thu, 7 Dec 2006, H. Peter Anvin wrote: > > What it could do better is it could prevent multiple identical queries from > being launched in parallel. That's the real problem we see; under high load, > Apache times out so the git query never gets into the cache; but in the > meantime, the common queries might easily have been launched 20 times in > parallel. Unfortunately, the most common queries are also extremely > expensive. Ahh. I'd have expected that apache itself had some serialization facility, that would kind of go hand-in-hand with any caching. It really would make more sense to have anything that does caching serialize the address that gets cached (think "page cache" layer in the kernel: the _cache_ is also the serialization point, and is what guarantees that we don't do stupid multiple reads to the same address). I'm surprised that Apache can't do that. Or maybe it can, and it just needs some configuration entry? I don't know apache.. I realize that because Apache doesn't know before-hand whether something is cacheable or not, it must probably _default_ to running the CGI scripts to the same address in parallel, but it would be stupid to not have the option to serialize. That said, from some of the other horrors I've heard about, "stupid" may be just scratching at the surface. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-07 19:30 ` Linus Torvalds @ 2006-12-07 19:39 ` Shawn Pearce 2006-12-07 19:58 ` Linus Torvalds 2006-12-07 19:58 ` H. Peter Anvin 2006-12-07 20:05 ` Junio C Hamano 1 sibling, 2 replies; 82+ messages in thread From: Shawn Pearce @ 2006-12-07 19:39 UTC (permalink / raw) To: Linus Torvalds Cc: H. Peter Anvin, Kernel Org Admin, Git Mailing List, Jakub Narebski Linus Torvalds <torvalds@osdl.org> wrote: > I'm surprised that Apache can't do that. Or maybe it can, and it just > needs some configuration entry? I don't know apache.. I realize that > because Apache doesn't know before-hand whether something is cacheable or > not, it must probably _default_ to running the CGI scripts to the same > address in parallel, but it would be stupid to not have the option to > serialize. AFAIK it doesn't have such an option, for basically the reason you describe. I worked on a project which had much more difficult to answer queries than gitweb and were also very popular. Yes, the system died under any load, no matter how much money was thrown at it. :-) > That said, from some of the other horrors I've heard about, "stupid" may > be just scratching at the surface. It is. :-) -- ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-07 19:39 ` Shawn Pearce @ 2006-12-07 19:58 ` Linus Torvalds 2006-12-07 23:33 ` Michael K. Edwards 2006-12-07 19:58 ` H. Peter Anvin 1 sibling, 1 reply; 82+ messages in thread From: Linus Torvalds @ 2006-12-07 19:58 UTC (permalink / raw) To: Shawn Pearce Cc: H. Peter Anvin, Kernel Org Admin, Git Mailing List, Jakub Narebski On Thu, 7 Dec 2006, Shawn Pearce wrote: > > AFAIK it doesn't have such an option, for basically the reason > you describe. I worked on a project which had much more difficult > to answer queries than gitweb and were also very popular. Yes, > the system died under any load, no matter how much money was thrown > at it. :-) > > > That said, from some of the other horrors I've heard about, "stupid" may > > be just scratching at the surface. > > It is. :-) Gaah. That's just stupid. This is such a _basic_ issue for caching ("if concurrent requests come in, only handle _one_ and give everybody the same result") that I claim that any cache that doesn't handle it isn't a cache at all, but a total disaster written by incompetent people. Sure, you may want to disable it for certain kinds of truly dynamic content, but that doesn't mean you shouldn't be able to do it at all. Does anybody who is web-server clueful know if there is some simple front-end (squid?) that is easy to set up and can just act as a caching proxy in front of such an incompetent server? Or maybe there is some competent Apache module, not just the default mod_cache (which is what I assume kernel.org uses now)? ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-07 19:58 ` Linus Torvalds @ 2006-12-07 23:33 ` Michael K. Edwards 0 siblings, 0 replies; 82+ messages in thread From: Michael K. Edwards @ 2006-12-07 23:33 UTC (permalink / raw) To: Linus Torvalds Cc: Shawn Pearce, H. Peter Anvin, Kernel Org Admin, Git Mailing List, Jakub Narebski On 12/7/06, Linus Torvalds <torvalds@osdl.org> wrote: > Does anybody who is web-server clueful know if there is some simple > front-end (squid?) that is easy to set up and can just act as a caching > proxy in front of such an incompetent server? Squid in "transparent reverse proxy" mode isn't a bad choice, although I don't know offhand whether it queues/clusters concurrent requests for the same URL in the way you want. I suggest the "transparent" deployment (netfilter/netlink integration) because you can slap it in with no changes to the origin server and yank it out again if you have a problem. The challenge is in getting conntrack to scale to a zillion concurrent sessions, but you could probably find someone in your crowd who knows something about that. :-) Ignore any documentation that talks about httpd_accel_*. Configuring transparent mode is a great deal simpler and saner in squid 2.6 than it used to be; you just add a "transparent" parameter to the http_port tag. With or without this tag, you set up what used to be called "accelerator mode" using some parameters to http_port and cache_peer, as described in http://www.squid-cache.org/mail-archive/squid-users/200607/0162.html. If transparent mode looks like the right thing for kernel.org, you might be interested in some netfilter hackery to offload part of the conntrack session lookup load to a front-end box that blocks DDoS and acts more or less as an L4 switch plus session context cache. I've been banging on a proof of concept implementation for a while, and am currently working on integrating against 2.6.19 by splitting nf_conntrack into front and back halves that interact via a sort of Layer 2+ header. I have no idea yet whether it will have any scalability benefit on dual-x86_64 class hardware (it was originally conceived for rigid cache architectures where the random access patterns of session lookups have drastic cache effects). Cheers, ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-07 19:39 ` Shawn Pearce 2006-12-07 19:58 ` Linus Torvalds @ 2006-12-07 19:58 ` H. Peter Anvin 1 sibling, 0 replies; 82+ messages in thread From: H. Peter Anvin @ 2006-12-07 19:58 UTC (permalink / raw) To: Shawn Pearce Cc: Linus Torvalds, Kernel Org Admin, Git Mailing List, Jakub Narebski Shawn Pearce wrote: > Linus Torvalds <torvalds@osdl.org> wrote: >> I'm surprised that Apache can't do that. Or maybe it can, and it just >> needs some configuration entry? I don't know apache.. I realize that >> because Apache doesn't know before-hand whether something is cacheable or >> not, it must probably _default_ to running the CGI scripts to the same >> address in parallel, but it would be stupid to not have the option to >> serialize. > > AFAIK it doesn't have such an option, for basically the reason > you describe. I worked on a project which had much more difficult > to answer queries than gitweb and were also very popular. Yes, > the system died under any load, no matter how much money was thrown > at it. :-) You certainly can be smarter about it when you know the nature of the query, though. I do that with the patch viewer scripts. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-07 19:30 ` Linus Torvalds 2006-12-07 19:39 ` Shawn Pearce @ 2006-12-07 20:05 ` Junio C Hamano 2006-12-07 20:09 ` H. Peter Anvin 1 sibling, 1 reply; 82+ messages in thread From: Junio C Hamano @ 2006-12-07 20:05 UTC (permalink / raw) To: Linus Torvalds; +Cc: git, Kernel Org Admin, H. Peter Anvin If I understand correctly, kernel.org is still running the version of gitweb Kay last installed there (I am too busy to take over the gitweb installation maintenance at kernel.org, and I did not ask the $DOCUMENTROOT/git/ directory to be transferred to me when I rolled gitweb into the git.git repository). I do not know what queries are most popular, but I think a newer gitweb is more efficient in the summary page (getting list of branches and tags). It might be worth a try. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-07 20:05 ` Junio C Hamano @ 2006-12-07 20:09 ` H. Peter Anvin 2006-12-07 22:11 ` Junio C Hamano 0 siblings, 1 reply; 82+ messages in thread From: H. Peter Anvin @ 2006-12-07 20:09 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git, Kernel Org Admin Junio C Hamano wrote: > If I understand correctly, kernel.org is still running the > version of gitweb Kay last installed there (I am too busy to > take over the gitweb installation maintenance at kernel.org, and > I did not ask the $DOCUMENTROOT/git/ directory to be transferred > to me when I rolled gitweb into the git.git repository). That's correct. I can transfer that directory to you if you want; I can't realistically track gitweb well enough to do this myself (in fact, it was pretty much a condition of having it up there that Kay would keep maintaining it.) > I do not know what queries are most popular, but I think a newer > gitweb is more efficient in the summary page (getting list of > branches and tags). It might be worth a try. How do you want to handle it? -hpa ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-07 20:09 ` H. Peter Anvin @ 2006-12-07 22:11 ` Junio C Hamano 0 siblings, 0 replies; 82+ messages in thread From: Junio C Hamano @ 2006-12-07 22:11 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Linus Torvalds, git, Kernel Org Admin "H. Peter Anvin" <hpa@zytor.com> writes: > Junio C Hamano wrote: >> If I understand correctly, kernel.org is still running the >> version of gitweb Kay last installed there (I am too busy to >> take over the gitweb installation maintenance at kernel.org, and >> I did not ask the $DOCUMENTROOT/git/ directory to be transferred >> to me when I rolled gitweb into the git.git repository). > > That's correct. I can transfer that directory to you if you want; I > can't realistically track gitweb well enough to do this myself... Well, the reason I haven't asked to is because I don't have enough time myself, so.... ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-07 19:05 ` kernel.org mirroring (Re: [GIT PULL] MMC update) Linus Torvalds 2006-12-07 19:16 ` H. Peter Anvin @ 2006-12-08 9:43 ` Jakub Narebski 1 sibling, 0 replies; 82+ messages in thread From: Jakub Narebski @ 2006-12-08 9:43 UTC (permalink / raw) To: Linus Torvalds Cc: H. Peter Anvin, Kernel Org Admin, Git Mailing List, Petr Baudis Linus Torvalds wrote: [...] > For example, if the git "refs/heads/" (or tags) directory hasn't changed > in the last two months, we should probably set any ref-relative gitweb > pages to have a caching timeout of a day or two. In contrast, if it's > changed in the last hour, maybe we should only cache it for five minutes. > > Jakub: any way to make gitweb set the "expires" fields _much_ more > aggressively. I think we should at least have the ability to set a basic > rules like > > - a _minimum_ of five minutes regardless of anything else > > We might even tweak this based on loadaverage, and it might be > worthwhile to add a randomization, to make sure that you don't get into > situations where everything webpage needs to be recalculated at once. I think the minimum expires (or minimum _additional_ expires: as of now giweb only does expires +1d for explicit hash requests) should depend on how often project changes. How often there are pushes to kernel.org? > - if refs/ directories are old, raise the minimum by the age of the refs > > If it's more than an hour old, raise it to ten minutes. If it's more > than a day, raise it to an hour. If it's more than a month old, raise > it to a day. And if it's more than half a year, it's some historical > archive like linux-history, and should probably default to a week or > more. What about packed refs? We can certainly raise expires for tags (tags objects), as they should not usually change. > - infinite for stuff that isn't ref-related. As sha1 is not changeable, everything that is accessed by explicit sha1 (hash), or by explicit sha1 (hash_base) plus pathname (file_name) should have effectively infinite expires. Every caching would need some temporary memory, or temporary disk space. And perhaps mod_perl specific caching would be useful here... P.S. I have added Pasky to Cc:, as he manages http://repo.or.cz public git repository hosting (much smaller than kernel.org and I think under less load: but also I think withour kernel.org resources). -- Jakub Narebski ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update)
@ 2006-12-11 3:40 linux
2006-12-11 9:30 ` Jakub Narebski
0 siblings, 1 reply; 82+ messages in thread
From: linux @ 2006-12-11 3:40 UTC (permalink / raw)
To: koreth; +Cc: git
>>> I posted separately about those. And I've been mulling about whether
>>> the thundering herd is really such a big problem that we need to
>>> address it head-on.
>>
>> Uhm... yes it is.
>
> Got some more info, discussion points or links to stuff I should read
> to appreciate why that is? I am trying to articulate why I consider it
> is not a high-payoff task, as well as describing how to tackle it.
>
> To recap, the reasons it is not high payoff is that:
>
> - the main benefit comes from being cacheable and able to revalidate
> the cache cheaply (with the ETags-based strategy discussed above)
> - highly distributed caches/proxies means we'll seldom see a true
> cold cache situation
> - we have a huge set of URLs which are seldom hit, and will never see
> a thundering anything
> - we have a tiny set of very popular URLs that are the key target for
> the thundering herd - (projects page, summary page, shortlog, fulllog)
> - but those are in the clear as soon as the caches are populated
>
> Why do we have to take it head-on? :-)
I think I agree with you, but not as strongly. Certainly, having any
kind of effective cacheing (heck, just comparing the timestamp of the
relevant ref(s) with the If-Modified-Since: header) will help kernel.org
enormously.
But as soon as there's a push, particularly a release push, that
invalidates *all* of the popular pages *and* the thindering herd arrives.
The result is that all of the popular "what's new?" summary pages get
fetched 15 times in parallel and, because the front end doesn't serialize
them, populating the caches can be a painful process involving a lot of
repeated work.
I tend to agree that for the basic project summary pages, generating them
preemptively as static pages out of the push script seems best.
("find /usr/src/linux -type d -print | wc -l" is 1492. Dear me.
Oh! There is no per-directory shortlog page; that simplifies things.
But there *should* be.)
The only tricky thing is the "n minutes/hours/days ago" timestamps.
Basically, you want to generate a half-formatted, indefinitely-cacheable
page that contains them as absolute timestamps, and a have system for
regenerating the fully-formatted page from that (and the current time).
The ideas that people have been posting seem excellent. Give a page
two timeouts. If a GET arrives before the first timestamp, and no
prerequisites have changes, it's served directly from cache. If it
arrives after the second timeout, or the prerequisites have changed,
it blocks until the page is regenerated. But if it arrives between
those two times, it serves the stale data and starts generating fresh
data in the background.
So for the fully-formed timestamps, the first timeout is when the next
human-readable timestamp on the page ticks over. But the second timeout
can be past that by, say, 5% of the timeout value. It's okay to display
"3 hours ago" until 12 minutes past the 4 hour mark.
It might be okay to allow even the prerequisites to be slightly stale when
serving old data; it's okay if it takes 30 seconds for the kernel.org
web page to notice that Linus pushed. But on my office gitweb, I'm not
sure that it's okay to take 30 seconds to notice that *I* just pushed.
(I'm also not sure about consistency issues. If I link from one page
that shows the new release to another, it would be a bit disconcerting
if it disappeared.)
The nasty problem with built-in cacheing is that you need a whole cache
reclaim infrastructure; it would be so much nicer to let Squid deal
with that whole mess. But it can't deal with anything other than fully
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: kernel.org mirroring (Re: [GIT PULL] MMC update) 2006-12-11 3:40 linux @ 2006-12-11 9:30 ` Jakub Narebski 0 siblings, 0 replies; 82+ messages in thread From: Jakub Narebski @ 2006-12-11 9:30 UTC (permalink / raw) To: git Side note: I wonder why this mail is not attached to the rest of thread in the gmane.comp.version-control.git news/Usenet GMane mirrot of git mailing list. linux@horizon.com wrote: > Oh! There is no per-directory shortlog page; that simplifies things. > But there *should* be.) There is. It is called "history" (and currently we have only shortlog-like view for history and no log-like view). > The only tricky thing is the "n minutes/hours/days ago" timestamps. > Basically, you want to generate a half-formatted, indefinitely-cacheable > page that contains them as absolute timestamps, and a have system for > regenerating the fully-formatted page from that (and the current time). Or use ECMAScript (JavaScript) for that. I plan (if this feature is requested) to make it a %feature. -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 82+ messages in thread
end of thread, other threads:[~2006-12-12 21:17 UTC | newest] Thread overview: 82+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <45708A56.3040508@drzeus.cx> [not found] ` <Pine.LNX.4.64.0612011639240.3695@woody.osdl.org> [not found] ` <457151A0.8090203@drzeus.cx> [not found] ` <Pine.LNX.4.64.0612020835110.3476@woody.osdl.org> [not found] ` <45744FA3.7020908@zytor.com> [not found] ` <Pine.LNX.4.64.0612061847190.3615@woody.osdl.org> [not found] ` <45778AA3.7080709@zytor.com> [not found] ` <Pine.LNX.4.64.0612061940170.3615@woody.osdl.org> [not found] ` <4577A84C.3010601@zytor.com> [not found] ` <Pine.LNX.4.64.0612070953290.3615@woody.osdl.org> [not found] ` <45785697.1060001@zytor.com> 2006-12-07 19:05 ` kernel.org mirroring (Re: [GIT PULL] MMC update) Linus Torvalds 2006-12-07 19:16 ` H. Peter Anvin 2006-12-07 19:30 ` Olivier Galibert 2006-12-07 19:57 ` H. Peter Anvin 2006-12-07 23:50 ` Olivier Galibert 2006-12-07 23:56 ` H. Peter Anvin 2006-12-08 11:25 ` Jakub Narebski 2006-12-08 12:57 ` Rogan Dawes 2006-12-08 13:38 ` Jakub Narebski 2006-12-08 14:31 ` Rogan Dawes 2006-12-08 15:38 ` Jonas Fonseca 2006-12-09 1:28 ` Martin Langhoff 2006-12-09 2:03 ` H. Peter Anvin 2006-12-09 2:52 ` Martin Langhoff 2006-12-09 5:09 ` H. Peter Anvin 2006-12-09 5:34 ` Martin Langhoff 2006-12-09 16:26 ` H. Peter Anvin 2006-12-08 16:16 ` H. Peter Anvin 2006-12-08 16:35 ` Linus Torvalds 2006-12-08 16:42 ` H. Peter Anvin 2006-12-08 19:49 ` Lars Hjemli 2006-12-08 19:51 ` H. Peter Anvin 2006-12-08 19:59 ` Lars Hjemli 2006-12-08 20:02 ` H. Peter Anvin 2006-12-10 9:43 ` rda 2006-12-08 16:54 ` Jeff Garzik 2006-12-08 17:04 ` H. Peter Anvin 2006-12-08 17:40 ` Jeff Garzik 2006-12-08 23:27 ` Linus Torvalds 2006-12-08 23:46 ` Michael K. Edwards 2006-12-08 23:49 ` H. Peter Anvin 2006-12-09 0:18 ` Michael K. Edwards 2006-12-09 0:23 ` H. Peter Anvin 2006-12-09 0:49 ` Linus Torvalds 2006-12-09 0:51 ` H. Peter Anvin 2006-12-09 4:36 ` Michael K. Edwards 2006-12-09 9:27 ` Jeff Garzik [not found] ` <4579FABC.5070509@garzik.org> 2006-12-09 0:45 ` Linus Torvalds 2006-12-09 0:47 ` H. Peter Anvin 2006-12-09 9:16 ` Jeff Garzik 2006-12-09 1:56 ` Martin Langhoff 2006-12-09 11:51 ` Jakub Narebski 2006-12-09 12:42 ` Jeff Garzik 2006-12-09 13:37 ` Jakub Narebski 2006-12-09 14:43 ` Jeff Garzik 2006-12-09 17:02 ` Jakub Narebski 2006-12-09 17:27 ` Jeff Garzik 2006-12-10 4:07 ` Martin Langhoff 2006-12-10 10:09 ` Jakub Narebski 2006-12-10 12:41 ` Jeff Garzik 2006-12-10 13:02 ` Jakub Narebski 2006-12-10 13:45 ` Jeff Garzik 2006-12-10 19:11 ` Jakub Narebski 2006-12-10 19:50 ` Linus Torvalds 2006-12-10 20:27 ` Jakub Narebski 2006-12-10 20:30 ` Linus Torvalds 2006-12-10 22:01 ` Martin Langhoff 2006-12-10 22:14 ` Jeff Garzik 2006-12-10 22:08 ` Jeff Garzik 2006-12-10 21:01 ` H. Peter Anvin 2006-12-10 22:05 ` Jeff Garzik 2006-12-10 22:59 ` Jakub Narebski 2006-12-11 2:16 ` Martin Langhoff 2006-12-11 8:59 ` Jakub Narebski 2006-12-11 10:18 ` Martin Langhoff 2006-12-09 18:04 ` Linus Torvalds 2006-12-09 18:30 ` H. Peter Anvin 2006-12-10 3:55 ` Martin Langhoff 2006-12-10 7:05 ` H. Peter Anvin 2006-12-12 21:19 ` Jakub Narebski 2006-12-09 7:56 ` Steven Grimm 2006-12-07 19:30 ` Linus Torvalds 2006-12-07 19:39 ` Shawn Pearce 2006-12-07 19:58 ` Linus Torvalds 2006-12-07 23:33 ` Michael K. Edwards 2006-12-07 19:58 ` H. Peter Anvin 2006-12-07 20:05 ` Junio C Hamano 2006-12-07 20:09 ` H. Peter Anvin 2006-12-07 22:11 ` Junio C Hamano 2006-12-08 9:43 ` Jakub Narebski 2006-12-11 3:40 linux 2006-12-11 9:30 ` Jakub Narebski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).