From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.176.0/21 X-Spam-Status: No, score=-3.5 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MSGID_FROM_MTA_HEADER,RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 From: Rogan Dawes Subject: Re: kernel.org mirroring (Re: [GIT PULL] MMC update) Date: Fri, 08 Dec 2006 16:31:56 +0200 Message-ID: <4579775C.2010608@dawes.za.net> References: <4578722E.9030402@zytor.com> <4579611F.5010303@dawes.za.net> <200612081438.25493.jnareb@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-2; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE NNTP-Posting-Date: Fri, 8 Dec 2006 14:32:13 +0000 (UTC) Cc: "H. Peter Anvin" , Linus Torvalds , Kernel Org Admin , Git Mailing List , Petr Baudis Return-path: Envelope-to: gcvg-git@gmane.org User-Agent: Thunderbird 1.5.0.8 (Windows/20061025) Original-Newsgroups: gmane.comp.version-control.git In-Reply-To: <200612081438.25493.jnareb@gmail.com> Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: Received: from vger.kernel.org ([209.132.176.167]) by dough.gmane.org with esmtp (Exim 4.50) id 1Gsglh-0005oU-1B for gcvg-git@gmane.org; Fri, 08 Dec 2006 15:32:05 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1425521AbWLHOcB convert rfc822-to-quoted-printable (ORCPT ); Fri, 8 Dec 2006 09:32:01 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1425525AbWLHOcB (ORCPT ); Fri, 8 Dec 2006 09:32:01 -0500 Received: from sd-green-bigip-211.dreamhost.com ([208.97.132.211]:35251 "EHLO spunkymail-a11.dreamhost.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1425521AbWLHOcA (ORCPT ); Fri, 8 Dec 2006 09:32:00 -0500 Received: from [192.168.201.102] (dsl-146-24-82.telkomadsl.co.za [165.146.24.82]) by spunkymail-a11.dreamhost.com (Postfix) with ESMTP id C393BB6E16; Fri, 8 Dec 2006 06:31:55 -0800 (PST) To: Jakub Narebski Sender: git-owner@vger.kernel.org Jakub Narebski wrote: > Dnia pi=B1tek 8. grudnia 2006 13:57, Rogan Dawes napisa=B3: >> How about extending gitweb to check to see if there already exists a= =20 >> cached version of these pages, before recreating them? >> >> e.g. structure the temp dir in such a way that each project has a pl= ace=20 >> for cached pages. Then, before performing expensive operations, chec= k to=20 >> see if a file corresponding to the requested page already exists. If= it=20 >> does, simply return the contents of the file, otherwise go ahead and= =20 >> create the page dynamically, and return it to the user. Do not creat= e=20 >> cached pages in gitweb dynamically. >=20 > This would add the need for directory for temporary files... well, > it would be optional now... >=20 It would still be optional. If the "cache" directory structure exists,=20 then use it, otherwise, continue as usual. All it would cost is a stat(= )=20 or two, I guess. >> Then, in a post-update hook, for each of the expensive pages, invoke= =20 >> something like: >> >> # delete the cached copy of the file, to force gitweb to recreate it >> rm -f $git_temp/$project/rss >> # get gitweb to recreate the page appropriately >> # use a tmp file to prevent gitweb from getting confused >> wget -O $git_temp/$project/rss.tmp \ >> http://kernel.org/gitweb.cgi?p=3D$project;a=3Drss >> # move the tmp file into place >> mv $git_temp/$project/rss.tmp $git_temp/$project/rss >=20 > Good idea... although there are some page views which shouldn't chang= e > at all... well, with the possible exception of changes in gitweb outp= ut, > and even then there are some (blob_plain and snapshot views) which > doesn't change at all. >=20 > It would be good to avoid removing them on push, and only remove > them using some tmpwatch-like removal. Well, my theory was that we would only cache pages that change when new= =20 data enters the repo. So, using the push as the trigger is almost=20 guaranteed to be the right thing to do. New data indicates new rss=20 items, indicates an updated shortlog page, etc. NOTE: This caching could be problematic for the "changed 2 hours ago"=20 notation for various branches/files, etc. But however we implement the=20 caching, we'd have this problem. >> This way, we get the exact output returned from the usual gitweb=20 >> invocation, but we can now cache the result, and only update it when= =20 >> there is a new commit that would affect the page output. >> >> This would also not affect those who do not wish to use this mechani= sm.=20 >> If the file does not exist, gitweb.cgi will simply revert to its usu= al=20 >> behaviour. >=20 > Good idea. Perhaps I should add it to gitweb TODO file. >=20 > Hmmm... perhaps it is time for next "[RFC] gitweb wishlist and TODO l= ist" > thread? > =20 >> Possible complications are the content-type headers, etc, but you co= uld=20 >> use the -s flag to wget, and store the server headers as well in the= =20 >> file, and get the necessary headers from the file as you stream it. >> >> i.e. read the headers looking for ones that are "interesting"=20 >> (Content-Type, charset, expires) until you get a blank line, print o= ut=20 >> the interesting headers using $cgi->header(), then just dump the=20 >> remainder of the file to the caller via stdout. >=20 > No need for that. $cgi->header() is to _generate_ the headers, so if > a file is saved with headers, we can just dump it to STDOUT; the poss= ible > exception is a need to rewrite 'expires' header, if it is used. Good point. I guess one thing that will be incorrect in the headers is=20 the server date, but I doubt that anyone cares much. As you say, though= ,=20 this might relate to the expiry of cached content in upstream caches. >=20 > Perhaps gitweb should generate it's own ETag instead of messing with > 'expires' header? Well, we can possibly eliminate the expires header entirely for dynamic= =20 pages, and check the If-Modified-Since value against the timestamp of=20 the cached file, or the server date in the cached file, and return "304= =20 Not Modified" responses. That would also help to reduce the load on the= =20 server, by only returning the headers, and not the entire response. The downside is that it would prevent upstream proxies from caching thi= s=20 data for us. Regards,