From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jakub Narebski Subject: [RFC] Implementing gitweb output caching - issues to solve Date: Thu, 4 Nov 2010 18:21:51 +0200 Message-ID: <201011041721.53371.jnareb@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-2 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: John 'Warthog9' Hawley , John 'Warthog9' Hawley , Junio C Hamano , Petr Baudis , admin@repo.or.cz To: git@vger.kernel.org X-From: git-owner@vger.kernel.org Thu Nov 04 17:22:18 2010 Return-path: Envelope-to: gcvg-git-2@lo.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PE2Zr-0007ar-FW for gcvg-git-2@lo.gmane.org; Thu, 04 Nov 2010 17:22:15 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750987Ab0KDQWI convert rfc822-to-quoted-printable (ORCPT ); Thu, 4 Nov 2010 12:22:08 -0400 Received: from mail-bw0-f46.google.com ([209.85.214.46]:35944 "EHLO mail-bw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750953Ab0KDQWH (ORCPT ); Thu, 4 Nov 2010 12:22:07 -0400 Received: by bwz11 with SMTP id 11so1814088bwz.19 for ; Thu, 04 Nov 2010 09:22:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:from:to:subject:date :user-agent:cc:mime-version:content-type:content-transfer-encoding :content-disposition:message-id; bh=NRzWFmTkGfuNj7qizbmB1x4EQo8Dm5DkFV22jIwLzjk=; b=wOf5z5h3KJgUBwK9wQ6DzFcpcMXtl3VzNIBDLyuPSPTFgsyW6E2ASFRO6wuAsKOvtA UC8PZYCSkGZVEof0jxTryYyThAKCBtLzlJhE+kujO2+N/CWvXCwN8GDnMvKMmDfvH+rQ r48z+odXFHmsXstqTJQDN6ZX322lpbjOeDszg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:subject:date:user-agent:cc:mime-version:content-type :content-transfer-encoding:content-disposition:message-id; b=LPwlSz03NJAcU5kJyl+5qfmpxPRrvC8NVTpYqsJ8JwVI/1JcxCyWyyMJxt9GOXqtSd 9K34ckQ+cqE0G4y5zzhnJf4ewnRgzxMdosLUNztdJdnMOHp5h+Zj5Ea884EqIHq5TCxR ZH5xh6GxiDu+QxtfRCTwkZd7KYnOznNUeZFUY= Received: by 10.204.113.74 with SMTP id z10mr832812bkp.25.1288887725369; Thu, 04 Nov 2010 09:22:05 -0700 (PDT) Received: from [192.168.1.13] (abvz101.neoplus.adsl.tpnet.pl [83.8.223.101]) by mx.google.com with ESMTPS id v25sm109449bkt.6.2010.11.04.09.22.02 (version=TLSv1/SSLv3 cipher=RC4-MD5); Thu, 04 Nov 2010 09:22:03 -0700 (PDT) User-Agent: KMail/1.9.3 Content-Disposition: inline Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: =46irst, what is the status of gitweb output caching? I have send mini= mal fixup for latest John 'Warthog9' Hawley series as Gitweb caching v7.1[1= ]. John said that he can work on fixes on November 2 at earliest. [1] http://thread.gmane.org/gmane.comp.version-control.git/160147/focus= =3D160473 During working on my rewrite of gitweb output caching, based on ideas (and some code) from John 'Warthog9' Hawley (J.H.) series I have met with the following issues: 1. Error handling There are three separate places where we can encounter error during caching. They are: A. setting up cache (initialization) B. retrieving data from cache C. storing (saving) data in cache. There are two possible choices for errors during cache initializatio= n (case A): * turn off caching * show 500 Internal Server Error The problem with turning off caching is that it hides an error, and admin can not notice that caching does not work. Therefore I think that better solution would be to generate error, especially that suc= h errors are not, usually, incidental. Note that in my opinion cache should not be initialized if caching i= s disabled. =20 The modern CHI caching interface provides a way to change how runtim= e errors are handled via 'on_get_error' and 'on_set_error' parameters = to constructor. The default for CHI is to log error, or ignore if no logger is set. You can provide a coderef (callback). For errors during getting data from cache (case B), we can ignore er= ror which means generating data as if cache was disabled/entry did not exist in cache, or erroring out and showing error to user. Such err= ors should be rare, so erroring out might be appropriate reaction. Note that we have special case of zero sized cache entry file, in which c= ase we treat it as non-existent entry, already. For errors during setting (saving) data to cache (case C), we can ignore error and not re-generate cache (not even calling callback if error is detected early), or error out. Such errors can happen for example if filesystem gets full. It might be better to ignore such errore, at least in the case of ENOSPC / ENOMEM. 2. Progress info and not caching error pages J.H. find out or invented very nice hack to generate progress info page without Ajax (and therefore without need for JavaScript), to ma= ke user aware that data is being regenerated, and that he/she should be patient instead of refreshing or complaining. This trick uses the fact that the internal redirect using meta refresh[2] does not occur untill the web browser gets full page (which means th= at connection get closed / end of request). [2] http://en.wikipedia.org/wiki/Meta_refresh Besides the fact that, as J.H. wrote JH> There are still a few known "issues" with respect to this: JH> - Code needs to be added to be "browser" aware so JH> that clients like wget that are trying to get a JH> binary blob don't obtain a "Generating..." page there is additional problem, at least in my rewrite. (I didn't foll= ow the code flow in J.H. v7 patch series to check if it also affected). Let's assume that 'progress_info' feature is turned on, and that cac= he is generated in background. Let's assume that error pages (die_erro= r) are not cached. Now client tries to access page which would/will result in an error. With caching it acquires exclusive (writers) lock, check that there are no stale data (error pages are never cached), checks that it is provided with 'generating_info' subroutine, so it forks a process to generate data in background. Background process detaches, tries to generate data to cache, die_error is called. die_error turns off capturing, and prints to STDOUT. At the end of die_error there is jump to outer scope to DONE_GITWEB; therefore $lock_fh goes out of scope and lockfile is closed, and lock released. Background process finishes. Parent process runs 'generating_info' subroutine. Now if it waits a bit like in my rewrite before employing trick mentioned above, and does not print anything if lockfile is released before startup delay= , _and_ die_error finishes within this delay, then everything is all right: the error message is sent to client. If die_error doesn't finish before git_generating_data_html prints meta refresh, or there is no startup delay, then error pages would g= et infinite redirects (remember: there is never anything in cache for error pages). This is a bad thing. One possible solution for this problem (beside employing startup delay) is to have tee output, i.e. print it as it is being captured. Streamed (partial) response would serve as progress indicator for process (re)generating cache; only parallel processes waiting for cache would show 'generating_info'. I think that in current implementation of capturing it would be as simple as not closing STDOUT, but I'd have to check that. 3. Using generic cache engine and memory consumption Most Perl caching interfaces support only $cache->set($key, $data), where $data is a Perl variable, and $data =3D $cache->get($key), or their equivalents. Even for file-based cache drivers you save from memory and read into memory. =20 The only exception I know of is Cache interface with Cache::File driver, that provides $cache->handle($key [, $mode]), where optional $mode argument can be any of mode strings that can be used in 'open' function, or one of fopen(3) modes. (Of course for memory-based or network-based (like memcached) caches it might not make sense to provide such interface). The most often recommended capture module, Capture::Tiny, allows onl= y capturing into scalar (into memory)=20 $stdout =3D capture \&code; or (using its prototype) =20 $stdout =3D capture { }; Well, it is *::Tiny, so it supports minimal API. From the list of different capture modules, that allow capturing of Perl code output, with different interfaces (API), in "See also" section of Capture::Tiny documentation[3], only IO::CaptureOutput allow capturi= ng into specified file: capture \&code, \$stdout, undef, $outfile; [3] http://p3rl.org/Capture::Tiny J.H.'s gitweb output caching v7 captures output directly ito cache files. The problem with doing it in my rewrite is to allow capturin= g directly into cache entry file without losing ability to select different caching engine, which might be not file-based (like e.g. memcached-based). P.S. You can always check out my latest work on gitweb output caching rewrite here (warning: rebased!) git://repo.or.cz/git/jnareb-git.git gitweb/cache-kernel-pu (it is also available in git://github.com/jnareb/git.git) --=20 Jakub Nar=EAbski Poland