From: Jakub Narebski <jnareb@gmail.com>
To: git@vger.kernel.org
Cc: John 'Warthog9' Hawley <warthog9@kernel.org>,
John 'Warthog9' Hawley <warthog9@eaglescrag.net>,
Junio C Hamano <gitster@pobox.com>, Petr Baudis <pasky@ucw.cz>,
admin@repo.or.cz
Subject: [RFC] Implementing gitweb output caching - issues to solve
Date: Thu, 4 Nov 2010 18:21:51 +0200 [thread overview]
Message-ID: <201011041721.53371.jnareb@gmail.com> (raw)
First, what is the status of gitweb output caching? I have send minimal
fixup for latest John 'Warthog9' Hawley series as Gitweb caching v7.1[1].
John said that he can work on fixes on November 2 at earliest.
[1] http://thread.gmane.org/gmane.comp.version-control.git/160147/focus=160473
During working on my rewrite of gitweb output caching, based on ideas
(and some code) from John 'Warthog9' Hawley (J.H.) series I have met
with the following issues:
1. Error handling
There are three separate places where we can encounter error during
caching. They are:
A. setting up cache (initialization)
B. retrieving data from cache
C. storing (saving) data in cache.
There are two possible choices for errors during cache initialization
(case A):
* turn off caching
* show 500 Internal Server Error
The problem with turning off caching is that it hides an error, and
admin can not notice that caching does not work. Therefore I think
that better solution would be to generate error, especially that such
errors are not, usually, incidental.
Note that in my opinion cache should not be initialized if caching is
disabled.
The modern CHI caching interface provides a way to change how runtime
errors are handled via 'on_get_error' and 'on_set_error' parameters to
constructor. The default for CHI is to log error, or ignore if no
logger is set. You can provide a coderef (callback).
For errors during getting data from cache (case B), we can ignore error
which means generating data as if cache was disabled/entry did not
exist in cache, or erroring out and showing error to user. Such errors
should be rare, so erroring out might be appropriate reaction. Note
that we have special case of zero sized cache entry file, in which case
we treat it as non-existent entry, already.
For errors during setting (saving) data to cache (case C), we can
ignore error and not re-generate cache (not even calling callback if
error is detected early), or error out. Such errors can happen for
example if filesystem gets full. It might be better to ignore such
errore, at least in the case of ENOSPC / ENOMEM.
2. Progress info and not caching error pages
J.H. find out or invented very nice hack to generate progress info
page without Ajax (and therefore without need for JavaScript), to make
user aware that data is being regenerated, and that he/she should be
patient instead of refreshing or complaining.
This trick uses the fact that the internal redirect using meta
refresh[2]
<meta http-equiv="refresh" content="0" />
does not occur untill the web browser gets full page (which means that
connection get closed / end of request).
[2] http://en.wikipedia.org/wiki/Meta_refresh
Besides the fact that, as J.H. wrote
JH> There are still a few known "issues" with respect to this:
JH> - Code needs to be added to be "browser" aware so
JH> that clients like wget that are trying to get a
JH> binary blob don't obtain a "Generating..." page
there is additional problem, at least in my rewrite. (I didn't follow
the code flow in J.H. v7 patch series to check if it also affected).
Let's assume that 'progress_info' feature is turned on, and that cache
is generated in background. Let's assume that error pages (die_error)
are not cached.
Now client tries to access page which would/will result in an error.
With caching it acquires exclusive (writers) lock, check that there
are no stale data (error pages are never cached), checks that it is
provided with 'generating_info' subroutine, so it forks a process to
generate data in background.
Background process detaches, tries to generate data to cache,
die_error is called. die_error turns off capturing, and prints to
STDOUT. At the end of die_error there is jump to outer scope to
DONE_GITWEB; therefore $lock_fh goes out of scope and lockfile is
closed, and lock released. Background process finishes.
Parent process runs 'generating_info' subroutine. Now if it waits a
bit like in my rewrite before employing trick mentioned above, and
does not print anything if lockfile is released before startup delay,
_and_ die_error finishes within this delay, then everything is all
right: the error message is sent to client.
If die_error doesn't finish before git_generating_data_html prints
meta refresh, or there is no startup delay, then error pages would get
infinite redirects (remember: there is never anything in cache for
error pages). This is a bad thing.
One possible solution for this problem (beside employing startup
delay) is to have tee output, i.e. print it as it is being captured.
Streamed (partial) response would serve as progress indicator for
process (re)generating cache; only parallel processes waiting for
cache would show 'generating_info'.
I think that in current implementation of capturing it would be as
simple as not closing STDOUT, but I'd have to check that.
3. Using generic cache engine and memory consumption
Most Perl caching interfaces support only $cache->set($key, $data),
where $data is a Perl variable, and $data = $cache->get($key), or
their equivalents. Even for file-based cache drivers you save from
memory and read into memory.
The only exception I know of is Cache interface with Cache::File
driver, that provides $cache->handle($key [, $mode]), where optional
$mode argument can be any of mode strings that can be used in 'open'
function, or one of fopen(3) modes. (Of course for memory-based or
network-based (like memcached) caches it might not make sense to
provide such interface).
The most often recommended capture module, Capture::Tiny, allows only
capturing into scalar (into memory)
$stdout = capture \&code;
or (using its prototype)
$stdout = capture { <code> };
Well, it is *::Tiny, so it supports minimal API. From the list of
different capture modules, that allow capturing of Perl code output,
with different interfaces (API), in "See also" section of
Capture::Tiny documentation[3], only IO::CaptureOutput allow capturing
into specified file:
capture \&code, \$stdout, undef, $outfile;
[3] http://p3rl.org/Capture::Tiny
J.H.'s gitweb output caching v7 captures output directly ito cache
files. The problem with doing it in my rewrite is to allow capturing
directly into cache entry file without losing ability to select
different caching engine, which might be not file-based (like
e.g. memcached-based).
P.S. You can always check out my latest work on gitweb output caching
rewrite here (warning: rebased!)
git://repo.or.cz/git/jnareb-git.git gitweb/cache-kernel-pu
(it is also available in git://github.com/jnareb/git.git)
--
Jakub Narębski
Poland
next reply other threads:[~2010-11-04 16:22 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-04 16:21 Jakub Narebski [this message]
2010-12-09 1:31 ` [RFC] Implementing gitweb output caching - issues to solve J.H.
2010-12-09 5:22 ` Junio C Hamano
2010-12-09 5:28 ` J.H.
2010-12-09 22:30 ` Jakub Narebski
2010-12-09 22:52 ` Jonathan Nieder
2010-12-10 3:17 ` Olaf Alders
2010-12-10 4:11 ` Jonathan Nieder
2010-12-10 4:46 ` J.H.
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201011041721.53371.jnareb@gmail.com \
--to=jnareb@gmail.com \
--cc=admin@repo.or.cz \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=pasky@ucw.cz \
--cc=warthog9@eaglescrag.net \
--cc=warthog9@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).