[RFD] Gitweb caching, part 1 (long)

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFD] Gitweb caching, part 1 (long)
@ 2008-03-19  0:54 Jakub Narebski
  2008-03-19  8:21 ` Frank Lichtenheld
  2008-03-25 17:06 ` [RFD] Gitweb caching, part 2 (long) Jakub Narebski
  0 siblings, 2 replies; 4+ messages in thread
From: Jakub Narebski @ 2008-03-19  0:54 UTC (permalink / raw)
  To: git; +Cc: Petr Baudis, J.H., Frank Lichtenheld

[Please Cc: me directly, as I am not subscribed to git mailing list,
 and GMane NNTP (news) interface I use doesn't show currently any new
 posts; I wouldn't want to miss any response.  Thanks in advance.]

This post shows my ideas about how to implement caching in gitweb, my
thoughts on what are the problems, and what solutions (what code) can
we use.

>From what I remember of discussion about gitweb performance and
bottlenecks on git mailing list, the main culprit is projects list
(which perhaps should be redesigned), and summary pages for some of
larger / more popular projects.  Gitweb performance is I/O bound, not
CPU bound, so I guess not all existing caching solutions and ideas
would work with gitweb.

There are some troubles with adding generic (as opposed to
site-specific) caching to gitweb.  First, gitweb should work both with
mod_perl and as CGI (perhaps in the future also FastCGI) script.
Second, the solution should not depend on additional packages, at
least not those that can be found packaged in extras or well trusted
contrib repositories; not all admins allow installing packages from
CPAN.  Third, the solution could be helped but should not depend on
adding helper hooks to users repositories; while hosting sites like
repo.or.cz controls repositories, sites like kernel.org or
freedesktop.org, which give shell access, do not.

Let talk first about what to cache.

1. Support for caching in HTTP (HTTP accelerators, caching engines)

My first idea of adding caching support to gitweb was for it to
generate proper "freshness" caching headers (Last-Modified: and ETag:)
and respond to cache validation requests (If-Modified-Since:,
If-None-Match: etc.), and for reverse proxy, aka. caching engine,
aka. web accelerator/HTTP accelerator (e.g. Varnish or Squid) take
care of caching.  It is better to use existing solution, isn't it?

Unfortunately gitweb to generate date for If-Modified-Since:,
or a tag for If-None-Match: must do hard work; perhaps not as much as
generating the whole page in the term of CPU, but almost the same in
the terms of I/O hit.  So it is not so simple...

Nevertheless even if using reverse proxy for gitweb caching is not so
simple, gitweb should play well with support for caching in HTTP
protocol, so the pages can be cached between gitweb and user, either
in one of intermediate caches, proxy server with caching support, or
browser cache.  Currently gitweb uses 'Expires:' header with expiry of
24h / 1d (IIRC cutoff time for caches; also IIRC forever is half
a year) for pages which we know would not change (using full SHA-1
identifier and all required information filled).  We should probably
generate Last-Modified: and/or ETag: if it is possible.

However if gitweb has some kind of internal caching turned on, it can
respond properly to validation requests with low cost.  This way some
of requests would be handled by intermediate caches, so gitweb
wouldn't have even to access the cache to return an answer.  But IMHO
that is a secondary concern: it could help, but isn't possible to do
well without in-gitweb caching (as far as I can see).

BTW. besides optionally marking result as being retrieved from cache
("stale" or "cached"), gitweb I think should also send appropriate
Warning: header, see sections 13.1.2 and 14.46 of RFC2616, e.g.
  Warning: 110 git.kernel.org "Response is stale"

References:
* "Caching Tutorial for Web Authors"
  http://www.mnot.net/cache_docs/
* HTTP 1.1 Specification (RFC 2616)
  http://www.ietf.org/rfc/rfc2616.txt

2. Caching Perl structures

On of solutions (used for example by Petr 'Pasky' Baudis in his last
post about caching projects list info in gitweb) is be to cache (save)
Perl structures containing information needed to generate response
(web page).  Another solution, discussed below, would be to cache
generated output, i.e. web page, optionally with (some) HTTP headers.

The advantage of storing Perl structures (raw data) in the cache is
that the same data can be reused for different pages (e.g. paging
projects list if/when it gets implemented), same page with varying
part (e.g. content type being text/html or application/xhtml+xml
depending on what web browser prefers, or transparently compressed
output via Transfer-Encoding: depending on web browser capabilities),
and for replying to cache validation requests.  Additionally we can
generate web pages with correct relative (e.g. "5 minutes ago") time
info.  Not that all but first and last are not possible with caching
[final] output, but it would be, I think, much harder...

The disadvantage is that we have to decide is what format to use for
serializing data, i.e. to represent compound complex data as stream of
bytes in cache... unless of course gitweb would rely on one of already
existing caching solutions, which usually take care of this problem
for us: see next section (in next installment).

Formats I was considering were:
 - Data::Dumper
 - Storable (binary)
 - YAML Tiny
 - gitconfig tiny

2.1. Data::Dumper

One of advantages of Data::Dumper format is that it comes with Perl
installation, so there is no problem with installing it (well, at
least it comes with perl-5.8.6-24 RPM on Linux).  Another advantage is
that it is textual format, thus easy to debug in the case of
problems.  

Main problem with Data::Dumper is that it use eval() to thaw (restore)
data from serialized form in cache, which is serious security risk in
less secure environments.

2.2. Storable

Also comes packaged with Perl distribution.  Offers writing to and
restoring from file or to/from opened file handle.  It is fast, from
my unscientific tests around 3-4 times as fast as using eval() to read
Data::Dumper data.

One of disadvantages is the fact that Storable format is binary, so
you would have to write separate Perl script to convert it to
human-readable form (e.g. Data::Dumper form).  Format includes format
and version header, and modern 'file' installations should detect it
correctly, e.g.:
  filename: perl Storable(v0.7) data (major 2) (minor 6)

Another nuisance is the fact that while Storable(3pm) manpage states
that:

     The [retrieve()] routine returns "undef" for I/O problems or
     other internal error, a true value otherwise. Serious errors are
     propagated as a "die" exception.

But 'serious errors' include the fact that file is not in correct
format, so for safety (because server-side script should return page
with error info instead of dying silently) one would have to use "eval
{ ... }" to catch errors.

Used by Cache::Cache and, I think, also by other caching solutions
(packages).

2.3. YAML Tiny (subset of YAML)

YAML was created as human-readable serialization format, easy to parse
by machine.  Unfortunately none of YAML parsing modules (YAML,
YAML::Syck, YAML::Tiny) are packaged with Perl; on the other hand they
can be found in Dries RPM repository, so I guess they fill criteria of
being in extras or trusted contrib package repository.

Additionally at least YAML::Tiny (which implements subset of YAML in
pure Perl code) is slower, around 4-5 times, than even using
Data::Dumper, and more that 10 times slower than using Storable.  This
_might_ have been caused by the choice of module to implement YAML
parsing.

We could write parser (and generator) for even smaller subset of YAML
to use only those features that are truly needed by gitweb... but then
we can go with the next format, which is also text format, and also
doesn't have insecurities of using eval() to thaw data (read from
cache).

YAML was designed from the ground up to be an excellent syntax for
configuration files.  Not necessarily so for cache.

References:
* http://en.wikipedia.org/wiki/YAML
* YAML::Tiny(3pm)
  http://search.cpan.org/dist/YAML-Tiny/lib/YAML/Tiny.pm
* YAML Ain't Markup Language (YAML^TM) Version 1.1, Working Draft
  http://yaml.org/spec/current.html

2.4. gitconfig tiny (subset of ini-like gitconfig format).

What, I think, we would want to cache is usually list of records, or
in Perl terminology array of hashes; usually ordering of array doesn't
matter.

Because of that I think it would be possible to represent data to be
saved (cached) in the ini-like extended git config format.  Then
gitweb could either (re)use config parser in Perl used by
git-cvsserver (which accepts subset of valid config format), or 
"git config --file=<path> -z -l" to slurp data in more parseable
format... but if we do that, we could choose this format or variation
of it as our serialization format.

The cache file could look like this:

   [gitweb "<primary key value>"]
   	key1 = value1
   	key2 = "value with spaces"

where for list of projects info primary key might be path (relative to
projectroot) to the repository.

3. Caching output: formatted pages

Alternate solution to caching Perl structures is caching final output,
with or without (some/all) HTTP headers.  It has the advantage that it
is simple to implement, and that the same code can be used to cache
all the pages.  (But we could get similar result by creating something
similar to Tie::Memoize, tying hash or array so it automatically get
data either from git command, or from cache... or we can implement
universal API, like Cache::Cache API.)

This is from what I understand what kernel.org (warthog9) gitweb uses;
I don't know what cgit (web interface in C) which also has some
caching support uses: does it cache data or output?

How one can simply extend CGI script with support for caching is shown
by the CGI::Cache (non-standard CPAN Perl module).  On the other hand
gitweb can afford more extensive surgery.

References:
* CGI::Cache(3pm)
  http://search.cpan.org/~dcoppit/CGI-Cache-1.4200/lib/CGI/Cache.pm

To be continued...

%% .................................................................. %%
In next parts:

Cache lifetime and invalidation
1. static cache, external refreshing e.g. by hooks
2. stat and/or inotify, to check if repository changed
3. cache lifetime (trying to avoid "thundering horde" problem)

CPAN packages we could use, or take inspiration from
1. Cache::Cache (standard)
2. CHI, Unified caching interface
3. Cache
4. other (e.g. Cache::Adaptive, using Cache::Cache)

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFD] Gitweb caching, part 1 (long)
  2008-03-19  0:54 [RFD] Gitweb caching, part 1 (long) Jakub Narebski
@ 2008-03-19  8:21 ` Frank Lichtenheld
  2008-03-25 17:06 ` [RFD] Gitweb caching, part 2 (long) Jakub Narebski
  1 sibling, 0 replies; 4+ messages in thread
From: Frank Lichtenheld @ 2008-03-19  8:21 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, Petr Baudis, J.H.

On Wed, Mar 19, 2008 at 01:54:53AM +0100, Jakub Narebski wrote:
> Because of that I think it would be possible to represent data to be
> saved (cached) in the ini-like extended git config format.  Then
> gitweb could either (re)use config parser in Perl used by
> git-cvsserver (which accepts subset of valid config format), or 
> "git config --file=<path> -z -l" to slurp data in more parseable
> format... but if we do that, we could choose this format or variation
> of it as our serialization format.

git-cvsserver uses "git config -l", too.

Gruesse,
-- 
Frank Lichtenheld <frank@lichtenheld.de>
www: http://www.djpig.de/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFD] Gitweb caching, part 2 (long)
  2008-03-19  0:54 [RFD] Gitweb caching, part 1 (long) Jakub Narebski
  2008-03-19  8:21 ` Frank Lichtenheld
@ 2008-03-25 17:06 ` Jakub Narebski
  2008-03-29 17:13   ` [RFD] Gitweb caching, part 3: examining Perl modules for caching (long) Jakub Narebski
  1 sibling, 1 reply; 4+ messages in thread
From: Jakub Narebski @ 2008-03-25 17:06 UTC (permalink / raw)
  To: git; +Cc: Petr Baudis, J.H., Frank Lichtenheld, Lars Hjemli

In previous part:

What to cache.
1. Support for caching in HTTP (external caching)
2. Caching Perl structures (and serialization)
3. Caching gitweb output: formatted pages

TO WHOM IT MAY CONCERN:  John 'Warthog9' Hawley (J.H.) who created
caching for gitweb at kernel.org; Petr 'Pasky' Baudis who maintains
repo.or.cz fork of gitweb and lately added caching of projects list
info; Lars Hjemli who is the author of cgit, git web interface in C
which includes some caching.  (BTW. I'd like to hear your thoughs on
git web interface caching, and about solutions you have implemented).

This is continuation of my thoughts about how to implement caching in
gitweb, what problems we could encounter, and what existing solutions
(what code/what packages) can we (re)use.

One of the more important issues to think about when implementing
caching is to decide when to regenerate cache, i.e. issues of cache
(in)validation and lifetime.

1. Static cache, external refreshing (invalidation).

The easiest situation is when cache can be invalidated (removed)
externally; caching support in gitweb would then need only to either
use cached information or cached output if it exists, and generate
information and/or output and cache appropriate things if cache
doesn't exist.

In closed-up git hosting system like repo.or.cz new contents can
appear in repository only via push (if repo is manually updated) or
via automated fetch (if repo is mirrored automatically).  This means
that it is known when infomration about given repository gets stale
(out of sync).  It would be then enought to make 'update' or
'post-receive' hook to delete cache, or invalidate parts of cached
info about given repository.  Creation and deletion of repositories
should also be handled by scripts; they affect caching too.

This of course assumes that we can control repository hooks (perhaps
git should learn hook multiplexing first, as proposed some time ago on
a mailing list).  This is not the case when developers are given shell
access, and gitweb is offered as a part of service rather than as a
part of git hosting; repositories are not under web administrator
control.  This is the case (according to J.H. on git mailing list) for
kernel.org.

So we have to examine also more generic solutions.

2. Checking filesystem (stat and/or inotify).

If new objects come to repository via commit or via fetch it is enough
(I guess) to watch for modifications of GIT_DIR of a project (I think
due to doing atomic writes via "create temporary file, then rename it
to final filename" of files in GIT_DIR: COMMIT_MSG and FETCH_HEAD).
So it should be enough to check and compare stat info for GIT_DIR of a
project, or of possible implement some inotify (or equivalent on other
operating systems than Linux) checking, to see if cache can contain
stale info.  In practice what we can truly check is that nothing
changed with repo.

Unfortunately the above is not the case if objects come to repository
via push.  Note that both push resulting in crating a pack (this I
think could use the same mechanism, only checking GIT_DIR/objects/pack
directory), and push resulting in creation of loose objects has to be
supported; additionally the refs pushed can have deeply hierarchical
names.

I would be grateful if somebody could think a way to check if anything
could have changed for such situation... but as it is now we have to
go to more complicated ways of cache invalidation.

3. Cache lifetime.

Finally, for cases such as gitweb where validating cache (checking if
the cached information isn't stale, out of sync with reality) is
almost as costly as calculating the whole information without using
cache at all, there is one possible solution to cache validation:
simply keep cache for some time.  For longer cache lifetimes gitweb
perhaps should put some notification that information is from cache,
perhaps with the time in human readable form how much time ago was
this information generated (human readable means no "1325 seconds ago"
info ;-).  And if we want to be thorough, put it also in the HTTP
header "Warning:" (at least for HTTP/1.1, see sections 13.1.2 and
14.46 of RFC 2616), e.g.:

  Warning: 110 git.kernel.org "Response is stale"

The question is what timeout, or how to choose lifetime of a cache.
J.H. kernel.org's gitweb tries to adjust cache lifetime to server
load, making cache lifetime longer if server load is higher, but
ensuring that cache lifetime stays within specified bounds.  

I have found among CPAN modules Cache::Adaptive where you can also
specify bounds for expire time and subroutine to adjust cache
lifetime, e.g. according to load average, process time for building
the cache entry, etc. (it can use specified backend, for example
Cache::FileCache from Cache::Cache distribution).  Its subclass
Cache::Adaptive::ByLoad which tries to adjust cache lifetime for
bottlenecks under heavy load.  Neither of modules I think is
distributed as ready package in extras on trusted contrib packages
repositories.  Nevertheless we can "borrow" the algorithms used by
those modules.

We should also try to avoid 'thundering herd' problem, namely that
cache expires, gitweb gets N requests before cache gets re-created,
and [poorly designed] cache architecture makes all N do the work
regenerating cache.  There are several ideas of how to deal with this
problem:

 * If (part of) cache has expired, set its expiration time to the
   current time plus specified duration (slop) needed to regenerate
   cache.  It was used by original Pasky solution (and is used by
   further solutions for caching projects list sent here); in can be
   used by CHI (caching infrastructure) with busy_lock option... well,
   kind of.

 * Use some kind of locking so only one process does the work and
   updates the cache.  From what I've briefly checked that is what
   kernel.org gitweb does (using flock()).

   The patch implementing projects list info caching does protect
   using O_EXCL on temporary/lock file against more than one process
   writing the cache, but doesn't protect against more than one
   process doing the work, unformtunately.

 * Allows items to expire a little earlier than the stated expiration
   time to help prevent cache miss stampedes.  This is what CHI module
   does with expires_variance option.

   The probability of expiration increases as a function of how far
   along we are in the potential expiration window, with the
   probability being near 0 at the beginning of the window and
   approaching 1 at the end.

If cache size becomes issue there will be additional complications
like which entries (which cached values) to remove first when we go
over the cache size limit; but lets us leave it for later, if it would
be needed at all.

%%
In next part:

CPAN packages we could use, or take inspiration from
1. Cache::Cache (standard)
2. CHI - Unified cache interface
3. Cache - the Cache interface 
4. other interesting packages
  * Cache::Adaptive for adaptive cache lifetime solutions
  * Cache::Memcached and/or Cache::Swifty
    for caching using cache daemon 
  * Cache::FastMmap (also example of callbacks),
    and caching benchmark mentioned there

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFD] Gitweb caching, part 3: examining Perl modules for caching (long)
  2008-03-25 17:06 ` [RFD] Gitweb caching, part 2 (long) Jakub Narebski
@ 2008-03-29 17:13   ` Jakub Narebski
  0 siblings, 0 replies; 4+ messages in thread
From: Jakub Narebski @ 2008-03-29 17:13 UTC (permalink / raw)
  To: git

[Hopefully this resend wouldn't be stopped by vger antispam filter]

In previous parts:

What to cache.
1. Support for caching in HTTP (external caching).
2. Caching Perl structures (and serialization).
3. Caching gitweb output: formatted pages.

Cache (in)validation and lifetime
1. Static cache, external refreshing (invalidation).
2. Checking filesystem (stat and/or inotify).
3. Cache lifetime (timing-out cached info).
4. LRU (Least Recently Used) and others <- to be written

In this part I will write about existing caching solutions, or to be
more exact about CPAN packages implementing caching or cache interface
in Perl.

Note that not on all sites one can install packages from CPAN; only
packages from main packages repository, from extras repository, or
sometimes from trusted contrib repository are possible (see J.H. post
about this problem).  From mentioned packages only Cache::Cache and
Cache::Mmap are available in contrib repositories for Aurox 11.1
distribution (based on Fedora Core 4): Cache::Cache distribution as
perl-Cache-Cache in Dries RPM Repository, and Cache::Mmap as
perl-Cache-Mmap in both Dries and Dag Wieers RPM repositories.

1. Cache::Cache (standard)

Implements Cache::MemoryCache, Cache::SharedMemoryCache (using
IPC::ShareLite), Cache::FileCache and Cache::SizeAwareFileCache.  This
is the standard: various other modules often say that they implement
Cache::Cache interface.  

It shows a bit its age, so various improvements exists, including CHI
- unified cache interface (which can use Cache::Cache modules, and
also other caching backends like Cache::FastMmap in a unified way),
and Cache - the cache interface, which tries to improve Cache::Cache
but is not yet complete.

Here is some sample code for instantiating and using file system based
cache (it uses Storable for serialization, IIRC).

  use Cache::FileCache;

  my $cache = new Cache::FileCache({default_expires_in => "15 minutes"});

  my $customer = $cache->get($name);

  if (not defined $customer)  {
    $customer = get_customer_from_db($name);
    $cache->set($name, $customer, "10 minutes");
  }

  return $customer;

Cache::Cache distribution can be found (at least on RPM based Linux
distributions) in perl-Cache-Cache package (e.g. in Dries RPM
repository).

Various other modules implements Cache::Cache interface, for example
Cache::BerkeleyDB (compare with Cache::BDB).

2. CHI - Unified cache interface

CHI provides a unified caching API.  The CHI interface is implemented
by driver classes that support fetching, storing and clearing of data.

CHI is intended as an evolution (and successor) of DeWitt Clinton's
Cache::Cache package, adhering to the basic Cache API but adding new
features and addressing limitations in the Cache::Cache implementation.

Main goals of CHI were performance (minimizing method calls,
serializing data only when necessary) and making the creation of new
drivers as easy as possible.  

The latter had lead to wrapping most popular caches available on CPAN in
CHI interface (CHI handles serialization and expiration times) with
CHI::Driver::CacheCache, CHI::Driver::FastMmap, CHI::Driver::Memcached.
"Native" CHI drivers include 'File' (one file per entry), 'Memory'
(per-process) and 'Multilevel' (two or more CHI caches, e.g. memcached
bolstered by a local memory cache).  'DBI' and 'BerkeleyDB' are
planned...

CHI provides expire_if [CODEREF] for additional check if cache expired,
busy_lock [DURATION] to set expiration time to current time plus
specified duration if value has expired to prevent "cache stampede", and
expires_variance [FLOAT] to allow items to expire little earlier to
prevent cache miss stampedes (favored over busy_lock).  Even if gitweb
wouldn't use CHI interface directly, those ideas are worth considering.

In addition to standard get() and set() methods it implements
compute() method which combines get and set operations in a single
call.  It also has some methods to process multiple keys and/or values
at once.

Here is some sample code for instantiating and using file system based
cache.

    use CHI;

    # Choose a standard driver
    #
    my $cache = CHI->new(driver => 'File', 
                         cache_root => '/tmp/cache');

    # Basic cache operations
    #
    my $customer = $cache->get($name);
    if (!defined $customer) {
        $customer = get_customer_from_db($name);
        $cache->set($name, $customer, "10 minutes");
    }

    # or simply
    my $customer = $cache->compute($name, \&get_customer_from_db,
                                   "10 minutes");

3. Cache - the Cache interface 

The Cache modules are a total redesign and reimplementation of Cache::Cache
and thus not directly compatible.  Contrary to Cache::Cache get() and
set() methods do not serialize complex data types; you have to freeze()
and thaw() data explicitely, instead of set/get.  You can get IO::Handle
by which data can be read from, or written to cache, e.g. when using
Cache::File.  There is no concept of 'namespace' in the basic cache
interface.  Purging is done automatically in current implementation.

Currently only Cache::File (filesystem based implementation, could be
done more efficiently, currently supports only LOCK_NFS locking) and
Cache::Memory (per-process memory based implemetation; with namespaces)
drivers are implemented.

In Cache modules one can select removal strategy for the cache.  By
default FIFO (First In First Out: remove oldest) and LRU (Least Recently
Used: remove stalest) strategies are available (when cache has size
limit?).

Cache modules provide callback interface: load_callback to be called
anytime when a get() is issued for a data that does not exist in cache,
and validate_callback (for example storing and checking timestamp or
similar).  This means that sample code for instantiating and using file
system based cache can be written as below.

  use Cache::File;

  my $cache = Cache::File->new(cache_root => '/tmp/cacheroot');
  $cache->set_load_callback(\&get_customer_from_db);

  # calls get_customer_from_db() if needed
  my $customer = $cache->get($name);

The Cache classes can be used via the tie interface, as shown below.
This allows the cache to be accessed via a hash.  All the standard
methods for accessing the hash are supported, with the exception
of the 'keys' or 'each' call.

  tie %hash, 'Cache::File', { cache_root => $tempdir };

  $hash{'key'} = 'some data';
  $data = $hash{'key'};

The tie interface is especially useful with the load_callback to
automatically populate the hash.

Even if gitweb wouldn't use Cache modules (perhaps because lack of
matority, or/and the fact that they ar not in extras or trusted
contrib packages repository) the idea of selectable removal strategy
and the idea of callback interface are worth considering; perhaps even
tie interface.  Whether to serialize explicitely or not... that is
also to be decided.

4. Other interesting caching packages

4.1. Cache::Adaptive for adaptive cache lifetime control

Cache::Adaptive is a cache engine with adaptive lifetime control.  Cache
lifetimes can be increased or decreased by any factor, e.g. load
average, process time for building the cache entry, etc.  Can use almost
any Cache::Cache object as backend (the update algorithm needs reliable
set() method, so Cache::SizeAwareFileCache cannot be used).

Cache::Adaptive::ByLoad is a subclass of Cache::Adaptive, which adjusts
cache lifetime by two factors; the load average of the platform and the
percentage of the total time spent by the builder.

Cache::Adaptive Introduces additional 
  access({ key => cache_key, builder => sub { ... } })
method, which returns cached entry if possible, or builds the entry by
calling the builder function, and optionally stores the build entry to
cache.  Compare with compute() method from CHI, or callback interfaces.

Worth examining (both interface and implementation) if/when implementing
cache lifetime control based on load average, like kernel.org gitweb
tries to do.  J.H. (kernel.org) fork of gitweb uses longer lifetime
under heavier load (within specified bounds).

4.2. Cache::Memcached and/or Cache::Swifty for caching using cache daemon

For larger installations, when there is needed caching not only for
gitweb, it might be worth examining cache daemon solutions, like
memcached (and Cache::Memcached, Cache::Memcached::Fast or CHI
equivalent), a distributed memory cache daemon; or swifty (and
Cache::Swifty), a very fast shared memory cache, in early alpha stages.

The Cache::Memcached api, besides set/get methods and administrative
methods provide add() and replace() methods to set() conditionally, if
value doesn't exists, or does exists in the cache.

Memcached was created to reduce load on high-trafic site with a hight
database load that contains mostly read threads, so it might be not
approproate for gitweb, where I/O load is of most concern, not CPU.
The main advantage of memcached is its ability to scale out.  Usually
you can run memcached together with web serever or database server, as
memcached is CPU lean and memory hungry, and web/database server the
reverse: CPU hungry and usually memory lean.  Note that gitweb (or
rather git access to repositories in gitweb) is I/O hungry.

4.3. Cache::FastMmap (also example of callbacks),
     and caching benchmark mentioned there

Cache::FastMmap uses an mmap'ed file to act as a shared memory
interprocess cache.  It uses fcntl locking to ensure multiple processes
can safely access the cache at the same time. It uses a basic LRU
algorithm to keep the most used entries in the cache, plus (optionally)
cache timeout.

Cache::FastMmap was created to be very fast.

The class also supports read-through, and write-back or write-through
callbacks to access the real data if it's not in the cache.  With those
the code to deal with cache can be written simply as

  Cache::FastMmap->new(
    ...
    context => $RealDataSourceHandle,
    read_cb  => sub { $_[0]->get($_[1]) },
    write_cb => sub { $_[0]->set($_[1], $_[2]) },
  );

  ...

  my $value = $cache->get($key);

  $cache->set($key, $newvalue);

It also supports get_and_set() subroutine to atomically retrieve and set
value of given key, and has methods dealing with multiple keys at once.
There is also Cache::FastMmap::Tie module to use tie interface to
Cache::FastMap.  Even if gitweb wouldn't use this module, the callback
based interface is worth considering to implement.

4.4. CGI::Cache to help cache output of time-intensive CGI scripts
     with minimal changes to CGI script code.

This module is intended to be used in CGI scripts that may benefit from
caching; it is written in such a way that existing CGI code could get
caching added with minimal changes to script.  Here's a simple example:

  #!/usr/bin/perl

  use CGI;
  use CGI::Cache;

  # Set up cache
  CGI::Cache::setup();

  my $cgi = new CGI;

  # CGI::Vars requires CGI version 2.50 or better
  CGI::Cache::set_key($cgi->Vars);

  # This should short-circuit the rest of the loop if a cache value is
  # already there
  CGI::Cache::start() or exit;

  print $cgi->header, "\n";

  #...

  print <<EOF;
  This prints to STDOUT, which will be cached.
  If the next visit is within 24 hours, the cached STDOUT
  will be served instead of executing this 'print'.
  EOF

CGI::Cache module ties the output file descriptor (usually STDOUT) to an
internal variable to which all output is saved.  This trick (technique)
is worth considering if we decide on caching final output in gitweb, or
final output without HTTP headers.

5. Summary

If it is decided that gitweb would do caching of Perl structures, we
would certainly use Storable, which should be installed as part of Perl
installation on most systems.  Perhaps gitweb could use Cache::Cache
packages in general (and Cache::FileCache in particular), as it should
fill "extras or trusted contrib" criterion, but I'd rather not add
another dependency to gitweb, especially that not all installations need
caching.  It could be good solution for gitweb fork, and I guess
kernel.org gitweb could use it.

If gitweb is to implement it's own solutions to not introduce extra
dependencies, and it would cache Perl structures, implementing
Cache::Cache get/set interface, with possible improvements of callback
interface would be a good idea.  For very large installations it would
be good to check memcached solution (or multilevel cache, see CHI).

If gitweb is to use caching of output, or output without HTTP headers,
either using CGI::Cache or using its technique would be a good idea.

Gitweb caching is meant to reduce load (mainly I/O load according to
some mails send on this mailing list by J.H., kernel.org gitweb admin,
and Pasky, repo.or.cz gitweb admin).  I think it would be good to try
and check, benchmarking if possible, different solutions to "thundering
horde" aka "cache stampede" problem, and to adaptive cache lifetime
control (see Cache::Adaptive).

Thoughts? Comments?

%%
In the next part I'd like to have thoughts and ideas for gitweb caching
from J.H. and Petr 'Pasky' Baudis...

References:
===========
[1] http://search.cpan.org
[2] http://code.google.com/p/perl-cache
[3] http://www.danga.com/memcached/
[4] http://cpan.robm.fastmail.fm/cache_perf.html

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-03-29 17:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-19  0:54 [RFD] Gitweb caching, part 1 (long) Jakub Narebski
2008-03-19  8:21 ` Frank Lichtenheld
2008-03-25 17:06 ` [RFD] Gitweb caching, part 2 (long) Jakub Narebski
2008-03-29 17:13   ` [RFD] Gitweb caching, part 3: examining Perl modules for caching (long) Jakub Narebski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).