Re: [RFC PATCH 10/10] gitweb: Show appropriate "Generating..." page when regenerating cache (WIP)

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jakub Narebski <jnareb@gmail.com>
To: Petr Baudis <pasky@suse.cz>
Cc: git@vger.kernel.org,
	John 'Warthog9' Hawley <warthog9@eaglescrag.net>,
	John 'Warthog9' Hawley <warthog9@kernel.org>
Subject: Re: [RFC PATCH 10/10] gitweb: Show appropriate "Generating..." page when regenerating cache (WIP)
Date: Mon, 25 Jan 2010 21:58:28 +0100	[thread overview]
Message-ID: <201001252158.29516.jnareb@gmail.com> (raw)
In-Reply-To: <20100125135653.GN4159@machine.or.cz>

On Mon, Jan 25, 2010 at 14:56 +0100, Petr Baudis wrote:
> On Mon, Jan 25, 2010 at 02:48:26PM +0100, Jakub Narebski wrote:

> > Now those patches (mine and J.H. both) make gitweb use locking
> > (it is IIRC configurable in J.H. patch) to make only one process
> > generate the page if it is missing from cache, or is stale.  Now
> > if it is missing, we have to wait until it is generated in full
> > before being able to show it to client.  While it is possible to
> > "tee" output (using PerlIO::tee, or Capture::Tiny, or tie like
> > CGI::Cache) writing it simultaneously to browser and to cache for 
> > the process that is generating data, it is as far as I understand
> > it impossible for processes which are waiting for data.  Therefore
> > the need for "Generating..." page, so the user does not think that
> > web server hung or something, and is not generating output.
> 
> Ah, ok, so the message is there to cover up for a technical problem. ;-)
> I didn't quite realize. Then, it would be great to tweak the mechanisms
> so that the user does not really have to wait.

Well, the mechanism would certainly be configurable in final version
(current split version is more of proof of concept of splitting).

> 
> So, I wonder about two things:
> 
> (i) How often does it happen that two requests for the same page are
> received? Has anyone measured it? Or is at least able to make
> a minimally educated guess? IOW, isn't this premature optimization?

To be more exact the question is how often second request for the same
page appears when earlier request didn't finished processing.  It is
the matter of both frequency of given requests, and time it takes to
generate request (which grows with growing load on server).

As to measurements: Pasky, do you have access logs, or their analysis
a la AWStats, Webalizer and the like, for repo.or.cz?  Warthog9, do
you have access logs or analysis for git.kernel.org?  Can you get similar
from fedorahosted?

> (ii) Can't the locked gitwebs do the equivalent of tail -f?

Well, it could, in principle, but it would need some changes.  

First, instead of using temporary file to create cache entry atomically
(write to temporary file, then rename) the process generating data would
have to write to file other processes can read from.  It could be e.g.
lockfile.

Second, there would be needed extended cache API so that generated data
is streamed to cache file, ->set($key, $data) ==> ->set($key, $fh) or
->set_io($key, $fh).  This would mean some complications, but what might
be more important is that this trick would not work as far as I can see
with other caching backends / caching engines that the one from 
gitweb/cache.pm (like memcached or mmap based ones).

Then the code could look like the following (in pseudocode):

  try to acquire writers lock
  if (acquired writers lock) {
  	generate and "tee" response
  	create cache entry
  } else {
  	# <<<<<<
  	while (not acquired writers lock &&
  	       sysread something) {
  		print <data>;
  	}
  	# >>>>>>
  	retrieve and print (rest) of data
  }

where parts between <<<<<< and >>>>>> are new.

But there is another complication: gitweb needs to be able to deal with
the situation where process generating data got interrupted before 
creating full output, or process generating data ran die_error which
does not generate any cache entry (e.g. if the URL we are trying to
access returns 404 not found - the check for existence of object can
take a while if the system is busy, I think).  

Now in current implementation either cache entry is written in full, or it
is not written at all.  It would be, I think, fairly easy to check with the
current code whether cache entry got generated when we acquired readers
lock (when the process get terminated, the lock gets released, which
is advantage over using atomic creating file with O_EXCL for locking),
and if we didn't repeat the whole process.  With the "tee"/"tail" 
solution if the process generating data got interrupted before end,
we can detect such situation, but currently I have no idea what should
be done in such situation.  We can as easily as for the current solution
(which needs "Generating..." page for activity indicator) to detect
die_error situation, and with some care i.e. with not writing to cache
file directly we can ensure that cache entries contain full, correctly
generated data.

> 
> P.S.: Again the disclaimer - if this is "too hard", it's better to
> accept patches like they are, then improve this later. But perhaps
> a better solution would be not to clutter the code by optimizing this
> case at all if it's not clear it really matters in the real world.

See above.

P.S. I have noticed that with current implementation (well, I am not sure
if it is true also for J.H. implementation) there is problem if there is
more than one process trying to request URL which result in die_error being
called.  The design decision, present in original patch, was to not cache
"die_error" / non-"200 OK" pages; it seems sane, but I don't know if it
was a correct decision.  The solution for interrupted generating process,
described above, works also for die_error pages, although it makes 
die_error pages slower for such (hopefully rare) situation of simultaneous
errorneous request.

P.P.S. Both Pasky's approach to caching projects_list page, and Lea Wiemann
work on "gitweb caching" project for Google Summer of Code 2008 approached
caching in different way: by caching (parsed) data, not by caching output.
Note however that for some actions like 'snapshot' we would probably want
to have response/output caching anyway.  Also for output caching we can
use X-Sendfile (or like) extension.

-- 
Jakub Narebski
Poland

next prev parent reply	other threads:[~2010-01-25 20:58 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-14  1:22 [PATCH 0/9] Gitweb caching v5 John 'Warthog9' Hawley
2010-01-14  1:22 ` [PATCH 1/9] gitweb: Load checking John 'Warthog9' Hawley
2010-01-14  1:22   ` [PATCH 2/9] gitweb: change die_error to take "extra" argument for extended die information John 'Warthog9' Hawley
2010-01-14  1:22     ` [PATCH 3/9] gitweb: Add option to force version match John 'Warthog9' Hawley
2010-01-14  1:23       ` [PATCH 4/9] gitweb: Makefile improvements John 'Warthog9' Hawley
2010-01-14  1:23         ` [PATCH 5/9] gitweb: add a get function to compliment print_local_time John 'Warthog9' Hawley
2010-01-14  1:23           ` [PATCH 6/9] gitweb: add a get function to compliment print_sort_th John 'Warthog9' Hawley
2010-01-14  1:23             ` [PATCH 7/9] gitweb: cleanup error message produced by undefined $site_header John 'Warthog9' Hawley
2010-01-14  1:23               ` [PATCH 8/9] gitweb: Convert output to using indirect file handle John 'Warthog9' Hawley
2010-01-14  1:23                 ` [PATCH 9/9] gitweb: File based caching layer (from git.kernel.org) John 'Warthog9' Hawley
2010-01-16  2:48                   ` Jakub Narebski
2010-01-23  0:27                   ` [RFC PATCH 00/10] gitweb: Simple file based output caching Jakub Narebski
2010-01-23  0:27                     ` [RFC PATCH 02/10] gitweb: href(..., -path_info => 0|1) Jakub Narebski
2010-01-23  0:27                     ` [RFC PATCH 03/10] gitweb/cache.pm - Very simple file based caching Jakub Narebski
2010-01-23  0:27                     ` [RFC PATCH 04/10] gitweb/cache.pm - Stat-based cache expiration Jakub Narebski
2010-01-23  0:27                     ` [RFC PATCH 05/10] gitweb: Use Cache::Cache compatibile (get, set) output caching (WIP) Jakub Narebski
2010-01-23  0:27                     ` [RFC PATCH 06/10] gitweb/cache.pm - Adaptive cache expiration time (WIP) Jakub Narebski
2010-01-23  0:27                     ` [RFC PATCH 07/10] gitweb: Use CHI compatibile (compute method) caching (WIP) Jakub Narebski
2010-01-23  0:27                     ` [RFC PATCH 08/10] gitweb/cache.pm - Use locking to avoid 'stampeding herd' problem (WIP) Jakub Narebski
2010-01-23  0:27                     ` [RFC PATCH 09/10] gitweb/cache.pm - Serve stale data when waiting for filling cache (WIP) Jakub Narebski
2010-01-23  0:27                     ` [RFC PATCH 10/10] gitweb: Show appropriate "Generating..." page when regenerating " Jakub Narebski
2010-01-24 22:24                       ` Petr Baudis
2010-01-25  0:03                         ` Jakub Narebski
2010-01-25  1:17                           ` Jakub Narebski
2010-01-25 11:46                         ` Jakub Narebski
2010-01-25 13:02                           ` Petr Baudis
2010-01-25 13:48                             ` Jakub Narebski
2010-01-25 13:56                               ` Petr Baudis
2010-01-25 20:32                                 ` J.H.
2010-01-26  1:49                                   ` Jakub Narebski
2010-01-28 17:39                                   ` Petr Baudis
2010-01-31 11:58                                     ` Jakub Narebski
2010-01-25 20:58                                 ` Jakub Narebski [this message]
2010-01-25 20:41                               ` J.H.
2010-01-26  2:30                                 ` Jakub Narebski
2010-01-23 19:55                     ` [RFC PATCH 00/10] gitweb: Simple file based output caching J.H.
2010-01-24 13:54                     ` [RFC PATCH 11/10] gitweb: Ajax-y "Generating..." page when regenerating cache (WIP) Jakub Narebski
2010-02-06  0:51                     ` [RFC PATCH 00/10] gitweb: Simple file based output caching J.H.
2010-02-06 23:56                       ` Jakub Narebski
2010-02-07 12:35                         ` Jakub Narebski
     [not found]                   ` <0dd15cb3f18e2a26fc834fd3b071e6d3ecc00557.1264198194.git.jnareb@gmail.com>
2010-01-23  0:48                     ` [RFC PATCH 01/10] gitweb: Print to explicit filehandle (preparing for caching) Jakub Narebski
2010-02-07 21:32                     ` Jakub Narebski
2010-01-16  0:43                 ` [PATCH 8/9] gitweb: Convert output to using indirect file handle Jakub Narebski
2010-01-16  0:58                   ` Junio C Hamano
2010-01-16  1:14                     ` Jakub Narebski
2010-01-16  1:41                       ` Junio C Hamano
2010-01-24 22:14                   ` Petr Baudis
2010-01-25  1:47                     ` Jakub Narebski
2010-01-25 20:48                       ` J.H.
2010-01-25 21:48                         ` Jakub Narebski
2010-01-15 23:49               ` [PATCH 7/9] gitweb: cleanup error message produced by undefined $site_header Jakub Narebski
2010-01-23 11:13           ` [PATCH 5/9] gitweb: add a get function to compliment print_local_time Jakub Narebski
2010-01-15 23:36       ` [PATCH 3/9] gitweb: Add option to force version match Jakub Narebski
2010-01-24 21:59       ` Petr Baudis
2010-01-24 23:17         ` Jakub Narebski
2010-01-15 22:40     ` [PATCH 2/9] gitweb: change die_error to take "extra" argument for extended die information Jakub Narebski
2010-01-15 22:30   ` [PATCH 1/9] gitweb: Load checking Jakub Narebski
2010-01-15  1:40 ` [PATCH 0/9] Gitweb caching v5 Jakub Narebski
2010-01-15  4:29   ` J.H.
2010-01-15 10:28     ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201001252158.29516.jnareb@gmail.com \
    --to=jnareb@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=pasky@suse.cz \
    --cc=warthog9@eaglescrag.net \
    --cc=warthog9@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).