* [RFC] Implementing gitweb output caching - issues to solve @ 2010-11-04 16:21 Jakub Narebski 2010-12-09 1:31 ` J.H. 0 siblings, 1 reply; 9+ messages in thread From: Jakub Narebski @ 2010-11-04 16:21 UTC (permalink / raw) To: git Cc: John 'Warthog9' Hawley, John 'Warthog9' Hawley, Junio C Hamano, Petr Baudis, admin First, what is the status of gitweb output caching? I have send minimal fixup for latest John 'Warthog9' Hawley series as Gitweb caching v7.1[1]. John said that he can work on fixes on November 2 at earliest. [1] http://thread.gmane.org/gmane.comp.version-control.git/160147/focus=160473 During working on my rewrite of gitweb output caching, based on ideas (and some code) from John 'Warthog9' Hawley (J.H.) series I have met with the following issues: 1. Error handling There are three separate places where we can encounter error during caching. They are: A. setting up cache (initialization) B. retrieving data from cache C. storing (saving) data in cache. There are two possible choices for errors during cache initialization (case A): * turn off caching * show 500 Internal Server Error The problem with turning off caching is that it hides an error, and admin can not notice that caching does not work. Therefore I think that better solution would be to generate error, especially that such errors are not, usually, incidental. Note that in my opinion cache should not be initialized if caching is disabled. The modern CHI caching interface provides a way to change how runtime errors are handled via 'on_get_error' and 'on_set_error' parameters to constructor. The default for CHI is to log error, or ignore if no logger is set. You can provide a coderef (callback). For errors during getting data from cache (case B), we can ignore error which means generating data as if cache was disabled/entry did not exist in cache, or erroring out and showing error to user. Such errors should be rare, so erroring out might be appropriate reaction. Note that we have special case of zero sized cache entry file, in which case we treat it as non-existent entry, already. For errors during setting (saving) data to cache (case C), we can ignore error and not re-generate cache (not even calling callback if error is detected early), or error out. Such errors can happen for example if filesystem gets full. It might be better to ignore such errore, at least in the case of ENOSPC / ENOMEM. 2. Progress info and not caching error pages J.H. find out or invented very nice hack to generate progress info page without Ajax (and therefore without need for JavaScript), to make user aware that data is being regenerated, and that he/she should be patient instead of refreshing or complaining. This trick uses the fact that the internal redirect using meta refresh[2] <meta http-equiv="refresh" content="0" /> does not occur untill the web browser gets full page (which means that connection get closed / end of request). [2] http://en.wikipedia.org/wiki/Meta_refresh Besides the fact that, as J.H. wrote JH> There are still a few known "issues" with respect to this: JH> - Code needs to be added to be "browser" aware so JH> that clients like wget that are trying to get a JH> binary blob don't obtain a "Generating..." page there is additional problem, at least in my rewrite. (I didn't follow the code flow in J.H. v7 patch series to check if it also affected). Let's assume that 'progress_info' feature is turned on, and that cache is generated in background. Let's assume that error pages (die_error) are not cached. Now client tries to access page which would/will result in an error. With caching it acquires exclusive (writers) lock, check that there are no stale data (error pages are never cached), checks that it is provided with 'generating_info' subroutine, so it forks a process to generate data in background. Background process detaches, tries to generate data to cache, die_error is called. die_error turns off capturing, and prints to STDOUT. At the end of die_error there is jump to outer scope to DONE_GITWEB; therefore $lock_fh goes out of scope and lockfile is closed, and lock released. Background process finishes. Parent process runs 'generating_info' subroutine. Now if it waits a bit like in my rewrite before employing trick mentioned above, and does not print anything if lockfile is released before startup delay, _and_ die_error finishes within this delay, then everything is all right: the error message is sent to client. If die_error doesn't finish before git_generating_data_html prints meta refresh, or there is no startup delay, then error pages would get infinite redirects (remember: there is never anything in cache for error pages). This is a bad thing. One possible solution for this problem (beside employing startup delay) is to have tee output, i.e. print it as it is being captured. Streamed (partial) response would serve as progress indicator for process (re)generating cache; only parallel processes waiting for cache would show 'generating_info'. I think that in current implementation of capturing it would be as simple as not closing STDOUT, but I'd have to check that. 3. Using generic cache engine and memory consumption Most Perl caching interfaces support only $cache->set($key, $data), where $data is a Perl variable, and $data = $cache->get($key), or their equivalents. Even for file-based cache drivers you save from memory and read into memory. The only exception I know of is Cache interface with Cache::File driver, that provides $cache->handle($key [, $mode]), where optional $mode argument can be any of mode strings that can be used in 'open' function, or one of fopen(3) modes. (Of course for memory-based or network-based (like memcached) caches it might not make sense to provide such interface). The most often recommended capture module, Capture::Tiny, allows only capturing into scalar (into memory) $stdout = capture \&code; or (using its prototype) $stdout = capture { <code> }; Well, it is *::Tiny, so it supports minimal API. From the list of different capture modules, that allow capturing of Perl code output, with different interfaces (API), in "See also" section of Capture::Tiny documentation[3], only IO::CaptureOutput allow capturing into specified file: capture \&code, \$stdout, undef, $outfile; [3] http://p3rl.org/Capture::Tiny J.H.'s gitweb output caching v7 captures output directly ito cache files. The problem with doing it in my rewrite is to allow capturing directly into cache entry file without losing ability to select different caching engine, which might be not file-based (like e.g. memcached-based). P.S. You can always check out my latest work on gitweb output caching rewrite here (warning: rebased!) git://repo.or.cz/git/jnareb-git.git gitweb/cache-kernel-pu (it is also available in git://github.com/jnareb/git.git) -- Jakub Narębski Poland ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Implementing gitweb output caching - issues to solve 2010-11-04 16:21 [RFC] Implementing gitweb output caching - issues to solve Jakub Narebski @ 2010-12-09 1:31 ` J.H. 2010-12-09 5:22 ` Junio C Hamano 2010-12-09 22:30 ` Jakub Narebski 0 siblings, 2 replies; 9+ messages in thread From: J.H. @ 2010-12-09 1:31 UTC (permalink / raw) To: Jakub Narebski Cc: git, John 'Warthog9' Hawley, Junio C Hamano, Petr Baudis, admin > 1. Error handling [...] > Note that in my opinion cache should not be initialized if caching is > disabled. Well it wasn't getting initialized in my code if caching was turned off. It might have been buffering, but it wasn't mucking with the directory creation and such... > The modern CHI caching interface provides a way to change how runtime > errors are handled via 'on_get_error' and 'on_set_error' parameters to > constructor. The default for CHI is to log error, or ignore if no > logger is set. You can provide a coderef (callback). > > > For errors during getting data from cache (case B), we can ignore error > which means generating data as if cache was disabled/entry did not > exist in cache, or erroring out and showing error to user. Such errors > should be rare, so erroring out might be appropriate reaction. Note > that we have special case of zero sized cache entry file, in which case > we treat it as non-existent entry, already. > > > For errors during setting (saving) data to cache (case C), we can > ignore error and not re-generate cache (not even calling callback if > error is detected early), or error out. Such errors can happen for > example if filesystem gets full. It might be better to ignore such > errore, at least in the case of ENOSPC / ENOMEM. The potential, at least for the deployed sites who are already using the gitweb caching, for disaster if caching isn't running is pretty high. I would agree that erroring out with a 500 that explains the 'why' behind the failure is the only real option right now. The callback mechanism in CHI is nice, but I think for the time being is a little overkill for what we've got right now. I also don't see that many, typical, sysadmins adding in that many error handling hooks and such into gitweb and/or their infrastructure. I could be wrong, I just don't see that as being a common occurrence. > 2. Progress info and not caching error pages > > J.H. find out or invented very nice hack to generate progress info > page without Ajax (and therefore without need for JavaScript), to make > user aware that data is being regenerated, and that he/she should be > patient instead of refreshing or complaining. Not so much found or invented, it's something that's been around since I started doing any web development back in 1995. It's a bit old-school, I'll admit, but it gets the job done still. [...] > JH> There are still a few known "issues" with respect to this: > JH> - Code needs to be added to be "browser" aware so > JH> that clients like wget that are trying to get a > JH> binary blob don't obtain a "Generating..." page I've solved this in v8, there is now an explicit blacklist for clients that we assume are only there for whatever is getting generated. Right now the list is only wget and curl as those are the only two I can think of immediately that wouldn't be able to parse the meta-equiv refresh. There is already a special case for feeds to not get this, so those shouldn't be affected as it is already. Downside is it's a blacklist, and inherits all the problems of a blacklist. Though there really isn't a better way to handle that, given there's no way to really query a browser and ask what it's capabilities really are. > there is additional problem, at least in my rewrite. (I didn't follow > the code flow in J.H. v7 patch series to check if it also affected). > > > Let's assume that 'progress_info' feature is turned on, and that cache > is generated in background. Let's assume that error pages (die_error) > are not cached. > > Now client tries to access page which would/will result in an error. > With caching it acquires exclusive (writers) lock, check that there > are no stale data (error pages are never cached), checks that it is > provided with 'generating_info' subroutine, so it forks a process to > generate data in background. > > Background process detaches, tries to generate data to cache, > die_error is called. die_error turns off capturing, and prints to > STDOUT. At the end of die_error there is jump to outer scope to > DONE_GITWEB; therefore $lock_fh goes out of scope and lockfile is > closed, and lock released. Background process finishes. > > Parent process runs 'generating_info' subroutine. Now if it waits a > bit like in my rewrite before employing trick mentioned above, and > does not print anything if lockfile is released before startup delay, > _and_ die_error finishes within this delay, then everything is all > right: the error message is sent to client. > > If die_error doesn't finish before git_generating_data_html prints > meta refresh, or there is no startup delay, then error pages would get > infinite redirects (remember: there is never anything in cache for > error pages). This is a bad thing. > > > One possible solution for this problem (beside employing startup > delay) is to have tee output, i.e. print it as it is being captured. > Streamed (partial) response would serve as progress indicator for > process (re)generating cache; only parallel processes waiting for > cache would show 'generating_info'. > > I think that in current implementation of capturing it would be as > simple as not closing STDOUT, but I'd have to check that. Actually in reading through this I thought of a better way to get the error message cleanly passed from the backend process to the frontend waiting processes. It's implemented in v8, but it basically boils down to, strangely enough, caching the error. (I've got a hammer, it clearly solves all problems!). It's not a full cache (like what I'm doing with the rest of the cache) but it basically generates a file that can be used to short circuit things a bit. > 3. Using generic cache engine and memory consumption > > Most Perl caching interfaces support only $cache->set($key, $data), > where $data is a Perl variable, and $data = $cache->get($key), or > their equivalents. Even for file-based cache drivers you save from > memory and read into memory. > > The only exception I know of is Cache interface with Cache::File > driver, that provides $cache->handle($key [, $mode]), where optional > $mode argument can be any of mode strings that can be used in 'open' > function, or one of fopen(3) modes. (Of course for memory-based or > network-based (like memcached) caches it might not make sense to > provide such interface). > > The most often recommended capture module, Capture::Tiny, allows only > capturing into scalar (into memory) > > $stdout = capture \&code; > > or (using its prototype) > > $stdout = capture { <code> }; > > Well, it is *::Tiny, so it supports minimal API. From the list of > different capture modules, that allow capturing of Perl code output, > with different interfaces (API), in "See also" section of > Capture::Tiny documentation[3], only IO::CaptureOutput allow capturing > into specified file: > > capture \&code, \$stdout, undef, $outfile; > > [3] http://p3rl.org/Capture::Tiny > > > J.H.'s gitweb output caching v7 captures output directly ito cache > files. The problem with doing it in my rewrite is to allow capturing > directly into cache entry file without losing ability to select > different caching engine, which might be not file-based (like > e.g. memcached-based). One of the big things that my caching engine now stops is the stampeding herd problem, basically by locking everyone out of the cache for that entry till it's prepared. Assuming that can be preserved with any arbitrary caching engine, then capturing directly to the file isn't that big of a deal. If that can be preserved in your full re-write, then it doesn't really matter how or when the data makes it into the caching system be it memory, disk, /dev/random or collapsing quantum wave form I'm going to try and get v8 out tomorrow morning, and I'm going to let this stew on kernel.org overnight. It's looking like 18 patches (which includes Jakub's v7.2, since I was building straight on top of that). In poking around, I'll admit, I keep finding things I'd love to fix / cleanup throughout the entirety of the code base. I'm kinda hoping to finally get caching dealt with and merged so I can start scratching those other itches. Mind you this has been 4+ years in the making already... - John 'Warthog9' Hawley P.S. Most of this e-mail was written a couple of weeks ago, I found it in my drafts folder and wanted to get it out along with a note that I've got v8 percolating. -JH ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Implementing gitweb output caching - issues to solve 2010-12-09 1:31 ` J.H. @ 2010-12-09 5:22 ` Junio C Hamano 2010-12-09 5:28 ` J.H. 2010-12-09 22:30 ` Jakub Narebski 1 sibling, 1 reply; 9+ messages in thread From: Junio C Hamano @ 2010-12-09 5:22 UTC (permalink / raw) To: J.H. Cc: Jakub Narebski, git, John 'Warthog9' Hawley, Junio C Hamano, Petr Baudis, admin "J.H." <warthog9@eaglescrag.net> writes: > P.S. Most of this e-mail was written a couple of weeks ago, I found it > in my drafts folder and wanted to get it out along with a note that I've > got v8 percolating. -JH Thanks. Is "Tomorrow morning" relative to "a couple of weeks ago"? ;-) ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Implementing gitweb output caching - issues to solve 2010-12-09 5:22 ` Junio C Hamano @ 2010-12-09 5:28 ` J.H. 0 siblings, 0 replies; 9+ messages in thread From: J.H. @ 2010-12-09 5:28 UTC (permalink / raw) To: Junio C Hamano Cc: Jakub Narebski, git, John 'Warthog9' Hawley, Petr Baudis, admin On 12/08/2010 09:22 PM, Junio C Hamano wrote: > "J.H." <warthog9@eaglescrag.net> writes: > >> P.S. Most of this e-mail was written a couple of weeks ago, I found it >> in my drafts folder and wanted to get it out along with a note that I've >> got v8 percolating. -JH > > Thanks. Is "Tomorrow morning" relative to "a couple of weeks ago"? ;-) No tomorrow morning is relative to today - so ~12hrs from now, ish, maybe big ISH ;-) I finished up the patch series today. The series itself is already pushed out to my repo on kernel.org http://git.kernel.org/?p=git/warthog9/gitweb.git;a=shortlog;h=refs/heads/gitweb-ml-v8 just wanted to let that stew a bit before pushing it out to the mailing list, since it is now an 18 part patch :-/ - John 'Warthog9' Hawley ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Implementing gitweb output caching - issues to solve 2010-12-09 1:31 ` J.H. 2010-12-09 5:22 ` Junio C Hamano @ 2010-12-09 22:30 ` Jakub Narebski 2010-12-09 22:52 ` Jonathan Nieder 1 sibling, 1 reply; 9+ messages in thread From: Jakub Narebski @ 2010-12-09 22:30 UTC (permalink / raw) To: J.H. Cc: git, John 'Warthog9' Hawley, Junio C Hamano, Petr Baudis, admin On Thu, 9 Dec 2010, J.H. wrote: > > 1. Error handling > [...] > > Note that in my opinion cache should not be initialized if caching is > > disabled. > > Well it wasn't getting initialized in my code if caching was turned off. > It might have been buffering, but it wasn't mucking with the directory > creation and such... What I wanted to say here was that, in my opinion, adding support for caching to gitweb shouldn't change gitweb behavior in the case when caching is enabled (and should modify gitweb code only minimally). The original v7 series ran capture_fetch also when caching was disabled, and in such case captured gitweb to in-memory file to print it at end. This was fixed (well, fixed if we consider this a bug/misfeature, which I do) in v7.1, and finally done without errors in v7.4 > > The modern CHI caching interface provides a way to change how runtime > > errors are handled via 'on_get_error' and 'on_set_error' parameters to > > constructor. The default for CHI is to log error, or ignore if no > > logger is set. You can provide a coderef (callback). > > > > > > For errors during getting data from cache (case B), we can ignore error > > which means generating data as if cache was disabled/entry did not > > exist in cache, or erroring out and showing error to user. Such errors > > should be rare, so erroring out might be appropriate reaction. Note > > that we have special case of zero sized cache entry file, in which case > > we treat it as non-existent entry, already. > > > > > > For errors during setting (saving) data to cache (case C), we can > > ignore error and not re-generate cache (not even calling callback if > > error is detected early), or error out. Such errors can happen for > > example if filesystem gets full. It might be better to ignore such > > errore, at least in the case of ENOSPC / ENOMEM. > > The potential, at least for the deployed sites who are already using the > gitweb caching, for disaster if caching isn't running is pretty high. I > would agree that erroring out with a 500 that explains the 'why' behind > the failure is the only real option right now. > > The callback mechanism in CHI is nice, but I think for the time being is > a little overkill for what we've got right now. I also don't see that > many, typical, sysadmins adding in that many error handling hooks and > such into gitweb and/or their infrastructure. I could be wrong, I just > don't see that as being a common occurrence. Well, we would of course provide default callback that uses die_error, see for example [PATCHv6/RFC 21/24] gitweb: Wrap die_error to use as error handler for caching engine http://thread.gmane.org/gmane.comp.version-control.git/163052/focus=163043 http://repo.or.cz/w/git/jnareb-git.git/commitdiff/9e8e92f318c6b776486ab19509326f0b470d491d What I want to avoid is using die_error in caching module; this entangles caching module code with gitweb code, and makes it much more difficult to do unit tests / test caching code in isolation (die_error would have to be mocked). This would also prevent making caching engine a proper Perl module, I think... There is also another issue: with error handler passed as callback we can for example use CHI as caching engine (with any of its drivers: File, FastMmap, Memcached) with our own/gitweb error reporting. > > 2. Progress info and not caching error pages > > > > J.H. find out or invented very nice hack to generate progress info > > page without Ajax (and therefore without need for JavaScript), to make > > user aware that data is being regenerated, and that he/she should be > > patient instead of refreshing or complaining. > > Not so much found or invented, it's something that's been around since I > started doing any web development back in 1995. It's a bit old-school, > I'll admit, but it gets the job done still. Nice idea, anyway. I do wonder though if all web browsers follow this behavior, and is this behavior documented / required... > [...] > > JH> There are still a few known "issues" with respect to this: > > JH> - Code needs to be added to be "browser" aware so > > JH> that clients like wget that are trying to get a > > JH> binary blob don't obtain a "Generating..." page > > I've solved this in v8, there is now an explicit blacklist for clients > that we assume are only there for whatever is getting generated. Right > now the list is only wget and curl as those are the only two I can think > of immediately that wouldn't be able to parse the meta-equiv refresh. In my rewrite [PATCHv6 17/24] gitweb: Show appropriate "Generating..." page when regenerating cache http://thread.gmane.org/gmane.comp.version-control.git/163052/focus=163040 http://repo.or.cz/w/git/jnareb-git.git/commitdiff/48679f7985ccda16dc54fda97790841bab4a0ba2#patch1 (see the browser_is_robot() subroutine: http://repo.or.cz/w/git/jnareb-git.git/blob/48679f7985ccda16dc54fda97790841bab4a0ba2:/gitweb/gitweb.perl#l870 I use HTTP::BrowserDetect package if available and it's ->robot() method. The fallback is to use *whitelist*, assuming that it would be better to not show "Generating..." page rather than download the wrong thing. I also guess that most (all?) web browsers use "Mozilla compatibile" somewhere in their User-Agent string, thus matching 'Mozilla'. > > There is already a special case for feeds to not get this, so those > shouldn't be affected as it is already. Yes, it is true that you limit actions that are affected. I also use whitelist in my rewrite, allowing only actions which output HTML to use "Generating..." page. Perhaps that's too strict. > Downside is it's a blacklist, and inherits all the problems of a > blacklist. Though there really isn't a better way to handle that, given > there's no way to really query a browser and ask what it's capabilities > really are. Well, we can always use whitelist instead, see above. > > there is additional problem, at least in my rewrite. (I didn't follow > > the code flow in J.H. v7 patch series to check if it also affected). > > > > > > Let's assume that 'progress_info' feature is turned on, and that cache > > is generated in background. Let's assume that error pages (die_error) > > are not cached. > > > > Now client tries to access page which would/will result in an error. > > With caching it acquires exclusive (writers) lock, check that there > > are no stale data (error pages are never cached), checks that it is > > provided with 'generating_info' subroutine, so it forks a process to > > generate data in background. > > > > Background process detaches, tries to generate data to cache, > > die_error is called. die_error turns off capturing, and prints to > > STDOUT. At the end of die_error there is jump to outer scope to > > DONE_GITWEB; therefore $lock_fh goes out of scope and lockfile is > > closed, and lock released. Background process finishes. > > > > Parent process runs 'generating_info' subroutine. Now if it waits a > > bit like in my rewrite before employing trick mentioned above, and > > does not print anything if lockfile is released before startup delay, > > _and_ die_error finishes within this delay, then everything is all > > right: the error message is sent to client. > > > > If die_error doesn't finish before git_generating_data_html prints > > meta refresh, or there is no startup delay, then error pages would get > > infinite redirects (remember: there is never anything in cache for > > error pages). This is a bad thing. > > > > > > One possible solution for this problem (beside employing startup > > delay) is to have tee output, i.e. print it as it is being captured. > > Streamed (partial) response would serve as progress indicator for > > process (re)generating cache; only parallel processes waiting for > > cache would show 'generating_info'. > > > > I think that in current implementation of capturing it would be as > > simple as not closing STDOUT, but I'd have to check that. > > Actually in reading through this I thought of a better way to get the > error message cleanly passed from the backend process to the frontend > waiting processes. > > It's implemented in v8, but it basically boils down to, strangely > enough, caching the error. (I've got a hammer, it clearly solves all > problems!). It's not a full cache (like what I'm doing with the rest of > the cache) but it basically generates a file that can be used to short > circuit things a bit. Hmmm... I'll take a look at your solution then. Looks like it can be interesting. I can think of capturing die_error output to in-memory file (doing caching of its output in-memory), but I don't see how you can change "goto DONE_GITWEB" (or in patch[1] sent to git mailing list "goto DONE_REQUEST") to exit action subroutine but not exit cache_fetch so gitweb would be able to display catched output. [1]: http://permalink.gmane.org/gmane.comp.version-control.git/162156 "[PATCH/RFC] gitweb: Go to DONE_REQUEST rather than DONE_GITWEB in die_error" > > 3. Using generic cache engine and memory consumption > > > > Most Perl caching interfaces support only $cache->set($key, $data), > > where $data is a Perl variable, and $data = $cache->get($key), or > > their equivalents. Even for file-based cache drivers you save from > > memory and read into memory. > > > > The only exception I know of is Cache interface with Cache::File > > driver, that provides $cache->handle($key [, $mode]), where optional > > $mode argument can be any of mode strings that can be used in 'open' > > function, or one of fopen(3) modes. (Of course for memory-based or > > network-based (like memcached) caches it might not make sense to > > provide such interface). > > > > The most often recommended capture module, Capture::Tiny, allows only > > capturing into scalar (into memory) > > > > $stdout = capture \&code; > > > > or (using its prototype) > > > > $stdout = capture { <code> }; > > > > Well, it is *::Tiny, so it supports minimal API. From the list of > > different capture modules, that allow capturing of Perl code output, > > with different interfaces (API), in "See also" section of > > Capture::Tiny documentation[3], only IO::CaptureOutput allow capturing > > into specified file: > > > > capture \&code, \$stdout, undef, $outfile; > > > > [3] http://p3rl.org/Capture::Tiny > > > > > > J.H.'s gitweb output caching v7 captures output directly ito cache > > files. The problem with doing it in my rewrite is to allow capturing > > directly into cache entry file without losing ability to select > > different caching engine, which might be not file-based (like > > e.g. memcached-based). > > One of the big things that my caching engine now stops is the stampeding > herd problem, basically by locking everyone out of the cache for that > entry till it's prepared. Assuming that can be preserved with any > arbitrary caching engine, then capturing directly to the file isn't that > big of a deal. Well, the recommended mechanism that one can use to prevent 'cache miss stampede' problem in CHI[2] caching interface (though not the more generic 'stampeding herd' problem) is to use "expires_variance" option. From CHI manpage: expires_variance [FLOAT] Controls the variable expiration feature, which allows items to expire a little earlier than the stated expiration time to help prevent cache miss stampedes. Value is between 0.0 and 1.0, with 0.0 meaning that items expire exactly when specified (feature is disabled), and 1.0 meaning that items might expire anytime from now til the stated expiration time. The default is 0.0. A setting of 0.10 to 0.25 would introduce a small amount of variation without interfering too much with intended expiration times. The probability of expiration increases as a function of how far along we are in the potential expiration window, with the probability being near 0 at the beginning of the window and approaching 1 at the end. [...] [2]: http://p3rl.org/CHI Though with locking we can for example lock whole cache instead of single entry when average load gets too high (but lower than $maxload, when gitweb returns '503 - The load average on the server is too high' (custom version of '503 Service Unavailable' HTTP error code/status)). I think that CHI can be extended (via roles perhaps) to use locking: flock for 'File' driver, IPC::Lock::Memcached or similar for 'Memcached' driver, get_and_set from Cache::FastMmap which IIUC uses fcntl to lock fragments of mmapped file for 'FastMmap' driver. Cache::File driver from Cache package (http://p3rl.org/Cache) supports locking by itself: lock_level Specify the level of locking to be used. There are three different levels available: Cache::File::LOCK_NONE() No locking is performed. Useful when you can guarantee only one process will be accessing the cache at a time. Cache::File::LOCK_LOCAL() Locking is performed, but it is not suitable for use over NFS filesystems. However it is more efficient. Cache::File::LOCK_NFS() Locking is performed in a way that is suitable for use on NFS filesystems. [...] locking efficiency Currently LOCK_LOCAL is not implemented (if uses the same code as LOCK_NFS). There are two points of locking in Cache::File, index locking and entry locking. The index locking is always exclusive and the lock is required briefly during most operations. The entry locking is either shared or exclusive and is also required during most operations. When locking is enabled, File::NFSLock is used to provide the locking for both situations. This is not overly efficient, especially as the entry lock is only ever grabbed whilst the index lock is held. > If that can be preserved in your full re-write, then it doesn't really > matter how or when the data makes it into the caching system be it > memory, disk, /dev/random or collapsing quantum wave form In my rewrite I use the same flock-based locking mechanism... just rewritten. From the cover letter: JN> The main ideas in lifted from J.H. patches are the following JN> (features in common with "Gitweb caching v7" series by John Hawley): JN> JN> * caching captured output of gitweb in flat files, without any JN> serialization (caching raw data) JN> JN> * using global (per-cache, not per-entry) expiration time, and JN> using difference between mtime of cache file and current time JN> for expiration JN> JN> * using file locking (flock) to prevent 'cache miss stampede' JN> problem, i.e. to ensure that only one process is (re)generating JN> cache entry JN> JN> * serving stale but not too old version, and regenerating data JN> in background, to avoid waiting for data to be regenerated JN> JN> * progress info indicator based on http-equiv refresh trick JN> (described in more detail how it works in the commit message) JN> JN> * capturing gitweb output by redirecting STDOUT to cache entry file > I'm going to try and get v8 out tomorrow morning, and I'm going to let > this stew on kernel.org overnight. It's looking like 18 patches (which > includes Jakub's v7.2, since I was building straight on top of that). Hmmm... I was thinking about doing rewrite of rewrite, keeping only parts which are actually used. Instead of 22 patches it would be 4-9 patches: * capturing output to file or filehandle * caching with locking, using 'compute_fh' interface * module implementing cache_output subroutine, joining those two * adding caching support to gitweb where the following patches might be extracted as additional patches * adaptive cache lifetime * zero-size file check * "Generating..." page for cache and for gitweb perhaps also with separating * serving stale data * background cache generation > In poking around, I'll admit, I keep finding things I'd love to fix / > cleanup throughout the entirety of the code base. I'm kinda hoping to > finally get caching dealt with and merged so I can start scratching > those other itches. Mind you this has been 4+ years in the making > already... I am thinking about adding support for "tee" in addition to "capture", i.e. printing when capturing, and using it in cache... > - John 'Warthog9' Hawley > > P.S. Most of this e-mail was written a couple of weeks ago, I found it > in my drafts folder and wanted to get it out along with a note that I've > got v8 percolating. -JH This email was written before v8 appeared on git mailing list... P.S. Did you had a chance to read mine rerwite of gitweb output caching "[PATCHv6/RFC 00/24] gitweb: Simple file based output caching"? -- Jakub Narebski Poland ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Implementing gitweb output caching - issues to solve 2010-12-09 22:30 ` Jakub Narebski @ 2010-12-09 22:52 ` Jonathan Nieder 2010-12-10 3:17 ` Olaf Alders 0 siblings, 1 reply; 9+ messages in thread From: Jonathan Nieder @ 2010-12-09 22:52 UTC (permalink / raw) To: Jakub Narebski Cc: J.H., git, John 'Warthog9' Hawley, Junio C Hamano, Petr Baudis, admin, olaf Jakub Narebski wrote: > In my rewrite > > [PATCHv6 17/24] gitweb: Show appropriate "Generating..." page when regenerating cache > http://thread.gmane.org/gmane.comp.version-control.git/163052/focus=163040 > http://repo.or.cz/w/git/jnareb-git.git/commitdiff/48679f7985ccda16dc54fda97790841bab4a0ba2#patch1 > > (see the browser_is_robot() subroutine: > > http://repo.or.cz/w/git/jnareb-git.git/blob/48679f7985ccda16dc54fda97790841bab4a0ba2:/gitweb/gitweb.perl#l870 > > I use HTTP::BrowserDetect package if available and it's ->robot() method. > > The fallback is to use *whitelist*, assuming that it would be better to > not show "Generating..." page rather than download the wrong thing. > I also guess that most (all?) web browsers use "Mozilla compatibile" > somewhere in their User-Agent string, thus matching 'Mozilla'. Interesting. http://www.user-agents.org/ seems to suggest that many robots do use Mozilla (though I don't think it's worth bending over backwards to help them see the page correctly). HTTP::BrowserDetect uses a blacklist as far as I can tell. Maybe in the long term it would be nice to add a whitelist ->human() method. Cc-ing Olaf Alders for ideas. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Implementing gitweb output caching - issues to solve 2010-12-09 22:52 ` Jonathan Nieder @ 2010-12-10 3:17 ` Olaf Alders 2010-12-10 4:11 ` Jonathan Nieder 2010-12-10 4:46 ` J.H. 0 siblings, 2 replies; 9+ messages in thread From: Olaf Alders @ 2010-12-10 3:17 UTC (permalink / raw) To: Jonathan Nieder Cc: Jakub Narebski, J.H., git, John 'Warthog9' Hawley, Junio C Hamano, Petr Baudis, admin On 2010-12-09, at 5:52 PM, Jonathan Nieder wrote: > Jakub Narebski wrote: > >> In my rewrite >> >> [PATCHv6 17/24] gitweb: Show appropriate "Generating..." page when regenerating cache >> http://thread.gmane.org/gmane.comp.version-control.git/163052/focus=163040 >> http://repo.or.cz/w/git/jnareb-git.git/commitdiff/48679f7985ccda16dc54fda97790841bab4a0ba2#patch1 >> >> (see the browser_is_robot() subroutine: >> >> http://repo.or.cz/w/git/jnareb-git.git/blob/48679f7985ccda16dc54fda97790841bab4a0ba2:/gitweb/gitweb.perl#l870 >> >> I use HTTP::BrowserDetect package if available and it's ->robot() method. >> >> The fallback is to use *whitelist*, assuming that it would be better to >> not show "Generating..." page rather than download the wrong thing. >> I also guess that most (all?) web browsers use "Mozilla compatibile" >> somewhere in their User-Agent string, thus matching 'Mozilla'. > > Interesting. http://www.user-agents.org/ seems to suggest that many > robots do use Mozilla (though I don't think it's worth bending over > backwards to help them see the page correctly). > > HTTP::BrowserDetect uses a blacklist as far as I can tell. Maybe in > the long term it would be nice to add a whitelist ->human() method. > > Cc-ing Olaf Alders for ideas. Thanks for including me in this. :) I'm certainly open to patching the module, but I'm not 100% clear on how you would want to implement this. How is ->is_human different from !->is_robot? To clarify, I should say that from the snippet above, I'm not 100% clear on what the problem is which needs to be solved. Olaf -- Olaf Alders olaf@wundersolutions.com http://www.wundersolutions.com http://twitter.com/wundercounter 866 503 2204 (Toll free - North America) 416 944 8306 (direct) ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Implementing gitweb output caching - issues to solve 2010-12-10 3:17 ` Olaf Alders @ 2010-12-10 4:11 ` Jonathan Nieder 2010-12-10 4:46 ` J.H. 1 sibling, 0 replies; 9+ messages in thread From: Jonathan Nieder @ 2010-12-10 4:11 UTC (permalink / raw) To: Olaf Alders Cc: Jakub Narebski, J.H., git, John 'Warthog9' Hawley, Junio C Hamano, Petr Baudis, admin Olaf Alders wrote: > On 2010-12-09, at 5:52 PM, Jonathan Nieder wrote: >> HTTP::BrowserDetect uses a blacklist as far as I can tell. Maybe in >> the long term it would be nice to add a whitelist ->human() method. >> >> Cc-ing Olaf Alders for ideas. > > Thanks for including me in this. :) I'm certainly open to patching > the module, but I'm not 100% clear on how you would want to > implement this. How is ->is_human different from !->is_robot? To > clarify, I should say that from the snippet above, I'm not 100% > clear on what the problem is which needs to be solved. Context (sorry I did not include this in the first place): The caching code (in development) for git's web interface uses a page that says "Generating..." for cache misses, with an http refresh redirecting to the generated content. The big downside is that if done naively this breaks wget, curl, and similar user agents that are not patient enough to grab the actual content instead of the redirect page. The first solution tried was to explicitly special case wget and curl. But in this case it is better to be more inclusive[2]; when in doubt, leave out the nice "Generating..." page and just serve the actual content slowly just in case. In other words, the idea was that user agents fall into three categories: A. definitely will not replace content with target of HTTP refresh B. definitely will replace content with target of HTTP refresh C. unknown and maybe ->is_robot could return true for A and ->is_human return true for B (leaving C as !->is_human && !->is_robot). In this case, we should show the "Generating..." page only in the ->is_human (B) case. That said, I know almost nothing on this subject, so it is likely this analysis misses something. J.H. or Jakub can likely say more. Thanks, Jonathan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] Implementing gitweb output caching - issues to solve 2010-12-10 3:17 ` Olaf Alders 2010-12-10 4:11 ` Jonathan Nieder @ 2010-12-10 4:46 ` J.H. 1 sibling, 0 replies; 9+ messages in thread From: J.H. @ 2010-12-10 4:46 UTC (permalink / raw) To: Olaf Alders Cc: Jonathan Nieder, Jakub Narebski, git, John 'Warthog9' Hawley, Junio C Hamano, Petr Baudis, admin >> Interesting. http://www.user-agents.org/ seems to suggest that many >> robots do use Mozilla (though I don't think it's worth bending over >> backwards to help them see the page correctly). If a robot reports itself and we don't know about it, I'm fine with giving it the 'Generating...' page as opposed to what it's expecting. The number of robots and things of that nature that won't handle the meta refresh are fewer than the number of people who will be clicking with eyeballs on a screen. >> HTTP::BrowserDetect uses a blacklist as far as I can tell. Maybe in >> the long term it would be nice to add a whitelist ->human() method. >> >> Cc-ing Olaf Alders for ideas. > > Thanks for including me in this. :) I'm certainly open to patching the module, but I'm not 100% clear on how you would want to implement this. How is ->is_human different from !->is_robot? To clarify, I should say that from the snippet above, I'm not 100% clear on what the problem is which needs to be solved. At this point I don't really see an issue with HTTP::BrowserDetect's robot() function, and I agree with human = !->is_robot. One thing I would like to see is the ability to do some sort of an add to the list of things to check for. As you are probably aware there are more agents that exist than what you have setup, I'm moving forward and handling it with the following: sub is_dumb_client { my($user_agent) = lc $ENV{'HTTP_USER_AGENT'}; my $browser_detect = HTTP::BrowserDetect->new($user_agent); return 1 if ( $browser_detect->robot() ); foreach my $adc ( @additional_dumb_clients ) { return 1 if ( index( $user_agent, lc $adc ) != -1 ); } return 0; } which could be simplified if there was just some way to do my($user_agent) = lc $ENV{'HTTP_USER_AGENT'}; my $browser_detect = HTTP::BrowserDetect->new($user_agent); $browser_detect->add_robots( @array ); return 1 if ( $browser_detect->robot() ); Not sure that particularly generalizes, and honestly it's only 4 lines of code to do add additional checks. - John 'Warthog9' Hawley ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2010-12-10 4:45 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-11-04 16:21 [RFC] Implementing gitweb output caching - issues to solve Jakub Narebski 2010-12-09 1:31 ` J.H. 2010-12-09 5:22 ` Junio C Hamano 2010-12-09 5:28 ` J.H. 2010-12-09 22:30 ` Jakub Narebski 2010-12-09 22:52 ` Jonathan Nieder 2010-12-10 3:17 ` Olaf Alders 2010-12-10 4:11 ` Jonathan Nieder 2010-12-10 4:46 ` J.H.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).