Re: [PATCH 00/18] Gitweb caching v8

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jakub Narebski <jnareb@gmail.com>
To: "J.H." <warthog9@eaglescrag.net>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 00/18] Gitweb caching v8
Date: Fri, 10 Dec 2010 02:27:27 +0100	[thread overview]
Message-ID: <201012100227.27903.jnareb@gmail.com> (raw)
In-Reply-To: <4D017796.4030506@eaglescrag.net>

On Fri, 10 Dec 2010, J.H. wrote:
> On 12/09/2010 03:26 PM, Jakub Narebski wrote:

>> John, could you please in the future Cc me?  I am interested in gitweb
>> output caching development.  Thanks in advance.
> 
> Apologies, apparently screwed up on my git send-email line.  I'll get
> that right one of these eons.

Ah, I can understand this.

>> I guess that in the final submission (i.e. the one that is to be
>> merged in into git.git repository) those changes would be squashed in,
>> isn't it?
> 
> I have no objections to squashing the reversions into a single patch,
> just figured it was easier to break them out for the time being.

I guess that interdiff in comments would work as well, or almost as well...
 
>>> There is one additional script I've written that the Fedora folks are using,
>>> and that might be useful to include, which is an 'offline' cache file generator.
>>> It basically wraps gitweb.cgi and at the end moves the cache file into the right
>>> place.  The Fedora folks were finding it took hours to generate their front
>>> page, and that doing a background generation almost never completed (due to 
>>> process death).  This was a simple way to handle that.  If people would like
>>> I can add it in as an additional patch.
>> 
>> Are you detaching the background process?

Errr... what I meant here is that perhaps detaching background process
would make it not die, but I am guessing here.
 
> No, in fact I completely turn off forking (using the $cacheDoFork variable.)

BTW. what I don't like is your code forking indiscriminately even if it
is not needed (e.g. background cache generation is turned off).

> 
>> It would be nice to have it as separate patch.
> 
> I can add it easily enough.

It is only about caching most IO intensive page, i.e. projects_list page,
isn't it?  Why doesn't _it_ die, like background process?

> 
>>> v8:
>>> 	- Reverting several changes from Jakub's change set that make no sense
>>>                 - is_cacheable changed to always return true - nothing special about
>>>                   blame or blame_incremental as far as the caching engine is concerned
>> 
>> 'blame_incremental' is just another version of 'blame' view.  I have
>> disabled it when caching is enabled in my rewrite (you instead disabled
>> caching for 'blame_incremental' in your v7 and mine v7.x) because I
>> couldn't get it to work together with caching.  Did you check that it
>> works?
> 
> blame works fine, blame_incremental generates but doesn't..... ohhhh
> someone added ajaxy kinda stuff and doesn't mention it anywhere.

Errr... I thought that the 'incremental' part is self-explaining that
it is Ajax-y stuff.  Well, while commit is 4af819d (gitweb: Incremental
blame (using JavaScript), 2009-09-01), perhaps I should have added some
comment in the code.

> 
> Exciting.
> 
> blame_data needs to not get a 'generating...' page in all likelihood,
> generating a blame_incremental page, letting it load and then refreshing
> the whole thing gets me what I'm expecting.

Hmmm... I wonder why it didn't work for me at that time...

> 
> Is enough to mask.
> 
> Guess I'm looking at a v9 now.
> 
>> Besides, withou "tee"-ing, i.e. printing output as it is captured,
>> cached 'blame_data' means that 'blame_incremental' is not incremental,
>> and therefore it vanishes its advantage over 'blame'.

I mean here that with current state of caching 'blame_incremental' stops
to be incremental...
 
> There are only 2 ways to get to a blame_incremental page
> 
> 1) By going to a blame page and clicking on the incremental link in the nav
> 
> 2) By enabling it by default so when you click 'blame' it goes to
> incremental first.

  3) By having JavaScript add ';js=1' to all links, so clicking on
  'blame' link (with action set to 'blame') would result in 
  'blame_incremental' view.

> 
>> In the case data is in cache, then 'blame_inremental' doesn't have
>> advantage over 'blame' either.
> 
> Agreed, though it's easy enough to support in the caching engine,
> basically don't return 'Generating...' and wait for that data to cache.
> Not really an advantage except that your not waiting for the whole
> generation to get a page back at all.
> 
>>>                 - Reverted change to reset_output as
>>>                         open STDOUT, ">&", \*STDOUT_REAL;
>>>                   causes assertion failures:
>>>                   Assertion !((((s->var)->sv_flags & (0x00004000|0x00008000)) == 0x00008000) && (((svtype)((s->var)->sv_flags & 0xff)) == SVt_PVGV || ((svtype)((s->var)->sv_flags & 0xff)) == SVt_PVLV)) failed: file "scalar.xs", line 49 at gitweb.cgi line 1221.
>>>                   if we encounter an error *BEFORE* we've ever changed the output.
>> 
>> Which Perl version are you using?  Because I think you found error in Perl.
>> Well, at least I have not happen on this bug.
> 
> This is perl, v5.10.0 built for x86_64-linux-thread-multi

Could you check with newer perl?  I don't get this error.

>> I have nothing againts using
>> 
>>   open STDOUT, ">&STDOUT_REAL";
>> 
>> though I really prefer that you used lexical filehandles, instead of
>> "globs" which are global variables.

And using 'print STDOUT_REAL "";' protects against spurious warning
(the warning is really wrong in this case).
 
>> The following works:
>> 
>>   open STDOUT, '>&', fileno($fh);
>> 
>> Note that fileno(Symbol::qualify_to_ref($fh)) might be needed...
> 
> I see 0 advantage to shifting around STDOUT and STDERR to a lexical
> filehandle vs. a glob in this case.  STDOUT_REAL retains all the
> properties of STDOUT should it be needed elsewhere, including what it
> was going and what it was doing.
> 
> I have no objection to shifting the file handles I'm using to lexical
> variables, if nothing else the argument about them closing when falling
> out of scope is worth it, but for STDOUT, STDERR, etc I don't think
> switching to lexicals makes a lot of sense

Well... I'd have to agree that in current case (capturing engine embedded
in gitweb, and gitweb-specific; no need for recursive capture) it would
be enough to use such globs.

> 
>>>         - Cleanups there were indirectly mentioned by Jakub
>>>                 - Elimination of anything even remotely looking like duplicate code
>>>                         - Creation of isBinaryAction() and isFeedAction()
>> 
>> Could you please do not use mixedCase names?
> 
> I'm fine with renaming those if you wish.
> 
>> First, that is what %actions_info from
>> 
>>   [PATCH 16/24] gitweb: Introduce %actions_info, gathering information about actions
>>   http://thread.gmane.org/gmane.comp.version-control.git/163052/focus=163038
>>   http://repo.or.cz/w/git/jnareb-git.git/commitdiff/305a10339b33d56b4a50708d71e8f42453c8cb1f
>> 
>> I have invented for.
> 
> I have not based any of my caching engine, right now, on anything you've
> done for your rewrite.

What I meant here that if you will be doing yet another version, you
can take a look at it as a way to avoiding not very clear and nice
long alternatives in condition, or in regexp matched.

> 
>> Second, why 'isBinaryAction()'?  there isn't something inherently
>> different between binary (':raw') and text (':utf8') output, as I have
>> repeatedly said before.
> 
> It's a binary action in that you are shoving something down the pipe
> with the intention of sending the bits completely raw.  You read the
> data raw, and write the data raw.  There is no interpretation of the
> data as being anything but straight raw.
> 
> Right now, in gitweb already, there are two places that treat output
> completely differently:
> 
> 	- snapshot
> 	- blob_plain
> 
> The only reason isBinaryAction() (or any other function name or process
> you want to grant it) exists is so that I can figure out if it's one of
> those actions so I can deal with the cache and output handling
> differently for each.
> 
> Yes, I could flip the entire caching engine over to following the same
> mantra for everything and thus there is no need to care, but gitweb
> itself isn't really setup to handle that separation cleanly right now,
> and I'm trying to make as few bigger changes right now as is.

Always reading from cache in ':raw' mode and always printing from cache
in ':raw' mode (i.e. setting STDOUT to ':raw' before printing / copying
cache entry) would be in gitweb case enough to not special-case binary
files.

In gitweb you always do "binmode STDOUT, ':raw';" _after_ starting capture,
which means that it gets applied to cache file; and gitweb always do
"binmode STDOUT, ':utf8';" before stopping capture.

If you print text data to file using ':utf8' layer (applied at beginning
to cache file) it is in this file as correct sequence of bytes.  Therefore
you can dump said cache file to STDOUT in ':raw' mode (or in ':utf8' mode)
- both STDOUT and read cache file has to have the same mode.

>>>         - Added better error handling
>>>                 - Creation of .err file in the cache directory
>>>                 - Trap STDERR output into $output_err as this was spewing data prior
>>>                   to any header information being sent
>> 
>> Why it is needed?  We capture output of "die" via CGI::Util::set_message,
>> and "warn" output is captured to web server logs... unless you explicitely
>> use "print STDERR <sth>" -- don't do that instead.
> 
> I have seen, in several instances, a case where git itself will generate
> an error, it shoves it to STDERR which makes it to the client before
> anything else, thus causing 500 level errors.
> 
> Added this so that STDERR got trapped and those messages didn't make it out.

Could you give examples when it happens?  Anything that happens after
"use CGI::Carp" is parsed should have STDERR redirected to web server
errors log.

I'll read the actual patch and comment on it.

> 
>>>         - Added hidden field in footer for url & hash of url, which is extremely useful
>>>           for debugging
>> 
>> Nice idea, I'll see it.  Can it be disabled (information leakage)?
> 
> There's not really any information leakage per-se, unless you call
> md5suming the url information leakage.

Ah, sorry, I send this comment before actually reading patch in question.

-- 
Jakub Narebski
Poland

next prev parent reply	other threads:[~2010-12-10  1:27 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-09 21:57 [PATCH 00/18] Gitweb caching v8 John 'Warthog9' Hawley
2010-12-09 21:57 ` [PATCH 01/18] gitweb: Prepare for splitting gitweb John 'Warthog9' Hawley
2010-12-09 23:30   ` Jakub Narebski
2010-12-09 21:57 ` [PATCH 02/18] gitweb: add output buffering and associated functions John 'Warthog9' Hawley
2010-12-09 21:57 ` [PATCH 03/18] gitweb: File based caching layer (from git.kernel.org) John 'Warthog9' Hawley
2010-12-09 21:57 ` [PATCH 04/18] gitweb: Minimal testing of gitweb caching John 'Warthog9' Hawley
2010-12-09 21:57 ` [PATCH 05/18] gitweb: Regression fix concerning binary output of files John 'Warthog9' Hawley
2010-12-09 23:33   ` Jakub Narebski
2010-12-09 21:57 ` [PATCH 06/18] gitweb: Add more explicit means of disabling 'Generating...' page John 'Warthog9' Hawley
2010-12-09 21:57 ` [PATCH 07/18] gitweb: Revert back to $cache_enable vs. $caching_enabled John 'Warthog9' Hawley
2010-12-09 23:38   ` Jakub Narebski
2010-12-10  2:38     ` J.H.
2010-12-10 13:48       ` Jakub Narebski
2010-12-09 21:57 ` [PATCH 08/18] gitweb: Change is_cacheable() to return true always John 'Warthog9' Hawley
2010-12-09 23:46   ` Jakub Narebski
2010-12-09 21:57 ` [PATCH 09/18] gitweb: Revert reset_output() back to original code John 'Warthog9' Hawley
2010-12-09 23:58   ` Jakub Narebski
2010-12-10  2:43     ` J.H.
2010-12-09 21:57 ` [PATCH 10/18] gitweb: Adding isBinaryAction() and isFeedAction() to determine the action type John 'Warthog9' Hawley
2010-12-10  0:06   ` Jakub Narebski
2010-12-10  3:39     ` J.H.
2010-12-10 12:10       ` Jakub Narebski
2010-12-10 12:25         ` Jakub Narebski
2010-12-09 21:57 ` [PATCH 11/18] gitweb: add isDumbClient() check John 'Warthog9' Hawley
2010-12-10  0:12   ` Jakub Narebski
2010-12-10  4:00     ` J.H.
2010-12-11  0:07       ` Junio C Hamano
2010-12-11  0:15         ` Jakub Narebski
2010-12-11  1:15           ` J.H.
2010-12-11  1:40             ` Jakub Narebski
2010-12-09 21:57 ` [PATCH 12/18] gitweb: Change file handles (in caching) to lexical variables as opposed to globs John 'Warthog9' Hawley
2010-12-10  0:16   ` Jakub Narebski
2010-12-10  0:32     ` Junio C Hamano
2010-12-10  0:47       ` Jakub Narebski
2010-12-10  5:56       ` J.H.
2010-12-09 21:57 ` [PATCH 13/18] gitweb: Add commented url & url hash to page footer John 'Warthog9' Hawley
2010-12-10  0:26   ` Jakub Narebski
2010-12-10  6:10     ` J.H.
2010-12-09 21:57 ` [PATCH 14/18] gitweb: add print_transient_header() function for central header printing John 'Warthog9' Hawley
2010-12-10  0:36   ` Jakub Narebski
2010-12-10  6:18     ` J.H.
2010-12-09 21:57 ` [PATCH 15/18] gitweb: Add show_warning() to display an immediate warning, with refresh John 'Warthog9' Hawley
2010-12-10  1:01   ` Jakub Narebski
2010-12-10  7:38     ` J.H.
2010-12-10 14:10       ` Jakub Narebski
2010-12-09 21:57 ` [PATCH 16/18] gitweb: When changing output (STDOUT) change STDERR as well John 'Warthog9' Hawley
2010-12-10  1:36   ` Jakub Narebski
2010-12-12  5:25     ` J.H.
2010-12-12 15:17       ` Jakub Narebski
2010-12-09 21:57 ` [PATCH 17/18] gitweb: Prepare for cached error pages & better error page handling John 'Warthog9' Hawley
2010-12-10  1:49   ` Jakub Narebski
2010-12-10  8:33     ` J.H.
2010-12-10 20:33       ` Jakub Narebski
2010-12-09 21:57 ` [PATCH 18/18] gitweb: Add better error handling for gitweb caching John 'Warthog9' Hawley
2010-12-10  1:56   ` Jakub Narebski
2010-12-09 23:26 ` [PATCH 00/18] Gitweb caching v8 Jakub Narebski
2010-12-10  0:43   ` J.H.
2010-12-10  1:27     ` Jakub Narebski [this message]
2010-12-10  0:39 ` Junio C Hamano
2010-12-10  0:45   ` J.H.

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201012100227.27903.jnareb@gmail.com \
    --to=jnareb@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=warthog9@eaglescrag.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.