Re: Cross-referencing the Git mailing list archive with their corresponding commits in `pu`

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Eric Wong <e@80x24.org>
To: Jeff King <peff@peff.net>
Cc: Johannes Schindelin <Johannes.Schindelin@gmx.de>,
	Josh Triplett <josh@joshtriplett.org>,
	git@vger.kernel.org
Subject: Re: Cross-referencing the Git mailing list archive with their corresponding commits in `pu`
Date: Tue, 7 Feb 2017 00:14:46 +0000	[thread overview]
Message-ID: <20170207001446.GB7128@starla> (raw)
In-Reply-To: <20170206220754.5q2oddr5ej7c6qcg@sigill.intra.peff.net>

Jeff King <peff@peff.net> wrote:
> On Mon, Feb 06, 2017 at 08:48:20PM +0000, Eric Wong wrote:
> 
> > I haven't hit insurmountable performance problems, even on
> > low-end hardware; especially since I started storing blob ids in
> > Xapian itself, avoiding the expensive tree lookup via git.
> 
> The painful thing is traversing the object graph for clones and fetches.
> Bitmaps help, but you still have to generate them.

Yep.  "public-inbox-init" defaults to enabling bitmaps in the
config for this reason.

> > The main problem seems to be tree size.  Deepening (2/2/36 vs
> > 2/38) might be an option (I think Peff brought that up); but it
> > might be easier to switch to YYYYMM refs (working like
> > logrotate) and rely on Xapian to tie the entire thing together.
> 
> Yes, the hashing is definitely one issue. Some numbers here:
> 
>   http://public-inbox.org/git/20160805092805.w3nwv2l6jkbuwlzf@sigill.intra.peff.net/
> 
> If you have C commits on a tree with T entries, you have to do C*T hash
> lookups for a flat tree (for each commit, you have to see "yup, already
> saw that object"). Sharding that across H entries at the top level drops
> the tree cost from T to H + T/H (actually, it's a bit worse because we
> have to read the secondary tree, too). Sharding again (at H') gets you
> H + H' + T/H/H'.
> 
> Let's imagine you do one message per commit, so C=T. At 400K messages,
> that's about 160 billion hash lookups flat. At H=256, it's about 700
> million. If you shard again with H'=256, it's 200 million. After that,
> the additive terms start to dominate, and it's not worth going any
> further (and also, we're ignoring the extra-tree cost to each level).

Just to make sure I'm following, here; the entire formulas are:

	C * H + H' + (T / H / H')     # 2/2/36
	C * H + (T / H)               # 2/38 (current)

Right?

> At that point you're better off to start having fewer commits. I know
> that the schema you use does put useful information into the commit
> message, but it's also redundant with what's in the messages themselves.
> And it sounds like you push most of that out to Xapian anyway.

Yeah, there's no benefit to Xapian users for having any info in
the commit.  However, keeping commit-per-message is still
important to me to for better robustness from hardware and
network failures.

But yes, historical stuff could be squashed into a single commit
(much like how linux.git started with v2.6.12-rc2 without
history).  Perhaps some folks will care about NNTP article
numbering being non-chronological...

> Imagine your repo had one commit with 400K historical messages, and then
> grouped the new messages so that on average we got about 10 messages per
> commit (this doesn't seem unrealistic for something that commits every
> few minutes; the messages tend to be bunched in time; I ran some
> numbers against a 10-minute mark in the earlier message).
> 
> Then after another 100K messages, we'd have C=10,001 and T=500K. With
> two levels of hashing at 256 each, that's ~5 million hash lookups to
> walk the graph. And those numbers would be reasonable for a hosting site
> like GitHub.
> 
> I don't know what C is for the kernel repo, but I suspect with the right
> tuning it could be made into large-but-reasonable.

LKML probably has an upper bound of 30K messages per month;
so it could hit 100K in less than 4 months.  Worst case might
be 360K messages a year

	360000 * (256 + 256 + ((360000 + old) / 256 / 256))

That's still at least 180 million hash lookups after a year or
so of real-time updates; right?  (But probably closer to 240
million if there's 10 million old messages in there.

Instead, I think I will add an option to support logrotate-style
monthly heads (YYYYMM); keeping 2/38 and C == T:

	30000 * (256 + (30000 / 256))               => 11 million
	30000 * (256 + 256 + (30000 / 256 / 256))   => 15 million

The monthly heads would each be discontiguous history-wise;
so Xapian would become a requirement for users of this option
for Message-ID lookups, but histories would still be readable
with "git log"

One good side-effect of using monthly heads is --single-branch
clones may be used if someone lacks the bandwidth or space to do
a full mirror.  I'm not sure if the server-side (pack reuse,
bitmaps) will benefit other aside from bandwidth reductions,
though.

A (far-fetched) option I've considered would be to store entire
messages in the commit and have no trees or blobs at all.  But
that would require a significant rework, and would also make
Xapian a hard requirement for even checking if a message is
deleted or not.

next prev parent reply	other threads:[~2017-02-07  0:14 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-06 15:34 Cross-referencing the Git mailing list archive with their corresponding commits in `pu` Johannes Schindelin
2017-02-06 19:10 ` Junio C Hamano
2017-02-09 14:11   ` Lars Schneider
2017-02-09 21:53     ` Johannes Schindelin
2017-02-09 22:18       ` Junio C Hamano
2017-02-06 20:48 ` Eric Wong
2017-02-06 22:07   ` Jeff King
2017-02-07  0:14     ` Eric Wong [this message]
2017-02-17 17:50 ` Johannes Schindelin
2017-02-20 19:33   ` Junio C Hamano
2017-02-20 20:06     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170207001446.GB7128@starla \
    --to=e@80x24.org \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=josh@joshtriplett.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.