git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Eric Wong <e@80x24.org>
To: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Cc: Josh Triplett <josh@joshtriplett.org>, git@vger.kernel.org
Subject: Re: Cross-referencing the Git mailing list archive with their corresponding commits in `pu`
Date: Mon, 6 Feb 2017 20:48:20 +0000	[thread overview]
Message-ID: <20170206204820.GA7128@starla> (raw)
In-Reply-To: <alpine.DEB.2.20.1702041206130.3496@virtualbox>

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> For details, see:
> http://public-inbox.org/git/11340844841342-git-send-email-mailing-lists.git@rawuncut.elitemail.org/
> (this is also an example where public-inbox' thread detection went utterly
> wrong, including way too many mails in the "thread")

Thanks, it should be fixed in an hour or two when reindexing
finishes...

<https://public-inbox.org/meta/20170206200216.GA26676@dcvr/>

but it looks like reindexing is a little buggy in that it reuses
thread IDs, too... (will fix)

The Tor .onion mirrors should be done, first, since they're on
better hardware:
http://hjrcffqmbrq6wope.onion/git/11340844841342-git-send-email-mailing-lists.git@rawuncut.elitemail.org/
http://czquwvybam4bgbro.onion/git/11340844841342-git-send-email-mailing-lists.git@rawuncut.elitemail.org/

> This last example also demonstrates a very curious test case for a
> different difficulty in trying to reconstruct lost correspondences: the
> patch series was applied *twice*, independently of each other. First, on
> the day v3 was submitted, it was applied on top of v1.8.1-rc0 (as commits
> ee26a6e2b8..dd465ce66f), although it was not merged until v1.8.1-rc3. 22
> days later, it was reapplied on top of maint so it could enter v1.8.0.3
> (back then, Git still had "patchlevel" versions): c2999adcd5..008c208c2c.
> 
> As you can see, there is a many-to-many relationship here, even if you do
> leave the *original* branch out of the picture entirely.

Fwiw, I've always seen the search ability of public-inbox as
analogous to rename detection in git; in that it can never be
perfect, but can still be tweaked and improved after-the-fact
and be used more flexibly.

Right now, the thread searching public-inbox is loose in that it
favors overmatching based on Subject in addition to References.
But the actual threading algorithm (for display) is strict,
relying only on References.  But yeah, there can be tweaks to
improve matching and introducing git (code) repository awareness
into the mail search...

> Will keep you posted,

Likewise :>

> P.S.: I used public-inbox.org links instead of commit references to the
> Git repository containing the mailing list archive, because the format of
> said Git repository is so unfavorable that it was determined very quickly
> in a discussion between Patrick Reynolds (GitHub) and myself that it would
> put totally undue burden on GitHub to mirror it there (compare also Carlos
> Nieto's talk at GitMerge titled "Top Ten Worst Repositories to host on
> GitHub").

Any suggestions on how the repository format can be improved?

I haven't hit insurmountable performance problems, even on
low-end hardware; especially since I started storing blob ids in
Xapian itself, avoiding the expensive tree lookup via git.

The main problem seems to be tree size.  Deepening (2/2/36 vs
2/38) might be an option (I think Peff brought that up); but it
might be easier to switch to YYYYMM refs (working like
logrotate) and rely on Xapian to tie the entire thing together.

Some change will definitely be needed for all LKML, but most
projects have less traffic than even git, and should be fine.


But, I am working to undermine centralized messaging systems
(which GitHub and GitLab both are), so they would be wise to
undermine public-inbox all the same ;>

  parent reply	other threads:[~2017-02-06 20:48 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-06 15:34 Cross-referencing the Git mailing list archive with their corresponding commits in `pu` Johannes Schindelin
2017-02-06 19:10 ` Junio C Hamano
2017-02-09 14:11   ` Lars Schneider
2017-02-09 21:53     ` Johannes Schindelin
2017-02-09 22:18       ` Junio C Hamano
2017-02-06 20:48 ` Eric Wong [this message]
2017-02-06 22:07   ` Jeff King
2017-02-07  0:14     ` Eric Wong
2017-02-17 17:50 ` Johannes Schindelin
2017-02-20 19:33   ` Junio C Hamano
2017-02-20 20:06     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170206204820.GA7128@starla \
    --to=e@80x24.org \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=josh@joshtriplett.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).