From: Junio C Hamano <gitster@pobox.com>
To: "Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"brian m. carlson" <sandals@crustytoothpaste.net>,
"Georgios Kontaxis" <geko1702+commits@99rst.org>
Subject: Re: [PATCH v4] gitweb: redacted e-mail addresses feature.
Date: Mon, 22 Mar 2021 11:32:46 -0700 [thread overview]
Message-ID: <xmqqlfaf6nu9.fsf@gitster.g> (raw)
In-Reply-To: <pull.910.v4.git.1616396267010.gitgitgadget@gmail.com> (Georgios Kontaxis via GitGitGadget's message of "Mon, 22 Mar 2021 06:57:46 +0000")
"Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
writes:
[note to other reviewers. input from those who are more familiar
with gitweb and Perl is very much appreciated on this patch].
> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>
> Gitweb extracts content from the Git log and makes it accessible
> over HTTP. As a result, e-mail addresses found in commits are
> exposed to web crawlers and they may not respect robots.txt.
> This may result in unsolicited messages.
"... are exposed to web crawlers, which spammers may use." would be
sufficient as a problem description.
After giving a problem description, it is customery to describe the
solution as if you are ordering the codebase to "be like so", so
instead of this ...
> This is a feature for redacting e-mail addresses
> from the generated HTML, etc. content.
... we may say something like
Introduce an 'email-privacy' feature, which defaults to false,
that redacts e-mail addresses that appear as author/committer
info and in log messages from the generated HTML content.
> This feature does not prevent someone from downloading the
> unredacted commit log, e.g., by cloning the repository, and
> extracting information from it.
> It aims to hinder the low-effort bulk collection of e-mail
> addresses by web crawlers.
And this is a good thing to add. Overall, nicely written.
> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
> ---
> gitweb: redacted e-mail addresses feature.
>
> Gitweb extracts content from the Git log and makes it accessible over
> HTTP. As a result, e-mail addresses found in commits are exposed to web
> crawlers and they may not respect robots.txt. This may result in
> unsolicited messages. This is a feature for redacting e-mail addresses
> from the generated HTML, etc. content.
>
> This feature does not prevent someone from downloading the unredacted
> commit log, e.g., by cloning the repository, and extracting information
> from it. It aims to hinder the low-effort bulk collection of e-mail
> addresses by web crawlers.
You do not need to repeat the above, which is in the log message above.
> Changes since v1:
>
> * Turned off the feature by default.
> * Removed duplicate code.
> * Added note about Gitweb consumers receiving redacted logs.
>
> Changes since v2:
>
> * The feature can be set on a per-project basis. ('override' => 1)
>
> Changes since v3:
>
> * Renamed feature to "email-privacy" and improved documentation.
> * Removed UI elements for git-format-patch since it won't be redacted.
> * Simplified calls to the address redaction logic.
> * Mail::Address is now used to reduce false-positive redactions.
Having these under the --- line like this is very helpful.
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 0959a782eccb..6630c76d92fd 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -21,6 +21,7 @@
> use File::Basename qw(basename);
> use Time::HiRes qw(gettimeofday tv_interval);
> use Digest::MD5 qw(md5_hex);
> +use Git::LoadCPAN::Mail::Address;
I'll defer to others who are more familiar with gitweb and Perl
ecosystem if this is warranted, but I have a feeling that importing
and using Mail::Address->parse() only because we want to see if a
given "<string>" is an address is a bit overkill and it might be
sufficient to do something as crude as m/^<[^@>]+@[a-z0-9-.]+>$/i
> + # Redact e-mail addresses.
> +
> + # To enable system wide have in $GITWEB_CONFIG
> + # $feature{'email-privacy'}{'default'} = [1];
> + 'email-privacy' => {
> + 'sub' => sub { feature_bool('email-privacy', @_) },
> + 'override' => 1,
> + 'default' => [0]},
> );
>
> sub gitweb_get_feature {
> @@ -3449,6 +3459,32 @@ sub parse_date {
> return %date;
> }
>
> +sub is_mailaddr {
> + my @addrs = Mail::Address->parse(shift);
> + if (!@addrs || !$addrs[0]->host || !$addrs[0]->user) {
> + return 0;
> + }
> + return 1;
> +}
> +
> +sub hide_mailaddrs_if_private {
> + my $line = shift;
> + return $line unless gitweb_check_feature('email-privacy');
> + while ($line =~ m/(<[^>]+>)/g) {
> + my $match = $1;
> + if (!is_mailaddr($match)) {
> + next;
> + }
> + my $offset = pos $line;
> + my $head = substr $line, 0, $offset - length($match);
> + my $redaction = "<redacted>";
> + my $tail = substr $line, $offset;
> + $line = $head . $redaction . $tail;
> + pos $line = length($head) + length($redaction);
Hmmmm, Perl suggestions from others? It looks quite strange to see
that s/// operator is not used and replacement is done manually with
byte position in a Perl script.
> sub parse_tag {
> my $tag_id = shift;
> my %tag;
> @@ -3465,7 +3501,7 @@ sub parse_tag {
> } elsif ($line =~ m/^tag (.+)$/) {
> $tag{'name'} = $1;
> } elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) {
> - $tag{'author'} = $1;
> + $tag{'author'} = hide_mailaddrs_if_private($1);
> $tag{'author_epoch'} = $2;
> $tag{'author_tz'} = $3;
> if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
This (and the others that follow the same pattern) looks sensible.
> @@ -7489,7 +7526,8 @@ sub git_log_generic {
> -accesskey => "n", -title => "Alt-n"}, "next");
> }
> my $patch_max = gitweb_get_feature('patches');
> - if ($patch_max && !defined $file_name) {
> + if ($patch_max && !defined $file_name &&
> + !gitweb_check_feature('email-privacy')) {
> if ($patch_max < 0 || @commitlist <= $patch_max) {
> $paging_nav .= " ⋅ " .
> $cgi->a({-href => href(action=>"patches", -replay=>1)},
> @@ -7550,7 +7588,8 @@ sub git_commit {
> } @$parents ) .
> ')';
> }
> - if (gitweb_check_feature('patches') && @$parents <= 1) {
> + if (gitweb_check_feature('patches') && @$parents <= 1 &&
> + !gitweb_check_feature('email-privacy')) {
> $formats_nav .= " | " .
> $cgi->a({-href => href(action=>"patch", -replay=>1)},
> "patch");
> @@ -7863,7 +7902,8 @@ sub git_commitdiff {
> $formats_nav =
> $cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)},
> "raw");
> - if ($patch_max && @{$co{'parents'}} <= 1) {
> + if ($patch_max && @{$co{'parents'}} <= 1 &&
> + !gitweb_check_feature('email-privacy')) {
> $formats_nav .= " | " .
> $cgi->a({-href => href(action=>"patch", -replay=>1)},
> "patch");
I wouldn't have expected to see the above three hunks in the
"patch" codepath. Rather, I was hoping that you'd do something
like this at startup when you find out that the privacy feature
is enabled:
$feature{'patches'}{'default'} = [0]
if gitweb_get_feature('email-privacy');
so that nothing related to the 'patches' has to be modified.
That way, even if there were fourth place that can leak an e-mail
address in the 'patch' codepath that above three hunks in this patch
missed, crawlers won't be able to get at it, no?
> diff --git a/t/lib-gitweb.sh b/t/lib-gitweb.sh
> index 1f32ca66ea51..77fc1298d4c6 100644
> --- a/t/lib-gitweb.sh
> +++ b/t/lib-gitweb.sh
> @@ -67,6 +67,9 @@ gitweb_run () {
> GITWEB_CONFIG=$(pwd)/gitweb_config.perl
> export GITWEB_CONFIG
>
> + PERL5LIB="$GIT_BUILD_DIR/perl:$GIT_BUILD_DIR/perl/FromCPAN"
> + export PERL5LIB
> +
Why is this change suddenly become necessary? This addition is only
about tests---do we need to do something similar in the runtime
environment when the updated gitweb that requires Mail::Address gets
deployed, or is that covered already somewhere else?
Thanks.
next prev parent reply other threads:[~2021-03-22 18:33 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-20 23:42 [PATCH] gitweb: redacted e-mail addresses feature Georgios Kontaxis via GitGitGadget
2021-03-21 0:42 ` Ævar Arnfjörð Bjarmason
2021-03-21 1:27 ` brian m. carlson
2021-03-21 3:30 ` Georgios Kontaxis
2021-03-21 3:32 ` [PATCH v2] " Georgios Kontaxis via GitGitGadget
2021-03-21 17:28 ` [PATCH v3] " Georgios Kontaxis via GitGitGadget
2021-03-21 18:26 ` Ævar Arnfjörð Bjarmason
2021-03-21 18:48 ` Junio C Hamano
2021-03-21 19:48 ` Georgios Kontaxis
2021-03-21 18:42 ` Junio C Hamano
2021-03-21 18:57 ` Junio C Hamano
2021-03-21 19:05 ` Junio C Hamano
2021-03-21 20:07 ` Georgios Kontaxis
2021-03-21 22:17 ` Junio C Hamano
2021-03-21 23:14 ` Georgios Kontaxis
2021-03-22 4:25 ` Junio C Hamano
2021-03-22 6:57 ` [PATCH v4] " Georgios Kontaxis via GitGitGadget
2021-03-22 18:32 ` Junio C Hamano [this message]
2021-03-22 18:58 ` Georgios Kontaxis
2021-03-28 1:41 ` Junio C Hamano
2021-03-28 21:43 ` Georgios Kontaxis
2021-03-28 22:35 ` Junio C Hamano
2021-03-23 4:27 ` Georgios Kontaxis
2021-03-27 3:56 ` [PATCH v5] " Georgios Kontaxis via GitGitGadget
2021-03-28 23:26 ` [PATCH v6] " Georgios Kontaxis via GitGitGadget
2021-03-29 20:00 ` Junio C Hamano
2021-03-31 21:14 ` Junio C Hamano
2021-04-06 0:56 ` Junio C Hamano
2021-04-08 22:43 ` Ævar Arnfjörð Bjarmason
2021-04-08 22:51 ` Junio C Hamano
2021-03-29 1:47 ` [PATCH v5] " Eric Wong
2021-03-29 3:17 ` Georgios Kontaxis
2021-04-08 17:16 ` Eric Wong
2021-04-08 21:04 ` Junio C Hamano
2021-04-08 21:19 ` Eric Wong
2021-04-08 22:45 ` Ævar Arnfjörð Bjarmason
2021-04-08 22:54 ` Junio C Hamano
2021-03-21 6:00 ` [PATCH] " Junio C Hamano
2021-03-21 6:18 ` Junio C Hamano
2021-03-21 6:43 ` Georgios Kontaxis
2021-03-21 16:55 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xmqqlfaf6nu9.fsf@gitster.g \
--to=gitster@pobox.com \
--cc=avarab@gmail.com \
--cc=geko1702+commits@99rst.org \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=sandals@crustytoothpaste.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.