From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Georgios Kontaxis via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org,
"brian m. carlson" <sandals@crustytoothpaste.net>,
Georgios Kontaxis <geko1702+commits@99rst.org>
Subject: Re: [PATCH v3] gitweb: redacted e-mail addresses feature.
Date: Sun, 21 Mar 2021 19:26:08 +0100 [thread overview]
Message-ID: <87r1k8qs73.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <pull.910.v3.git.1616347731514.gitgitgadget@gmail.com>
On Sun, Mar 21 2021, Georgios Kontaxis via GitGitGadget wrote:
> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>
> Gitweb extracts content from the Git log and makes it accessible
> over HTTP. As a result, e-mail addresses found in commits are
> exposed to web crawlers and they may not respect robots.txt.
> This may result in unsolicited messages.
> This is a feature for redacting e-mail addresses
> from the generated HTML, etc. content.
>
> This feature does not prevent someone from downloading the
> unredacted commit log, e.g., by cloning the repository, and
> extracting information from it.
> It aims to hinder the low-effort bulk collection of e-mail
> addresses by web crawlers.
>
> Changes since v1:
> - Turned off the feature by default.
> - Removed duplicate code.
> - Added note about Gitweb consumers receiving redacted logs.
>
> Changes since v2:
> - The feature can be set on a per-project basis. ('override' => 1)
>
> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
> ---
> gitweb: Redacted e-mail addresses feature.
>
> Gitweb extracts content from the Git log and makes it accessible over
> HTTP. As a result, e-mail addresses found in commits are exposed to web
> crawlers. This may result in unsolicited messages. This is a feature for
> redacting e-mail addresses from the generated HTML content.
>
> This feature does not prevent someone from downloading the unredacted
> commit log and extracting information from it. It aims to hinder the
> low-effort bulk collection of e-mail addresses by web crawlers.
>
> Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v3
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v3
> Pull-Request: https://github.com/gitgitgadget/git/pull/910
>
> Range-diff vs v2:
>
> 1: 74af11ca8bf2 ! 1: 930cdefe7ee0 gitweb: redacted e-mail addresses feature.
> @@ Commit message
> - Removed duplicate code.
> - Added note about Gitweb consumers receiving redacted logs.
>
> + Changes since v2:
> + - The feature can be set on a per-project basis. ('override' => 1)
> +
> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
>
> ## Documentation/gitweb.conf.txt ##
> @@ gitweb/gitweb.perl: sub evaluate_uri {
> + # $feature{'email_privacy'}{'default'} = [1];
> + 'email_privacy' => {
> + 'sub' => sub { feature_bool('email_privacy', @_) },
> -+ 'override' => 0,
> ++ 'override' => 1,
> + 'default' => [0]},
> );
>
>
>
> Documentation/gitweb.conf.txt | 16 +++++++++++++
> gitweb/gitweb.perl | 42 ++++++++++++++++++++++++++++++++---
> 2 files changed, 55 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
> index 7963a79ba98b..b7af3240177d 100644
> --- a/Documentation/gitweb.conf.txt
> +++ b/Documentation/gitweb.conf.txt
> @@ -896,6 +896,22 @@ same as of the snippet above:
> It is an error to specify a ref that does not pass "git check-ref-format"
> scrutiny. Duplicated values are filtered.
>
> +email_privacy::
> + Redact e-mail addresses from the generated HTML, etc. content.
> + This hides e-mail addresses found in the commit log from web crawlers.
> + Disabled by default.
> ++
> +It is highly recommended to enable this feature unless web crawlers are
> +hindered in some other way. Note that crawlers intent on harvesting e-mail
> +addresses may disregard robots.txt. You can enable this feature like so:
> ++
> +---------------------------------------------------------------------------
> +$feature{'email_privacy'}{'default'} = [1];
> +---------------------------------------------------------------------------
> ++
> +Note that if Gitweb is not the final step in a workflow then subsequent
> +steps may misbehave because of the redacted information they receive.
> +
>
> EXAMPLES
> --------
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 0959a782eccb..174cc566d97d 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -569,6 +569,15 @@ sub evaluate_uri {
> 'sub' => \&feature_extra_branch_refs,
> 'override' => 0,
> 'default' => []},
> +
> + # Redact e-mail addresses.
> +
> + # To enable system wide have in $GITWEB_CONFIG
> + # $feature{'email_privacy'}{'default'} = [1];
> + 'email_privacy' => {
> + 'sub' => sub { feature_bool('email_privacy', @_) },
> + 'override' => 1,
> + 'default' => [0]},
> );
>
> sub gitweb_get_feature {
> @@ -3449,6 +3458,19 @@ sub parse_date {
> return %date;
> }
>
> [snip]
So in the v1 feedback I suggested:
BEGIN QUOTE
sub maybe_hide_email {
my $email = shift;
return $email unless gitweb_check_feature('email_privacy');
return hide_email($email);
}
then:
$tag{author_email} = maybe_hide_email($2);
END QUOTE
But:
> sub parse_tag {
> my $tag_id = shift;
> my %tag;
> @@ -3471,6 +3493,10 @@ sub parse_tag {
> if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
> $tag{'author_name'} = $1;
> $tag{'author_email'} = $2;
> + if (gitweb_check_feature('email_privacy')) {
> + $tag{'author_email'} = "private";
> + $tag{'author'} = hide_mailaddr($tag{'author'});
> + }
This code seems quite awkward, we've already done the regex match, but
this code:
> [snip]
> +sub hide_mailaddr_if_private {
> + my $line = shift;
> + return $line unless (gitweb_check_feature('email_privacy') &&
> + $line =~ m/^([^<]+) <([^>]*)>/);
> + return hide_mailaddr($line)
> +}
> +
> +sub hide_mailaddr {
> + my $mailaddr = shift;
> + $mailaddr =~ s/<([^>]*)>/<private>/;
> + return $mailaddr;
> +}
Is going to do it again incrementally, and then just act on a
search-replacement if we've got the feature enabled.
It seems much simpler to just turn your:
> + if (gitweb_check_feature('email_privacy')) {
> + $tag{'author_email'} = "private";
> + $tag{'author'} = hide_mailaddr($tag{'author'});
> + }
Into:
$tag{'author'} = maybe_hide_mailaddr($tag{author}, \$tag{author_email});
Using:
sub maybe_hide_email {
my ($email, $ref) = shift;
return $email unless gitweb_check_feature('email_privacy');
$$ref = "private" if $ref;
return hide_email($email);
}
Which also works for the case where you don't have a "private" hash key
to assign to. But maybe it overcomplicates things...
> } else {
> $tag{'author_name'} = $tag{'author'};
> }
> @@ -3519,6 +3545,10 @@ sub parse_commit_text {
> if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
> $co{'author_name'} = $1;
> $co{'author_email'} = $2;
> + if (gitweb_check_feature('email_privacy')) {
> + $co{'author_email'} = "private";
> + $co{'author'} = hide_mailaddr($co{'author'});
> + }
> } else {
> $co{'author_name'} = $co{'author'};
> }
> @@ -3529,6 +3559,10 @@ sub parse_commit_text {
> if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
> $co{'committer_name'} = $1;
> $co{'committer_email'} = $2;
> + if (gitweb_check_feature('email_privacy')) {
> + $co{'committer_email'} = "private";
> + $co{'committer'} = hide_mailaddr($co{'committer'});
> + }
> [...]
> } else {
> $co{'committer_name'} = $co{'committer'};
> }
> @@ -3568,9 +3602,10 @@ sub parse_commit_text {
> if (! defined $co{'title'} || $co{'title'} eq "") {
> $co{'title'} = $co{'title_short'} = '(no commit message)';
> }
> - # remove added spaces
> + # remove added spaces, redact e-mail addresses if applicable.
> foreach my $line (@commit_lines) {
> $line =~ s/^ //;
> + $line = hide_mailaddr_if_private($line);
> }
> $co{'comment'} = \@commit_lines;
>
> @@ -8060,8 +8095,9 @@ sub git_commitdiff {
> close $fd
> or print "Reading git-diff-tree failed\n";
> } elsif ($format eq 'patch') {
> - local $/ = undef;
> - print <$fd>;
> + while (my $line = <$fd>) {
> + print hide_mailaddr_if_private($line);
> + }
Urm, have you tested this? How does a while loop over a <$fd> make sense
when $/ is undef, the readline() operator will always return just one
record, so having a while loop doesn't make sense.
I'm not sure of the input here, but given that if you're expecting to
replace all e-mail addresses on all lines with this function that's not
how it'll work, the s/// doesn't have a /g, so it'll stop at the first
replacement.
> close $fd
> or print "Reading git-format-patch failed\n";
> }
>
> base-commit: a5828ae6b52137b913b978e16cd2334482eb4c1f
next prev parent reply other threads:[~2021-03-21 18:32 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-20 23:42 [PATCH] gitweb: redacted e-mail addresses feature Georgios Kontaxis via GitGitGadget
2021-03-21 0:42 ` Ævar Arnfjörð Bjarmason
2021-03-21 1:27 ` brian m. carlson
2021-03-21 3:30 ` Georgios Kontaxis
2021-03-21 3:32 ` [PATCH v2] " Georgios Kontaxis via GitGitGadget
2021-03-21 17:28 ` [PATCH v3] " Georgios Kontaxis via GitGitGadget
2021-03-21 18:26 ` Ævar Arnfjörð Bjarmason [this message]
2021-03-21 18:48 ` Junio C Hamano
2021-03-21 19:48 ` Georgios Kontaxis
2021-03-21 18:42 ` Junio C Hamano
2021-03-21 18:57 ` Junio C Hamano
2021-03-21 19:05 ` Junio C Hamano
2021-03-21 20:07 ` Georgios Kontaxis
2021-03-21 22:17 ` Junio C Hamano
2021-03-21 23:14 ` Georgios Kontaxis
2021-03-22 4:25 ` Junio C Hamano
2021-03-22 6:57 ` [PATCH v4] " Georgios Kontaxis via GitGitGadget
2021-03-22 18:32 ` Junio C Hamano
2021-03-22 18:58 ` Georgios Kontaxis
2021-03-28 1:41 ` Junio C Hamano
2021-03-28 21:43 ` Georgios Kontaxis
2021-03-28 22:35 ` Junio C Hamano
2021-03-23 4:27 ` Georgios Kontaxis
2021-03-27 3:56 ` [PATCH v5] " Georgios Kontaxis via GitGitGadget
2021-03-28 23:26 ` [PATCH v6] " Georgios Kontaxis via GitGitGadget
2021-03-29 20:00 ` Junio C Hamano
2021-03-31 21:14 ` Junio C Hamano
2021-04-06 0:56 ` Junio C Hamano
2021-04-08 22:43 ` Ævar Arnfjörð Bjarmason
2021-04-08 22:51 ` Junio C Hamano
2021-03-29 1:47 ` [PATCH v5] " Eric Wong
2021-03-29 3:17 ` Georgios Kontaxis
2021-04-08 17:16 ` Eric Wong
2021-04-08 21:04 ` Junio C Hamano
2021-04-08 21:19 ` Eric Wong
2021-04-08 22:45 ` Ævar Arnfjörð Bjarmason
2021-04-08 22:54 ` Junio C Hamano
2021-03-21 6:00 ` [PATCH] " Junio C Hamano
2021-03-21 6:18 ` Junio C Hamano
2021-03-21 6:43 ` Georgios Kontaxis
2021-03-21 16:55 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87r1k8qs73.fsf@evledraar.gmail.com \
--to=avarab@gmail.com \
--cc=geko1702+commits@99rst.org \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=sandals@crustytoothpaste.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.