All of lore.kernel.org
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org, peff@peff.net
Subject: Re: [RFC/PATCH 1/3] mailinfo: refactor commit message processing
Date: Fri, 16 Sep 2016 12:12:50 -0700	[thread overview]
Message-ID: <xmqqoa3nk6a5.fsf@gitster.mtv.corp.google.com> (raw)
In-Reply-To: <7dbb4bc0659056211b27f0033c73f0d558efdb54.1474047135.git.jonathantanmy@google.com> (Jonathan Tan's message of "Fri, 16 Sep 2016 10:37:22 -0700")

Jonathan Tan <jonathantanmy@google.com> writes:

> Within the processing of the commit message, check for a scissors line
> or a patchbreak line first (before checking for in-body headers) so that
> a subsequent patch modifying the processing of in-body headers would not
> cause a scissors line or patchbreak line to be misidentified.
>
> If a line could be both an in-body header and a scissors line (for
> example, "From: -- >8 --"), this is considered a fatal error
> (previously, it would be interpreted as an in-body header).

The scissors line is designed to allow garbage other than scissors
and perforation marks to be on the same line, i.e.

        /*
         * The mark must be at least 8 bytes long (e.g. "-- >8 --").
         * Even though there can be arbitrary cruft on the same line
         * (e.g. "cut here"), in order to avoid misidentification, the
         * perforation must occupy more than a third of the visible
         * width of the line, and dashes and scissors must occupy more
         * than half of the perforation.
         */

Even though it is not likely for people to do so, it would probably
be nicer if we can treat

	From: -- >8 -- cut -- >8 -- >8 -- here -- >8 --

as a scissors line instead of making it a fatal error, by treating
that "From:" as just a random garbage.

But this is a minor point.  It is not worth to make it work like so
if the resulting code will become messier.

> The following enumeration shows that processing is the same except (as
> described above) the in-body header + scissors line case.
>
> o in-body header (check_header OK)
>   o passes UTF-8 conversion
>     o [described above] is scissors line
>     o [not possible] is patchbreak line
>     o [not possible] is blank line
>     o is none of the above - processed as header
>   o fails UTF-8 conversion - processed as header
> o not in-body header
>   o passes UTF-8 conversion
>     o is scissors line - processed as scissors
>     o is patchbreak line - processed as patchbreak
>     o is blank line - ignored if in header_stage
>     o is none of the above - log message
>   o fails UTF-8 conversion - input error
>
> As for the result left in "line" (after the invocation of
> handle_commit_msg), it is unused (by its caller, handle_filter, and by
> handle_filter's callers, handle_boundary and handle_body) unless this
> line is a patchbreak line, in which case handle_patch is subsequently
> called (in handle_filter) on "line". In this case, "line" must have
> passed UTF-8 conversion both before and after this patch, so the result
> is still the same overall.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  mailinfo.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 115 insertions(+), 30 deletions(-)
>
> diff --git a/mailinfo.c b/mailinfo.c
> index e19abe3..23a56c2 100644
> --- a/mailinfo.c
> +++ b/mailinfo.c
> @@ -340,23 +340,56 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg)
>  	return out;
>  }
>  
> -static int convert_to_utf8(struct mailinfo *mi,
> -			   struct strbuf *line, const char *charset)
> +/*
> + * Attempts to convert line into UTF-8, storing the result in line.
> + *
> + * This differs from convert_to_utf8 in that conversion non-success is not
> + * considered an error case - mi->input_error is not set, and no error message
> + * is printed.
> + *
> + * If the conversion is unnecessary, returns 0 and stores NULL in old_buf (if
> + * old_buf is not NULL).
> + *
> + * If the conversion is successful, returns 0 and stores the unconverted string
> + * in old_buf and old_len (if they are respectively not NULL).
> + *
> + * If the conversion is unsuccessful, returns -1.
> + */
> +static int try_convert_to_utf8(const struct mailinfo *mi, struct strbuf *line,
> +			       const char *charset, char **old_buf,
> +			       size_t *old_len)
>  {
> -	char *out;
> +	char *utf8;
>  
> -	if (!mi->metainfo_charset || !charset || !*charset)
> +	if (!mi->metainfo_charset || !charset || !*charset ||
> +	    same_encoding(mi->metainfo_charset, charset)) {
> +		if (old_buf)
> +			*old_buf = NULL;
>  		return 0;
> +	}
>  
> -	if (same_encoding(mi->metainfo_charset, charset))
> +	utf8 = reencode_string(line->buf, mi->metainfo_charset, charset);
> +	if (utf8) {
> +		char *temp = strbuf_detach(line, old_len);
> +		if (old_buf)
> +			*old_buf = temp;
> +		strbuf_attach(line, utf8, strlen(utf8), strlen(utf8));
>  		return 0;
> -	out = reencode_string(line->buf, mi->metainfo_charset, charset);
> -	if (!out) {
> +	}
> +	return -1;
> +}
> +
> +/*
> + * Converts line into UTF-8, setting mi->input_error to -1 upon failure.
> + */
> +static int convert_to_utf8(struct mailinfo *mi,
> +			   struct strbuf *line, const char *charset)
> +{
> +	if (try_convert_to_utf8(mi, line, charset, NULL, NULL)) {
>  		mi->input_error = -1;
>  		return error("cannot convert from %s to %s",
>  			     charset, mi->metainfo_charset);
>  	}
> -	strbuf_attach(line, out, strlen(out), strlen(out));
>  	return 0;
>  }

Please split this part into its own patch.  IIUC, it moves the meat
of convert_to_utf8() to a more silent try_convert_to_utf8() and then
makes the former a thin wrapper of the latter.  Which by itself is a
good change but does not have anything to do with "fix handling of
the in-body headers", other than that the main fix wants to have
such a more silent helper for its own use.


> @@ -515,6 +548,13 @@ static int check_header(struct mailinfo *mi,
>  	return ret;
>  }
>  
> +static int check_header_raw(struct mailinfo *mi,
> +			    char *buf, size_t len,
> +			    struct strbuf *hdr_data[], int overwrite) {
> +	const struct strbuf sb = {0, len, buf};
> +	return check_header(mi, &sb, hdr_data, overwrite);
> +}

IIUC, this is a helper for callers that do not have a strbuf but
instead have <buf, len> pair to perform the same check_header() the
callers that have strbuf can do.

As check_header() uses the strbuf as a read-only entity, wrapping
the <buf, len> pair in a temporary strbuf like this is safe.

The incoming <buf> should conceptually be "const char *", but it's
OK.

If check_header() didn't call any helper function that gets passed
&sb as a strbuf, or if convertiong the helper function to take a
<buf, len> pair instead, I would actually suggest refactoring this
the other way around, though.  That is, move the implementation of
check_header() to this function, updating its reference to line->buf
and line->len to reference to <buf> and <len>, and then make
check_header() a thin wrapper that does

        check_header(mi, const struct strbuf *line,
                     struct strbuf *hdr_data[], int overwrite)
        {
                return check_header_raw(mi, line->buf, line->len,
                                        hdr_data, overwrite);
        }

I didn't check how involved to update cmp_header() to take <buf,len>
pair.  If it does not look too bad, then I think I would prefer to
do it that way, and as before, make that conversion a separate
preparatory patch.

> @@ -623,32 +663,48 @@ static int is_scissors_line(const struct strbuf *line)
>  		gap * 2 < perforation);
>  }
>  
> -static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
> +static int resembles_rfc2822_header(const struct strbuf *line)
>  {
> -	assert(!mi->filter_stage);
> +	char *c;
>  
> -	if (mi->header_stage) {
> -		if (!line->len || (line->len == 1 && line->buf[0] == '\n'))
> +	if (!isalpha(line->buf[0]))
> +		return 0;
> +
> +	for (c = line->buf + 1; *c != 0; c++) {
> +		if (*c == ':')
> +			return 1;
> +		else if (*c != '-' && !isalpha(*c))
>  			return 0;
>  	}
> +	return 0;
> +}

Is this helper supposed to handle any rfc2822 looking header, or
only the ones we expect to see as in-body header?


> -	if (mi->use_inbody_headers && mi->header_stage) {
> -		mi->header_stage = check_header(mi, line, mi->s_hdr_data, 0);
> -		if (mi->header_stage)
> -			return 0;
> -	} else
> -		/* Only trim the first (blank) line of the commit message
> -		 * when ignoring in-body headers.
> -		 */
> -		mi->header_stage = 0;
> +static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
> +{
> +	int ret = 0;
> +	int utf8_result;
> +	char *old_buf;
> +	size_t old_len;
> +
> +	assert(!mi->filter_stage);
>  
> -	/* normalize the log message to UTF-8. */
> -	if (convert_to_utf8(mi, line, mi->charset.buf))
> -		return 0; /* mi->input_error already set */
> +	/*
> +	 * Obtain UTF8 for scissors line and patchbreak checks, but retain the
> +	 * undecoded line in case we need to process it as an in-body header.
> +	 */
> +	utf8_result = try_convert_to_utf8(mi, line, mi->charset.buf, &old_buf,
> +					  &old_len);

Just a minor style suggestion.  As <old_buf, old_len> come in a
pair, fold the line before them, so that the readers can easily
see the association between them.  I.e.

        utf8_result = try_convert_to_utf8(mi, line, mi->charset.buf,
                                          &old_buf, &old_len);


> -	if (mi->use_scissors && is_scissors_line(line)) {
> +	if (!utf8_result && mi->use_scissors && is_scissors_line(line)) {
>  		int i;
>  
> +		if (resembles_rfc2822_header(line))
> +			/*
> +			 * Explicitly reject scissor lines that resemble a RFC
> +			 * 2822 header, to avoid being prone to error.
> +			 */
> +			die("scissors line resembles RFC 2822 header");
> +

I guess "disambiguate to favor scissors" is not that difficult ;-)

>  		strbuf_setlen(&mi->log_message, 0);
>  		mi->header_stage = 1;
>  
> @@ -661,18 +717,47 @@ static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
>  				strbuf_release(mi->s_hdr_data[i]);
>  			mi->s_hdr_data[i] = NULL;
>  		}
> -		return 0;
> +		goto handle_commit_msg_out;
>  	}
> -
> -	if (patchbreak(line)) {
> +	if (!utf8_result && patchbreak(line)) {
>  		if (mi->message_id)
>  			strbuf_addf(&mi->log_message,
>  				    "Message-Id: %s\n", mi->message_id);
> -		return 1;
> +		ret = 1;
> +		goto handle_commit_msg_out;
>  	}
>  
> +	if (mi->header_stage) {
> +		char *buf = old_buf ? old_buf : line->buf;
> +		if (buf[0] == 0 || (buf[0] == '\n' && buf[1] == 0))
> +			goto handle_commit_msg_out;
> +	}
> +
> +	if (mi->use_inbody_headers && mi->header_stage) {
> +		char *buf = old_buf ? old_buf : line->buf;
> +		size_t len = old_buf ? old_len : line->len;
> +		mi->header_stage = check_header_raw(mi, buf, len,
> +						    mi->s_hdr_data, 0);

Just a minor comment, but I guess check_header_raw() refactoring is
not strictly needed after all, as this callsite can wrap <buf,len>
into a temporary strbuf.

Unlike the real header that is read in read_one_header_line() inside
a loop to implement line folding before check_header() is called, we
call check_header() before possibly-foldable header lines is fully
assembled into one header.  Probably it comes in later patches, I
guess.

It is not immediately obvious to me how this step helps further work
done by later patches in the series until I read them, but so far
what this patch did looks understandable to me ;-)

Thanks.


> +		if (mi->header_stage)
> +			goto handle_commit_msg_out;
> +	} else
> +		/* Only trim the first (blank) line of the commit message
> +		 * when ignoring in-body headers.
> +		 */
> +		mi->header_stage = 0;
> +
> +	/* If adding as a log message, conversion to UTF-8 is required. */
> +	if (utf8_result) {
> +		mi->input_error = -1;
> +		error("cannot convert from %s to %s",
> +		      mi->charset.buf, mi->metainfo_charset);
> +		goto handle_commit_msg_out;
> +	}
>  	strbuf_addbuf(&mi->log_message, line);
> -	return 0;
> +
> +handle_commit_msg_out:
> +	free(old_buf);
> +	return ret;
>  }
>  
>  static void handle_patch(struct mailinfo *mi, const struct strbuf *line)

  reply	other threads:[~2016-09-16 19:12 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-02 19:58 [PATCH] sequencer: support folding in rfc2822 footer Jonathan Tan
2016-09-03  2:23 ` Junio C Hamano
2016-09-06 22:08   ` Jonathan Tan
2016-09-06 23:30     ` Jonathan Tan
2016-09-07  6:38       ` Jeff King
2016-09-16 17:37         ` [RFC/PATCH 0/3] handle multiline in-body headers Jonathan Tan
2016-09-16 17:37           ` [RFC/PATCH 1/3] mailinfo: refactor commit message processing Jonathan Tan
2016-09-16 19:12             ` Junio C Hamano [this message]
2016-09-16 21:46               ` Jeff King
2016-09-16 17:37           ` [RFC/PATCH 2/3] mailinfo: correct malformed test example Jonathan Tan
2016-09-16 19:19             ` Junio C Hamano
2016-09-16 22:42               ` Jonathan Tan
2016-09-16 22:55                 ` Junio C Hamano
2016-09-17  0:31                   ` Jonathan Tan
2016-09-17  3:48                     ` Junio C Hamano
2016-09-16 17:37           ` [RFC/PATCH 3/3] mailinfo: handle in-body header continuations Jonathan Tan
2016-09-16 20:17             ` Junio C Hamano
2016-09-16 20:49               ` Jonathan Tan
2016-09-16 20:59                 ` Junio C Hamano
2016-09-16 22:36                   ` Jonathan Tan
2016-09-16 23:04                     ` Junio C Hamano
2016-09-17  0:22                       ` Jonathan Tan
2016-09-16 21:51             ` Jeff King
2016-09-16 18:29           ` [RFC/PATCH 0/3] handle multiline in-body headers Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqoa3nk6a5.fsf@gitster.mtv.corp.google.com \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.