git-mailinfo fixes/features v3

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git-mailinfo fixes/features v3
@ 2007-03-14 20:12 Don Zickus
  2007-03-14 20:12 ` [PATCH 1/5] builtin-mailinfo.c infrastrcture changes Don Zickus
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Don Zickus @ 2007-03-14 20:12 UTC (permalink / raw)
  To: git

Another round of cleanups as noticed by Junio.  
Only the the first two patches were touched.

-coding style cleanups
-better boundary checking

Cheers,
Don

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/5] builtin-mailinfo.c infrastrcture changes
  2007-03-14 20:12 git-mailinfo fixes/features v3 Don Zickus
@ 2007-03-14 20:12 ` Don Zickus
  2007-03-15 14:35   ` Don Zickus
  2007-03-14 20:12 ` [PATCH 2/5] add the ability to select more email header fields to output Don Zickus
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Don Zickus @ 2007-03-14 20:12 UTC (permalink / raw)
  To: git; +Cc: Don Zickus

I am working on a project that required parsing through regular mboxes that
didn't necessarily have patches embedded in them.  I started by creating my
own modified copy of git-am and working from there.  Very quickly, I noticed
git-mailinfo wasn't able to handle a big chunk of my email.

After hacking up numerous solutions and running into more limitations, I
decided it was just easier to rewrite a big chunk of it.  The following
patch has a bunch of fixes and features that I needed in order for me do
what I wanted.

Note: I'm didn't follow any email rfc papers but I don't think any of the
changes I did required much knowledge (besides the boundary stuff).

List of major changes/fixes:
- can't create empty patch files fix
- empty patch files don't fail, this failure will come inside git-am
- multipart boundaries are now handled
- only output inbody headers if a patch exists otherwise assume those
headers are part of the reply and instead output the original headers
- decode and filter base64 patches correctly
- various other accidental fixes

I believe I didn't break any existing functionality or compatibility (other
than what I describe above, which is really only the empty patch file).

I tested this through various mailing list archives and everything seemed to
parse correctly (a couple thousand emails).

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 builtin-mailinfo.c |  520 +++++++++++++++++++++++++++------------------------
 git-am.sh          |    4 +
 git-applymbox.sh   |    4 +
 git-quiltimport.sh |    4 +
 4 files changed, 287 insertions(+), 245 deletions(-)

diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index 766a37e..dacdf77 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -11,19 +11,22 @@ static FILE *cmitmsg, *patchfile, *fin, *fout;
 static int keep_subject;
 static const char *metainfo_charset;
 static char line[1000];
-static char date[1000];
 static char name[1000];
 static char email[1000];
-static char subject[1000];
 
 static enum  {
 	TE_DONTCARE, TE_QP, TE_BASE64,
 } transfer_encoding;
-static char charset[256];
+static enum  {
+	TYPE_TEXT, TYPE_OTHER,
+} message_type;
 
-static char multipart_boundary[1000];
-static int multipart_boundary_len;
+static char charset[256];
 static int patch_lines;
+static char **p_hdr_data, **s_hdr_data;
+
+#define MAX_HDR_PARSED 10
+#define MAX_BOUNDARIES 5
 
 static char *sanity_check(char *name, char *email)
 {
@@ -137,15 +140,13 @@ static int handle_from(char *in_line)
 	return 1;
 }
 
-static int handle_date(char *line)
+static int handle_header(char *line, char *data, int ofs)
 {
-	strcpy(date, line);
-	return 0;
-}
+	if (!line || !data)
+		return 1;
+
+	strcpy(data, line+ofs);
 
-static int handle_subject(char *line)
-{
-	strcpy(subject, line);
 	return 0;
 }
 
@@ -177,17 +178,35 @@ static int slurp_attr(const char *line, const char *name, char *attr)
 	return 1;
 }
 
-static int handle_subcontent_type(char *line)
+struct content_type {
+	char *boundary;
+	int boundary_len;
+};
+
+static struct content_type content[MAX_BOUNDARIES];
+
+static struct content_type *content_top = content;
+
+static int handle_content_type(char *line)
 {
-	/* We do not want to mess with boundary.  Note that we do not
-	 * handle nested multipart.
+	char boundary[256];
+
+	/* the only time this return less than zero is when 
+	   /line/ does not contain "text/"
 	 */
-	if (strcasestr(line, "boundary=")) {
-		fprintf(stderr, "Not handling nested multipart message.\n");
-		exit(1);
+	if (strcasestr(line, "text/") == NULL)
+		 message_type = TYPE_OTHER;
+	if (slurp_attr(line, "boundary=", boundary + 2)) {
+		memcpy(boundary, "--", 2);
+		if (content_top++ >= &content[MAX_BOUNDARIES]) {
+			fprintf(stderr, "Too many boundaries to handle\n");
+			exit(1);
+		}
+		content_top->boundary_len = strlen(boundary);
+		content_top->boundary = xmalloc(content_top->boundary_len+1);
+		strcpy(content_top->boundary, boundary);
 	}
-	slurp_attr(line, "charset=", charset);
-	if (*charset) {
+	if (slurp_attr(line, "charset=", charset)) {
 		int i, c;
 		for (i = 0; (c = charset[i]) != 0; i++)
 			charset[i] = tolower(c);
@@ -195,17 +214,6 @@ static int handle_subcontent_type(char *line)
 	return 0;
 }
 
-static int handle_content_type(char *line)
-{
-	*multipart_boundary = 0;
-	if (slurp_attr(line, "boundary=", multipart_boundary + 2)) {
-		memcpy(multipart_boundary, "--", 2);
-		multipart_boundary_len = strlen(multipart_boundary);
-	}
-	slurp_attr(line, "charset=", charset);
-	return 0;
-}
-
 static int handle_content_transfer_encoding(char *line)
 {
 	if (strcasestr(line, "base64"))
@@ -219,7 +227,7 @@ static int handle_content_transfer_encoding(char *line)
 
 static int is_multipart_boundary(const char *line)
 {
-	return (!memcmp(line, multipart_boundary, multipart_boundary_len));
+	return (!memcmp(line, content_top->boundary, content_top->boundary_len));
 }
 
 static int eatspace(char *line)
@@ -230,62 +238,6 @@ static int eatspace(char *line)
 	return len;
 }
 
-#define SEEN_FROM 01
-#define SEEN_DATE 02
-#define SEEN_SUBJECT 04
-#define SEEN_BOGUS_UNIX_FROM 010
-#define SEEN_PREFIX  020
-
-/* First lines of body can have From:, Date:, and Subject: or empty */
-static void handle_inbody_header(int *seen, char *line)
-{
-	if (*seen & SEEN_PREFIX)
-		return;
-	if (isspace(*line)) {
-		char *cp;
-		for (cp = line + 1; *cp; cp++) {
-			if (!isspace(*cp))
-				break;
-		}
-		if (!*cp)
-			return;
-	}
-	if (!memcmp(">From", line, 5) && isspace(line[5])) {
-		if (!(*seen & SEEN_BOGUS_UNIX_FROM)) {
-			*seen |= SEEN_BOGUS_UNIX_FROM;
-			return;
-		}
-	}
-	if (!memcmp("From:", line, 5) && isspace(line[5])) {
-		if (!(*seen & SEEN_FROM) && handle_from(line+6)) {
-			*seen |= SEEN_FROM;
-			return;
-		}
-	}
-	if (!memcmp("Date:", line, 5) && isspace(line[5])) {
-		if (!(*seen & SEEN_DATE)) {
-			handle_date(line+6);
-			*seen |= SEEN_DATE;
-			return;
-		}
-	}
-	if (!memcmp("Subject:", line, 8) && isspace(line[8])) {
-		if (!(*seen & SEEN_SUBJECT)) {
-			handle_subject(line+9);
-			*seen |= SEEN_SUBJECT;
-			return;
-		}
-	}
-	if (!memcmp("[PATCH]", line, 7) && isspace(line[7])) {
-		if (!(*seen & SEEN_SUBJECT)) {
-			handle_subject(line);
-			*seen |= SEEN_SUBJECT;
-			return;
-		}
-	}
-	*seen |= SEEN_PREFIX;
-}
-
 static char *cleanup_subject(char *subject)
 {
 	if (keep_subject)
@@ -341,57 +293,62 @@ static void cleanup_space(char *buf)
 }
 
 static void decode_header(char *it);
-typedef int (*header_fn_t)(char *);
-struct header_def {
-	const char *name;
-	header_fn_t func;
-	int namelen;
+static char *header[MAX_HDR_PARSED] = {
+	"From","Subject","Date",
 };
 
-static void check_header(char *line, struct header_def *header)
+static int check_header(char *line, char **hdr_data)
 {
 	int i;
 
-	if (header[0].namelen <= 0) {
-		for (i = 0; header[i].name; i++)
-			header[i].namelen = strlen(header[i].name);
-	}
-	for (i = 0; header[i].name; i++) {
-		int len = header[i].namelen;
-		if (!strncasecmp(line, header[i].name, len) &&
+	/* search for the interesting parts */
+	for (i = 0; header[i]; i++) {
+		int len = strlen(header[i]);
+		if (!hdr_data[i] &&
+		    !strncasecmp(line, header[i], len) &&
 		    line[len] == ':' && isspace(line[len + 1])) {
 			/* Unwrap inline B and Q encoding, and optionally
 			 * normalize the meta information to utf8.
 			 */
 			decode_header(line + len + 2);
-			header[i].func(line + len + 2);
-			break;
+			hdr_data[i] = xmalloc(1000 * sizeof(char));
+			if (! handle_header(line, hdr_data[i], len + 2)) {
+				return 1;
+			}
 		}
 	}
-}
 
-static void check_subheader_line(char *line)
-{
-	static struct header_def header[] = {
-		{ "Content-Type", handle_subcontent_type },
-		{ "Content-Transfer-Encoding",
-		  handle_content_transfer_encoding },
-		{ NULL },
-	};
-	check_header(line, header);
-}
-static void check_header_line(char *line)
-{
-	static struct header_def header[] = {
-		{ "From", handle_from },
-		{ "Date", handle_date },
-		{ "Subject", handle_subject },
-		{ "Content-Type", handle_content_type },
-		{ "Content-Transfer-Encoding",
-		  handle_content_transfer_encoding },
-		{ NULL },
-	};
-	check_header(line, header);
+	/* Content stuff */
+	if (!strncasecmp(line, "Content-Type", 12) &&
+		line[12] == ':' && isspace(line[12 + 1])) {
+		decode_header(line + 12 + 2);
+		if (! handle_content_type(line)) {
+			return 1;
+		}
+	}
+	if (!strncasecmp(line, "Content-Transfer-Encoding", 25) &&
+		line[25] == ':' && isspace(line[25 + 1])) {
+		decode_header(line + 25 + 2);
+		if (! handle_content_transfer_encoding(line)) {
+			return 1;
+		}
+	}
+
+	/* for inbody stuff */
+	if (!memcmp(">From", line, 5) && isspace(line[5]))
+		return 1;
+	if (!memcmp("[PATCH]", line, 7) && isspace(line[7])) {
+		for (i=0; header[i]; i++) {
+			if (!memcmp("Subject: ", header[i], 9)) {
+				if (! handle_header(line, hdr_data[i], 0)) {
+					return 1;
+				}
+			}
+		}
+	}
+
+	/* no match */
+	return 0;
 }
 
 static int is_rfc2822_header(char *line)
@@ -647,147 +604,222 @@ static void decode_transfer_encoding(char *line)
 	}
 }
 
-static void handle_info(void)
+static int handle_filter(char *line);
+
+static int find_boundary(void)
 {
-	char *sub;
+	while(fgets(line, sizeof(line), fin) != NULL) {
+		if (is_multipart_boundary(line))
+			return 1;
+	}
+	return 0;
+}
+
+static int handle_boundary(void)
+{
+again:
+	if (!memcmp(line+content_top->boundary_len, "--", 2)) {
+		/* we hit an end boundary */
+		/* pop the current boundary off the stack */
+		free(content_top->boundary);
+		
+		/* technically won't happen as is_multipart_boundary()
+		   will fail first.  But just in case..
+		 */
+		if (content_top-- < content) {
+			fprintf(stderr, "Detected mismatched boundaries, "
+					"can't recover\n");
+			exit(1);
+		}
+		handle_filter("\n");
+
+		/* skip to the next boundary */
+		if (!find_boundary())
+			return 0;
+		goto again;
+	}
 
-	sub = cleanup_subject(subject);
-	cleanup_space(name);
-	cleanup_space(date);
-	cleanup_space(email);
-	cleanup_space(sub);
+	/* set some defaults */
+	transfer_encoding = TE_DONTCARE;
+	charset[0] = 0;
+	message_type = TYPE_TEXT;
 
-	fprintf(fout, "Author: %s\nEmail: %s\nSubject: %s\nDate: %s\n\n",
-	       name, email, sub, date);
+	/* slurp in this section's info */
+	while (read_one_header_line(line, sizeof(line), fin))
+		check_header(line, p_hdr_data);
+
+	/* eat the blank line after section info */
+	return (fgets(line, sizeof(line), fin) != NULL);
 }
 
-/* We are inside message body and have read line[] already.
- * Spit out the commit log.
- */
-static int handle_commit_msg(int *seen)
+static int handle_commit_msg(char *line)
 {
+	static int still_looking=1;
+
 	if (!cmitmsg)
 		return 0;
-	do {
-		if (!memcmp("diff -", line, 6) ||
-		    !memcmp("---", line, 3) ||
-		    !memcmp("Index: ", line, 7))
-			break;
-		if ((multipart_boundary[0] && is_multipart_boundary(line))) {
-			/* We come here when the first part had only
-			 * the commit message without any patch.  We
-			 * pretend we have not seen this line yet, and
-			 * go back to the loop.
-			 */
-			return 1;
-		}
 
-		/* Unwrap transfer encoding and optionally
-		 * normalize the log message to UTF-8.
-		 */
-		decode_transfer_encoding(line);
-		if (metainfo_charset)
-			convert_to_utf8(line, charset);
+	if (still_looking) {
+		char *cp=line;
+		if (isspace(*line)) {
+			for (cp = line + 1; *cp; cp++) {
+				if (!isspace(*cp))
+					break;
+			}
+			if (!*cp)
+				return 0;
+		}
+		if ((still_looking = check_header(cp, s_hdr_data)) != 0)
+			return 0;
+	}
 
-		handle_inbody_header(seen, line);
-		if (!(*seen & SEEN_PREFIX))
-			continue;
+	if (!memcmp("diff -", line, 6) ||
+	    !memcmp("---", line, 3) ||
+	    !memcmp("Index: ", line, 7)) {
+		fclose(cmitmsg);
+		cmitmsg = NULL;
+		return 1;
+	}
 
-		fputs(line, cmitmsg);
-	} while (fgets(line, sizeof(line), fin) != NULL);
-	fclose(cmitmsg);
-	cmitmsg = NULL;
+	fputs(line, cmitmsg);
 	return 0;
 }
 
-/* We have done the commit message and have the first
- * line of the patch in line[].
- */
-static void handle_patch(void)
+static int handle_patch(char *line)
 {
-	do {
-		if (multipart_boundary[0] && is_multipart_boundary(line))
-			break;
-		/* Only unwrap transfer encoding but otherwise do not
-		 * do anything.  We do *NOT* want UTF-8 conversion
-		 * here; we are dealing with the user payload.
-		 */
-		decode_transfer_encoding(line);
-		fputs(line, patchfile);
-		patch_lines++;
-	} while (fgets(line, sizeof(line), fin) != NULL);
+	fputs(line, patchfile);
+	patch_lines++;
+	return 0;
 }
 
-/* multipart boundary and transfer encoding are set up for us, and we
- * are at the end of the sub header.  do equivalent of handle_body up
- * to the next boundary without closing patchfile --- we will expect
- * that the first part to contain commit message and a patch, and
- * handle other parts as pure patches.
- */
-static int handle_multipart_one_part(int *seen)
+static int handle_filter(char *line)
 {
-	int n = 0;
+	static int filter=0;
 
-	while (fgets(line, sizeof(line), fin) != NULL) {
-	again:
-		n++;
-		if (is_multipart_boundary(line))
+	/* filter tells us which part we left off on
+	 * a non-zero return indicates we hit a filter point
+	 */
+	switch (filter) {
+	case 0:
+		if (!handle_commit_msg(line))
 			break;
-		if (handle_commit_msg(seen))
-			goto again;
-		handle_patch();
-		break;
+		filter++;
+	case 1:
+		if (!handle_patch(line))
+			break;
+		filter++;
+	default:
+		return 1;
 	}
-	if (n == 0)
-		return -1;
+
 	return 0;
 }
 
-static void handle_multipart_body(void)
+static void handle_body(void)
 {
-	int seen = 0;
-	int part_num = 0;
+	int rc=0;
+	static char newline[2000];
+	static char *np=newline;
 
 	/* Skip up to the first boundary */
-	while (fgets(line, sizeof(line), fin) != NULL)
-		if (is_multipart_boundary(line)) {
-			part_num = 1;
+	if (content_top->boundary) {
+		if (!find_boundary())
+			return;
+	}
+
+	do {
+		/* process any boundary lines */
+		if (content_top->boundary && is_multipart_boundary(line)) {
+			/* flush any leftover */
+			if ((transfer_encoding == TE_BASE64)  &&
+			    (np != newline)) {
+				handle_filter(newline);
+			}
+			if (!handle_boundary())
+				return;
+		}
+
+		/* Unwrap transfer encoding and optionally
+		 * normalize the log message to UTF-8.
+		 */
+		decode_transfer_encoding(line);
+		if (metainfo_charset)
+			convert_to_utf8(line, charset);
+
+		switch (transfer_encoding) {
+		case TE_BASE64:
+		{
+			char *op=line;
+
+			/* binary data most likely doesn't have newlines */
+			if (message_type != TYPE_TEXT) {
+				rc=handle_filter(line);
+				break;
+			}
+
+			/* this is a decoded line that may contain
+			 * multiple new lines.  Pass only one chunk
+			 * at a time to handle_filter()
+			 */
+
+			do {
+				while (*op != '\n' && *op != 0)
+					*np++ = *op++;
+				*np = *op;
+				if (*np != 0) {
+					/* should be sitting on a new line */
+					*(++np) = 0;
+					op++;
+					rc=handle_filter(newline);
+					np=newline;
+				}
+			} while (*op != 0);
+			/* the partial chunk is saved in newline and
+			 * will be appended by the next iteration of fgets
+			 */
 			break;
 		}
-	if (!part_num)
-		return;
-	/* We are on boundary line.  Start slurping the subhead. */
-	while (1) {
-		int hdr = read_one_header_line(line, sizeof(line), fin);
-		if (!hdr) {
-			if (handle_multipart_one_part(&seen) < 0)
-				return;
-			/* Reset per part headers */
-			transfer_encoding = TE_DONTCARE;
-			charset[0] = 0;
+		default:
+			rc=handle_filter(line);
 		}
-		else
-			check_subheader_line(line);
-	}
-	fclose(patchfile);
-	if (!patch_lines) {
-		fprintf(stderr, "No patch found\n");
-		exit(1);
-	}
+		if (rc)
+			/* nothing left to filter */
+			break;
+	} while (fgets(line, sizeof(line), fin));
+
+	return;
 }
 
-/* Non multipart message */
-static void handle_body(void)
+static void handle_info(void)
 {
-	int seen = 0;
-
-	handle_commit_msg(&seen);
-	handle_patch();
-	fclose(patchfile);
-	if (!patch_lines) {
-		fprintf(stderr, "No patch found\n");
-		exit(1);
+	char *sub;
+	char *hdr;
+	int i;
+
+	for (i=0; header[i]; i++) {
+
+		/* only print inbody headers if we output a patch file */
+		if (patch_lines && s_hdr_data[i])
+			hdr=s_hdr_data[i];
+		else if (p_hdr_data[i])
+			hdr=p_hdr_data[i];
+		else
+			continue;
+
+		if (!memcmp(header[i], "Subject", 7)) {
+			sub = cleanup_subject(hdr);
+			cleanup_space(sub);
+			fprintf(fout, "Subject: %s\n", sub);
+		} else if (!memcmp(header[i], "From", 4)) {
+			handle_from(hdr);
+			fprintf(fout, "Author: %s\n", name);
+			fprintf(fout, "Email: %s\n", email);
+		} else {
+			cleanup_space(hdr);
+			fprintf(fout, "%s: %s\n", header[i], hdr);
+		}
 	}
+	fprintf(fout, "\n");
 }
 
 int mailinfo(FILE *in, FILE *out, int ks, const char *encoding,
@@ -809,18 +841,16 @@ int mailinfo(FILE *in, FILE *out, int ks, const char *encoding,
 		fclose(cmitmsg);
 		return -1;
 	}
-	while (1) {
-		int hdr = read_one_header_line(line, sizeof(line), fin);
-		if (!hdr) {
-			if (multipart_boundary[0])
-				handle_multipart_body();
-			else
-				handle_body();
-			handle_info();
-			break;
-		}
-		check_header_line(line);
-	}
+
+	p_hdr_data = xcalloc(MAX_HDR_PARSED, sizeof(char *));
+	s_hdr_data = xcalloc(MAX_HDR_PARSED, sizeof(char *));
+
+	/* process the email header */
+	while (read_one_header_line(line, sizeof(line), fin))
+		check_header(line, p_hdr_data);
+
+	handle_body();
+	handle_info();
 
 	return 0;
 }
diff --git a/git-am.sh b/git-am.sh
index 2c73d11..847a44f 100755
--- a/git-am.sh
+++ b/git-am.sh
@@ -290,6 +290,10 @@ do
 		git-mailinfo $keep $utf8 "$dotest/msg" "$dotest/patch" \
 			<"$dotest/$msgnum" >"$dotest/info" ||
 			stop_here $this
+		test -s $dotest/patch || { 
+			echo "Patch is empty.  Was is split wrong?"
+			stop_here $this
+		}
 		git-stripspace < "$dotest/msg" > "$dotest/msg-clean"
 		;;
 	esac
diff --git a/git-applymbox.sh b/git-applymbox.sh
index 1f68599..2cbdc7e 100755
--- a/git-applymbox.sh
+++ b/git-applymbox.sh
@@ -77,6 +77,10 @@ do
     *)
 	    git-mailinfo $keep_subject $utf8 \
 		.dotest/msg .dotest/patch <$i >.dotest/info || exit 1
+	    test -s $dotest/patch || {
+		echo "Patch is empty.  Was is split wrong?"
+		stop_here $this
+	    }
 	    git-stripspace < .dotest/msg > .dotest/msg-clean
 	    ;;
     esac
diff --git a/git-quiltimport.sh b/git-quiltimport.sh
index 671a5ff..08ac9bb 100755
--- a/git-quiltimport.sh
+++ b/git-quiltimport.sh
@@ -73,6 +73,10 @@ mkdir $tmp_dir || exit 2
 for patch_name in $(cat "$QUILT_PATCHES/series" | grep -v '^#'); do
 	echo $patch_name
 	(cat $QUILT_PATCHES/$patch_name | git-mailinfo "$tmp_msg" "$tmp_patch" > "$tmp_info") || exit 3
+	test -s $dotest/patch || {
+		echo "Patch is empty.  Was is split wrong?"
+		stop_here $this
+	}
 
 	# Parse the author information
 	export GIT_AUTHOR_NAME=$(sed -ne 's/Author: //p' "$tmp_info")
-- 
1.5.0.2.211.g2ca9-dirty

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] builtin-mailinfo.c infrastrcture changes
  2007-03-14 20:12 ` [PATCH 1/5] builtin-mailinfo.c infrastrcture changes Don Zickus
@ 2007-03-15 14:35   ` Don Zickus
  0 siblings, 0 replies; 12+ messages in thread
From: Don Zickus @ 2007-03-15 14:35 UTC (permalink / raw)
  To: git

I am working on a project that required parsing through regular mboxes that
didn't necessarily have patches embedded in them.  I started by creating my
own modified copy of git-am and working from there.  Very quickly, I noticed
git-mailinfo wasn't able to handle a big chunk of my email.

After hacking up numerous solutions and running into more limitations, I
decided it was just easier to rewrite a big chunk of it.  The following
patch has a bunch of fixes and features that I needed in order for me do
what I wanted.

Note: I'm didn't follow any email rfc papers but I don't think any of the
changes I did required much knowledge (besides the boundary stuff).

List of major changes/fixes:
- can't create empty patch files fix
- empty patch files don't fail, this failure will come inside git-am
- multipart boundaries are now handled
- only output inbody headers if a patch exists otherwise assume those
headers are part of the reply and instead output the original headers
- decode and filter base64 patches correctly
- various other accidental fixes

I believe I didn't break any existing functionality or compatibility (other
than what I describe above, which is really only the empty patch file).

I tested this through various mailing list archives and everything seemed to
parse correctly (a couple thousand emails).

Signed-off-by: Don Zickus <dzickus@redhat.com>
---

Accidentally sent out the wrong patch yesterday..

---
 builtin-mailinfo.c |  520 +++++++++++++++++++++++++++------------------------
 git-am.sh          |    4 +
 git-applymbox.sh   |    4 +
 git-quiltimport.sh |    4 +
 4 files changed, 287 insertions(+), 245 deletions(-)

diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index 766a37e..a5eea82 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -11,19 +11,22 @@ static FILE *cmitmsg, *patchfile, *fin, *fout;
 static int keep_subject;
 static const char *metainfo_charset;
 static char line[1000];
-static char date[1000];
 static char name[1000];
 static char email[1000];
-static char subject[1000];
 
 static enum  {
 	TE_DONTCARE, TE_QP, TE_BASE64,
 } transfer_encoding;
-static char charset[256];
+static enum  {
+	TYPE_TEXT, TYPE_OTHER,
+} message_type;
 
-static char multipart_boundary[1000];
-static int multipart_boundary_len;
+static char charset[256];
 static int patch_lines;
+static char **p_hdr_data, **s_hdr_data;
+
+#define MAX_HDR_PARSED 10
+#define MAX_BOUNDARIES 5
 
 static char *sanity_check(char *name, char *email)
 {
@@ -137,15 +140,13 @@ static int handle_from(char *in_line)
 	return 1;
 }
 
-static int handle_date(char *line)
+static int handle_header(char *line, char *data, int ofs)
 {
-	strcpy(date, line);
-	return 0;
-}
+	if (!line || !data)
+		return 1;
+
+	strcpy(data, line+ofs);
 
-static int handle_subject(char *line)
-{
-	strcpy(subject, line);
 	return 0;
 }
 
@@ -177,17 +178,35 @@ static int slurp_attr(const char *line, const char *name, char *attr)
 	return 1;
 }
 
-static int handle_subcontent_type(char *line)
+struct content_type {
+	char *boundary;
+	int boundary_len;
+};
+
+static struct content_type content[MAX_BOUNDARIES];
+
+static struct content_type *content_top = content;
+
+static int handle_content_type(char *line)
 {
-	/* We do not want to mess with boundary.  Note that we do not
-	 * handle nested multipart.
+	char boundary[256];
+
+	/* the only time this return less than zero is when 
+	   /line/ does not contain "text/"
 	 */
-	if (strcasestr(line, "boundary=")) {
-		fprintf(stderr, "Not handling nested multipart message.\n");
-		exit(1);
+	if (strcasestr(line, "text/") == NULL)
+		 message_type = TYPE_OTHER;
+	if (slurp_attr(line, "boundary=", boundary + 2)) {
+		memcpy(boundary, "--", 2);
+		if (content_top++ >= &content[MAX_BOUNDARIES]) {
+			fprintf(stderr, "Too many boundaries to handle\n");
+			exit(1);
+		}
+		content_top->boundary_len = strlen(boundary);
+		content_top->boundary = xmalloc(content_top->boundary_len+1);
+		strcpy(content_top->boundary, boundary);
 	}
-	slurp_attr(line, "charset=", charset);
-	if (*charset) {
+	if (slurp_attr(line, "charset=", charset)) {
 		int i, c;
 		for (i = 0; (c = charset[i]) != 0; i++)
 			charset[i] = tolower(c);
@@ -195,17 +214,6 @@ static int handle_subcontent_type(char *line)
 	return 0;
 }
 
-static int handle_content_type(char *line)
-{
-	*multipart_boundary = 0;
-	if (slurp_attr(line, "boundary=", multipart_boundary + 2)) {
-		memcpy(multipart_boundary, "--", 2);
-		multipart_boundary_len = strlen(multipart_boundary);
-	}
-	slurp_attr(line, "charset=", charset);
-	return 0;
-}
-
 static int handle_content_transfer_encoding(char *line)
 {
 	if (strcasestr(line, "base64"))
@@ -219,7 +227,7 @@ static int handle_content_transfer_encoding(char *line)
 
 static int is_multipart_boundary(const char *line)
 {
-	return (!memcmp(line, multipart_boundary, multipart_boundary_len));
+	return (!memcmp(line, content_top->boundary, content_top->boundary_len));
 }
 
 static int eatspace(char *line)
@@ -230,62 +238,6 @@ static int eatspace(char *line)
 	return len;
 }
 
-#define SEEN_FROM 01
-#define SEEN_DATE 02
-#define SEEN_SUBJECT 04
-#define SEEN_BOGUS_UNIX_FROM 010
-#define SEEN_PREFIX  020
-
-/* First lines of body can have From:, Date:, and Subject: or empty */
-static void handle_inbody_header(int *seen, char *line)
-{
-	if (*seen & SEEN_PREFIX)
-		return;
-	if (isspace(*line)) {
-		char *cp;
-		for (cp = line + 1; *cp; cp++) {
-			if (!isspace(*cp))
-				break;
-		}
-		if (!*cp)
-			return;
-	}
-	if (!memcmp(">From", line, 5) && isspace(line[5])) {
-		if (!(*seen & SEEN_BOGUS_UNIX_FROM)) {
-			*seen |= SEEN_BOGUS_UNIX_FROM;
-			return;
-		}
-	}
-	if (!memcmp("From:", line, 5) && isspace(line[5])) {
-		if (!(*seen & SEEN_FROM) && handle_from(line+6)) {
-			*seen |= SEEN_FROM;
-			return;
-		}
-	}
-	if (!memcmp("Date:", line, 5) && isspace(line[5])) {
-		if (!(*seen & SEEN_DATE)) {
-			handle_date(line+6);
-			*seen |= SEEN_DATE;
-			return;
-		}
-	}
-	if (!memcmp("Subject:", line, 8) && isspace(line[8])) {
-		if (!(*seen & SEEN_SUBJECT)) {
-			handle_subject(line+9);
-			*seen |= SEEN_SUBJECT;
-			return;
-		}
-	}
-	if (!memcmp("[PATCH]", line, 7) && isspace(line[7])) {
-		if (!(*seen & SEEN_SUBJECT)) {
-			handle_subject(line);
-			*seen |= SEEN_SUBJECT;
-			return;
-		}
-	}
-	*seen |= SEEN_PREFIX;
-}
-
 static char *cleanup_subject(char *subject)
 {
 	if (keep_subject)
@@ -341,57 +293,62 @@ static void cleanup_space(char *buf)
 }
 
 static void decode_header(char *it);
-typedef int (*header_fn_t)(char *);
-struct header_def {
-	const char *name;
-	header_fn_t func;
-	int namelen;
+static char *header[MAX_HDR_PARSED] = {
+	"From","Subject","Date",
 };
 
-static void check_header(char *line, struct header_def *header)
+static int check_header(char *line, char **hdr_data)
 {
 	int i;
 
-	if (header[0].namelen <= 0) {
-		for (i = 0; header[i].name; i++)
-			header[i].namelen = strlen(header[i].name);
-	}
-	for (i = 0; header[i].name; i++) {
-		int len = header[i].namelen;
-		if (!strncasecmp(line, header[i].name, len) &&
+	/* search for the interesting parts */
+	for (i = 0; header[i]; i++) {
+		int len = strlen(header[i]);
+		if (!hdr_data[i] &&
+		    !strncasecmp(line, header[i], len) &&
 		    line[len] == ':' && isspace(line[len + 1])) {
 			/* Unwrap inline B and Q encoding, and optionally
 			 * normalize the meta information to utf8.
 			 */
 			decode_header(line + len + 2);
-			header[i].func(line + len + 2);
-			break;
+			hdr_data[i] = xmalloc(1000 * sizeof(char));
+			if (! handle_header(line, hdr_data[i], len + 2)) {
+				return 1;
+			}
 		}
 	}
-}
 
-static void check_subheader_line(char *line)
-{
-	static struct header_def header[] = {
-		{ "Content-Type", handle_subcontent_type },
-		{ "Content-Transfer-Encoding",
-		  handle_content_transfer_encoding },
-		{ NULL },
-	};
-	check_header(line, header);
-}
-static void check_header_line(char *line)
-{
-	static struct header_def header[] = {
-		{ "From", handle_from },
-		{ "Date", handle_date },
-		{ "Subject", handle_subject },
-		{ "Content-Type", handle_content_type },
-		{ "Content-Transfer-Encoding",
-		  handle_content_transfer_encoding },
-		{ NULL },
-	};
-	check_header(line, header);
+	/* Content stuff */
+	if (!strncasecmp(line, "Content-Type", 12) &&
+		line[12] == ':' && isspace(line[12 + 1])) {
+		decode_header(line + 12 + 2);
+		if (! handle_content_type(line)) {
+			return 1;
+		}
+	}
+	if (!strncasecmp(line, "Content-Transfer-Encoding", 25) &&
+		line[25] == ':' && isspace(line[25 + 1])) {
+		decode_header(line + 25 + 2);
+		if (! handle_content_transfer_encoding(line)) {
+			return 1;
+		}
+	}
+
+	/* for inbody stuff */
+	if (!memcmp(">From", line, 5) && isspace(line[5]))
+		return 1;
+	if (!memcmp("[PATCH]", line, 7) && isspace(line[7])) {
+		for (i = 0; header[i]; i++) {
+			if (!memcmp("Subject: ", header[i], 9)) {
+				if (! handle_header(line, hdr_data[i], 0)) {
+					return 1;
+				}
+			}
+		}
+	}
+
+	/* no match */
+	return 0;
 }
 
 static int is_rfc2822_header(char *line)
@@ -647,147 +604,222 @@ static void decode_transfer_encoding(char *line)
 	}
 }
 
-static void handle_info(void)
+static int handle_filter(char *line);
+
+static int find_boundary(void)
 {
-	char *sub;
+	while(fgets(line, sizeof(line), fin) != NULL) {
+		if (is_multipart_boundary(line))
+			return 1;
+	}
+	return 0;
+}
+
+static int handle_boundary(void)
+{
+again:
+	if (!memcmp(line+content_top->boundary_len, "--", 2)) {
+		/* we hit an end boundary */
+		/* pop the current boundary off the stack */
+		free(content_top->boundary);
+		
+		/* technically won't happen as is_multipart_boundary()
+		   will fail first.  But just in case..
+		 */
+		if (content_top-- < content) {
+			fprintf(stderr, "Detected mismatched boundaries, "
+					"can't recover\n");
+			exit(1);
+		}
+		handle_filter("\n");
+
+		/* skip to the next boundary */
+		if (!find_boundary())
+			return 0;
+		goto again;
+	}
 
-	sub = cleanup_subject(subject);
-	cleanup_space(name);
-	cleanup_space(date);
-	cleanup_space(email);
-	cleanup_space(sub);
+	/* set some defaults */
+	transfer_encoding = TE_DONTCARE;
+	charset[0] = 0;
+	message_type = TYPE_TEXT;
 
-	fprintf(fout, "Author: %s\nEmail: %s\nSubject: %s\nDate: %s\n\n",
-	       name, email, sub, date);
+	/* slurp in this section's info */
+	while (read_one_header_line(line, sizeof(line), fin))
+		check_header(line, p_hdr_data);
+
+	/* eat the blank line after section info */
+	return (fgets(line, sizeof(line), fin) != NULL);
 }
 
-/* We are inside message body and have read line[] already.
- * Spit out the commit log.
- */
-static int handle_commit_msg(int *seen)
+static int handle_commit_msg(char *line)
 {
+	static int still_looking=1;
+
 	if (!cmitmsg)
 		return 0;
-	do {
-		if (!memcmp("diff -", line, 6) ||
-		    !memcmp("---", line, 3) ||
-		    !memcmp("Index: ", line, 7))
-			break;
-		if ((multipart_boundary[0] && is_multipart_boundary(line))) {
-			/* We come here when the first part had only
-			 * the commit message without any patch.  We
-			 * pretend we have not seen this line yet, and
-			 * go back to the loop.
-			 */
-			return 1;
-		}
 
-		/* Unwrap transfer encoding and optionally
-		 * normalize the log message to UTF-8.
-		 */
-		decode_transfer_encoding(line);
-		if (metainfo_charset)
-			convert_to_utf8(line, charset);
+	if (still_looking) {
+		char *cp=line;
+		if (isspace(*line)) {
+			for (cp = line + 1; *cp; cp++) {
+				if (!isspace(*cp))
+					break;
+			}
+			if (!*cp)
+				return 0;
+		}
+		if ((still_looking = check_header(cp, s_hdr_data)) != 0)
+			return 0;
+	}
 
-		handle_inbody_header(seen, line);
-		if (!(*seen & SEEN_PREFIX))
-			continue;
+	if (!memcmp("diff -", line, 6) ||
+	    !memcmp("---", line, 3) ||
+	    !memcmp("Index: ", line, 7)) {
+		fclose(cmitmsg);
+		cmitmsg = NULL;
+		return 1;
+	}
 
-		fputs(line, cmitmsg);
-	} while (fgets(line, sizeof(line), fin) != NULL);
-	fclose(cmitmsg);
-	cmitmsg = NULL;
+	fputs(line, cmitmsg);
 	return 0;
 }
 
-/* We have done the commit message and have the first
- * line of the patch in line[].
- */
-static void handle_patch(void)
+static int handle_patch(char *line)
 {
-	do {
-		if (multipart_boundary[0] && is_multipart_boundary(line))
-			break;
-		/* Only unwrap transfer encoding but otherwise do not
-		 * do anything.  We do *NOT* want UTF-8 conversion
-		 * here; we are dealing with the user payload.
-		 */
-		decode_transfer_encoding(line);
-		fputs(line, patchfile);
-		patch_lines++;
-	} while (fgets(line, sizeof(line), fin) != NULL);
+	fputs(line, patchfile);
+	patch_lines++;
+	return 0;
 }
 
-/* multipart boundary and transfer encoding are set up for us, and we
- * are at the end of the sub header.  do equivalent of handle_body up
- * to the next boundary without closing patchfile --- we will expect
- * that the first part to contain commit message and a patch, and
- * handle other parts as pure patches.
- */
-static int handle_multipart_one_part(int *seen)
+static int handle_filter(char *line)
 {
-	int n = 0;
+	static int filter=0;
 
-	while (fgets(line, sizeof(line), fin) != NULL) {
-	again:
-		n++;
-		if (is_multipart_boundary(line))
+	/* filter tells us which part we left off on
+	 * a non-zero return indicates we hit a filter point
+	 */
+	switch (filter) {
+	case 0:
+		if (!handle_commit_msg(line))
 			break;
-		if (handle_commit_msg(seen))
-			goto again;
-		handle_patch();
-		break;
+		filter++;
+	case 1:
+		if (!handle_patch(line))
+			break;
+		filter++;
+	default:
+		return 1;
 	}
-	if (n == 0)
-		return -1;
+
 	return 0;
 }
 
-static void handle_multipart_body(void)
+static void handle_body(void)
 {
-	int seen = 0;
-	int part_num = 0;
+	int rc=0;
+	static char newline[2000];
+	static char *np=newline;
 
 	/* Skip up to the first boundary */
-	while (fgets(line, sizeof(line), fin) != NULL)
-		if (is_multipart_boundary(line)) {
-			part_num = 1;
+	if (content_top->boundary) {
+		if (!find_boundary())
+			return;
+	}
+
+	do {
+		/* process any boundary lines */
+		if (content_top->boundary && is_multipart_boundary(line)) {
+			/* flush any leftover */
+			if ((transfer_encoding == TE_BASE64)  &&
+			    (np != newline)) {
+				handle_filter(newline);
+			}
+			if (!handle_boundary())
+				return;
+		}
+
+		/* Unwrap transfer encoding and optionally
+		 * normalize the log message to UTF-8.
+		 */
+		decode_transfer_encoding(line);
+		if (metainfo_charset)
+			convert_to_utf8(line, charset);
+
+		switch (transfer_encoding) {
+		case TE_BASE64:
+		{
+			char *op=line;
+
+			/* binary data most likely doesn't have newlines */
+			if (message_type != TYPE_TEXT) {
+				rc=handle_filter(line);
+				break;
+			}
+
+			/* this is a decoded line that may contain
+			 * multiple new lines.  Pass only one chunk
+			 * at a time to handle_filter()
+			 */
+
+			do {
+				while (*op != '\n' && *op != 0)
+					*np++ = *op++;
+				*np = *op;
+				if (*np != 0) {
+					/* should be sitting on a new line */
+					*(++np) = 0;
+					op++;
+					rc=handle_filter(newline);
+					np=newline;
+				}
+			} while (*op != 0);
+			/* the partial chunk is saved in newline and
+			 * will be appended by the next iteration of fgets
+			 */
 			break;
 		}
-	if (!part_num)
-		return;
-	/* We are on boundary line.  Start slurping the subhead. */
-	while (1) {
-		int hdr = read_one_header_line(line, sizeof(line), fin);
-		if (!hdr) {
-			if (handle_multipart_one_part(&seen) < 0)
-				return;
-			/* Reset per part headers */
-			transfer_encoding = TE_DONTCARE;
-			charset[0] = 0;
+		default:
+			rc=handle_filter(line);
 		}
-		else
-			check_subheader_line(line);
-	}
-	fclose(patchfile);
-	if (!patch_lines) {
-		fprintf(stderr, "No patch found\n");
-		exit(1);
-	}
+		if (rc)
+			/* nothing left to filter */
+			break;
+	} while (fgets(line, sizeof(line), fin));
+
+	return;
 }
 
-/* Non multipart message */
-static void handle_body(void)
+static void handle_info(void)
 {
-	int seen = 0;
-
-	handle_commit_msg(&seen);
-	handle_patch();
-	fclose(patchfile);
-	if (!patch_lines) {
-		fprintf(stderr, "No patch found\n");
-		exit(1);
+	char *sub;
+	char *hdr;
+	int i;
+
+	for (i = 0; header[i]; i++) {
+
+		/* only print inbody headers if we output a patch file */
+		if (patch_lines && s_hdr_data[i])
+			hdr=s_hdr_data[i];
+		else if (p_hdr_data[i])
+			hdr=p_hdr_data[i];
+		else
+			continue;
+
+		if (!memcmp(header[i], "Subject", 7)) {
+			sub = cleanup_subject(hdr);
+			cleanup_space(sub);
+			fprintf(fout, "Subject: %s\n", sub);
+		} else if (!memcmp(header[i], "From", 4)) {
+			handle_from(hdr);
+			fprintf(fout, "Author: %s\n", name);
+			fprintf(fout, "Email: %s\n", email);
+		} else {
+			cleanup_space(hdr);
+			fprintf(fout, "%s: %s\n", header[i], hdr);
+		}
 	}
+	fprintf(fout, "\n");
 }
 
 int mailinfo(FILE *in, FILE *out, int ks, const char *encoding,
@@ -809,18 +841,16 @@ int mailinfo(FILE *in, FILE *out, int ks, const char *encoding,
 		fclose(cmitmsg);
 		return -1;
 	}
-	while (1) {
-		int hdr = read_one_header_line(line, sizeof(line), fin);
-		if (!hdr) {
-			if (multipart_boundary[0])
-				handle_multipart_body();
-			else
-				handle_body();
-			handle_info();
-			break;
-		}
-		check_header_line(line);
-	}
+
+	p_hdr_data = xcalloc(MAX_HDR_PARSED, sizeof(char *));
+	s_hdr_data = xcalloc(MAX_HDR_PARSED, sizeof(char *));
+
+	/* process the email header */
+	while (read_one_header_line(line, sizeof(line), fin))
+		check_header(line, p_hdr_data);
+
+	handle_body();
+	handle_info();
 
 	return 0;
 }
diff --git a/git-am.sh b/git-am.sh
index 2c73d11..847a44f 100755
--- a/git-am.sh
+++ b/git-am.sh
@@ -290,6 +290,10 @@ do
 		git-mailinfo $keep $utf8 "$dotest/msg" "$dotest/patch" \
 			<"$dotest/$msgnum" >"$dotest/info" ||
 			stop_here $this
+		test -s $dotest/patch || { 
+			echo "Patch is empty.  Was is split wrong?"
+			stop_here $this
+		}
 		git-stripspace < "$dotest/msg" > "$dotest/msg-clean"
 		;;
 	esac
diff --git a/git-applymbox.sh b/git-applymbox.sh
index 1f68599..2cbdc7e 100755
--- a/git-applymbox.sh
+++ b/git-applymbox.sh
@@ -77,6 +77,10 @@ do
     *)
 	    git-mailinfo $keep_subject $utf8 \
 		.dotest/msg .dotest/patch <$i >.dotest/info || exit 1
+	    test -s $dotest/patch || {
+		echo "Patch is empty.  Was is split wrong?"
+		stop_here $this
+	    }
 	    git-stripspace < .dotest/msg > .dotest/msg-clean
 	    ;;
     esac
diff --git a/git-quiltimport.sh b/git-quiltimport.sh
index 671a5ff..08ac9bb 100755
--- a/git-quiltimport.sh
+++ b/git-quiltimport.sh
@@ -73,6 +73,10 @@ mkdir $tmp_dir || exit 2
 for patch_name in $(cat "$QUILT_PATCHES/series" | grep -v '^#'); do
 	echo $patch_name
 	(cat $QUILT_PATCHES/$patch_name | git-mailinfo "$tmp_msg" "$tmp_patch" > "$tmp_info") || exit 3
+	test -s $dotest/patch || {
+		echo "Patch is empty.  Was is split wrong?"
+		stop_here $this
+	}
 
 	# Parse the author information
 	export GIT_AUTHOR_NAME=$(sed -ne 's/Author: //p' "$tmp_info")
-- 
1.5.0.2.213.g18c8-dirty

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/5] add the ability to select more email header fields to output
  2007-03-14 20:12 git-mailinfo fixes/features v3 Don Zickus
  2007-03-14 20:12 ` [PATCH 1/5] builtin-mailinfo.c infrastrcture changes Don Zickus
@ 2007-03-14 20:12 ` Don Zickus
  2007-03-15 14:36   ` Don Zickus
  2007-03-14 20:12 ` [PATCH 3/5] restrict the patch filtering Don Zickus
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Don Zickus @ 2007-03-14 20:12 UTC (permalink / raw)
  To: git; +Cc: Don Zickus

This is useful when scripts need more than just the basic email headers to
parse.  By specifying the "-x=" option, one can search and output any header
field they want.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 builtin-mailinfo.c |   10 ++++++++--
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index dacdf77..dd0f563 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -856,11 +856,12 @@ int mailinfo(FILE *in, FILE *out, int ks, const char *encoding,
 }
 
 static const char mailinfo_usage[] =
-	"git-mailinfo [-k] [-u | --encoding=<encoding>] msg patch <mail >info";
+	"git-mailinfo [-k] [-u | --encoding=<encoding>] [-x=<field>] msg patch <mail >info";
 
 int cmd_mailinfo(int argc, const char **argv, const char *prefix)
 {
 	const char *def_charset;
+	int top;
 
 	/* NEEDSWORK: might want to do the optional .git/ directory
 	 * discovery
@@ -870,6 +871,8 @@ int cmd_mailinfo(int argc, const char **argv, const char *prefix)
 	def_charset = (git_commit_encoding ? git_commit_encoding : "utf-8");
 	metainfo_charset = def_charset;
 
+	for (top=0; header[top]; top++){ ; }
+
 	while (1 < argc && argv[1][0] == '-') {
 		if (!strcmp(argv[1], "-k"))
 			keep_subject = 1;
@@ -879,7 +882,10 @@ int cmd_mailinfo(int argc, const char **argv, const char *prefix)
 			metainfo_charset = NULL;
 		else if (!prefixcmp(argv[1], "--encoding="))
 			metainfo_charset = argv[1] + 11;
-		else
+		else if (!prefixcmp(argv[1], "-x=")) {
+			header[top] = xmalloc(256*sizeof(char));
+			strncpy(header[top++], argv[1]+3, 256);
+		} else
 			usage(mailinfo_usage);
 		argc--; argv++;
 	}
-- 
1.5.0.2.211.g2ca9-dirty

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/5] add the ability to select more email header fields to output
  2007-03-14 20:12 ` [PATCH 2/5] add the ability to select more email header fields to output Don Zickus
@ 2007-03-15 14:36   ` Don Zickus
  0 siblings, 0 replies; 12+ messages in thread
From: Don Zickus @ 2007-03-15 14:36 UTC (permalink / raw)
  To: git

This is useful when scripts need more than just the basic email headers to
parse.  By specifying the "-x=" option, one can search and output any header
field they want.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---

 Accidentally sent out the wrong patch yesterday.

---
 builtin-mailinfo.c |   22 ++++++++++++++++------
 1 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index a5eea82..8ac6ef4 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -302,7 +302,7 @@ static int check_header(char *line, char **hdr_data)
 	int i;
 
 	/* search for the interesting parts */
-	for (i = 0; header[i]; i++) {
+	for (i = 0; header[i] && i < MAX_HDR_PARSED; i++) {
 		int len = strlen(header[i]);
 		if (!hdr_data[i] &&
 		    !strncasecmp(line, header[i], len) &&
@@ -338,8 +338,8 @@ static int check_header(char *line, char **hdr_data)
 	if (!memcmp(">From", line, 5) && isspace(line[5]))
 		return 1;
 	if (!memcmp("[PATCH]", line, 7) && isspace(line[7])) {
-		for (i = 0; header[i]; i++) {
-			if (!memcmp("Subject: ", header[i], 9)) {
+		for (i = 0; header[i] && i < MAX_HDR_PARSED; i++) {
+			if (!memcmp("Subject", header[i], 7)) {
 				if (! handle_header(line, hdr_data[i], 0)) {
 					return 1;
 				}
@@ -796,7 +796,7 @@ static void handle_info(void)
 	char *hdr;
 	int i;
 
-	for (i = 0; header[i]; i++) {
+	for (i = 0; header[i] && i < MAX_HDR_PARSED; i++) {
 
 		/* only print inbody headers if we output a patch file */
 		if (patch_lines && s_hdr_data[i])
@@ -856,11 +856,12 @@ int mailinfo(FILE *in, FILE *out, int ks, const char *encoding,
 }
 
 static const char mailinfo_usage[] =
-	"git-mailinfo [-k] [-u | --encoding=<encoding>] msg patch <mail >info";
+	"git-mailinfo [-k] [-u | --encoding=<encoding>] [-x=<field>] msg patch <mail >info";
 
 int cmd_mailinfo(int argc, const char **argv, const char *prefix)
 {
 	const char *def_charset;
+	int top;
 
 	/* NEEDSWORK: might want to do the optional .git/ directory
 	 * discovery
@@ -870,6 +871,8 @@ int cmd_mailinfo(int argc, const char **argv, const char *prefix)
 	def_charset = (git_commit_encoding ? git_commit_encoding : "utf-8");
 	metainfo_charset = def_charset;
 
+	for (top = 0; header[top] && top < MAX_HDR_PARSED; top++){ ; }
+
 	while (1 < argc && argv[1][0] == '-') {
 		if (!strcmp(argv[1], "-k"))
 			keep_subject = 1;
@@ -879,7 +882,14 @@ int cmd_mailinfo(int argc, const char **argv, const char *prefix)
 			metainfo_charset = NULL;
 		else if (!prefixcmp(argv[1], "--encoding="))
 			metainfo_charset = argv[1] + 11;
-		else
+		else if (!prefixcmp(argv[1], "-x=")) {
+			if (top >= MAX_HDR_PARSED) {
+				fprintf(stderr, "too many headers to parse\n");
+				exit(1);
+			}
+			header[top] = xmalloc(256*sizeof(char));
+			strncpy(header[top++], argv[1]+3, 256);
+		} else
 			usage(mailinfo_usage);
 		argc--; argv++;
 	}
-- 
1.5.0.2.213.g18c8-dirty

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/5] restrict the patch filtering
  2007-03-14 20:12 git-mailinfo fixes/features v3 Don Zickus
  2007-03-14 20:12 ` [PATCH 1/5] builtin-mailinfo.c infrastrcture changes Don Zickus
  2007-03-14 20:12 ` [PATCH 2/5] add the ability to select more email header fields to output Don Zickus
@ 2007-03-14 20:12 ` Don Zickus
  2007-03-14 20:12 ` [PATCH 4/5] Add a couple more test cases to the suite Don Zickus
  2007-03-14 20:12 ` [PATCH 5/5] fix a utf8 issue in t5100/patch005 Don Zickus
  4 siblings, 0 replies; 12+ messages in thread
From: Don Zickus @ 2007-03-14 20:12 UTC (permalink / raw)
  To: git; +Cc: Don Zickus

I have come across many emails that use long strings of '-'s as separators
for ideas.  This patch below limits the separator to only 3 '-', with the
intent that long string of '-'s will stay in the commit msg and not in the
patch file.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>

---
I purposedly separated this patch out because I wasn't sure if anyone would
have objections to it.  I tested it on numerous emails with and with patches
and didn't see any issues.
---
 builtin-mailinfo.c |   37 ++++++++++++++++++++++++++++++++++---
 1 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index dd0f563..a8d5b60 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -652,6 +652,39 @@ again:
 	return (fgets(line, sizeof(line), fin) != NULL);
 }
 
+static inline int patchbreak(const char *line)
+{
+	/* Beginning of a "diff -" header? */
+	if (!memcmp("diff -", line, 6))
+		return 1;
+
+	/* CVS "Index: " line? */
+	if (!memcmp("Index: ", line, 7))
+		return 1;
+
+	/*
+	 * "--- <filename>" starts patches without headers
+	 * "---<sp>*" is a manual separator
+	 */
+	if (!memcmp("---", line, 3)) {
+		line += 3;
+		/* space followed by a filename? */
+		if (line[0] == ' ' && !isspace(line[1]))
+			return 1;
+		/* Just whitespace? */
+		for (;;) {
+			unsigned char c = *line++;
+			if (c == '\n')
+				return 1;
+			if (!isspace(c))
+				break;
+		}
+		return 0;
+	}
+	return 0;
+}
+
+
 static int handle_commit_msg(char *line)
 {
 	static int still_looking=1;
@@ -673,9 +706,7 @@ static int handle_commit_msg(char *line)
 			return 0;
 	}
 
-	if (!memcmp("diff -", line, 6) ||
-	    !memcmp("---", line, 3) ||
-	    !memcmp("Index: ", line, 7)) {
+	if (patchbreak(line)) {
 		fclose(cmitmsg);
 		cmitmsg = NULL;
 		return 1;
-- 
1.5.0.2.211.g2ca9-dirty

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/5] Add a couple more test cases to the suite.
  2007-03-14 20:12 git-mailinfo fixes/features v3 Don Zickus
                   ` (2 preceding siblings ...)
  2007-03-14 20:12 ` [PATCH 3/5] restrict the patch filtering Don Zickus
@ 2007-03-14 20:12 ` Don Zickus
  2007-03-14 20:12 ` [PATCH 5/5] fix a utf8 issue in t5100/patch005 Don Zickus
  4 siblings, 0 replies; 12+ messages in thread
From: Don Zickus @ 2007-03-14 20:12 UTC (permalink / raw)
  To: git; +Cc: Don Zickus

They handle cases where there is no attached patch.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 t/t5100-mailinfo.sh |    2 +-
 t/t5100/info0007    |    5 +++++
 t/t5100/info0008    |    5 +++++
 t/t5100/msg0007     |    2 ++
 t/t5100/msg0008     |    4 ++++
 t/t5100/sample.mbox |   18 ++++++++++++++++++
 6 files changed, 35 insertions(+), 1 deletions(-)
 create mode 100644 t/t5100/info0007
 create mode 100644 t/t5100/info0008
 create mode 100644 t/t5100/msg0007
 create mode 100644 t/t5100/msg0008
 create mode 100644 t/t5100/patch0007
 create mode 100644 t/t5100/patch0008

diff --git a/t/t5100-mailinfo.sh b/t/t5100-mailinfo.sh
index 4d2b781..ca96918 100755
--- a/t/t5100-mailinfo.sh
+++ b/t/t5100-mailinfo.sh
@@ -11,7 +11,7 @@ test_expect_success 'split sample box' \
 	'git-mailsplit -o. ../t5100/sample.mbox >last &&
 	last=`cat last` &&
 	echo total is $last &&
-	test `cat last` = 6'
+	test `cat last` = 8'
 
 for mail in `echo 00*`
 do
diff --git a/t/t5100/info0007 b/t/t5100/info0007
new file mode 100644
index 0000000..49bb0fe
--- /dev/null
+++ b/t/t5100/info0007
@@ -0,0 +1,5 @@
+Author: A U Thor
+Email: a.u.thor@example.com
+Subject: another patch
+Date: Fri, 9 Jun 2006 00:44:16 -0700
+
diff --git a/t/t5100/info0008 b/t/t5100/info0008
new file mode 100644
index 0000000..e8a2951
--- /dev/null
+++ b/t/t5100/info0008
@@ -0,0 +1,5 @@
+Author: Junio C Hamano
+Email: junio@kernel.org
+Subject: another patch
+Date: Fri, 9 Jun 2006 00:44:16 -0700
+
diff --git a/t/t5100/msg0007 b/t/t5100/msg0007
new file mode 100644
index 0000000..71b23c0
--- /dev/null
+++ b/t/t5100/msg0007
@@ -0,0 +1,2 @@
+Here is an empty patch from A U Thor.
+
diff --git a/t/t5100/msg0008 b/t/t5100/msg0008
new file mode 100644
index 0000000..a80ecb9
--- /dev/null
+++ b/t/t5100/msg0008
@@ -0,0 +1,4 @@
+>Here is an empty patch from A U Thor.
+
+Hey you forgot the patch!
+
diff --git a/t/t5100/patch0007 b/t/t5100/patch0007
new file mode 100644
index 0000000..e69de29
diff --git a/t/t5100/patch0008 b/t/t5100/patch0008
new file mode 100644
index 0000000..e69de29
diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox
index 86bfc27..b80c981 100644
--- a/t/t5100/sample.mbox
+++ b/t/t5100/sample.mbox
@@ -386,3 +386,21 @@ index 9123cdc..918dcf8 100644
 -- 
 1.4.0.g6f2b
 
+From nobody Mon Sep 17 00:00:00 2001
+From: A U Thor <a.u.thor@example.com>
+Date: Fri, 9 Jun 2006 00:44:16 -0700
+Subject: [PATCH] another patch
+
+Here is an empty patch from A U Thor.
+
+From nobody Mon Sep 17 00:00:00 2001
+From: Junio C Hamano <junio@kernel.org>
+Date: Fri, 9 Jun 2006 00:44:16 -0700
+Subject: re: [PATCH] another patch
+
+From: A U Thor <a.u.thor@example.com>
+Subject: [PATCH] another patch
+>Here is an empty patch from A U Thor.
+
+Hey you forgot the patch!
+
-- 
1.5.0.2.211.g2ca9-dirty

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 5/5] fix a utf8 issue in t5100/patch005
  2007-03-14 20:12 git-mailinfo fixes/features v3 Don Zickus
                   ` (3 preceding siblings ...)
  2007-03-14 20:12 ` [PATCH 4/5] Add a couple more test cases to the suite Don Zickus
@ 2007-03-14 20:12 ` Don Zickus
  4 siblings, 0 replies; 12+ messages in thread
From: Don Zickus @ 2007-03-14 20:12 UTC (permalink / raw)
  To: git; +Cc: Don Zickus

This issue popped up when testing my changes.  I believe the patch is the
intended output that git-mailinfo should provide.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 t/t5100/patch0005 |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/t/t5100/patch0005 b/t/t5100/patch0005
index 7d24b24..e7d6f66 100644
--- a/t/t5100/patch0005
+++ b/t/t5100/patch0005
@@ -61,7 +61,7 @@ diff --git a/git-cvsimport-script b/git-cvsimport-script
  		push(@old,$fn);
 
 -- 
-David Kågedal
+David KÃ¥gedal
 -
 To unsubscribe from this list: send the line "unsubscribe git" in
 the body of a message to majordomo@vger.kernel.org
-- 
1.5.0.2.211.g2ca9-dirty

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 0/5] git-mailinfo fixes/features
@ 2007-03-12 19:52 Don Zickus
  2007-03-12 19:52 ` [PATCH 1/5] builtin-mailinfo.c infrastrcture changes Don Zickus
  0 siblings, 1 reply; 12+ messages in thread
From: Don Zickus @ 2007-03-12 19:52 UTC (permalink / raw)
  To: git

I am trying to get my own custom git-am to parse non-patches from my Inbox
better.  Using git-mailinfo had a lot of limitations.  I rewrote and
restructured builtin-mailinfo.c to handle what I want to do better.  

In addition to a lot of fixes, I am looking to add a few small backwards
compatible features.  The following patches accomplish that.

This is an update to my previous set of patches.  These new fixes deal with
some of the issues Junio and Linus brought up.

Any feedback would be great.

Cheers,
Don

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/5] builtin-mailinfo.c infrastrcture changes
  2007-03-12 19:52 [PATCH 0/5] git-mailinfo fixes/features Don Zickus
@ 2007-03-12 19:52 ` Don Zickus
  0 siblings, 0 replies; 12+ messages in thread
From: Don Zickus @ 2007-03-12 19:52 UTC (permalink / raw)
  To: git; +Cc: Don Zickus

I am working on a project that required parsing through regular mboxes that
didn't necessarily have patches embedded in them.  I started by creating my
own modified copy of git-am and working from there.  Very quickly, I noticed
git-mailinfo wasn't able to handle a big chunk of my email.

After hacking up numerous solutions and running into more limitations, I
decided it was just easier to rewrite a big chunk of it.  The following
patch has a bunch of fixes and features that I needed in order for me do
what I wanted.

Note: I'm didn't follow any email rfc papers but I don't think any of the
changes I did required much knowledge (besides the boundary stuff).

List of major changes/fixes:
- can't create empty patch files fix
- empty patch files don't fail, this failure will come inside git-am
- multipart boundaries are now handled
- only output inbody headers if a patch exists otherwise assume those
headers are part of the reply and instead output the original headers
- decode and filter base64 patches correctly
- various other accidental fixes

I believe I didn't break any existing functionality or compatibility (other
than what I describe above, which is really only the empty patch file).

I tested this through various mailing list archives and everything seemed to
parse correctly (a couple thousand emails).

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 builtin-mailinfo.c |  520 +++++++++++++++++++++++++++------------------------
 git-am.sh          |    4 +
 git-applymbox.sh   |    4 +
 git-quiltimport.sh |    4 +
 4 files changed, 287 insertions(+), 245 deletions(-)

diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index 766a37e..dacdf77 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -11,19 +11,22 @@ static FILE *cmitmsg, *patchfile, *fin, *fout;
 static int keep_subject;
 static const char *metainfo_charset;
 static char line[1000];
-static char date[1000];
 static char name[1000];
 static char email[1000];
-static char subject[1000];
 
 static enum  {
 	TE_DONTCARE, TE_QP, TE_BASE64,
 } transfer_encoding;
-static char charset[256];
+static enum  {
+	TYPE_TEXT, TYPE_OTHER,
+} message_type;
 
-static char multipart_boundary[1000];
-static int multipart_boundary_len;
+static char charset[256];
 static int patch_lines;
+static char **p_hdr_data, **s_hdr_data;
+
+#define MAX_HDR_PARSED 10
+#define MAX_BOUNDARIES 5
 
 static char *sanity_check(char *name, char *email)
 {
@@ -137,15 +140,13 @@ static int handle_from(char *in_line)
 	return 1;
 }
 
-static int handle_date(char *line)
+static int handle_header(char *line, char *data, int ofs)
 {
-	strcpy(date, line);
-	return 0;
-}
+	if (!line || !data)
+		return 1;
+
+	strcpy(data, line+ofs);
 
-static int handle_subject(char *line)
-{
-	strcpy(subject, line);
 	return 0;
 }
 
@@ -177,17 +178,35 @@ static int slurp_attr(const char *line, const char *name, char *attr)
 	return 1;
 }
 
-static int handle_subcontent_type(char *line)
+struct content_type {
+	char *boundary;
+	int boundary_len;
+};
+
+static struct content_type content[MAX_BOUNDARIES];
+
+static struct content_type *content_top = content;
+
+static int handle_content_type(char *line)
 {
-	/* We do not want to mess with boundary.  Note that we do not
-	 * handle nested multipart.
+	char boundary[256];
+
+	/* the only time this return less than zero is when 
+	   /line/ does not contain "text/"
 	 */
-	if (strcasestr(line, "boundary=")) {
-		fprintf(stderr, "Not handling nested multipart message.\n");
-		exit(1);
+	if (strcasestr(line, "text/") == NULL)
+		 message_type = TYPE_OTHER;
+	if (slurp_attr(line, "boundary=", boundary + 2)) {
+		memcpy(boundary, "--", 2);
+		if (content_top++ >= &content[MAX_BOUNDARIES]) {
+			fprintf(stderr, "Too many boundaries to handle\n");
+			exit(1);
+		}
+		content_top->boundary_len = strlen(boundary);
+		content_top->boundary = xmalloc(content_top->boundary_len+1);
+		strcpy(content_top->boundary, boundary);
 	}
-	slurp_attr(line, "charset=", charset);
-	if (*charset) {
+	if (slurp_attr(line, "charset=", charset)) {
 		int i, c;
 		for (i = 0; (c = charset[i]) != 0; i++)
 			charset[i] = tolower(c);
@@ -195,17 +214,6 @@ static int handle_subcontent_type(char *line)
 	return 0;
 }
 
-static int handle_content_type(char *line)
-{
-	*multipart_boundary = 0;
-	if (slurp_attr(line, "boundary=", multipart_boundary + 2)) {
-		memcpy(multipart_boundary, "--", 2);
-		multipart_boundary_len = strlen(multipart_boundary);
-	}
-	slurp_attr(line, "charset=", charset);
-	return 0;
-}
-
 static int handle_content_transfer_encoding(char *line)
 {
 	if (strcasestr(line, "base64"))
@@ -219,7 +227,7 @@ static int handle_content_transfer_encoding(char *line)
 
 static int is_multipart_boundary(const char *line)
 {
-	return (!memcmp(line, multipart_boundary, multipart_boundary_len));
+	return (!memcmp(line, content_top->boundary, content_top->boundary_len));
 }
 
 static int eatspace(char *line)
@@ -230,62 +238,6 @@ static int eatspace(char *line)
 	return len;
 }
 
-#define SEEN_FROM 01
-#define SEEN_DATE 02
-#define SEEN_SUBJECT 04
-#define SEEN_BOGUS_UNIX_FROM 010
-#define SEEN_PREFIX  020
-
-/* First lines of body can have From:, Date:, and Subject: or empty */
-static void handle_inbody_header(int *seen, char *line)
-{
-	if (*seen & SEEN_PREFIX)
-		return;
-	if (isspace(*line)) {
-		char *cp;
-		for (cp = line + 1; *cp; cp++) {
-			if (!isspace(*cp))
-				break;
-		}
-		if (!*cp)
-			return;
-	}
-	if (!memcmp(">From", line, 5) && isspace(line[5])) {
-		if (!(*seen & SEEN_BOGUS_UNIX_FROM)) {
-			*seen |= SEEN_BOGUS_UNIX_FROM;
-			return;
-		}
-	}
-	if (!memcmp("From:", line, 5) && isspace(line[5])) {
-		if (!(*seen & SEEN_FROM) && handle_from(line+6)) {
-			*seen |= SEEN_FROM;
-			return;
-		}
-	}
-	if (!memcmp("Date:", line, 5) && isspace(line[5])) {
-		if (!(*seen & SEEN_DATE)) {
-			handle_date(line+6);
-			*seen |= SEEN_DATE;
-			return;
-		}
-	}
-	if (!memcmp("Subject:", line, 8) && isspace(line[8])) {
-		if (!(*seen & SEEN_SUBJECT)) {
-			handle_subject(line+9);
-			*seen |= SEEN_SUBJECT;
-			return;
-		}
-	}
-	if (!memcmp("[PATCH]", line, 7) && isspace(line[7])) {
-		if (!(*seen & SEEN_SUBJECT)) {
-			handle_subject(line);
-			*seen |= SEEN_SUBJECT;
-			return;
-		}
-	}
-	*seen |= SEEN_PREFIX;
-}
-
 static char *cleanup_subject(char *subject)
 {
 	if (keep_subject)
@@ -341,57 +293,62 @@ static void cleanup_space(char *buf)
 }
 
 static void decode_header(char *it);
-typedef int (*header_fn_t)(char *);
-struct header_def {
-	const char *name;
-	header_fn_t func;
-	int namelen;
+static char *header[MAX_HDR_PARSED] = {
+	"From","Subject","Date",
 };
 
-static void check_header(char *line, struct header_def *header)
+static int check_header(char *line, char **hdr_data)
 {
 	int i;
 
-	if (header[0].namelen <= 0) {
-		for (i = 0; header[i].name; i++)
-			header[i].namelen = strlen(header[i].name);
-	}
-	for (i = 0; header[i].name; i++) {
-		int len = header[i].namelen;
-		if (!strncasecmp(line, header[i].name, len) &&
+	/* search for the interesting parts */
+	for (i = 0; header[i]; i++) {
+		int len = strlen(header[i]);
+		if (!hdr_data[i] &&
+		    !strncasecmp(line, header[i], len) &&
 		    line[len] == ':' && isspace(line[len + 1])) {
 			/* Unwrap inline B and Q encoding, and optionally
 			 * normalize the meta information to utf8.
 			 */
 			decode_header(line + len + 2);
-			header[i].func(line + len + 2);
-			break;
+			hdr_data[i] = xmalloc(1000 * sizeof(char));
+			if (! handle_header(line, hdr_data[i], len + 2)) {
+				return 1;
+			}
 		}
 	}
-}
 
-static void check_subheader_line(char *line)
-{
-	static struct header_def header[] = {
-		{ "Content-Type", handle_subcontent_type },
-		{ "Content-Transfer-Encoding",
-		  handle_content_transfer_encoding },
-		{ NULL },
-	};
-	check_header(line, header);
-}
-static void check_header_line(char *line)
-{
-	static struct header_def header[] = {
-		{ "From", handle_from },
-		{ "Date", handle_date },
-		{ "Subject", handle_subject },
-		{ "Content-Type", handle_content_type },
-		{ "Content-Transfer-Encoding",
-		  handle_content_transfer_encoding },
-		{ NULL },
-	};
-	check_header(line, header);
+	/* Content stuff */
+	if (!strncasecmp(line, "Content-Type", 12) &&
+		line[12] == ':' && isspace(line[12 + 1])) {
+		decode_header(line + 12 + 2);
+		if (! handle_content_type(line)) {
+			return 1;
+		}
+	}
+	if (!strncasecmp(line, "Content-Transfer-Encoding", 25) &&
+		line[25] == ':' && isspace(line[25 + 1])) {
+		decode_header(line + 25 + 2);
+		if (! handle_content_transfer_encoding(line)) {
+			return 1;
+		}
+	}
+
+	/* for inbody stuff */
+	if (!memcmp(">From", line, 5) && isspace(line[5]))
+		return 1;
+	if (!memcmp("[PATCH]", line, 7) && isspace(line[7])) {
+		for (i=0; header[i]; i++) {
+			if (!memcmp("Subject: ", header[i], 9)) {
+				if (! handle_header(line, hdr_data[i], 0)) {
+					return 1;
+				}
+			}
+		}
+	}
+
+	/* no match */
+	return 0;
 }
 
 static int is_rfc2822_header(char *line)
@@ -647,147 +604,222 @@ static void decode_transfer_encoding(char *line)
 	}
 }
 
-static void handle_info(void)
+static int handle_filter(char *line);
+
+static int find_boundary(void)
 {
-	char *sub;
+	while(fgets(line, sizeof(line), fin) != NULL) {
+		if (is_multipart_boundary(line))
+			return 1;
+	}
+	return 0;
+}
+
+static int handle_boundary(void)
+{
+again:
+	if (!memcmp(line+content_top->boundary_len, "--", 2)) {
+		/* we hit an end boundary */
+		/* pop the current boundary off the stack */
+		free(content_top->boundary);
+		
+		/* technically won't happen as is_multipart_boundary()
+		   will fail first.  But just in case..
+		 */
+		if (content_top-- < content) {
+			fprintf(stderr, "Detected mismatched boundaries, "
+					"can't recover\n");
+			exit(1);
+		}
+		handle_filter("\n");
+
+		/* skip to the next boundary */
+		if (!find_boundary())
+			return 0;
+		goto again;
+	}
 
-	sub = cleanup_subject(subject);
-	cleanup_space(name);
-	cleanup_space(date);
-	cleanup_space(email);
-	cleanup_space(sub);
+	/* set some defaults */
+	transfer_encoding = TE_DONTCARE;
+	charset[0] = 0;
+	message_type = TYPE_TEXT;
 
-	fprintf(fout, "Author: %s\nEmail: %s\nSubject: %s\nDate: %s\n\n",
-	       name, email, sub, date);
+	/* slurp in this section's info */
+	while (read_one_header_line(line, sizeof(line), fin))
+		check_header(line, p_hdr_data);
+
+	/* eat the blank line after section info */
+	return (fgets(line, sizeof(line), fin) != NULL);
 }
 
-/* We are inside message body and have read line[] already.
- * Spit out the commit log.
- */
-static int handle_commit_msg(int *seen)
+static int handle_commit_msg(char *line)
 {
+	static int still_looking=1;
+
 	if (!cmitmsg)
 		return 0;
-	do {
-		if (!memcmp("diff -", line, 6) ||
-		    !memcmp("---", line, 3) ||
-		    !memcmp("Index: ", line, 7))
-			break;
-		if ((multipart_boundary[0] && is_multipart_boundary(line))) {
-			/* We come here when the first part had only
-			 * the commit message without any patch.  We
-			 * pretend we have not seen this line yet, and
-			 * go back to the loop.
-			 */
-			return 1;
-		}
 
-		/* Unwrap transfer encoding and optionally
-		 * normalize the log message to UTF-8.
-		 */
-		decode_transfer_encoding(line);
-		if (metainfo_charset)
-			convert_to_utf8(line, charset);
+	if (still_looking) {
+		char *cp=line;
+		if (isspace(*line)) {
+			for (cp = line + 1; *cp; cp++) {
+				if (!isspace(*cp))
+					break;
+			}
+			if (!*cp)
+				return 0;
+		}
+		if ((still_looking = check_header(cp, s_hdr_data)) != 0)
+			return 0;
+	}
 
-		handle_inbody_header(seen, line);
-		if (!(*seen & SEEN_PREFIX))
-			continue;
+	if (!memcmp("diff -", line, 6) ||
+	    !memcmp("---", line, 3) ||
+	    !memcmp("Index: ", line, 7)) {
+		fclose(cmitmsg);
+		cmitmsg = NULL;
+		return 1;
+	}
 
-		fputs(line, cmitmsg);
-	} while (fgets(line, sizeof(line), fin) != NULL);
-	fclose(cmitmsg);
-	cmitmsg = NULL;
+	fputs(line, cmitmsg);
 	return 0;
 }
 
-/* We have done the commit message and have the first
- * line of the patch in line[].
- */
-static void handle_patch(void)
+static int handle_patch(char *line)
 {
-	do {
-		if (multipart_boundary[0] && is_multipart_boundary(line))
-			break;
-		/* Only unwrap transfer encoding but otherwise do not
-		 * do anything.  We do *NOT* want UTF-8 conversion
-		 * here; we are dealing with the user payload.
-		 */
-		decode_transfer_encoding(line);
-		fputs(line, patchfile);
-		patch_lines++;
-	} while (fgets(line, sizeof(line), fin) != NULL);
+	fputs(line, patchfile);
+	patch_lines++;
+	return 0;
 }
 
-/* multipart boundary and transfer encoding are set up for us, and we
- * are at the end of the sub header.  do equivalent of handle_body up
- * to the next boundary without closing patchfile --- we will expect
- * that the first part to contain commit message and a patch, and
- * handle other parts as pure patches.
- */
-static int handle_multipart_one_part(int *seen)
+static int handle_filter(char *line)
 {
-	int n = 0;
+	static int filter=0;
 
-	while (fgets(line, sizeof(line), fin) != NULL) {
-	again:
-		n++;
-		if (is_multipart_boundary(line))
+	/* filter tells us which part we left off on
+	 * a non-zero return indicates we hit a filter point
+	 */
+	switch (filter) {
+	case 0:
+		if (!handle_commit_msg(line))
 			break;
-		if (handle_commit_msg(seen))
-			goto again;
-		handle_patch();
-		break;
+		filter++;
+	case 1:
+		if (!handle_patch(line))
+			break;
+		filter++;
+	default:
+		return 1;
 	}
-	if (n == 0)
-		return -1;
+
 	return 0;
 }
 
-static void handle_multipart_body(void)
+static void handle_body(void)
 {
-	int seen = 0;
-	int part_num = 0;
+	int rc=0;
+	static char newline[2000];
+	static char *np=newline;
 
 	/* Skip up to the first boundary */
-	while (fgets(line, sizeof(line), fin) != NULL)
-		if (is_multipart_boundary(line)) {
-			part_num = 1;
+	if (content_top->boundary) {
+		if (!find_boundary())
+			return;
+	}
+
+	do {
+		/* process any boundary lines */
+		if (content_top->boundary && is_multipart_boundary(line)) {
+			/* flush any leftover */
+			if ((transfer_encoding == TE_BASE64)  &&
+			    (np != newline)) {
+				handle_filter(newline);
+			}
+			if (!handle_boundary())
+				return;
+		}
+
+		/* Unwrap transfer encoding and optionally
+		 * normalize the log message to UTF-8.
+		 */
+		decode_transfer_encoding(line);
+		if (metainfo_charset)
+			convert_to_utf8(line, charset);
+
+		switch (transfer_encoding) {
+		case TE_BASE64:
+		{
+			char *op=line;
+
+			/* binary data most likely doesn't have newlines */
+			if (message_type != TYPE_TEXT) {
+				rc=handle_filter(line);
+				break;
+			}
+
+			/* this is a decoded line that may contain
+			 * multiple new lines.  Pass only one chunk
+			 * at a time to handle_filter()
+			 */
+
+			do {
+				while (*op != '\n' && *op != 0)
+					*np++ = *op++;
+				*np = *op;
+				if (*np != 0) {
+					/* should be sitting on a new line */
+					*(++np) = 0;
+					op++;
+					rc=handle_filter(newline);
+					np=newline;
+				}
+			} while (*op != 0);
+			/* the partial chunk is saved in newline and
+			 * will be appended by the next iteration of fgets
+			 */
 			break;
 		}
-	if (!part_num)
-		return;
-	/* We are on boundary line.  Start slurping the subhead. */
-	while (1) {
-		int hdr = read_one_header_line(line, sizeof(line), fin);
-		if (!hdr) {
-			if (handle_multipart_one_part(&seen) < 0)
-				return;
-			/* Reset per part headers */
-			transfer_encoding = TE_DONTCARE;
-			charset[0] = 0;
+		default:
+			rc=handle_filter(line);
 		}
-		else
-			check_subheader_line(line);
-	}
-	fclose(patchfile);
-	if (!patch_lines) {
-		fprintf(stderr, "No patch found\n");
-		exit(1);
-	}
+		if (rc)
+			/* nothing left to filter */
+			break;
+	} while (fgets(line, sizeof(line), fin));
+
+	return;
 }
 
-/* Non multipart message */
-static void handle_body(void)
+static void handle_info(void)
 {
-	int seen = 0;
-
-	handle_commit_msg(&seen);
-	handle_patch();
-	fclose(patchfile);
-	if (!patch_lines) {
-		fprintf(stderr, "No patch found\n");
-		exit(1);
+	char *sub;
+	char *hdr;
+	int i;
+
+	for (i=0; header[i]; i++) {
+
+		/* only print inbody headers if we output a patch file */
+		if (patch_lines && s_hdr_data[i])
+			hdr=s_hdr_data[i];
+		else if (p_hdr_data[i])
+			hdr=p_hdr_data[i];
+		else
+			continue;
+
+		if (!memcmp(header[i], "Subject", 7)) {
+			sub = cleanup_subject(hdr);
+			cleanup_space(sub);
+			fprintf(fout, "Subject: %s\n", sub);
+		} else if (!memcmp(header[i], "From", 4)) {
+			handle_from(hdr);
+			fprintf(fout, "Author: %s\n", name);
+			fprintf(fout, "Email: %s\n", email);
+		} else {
+			cleanup_space(hdr);
+			fprintf(fout, "%s: %s\n", header[i], hdr);
+		}
 	}
+	fprintf(fout, "\n");
 }
 
 int mailinfo(FILE *in, FILE *out, int ks, const char *encoding,
@@ -809,18 +841,16 @@ int mailinfo(FILE *in, FILE *out, int ks, const char *encoding,
 		fclose(cmitmsg);
 		return -1;
 	}
-	while (1) {
-		int hdr = read_one_header_line(line, sizeof(line), fin);
-		if (!hdr) {
-			if (multipart_boundary[0])
-				handle_multipart_body();
-			else
-				handle_body();
-			handle_info();
-			break;
-		}
-		check_header_line(line);
-	}
+
+	p_hdr_data = xcalloc(MAX_HDR_PARSED, sizeof(char *));
+	s_hdr_data = xcalloc(MAX_HDR_PARSED, sizeof(char *));
+
+	/* process the email header */
+	while (read_one_header_line(line, sizeof(line), fin))
+		check_header(line, p_hdr_data);
+
+	handle_body();
+	handle_info();
 
 	return 0;
 }
diff --git a/git-am.sh b/git-am.sh
index 2c73d11..847a44f 100755
--- a/git-am.sh
+++ b/git-am.sh
@@ -290,6 +290,10 @@ do
 		git-mailinfo $keep $utf8 "$dotest/msg" "$dotest/patch" \
 			<"$dotest/$msgnum" >"$dotest/info" ||
 			stop_here $this
+		test -s $dotest/patch || { 
+			echo "Patch is empty.  Was is split wrong?"
+			stop_here $this
+		}
 		git-stripspace < "$dotest/msg" > "$dotest/msg-clean"
 		;;
 	esac
diff --git a/git-applymbox.sh b/git-applymbox.sh
index 1f68599..2cbdc7e 100755
--- a/git-applymbox.sh
+++ b/git-applymbox.sh
@@ -77,6 +77,10 @@ do
     *)
 	    git-mailinfo $keep_subject $utf8 \
 		.dotest/msg .dotest/patch <$i >.dotest/info || exit 1
+	    test -s $dotest/patch || {
+		echo "Patch is empty.  Was is split wrong?"
+		stop_here $this
+	    }
 	    git-stripspace < .dotest/msg > .dotest/msg-clean
 	    ;;
     esac
diff --git a/git-quiltimport.sh b/git-quiltimport.sh
index 671a5ff..08ac9bb 100755
--- a/git-quiltimport.sh
+++ b/git-quiltimport.sh
@@ -73,6 +73,10 @@ mkdir $tmp_dir || exit 2
 for patch_name in $(cat "$QUILT_PATCHES/series" | grep -v '^#'); do
 	echo $patch_name
 	(cat $QUILT_PATCHES/$patch_name | git-mailinfo "$tmp_msg" "$tmp_patch" > "$tmp_info") || exit 3
+	test -s $dotest/patch || {
+		echo "Patch is empty.  Was is split wrong?"
+		stop_here $this
+	}
 
 	# Parse the author information
 	export GIT_AUTHOR_NAME=$(sed -ne 's/Author: //p' "$tmp_info")
-- 
1.5.0.2.211.g2ca9-dirty

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 1/5] builtin-mailinfo.c infrastrcture changes
@ 2007-03-06 21:57 Don Zickus
  2007-03-07  5:34 ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Don Zickus @ 2007-03-06 21:57 UTC (permalink / raw)
  To: git; +Cc: Don Zickus

I am working on a project that required parsing through regular mboxes that
didn't necessarily have patches embedded in them.  I started by creating my
own modified copy of git-am and working from there.  Very quickly, I noticed
git-mailinfo wasn't able to handle a big chunk of my email.

After hacking up numerous solutions and running into more limitations, I
decided it was just easier to rewrite a big chunk of it.  The following
patch has a bunch of fixes and features that I needed in order for me do
what I wanted.

Note: I'm didn't follow any email rfc papers but I don't think any of the
changes I did required much knowledge (besides the boundary stuff).

Sorry for the large patchset.  It was too hard to break it up without
creating a bunch of intermediate throwaway changes.  If it is too large, I
can try to break it down some.

List of major changes/fixes:
- can't create empty patch files fix
- empty patch files don't fail, this failure will come inside git-am
- multipart boundaries are now handled
- only output inbody headers if a patch exists otherwise assume those
headers are part of the reply and instead output the original headers
- decode and filter base64 patches correctly
- various other accidental fixes

I believe I didn't break any existing functionality or compatibility (other
than what I describe above, which is really only the empty patch file).

I tested this through various mailing list archives and everything seemed to
parse correctly (a couple thousand emails).

Any comments or feedback would be great.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 builtin-mailinfo.c |  514 +++++++++++++++++++++++++++-------------------------
 git-am.sh          |    1 +
 2 files changed, 269 insertions(+), 246 deletions(-)

diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index 766a37e..7456d8a 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -11,19 +11,19 @@ static FILE *cmitmsg, *patchfile, *fin, *fout;
 static int keep_subject;
 static const char *metainfo_charset;
 static char line[1000];
-static char date[1000];
 static char name[1000];
 static char email[1000];
-static char subject[1000];
 
 static enum  {
 	TE_DONTCARE, TE_QP, TE_BASE64,
 } transfer_encoding;
-static char charset[256];
+static enum  {
+	TYPE_TEXT, TYPE_OTHER,
+} message_type;
 
-static char multipart_boundary[1000];
-static int multipart_boundary_len;
+static char charset[256];
 static int patch_lines;
+static char **p_hdr_data, **s_hdr_data;
 
 static char *sanity_check(char *name, char *email)
 {
@@ -137,15 +137,13 @@ static int handle_from(char *in_line)
 	return 1;
 }
 
-static int handle_date(char *line)
+static int handle_header(char *line, char *data, int ofs)
 {
-	strcpy(date, line);
-	return 0;
-}
+	if (!line || !data)
+		return 1;
+
+	strcpy(data, line+ofs);
 
-static int handle_subject(char *line)
-{
-	strcpy(subject, line);
 	return 0;
 }
 
@@ -177,17 +175,35 @@ static int slurp_attr(const char *line, const char *name, char *attr)
 	return 1;
 }
 
-static int handle_subcontent_type(char *line)
+struct content_type {
+	char *boundary;
+	int boundary_len;
+};
+
+static struct content_type content[]={
+	{ NULL },
+	{ NULL },
+	{ NULL },
+	{ NULL },
+	{ NULL },
+};
+
+static struct content_type *content_top = content;
+
+static int handle_content_type(char *line)
 {
-	/* We do not want to mess with boundary.  Note that we do not
-	 * handle nested multipart.
-	 */
-	if (strcasestr(line, "boundary=")) {
-		fprintf(stderr, "Not handling nested multipart message.\n");
-		exit(1);
+	char boundary[256];
+
+	if (strcasestr(line, "text/") < 0)
+		 message_type = TYPE_OTHER;
+	if (slurp_attr(line, "boundary=", boundary + 2)) {
+		memcpy(boundary, "--", 2);
+		content_top++;
+		content_top->boundary_len = strlen(boundary);
+		content_top->boundary = xmalloc(content_top->boundary_len+1);
+		strcpy(content_top->boundary, boundary);
 	}
-	slurp_attr(line, "charset=", charset);
-	if (*charset) {
+	if (slurp_attr(line, "charset=", charset)) {
 		int i, c;
 		for (i = 0; (c = charset[i]) != 0; i++)
 			charset[i] = tolower(c);
@@ -195,17 +211,6 @@ static int handle_subcontent_type(char *line)
 	return 0;
 }
 
-static int handle_content_type(char *line)
-{
-	*multipart_boundary = 0;
-	if (slurp_attr(line, "boundary=", multipart_boundary + 2)) {
-		memcpy(multipart_boundary, "--", 2);
-		multipart_boundary_len = strlen(multipart_boundary);
-	}
-	slurp_attr(line, "charset=", charset);
-	return 0;
-}
-
 static int handle_content_transfer_encoding(char *line)
 {
 	if (strcasestr(line, "base64"))
@@ -219,7 +224,7 @@ static int handle_content_transfer_encoding(char *line)
 
 static int is_multipart_boundary(const char *line)
 {
-	return (!memcmp(line, multipart_boundary, multipart_boundary_len));
+	return (!memcmp(line, content_top->boundary, content_top->boundary_len));
 }
 
 static int eatspace(char *line)
@@ -230,62 +235,6 @@ static int eatspace(char *line)
 	return len;
 }
 
-#define SEEN_FROM 01
-#define SEEN_DATE 02
-#define SEEN_SUBJECT 04
-#define SEEN_BOGUS_UNIX_FROM 010
-#define SEEN_PREFIX  020
-
-/* First lines of body can have From:, Date:, and Subject: or empty */
-static void handle_inbody_header(int *seen, char *line)
-{
-	if (*seen & SEEN_PREFIX)
-		return;
-	if (isspace(*line)) {
-		char *cp;
-		for (cp = line + 1; *cp; cp++) {
-			if (!isspace(*cp))
-				break;
-		}
-		if (!*cp)
-			return;
-	}
-	if (!memcmp(">From", line, 5) && isspace(line[5])) {
-		if (!(*seen & SEEN_BOGUS_UNIX_FROM)) {
-			*seen |= SEEN_BOGUS_UNIX_FROM;
-			return;
-		}
-	}
-	if (!memcmp("From:", line, 5) && isspace(line[5])) {
-		if (!(*seen & SEEN_FROM) && handle_from(line+6)) {
-			*seen |= SEEN_FROM;
-			return;
-		}
-	}
-	if (!memcmp("Date:", line, 5) && isspace(line[5])) {
-		if (!(*seen & SEEN_DATE)) {
-			handle_date(line+6);
-			*seen |= SEEN_DATE;
-			return;
-		}
-	}
-	if (!memcmp("Subject:", line, 8) && isspace(line[8])) {
-		if (!(*seen & SEEN_SUBJECT)) {
-			handle_subject(line+9);
-			*seen |= SEEN_SUBJECT;
-			return;
-		}
-	}
-	if (!memcmp("[PATCH]", line, 7) && isspace(line[7])) {
-		if (!(*seen & SEEN_SUBJECT)) {
-			handle_subject(line);
-			*seen |= SEEN_SUBJECT;
-			return;
-		}
-	}
-	*seen |= SEEN_PREFIX;
-}
-
 static char *cleanup_subject(char *subject)
 {
 	if (keep_subject)
@@ -341,57 +290,65 @@ static void cleanup_space(char *buf)
 }
 
 static void decode_header(char *it);
-typedef int (*header_fn_t)(char *);
-struct header_def {
-	const char *name;
-	header_fn_t func;
-	int namelen;
+static char *header[10] = {
+	"From",
+	"Subject",
+	"Date",
+	NULL,
+	NULL,
+	NULL,
 };
 
-static void check_header(char *line, struct header_def *header)
+static int check_header(char *line, char **hdr_data)
 {
 	int i;
 
-	if (header[0].namelen <= 0) {
-		for (i = 0; header[i].name; i++)
-			header[i].namelen = strlen(header[i].name);
-	}
-	for (i = 0; header[i].name; i++) {
-		int len = header[i].namelen;
-		if (!strncasecmp(line, header[i].name, len) &&
+	/* search for the interesting parts */
+	for (i = 0; header[i]; i++) {
+		int len = strlen(header[i]);
+		if (!hdr_data[i] &&
+		    !strncasecmp(line, header[i], len) &&
 		    line[len] == ':' && isspace(line[len + 1])) {
 			/* Unwrap inline B and Q encoding, and optionally
 			 * normalize the meta information to utf8.
 			 */
 			decode_header(line + len + 2);
-			header[i].func(line + len + 2);
-			break;
+			hdr_data[i] = xmalloc(1000 * sizeof(char));
+			if (! handle_header(line, hdr_data[i], len + 2)) {
+				return 1;
+			}
 		}
 	}
-}
 
-static void check_subheader_line(char *line)
-{
-	static struct header_def header[] = {
-		{ "Content-Type", handle_subcontent_type },
-		{ "Content-Transfer-Encoding",
-		  handle_content_transfer_encoding },
-		{ NULL },
-	};
-	check_header(line, header);
-}
-static void check_header_line(char *line)
-{
-	static struct header_def header[] = {
-		{ "From", handle_from },
-		{ "Date", handle_date },
-		{ "Subject", handle_subject },
-		{ "Content-Type", handle_content_type },
-		{ "Content-Transfer-Encoding",
-		  handle_content_transfer_encoding },
-		{ NULL },
-	};
-	check_header(line, header);
+	/* Content stuff */
+	if (!strncasecmp(line, "Content-Type: ", 14)) {
+		decode_header(line + 14);
+		if (! handle_content_type(line)) {
+			return 1;
+		}
+	}
+	if (!strncasecmp(line, "Content-Transfer-Encoding: ", 27)) {
+		decode_header(line + 27);
+		if (! handle_content_transfer_encoding(line)) {
+			return 1;
+		}
+	}
+
+	/* for inbody stuff */
+	if (!memcmp(">From", line, 5) && isspace(line[5]))
+		return 1;
+	if (!memcmp("[PATCH]", line, 7) && isspace(line[7])) {
+		for (i=0; header[i]; i++) {
+			if (!memcmp("Subject: ", header[i], 9)) {
+				if (! handle_header(line, hdr_data[i], 0)) {
+					return 1;
+				}
+			}
+		}
+	}
+
+	/* no match */
+	return 0;
 }
 
 static int is_rfc2822_header(char *line)
@@ -647,147 +604,214 @@ static void decode_transfer_encoding(char *line)
 	}
 }
 
-static void handle_info(void)
+static int handle_filter(char *line);
+
+static int find_boundary(void)
 {
-	char *sub;
+	while(fgets(line, sizeof(line), fin) != NULL) {
+		if (is_multipart_boundary(line))
+			return 1;
+	}
+	return 0;
+}
+
+static int handle_boundary(void)
+{
+again:
+	if (!memcmp(line+content_top->boundary_len, "--", 2)) {
+		/* we hit an end boundary */
+		/* pop the current boundary off the stack */
+		free(content_top->boundary);
+		content_top--;
+		handle_filter("\n");
+
+		/* skip to the next boundary */
+		if (!find_boundary())
+			return 0;
+		goto again;
+	}
+
+	/* set some defaults */
+	transfer_encoding = TE_DONTCARE;
+	charset[0] = 0;
+	message_type = TYPE_TEXT;
 
-	sub = cleanup_subject(subject);
-	cleanup_space(name);
-	cleanup_space(date);
-	cleanup_space(email);
-	cleanup_space(sub);
+	/* slurp in this section's info */
+	while (read_one_header_line(line, sizeof(line), fin))
+		check_header(line, p_hdr_data);
 
-	fprintf(fout, "Author: %s\nEmail: %s\nSubject: %s\nDate: %s\n\n",
-	       name, email, sub, date);
+	/* eat the blank line after section info */
+	return (fgets(line, sizeof(line), fin) != NULL);
 }
 
-/* We are inside message body and have read line[] already.
- * Spit out the commit log.
- */
-static int handle_commit_msg(int *seen)
+static int handle_commit_msg(char *line)
 {
+	static int still_looking=1;
+
 	if (!cmitmsg)
 		return 0;
-	do {
-		if (!memcmp("diff -", line, 6) ||
-		    !memcmp("---", line, 3) ||
-		    !memcmp("Index: ", line, 7))
-			break;
-		if ((multipart_boundary[0] && is_multipart_boundary(line))) {
-			/* We come here when the first part had only
-			 * the commit message without any patch.  We
-			 * pretend we have not seen this line yet, and
-			 * go back to the loop.
-			 */
-			return 1;
-		}
 
-		/* Unwrap transfer encoding and optionally
-		 * normalize the log message to UTF-8.
-		 */
-		decode_transfer_encoding(line);
-		if (metainfo_charset)
-			convert_to_utf8(line, charset);
+	if (still_looking) {
+		char *cp=line;
+		if (isspace(*line)) {
+			for (cp = line + 1; *cp; cp++) {
+				if (!isspace(*cp))
+					break;
+			}
+			if (!*cp)
+				return 0;
+		}
+		if ((still_looking = check_header(cp, s_hdr_data)) != 0)
+			return 0;
+	}
 
-		handle_inbody_header(seen, line);
-		if (!(*seen & SEEN_PREFIX))
-			continue;
+	if (!memcmp("diff -", line, 6) ||
+	    !memcmp("---", line, 3) ||
+	    !memcmp("Index: ", line, 7)) {
+		fclose(cmitmsg);
+		cmitmsg = NULL;
+		return 1;
+	}
 
-		fputs(line, cmitmsg);
-	} while (fgets(line, sizeof(line), fin) != NULL);
-	fclose(cmitmsg);
-	cmitmsg = NULL;
+	fputs(line, cmitmsg);
 	return 0;
 }
 
-/* We have done the commit message and have the first
- * line of the patch in line[].
- */
-static void handle_patch(void)
+static int handle_patch(char *line)
 {
-	do {
-		if (multipart_boundary[0] && is_multipart_boundary(line))
-			break;
-		/* Only unwrap transfer encoding but otherwise do not
-		 * do anything.  We do *NOT* want UTF-8 conversion
-		 * here; we are dealing with the user payload.
-		 */
-		decode_transfer_encoding(line);
-		fputs(line, patchfile);
-		patch_lines++;
-	} while (fgets(line, sizeof(line), fin) != NULL);
+	fputs(line, patchfile);
+	patch_lines++;
+	return 0;
 }
 
-/* multipart boundary and transfer encoding are set up for us, and we
- * are at the end of the sub header.  do equivalent of handle_body up
- * to the next boundary without closing patchfile --- we will expect
- * that the first part to contain commit message and a patch, and
- * handle other parts as pure patches.
- */
-static int handle_multipart_one_part(int *seen)
+static int handle_filter(char *line)
 {
-	int n = 0;
+	static int filter=0;
 
-	while (fgets(line, sizeof(line), fin) != NULL) {
-	again:
-		n++;
-		if (is_multipart_boundary(line))
+	/* filter tells us which part we left off on
+	 * a non-zero return indicates we hit a filter point
+	 */
+	switch (filter) {
+	case 0:
+		if (!handle_commit_msg(line))
 			break;
-		if (handle_commit_msg(seen))
-			goto again;
-		handle_patch();
-		break;
+		filter++;
+	case 1:
+		if (!handle_patch(line))
+			break;
+		filter++;
+	default:
+		return 1;
 	}
-	if (n == 0)
-		return -1;
+
 	return 0;
 }
 
-static void handle_multipart_body(void)
+static void handle_body(void)
 {
-	int seen = 0;
-	int part_num = 0;
+	int rc=0;
+	static char newline[2000];
+	static char *np=newline;
 
 	/* Skip up to the first boundary */
-	while (fgets(line, sizeof(line), fin) != NULL)
-		if (is_multipart_boundary(line)) {
-			part_num = 1;
+	if (content_top->boundary) {
+		if (!find_boundary())
+			return;
+	}
+
+	do {
+		/* process any boundary lines */
+		if (content_top->boundary && is_multipart_boundary(line)) {
+			/* flush any leftover */
+			if ((transfer_encoding == TE_BASE64)  &&
+			    (np != newline)) {
+				handle_filter(newline);
+			}
+			if (!handle_boundary())
+				return;
+		}
+
+		/* Unwrap transfer encoding and optionally
+		 * normalize the log message to UTF-8.
+		 */
+		decode_transfer_encoding(line);
+		if (metainfo_charset)
+			convert_to_utf8(line, charset);
+
+		switch (transfer_encoding) {
+		case TE_BASE64:
+		{
+			/* binary data most likely doesn't have newlines */
+			if (message_type != TYPE_TEXT) {
+				rc=handle_filter(line);
+				break;
+			}
+
+			/* this is a decoded line that may contain
+			 * multiple new lines.  Pass only one chunk
+			 * at a time to handle_filter()
+			 */
+
+			char *op=line;
+
+			do {
+				while (*op != '\n' && *op != 0)
+					*np++ = *op++;
+				*np = *op;
+				if (*np != 0) {
+					/* should be sitting on a new line */
+					*(++np) = 0;
+					op++;
+					rc=handle_filter(newline);
+					np=newline;
+				}
+			} while (*op != 0);
+			/* the partial chunk is saved in newline and
+			 * will be appended by the next iteration of fgets
+			 */
 			break;
 		}
-	if (!part_num)
-		return;
-	/* We are on boundary line.  Start slurping the subhead. */
-	while (1) {
-		int hdr = read_one_header_line(line, sizeof(line), fin);
-		if (!hdr) {
-			if (handle_multipart_one_part(&seen) < 0)
-				return;
-			/* Reset per part headers */
-			transfer_encoding = TE_DONTCARE;
-			charset[0] = 0;
+		default:
+			rc=handle_filter(line);
 		}
-		else
-			check_subheader_line(line);
-	}
-	fclose(patchfile);
-	if (!patch_lines) {
-		fprintf(stderr, "No patch found\n");
-		exit(1);
-	}
+		if (rc)
+			/* nothing left to filter */
+			break;
+	} while (fgets(line, sizeof(line), fin));
+
+	return;
 }
 
-/* Non multipart message */
-static void handle_body(void)
+static void handle_info(void)
 {
-	int seen = 0;
-
-	handle_commit_msg(&seen);
-	handle_patch();
-	fclose(patchfile);
-	if (!patch_lines) {
-		fprintf(stderr, "No patch found\n");
-		exit(1);
+	char *sub;
+	char *hdr;
+	int i;
+
+	for (i=0; header[i]; i++) {
+
+		/* only print inbody headers if we output a patch file */
+		if (patch_lines && s_hdr_data[i])
+			hdr=s_hdr_data[i];
+		else if (p_hdr_data[i])
+			hdr=p_hdr_data[i];
+		else
+			continue;
+
+		if (!memcmp(header[i], "Subject", 7)) {
+			sub = cleanup_subject(hdr);
+			cleanup_space(sub);
+			fprintf(fout, "Subject: %s\n", sub);
+		} else if (!memcmp(header[i], "From", 4)) {
+			handle_from(hdr);
+			fprintf(fout, "Author: %s\n", name);
+			fprintf(fout, "Email: %s\n", email);
+		} else {
+			cleanup_space(hdr);
+			fprintf(fout, "%s: %s\n", header[i], hdr);
+		}
 	}
+	fprintf(fout, "\n");
 }
 
 int mailinfo(FILE *in, FILE *out, int ks, const char *encoding,
@@ -809,18 +833,16 @@ int mailinfo(FILE *in, FILE *out, int ks, const char *encoding,
 		fclose(cmitmsg);
 		return -1;
 	}
-	while (1) {
-		int hdr = read_one_header_line(line, sizeof(line), fin);
-		if (!hdr) {
-			if (multipart_boundary[0])
-				handle_multipart_body();
-			else
-				handle_body();
-			handle_info();
-			break;
-		}
-		check_header_line(line);
-	}
+
+	p_hdr_data = xcalloc(10, sizeof(char *));
+	s_hdr_data = xcalloc(10, sizeof(char *));
+
+	/* process the email header */
+	while (read_one_header_line(line, sizeof(line), fin))
+		check_header(line, p_hdr_data);
+
+	handle_body();
+	handle_info();
 
 	return 0;
 }
diff --git a/git-am.sh b/git-am.sh
index 2c73d11..8fa466a 100755
--- a/git-am.sh
+++ b/git-am.sh
@@ -290,6 +290,7 @@ do
 		git-mailinfo $keep $utf8 "$dotest/msg" "$dotest/patch" \
 			<"$dotest/$msgnum" >"$dotest/info" ||
 			stop_here $this
+		test -s $dotest/patch || stop_here $this
 		git-stripspace < "$dotest/msg" > "$dotest/msg-clean"
 		;;
 	esac
-- 
1.5.0.2.212.gd52f-dirty

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] builtin-mailinfo.c infrastrcture changes
  2007-03-06 21:57 Don Zickus
@ 2007-03-07  5:34 ` Junio C Hamano
  2007-03-07 17:17   ` Don Zickus
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2007-03-07  5:34 UTC (permalink / raw)
  To: Don Zickus; +Cc: git

Don Zickus <dzickus@redhat.com> writes:

> List of major changes/fixes:
> - can't create empty patch files fix
> - empty patch files don't fail, this failure will come inside git-am
> - multipart boundaries are now handled
> - only output inbody headers if a patch exists otherwise assume those
> headers are part of the reply and instead output the original headers
> - decode and filter base64 patches correctly
> - various other accidental fixes
>
> I believe I didn't break any existing functionality or compatibility (other
> than what I describe above, which is really only the empty patch file).
>
> I tested this through various mailing list archives and everything seemed to
> parse correctly (a couple thousand emails).

Thanks.

> @@ -177,17 +175,35 @@ static int slurp_attr(const char *line, const char *name, char *attr)
>  	return 1;
>  }
>  
> -static int handle_subcontent_type(char *line)
> +struct content_type {
> +	char *boundary;
> +	int boundary_len;
> +};
> +
> +static struct content_type content[]={
> +	{ NULL },
> +	{ NULL },
> +	{ NULL },
> +	{ NULL },
> +	{ NULL },
> +};

The reason why this array has 5 elements (not 4, not 6) is...?

> +
> +static struct content_type *content_top = content;
> +
> +static int handle_content_type(char *line)
>  {
> +	char boundary[256];
> +
> +	if (strcasestr(line, "text/") < 0)
> +		 message_type = TYPE_OTHER;

Did you mean

	if (strncasecmp(line, "text/", 5))

It makes me wonder how a couple thousand mails (or for that
matter our own testsuite) could have passed the test with a
thing like this...

> +	if (slurp_attr(line, "boundary=", boundary + 2)) {
> +		memcpy(boundary, "--", 2);
> +		content_top++;
> +		content_top->boundary_len = strlen(boundary);
> +		content_top->boundary = xmalloc(content_top->boundary_len+1);
> +		strcpy(content_top->boundary, boundary);
>  	}

Does anybody check the content[] stack overflow?

> @@ -341,57 +290,65 @@ static void cleanup_space(char *buf)
>  }
>  
>  static void decode_header(char *it);
> -typedef int (*header_fn_t)(char *);
> -struct header_def {
> -	const char *name;
> -	header_fn_t func;
> -	int namelen;
> +static char *header[10] = {
> +	"From",
> +	"Subject",
> +	"Date",
> +	NULL,
> +	NULL,
> +	NULL,
>  };

Why initialize only three with NULL when you have four more?
What are the 7 entries other than From/Subject/Date for?  Future
extension?  Wouldn't 

	static char *header[] = {
        	"From", "Subject", "Date", NULL,
	};

or even:

	static char *header[] = {
        	"From", "Subject", "Date",
	};

easier to handle (for the latter, you would need your loop with:

	for (i = 0; i < ARRAY_SIZE(header); i++)
        	...

> +	/* Content stuff */
> +	if (!strncasecmp(line, "Content-Type: ", 14)) {

I'd rather not insist SP after colon (I do not think it even has
to exist, but I think we check for isspace() in the current code).

> +static int handle_boundary(void)
> +{
> +again:
> +	if (!memcmp(line+content_top->boundary_len, "--", 2)) {
> +		/* we hit an end boundary */
> +		/* pop the current boundary off the stack */
> +		free(content_top->boundary);
> +		content_top--;
> +		handle_filter("\n");

Stack underflow?

> +static void handle_body(void)
>  {
> ...
> +		switch (transfer_encoding) {
> +		case TE_BASE64:
> +		{
> +			/* binary data most likely doesn't have newlines */
> +			if (message_type != TYPE_TEXT) {
> +				rc=handle_filter(line);
> +				break;
> +			}
> +
> +			/* this is a decoded line that may contain
> +			 * multiple new lines.  Pass only one chunk
> +			 * at a time to handle_filter()
> +			 */
> +
> +			char *op=line;
> +

builtin-mailinfo.c:786: warning: ISO C90 forbids mixed declarations and code.

> @@ -809,18 +833,16 @@ int mailinfo(FILE *in, FILE *out, int ks, const char *encoding,
>  		fclose(cmitmsg);
>  		return -1;
>  	}
> +
> +	p_hdr_data = xcalloc(10, sizeof(char *));
> +	s_hdr_data = xcalloc(10, sizeof(char *));

These tens look unexplained magic...

> diff --git a/git-am.sh b/git-am.sh
> index 2c73d11..8fa466a 100755
> --- a/git-am.sh
> +++ b/git-am.sh
> @@ -290,6 +290,7 @@ do
>  		git-mailinfo $keep $utf8 "$dotest/msg" "$dotest/patch" \
>  			<"$dotest/$msgnum" >"$dotest/info" ||
>  			stop_here $this
> +		test -s $dotest/patch || stop_here $this
>  		git-stripspace < "$dotest/msg" > "$dotest/msg-clean"
>  		;;
>  	esac

I think this interface change probably is a good thing.  I
wonder if we need a matching change to applymbox, though.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] builtin-mailinfo.c infrastrcture changes
  2007-03-07  5:34 ` Junio C Hamano
@ 2007-03-07 17:17   ` Don Zickus
  0 siblings, 0 replies; 12+ messages in thread
From: Don Zickus @ 2007-03-07 17:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Tue, Mar 06, 2007 at 09:34:31PM -0800, Junio C Hamano wrote:
> 
> > @@ -177,17 +175,35 @@ static int slurp_attr(const char *line, const char *name, char *attr)
> >  	return 1;
> >  }
> >  
> > -static int handle_subcontent_type(char *line)
> > +struct content_type {
> > +	char *boundary;
> > +	int boundary_len;
> > +};
> > +
> > +static struct content_type content[]={
> > +	{ NULL },
> > +	{ NULL },
> > +	{ NULL },
> > +	{ NULL },
> > +	{ NULL },
> > +};
> 
> The reason why this array has 5 elements (not 4, not 6) is...?

a flip of a coin.  I got lazy and didn't feel like implementing linked
lists.  So instead I decided to use static char arrays.  I can create a
#define <arraysize> with a comment on its intended use.  Or do you prefer
a different approach. 
 
> 
> > +
> > +static struct content_type *content_top = content;
> > +
> > +static int handle_content_type(char *line)
> >  {
> > +	char boundary[256];
> > +
> > +	if (strcasestr(line, "text/") < 0)
> > +		 message_type = TYPE_OTHER;
> 
> Did you mean
> 
> 	if (strncasecmp(line, "text/", 5))
> 
> It makes me wonder how a couple thousand mails (or for that
> matter our own testsuite) could have passed the test with a
> thing like this...

Ooops.  Thanks for catching that.  The default TYPE_TEXT works for almost
all situations, which is why it passes my tests.  The only reason why I
put that code in there is to handle very large base64 binary blobs (that
could overflow the char arrays over 2k bytes).  I didn't have any binary
blobs that large to test so I never ran into this bug.  I'll fix it up.

> 
> > +	if (slurp_attr(line, "boundary=", boundary + 2)) {
> > +		memcpy(boundary, "--", 2);
> > +		content_top++;
> > +		content_top->boundary_len = strlen(boundary);
> > +		content_top->boundary = xmalloc(content_top->boundary_len+1);
> > +		strcpy(content_top->boundary, boundary);
> >  	}
> 
> Does anybody check the content[] stack overflow?

nope.  meant too though.  I'll fix that too.

> 
> > @@ -341,57 +290,65 @@ static void cleanup_space(char *buf)
> >  }
> >  
> >  static void decode_header(char *it);
> > -typedef int (*header_fn_t)(char *);
> > -struct header_def {
> > -	const char *name;
> > -	header_fn_t func;
> > -	int namelen;
> > +static char *header[10] = {
> > +	"From",
> > +	"Subject",
> > +	"Date",
> > +	NULL,
> > +	NULL,
> > +	NULL,
> >  };
> 
> Why initialize only three with NULL when you have four more?
> What are the 7 entries other than From/Subject/Date for?  Future
> extension?  Wouldn't 
> 
> 	static char *header[] = {
>         	"From", "Subject", "Date", NULL,
> 	};
> 
> or even:
> 
> 	static char *header[] = {
>         	"From", "Subject", "Date",
> 	};
> 
> easier to handle (for the latter, you would need your loop with:
> 
> 	for (i = 0; i < ARRAY_SIZE(header); i++)
>         	...
> 

see patch 2/5 for a better idea of the direction I was going with this.
Again another lazy trick to avoid linked lists and instead use static char
arrays to easily handle new content.  Probably wouldn't hurt for me to clean
this up too.


> > +	/* Content stuff */
> > +	if (!strncasecmp(line, "Content-Type: ", 14)) {
> 
> I'd rather not insist SP after colon (I do not think it even has
> to exist, but I think we check for isspace() in the current code).

ok.

> 
> > +static int handle_boundary(void)
> > +{
> > +again:
> > +	if (!memcmp(line+content_top->boundary_len, "--", 2)) {
> > +		/* we hit an end boundary */
> > +		/* pop the current boundary off the stack */
> > +		free(content_top->boundary);
> > +		content_top--;
> > +		handle_filter("\n");
> 
> Stack underflow?

yup.  i'll fix this.

> 
> > +static void handle_body(void)
> >  {
> > ...
> > +		switch (transfer_encoding) {
> > +		case TE_BASE64:
> > +		{
> > +			/* binary data most likely doesn't have newlines */
> > +			if (message_type != TYPE_TEXT) {
> > +				rc=handle_filter(line);
> > +				break;
> > +			}
> > +
> > +			/* this is a decoded line that may contain
> > +			 * multiple new lines.  Pass only one chunk
> > +			 * at a time to handle_filter()
> > +			 */
> > +
> > +			char *op=line;
> > +
> 
> builtin-mailinfo.c:786: warning: ISO C90 forbids mixed declarations and code.

hmm. don't know why gcc 4.1.1 didn't catch that.  anyway, the binary data
part was added much later.  should be easy to fix.

> 
> > @@ -809,18 +833,16 @@ int mailinfo(FILE *in, FILE *out, int ks, const char *encoding,
> >  		fclose(cmitmsg);
> >  		return -1;
> >  	}
> > +
> > +	p_hdr_data = xcalloc(10, sizeof(char *));
> > +	s_hdr_data = xcalloc(10, sizeof(char *));
> 
> These tens look unexplained magic...

i'll create a #define for those too.

> 
> > diff --git a/git-am.sh b/git-am.sh
> > index 2c73d11..8fa466a 100755
> > --- a/git-am.sh
> > +++ b/git-am.sh
> > @@ -290,6 +290,7 @@ do
> >  		git-mailinfo $keep $utf8 "$dotest/msg" "$dotest/patch" \
> >  			<"$dotest/$msgnum" >"$dotest/info" ||
> >  			stop_here $this
> > +		test -s $dotest/patch || stop_here $this
> >  		git-stripspace < "$dotest/msg" > "$dotest/msg-clean"
> >  		;;
> >  	esac
> 
> I think this interface change probably is a good thing.  I
> wonder if we need a matching change to applymbox, though.

that and git-quiltimport too.

I'll send out another draft this afternoon.

Cheers,
Don

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-03-15 14:38 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-14 20:12 git-mailinfo fixes/features v3 Don Zickus
2007-03-14 20:12 ` [PATCH 1/5] builtin-mailinfo.c infrastrcture changes Don Zickus
2007-03-15 14:35   ` Don Zickus
2007-03-14 20:12 ` [PATCH 2/5] add the ability to select more email header fields to output Don Zickus
2007-03-15 14:36   ` Don Zickus
2007-03-14 20:12 ` [PATCH 3/5] restrict the patch filtering Don Zickus
2007-03-14 20:12 ` [PATCH 4/5] Add a couple more test cases to the suite Don Zickus
2007-03-14 20:12 ` [PATCH 5/5] fix a utf8 issue in t5100/patch005 Don Zickus
  -- strict thread matches above, loose matches on Subject: below --
2007-03-12 19:52 [PATCH 0/5] git-mailinfo fixes/features Don Zickus
2007-03-12 19:52 ` [PATCH 1/5] builtin-mailinfo.c infrastrcture changes Don Zickus
2007-03-06 21:57 Don Zickus
2007-03-07  5:34 ` Junio C Hamano
2007-03-07 17:17   ` Don Zickus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).