git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Conrad Irwin <conrad.irwin@gmail.com>
Cc: Junio C Hamano <gitster@pobox.com>,
	git@vger.kernel.org, Nguyen Thai Ngoc Duy <pclouds@gmail.com>,
	Dov Grobgeld <dov.grobgeld@gmail.com>
Subject: [PATCH 2/2] grep: respect diff attributes for binary-ness
Date: Wed, 1 Feb 2012 18:21:29 -0500	[thread overview]
Message-ID: <20120201232129.GB2652@sigill.intra.peff.net> (raw)
In-Reply-To: <20120201221437.GA19044@sigill.intra.peff.net>

There is currently no way for users to tell git-grep that a
particular path is or is not a binary file; instead, grep
always relies on its auto-detection (or the user specifying
"-a" to treat all binary-looking files like text).

This patch teaches git-grep to use the same attribute lookup
that is used by git-diff. We could add a new "grep" flag,
but that is unnecessarily complex and unlikely to be useful.
Despite the name, the "-diff" attribute (or "diff=foo" and
the associated diff.foo.binary config option) are really
about describing the contents of the path. It's simply
historical that diff was the only thing that cared about
these attributes in the past.

And if this simple approach turns out to be insufficient, we
still have a backwards-compatible path forward: we can add a
separate "grep" attribute, and fall back to respecting
"diff" if it is unset.

There are a few things worth nothing about the
implementation.

One is that the attribute lookup happens outside of the
grep.[ch] interface (i.e., outside of grep_buffer). We could
do it at a lower level, which would be slightly more
convenient for callers. However, this interacts badly with
threading.  The attribute-lookup code performs best when
lookup order matches the filesystem order (so looking up
"a/b/c" is cheaper if we just looked up "a/b/d" than if we
just did "x/y/z"). Because we issue many simultaneous
requests to grep_buffer, performing the attribute lookup at
that level will cause requests from unrelated paths to
interleave, and we lose the locality that makes the lookup
optimization work.

Instead, in the threaded case we check the attributes as
they are added to the work queue, meaning they are looked up
in the optimal order (in the single threaded case, this is a
non-issue, as we process the files serially in the optimal
order).

Here are a few numbers showing the difference. The first is
a best-of-five time for "git grep foo" on the linux-2.6 repo
before this patch (on a 4-core HT processor, using 8
threads):

  real    0m0.306s
  user    0m1.512s
  sys     0m0.412s

Now here's the time for the same operation with a trial
implementation looking up attributes in grep_buffer,
showing a 31% slowdown:

  real    0m0.401s
  user    0m1.760s
  sys     0m0.636s

And here's the same operation with this patch, with only an
11% slowdown:

  real    0m0.339s
  user    0m1.444s
  sys     0m0.584s

Note that while the percentages are big, the absolute
numbers are pretty small. In particular, this is a very
inexpensive grep to do. A more complex regex should have the
same absolute slowdown from the attribute lookup, but it
would be a much smaller percentage of the total processing
time.  So doing it this way is not a huge win, but it does
help on small greps.

The second issue worth noting is that while we do a full
attribute lookup, we pass along only the binary flag to
grep_buffer. When the "-p" flag is given to grep, we will
actually look up the same attributes to find funcname
patterns of matches. We could pass along the pointer to the
userdiff driver for reuse.

However, it's not worth doing this. It clutters the code, as
the driver has to be passed through a large number of helper
functions (and pollutes the grep_buffer interface with
userdiff code). And in my tests, it didn't actually improve
performance. Because we only have to look up the attribute
for a grep hit, in most cases we will only do the funcname
lookup for a small subset of files. The cost of the extra
lookups turns out to be negligible.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/grep.c         |   26 ++++++++++++++++++++++----
 t/t7008-grep-binary.sh |   24 ++++++++++++++++++++++++
 2 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index e328316..bb38804 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -42,6 +42,7 @@ enum work_type {WORK_SHA1, WORK_FILE};
 struct work_item {
 	enum work_type type;
 	char *name;
+	int is_binary;
 
 	/* if type == WORK_SHA1, then 'identifier' is a SHA1,
 	 * otherwise type == WORK_FILE, and 'identifier' is a NUL
@@ -113,6 +114,14 @@ static pthread_cond_t cond_result;
 
 static int skip_first_line;
 
+static int path_is_binary(const char *path)
+{
+	struct userdiff_driver *drv = userdiff_find_by_path(path);
+	if (drv)
+		return drv->binary;
+	return -1;
+}
+
 static void add_work(enum work_type type, char *name, void *id)
 {
 	grep_lock();
@@ -123,6 +132,11 @@ static void add_work(enum work_type type, char *name, void *id)
 
 	todo[todo_end].type = type;
 	todo[todo_end].name = name;
+
+	pthread_mutex_lock(&grep_attr_mutex);
+	todo[todo_end].is_binary = path_is_binary(name);
+	pthread_mutex_unlock(&grep_attr_mutex);
+
 	todo[todo_end].identifier = id;
 	todo[todo_end].done = 0;
 	strbuf_reset(&todo[todo_end].out);
@@ -221,14 +235,16 @@ static void *run(void *arg)
 			void* data = load_sha1(w->identifier, &sz, w->name);
 
 			if (data) {
-				hit |= grep_buffer(opt, w->name, -1, data, sz);
+				hit |= grep_buffer(opt, w->name, w->is_binary,
+						   data, sz);
 				free(data);
 			}
 		} else if (w->type == WORK_FILE) {
 			size_t sz;
 			void* data = load_file(w->identifier, &sz);
 			if (data) {
-				hit |= grep_buffer(opt, w->name, -1, data, sz);
+				hit |= grep_buffer(opt, w->name, w->is_binary,
+						   data, sz);
 				free(data);
 			}
 		} else {
@@ -421,7 +437,8 @@ static int grep_sha1(struct grep_opt *opt, const unsigned char *sha1,
 		if (!data)
 			hit = 0;
 		else
-			hit = grep_buffer(opt, name, -1, data, sz);
+			hit = grep_buffer(opt, name, path_is_binary(name),
+					  data, sz);
 
 		free(data);
 		free(name);
@@ -483,7 +500,8 @@ static int grep_file(struct grep_opt *opt, const char *filename)
 		if (!data)
 			hit = 0;
 		else
-			hit = grep_buffer(opt, name, -1, data, sz);
+			hit = grep_buffer(opt, name, path_is_binary(name),
+					  data, sz);
 
 		free(data);
 		free(name);
diff --git a/t/t7008-grep-binary.sh b/t/t7008-grep-binary.sh
index 917a264..fd6410f 100755
--- a/t/t7008-grep-binary.sh
+++ b/t/t7008-grep-binary.sh
@@ -99,4 +99,28 @@ test_expect_success 'git grep y<NUL>x a' "
 	test_must_fail git grep -f f a
 "
 
+test_expect_success 'grep respects binary diff attribute' '
+	echo text >t &&
+	git add t &&
+	echo t:text >expect &&
+	git grep text t >actual &&
+	test_cmp expect actual &&
+	echo "t -diff" >.gitattributes &&
+	echo "Binary file t matches" >expect &&
+	git grep text t >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep respects not-binary diff attribute' '
+	echo binQary | q_to_nul >b &&
+	git add b &&
+	echo "Binary file b matches" >expect &&
+	git grep bin b >actual &&
+	test_cmp expect actual &&
+	echo "b diff" >.gitattributes &&
+	echo "b:binQary" >expect &&
+	git grep bin b | nul_to_q >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
1.7.9.3.gc3fce1.dirty

  parent reply	other threads:[~2012-02-01 23:21 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-01-17  9:14 git-grep while excluding files in a blacklist Dov Grobgeld
2012-01-17  9:19 ` Nguyen Thai Ngoc Duy
2012-01-17 20:09   ` Junio C Hamano
2012-01-18  1:24     ` Nguyen Thai Ngoc Duy
2012-01-23  9:37       ` [PATCH] Don't search files with an unset "grep" attribute conrad.irwin
2012-01-23 18:33         ` Junio C Hamano
2012-01-23 22:59           ` Conrad Irwin
2012-01-24  6:59             ` Junio C Hamano
2012-01-25 21:46               ` Jeff King
2012-01-26 13:51                 ` Stephen Bash
2012-01-26 17:29                   ` Jeff King
2012-01-26 16:45                 ` Michael Haggerty
2012-01-27  6:35                   ` Jeff King
2012-02-01  8:01                 ` Junio C Hamano
2012-02-01  8:20                   ` Jeff King
2012-02-01  9:10                     ` Jeff King
2012-02-01  9:28                       ` Conrad Irwin
2012-02-01 22:14                         ` Jeff King
2012-02-01 23:20                           ` Jeff King
2012-02-02  2:03                             ` Junio C Hamano
2012-02-01 23:21                           ` [PATCH 1/2] grep: let grep_buffer callers specify a binary flag Jeff King
2012-02-02  0:47                             ` Junio C Hamano
2012-02-02  0:52                               ` Jeff King
2012-02-02  8:17                                 ` [PATCH 0/9] respect binary attribute in grep Jeff King
2012-02-02  8:18                                   ` [PATCH 1/9] grep: make locking flag global Jeff King
2012-02-02  8:18                                   ` [PATCH 2/9] grep: move sha1-reading mutex into low-level code Jeff King
2012-02-02  8:19                                   ` [PATCH 3/9] grep: refactor the concept of "grep source" into an object Jeff King
2012-02-02  8:19                                   ` [PATCH 4/9] convert git-grep to use grep_source interface Jeff King
2012-02-02  8:20                                   ` [PATCH 5/9] grep: drop grep_buffer's "name" parameter Jeff King
2012-02-02  8:20                                   ` [PATCH 6/9] grep: cache userdiff_driver in grep_source Jeff King
2012-02-02 18:34                                     ` Junio C Hamano
2012-02-02 19:37                                       ` Jeff King
2012-02-02  8:21                                   ` [PATCH 7/9] grep: respect diff attributes for binary-ness Jeff King
2012-02-02  8:21                                   ` [PATCH 8/9] grep: load file data after checking binary-ness Jeff King
2012-02-02  8:24                                   ` [PATCH 9/9] grep: pre-load userdiff drivers when threaded Jeff King
2012-02-02  8:30                                   ` [PATCH 0/9] respect binary attribute in grep Jeff King
2012-02-02 11:00                                   ` Thomas Rast
2012-02-02 11:07                                     ` Jeff King
2012-02-02 18:39                                   ` Junio C Hamano
2012-02-04 19:22                                   ` Pete Wyckoff
2012-02-04 23:18                                     ` Jeff King
2012-02-01 23:21                           ` Jeff King [this message]
2012-02-01 16:28                       ` [PATCH] Don't search files with an unset "grep" attribute Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120201232129.GB2652@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=conrad.irwin@gmail.com \
    --cc=dov.grobgeld@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).