git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Conrad Irwin <conrad.irwin@gmail.com>
Cc: git@vger.kernel.org, Nguyen Thai Ngoc Duy <pclouds@gmail.com>,
	Dov Grobgeld <dov.grobgeld@gmail.com>
Subject: Re: [PATCH] Don't search files with an unset "grep" attribute
Date: Mon, 23 Jan 2012 22:59:45 -0800	[thread overview]
Message-ID: <7vaa5d4mce.fsf@alter.siamese.dyndns.org> (raw)
In-Reply-To: <1327359555-29457-1-git-send-email-conrad.irwin@gmail.com> (Conrad Irwin's message of "Mon, 23 Jan 2012 14:59:15 -0800")

Conrad Irwin <conrad.irwin@gmail.com> writes:

>> in order to avoid uselessly spewing garbage at you while reminding you
>> that the command is not showing the whole truth and you _might_ want to
>> look into the elided file if you wanted to with "grep -a world hello.o".
>> Compared to this, it feels that the result your patch goes too far and
>> hides the truth a bit too much to my taste. Maybe it is just me.
>
> I used to use this approach, hooking into the "diff" attribute directly to mark
> a file as binary, however that was clearly a hack.

After thinking about this a bit more, I have to say I disagree that it is
a hack.

I agree that the issue you are trying to address is real, but I think the
approach taken by the patch is flawed on three counts:

 - You cannot do "grep -a" equivalent once you set your "-grep" attribute;

 - The user has flexibility to set "diff" and "grep" independently, which
   is an unnecessary complication [*1*]; and

 - The user does not get any hint that potential hits are elided, like
   normal "grep" would do.

I also think we can fix the above problems, while simplifying the external
interface visible to the end users.

So let's step back a bit and take a look at the handling of files for
which you do not want to see patch output and/or you do not want to see
grep hits, in a fictional but realisitic use scenario.

I am building a small embedded applicance that links with a binary blob
firmware, xyzzy.so, supplied by an upstream vendor, and this blob is
updated from time to time. In order to keep track of what binary blob went
into which product release, this binary blob is stored in the repository
together with the source files. Because it is useless to view binary
changes in "git log -p" output, I would naturally mark this as "binary".

Now, is it realistic to expect that I might want to see "grep" hits from
this binary blob? I do not think so [*2*].

When I say "xyzzy.so is a binary" in my .gitattributes file, according to
the current documentation, I am saying that "do not run diff, and also do
not run EOL conversion on this file" [*3*], but from the end user's point
of view, it is natural that I am entitled to expect much more than that.
Without having to know how many different kinds of textual operations the
SCM I happen to be using supports, I am telling it that I do not want to
see _any_ textual operations done on it. If a new version of "git grep"
learns to honor the attributes system, I want my "this is binary" to be
considered by the improved "git grep".

If you think about it this way from the very high level description of the
problem, aka, end user's point of view, it is fairly clear that tying the
"binary" attribute to "git grep" to allow us to override the built-in
buffer_is_binary() call you see in grep.c gives the most intuitive result,
without forcing the user to do anything more than they are already doing.

Because "binary" decomposes to "-diff" and "-text", we have two choices.
We _could_ say "-text" should be the one that overrides buffer_is_binary()
autodetection, but using "-diff" instead for that purpose opens the door
for us to give our users even more intuitive and consistent experience.

Suppose that this binary blob firmware came with an API manual formatted
in PDF, xyzzy.pdf, also supplied by the vendor. It is also kept in the
repository, but again, running textual diff between updated versions of
the PDF documentation would not be very useful. I however may have a
textconv filter defined for it to grab the text by running pdf2ascii.

Now if my "git show --textconv xyzzy.pdf" has an output line that says a
string "XYZZY API 2.0" was added to the current version, wouldn't it be
natural for me to expect that "git grep --textconv 'XYZZY API' xyzzy.pdf"
to find it [*4*]?

As an added bonus, if you truly want to omit _any_ hit from "git grep" for
your minified.js files, you can easily do so. Just define a textconv
filter that yields an empty string for them, and "grep --textconv" won't
produce any matches, not even "Binary file ... matches".

So I am inclined to think that reusing the infrastructure we already have
for the `diff` attribute as much as possible would be a better design for
this new feature you are proposing.

The name of the attribute `diff` is somewhat unfortunate, but the end user
interface to the feature at the highest level is already called "binary"
attribute (macro), and our low-level documentation can give precise
meaning of the attribute while alluding to the origin of the name so that
intelligent readers would understand it pretty easily (see attached
patch).

Besides, we already tell the users in "Marking files as binary" section to
mark them with "-diff" in the attributes file if they want their contents
treated as binary.


[Footnotes]

*1* An unused flexibility like this does nothing useful, other than
forcing the user to flip two attributes always as a pair, when one should
suffice.

*2* The essence of my previous message was "why it is insufficient to mark
them binary, i.e. uninteresting for the purpose of all textual operations
not just grep but also for diff." which was unanswered.

*3* This is because "binary" is defined as an attribute macro that expands
to "-diff -text".

*4* This also suggests that the infrastructure we already have for driving
"git diff" is a good match for "git grep".  The "funcname" patterns, which
"git grep -p" already can borrow and use from "diff" infrastructure, is
another indication that using `diff` is a good match for "git grep".


 Documentation/gitattributes.txt |   15 ++++++++++++---
 1 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index a85b187..63844c4 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -388,9 +388,18 @@ Generating diff text
 `diff`
 ^^^^^^
 
-The attribute `diff` affects how 'git' generates diffs for particular
-files. It can tell git whether to generate a textual patch for the path
-or to treat the path as a binary file.  It can also affect what line is
+The attribute `diff` affects if `git` treats the contents of file as text
+or binary. Historically, `git diff` and its friends were the first to use
+this attribute, hence its name, but the attribute governs textual
+operations in general, including `git grep`.
+
+For `git diff` and its friends, differences in contents marked as text
+produces a textual patch, while differences in non-text are reported only
+as "Binary files differ". For `git grep`, matches in contents marked as
+text are reported by showing lines that contain them, while matches in
+non-text are reported only as "Binary file ... matches".
+
+This attribute can also affect what line is
 shown on the hunk header `@@ -k,l +n,m @@` line, tell git to use an
 external command to generate the diff, or ask git to convert binary
 files to a text format before generating the diff.

  reply	other threads:[~2012-01-24  7:00 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-01-17  9:14 git-grep while excluding files in a blacklist Dov Grobgeld
2012-01-17  9:19 ` Nguyen Thai Ngoc Duy
2012-01-17 20:09   ` Junio C Hamano
2012-01-18  1:24     ` Nguyen Thai Ngoc Duy
2012-01-23  9:37       ` [PATCH] Don't search files with an unset "grep" attribute conrad.irwin
2012-01-23 18:33         ` Junio C Hamano
2012-01-23 22:59           ` Conrad Irwin
2012-01-24  6:59             ` Junio C Hamano [this message]
2012-01-25 21:46               ` Jeff King
2012-01-26 13:51                 ` Stephen Bash
2012-01-26 17:29                   ` Jeff King
2012-01-26 16:45                 ` Michael Haggerty
2012-01-27  6:35                   ` Jeff King
2012-02-01  8:01                 ` Junio C Hamano
2012-02-01  8:20                   ` Jeff King
2012-02-01  9:10                     ` Jeff King
2012-02-01  9:28                       ` Conrad Irwin
2012-02-01 22:14                         ` Jeff King
2012-02-01 23:20                           ` Jeff King
2012-02-02  2:03                             ` Junio C Hamano
2012-02-01 23:21                           ` [PATCH 1/2] grep: let grep_buffer callers specify a binary flag Jeff King
2012-02-02  0:47                             ` Junio C Hamano
2012-02-02  0:52                               ` Jeff King
2012-02-02  8:17                                 ` [PATCH 0/9] respect binary attribute in grep Jeff King
2012-02-02  8:18                                   ` [PATCH 1/9] grep: make locking flag global Jeff King
2012-02-02  8:18                                   ` [PATCH 2/9] grep: move sha1-reading mutex into low-level code Jeff King
2012-02-02  8:19                                   ` [PATCH 3/9] grep: refactor the concept of "grep source" into an object Jeff King
2012-02-02  8:19                                   ` [PATCH 4/9] convert git-grep to use grep_source interface Jeff King
2012-02-02  8:20                                   ` [PATCH 5/9] grep: drop grep_buffer's "name" parameter Jeff King
2012-02-02  8:20                                   ` [PATCH 6/9] grep: cache userdiff_driver in grep_source Jeff King
2012-02-02 18:34                                     ` Junio C Hamano
2012-02-02 19:37                                       ` Jeff King
2012-02-02  8:21                                   ` [PATCH 7/9] grep: respect diff attributes for binary-ness Jeff King
2012-02-02  8:21                                   ` [PATCH 8/9] grep: load file data after checking binary-ness Jeff King
2012-02-02  8:24                                   ` [PATCH 9/9] grep: pre-load userdiff drivers when threaded Jeff King
2012-02-02  8:30                                   ` [PATCH 0/9] respect binary attribute in grep Jeff King
2012-02-02 11:00                                   ` Thomas Rast
2012-02-02 11:07                                     ` Jeff King
2012-02-02 18:39                                   ` Junio C Hamano
2012-02-04 19:22                                   ` Pete Wyckoff
2012-02-04 23:18                                     ` Jeff King
2012-02-01 23:21                           ` [PATCH 2/2] grep: respect diff attributes for binary-ness Jeff King
2012-02-01 16:28                       ` [PATCH] Don't search files with an unset "grep" attribute Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7vaa5d4mce.fsf@alter.siamese.dyndns.org \
    --to=gitster@pobox.com \
    --cc=conrad.irwin@gmail.com \
    --cc=dov.grobgeld@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).