From: ZheNing Hu <adlternative@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: ZheNing Hu via GitGitGadget <gitgitgadget@gmail.com>,
Git List <git@vger.kernel.org>,
Christian Couder <christian.couder@gmail.com>,
Hariom Verma <hariom18599@gmail.com>,
Karthik Nayak <karthik.188@gmail.com>,
Felipe Contreras <felipe.contreras@gmail.com>,
Bagas Sanjaya <bagasdotme@gmail.com>, Jeff King <peff@peff.net>
Subject: Re: [PATCH 1/2] [GSOC] ref-filter: add %(raw) atom
Date: Fri, 28 May 2021 23:04:59 +0800 [thread overview]
Message-ID: <CAOLTT8SLLgZnF0SV2FPPBJkB=ybeh8mamTPqc-M6CXQeepooOQ@mail.gmail.com> (raw)
In-Reply-To: <xmqq1r9r8spu.fsf@gitster.g>
Junio C Hamano <gitster@pobox.com> 于2021年5月28日周五 上午11:04写道:
>
> "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > The raw data of blob, tree objects may contain '\0', but most of
> > the logic in `ref-filter` depands on the output of the atom being
> > a structured string (end with '\0').
>
> Text, yes, string is also OK. But structured? Probably not.
>
> ... being text (specifically, no embedded NULs in it).
>
OK.
> > Beyond, `--format=%(raw)` should not combine with `--python`, `--shell`,
> > `--tcl`, `--perl` because if our binary raw data is passed to a variable
> > in the host language, the host languages may cause escape errors.
>
> OK. I think at least --perl and possibly --python should be able to
> express NULs in the "string" type we use from the host language, but
> I am perfectly fine with the decision to leave it to later updates.
>
Yes, for the time being, unified processing --<lang> will be easier.
> > +Note that `--format=%(raw)` should not combine with `--python`, `--shell`, `--tcl`,
>
> "should not combine" -> "cannot be used" would make it read more
> naturally (ditto for the phrase used in the proposed log message).
>
OK, so the wording is better.
> >
> > struct atom_value {
> > const char *s;
> > + size_t s_size;
>
> OK, so everything used to be a C-string that cannot hold NULs in it,
> but now it is a counted <ptr, len> string. Good.
>
This suddenly reminded me of strbuf...
I don't know if it is worth replacing all s, s_size with strbuf.
> > @@ -588,6 +607,10 @@ static int parse_ref_filter_atom(const struct ref_format *format,
> > return strbuf_addf_ret(err, -1, _("malformed field name: %.*s"),
> > (int)(ep-atom), atom);
> >
> > + if (format->quote_style && starts_with(sp, "raw"))
> > + return strbuf_addf_ret(err, -1, _("--format=%.*s should not combine with"
> > + "--python, --shell, --tcl, --perl"), (int)(ep-atom), atom);
>
> Don't we want to allow "raw:size" that would be a plain text?
> I am not sure if this check belongs here in the first place.
> Shouldn't it be done in raw_atom_parser() instead?
>
You are right: "raw:size" should be keep, but I can't this check to
raw_atom_parser(), becase "if I move it to raw_atom_parser(), when we
use:
`git ref-filter --format=%raw --sort=raw --python`
Git will continue to run instead of die because parse_sorting_atom() will
use a dummy ref_format and don't remember --language details, next time
format_ref_array_item() will reuse the used_atom entry of sorting atom in
parse_ref_filter_atom(), this will skip the check in raw_atom_parser()."
> Another idea is to teach a more generic rule to quote_formatting()
> to detect NULs in v->s[0..v->s_size] at runtime and barf, i.e. a
> plain-text blob object can be used with "--shell --format=%(raw)"
> just fine.
>
The cost of such a check is not small. Maybe can add an option
such as "--only-text" to do it.
> > @@ -652,11 +675,14 @@ static int parse_ref_filter_atom(const struct ref_format *format,
> > return at;
> > }
> >
> > -static void quote_formatting(struct strbuf *s, const char *str, int quote_style)
> > +static void quote_formatting(struct strbuf *s, const char *str, size_t len, int quote_style)
> > {
> > switch (quote_style) {
> > case QUOTE_NONE:
> > - strbuf_addstr(s, str);
> > + if (len != ATOM_VALUE_S_SIZE_INIT)
> > + strbuf_add(s, str, len);
> > + else
> > + strbuf_addstr(s, str);
>
> It probably is a good idea to invent a C preprocessor macro for a
> named constant to be used when a structure is initialized, but it
> would be easier to read if the rule is "len field is negative when
> the value is a C-string", e.g.
>
> if (len < 0)
>
I do not recognize such an approach because we are deal
with "size_t s_size", if (len < 0) will never be established.
I use -1 is because it's equal to 18446744073709551615
and it's impossible to have such a large file in Git.
> > @@ -785,14 +814,16 @@ static int if_atom_handler(struct atom_value *atomv, struct ref_formatting_state
> > return 0;
> > }
> >
> > -static int is_empty(const char *s)
> > +static int is_empty(struct strbuf *buf)
> > {
> > - while (*s != '\0') {
> > - if (!isspace(*s))
> > - return 0;
> > + const char *s = buf->buf;
> > + size_t cur_len = 0;
> > +
> > + while ((cur_len != buf->len) && (isspace(*s) || *s == '\0')) {
> > s++;
> > + cur_len++;
> > }
> > - return 1;
> > + return cur_len == buf->len;
> > }
>
> Assuming that we do want to treat NULs the same way as whitespaces,
> the updated code works as intended, which is good. But I have no
> reason to support that design decision. I do not have a strong
> reason to support a design decision that goes the opposite way to
> treat a NUL just like we treat an 'X', but at least I can understand
> it (i.e. "because we have no reason to special case NUL any more
> than 'X' when trying to see if a buffer is 'empty', we don't").
> This code on the other hand must be supported with "because we need
> to special case NUL for such and such reasons for the purpose of
> determining if a buffer is 'empty', we treat them the same way as
> whitespaces".
>
Something like "\0abc", from the perspective of the string, it is empty;
from the perspective of the memory, it is not empty; I don't know any
absolutely good solutions here.
> Can the change in this commit violate the invariant that
> if_then_else->str cannot be NULL, which seems to have been the case
> forever as we see an unchecked strcmp() done in the original?
>
> If so, perhaps you can check the condition upfront, where you
> compute str_len above, e.g.
>
> if (!if_then_else->str) {
> if (if_then_else->cmp_status == COMPARE_EQUAL ||
> if_then_else->cmp_status == COMPARE_UNEQUAL)
> BUG(...);
> } else
> str_len = strlen(...);
>
> If not, then I do not see the point of adding this (and later) check
> with BUG to this code.
>
> Or is the invariant that .str must not be NULL could have been
> violated without this patch (i.e. the original was buggy in running
> strcmp() on .str without checking)? If so, please make it a separate
> preliminary change to add such an assert.
>
The BUG() here actually acts as an "assert()". ".str must not be NULL" is
right, it point to "xxx" in "%(if:equals=xxx)", so it seems that these BUG()
are somewhat redundant, I will remove them.
> > /* See grab_values */
> > -static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
> > +static void grab_raw_data(struct atom_value *val, int deref, void *buf, unsigned long buf_size, struct object *obj)
> > {
> > int i;
> > const char *subpos = NULL, *bodypos = NULL, *sigpos = NULL;
> > @@ -1307,10 +1349,22 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
> > continue;
> > if (deref)
> > name++;
> > - if (strcmp(name, "body") &&
> > - !starts_with(name, "subject") &&
> > - !starts_with(name, "trailers") &&
> > - !starts_with(name, "contents"))
> > +
> > + if (starts_with(name, "raw")) {
> > + if (atom->u.raw_data.option == RAW_BARE) {
> > + v->s = xmemdupz(buf, buf_size);
> > + v->s_size = buf_size;
> > + } else if (atom->u.raw_data.option == RAW_LENGTH)
> > + v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size);
> > + continue;
> > + }
>
> I can understand that "raw[:<options>]" handling has been inserted
> above the existing "from here on, we only deal with log message
> components" check. But
>
> > + if ((obj->type != OBJ_TAG &&
> > + obj->type != OBJ_COMMIT) ||
>
> I do not see why these new conditions are added. The change is not
> justified in the proposed log message, the original did not need
> these conditions, and this does not concern the primary point of the
> change, which is to start supporting %(raw[:<option>]) placeholder.
>
> If it is needed as a bugfix (e.g. it may be that you consider "if a
> blob has contents that looks very similar to 'git cat-file commit
> HEAD', %(body) and friends parse these out, even though it is not a
> commit" is a bug and the change to add these extra tests is meant as
> a fix), that should be done as a preliminary change before adding
> the support for a new atom.
>
Almost what I means: Make a strong guarantee that blob and tree
will never pass the check so that we can don't worry about incorrect
parsing in find_subpos(). The reason I put it in this patch is that only
commit and tag objects will execute `grab_sub_body_contents()` before,
but in the current patch it has changed.
> > + (strcmp(name, "body") &&
> > + !starts_with(name, "subject") &&
> > + !starts_with(name, "trailers") &&
> > + !starts_with(name, "contents")))
> > continue;
> > if (!subpos)
> > find_subpos(buf,
> > @@ -1374,25 +1428,30 @@ static void fill_missing_values(struct atom_value *val)
> > * pointed at by the ref itself; otherwise it is the object the
> > * ref (which is a tag) refers to.
> > */
> > -static void grab_values(struct atom_value *val, int deref, struct object *obj, void *buf)
> > +static void grab_values(struct atom_value *val, int deref, struct object *obj, struct expand_data *data)
> > {
> > + void *buf = data->content;
> > + unsigned long buf_size = data->size;
> > +
> > switch (obj->type) {
> > case OBJ_TAG:
> > grab_tag_values(val, deref, obj);
> > - grab_sub_body_contents(val, deref, buf);
> > + grab_raw_data(val, deref, buf, buf_size, obj);
>
> It is very strange that a helper that is named to grab raw data can
> still process pieces out of a structured data. The original name is
> still a far better match to what the function does, even after this
> patch teaches it to also honor %(raw) placeholder. It is still
> about grabbing various "sub"-pieces out of "body contents", and the
> sub-piece the %(raw) grabs just happens to be "the whole thing".
>
Well, I can't think of a better name, My original idea was grab_raw_data()
can grab itself, header, contents, It is more general than
grab_sub_body_contents(),
raw data is not a part of "subject" or "body" of "contents"...
> > +static int memcasecmp(const void *vs1, const void *vs2, size_t n)
> > +{
> > + size_t i;
> > + const char *s1 = (const char *)vs1;
> > + const char *s2 = (const char *)vs2;
> > +
> > + for (i = 0; i < n; i++) {
> > + unsigned char u1 = s1[i];
> > + unsigned char u2 = s2[i];
> > + int U1 = toupper (u1);
> > + int U2 = toupper (u2);
>
> Does toupper('\0') even have a defined meaning?
>
> > + int diff = (UCHAR_MAX <= INT_MAX ? U1 - U2
> > + : U1 < U2 ? -1 : U2 < U1);
>
> Looks crazy to worry about uchar wider than int. Such a system is
> not even standard compliant, is it?
>
Forget about this inelegant help function. As I said in my reply to Felipe,
this is copied from gunlib...
> > @@ -2304,6 +2382,7 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
> > int cmp_detached_head = 0;
> > cmp_type cmp_type = used_atom[s->atom].type;
> > struct strbuf err = STRBUF_INIT;
> > + size_t slen = 0;
> >
> > if (get_ref_atom_value(a, s->atom, &va, &err))
> > die("%s", err.buf);
> > @@ -2317,10 +2396,32 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
> > } else if (s->sort_flags & REF_SORTING_VERSION) {
> > cmp = versioncmp(va->s, vb->s);
> > } else if (cmp_type == FIELD_STR) {
> > - int (*cmp_fn)(const char *, const char *);
> > - cmp_fn = s->sort_flags & REF_SORTING_ICASE
> > - ? strcasecmp : strcmp;
> > - cmp = cmp_fn(va->s, vb->s);
>
> > + if (va->s_size == ATOM_VALUE_S_SIZE_INIT &&
> > + vb->s_size == ATOM_VALUE_S_SIZE_INIT) {
> > + int (*cmp_fn)(const char *, const char *);
> > + cmp_fn = s->sort_flags & REF_SORTING_ICASE
> > + ? strcasecmp : strcmp;
> > + cmp = cmp_fn(va->s, vb->s);
> > + } else {
> > + int (*cmp_fn)(const void *, const void *, size_t);
> > + cmp_fn = s->sort_flags & REF_SORTING_ICASE
> > + ? memcasecmp : memcmp;
>
> Why not introduce two local temporary variables a_size and b_size
> and initialize them upfront like so:
>
> a_size = va->s_size < 0 ? strlen(va->s) : va->s_size;
> b_size = vb->s_size < 0 ? strlen(vb->s) : vb->s_size;
>
> Wouldn't that allow you to do without the complex "if both are
> counted, do this, if A is counted but B is not, do that, ..."
> cascade?
>
Sorry, such code would be really ugly for reading.
> I can sort-of see the point of special casing "both are traditional
> C strings" case (i.e. the "if" side of the "else" we are discussing
> here) and using strcasecmp/strcmp instead of memcasecmp/memcmp, but
> I do not see much point in having the if/elseif/else cascade inside
> this "else" clause.
>
I will try to modify its logic.
> Thanks.
Your reply is very accurate.
Thanks.
--
ZheNing Hu
next prev parent reply other threads:[~2021-05-28 15:05 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-27 14:43 [PATCH 0/2] [GSOC] ref-filter: add %(raw) atom ZheNing Hu via GitGitGadget
2021-05-27 14:43 ` [PATCH 1/2] " ZheNing Hu via GitGitGadget
2021-05-27 16:36 ` Felipe Contreras
2021-05-28 13:02 ` ZheNing Hu
2021-05-28 16:30 ` Felipe Contreras
2021-05-30 5:37 ` ZheNing Hu
2021-05-29 13:23 ` Phillip Wood
2021-05-29 15:24 ` Felipe Contreras
2021-05-29 17:23 ` Phillip Wood
2021-05-30 6:29 ` ZheNing Hu
2021-05-30 13:05 ` Phillip Wood
2021-05-31 14:15 ` ZheNing Hu
2021-05-31 15:35 ` Felipe Contreras
2021-05-30 6:26 ` ZheNing Hu
2021-05-30 13:02 ` Phillip Wood
2021-05-28 3:03 ` Junio C Hamano
2021-05-28 15:04 ` ZheNing Hu [this message]
2021-05-28 16:38 ` Felipe Contreras
2021-05-30 8:11 ` ZheNing Hu
2021-05-27 14:43 ` [PATCH 2/2] [GSOC] ref-filter: add %(header) atom ZheNing Hu via GitGitGadget
2021-05-27 16:37 ` Felipe Contreras
2021-05-28 3:06 ` Junio C Hamano
2021-05-28 4:36 ` Junio C Hamano
2021-05-28 15:19 ` ZheNing Hu
2021-05-27 15:39 ` [PATCH 0/2] [GSOC] ref-filter: add %(raw) atom Felipe Contreras
2021-05-30 13:01 ` [PATCH v2 " ZheNing Hu via GitGitGadget
2021-05-30 13:01 ` [PATCH v2 1/2] [GSOC] ref-filter: add obj-type check in grab contents ZheNing Hu via GitGitGadget
2021-05-31 5:34 ` Junio C Hamano
2021-05-30 13:01 ` [PATCH v2 2/2] [GSOC] ref-filter: add %(raw) atom ZheNing Hu via GitGitGadget
2021-05-31 0:44 ` Junio C Hamano
2021-05-31 14:35 ` ZheNing Hu
2021-06-01 9:54 ` Junio C Hamano
2021-06-01 11:05 ` ZheNing Hu
2021-05-31 4:04 ` Junio C Hamano
2021-05-31 14:40 ` ZheNing Hu
2021-06-01 8:54 ` Junio C Hamano
2021-06-01 11:00 ` ZheNing Hu
2021-06-01 13:48 ` Johannes Schindelin
2021-05-31 4:10 ` Junio C Hamano
2021-05-31 15:41 ` Felipe Contreras
2021-06-01 10:37 ` ZheNing Hu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAOLTT8SLLgZnF0SV2FPPBJkB=ybeh8mamTPqc-M6CXQeepooOQ@mail.gmail.com' \
--to=adlternative@gmail.com \
--cc=bagasdotme@gmail.com \
--cc=christian.couder@gmail.com \
--cc=felipe.contreras@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=gitster@pobox.com \
--cc=hariom18599@gmail.com \
--cc=karthik.188@gmail.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).