* [PATCH] ref-filter: sort numerically when ":size" is used
@ 2023-09-01 14:24 Kousik Sanagavarapu
2023-09-01 16:43 ` Junio C Hamano
2023-09-02 9:00 ` [PATCH v2] " Kousik Sanagavarapu
0 siblings, 2 replies; 14+ messages in thread
From: Kousik Sanagavarapu @ 2023-09-01 14:24 UTC (permalink / raw)
To: git; +Cc: Kousik Sanagavarapu
Atoms like "raw" and "contents" have a ":size" option which can be used
to know the size of the data. Since these atoms have the cmp_type
FIELD_STR, they are sorted alphabetically from 'a' to 'z' and '0' to
'9'. Meaning, even when the ":size" option is used and what we
ultimatlely have is numbers, we still sort alphabetically.
For example, consider the the following case in a repo
refname contents:size raw:size
======= ============= ========
refs/heads/branch1 1130 1210
refs/heads/master 300 410
refs/tags/v1.0 140 260
Sorting with "--format="%(refname) %(contents:size) --sort=contents:size"
would give
refs/heads/branch1 1130
refs/tags/v1.0.0 140
refs/heads/master 300
which is an alphabetic sort, while what one might really expect is
refs/tags/v1.0.0 140
refs/heads/master 300
refs/heads/branch1 1130
which is a numeric sort (that is, a "$ sort -n file" as opposed to a
"$ sort file", where "file" contains only the "contents:size" or
"raw:size" info, each of which is on a newline).
Same is the case with "--sort=raw:size".
So, sort numerically whenever the sort is done with "contents:size" or
"raw:size" and do it the normal alphabetic way when "contents" or "raw"
are used with some other option (they are FIELD_STR anyways).
Signed-off-by: Kousik Sanagavarapu <five231003@gmail.com>
---
ref-filter.c | 20 +++++++++++++++-----
t/t6300-for-each-ref.sh | 15 +++++++++++++--
2 files changed, 28 insertions(+), 7 deletions(-)
diff --git a/ref-filter.c b/ref-filter.c
index 1bfaf20fbf..5d7bea5f23 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -932,7 +932,13 @@ struct atom_value {
ssize_t s_size;
int (*handler)(struct atom_value *atomv, struct ref_formatting_state *state,
struct strbuf *err);
- uintmax_t value; /* used for sorting when not FIELD_STR */
+
+ /*
+ * Used for sorting when not FIELD_STR or when FIELD_STR but the
+ * sort should be numeric and not alphabetic.
+ */
+ uintmax_t value;
+
struct used_atom *atom;
};
@@ -1857,7 +1863,8 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp
v->s = xmemdupz(buf, buf_size);
v->s_size = buf_size;
} else if (atom->u.raw_data.option == RAW_LENGTH) {
- v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size);
+ v->value = (uintmax_t)buf_size;
+ v->s = xstrfmt("%"PRIuMAX, v->value);
}
continue;
}
@@ -1883,8 +1890,10 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp
v->s = strbuf_detach(&sb, NULL);
} else if (atom->u.contents.option == C_BODY_DEP)
v->s = xmemdupz(bodypos, bodylen);
- else if (atom->u.contents.option == C_LENGTH)
- v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos));
+ else if (atom->u.contents.option == C_LENGTH) {
+ v->value = (uintmax_t)strlen(subpos);
+ v->s = xstrfmt("%"PRIuMAX, v->value);
+ }
else if (atom->u.contents.option == C_BODY)
v->s = xmemdupz(bodypos, nonsiglen);
else if (atom->u.contents.option == C_SIG)
@@ -2265,6 +2274,7 @@ static int populate_value(struct ref_array_item *ref, struct strbuf *err)
v->s_size = ATOM_SIZE_UNSPECIFIED;
v->handler = append_atom;
+ v->value = 0;
v->atom = atom;
if (*name == '*') {
@@ -2986,7 +2996,7 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
cmp_detached_head = 1;
} else if (s->sort_flags & REF_SORTING_VERSION) {
cmp = versioncmp(va->s, vb->s);
- } else if (cmp_type == FIELD_STR) {
+ } else if (cmp_type == FIELD_STR && !va->value && !vb->value) {
if (va->s_size < 0 && vb->s_size < 0) {
int (*cmp_fn)(const char *, const char *);
cmp_fn = s->sort_flags & REF_SORTING_ICASE
diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
index aa3c7c03c4..7b943fd34c 100755
--- a/t/t6300-for-each-ref.sh
+++ b/t/t6300-for-each-ref.sh
@@ -1017,16 +1017,16 @@ test_expect_success 'Verify sorts with raw' '
test_expect_success 'Verify sorts with raw:size' '
cat >expected <<-EOF &&
refs/myblobs/blob8
- refs/myblobs/first
refs/myblobs/blob7
- refs/heads/main
refs/myblobs/blob4
refs/myblobs/blob1
refs/myblobs/blob2
refs/myblobs/blob3
refs/myblobs/blob5
refs/myblobs/blob6
+ refs/myblobs/first
refs/mytrees/first
+ refs/heads/main
EOF
git for-each-ref --format="%(refname)" --sort=raw:size \
refs/heads/main refs/myblobs/ refs/mytrees/first >actual &&
@@ -1138,6 +1138,17 @@ test_expect_success 'for-each-ref --format compare with cat-file --batch' '
test_cmp expected actual
'
+test_expect_success 'verify sorts with contents:size' '
+ cat >expect <<-\EOF &&
+ refs/heads/main
+ refs/heads/newtag
+ refs/heads/ambiguous
+ EOF
+ git for-each-ref --format="%(refname)" \
+ --sort=contents:size refs/heads/ >actual &&
+ test_cmp expect actual
+'
+
test_expect_success 'set up multiple-sort tags' '
for when in 100000 200000
do
--
2.42.0.51.g5dc72c0fbc.dirty
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] ref-filter: sort numerically when ":size" is used
2023-09-01 14:24 [PATCH] ref-filter: sort numerically when ":size" is used Kousik Sanagavarapu
@ 2023-09-01 16:43 ` Junio C Hamano
2023-09-01 17:45 ` Jeff King
2023-09-02 9:00 ` [PATCH v2] " Kousik Sanagavarapu
1 sibling, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2023-09-01 16:43 UTC (permalink / raw)
To: Kousik Sanagavarapu; +Cc: git
Kousik Sanagavarapu <five231003@gmail.com> writes:
> Atoms like "raw" and "contents" have a ":size" option which can be used
> to know the size of the data. Since these atoms have the cmp_type
> FIELD_STR, they are sorted alphabetically from 'a' to 'z' and '0' to
> '9'. Meaning, even when the ":size" option is used and what we
> ultimatlely have is numbers, we still sort alphabetically.
There are other cmp_types already defined, like ULONG and TIME. How
are they used and affect the comparison? Naively, :size sounds like
a good candidate to compare as ULONG, as it cannot be negative even
though 0 is a valid size.
I understand and agree with the motivation, but the implementation
looks puzzling.
> diff --git a/ref-filter.c b/ref-filter.c
> index 1bfaf20fbf..5d7bea5f23 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -932,7 +932,13 @@ struct atom_value {
> ssize_t s_size;
> int (*handler)(struct atom_value *atomv, struct ref_formatting_state *state,
> struct strbuf *err);
> - uintmax_t value; /* used for sorting when not FIELD_STR */
> +
> + /*
> + * Used for sorting when not FIELD_STR or when FIELD_STR but the
> + * sort should be numeric and not alphabetic.
> + */
> + uintmax_t value;
This does not explain why we cannot make <anything>:size FIELD_ULONG
for <anything> that is of FIELD_STR type, though. IOW, why such a
strange "when FIELD_STR but the sort should be numeric" is needed?
If you have a <size>, shouldn't it always be numeric?
IOW, when you notice that you need to set, say, u.contents.option of
an atom to C_LENGTH, shouldn't you set cmp_type of the atom to
FIELD_ULONG, somewhere in contents_atom_parser() and friends, and
everything should naturally follow, no?
It seems that support for other cmp_types are incomplete in the
current code. There are FIELD_ULONG and FIELD_TIME defined, but
they do not appear to be used in any way, so the cmp_ref_sorting()
would need to be updated to make it actually pay attention to the
cmp_type and perform numeric comparison.
> @@ -1883,8 +1890,10 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp
> v->s = strbuf_detach(&sb, NULL);
> } else if (atom->u.contents.option == C_BODY_DEP)
> v->s = xmemdupz(bodypos, bodylen);
> - else if (atom->u.contents.option == C_LENGTH)
> - v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos));
> + else if (atom->u.contents.option == C_LENGTH) {
> + v->value = (uintmax_t)strlen(subpos);
> + v->s = xstrfmt("%"PRIuMAX, v->value);
> + }
We should take a note that all of these v->value are *per* *item*
that will be sorted.
> @@ -2986,7 +2996,7 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
> cmp_detached_head = 1;
> } else if (s->sort_flags & REF_SORTING_VERSION) {
> cmp = versioncmp(va->s, vb->s);
> - } else if (cmp_type == FIELD_STR) {
> + } else if (cmp_type == FIELD_STR && !va->value && !vb->value) {
Two refs may point at an empty object with zero length, i.e. for
them, !v->value is true, and another ref may point at a non-empty
object. The two empty refs are compared with an algorithm different
from the algorithm used to compare the empty ref and the non-empty
ref. Isn't this broken as a comparison function to be given to
QSORT(), which must be transitive (e.g. if A < B and B < C, then it
should be guaranteed that A < C and you do not have to compare A and
C)?
IOW, the choice of the comparison algorithm should not depend on an
attribute (like value or s) that is specific to the item being
compared. Things like cmp_type that is defined at the used_atom
lefvel to make the sorting function stable, I would think.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] ref-filter: sort numerically when ":size" is used
2023-09-01 16:43 ` Junio C Hamano
@ 2023-09-01 17:45 ` Jeff King
2023-09-01 17:59 ` Junio C Hamano
0 siblings, 1 reply; 14+ messages in thread
From: Jeff King @ 2023-09-01 17:45 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Kousik Sanagavarapu, git
On Fri, Sep 01, 2023 at 09:43:07AM -0700, Junio C Hamano wrote:
> IOW, when you notice that you need to set, say, u.contents.option of
> an atom to C_LENGTH, shouldn't you set cmp_type of the atom to
> FIELD_ULONG, somewhere in contents_atom_parser() and friends, and
> everything should naturally follow, no?
Yeah, I had the same thought after reading the patch. Unfortunately the
"type" is used only for comparison, not formatting. So you are still
stuck setting both v->value and v->s in grab_sub_body_contents(). It
feels like we could hoist that xstrfmt("%"PRIuMAX) to a higher level as
a preparatory refactoring. But it's not that big a deal to work around
it if that turns out to be hard.
> It seems that support for other cmp_types are incomplete in the
> current code. There are FIELD_ULONG and FIELD_TIME defined, but
> they do not appear to be used in any way, so the cmp_ref_sorting()
> would need to be updated to make it actually pay attention to the
> cmp_type and perform numeric comparison.
I think they are covered implicitly by the "else" block of the
conditional that checks for FIELD_STR.
So just this is sufficient to make contents:size work:
diff --git a/ref-filter.c b/ref-filter.c
index 88b021dd1d..02e3b6ba82 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -583,9 +583,10 @@ static int contents_atom_parser(struct ref_format *format, struct used_atom *ato
atom->u.contents.option = C_BARE;
else if (!strcmp(arg, "body"))
atom->u.contents.option = C_BODY;
- else if (!strcmp(arg, "size"))
+ else if (!strcmp(arg, "size")) {
+ atom->type = FIELD_ULONG;
atom->u.contents.option = C_LENGTH;
- else if (!strcmp(arg, "signature"))
+ } else if (!strcmp(arg, "signature"))
atom->u.contents.option = C_SIG;
else if (!strcmp(arg, "subject"))
atom->u.contents.option = C_SUB;
@@ -1885,9 +1886,10 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp
v->s = strbuf_detach(&sb, NULL);
} else if (atom->u.contents.option == C_BODY_DEP)
v->s = xmemdupz(bodypos, bodylen);
- else if (atom->u.contents.option == C_LENGTH)
- v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos));
- else if (atom->u.contents.option == C_BODY)
+ else if (atom->u.contents.option == C_LENGTH) {
+ v->value = strlen(subpos);
+ v->s = xstrfmt("%"PRIuMAX, v->value);
+ } else if (atom->u.contents.option == C_BODY)
v->s = xmemdupz(bodypos, nonsiglen);
else if (atom->u.contents.option == C_SIG)
v->s = xmemdupz(sigpos, siglen);
-Peff
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] ref-filter: sort numerically when ":size" is used
2023-09-01 17:45 ` Jeff King
@ 2023-09-01 17:59 ` Junio C Hamano
2023-09-01 18:32 ` Jeff King
0 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2023-09-01 17:59 UTC (permalink / raw)
To: Jeff King; +Cc: Kousik Sanagavarapu, git
Jeff King <peff@peff.net> writes:
> On Fri, Sep 01, 2023 at 09:43:07AM -0700, Junio C Hamano wrote:
>
>> IOW, when you notice that you need to set, say, u.contents.option of
>> an atom to C_LENGTH, shouldn't you set cmp_type of the atom to
>> FIELD_ULONG, somewhere in contents_atom_parser() and friends, and
>> everything should naturally follow, no?
>
> Yeah, I had the same thought after reading the patch. Unfortunately the
> "type" is used only for comparison, not formatting. So you are still
> stuck setting both v->value and v->s in grab_sub_body_contents(). It
> feels like we could hoist that xstrfmt("%"PRIuMAX) to a higher level as
> a preparatory refactoring. But it's not that big a deal to work around
> it if that turns out to be hard.
Setting of the .value member happens O(N) times for the number of
refs involved, which does not bother me. Do you mean "when we know
we are not sorting with size we should omit parsing the string into
the .value member"? If so, I think that would be nice to have.
>> It seems that support for other cmp_types are incomplete in the
>> current code. There are FIELD_ULONG and FIELD_TIME defined, but
>> they do not appear to be used in any way, so the cmp_ref_sorting()
>> would need to be updated to make it actually pay attention to the
>> cmp_type and perform numeric comparison.
>
> I think they are covered implicitly by the "else" block of the
> conditional that checks for FIELD_STR.
Ah, OK. That needs to be future-proofed to force future developers
who want to add different FIELD_FOO type to look at the comparison
logic. If we want to do so, it should be done as a separate topic
for cleaning-up the mess, not as part of this effort.
> So just this is sufficient to make contents:size work:
>
> diff --git a/ref-filter.c b/ref-filter.c
> index 88b021dd1d..02e3b6ba82 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -583,9 +583,10 @@ static int contents_atom_parser(struct ref_format *format, struct used_atom *ato
> atom->u.contents.option = C_BARE;
> else if (!strcmp(arg, "body"))
> atom->u.contents.option = C_BODY;
> - else if (!strcmp(arg, "size"))
> + else if (!strcmp(arg, "size")) {
> + atom->type = FIELD_ULONG;
> atom->u.contents.option = C_LENGTH;
> - else if (!strcmp(arg, "signature"))
> + } else if (!strcmp(arg, "signature"))
> atom->u.contents.option = C_SIG;
> else if (!strcmp(arg, "subject"))
> atom->u.contents.option = C_SUB;
> @@ -1885,9 +1886,10 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp
> v->s = strbuf_detach(&sb, NULL);
> } else if (atom->u.contents.option == C_BODY_DEP)
> v->s = xmemdupz(bodypos, bodylen);
> - else if (atom->u.contents.option == C_LENGTH)
> - v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos));
> - else if (atom->u.contents.option == C_BODY)
> + else if (atom->u.contents.option == C_LENGTH) {
> + v->value = strlen(subpos);
> + v->s = xstrfmt("%"PRIuMAX, v->value);
> + } else if (atom->u.contents.option == C_BODY)
> v->s = xmemdupz(bodypos, nonsiglen);
> else if (atom->u.contents.option == C_SIG)
> v->s = xmemdupz(sigpos, siglen);
Yup, exactly.
Thanks.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] ref-filter: sort numerically when ":size" is used
2023-09-01 17:59 ` Junio C Hamano
@ 2023-09-01 18:32 ` Jeff King
2023-09-01 18:59 ` Kousik Sanagavarapu
0 siblings, 1 reply; 14+ messages in thread
From: Jeff King @ 2023-09-01 18:32 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Kousik Sanagavarapu, git
On Fri, Sep 01, 2023 at 10:59:28AM -0700, Junio C Hamano wrote:
> > Yeah, I had the same thought after reading the patch. Unfortunately the
> > "type" is used only for comparison, not formatting. So you are still
> > stuck setting both v->value and v->s in grab_sub_body_contents(). It
> > feels like we could hoist that xstrfmt("%"PRIuMAX) to a higher level as
> > a preparatory refactoring. But it's not that big a deal to work around
> > it if that turns out to be hard.
>
> Setting of the .value member happens O(N) times for the number of
> refs involved, which does not bother me. Do you mean "when we know
> we are not sorting with size we should omit parsing the string into
> the .value member"? If so, I think that would be nice to have.
No, I wasn't worried about code efficiency, but rather programmer
effort. IOW, I expected that the second hunk that I showed could look
like this:
diff --git a/ref-filter.c b/ref-filter.c
index 88b021dd1d..02b02d6813 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1886,7 +1886,7 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp
} else if (atom->u.contents.option == C_BODY_DEP)
v->s = xmemdupz(bodypos, bodylen);
else if (atom->u.contents.option == C_LENGTH)
- v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos));
+ v->value = strlen(subpos);
else if (atom->u.contents.option == C_BODY)
v->s = xmemdupz(bodypos, nonsiglen);
else if (atom->u.contents.option == C_SIG)
rather than setting both "value" and "s", and that some higher level
code would recognize "oh, this is FIELD_ULONG, so I'll format it rather
than looking at v->s". But it seems that such code does not exist. :)
All of the other spots that set v->value (e.g., objectsize), just set
both.
The ref-filter code is a weird mix of almost-object-oriented bits, giant
switch statements, and assumptions about which fields are set when.
> > I think they are covered implicitly by the "else" block of the
> > conditional that checks for FIELD_STR.
>
> Ah, OK. That needs to be future-proofed to force future developers
> who want to add different FIELD_FOO type to look at the comparison
> logic. If we want to do so, it should be done as a separate topic
> for cleaning-up the mess, not as part of this effort.
Yes, agreed.
-Peff
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] ref-filter: sort numerically when ":size" is used
2023-09-01 18:32 ` Jeff King
@ 2023-09-01 18:59 ` Kousik Sanagavarapu
2023-09-01 19:16 ` Jeff King
2023-09-01 20:04 ` Junio C Hamano
0 siblings, 2 replies; 14+ messages in thread
From: Kousik Sanagavarapu @ 2023-09-01 18:59 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, git
On Fri, Sep 01, 2023 at 02:32:06PM -0400, Jeff King wrote:
> On Fri, Sep 01, 2023 at 10:59:28AM -0700, Junio C Hamano wrote:
>
> > > Yeah, I had the same thought after reading the patch. Unfortunately the
> > > "type" is used only for comparison, not formatting. So you are still
> > > stuck setting both v->value and v->s in grab_sub_body_contents(). It
> > > feels like we could hoist that xstrfmt("%"PRIuMAX) to a higher level as
> > > a preparatory refactoring. But it's not that big a deal to work around
> > > it if that turns out to be hard.
> >
> > Setting of the .value member happens O(N) times for the number of
> > refs involved, which does not bother me. Do you mean "when we know
> > we are not sorting with size we should omit parsing the string into
> > the .value member"? If so, I think that would be nice to have.
>
> No, I wasn't worried about code efficiency, but rather programmer
> effort. IOW, I expected that the second hunk that I showed could look
> like this:
>
> diff --git a/ref-filter.c b/ref-filter.c
> index 88b021dd1d..02b02d6813 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -1886,7 +1886,7 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp
> } else if (atom->u.contents.option == C_BODY_DEP)
> v->s = xmemdupz(bodypos, bodylen);
> else if (atom->u.contents.option == C_LENGTH)
> - v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos));
> + v->value = strlen(subpos);
> else if (atom->u.contents.option == C_BODY)
> v->s = xmemdupz(bodypos, nonsiglen);
> else if (atom->u.contents.option == C_SIG)
This looks very tempting, although too good to be true with the current
ref-filter I guess, as you explain below.
> rather than setting both "value" and "s", and that some higher level
> code would recognize "oh, this is FIELD_ULONG, so I'll format it rather
> than looking at v->s". But it seems that such code does not exist. :)
> All of the other spots that set v->value (e.g., objectsize), just set
> both.
This was also one of the reasons why I decided to set both v->value
and v->s, that is because "objectsize" was implemented in a similar
fashion. Although I left "cmp_type" field untouched for the reasons
below.
> > > I think they are covered implicitly by the "else" block of the
> > > conditional that checks for FIELD_STR.
> >
> > Ah, OK. That needs to be future-proofed to force future developers
> > who want to add different FIELD_FOO type to look at the comparison
> > logic. If we want to do so, it should be done as a separate topic
> > for cleaning-up the mess, not as part of this effort.
What I also find weird is the fact that we assign a "cmp_type" to the
whole atom. Like "contents" is FIELD_STR and "objectsize" is "FIELD_ULONG"
in "valid_atom". This seems wrong because the options of the atoms should be
the ones deciding the "cmp_type", no?
I wanted to leave the "cmp_type" field of the atom untouched because that
would mess up this "global" setting of "contents" to be a "FIELD_STR" (or
even "raw" for that matter). Although that seems like a bad idea, after
I've read Junio's and your comments.
Thanks
>
> Yes, agreed.
>
> -Peff
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] ref-filter: sort numerically when ":size" is used
2023-09-01 18:59 ` Kousik Sanagavarapu
@ 2023-09-01 19:16 ` Jeff King
2023-09-01 20:21 ` Junio C Hamano
2023-09-01 20:04 ` Junio C Hamano
1 sibling, 1 reply; 14+ messages in thread
From: Jeff King @ 2023-09-01 19:16 UTC (permalink / raw)
To: Kousik Sanagavarapu; +Cc: Junio C Hamano, git
On Sat, Sep 02, 2023 at 12:29:07AM +0530, Kousik Sanagavarapu wrote:
> > > > I think they are covered implicitly by the "else" block of the
> > > > conditional that checks for FIELD_STR.
> > >
> > > Ah, OK. That needs to be future-proofed to force future developers
> > > who want to add different FIELD_FOO type to look at the comparison
> > > logic. If we want to do so, it should be done as a separate topic
> > > for cleaning-up the mess, not as part of this effort.
>
> What I also find weird is the fact that we assign a "cmp_type" to the
> whole atom. Like "contents" is FIELD_STR and "objectsize" is "FIELD_ULONG"
> in "valid_atom". This seems wrong because the options of the atoms should be
> the ones deciding the "cmp_type", no?
I think the data structure is a little confusing if you haven't worked
with it before. But basically each "atom" corresponds to a single "%()"
block in the format. So if you ran:
git for-each-ref --format="%(contents:size) %(contents:body)"
you'd have two atoms in the used_atom struct: one for the size and one
for the body.
IMHO the code would be a lot easier to work with if the atoms were
structured as a parse tree with child pointers (especially when you get
into things like "if" that have sub-expressions). I think one of the
reasons that used_atom is an array is to de-duplicate repeated mentions
(so if you formatted "%(foo) %(foo)" it would only have to store the
computed value once).
But I think that is the wrong way to optimize it. We shouldn't be
storing any strings per-atom, but rather walking the parse tree to
produce a single output buffer. And the values should be cheap to fill
in, because we should parse the object as necessary up front. This is
more or less the way the pretty.c parser does it.
But that is all quite a large tangent from what you're working on, and
would probably be a ground-up rewrite of the formatting code. You can
safely ignore my rant for the purposes of your patch. ;)
> I wanted to leave the "cmp_type" field of the atom untouched because that
> would mess up this "global" setting of "contents" to be a "FIELD_STR" (or
> even "raw" for that matter). Although that seems like a bad idea, after
> I've read Junio's and your comments.
Yeah, I agree that would be a problem if there were one global
"contents". But we are allocating a new atom struct on the fly for the
contents:size directive that we parse, and so on.
-Peff
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] ref-filter: sort numerically when ":size" is used
2023-09-01 18:59 ` Kousik Sanagavarapu
2023-09-01 19:16 ` Jeff King
@ 2023-09-01 20:04 ` Junio C Hamano
2023-09-01 20:40 ` Jeff King
1 sibling, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2023-09-01 20:04 UTC (permalink / raw)
To: Kousik Sanagavarapu; +Cc: Jeff King, git
Kousik Sanagavarapu <five231003@gmail.com> writes:
> What I also find weird is the fact that we assign a "cmp_type" to the
> whole atom. Like "contents" is FIELD_STR and "objectsize" is "FIELD_ULONG"
> in "valid_atom". This seems wrong because the options of the atoms should be
> the ones deciding the "cmp_type", no?
I do not quite get where your confusion comes from.
The use of valid_atom[] purely for catalogging things like
"contents", "refname", etc., before specialization, as opposed to
used_atom[] that lists the actual specialized form of the atoms used
in the format string. If you refer to "contents:body" and
"contents:size" in your format string, they become two entries in
used_atom[], both of which refer to the same atom_type obtained by
consulting the same entry in the valid_atom[] array.
The specialization between "contents:body" and "contents:size" must
be captured somewhere, and that happens by using two used_atom[]
entries. There will be one "struct atom_value" for each of these
placeholders, each of which refers to its own used_atom that knows
for which variant of "contents" it was created. Of course, these
two "struct atom_value" instances will have different content string
for the same ref (one stores the body part of the string, the other
stores the size of the contents).
> I wanted to leave the "cmp_type" field of the atom untouched because that
> would mess up this "global" setting of "contents" to be a "FIELD_STR" (or
> even "raw" for that matter).
We are not talking about futzing with valid_atom[] array.
Because the used_atom[] array is designed to be used to capture the
differences among "contents" vs "contents:body" vs "contents:size",
what types of entities the values that uses an entry in used_atom[]
array (i.e. an instance of "struct atom_value") should be decided
using the information stored there.
I agree that Peff's "the value for 'contents:size' we know is
numeric, so only store the numeric value in atom_value and let the
output logic handle that using cmp_type information" sound very
tempting. If we were to tackle it, however, I think it should be a
separate topic.
In any case, it was very good that you noticed we do not sort
numerically when sorting by size (I guess our sort by timestamp
weren't affected only because we have been lucky?). Thanks for
starting this topic.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] ref-filter: sort numerically when ":size" is used
2023-09-01 19:16 ` Jeff King
@ 2023-09-01 20:21 ` Junio C Hamano
2023-09-01 20:51 ` Jeff King
0 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2023-09-01 20:21 UTC (permalink / raw)
To: Jeff King; +Cc: Kousik Sanagavarapu, git
Jeff King <peff@peff.net> writes:
> But I think that is the wrong way to optimize it. We shouldn't be
> storing any strings per-atom, but rather walking the parse tree to
> produce a single output buffer. And the values should be cheap to fill
> in, because we should parse the object as necessary up front. This is
> more or less the way the pretty.c parser does it.
I thought "as necessary" may be a bit tricky as populate_value()
were taught to omit doing the whole get_object() thing when the
values for used_atom[] are all computable without parsing the object
at all, but it seems that over time the populate_value() callchain
has degraded sufficiently to unconditionally call get_object() these
days, so I agree that the arrangement does not have much optimization
value, at least in the current code.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] ref-filter: sort numerically when ":size" is used
2023-09-01 20:04 ` Junio C Hamano
@ 2023-09-01 20:40 ` Jeff King
0 siblings, 0 replies; 14+ messages in thread
From: Jeff King @ 2023-09-01 20:40 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Kousik Sanagavarapu, git
On Fri, Sep 01, 2023 at 01:04:08PM -0700, Junio C Hamano wrote:
> In any case, it was very good that you noticed we do not sort
> numerically when sorting by size (I guess our sort by timestamp
> weren't affected only because we have been lucky?). Thanks for
> starting this topic.
I think the date code works as expected, and by design. It's just that
the code is a little confusing. :) Something like %(committerdate) gets
a used_atom with FIELD_TIME, and the "grab" function sets v->value to
the unix-epoch timestamp. And then the comparison function uses
v->value, since it's not FIELD_STR.
It's a little hard to follow the code that fills in v->value, but the
call stack is roughly grab_values() -> grab_person() -> grab_date().
-Peff
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] ref-filter: sort numerically when ":size" is used
2023-09-01 20:21 ` Junio C Hamano
@ 2023-09-01 20:51 ` Jeff King
0 siblings, 0 replies; 14+ messages in thread
From: Jeff King @ 2023-09-01 20:51 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Kousik Sanagavarapu, git
On Fri, Sep 01, 2023 at 01:21:10PM -0700, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
> > But I think that is the wrong way to optimize it. We shouldn't be
> > storing any strings per-atom, but rather walking the parse tree to
> > produce a single output buffer. And the values should be cheap to fill
> > in, because we should parse the object as necessary up front. This is
> > more or less the way the pretty.c parser does it.
>
> I thought "as necessary" may be a bit tricky as populate_value()
> were taught to omit doing the whole get_object() thing when the
> values for used_atom[] are all computable without parsing the object
> at all, but it seems that over time the populate_value() callchain
> has degraded sufficiently to unconditionally call get_object() these
> days, so I agree that the arrangement does not have much optimization
> value, at least in the current code.
No, I think we still do that optimization. When parsing the format
string, the parser function for each atom sets fields in an object_info
struct to indicate what we're interested in. Then for each ref, we call
populate_value(). If that object_info doesn't need anything (we
byte-wise compare it to an empty dummy struct), then we return early,
before calling get_object().
And that optimization is very important to retain; it makes a format
like %(refname) an order of magnitude faster.
The optimization I was referring to is that if you have a format like:
%(contents:body) %(contents:body)
then we'll de-duplicate that to a single used_atom struct, and they'll
share the same v->s result string. That's much harder to do if you parse
into an abstract syntax tree, since the two occupy different parts of
the tree. But my contention is that it does not matter if you stop
allocating v->s in the first place, and just walk the tree to directly
output the result (either to a strbuf or directly to stdout).
-Peff
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v2] ref-filter: sort numerically when ":size" is used
2023-09-01 14:24 [PATCH] ref-filter: sort numerically when ":size" is used Kousik Sanagavarapu
2023-09-01 16:43 ` Junio C Hamano
@ 2023-09-02 9:00 ` Kousik Sanagavarapu
2023-09-02 9:11 ` Kousik Sanagavarapu
2023-09-02 22:19 ` Junio C Hamano
1 sibling, 2 replies; 14+ messages in thread
From: Kousik Sanagavarapu @ 2023-09-02 9:00 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Kousik Sanagavarapu, Jeff King
Atoms like "raw" and "contents" have a ":size" option which can be used
to know the size of the data. Since these atoms have the cmp_type
FIELD_STR, they are sorted alphabetically from 'a' to 'z' and '0' to
'9'. Meaning, even when the ":size" option is used and what we
ultimatlely have is numbers, we still sort alphabetically.
For example, consider the the following case in a repo
refname contents:size raw:size
======= ============= ========
refs/heads/branch1 1130 1210
refs/heads/master 300 410
refs/tags/v1.0 140 260
Sorting with "--format="%(refname) %(contents:size) --sort=contents:size"
would give
refs/heads/branch1 1130
refs/tags/v1.0.0 140
refs/heads/master 300
which is an alphabetic sort, while what one might really expect is
refs/tags/v1.0.0 140
refs/heads/master 300
refs/heads/branch1 1130
which is a numeric sort (that is, a "$ sort -n file" as opposed to a
"$ sort file", where "file" contains only the "contents:size" or
"raw:size" info, each of which is on a newline).
Same is the case with "--sort=raw:size".
So, sort numerically whenever the sort is done with "contents:size" or
"raw:size" and do it the normal alphabetic way when "contents" or "raw"
are used with some other option (they are FIELD_STR anyways).
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Kousik Sanagavarapu <five231003@gmail.com>
---
ref-filter.c | 21 +++++++++++++--------
t/t6300-for-each-ref.sh | 15 +++++++++++++--
2 files changed, 26 insertions(+), 10 deletions(-)
diff --git a/ref-filter.c b/ref-filter.c
index 1bfaf20fbf..9dbc4f71bd 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -582,9 +582,10 @@ static int contents_atom_parser(struct ref_format *format, struct used_atom *ato
atom->u.contents.option = C_BARE;
else if (!strcmp(arg, "body"))
atom->u.contents.option = C_BODY;
- else if (!strcmp(arg, "size"))
+ else if (!strcmp(arg, "size")) {
+ atom->type = FIELD_ULONG;
atom->u.contents.option = C_LENGTH;
- else if (!strcmp(arg, "signature"))
+ } else if (!strcmp(arg, "signature"))
atom->u.contents.option = C_SIG;
else if (!strcmp(arg, "subject"))
atom->u.contents.option = C_SUB;
@@ -690,9 +691,10 @@ static int raw_atom_parser(struct ref_format *format UNUSED,
{
if (!arg)
atom->u.raw_data.option = RAW_BARE;
- else if (!strcmp(arg, "size"))
+ else if (!strcmp(arg, "size")) {
+ atom->type = FIELD_ULONG;
atom->u.raw_data.option = RAW_LENGTH;
- else
+ } else
return err_bad_arg(err, "raw", arg);
return 0;
}
@@ -1857,7 +1859,8 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp
v->s = xmemdupz(buf, buf_size);
v->s_size = buf_size;
} else if (atom->u.raw_data.option == RAW_LENGTH) {
- v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size);
+ v->value = buf_size;
+ v->s = xstrfmt("%"PRIuMAX, v->value);
}
continue;
}
@@ -1883,9 +1886,10 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp
v->s = strbuf_detach(&sb, NULL);
} else if (atom->u.contents.option == C_BODY_DEP)
v->s = xmemdupz(bodypos, bodylen);
- else if (atom->u.contents.option == C_LENGTH)
- v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos));
- else if (atom->u.contents.option == C_BODY)
+ else if (atom->u.contents.option == C_LENGTH) {
+ v->value = strlen(subpos);
+ v->s = xstrfmt("%"PRIuMAX, v->value);
+ } else if (atom->u.contents.option == C_BODY)
v->s = xmemdupz(bodypos, nonsiglen);
else if (atom->u.contents.option == C_SIG)
v->s = xmemdupz(sigpos, siglen);
@@ -2265,6 +2269,7 @@ static int populate_value(struct ref_array_item *ref, struct strbuf *err)
v->s_size = ATOM_SIZE_UNSPECIFIED;
v->handler = append_atom;
+ v->value = 0;
v->atom = atom;
if (*name == '*') {
diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
index aa3c7c03c4..7b943fd34c 100755
--- a/t/t6300-for-each-ref.sh
+++ b/t/t6300-for-each-ref.sh
@@ -1017,16 +1017,16 @@ test_expect_success 'Verify sorts with raw' '
test_expect_success 'Verify sorts with raw:size' '
cat >expected <<-EOF &&
refs/myblobs/blob8
- refs/myblobs/first
refs/myblobs/blob7
- refs/heads/main
refs/myblobs/blob4
refs/myblobs/blob1
refs/myblobs/blob2
refs/myblobs/blob3
refs/myblobs/blob5
refs/myblobs/blob6
+ refs/myblobs/first
refs/mytrees/first
+ refs/heads/main
EOF
git for-each-ref --format="%(refname)" --sort=raw:size \
refs/heads/main refs/myblobs/ refs/mytrees/first >actual &&
@@ -1138,6 +1138,17 @@ test_expect_success 'for-each-ref --format compare with cat-file --batch' '
test_cmp expected actual
'
+test_expect_success 'verify sorts with contents:size' '
+ cat >expect <<-\EOF &&
+ refs/heads/main
+ refs/heads/newtag
+ refs/heads/ambiguous
+ EOF
+ git for-each-ref --format="%(refname)" \
+ --sort=contents:size refs/heads/ >actual &&
+ test_cmp expect actual
+'
+
test_expect_success 'set up multiple-sort tags' '
for when in 100000 200000
do
--
2.42.0.101.g9b561e429b.dirty
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v2] ref-filter: sort numerically when ":size" is used
2023-09-02 9:00 ` [PATCH v2] " Kousik Sanagavarapu
@ 2023-09-02 9:11 ` Kousik Sanagavarapu
2023-09-02 22:19 ` Junio C Hamano
1 sibling, 0 replies; 14+ messages in thread
From: Kousik Sanagavarapu @ 2023-09-02 9:11 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Jeff King
On Sat, Sep 02, 2023 at 02:30:39PM +0530, Kousik Sanagavarapu wrote:
> Atoms like "raw" and "contents" have a ":size" option which can be used
> to know the size of the data. Since these atoms have the cmp_type
> FIELD_STR, they are sorted alphabetically from 'a' to 'z' and '0' to
> '9'. Meaning, even when the ":size" option is used and what we
> ultimatlely have is numbers, we still sort alphabetically.
[...]
> So, sort numerically whenever the sort is done with "contents:size" or
> "raw:size" and do it the normal alphabetic way when "contents" or "raw"
> are used with some other option (they are FIELD_STR anyways).
>
> Helped-by: Jeff King <peff@peff.net>
> Signed-off-by: Kousik Sanagavarapu <five231003@gmail.com>
> ---
Oops, forgot the range-diff.
Range-diff against v1:
1: 9b561e429b ! 1: 194dcb0b0d ref-filter: sort numerically when ":size" is used
@@ Commit message
"raw:size" and do it the normal alphabetic way when "contents" or "raw"
are used with some other option (they are FIELD_STR anyways).
+ Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Kousik Sanagavarapu <five231003@gmail.com>
## ref-filter.c ##
-@@ ref-filter.c: struct atom_value {
- ssize_t s_size;
- int (*handler)(struct atom_value *atomv, struct ref_formatting_state *state,
- struct strbuf *err);
-- uintmax_t value; /* used for sorting when not FIELD_STR */
-+
-+ /*
-+ * Used for sorting when not FIELD_STR or when FIELD_STR but the
-+ * sort should be numeric and not alphabetic.
-+ */
-+ uintmax_t value;
-+
- struct used_atom *atom;
- };
-
+@@ ref-filter.c: static int contents_atom_parser(struct ref_format *format, struct used_atom *ato
+ atom->u.contents.option = C_BARE;
+ else if (!strcmp(arg, "body"))
+ atom->u.contents.option = C_BODY;
+- else if (!strcmp(arg, "size"))
++ else if (!strcmp(arg, "size")) {
++ atom->type = FIELD_ULONG;
+ atom->u.contents.option = C_LENGTH;
+- else if (!strcmp(arg, "signature"))
++ } else if (!strcmp(arg, "signature"))
+ atom->u.contents.option = C_SIG;
+ else if (!strcmp(arg, "subject"))
+ atom->u.contents.option = C_SUB;
+@@ ref-filter.c: static int raw_atom_parser(struct ref_format *format UNUSED,
+ {
+ if (!arg)
+ atom->u.raw_data.option = RAW_BARE;
+- else if (!strcmp(arg, "size"))
++ else if (!strcmp(arg, "size")) {
++ atom->type = FIELD_ULONG;
+ atom->u.raw_data.option = RAW_LENGTH;
+- else
++ } else
+ return err_bad_arg(err, "raw", arg);
+ return 0;
+ }
@@ ref-filter.c: static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp
v->s = xmemdupz(buf, buf_size);
v->s_size = buf_size;
v->s = xmemdupz(buf, buf_size);
v->s_size = buf_size;
v->s_size = buf_size;
} else if (atom->u.raw_data.option == RAW_LENGTH) {
- v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size);
-+ v->value = (uintmax_t)buf_size;
++ v->value = buf_size;
+ v->s = xstrfmt("%"PRIuMAX, v->value);
}
continue;
@@ ref-filter.c: static void grab_sub_body_contents(struct atom_value *val, int der
v->s = xmemdupz(bodypos, bodylen);
- else if (atom->u.contents.option == C_LENGTH)
- v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos));
+- else if (atom->u.contents.option == C_BODY)
+ else if (atom->u.contents.option == C_LENGTH) {
-+ v->value = (uintmax_t)strlen(subpos);
++ v->value = strlen(subpos);
+ v->s = xstrfmt("%"PRIuMAX, v->value);
-+ }
- else if (atom->u.contents.option == C_BODY)
++ } else if (atom->u.contents.option == C_BODY)
v->s = xmemdupz(bodypos, nonsiglen);
else if (atom->u.contents.option == C_SIG)
+ v->s = xmemdupz(sigpos, siglen);
@@ ref-filter.c: static int populate_value(struct ref_array_item *ref, struct strbuf *err)
v->s_size = ATOM_SIZE_UNSPECIFIED;
@@ ref-filter.c: static int populate_value(struct ref_array_item *ref, struct strbu
v->atom = atom;
if (*name == '*') {
-@@ ref-filter.c: static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
- cmp_detached_head = 1;
- } else if (s->sort_flags & REF_SORTING_VERSION) {
- cmp = versioncmp(va->s, vb->s);
-- } else if (cmp_type == FIELD_STR) {
-+ } else if (cmp_type == FIELD_STR && !va->value && !vb->value) {
- if (va->s_size < 0 && vb->s_size < 0) {
- int (*cmp_fn)(const char *, const char *);
- cmp_fn = s->sort_flags & REF_SORTING_ICASE
## t/t6300-for-each-ref.sh ##
@@ t/t6300-for-each-ref.sh: test_expect_success 'Verify sorts with raw' '
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2] ref-filter: sort numerically when ":size" is used
2023-09-02 9:00 ` [PATCH v2] " Kousik Sanagavarapu
2023-09-02 9:11 ` Kousik Sanagavarapu
@ 2023-09-02 22:19 ` Junio C Hamano
1 sibling, 0 replies; 14+ messages in thread
From: Junio C Hamano @ 2023-09-02 22:19 UTC (permalink / raw)
To: Kousik Sanagavarapu; +Cc: git, Jeff King
Kousik Sanagavarapu <five231003@gmail.com> writes:
> } else if (atom->u.raw_data.option == RAW_LENGTH) {
> - v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size);
> + v->value = buf_size;
> + v->s = xstrfmt("%"PRIuMAX, v->value);
> }
> continue;
Interesting that typeof(.value) happens to be uintmax_t, keeping
this a safe change.
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2023-09-02 22:19 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-01 14:24 [PATCH] ref-filter: sort numerically when ":size" is used Kousik Sanagavarapu
2023-09-01 16:43 ` Junio C Hamano
2023-09-01 17:45 ` Jeff King
2023-09-01 17:59 ` Junio C Hamano
2023-09-01 18:32 ` Jeff King
2023-09-01 18:59 ` Kousik Sanagavarapu
2023-09-01 19:16 ` Jeff King
2023-09-01 20:21 ` Junio C Hamano
2023-09-01 20:51 ` Jeff King
2023-09-01 20:04 ` Junio C Hamano
2023-09-01 20:40 ` Jeff King
2023-09-02 9:00 ` [PATCH v2] " Kousik Sanagavarapu
2023-09-02 9:11 ` Kousik Sanagavarapu
2023-09-02 22:19 ` Junio C Hamano
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).