git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Shuqi Liang <cheskaqiqi@gmail.com>
To: Junio C Hamano <gitster@pobox.com>,
	git@vger.kernel.org, Victoria Dye <vdye@github.com>
Subject: Re: [PATCH v3 1/3] attr.c: read attributes in a sparse directory
Date: Thu, 13 Jul 2023 16:13:39 -0400	[thread overview]
Message-ID: <CAMO4yUEVbeLSeOq42V=7RhcLG_4e_D2fBS72Dz-5Oq_u6-RhNw@mail.gmail.com> (raw)
In-Reply-To: <xmqqjzv6w3o2.fsf@gitster.g>

On Tue, Jul 11, 2023 at 5:15 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Shuqi Liang <cheskaqiqi@gmail.com> writes:
>
> > 'git check-attr' cannot currently find attributes of a file within a
> > sparse directory. This is due to .gitattributes files are irrelevant in
> > sparse-checkout cone mode, as the file is considered sparse only if all
> > paths within its parent directory are also sparse.
>
> I do not quite understand what these two sentences want to say.  If
> the attribute files are truly irrelevant then "cannot find" does not
> matter, because there is no point in finding irrelevant things that
> by definition will not affect the outcome of any commands at all,
> no?
>
> > In addition,
> > searching for a .gitattributes file causes expansion of the sparse
> > index, which is avoided to prevent potential performance degradation.
>
> Does this sentence want to say that there is a price to pay, in
> order to read an attribute file that is not part of the cones of
> interest, that you first need to expand the sparse index?  I think
> that is a given and I am not sure what the point of saying it is.
>
> > However, this behavior can lead to missing attributes for files inside
> > sparse directories, causing inconsistencies in file handling.
>
> I agree.  Not reading attribute files correctly will lead to a bug.
>
> Let me rephase what (I think) you wrote below to see if I understand
> what you are doing correctly.
>
> Suppose that sub1/.gitattributes need to be read, when the calling
> command wants to know about attributes of sub1/file.  Imagine that
> sub1/ and sub2/ are both outside the cones of interest. It would be
> better not to expand sub2/ even though we need to expand sub1/.  Not
> calling ensure_full_index() upfront and instead expanding the
> necessary subdirectories on demand would be a good way to solve it.
>
> Is that what going on?

Sorry for the confusion. I was actually trying to explain why the original
comment isn't needed anymore. Here's my updated comment - does this
make more sense?

Previously, the `read_attr_from_index()` function was structured to handle
attribute reading from a `.gitattributes` file only when the file
paths were part
of the cone-mode sparse-checkout. This was based on the fact that the
`.gitattributes` file only applied to files within its parent
directory. As a result,
we avoided loading the `.gitattributes` file if the sparse-checkout was in
cone-mode, as the file is sparse only if all paths within that directory are
also sparse.

However, this approach was not capable of handling scenarios where we
needed to read attributes from sparse directories。

To resolve this, revise 'git check-attr' to allow attribute reading for
files in sparse directories from the corresponding .gitattributes files:

1.Utilize path_in_cone_mode_sparse_checkout() and index_name_pos_sparse
to check if a path falls within a sparse directory.

2.If path is inside a sparse directory, employ the value of
index_name_pos_sparse() to find the sparse directory containing path and
path relative to sparse directory. Proceed to read attributes from the
tree OID of the sparse directory using read_attr_from_blob().

3.If path is not inside a sparse directory,ensure that attributes are
fetched from the index blob with read_blob_data_from_index().



> > diff --git a/attr.c b/attr.c
> > index 7d39ac4a29..be06747b0d 100644
> > --- a/attr.c
> > +++ b/attr.c
> > @@ -808,35 +808,44 @@ static struct attr_stack *read_attr_from_blob(struct index_state *istate,
> >  static struct attr_stack *read_attr_from_index(struct index_state *istate,
> >                                              const char *path, unsigned flags)
> >  {
> > +     struct attr_stack *stack = NULL;
> >       char *buf;
> >       unsigned long size;
> > +     int pos;
> >
> >       if (!istate)
> >               return NULL;
> >
> >       /*
> > -      * The .gitattributes file only applies to files within its
> > -      * parent directory. In the case of cone-mode sparse-checkout,
> > -      * the .gitattributes file is sparse if and only if all paths
> > -      * within that directory are also sparse. Thus, don't load the
> > -      * .gitattributes file since it will not matter.
>
> Imagine that you have a tree with sub1/ outside the cones of
> interest and sub2/ and sub9/ inside the cones of interest, and
> further imagine that sub1/.gitattributes and sub2/.gitattributes
> give attribute X to sub1/file and sub2/file respectively.  There
> is no sub9/.gitattributes file.
>
> Then "git ls-files ':(attr:X)sub[0-9]'" _could_ have two equally
> sensible behaviours:
>
>  (1) Only show sub2/file because sub1/ is outside the cones of
>      interest and the user does not want to clutter the output
>      from the parts of the tree they are not interested in.
>
>  (2) Show both sub1/file and sub2/file, even though sub1/ is outside
>      the cones of interest, in response to the fact that the mention
>      of "sub[0-9]" on the command line is an explicit indication of
>      interest by the user (it would become more and more interesting
>      if the pathspec gets less specific, like ":(attr:X)" that is
>      treewide, though).
> The original comment seems to say that only behaviour (1) is
> supported, but I wonder if we eventually want to support both,
> choice made by the calling code (and perhaps options)?  In any case,
> offering the choice of (2) is a good thing in the longer run.
> Anyway...

If we use '--sparse' as an option, then by default we'd only apply (1).
However, if the user types '--sparse', we'd switch to (2). Do you think that's
a good approach?

> > +      * If the pos value is negative, it means the path is not in the index.
> > +      * However, the absolute value of pos minus 1 gives us the position where the path
> > +      * would be inserted in lexicographic order. By subtracting another 1 from this
> > +      * value (pos = -pos - 2), we find the position of the last index entry
> > +      * which is lexicographically smaller than the provided path. This would be
> > +      * the sparse directory containing the path.
>
> That is true only if the directory containing the .gitattribute file
> is sparsified (e.g. sub1/.gitattributes does not appear in the index
> but sub1/ does; sub2/.gitattributes however does appear in the index
> and there is no sub2/ in the index).
>
> If not, there are two cases:
>
>  * sub2/.gitattributes does appear in the index (and there is no
>    sub2/ in the index).  "pos = - pos - 2" computes a nonsense
>    number in this case; hopefully we can reject it early by noticing
>    that the resulting pos is negative.
>
>  * sub9/.gitattributes does not belong to the project.  The pos is
>    negative and "- pos - 2" does not poihnt at sub9/ (as it is not
>    sparse).  Depending on what other paths appear in sub9/., the
>    path that appears at (-pos-2) may be inside or outside sub9/.  In
>    the worst case, it could be a sparsified directory that sorts
>    directly before sub9/ (say, there is sub8/ that is sparse, which
>    may have .gitattributes in it).  Would the updated code
>    mistakenly check S_ISSPARSEDIR() on sub8/ that has no relevance
>    when we are dealing with sub9/.gitattributes that does not exist?

I made some modifications to the code to address your points. Now, only
calculating 'pos' when the path falls outside of the cone. Moreover, we
only compute '-pos -2' when 'pos' is negative. I believe this adjustment can
effectively address the issues you've brought up.

> > -     if (!path_in_cone_mode_sparse_checkout(path, istate))
> > -             return NULL;
> > +     pos = index_name_pos_sparse(istate, path, strlen(path));
> > +     pos = - pos - 2;
> >
> > -     buf = read_blob_data_from_index(istate, path, &size);
> > -     if (!buf)
> > -             return NULL;
> > -     if (size >= ATTR_MAX_FILE_SIZE) {
> > -             warning(_("ignoring overly large gitattributes blob '%s'"), path);
> > -             return NULL;
> > -     }
> > +     if (!path_in_cone_mode_sparse_checkout(path, istate) && 0 <= pos) {
> > +             if (!S_ISSPARSEDIR(istate->cache[pos]->ce_mode))
> > +                     return NULL;
>
> So earlier, the code, given say sub1/.gitattributes, checked if that
> path is outside the cones of interest and skipped reading it.  But
> the updated code tries to check the same "is it outside or inside?"
> condition for sub1/ directory itself.  Does it make a practical
> difference that you can demonstrate with a test?

I'm not sure I understand the point. Are you suggesting that checking
(!S_ISSPARSEDIR(istate->cache[pos]->ce_mode) is unnecessary because
 we don't need to verify it as we already outside the cone?

> I do not know if the updated code does the right thing for
> sub2/.gitattributes (exists in a non-sparse directory) and
> sub9/.gitattributes (does not exist in non-sparse directory),
> though.
>
> > +             if (strncmp(istate->cache[pos]->name, path, ce_namelen(istate->cache[pos])) == 0) {
>
> Don't compare with "==0", write !strncmp(...) instead.

Will fix!

> > +                     const char *relative_path = path + ce_namelen(istate->cache[pos]);
> > +                     stack = read_attr_from_blob(istate, &istate->cache[pos]->oid, relative_path, flags);
> > +             }
>
> If the earlier "- pos - 2" misidentified the parent sparse directory
> entry in the index and the strncmp() noticed that mistake, we would
> come here without reading any new attribute stack frame.  Don't we
> need to fallback reading from the path in the correct directory that
> is not at "- pos - 2"?
>
> Let's imagine this case where sub/ is a directory outside the cones
> of interest, and our sparse-index may or may not have it as a
> directory in the index, and then the caller asks to read from the
> "sub/sub1/.gitattributes" file.  Even when "sub/" is expanded in the
> index, "sub/sub1/" may not and appear as a directory in the index.
>
> The above "find relative_path and read from the tree object" code
> would of course work when the direct parent directory of
> ".gitattributes" is visible in the index, but interestingly, it
> would also work when it does not.  E.g. if "sub/" is represented as
> a directory in the index, then asking for "sub1/.gitattributes"
> inside the tree object of "sub/" would work as get_tree_entry() used
> by read_attr_from_blob() would get to the right object recursively,
> so that is nice.  If that is why "'- pos - 2' must be the directory
> entry in the index that _would_ include $leadingpath/.gitattributes
> regardless of how many levels of directory hierarchy there are
> inside $leadingpath" idea was chosen, I'd have to say that it is
> clever ;-)
>
> I however find the "'- pos - 2' must be the directory entry in the
> index" trick hard to reason about and explain.  I wonder if we write
> this in a more straight-forward and stupid way, the result becomes
> easier to read and less prone to future bugs...

Here is my updated code. Is it more straightforward and less prone to bugs?

if (!path_in_cone_mode_sparse_checkout(path, istate)) {
    pos = index_name_pos_sparse(istate, path, strlen(path));

    if (pos < 0)
        pos = -pos - 2;
}

if (pos >= 0 && !path_in_cone_mode_sparse_checkout(path, istate) &&
S_ISSPARSEDIR(istate->cache[pos]->ce_mode) &&
!strncmp(istate->cache[pos]->name, path,
ce_namelen(istate->cache[pos])) &&
!normalize_path_copy(normalize_path, path)) {
    relative_path = normalize_path + ce_namelen(istate->cache[pos]);
    stack = read_attr_from_blob(istate, &istate->cache[pos]->oid,
relative_path, flags);

    stack = read_attr_from_blob(istate, &istate->cache[pos]->oid,
relative_path, flags);
}

  parent reply	other threads:[~2023-07-13 20:14 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-01  6:48 [PATCH v1 0/3] check-attr: integrate with sparse-index Shuqi Liang
2023-07-01  6:48 ` [PATCH v1 1/3] attr.c: read attributes in a sparse directory Shuqi Liang
2023-07-03 17:59   ` Victoria Dye
2023-07-01  6:48 ` [PATCH v1 2/3] t1092: add tests for `git check-attr` Shuqi Liang
2023-07-03 18:11   ` Victoria Dye
2023-07-01  6:48 ` [PATCH v1 3/3] check-attr: integrate with sparse-index Shuqi Liang
2023-07-03 18:21   ` Victoria Dye
2023-07-07 15:18 ` [PATCH v2 0/3] " Shuqi Liang
2023-07-07 15:18   ` [PATCH v2 1/3] Enable gitattributes read from sparse directories Shuqi Liang
2023-07-07 23:15     ` Junio C Hamano
2023-07-07 15:18   ` [PATCH v2 2/3] t1092: add tests for `git check-attr` Shuqi Liang
2023-07-07 15:18   ` [PATCH v2 3/3] check-attr: integrate with sparse-index Shuqi Liang
2023-07-11 13:30   ` [PATCH v3 0/3] " Shuqi Liang
2023-07-11 13:30     ` [PATCH v3 1/3] attr.c: read attributes in a sparse directory Shuqi Liang
2023-07-11 21:15       ` Junio C Hamano
2023-07-11 22:08         ` Junio C Hamano
2023-07-13 20:22           ` Shuqi Liang
2023-07-13 20:13         ` Shuqi Liang [this message]
2023-07-11 21:24       ` Victoria Dye
2023-07-11 13:30     ` [PATCH v3 2/3] t1092: add tests for `git check-attr` Shuqi Liang
2023-07-11 18:52       ` Junio C Hamano
2023-07-11 20:47       ` Victoria Dye
2023-07-11 13:30     ` [PATCH v3 3/3] check-attr: integrate with sparse-index Shuqi Liang
2023-07-11 20:07       ` Junio C Hamano
2023-07-11 16:56     ` [PATCH v3 0/3] " Junio C Hamano
2023-07-18 23:29     ` [PATCH v4 " Shuqi Liang
2023-07-18 23:29       ` [PATCH v4 1/3] t1092: add tests for 'git check-attr' Shuqi Liang
2023-07-20 18:43         ` Victoria Dye
2023-07-18 23:29       ` [PATCH v4 2/3] attr.c: read attributes in a sparse directory Shuqi Liang
2023-07-20 20:18         ` Victoria Dye
2023-08-03 16:22           ` Glen Choo
2023-08-15  8:05             ` Shuqi Liang
2023-07-18 23:29       ` [PATCH v4 3/3] check-attr: integrate with sparse-index Shuqi Liang
2023-08-11 14:22       ` [PATCH v5 0/3] " Shuqi Liang
2023-08-11 14:22         ` [PATCH v5 1/3] t1092: add tests for 'git check-attr' Shuqi Liang
2023-08-11 14:22         ` [PATCH v5 2/3] attr.c: read attributes in a sparse directory Shuqi Liang
2023-08-11 14:22         ` [PATCH v5 3/3] check-attr: integrate with sparse-index Shuqi Liang
2023-08-14 16:24         ` [PATCH v5 0/3] " Victoria Dye
2023-08-14 17:10           ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMO4yUEVbeLSeOq42V=7RhcLG_4e_D2fBS72Dz-5Oq_u6-RhNw@mail.gmail.com' \
    --to=cheskaqiqi@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=vdye@github.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).