* icase pathspec magic support in ls-tree @ 2022-09-30 12:04 Tao Klerks 2022-09-30 13:53 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 16+ messages in thread From: Tao Klerks @ 2022-09-30 12:04 UTC (permalink / raw) To: git Hi folks, I just found out today both that icase magic exists (awesome!), and that it isn't supported in ls-tree (boo). As far as I can tell, getting it supported would be the only way to *efficiently* prevent caseless-duplicate files from being created in a repo, in an "update" hook: I'd want to call ls-tree on the new head commit for the branch, passing in an icase pathspec for all the files being added since the previous state - and then sort and uniq. Of course, for entirely new branches I'd have to do a full check of the tree, and for very large changes that might be the fastest/best thing to do anyway, rather than creating a silly-sized pathspec - but checking the full tree costs me about 1 second, a price that I'm loath to pay for everyday commit verifications of a handful of files, vs a 200,000-file full tree. I tried changing ls-tree "naively" to just permit the icase magic, without any logic changes, and found at least one case where it doesn't work: when combining wildcards with case-insensitivity, like an icased "T/*" patchspec in the git repo; ls-files finds all the tests, and a naively updated ls-tree does not. I think I see the last person to update this was Nguyễn Thái Ngọc Duy in 2013, giving a hint as to what would need to be done to make this be supported; is this an area anyone else might be looking at at the moment? Thanks, Tao ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: icase pathspec magic support in ls-tree 2022-09-30 12:04 icase pathspec magic support in ls-tree Tao Klerks @ 2022-09-30 13:53 ` Ævar Arnfjörð Bjarmason 2022-10-02 19:07 ` brian m. carlson 0 siblings, 1 reply; 16+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-09-30 13:53 UTC (permalink / raw) To: Tao Klerks; +Cc: git On Fri, Sep 30 2022, Tao Klerks wrote: > Hi folks, > > I just found out today both that icase magic exists (awesome!), and > that it isn't supported in ls-tree (boo). > > As far as I can tell, getting it supported would be the only way to > *efficiently* prevent caseless-duplicate files from being created in a > repo, in an "update" hook: I'd want to call ls-tree on the new head > commit for the branch, passing in an icase pathspec for all the files > being added since the previous state - and then sort and uniq. > > Of course, for entirely new branches I'd have to do a full check of > the tree, and for very large changes that might be the fastest/best > thing to do anyway, rather than creating a silly-sized pathspec - but > checking the full tree costs me about 1 second, a price that I'm loath > to pay for everyday commit verifications of a handful of files, vs a > 200,000-file full tree. > > I tried changing ls-tree "naively" to just permit the icase magic, > without any logic changes, and found at least one case where it > doesn't work: when combining wildcards with case-insensitivity, like > an icased "T/*" patchspec in the git repo; ls-files finds all the > tests, and a naively updated ls-tree does not. > > I think I see the last person to update this was Nguyễn Thái Ngọc Duy > in 2013, giving a hint as to what would need to be done to make this > be supported; is this an area anyone else might be looking at at the > moment? You might find ASCII-only sufficient, but note that even if you get this working you won't catch the more complex Unicode normalization rules various filesystems perform, see the fsck code we carefully crafted to make sure we don't get something those FS's will mistake for a ".git" directory in-tree. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: icase pathspec magic support in ls-tree 2022-09-30 13:53 ` Ævar Arnfjörð Bjarmason @ 2022-10-02 19:07 ` brian m. carlson 2022-10-13 6:35 ` Tao Klerks 0 siblings, 1 reply; 16+ messages in thread From: brian m. carlson @ 2022-10-02 19:07 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: Tao Klerks, git [-- Attachment #1: Type: text/plain, Size: 1187 bytes --] On 2022-09-30 at 13:53:16, Ævar Arnfjörð Bjarmason wrote: > You might find ASCII-only sufficient, but note that even if you get this > working you won't catch the more complex Unicode normalization rules > various filesystems perform, see the fsck code we carefully crafted to > make sure we don't get something those FS's will mistake for a ".git" > directory in-tree. What's even worse is that different OSes case-fold differently and the behaviour differs based on the version of the OS that formatted the file system (which is of course not exposed to userspace), so in general it's impossible to know exactly how case folding works on a particular system. It might be possible to implement some general rules that are overzealous (in that they will catch patterns that will case-fold on _some_ system), but in general this is very difficult. The rules will also almost certainly change with newer versions of Unicode. I'll also point out that there is no locale-independent way to correctly case-fold Unicode text. Correct case-folding is sensitive to the language, script, and region. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: icase pathspec magic support in ls-tree 2022-10-02 19:07 ` brian m. carlson @ 2022-10-13 6:35 ` Tao Klerks 2022-10-14 4:51 ` Torsten Bögershausen 2022-10-14 7:41 ` Elijah Newren 0 siblings, 2 replies; 16+ messages in thread From: Tao Klerks @ 2022-10-13 6:35 UTC (permalink / raw) To: brian m. carlson, Ævar Arnfjörð Bjarmason, Tao Klerks, git On Sun, Oct 2, 2022 at 9:07 PM brian m. carlson <sandals@crustytoothpaste.net> wrote: > > On 2022-09-30 at 13:53:16, Ævar Arnfjörð Bjarmason wrote: > > You might find ASCII-only sufficient, but note that even if you get this > > working you won't catch the more complex Unicode normalization rules > > various filesystems perform, see the fsck code we carefully crafted to > > make sure we don't get something those FS's will mistake for a ".git" > > directory in-tree. > > What's even worse is that different OSes case-fold differently and the > behaviour differs based on the version of the OS that formatted the file > system (which is of course not exposed to userspace), so in general it's > impossible to know exactly how case folding works on a particular > system. > > It might be possible to implement some general rules that are > overzealous (in that they will catch patterns that will case-fold on > _some_ system), but in general this is very difficult. The rules will > also almost certainly change with newer versions of Unicode. > > I'll also point out that there is no locale-independent way to correctly > case-fold Unicode text. Correct case-folding is sensitive to the > language, script, and region. Thanks for the feedback! If I'm understanding correctly, both of these responses were targeted specifically at my motivation/usecase (preventing the submission of case-insensitively duplicate files into a repository) rather than the question of whether anyone has worked or is working on anything relevant to adding icase pathspec magic support to ls-tree. I understand that case-folding is a complex topic, and doing it correctly in some universal sense is undoubtedly beyond me - but "my" context certainly does not require a high standard of correctness: There's a repo shared by some 1000 engineers, 200k files, lots of activity, three different OSes of which two default to case-insensitive filesystems, and every once in a while a user on linux creates a case-insensitive-duplicate file with differing content, which causes git on case-insensitive filesystems to lose the plot (you end up with one file's content looking like changes to the other file's content - "ghost changes" that appear as soon as you check out, that prevent you from doing a simple "pull", and that you just can't reset). I don't imagine I can make a perfectly correct and universal fix to this, but with case-insensitive matching on ls-tree in an update hook I believe I could reduce the frequency of this already-infrequent issue by at least 1000X, which would suit my purposes just fine. In my case filenames are mostly ansi-based, and I don't expect we've ever had Turkish filenames (turkish "i" being the most famous case-folding gotcha I think?). Coming at this from another angle, I guess we could teach git on case-insensitive filesystems to detect this situation (where two files in the index, with different contents, are pointing to the exact same filesystem file) and more explicitly warn the user of what's wrong, giving them clear help on how to fix it? And temporarily exclude those two files from its change reconciliation processes altogether to avoid ghost changes interfering with recovery actions like "pull"? Certainly that would be better than the current "ghost changes" behavior... but it would still be far less convenient than preventing (the vast majority of) these issues altogether, be that with a custom hook or a core option prohibiting clearly case-insensitive-duplicate files from being pushed. By the time a case-insensitive-FS-user encounters this issue in their repo as they checkout or clone, it's likely that the problem is in master/main and others are already affected, and both the cycle time to fixing the issue, and the communication impact in the current state ("please wait, the issue is being addressed, once the remote branch is fixed here's what you'll do to 'pull' successfully in spite of the local repo thinking there are filesystem changes that really don't exist and can't be reset") are... suboptimal. It feels like adding case-insensitivity pathspec magic support to ls-tree (however reliable or universal the subsequent duplicate-detection is or isn't) *should* be much simpler than what it would have taken to support it in ls-files in the first place - but at a glance, I see the official pathspec-supporting function "match_pathspec()" is deep in index-land, with an "index_state" structure being passed around all over the place. If it really was easy, someone would already have done it I guess? :) I don't see this being something I can take on in my spare time, so for now I suspect I'll have to do a full-tree duplicate-file-search on every ref update, and simply accept the 1-second update hook processing time/delay per pushed ref :( I'm assuming the "ghost changes" behavior I allude to here (where two different files in the index, with different contents, point to the same single file in the case-insensitive filesystem, and one or the other index file appears modified / the working tree looks "dirty") is a known issue, but if there's any value in my opening a thread more clearly/explicitly about this behavior, please let me know. Thanks, Tao ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: icase pathspec magic support in ls-tree 2022-10-13 6:35 ` Tao Klerks @ 2022-10-14 4:51 ` Torsten Bögershausen 2022-10-14 8:31 ` Tao Klerks 2022-10-14 7:41 ` Elijah Newren 1 sibling, 1 reply; 16+ messages in thread From: Torsten Bögershausen @ 2022-10-14 4:51 UTC (permalink / raw) To: Tao Klerks; +Cc: brian m. carlson, Ævar Arnfjörð Bjarmason, git On Thu, Oct 13, 2022 at 08:35:11AM +0200, Tao Klerks wrote: > On Sun, Oct 2, 2022 at 9:07 PM brian m. carlson > <sandals@crustytoothpaste.net> wrote: > > > > On 2022-09-30 at 13:53:16, Ævar Arnfjörð Bjarmason wrote: > > > You might find ASCII-only sufficient, but note that even if you get this > > > working you won't catch the more complex Unicode normalization rules > > > various filesystems perform, see the fsck code we carefully crafted to > > > make sure we don't get something those FS's will mistake for a ".git" > > > directory in-tree. > > > > What's even worse is that different OSes case-fold differently and the > > behaviour differs based on the version of the OS that formatted the file > > system (which is of course not exposed to userspace), so in general it's > > impossible to know exactly how case folding works on a particular > > system. > > > > It might be possible to implement some general rules that are > > overzealous (in that they will catch patterns that will case-fold on > > _some_ system), but in general this is very difficult. The rules will > > also almost certainly change with newer versions of Unicode. > > > > I'll also point out that there is no locale-independent way to correctly > > case-fold Unicode text. Correct case-folding is sensitive to the > > language, script, and region. > > Thanks for the feedback! > > If I'm understanding correctly, both of these responses were targeted > specifically at my motivation/usecase (preventing the submission of > case-insensitively duplicate files into a repository) rather than the > question of whether anyone has worked or is working on anything > relevant to adding icase pathspec magic support to ls-tree. > > I understand that case-folding is a complex topic, and doing it > correctly in some universal sense is undoubtedly beyond me - but "my" > context certainly does not require a high standard of correctness: > There's a repo shared by some 1000 engineers, 200k files, lots of > activity, three different OSes of which two default to > case-insensitive filesystems, and every once in a while a user on > linux creates a case-insensitive-duplicate file with differing > content, which causes git on case-insensitive filesystems to lose the > plot (you end up with one file's content looking like changes to the > other file's content - "ghost changes" that appear as soon as you > check out, that prevent you from doing a simple "pull", and that you > just can't reset). > > I don't imagine I can make a perfectly correct and universal fix to > this, but with case-insensitive matching on ls-tree in an update hook > I believe I could reduce the frequency of this already-infrequent > issue by at least 1000X, which would suit my purposes just fine. In my > case filenames are mostly ansi-based, and I don't expect we've ever > had Turkish filenames (turkish "i" being the most famous case-folding > gotcha I think?). > > Coming at this from another angle, I guess we could teach git on > case-insensitive filesystems to detect this situation (where two files > in the index, with different contents, are pointing to the exact same > filesystem file) and more explicitly warn the user of what's wrong, > giving them clear help on how to fix it? And temporarily exclude those > two files from its change reconciliation processes altogether to avoid > ghost changes interfering with recovery actions like "pull"? Certainly > that would be better than the current "ghost changes" behavior... but > it would still be far less convenient than preventing (the vast > majority of) these issues altogether, be that with a custom hook or a > core option prohibiting clearly case-insensitive-duplicate files from > being pushed. > > By the time a case-insensitive-FS-user encounters this issue in their > repo as they checkout or clone, it's likely that the problem is in > master/main and others are already affected, and both the cycle time > to fixing the issue, and the communication impact in the current state > ("please wait, the issue is being addressed, once the remote branch is > fixed here's what you'll do to 'pull' successfully in spite of the > local repo thinking there are filesystem changes that really don't > exist and can't be reset") are... suboptimal. > > It feels like adding case-insensitivity pathspec magic support to > ls-tree (however reliable or universal the subsequent > duplicate-detection is or isn't) *should* be much simpler than what it > would have taken to support it in ls-files in the first place - but at > a glance, I see the official pathspec-supporting function > "match_pathspec()" is deep in index-land, with an "index_state" > structure being passed around all over the place. If it really was > easy, someone would already have done it I guess? :) > > I don't see this being something I can take on in my spare time, so > for now I suspect I'll have to do a full-tree duplicate-file-search on > every ref update, and simply accept the 1-second update hook > processing time/delay per pushed ref :( > > I'm assuming the "ghost changes" behavior I allude to here (where two > different files in the index, with different contents, point to the > same single file in the case-insensitive filesystem, and one or the > other index file appears modified / the working tree looks "dirty") is > a known issue, but if there's any value in my opening a thread more > clearly/explicitly about this behavior, please let me know. > > Thanks, > Tao Thanks for sharing your experience in detail. Did you ever consider to write a shell script, that can detect icase-collisions ? For example, we can use Linux: git ls-files | tr 'A-Z' 'a-z' | sort | uniq -d ; echo $? include/uapi/linux/netfilter_ipv4/ipt_ecn.h include/uapi/linux/netfilter_ipv4/ipt_ttl.h [snip the other files] The GNU versions of uniq allow an even shorter command, (But the POSIX versions don't) git ls-files | sort | uniq -i -d I think that a script like this could do the trick: #!/bin/sh ret=1 >/tmp/$$-exp git ls-files | sort | uniq -i -d >/tmp/$$-act && cmp /tmp/$$-exp /tmp/$$-act && ret=0 rm -f /tmp/$$-exp /tmp/$$-act exit $ret #################### The usage of files in /tmp is probably debatable, I want just illustrate how a combination of shell scripts in combination with existing commands can be used. The biggest step may be to introduce a server-side hook that does a check. But once that is done and working, you probably do not want to miss it. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: icase pathspec magic support in ls-tree 2022-10-14 4:51 ` Torsten Bögershausen @ 2022-10-14 8:31 ` Tao Klerks 2022-10-14 8:37 ` Erik Cervin Edin 0 siblings, 1 reply; 16+ messages in thread From: Tao Klerks @ 2022-10-14 8:31 UTC (permalink / raw) To: Torsten Bögershausen Cc: brian m. carlson, Ævar Arnfjörð Bjarmason, git On Fri, Oct 14, 2022 at 6:51 AM Torsten Bögershausen <tboegi@web.de> wrote: > > On Thu, Oct 13, 2022 at 08:35:11AM +0200, Tao Klerks wrote: > > Did you ever consider to write a shell script, > that can detect icase-collisions ? > > For example, we can use Linux: > git ls-files | tr 'A-Z' 'a-z' | sort | uniq -d ; echo $? > include/uapi/linux/netfilter_ipv4/ipt_ecn.h > include/uapi/linux/netfilter_ipv4/ipt_ttl.h > [snip the other files] > > The GNU versions of uniq allow an even shorter command, > (But the POSIX versions don't) > > git ls-files | sort | uniq -i -d > > I think that a script like this could do the trick: > > #!/bin/sh > ret=1 > >/tmp/$$-exp > git ls-files | sort | uniq -i -d >/tmp/$$-act && > cmp /tmp/$$-exp /tmp/$$-act && > ret=0 > rm -f /tmp/$$-exp /tmp/$$-act > exit $ret > > > #################### > The usage of files in /tmp is probably debatable, > I want just illustrate how a combination of shell > scripts in combination with existing commands can be used. > > The biggest step may be to introduce a server-side hook > that does a check. > But once that is done and working, you probably do > not want to miss it. Thanks for the proposal! Sorry I was a bit vague in my "I suspect I'll have to do a full-tree duplicate-file-search on every ref update", but your suggestion is almost exactly what I meant. On my machine, on this repo, a full-tree case-insensitive duplicate search costs me about 800ms for 100k files, or 1,800ms for 200k files: git ls-tree --name-only -r $NEWHASH | sort | uniq -i -d I need to use ls-tree rather than ls-files because this is indeed a command to run in an update hook, and there is no working tree - no (relevant) index, in a server-side update hook. The 800ms for 100k files are composed of 200ms of ls-tree, 600ms of sort, and about 10ms of uniq. My intent with supporting icase pathspec magic was to do something like: git --icase-pathspecs ls-tree --name-only -r $NEWHASH -- PATHS OF ADDED FILES | sort | uniq -i -d Which would be near-instantaneous in the vast majority of cases (and I'd have some file count limit past which I would fall back to doing the full tree, to avoid excessive command lengths). Unfortunately, "--icase-pathspecs" is not supported in ls-tree, hence this thread :) But yes - ultimately, paying that "full dupe search" per-update server hook processing time cost has seemed like the only sensible way of doing this - until I thought about Elijah's suggestion a little harder that is! More in the next part of the thread. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: icase pathspec magic support in ls-tree 2022-10-14 8:31 ` Tao Klerks @ 2022-10-14 8:37 ` Erik Cervin Edin 0 siblings, 0 replies; 16+ messages in thread From: Erik Cervin Edin @ 2022-10-14 8:37 UTC (permalink / raw) To: Tao Klerks Cc: Torsten Bögershausen, brian m. carlson, Ævar Arnfjörð Bjarmason, git On Fri, Oct 14, 2022 at 10:32 AM Tao Klerks <tao@klerks.biz> wrote: > > I need to use ls-tree rather than ls-files because this is indeed a > command to run in an update hook, and there is no working tree - no > (relevant) index, in a server-side update hook. I believe git diff --name-only doesn't need a working tree ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: icase pathspec magic support in ls-tree 2022-10-13 6:35 ` Tao Klerks 2022-10-14 4:51 ` Torsten Bögershausen @ 2022-10-14 7:41 ` Elijah Newren 2022-10-14 8:03 ` Erik Cervin Edin 2022-10-14 8:48 ` Tao Klerks 1 sibling, 2 replies; 16+ messages in thread From: Elijah Newren @ 2022-10-14 7:41 UTC (permalink / raw) To: Tao Klerks; +Cc: brian m. carlson, Ævar Arnfjörð Bjarmason, git On Wed, Oct 12, 2022 at 11:49 PM Tao Klerks <tao@klerks.biz> wrote: > [...] > I understand that case-folding is a complex topic, and doing it > correctly in some universal sense is undoubtedly beyond me - but "my" > context certainly does not require a high standard of correctness: > There's a repo shared by some 1000 engineers, 200k files, lots of > activity, three different OSes of which two default to > case-insensitive filesystems, and every once in a while a user on > linux creates a case-insensitive-duplicate file with differing > content, which causes git on case-insensitive filesystems to lose the > plot (you end up with one file's content looking like changes to the > other file's content - "ghost changes" that appear as soon as you > check out, that prevent you from doing a simple "pull", and that you > just can't reset). I've felt that same pain. > I don't imagine I can make a perfectly correct and universal fix to > this, but with case-insensitive matching on ls-tree in an update hook > I believe I could reduce the frequency of this already-infrequent > issue by at least 1000X, which would suit my purposes just fine. In my > case filenames are mostly ansi-based, and I don't expect we've ever > had Turkish filenames (turkish "i" being the most famous case-folding > gotcha I think?). How exactly would case-insensitive matching in ls-tree help you here? Can't you write a hook without such capability that rejects such collisions? > I don't see this being something I can take on in my spare time, so > for now I suspect I'll have to do a full-tree duplicate-file-search on > every ref update, and simply accept the 1-second update hook > processing time/delay per pushed ref :( I don't see why you need to do full-tree with existing options, nor why the ls-tree option you want would somehow make it easier to avoid. I think you can avoid the full-tree search with something like: git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH | sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH | \ sort | uniq -i -d The final "sort | uniq -i -d" is taken from Torsten's suggestion. The git diff ... xargs git ls-tree section on the first line will provide a list of all files (& subdirs) in the same directory as any added file. (Although, it has a blind spot for paths in the toplevel directory.) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: icase pathspec magic support in ls-tree 2022-10-14 7:41 ` Elijah Newren @ 2022-10-14 8:03 ` Erik Cervin Edin 2022-10-14 8:57 ` Tao Klerks 2022-10-14 8:48 ` Tao Klerks 1 sibling, 1 reply; 16+ messages in thread From: Erik Cervin Edin @ 2022-10-14 8:03 UTC (permalink / raw) To: Elijah Newren Cc: Tao Klerks, brian m. carlson, Ævar Arnfjörð Bjarmason, git > I don't imagine I can make a perfectly correct and universal fix to > this, but with case-insensitive matching on ls-tree in an update hook > I believe I could reduce the frequency of this already-infrequent > issue by at least 1000X, which would suit my purposes just fine. In my > case filenames are mostly ansi-based, and I don't expect we've ever > had Turkish filenames (turkish "i" being the most famous case-folding > gotcha I think?). How about doing it in something that's not ls-tree? Sounds like you already have a script, it just takes a bit too long? Something similar to On Fri, Oct 14, 2022 at 6:59 AM Torsten Bögershausen <tboegi@web.de> wrote: > > For example, we can use Linux: > git ls-files | tr 'A-Z' 'a-z' | sort | uniq -d ; echo $? In a repo with many files, maybe use git diff --name-only and just run it periodically as a part of a check-in hook or something? git diff --name-only HEAD~100..HEAD | tr 'A-Z' 'a-z' | sort | uniq -d On Fri, Oct 14, 2022 at 9:59 AM Elijah Newren <newren@gmail.com> wrote: > > git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH | > sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH | > \ > sort | uniq -i -d Or what Elijah just wrote > Coming at this from another angle, I guess we could teach git on > case-insensitive filesystems to detect this situation (where two files > in the index, with different contents, are pointing to the exact same > filesystem file) and more explicitly warn the user of what's wrong, > giving them clear help on how to fix it? And temporarily exclude those > two files from its change reconciliation processes altogether to avoid > ghost changes interfering with recovery actions like "pull"? Certainly > that would be better than the current "ghost changes" behavior... but > it would still be far less convenient than preventing (the vast > majority of) these issues altogether, be that with a custom hook or a > core option prohibiting clearly case-insensitive-duplicate files from > being pushed. That's not to say this isn't a good idea but for now I'd advice an automated scripted route. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: icase pathspec magic support in ls-tree 2022-10-14 8:03 ` Erik Cervin Edin @ 2022-10-14 8:57 ` Tao Klerks 0 siblings, 0 replies; 16+ messages in thread From: Tao Klerks @ 2022-10-14 8:57 UTC (permalink / raw) To: Erik Cervin Edin Cc: Elijah Newren, brian m. carlson, Ævar Arnfjörð Bjarmason, git On Fri, Oct 14, 2022 at 10:04 AM Erik Cervin Edin <erik@cervined.in> wrote: > > > On Fri, Oct 14, 2022 at 6:59 AM Torsten Bögershausen <tboegi@web.de> wrote: > > > > For example, we can use Linux: > > git ls-files | tr 'A-Z' 'a-z' | sort | uniq -d ; echo $? > > In a repo with many files, maybe use git diff --name-only and just run > it periodically as a part of a check-in hook or something? > > git diff --name-only HEAD~100..HEAD | tr 'A-Z' 'a-z' | sort | uniq -d > > [... next email...] > I believe > git diff --name-only > doesn't need a working tree I don't understand this suggestion; doesn't it only catch duplicates where both instances were introduced in the same 100-commit range? That has often or typically not been the case, in my experience. Often one version of the file or folder will have existed for some time (days, months, years), and then a "duplicate" will be introduced. As far as I can tell this "diff large range" approach is quite expensive (37 seconds in a trivial test on this repo), and non-comprehensive. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: icase pathspec magic support in ls-tree 2022-10-14 7:41 ` Elijah Newren 2022-10-14 8:03 ` Erik Cervin Edin @ 2022-10-14 8:48 ` Tao Klerks 2022-10-14 9:07 ` Tao Klerks 2022-10-14 17:06 ` Elijah Newren 1 sibling, 2 replies; 16+ messages in thread From: Tao Klerks @ 2022-10-14 8:48 UTC (permalink / raw) To: Elijah Newren Cc: brian m. carlson, Ævar Arnfjörð Bjarmason, git On Fri, Oct 14, 2022 at 9:41 AM Elijah Newren <newren@gmail.com> wrote: > > > How exactly would case-insensitive matching in ls-tree help you here? I just attempted to demonstrate in response to Torsten's email; the assumption was that I could list the added files' paths in a case-insensitive pathspec, and thereby get all duplicates fast and efficiently, for reasonably-sized commits. (new refs and very large commits would still need a full-tree dupe scan) > Can't you write a hook without such capability that rejects such > collisions? It is possible, but far less convenient and I'm not confident that my shell scripting abilities will get me to a good place. That said, having thought about your point, my shell scripting abilities are more likely to get me to a good place than attempting to add icase pathspec magic support to ls-tree :) > > > I don't see this being something I can take on in my spare time, so > > for now I suspect I'll have to do a full-tree duplicate-file-search on > > every ref update, and simply accept the 1-second update hook > > processing time/delay per pushed ref :( > > I don't see why you need to do full-tree with existing options, nor > why the ls-tree option you want would somehow make it easier to avoid. > I think you can avoid the full-tree search with something like: > > git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH | > sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH | > \ > sort | uniq -i -d > > The final "sort | uniq -i -d" is taken from Torsten's suggestion. > > The git diff ... xargs git ls-tree section on the first line will > provide a list of all files (& subdirs) in the same directory as any > added file. (Although, it has a blind spot for paths in the toplevel > directory.) The theoretical problem with this approach is that it only addresses case-insensitive-duplicate files, not directories. Directories have been the problem, in "my" repo, around one-third of the time - typically someone does a directory rename, and someone else does a bad merge and reintroduces the old directory. That said, what "icase pathspec magic" actually *does*, is break down the pathspec into iteratively more complete paths, level by level, looking for case-duplicates at each level. That's something I could presumably do in shell scripting, collecting all the interesting sub-paths first, and then getting ls-tree to tell me about the immediate children for each sub-path, doing case-insensitive dupe searches across children for each of these sub-paths. ls-tree supporting icase pathspec magic would clearly be more efficient (I wouldn't need N ls-tree git processes, where N is the number of sub-paths in the diff), but this should be plenty efficient for normal commits, with a fallback to the full search This seems like a sensible direction, I'll have a play. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: icase pathspec magic support in ls-tree 2022-10-14 8:48 ` Tao Klerks @ 2022-10-14 9:07 ` Tao Klerks 2022-10-14 12:00 ` Erik Cervin Edin 2022-10-14 17:06 ` Elijah Newren 1 sibling, 1 reply; 16+ messages in thread From: Tao Klerks @ 2022-10-14 9:07 UTC (permalink / raw) To: Elijah Newren Cc: brian m. carlson, Ævar Arnfjörð Bjarmason, git Small typo / confusion on my part: On Fri, Oct 14, 2022 at 10:48 AM Tao Klerks <tao@klerks.biz> wrote: > > That said, what "icase pathspec magic" actually *does*, is break down > the pathspec into iteratively more complete paths, level by level, > looking for case-duplicates at each level. I meant what it does is *collect path matches* at iteratively more complete paths, level by level - it doesn't care about duplicates; any duplicate-detection is what you can then do on the result. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: icase pathspec magic support in ls-tree 2022-10-14 9:07 ` Tao Klerks @ 2022-10-14 12:00 ` Erik Cervin Edin 0 siblings, 0 replies; 16+ messages in thread From: Erik Cervin Edin @ 2022-10-14 12:00 UTC (permalink / raw) To: Tao Klerks Cc: Elijah Newren, brian m. carlson, Ævar Arnfjörð Bjarmason, git On Fri, Oct 14, 2022 at 10:58 AM Tao Klerks <tao@klerks.biz> wrote: > > I don't understand this suggestion; doesn't it only catch duplicates > where both instances were introduced in the same 100-commit range? Yes. It was a bit half-baked but the main idea was to limit the tree to a smaller subset (and not the whole tree) and incrementally checking for introduced duplicates instead of a full tree search. I think that's basically Elijah's idea. Get all (added?) files introduced in a certain revision range (last change, since yesterday etc.) and then only check those against the tree for duplicates in a manner of how you define duplicates On Fri, Oct 14, 2022 at 10:50 AM Tao Klerks <tao@klerks.biz> wrote: > > Directories have been the problem, in "my" repo, around one-third of > the time - typically someone does a directory rename, and someone else > does a bad merge and reintroduces the old directory. That adds a bit of complexity :/ but should still be doable. Not perfect but maybe something along these lines? (caveat, possibly GNU only) #!/bin/sh # files added between revisions x y added_files() { git diff --diff-filter=A --name-only --no-renames $1 $2 ; } # folders of those added files added_folders() { added_files $1 $2 | sed -e '/[^\/]*/s@^@./@' -e 's@/[^/]*$@/@' | sort -u ; } # all files tracked by git in *those* folders at HEAD possible_dupes() { added_folders $1 $2 | xargs git ls-tree --name-only HEAD ; } # case insensitive columns separated by \x1 # eg. #path\x1PaTh #path\x1path case_insensitive() { sed -e 's@.*@\L\0\E\x1\0@' | sort ; } x=$1 y=$2 # Find all duplicates paths (case insensitive) # in directories which were added between $x $y possible_dupes $x $y | case_insensitive | awk -F '\x1' ' # actual "duplicate" paths, column $2 # as determined by case-insensitive column $1 $1 in a { print a[$1]; print $2 } { a[$1]=$2 } ' | uniq ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: icase pathspec magic support in ls-tree 2022-10-14 8:48 ` Tao Klerks 2022-10-14 9:07 ` Tao Klerks @ 2022-10-14 17:06 ` Elijah Newren 2022-10-15 22:06 ` Tao Klerks 1 sibling, 1 reply; 16+ messages in thread From: Elijah Newren @ 2022-10-14 17:06 UTC (permalink / raw) To: Tao Klerks; +Cc: brian m. carlson, Ævar Arnfjörð Bjarmason, git On Fri, Oct 14, 2022 at 1:48 AM Tao Klerks <tao@klerks.biz> wrote: > > On Fri, Oct 14, 2022 at 9:41 AM Elijah Newren <newren@gmail.com> wrote: > > [...] > > I don't see why you need to do full-tree with existing options, nor > > why the ls-tree option you want would somehow make it easier to avoid. > > I think you can avoid the full-tree search with something like: > > > > git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH | > > sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH | > > \ > > sort | uniq -i -d > > > > The final "sort | uniq -i -d" is taken from Torsten's suggestion. > > > > The git diff ... xargs git ls-tree section on the first line will > > provide a list of all files (& subdirs) in the same directory as any > > added file. (Although, it has a blind spot for paths in the toplevel > > directory.) > > The theoretical problem with this approach is that it only addresses > case-insensitive-duplicate files, not directories. It'll catch some case-insensitive-duplicate directories too -- note that I did call out that it'd print subdirs. But to be more cautious, you would need to carefully grab all leading directories of any added path, not just the immediate leading directory. > Directories have been the problem, in "my" repo, around one-third of > the time - typically someone does a directory rename, and someone else > does a bad merge and reintroduces the old directory. > > That said, what "icase pathspec magic" actually *does*, is break down > the pathspec into iteratively more complete paths, level by level, > looking for case-duplicates at each level. That's something I could > presumably do in shell scripting, collecting all the interesting > sub-paths first, and then getting ls-tree to tell me about the > immediate children for each sub-path, doing case-insensitive dupe > searches across children for each of these sub-paths. > > ls-tree supporting icase pathspec magic would clearly be more > efficient (I wouldn't need N ls-tree git processes, where N is the > number of sub-paths in the diff), but this should be plenty efficient > for normal commits, with a fallback to the full search > > This seems like a sensible direction, I'll have a play. If you create a script that gives you all leading directories of any listed path (plus replacing the toplevel dir with ':/'), such as this (which I'm calling 'all-leading-dirs.py'): """ #!/usr/bin/env python3 import os import sys paths = sys.stdin.read().splitlines() dirs_seen = set() for path in paths: dir = path while dir: dir = os.path.dirname(dir) if dir in dirs_seen: continue dirs_seen.add(dir) if dirs_seen: # Replace top-level dir of "" with ":"; we'll add the trailing '/' below when adding it to all other dirs dirs_seen.remove("") dirs_seen.add(':') for dir in dirs_seen: print(dir+'/') # ls-tree wants the trailing '/' if we are going to list contents within that tree rather than just the tree itself """ Then the following will catch duplicates files and directories for you: git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only -t HEAD | sort | uniq -i -d and it no longer has problems catching duplicates in the toplevel directory either. It does have (at most) two git invocations, but only one invocation of ls-tree. Here's a test script to prove it works: """ #!/bin/bash git init -b main nukeme cd nukeme mkdir -p dir1/subdir/whatever mkdir -p dir2/subdir/whatever >dir1/subdir/whatever/foo >dir2/subdir/whatever/foo git add . git commit -m initial mkdir -p dir1/SubDir/whatever >dir1/SubDir/whatever/foo git add . git commit -m stuff git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only -t HEAD | sort | uniq -i -d """ The output of this script is """ dir1/subdir """ which correctly notifies on the duplicate (dir1/SubDir being the other; uniq is the one that picks which of the two duplicate names to print) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: icase pathspec magic support in ls-tree 2022-10-14 17:06 ` Elijah Newren @ 2022-10-15 22:06 ` Tao Klerks 2022-10-17 15:46 ` Tao Klerks 0 siblings, 1 reply; 16+ messages in thread From: Tao Klerks @ 2022-10-15 22:06 UTC (permalink / raw) To: Elijah Newren Cc: brian m. carlson, Ævar Arnfjörð Bjarmason, git This seems to be working, thank you!!! Two updates I had to make, in case this is useful to anyone else: 1: I'm getting some weird behavior I can't explain yet, where some paths are returned from the ls-tree call twice: The input to ls-tree is all unique paths, but the output somehow includes a relatively small subset of paths twice. This mysterious issue is easily addressed by adding an extra "uniq" call to remove the "trivial dupes" before hunting for the "case-insensitive dupes" we're interested in: git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only -t HEAD | sort | uniq | uniq -i -d 2: The xargs call has issues with paths with spaces in them. Adding -d"\n" seems to be a clean way to fix this git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | all-leading-dirs.py | xargs -d"\n" --no-run-if-empty git ls-tree --name-only -t HEAD | sort | uniq | uniq -i -d Not only does this approach seem to work well, but it also has far better performance characteristics than I was expecting! Simple small commit (10 files): 20ms Reasonably large commit (10,000 files): 250ms Diff from empty on a smaller branch (100,000 files): 2,800ms Diff from empty on a larger branch (200,000 files): 5,400ms It still makes sense to check the number of files/lines after doing the diff, and do a "simple" 800ms full-tree (no-pathspec) dupe check instead of this when your diff size goes past some file count threshold, but it looks like that threshold would be quite high in my environment - 30k files maybe? I will have a go at writing a full update hook, and (without knowing whether it will make sense from a performance perspective) I'd like to try to implement the "all-leading-dirs" logic in bash 4 using associative arrays, to remove the python dependency. If I make it work I'll post back here. This seems to cover what I needed icase pathspec magic for in ls-tree, without having to implement it - so thanks again! Tao On Fri, Oct 14, 2022 at 7:06 PM Elijah Newren <newren@gmail.com> wrote: > > On Fri, Oct 14, 2022 at 1:48 AM Tao Klerks <tao@klerks.biz> wrote: > > > > On Fri, Oct 14, 2022 at 9:41 AM Elijah Newren <newren@gmail.com> wrote: > > > > [...] > > > I don't see why you need to do full-tree with existing options, nor > > > why the ls-tree option you want would somehow make it easier to avoid. > > > I think you can avoid the full-tree search with something like: > > > > > > git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH | > > > sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH | > > > \ > > > sort | uniq -i -d > > > > > > The final "sort | uniq -i -d" is taken from Torsten's suggestion. > > > > > > The git diff ... xargs git ls-tree section on the first line will > > > provide a list of all files (& subdirs) in the same directory as any > > > added file. (Although, it has a blind spot for paths in the toplevel > > > directory.) > > > > The theoretical problem with this approach is that it only addresses > > case-insensitive-duplicate files, not directories. > > It'll catch some case-insensitive-duplicate directories too -- note > that I did call out that it'd print subdirs. But to be more cautious, > you would need to carefully grab all leading directories of any added > path, not just the immediate leading directory. > > > Directories have been the problem, in "my" repo, around one-third of > > the time - typically someone does a directory rename, and someone else > > does a bad merge and reintroduces the old directory. > > > > That said, what "icase pathspec magic" actually *does*, is break down > > the pathspec into iteratively more complete paths, level by level, > > looking for case-duplicates at each level. That's something I could > > presumably do in shell scripting, collecting all the interesting > > sub-paths first, and then getting ls-tree to tell me about the > > immediate children for each sub-path, doing case-insensitive dupe > > searches across children for each of these sub-paths. > > > > ls-tree supporting icase pathspec magic would clearly be more > > efficient (I wouldn't need N ls-tree git processes, where N is the > > number of sub-paths in the diff), but this should be plenty efficient > > for normal commits, with a fallback to the full search > > > > This seems like a sensible direction, I'll have a play. > > If you create a script that gives you all leading directories of any > listed path (plus replacing the toplevel dir with ':/'), such as this > (which I'm calling 'all-leading-dirs.py'): > > """ > #!/usr/bin/env python3 > > import os > import sys > > paths = sys.stdin.read().splitlines() > dirs_seen = set() > for path in paths: > dir = path > while dir: > dir = os.path.dirname(dir) > if dir in dirs_seen: > continue > dirs_seen.add(dir) > if dirs_seen: > # Replace top-level dir of "" with ":"; we'll add the trailing '/' > below when adding it to all other dirs > dirs_seen.remove("") > dirs_seen.add(':') > for dir in dirs_seen: > print(dir+'/') # ls-tree wants the trailing '/' if we are going > to list contents within that tree rather than just the tree itself > """ > > Then the following will catch duplicates files and directories for you: > > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | > all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only > -t HEAD | sort | uniq -i -d > > and it no longer has problems catching duplicates in the toplevel > directory either. It does have (at most) two git invocations, but > only one invocation of ls-tree. Here's a test script to prove it > works: > > """ > #!/bin/bash > > git init -b main nukeme > cd nukeme > mkdir -p dir1/subdir/whatever > mkdir -p dir2/subdir/whatever > >dir1/subdir/whatever/foo > >dir2/subdir/whatever/foo > git add . > git commit -m initial > > mkdir -p dir1/SubDir/whatever > >dir1/SubDir/whatever/foo > git add . > git commit -m stuff > > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | > all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only > -t HEAD | sort | uniq -i -d > """ > > The output of this script is > """ > dir1/subdir > """ > which correctly notifies on the duplicate (dir1/SubDir being the > other; uniq is the one that picks which of the two duplicate names to > print) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: icase pathspec magic support in ls-tree 2022-10-15 22:06 ` Tao Klerks @ 2022-10-17 15:46 ` Tao Klerks 0 siblings, 0 replies; 16+ messages in thread From: Tao Klerks @ 2022-10-17 15:46 UTC (permalink / raw) To: Elijah Newren Cc: brian m. carlson, Ævar Arnfjörð Bjarmason, git [-- Attachment #1: Type: text/plain, Size: 7541 bytes --] Thanks again for the code samples above! I've spent some time productizing this for my environment, and I think it's pretty close to optimal for my environment at least. In case this can be of value to anyone else, I've attached the final thing. This version is genericized a little - two specific specializations for my environment not included here are: * a specific strategy for determining the most-likely-correct upstream branch to use as a default for "diff from" for new branches, to avoid having to do a full-tree dupe search; in most environments choosing "master" or "main" would likely be the right thing most of the time * a specific list of branch paths that we should validate (vs others that we should not) Best regards, Tao On Sun, Oct 16, 2022 at 12:06 AM Tao Klerks <tao@klerks.biz> wrote: > > This seems to be working, thank you!!! > > Two updates I had to make, in case this is useful to anyone else: > > 1: I'm getting some weird behavior I can't explain yet, where some > paths are returned from the ls-tree call twice: The input to ls-tree > is all unique paths, but the output somehow includes a relatively > small subset of paths twice. > This mysterious issue is easily addressed by adding an extra "uniq" > call to remove the "trivial dupes" before hunting for the > "case-insensitive dupes" we're interested in: > > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | > all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only > -t HEAD | sort | uniq | uniq -i -d > > 2: The xargs call has issues with paths with spaces in them. Adding > -d"\n" seems to be a clean way to fix this > > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | > all-leading-dirs.py | xargs -d"\n" --no-run-if-empty git ls-tree --name-only > -t HEAD | sort | uniq | uniq -i -d > > > Not only does this approach seem to work well, but it also has far > better performance characteristics than I was expecting! > > Simple small commit (10 files): 20ms > Reasonably large commit (10,000 files): 250ms > Diff from empty on a smaller branch (100,000 files): 2,800ms > Diff from empty on a larger branch (200,000 files): 5,400ms > > > It still makes sense to check the number of files/lines after doing > the diff, and do a "simple" 800ms full-tree (no-pathspec) dupe check > instead of this when your diff size goes past some file count > threshold, but it looks like that threshold would be quite high in my > environment - 30k files maybe? > > I will have a go at writing a full update hook, and (without knowing > whether it will make sense from a performance perspective) I'd like to > try to implement the "all-leading-dirs" logic in bash 4 using > associative arrays, to remove the python dependency. If I make it work > I'll post back here. > > This seems to cover what I needed icase pathspec magic for in ls-tree, > without having to implement it - so thanks again! > > Tao > > On Fri, Oct 14, 2022 at 7:06 PM Elijah Newren <newren@gmail.com> wrote: > > > > On Fri, Oct 14, 2022 at 1:48 AM Tao Klerks <tao@klerks.biz> wrote: > > > > > > On Fri, Oct 14, 2022 at 9:41 AM Elijah Newren <newren@gmail.com> wrote: > > > > > > [...] > > > > I don't see why you need to do full-tree with existing options, nor > > > > why the ls-tree option you want would somehow make it easier to avoid. > > > > I think you can avoid the full-tree search with something like: > > > > > > > > git diff --diff-filter=A --no-renames --name-only $OLDHASH $NEWHASH | > > > > sed -e s%/[^/]*$%/% | uniq | xargs git ls-tree --name-only $NEWHASH | > > > > \ > > > > sort | uniq -i -d > > > > > > > > The final "sort | uniq -i -d" is taken from Torsten's suggestion. > > > > > > > > The git diff ... xargs git ls-tree section on the first line will > > > > provide a list of all files (& subdirs) in the same directory as any > > > > added file. (Although, it has a blind spot for paths in the toplevel > > > > directory.) > > > > > > The theoretical problem with this approach is that it only addresses > > > case-insensitive-duplicate files, not directories. > > > > It'll catch some case-insensitive-duplicate directories too -- note > > that I did call out that it'd print subdirs. But to be more cautious, > > you would need to carefully grab all leading directories of any added > > path, not just the immediate leading directory. > > > > > Directories have been the problem, in "my" repo, around one-third of > > > the time - typically someone does a directory rename, and someone else > > > does a bad merge and reintroduces the old directory. > > > > > > That said, what "icase pathspec magic" actually *does*, is break down > > > the pathspec into iteratively more complete paths, level by level, > > > looking for case-duplicates at each level. That's something I could > > > presumably do in shell scripting, collecting all the interesting > > > sub-paths first, and then getting ls-tree to tell me about the > > > immediate children for each sub-path, doing case-insensitive dupe > > > searches across children for each of these sub-paths. > > > > > > ls-tree supporting icase pathspec magic would clearly be more > > > efficient (I wouldn't need N ls-tree git processes, where N is the > > > number of sub-paths in the diff), but this should be plenty efficient > > > for normal commits, with a fallback to the full search > > > > > > This seems like a sensible direction, I'll have a play. > > > > If you create a script that gives you all leading directories of any > > listed path (plus replacing the toplevel dir with ':/'), such as this > > (which I'm calling 'all-leading-dirs.py'): > > > > """ > > #!/usr/bin/env python3 > > > > import os > > import sys > > > > paths = sys.stdin.read().splitlines() > > dirs_seen = set() > > for path in paths: > > dir = path > > while dir: > > dir = os.path.dirname(dir) > > if dir in dirs_seen: > > continue > > dirs_seen.add(dir) > > if dirs_seen: > > # Replace top-level dir of "" with ":"; we'll add the trailing '/' > > below when adding it to all other dirs > > dirs_seen.remove("") > > dirs_seen.add(':') > > for dir in dirs_seen: > > print(dir+'/') # ls-tree wants the trailing '/' if we are going > > to list contents within that tree rather than just the tree itself > > """ > > > > Then the following will catch duplicates files and directories for you: > > > > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | > > all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only > > -t HEAD | sort | uniq -i -d > > > > and it no longer has problems catching duplicates in the toplevel > > directory either. It does have (at most) two git invocations, but > > only one invocation of ls-tree. Here's a test script to prove it > > works: > > > > """ > > #!/bin/bash > > > > git init -b main nukeme > > cd nukeme > > mkdir -p dir1/subdir/whatever > > mkdir -p dir2/subdir/whatever > > >dir1/subdir/whatever/foo > > >dir2/subdir/whatever/foo > > git add . > > git commit -m initial > > > > mkdir -p dir1/SubDir/whatever > > >dir1/SubDir/whatever/foo > > git add . > > git commit -m stuff > > > > git diff --diff-filter=A --no-renames --name-only HEAD~1 HEAD | > > all-leading-dirs.py | xargs --no-run-if-empty git ls-tree --name-only > > -t HEAD | sort | uniq -i -d > > """ > > > > The output of this script is > > """ > > dir1/subdir > > """ > > which correctly notifies on the duplicate (dir1/SubDir being the > > other; uniq is the one that picks which of the two duplicate names to > > print) [-- Attachment #2: case-insensitive-duplicate-check-update-hook.bash --] [-- Type: application/octet-stream, Size: 8047 bytes --] #!/bin/bash UPDATING_REF="$1" FROM_HASH="$2" TO_HASH="$3" # GENERAL Problem Background: # - In git, repo paths are inherently case-sensitive # - In two of the three major dev workstation OSs (Windows, OSX), filesystems are typically case-insensitive # - When files are added in the repo that are duplicates in the filesystem, but not in git, usability problems abound # - A particularly onerous issue occurs when case-insensitive-duplicate files have differing content; after a checkout # of such a tree, the case-insensitive working tree always looks like it has a change to one or the other of the # VCS files. In such an "unavoidably/unfixably dirty tree" state, a casual user cannot even "pull" a fixed branch # state, as git will will refuse to pull in a dirty working tree. # - Git currently offers no built-in mechanism to prevent this # - It is preferable to avoid this issue in a server hook, rather than in CI validations (although the latter can also # work) - the earlier the issue is "nipped in the bud" the better. # - There is a reasonably trivial shell scripting test for "Tree has duplicates", but validating simply on the basis of # this test presents two challenges: # - It might be too expensive to run on every commit in a large repo # - It might trigger "false positives" when creating new refs pointing to old already-problematic commits # SPECIFIC Problem Repo Background: # - 200K files, in deeply nested (java) folder structures (+50k directories) # - Doing a "naive" full-tree case-insensitive duplicate search takes around 2s # - This is not a price we're willing to pay on every branch update # - Case-insensitive duplicates are sometimes folders, sometimes files # - Repo has lots of refs - 110k - so scanning them is costly # - New refs are created *very* often - most often as "keeparound" refs in GitLab # - Validations don't make sense / don't add value for some or many classes of ref (those created by automated # systems, where a violation is not the "fault" of the pusher) # - Some (active) branches are very distant from each other - there is no single "master" # Test Cases / situations: # - Single "normal" change, a dozen added files # - Single "large-ish" change, a few hundred added files # - on an existing branch (eg simple change, merge from upstream) # - on a new branch, varying the original/upstream branch # - on new refs that are excluded from validation by path/pattern # Expected & tested performance characteristics of this hook script: # - For non-branch or otherwise disqualified refs, close to no overhead. # - For updates to existing refs, it is reasonably optimal # - There are four non-trivial operations, which all scale with the diff size: # - Finding the files added in the ref update (git diff) # - Finding all the ancestor-paths for these files (output_all_ancestor_gitpaths) # - This *can* be made slightly more efficient by using eg python # - Finding all items in these paths (git ls-tree) # - Sorting the resulting paths/items (sort) # - When the number of added files is too high, we fall back to a full-tree search # - For new non-disqualified refs, there are additional overheads: # - Finding the best/closest "standard branch" to diff from, for a new ref, might be non-trivial in some repos # - (in the generic implementation, it is a trivial/cheap check for "master" and "main") # - Depending on the strategy used to find a "diff from" ref, diffs on new refs will likely be much larger/more expensive than they should ideally need to be # Future work / possible improvements: # - If git provided an easy/cheap way to detect "commits new to this repo" in an update hook, the diffs could be consistently much smaller/more efficient. # - (it is possible to find out whether a commit is net-new with something like "git rev-list MYREF --not --exclude MYREF --all", but this is *expensive* in a large repo) # - If "git ls-tree" supported icase pathspec magic, we could pass in full added-file pathspecs, and subsequent dupe-detection could be more efficient # - (there would be a functional difference: the current strategy finds dupe directories even if they have non-overlapping contents; this other approach would not) # Requirements: # - Bash 4 for associative arrays #--------- #FUNCTIONS - called via subshell #--------- output_all_ancestor_gitpaths() { declare -A INTERESTING_PATHS while IFS= read -r INPUT_LINE; do CURRENT_PATH="$INPUT_LINE" while [ "$CURRENT_PATH" != ":" ]; do [[ "$CURRENT_PATH" == */* ]] || CURRENT_PATH=":/$CURRENT_PATH" PARENT_PATH="${CURRENT_PATH%/*}" if [ "${INTERESTING_PATHS["${PARENT_PATH}/"]}" = "1" ]; then break fi INTERESTING_PATHS["${PARENT_PATH}/"]="1" CURRENT_PATH="$PARENT_PATH" done done <<< "$1" for key in ${!INTERESTING_PATHS[@]}; do echo "$key"; done debug_log "FOUND ${#INTERESTING_PATHS[@]} INTERESTING PATHS" } find_closest_standard_branch() { # Simplest implementation, look for "master" or "main" or give up. # there will likely be better repo-specific strategies in specific repos. git for-each-ref --format="%(refname)" --count=1 refs/heads/master refs/heads/main } START_TIME="$(date +%s%N)" debug_log() { if [ -n "$DEBUG_DUPE_HOOK" ]; then CURRENT_TIME="$(date +%s%N)" echo "Debug at $(($(($CURRENT_TIME-$START_TIME))/1000000))ms: $1" >&2 fi } #----------- #MAIN SCRIPT #----------- if [ -z "$TO_HASH" ]; then # branch deletion, no possibility of new dupes debug_log "BRANCH DELETION - SKIPPING DUPE VALIDATION" exit 0 fi VALIDATING_PATH_PREFIXES=("refs/heads/") REF_IS_VALIDATABLE=0 for PATH_PREFIX in "${VALIDATING_PATH_PREFIXES[@]}" do if [[ "$UPDATING_REF" = "$PATH_PREFIX"* ]]; then REF_IS_VALIDATABLE=1 fi done if [[ "$REF_IS_VALIDATABLE" = "0" ]]; then # we're not interested in validating for dupes on this ref. debug_log "NON-VALIDATABLE REF - SKIPPING DUPE VALIDATION" exit 0 fi FULL_SEARCH=false if [ -z "$FROM_HASH" ]; then # New ref; in most repos new "interesting" (not-excluded-from-dupe-check) # refs will likely be closely related to master/main most of the time, so # diff from there rather than starting from scratch. debug_log "NEW REF - LOOKING FOR STANDARD BRANCH" FROM_HASH=$(find_closest_standard_branch "$TO_HASH") fi if [ -z "$FROM_HASH" ]; then # If this is a new ref and no "standard upstream ref" was found, fall back # to full dupe search. debug_log "NO STANDARD UPSTREAM REF FOUND; DOING FULL SEARCH" FULL_SEARCH=true else debug_log "CHECKING DIFF: git diff --diff-filter=A --no-renames --name-only '$FROM_HASH' '$TO_HASH'" NEW_FILES=$(git diff --diff-filter=A --no-renames --name-only "$FROM_HASH" "$TO_HASH") NEW_FILE_LINE_COUNT=$(wc -l <<< "$NEW_FILES") if [ "$NEW_FILE_LINE_COUNT" -le "1" ] && [ -z "$NEW_FILES" ]; then # no newly added files, no possibility of new dupes # (count "1" can be an empty line) debug_log "EXITING WITH 0 ADDED FILES" exit 0 fi if [ "$NEW_FILE_LINE_COUNT" -gt 10000 ]; then # If there are too many files, then generating all the sub-paths and # running ls-tree with a huge pathspec will be slower than simply # checking the whole tree. debug_log "FILE COUNT $NEW_FILE_LINE_COUNT OVER LIMIT; DOING FULL SEARCH" FULL_SEARCH=true fi fi if [ "$FULL_SEARCH" = "true" ]; then debug_log "DOING FULL DUPE SEARCH" DUPE_PATHS=$(git ls-tree --name-only -r "$TO_HASH" | sort | uniq -i -D) else debug_log "DOING DIFF SEARCH OF $NEW_FILE_LINE_COUNT ADDED FILES BETWEEN $FROM_HASH AND $TO_HASH" DUPE_PATHS=$(output_all_ancestor_gitpaths "$NEW_FILES" | sort | xargs -d"\n" --no-run-if-empty git ls-tree --name-only -t "$TO_HASH" | sort | uniq | uniq -i -D) fi if [ -z "$DUPE_PATHS" ]; then debug_log "EXITING - NO DUPES FOUND" exit 0 else debug_log "EXITING WITH DUPES ERROR" echo "This ref cannot be accepted - duplicate folders or files found:" echo "$DUPE_PATHS" exit 1 fi ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2022-10-17 15:47 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-09-30 12:04 icase pathspec magic support in ls-tree Tao Klerks 2022-09-30 13:53 ` Ævar Arnfjörð Bjarmason 2022-10-02 19:07 ` brian m. carlson 2022-10-13 6:35 ` Tao Klerks 2022-10-14 4:51 ` Torsten Bögershausen 2022-10-14 8:31 ` Tao Klerks 2022-10-14 8:37 ` Erik Cervin Edin 2022-10-14 7:41 ` Elijah Newren 2022-10-14 8:03 ` Erik Cervin Edin 2022-10-14 8:57 ` Tao Klerks 2022-10-14 8:48 ` Tao Klerks 2022-10-14 9:07 ` Tao Klerks 2022-10-14 12:00 ` Erik Cervin Edin 2022-10-14 17:06 ` Elijah Newren 2022-10-15 22:06 ` Tao Klerks 2022-10-17 15:46 ` Tao Klerks
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).