* Git Rename Detection Bug
@ 2023-11-06 12:00 Jeremy Pridmore
2023-11-07 8:05 ` Elijah Newren
0 siblings, 1 reply; 13+ messages in thread
From: Jeremy Pridmore @ 2023-11-06 12:00 UTC (permalink / raw)
To: git@vger.kernel.org; +Cc: Paul Baumgartner
Thank you for filling out a Git bug report!
Please answer the following questions to help us understand your issue.
What did you do before the bug happened? (Steps to reproduce your issue)
I have two GIT repositories (A and B). Both migrated from the same TFS server using git-tfs tool. I migrated code into A and made lots of changes, including moving 50,000+ files from folder "/Landscape" to "/Landscape/src". B contains the same code but with various other changes made since my original migration from TFS to A. All the files in B are still in the "/Landscape" folder. I recently needed to merge my changes from A to B, so I added A as a remote to B and then performed a number of cherry-picks from A to B, but got stuck when trying to cherry-pick the commit containing the results of moving all files into "/Landscape/src".
What did you expect to happen? (Expected behavior)
I expected the git rename detection to match all files in A "/Landscape" to files in B "/Landscape/src".
What happened instead? (Actual behavior)
Although many files were matched successfully, git mismatched over two dozen similarly named files, e.g.
Incorrect path match: Landscape/Services/uiServices/Complaints/Interfaces/IAccountsIntegration.vb -> Landscape/src/Complaints/Rdt.Complaints.UI/Interfaces/IAccountsIntegration.vb
Incorrect path match: Landscape/Services/uiServices/Complaints/Interfaces/IDocumentIntegration.vb -> Landscape/src/Complaints/Rdt.Complaints.UI/Interfaces/IDocumentIntegration.vb
Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
Incorrect name match: Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx -> Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
Incorrect path match: Landscape/Documentation/uiDocumentation/licenses.licx -> Landscape/src/Documentation/Rdt.Documentation.UI/Properties/licenses.licx
Incorrect path match: Landscape/Import/uiImport/My Project/licenses.licx -> Landscape/src/Documentation/uiDocumentation/licenses.licx
Incorrect path match: Landscape/Main/uiMain.Workflow/My Project/licenses.licx -> Landscape/src/Import/uiImport/My Project/licenses.licx
Incorrect path match: Landscape/Main/uiMain/My Project/licenses.licx -> Landscape/src/Main/uiMain.Workflow/My Project/licenses.licx
Incorrect path match: Landscape/LandscapeApiService.Setup/Setup/UIContent/RDT_Logo.ico -> Landscape/src/Main/uiMain.Workflow/Resources/RDT_Logo.ico
Incorrect path match: Landscape/Policy/Rdt.Policy.UI.Templates/Properties/licenses.licx -> Landscape/src/Main/uiMain/My Project/licenses.licx
Incorrect path match: Landscape/Main/uiMain.Workflow/Resources/RDT_Logo.ico -> Landscape/src/Main/uiMain/Resources/RDT_Logo.ico
Incorrect path match: Landscape/Policy/Rdt.Policy.UI/Properties/licenses.licx -> Landscape/src/Policy/Rdt.Policy.UI.Templates/Properties/licenses.licx
Incorrect path match: Landscape/Rates/uiRates/My Project/licenses.licx -> Landscape/src/Policy/Rdt.Policy.UI/Properties/licenses.licx
Incorrect path match: Landscape/Rdt.Claim.UI/Properties/licenses.licx -> Landscape/src/Rates/uiRates/My Project/licenses.licx
Incorrect path match: Landscape/Rdt.Landscape.UI.Templates.Workflow/Properties/licenses.licx -> Landscape/src/Rdt.Claim.UI/Properties/licenses.licx
Incorrect path match: Landscape/Rdt.Landscape.UI.Templates/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Templates.Workflow/Properties/licenses.licx
Incorrect path match: Landscape/Rdt.Landscape.UI.Workflow/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Templates/Properties/licenses.licx
Incorrect path match: Landscape/Rdt.Landscape.UI/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Workflow/Properties/licenses.licx
Incorrect path match: Landscape/StandardLetters/uiStandardLetters/My Project/licenses.licx -> Landscape/src/Rdt.Landscape.UI/Properties/licenses.licx
Incorrect path match: Landscape/Complaints/Rdt.Complaints.UI/Interfaces/IDocumentIntegration.vb -> Landscape/src/Services/uiServices/Complaints/Interfaces/IDocumentIntegration.vb
Incorrect path match: Landscape/SystemEvents/uiSystemEvents/My Project/licenses.licx -> Landscape/src/StandardLetters/uiStandardLetters/My Project/licenses.licx
Incorrect path match: Landscape/Services/busServices/RDT_Logo.ico -> Landscape/src/Startup/uiStartup.Workflow/Resources/RDT_Logo.ico
Incorrect path match: Landscape/Startup/uiStartup.Workflow/Resources/RDT_Logo.ico -> Landscape/src/Startup/uiStartup/Resources/RDT_Logo.ico
Incorrect path match: Landscape/Startup/uiStartup/Resources/RDT_Logo.ico -> Landscape/src/Startup/uiStartup32/RDT_Logo.ico
Incorrect path match: Landscape/Startup/uiStartup/Resources/newrdlogogradiant48shad.ico -> Landscape/src/Startup/uiStartup32/newrdlogogradiant48shad.ico
Incorrect path match: Landscape/Templates/uiTemplates.Workflow/My Project/licenses.licx -> Landscape/src/SystemEvents/uiSystemEvents/My Project/licenses.licx
Incorrect path match: Landscape/Utils/Rdt.Utils.UI/Properties/licenses.licx -> Landscape/src/Templates/uiTemplates.Workflow/My Project/licenses.licx
Incorrect path match: Landscape/Utils/uiUtils/My Project/licenses.licx -> Landscape/src/Utils/Rdt.Utils.UI/Properties/licenses.licx
Incorrect name match: Landscape/WebServices/ServiceFabric/Policy/Rdt.Policy.Repository.Service.Fabric.Host/PackageRoot/Data/Swagger/Examples/POST_UKSTasks_Response.json -> Landscape/src/Utils/uiUtils/My Project/licenses.licx
What's different between what you expected and what actually happened?
As you can see, although the filenames (and content) are the same, there are a few files that have been incorrectly matched to files with teh same name, but in a different sub folder. In some cases, it seems that the catalyst has been git thinking that a file from B has been deleted from A, when in fact it has not actually been deleted at all.
For example, the file Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1 has not been deleted in A or B, therefore git should not have attempted to rename Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 to Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1, especially as it then attempts to rename Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 to Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 and so on.
Git status contains, for example:
deleted by them: Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
renamed: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
The correct renames should have been:
git mv "Landscape/Complaints/Rdt.Complaints.UI/Interfaces/IAccountsIntegration.vb" "Landscape/src/Complaints/Rdt.Complaints.UI/Interfaces/IAccountsIntegration.vb"
git mv "Landscape/Complaints/Rdt.Complaints.UI/Interfaces/IDocumentIntegration.vb" "Landscape/src/Complaints/Rdt.Complaints.UI/Interfaces/IDocumentIntegration.vb"
git mv "Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1" "Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1"
git mv "Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1" "Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1"
git mv "Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx" "Landscape/src/Documentation/Rdt.Documentation.UI/Properties/licenses.licx"
git mv "Landscape/Documentation/uiDocumentation/licenses.licx" "Landscape/src/Documentation/uiDocumentation/licenses.licx"
git mv "Landscape/Import/uiImport/My Project/licenses.licx" "Landscape/src/Import/uiImport/My Project/licenses.licx"
git mv "Landscape/Main/uiMain.Workflow/My Project/licenses.licx" "Landscape/src/Main/uiMain.Workflow/My Project/licenses.licx"
git mv "Landscape/Main/uiMain.Workflow/Resources/RDT_Logo.ico" "Landscape/src/Main/uiMain.Workflow/Resources/RDT_Logo.ico"
git mv "Landscape/Main/uiMain/My Project/licenses.licx" "Landscape/src/Main/uiMain/My Project/licenses.licx"
git mv "Landscape/Main/uiMain/Resources/RDT_Logo.ico" "Landscape/src/Main/uiMain/Resources/RDT_Logo.ico"
git mv "Landscape/Policy/Rdt.Policy.UI.Templates/Properties/licenses.licx" "Landscape/src/Policy/Rdt.Policy.UI.Templates/Properties/licenses.licx"
git mv "Landscape/Policy/Rdt.Policy.UI/Properties/licenses.licx" "Landscape/src/Policy/Rdt.Policy.UI/Properties/licenses.licx"
git mv "Landscape/Rdt.Claim.UI/Properties/licenses.licx" "Landscape/src/Rdt.Claim.UI/Properties/licenses.licx"
git mv "Landscape/Rdt.Landscape.UI.Templates.Workflow/Properties/licenses.licx" "Landscape/src/Rdt.Landscape.UI.Templates.Workflow/Properties/licenses.licx"
git mv "Landscape/Rdt.Landscape.UI.Templates/Properties/licenses.licx" "Landscape/src/Rdt.Landscape.UI.Templates/Properties/licenses.licx"
git mv "Landscape/Rdt.Landscape.UI.Workflow/Properties/licenses.licx" "Landscape/src/Rdt.Landscape.UI.Workflow/Properties/licenses.licx"
git mv "Landscape/Rdt.Landscape.UI/Properties/licenses.licx" "Landscape/src/Rdt.Landscape.UI/Properties/licenses.licx"
git mv "Landscape/Services/busServices/RDT_Logo.ico" "Landscape/src/Services/busServices/RDT_Logo.ico"
git mv "Landscape/Services/uiServices/Complaints/Interfaces/IAccountsIntegration.vb" "Landscape/src/Services/uiServices/Complaints/Interfaces/IAccountsIntegration.vb"
git mv "Landscape/Services/uiServices/Complaints/Interfaces/IDocumentIntegration.vb" "Landscape/src/Services/uiServices/Complaints/Interfaces/IDocumentIntegration.vb"
git mv "Landscape/StandardLetters/uiStandardLetters/My Project/licenses.licx" "Landscape/src/StandardLetters/uiStandardLetters/My Project/licenses.licx"
git mv "Landscape/Startup/uiStartup.Workflow/Resources/RDT_Logo.ico" "Landscape/src/Startup/uiStartup.Workflow/Resources/RDT_Logo.ico"
git mv "Landscape/Startup/uiStartup/Resources/RDT_Logo.ico" "Landscape/src/Startup/uiStartup/Resources/RDT_Logo.ico"
git mv "Landscape/SystemEvents/uiSystemEvents/My Project/licenses.licx" "Landscape/src/SystemEvents/uiSystemEvents/My Project/licenses.licx"
git mv "Landscape/Templates/uiTemplates.Workflow/My Project/licenses.licx" "Landscape/src/Templates/uiTemplates.Workflow/My Project/licenses.licx"
git mv "Landscape/Utils/Rdt.Utils.UI/Properties/licenses.licx" "Landscape/src/Utils/Rdt.Utils.UI/Properties/licenses.licx"
git mv "Landscape/Utils/uiUtils/My Project/licenses.licx" "Landscape/src/Utils/uiUtils/My Project/licenses.licx"
Anything else you want to add:
I can't help but think that this is related to changes made by Palantir:
https://blog.palantir.com/optimizing-gits-merge-machinery-1-127ceb0ef2a1
I have tried to unstage these renames using "git restore --staged <file_name>" so I can then apply the correct "git mv" commands, but bizzarely, this then results in "git status" reporting a different, smaller set of mismatched names:
Incorrect name match: Landscape/Services/busServices/Service References/DataCollectorService/busServices.DataCollectorService.GetDataResponse.datasource -> Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
Incorrect name match: Landscape/Services/busServices/Service References/GlobalGatewayService/busServices.GlobalGatewayService.OrderResponse.datasource -> Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
Incorrect name match: Landscape/WebServices/ServiceFabric/Policy/Rdt.Policy.Repository.Service.Fabric.Host/PackageRoot/Data/Swagger/Examples/POST_UKSTasks_Response.json -> Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
Incorrect path match: Landscape/LandscapeApiService.Setup/Setup/UIContent/RDT_Logo.ico -> Landscape/src/Main/uiMain.Workflow/Resources/RDT_Logo.ico
Incorrect path match: Landscape/WebServices/WCFServices/Landscape WCF Service/wsWcfLandscapeServices/RDT_Logo.ico -> Landscape/src/Main/uiMain/Resources/RDT_Logo.ico
Incorrect path match: Landscape/WindowsServices/Rdt.BatchProcessingService/Rdt.BatchProcessingService.Integration/RDT_Logo.ico -> Landscape/src/Services/busServices/RDT_Logo.ico
Incorrect path match: Landscape/WindowsServices/Rdt.BatchProcessingService/Rdt.BatchProcessingService.Setup/UIContent/RDT_Logo.ico -> Landscape/src/Startup/uiStartup.Workflow/Resources/RDT_Logo.ico
Incorrect path match: Landscape/WindowsServices/Rdt.BatchProcessingService/Rdt.BatchProcessingService/RDT_Logo.ico -> Landscape/src/Startup/uiStartup/Resources/RDT_Logo.ico
Incorrect path match: Landscape/_Tests/Rdt.BatchProcessingService.Tests/RDT_Logo.ico -> Landscape/src/Startup/uiStartup32/RDT_Logo.ico
Incorrect path match: Landscape/Startup/uiStartup/Resources/newrdlogogradiant48shad.ico -> Landscape/src/Startup/uiStartup32/newrdlogogradiant48shad.ico
Incorrect path match: Landscape/uiStartup.Setup/Setup/UIContent/RDT_Logo.ico -> Landscape/src/WebServices/WCFServices/Landscape WCF Service/wsWcfLandscapeServices/RDT_Logo.ico
Incorrect path match: Landscape/wisWorkflow.Setup/Setup/UIContent/RDT_Logo.ico -> Landscape/src/WindowsServices/Rdt.BatchProcessingService/Rdt.BatchProcessingService.Integration/RDT_Logo.ico
Incorrect path match: Landscape/wsXmlLandscapeServices.Setup/Setup/Binary/RDT_Logo.ico -> Landscape/src/WindowsServices/Rdt.BatchProcessingService/Rdt.BatchProcessingService.Setup/UIContent/RDT_Logo.ico
Notice how "busServices.DataCollectorService.GetDataResponse.datasource" is now renamed to "Landscape.Net/pre-req.ps1", whereas the datasource file wasn't even listed previously?
Please help. I've been on this for a couple of weeks now and I'm running out of ideas.
Please review the rest of the bug report below.
You can delete any lines you don't wish to share.
[System Info]
git version:
git version 2.42.0.windows.2
cpu: x86_64
built from commit: 2f819d1670fff9a1818f63b6722e9959405378e3
sizeof-long: 4
sizeof-size_t: 8
shell-path: /bin/sh
feature: fsmonitor--daemon
uname: Windows 10.0 19044
compiler info: gnuc: 13.2
libc info: no libc information available
$SHELL (typically, interactive shell): <unset>
[Enabled Hooks]
<none>
Regards,
Jeremy Pridmore
Lead Solution Architect
________________________________
DISCLAIMER This email is confidential. It should only be read by those persons to whom it is addressed. RDT Ltd accept no liability for the consequences of any person acting, or refraining from acting, on any information contained within this e-mail or any attached documents prior to the receipt by those persons of subsequent written confirmation of that information. If you think this e-mail may not be intended for you, do not use, pass on or copy the transmission in any way. While all reasonable precautions are taken to minimise the risk of transmitting software viruses we advise you to carry out your own virus checks on any attachment to this message. We cannot accept liability for any loss or damage caused by software viruses.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Git Rename Detection Bug
2023-11-06 12:00 Git Rename Detection Bug Jeremy Pridmore
@ 2023-11-07 8:05 ` Elijah Newren
2023-11-10 11:28 ` Jeremy Pridmore
0 siblings, 1 reply; 13+ messages in thread
From: Elijah Newren @ 2023-11-07 8:05 UTC (permalink / raw)
To: Jeremy Pridmore; +Cc: git@vger.kernel.org, Paul Baumgartner
Hi,
On Mon, Nov 6, 2023 at 4:01 AM Jeremy Pridmore <jpridmore@rdt.co.uk> wrote:
>
> Thank you for filling out a Git bug report!
> Please answer the following questions to help us understand your issue.
I think it might be worthwhile to point out a few facts about rename
handling in Git, as background information that might clarify a few
things about how Git's mental model seems to differ from yours:
* In git, renames are not tracked; they are detected (based on file
similarity of the commits being compared).
* So, when you run "git mv A B", there is no rename recorded. It's
basically the same as "git rm A", followed by creating B with the same
contents, followed by "git add B".
* The detection happens whenever an operation (diff, log -p, merge,
status, etc.) needs or wants to know about renames.
* In git, directory renames are detected _after_ normal renames, and
via amalgamation of the individual renames.
* As a corollary of the last item, the only way individual renames
can be affected by directory renames, is if the individual rename on
one side of history was into a directory that the other side of
history renamed away; in such a case, we apply an _extra_ rename to
move it into the new directory. But we don't "undo" individual
renames to make them fit the majority-determined directory rename or
anything like that.
(Also, if it matters, all of this is true of both `recursive` and
`ort` merge strategies, i.e. the old default merge backend and the new
one.)
> What did you do before the bug happened? (Steps to reproduce your issue)
> I have two GIT repositories (A and B). Both migrated from the same TFS server using git-tfs tool. I migrated code into A and made lots of changes, including moving 50,000+ files from folder "/Landscape" to "/Landscape/src". B contains the same code but with various other changes made since my original migration from TFS to A. All the files in B are still in the "/Landscape" folder. I recently needed to merge my changes from A to B, so I added A as a remote to B and then performed a number of cherry-picks from A to B, but got stuck when trying to cherry-pick the commit containing the results of moving all files into "/Landscape/src".
In case anyone else wants to dig into this, note that this problem
setup precludes directory rename detection being involved. Directory
rename detection has a rule where if the source directory wasn't
entirely removed on one side, then that directory was not renamed on
that side. Seems obvious, but the upshot of that rule is that a
directory cannot be renamed into a subdirectory of itself, because by
virtue of being a subdirectory that means its parent directory still
exists.
So, this is a problem where only regular rename detection (i.e. rename
detection of individual files) is going to be at play.
> What did you expect to happen? (Expected behavior)
> I expected the git rename detection to match all files in A "/Landscape" to files in B "/Landscape/src".
Are all files under "/Landscape" from the merge base commit
individually more similar to the counterpart under "/Landscape/src"
than to files under any other directory? If not, the expectation goes
against how rename detection has worked in git from the beginning.
> What happened instead? (Actual behavior)
> Although many files were matched successfully, git mismatched over two dozen similarly named files, e.g.
>
> Incorrect path match: Landscape/Services/uiServices/Complaints/Interfaces/IAccountsIntegration.vb -> Landscape/src/Complaints/Rdt.Complaints.UI/Interfaces/IAccountsIntegration.vb
> Incorrect path match: Landscape/Services/uiServices/Complaints/Interfaces/IDocumentIntegration.vb -> Landscape/src/Complaints/Rdt.Complaints.UI/Interfaces/IDocumentIntegration.vb
> Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
> Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
> Incorrect name match: Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx -> Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
> Incorrect path match: Landscape/Documentation/uiDocumentation/licenses.licx -> Landscape/src/Documentation/Rdt.Documentation.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Import/uiImport/My Project/licenses.licx -> Landscape/src/Documentation/uiDocumentation/licenses.licx
> Incorrect path match: Landscape/Main/uiMain.Workflow/My Project/licenses.licx -> Landscape/src/Import/uiImport/My Project/licenses.licx
> Incorrect path match: Landscape/Main/uiMain/My Project/licenses.licx -> Landscape/src/Main/uiMain.Workflow/My Project/licenses.licx
> Incorrect path match: Landscape/LandscapeApiService.Setup/Setup/UIContent/RDT_Logo.ico -> Landscape/src/Main/uiMain.Workflow/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Policy/Rdt.Policy.UI.Templates/Properties/licenses.licx -> Landscape/src/Main/uiMain/My Project/licenses.licx
> Incorrect path match: Landscape/Main/uiMain.Workflow/Resources/RDT_Logo.ico -> Landscape/src/Main/uiMain/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Policy/Rdt.Policy.UI/Properties/licenses.licx -> Landscape/src/Policy/Rdt.Policy.UI.Templates/Properties/licenses.licx
> Incorrect path match: Landscape/Rates/uiRates/My Project/licenses.licx -> Landscape/src/Policy/Rdt.Policy.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Claim.UI/Properties/licenses.licx -> Landscape/src/Rates/uiRates/My Project/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI.Templates.Workflow/Properties/licenses.licx -> Landscape/src/Rdt.Claim.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI.Templates/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Templates.Workflow/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI.Workflow/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Templates/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Workflow/Properties/licenses.licx
> Incorrect path match: Landscape/StandardLetters/uiStandardLetters/My Project/licenses.licx -> Landscape/src/Rdt.Landscape.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Complaints/Rdt.Complaints.UI/Interfaces/IDocumentIntegration.vb -> Landscape/src/Services/uiServices/Complaints/Interfaces/IDocumentIntegration.vb
> Incorrect path match: Landscape/SystemEvents/uiSystemEvents/My Project/licenses.licx -> Landscape/src/StandardLetters/uiStandardLetters/My Project/licenses.licx
> Incorrect path match: Landscape/Services/busServices/RDT_Logo.ico -> Landscape/src/Startup/uiStartup.Workflow/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Startup/uiStartup.Workflow/Resources/RDT_Logo.ico -> Landscape/src/Startup/uiStartup/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Startup/uiStartup/Resources/RDT_Logo.ico -> Landscape/src/Startup/uiStartup32/RDT_Logo.ico
> Incorrect path match: Landscape/Startup/uiStartup/Resources/newrdlogogradiant48shad.ico -> Landscape/src/Startup/uiStartup32/newrdlogogradiant48shad.ico
> Incorrect path match: Landscape/Templates/uiTemplates.Workflow/My Project/licenses.licx -> Landscape/src/SystemEvents/uiSystemEvents/My Project/licenses.licx
> Incorrect path match: Landscape/Utils/Rdt.Utils.UI/Properties/licenses.licx -> Landscape/src/Templates/uiTemplates.Workflow/My Project/licenses.licx
> Incorrect path match: Landscape/Utils/uiUtils/My Project/licenses.licx -> Landscape/src/Utils/Rdt.Utils.UI/Properties/licenses.licx
> Incorrect name match: Landscape/WebServices/ServiceFabric/Policy/Rdt.Policy.Repository.Service.Fabric.Host/PackageRoot/Data/Swagger/Examples/POST_UKSTasks_Response.json -> Landscape/src/Utils/uiUtils/My Project/licenses.licx
>
>
> What's different between what you expected and what actually happened?
>
> As you can see, although the filenames (and content) are the same,
The content is the same as well? So, these renames that you label as
incorrect are actually _exact_ renames -- and further, in most cases
they also have an identical basename for the file as well.
Exact renames are detected first, before any other method of rename
detection is employed, and other than giving a preference to files
with the same basename, if there are multiple choices it just picks
one (what I'd call at random, though technically based on what the
internal processing order happens to be; see the "Too many identical
alternatives? Pick one" code comment).
And this, too, is true of both the `recursve` and `ort` backends; no
change has been made to how exact renames are handled.
> In some cases, it seems that the catalyst has been git thinking that a file from B has been deleted from A, when in fact it has not actually been deleted at all.
>
> For example, the file Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1 has not been deleted in A or B, therefore git should not have attempted to rename Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 to Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1, especially as it then attempts to rename Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 to Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 and so on.
Renamed files, from Git's perspective, always involve files that have
been deleted.
> Git status contains, for example:
> deleted by them: Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
This means that it wasn't sufficiently similar to any of the new
files...or that _other_ deleted files were more similar to the new
files and thus that they were paired up instead of this file, leaving
this file to simply be marked as deleted. (Or that other deleted
files were just as similar; tie-breakers are kinda random in such a
case.)
[...]
> Anything else you want to add:
> I can't help but think that this is related to changes made by Palantir:
> https://blog.palantir.com/optimizing-gits-merge-machinery-1-127ceb0ef2a1
Curious. What makes you think it's related?
If there is some reason you think it's related, there's an easy way to
check -- just repeat the cherry-pick with the "-s recursive" flag to
use the old merge backend and compare the results.
I'll be somewhat surprised if it's related, though.
> I have tried to unstage these renames using "git restore --staged <file_name>" so I can then apply the correct "git mv" commands
Why? Just modify all the files to have the correct end results and then commit.
>, but bizzarely, this then results in "git status" reporting a different, smaller set of mismatched names:
As mentioned earlier, git does _not_ record renames. So, running the
correct "git mv" command doesn't really mean much. If you use
completely "incorrect" git-mv commands, but then manually tweak files
until they have the correct results, then what's recorded is exactly
the same as if you had used the "correct" git-mv commands.
Further, when you run "git status", it can't access any renames you
did because that information isn't recorded anywhere. It instead
recomputes renames on the fly. And it does so each and every time you
run "git status", even if you make no changes between two invocations.
In fact, from this you can probably also deduce that there are other
ways to affect what will be shown as renames, when you have multiple
files similar to any given source file. In particular, you can cause
a different pairing modifying one of the similar files enough that it
becomes the most similar to the source file, or so that it becomes no
longer the most similar to the source file. However, what "git
status" reports for renames is irrelevant, since that info won't be
recorded in the commit. Renames are never recorded. Anywhere.
In fact, you can even run "git status --no-renames" to just see the
old filenames that were removed and the new ones that were added
without having all the files be paired up as renames.
Hope that helps,
Elijah
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Git Rename Detection Bug
2023-11-07 8:05 ` Elijah Newren
@ 2023-11-10 11:28 ` Jeremy Pridmore
2023-11-11 5:46 ` Elijah Newren
0 siblings, 1 reply; 13+ messages in thread
From: Jeremy Pridmore @ 2023-11-10 11:28 UTC (permalink / raw)
To: Elijah Newren; +Cc: git@vger.kernel.org, Paul Baumgartner
Hi Elijah,
Many thanks for your reply, the detail is much appreciated. I was aware, from recently read articles, that git doesn't record renames as such, hence my investigations into the rename detection, but I also found some interesting points in your email, such as the "git status --no-renames" flag.
I think the issue I'm encountering is described by what you say here:
"Exact renames are detected first, before any other method of rename
detection is employed, and other than giving a preference to files
with the same basename, if there are multiple choices it just picks
one (what I'd call at random, though technically based on what the
internal processing order happens to be)"
That is close to the behaviour I'm seeing. As I mentioned, git seems to think a file has been deleted and then as it continues to detect renames, it's as if it is going through lists of "Local-Base" and "Base-Remote" changes trying to match them up, but the directories of the files being matched are "offset", as highlighted by this list of mismatches:
(I'd put the paths in a table for easier analysis, but for some reason the emails need to be plain text)
> Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
> Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
> Incorrect name match: Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx -> Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
> Incorrect path match: Landscape/Documentation/uiDocumentation/licenses.licx -> Landscape/src/Documentation/Rdt.Documentation.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Import/uiImport/My Project/licenses.licx -> Landscape/src/Documentation/uiDocumentation/licenses.licx
> Incorrect path match: Landscape/Main/uiMain.Workflow/My Project/licenses.licx -> Landscape/src/Import/uiImport/My Project/licenses.licx
> Incorrect path match: Landscape/Main/uiMain/My Project/licenses.licx -> Landscape/src/Main/uiMain.Workflow/My Project/licenses.licx
Given git compares both the content and the directory\filenames, and as the repositories have unrelated histories, the "Base" file is going to be empty, therefore, even if Local and Remote are identical, they are both 100% different to Base. That given, I'm not sure why git would state that Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx and Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 are a "both added" conflict given their file names and paths are completely different. Any ideas?
I wrote a script to resolve the conflicts best I can which categorises the files into sets according to the file status (i.e. "added by them", "added by us" etc), and then either does a "git checkout head -- <file>" or a "git rm <file>" based upon which set the file is in and whether it is in another set or not. This has worked really well and helped me through the large changeset with 3k conflicts.
As git only needs to try and match files in the "deleted by us" and "deleted by them" sets (although including the "deleted in both" set would allow matching renames/moves on both sides), an idea for a potential improvement to the matching algorithm (where you say there's a comment "too many alternatives, pick one") could be to compute a "difference value" for the path\filename of those files in one of the other sets (i.e. "added by us", "added by them" or "added in both"), and chose a potential rename based upon the smallest calculated difference. The difference value would be the number of differences in folder names, e.g.
deleted in both: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
(path\name difference = 2)
added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
(path\name difference = 1)
added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
(path\name difference = 2)
So, given the above, git would chose the second "added in both" entry.
Food for thought? Happy to discuss the idea further.
Regards,
Jeremy Pridmore
Lead Solution Architect
From: Elijah Newren <newren@gmail.com>
Sent: 07 November 2023 08:05
To: Jeremy Pridmore <jpridmore@rdt.co.uk>
Cc: git@vger.kernel.org <git@vger.kernel.org>; Paul Baumgartner <pbaumgartner@rdt.co.uk>
Subject: Re: Git Rename Detection Bug
[You don't often get email from newren@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
Caution: This email originated from outside of the organisation. Please treat any attachments or links with caution. If in doubt please contact IT
Hi,
On Mon, Nov 6, 2023 at 4:01 AM Jeremy Pridmore <jpridmore@rdt.co.uk> wrote:
>
> Thank you for filling out a Git bug report!
> Please answer the following questions to help us understand your issue.
I think it might be worthwhile to point out a few facts about rename
handling in Git, as background information that might clarify a few
things about how Git's mental model seems to differ from yours:
* In git, renames are not tracked; they are detected (based on file
similarity of the commits being compared).
* So, when you run "git mv A B", there is no rename recorded. It's
basically the same as "git rm A", followed by creating B with the same
contents, followed by "git add B".
* The detection happens whenever an operation (diff, log -p, merge,
status, etc.) needs or wants to know about renames.
* In git, directory renames are detected _after_ normal renames, and
via amalgamation of the individual renames.
* As a corollary of the last item, the only way individual renames
can be affected by directory renames, is if the individual rename on
one side of history was into a directory that the other side of
history renamed away; in such a case, we apply an _extra_ rename to
move it into the new directory. But we don't "undo" individual
renames to make them fit the majority-determined directory rename or
anything like that.
(Also, if it matters, all of this is true of both `recursive` and
`ort` merge strategies, i.e. the old default merge backend and the new
one.)
> What did you do before the bug happened? (Steps to reproduce your issue)
> I have two GIT repositories (A and B). Both migrated from the same TFS server using git-tfs tool. I migrated code into A and made lots of changes, including moving 50,000+ files from folder "/Landscape" to "/Landscape/src". B contains the same code but with various other changes made since my original migration from TFS to A. All the files in B are still in the "/Landscape" folder. I recently needed to merge my changes from A to B, so I added A as a remote to B and then performed a number of cherry-picks from A to B, but got stuck when trying to cherry-pick the commit containing the results of moving all files into "/Landscape/src".
In case anyone else wants to dig into this, note that this problem
setup precludes directory rename detection being involved. Directory
rename detection has a rule where if the source directory wasn't
entirely removed on one side, then that directory was not renamed on
that side. Seems obvious, but the upshot of that rule is that a
directory cannot be renamed into a subdirectory of itself, because by
virtue of being a subdirectory that means its parent directory still
exists.
So, this is a problem where only regular rename detection (i.e. rename
detection of individual files) is going to be at play.
> What did you expect to happen? (Expected behavior)
> I expected the git rename detection to match all files in A "/Landscape" to files in B "/Landscape/src".
Are all files under "/Landscape" from the merge base commit
individually more similar to the counterpart under "/Landscape/src"
than to files under any other directory? If not, the expectation goes
against how rename detection has worked in git from the beginning.
> What happened instead? (Actual behavior)
> Although many files were matched successfully, git mismatched over two dozen similarly named files, e.g.
>
> Incorrect path match: Landscape/Services/uiServices/Complaints/Interfaces/IAccountsIntegration.vb -> Landscape/src/Complaints/Rdt.Complaints.UI/Interfaces/IAccountsIntegration.vb
> Incorrect path match: Landscape/Services/uiServices/Complaints/Interfaces/IDocumentIntegration.vb -> Landscape/src/Complaints/Rdt.Complaints.UI/Interfaces/IDocumentIntegration.vb
> Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
> Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
> Incorrect name match: Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx -> Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
> Incorrect path match: Landscape/Documentation/uiDocumentation/licenses.licx -> Landscape/src/Documentation/Rdt.Documentation.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Import/uiImport/My Project/licenses.licx -> Landscape/src/Documentation/uiDocumentation/licenses.licx
> Incorrect path match: Landscape/Main/uiMain.Workflow/My Project/licenses.licx -> Landscape/src/Import/uiImport/My Project/licenses.licx
> Incorrect path match: Landscape/Main/uiMain/My Project/licenses.licx -> Landscape/src/Main/uiMain.Workflow/My Project/licenses.licx
> Incorrect path match: Landscape/LandscapeApiService.Setup/Setup/UIContent/RDT_Logo.ico -> Landscape/src/Main/uiMain.Workflow/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Policy/Rdt.Policy.UI.Templates/Properties/licenses.licx -> Landscape/src/Main/uiMain/My Project/licenses.licx
> Incorrect path match: Landscape/Main/uiMain.Workflow/Resources/RDT_Logo.ico -> Landscape/src/Main/uiMain/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Policy/Rdt.Policy.UI/Properties/licenses.licx -> Landscape/src/Policy/Rdt.Policy.UI.Templates/Properties/licenses.licx
> Incorrect path match: Landscape/Rates/uiRates/My Project/licenses.licx -> Landscape/src/Policy/Rdt.Policy.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Claim.UI/Properties/licenses.licx -> Landscape/src/Rates/uiRates/My Project/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI.Templates.Workflow/Properties/licenses.licx -> Landscape/src/Rdt.Claim.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI.Templates/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Templates.Workflow/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI.Workflow/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Templates/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Workflow/Properties/licenses.licx
> Incorrect path match: Landscape/StandardLetters/uiStandardLetters/My Project/licenses.licx -> Landscape/src/Rdt.Landscape.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Complaints/Rdt.Complaints.UI/Interfaces/IDocumentIntegration.vb -> Landscape/src/Services/uiServices/Complaints/Interfaces/IDocumentIntegration.vb
> Incorrect path match: Landscape/SystemEvents/uiSystemEvents/My Project/licenses.licx -> Landscape/src/StandardLetters/uiStandardLetters/My Project/licenses.licx
> Incorrect path match: Landscape/Services/busServices/RDT_Logo.ico -> Landscape/src/Startup/uiStartup.Workflow/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Startup/uiStartup.Workflow/Resources/RDT_Logo.ico -> Landscape/src/Startup/uiStartup/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Startup/uiStartup/Resources/RDT_Logo.ico -> Landscape/src/Startup/uiStartup32/RDT_Logo.ico
> Incorrect path match: Landscape/Startup/uiStartup/Resources/newrdlogogradiant48shad.ico -> Landscape/src/Startup/uiStartup32/newrdlogogradiant48shad.ico
> Incorrect path match: Landscape/Templates/uiTemplates.Workflow/My Project/licenses.licx -> Landscape/src/SystemEvents/uiSystemEvents/My Project/licenses.licx
> Incorrect path match: Landscape/Utils/Rdt.Utils.UI/Properties/licenses.licx -> Landscape/src/Templates/uiTemplates.Workflow/My Project/licenses.licx
> Incorrect path match: Landscape/Utils/uiUtils/My Project/licenses.licx -> Landscape/src/Utils/Rdt.Utils.UI/Properties/licenses.licx
> Incorrect name match: Landscape/WebServices/ServiceFabric/Policy/Rdt.Policy.Repository.Service.Fabric.Host/PackageRoot/Data/Swagger/Examples/POST_UKSTasks_Response.json -> Landscape/src/Utils/uiUtils/My Project/licenses.licx
>
>
> What's different between what you expected and what actually happened?
>
> As you can see, although the filenames (and content) are the same,
The content is the same as well? So, these renames that you label as
incorrect are actually _exact_ renames -- and further, in most cases
they also have an identical basename for the file as well.
Exact renames are detected first, before any other method of rename
detection is employed, and other than giving a preference to files
with the same basename, if there are multiple choices it just picks
one (what I'd call at random, though technically based on what the
internal processing order happens to be; see the "Too many identical
alternatives? Pick one" code comment).
And this, too, is true of both the `recursve` and `ort` backends; no
change has been made to how exact renames are handled.
> In some cases, it seems that the catalyst has been git thinking that a file from B has been deleted from A, when in fact it has not actually been deleted at all.
>
> For example, the file Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1 has not been deleted in A or B, therefore git should not have attempted to rename Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 to Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1, especially as it then attempts to rename Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 to Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 and so on.
Renamed files, from Git's perspective, always involve files that have
been deleted.
> Git status contains, for example:
> deleted by them: Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
This means that it wasn't sufficiently similar to any of the new
files...or that _other_ deleted files were more similar to the new
files and thus that they were paired up instead of this file, leaving
this file to simply be marked as deleted. (Or that other deleted
files were just as similar; tie-breakers are kinda random in such a
case.)
[...]
> Anything else you want to add:
> I can't help but think that this is related to changes made by Palantir:
> https://blog.palantir.com/optimizing-gits-merge-machinery-1-127ceb0ef2a1
Curious. What makes you think it's related?
If there is some reason you think it's related, there's an easy way to
check -- just repeat the cherry-pick with the "-s recursive" flag to
use the old merge backend and compare the results.
I'll be somewhat surprised if it's related, though.
> I have tried to unstage these renames using "git restore --staged <file_name>" so I can then apply the correct "git mv" commands
Why? Just modify all the files to have the correct end results and then commit.
>, but bizzarely, this then results in "git status" reporting a different, smaller set of mismatched names:
As mentioned earlier, git does _not_ record renames. So, running the
correct "git mv" command doesn't really mean much. If you use
completely "incorrect" git-mv commands, but then manually tweak files
until they have the correct results, then what's recorded is exactly
the same as if you had used the "correct" git-mv commands.
Further, when you run "git status", it can't access any renames you
did because that information isn't recorded anywhere. It instead
recomputes renames on the fly. And it does so each and every time you
run "git status", even if you make no changes between two invocations.
In fact, from this you can probably also deduce that there are other
ways to affect what will be shown as renames, when you have multiple
files similar to any given source file. In particular, you can cause
a different pairing modifying one of the similar files enough that it
becomes the most similar to the source file, or so that it becomes no
longer the most similar to the source file. However, what "git
status" reports for renames is irrelevant, since that info won't be
recorded in the commit. Renames are never recorded. Anywhere.
In fact, you can even run "git status --no-renames" to just see the
old filenames that were removed and the new ones that were added
without having all the files be paired up as renames.
Hope that helps,
Elijah
________________________________
DISCLAIMER This email is confidential. It should only be read by those persons to whom it is addressed. RDT Ltd accept no liability for the consequences of any person acting, or refraining from acting, on any information contained within this e-mail or any attached documents prior to the receipt by those persons of subsequent written confirmation of that information. If you think this e-mail may not be intended for you, do not use, pass on or copy the transmission in any way. While all reasonable precautions are taken to minimise the risk of transmitting software viruses we advise you to carry out your own virus checks on any attachment to this message. We cannot accept liability for any loss or damage caused by software viruses.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Git Rename Detection Bug
2023-11-10 11:28 ` Jeremy Pridmore
@ 2023-11-11 5:46 ` Elijah Newren
2023-11-11 11:08 ` Philip Oakley
2023-11-15 16:51 ` Philip Oakley
0 siblings, 2 replies; 13+ messages in thread
From: Elijah Newren @ 2023-11-11 5:46 UTC (permalink / raw)
To: Jeremy Pridmore; +Cc: git@vger.kernel.org, Paul Baumgartner
Hi Jeremy,
On Fri, Nov 10, 2023 at 3:28 AM Jeremy Pridmore <jpridmore@rdt.co.uk> wrote:
>
> Hi Elijah,
>
> Many thanks for your reply, the detail is much appreciated. I was aware, from recently read articles, that git doesn't record renames as such, hence my investigations into the rename detection, but I also found some interesting points in your email, such as the "git status --no-renames" flag.
The fact that you were trying to "undo" renames and "redo the correct
ones" suggested there's something you still didn't understand about
rename detection, though. The fact that you were worried that "git
status" showing the "wrong renames" and the implication that it needed
to show the right ones before you committed also suggested there's
something that was still not being understood. Likewise, the fact
that the renames reported by "git status" change even when you haven't
renamed files further but have simply made additional changes to the
contents of some files suggests there is something that was still not
being understood about the phrase that "renames are detected rather
than recorded".
While renames are used in the merge algorithm in order to know what
files to match up for three-way content merges (so that changes from
both sides to the "same" file can be incorporated into the end
result), once the merge stops to ask you to resolve conflicts, the
detection of renames doesn't matter beyond that point. The fact that
renames aren't recorded means whatever renames are shown is only there
as a guide to help you understand. All that matters to Git is that
all files have the intended content. If some of the files have the
wrong content, then by all means go and correct it. If "git mv"
commands help you do that, great. If simply editing all files of
interest (including adding and deleting files) until they match the
expected contents works, that's fine too. Once all files have the
correct content, commit it. Git will have no way whatsoever of
knowing which of those two routes you picked, and won't behave any
differently in the future based on which way you ended up with the
right contents in each file. You could have even done the extreme of
"git merge -s ours --no-commit ${OTHER_BRANCH}" (merge the other
branch but completely ignore *every* change the other side made and
then stop for user to make further changes), followed by opening and
editing every relevant file to include changes made by the other side,
deleting and adding files as relevant, to end up with the right
contents in all files and then committed, and Git wouldn't know the
difference and no one who pulled your merge commit would be able to
tell the difference either.
> I think the issue I'm encountering is described by what you say here:
> "Exact renames are detected first, before any other method of rename
> detection is employed, and other than giving a preference to files
> with the same basename, if there are multiple choices it just picks
> one (what I'd call at random, though technically based on what the
> internal processing order happens to be)"
>
> That is close to the behaviour I'm seeing. As I mentioned, git seems to think a file has been deleted and then as it continues to detect renames, it's as if it is going through lists of "Local-Base" and "Base-Remote" changes trying to match them up, but the directories of the files being matched are "offset", as highlighted by this list of mismatches:
>
> (I'd put the paths in a table for easier analysis, but for some reason the emails need to be plain text)
> > Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
> > Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
> > Incorrect name match: Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx -> Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
> > Incorrect path match: Landscape/Documentation/uiDocumentation/licenses.licx -> Landscape/src/Documentation/Rdt.Documentation.UI/Properties/licenses.licx
> > Incorrect path match: Landscape/Import/uiImport/My Project/licenses.licx -> Landscape/src/Documentation/uiDocumentation/licenses.licx
> > Incorrect path match: Landscape/Main/uiMain.Workflow/My Project/licenses.licx -> Landscape/src/Import/uiImport/My Project/licenses.licx
> > Incorrect path match: Landscape/Main/uiMain/My Project/licenses.licx -> Landscape/src/Main/uiMain.Workflow/My Project/licenses.licx
>
> Given git compares both the content and the directory\filenames,
For exact renames, it only will look at filenames if there are
multiple 100% matches. If one match is 100%, and the other 99.9999%
match, the 100% match is taken.
(Since we think that exact renames seem to be describing your problem,
there's no point discussing the inexact rename handling.)
> and as the repositories have unrelated histories,
What do you mean by this? If there's no common point in history (i.e.
no merge base), rename detection doesn't even get invoked, so you must
mean something different by this phrase than what I would normally
take it to mean.
Any chance, if you're still in the middle of the merge, that you could run
git merge-base --all HEAD MERGE_HEAD
and report the output? It'll be commit hashes that don't mean a lot
to anyone who doesn't have your repository, but I'm interested in how
many commit hashes it responds with (0, 1, or more than 1).
> the "Base" file is going to be empty, therefore, even if Local and Remote are identical, they are both 100% different to Base.
I assume by "Base" you are referring to the commit in history common
to both branches being merged (i.e. what we call the "merge base"), or
to the relevant file from that commit. Is that right?
Even if I'm right about that, this comment has me lost. Could you
clarify it, with one particular example? For example (I'm making
stuff up since I'm not familiar with your repo, but showing how to
clarify for a given set of path(s)):
* In this repository, with the renames detected above,
Landscape/Main/somefile.licx is an empty file in the merge base.
* On my local side, I renamed Landscape/Main/somefile.licx to
Landscape/src/Main/somefile.licx and populated it with some content
(with hash A). There is also another new file on the local side (at
least new relative to the merge base) named
Landscape/src/Other/somefile.licx that happens to have hash A.
* On the remote side, Landscape/Main/somefile.licx was left in
place but populated with some content (with hash A).
* Git is detecting the rename as Landscape/Main/somefile.licx ->
Landscape/src/Other/somefile.licx, when I wanted it to detect a rename
to Landscape/src/Main/somefile.licx.
I'm pretty sure this example is not what you're seeing, even if
components of it are, because the empty file thing is impossible with
the rest of the story.
> That given, I'm not sure why git would state that Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx and Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 are a "both added" conflict given their file names and paths are completely different. Any ideas?
The fact that Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx
is marked as "both added" means that this file
* did not exist in the merge base
* did exist on your local side
* did exist on the remote side
* the version of this file on the local and remote sides do not match
Basically, both sides added a new file and they are not the same, so
you have a conflict.
The answer for Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1,
is an identical set of bullet points; it was a file added by both
sides of history that did not exist in the merge base, and the sides
are different so you have a conflict in this file too.
> I wrote a script to resolve the conflicts best I can which categorises the files into sets according to the file status (i.e. "added by them", "added by us" etc), and then either does a "git checkout head --
<file>" or a "git rm <file>" based upon which set the file is in and
whether it is in another set or not. This has worked really well and
helped me through the large changeset with 3k conflicts.
So, you're simply throwing away the changes made by the remote side?
I mean, that's one way to merge, and it might be right in your case,
but to someone unfamiliar with your repo it smells like a hack to just
ignore conflicts and throw away other people's changes in order to
complete the merge.
> As git only needs to try and match files in the "deleted by us" and "deleted by them" sets (although including the "deleted in both" set would allow matching renames/moves on both sides),
"deleted by us" and "deleted by them" means no rename was detected for
the file (at least on the side that the delete is reported for). So,
a "deleted in both" only happens when neither side detects a rename,
and if the file isn't renamed on either side and both removed it, then
there's no conflict -- just delete the file.
> an idea for a potential improvement to the matching algorithm (where you say there's a comment "too many alternatives, pick one") could be to compute a "difference value" for the path\filename of those files in one of the other sets (i.e. "added by us", "added by them" or "added in both"), and chose a potential rename based upon the smallest calculated difference. The difference value would be the number of differences in folder names, e.g.
>
> deleted in both: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
>
> added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
> (path\name difference = 2)
> added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
> (path\name difference = 1)
> added in both: Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
> (path\name difference = 2)
>
> So, given the above, git would chose the second "added in both" entry.
Three problems here:
* If git were to report "deleted in both" for one path and "added in
both" in another as you suggest, that would only be because the files
are dissimilar. Not only would they not be an exact match, their
contents would have less than 50% similarity. Thus augmenting exact
matching like this just wouldn't work, because the files aren't
matches at all. The only correct thing to do would be to not report
the "deleted in both" because that file is not a rename of anything
else and has simply been deleted by both sides.
* filename similarity is extraordinarily expensive compared to exact
renames, and if not carefully handled, can sometimes rival the cost of
file content similarity computations given our spanhash
representations. Exact renames are tasked with finding renames even
if they are known to not be relevant, simply because exact renames can
do so very quickly. If we change that, we throw a monkey wrench in
our performance handling elsewhere and have to rethink a number of
other things.
* While I was optimizing rename detection while investigating the new
merge backend, I actually attempted a few versions of filename
similarity looking for something that was predictive and useful.
While I think the idea was potentially helpful for some repositories,
it has a significant risk of hurting merges in other repositories.
While what I tried was far from an exhaustive checking of all filename
similarity ideas, I came away doubting there was a useful heuristic
other than exact matches of basenames (i.e. exact matches of
everything in the filename after the final slash). If someone else
wants to try more ideas, and do a study on various existing
repositories, they can go ahead, but I suspect most work here is going
to end up at a dead end and I'm unwilling to put further time of my
own into it.
> Food for thought? Happy to discuss the idea further.
So, I've occasionally seen repositories that have something like the following:
* base version: directory named library-x-1.7/
* stable branch: many changes to files under library-x-1.7/
* development branch: library-x-1.7/ no longer exists. However,
library-x-1.8/ and library-x-1.9/ both do. Both are obviously
"similar" to library-x-1.7/ but both have many changes.
What happens when someone tries to merge the stable branch into the
development branch? There are two obvious guesses:
* Changes from library-x-1.7/ on the stable branch are applied to
library-x-1.8/
* Changes from library-x-1.7/ on the stable branch are applied to
library-x-1.9/
Either answer can be suboptimal depending on your viewpoint.
(Applying the changes to both directories would also have other
suboptimal effects even if it might sound right based for this exact
problem as I've worded it. But git doesn't do copy detection as part
of merges so Git won't choose this third choice ever.) So, which of
those two happens? Well, since renames are detected based upon file
similarity, the changes will go to whatever file is most similar.
What does that mean? It means that both answers above are wrong.
Instead:
* Some of the changes from library-x-1.7/ on the stable branch are
applied to files from library-x-1.8/, while others are applied to
files from library-x-1.9/, and to determine which files from which
directory are matched up is an individual file choice based on which
file in library-x-1.8/ or library-x-1.9/ is most similar to the file
from the base version in library-x-1.7/.
This answer is clearly worse than either of the two above, and is
virtually never what people would want. But it's also fundamental to
the idea of matching up files and detecting renames individually based
upon file similarity. It's part of both the old and new merge
backends in Git, because both were based upon this fundamental idea.
So, if you have this kind of situation, or even something like it
where files from one old directory could match files from multiple
other directories, it's just something you have to be aware of.
All that said, here's something that might help:
*** Hack to workaround rename detection in special cases where there
are directories of multiple possible matches ***
1. Get back to a clean slate from before the merge.
$ git merge --abort
OR
$ cd ${OTHER_DIRECTORY} && git clone ${url} && cd ${REPONAME} && git
checkout ${relevant_branch}
2. Temporarily undo your local renames and make a temporary commit
$ git mv Landscape/src/* Landscape/
$ git commit -m "TEMPORARY COMMIT"
3. Perform the merge (files will be in Landscape/ instead of
Landscape/src/ for now). Don't worry, we'll fix the merge commit
later.
$ git merge ${REMOTE_BRANCH}
[...fix up any conflicts and commit, if needed...]
4. Rename Landscape/ back to Landscape/src/ and make another (temporary) commit.
5. Create a corrected merge commit with the current tree, the commit
message from your merge commit, and the correct parents:
$ git commit-tree -p HEAD~3 -p HEAD~1^2 -F $(git log -1 --format=%B
HEAD~1) HEAD^{tree}
[...the above command will print out a new commit id for a corrected
merge commit. You can inspect it first, but we just need to pass this
to reset --hard...]
6. Reset your branch to this corrected merge commit (which will orphan
the temporary commits from steps 2, 3, and 4 so they can later be
garbage collected)
$ git reset --hard [...output of commit-tree command...]
Hope that helps,
Elijah
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Git Rename Detection Bug
2023-11-11 5:46 ` Elijah Newren
@ 2023-11-11 11:08 ` Philip Oakley
2023-11-11 15:13 ` Elijah Newren
2023-11-15 16:51 ` Philip Oakley
1 sibling, 1 reply; 13+ messages in thread
From: Philip Oakley @ 2023-11-11 11:08 UTC (permalink / raw)
To: Elijah Newren, Jeremy Pridmore; +Cc: git@vger.kernel.org, Paul Baumgartner
Hi all,
On 11/11/2023 05:46, Elijah Newren wrote:
> The fact that you were trying to "undo" renames and "redo the correct
> ones" suggested there's something you still didn't understand about
> rename detection, though.
Could I suggest that we are missing a piece of terminology, to wit,
BLOBSAME. It's a compatriot to TREESAME, as used in `git log` for
history simplification (based on a tree's pathspec, most commonly a
commit's top level path).
File rename, at it's most basic, is when the blob associated with that
changed path is identical, i.e. BLOBSAME. There is no need to 'record'
the action of renaming, moving or whatever, the content sameness is
right there, in plain sight, as an identical blob name. After that
(files with slight variations) it is a load of heuristics, but starting
with BLOBSAME we see how easy the basic rename detection is, and why
renames (and de-dup) don't need recording.
The heuristics of 'rename with small change' is trickier, but for a
basic understanding, starting at BLOBSAME (and TREESAME for directory
renames) should make it easier to grasp the concepts.
--
Philip
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Git Rename Detection Bug
2023-11-11 11:08 ` Philip Oakley
@ 2023-11-11 15:13 ` Elijah Newren
2023-11-12 23:09 ` Junio C Hamano
2023-11-15 14:36 ` Philip Oakley
0 siblings, 2 replies; 13+ messages in thread
From: Elijah Newren @ 2023-11-11 15:13 UTC (permalink / raw)
To: Philip Oakley; +Cc: Jeremy Pridmore, git@vger.kernel.org, Paul Baumgartner
Hi,
On Sat, Nov 11, 2023 at 3:08 AM Philip Oakley <philipoakley@iee.email> wrote:
>
> Hi all,
>
> On 11/11/2023 05:46, Elijah Newren wrote:
> > The fact that you were trying to "undo" renames and "redo the correct
> > ones" suggested there's something you still didn't understand about
> > rename detection, though.
>
>
> Could I suggest that we are missing a piece of terminology, to wit,
> BLOBSAME. It's a compatriot to TREESAME, as used in `git log` for
> history simplification (based on a tree's pathspec, most commonly a
> commit's top level path).
We could add it, but I'm not sure how it helps. We already had 'exact
rename' which seems to fit the bill as well, and 'blob' is something
someone new to Git is unlikely to know.
Perhaps it's useful in some other context, though?
> File rename, at it's most basic, is when the blob associated with that
> changed path is identical, i.e. BLOBSAME. There is no need to 'record'
> the action of renaming, moving or whatever, the content sameness is
> right there, in plain sight, as an identical blob name. After that
> (files with slight variations) it is a load of heuristics, but starting
> with BLOBSAME we see how easy the basic rename detection is, and why
> renames (and de-dup) don't need recording.
This is incorrect. Let's say you have a file foo:
* base version: foo has hash A
* our version: foo has been renamed to bar, but bar still has hash A
* their version: foo has been modified; it now has hash B
The foo->bar is an exact rename (or they are BLOBSAME if you prefer),
but the renaming/moving/whatever is a critical piece of information
because the changes to foo in 'their' version need to be applied to
bar to get the correct end results.
I do not know if in Jeremy's case foo has been modified on the
unrenamed side. But the following hypothetical is exactly the type of
problem Jeremy is hitting: what should happen when 'our' version has
both a new 'bar' and a new 'baz' file that each have hash A? In that
case, to which one was foo renamed? It's inherently ambiguous.
> The heuristics of 'rename with small change' is trickier, but for a
> basic understanding, starting at BLOBSAME (and TREESAME for directory
> renames) should make it easier to grasp the concepts.
Interesting; TREESAME isn't used within directory rename detection
currently; it is only used currently when two (or three) trees with
the same name are TREESAME, in order to potentially avoid recursing
into the tree. But even then, having two trees with the same name be
TREESAME isn't enough on its own to avoid recursing into that tree,
because the other side could have added files within the same-named
tree and we need to know about those added files because they could be
part of renames involving other files outside that tree. There would
probably be similar challenges to attempting to apply the concept of
TREESAME to directory rename detection to two trees of different
names, but it's at least an interesting idea. Hmm....
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Git Rename Detection Bug
2023-11-11 15:13 ` Elijah Newren
@ 2023-11-12 23:09 ` Junio C Hamano
2023-11-15 15:35 ` Philip Oakley
2023-11-15 14:36 ` Philip Oakley
1 sibling, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2023-11-12 23:09 UTC (permalink / raw)
To: Elijah Newren
Cc: Philip Oakley, Jeremy Pridmore, git@vger.kernel.org,
Paul Baumgartner
Elijah Newren <newren@gmail.com> writes:
>> Could I suggest that we are missing a piece of terminology, to wit,
>> BLOBSAME. It's a compatriot to TREESAME, as used in `git log` for
>> history simplification (based on a tree's pathspec, most commonly a
>> commit's top level path).
>
> We could add it, but I'm not sure how it helps. We already had 'exact
> rename' which seems to fit the bill as well, and 'blob' is something
> someone new to Git is unlikely to know.
Also, as Philip said, TREESAME is a concept foreign to rename
detection codepath. It is a property of a commit (not a tree) and
tells us if it has the same tree object as its relevant parents (in
which case it can be simplified away if it is a merge). I do not
mind rename codepath using a jargon (or two) to express "in trees A
and B, this subtree of A records the same tree object as a subtree
of B at a different path (i.e., the contents of these two subtrees
at different paths are the same)" but the word used to express that
should not be TREESAME to avoid confusion. And the other word to
express "this path in tree A records a blob object that is identical
to this other path in tree B" should not be BLOBSAME, as the word
strongly hints it is somehow related to TREESAME.
Thanks.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Git Rename Detection Bug
2023-11-11 15:13 ` Elijah Newren
2023-11-12 23:09 ` Junio C Hamano
@ 2023-11-15 14:36 ` Philip Oakley
2023-11-16 6:26 ` Elijah Newren
1 sibling, 1 reply; 13+ messages in thread
From: Philip Oakley @ 2023-11-15 14:36 UTC (permalink / raw)
To: Elijah Newren; +Cc: Jeremy Pridmore, git@vger.kernel.org, Paul Baumgartner
Hi Elijah,
sorry for the delay in replying.
On 11/11/2023 15:13, Elijah Newren wrote:
> Hi,
>
> On Sat, Nov 11, 2023 at 3:08 AM Philip Oakley <philipoakley@iee.email> wrote:
>>
>> Hi all,
>>
>> On 11/11/2023 05:46, Elijah Newren wrote:
>>> The fact that you were trying to "undo" renames and "redo the correct
>>> ones" suggested there's something you still didn't understand about
>>> rename detection, though.
>>
>>
>> Could I suggest that we are missing a piece of terminology, to wit,
>> BLOBSAME. It's a compatriot to TREESAME, as used in `git log` for
>> history simplification (based on a tree's pathspec, most commonly a
>> commit's top level path).
>
> We could add it, but I'm not sure how it helps. We already had 'exact
> rename' which seems to fit the bill as well,
My point was that we already had the confusion of mental models, with
both sides essentially thinking they had an "exact rename", hence my
thought was to add a rather distinct technical name which reflected the
Git mind-shift. Without something to bring folks up short they'll
continue, erroneously, with their prior mental models.
and 'blob' is something
> someone new to Git is unlikely to know.
I'd agree that BLOBSAME is new, but we should be proactive in ensuring
folk do have the mind shift from the old centralised VCS misunderstandings.
>
> Perhaps it's useful in some other context, though?
>
>> File rename, at it's most basic, is when the blob associated with that
>> changed path is identical, i.e. BLOBSAME. There is no need to 'record'
>> the action of renaming, moving or whatever, the content sameness is
>> right there, in plain sight, as an identical blob name. After that
>> (files with slight variations) it is a load of heuristics, but starting
>> with BLOBSAME we see how easy the basic rename detection is, and why
>> renames (and de-dup) don't need recording.
>
> This is incorrect. Let's say you have a file foo:
> * base version: foo has hash A
> * our version: foo has been renamed to bar, but bar still has hash A
> * their version: foo has been modified; it now has hash B
>
> The foo->bar is an exact rename (or they are BLOBSAME if you prefer),
> but the renaming/moving/whatever is a critical piece of information
> because the changes to foo in 'their' version need to be applied to
> bar to get the correct end results.
Isn't that what I thought I'd said?
Hash A = Hash A => identical content;
Hash A != B => different content.
>
> I do not know if in Jeremy's case foo has been modified on the
> unrenamed side. But the following hypothetical is exactly the type of
> problem Jeremy is hitting: what should happen when 'our' version has
> both a new 'bar' and a new 'baz' file that each have hash A? In that
> case, to which one was foo renamed? It's inherently ambiguous.
true, the terminology hasn't kept up with the methodology for blob
content, and the independent meta-data. In previous 'ort' discussions I
didn't really understand what the '1/2' renames (and other
nomenclatures) really meant with respect to paths, filenames, content
and the ours / theirs / base distinctions.
>
>> The heuristics of 'rename with small change' is trickier, but for a
>> basic understanding, starting at BLOBSAME (and TREESAME for directory
>> renames) should make it easier to grasp the concepts.
>
> Interesting; TREESAME isn't used within directory rename detection
> currently; it is only used currently when two (or three) trees with
> the same name are TREESAME, in order to potentially avoid recursing
> into the tree. But even then, having two trees with the same name be
> TREESAME isn't enough on its own to avoid recursing into that tree,
> because the other side could have added files within the same-named
> tree and we need to know about those added files because they could be
> part of renames involving other files outside that tree.
definitely easy to get confused on these cases...
> There would
> probably be similar challenges to attempting to apply the concept of
> TREESAME to directory rename detection to two trees of different
> names, but it's at least an interesting idea. Hmm....
>
Thanks for the insights.
Philip
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Git Rename Detection Bug
2023-11-12 23:09 ` Junio C Hamano
@ 2023-11-15 15:35 ` Philip Oakley
0 siblings, 0 replies; 13+ messages in thread
From: Philip Oakley @ 2023-11-15 15:35 UTC (permalink / raw)
To: Junio C Hamano, Elijah Newren
Cc: Jeremy Pridmore, git@vger.kernel.org, Paul Baumgartner
On 12/11/2023 23:09, Junio C Hamano wrote:
> Elijah Newren <newren@gmail.com> writes:
>
>>> Could I suggest that we are missing a piece of terminology, to wit,
>>> BLOBSAME. It's a compatriot to TREESAME, as used in `git log` for
>>> history simplification (based on a tree's pathspec, most commonly a
>>> commit's top level path).
>>
>> We could add it, but I'm not sure how it helps. We already had 'exact
>> rename' which seems to fit the bill as well, and 'blob' is something
>> someone new to Git is unlikely to know.
>
> Also, as Philip said, TREESAME is a concept foreign to rename
> detection codepath. It is a property of a commit (not a tree)
That (property of a commit?) wasn't really obvious to me.
I'd always thought of it as a comparison between two trees, commonly
those associated with two commits. Though it could also be thought of as
the operation "TREESAME to" that binds in the second tree.
and
> tells us if it has the same tree object as its relevant parents (in
> which case it can be simplified away if it is a merge). I do not
> mind rename codepath using a jargon (or two) to express "in trees A
> and B, this subtree of A records the same tree object as a subtree
> of B at a different path (i.e., the contents of these two subtrees
> at different paths are the same)" but the word used to express that
> should not be TREESAME to avoid confusion.
Maybe it's that the explanation of TREESAME in
rev-list-options.txt#L419-L436 [1] has a similar set of confusions about
how subtrees are considered and the path v filename confusions.
And the other word to
> express "this path in tree A records a blob object that is identical
> to this other path in tree B" should not be BLOBSAME, as the word
> strongly hints it is somehow related to TREESAME.
Yep. Naming is hard.
>
> Thanks.
>
>
Philip
[1]
https://github.com/git/git/blob/master/Documentation/rev-list-options.txt#L419-L436
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Git Rename Detection Bug
2023-11-11 5:46 ` Elijah Newren
2023-11-11 11:08 ` Philip Oakley
@ 2023-11-15 16:51 ` Philip Oakley
2023-12-24 7:46 ` Elijah Newren
1 sibling, 1 reply; 13+ messages in thread
From: Philip Oakley @ 2023-11-15 16:51 UTC (permalink / raw)
To: Elijah Newren, Jeremy Pridmore; +Cc: git@vger.kernel.org, Paul Baumgartner
Hi Elijah,
On 11/11/2023 05:46, Elijah Newren wrote:
> * filename similarity is extraordinarily expensive compared to exact
> renames, and if not carefully handled, can sometimes rival the cost of
> file content similarity computations given our spanhash
> representations.
I've not heard of spanhash representation before. Any references or
further reading?
> Exact renames are tasked with finding renames even
> if they are known to not be relevant, simply because exact renames can
> do so very quickly. If we change that, we throw a monkey wrench in
> our performance handling elsewhere and have to rethink a number of
> other things.
--
Philip
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Git Rename Detection Bug
2023-11-15 14:36 ` Philip Oakley
@ 2023-11-16 6:26 ` Elijah Newren
0 siblings, 0 replies; 13+ messages in thread
From: Elijah Newren @ 2023-11-16 6:26 UTC (permalink / raw)
To: Philip Oakley; +Cc: Jeremy Pridmore, git@vger.kernel.org, Paul Baumgartner
Hi Philip,
On Wed, Nov 15, 2023 at 6:36 AM Philip Oakley <philipoakley@iee.email> wrote:
>
> Hi Elijah,
> sorry for the delay in replying.
It was only a few days; no need to apologize.
>
> On 11/11/2023 15:13, Elijah Newren wrote:
> > Hi,
> >
> > On Sat, Nov 11, 2023 at 3:08 AM Philip Oakley <philipoakley@iee.email> wrote:
> >>
> >> Hi all,
> >>
> >> On 11/11/2023 05:46, Elijah Newren wrote:
> >>> The fact that you were trying to "undo" renames and "redo the correct
> >>> ones" suggested there's something you still didn't understand about
> >>> rename detection, though.
> >>
> >>
> >> Could I suggest that we are missing a piece of terminology, to wit,
> >> BLOBSAME. It's a compatriot to TREESAME, as used in `git log` for
> >> history simplification (based on a tree's pathspec, most commonly a
> >> commit's top level path).
> >
> > We could add it, but I'm not sure how it helps. We already had 'exact
> > rename' which seems to fit the bill as well,
>
> My point was that we already had the confusion of mental models, with
> both sides essentially thinking they had an "exact rename", hence my
> thought was to add a rather distinct technical name which reflected the
> Git mind-shift. Without something to bring folks up short they'll
> continue, erroneously, with their prior mental models.
Maybe I'm missing the other mental model you're alluding to? The
mental model problems I suspected Jeremy had did not appear to hinge
on exact renames; the fact that Jeremy was trying to "undo" renames
and "redo the correct ones" suggested to me that it's an issue with
understanding detecting vs. tracking renames, something for which it
doesn't matter whether the renames are exact or inexact. (Also, the
fact that he thought it was a bug that the detected renames could
change in `git status` output by merely editing files, suggested also
a detected vs. tracked rename misunderstanding -- and in that case,
there's virtually no chance that an explanation of exact renames or
BLOBSAME or whatever could help since the editing of files in question
means that they aren't going to have the same contents anymore.)
> and 'blob' is something
> > someone new to Git is unlikely to know.
>
> I'd agree that BLOBSAME is new, but we should be proactive in ensuring
> folk do have the mind shift from the old centralised VCS misunderstandings.
I agree it's good to help folks make the mind shift, I'm just not
following how BLOBSAME or any other alternative for "exact rename"
could help in this particular case.
Maybe it'd help to understand why I brought up "exact renames" in the
first place? When most people see "rename detection bug" (as in the
subject of this thread), they're likely to assume some heuristic went
haywire. When Jeremy pointed out that the content of the "mismatched"
file was identical to the source file of interest, I thought it worth
pointing out so anyone else watching the thread would notice. It
means all the normal heuristics involved in inexact renames cannot be
the source of this particular problem. And, in particular, that the
changes that ort brought aren't relevant to this particular problem
either.
> > Perhaps it's useful in some other context, though?
> >
> >> File rename, at it's most basic, is when the blob associated with that
> >> changed path is identical, i.e. BLOBSAME. There is no need to 'record'
> >> the action of renaming, moving or whatever, the content sameness is
> >> right there, in plain sight, as an identical blob name. After that
> >> (files with slight variations) it is a load of heuristics, but starting
> >> with BLOBSAME we see how easy the basic rename detection is, and why
> >> renames (and de-dup) don't need recording.
> >
> > This is incorrect. Let's say you have a file foo:
> > * base version: foo has hash A
> > * our version: foo has been renamed to bar, but bar still has hash A
> > * their version: foo has been modified; it now has hash B
> >
> > The foo->bar is an exact rename (or they are BLOBSAME if you prefer),
> > but the renaming/moving/whatever is a critical piece of information
> > because the changes to foo in 'their' version need to be applied to
> > bar to get the correct end results.
>
> Isn't that what I thought I'd said?
> Hash A = Hash A => identical content;
> Hash A != B => different content.
My apologies, I misread what you had written. I had somehow read you
as saying that the rename wasn't needed, but you didn't say that at
all.
Now that I've re-read what you wrote, and hopefully understand better,
I did notice something else that might be worth responding to, though.
Let me re-quote a piece of what you said and respond to it:
> >> ...the content sameness is
> >> right there, in plain sight, as an identical blob name. After that
> >> (files with slight variations) it is a load of heuristics, ...
This suggests there aren't heuristics involved when we have identical
blob names. That's not quite accurate, though. When there are
multiple identical blob names that a given source file could be paired
with (e.g. old/foo.txt deleted, and all three of A/foo.txt and
B/bar.txt and C/foo.txt all added on the same side of history and all
with identical contents to the old/foo.txt), we give a higher score to
files that share the same basename (so A/foo.txt and C/foo.txt would
be picked as rename targets over B/bar.txt). If we have multiple
files with an identical blob name and the same file basename (i.e.
A/foo.txt and C/foo.txt in this case), then we pick one, basically at
random as far as the user is concerned (so old/foo.txt could be a
rename to either A/foo.txt or C/foo.txt, just depending on processing
order).
> > I do not know if in Jeremy's case foo has been modified on the
> > unrenamed side. But the following hypothetical is exactly the type of
> > problem Jeremy is hitting: what should happen when 'our' version has
> > both a new 'bar' and a new 'baz' file that each have hash A? In that
> > case, to which one was foo renamed? It's inherently ambiguous.
>
> true, the terminology hasn't kept up with the methodology for blob
> content, and the independent meta-data. In previous 'ort' discussions I
> didn't really understand what the '1/2' renames (and other
> nomenclatures) really meant with respect to paths, filenames, content
> and the ours / theirs / base distinctions.
None of that other nomenclature in 'ort' really matters; the problem
presented is almost certainly unrelated to any changes that came with
'ort'. If he had switched to git anytime in the last 15 years he'd
probably see exactly the same problem.
I'm guessing by "'1/2' renames" that you're referring to what I termed
"rename/rename(1to2)" cases, which exist in both merge-recursive and
merge-ort. Those are cases where the base version had a file named A,
and one side of history renamed A->B, while the other side of history
renamed A->C. That kind of situation is not relevant to the current
problem; for the current problem we have a file named D that was
renamed on only one side, but there are both files E and F each with
contents identical to D and rename detection has to decide whether D
was renamed to E or renamed to F on that side.
(This contrasts with "rename/rename(2to1)" cases, where the base
version has both files G & H, and then one side of history renames
G->I and the other side renames H->I. Or in short, 2to1 means two
files are renamed to one file, and 1to2 means one file is renamed to
two.)
> >> The heuristics of 'rename with small change' is trickier, but for a
> >> basic understanding, starting at BLOBSAME (and TREESAME for directory
> >> renames) should make it easier to grasp the concepts.
> >
> > Interesting; TREESAME isn't used within directory rename detection
> > currently; it is only used currently when two (or three) trees with
> > the same name are TREESAME, in order to potentially avoid recursing
> > into the tree. But even then, having two trees with the same name be
> > TREESAME isn't enough on its own to avoid recursing into that tree,
> > because the other side could have added files within the same-named
> > tree and we need to know about those added files because they could be
> > part of renames involving other files outside that tree.
>
> definitely easy to get confused on these cases...
yeah, and apparently I shouldn't have used TREESAME this way as per
Junio's comment elsewhere in the thread. Oops.
> Thanks for the insights.
Thanks for your comments, and again, my apologies for misreading what
you wrote the first time. I hope I didn't do it again.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Git Rename Detection Bug
2023-11-15 16:51 ` Philip Oakley
@ 2023-12-24 7:46 ` Elijah Newren
2023-12-28 15:33 ` Philip Oakley
0 siblings, 1 reply; 13+ messages in thread
From: Elijah Newren @ 2023-12-24 7:46 UTC (permalink / raw)
To: Philip Oakley; +Cc: Jeremy Pridmore, git@vger.kernel.org, Paul Baumgartner
Hi Philip,
Sorry for the late reply; I somehow missed this earlier.
On Wed, Nov 15, 2023 at 8:51 AM Philip Oakley <philipoakley@iee.email> wrote:
>
> Hi Elijah,
>
> On 11/11/2023 05:46, Elijah Newren wrote:
> > * filename similarity is extraordinarily expensive compared to exact
> > renames, and if not carefully handled, can sometimes rival the cost of
> > file content similarity computations given our spanhash
> > representations.
>
> I've not heard of spanhash representation before. Any references or
> further reading?
You can find more in diffcore-delta.c, especially the big comment near
the top of the file. But here's a short explanation of spanhashes:
* Split files into chunks delimited either by LF or 64 bytes,
whichever comes first.
* Hash every chunk into an integer between 0 and 107926
* Keep a character count for each of those integers as well (thus if
a line has N characters, but appears twice in the file, the associated
count for that integer will be 2N).
* A "spanhash" is the combination of the integer that a chunk (or
span) hashes to, plus the count associated with it.
* The list/array of spanhashes for a file (i.e. the list/array of
integers and character counts) is used to compare one file to another.
Now, why do I claim that comparison of filenames can rival cost of
file content similarity? Well, in a monorepo of interest, the median
sized file is named something close to
"modules/client-resources/src/main/resources/buttons/smallTriangleBlackRight.png"
and is 2875 bytes. As a png, all its chunks are probably the full 64
characters, which works out to about 45 chunks (assuming the 64-byte
chunks are different from each other). The filename is 79 characters.
So, for this case, 45 pairs of integers vs 79 characters. So, the
comparison cost is roughly the same order of magnitude.
(Yes, creating the spanhashes is a heavy overhead; however, we only
initialize it once and then do N comparisons of each spanhash to the
other spanhashes. And we'd be doing N comparisons of each filename to
other filenames, so the overhead of creating the spanhashes can be
overlooked if your merge has enough files modified on both sides of
history.)
Yes, this particular repository is a case I randomly picked that you
can argue is special. But rather than look at the particular example,
I think it's interesting to check how the spanhash size vs. filename
size scale with repository size. From my experience: (1) I don't
think the median-sized file varies all that much between small and big
repositories; every time I check a repo the median size seems to be
order of a few thousand bytes, regardless of whether the repository
I'm looking at is tiny or huge, (2) while small repositories often
have much shorter filenames, big repositories often will have
filenames even longer than my example; length of filename tends to
grow with repository size from deep directory nestings. So, between
these two facts, I'd expect the filename comparison costs to grow
relative to file content comparison costs, when considering only
median-sized files being modified. And since it's common to have
merges or rebases or diffs where only approximately-median-sized files
are involved, I think this is relevant to look at. Finally, since I
already had an example that showed the cost likely roughly comparable
for a random repository of interest, and it's not even all that big a
repository compared to many out there, I think the combination
motivates pretty well my claim that filename similarity costs _could_
rival file content similarity costs if one wasn't careful.
I don't have a rigorous proof here. And, in fact, I ended up doing
this rough back-of-the-envelope analysis _after_ implementing some
filename similarity comparison ideas and seeing performance degrade
badly, and wondering why it made such a difference. I don't know if I
ever got exact numbers, but I certainly didn't record them. This
rough analysis, though, was what made me realize that I needed to be
careful with any such added filename comparisons, though, and is why
I'm leery of adding more.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Git Rename Detection Bug
2023-12-24 7:46 ` Elijah Newren
@ 2023-12-28 15:33 ` Philip Oakley
0 siblings, 0 replies; 13+ messages in thread
From: Philip Oakley @ 2023-12-28 15:33 UTC (permalink / raw)
To: Elijah Newren; +Cc: Jeremy Pridmore, git@vger.kernel.org, Paul Baumgartner
Hi Elijah,
Many thanks.. personal notes in-line.
On 24/12/2023 07:46, Elijah Newren wrote:
> Hi Philip,
>
> Sorry for the late reply; I somehow missed this earlier.
>
> On Wed, Nov 15, 2023 at 8:51 AM Philip Oakley <philipoakley@iee.email> wrote:
>>
>> Hi Elijah,
>>
>> On 11/11/2023 05:46, Elijah Newren wrote:
>>> * filename similarity is extraordinarily expensive compared to exact
>>> renames, and if not carefully handled, can sometimes rival the cost of
>>> file content similarity computations given our spanhash
>>> representations.
>>
>> I've not heard of spanhash representation before. Any references or
>> further reading?
>
> You can find more in diffcore-delta.c, especially the big comment near
> the top of the file.
+1
> But here's a short explanation of spanhashes:
> * Split files into chunks delimited either by LF or 64 bytes,
> whichever comes first.
neat
> * Hash every chunk into an integer between 0 and 107926
as per the comment, this is 1 less than a nice prime 107927 that fits
17bits.
Some discussions at
https://lore.kernel.org/git/7vwtezt202.fsf@assigned-by-dhcp.cox.net/ and
surrounding messages.
The hash is very similar to a CRC, a rotating 64bit value, using 7 bit
shifts and a 8bit char addition, then reduced to a hash computed at ~#L157
> * Keep a character count for each of those integers as well (thus if
> a line has N characters, but appears twice in the file, the associated
> count for that integer will be 2N).
> * A "spanhash" is the combination of the integer that a chunk (or
> span) hashes to, plus the count associated with it.
> * The list/array of spanhashes for a file (i.e. the list/array of
> integers and character counts) is used to compare one file to another.
I was surprised to see that I'd been in the area at #L162 ;-)
Thank you for the useful summary.
>
> Now, why do I claim that comparison of filenames can rival cost of
> file content similarity? Well, in a monorepo of interest, the median
> sized file is named something close to
> "modules/client-resources/src/main/resources/buttons/smallTriangleBlackRight.png"
> and is 2875 bytes. As a png, all its chunks are probably the full 64
> characters, which works out to about 45 chunks (assuming the 64-byte
> chunks are different from each other). The filename is 79 characters.
> So, for this case, 45 pairs of integers vs 79 characters. So, the
> comparison cost is roughly the same order of magnitude.
> (Yes, creating the spanhashes is a heavy overhead; however, we only
> initialize it once and then do N comparisons of each spanhash to the
> other spanhashes. And we'd be doing N comparisons of each filename to
> other filenames, so the overhead of creating the spanhashes can be
> overlooked if your merge has enough files modified on both sides of
> history.)
Nice point about the hashes only being computed once.
>
> Yes, this particular repository is a case I randomly picked that you
> can argue is special. But rather than look at the particular example,
> I think it's interesting to check how the spanhash size vs. filename
> size scale with repository size. From my experience: (1) I don't
> think the median-sized file varies all that much between small and big
> repositories; every time I check a repo the median size seems to be
> order of a few thousand bytes, regardless of whether the repository
> I'm looking at is tiny or huge, (2) while small repositories often
> have much shorter filenames, big repositories often will have
> filenames even longer than my example; length of filename tends to
> grow with repository size from deep directory nestings. So, between
> these two facts, I'd expect the filename comparison costs to grow
> relative to file content comparison costs, when considering only
> median-sized files being modified. And since it's common to have
> merges or rebases or diffs where only approximately-median-sized files
> are involved, I think this is relevant to look at. Finally, since I
> already had an example that showed the cost likely roughly comparable
> for a random repository of interest, and it's not even all that big a
> repository compared to many out there, I think the combination
> motivates pretty well my claim that filename similarity costs _could_
> rival file content similarity costs if one wasn't careful.
>
> I don't have a rigorous proof here. And, in fact, I ended up doing
> this rough back-of-the-envelope analysis _after_ implementing some
> filename similarity comparison ideas and seeing performance degrade
> badly, and wondering why it made such a difference. I don't know if I
> ever got exact numbers, but I certainly didn't record them. This
> rough analysis, though, was what made me realize that I needed to be
> careful with any such added filename comparisons, though, and is why
> I'm leery of adding more.
Thanks again.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2023-12-28 15:59 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-06 12:00 Git Rename Detection Bug Jeremy Pridmore
2023-11-07 8:05 ` Elijah Newren
2023-11-10 11:28 ` Jeremy Pridmore
2023-11-11 5:46 ` Elijah Newren
2023-11-11 11:08 ` Philip Oakley
2023-11-11 15:13 ` Elijah Newren
2023-11-12 23:09 ` Junio C Hamano
2023-11-15 15:35 ` Philip Oakley
2023-11-15 14:36 ` Philip Oakley
2023-11-16 6:26 ` Elijah Newren
2023-11-15 16:51 ` Philip Oakley
2023-12-24 7:46 ` Elijah Newren
2023-12-28 15:33 ` Philip Oakley
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).