* Fw: Curiosity [not found] <Wlh_w2gSCDQ2ieJnIY7TStWrzxbwP98SNRIFMTYpva7SRFipqk63HEYFVF7wFn1oSHOkQNsjWGOa5L49vyRlvSLbuZqpmvOaDOHmFkdt2zw=@protonmail.com> @ 2021-12-15 3:52 ` João Victor Bonfim 2021-12-15 18:07 ` Junio C Hamano 0 siblings, 1 reply; 14+ messages in thread From: João Victor Bonfim @ 2021-12-15 3:52 UTC (permalink / raw) To: git@vger.kernel.org I sent this message to Junio Hamano kinda of forever ago, since then I haven't been able to address it or do anything about it really (I am personally making a report on Git for the conclusion of my technician course so I can get my certification, yada yada yada, couldn't get to it). These days I have been reading Junio's responses on the git mailing list archive (https://marc.info/?l=git or rather https://marc.info/?a=118086005800002&r=1&w=4) from May to now to see if Junio said anything. Junio didn't, but I did read https://marc.info/?i=xmqqpmudng5x.fsf%20()%20gitster%20!%20g and kinda of felt that was targeted at me, or people like me at least... `:-) - me sweating in exasperation. Also since then, I may have improved on my confusing line of thought, so here is the past message and my current version so to speak: ------- Second attempt -------- Since Git is almost used for everything at this point, is there any intent on providing better support for non textual file types? Why do I say this? Take this game mod which I follow as example -> https://github.com/SolariusScorch/XComFiles <- whenever I clone it Git takes a significant forever amount of time to download 452 MB of files whose some part, from my perspective, isn't being delta compressed like the text files are (since, whenever reading a log of what changes were made, git creates and undoes modes for all binary files, some of which only changed by a pixel from one colour to another). From my perspective it would be interesting to enhance the effectiveness/performance of git for such files, since some projects are very heavy on multimedia that isn't hard coded and those will eventually come around to using git. From a personal perspective: I pretend to create an open source game and track it with git, however it concerns me whether or not it might take forever for users to clone the repo once a few versions of a singular file of, perhaps, some Gigabytes in size aren't stored and compressed efficiently and instead all the versions are stored in full, totalling some Terabytes in storage for a few of such files. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ Wednesday, 27, May, 2021, 22:12, João Victor Bonfim <JoaoVictorBonfim@protonmail.com> wrote: > I am assuming you are the Git maintainer, therefore the message, otherwise, forgive me. > Considering the ubiquity of Git as a versioning system and my internal queries about the future of software development, specially game development, is there any intent on providing support for non textual file types? What do I mean is that binary files, from my perspective as a user, are tracked in full rather than partially, which I mean is that the files are discarded and replaced if they are altered when, instead, they could have the differentiation between files tracked. Of course this would require several changes to Git so it can interpret images and so on, but I think that it could be good for software development that requires extensive multimedia use and, therefore, may require that better tracking for such material is made available. > > Do you understand where I want to get to? > > Graciously yours, João Victor Bonfim. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fw: Curiosity 2021-12-15 3:52 ` Fw: Curiosity João Victor Bonfim @ 2021-12-15 18:07 ` Junio C Hamano 2021-12-15 23:45 ` João Victor Bonfim 2021-12-16 2:19 ` brian m. carlson 0 siblings, 2 replies; 14+ messages in thread From: Junio C Hamano @ 2021-12-15 18:07 UTC (permalink / raw) To: João Victor Bonfim; +Cc: git@vger.kernel.org João Victor Bonfim <JoaoVictorBonfim@protonmail.com> writes: > I sent this message to Junio Hamano kinda of forever ago, since > then I haven't been able to address it or do anything about it > really... My spam filter has learned that anything that goes to gitster@ address without cc'ed to the git@vger list are to be caught, so it is very plausible that I didn't see it. Sending any inquiry here on the list is the right thing to do, especially because it is likely that I may not be the area expert for whatever you want to learn about Git, while there are others who are more familiar with various parts of the system and other ways the system is used. You will also increase your chances to be read if you made your message look more like the ones typically posted here (see the archive), by wrapping overly long lines, etc. > Since Git is almost used for everything at this point, is there > any intent on providing better support for non textual file types? > Why do I say this? Take this game mod which I follow as example -> > https://github.com/SolariusScorch/XComFiles <- whenever I clone it > Git takes a significant forever amount of time to download 452 MB > of files whose some part, from my perspective, isn't being delta > compressed like the text files are (since, whenever reading a log > of what changes were made, git creates and undoes modes for all > binary files, some of which only changed by a pixel from one > colour to another). Our delta compression does not care whether the contents are text or binary, so if it is not compressed well, so it can be a sign that the contents are not compressible to begin with, at least with the xdelta binary-diff-patch engine we use. Improvement designs, algorithms and patches are always welcome ;-) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fw: Curiosity 2021-12-15 18:07 ` Junio C Hamano @ 2021-12-15 23:45 ` João Victor Bonfim 2021-12-16 2:19 ` brian m. carlson 1 sibling, 0 replies; 14+ messages in thread From: João Victor Bonfim @ 2021-12-15 23:45 UTC (permalink / raw) To: Junio C Hamano; +Cc: git@vger.kernel.org > João Victor Bonfim JoaoVictorBonfim@protonmail.com writes: > > Our delta compression does not care whether the contents are text or > > binary, so if it is not compressed well, so it can be a sign that > > the contents are not compressible to begin with, at least with the > > xdelta binary-diff-patch engine we use. Improvement designs, > > algorithms and patches are always welcome ;-) Gosh, I wish I could do anything about it. I am but a mere code monkey, haven't done much writing practice either. Maybe one day, but that is yet to be seen. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fw: Curiosity 2021-12-15 18:07 ` Junio C Hamano 2021-12-15 23:45 ` João Victor Bonfim @ 2021-12-16 2:19 ` brian m. carlson 2021-12-16 21:20 ` João Victor Bonfim 1 sibling, 1 reply; 14+ messages in thread From: brian m. carlson @ 2021-12-16 2:19 UTC (permalink / raw) To: Junio C Hamano; +Cc: João Victor Bonfim, git@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 2184 bytes --] On 2021-12-15 at 18:07:20, Junio C Hamano wrote: > João Victor Bonfim <JoaoVictorBonfim@protonmail.com> writes: > > Since Git is almost used for everything at this point, is there > > any intent on providing better support for non textual file types? > > Why do I say this? Take this game mod which I follow as example -> > > https://github.com/SolariusScorch/XComFiles <- whenever I clone it > > Git takes a significant forever amount of time to download 452 MB > > of files whose some part, from my perspective, isn't being delta > > compressed like the text files are (since, whenever reading a log > > of what changes were made, git creates and undoes modes for all > > binary files, some of which only changed by a pixel from one > > colour to another). > > Our delta compression does not care whether the contents are text or > binary, so if it is not compressed well, so it can be a sign that > the contents are not compressible to begin with, at least with the > xdelta binary-diff-patch engine we use. Improvement designs, > algorithms and patches are always welcome ;-) To expand on this, if what you're storing is already compressed, like Ogg Vorbis files or PNGs, like are found in that repository, then generally they will not delta well. This is also true of things like Microsoft Office or OpenOffice documents, because they're essentially Zip files. The delta algorithm looks for similarities between files to compress them. If a file is already compressed using something like Deflate, used in PNGs and Zip files, then even very similar files will generally look very different, so deltification will generally be ineffective. There are two main solutions to this. One is to store your data uncompressed in the repository and compress it as part of a build step. This makes your checkouts larger, but it makes your repository smaller. The other is to store them outside of the repository proper. Some folks use Git LFS for this, but you could also just store a manifest with file names and secure hashes, plus a download location for a public server. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fw: Curiosity 2021-12-16 2:19 ` brian m. carlson @ 2021-12-16 21:20 ` João Victor Bonfim 2021-12-16 21:33 ` Martin Fick 0 siblings, 1 reply; 14+ messages in thread From: João Victor Bonfim @ 2021-12-16 21:20 UTC (permalink / raw) To: brian m. carlson; +Cc: Junio C Hamano, git@vger.kernel.org > To expand on this, if what you're storing is already compressed, like > Ogg Vorbis files or PNGs, like are found in that repository, then > generally they will not delta well. This is also true of things like > Microsoft Office or OpenOffice documents, because they're essentially > Zip files. > > The delta algorithm looks for similarities between files to compress > them. If a file is already compressed using something like Deflate, > used in PNGs and Zip files, then even very similar files will generally > look very different, so deltification will generally be ineffective. This explain why, also, Git opens a new mode every time an edit is made, since it cannot recognize any similarities between the files, even though there are. > There are two main solutions to this. One is to store your data > uncompressed in the repository and compress it as part of a build step. > This makes your checkouts larger, but it makes your repository smaller. > > The other is to store them outside of the repository proper. Some folks > use Git LFS for this, but you could also just store a manifest with file > names and secure hashes, plus a download location for a public server. Maybe I am thinking too outside the box, but wouldn't it be quite more effective for git to identify compressed files, specially on edge cases where the compression doesn't have a good chemistry with delta compression, decompress them for repo storage while also storing the compression algorithm as some metadata tag (like a text string or an ID code decided beforehand), and, when creating the work mirrors, return the compression to its default state before checkout? Of course you would also need reversing functions when you want to checkout the info back to repo. Just throwing ideas out there. ------------------------------- João Victor Bonfim, any pronouns are welcome. ‐‐‐‐‐‐‐Original Message ‐‐‐‐‐‐‐ Em quarta-feira, 15 de dezembro de 2021 às 23:19, brian m. carlson <sandals@crustytoothpaste.net> escreveu: > On 2021-12-15 at 18:07:20, Junio C Hamano wrote: > > > João Victor Bonfim JoaoVictorBonfim@protonmail.com writes: > > > > > Since Git is almost used for everything at this point, is there > > > > > > any intent on providing better support for non textual file types? > > > > > > Why do I say this? Take this game mod which I follow as example -> > > > > > > https://github.com/SolariusScorch/XComFiles <- whenever I clone it > > > > > > Git takes a significant forever amount of time to download 452 MB > > > > > > of files whose some part, from my perspective, isn't being delta > > > > > > compressed like the text files are (since, whenever reading a log > > > > > > of what changes were made, git creates and undoes modes for all > > > > > > binary files, some of which only changed by a pixel from one > > > > > > colour to another). > > > > Our delta compression does not care whether the contents are text or > > > > binary, so if it is not compressed well, so it can be a sign that > > > > the contents are not compressible to begin with, at least with the > > > > xdelta binary-diff-patch engine we use. Improvement designs, > > > > algorithms and patches are always welcome ;-) > > To expand on this, if what you're storing is already compressed, like > > Ogg Vorbis files or PNGs, like are found in that repository, then > > generally they will not delta well. This is also true of things like > > Microsoft Office or OpenOffice documents, because they're essentially > > Zip files. > > The delta algorithm looks for similarities between files to compress > > them. If a file is already compressed using something like Deflate, > > used in PNGs and Zip files, then even very similar files will generally > > look very different, so deltification will generally be ineffective. > > There are two main solutions to this. One is to store your data > > uncompressed in the repository and compress it as part of a build step. > > This makes your checkouts larger, but it makes your repository smaller. > > The other is to store them outside of the repository proper. Some folks > > use Git LFS for this, but you could also just store a manifest with file > > names and secure hashes, plus a download location for a public server. > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > brian m. carlson (he/him or they/them) > > Toronto, Ontario, CA ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fw: Curiosity 2021-12-16 21:20 ` João Victor Bonfim @ 2021-12-16 21:33 ` Martin Fick 2021-12-16 21:42 ` Junio C Hamano 2021-12-18 0:15 ` João Victor Bonfim 0 siblings, 2 replies; 14+ messages in thread From: Martin Fick @ 2021-12-16 21:33 UTC (permalink / raw) To: João Victor Bonfim; +Cc: brian m. carlson, Junio C Hamano, git On 2021-12-16 14:20, João Victor Bonfim wrote: >> To expand on this, if what you're storing is already compressed, like >> Ogg Vorbis files or PNGs, like are found in that repository, then >> generally they will not delta well. This is also true of things like >> Microsoft Office or OpenOffice documents, because they're essentially >> Zip files. >> >> The delta algorithm looks for similarities between files to compress >> them. If a file is already compressed using something like Deflate, >> used in PNGs and Zip files, then even very similar files will >> generally >> look very different, so deltification will generally be ineffective. ... > Maybe I am thinking too outside the box, but wouldn't it be quite more > effective for git to identify compressed files, specially on edge cases > where the compression doesn't have a good chemistry with delta > compression, > decompress them for repo storage while also storing the compression > algorithm as some metadata tag (like a text string or an ID code > decided > beforehand), and, when creating the work mirrors, return the > compression > to its default state before checkout? I suspect that for most algorithms and their implementations, this would not result in repeatable "recompressed" results. Thus the checked-out files might be different every time you checked them out. :( -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fw: Curiosity 2021-12-16 21:33 ` Martin Fick @ 2021-12-16 21:42 ` Junio C Hamano 2021-12-18 0:17 ` João Victor Bonfim 2021-12-18 0:15 ` João Victor Bonfim 1 sibling, 1 reply; 14+ messages in thread From: Junio C Hamano @ 2021-12-16 21:42 UTC (permalink / raw) To: Martin Fick; +Cc: João Victor Bonfim, brian m. carlson, git Martin Fick <mfick@codeaurora.org> writes: > On 2021-12-16 14:20, João Victor Bonfim wrote: >>> To expand on this, if what you're storing is already compressed, like >>> Ogg Vorbis files or PNGs, like are found in that repository, then >>> generally they will not delta well. This is also true of things like >>> Microsoft Office or OpenOffice documents, because they're essentially >>> Zip files. >>> The delta algorithm looks for similarities between files to >>> compress >>> them. If a file is already compressed using something like Deflate, >>> used in PNGs and Zip files, then even very similar files will >>> generally >>> look very different, so deltification will generally be ineffective. > ... >> Maybe I am thinking too outside the box, but wouldn't it be quite more >> effective for git to identify compressed files, specially on edge cases >> where the compression doesn't have a good chemistry with delta >> compression, >> decompress them for repo storage while also storing the compression >> algorithm as some metadata tag (like a text string or an ID code >> decided >> beforehand), and, when creating the work mirrors, return the >> compression >> to its default state before checkout? > > I suspect that for most algorithms and their implementations, this would > not result in repeatable "recompressed" results. Thus the checked-out > files might be different every time you checked them out. :( That is probably too application specific to be in core-git, but it is probably a good application for smudge/clean filters like brian alluded to? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fw: Curiosity 2021-12-16 21:42 ` Junio C Hamano @ 2021-12-18 0:17 ` João Victor Bonfim 0 siblings, 0 replies; 14+ messages in thread From: João Victor Bonfim @ 2021-12-18 0:17 UTC (permalink / raw) To: Junio C Hamano; +Cc: Martin Fick, brian m. carlson, git > That is probably too application specific to be in core-git, but it Application specific as in that it is too much of an edge case to be used by all git users? > is probably a good application for smudge/clean filters like brian > > alluded to? Perhaps. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ Em quinta-feira, 16 de dezembro de 2021 às 18:42, Junio C Hamano <gitster@pobox.com> escreveu: > Martin Fick mfick@codeaurora.org writes: > > > On 2021-12-16 14:20, João Victor Bonfim wrote: > > > > > > To expand on this, if what you're storing is already compressed, like > > > > > > > > Ogg Vorbis files or PNGs, like are found in that repository, then > > > > > > > > generally they will not delta well. This is also true of things like > > > > > > > > Microsoft Office or OpenOffice documents, because they're essentially > > > > > > > > Zip files. > > > > > > > > The delta algorithm looks for similarities between files to > > > > > > > > compress > > > > > > > > them. If a file is already compressed using something like Deflate, > > > > > > > > used in PNGs and Zip files, then even very similar files will > > > > > > > > generally > > > > > > > > look very different, so deltification will generally be ineffective. > > > > > > > > ... > > > > > > > > Maybe I am thinking too outside the box, but wouldn't it be quite more > > > > > > > > effective for git to identify compressed files, specially on edge cases > > > > > > > > where the compression doesn't have a good chemistry with delta > > > > > > > > compression, > > > > > > > > decompress them for repo storage while also storing the compression > > > > > > > > algorithm as some metadata tag (like a text string or an ID code > > > > > > > > decided > > > > > > > > beforehand), and, when creating the work mirrors, return the > > > > > > > > compression > > > > > > > > to its default state before checkout? > > > > I suspect that for most algorithms and their implementations, this would > > > > not result in repeatable "recompressed" results. Thus the checked-out > > > > files might be different every time you checked them out. :( > > That is probably too application specific to be in core-git, but it > > is probably a good application for smudge/clean filters like brian > > alluded to? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fw: Curiosity 2021-12-16 21:33 ` Martin Fick 2021-12-16 21:42 ` Junio C Hamano @ 2021-12-18 0:15 ` João Victor Bonfim 2021-12-18 0:24 ` Junio C Hamano ` (2 more replies) 1 sibling, 3 replies; 14+ messages in thread From: João Victor Bonfim @ 2021-12-18 0:15 UTC (permalink / raw) To: Martin Fick; +Cc: brian m. carlson, Junio C Hamano, git > I suspect that for most algorithms and their implementations, this would > > not result in repeatable "recompressed" results. Thus the checked-out > > files might be different every time you checked them out. :( How or why? Sincere question. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ Em quinta-feira, 16 de dezembro de 2021 às 18:33, Martin Fick <mfick@codeaurora.org> escreveu: > On 2021-12-16 14:20, João Victor Bonfim wrote: > > > > To expand on this, if what you're storing is already compressed, like > > > > > > Ogg Vorbis files or PNGs, like are found in that repository, then > > > > > > generally they will not delta well. This is also true of things like > > > > > > Microsoft Office or OpenOffice documents, because they're essentially > > > > > > Zip files. > > > > > > The delta algorithm looks for similarities between files to compress > > > > > > them. If a file is already compressed using something like Deflate, > > > > > > used in PNGs and Zip files, then even very similar files will > > > > > > generally > > > > > > look very different, so deltification will generally be ineffective. > > ... > > > Maybe I am thinking too outside the box, but wouldn't it be quite more > > > > effective for git to identify compressed files, specially on edge cases > > > > where the compression doesn't have a good chemistry with delta > > > > compression, > > > > decompress them for repo storage while also storing the compression > > > > algorithm as some metadata tag (like a text string or an ID code > > > > decided > > > > beforehand), and, when creating the work mirrors, return the > > > > compression > > > > to its default state before checkout? > > I suspect that for most algorithms and their implementations, this would > > not result in repeatable "recompressed" results. Thus the checked-out > > files might be different every time you checked them out. :( > > -Martin > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > The Qualcomm Innovation Center, Inc. is a member of Code > > Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fw: Curiosity 2021-12-18 0:15 ` João Victor Bonfim @ 2021-12-18 0:24 ` Junio C Hamano 2021-12-18 0:50 ` João Victor Bonfim 2021-12-18 1:06 ` Martin Fick 2021-12-18 1:34 ` brian m. carlson 2 siblings, 1 reply; 14+ messages in thread From: Junio C Hamano @ 2021-12-18 0:24 UTC (permalink / raw) To: João Victor Bonfim; +Cc: Martin Fick, brian m. carlson, git João Victor Bonfim <JoaoVictorBonfim@protonmail.com> writes: >> I suspect that for most algorithms and their implementations, this would >> >> not result in repeatable "recompressed" results. Thus the checked-out >> >> files might be different every time you checked them out. :( > > How or why? > > Sincere question. Two immediate things that come to my mind are lossy compression algorithms (jpeg pictures?) and compressors that do not necessarily produce bit-for-bit identical results (e.g. gzip by default embeds timestamp unless explicitly told not to from a command line option). ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fw: Curiosity 2021-12-18 0:24 ` Junio C Hamano @ 2021-12-18 0:50 ` João Victor Bonfim 0 siblings, 0 replies; 14+ messages in thread From: João Victor Bonfim @ 2021-12-18 0:50 UTC (permalink / raw) To: Junio C Hamano; +Cc: Martin Fick, brian m. carlson, git Yeah, that sounds reasonable, Junio. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ Em sexta-feira, 17 de dezembro de 2021 às 21:24, Junio C Hamano <gitster@pobox.com> escreveu: > João Victor Bonfim JoaoVictorBonfim@protonmail.com writes: > > > > I suspect that for most algorithms and their implementations, this would > > > > > > not result in repeatable "recompressed" results. Thus the checked-out > > > > > > files might be different every time you checked them out. :( > > > > How or why? > > > > Sincere question. > > Two immediate things that come to my mind are lossy compression > > algorithms (jpeg pictures?) and compressors that do not necessarily > > produce bit-for-bit identical results (e.g. gzip by default embeds > > timestamp unless explicitly told not to from a command line option). ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fw: Curiosity 2021-12-18 0:15 ` João Victor Bonfim 2021-12-18 0:24 ` Junio C Hamano @ 2021-12-18 1:06 ` Martin Fick 2021-12-18 1:34 ` brian m. carlson 2 siblings, 0 replies; 14+ messages in thread From: Martin Fick @ 2021-12-18 1:06 UTC (permalink / raw) To: João Victor Bonfim; +Cc: brian m. carlson, Junio C Hamano, git On 2021-12-17 17:15, João Victor Bonfim wrote: >> I suspect that for most algorithms and their implementations, this >> would >> >> not result in repeatable "recompressed" results. Thus the checked-out >> >> files might be different every time you checked them out. :( > > How or why? > Here are some reasons I can think of (I am no expert): 1) Most compression formats are file formats, not exact algorithms, thus different program implementations of similar algorithms can create vastly different outputs. 2) The same program will evolve over time, get improvements, bug fixes, etc. so each version of the same program could vary over time even with the same settings. The same program version on different platforms could have different output. 3) Settings, compression programs have compression levels, perhaps memory utilization parameters... The way the program measures these may not be deterministic and non-repeatable. 4) Threading. Some compressions algorithms, such as git repack itself, can use several threads to analyze the input data. And since the timing between different threads is not deterministic, when cooperating, they can have different results. Much of this has to do with the idea that there is usually no such thing as "done" when it comes to compression. You can probably search infinitely to try and find more data patterns to compress the data more. Thus compression programs have to have limits based on heuristics (how far to look ahead/behind, how many patterns to remember...) programmed into them to come to an end somehow. How these limits are determined can sometimes be non deterministic, it may even involve system resources (how much RAM the machine has, how long it has run...) or system config. I hope that helps, -Martin > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > > Em quinta-feira, 16 de dezembro de 2021 às 18:33, Martin Fick > <mfick@codeaurora.org> escreveu: > >> On 2021-12-16 14:20, João Victor Bonfim wrote: >> >> > > To expand on this, if what you're storing is already compressed, like >> > > >> > > Ogg Vorbis files or PNGs, like are found in that repository, then >> > > >> > > generally they will not delta well. This is also true of things like >> > > >> > > Microsoft Office or OpenOffice documents, because they're essentially >> > > >> > > Zip files. >> > > >> > > The delta algorithm looks for similarities between files to compress >> > > >> > > them. If a file is already compressed using something like Deflate, >> > > >> > > used in PNGs and Zip files, then even very similar files will >> > > >> > > generally >> > > >> > > look very different, so deltification will generally be ineffective. >> >> ... >> >> > Maybe I am thinking too outside the box, but wouldn't it be quite more >> > >> > effective for git to identify compressed files, specially on edge cases >> > >> > where the compression doesn't have a good chemistry with delta >> > >> > compression, >> > >> > decompress them for repo storage while also storing the compression >> > >> > algorithm as some metadata tag (like a text string or an ID code >> > >> > decided >> > >> > beforehand), and, when creating the work mirrors, return the >> > >> > compression >> > >> > to its default state before checkout? >> >> I suspect that for most algorithms and their implementations, this >> would >> >> not result in repeatable "recompressed" results. Thus the checked-out >> >> files might be different every time you checked them out. :( >> >> -Martin >> >> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> The Qualcomm Innovation Center, Inc. is a member of Code >> >> Aurora Forum, hosted by The Linux Foundation -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fw: Curiosity 2021-12-18 0:15 ` João Victor Bonfim 2021-12-18 0:24 ` Junio C Hamano 2021-12-18 1:06 ` Martin Fick @ 2021-12-18 1:34 ` brian m. carlson 2021-12-18 1:40 ` João Victor Bonfim 2 siblings, 1 reply; 14+ messages in thread From: brian m. carlson @ 2021-12-18 1:34 UTC (permalink / raw) To: João Victor Bonfim; +Cc: Martin Fick, Junio C Hamano, git [-- Attachment #1: Type: text/plain, Size: 1840 bytes --] On 2021-12-18 at 00:15:59, João Victor Bonfim wrote: > > I suspect that for most algorithms and their implementations, this would > > > > not result in repeatable "recompressed" results. Thus the checked-out > > > > files might be different every time you checked them out. :( > > How or why? > > Sincere question. A lossless compression algorithm has to produce an encoded value that, when decoded, must produce the original input. Ideally, it will also reduce the file size of the original input. Beyond that, there's a great deal of freedom to implement that. Just taking Deflate, which is used in zlib and gzip, as an example, there are different compression settings that control the size of the window to use that affect compression speed, quality of compression (resulting size), and memory usage. One might prefer using gzip -1 to get better performance or use less memory, or gzip -9 to reduce the file size as much as possible. Even when the same settings are used, the technique used can vary between versions of the software. For example, GitHub effectively uses git archive to generate archives, and one time when they upgraded their servers, the compression changed in the tarballs and zip files, and everybody who was relying on the archives being bit-for-bit identical[0] had a problem. So it would be nearly impossible to produce bit-for-bit repeatable results without specifying a specific, hard-coded implementation, and even in that case, the behavior might need to change for security reasons, so it would end up being difficult to achieve. [0] Neither Git nor GitHub provides this guarantee, so please do not make this mistake. If you need a fixed bit-for-bit tarball, save it as a release artifact. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fw: Curiosity 2021-12-18 1:34 ` brian m. carlson @ 2021-12-18 1:40 ` João Victor Bonfim 0 siblings, 0 replies; 14+ messages in thread From: João Victor Bonfim @ 2021-12-18 1:40 UTC (permalink / raw) To: brian m. carlson; +Cc: Martin Fick, Junio C Hamano, git How does one make a release artifact? o-o ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ Em sexta-feira, 17 de dezembro de 2021 às 22:34, brian m. carlson <sandals@crustytoothpaste.net> escreveu: > On 2021-12-18 at 00:15:59, João Victor Bonfim wrote: > > > > I suspect that for most algorithms and their implementations, this would > > > > > > not result in repeatable "recompressed" results. Thus the checked-out > > > > > > files might be different every time you checked them out. :( > > > > How or why? > > > > Sincere question. > > A lossless compression algorithm has to produce an encoded value that, > > when decoded, must produce the original input. Ideally, it will also > > reduce the file size of the original input. Beyond that, there's a > > great deal of freedom to implement that. > > Just taking Deflate, which is used in zlib and gzip, as an example, > > there are different compression settings that control the size of the > > window to use that affect compression speed, quality of compression > > (resulting size), and memory usage. One might prefer using gzip -1 to > > get better performance or use less memory, or gzip -9 to reduce the file > > size as much as possible. > > Even when the same settings are used, the technique used can vary > > between versions of the software. For example, GitHub effectively uses > > git archive to generate archives, and one time when they upgraded their > > servers, the compression changed in the tarballs and zip files, and > > everybody who was relying on the archives being bit-for-bit identical[0] > > had a problem. > > So it would be nearly impossible to produce bit-for-bit repeatable > > results without specifying a specific, hard-coded implementation, and > > even in that case, the behavior might need to change for security > > reasons, so it would end up being difficult to achieve. > > [0] Neither Git nor GitHub provides this guarantee, so please do not > > make this mistake. If you need a fixed bit-for-bit tarball, save it as > > a release artifact. > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > brian m. carlson (he/him or they/them) > > Toronto, Ontario, CA ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2021-12-18 1:40 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <Wlh_w2gSCDQ2ieJnIY7TStWrzxbwP98SNRIFMTYpva7SRFipqk63HEYFVF7wFn1oSHOkQNsjWGOa5L49vyRlvSLbuZqpmvOaDOHmFkdt2zw=@protonmail.com> 2021-12-15 3:52 ` Fw: Curiosity João Victor Bonfim 2021-12-15 18:07 ` Junio C Hamano 2021-12-15 23:45 ` João Victor Bonfim 2021-12-16 2:19 ` brian m. carlson 2021-12-16 21:20 ` João Victor Bonfim 2021-12-16 21:33 ` Martin Fick 2021-12-16 21:42 ` Junio C Hamano 2021-12-18 0:17 ` João Victor Bonfim 2021-12-18 0:15 ` João Victor Bonfim 2021-12-18 0:24 ` Junio C Hamano 2021-12-18 0:50 ` João Victor Bonfim 2021-12-18 1:06 ` Martin Fick 2021-12-18 1:34 ` brian m. carlson 2021-12-18 1:40 ` João Victor Bonfim
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).