* [Buildroot] Large number of duplicate files in sdk @ 2024-11-20 20:00 Grant Edwards 2024-11-20 23:27 ` Grant Edwards 0 siblings, 1 reply; 4+ messages in thread From: Grant Edwards @ 2024-11-20 20:00 UTC (permalink / raw) To: buildroot When I do a "make sdk" (using 2024.02.6), the resulting tarball contains tons of duplicate files. I'm using an external Linaro ARM toolchain. With a fairly bare-bones package selection, the sdk tarball generated by buildroot appears to be about 40% duplicate files by size (about 20% by count). According to a Python app I hacked together it looks like there are 2300+ duplicated files taking up 380MB of wasted disk space. [The fdupes utility finds about 100 fewer duplicates than my Python application, so my numbers might be slightly off.] Pretty much all of the include and library files from the external toolchain are found under both ./opt/ext-toolchain/arm-linux-gnueabihf and again under ./arm-buildroot-linux-gnueabihf/sysroot Are both copies of all those files needed? Is there some option to prune unneeded files? My first impulse was to just delete ./opt/ext-toolchain, but I noticed that there are couple dozen files under ./bin that are symlinked to .../opt/ instead of being duplicated like everything else. -- Grant _______________________________________________ buildroot mailing list buildroot@buildroot.org https://lists.buildroot.org/mailman/listinfo/buildroot ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Buildroot] Large number of duplicate files in sdk 2024-11-20 20:00 [Buildroot] Large number of duplicate files in sdk Grant Edwards @ 2024-11-20 23:27 ` Grant Edwards 2024-11-21 8:18 ` Peter Korsgaard 0 siblings, 1 reply; 4+ messages in thread From: Grant Edwards @ 2024-11-20 23:27 UTC (permalink / raw) To: buildroot; +Cc: buildroot On 2024-11-20, Grant Edwards <grant.b.edwards@gmail.com> wrote: > When I do a "make sdk" (using 2024.02.6), the resulting tarball > contains tons of duplicate files. I'm using an external Linaro ARM > toolchain. With a fairly bare-bones package selection, the sdk tarball > generated by buildroot appears to be about 40% duplicate files by > size (about 20% by count). It's actually a bit worse than that. My app wasn't finding files that were duplicated more than once. Running a simple de-dupe utility on output/host reduced disk space from 928MB to 556MB. That app is pretty conservative: it only links files that have the same name, the same parent directory name, and differing top directory names. _______________________________________________ buildroot mailing list buildroot@buildroot.org https://lists.buildroot.org/mailman/listinfo/buildroot ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Buildroot] Large number of duplicate files in sdk 2024-11-20 23:27 ` Grant Edwards @ 2024-11-21 8:18 ` Peter Korsgaard 2024-11-21 15:06 ` Grant Edwards 0 siblings, 1 reply; 4+ messages in thread From: Peter Korsgaard @ 2024-11-21 8:18 UTC (permalink / raw) To: Grant Edwards; +Cc: buildroot, buildroot >>>>> "Grant" == Grant Edwards <grant.b.edwards@gmail.com> writes: > On 2024-11-20, Grant Edwards <grant.b.edwards@gmail.com> wrote: >> When I do a "make sdk" (using 2024.02.6), the resulting tarball >> contains tons of duplicate files. I'm using an external Linaro ARM >> toolchain. With a fairly bare-bones package selection, the sdk tarball >> generated by buildroot appears to be about 40% duplicate files by >> size (about 20% by count). > It's actually a bit worse than that. My app wasn't finding files that > were duplicated more than once. > Running a simple de-dupe utility on output/host reduced disk space > from 928MB to 556MB. > That app is pretty conservative: it only links files that have the > same name, the same parent directory name, and differing top > directory names. Some duplicates are expected, E.G. we have a number of packages that can be built for the host and the target (E.G. python3), so if your SDK has both host and target variant enabled then there will be some duplicated files. As a comparison, I have a SDK built with 2024.02.8 and a Buildroot-generated external toolchain where the SDK .tar.gz is 227MB and extracted: du -hs 741M . fdupes -rm . 4179 duplicate files (in 3982 sets), occupying 100.6 megabytes Focusing on the big files I see: fdupes -rS -G $(( 1000 * 1024 )) . 1467305 bytes each: ./opt/ext-toolchain/share/man/man1/aarch64-buildroot-linux-gnu-g++.1 ./opt/ext-toolchain/share/man/man1/aarch64-buildroot-linux-gnu-gcc.1 3334288 bytes each: ./opt/ext-toolchain/aarch64-buildroot-linux-gnu/bin/ld.bfd ./opt/ext-toolchain/aarch64-buildroot-linux-gnu/bin/ld 5427708 bytes each: ./opt/ext-toolchain/aarch64-buildroot-linux-gnu/sysroot/usr/lib/libc.a ./aarch64-buildroot-linux-gnu/sysroot/usr/lib/libc.a 6316256 bytes each: ./opt/ext-toolchain/aarch64-buildroot-linux-gnu/sysroot/usr/lib/libstdc++.a ./opt/ext-toolchain/aarch64-buildroot-linux-gnu/lib64/libstdc++.a ./aarch64-buildroot-linux-gnu/sysroot/usr/lib/libstdc++.a 1582810 bytes each: ./opt/ext-toolchain/aarch64-buildroot-linux-gnu/sysroot/usr/lib/libm-2.38.a ./aarch64-buildroot-linux-gnu/sysroot/usr/lib/libm-2.38.a 2892864 bytes each: ./opt/ext-toolchain/aarch64-buildroot-linux-gnu/sysroot/usr/lib/libstdc++.so.6.0.30 ./opt/ext-toolchain/aarch64-buildroot-linux-gnu/lib64/libstdc++.so.6.0.30 ./aarch64-buildroot-linux-gnu/sysroot/usr/lib/libstdc++.so.6.0.30 3386286 bytes each: ./opt/ext-toolchain/aarch64-buildroot-linux-gnu/sysroot/usr/share/i18n/locales/iso14651_t1_common ./aarch64-buildroot-linux-gnu/sysroot/usr/share/i18n/locales/iso14651_t1_common 1111538 bytes each: ./opt/ext-toolchain/aarch64-buildroot-linux-gnu/sysroot/usr/share/i18n/locales/iso14651_t1_pinyin ./aarch64-buildroot-linux-gnu/sysroot/usr/share/i18n/locales/iso14651_t1_pinyin 4523291 bytes each: ./opt/ext-toolchain/aarch64-buildroot-linux-gnu/sysroot/usr/share/i18n/locales/cns11643_stroke ./aarch64-buildroot-linux-gnu/sysroot/usr/share/i18n/locales/cns11643_stroke 2148440 bytes each: ./opt/ext-toolchain/aarch64-buildroot-linux-gnu/sysroot/lib/libc.so.6 ./aarch64-buildroot-linux-gnu/sysroot/lib/libc.so.6 So mainly the copy we do of the external toolchain into host/. I think we could be smarter about using hard links instead of actually copying files / perhaps use hardlink before creating the SDK tarball: hardlink . Mode: real Method: sha256 Files: 15298 Linked: 2903 files Compared: 0 xattrs Compared: 14922 files Saved: 72.68 MiB Duration: 0.855040 seconds du -hs 661M . -- Bye, Peter Korsgaard _______________________________________________ buildroot mailing list buildroot@buildroot.org https://lists.buildroot.org/mailman/listinfo/buildroot ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Buildroot] Large number of duplicate files in sdk 2024-11-21 8:18 ` Peter Korsgaard @ 2024-11-21 15:06 ` Grant Edwards 0 siblings, 0 replies; 4+ messages in thread From: Grant Edwards @ 2024-11-21 15:06 UTC (permalink / raw) To: buildroot; +Cc: buildroot On 2024-11-21, Peter Korsgaard <peter@korsgaard.com> wrote: > > On 2024-11-20, Grant Edwards <grant.b.edwards@gmail.com> wrote: > >> When I do a "make sdk" (using 2024.02.6), the resulting tarball > >> contains tons of duplicate files. [...] > > Some duplicates are expected, E.G. we have a number of packages that > can be built for the host and the target (E.G. python3), so if your > SDK has both host and target variant enabled then there will be some > duplicated files. Yes, that I expected. Those files need to be in both places. You can save some space by hardlinking them, but you actually need two "copies". What surprised me was the duplication of the include and library trees from the toolchain. I would have thought that foo.h, libfoo.a, libfoo.so, and libfoo.so.1 would only need to be in one place (each). Does there need to be a copy in both sysroot and opt/ext-toolchain? Some effort was expended to avoid duplicating the toolchain binaries themselves, but all of the other files are duplicated. > So mainly the copy we do of the external toolchain into host/. And I limited my custom "de-dupe" utility to that case: hard-linking files that were conceptually "the same file" in duplicate trees. I didn't link files within sysroot that happened to have identical content or link files within opt/ext_toolchain because they had identical content. > I think we could be smarter about using hard links instead of > actually copying files / perhaps use hardlink before creating the > SDK tarball: We're probably already further down this rabbit hole that it deserves, but if an include/library file needs to be in sysroot, does it still need to be in the opt/ext_toolchain directory also? Can we move the file instead of copying/linking it? Are there some situations where libfoo.so.1 is used from sysroot and other situations where it is used from opt/ext-toolchain? If yes, then moving the file won't work. Are the two libfoo.so.1 files ever expected to differ? If not, then linking seems to be the right answer instead of copying. I tried using symlinks from sysroot to opt/ext-toolchain instead of hardlinks, but I could not get that to work. I didn't persue that for long, but it appeared that <something> was refusing to follow symlinks for <some?> library files. The file was found in the expected location, but it was the "wrong type": it was a symlink. -- Grant _______________________________________________ buildroot mailing list buildroot@buildroot.org https://lists.buildroot.org/mailman/listinfo/buildroot ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-11-21 15:06 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-11-20 20:00 [Buildroot] Large number of duplicate files in sdk Grant Edwards 2024-11-20 23:27 ` Grant Edwards 2024-11-21 8:18 ` Peter Korsgaard 2024-11-21 15:06 ` Grant Edwards
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox