Buildroot Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [Buildroot] Large number of duplicate files in sdk
@ 2024-11-20 20:00 Grant Edwards
  2024-11-20 23:27 ` Grant Edwards
  0 siblings, 1 reply; 4+ messages in thread
From: Grant Edwards @ 2024-11-20 20:00 UTC (permalink / raw)
  To: buildroot

When I do a "make sdk" (using 2024.02.6), the resulting tarball
contains tons of duplicate files.  I'm using an external Linaro ARM
toolchain.  With a fairly bare-bones package selection, the sdk tarball
generated by buildroot appears to be about 40% duplicate files by
size (about 20% by count).  According to a Python app I hacked
together it looks like there are 2300+ duplicated files taking up
380MB of wasted disk space.

[The fdupes utility finds about 100 fewer duplicates than my Python
application, so my numbers might be slightly off.]

Pretty much all of the include and library files from the external
toolchain are found under both ./opt/ext-toolchain/arm-linux-gnueabihf
and again under ./arm-buildroot-linux-gnueabihf/sysroot

Are both copies of all those files needed?

Is there some option to prune unneeded files?

My first impulse was to just delete ./opt/ext-toolchain, but I noticed
that there are couple dozen files under ./bin that are symlinked to
.../opt/ instead of being duplicated like everything else.

--
Grant

_______________________________________________
buildroot mailing list
buildroot@buildroot.org
https://lists.buildroot.org/mailman/listinfo/buildroot

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Buildroot] Large number of duplicate files in sdk
  2024-11-20 20:00 [Buildroot] Large number of duplicate files in sdk Grant Edwards
@ 2024-11-20 23:27 ` Grant Edwards
  2024-11-21  8:18   ` Peter Korsgaard
  0 siblings, 1 reply; 4+ messages in thread
From: Grant Edwards @ 2024-11-20 23:27 UTC (permalink / raw)
  To: buildroot; +Cc: buildroot

On 2024-11-20, Grant Edwards <grant.b.edwards@gmail.com> wrote:

> When I do a "make sdk" (using 2024.02.6), the resulting tarball
> contains tons of duplicate files.  I'm using an external Linaro ARM
> toolchain.  With a fairly bare-bones package selection, the sdk tarball
> generated by buildroot appears to be about 40% duplicate files by
> size (about 20% by count).

It's actually a bit worse than that. My app wasn't finding files that
were duplicated more than once.

Running a simple de-dupe utility on output/host reduced disk space
from 928MB to 556MB.

That app is pretty conservative: it only links files that have the
same name, the same parent directory name, and differing top
directory names.

_______________________________________________
buildroot mailing list
buildroot@buildroot.org
https://lists.buildroot.org/mailman/listinfo/buildroot

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Buildroot] Large number of duplicate files in sdk
  2024-11-20 23:27 ` Grant Edwards
@ 2024-11-21  8:18   ` Peter Korsgaard
  2024-11-21 15:06     ` Grant Edwards
  0 siblings, 1 reply; 4+ messages in thread
From: Peter Korsgaard @ 2024-11-21  8:18 UTC (permalink / raw)
  To: Grant Edwards; +Cc: buildroot, buildroot

>>>>> "Grant" == Grant Edwards <grant.b.edwards@gmail.com> writes:

 > On 2024-11-20, Grant Edwards <grant.b.edwards@gmail.com> wrote:
 >> When I do a "make sdk" (using 2024.02.6), the resulting tarball
 >> contains tons of duplicate files.  I'm using an external Linaro ARM
 >> toolchain.  With a fairly bare-bones package selection, the sdk tarball
 >> generated by buildroot appears to be about 40% duplicate files by
 >> size (about 20% by count).

 > It's actually a bit worse than that. My app wasn't finding files that
 > were duplicated more than once.

 > Running a simple de-dupe utility on output/host reduced disk space
 > from 928MB to 556MB.

 > That app is pretty conservative: it only links files that have the
 > same name, the same parent directory name, and differing top
 > directory names.

Some duplicates are expected, E.G. we have a number of packages that can
be built for the host and the target (E.G. python3), so if your SDK has
both host and target variant enabled then there will be some duplicated
files.

As a comparison, I have a SDK built with 2024.02.8 and a
Buildroot-generated external toolchain where the SDK .tar.gz is 227MB
and extracted:

du -hs
741M    .

fdupes -rm .
4179 duplicate files (in 3982 sets), occupying 100.6 megabytes

Focusing on the big files I see:

fdupes -rS -G $(( 1000 * 1024 )) .
1467305 bytes each:
./opt/ext-toolchain/share/man/man1/aarch64-buildroot-linux-gnu-g++.1
./opt/ext-toolchain/share/man/man1/aarch64-buildroot-linux-gnu-gcc.1

3334288 bytes each:
./opt/ext-toolchain/aarch64-buildroot-linux-gnu/bin/ld.bfd
./opt/ext-toolchain/aarch64-buildroot-linux-gnu/bin/ld

5427708 bytes each:
./opt/ext-toolchain/aarch64-buildroot-linux-gnu/sysroot/usr/lib/libc.a
./aarch64-buildroot-linux-gnu/sysroot/usr/lib/libc.a

6316256 bytes each:
./opt/ext-toolchain/aarch64-buildroot-linux-gnu/sysroot/usr/lib/libstdc++.a
./opt/ext-toolchain/aarch64-buildroot-linux-gnu/lib64/libstdc++.a
./aarch64-buildroot-linux-gnu/sysroot/usr/lib/libstdc++.a

1582810 bytes each:
./opt/ext-toolchain/aarch64-buildroot-linux-gnu/sysroot/usr/lib/libm-2.38.a
./aarch64-buildroot-linux-gnu/sysroot/usr/lib/libm-2.38.a

2892864 bytes each:
./opt/ext-toolchain/aarch64-buildroot-linux-gnu/sysroot/usr/lib/libstdc++.so.6.0.30
./opt/ext-toolchain/aarch64-buildroot-linux-gnu/lib64/libstdc++.so.6.0.30
./aarch64-buildroot-linux-gnu/sysroot/usr/lib/libstdc++.so.6.0.30

3386286 bytes each:
./opt/ext-toolchain/aarch64-buildroot-linux-gnu/sysroot/usr/share/i18n/locales/iso14651_t1_common
./aarch64-buildroot-linux-gnu/sysroot/usr/share/i18n/locales/iso14651_t1_common

1111538 bytes each:
./opt/ext-toolchain/aarch64-buildroot-linux-gnu/sysroot/usr/share/i18n/locales/iso14651_t1_pinyin
./aarch64-buildroot-linux-gnu/sysroot/usr/share/i18n/locales/iso14651_t1_pinyin

4523291 bytes each:
./opt/ext-toolchain/aarch64-buildroot-linux-gnu/sysroot/usr/share/i18n/locales/cns11643_stroke
./aarch64-buildroot-linux-gnu/sysroot/usr/share/i18n/locales/cns11643_stroke

2148440 bytes each:
./opt/ext-toolchain/aarch64-buildroot-linux-gnu/sysroot/lib/libc.so.6
./aarch64-buildroot-linux-gnu/sysroot/lib/libc.so.6


So mainly the copy we do of the external toolchain into host/. I think
we could be smarter about using hard links instead of actually copying
files / perhaps use hardlink before creating the SDK tarball:

hardlink .
Mode:                     real
Method:                   sha256
Files:                    15298
Linked:                   2903 files
Compared:                 0 xattrs
Compared:                 14922 files
Saved:                    72.68 MiB
Duration:                 0.855040 seconds

du -hs
661M    .

-- 
Bye, Peter Korsgaard
_______________________________________________
buildroot mailing list
buildroot@buildroot.org
https://lists.buildroot.org/mailman/listinfo/buildroot

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Buildroot] Large number of duplicate files in sdk
  2024-11-21  8:18   ` Peter Korsgaard
@ 2024-11-21 15:06     ` Grant Edwards
  0 siblings, 0 replies; 4+ messages in thread
From: Grant Edwards @ 2024-11-21 15:06 UTC (permalink / raw)
  To: buildroot; +Cc: buildroot

On 2024-11-21, Peter Korsgaard <peter@korsgaard.com> wrote:
> > On 2024-11-20, Grant Edwards <grant.b.edwards@gmail.com> wrote:
> >> When I do a "make sdk" (using 2024.02.6), the resulting tarball
> >> contains tons of duplicate files. [...]
>
> Some duplicates are expected, E.G. we have a number of packages that
> can be built for the host and the target (E.G. python3), so if your
> SDK has both host and target variant enabled then there will be some
> duplicated files.

Yes, that I expected. Those files need to be in both places. You can
save some space by hardlinking them, but you actually need two
"copies".  What surprised me was the duplication of the include and
library trees from the toolchain.

I would have thought that foo.h, libfoo.a, libfoo.so, and libfoo.so.1
would only need to be in one place (each). Does there need to be a
copy in both sysroot and opt/ext-toolchain?

Some effort was expended to avoid duplicating the toolchain binaries
themselves, but all of the other files are duplicated.

> So mainly the copy we do of the external toolchain into host/.

And I limited my custom "de-dupe" utility to that case: hard-linking
files that were conceptually "the same file" in duplicate trees. I
didn't link files within sysroot that happened to have identical
content or link files within opt/ext_toolchain because they had
identical content.

> I think we could be smarter about using hard links instead of
> actually copying files / perhaps use hardlink before creating the
> SDK tarball:

We're probably already further down this rabbit hole that it deserves,
but if an include/library file needs to be in sysroot, does it still
need to be in the opt/ext_toolchain directory also? Can we move the
file instead of copying/linking it? Are there some situations where
libfoo.so.1 is used from sysroot and other situations where it is used
from opt/ext-toolchain?  If yes, then moving the file won't work. Are
the two libfoo.so.1 files ever expected to differ? If not, then
linking seems to be the right answer instead of copying.

I tried using symlinks from sysroot to opt/ext-toolchain instead of
hardlinks, but I could not get that to work. I didn't persue that for
long, but it appeared that <something> was refusing to follow symlinks
for <some?> library files. The file was found in the expected
location, but it was the "wrong type": it was a symlink.

--
Grant


_______________________________________________
buildroot mailing list
buildroot@buildroot.org
https://lists.buildroot.org/mailman/listinfo/buildroot

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-11-21 15:06 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-20 20:00 [Buildroot] Large number of duplicate files in sdk Grant Edwards
2024-11-20 23:27 ` Grant Edwards
2024-11-21  8:18   ` Peter Korsgaard
2024-11-21 15:06     ` Grant Edwards

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox