Buildroot Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
To: buildroot@busybox.net
Subject: [Buildroot] [PATCH] core/instrumentation: filter-out packages that install identical files
Date: Thu, 3 Jan 2019 22:37:52 +0100	[thread overview]
Message-ID: <20190103223752.5262bdac@windsurf> (raw)
In-Reply-To: <20190103160431.13306-1-yann.morin.1998@free.fr>

Hello,

On Thu,  3 Jan 2019 17:04:31 +0100, Yann E. MORIN wrote:
> Currently, we check that no two packages write to the same files, as a
> sanity check. We do so by checking which files were touched since the
> end of the build (aka begining of the installation).
> 
> However, when the packages do install the exact same file, i,e, the
> same content, we in fact do not really care what package had provided
> said file.
> 
> In the past, we avoided that situation because we were md5sum-inf every
> files before and after installation. Anything that changed was new or
> modified, and everything that did not change was not modified (but could
> have been reinstalled).
> 
> However, since 7fb6e78254 (core/instrumentation: shave minutes off the
> build time), we're now using mtimes, and we're in the situation that the
> exact same file installed by two-or-more packages is reported.
> 
> In such a situation, it is not very interesting to know what package
> installed the file, because whatever the ordering, or whatever the
> subset of said packages, we'd have ended up with the same file anyway.
> One prominent case where this happens, if the fftw familly of pcackages,
> that all install different libraries, but install the same set of
> identical headers and some common utilities.
> 
> The rationale for 7fb6e78254 was that the impact on the build time was
> horrible, so we can't revert it.
> 
> However, we can partially restore use of md5 to detect if the files were
> actually modified or not, and limiting the md5sum-ing only to those
> actually modified files. The comparison of the md5 is then offloaded to
> later, during the final check-uniq-files calls.
> 
> Since the md5sum is run only on new files, they are cache-hot, and so
> the md5 is not going to storage to fetch them (and if they were not
> cache-hot, it would mean memory pressure for a reason or another, and
> the system would already be slowed for other reasons).
> 
> Using a defconfig with a various set of ~236 packages, the build times
> reported by running "time make >log.file 2>&1", on an otherwise unloaded
> system, were (smallest value of 5 runs each):
> 
>     before:     35:15.92
>     after:      35:22.03
> 
> Which is a slight overhead of just 6.11s, which is less than 26ms per
> package. Also, the system, although pretty idle, was still doing
> background work like fetching mails and such, so the overhead is not
> even consistent across various measurements...

Overall, I find the idea good and useful.

A couple of comments/questions (here and below):

 - I believe we should still fix the fact that the .la file timestamp
   is changed everytime we $(SED) the .la files, even if they are not
   changed. This will also help reduce the number of files to check
   using md5sum: a .la file should be fixed only once, and its
   timestamp never touched again.

 - Could we make a check-uniq-files error a hard build failure, as a
   follow-up patch from yours ? So far, check-uniq-files errors have
   been a bit useless, because they do not cause the build to fail, so
   nobody really cared.

> diff --git a/package/pkg-generic.mk b/package/pkg-generic.mk
> index f5cab2b9c2..7c6a995c19 100644
> --- a/package/pkg-generic.mk
> +++ b/package/pkg-generic.mk
> @@ -64,12 +64,14 @@ GLOBAL_INSTRUMENTATION_HOOKS += step_time
>  # $(3): suffix of file  (optional)
>  define step_pkg_size_inner
>  	@touch $(BUILD_DIR)/packages-file-list$(3).txt
> -	$(SED) '/^$(1),/d' $(BUILD_DIR)/packages-file-list$(3).txt
> +	$(SED) '/^$(1)  /d' $(BUILD_DIR)/packages-file-list$(3).txt

Why is the separator changed from , to space ?

>  	cd $(2); \
> -	find . \( -type f -o -type l \) \
> +	find . -xtype f \

Why are you changing from -type f -o -type l to -xtype f ? Is this
really related to the change ? Is -xtype supported in old version of
find ?

>  		-newer $($(PKG)_DIR)/.stamp_built \
> -		-exec printf '$(1),%s\n' {} + \
> -		>> $(BUILD_DIR)/packages-file-list$(3).txt  
> +		-print0 \
> +	|xargs -0 -r md5sum \

Ah, you're changing the separator to two spaces, because md5sum's output
already uses two spaces as a separator ?

This could break people who have scripts that parse the
package-file-list.txt contents. Maybe we should keep the compatibility,
and keep using , as a separator, and push the md5sum as an additional
third column ?

Like:
	|xargs -0 -r md5sum \
	| sed -r -e 's/([0-9a-f]{32})  (.*)/$(1),\2,\1/'

One drawback is that it won't work with filenames that contain a comma.
But in this case, we're forced to keep the filename field as the last
one, and therefore break backward compatibility of the
package-file-lists.txt format every time we want to add a new
information to it.

> +	|sed -r -e 's/^/$(1)  /' \
> +	>> $(BUILD_DIR)/packages-file-list$(3).txt
>  endef
>  
>  define step_pkg_size
> diff --git a/support/scripts/check-uniq-files b/support/scripts/check-uniq-files
> index fbc6b5d6e7..94deea01db 100755
> --- a/support/scripts/check-uniq-files
> +++ b/support/scripts/check-uniq-files
> @@ -24,24 +24,30 @@ def main():
>          sys.stderr.write('No type was provided\n')
>          return False
>  
> -    file_to_pkg = defaultdict(list)
> +    file_to_pkg = defaultdict(set)

You're changing from list to set, it's a bit unrelated. Should be a
separate commit ?

> +    file_md5 = defaultdict(set)
>      with open(args.packages_file_list[0], 'rb') as pkg_file_list:
>          for line in pkg_file_list.readlines():
> -            pkg, _, file = line.rstrip(b'\n').partition(b',')
> -            file_to_pkg[file].append(pkg)
> +            pkg, md5, file = line.rstrip(b'\n').split(b'  ', 2)

The ", 2" in split() is to make sure that if the filename contains two
consecutive spaces, we don't split it up ?

> +            file_to_pkg[file].add(pkg)
> +            file_md5[file].add(md5)
>  
>      for file in file_to_pkg:
> -        if len(file_to_pkg[file]) > 1:
> -            # If possible, try to decode the binary strings with
> -            # the default user's locale
> -            try:
> -                sys.stderr.write(warn.format(args.type, file.decode(),
> -                                             [p.decode() for p in file_to_pkg[file]]))
> -            except UnicodeDecodeError:
> -                # ... but fallback to just dumping them raw if they
> -                # contain non-representable chars
> -                sys.stderr.write(warn.format(args.type, file,
> -                                             file_to_pkg[file]))
> +        if len(file_to_pkg[file]) == 1 or len(file_md5[file]) == 1:
> +            # If only one package installed that file, or they all
> +            # installed the same content, don't report that file.
> +            continue
> +
> +        # If possible, try to decode the binary strings with
> +        # the default user's locale
> +        try:
> +            sys.stderr.write(warn.format(args.type, file.decode(),
> +                                         [p.decode() for p in file_to_pkg[file]]))
> +        except UnicodeDecodeError:
> +            # ... but fallback to just dumping them raw if they
> +            # contain non-representable chars
> +            sys.stderr.write(warn.format(args.type, file,
> +                                         file_to_pkg[file]))

I'm not sure why the whole test is inverted, it causes a lot of noise
in the diff. If there is a good reason for it, should be a separate
patch.

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

  reply	other threads:[~2019-01-03 21:37 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-03 16:04 [Buildroot] [PATCH] core/instrumentation: filter-out packages that install identical files Yann E. MORIN
2019-01-03 21:37 ` Thomas Petazzoni [this message]
2019-01-03 22:38   ` Yann E. MORIN
2019-01-04  9:24     ` Thomas Petazzoni
2019-01-04  9:51       ` Yann E. MORIN
2019-01-04 10:03         ` Thomas Petazzoni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190103223752.5262bdac@windsurf \
    --to=thomas.petazzoni@bootlin.com \
    --cc=buildroot@busybox.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox