All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ola x Nilsson <ola.x.nilsson@axis.com>
To: "richard.purdie@linuxfoundation.org"
	<richard.purdie@linuxfoundation.org>
Cc: bitbake-devel <bitbake-devel@lists.openembedded.org>
Subject: Re: [RFC][WIP][PATCHv1] lib/bb/checksum.py: Speed-up checksum gen when directory is git
Date: Fri, 11 Oct 2019 14:36:06 +0200	[thread overview]
Message-ID: <jwq1rvjwjbd.fsf@axis.com> (raw)
In-Reply-To: <c0c730e057be4e09d35a85fbcb960a3784c394ed.camel@linuxfoundation.org>


On Wed, Oct 09 2019, richard.purdie@linuxfoundation.org wrote:

> On Wed, 2019-10-09 at 11:02 +0200, Nicolas Dechesne wrote:
>> On Wed, Oct 9, 2019 at 1:15 AM <richard.purdie@linuxfoundation.org>
>> wrote:
>> > On Tue, 2019-10-08 at 13:45 -0500, Aníbal Limón wrote:
>> > > In some cases people/organizations are using a SRC_URI with
>> > > file:///PATH_TO_DIR that contains a git repository for different
>> > > reasons, it is useful when want to do an internal build without
>> > > clone sources from outside.
>> > > 
>> > > This could consume a lot of CPU time because the current taskhash
>> > > generation mechanism didn't identify that the folder is a VCS
>> > > (git, svn, cvs) and makes the cksum for every file including the
>> > > .git repository in this case.
>> > > 
>> > > There are different ways to improve the situation,
>> > > 
>> > > * Add protocol=gitscm in file:// SRC_URI but the taskhash is
>> > >   calculated before the fetcher identifies the protocol, will
>> > > require
>> > >   some changes in bitbake codebase.
>> > > * This patch: When directory is a git repository (contains .git)
>> > >   use HEAD rev + git diff to calculate checksum instead of do it
>> > >   in every file, that is hackish because make some assumptions
>> > > about
>> > >   .git directory contents.
>> > > * Variant of this patch: Make a list of VCS directories (.git,
>> > > .svn,
>> > >   .cvs) and take out for cksum calculations, same as before
>> > > making
>> > >   assumptions about the . folders content.
>> > 
>> > This is an interesting one.
>> 
>> Are you referring to the last bullet here? I suspect it's the second
>> one.
>
> Sorry I wasn't clear, I was meaning in general.
>
>> 
>> Also to give a bit more background to everyone, as it might not be
>> obvious. I've seen the same pattern used several times , especially
>> in
>> large/corporate deployment of OE/YP. the whole build workspace is
>> built as:
>> 
>> > - sources
>> > ---- kernel
>> > ---- component_A
>> > ---- component_B
>> > - layers
>> > ---- poky
>> > ---- meta-mycompany
>> > -------- recipes for kernel, component_A, ...
>> 
>> The whole workspace is managed with a repo manifest, and the recipes
>> are written to use source code from the 'sources' local folder.
>> 
>> I am not trying to argue whether this is a good practice or not ;-)
>> but from the perspectives of the folks I've talked to , there are a
>> couple of critical advantages of doing something like that:
>> * it looks like Android development workflow ;-)
>> * it relates to the company license/legal process and review. e.g.
>> all
>> the software that gets out of the company is managed by a single repo
>> manifest xml file
>> * it solves "nicely" the problem of being able to iteratively develop
>> using bitbake natively. e.g. "bibtake myimage" always work, and uses
>> local changes from 'sources'
>> 
>> So overall, i am being convinced that this is a valid use case for OE
>> end users. I don't think we can use the git:// fetcher as we need the
>> snapshot of the current 'sources' (with local changes), and using the
>> file:// fetcher has important performance impacts:
>> * checksum for 'each' file (which can be large, especially for
>> kernel)
>> * un-expected rebuild when running repo sync, if any new git objects
>> are put in .git (even when no changes are made to the local worktree
>> of the git project).
>> 
>> > File checksums are added to the hashes "late" so that we don't have
>> > to
>> > reparse entire recipes when files change. We do need a mechanism to
>> > know when we need to reparse the checksum. I think this means you
>> > can
>> > skip the checksum calculation for each file but you do still end up
>> > having to stat all files in the tree separately for bitbake's
>> > tracking
>> > and for git. We also have to notice when new files are added.
>> > 
>> > As such I'm not convinced this patch will work correctly (e.g.
>> > would it
>> > notice if I copy in a new file to the directory untracked by git).
>> 
>> At least I confirm that with the file:// fetcher everything works
>> fine, when modifying files. I don't think I have tried adding new
>> files. But I will try that.
>
> I'd like to check that bitbake's hashes are changing correctly in the
> different modification cases.
>
> I did also wondering about this kind of trick, borrowed from stack
> overflow:
>
> if [ ! -e  .git/allfilesindex ]; then
>     cp .git/index .git/allfilesindex
> fi
> GIT_INDEX_FILE=.git/allfilesindex git add -u; git write-tree
>
> to get a hash which represents the state of the tree. Using git add -A
> might track untracked files too.

This is what externalsrc.bbclass does to detect changes.

/Ola

>> Are you trying to say that to fix this properly we might need another
>> Fetcher , something in between file:// and git://, e.g. localgit://?
>> Would that make this problem easier to solve?
>
> I'm not sure about that. I'm mainly worried that we have reports that
> file:// doesn't work correctly today before we add this kind of
> complexity on top. Hence my comments about needing better tests in this
> area, with and without git involved.
>
> I'm a little bit too focused on the release to be able to think clearly
> about this right now which doesn't help.
>
> Cheers,
>
> Richard
>
>> > A first step may be to add some further tests to bitbake-selftest
>> > to
>> > better cover this area...
>> > 
>> > Cheers,
>> > 
>> > Richard
>> > 
>> > 
>> > 
>> > 
>> > 


-- 
Ola x Nilsson


  reply	other threads:[~2019-10-11 12:36 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-08 18:45 [RFC][WIP][PATCHv1] lib/bb/checksum.py: Speed-up checksum gen when directory is git Aníbal Limón
2019-10-08 20:53 ` Mark Hatle
2019-10-08 21:07   ` Nicolas Dechesne
2019-10-08 22:31     ` Mark Hatle
2019-10-08 23:15 ` richard.purdie
2019-10-09  9:02   ` Nicolas Dechesne
2019-10-09 11:53     ` richard.purdie
2019-10-11 12:36       ` Ola x Nilsson [this message]
2019-10-10 23:53     ` Mark Hatle
2019-11-06  4:50 ` Nicolas Dechesne
2019-11-13 19:44   ` Anibal Limon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=jwq1rvjwjbd.fsf@axis.com \
    --to=ola.x.nilsson@axis.com \
    --cc=bitbake-devel@lists.openembedded.org \
    --cc=richard.purdie@linuxfoundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.