Re: [Openembedded-architecture] Adding more information to the SBOM

public inbox for openembedded-core@lists.openembedded.org
 help / color / mirror / Atom feed

From: Mark Hatle <mark.hatle@kernel.crashing.org>
To: Joshua Watt <JPEWhacker@gmail.com>,
	Marta Rybczynska <rybczynska@gmail.com>
Cc: OE-core <openembedded-core@lists.openembedded.org>,
	openembedded-architecture@lists.openembedded.org
Subject: Re: [Openembedded-architecture] Adding more information to the SBOM
Date: Wed, 14 Sep 2022 20:16:42 -0500	[thread overview]
Message-ID: <85baa0f3-937d-fce5-9fc2-0d9315d030ac@kernel.crashing.org> (raw)
In-Reply-To: <CAJdd5GaVFL-GyPrCHKTGn1rBU1-khVN0TnbeYv_Yr0QHUDLwig@mail.gmail.com>



On 9/14/22 9:56 AM, Joshua Watt wrote:
> On Wed, Sep 14, 2022 at 9:16 AM Marta Rybczynska <rybczynska@gmail.com> wrote:
>>
>> Dear all,
>> (cross-posting to oe-core and *-architecture)
>> In the last months, we have worked in Oniro on using the create-spdx
>> class for both IP compliance and security.
>>
>> During this work, Alberto Pianon has found that some information is
>> missing from the SBOM and it does not contain enough for Software
>> Composition Analysis. The main missing point is the relation between
>> the actual upstream sources and the final binaries (create-spdx uses
>> composite sources).
> 
> I believe we map the binaries to the source code from the -dbg
> packages; is the premise that this is insufficient? Can you elaborate
> more on why that is, I don't quite understand. The debug sources are
> (basically) what we actually compiled (e.g. post-do_patch) to produce
> the binary, and you can in turn follow these back to the upstream
> sources with the downloadLocation property.

When I last looked at this, it was critical that the analysis be:

binary -> patched & configured source (dbg package) -> how the sources were 
constructed.

As Joshua said above.  I believe all of the information is present for this as 
you can tie the binary (through debug symbols) back to the debug package.. and 
the source of the debug package back to the sources that constructed it via 
heuristics.  (If you enable the git patch mechanism.  It should even be possible 
to use git blame to find exactly what upstreams constructed the patched sources.

For generated content, it's more difficult -- but for those items usually there 
is a header which indicates what generated the content so other heuristics can 
be used.

>>
>> Alberto has worked on how to obtain the missing data and now has a
>> POC. This POC provides full source-to-binary tracking of Yocto builds
>> through a couple of scripts (intended to be transformed into a new
>> bbclass at a later stage). The goal is to add the missing pieces of
>> information in order to get a "real" SBOM from Yocto, which should, at
>> a minimum:
> 
> Please be a little careful with the wording; SBoMs have a lot of uses,
> and many of them we can satisfy with what we currently generate; it
> may not do the exact use case you are looking for, but that doesn't
> mean it's not a "real" SBoM :)
> 
>>
>> - carefully describe what is found in a final image (i.e. binary files
>> and their dependencies), since that is what is actually distributed
>> and goes into the final product;
>> - describe how such binary files have been generated and where they
>> come from (i.e. upstream sources, including patches and other stuff
>> added from meta-layers); provenance is important for a number of
>> reasons related to IP Compliance and security.

Full compliance will require binaries mapped to patched source to upstream 
sources _AND_ the instructions (layer/recipe/configuration) used to build them. 
  But it's up to the local legal determination to figure out 'how far you really 
need to go', vs just "here are the layers I used to build my project".)

>> The aim is to become able to:
>>
>> - map binaries to their corresponding upstream source packages (and
>> not to the "internal" source packages created by recipes by combining
>> multiple upstream sources and patches)
>> - map binaries to the source files that have been actually used to
>> build them - which usually are a small subset of the whole source
>> package
>>
>> With respect to IP compliance, this would allow to, among other things:
>>
>> - get the real license text for each binary file, by getting the
>> license of the specific source files it has been generated from
>> (provided by Fossology, for instance), - and not the main license
>> stated in the corresponding recipe (which may be as confusing as
>> GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause & BSD-4-Clause, or
>> even worse)
> 
> IIUC this is the difference between the "Declared" license and the
> "Concluded" license. You can report both, and I think
> create-spdx.bbclass can currently do this with its rudimentary source
> license scanning. You really do want both and it's a great way to make
> sure that the "Declared" license (that is the license in the recipe)
> reflects the reality of the source code.

And the thing to keep in mind is that in a given package the "Declared" is 
usually what a LICENSE file or header says.  But the "Concluded" has levels of 
quality behind them.  The first level of quality is "Declared".  The next level 
is automation (something like fossology), the next level is human reviewed, and 
the highest level is "lawyer reviewed".

So being able to inject SPDX information with Concluded values for evaluation 
and track the 'quality level' has always been something I wanted to do, but 
never had time.

At the time, my idea was a database (and/or bbappend) for each component that 
would included pre-processed SPDX data for each recipe.  This data would run 
through a validation step to show it actually matches the patched sources.  (If 
any file checksums do NOT match, then they would be flagged for follow up.)

>> - automatically check license incompatibilities at the binary file level.
>>
>> Other possible interesting things could be done also on the security side.
>>
>> This work intends to add a way to provide additional data that can be
>> used by create-spdx, not to replace create-spdx in any way.
>>
>> The sources with a long README are available at
>> https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker
>>
>> What do you think of this work? Would it be of interest to integrate
>> into YP at some point? Shall we discuss this?
> 
> This seems promising as something that could potentially move into
> core. I have a few points:
>   - The extraction of the sources to a dedicated directory is something
> that Richard has been toying around with for quite a while, and I
> think it would greatly simplify that part of your process. I would
> very much encourage you to look at the work he's done, and work on
> that to get it pushed across the finish line as it's a really good
> improvement that would benefit not just your source scanning.
>   - I would encourage you to not wait to turn this into a bbclass
> and/or library functions. You should be able to do this in a new
> layer, and that would make it much clearer as to what the path to
> being included in OE-core would look like. It also would (IMHO) be
> nicer to the users :)

Agreed, this looks useful.  The key is start turning it into one or more 
bbclasses now.  Things that work with the Yocto Project process.  Don't try to 
"post-process" and reconstruct sources.  Instead inject steps that will run your 
file checksums, build up your database as the source are constructed.  (i.e. 
do_unpack, do_patch..)

etc.

The key is, all of the information IS available.  It just may not be in the 
format you want.

--Mark

>>
>> Marta and Alberto
>>
>>
>> -=-=-=-=-=-=-=-=-=-=-=-
>> Links: You receive all messages sent to this group.
>> View/Reply Online (#1635): https://lists.openembedded.org/g/openembedded-architecture/message/1635
>> Mute This Topic: https://lists.openembedded.org/mt/93678489/3616948
>> Group Owner: openembedded-architecture+owner@lists.openembedded.org
>> Unsubscribe: https://lists.openembedded.org/g/openembedded-architecture/unsub [mark.hatle@kernel.crashing.org]
>> -=-=-=-=-=-=-=-=-=-=-=-
>>

next prev parent reply	other threads:[~2022-09-15  1:18 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-14 14:16 Adding more information to the SBOM Marta Rybczynska
2022-09-14 14:56 ` Joshua Watt
2022-09-14 17:10   ` [OE-core] " Alberto Pianon
2022-09-14 20:52     ` Joshua Watt
2022-09-15  1:16   ` Mark Hatle [this message]
2022-09-15 12:16 ` [Openembedded-architecture] " Richard Purdie
2022-09-16 15:18   ` Alberto Pianon
2022-09-16 15:49     ` Mark Hatle
2022-09-20 12:25       ` Alberto Pianon
2022-09-16 16:08     ` Richard Purdie
     [not found]       ` <1061592967.5114533.1663597215958.JavaMail.zimbra@piana.eu>
2022-09-20 13:15         ` Richard Purdie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=85baa0f3-937d-fce5-9fc2-0d9315d030ac@kernel.crashing.org \
    --to=mark.hatle@kernel.crashing.org \
    --cc=JPEWhacker@gmail.com \
    --cc=openembedded-architecture@lists.openembedded.org \
    --cc=openembedded-core@lists.openembedded.org \
    --cc=rybczynska@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox