Re: [Openembedded-architecture] Adding more information to the SBOM

public inbox for openembedded-core@lists.openembedded.org
 help / color / mirror / Atom feed

From: Mark Hatle <mark.hatle@kernel.crashing.org>
To: Alberto Pianon <alberto@pianon.eu>,
	Richard Purdie <richard.purdie@linuxfoundation.org>
Cc: Marta Rybczynska <rybczynska@gmail.com>,
	OE-core <openembedded-core@lists.openembedded.org>,
	openembedded-architecture@lists.openembedded.org,
	Joshua Watt <JPEWhacker@gmail.com>,
	"'Carlo Piana'" <carlo@piana.eu>,
	davide.ricci@huawei.com
Subject: Re: [Openembedded-architecture] Adding more information to the SBOM
Date: Fri, 16 Sep 2022 10:49:58 -0500	[thread overview]
Message-ID: <4c2cee7b-2e17-adc6-f603-86c78468cc55@kernel.crashing.org> (raw)
In-Reply-To: <10e816efb661938db17c512199720580@pianon.eu>

On 9/16/22 10:18 AM, Alberto Pianon wrote:

... trimmed ...

>> I also can see the issue with multiple sources in SRC_URI, although you
>> should be able to map those back if you assume subtrees are "owned" by
>> given SRC_URI entries. I suspect there may be a SPDX format limit in
>> documenting that piece?
> 
> I'm replying in reverse order:
> 
> - there is a SPDX format limit, but it is by design: a SPDX package
>     entity is a single sw distribution unit, so it may have only one
>     downloadLocation; if you have more than one downloadLocation, you must
>     have more than one SPDX package, according to SPDX specs;

I think my interpretation of this is different.  I've got a view of 'sourcing 
materials', and then verifying the are what we think they are and can be used 
the way we want.  The "upstream sources" (and patches) are really just 'raw 
materials' that we use the Yocto Project to combined to create "the source".

So for the purpose of the SPDX, each upstream source _may_ have a corresponding 
SPDX, but for the binaries their source is the combined unit.. not multiple 
SPDXes.  Think of it something like:

upstream source1 - SPDX
upstream source2 - SPDX
upstream patch
recipe patch1
recipe patch2

In the above, each of those items would be combined by the recipe system to 
construct the source used to build an individual recipe (and collection of 
packages).  Automation _IS_ used to combine the components [unpack/fetch] and 
_MAY_ be used to generated a combined SPDX.

So your "upstream" location for this recipe is the local machine's source 
archive.  The SPDX for the local recipe files can merge the SPDX information 
they know (and if it's at a file level) can use checksums to identify the items 
not captured/modified by the patches for further review (either manual or 
automation like fossology).  In the case where an upstream has SPDX data, you 
should be able to inherit MOST files this way... but the output is specific to 
your configuration and patches.

1 - SPDX |
2 - SPDX |
patch    |---> recipe specific SPDX
patch    |
patch    |

In some cases someone may want to generate SPDX data for the 3 patches, but that 
may or may not be useful in this context.

> - I understand that my solution is a bit hacky; but IMHO any other
>     *post-mortem* solution would be far more hacky; the real solution
>     would be collecting required information directly in do_fetch and
>     do_unpack

I've not looked at the current SPDX spec, but past versions has a notes section. 
  Assuming this is still present you can use it to reference back to how this 
component was constructed and the upstream source URIs (and SPDX files) you used 
for processing.

This way nothing really changes in do_fetch or do_unpack.  (You may want to find 
a way to capture file checksums and what the source was for a particular file.. 
but it may not really be necessary!)

> - I also understand that we should reduce pain, otherwise nobody would
>     use our solution; the simplest and cleanest way I can think about is
>     collecting just package (in the SPDX sense) files' relative paths and
>     checksums at every stage (fetch, unpack, patch, package), and leave
>     data processing (i.e. mapping upstream source packages -> recipe's
>     WORKDIR package -> debug source package -> binary packages -> binary
>     image) to a separate tool, that may use (just a thought) a graph
>     database to process things more efficiently.

Even it do_patch nothing really changes, other then again you may want to 
capture checksums to identify thingsthat need further processing.

This approach greatly simplifies things, and gives people doing code reviews the 
insight into what is the source used when shipping the binaries (which is really 
an important aspect of this), as well as which recipe and "build" (really 
fetch/unpack/patch) were used to construct the sources.  If they want to 
investigate the sources further back to their provider, then the notes would 
have the information for that, and you could transition back to the "raw 
materials" providers.

>>
>> Where I became puzzled is where you say "Information about debug
>> sources for each actual binary file is then taken from
>> tmp/pkgdata/<machine>/extended/*.json.zstd". This is the data we added
>> and use for the spdx class so you shouldn't need to reinvent that
>> piece. It should be the exact same data the spdx class uses.
>>
> 
> you're right, but in the context of a POC it was easier to extract them
> directly from json files than from SPDX data :) It's just a POC to show
> that required information may be retrieved in some way, implementation
> details do not matter
> 
>> I was also puzzled about the difference between rpm and the other
>> package backends. The exact same files are packaged by all the package
>> backends so the checksums from do_package should be fine.
>>
> 
> Here I may miss some piece of information. I looked at files in
> tmp/pkgdata but I couldn't find package file checksums anywhere: that is
> why I parsed rpm packages. But if such checksums were already available
> somewhere in tmp/pkgdata, it wouldn't be necessary to parse rpm packages
> at all... Could you point me to what I'm (maybe) missing here? Thanks!

file checksumming is expensive.  There are checksums available to individual 
packaging engines, as well as aggregate checksums for "hash equivalency".. but 
I'm not aware of any per-file checksum that is stored.

You definitely shouldn't be parsing packages of any type (rpm or otherwise), as 
packages are truly optional.  It's the binaries that matter here.

--Mark

> In any case, thank you much so for all your insights, they were
> super-useful!
> 
> Cheers,
> 
> Alberto
> 
> 
> 
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#1640): https://lists.openembedded.org/g/openembedded-architecture/message/1640
> Mute This Topic: https://lists.openembedded.org/mt/93678489/3616948
> Group Owner: openembedded-architecture+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/openembedded-architecture/unsub [mark.hatle@kernel.crashing.org]
> -=-=-=-=-=-=-=-=-=-=-=-
>

next prev parent reply	other threads:[~2022-09-16 15:52 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-14 14:16 Adding more information to the SBOM Marta Rybczynska
2022-09-14 14:56 ` Joshua Watt
2022-09-14 17:10   ` [OE-core] " Alberto Pianon
2022-09-14 20:52     ` Joshua Watt
2022-09-15  1:16   ` [Openembedded-architecture] " Mark Hatle
2022-09-15 12:16 ` Richard Purdie
2022-09-16 15:18   ` Alberto Pianon
2022-09-16 15:49     ` Mark Hatle [this message]
2022-09-20 12:25       ` Alberto Pianon
2022-09-16 16:08     ` Richard Purdie
     [not found]       ` <1061592967.5114533.1663597215958.JavaMail.zimbra@piana.eu>
2022-09-20 13:15         ` Richard Purdie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4c2cee7b-2e17-adc6-f603-86c78468cc55@kernel.crashing.org \
    --to=mark.hatle@kernel.crashing.org \
    --cc=JPEWhacker@gmail.com \
    --cc=alberto@pianon.eu \
    --cc=carlo@piana.eu \
    --cc=davide.ricci@huawei.com \
    --cc=openembedded-architecture@lists.openembedded.org \
    --cc=openembedded-core@lists.openembedded.org \
    --cc=richard.purdie@linuxfoundation.org \
    --cc=rybczynska@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox