From: Stefano Tondo <stondo@gmail.com>
To: openembedded-core@lists.openembedded.org
Cc: stefano.tondo.ext@siemens.com, adrian.freihofer@siemens.com,
Peter.Marko@siemens.com, jpewhacker@gmail.com,
Ross.Burton@arm.com
Subject: [PATCH 06/14] sbom30: Fix object deduplication to preserve complete data
Date: Sat, 21 Feb 2026 05:24:10 +0100 [thread overview]
Message-ID: <20260221042418.317535-7-stondo@gmail.com> (raw)
In-Reply-To: <20260221042418.317535-1-stondo@gmail.com>
From: Stefano Tondo <stefano.tondo.ext@siemens.com>
When consolidating SPDX documents via expand_collection(), objects
with the same SPDX ID can appear in multiple source documents with
different levels of completeness. The previous implementation used
simple set union (self.objects |= other.objects), which would keep
an arbitrary version when duplicates existed.
This caused data loss during consolidation, particularly affecting
externalIdentifier arrays where one version might have a basic PURL
while another has multiple PURLs with Git metadata qualifiers.
Fix by implementing intelligent object merging that:
- Detects objects with duplicate SPDX IDs
- Compares completeness based on externalIdentifier count
- Keeps the more complete version (more externalIdentifiers)
- Preserves objects without IDs as-is
This ensures that consolidated SBOMs contain the most complete
metadata available from all source documents.
The bug was discovered while testing multi-PURL support where
packages can have varying externalIdentifier counts (base PURL
vs base + Git commit + Git branch PURLs), but affects any
scenario with duplicate SPDX IDs during consolidation.
Signed-off-by: Stefano Tondo <stefano.tondo.ext@siemens.com>
---
meta/lib/oe/sbom30.py | 47 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 46 insertions(+), 1 deletion(-)
diff --git a/meta/lib/oe/sbom30.py b/meta/lib/oe/sbom30.py
index 227ac51877..c77e18f4e8 100644
--- a/meta/lib/oe/sbom30.py
+++ b/meta/lib/oe/sbom30.py
@@ -822,7 +822,52 @@ class ObjectSet(oe.spdx30.SHACLObjectSet):
if not e.externalSpdxId in imports:
imports[e.externalSpdxId] = e
- self.objects |= other.objects
+ # Merge objects intelligently: if same SPDX ID exists, keep the one with more complete data
+ #
+ # WHY DUPLICATES OCCUR: When consolidating SPDX documents (e.g., recipe -> package -> image),
+ # the same package can be referenced at different build stages, each with varying levels of
+ # detail. Early stages may have basic PURLs, while later stages add Git metadata qualifiers.
+ # This is architectural - multi-stage builds naturally create multiple representations of
+ # the same entity.
+ #
+ # However, preserve object identity for types that get referenced (like CreationInfo)
+ # to avoid breaking serialization
+ other_by_id = {}
+ for obj in other.objects:
+ obj_id = getattr(obj, '_id', None)
+ if obj_id:
+ other_by_id[obj_id] = obj
+
+ self_by_id = {}
+ for obj in self.objects:
+ obj_id = getattr(obj, '_id', None)
+ if obj_id:
+ self_by_id[obj_id] = obj
+
+ # Merge: for duplicate IDs, prefer the object with more externalIdentifier entries
+ # but only for Element types (not CreationInfo, Agent, Tool, etc.)
+ for obj_id, other_obj in other_by_id.items():
+ if obj_id in self_by_id:
+ self_obj = self_by_id[obj_id]
+ # Only replace Elements with more complete data
+ # Do NOT replace CreationInfo or other supporting types to preserve object identity
+ if isinstance(self_obj, oe.spdx30.Element):
+ # If both have externalIdentifier, keep the one with more entries
+ self_ext_ids = getattr(self_obj, 'externalIdentifier', [])
+ other_ext_ids = getattr(other_obj, 'externalIdentifier', [])
+ if len(other_ext_ids) > len(self_ext_ids):
+ # Replace self object with other (more complete) object
+ self.objects.discard(self_obj)
+ self.objects.add(other_obj)
+ # For non-Element types (CreationInfo, Agent, Tool), keep existing to preserve identity
+ else:
+ # New object, just add it
+ self.objects.add(other_obj)
+
+ # Add any objects without IDs
+ for obj in other.objects:
+ if not getattr(obj, '_id', None):
+ self.objects.add(obj)
for o in add_objectsets:
merge_doc(o)
--
2.53.0
next prev parent reply other threads:[~2026-02-21 4:24 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-21 4:24 [PATCH 00/14] spdx30: SBOM enrichment for PURL, metadata, and compliance Stefano Tondo
2026-02-21 4:24 ` [PATCH 01/14] spdx30: Add configurable file filtering support Stefano Tondo
2026-02-21 4:24 ` [PATCH 02/14] spdx30: Add supplier support for image and SDK SBOMs Stefano Tondo
2026-02-21 4:24 ` [PATCH 03/14] spdx30: Add ecosystem-specific PURL generation Stefano Tondo
2026-02-21 4:24 ` [PATCH 04/14] spdx30: Add version extraction from SRCREV for Git source components Stefano Tondo
2026-02-21 4:24 ` [PATCH 05/14] spdx30: Add SPDX_GIT_PURL_MAPPINGS for Git hosting Stefano Tondo
2026-02-21 4:24 ` Stefano Tondo [this message]
2026-02-21 4:24 ` [PATCH 07/14] spdx30: Enrich source downloads with external refs and PURLs Stefano Tondo
2026-02-21 4:24 ` [PATCH 08/14] spdx30: Include recipe base PURL in package external identifiers Stefano Tondo
2026-02-21 4:24 ` [PATCH 09/14] spdx30: Add image root metadata package with describes relationship Stefano Tondo
2026-02-21 4:24 ` [PATCH 10/14] spdx30_tasks: Fix non-deterministic BUILDNAME in image package version Stefano Tondo
2026-02-21 4:24 ` [PATCH 11/14] spdx30: Add rootfs version and dependency scope classification Stefano Tondo
2026-02-21 4:24 ` [PATCH 12/14] oeqa/selftest: Add test for download_location defensive handling Stefano Tondo
2026-02-21 4:24 ` [PATCH 13/14] spdx.py: Add test for version extraction patterns Stefano Tondo
2026-02-21 4:24 ` [PATCH 14/14] cve_check: Escape special characters in CPE 2.3 formatted strings Stefano Tondo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260221042418.317535-7-stondo@gmail.com \
--to=stondo@gmail.com \
--cc=Peter.Marko@siemens.com \
--cc=Ross.Burton@arm.com \
--cc=adrian.freihofer@siemens.com \
--cc=jpewhacker@gmail.com \
--cc=openembedded-core@lists.openembedded.org \
--cc=stefano.tondo.ext@siemens.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox