All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefano Tondo <stondo@gmail.com>
To: openembedded-core@lists.openembedded.org
Cc: stefano.tondo.ext@siemens.com, adrian.freihofer@siemens.com,
	Peter.Marko@siemens.com, jpewhacker@gmail.com,
	Ross.Burton@arm.com
Subject: [PATCH 06/14] sbom30: Fix object deduplication to preserve complete data
Date: Sat, 21 Feb 2026 05:24:10 +0100	[thread overview]
Message-ID: <20260221042418.317535-7-stondo@gmail.com> (raw)
In-Reply-To: <20260221042418.317535-1-stondo@gmail.com>

From: Stefano Tondo <stefano.tondo.ext@siemens.com>

When consolidating SPDX documents via expand_collection(), objects
with the same SPDX ID can appear in multiple source documents with
different levels of completeness. The previous implementation used
simple set union (self.objects |= other.objects), which would keep
an arbitrary version when duplicates existed.

This caused data loss during consolidation, particularly affecting
externalIdentifier arrays where one version might have a basic PURL
while another has multiple PURLs with Git metadata qualifiers.

Fix by implementing intelligent object merging that:
- Detects objects with duplicate SPDX IDs
- Compares completeness based on externalIdentifier count
- Keeps the more complete version (more externalIdentifiers)
- Preserves objects without IDs as-is

This ensures that consolidated SBOMs contain the most complete
metadata available from all source documents.

The bug was discovered while testing multi-PURL support where
packages can have varying externalIdentifier counts (base PURL
vs base + Git commit + Git branch PURLs), but affects any
scenario with duplicate SPDX IDs during consolidation.

Signed-off-by: Stefano Tondo <stefano.tondo.ext@siemens.com>
---
 meta/lib/oe/sbom30.py | 47 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/meta/lib/oe/sbom30.py b/meta/lib/oe/sbom30.py
index 227ac51877..c77e18f4e8 100644
--- a/meta/lib/oe/sbom30.py
+++ b/meta/lib/oe/sbom30.py
@@ -822,7 +822,52 @@ class ObjectSet(oe.spdx30.SHACLObjectSet):
                 if not e.externalSpdxId in imports:
                     imports[e.externalSpdxId] = e
 
-            self.objects |= other.objects
+            # Merge objects intelligently: if same SPDX ID exists, keep the one with more complete data
+            #
+            # WHY DUPLICATES OCCUR: When consolidating SPDX documents (e.g., recipe -> package -> image),
+            # the same package can be referenced at different build stages, each with varying levels of
+            # detail. Early stages may have basic PURLs, while later stages add Git metadata qualifiers.
+            # This is architectural - multi-stage builds naturally create multiple representations of
+            # the same entity.
+            #
+            # However, preserve object identity for types that get referenced (like CreationInfo)
+            # to avoid breaking serialization
+            other_by_id = {}
+            for obj in other.objects:
+                obj_id = getattr(obj, '_id', None)
+                if obj_id:
+                    other_by_id[obj_id] = obj
+
+            self_by_id = {}
+            for obj in self.objects:
+                obj_id = getattr(obj, '_id', None)
+                if obj_id:
+                    self_by_id[obj_id] = obj
+
+            # Merge: for duplicate IDs, prefer the object with more externalIdentifier entries
+            # but only for Element types (not CreationInfo, Agent, Tool, etc.)
+            for obj_id, other_obj in other_by_id.items():
+                if obj_id in self_by_id:
+                    self_obj = self_by_id[obj_id]
+                    # Only replace Elements with more complete data
+                    # Do NOT replace CreationInfo or other supporting types to preserve object identity
+                    if isinstance(self_obj, oe.spdx30.Element):
+                        # If both have externalIdentifier, keep the one with more entries
+                        self_ext_ids = getattr(self_obj, 'externalIdentifier', [])
+                        other_ext_ids = getattr(other_obj, 'externalIdentifier', [])
+                        if len(other_ext_ids) > len(self_ext_ids):
+                            # Replace self object with other (more complete) object
+                            self.objects.discard(self_obj)
+                            self.objects.add(other_obj)
+                    # For non-Element types (CreationInfo, Agent, Tool), keep existing to preserve identity
+                else:
+                    # New object, just add it
+                    self.objects.add(other_obj)
+
+            # Add any objects without IDs
+            for obj in other.objects:
+                if not getattr(obj, '_id', None):
+                    self.objects.add(obj)
 
         for o in add_objectsets:
             merge_doc(o)
-- 
2.53.0



  parent reply	other threads:[~2026-02-21  4:24 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-21  4:24 [PATCH 00/14] spdx30: SBOM enrichment for PURL, metadata, and compliance Stefano Tondo
2026-02-21  4:24 ` [PATCH 01/14] spdx30: Add configurable file filtering support Stefano Tondo
2026-02-21  4:24 ` [PATCH 02/14] spdx30: Add supplier support for image and SDK SBOMs Stefano Tondo
2026-02-21  4:24 ` [PATCH 03/14] spdx30: Add ecosystem-specific PURL generation Stefano Tondo
2026-02-21  4:24 ` [PATCH 04/14] spdx30: Add version extraction from SRCREV for Git source components Stefano Tondo
2026-02-21  4:24 ` [PATCH 05/14] spdx30: Add SPDX_GIT_PURL_MAPPINGS for Git hosting Stefano Tondo
2026-02-21  4:24 ` Stefano Tondo [this message]
2026-02-21  4:24 ` [PATCH 07/14] spdx30: Enrich source downloads with external refs and PURLs Stefano Tondo
2026-02-21  4:24 ` [PATCH 08/14] spdx30: Include recipe base PURL in package external identifiers Stefano Tondo
2026-02-21  4:24 ` [PATCH 09/14] spdx30: Add image root metadata package with describes relationship Stefano Tondo
2026-02-21  4:24 ` [PATCH 10/14] spdx30_tasks: Fix non-deterministic BUILDNAME in image package version Stefano Tondo
2026-02-21  4:24 ` [PATCH 11/14] spdx30: Add rootfs version and dependency scope classification Stefano Tondo
2026-02-21  4:24 ` [PATCH 12/14] oeqa/selftest: Add test for download_location defensive handling Stefano Tondo
2026-02-21  4:24 ` [PATCH 13/14] spdx.py: Add test for version extraction patterns Stefano Tondo
2026-02-21  4:24 ` [PATCH 14/14] cve_check: Escape special characters in CPE 2.3 formatted strings Stefano Tondo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260221042418.317535-7-stondo@gmail.com \
    --to=stondo@gmail.com \
    --cc=Peter.Marko@siemens.com \
    --cc=Ross.Burton@arm.com \
    --cc=adrian.freihofer@siemens.com \
    --cc=jpewhacker@gmail.com \
    --cc=openembedded-core@lists.openembedded.org \
    --cc=stefano.tondo.ext@siemens.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.