linux-debuggers.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] debuginfod: speed up extraction from kernel debuginfo packages by 200x
@ 2024-07-10 20:47 Omar Sandoval
  2024-07-10 20:47 ` [PATCH 1/3] debuginfod: factor out common code for responding from an archive Omar Sandoval
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Omar Sandoval @ 2024-07-10 20:47 UTC (permalink / raw)
  To: elfutils-devel; +Cc: linux-debuggers

From: Omar Sandoval <osandov@fb.com>

drgn [1] currently uses debuginfod with great success for debugging
userspace processes.  However, for debugging the Linux kernel (drgn's
main use case), we have had some performance issues with debuginfod, so
we intentionally avoid using it.  Specifically, it sometimes takes over
a minute for debuginfod to respond to queries for vmlinux and kernel
modules (not including the actual download time).

The reason for the slowness is that Linux kernel debuginfo packages are
very large and contain lots of files.  To respond to a query for a Linux
kernel debuginfo file, debuginfod has to decompress and iterate through
the whole package until it finds that file.  If the file is towards the
end of the package, this can take a very long time.  This was previously
reported for vdso files [2][3], which debuginfod was able to mitigate
with improved caching and prefetching.  However, kernel modules are far
greater in number, vary drastically by hardware and workload, and can be
spread all over the package, so in practice I've still been seeing long
delays.  This was also discussed on the drgn issue tracker [4].

The fundamental limitation is that Linux packages, which are essentially
compressed archives with extra metadata headers, don't support random
access to specific files.  However, the multi-threaded xz compression
format does actually support random access.  And, luckily, the kernel
debuginfo packages on Fedora, Debian, and Ubuntu all happen to use
multi-threaded xz compression!

debuginfod can take advantage of this: when it scans a package, if it is
a seekable xz archive, it can save the uncompressed offset and size of
each file.  Then, when it needs a file, it can seek to that offset and
extract it from there.  This requires some understanding of the xz
format and low-level liblzma code, but the speedup is massive: where the
worst case was previously about 50 seconds just to find a file in a
kernel debuginfo package, with this change the worst case is 0.25
seconds, a ~200x improvement! This works for both .rpm and .deb files.

Patch 1 is a preparatory refactor.  Patch 2 implements saving the
uncompressed offsets and sizes in the database.  Patch 3 implements the
seekable xz extraction.

I tested this by requesting and verifying the digest of every file from
a few kernel debuginfo rpms and debs [5].

P.S. The biggest downside of this change is that it depends on a very
specific compression format.  I think this is something we should
formalize with Linux distributions: large debuginfo packages should use
a seekable format.  Currently, xz in multi-threaded mode is the only
option, but Zstandard also has an experimental seekable format that is
worth looking into [6].

1: https://github.com/osandov/drgn
2: https://sourceware.org/bugzilla/show_bug.cgi?id=29478
3: https://bugzilla.redhat.com/show_bug.cgi?id=1970578
4: https://github.com/osandov/drgn/pull/380
5: https://gist.github.com/osandov/89d521fdc6c9a07aa8bb0ebf91974346
6: https://github.com/facebook/zstd/tree/dev/contrib/seekable_format

Omar Sandoval (3):
  debuginfod: factor out common code for responding from an archive
  debuginfod: add archive entry size, mtime, and uncompressed offset to
    database
  debuginfod: optimize extraction from seekable xz archives

 configure.ac              |   5 +
 debuginfod/Makefile.am    |   2 +-
 debuginfod/debuginfod.cxx | 870 +++++++++++++++++++++++++++++++-------
 3 files changed, 722 insertions(+), 155 deletions(-)

-- 
2.45.2


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/3] debuginfod: factor out common code for responding from an archive
  2024-07-10 20:47 [PATCH 0/3] debuginfod: speed up extraction from kernel debuginfo packages by 200x Omar Sandoval
@ 2024-07-10 20:47 ` Omar Sandoval
  2024-07-10 20:47 ` [PATCH 2/3] debuginfod: add archive entry size, mtime, and uncompressed offset to database Omar Sandoval
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Omar Sandoval @ 2024-07-10 20:47 UTC (permalink / raw)
  To: elfutils-devel; +Cc: linux-debuggers

From: Omar Sandoval <osandov@fb.com>

handle_buildid_r_match has two very similar branches where it optionally
extracts a section and then creates a microhttpd response.  In
preparation for adding a third one, factor it out into a function.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 debuginfod/debuginfod.cxx | 213 +++++++++++++++++---------------------
 1 file changed, 96 insertions(+), 117 deletions(-)

diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
index 305edde8..2d709026 100644
--- a/debuginfod/debuginfod.cxx
+++ b/debuginfod/debuginfod.cxx
@@ -1965,6 +1965,81 @@ string canonicalized_archive_entry_pathname(struct archive_entry *e)
 }
 
 
+// NB: takes ownership of, and may reassign, fd.
+static struct MHD_Response*
+create_buildid_r_response (int64_t b_mtime0,
+                           const string& b_source0,
+                           const string& b_source1,
+                           const string& section,
+                           const string& ima_sig,
+                           const char* tmppath,
+                           int& fd,
+                           off_t size,
+                           time_t mtime,
+                           const string& metric,
+                           const struct timespec& extract_begin)
+{
+  if (tmppath != NULL)
+    {
+      struct timespec extract_end;
+      clock_gettime (CLOCK_MONOTONIC, &extract_end);
+      double extract_time = (extract_end.tv_sec - extract_begin.tv_sec)
+        + (extract_end.tv_nsec - extract_begin.tv_nsec)/1.e9;
+      fdcache.intern(b_source0, b_source1, tmppath, size, true, extract_time);
+    }
+
+  if (!section.empty ())
+    {
+      int scn_fd = extract_section (fd, b_mtime0,
+                                    b_source0 + ":" + b_source1,
+                                    section, extract_begin);
+      close (fd);
+      if (scn_fd >= 0)
+        fd = scn_fd;
+      else
+        {
+          if (verbose)
+            obatched (clog) << "cannot find section " << section
+                            << " for archive " << b_source0
+                            << " file " << b_source1 << endl;
+          return 0;
+        }
+
+      struct stat fs;
+      if (fstat (fd, &fs) < 0)
+        {
+          close (fd);
+          throw libc_exception (errno,
+            string ("fstat ") + b_source0 + string (" ") + section);
+        }
+      size = fs.st_size;
+    }
+
+  struct MHD_Response* r = MHD_create_response_from_fd (size, fd);
+  if (r == 0)
+    {
+      if (verbose)
+        obatched(clog) << "cannot create fd-response for " << b_source0 << endl;
+      close(fd);
+    }
+  else
+    {
+      inc_metric ("http_responses_total","result",metric);
+      add_mhd_response_header (r, "Content-Type", "application/octet-stream");
+      add_mhd_response_header (r, "X-DEBUGINFOD-SIZE", to_string(size).c_str());
+      add_mhd_response_header (r, "X-DEBUGINFOD-ARCHIVE", b_source0.c_str());
+      add_mhd_response_header (r, "X-DEBUGINFOD-FILE", b_source1.c_str());
+      if(!ima_sig.empty()) add_mhd_response_header(r, "X-DEBUGINFOD-IMASIGNATURE", ima_sig.c_str());
+      add_mhd_last_modified (r, mtime);
+      if (verbose > 1)
+        obatched(clog) << "serving " << metric << " " << b_source0
+                       << " file " << b_source1
+                       << " section=" << section
+                       << " IMA signature=" << ima_sig << endl;
+      /* libmicrohttpd will close fd. */
+    }
+  return r;
+}
 
 static struct MHD_Response*
 handle_buildid_r_match (bool internal_req_p,
@@ -2142,57 +2217,15 @@ handle_buildid_r_match (bool internal_req_p,
           break; // branch out of if "loop", to try new libarchive fetch attempt
         }
 
-      if (!section.empty ())
-	{
-	  int scn_fd = extract_section (fd, fs.st_mtime,
-					b_source0 + ":" + b_source1,
-					section, extract_begin);
-	  close (fd);
-	  if (scn_fd >= 0)
-	    fd = scn_fd;
-	  else
-	    {
-	      if (verbose)
-	        obatched (clog) << "cannot find section " << section
-				<< " for archive " << b_source0
-				<< " file " << b_source1 << endl;
-	      return 0;
-	    }
-
-	  rc = fstat(fd, &fs);
-	  if (rc < 0)
-	    {
-	      close (fd);
-	      throw libc_exception (errno,
-		string ("fstat archive ") + b_source0 + string (" file ") + b_source1
-		+ string (" section ") + section);
-	    }
-	}
-
-      struct MHD_Response* r = MHD_create_response_from_fd (fs.st_size, fd);
+      struct MHD_Response* r = create_buildid_r_response (b_mtime, b_source0,
+                                                          b_source1, section,
+                                                          ima_sig, NULL, fd,
+                                                          fs.st_size,
+                                                          fs.st_mtime,
+                                                          "archive fdcache",
+                                                          extract_begin);
       if (r == 0)
-        {
-          if (verbose)
-            obatched(clog) << "cannot create fd-response for " << b_source0 << endl;
-          close(fd);
-          break; // branch out of if "loop", to try new libarchive fetch attempt
-        }
-
-      inc_metric ("http_responses_total","result","archive fdcache");
-
-      add_mhd_response_header (r, "Content-Type", "application/octet-stream");
-      add_mhd_response_header (r, "X-DEBUGINFOD-SIZE",
-			       to_string(fs.st_size).c_str());
-      add_mhd_response_header (r, "X-DEBUGINFOD-ARCHIVE", b_source0.c_str());
-      add_mhd_response_header (r, "X-DEBUGINFOD-FILE", b_source1.c_str());
-      if(!ima_sig.empty()) add_mhd_response_header(r, "X-DEBUGINFOD-IMASIGNATURE", ima_sig.c_str());
-      add_mhd_last_modified (r, fs.st_mtime);
-      if (verbose > 1)
-	obatched(clog) << "serving fdcache archive " << b_source0
-		       << " file " << b_source1
-		       << " section=" << section
-		       << " IMA signature=" << ima_sig << endl;
-      /* libmicrohttpd will close it. */
+        break; // branch out of if "loop", to try new libarchive fetch attempt
       if (result_fd)
         *result_fd = fd;
       return r;
@@ -2307,13 +2340,12 @@ handle_buildid_r_match (bool internal_req_p,
       tvs[1].tv_nsec = archive_entry_mtime_nsec(e);
       (void) futimens (fd, tvs);  /* best effort */
 
-      struct timespec extract_end;
-      clock_gettime (CLOCK_MONOTONIC, &extract_end);
-      double extract_time = (extract_end.tv_sec - extract_begin.tv_sec)
-        + (extract_end.tv_nsec - extract_begin.tv_nsec)/1.e9;
-      
       if (r != 0) // stage 3
         {
+          struct timespec extract_end;
+          clock_gettime (CLOCK_MONOTONIC, &extract_end);
+          double extract_time = (extract_end.tv_sec - extract_begin.tv_sec)
+            + (extract_end.tv_nsec - extract_begin.tv_nsec)/1.e9;
           // NB: now we know we have a complete reusable file; make fdcache
           // responsible for unlinking it later.
           fdcache.intern(b_source0, fn,
@@ -2324,69 +2356,16 @@ handle_buildid_r_match (bool internal_req_p,
           continue;
         }
 
-      // NB: now we know we have a complete reusable file; make fdcache
-      // responsible for unlinking it later.
-      fdcache.intern(b_source0, b_source1,
-                     tmppath, archive_entry_size(e),
-                     true, extract_time); // requested ones go to the front of the line
-
-      if (!section.empty ())
-	{
-	  int scn_fd = extract_section (fd, b_mtime,
-					b_source0 + ":" + b_source1,
-					section, extract_begin);
-	  close (fd);
-	  if (scn_fd >= 0)
-	    fd = scn_fd;
-	  else
-	    {
-	      if (verbose)
-	        obatched (clog) << "cannot find section " << section
-				<< " for archive " << b_source0
-				<< " file " << b_source1 << endl;
-	      return 0;
-	    }
-
-	  rc = fstat(fd, &fs);
-	  if (rc < 0)
-	    {
-	      close (fd);
-	      throw libc_exception (errno,
-		string ("fstat ") + b_source0 + string (" ") + section);
-	    }
-	  r = MHD_create_response_from_fd (fs.st_size, fd);
-	}
-      else
-	r = MHD_create_response_from_fd (archive_entry_size(e), fd);
-
-      inc_metric ("http_responses_total","result",archive_extension + " archive");
+      r = create_buildid_r_response (b_mtime, b_source0, b_source1, section,
+                                     ima_sig, tmppath, fd,
+                                     archive_entry_size(e),
+                                     archive_entry_mtime(e),
+                                     archive_extension + " archive",
+                                     extract_begin);
       if (r == 0)
-        {
-          if (verbose)
-            obatched(clog) << "cannot create fd-response for " << b_source0 << endl;
-          close(fd);
-          break; // assume no chance of better luck around another iteration; no other copies of same file
-        }
-      else
-        {
-          add_mhd_response_header (r, "Content-Type",
-                                   "application/octet-stream");
-          add_mhd_response_header (r, "X-DEBUGINFOD-SIZE",
-                                   to_string(archive_entry_size(e)).c_str());
-          add_mhd_response_header (r, "X-DEBUGINFOD-ARCHIVE", b_source0.c_str());
-          add_mhd_response_header (r, "X-DEBUGINFOD-FILE", b_source1.c_str());
-          if(!ima_sig.empty()) add_mhd_response_header(r, "X-DEBUGINFOD-IMASIGNATURE", ima_sig.c_str());
-          add_mhd_last_modified (r, archive_entry_mtime(e));
-          if (verbose > 1)
-	    obatched(clog) << "serving archive " << b_source0
-			   << " file " << b_source1
-			   << " section=" << section
-			   << " IMA signature=" << ima_sig << endl;
-          /* libmicrohttpd will close it. */
-          if (result_fd)
-            *result_fd = fd;
-          continue;
-        }
+        break; // assume no chance of better luck around another iteration; no other copies of same file
+      if (result_fd)
+        *result_fd = fd;
     }
 
   // XXX: rpm/file not found: delete this R entry?
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/3] debuginfod: add archive entry size, mtime, and uncompressed offset to database
  2024-07-10 20:47 [PATCH 0/3] debuginfod: speed up extraction from kernel debuginfo packages by 200x Omar Sandoval
  2024-07-10 20:47 ` [PATCH 1/3] debuginfod: factor out common code for responding from an archive Omar Sandoval
@ 2024-07-10 20:47 ` Omar Sandoval
  2024-07-10 20:47 ` [PATCH 3/3] debuginfod: optimize extraction from seekable xz archives Omar Sandoval
  2024-07-11 20:16 ` [PATCH 0/3] debuginfod: speed up extraction from kernel debuginfo packages by 200x Frank Ch. Eigler
  3 siblings, 0 replies; 6+ messages in thread
From: Omar Sandoval @ 2024-07-10 20:47 UTC (permalink / raw)
  To: elfutils-devel; +Cc: linux-debuggers

From: Omar Sandoval <osandov@fb.com>

In order to extract a file from a seekable archive, we need to know
where in the uncompressed archive the file data starts and its size.
Additionally, in order to populate the response headers, we need the
file modification time (since we won't be able to get it from the
archive metadata).  Add the size, modification time, and uncompressed
offset to the _r_de table and _query_d and _query_e views.  Note that
_r_de already has a column for the mtime of the archive itself, so that
one is renamed to mtime0 and the one for the entry is mtime1.

We need a little bit of liblzma magic to detect whether a file is
seekable.  If so, we populate the uncompressed_offset column, and
otherwise we set it to null.  size and mtime1 are populated
unconditionally for simplicity.

Before this change, the database for a single kernel debuginfo RPM
(kernel-debuginfo-6.9.6-200.fc40.x86_64.rpm) was about 15MB.  This
change increases that by about 70kB, only a 0.5% increase.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 configure.ac              |   5 +
 debuginfod/Makefile.am    |   2 +-
 debuginfod/debuginfod.cxx | 234 +++++++++++++++++++++++++++++++++-----
 3 files changed, 213 insertions(+), 28 deletions(-)

diff --git a/configure.ac b/configure.ac
index 24e68d94..9c5f7e51 100644
--- a/configure.ac
+++ b/configure.ac
@@ -441,8 +441,13 @@ eu_ZIPLIB(bzlib,BZLIB,bz2,BZ2_bzdopen,bzip2)
 # We need this since bzip2 doesn't have a pkgconfig file.
 BZ2_LIB="$LIBS"
 AC_SUBST([BZ2_LIB])
+save_LIBS="$LIBS"
+LIBS=
 eu_ZIPLIB(lzma,LZMA,lzma,lzma_auto_decoder,[LZMA (xz)])
+lzma_LIBS="$LIBS"
+LIBS="$lzma_LIBS $save_LIBS"
 AS_IF([test "x$with_lzma" = xyes], [LIBLZMA="liblzma"], [LIBLZMA=""])
+AC_SUBST([lzma_LIBS])
 AC_SUBST([LIBLZMA])
 eu_ZIPLIB(zstd,ZSTD,zstd,ZSTD_decompress,[ZSTD (zst)])
 AS_IF([test "x$with_zstd" = xyes], [LIBZSTD="libzstd"], [LIBLZSTD=""])
diff --git a/debuginfod/Makefile.am b/debuginfod/Makefile.am
index b74e3673..e199dc0c 100644
--- a/debuginfod/Makefile.am
+++ b/debuginfod/Makefile.am
@@ -70,7 +70,7 @@ bin_PROGRAMS += debuginfod-find
 endif
 
 debuginfod_SOURCES = debuginfod.cxx
-debuginfod_LDADD = $(libdw) $(libelf) $(libeu) $(libdebuginfod) $(argp_LDADD) $(fts_LIBS) $(libmicrohttpd_LIBS) $(sqlite3_LIBS) $(libarchive_LIBS) $(rpm_LIBS) $(jsonc_LIBS) $(libcurl_LIBS) -lpthread -ldl
+debuginfod_LDADD = $(libdw) $(libelf) $(libeu) $(libdebuginfod) $(argp_LDADD) $(fts_LIBS) $(libmicrohttpd_LIBS) $(sqlite3_LIBS) $(libarchive_LIBS) $(rpm_LIBS) $(jsonc_LIBS) $(libcurl_LIBS) $(lzma_LIBS) -lpthread -ldl
 
 debuginfod_find_SOURCES = debuginfod-find.c
 debuginfod_find_LDADD = $(libdw) $(libelf) $(libeu) $(libdebuginfod) $(argp_LDADD) $(fts_LIBS) $(jsonc_LIBS)
diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
index 2d709026..95a7d941 100644
--- a/debuginfod/debuginfod.cxx
+++ b/debuginfod/debuginfod.cxx
@@ -63,6 +63,10 @@ extern "C" {
 #undef __attribute__ /* glibc bug - rhbz 1763325 */
 #endif
 
+#ifdef USE_LZMA
+#include <lzma.h>
+#endif
+
 #include <unistd.h>
 #include <stdlib.h>
 #include <locale.h>
@@ -162,7 +166,7 @@ string_endswith(const string& haystack, const string& needle)
 
 
 // Roll this identifier for every sqlite schema incompatibility.
-#define BUILDIDS "buildids10"
+#define BUILDIDS "buildids11"
 
 #if SQLITE_VERSION_NUMBER >= 3008000
 #define WITHOUT_ROWID "without rowid"
@@ -239,15 +243,18 @@ static const char DEBUGINFOD_SQLITE_DDL[] =
   "        debuginfo_p integer not null,\n"
   "        executable_p integer not null,\n"
   "        file integer not null,\n"
-  "        mtime integer not null,\n"
+  "        mtime0 integer not null,\n"
   "        content integer not null,\n"
+  "        size integer not null,\n"
+  "        mtime1 integer not null,\n"
+  "        uncompressed_offset integer,\n"
   "        foreign key (file) references " BUILDIDS "_files(id) on update cascade on delete cascade,\n"
   "        foreign key (content) references " BUILDIDS "_files(id) on update cascade on delete cascade,\n"
   "        foreign key (buildid) references " BUILDIDS "_buildids(id) on update cascade on delete cascade,\n"
-  "        primary key (buildid, debuginfo_p, executable_p, file, content, mtime)\n"
+  "        primary key (buildid, debuginfo_p, executable_p, file, content, mtime0)\n"
   "        ) " WITHOUT_ROWID ";\n"
   // Index for faster delete by archive file identifier
-  "create index if not exists " BUILDIDS "_r_de_idx on " BUILDIDS "_r_de (file, mtime);\n"
+  "create index if not exists " BUILDIDS "_r_de_idx on " BUILDIDS "_r_de (file, mtime0);\n"
   // Index for metadata searches
   "create index if not exists " BUILDIDS "_r_de_idx2 on " BUILDIDS "_r_de (content);\n"  
   "create table if not exists " BUILDIDS "_r_sref (\n" // outgoing dwarf sourcefile references from rpm
@@ -268,22 +275,22 @@ static const char DEBUGINFOD_SQLITE_DDL[] =
   // create views to glue together some of the above tables, for webapi D queries
   "create view if not exists " BUILDIDS "_query_d as \n"
   "select\n"
-  "        b.hex as buildid, n.mtime, 'F' as sourcetype, f0.name as source0, n.mtime as mtime, null as source1\n"
+  "        b.hex as buildid, 'F' as sourcetype, f0.name as source0, n.mtime as mtime0, null as source1, null as size, null as mtime1, null as uncompressed_offset\n"
   "        from " BUILDIDS "_buildids b, " BUILDIDS "_files_v f0, " BUILDIDS "_f_de n\n"
   "        where b.id = n.buildid and f0.id = n.file and n.debuginfo_p = 1\n"
   "union all select\n"
-  "        b.hex as buildid, n.mtime, 'R' as sourcetype, f0.name as source0, n.mtime as mtime, f1.name as source1\n"
+  "        b.hex as buildid, 'R' as sourcetype, f0.name as source0, n.mtime0 as mtime0, f1.name as source1, n.size as size, n.mtime1 as mtime1, n.uncompressed_offset as uncompressed_offset\n"
   "        from " BUILDIDS "_buildids b, " BUILDIDS "_files_v f0, " BUILDIDS "_files_v f1, " BUILDIDS "_r_de n\n"
   "        where b.id = n.buildid and f0.id = n.file and f1.id = n.content and n.debuginfo_p = 1\n"
   ";"
   // ... and for E queries
   "create view if not exists " BUILDIDS "_query_e as \n"
   "select\n"
-  "        b.hex as buildid, n.mtime, 'F' as sourcetype, f0.name as source0, n.mtime as mtime, null as source1\n"
+  "        b.hex as buildid, 'F' as sourcetype, f0.name as source0, n.mtime as mtime0, null as source1, null as size, null as mtime1, null as uncompressed_offset\n"
   "        from " BUILDIDS "_buildids b, " BUILDIDS "_files_v f0, " BUILDIDS "_f_de n\n"
   "        where b.id = n.buildid and f0.id = n.file and n.executable_p = 1\n"
   "union all select\n"
-  "        b.hex as buildid, n.mtime, 'R' as sourcetype, f0.name as source0, n.mtime as mtime, f1.name as source1\n"
+  "        b.hex as buildid, 'R' as sourcetype, f0.name as source0, n.mtime0 as mtime0, f1.name as source1, n.size as size, n.mtime1 as mtime1, n.uncompressed_offset as uncompressed_offset\n"
   "        from " BUILDIDS "_buildids b, " BUILDIDS "_files_v f0, " BUILDIDS "_files_v f1, " BUILDIDS "_r_de n\n"
   "        where b.id = n.buildid and f0.id = n.file and f1.id = n.content and n.executable_p = 1\n"
   ";"
@@ -324,8 +331,23 @@ static const char DEBUGINFOD_SQLITE_DDL[] =
 // data over instead of just dropping it.  But that could incur
 // doubled storage costs.
 //
-// buildids10: split the _files table into _parts
+// buildids11: add size, mtime1, and uncompressed_offset to _r_de, _query_d, and _query_e
   "" // <<< we are here
+// buildids10: split the _files table into _parts
+  "DROP VIEW IF EXISTS buildids10_stats;\n"
+  "DROP VIEW IF EXISTS buildids10_query_s;\n"
+  "DROP VIEW IF EXISTS buildids10_query_e;\n"
+  "DROP VIEW IF EXISTS buildids10_query_d;\n"
+  "DROP TABLE IF EXISTS buildids10_r_sdef;\n"
+  "DROP TABLE IF EXISTS buildids10_r_sref;\n"
+  "DROP TABLE IF EXISTS buildids10_r_de;\n"
+  "DROP TABLE IF EXISTS buildids10_f_s;\n"
+  "DROP TABLE IF EXISTS buildids10_f_de;\n"
+  "DROP TABLE IF EXISTS buildids10_file_mtime_scanned;\n"
+  "DROP TABLE IF EXISTS buildids10_buildids;\n"
+  "DROP VIEW IF EXISTS buildids10_files_v;\n"
+  "DROP TABLE IF EXISTS buildids10_files;\n"
+  "DROP TABLE IF EXISTS buildids10_fileparts;\n"
 // buildids9: widen the mtime_scanned table
   "DROP VIEW IF EXISTS buildids9_stats;\n"
   "DROP INDEX IF EXISTS buildids9_r_de_idx;\n"
@@ -1947,6 +1969,140 @@ handle_buildid_f_match (bool internal_req_t,
   return r;
 }
 
+
+#ifdef USE_LZMA
+// Neither RPM nor deb files support seeking to a specific file in the package.
+// Instead, to extract a specific file, we normally need to read the archive
+// sequentially until we find the file.  This is very slow for files at the end
+// of a large package with lots of files, like kernel debuginfo.
+//
+// However, if the compression format used in the archive supports seeking, we
+// can accelerate this.  As of July 2024, xz is the only widely-used format that
+// supports seeking, and usually only in multi-threaded mode.  Luckily, the
+// kernel-debuginfo package in Fedora and its downstreams, and the
+// linux-image-*-dbg package in Debian and its downstreams, all happen to use
+// this.
+//
+// The xz format [1] ends with an index of independently compressed blocks in
+// the stream.  In RPM and deb files, the xz stream is the last thing in the
+// file, so we assume that the xz Stream Footer is at the end of the package
+// file and do everything relative to that.  For each file in the archive, we
+// remember the size and offset of the file data in the uncompressed xz stream,
+// then we use the index to seek to that offset when we need that file.
+//
+// 1: https://xz.tukaani.org/format/xz-file-format.txt
+
+// Return whether an archive supports seeking.
+static bool
+is_seekable_archive (const string& rps, struct archive* a)
+{
+  // Only xz supports seeking.
+  if (archive_filter_code (a, 0) != ARCHIVE_FILTER_XZ)
+    return false;
+
+  int fd = open (rps.c_str(), O_RDONLY);
+  if (fd < 0)
+    return false;
+  defer_dtor<int,int> fd_closer (fd, close);
+
+  // Seek to the xz Stream Footer.  We assume that it's the last thing in the
+  // file, which is true for RPM and deb files.
+  off_t footer_pos = -LZMA_STREAM_HEADER_SIZE;
+  if (lseek (fd, footer_pos, SEEK_END) == -1)
+    return false;
+
+  // Decode the Stream Footer.
+  uint8_t footer[LZMA_STREAM_HEADER_SIZE];
+  size_t footer_read = 0;
+  while (footer_read < sizeof (footer))
+    {
+      ssize_t bytes_read = read (fd, footer + footer_read,
+                                 sizeof (footer) - footer_read);
+      if (bytes_read < 0)
+        {
+          if (errno == EINTR)
+            continue;
+          return false;
+        }
+      if (bytes_read == 0)
+        return false;
+      footer_read += bytes_read;
+    }
+
+  lzma_stream_flags stream_flags;
+  lzma_ret ret = lzma_stream_footer_decode (&stream_flags, footer);
+  if (ret != LZMA_OK)
+    return false;
+
+  // Seek to the xz Index.
+  if (lseek (fd, footer_pos - stream_flags.backward_size, SEEK_END) == -1)
+    return false;
+
+  // Decode the Number of Records in the Index.  liblzma doesn't have an API for
+  // this if you don't want to decode the whole Index, so we have to do it
+  // ourselves.
+  //
+  // We need 1 byte for the Index Indicator plus 1-9 bytes for the
+  // variable-length integer Number of Records.
+  uint8_t index[10];
+  size_t index_read = 0;
+  while (index_read == 0) {
+      ssize_t bytes_read = read (fd, index, sizeof (index));
+      if (bytes_read < 0)
+        {
+          if (errno == EINTR)
+            continue;
+          return false;
+        }
+      if (bytes_read == 0)
+        return false;
+      index_read += bytes_read;
+  }
+  // The Index Indicator must be 0.
+  if (index[0] != 0)
+    return false;
+
+  lzma_vli num_records;
+  size_t pos = 0;
+  size_t in_pos = 1;
+  while (true)
+    {
+      if (in_pos >= index_read)
+        {
+          ssize_t bytes_read = read (fd, index, sizeof (index));
+          if (bytes_read < 0)
+          {
+            if (errno == EINTR)
+              continue;
+            return false;
+          }
+          if (bytes_read == 0)
+            return false;
+          index_read = bytes_read;
+          in_pos = 0;
+        }
+      ret = lzma_vli_decode (&num_records, &pos, index, &in_pos, index_read);
+      if (ret == LZMA_STREAM_END)
+        break;
+      else if (ret != LZMA_OK)
+        return false;
+    }
+
+  if (verbose > 3)
+    obatched(clog) << rps << " has " << num_records << " xz Blocks" << endl;
+
+  // The file is only seekable if it has more than one Block.
+  return num_records > 1;
+}
+#else
+static bool
+is_seekable_archive (const string& rps, struct archive* a)
+{
+  return false;
+}
+#endif
+
+
 // For security/portability reasons, many distro-package archives have
 // a "./" in front of path names; others have nothing, others have
 // "/".  Canonicalize them all to a single leading "/", with the
@@ -2557,16 +2713,16 @@ handle_buildid (MHD_Connection* conn,
   if (atype_code == "D")
     {
       pp = new sqlite_ps (thisdb, "mhd-query-d",
-                          "select mtime, sourcetype, source0, source1 from " BUILDIDS "_query_d where buildid = ? "
-                          "order by mtime desc");
+                          "select mtime0, sourcetype, source0, source1 from " BUILDIDS "_query_d where buildid = ? "
+                          "order by mtime0 desc");
       pp->reset();
       pp->bind(1, buildid);
     }
   else if (atype_code == "E")
     {
       pp = new sqlite_ps (thisdb, "mhd-query-e",
-                          "select mtime, sourcetype, source0, source1 from " BUILDIDS "_query_e where buildid = ? "
-                          "order by mtime desc");
+                          "select mtime0, sourcetype, source0, source1 from " BUILDIDS "_query_e where buildid = ? "
+                          "order by mtime0 desc");
       pp->reset();
       pp->bind(1, buildid);
     }
@@ -2589,9 +2745,9 @@ handle_buildid (MHD_Connection* conn,
   else if (atype_code == "I")
     {
       pp = new sqlite_ps (thisdb, "mhd-query-i",
-	"select mtime, sourcetype, source0, source1, 1 as debug_p from " BUILDIDS "_query_d where buildid = ? "
+	"select mtime0, sourcetype, source0, source1, 1 as debug_p from " BUILDIDS "_query_d where buildid = ? "
 	"union all "
-	"select mtime, sourcetype, source0, source1, 0 as debug_p from " BUILDIDS "_query_e where buildid = ? "
+	"select mtime0, sourcetype, source0, source1, 0 as debug_p from " BUILDIDS "_query_e where buildid = ? "
 	"order by debug_p desc, mtime desc");
       pp->reset();
       pp->bind(1, buildid);
@@ -3821,7 +3977,7 @@ archive_classify (const string& rps, string& archive_extension, int64_t archivei
                   sqlite_ps& ps_upsert_buildids, sqlite_ps& ps_upsert_fileparts, sqlite_ps& ps_upsert_file,
                   sqlite_ps& ps_lookup_file,
                   sqlite_ps& ps_upsert_de, sqlite_ps& ps_upsert_sref, sqlite_ps& ps_upsert_sdef,
-                  time_t mtime,
+                  time_t mtime0,
                   unsigned& fts_executable, unsigned& fts_debuginfo, unsigned& fts_sref, unsigned& fts_sdef,
                   bool& fts_sref_complete_p)
 {
@@ -3875,6 +4031,10 @@ archive_classify (const string& rps, string& archive_extension, int64_t archivei
   if (verbose > 3)
     obatched(clog) << "libarchive scanning " << rps << " id " << archiveid << endl;
 
+  bool seekable = is_seekable_archive (rps, a);
+  if (verbose > 2 && seekable)
+    obatched(clog) << rps << " is seekable" << endl;
+
   bool any_exceptions = false;
   while(1) // parse archive entries
     {
@@ -3896,6 +4056,12 @@ archive_classify (const string& rps, string& archive_extension, int64_t archivei
           if (verbose > 3)
             obatched(clog) << "libarchive checking " << fn << endl;
 
+          int64_t size = archive_entry_size (e);
+          time_t mtime1 = archive_entry_mtime (e);
+          int64_t uncompressed_offset;
+          if (seekable)
+            uncompressed_offset = archive_filter_bytes (a, 0);
+
           // extract this file to a temporary file
           char* tmppath = NULL;
           rc = asprintf (&tmppath, "%s/debuginfod-classify.XXXXXX", tmpdir.c_str());
@@ -3978,15 +4144,22 @@ archive_classify (const string& rps, string& archive_extension, int64_t archivei
 
           if (executable_p || debuginfo_p)
             {
-              ps_upsert_de
+              auto& ps =
+                ps_upsert_de
                 .reset()
                 .bind(1, buildid)
                 .bind(2, debuginfo_p ? 1 : 0)
                 .bind(3, executable_p ? 1 : 0)
                 .bind(4, archiveid)
-                .bind(5, mtime)
+                .bind(5, mtime0)
                 .bind(6, fileid)
-                .step_ok_done();
+                .bind(7, size)
+                .bind(8, mtime1);
+              if (seekable)
+                ps.bind(9, uncompressed_offset);
+              else
+                ps.bind(9);
+              ps.step_ok_done();
             }
           else // potential source - sdef record
             {
@@ -3994,18 +4167,25 @@ archive_classify (const string& rps, string& archive_extension, int64_t archivei
               ps_upsert_sdef
                 .reset()
                 .bind(1, archiveid)
-                .bind(2, mtime)
+                .bind(2, mtime0)
                 .bind(3, fileid)
                 .step_ok_done();
             }
 
           if ((verbose > 2) && (executable_p || debuginfo_p))
-            obatched(clog) << "recorded buildid=" << buildid << " rpm=" << rps << " file=" << fn
-                           << " mtime=" << mtime << " atype="
+            {
+              obatched ob(clog);
+              auto& o = ob << "recorded buildid=" << buildid << " rpm=" << rps << " file=" << fn
+                           << " mtime0=" << mtime0 << " size=" << size
+                           << " mtime1=" << mtime1
+                           << " atype="
                            << (executable_p ? "E" : "")
                            << (debuginfo_p ? "D" : "")
-                           << " sourcefiles=" << sourcefiles.size() << endl;
-
+                           << " sourcefiles=" << sourcefiles.size();
+              if (seekable)
+                o << " uncompressed_offset=" << uncompressed_offset;
+              o << endl;
+            }
         }
       catch (const reportable_exception& e)
         {
@@ -4169,8 +4349,8 @@ scan ()
                               " from " BUILDIDS "_files f, " BUILDIDS "_fileparts p1, " BUILDIDS "_fileparts p2 \n"
                               " where f.dirname = p1.id and f.basename = p2.id and p1.name = ? and p2.name = ?;\n");
   sqlite_ps ps_r_upsert_de (db, "rpm-de-insert",
-                          "insert or ignore into " BUILDIDS "_r_de (buildid, debuginfo_p, executable_p, file, mtime, content) values ("
-                          "(select id from " BUILDIDS "_buildids where hex = ?), ?, ?, ?, ?, ?);");
+                          "insert or ignore into " BUILDIDS "_r_de (buildid, debuginfo_p, executable_p, file, mtime0, content, size, mtime1, uncompressed_offset) values ("
+                          "(select id from " BUILDIDS "_buildids where hex = ?), ?, ?, ?, ?, ?, ?, ?, ?);");
   sqlite_ps ps_r_upsert_sref (db, "rpm-sref-insert",
                             "insert or ignore into " BUILDIDS "_r_sref (buildid, artifactsrc) values ("
                             "(select id from " BUILDIDS "_buildids where hex = ?), "
@@ -4559,7 +4739,7 @@ void groom()
   // as long as we make progress.
 
   sqlite_ps files_del_f_de (db, "nuke f_de", "delete from " BUILDIDS "_f_de where file = ? and mtime = ?");
-  sqlite_ps files_del_r_de (db, "nuke r_de", "delete from " BUILDIDS "_r_de where file = ? and mtime = ?");
+  sqlite_ps files_del_r_de (db, "nuke r_de", "delete from " BUILDIDS "_r_de where file = ? and mtime0 = ?");
   sqlite_ps files_del_scan (db, "nuke f_m_s", "delete from " BUILDIDS "_file_mtime_scanned "
                             "where file = ? and mtime = ?");
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/3] debuginfod: optimize extraction from seekable xz archives
  2024-07-10 20:47 [PATCH 0/3] debuginfod: speed up extraction from kernel debuginfo packages by 200x Omar Sandoval
  2024-07-10 20:47 ` [PATCH 1/3] debuginfod: factor out common code for responding from an archive Omar Sandoval
  2024-07-10 20:47 ` [PATCH 2/3] debuginfod: add archive entry size, mtime, and uncompressed offset to database Omar Sandoval
@ 2024-07-10 20:47 ` Omar Sandoval
  2024-07-11 20:16 ` [PATCH 0/3] debuginfod: speed up extraction from kernel debuginfo packages by 200x Frank Ch. Eigler
  3 siblings, 0 replies; 6+ messages in thread
From: Omar Sandoval @ 2024-07-10 20:47 UTC (permalink / raw)
  To: elfutils-devel; +Cc: linux-debuggers

From: Omar Sandoval <osandov@fb.com>

The kernel debuginfo packages on Fedora, Debian, and Ubuntu, and many of
their downstreams, are all compressed with xz in multi-threaded mode,
which allows random access.  We can use this to bypass the full archive
extraction and dramatically speed up kernel debuginfo requests (from ~50
seconds in the worst case to < 0.25 seconds).

This works because multi-threaded xz compression splits up the stream
into many independently compressed blocks.  The stream ends with an
index of blocks.  So, to seek to an offset, we find the block containing
that offset in the index and then decompress and throw away data until
we reach the offset within the block.  We can then decompress the
desired amount of data, possibly from subsequent blocks.  There's no
high-level API in liblzma to do this, but we can do it by stitching
together a few low-level APIs.

Since we now save the uncompressed offset and size of each archive file,
pass that information down so we can do the optimized extraction when
applicable.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 debuginfod/debuginfod.cxx | 431 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 417 insertions(+), 14 deletions(-)

diff --git a/debuginfod/debuginfod.cxx b/debuginfod/debuginfod.cxx
index 95a7d941..c3822be3 100644
--- a/debuginfod/debuginfod.cxx
+++ b/debuginfod/debuginfod.cxx
@@ -1971,6 +1971,16 @@ handle_buildid_f_match (bool internal_req_t,
 
 
 #ifdef USE_LZMA
+struct lzma_exception: public reportable_exception
+{
+  lzma_exception(int rc, const string& msg):
+    // liblzma doesn't have a lzma_ret -> string conversion function, so just
+    // report the value.
+    reportable_exception(string ("lzma error: ") + msg + ": error " + to_string(rc)) {
+      inc_metric("error_count","lzma",to_string(rc));
+    }
+};
+
 // Neither RPM nor deb files support seeking to a specific file in the package.
 // Instead, to extract a specific file, we normally need to read the archive
 // sequentially until we find the file.  This is very slow for files at the end
@@ -2094,12 +2104,352 @@ is_seekable_archive (const string& rps, struct archive* a)
   // The file is only seekable if it has more than one Block.
   return num_records > 1;
 }
+
+// Read the Index at the end of an xz file.
+static lzma_index*
+read_xz_index (int fd)
+{
+  off_t footer_pos = -LZMA_STREAM_HEADER_SIZE;
+  if (lseek (fd, footer_pos, SEEK_END) == -1)
+    throw libc_exception (errno, "lseek");
+
+  uint8_t footer[LZMA_STREAM_HEADER_SIZE];
+  size_t footer_read = 0;
+  while (footer_read < sizeof (footer))
+    {
+      ssize_t bytes_read = read (fd, footer + footer_read,
+                                 sizeof (footer) - footer_read);
+      if (bytes_read < 0)
+        {
+          if (errno == EINTR)
+            continue;
+          throw libc_exception (errno, "read");
+        }
+      if (bytes_read == 0)
+        throw reportable_exception ("truncated file");
+      footer_read += bytes_read;
+    }
+
+  lzma_stream_flags stream_flags;
+  lzma_ret ret = lzma_stream_footer_decode (&stream_flags, footer);
+  if (ret != LZMA_OK)
+    throw lzma_exception (ret, "lzma_stream_footer_decode");
+
+  if (lseek (fd, footer_pos - stream_flags.backward_size, SEEK_END) == -1)
+    throw libc_exception (errno, "lseek");
+
+  lzma_stream strm = LZMA_STREAM_INIT;
+  lzma_index* index = NULL;
+  ret = lzma_index_decoder (&strm, &index, UINT64_MAX);
+  if (ret != LZMA_OK)
+    throw lzma_exception (ret, "lzma_index_decoder");
+  defer_dtor<lzma_stream*,void> strm_ender (&strm, lzma_end);
+
+  uint8_t in_buf[4096];
+  while (true)
+    {
+      if (strm.avail_in == 0)
+        {
+          ssize_t bytes_read = read (fd, in_buf, sizeof (in_buf));
+          if (bytes_read < 0)
+            {
+              if (errno == EINTR)
+                continue;
+              throw libc_exception (errno, "read");
+            }
+          if (bytes_read == 0)
+            throw reportable_exception ("truncated file");
+          strm.avail_in = bytes_read;
+          strm.next_in = in_buf;
+        }
+
+        ret = lzma_code (&strm, LZMA_RUN);
+        if (ret == LZMA_STREAM_END)
+          break;
+        else if (ret != LZMA_OK)
+          throw lzma_exception (ret, "lzma_code index");
+    }
+
+  ret = lzma_index_stream_flags (index, &stream_flags);
+  if (ret != LZMA_OK)
+    {
+      lzma_index_end (index, NULL);
+      throw lzma_exception (ret, "lzma_index_stream_flags");
+    }
+  return index;
+}
+
+static void
+my_lzma_index_end (lzma_index* index)
+{
+  lzma_index_end (index, NULL);
+}
+
+static void
+free_lzma_block_filter_options (lzma_block* block)
+{
+  for (int i = 0; i < LZMA_FILTERS_MAX; i++)
+    {
+      free (block->filters[i].options);
+      block->filters[i].options = NULL;
+    }
+}
+
+static void
+free_lzma_block_filters (lzma_block* block)
+{
+  if (block->filters != NULL)
+    {
+      free_lzma_block_filter_options (block);
+      free (block->filters);
+    }
+}
+
+static void
+extract_xz_blocks_into_fd (const string& srcpath,
+                           int src,
+                           int dst,
+                           lzma_index_iter* iter,
+                           uint64_t offset,
+                           uint64_t size)
+{
+  // Seek to the Block.  Seeking from the end using the compressed size from the
+  // footer means we don't need to know where the xz stream starts in the
+  // archive.
+  if (lseek (src,
+             (off_t) iter->block.compressed_stream_offset
+             - (off_t) iter->stream.compressed_size,
+             SEEK_END) == -1)
+    throw libc_exception (errno, "lseek");
+
+  offset -= iter->block.uncompressed_file_offset;
+
+  lzma_block block{};
+  block.filters = (lzma_filter*) calloc (LZMA_FILTERS_MAX + 1,
+                                         sizeof (lzma_filter));
+  if (block.filters == NULL)
+    throw libc_exception (ENOMEM, "cannot allocate lzma_block filters");
+  defer_dtor<lzma_block*,void> filters_freer (&block, free_lzma_block_filters);
+
+  uint8_t in_buf[4096];
+  uint8_t out_buf[4096];
+  size_t header_read = 0;
+  bool need_log_extracting = verbose > 3;
+  while (true)
+    {
+      // The first byte of the Block is the encoded Block Header Size.  Read the
+      // first byte and whatever extra fits in the buffer.
+      while (header_read == 0)
+        {
+          ssize_t bytes_read = read (src, in_buf, sizeof (in_buf));
+          if (bytes_read < 0)
+            {
+              if (errno == EINTR)
+                continue;
+              throw libc_exception (errno, "read");
+            }
+          if (bytes_read == 0)
+            throw reportable_exception ("truncated file");
+          header_read += bytes_read;
+        }
+
+      block.header_size = lzma_block_header_size_decode (in_buf[0]);
+
+      // If we didn't buffer the whole Block Header earlier, get the rest.
+      eu_static_assert (sizeof (in_buf)
+                        >= lzma_block_header_size_decode (UINT8_MAX));
+      while (header_read < block.header_size)
+        {
+          ssize_t bytes_read = read (src, in_buf + header_read,
+                                     sizeof (in_buf) - header_read);
+          if (bytes_read < 0)
+            {
+              if (errno == EINTR)
+                continue;
+              throw libc_exception (errno, "read");
+            }
+          if (bytes_read == 0)
+            throw reportable_exception ("truncated file");
+          header_read += bytes_read;
+        }
+
+      // Decode the Block Header.
+      block.check = iter->stream.flags->check;
+      lzma_ret ret = lzma_block_header_decode (&block, NULL, in_buf);
+      if (ret != LZMA_OK)
+        throw lzma_exception (ret, "lzma_block_header_decode");
+      ret = lzma_block_compressed_size (&block, iter->block.unpadded_size);
+      if (ret != LZMA_OK)
+        throw lzma_exception (ret, "lzma_block_compressed_size");
+
+      // Start decoding the Block data.
+      lzma_stream strm = LZMA_STREAM_INIT;
+      ret = lzma_block_decoder (&strm, &block);
+      if (ret != LZMA_OK)
+        throw lzma_exception (ret, "lzma_block_decoder");
+      defer_dtor<lzma_stream*,void> strm_ender (&strm, lzma_end);
+
+      // We might still have some input buffered from when we read the header.
+      strm.avail_in = header_read - block.header_size;
+      strm.next_in = in_buf + block.header_size;
+      strm.avail_out = sizeof (out_buf);
+      strm.next_out = out_buf;
+      while (true)
+        {
+          if (strm.avail_in == 0)
+            {
+              ssize_t bytes_read = read (src, in_buf, sizeof (in_buf));
+              if (bytes_read < 0)
+                {
+                  if (errno == EINTR)
+                    continue;
+                  throw libc_exception (errno, "read");
+                }
+              if (bytes_read == 0)
+                throw reportable_exception ("truncated file");
+              strm.avail_in = bytes_read;
+              strm.next_in = in_buf;
+            }
+
+          ret = lzma_code (&strm, LZMA_RUN);
+          if (ret != LZMA_OK && ret != LZMA_STREAM_END)
+            throw lzma_exception (ret, "lzma_code block");
+
+          // Throw away anything we decode until we reach the offset, then
+          // start writing to the destination.
+          if (strm.total_out > offset)
+            {
+              size_t bytes_to_write = strm.next_out - out_buf;
+              uint8_t* buf_to_write = out_buf;
+
+              // Ignore anything in the buffer before the offset.
+              if (bytes_to_write > strm.total_out - offset)
+                {
+                  buf_to_write += bytes_to_write - (strm.total_out - offset);
+                  bytes_to_write = strm.total_out - offset;
+                }
+
+              // Ignore anything after the size.
+              if (strm.total_out - offset >= size)
+                bytes_to_write -= strm.total_out - offset - size;
+
+              if (need_log_extracting)
+                {
+                  obatched(clog) << "extracting from xz archive " << srcpath
+                                 << " size=" << size << endl;
+                  need_log_extracting = false;
+                }
+
+              while (bytes_to_write > 0)
+                {
+                  ssize_t written = write (dst, buf_to_write, bytes_to_write);
+                  if (written < 0)
+                    {
+                      if (errno == EAGAIN)
+                        continue;
+                      throw libc_exception (errno, "write");
+                    }
+                  bytes_to_write -= written;
+                  buf_to_write += written;
+                }
+
+              // If we reached the size, we're done.
+              if (strm.total_out - offset >= size)
+                return;
+            }
+
+          strm.avail_out = sizeof (out_buf);
+          strm.next_out = out_buf;
+
+          if (ret == LZMA_STREAM_END)
+            break;
+        }
+
+      // This Block didn't have enough data.  Go to the next one.
+      if (lzma_index_iter_next (iter, LZMA_INDEX_ITER_BLOCK))
+        throw reportable_exception ("no more blocks");
+      if (strm.total_out > offset)
+        size -= strm.total_out - offset;
+      offset = 0;
+      // If we had any buffered input left, move it to the beginning of the
+      // buffer to decode the next Block Header.
+      if (strm.avail_in > 0)
+        {
+          memmove (in_buf, strm.next_in, strm.avail_in);
+          header_read = strm.avail_in;
+        }
+      else
+        header_read = 0;
+      free_lzma_block_filter_options (&block);
+    }
+}
+
+static int
+extract_from_seekable_archive (const string& srcpath,
+                               char* tmppath,
+                               uint64_t offset,
+                               uint64_t size)
+{
+  int src = open (srcpath.c_str(), O_RDONLY);
+  if (src < 0)
+    throw libc_exception (errno, string("open ") + srcpath);
+  defer_dtor<int,int> src_closer (src, close);
+
+  try
+    {
+      lzma_index* index = read_xz_index (src);
+      defer_dtor<lzma_index*,void> index_ender (index, my_lzma_index_end);
+
+      // Find the Block containing the offset.
+      lzma_index_iter iter;
+      lzma_index_iter_init (&iter, index);
+      if (lzma_index_iter_locate (&iter, offset))
+        throw reportable_exception ("offset not found");
+
+      if (verbose > 3)
+        obatched(clog) << "seeking in xz archive " << srcpath
+                       << " offset=" << offset << " block_offset="
+                       << iter.block.uncompressed_file_offset << endl;
+
+      int dst = mkstemp (tmppath);
+      if (dst < 0)
+        throw libc_exception (errno, "cannot create temporary file");
+
+      try
+        {
+          extract_xz_blocks_into_fd (srcpath, src, dst, &iter, offset, size);
+        }
+      catch (...)
+        {
+          unlink (tmppath);
+          close (dst);
+          throw;
+        }
+
+      return dst;
+    }
+  catch (const reportable_exception &e)
+    {
+      if (verbose)
+        obatched(clog) << "failed to extract from seekable archive " << srcpath
+                       << ": " << e.message << endl;
+      return -1;
+    }
+}
 #else
 static bool
 is_seekable_archive (const string& rps, struct archive* a)
 {
   return false;
 }
+static int
+extract_from_seekable_archive (const string& srcpath,
+                               char* tmppath,
+                               uint64_t offset,
+                               uint64_t size)
+{
+  return -1;
+}
 #endif
 
 
@@ -2199,9 +2549,12 @@ create_buildid_r_response (int64_t b_mtime0,
 
 static struct MHD_Response*
 handle_buildid_r_match (bool internal_req_p,
-                        int64_t b_mtime,
+                        int64_t b_mtime0,
                         const string& b_source0,
                         const string& b_source1,
+                        int64_t b_size,
+                        int64_t b_mtime1,
+                        int64_t b_uncompressed_offset,
                         const string& section,
                         int *result_fd)
 {
@@ -2213,7 +2566,7 @@ handle_buildid_r_match (bool internal_req_p,
   if (rc != 0)
     throw libc_exception (errno, string("stat ") + b_source0);
 
-  if ((int64_t) fs.st_mtime != b_mtime)
+  if ((int64_t) fs.st_mtime != b_mtime0)
     {
       if (verbose)
         obatched(clog) << "mtime mismatch for " << b_source0 << endl;
@@ -2373,7 +2726,7 @@ handle_buildid_r_match (bool internal_req_p,
           break; // branch out of if "loop", to try new libarchive fetch attempt
         }
 
-      struct MHD_Response* r = create_buildid_r_response (b_mtime, b_source0,
+      struct MHD_Response* r = create_buildid_r_response (b_mtime0, b_source0,
                                                           b_source1, section,
                                                           ima_sig, NULL, fd,
                                                           fs.st_size,
@@ -2388,7 +2741,41 @@ handle_buildid_r_match (bool internal_req_p,
       // NB: see, we never go around the 'loop' more than once
     }
 
-  // no match ... grumble, must process the archive
+  // no match ... if the archive is seekable, try that first
+  if (b_uncompressed_offset > 0)
+    {
+      char* tmppath = NULL;
+      if (asprintf (&tmppath, "%s/debuginfod-fdcache.XXXXXX", tmpdir.c_str()) < 0)
+        throw libc_exception (ENOMEM, "cannot allocate tmppath");
+      defer_dtor<void*,void> tmmpath_freer (tmppath, free);
+
+      fd = extract_from_seekable_archive (b_source0, tmppath,
+                                          b_uncompressed_offset, b_size);
+      if (fd >= 0)
+        {
+          // Set the mtime so the fdcache file mtimes propagate to future webapi
+          // clients.
+          struct timespec tvs[2];
+          tvs[0].tv_sec = 0;
+          tvs[0].tv_nsec = UTIME_OMIT;
+          tvs[1].tv_sec = b_mtime1;
+          tvs[1].tv_nsec = 0;
+          (void) futimens (fd, tvs);  /* best effort */
+          struct MHD_Response* r = create_buildid_r_response (b_mtime0,
+                                                              b_source0,
+                                                              b_source1,
+                                                              section, ima_sig,
+                                                              tmppath, fd,
+                                                              b_size, b_mtime1,
+                                                              "seekable archive",
+                                                              extract_begin);
+          if (r != 0 && result_fd)
+            *result_fd = fd;
+          return r;
+        }
+    }
+
+  // still nothing ...  process the archive
   string archive_decoder = "/dev/null";
   string archive_extension = "";
   for (auto&& arch : scan_archives)
@@ -2512,7 +2899,7 @@ handle_buildid_r_match (bool internal_req_p,
           continue;
         }
 
-      r = create_buildid_r_response (b_mtime, b_source0, b_source1, section,
+      r = create_buildid_r_response (b_mtime0, b_source0, b_source1, section,
                                      ima_sig, tmppath, fd,
                                      archive_entry_size(e),
                                      archive_entry_mtime(e),
@@ -2583,21 +2970,26 @@ and will not query the upstream servers");
 
 static struct MHD_Response*
 handle_buildid_match (bool internal_req_p,
-                      int64_t b_mtime,
+                      int64_t b_mtime0,
                       const string& b_stype,
                       const string& b_source0,
                       const string& b_source1,
+                      int64_t b_size,
+                      int64_t b_mtime1,
+                      int64_t b_uncompressed_offset,
                       const string& section,
                       int *result_fd)
 {
   try
     {
       if (b_stype == "F")
-        return handle_buildid_f_match(internal_req_p, b_mtime, b_source0,
+        return handle_buildid_f_match(internal_req_p, b_mtime0, b_source0,
 				      section, result_fd);
       else if (b_stype == "R")
-        return handle_buildid_r_match(internal_req_p, b_mtime, b_source0,
-				      b_source1, section, result_fd);
+        return handle_buildid_r_match(internal_req_p, b_mtime0, b_source0,
+                                      b_source1, b_size, b_mtime1,
+                                      b_uncompressed_offset, section,
+                                      result_fd);
     }
   catch (const reportable_exception &e)
     {
@@ -2713,7 +3105,7 @@ handle_buildid (MHD_Connection* conn,
   if (atype_code == "D")
     {
       pp = new sqlite_ps (thisdb, "mhd-query-d",
-                          "select mtime0, sourcetype, source0, source1 from " BUILDIDS "_query_d where buildid = ? "
+                          "select mtime0, sourcetype, source0, source1, size, mtime1, uncompressed_offset from " BUILDIDS "_query_d where buildid = ? "
                           "order by mtime0 desc");
       pp->reset();
       pp->bind(1, buildid);
@@ -2721,7 +3113,7 @@ handle_buildid (MHD_Connection* conn,
   else if (atype_code == "E")
     {
       pp = new sqlite_ps (thisdb, "mhd-query-e",
-                          "select mtime0, sourcetype, source0, source1 from " BUILDIDS "_query_e where buildid = ? "
+                          "select mtime0, sourcetype, source0, source1, size, mtime1, uncompressed_offset from " BUILDIDS "_query_e where buildid = ? "
                           "order by mtime0 desc");
       pp->reset();
       pp->bind(1, buildid);
@@ -2765,19 +3157,30 @@ handle_buildid (MHD_Connection* conn,
       if (rc != SQLITE_ROW)
         throw sqlite_exception(rc, "step");
 
-      int64_t b_mtime = sqlite3_column_int64 (*pp, 0);
+      int64_t b_mtime0 = sqlite3_column_int64 (*pp, 0);
       string b_stype = string((const char*) sqlite3_column_text (*pp, 1) ?: ""); /* by DDL may not be NULL */
       string b_source0 = string((const char*) sqlite3_column_text (*pp, 2) ?: ""); /* may be NULL */
       string b_source1 = string((const char*) sqlite3_column_text (*pp, 3) ?: ""); /* may be NULL */
+      int64_t b_size = 0;
+      int64_t b_mtime1 = 0;
+      int64_t b_uncompressed_offset = 0;
+      if (atype_code == "D" || atype_code == "E")
+        {
+          b_mtime1 = sqlite3_column_int64 (*pp, 5);
+          // These may be NULL, in which case sqlite3_column_int64() returns 0.
+          b_size = sqlite3_column_int64 (*pp, 4);
+          b_uncompressed_offset = sqlite3_column_int64 (*pp, 6);
+        }
 
       if (verbose > 1)
-        obatched(clog) << "found mtime=" << b_mtime << " stype=" << b_stype
+        obatched(clog) << "found mtime=" << b_mtime0 << " stype=" << b_stype
              << " source0=" << b_source0 << " source1=" << b_source1 << endl;
 
       // Try accessing the located match.
       // XXX: in case of multiple matches, attempt them in parallel?
       auto r = handle_buildid_match (conn ? false : true,
-                                     b_mtime, b_stype, b_source0, b_source1,
+                                     b_mtime0, b_stype, b_source0, b_source1,
+                                     b_size, b_mtime1, b_uncompressed_offset,
 				     section, result_fd);
       if (r)
         return r;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/3] debuginfod: speed up extraction from kernel debuginfo packages by 200x
  2024-07-10 20:47 [PATCH 0/3] debuginfod: speed up extraction from kernel debuginfo packages by 200x Omar Sandoval
                   ` (2 preceding siblings ...)
  2024-07-10 20:47 ` [PATCH 3/3] debuginfod: optimize extraction from seekable xz archives Omar Sandoval
@ 2024-07-11 20:16 ` Frank Ch. Eigler
  2024-07-11 23:00   ` Omar Sandoval
  3 siblings, 1 reply; 6+ messages in thread
From: Frank Ch. Eigler @ 2024-07-11 20:16 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: elfutils-devel, linux-debuggers

Hi, Omar -

Thanks.  I wish this sort of amazing kludge weren't necessary, but
given that it helps, so be it.

I'd like to commend you on the effort needed to match your code up
with the stylistic idiosyncracies of the debuginfod c++ code.  It
looks just like the other code.  My only reservation is the schema
change.  Reindexing some of our large repos takes WEEKS.  Here's a
possible way to avoid that:

- Preserve the current BUILDID schema id and tables as is.

- Add a new table for the intra-archive coordinates.  Think of it like a cache.
  Index it with archive-file-name and content-file-name (source0, source1 IIRC).

- During a fetch out of the archive-file-name, check whether the new
  table has a record for that file.  If yes, cache hit, go through to
  the xz extraction stuff, winner!

- If not, try the is_seekable() check on the archive.  If it is true, we have an
  archive that should be seekable, but we don't have it in the intra-archive cache.
  So take this opportunity to index that archive (only), populate the cache table,
  as the archive is being extracted.  (No need to use the new cache data then, since
  we've just paid the effort of decompressing/reading the whole thing already.)

- Need to confirm that during grooming, a disappeared
  archive-file-name would also drop the corresponding intra-archive
  rows.

- Heck, during grooming or scanning, maybe the tool could preemptively
  do the intra-archive coordinate cache thing if it's not already
  done, just to defeat the latency of doing it on demand.


What do you think?


- FChE


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/3] debuginfod: speed up extraction from kernel debuginfo packages by 200x
  2024-07-11 20:16 ` [PATCH 0/3] debuginfod: speed up extraction from kernel debuginfo packages by 200x Frank Ch. Eigler
@ 2024-07-11 23:00   ` Omar Sandoval
  0 siblings, 0 replies; 6+ messages in thread
From: Omar Sandoval @ 2024-07-11 23:00 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: elfutils-devel, linux-debuggers

On Thu, Jul 11, 2024 at 04:16:25PM -0400, Frank Ch. Eigler wrote:
> Hi, Omar -
> 
> Thanks.  I wish this sort of amazing kludge weren't necessary, but
> given that it helps, so be it.
> 
> I'd like to commend you on the effort needed to match your code up
> with the stylistic idiosyncracies of the debuginfod c++ code.  It
> looks just like the other code.  My only reservation is the schema
> change.  Reindexing some of our large repos takes WEEKS.  Here's a
> possible way to avoid that:
> 
> - Preserve the current BUILDID schema id and tables as is.
> 
> - Add a new table for the intra-archive coordinates.  Think of it like a cache.
>   Index it with archive-file-name and content-file-name (source0, source1 IIRC).
> 
> - During a fetch out of the archive-file-name, check whether the new
>   table has a record for that file.  If yes, cache hit, go through to
>   the xz extraction stuff, winner!
> 
> - If not, try the is_seekable() check on the archive.  If it is true, we have an
>   archive that should be seekable, but we don't have it in the intra-archive cache.
>   So take this opportunity to index that archive (only), populate the cache table,
>   as the archive is being extracted.  (No need to use the new cache data then, since
>   we've just paid the effort of decompressing/reading the whole thing already.)
> 
> - Need to confirm that during grooming, a disappeared
>   archive-file-name would also drop the corresponding intra-archive
>   rows.
> 
> - Heck, during grooming or scanning, maybe the tool could preemptively
>   do the intra-archive coordinate cache thing if it's not already
>   done, just to defeat the latency of doing it on demand.
> 
> 
> What do you think?

Hi, Frank,

I didn't realize how expensive reindexing could be, thank you for
pointing that out.  Your proposal makes sense to me, I'll rework this.

Thanks,
Omar

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-07-11 23:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-10 20:47 [PATCH 0/3] debuginfod: speed up extraction from kernel debuginfo packages by 200x Omar Sandoval
2024-07-10 20:47 ` [PATCH 1/3] debuginfod: factor out common code for responding from an archive Omar Sandoval
2024-07-10 20:47 ` [PATCH 2/3] debuginfod: add archive entry size, mtime, and uncompressed offset to database Omar Sandoval
2024-07-10 20:47 ` [PATCH 3/3] debuginfod: optimize extraction from seekable xz archives Omar Sandoval
2024-07-11 20:16 ` [PATCH 0/3] debuginfod: speed up extraction from kernel debuginfo packages by 200x Frank Ch. Eigler
2024-07-11 23:00   ` Omar Sandoval

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).