[PATCH v3 00/17] Refactor chunk-format into an API - Derrick Stolee via GitGitGadget

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: me@ttaylorr.com, gitster@pobox.com, l.s.r@web.de,
	szeder.dev@gmail.com, Chris Torek <chris.torek@gmail.com>,
	Derrick Stolee <stolee@gmail.com>,
	Derrick Stolee <derrickstolee@github.com>
Subject: [PATCH v3 00/17] Refactor chunk-format into an API
Date: Fri, 05 Feb 2021 14:30:35 +0000	[thread overview]
Message-ID: <pull.848.v3.git.1612535452.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.848.v2.git.1611759716.gitgitgadget@gmail.com>

This is a restart on the topic previously submitted [1] but dropped because
ak/corrected-commit-date was still in progress. This version is based on
that branch.

[1]
https://lore.kernel.org/git/pull.804.git.1607012215.gitgitgadget@gmail.com/

This version also changes the approach to use a more dynamic interaction
with a struct chunkfile pointer. This idea is credited to Taylor Blau [2],
but I started again from scratch. I also go further to make struct chunkfile
anonymous to API consumers. It is defined only in chunk-format.c, which
should hopefully deter future users from interacting with that data
directly.

[2] https://lore.kernel.org/git/X8%2FI%2FRzXZksio+ri@nand.local/

This combined API is beneficial to reduce duplicated logic. Or rather, to
ensure that similar file formats have similar protections against bad data.
The multi-pack-index code did not have as many guards as the commit-graph
code did, but now they both share a common base that checks for things like
duplicate chunks or offsets outside the size of the file.

Here are some stats for the end-to-end change:

 * 570 insertions(+), 456 deletions(-).
 * commit-graph.c: 107 insertions(+), 192 deletions(-)
 * midx.c: 164 insertions(+), 260 deletions(-)

While there is an overall increase to the code size, the consumers do get
smaller. Boilerplate things like abstracting method to match chunk_write_fn
and chunk_read_fn make up a lot of these insertions. The "interesting" code
gets a lot smaller and cleaner.


Updates in V3
=============

 * API methods use better types and changed their order to match internal
   data more closely.

 * Use hashfile_total() instead of internal data values.

 * The implementation of pair_chunk() uses read_chunk().

 * init_chunkfile() has an in-code doc comment warning against using the
   same struct chunkfile for reads and writes.

 * More multiplications are correctly cast in midx.c.

 * The chunk-format technical docs are expanded.


Updates in V2
=============

 * The method pair_chunk() now automatically sets a pointer while
   read_chunk() uses the callback. This greatly reduces the code size.

 * Pointer casts are now implicit instead of explicit.

 * Extra care is taken to not overflow when verifying chunk sizes on write.

Thanks, -Stolee

Derrick Stolee (17):
  commit-graph: anonymize data in chunk_write_fn
  chunk-format: create chunk format write API
  commit-graph: use chunk-format write API
  midx: rename pack_info to write_midx_context
  midx: use context in write_midx_pack_names()
  midx: add entries to write_midx_context
  midx: add pack_perm to write_midx_context
  midx: add num_large_offsets to write_midx_context
  midx: return success/failure in chunk write methods
  midx: drop chunk progress during write
  midx: use chunk-format API in write_midx_internal()
  chunk-format: create read chunk API
  commit-graph: use chunk-format read API
  midx: use chunk-format read API
  midx: use 64-bit multiplication for chunk sizes
  chunk-format: restore duplicate chunk checks
  chunk-format: add technical docs

 Documentation/technical/chunk-format.txt      | 116 +++++
 .../technical/commit-graph-format.txt         |   3 +
 Documentation/technical/pack-format.txt       |   3 +
 Makefile                                      |   1 +
 chunk-format.c                                | 180 ++++++++
 chunk-format.h                                |  65 +++
 commit-graph.c                                | 299 +++++-------
 midx.c                                        | 431 +++++++-----------
 t/t5318-commit-graph.sh                       |   2 +-
 t/t5319-multi-pack-index.sh                   |   6 +-
 10 files changed, 648 insertions(+), 458 deletions(-)
 create mode 100644 Documentation/technical/chunk-format.txt
 create mode 100644 chunk-format.c
 create mode 100644 chunk-format.h


base-commit: 5a3b130cad0d5c770f766e3af6d32b41766374c0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-848%2Fderrickstolee%2Fchunk-format%2Frefactor-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-848/derrickstolee/chunk-format/refactor-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/848

Range-diff vs v2:

  1:  243dcec94368 =  1:  243dcec94368 commit-graph: anonymize data in chunk_write_fn
  2:  814512f21671 !  2:  16c37d2370cf chunk-format: create chunk format write API
     @@ Commit message
           5. free the chunkfile struct using free_chunkfile().
      
          Helped-by: Taylor Blau <me@ttaylorr.com>
     +    Helped-by: Junio C Hamano <gitster@pobox.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## Makefile ##
     @@ chunk-format.c (new)
      +}
      +
      +void add_chunk(struct chunkfile *cf,
     -+	       uint64_t id,
     -+	       chunk_write_fn fn,
     -+	       size_t size)
     ++	       uint32_t id,
     ++	       size_t size,
     ++	       chunk_write_fn fn)
      +{
      +	ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc);
      +
     @@ chunk-format.c (new)
      +int write_chunkfile(struct chunkfile *cf, void *data)
      +{
      +	int i;
     -+	size_t cur_offset = cf->f->offset + cf->f->total;
     ++	uint64_t cur_offset = hashfile_total(cf->f);
      +
      +	/* Add the table of contents to the current offset */
      +	cur_offset += (cf->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH;
     @@ chunk-format.c (new)
      +	hashwrite_be64(cf->f, cur_offset);
      +
      +	for (i = 0; i < cf->chunks_nr; i++) {
     -+		uint64_t start_offset = cf->f->total + cf->f->offset;
     ++		off_t start_offset = hashfile_total(cf->f);
      +		int result = cf->chunks[i].write_fn(cf->f, data);
      +
      +		if (result)
      +			return result;
      +
     -+		if (cf->f->total + cf->f->offset - start_offset != cf->chunks[i].size)
     ++		if (hashfile_total(cf->f) - start_offset != cf->chunks[i].size)
      +			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
      +			    cf->chunks[i].size, cf->chunks[i].id,
     -+			    cf->f->total + cf->f->offset - start_offset);
     ++			    hashfile_total(cf->f) - start_offset);
      +	}
      +
      +	return 0;
     @@ chunk-format.h (new)
      +struct chunkfile *init_chunkfile(struct hashfile *f);
      +void free_chunkfile(struct chunkfile *cf);
      +int get_num_chunks(struct chunkfile *cf);
     -+typedef int (*chunk_write_fn)(struct hashfile *f,
     -+			      void *data);
     ++typedef int (*chunk_write_fn)(struct hashfile *f, void *data);
      +void add_chunk(struct chunkfile *cf,
     -+	       uint64_t id,
     -+	       chunk_write_fn fn,
     -+	       size_t size);
     ++	       uint32_t id,
     ++	       size_t size,
     ++	       chunk_write_fn fn);
      +int write_chunkfile(struct chunkfile *cf, void *data);
      +
      +#endif
  3:  70af6e3083f4 !  3:  e549e24d79af commit-graph: use chunk-format write API
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
      -	chunks[2].write_fn = write_graph_chunk_data;
      +	cf = init_chunkfile(f);
      +
     -+	add_chunk(cf, GRAPH_CHUNKID_OIDFANOUT,
     -+		  write_graph_chunk_fanout, GRAPH_FANOUT_SIZE);
     -+	add_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP,
     -+		  write_graph_chunk_oids, hashsz * ctx->commits.nr);
     -+	add_chunk(cf, GRAPH_CHUNKID_DATA,
     -+		  write_graph_chunk_data, (hashsz + 16) * ctx->commits.nr);
     ++	add_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, GRAPH_FANOUT_SIZE,
     ++		  write_graph_chunk_fanout);
     ++	add_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, hashsz * ctx->commits.nr,
     ++		  write_graph_chunk_oids);
     ++	add_chunk(cf, GRAPH_CHUNKID_DATA, (hashsz + 16) * ctx->commits.nr,
     ++		  write_graph_chunk_data);
       
       	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
       		ctx->write_generation_data = 0;
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
      -	}
      +	if (ctx->write_generation_data)
      +		add_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
     -+			  write_graph_chunk_generation_data,
     -+			  sizeof(uint32_t) * ctx->commits.nr);
     ++			  sizeof(uint32_t) * ctx->commits.nr,
     ++			  write_graph_chunk_generation_data);
      +	if (ctx->num_generation_data_overflows)
      +		add_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
     -+			  write_graph_chunk_generation_data_overflow,
     -+			  sizeof(timestamp_t) * ctx->num_generation_data_overflows);
     ++			  sizeof(timestamp_t) * ctx->num_generation_data_overflows,
     ++			  write_graph_chunk_generation_data_overflow);
      +	if (ctx->num_extra_edges)
      +		add_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES,
     -+			  write_graph_chunk_extra_edges,
     -+			  4 * ctx->num_extra_edges);
     ++			  4 * ctx->num_extra_edges,
     ++			  write_graph_chunk_extra_edges);
       	if (ctx->changed_paths) {
      -		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
      -		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
      -	chunks[num_chunks].id = 0;
      -	chunks[num_chunks].size = 0;
      +		add_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
     -+			  write_graph_chunk_bloom_indexes,
     -+			  sizeof(uint32_t) * ctx->commits.nr);
     ++			  sizeof(uint32_t) * ctx->commits.nr,
     ++			  write_graph_chunk_bloom_indexes);
      +		add_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
     -+			  write_graph_chunk_bloom_data,
      +			  sizeof(uint32_t) * 3
     -+				+ ctx->total_bloom_filter_data_size);
     ++				+ ctx->total_bloom_filter_data_size,
     ++			  write_graph_chunk_bloom_data);
      +	}
      +	if (ctx->num_commit_graphs_after > 1)
      +		add_chunk(cf, GRAPH_CHUNKID_BASE,
     -+			  write_graph_chunk_base,
     -+			  hashsz * (ctx->num_commit_graphs_after - 1));
     ++			  hashsz * (ctx->num_commit_graphs_after - 1),
     ++			  write_graph_chunk_base);
       
       	hashwrite_be32(f, GRAPH_SIGNATURE);
       
  4:  0cac7890bed7 =  4:  66ff49ed9309 midx: rename pack_info to write_midx_context
  5:  4a4e90b129ae =  5:  1d7484c0cffa midx: use context in write_midx_pack_names()
  6:  30ad423997b7 =  6:  ea0e7d40e537 midx: add entries to write_midx_context
  7:  2f1c496f3ab5 =  7:  b283a38fb775 midx: add pack_perm to write_midx_context
  8:  c4939548e51c =  8:  e7064512ab7f midx: add num_large_offsets to write_midx_context
  9:  b3cc73c22567 !  9:  7aa3242e15b7 midx: return success/failure in chunk write methods
     @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
       	stop_progress(&progress);
       
      -	if (written != chunk_offsets[num_chunks])
     -+	if (f->total + f->offset != chunk_offsets[num_chunks])
     ++	if (hashfile_total(f) != chunk_offsets[num_chunks])
       		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
      -		    written,
     -+		    f->total + f->offset,
     ++		    hashfile_total(f),
       		    chunk_offsets[num_chunks]);
       
       	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 10:  78744d3b7016 ! 10:  70f68c95e479 midx: drop chunk progress during write
     @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
       	}
      -	stop_progress(&progress);
       
     - 	if (f->total + f->offset != chunk_offsets[num_chunks])
     + 	if (hashfile_total(f) != chunk_offsets[num_chunks])
       		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
 11:  07dc0cf8c683 ! 11:  787cd7f18d2e midx: use chunk-format API in write_midx_internal()
     @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
      -			case MIDX_CHUNKID_PACKNAMES:
      -				write_midx_pack_names(f, &ctx);
      -				break;
     -+	add_chunk(cf, MIDX_CHUNKID_PACKNAMES,
     -+		  write_midx_pack_names, pack_name_concat_len);
     -+	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT,
     -+		  write_midx_oid_fanout, MIDX_CHUNK_FANOUT_SIZE);
     ++	add_chunk(cf, MIDX_CHUNKID_PACKNAMES, pack_name_concat_len,
     ++		  write_midx_pack_names);
     ++	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT, MIDX_CHUNK_FANOUT_SIZE,
     ++		  write_midx_oid_fanout);
      +	add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP,
     -+		  write_midx_oid_lookup, ctx.entries_nr * the_hash_algo->rawsz);
     ++		  ctx.entries_nr * the_hash_algo->rawsz,
     ++		  write_midx_oid_lookup);
      +	add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS,
     -+		  write_midx_object_offsets,
     -+		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH);
     ++		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH,
     ++		  write_midx_object_offsets);
       
      -			case MIDX_CHUNKID_OIDFANOUT:
      -				write_midx_oid_fanout(f, &ctx);
     @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
      -	}
      +	if (ctx.large_offsets_needed)
      +		add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS,
     -+			write_midx_large_offsets,
     -+			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
     ++			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH,
     ++			write_midx_large_offsets);
       
     --	if (f->total + f->offset != chunk_offsets[num_chunks])
     +-	if (hashfile_total(f) != chunk_offsets[num_chunks])
      -		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
     --		    f->total + f->offset,
     +-		    hashfile_total(f),
      -		    chunk_offsets[num_chunks]);
      +	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
      +	write_chunkfile(cf, &ctx);
 12:  d8d8e9e2aa3f ! 12:  366eb2afee83 chunk-format: create read chunk API
     @@ Commit message
          read. If the same struct instance was used for both reads and writes,
          then there would be failures.
      
     +    Helped-by: Junio C Hamano <gitster@pobox.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## chunk-format.c ##
     @@ chunk-format.c: int write_chunkfile(struct chunkfile *cf, void *data)
      +	return 0;
      +}
      +
     ++static int pair_chunk_fn(const unsigned char *chunk_start,
     ++			 size_t chunk_size,
     ++			 void *data)
     ++{
     ++	const unsigned char **p = data;
     ++	*p = chunk_start;
     ++	return 0;
     ++}
     ++
      +int pair_chunk(struct chunkfile *cf,
      +	       uint32_t chunk_id,
      +	       const unsigned char **p)
      +{
     -+	int i;
     -+
     -+	for (i = 0; i < cf->chunks_nr; i++) {
     -+		if (cf->chunks[i].id == chunk_id) {
     -+			*p = cf->chunks[i].start;
     -+			return 0;
     -+		}
     -+	}
     -+
     -+	return CHUNK_NOT_FOUND;
     ++	return read_chunk(cf, chunk_id, pair_chunk_fn, p);
      +}
      +
      +int read_chunk(struct chunkfile *cf,
     @@ chunk-format.c: int write_chunkfile(struct chunkfile *cf, void *data)
      +}
      
       ## chunk-format.h ##
     +@@
     + struct hashfile;
     + struct chunkfile;
     + 
     ++/*
     ++ * Initialize a 'struct chunkfile' for writing _or_ reading a file
     ++ * with the chunk format.
     ++ *
     ++ * If writing a file, supply a non-NULL 'struct hashfile *' that will
     ++ * be used to write.
     ++ *
     ++ * If reading a file, then supply the memory-mapped data to the
     ++ * pair_chunk() or read_chunk() methods, as appropriate.
     ++ *
     ++ * DO NOT MIX THESE MODES. Use different 'struct chunkfile' instances
     ++ * for reading and writing.
     ++ */
     + struct chunkfile *init_chunkfile(struct hashfile *f);
     + void free_chunkfile(struct chunkfile *cf);
     + int get_num_chunks(struct chunkfile *cf);
      @@ chunk-format.h: void add_chunk(struct chunkfile *cf,
     - 	       size_t size);
     + 	       chunk_write_fn fn);
       int write_chunkfile(struct chunkfile *cf, void *data);
       
      +int read_table_of_contents(struct chunkfile *cf,
 13:  8744d2785965 = 13:  7838ad32e2e0 commit-graph: use chunk-format read API
 14:  750c03253c95 ! 14:  6bddd9e63b9b midx: use chunk-format read API
     @@ midx.c: struct multi_pack_index *load_multi_pack_index(const char *object_dir, i
       	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
       
       	m->pack_names = xcalloc(m->num_packs, sizeof(*m->pack_names));
     +@@ midx.c: struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
     + cleanup_fail:
     + 	free(m);
     + 	free(midx_name);
     ++	free(cf);
     + 	if (midx_map)
     + 		munmap(midx_map, midx_size);
     + 	if (0 <= fd)
      
       ## t/t5319-multi-pack-index.sh ##
      @@ t/t5319-multi-pack-index.sh: test_expect_success 'verify bad OID version' '
 15:  83d292532a0f ! 15:  3cd97f389f1f midx: use 64-bit multiplication for chunk sizes
     @@ Commit message
          multiplication always. This allows us to properly predict the chunk
          sizes without risk of overflow.
      
     +    Other possible overflows were discovered by evaluating each
     +    multiplication in midx.c and ensuring that at least one side of the
     +    operator was of type size_t or off_t.
     +
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## midx.c ##
     +@@ midx.c: static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
     + 	const unsigned char *offset_data;
     + 	uint32_t offset32;
     + 
     +-	offset_data = m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH;
     ++	offset_data = m->chunk_object_offsets + (off_t)pos * MIDX_CHUNK_OFFSET_WIDTH;
     + 	offset32 = get_be32(offset_data + sizeof(uint32_t));
     + 
     + 	if (m->chunk_large_offsets && offset32 & MIDX_LARGE_OFFSET_NEEDED) {
     +@@ midx.c: static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
     + 
     + static uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos)
     + {
     +-	return get_be32(m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH);
     ++	return get_be32(m->chunk_object_offsets +
     ++			(off_t)pos * MIDX_CHUNK_OFFSET_WIDTH);
     + }
     + 
     + static int nth_midxed_pack_entry(struct repository *r,
      @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack_index *
     - 	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT,
     - 		  write_midx_oid_fanout, MIDX_CHUNK_FANOUT_SIZE);
     + 	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT, MIDX_CHUNK_FANOUT_SIZE,
     + 		  write_midx_oid_fanout);
       	add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP,
     --		  write_midx_oid_lookup, ctx.entries_nr * the_hash_algo->rawsz);
     -+		  write_midx_oid_lookup, (uint64_t)ctx.entries_nr * the_hash_algo->rawsz);
     +-		  ctx.entries_nr * the_hash_algo->rawsz,
     ++		  (size_t)ctx.entries_nr * the_hash_algo->rawsz,
     + 		  write_midx_oid_lookup);
       	add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS,
     - 		  write_midx_object_offsets,
     - 		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH);
     -@@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack_index *
     +-		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH,
     ++		  (size_t)ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH,
     + 		  write_midx_object_offsets);
     + 
       	if (ctx.large_offsets_needed)
       		add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS,
     - 			write_midx_large_offsets,
     --			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
     -+			(uint64_t)ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
     +-			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH,
     ++			(size_t)ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH,
     + 			write_midx_large_offsets);
       
       	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
     - 	write_chunkfile(cf, &ctx);
 16:  669eeec707ab ! 16:  b9a1bddf615f chunk-format: restore duplicate chunk checks
     @@ Commit message
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## chunk-format.c ##
     -@@ chunk-format.c: struct chunk_info {
     - 	chunk_write_fn write_fn;
     - 
     - 	const void *start;
     -+	unsigned found:1;
     - };
     - 
     - struct chunkfile {
      @@ chunk-format.c: int read_table_of_contents(struct chunkfile *cf,
       			   uint64_t toc_offset,
       			   int toc_length)
 17:  8f3985ab5df3 ! 17:  4c7d751f1e39 chunk-format: add technical docs
     @@ Documentation/technical/chunk-format.txt (new)
      +
      +Functions for working with chunk-based file formats are declared in
      +`chunk-format.h`. Using these methods provide extra checks that assist
     -+developers when creating new file formats, including:
     ++developers when creating new file formats.
      +
     -+ 1. Writing and reading the table of contents.
     ++Writing chunk-based file formats
     ++--------------------------------
      +
     -+ 2. Verifying that the data written in a chunk matches the expected size
     -+    that was recorded in the table of contents.
     ++To write a chunk-based file format, create a `struct chunkfile` by
     ++calling `init_chunkfile()` and pass a `struct hashfile` pointer. The
     ++caller is responsible for opening the `hashfile` and writing header
     ++information so the file format is identifiable before the chunk-based
     ++format begins.
      +
     -+ 3. Checking that a table of contents describes offsets properly within
     -+    the file boundaries.
     ++Then, call `add_chunk()` for each chunk that is intended for write. This
     ++populates the `chunkfile` with information about the order and size of
     ++each chunk to write. Provide a `chunk_write_fn` function pointer to
     ++perform the write of the chunk data upon request.
     ++
     ++Call `write_chunkfile()` to write the table of contents to the `hashfile`
     ++followed by each of the chunks. This will verify that each chunk wrote
     ++the expected amount of data so the table of contents is correct.
     ++
     ++Finally, call `free_chunkfile()` to clear the `struct chunkfile` data. The
     ++caller is responsible for finalizing the `hashfile` by writing the trailing
     ++hash and closing the file.
     ++
     ++Reading chunk-based file formats
     ++--------------------------------
     ++
     ++To read a chunk-based file format, the file must be opened as a
     ++memory-mapped region. The chunk-format API expects that the entire file
     ++is mapped as a contiguous memory region.
     ++
     ++Initialize a `struct chunkfile` pointer with `init_chunkfile(NULL)`.
     ++
     ++After reading the header information from the beginning of the file,
     ++including the chunk count, call `read_table_of_contents()` to populate
     ++the `struct chunkfile` with the list of chunks, their offsets, and their
     ++sizes.
     ++
     ++Extract the data information for each chunk using `pair_chunk()` or
     ++`read_chunk()`:
     ++
     ++* `pair_chunk()` assigns a given pointer with the location inside the
     ++  memory-mapped file corresponding to that chunk's offset. If the chunk
     ++  does not exist, then the pointer is not modified.
     ++
     ++* `read_chunk()` takes a `chunk_read_fn` function pointer and calls it
     ++  with the appropriate initial pointer and size information. The function
     ++  is not called if the chunk does not exist. Use this method to read chunks
     ++  if you need to perform immediate parsing or if you need to execute logic
     ++  based on the size of the chunk.
     ++
     ++After calling these methods, call `free_chunkfile()` to clear the
     ++`struct chunkfile` data. This will not close the memory-mapped region.
     ++Callers are expected to own that data for the timeframe the pointers into
     ++the region are needed.
     ++
     ++Examples
     ++--------
     ++
     ++These file formats use the chunk-format API, and can be used as examples
     ++for future formats:
     ++
     ++* *commit-graph:* see `write_commit_graph_file()` and `parse_commit_graph()`
     ++  in `commit-graph.c` for how the chunk-format API is used to write and
     ++  parse the commit-graph file format documented in
     ++  link:technical/commit-graph-format.html[the commit-graph file format].
     ++
     ++* *multi-pack-index:* see `write_midx_internal()` and `load_multi_pack_index()`
     ++  in `midx.c` for how the chunk-format API is used to write and
     ++  parse the multi-pack-index file format documented in
     ++  link:technical/pack-format.html[the multi-pack-index file format].
      
       ## Documentation/technical/commit-graph-format.txt ##
      @@ Documentation/technical/commit-graph-format.txt: CHUNK LOOKUP:

-- 
gitgitgadget

next prev parent reply	other threads:[~2021-02-05 22:14 UTC|newest]

Thread overview: 120+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-01-27  1:53   ` Chris Torek
2021-01-27  2:36     ` Taylor Blau
2021-01-26 16:01 ` [PATCH 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-01-27  2:42   ` Taylor Blau
2021-01-27 13:49     ` Derrick Stolee
2021-01-26 16:01 ` [PATCH 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-01-27  2:47   ` Taylor Blau
2021-01-26 16:01 ` [PATCH 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-01-27  2:49   ` Taylor Blau
2021-01-26 16:01 ` [PATCH 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-01-27  3:02   ` Taylor Blau
2021-01-26 16:01 ` [PATCH 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 14/17] midx: " Derrick Stolee via GitGitGadget
2021-01-27  3:06   ` Taylor Blau
2021-01-27 13:50     ` Derrick Stolee
2021-01-26 16:01 ` [PATCH 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-01-26 22:37 ` [PATCH 00/17] Refactor chunk-format into an API Junio C Hamano
2021-01-27  2:29 ` Taylor Blau
2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-02-04 21:24     ` Junio C Hamano
2021-02-04 22:40       ` Junio C Hamano
2021-02-05 11:37       ` Derrick Stolee
2021-02-05 19:25         ` Junio C Hamano
2021-01-27 15:01   ` [PATCH v2 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-02-04 22:59     ` Junio C Hamano
2021-02-05 11:42       ` Derrick Stolee
2021-01-27 15:01   ` [PATCH v2 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-02-04 23:40     ` Junio C Hamano
2021-02-05 12:19       ` Derrick Stolee
2021-02-05 19:37         ` Junio C Hamano
2021-02-08 22:26           ` Junio C Hamano
2021-02-09  1:33             ` Derrick Stolee
2021-02-09 20:47               ` Junio C Hamano
2021-01-27 15:01   ` [PATCH v2 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 14/17] midx: " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-02-05  0:00     ` Junio C Hamano
2021-02-05 10:59       ` Chris Torek
2021-02-05 20:41         ` Junio C Hamano
2021-02-06 20:35           ` Chris Torek
2021-02-05 12:30       ` Derrick Stolee
2021-02-05 19:42         ` Junio C Hamano
2021-02-07 19:50       ` SZEDER Gábor
2021-02-08  5:41         ` Junio C Hamano
2021-01-27 15:01   ` [PATCH v2 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-02-05  0:05     ` Junio C Hamano
2021-02-05 12:31       ` Derrick Stolee
2021-01-27 15:01   ` [PATCH v2 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-02-05  0:15     ` Junio C Hamano
2021-01-27 16:03   ` [PATCH v2 00/17] Refactor chunk-format into an API Taylor Blau
2021-02-05  2:08   ` Junio C Hamano
2021-02-05  2:27     ` Derrick Stolee
2021-02-05 14:30   ` Derrick Stolee via GitGitGadget [this message]
2021-02-05 14:30     ` [PATCH v3 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-02-07 21:13       ` SZEDER Gábor
2021-02-08 13:44         ` Derrick Stolee
2021-02-11 19:43           ` SZEDER Gábor
2021-02-05 14:30     ` [PATCH v3 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-02-07 20:20       ` SZEDER Gábor
2021-02-08 13:35         ` Derrick Stolee
2021-02-05 14:30     ` [PATCH v3 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 14/17] midx: " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-02-24 16:52         ` SZEDER Gábor
2021-02-24 17:12           ` Taylor Blau
2021-02-24 17:52             ` Derrick Stolee
2021-02-24 19:44               ` Junio C Hamano
2021-02-18 14:07       ` [PATCH v4 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 14/17] midx: " Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-02-18 21:47         ` Junio C Hamano
2021-02-19 12:42           ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.848.v3.git.1612535452.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=chris.torek@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=me@ttaylorr.com \
    --cc=stolee@gmail.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.