From: Johan Herland <johan@herland.net>
To: git@vger.kernel.org
Cc: Junio C Hamano <junkio@cox.net>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH] Change softrefs file format from text (82 bytes per entry) to binary (40 bytes per entry)
Date: Sat, 09 Jun 2007 20:25:30 +0200 [thread overview]
Message-ID: <200706092025.30156.johan@herland.net> (raw)
In-Reply-To: <200706092019.13185.johan@herland.net>
The text-based softrefs file format uses 82 bytes per entry (40 bytes
from_sha1 in hex, 1 byte SP, 40 bytes to_sha1 in hex, 1 byte LF).
The binary softrefs file format uses 40 bytes per entry (20 bytes
from_sha1, 20 bytes to_sha1).
Moving to a binary format increases performance slightly, but sacrifices
easy readability of the softrefs files.
Signed-off-by: Johan Herland <johan@herland.net>
---
To illustrate the change in performance from changing the softrefs file
format, I prepared a linux repo (holding 57274 commits) with 10 tag
objects, and created softrefs from every commit to every tag object
(572740 softrefs in total). The resulting softrefs db was 46964680 bytes
when using the text format, and 22909600 bytes when using the binary
format. The experiment was done on a 32-bit Intel Pentium 4
(3 GHz w/HyperThreading) with 1 GB RAM:
========
Operations on unsorted softrefs:
(572740 (10 per commit) entries in random/unsorted order)
========
Listing all softrefs
(sequential reading of unsorted softrefs file)
--------
[text format]
$ /usr/bin/time git softref --list > /dev/null
0.44user 0.02system 0:00.47elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+11786minor)pagefaults 0swaps
[binary format]
$ /usr/bin/time git softref --list > /dev/null
0.35user 0.01system 0:00.38elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+5913minor)pagefaults 0swaps
Listing HEAD's softrefs
(sequential reading of unsorted softrefs file)
--------
[text format]
$ /usr/bin/time git softref --list HEAD > /dev/null
0.11user 0.01system 0:00.14elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+11790minor)pagefaults 0swaps
[binary format]
$ /usr/bin/time git softref --list HEAD > /dev/null
0.02user 0.01system 0:00.03elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+5918minor)pagefaults 0swaps
Sorting softrefs
--------
[text format]
$ /usr/bin/time git softref --merge-unsorted
2.73user 4.97system 0:07.77elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+15833minor)pagefaults 0swaps
[binary format]
$ /usr/bin/time git softref --merge-unsorted
1.78user 5.00system 0:06.79elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+9961minor)pagefaults 0swaps
Sorting softrefs into existing sorted file
(throwing away duplicates)
--------
[text format]
$ /usr/bin/time git softref --merge-unsorted
3.49user 5.12system 0:08.64elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+27300minor)pagefaults 0swaps
[binary format]
$ /usr/bin/time git softref --merge-unsorted
2.03user 4.92system 0:07.05elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+15556minor)pagefaults 0swaps
========
Operations on sorted softrefs:
(572740 (10 per commit) entries in sorted order)
========
Listing all softrefs
(sequential reading of sorted softrefs file)
--------
[text format]
$ /usr/bin/time git softref --list > /dev/null
0.43user 0.02system 0:00.48elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+11786minor)pagefaults 0swaps
[binary format]
$ /usr/bin/time git softref --list > /dev/null
0.37user 0.00system 0:00.38elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+5914minor)pagefaults 0swaps
Listing HEAD's softrefs
(256-fanout followed by binary search in sorted softrefs file)
--------
[text format]
$/usr/bin/time git softref --list HEAD > /dev/null
0.00user 0.00system 0:00.00elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+334minor)pagefaults 0swaps
[binary format]
$ /usr/bin/time git softref --list HEAD > /dev/null
0.00user 0.00system 0:00.00elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+333minor)pagefaults 0swaps
Sorting softrefs
(no-op)
--------
[text format]
$ /usr/bin/time git softref --merge-unsorted
0.00user 0.00system 0:00.00elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+312minor)pagefaults 0swaps
[binary format]
$ /usr/bin/time git softref --merge-unsorted
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+312minor)pagefaults 0swaps
As expected, the binary format almost halved the number of pagefaults for
cases causing the entire softrefs db to be read (the reason for this, of
course, being the halved size of the db).
For the most common use case (looking up a given commit in a sorted db)
the binary format has no measurable effect on performance.
...Johan
Documentation/git-softref.txt | 6 ++--
softrefs.c | 78 +++++++++++------------------------------
softrefs.h | 7 ++--
3 files changed, 27 insertions(+), 64 deletions(-)
diff --git a/Documentation/git-softref.txt b/Documentation/git-softref.txt
index 6a3e13b..e21aaf3 100644
--- a/Documentation/git-softref.txt
+++ b/Documentation/git-softref.txt
@@ -93,9 +93,9 @@ Also, softrefs are stored in a way that makes it easy and quick to find all
the "to" objects reachable from a given "from" object.
The softrefs db consists of two files: `.git/softrefs.unsorted` and
-`.git/softrefs.sorted`. Both files use the same format; one softref per line
-of the form "`<from-sha1> <to-sha1>\n`". Each sha1 sum is 40 bytes long; this
-makes each entry exactly 82 bytes long.
+`.git/softrefs.sorted`. Both files use the same binary format; `<from-sha1>`
+followed by `<to-sha1>` per entry. Each sha1 sum is 20 bytes long; this
+makes each entry exactly 40 bytes long.
The entries in `.git/softrefs.sorted` are sorted on `<from-sha1>`, in order to
make lookup fast. This file is also known as the "sorted softrefs db".
diff --git a/softrefs.c b/softrefs.c
index c7308c8..8bb3a83 100644
--- a/softrefs.c
+++ b/softrefs.c
@@ -9,10 +9,8 @@ static const unsigned int MAX_UNSORTED_ENTRIES = 1000;
/* softref entry as it appears in a softrefs file */
struct softrefs_entry {
- char from_sha1_hex[40];
- char space;
- char to_sha1_hex[40];
- char lf;
+ unsigned char from_sha1[20];
+ unsigned char to_sha1[20];
};
/* simple encapsulation of a softrefs file */
@@ -106,8 +104,9 @@ static int write_entry(int fd, const struct softrefs_entry *entry)
if (write(fd, (const void *) entry, sizeof(struct softrefs_entry))
< sizeof(struct softrefs_entry))
{
- return error("Failed to write entry '%.40s -> %.40s' to softrefs file descriptor %i: %s",
- entry->from_sha1_hex, entry->to_sha1_hex,
+ return error("Failed to write entry '%s -> %s' to softrefs file descriptor %i: %s",
+ sha1_to_hex(entry->from_sha1),
+ sha1_to_hex(entry->to_sha1),
fd, strerror(errno));
}
return 0;
@@ -139,29 +138,13 @@ void deinit_softrefs_access()
internal_file = 0;
}
-/* comparison between a SHA1 sum and a softrefs entry */
-static int sha1_to_entry_cmp(
- const unsigned char *from_sha1, const struct softrefs_entry *entry)
-{
- unsigned char sha1[20];
- get_sha1_hex(entry->from_sha1_hex, sha1);
- return hashcmp(from_sha1, sha1);
-}
-
/* comparison between softrefs entries */
static int softrefs_entry_cmp(
const struct softrefs_entry *a, const struct softrefs_entry *b)
{
- unsigned char sa[20], sb[20];
- int ret;
- get_sha1_hex(a->from_sha1_hex, sa);
- get_sha1_hex(b->from_sha1_hex, sb);
- ret = hashcmp(sa, sb);
- if (!ret) {
- get_sha1_hex(a->to_sha1_hex, sa);
- get_sha1_hex(b->to_sha1_hex, sb);
- ret = hashcmp(sa, sb);
- }
+ int ret = hashcmp(a->from_sha1, b->from_sha1);
+ if (!ret)
+ ret = hashcmp(a->to_sha1, b->to_sha1);
return ret;
}
@@ -193,26 +176,15 @@ static int do_for_each_sequential(
unsigned long i,
int stop_at_first_non_match)
{
- unsigned char f_sha1[20], t_sha1[20]; /* Holds sha1 per entry */
int ret = 0;
for (; i < file->data_len; ++i) { /* Step through file, starting at i */
- /* sanity check entry */
- if (file->data[i].space != ' ' || file->data[i].lf != '\n') {
- ret = error("Entry #%lu in softrefs file %s failed sanity check",
- i, file->filename);
- break;
- }
- /* retrieve SHA1 values */
- if (get_sha1_hex(file->data[i].from_sha1_hex, f_sha1) ||
- get_sha1_hex(file->data[i].to_sha1_hex, t_sha1)) {
- ret = error("Failed to read SHA1 values from entry #%lu in softrefs file %s",
- i, file->filename);
- break;
- }
/* Compare to lookup value */
- if (!from_sha1 || !hashcmp(from_sha1, f_sha1)) {
- if ((ret = fn(f_sha1, t_sha1, cb_data)))
+ if (!from_sha1 || !hashcmp(from_sha1, file->data[i].from_sha1)) {
+ if ((ret = fn(file->data[i].from_sha1,
+ file->data[i].to_sha1, cb_data)))
+ {
break; /* bail out if callback returns != 0 */
+ }
}
else if (stop_at_first_non_match)
break;
@@ -277,7 +249,7 @@ static int do_for_each_sorted(
i = (from_sha1[0] * file->data_len) / 256;
/* Binary search */
- while ((cmp_result = sha1_to_entry_cmp(from_sha1, &(file->data[i])))) {
+ while ((cmp_result = hashcmp(from_sha1, file->data[i].from_sha1))) {
if (right - left <= 1) /* not found; give up */
goto done;
if (cmp_result > 0) /* go right */
@@ -288,7 +260,7 @@ static int do_for_each_sorted(
}
/* i points to a matching entry, but not necessarily the first */
- while (i >= 1 && sha1_to_entry_cmp(from_sha1, &(file->data[i - 1])) == 0)
+ while (i >= 1 && !hashcmp(from_sha1, file->data[i - 1].from_sha1))
--i;
/* i points to the first matching entry */
@@ -432,7 +404,6 @@ static int merge_unsorted_into_sorted(
while (!ret && (i < unsorted.data_len || j < sorted.data_len)) {
/* there are still entries in either list */
struct softrefs_entry *cur;
- unsigned char from_sha1[20], to_sha1[20];
if (i < unsorted.data_len && j < sorted.data_len) {
/* there are still entries in _both_ lists */
/* choose "lowest" entry from either list */
@@ -451,10 +422,8 @@ static int merge_unsorted_into_sorted(
prev = cur;
/* skip writing if softref involves a non-existing object */
- if (get_sha1_hex(cur->from_sha1_hex, from_sha1) ||
- !has_sha1_file(from_sha1) ||
- get_sha1_hex(cur->to_sha1_hex, to_sha1) ||
- !has_sha1_file( to_sha1))
+ if (!has_sha1_file(cur->from_sha1) ||
+ !has_sha1_file(cur->to_sha1))
{
continue;
}
@@ -523,7 +492,6 @@ static int merge_sorted_into_sorted(const char *from_file, const char *to_file)
while (!ret && (i < file1.data_len || j < file2.data_len)) {
/* there are still entries in either list */
struct softrefs_entry *cur;
- unsigned char from_sha1[20], to_sha1[20];
if (i < file1.data_len && j < file2.data_len) {
/* there are still entries in _both_ lists */
/* choose "lowest" entry from either list */
@@ -542,10 +510,8 @@ static int merge_sorted_into_sorted(const char *from_file, const char *to_file)
prev = cur;
/* skip writing if softref involves a non-existing object */
- if (get_sha1_hex(cur->from_sha1_hex, from_sha1) ||
- !has_sha1_file(from_sha1) ||
- get_sha1_hex(cur->to_sha1_hex, to_sha1) ||
- !has_sha1_file( to_sha1))
+ if (!has_sha1_file(cur->from_sha1) ||
+ !has_sha1_file(cur->to_sha1))
{
continue;
}
@@ -608,10 +574,8 @@ int add_softrefs(const struct softref_list *list)
/* nada */;
}
else { /* softref is ok */
- strcpy(entry.from_sha1_hex, sha1_to_hex(list->from_sha1));
- strcpy(entry.to_sha1_hex, sha1_to_hex(list->to_sha1));
- entry.space = ' ';
- entry.lf = '\n';
+ hashcpy(entry.from_sha1, list->from_sha1);
+ hashcpy(entry.to_sha1, list->to_sha1);
if (write_entry(fd, &entry))
error("Failed to write entry to softrefs file %s: %s",
git_path(UNSORTED_FILENAME),
diff --git a/softrefs.h b/softrefs.h
index db0f8b9..89d25ce 100644
--- a/softrefs.h
+++ b/softrefs.h
@@ -19,10 +19,9 @@
* the "to" objects reachable from a given "from" object.
*
* The softrefs db consists of two files: .git/softrefs.unsorted and
- * .git/softrefs.sorted. Both files use the same format; one softref per line
- * of the form "<from-sha1> <to-sha1>\n". Each sha1 sum is 40 bytes long; this
- * makes each entry exactly 82 bytes long (including the space between the sha1
- * sums and the terminating linefeed).
+ * .git/softrefs.sorted. Both files use the same binary format; <from-sha1>
+ * followed by <to-sha1> per entry. Each sha1 sum is 20 bytes long; this
+ * makes each entry exactly 40 bytes long.
*
* The entries in .git/softrefs.sorted are sorted on <from-sha1>, in order to
* make lookup fast.
--
1.5.2.1.144.gabc40
next prev parent reply other threads:[~2007-06-09 18:25 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-06-04 0:51 Refactoring the tag object; Introducing soft references (softrefs); Git 'notes' (take 2) Johan Herland
2007-06-04 0:51 ` [PATCH 0/6] Refactor the tag object Johan Herland
2007-06-04 0:52 ` [PATCH 1/6] Refactor git tag objects; make "tag" header optional; introduce new optional "keywords" header Johan Herland
2007-06-04 6:08 ` Matthias Lederhofer
2007-06-04 7:30 ` Johan Herland
2007-06-04 0:53 ` [PATCH 2/6] git-show: When showing tag objects with no tag name, show tag object's SHA1 instead of an empty string Johan Herland
2007-06-04 0:53 ` [PATCH 3/6] git-fsck: Do thorough verification of tag objects Johan Herland
2007-06-04 5:56 ` Matthias Lederhofer
2007-06-04 7:51 ` Johan Herland
2007-06-06 7:18 ` Junio C Hamano
2007-06-06 8:06 ` Johan Herland
2007-06-06 9:03 ` Junio C Hamano
2007-06-06 9:21 ` Junio C Hamano
2007-06-06 10:26 ` Johan Herland
2007-06-06 10:35 ` Junio C Hamano
2007-06-04 0:54 ` [PATCH 4/6] Documentation/git-mktag: Document the changes in tag object structure Johan Herland
2007-06-04 0:54 ` [PATCH 5/6] git-mktag tests: Fix and expand the mktag tests according to the new " Johan Herland
2007-06-04 0:54 ` [PATCH 6/6] Add fsck_verify_ref_to_tag_object() to verify that refname matches name stored in tag object Johan Herland
2007-06-04 20:32 ` [PATCH 0/6] Refactor the " Junio C Hamano
2007-06-07 22:13 ` [PATCH] Fix bug in tag parsing when thorough verification was in effect Johan Herland
2007-06-09 18:19 ` [PATCH 0/7] Introduce soft references (softrefs) Johan Herland
2007-06-09 18:21 ` [PATCH 1/7] Softrefs: Add softrefs header file with API documentation Johan Herland
2007-06-10 6:58 ` Johannes Schindelin
2007-06-10 7:43 ` Junio C Hamano
2007-06-10 7:54 ` Johannes Schindelin
2007-06-10 14:00 ` Johan Herland
2007-06-10 14:27 ` Jakub Narebski
2007-06-10 14:45 ` [PATCH] Teach git-gc to merge unsorted softrefs Johan Herland
2007-06-09 18:22 ` [PATCH 2/7] Softrefs: Add implementation of softrefs API Johan Herland
2007-06-09 18:22 ` [PATCH 3/7] Softrefs: Add git-softref, a builtin command for adding, listing and administering softrefs Johan Herland
2007-06-09 18:23 ` [PATCH 4/7] Softrefs: Add manual page documenting git-softref and softrefs subsystem in general Johan Herland
2007-06-09 18:23 ` [PATCH 5/7] Softrefs: Add testcases for basic softrefs behaviour Johan Herland
2007-06-09 18:24 ` [PATCH 6/7] Softrefs: Administrivia associated with softrefs subsystem and git-softref builtin Johan Herland
2007-06-09 18:24 ` [PATCH 7/7] Teach git-mktag to register softrefs for all tag objects Johan Herland
2007-06-09 18:25 ` Johan Herland [this message]
2007-06-10 8:02 ` [PATCH] Change softrefs file format from text (82 bytes per entry) to binary (40 bytes per entry) Johannes Schindelin
2007-06-10 8:30 ` Junio C Hamano
2007-06-10 9:46 ` Johannes Schindelin
2007-06-10 14:03 ` Johan Herland
2007-06-09 23:55 ` Comment on weak refs Junio C Hamano
2007-06-10 1:25 ` Johan Herland
2007-06-10 6:33 ` Johannes Schindelin
2007-06-10 13:41 ` Johan Herland
2007-06-10 14:09 ` Pierre Habouzit
2007-06-10 14:25 ` Pierre Habouzit
2007-06-10 9:03 ` Pierre Habouzit
2007-06-10 15:26 ` Jakub Narebski
2007-06-09 22:57 ` Refactoring the tag object; Introducing soft references (softrefs); Git 'notes' (take 2) Steven Grimm
2007-06-09 23:16 ` Johan Herland
2007-06-10 8:29 ` Pierre Habouzit
2007-06-10 14:31 ` Johan Herland
2007-06-10 19:42 ` Steven Grimm
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200706092025.30156.johan@herland.net \
--to=johan@herland.net \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).