From: Karsten Blees <karsten.blees@gmail.com>
To: Git List <git@vger.kernel.org>
Subject: [PATCH v1 4/4] hashmap: add string interning API
Date: Thu, 03 Jul 2014 00:22:54 +0200 [thread overview]
Message-ID: <53B4863E.1020701@gmail.com> (raw)
In-Reply-To: <53B48540.5070600@gmail.com>
Interning short strings with high probability of duplicates can reduce the
memory footprint and speed up comparisons.
Add strintern() and memintern() APIs that use a hashmap to manage the pool
of unique, interned strings.
Note: strintern(getenv()) could be used to sanitize git's use of getenv(),
in case we ever encounter a platform where a call to getenv() invalidates
previous getenv() results (which is allowed by POSIX).
Signed-off-by: Karsten Blees <blees@dcon.de>
---
Documentation/technical/api-hashmap.txt | 15 +++++++++++++
hashmap.c | 38 +++++++++++++++++++++++++++++++++
hashmap.h | 8 +++++++
t/t0011-hashmap.sh | 13 +++++++++++
test-hashmap.c | 14 ++++++++++++
5 files changed, 88 insertions(+)
diff --git a/Documentation/technical/api-hashmap.txt b/Documentation/technical/api-hashmap.txt
index f9215d6..00c4c29 100644
--- a/Documentation/technical/api-hashmap.txt
+++ b/Documentation/technical/api-hashmap.txt
@@ -193,6 +193,21 @@ more entries.
`hashmap_iter_first` is a combination of both (i.e. initializes the iterator
and returns the first entry, if any).
+`const char *strintern(const char *string)`::
+`const void *memintern(const void *data, size_t len)`::
+
+ Returns the unique, interned version of the specified string or data,
+ similar to the `String.intern` API in Java and .NET, respectively.
+ Interned strings remain valid for the entire lifetime of the process.
++
+Can be used as `[x]strdup()` or `xmemdupz` replacement, except that interned
+strings / data must not be modified or freed.
++
+Interned strings are best used for short strings with high probability of
+duplicates.
++
+Uses a hashmap to store the pool of interned strings.
+
Usage example
-------------
diff --git a/hashmap.c b/hashmap.c
index d1b8056..f693839 100644
--- a/hashmap.c
+++ b/hashmap.c
@@ -226,3 +226,41 @@ void *hashmap_iter_next(struct hashmap_iter *iter)
current = iter->map->table[iter->tablepos++];
}
}
+
+struct pool_entry {
+ struct hashmap_entry ent;
+ size_t len;
+ unsigned char data[FLEX_ARRAY];
+};
+
+static int pool_entry_cmp(const struct pool_entry *e1,
+ const struct pool_entry *e2,
+ const unsigned char *keydata)
+{
+ return e1->data != keydata &&
+ (e1->len != e2->len || memcmp(e1->data, keydata, e1->len));
+}
+
+const void *memintern(const void *data, size_t len)
+{
+ static struct hashmap map;
+ struct pool_entry key, *e;
+
+ /* initialize string pool hashmap */
+ if (!map.tablesize)
+ hashmap_init(&map, (hashmap_cmp_fn) pool_entry_cmp, 0);
+
+ /* lookup interned string in pool */
+ hashmap_entry_init(&key, memhash(data, len));
+ key.len = len;
+ e = hashmap_get(&map, &key, data);
+ if (!e) {
+ /* not found: create it */
+ e = xmallocz(sizeof(struct pool_entry) + len);
+ hashmap_entry_init(e, key.ent.hash);
+ e->len = len;
+ memcpy(e->data, data, len);
+ hashmap_add(&map, e);
+ }
+ return e->data;
+}
diff --git a/hashmap.h b/hashmap.h
index 12f0668..507884b 100644
--- a/hashmap.h
+++ b/hashmap.h
@@ -87,4 +87,12 @@ static inline void *hashmap_iter_first(struct hashmap *map,
return hashmap_iter_next(iter);
}
+/* string interning */
+
+extern const void *memintern(const void *data, size_t len);
+static inline const char *strintern(const char *string)
+{
+ return memintern(string, strlen(string));
+}
+
#endif
diff --git a/t/t0011-hashmap.sh b/t/t0011-hashmap.sh
index 391e2b6..f97c805 100755
--- a/t/t0011-hashmap.sh
+++ b/t/t0011-hashmap.sh
@@ -237,4 +237,17 @@ test_expect_success 'grow / shrink' '
'
+test_expect_success 'string interning' '
+
+test_hashmap "intern value1
+intern Value1
+intern value2
+intern value2
+" "value1
+Value1
+value2
+value2"
+
+'
+
test_done
diff --git a/test-hashmap.c b/test-hashmap.c
index 3c9f67b..07aa7ec 100644
--- a/test-hashmap.c
+++ b/test-hashmap.c
@@ -234,6 +234,20 @@ int main(int argc, char *argv[])
/* print table sizes */
printf("%u %u\n", map.tablesize, map.size);
+ } else if (!strcmp("intern", cmd) && l1) {
+
+ /* test that strintern works */
+ const char *i1 = strintern(p1);
+ const char *i2 = strintern(p1);
+ if (strcmp(i1, p1))
+ printf("strintern(%s) returns %s\n", p1, i1);
+ else if (i1 == p1)
+ printf("strintern(%s) returns input pointer\n", p1);
+ else if (i1 != i2)
+ printf("strintern(%s) != strintern(%s)", i1, i2);
+ else
+ printf("%s\n", i1);
+
} else if (!strcmp("perfhashmap", cmd) && l1 && l2) {
perf_hashmap(atoi(p1), atoi(p2));
--
1.9.4.msysgit.0.dirty
next prev parent reply other threads:[~2014-07-02 22:23 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-07-02 22:18 [PATCH v1 0/4] hashmap improvements Karsten Blees
2014-07-02 22:20 ` [PATCH v1 1/4] hashmap: factor out getting an int hash code from a, SHA1 Karsten Blees
2014-07-07 17:22 ` Junio C Hamano
2014-07-02 22:21 ` [PATCH v1 2/4] hashmap: improve struct hashmap member documentation Karsten Blees
2014-07-02 22:22 ` [PATCH v1 3/4] hashmap: add simplified hashmap_get_from_hash() API Karsten Blees
2014-07-07 17:43 ` Junio C Hamano
2014-07-11 19:11 ` Karsten Blees
2014-07-11 22:21 ` Junio C Hamano
2014-07-02 22:22 ` Karsten Blees [this message]
2014-07-03 7:22 ` [PATCH v1 4/4] hashmap: add string interning API Matthieu Moy
2014-07-07 17:44 ` Junio C Hamano
2014-07-03 7:23 ` [PATCH v1 0/4] hashmap improvements Matthieu Moy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53B4863E.1020701@gmail.com \
--to=karsten.blees@gmail.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).