From: Denys Duchier <duchier@ps.uni-sb.de>
To: gnu-arch-users@gnu.org
Cc: gnu-arch-dev@lists.seyza.com, git@vger.kernel.org
Subject: Re: [Gnu-arch-users] Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'
Date: Thu, 21 Apr 2005 02:05:22 +0200 [thread overview]
Message-ID: <867jixata5.fsf@speedy.lifl.fr> (raw)
In-Reply-To: <200504202304.QAA17069@emf.net> (Tom Lord's message of "Wed, 20 Apr 2005 16:04:00 -0700 (PDT)")
[-- Attachment #1: Type: text/plain, Size: 2045 bytes --]
Tom Lord <lord@emf.net> writes:
> Thank you for your experiment.
you are welcome.
> I think that to a large extent you are seeing artifacts
> of the questionable trade-offs that (reports tell me) the
> ext* filesystems make. With a different filesystem, the
> results would be very different.
No, this is not the only thing that we observe. For example, here are the
reports for the following two experiments:
Indexing method = [2]
Max keys at level 0: 256
Max keys at level 1: 108
Total number of dirs: 257
Total number of keys: 21662
Disk footprint : 1.8M
Indexing method = [4 4]
Max keys at level 0: 18474
Max keys at level 1: 5
Max keys at level 2: 1
Total number of dirs: 40137
Total number of keys: 21662
Disk footprint : 157M
Notice the huge number of directories in the second experiment and they don't
help at all in performing discrimination.
> I'm imagining a blob database containing may revisions of the linux
> kernel. It will contain millions of blobs.
It is very easy to write code that uses an adaptive discrimination method
(i.e. when a directory becomes too full, introduce an additional level of
discrimination and rehash). In fact I have code that does that (rehashing if
the size of a leaf directory exceed 256), but the [2] method above doesn't even
need it even though it has 21662 keys.
Just in case there is some interest, I attach below the python scripts which I
used for my experiments:
To create an indexed archive:
python build.py SRC DST N1 ... Nk
where SRC is the root directory of the tree to be indexed, and DST names the
root directory of the indexed archive to be created. N1 through Nk are integers
that each indicate how many chars to chop off the key to create the next level
indexing key.
python info.py DST
collects and then prints out statistics about an indexed archive.
For example, the invocation that relates to your original proposal would be:
python build.py /usr/src/linux store 4 4
python info.py store
[-- Attachment #2: script to build an indexed archive --]
[-- Type: text/plain, Size: 1741 bytes --]
import os,os.path,stat,sha
tree = None
archive = None
slices = []
lastslice = (0,-1)
def recurse(path):
s = os.stat(path)
if stat.S_ISDIR(s.st_mode):
print path
contents = []
for n in os.listdir(path):
uid = recurse(os.path.join(path,n))
contents.append('\t'.join((n,uid)))
contents = '\n'.join(contents)
buf = sha.new(contents)
uid = buf.hexdigest()
uid = ','.join((uid,str(len(contents))))
store(uid)
return uid
else:
fd = file(path,"rb")
contents = fd.read()
fd.close()
buf = sha.new(contents)
uid = ','.join((buf.hexdigest(),str(s.st_size)))
store(uid)
return uid
def store(uid):
p = archive
if not os.path.exists(p):
os.mkdir(p)
for s in slices:
p = os.path.join(p,uid[s[0]:s[1]])
if not os.path.exists(p):
os.mkdir(p)
p = os.path.join(p,uid[lastslice[0]:lastslice[1]])
fd = file(p,"wb")
fd.close()
if __name__ == '__main__':
import sys
from optparse import OptionParser
from types import IntType
parser = OptionParser(usage="usage: %prog TREE ARCHIVE N1 ... Nk")
(options, args) = parser.parse_args()
if len(args) < 3:
print sys.stderr, "expected at least 3 positional arguments"
sys.exit(1)
tree = args[0]
archive = args[1]
prev = 0
for a in args[2:]:
try:
next = prev+int(a)
slices.append((prev,next))
prev = next
except:
print >>sys.stderr, "positional argument is not an integer:",a
sys.exit(1)
lastslice = (next,-1)
recurse(tree)
sys.exit(0)
[-- Attachment #3: script to print statistics about an indexed archive --]
[-- Type: text/plain, Size: 1214 bytes --]
import os,os.path,stat
info = []
archive = None
total_keys = 0
total_dirs = 0
def collect_info(path,i):
global total_dirs,total_keys
s = os.stat(path)
if stat.S_ISDIR(s.st_mode):
total_dirs += 1
l = os.listdir(path)
n = len(l)
if i==len(info):
info.append(n)
elif n>info[i]:
info[i] = n
i += 1
for f in l:
collect_info(os.path.join(path,f),i)
else:
total_keys += 1
def print_info():
i = 0
for n in info:
print "Max keys at level %2s: %7s" % (i,n)
i += 1
print "Total number of dirs: %7s" % total_dirs
print "Total number of keys: %7s" % total_keys
fd = os.popen("du -csh %s" % archive,"r")
s = fd.read()
fd.close()
s = s.split()[0]
print "Disk footprint : %7s" % s
if __name__ == '__main__':
import sys
from optparse import OptionParser
parser = OptionParser(usage="usage: %prog ARCHIVE")
(options, args) = parser.parse_args()
if len(args) != 1:
print sys.stderr, "expected exactly 1 positional argument"
sys.exit(1)
archive = args[0]
collect_info(archive,0)
print_info()
sys.exit(0)
[-- Attachment #4: Type: text/plain, Size: 216 bytes --]
Cheers,
PS: I should mention again, that my indexed archives only contain empty files
because I am only interested in measuring overhead.
--
Dr. Denys Duchier - IRI & LIFL - CNRS, Lille, France
AIM: duchierdenys
next prev parent reply other threads:[~2005-04-21 0:01 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-04-20 10:00 [ANNOUNCEMENT] /Arch/ embraces `git' Tom Lord
2005-04-20 10:19 ` Miles Bader
2005-04-20 17:15 ` duchier
2005-04-20 22:40 ` [Gnu-arch-users] Re: [GNU-arch-dev] " Tomas Mraz
2005-04-21 9:09 ` Denys Duchier
2005-04-21 10:21 ` Tomas Mraz
2005-04-21 11:46 ` [Gnu-arch-users] " duchier
2005-04-20 22:51 ` Tomas Mraz
2005-04-21 19:04 ` Tom Lord
2005-04-21 20:35 ` [Gnu-arch-users] Re: [GNU-arch-dev] " Tom Lord
2005-04-20 23:04 ` Tom Lord
2005-04-21 0:05 ` Denys Duchier [this message]
2005-04-21 20:39 ` [Gnu-arch-users] " Tom Lord
2005-04-21 7:49 ` Tomas Mraz
2005-04-21 21:51 ` [Gnu-arch-users] Re: [GNU-arch-dev] " Tom Lord
2005-04-21 21:52 ` Tom Lord
2005-04-22 16:13 ` Linus Torvalds
2005-04-22 17:39 ` Edésio Costa e Silva
2005-04-20 21:31 ` Petr Baudis
2005-04-20 21:55 ` C. Scott Ananian
2005-04-20 22:22 ` chunking (Re: [ANNOUNCEMENT] /Arch/ embraces `git') Linus Torvalds
2005-04-20 23:42 ` C. Scott Ananian
2005-04-22 21:02 ` blowing chunks (quick update) C. Scott Ananian
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=867jixata5.fsf@speedy.lifl.fr \
--to=duchier@ps.uni-sb.de \
--cc=git@vger.kernel.org \
--cc=gnu-arch-dev@lists.seyza.com \
--cc=gnu-arch-users@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.