From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Haggerty Subject: Re: [PATCH v5 2/3] path: optimize common dir checking Date: Mon, 05 Oct 2015 05:00:01 +0200 Message-ID: <5611E7B1.3090001@alum.mit.edu> References: <1441073591-639-1-git-send-email-dturner@twopensource.com> <1441073591-639-3-git-send-email-dturner@twopensource.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit Cc: Junio C Hamano To: David Turner , git@vger.kernel.org, Christian Couder X-From: git-owner@vger.kernel.org Mon Oct 05 05:07:16 2015 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Ziw7P-0003AN-9Y for gcvg-git-2@plane.gmane.org; Mon, 05 Oct 2015 05:07:15 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751959AbbJEDHI (ORCPT ); Sun, 4 Oct 2015 23:07:08 -0400 Received: from alum-mailsec-scanner-4.mit.edu ([18.7.68.15]:44846 "EHLO alum-mailsec-scanner-4.mit.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751495AbbJEDHH (ORCPT ); Sun, 4 Oct 2015 23:07:07 -0400 X-Greylist: delayed 423 seconds by postgrey-1.27 at vger.kernel.org; Sun, 04 Oct 2015 23:07:07 EDT X-AuditID: 1207440f-f79df6d000007c0f-97-5611e7b4be3a Received: from outgoing-alum.mit.edu (OUTGOING-ALUM.MIT.EDU [18.7.68.33]) by alum-mailsec-scanner-4.mit.edu (Symantec Messaging Gateway) with SMTP id 95.26.31759.4B7E1165; Sun, 4 Oct 2015 23:00:04 -0400 (EDT) Received: from [192.168.69.130] (p4FC96E0C.dip0.t-ipconnect.de [79.201.110.12]) (authenticated bits=0) (User authenticated as mhagger@ALUM.MIT.EDU) by outgoing-alum.mit.edu (8.13.8/8.12.4) with ESMTP id t953028K018696 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Sun, 4 Oct 2015 23:00:03 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.8.0 In-Reply-To: <1441073591-639-3-git-send-email-dturner@twopensource.com> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFprFKsWRmVeSWpSXmKPExsUixO6iqLvluWCYwdGVXBa3Z7YwW8zfdILR outKN5NFQ+8VZgcWj52z7rJ7XLyk7LHg+X12j8+b5AJYorhtkhJLyoIz0/P07RK4M75/ucte sE6s4sbJHpYGxmahLkYODgkBE4nr/xO7GDmBTDGJC/fWs3UxcnEICVxmlLjR9pMdwjnHJPHy wlwWkAZeAW2J6wusQRpYBFQlVsx/ygRiswnoSizqaQazRQWCJFYsf8EIYvMKCEqcnPmEBcQW EaiQWPH1HjuIzSygJnFoySOwGmEBO4lL935BLW5klNj3/ApYA6eAu0TD7ZvMEA16Ejuu/2KF sOUlmrfOZp7AKDALyY5ZSMpmISlbwMi8ilEuMac0Vzc3MTOnODVZtzg5MS8vtUjXRC83s0Qv NaV0EyMknPl3MHatlznEKMDBqMTDeyBeMEyINbGsuDL3EKMkB5OSKG+oGlCILyk/pTIjsTgj vqg0J7X4EKMEB7OSCO/rx0A53pTEyqrUonyYlDQHi5I4r/oSdT8hgfTEktTs1NSC1CKYrAwH h5IE76NnQI2CRanpqRVpmTklCGkmDk6Q4VxSIsWpeSmpRYmlJRnxoEiNLwbGKkiKB2ivBUg7 b3FBYi5QFKL1FKOilDjvDZCEAEgiozQPbiwsSb1iFAf6Uph3E0gVDzDBwXW/AhrMBDR4gSEf yOCSRISUVAPjhgTLo0ZPrl89Y9i9vv7beeYjfJt1XOz/ON6u3/HjXdjmnK8Hy6Nv2UpOjt7L eI+h4h+LvyHD22d7WsXmGTs4hax69VxX4MjD8qmzEjKPK4YE5h0ou3+o9+Vm1xu8 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On 09/01/2015 04:13 AM, David Turner wrote: > Instead of a linear search over common_list to check whether > a path is common, use a trie. The trie search operates on > path prefixes, and handles excludes. Here I am, coming late to the discussion as usual. Sorry for that. I dug into this code yesterday and got all nerd-tingly and started refactoring and optimizing it. But after I slept on it I realized that I'm a bit hazy on its justification. A trie is a beautiful data structure and all, but have there been any benchmarks showing (1) that this lookup is a bottleneck and (2) that the trie is an improvement on something simpler, like, say, a sorted string_list? Or is there a realistic hope that the trie might be generally useful for other purposes? The latter seems unlikely, at least with the current implementation, because it is very wasteful of space. It allocates one or two (usually two) `struct trie` for every single string that is added to it. Each `struct trie` contains an array of 256 pointers and usually needs two malloc calls to instantiate. So *each entry* stored in the trie costs something like 4 kilobytes on a 64-bit system, plus usually 4 calls to malloc. The large memory footprint, in turn, will cause access non-locality and might impact the lookup performance. So it is pretty clear that the current code would be unusable for a large number of strings. For this particular application, where we only have 19 strings to store, I suppose we could tolerate the use of approximately 64k of RAM to store 174 characters worth of strings *if* it would bring us big time savings. But I think we need some evidence of the time savings. If this lookup is really a bottleneck, I bet there are other alternatives that are just as fast as this trie and use less code, especially given that there are only 19 strings that need checking. With respect to the implementation, it looks correct to me. I would make the following three suggestions: * Please document that the `contents` field of `struct trie` is not NUL-terminated, because that would otherwise be a common assumption. (It is clearly not NUL-terminated because of the way it is initialized using xmalloc and memcpy in make_trie_node() and because of the way its length is adjusted in add_to_trie().) * But in add_to_trie(), you set `contents` using xstrdup(), which *does* NUL-terminate the string. It would be more consistent to set it using xmalloc and memcpy here. * Please use size_t instead of int for indexing into strings, at least in the trie_find() function, where the input is likely to be under the control of users. If we expected to use the trie for other purposes, then I would suggest a raft of other improvements. Ask me if you are interested. Michael -- Michael Haggerty mhagger@alum.mit.edu