From mboxrd@z Thu Jan 1 00:00:00 1970 From: "George Spelvin" Subject: Re: [PATCH 07/11] object: try naive cuckoo hashing Date: 13 Aug 2011 15:12:06 -0400 Message-ID: <20110813191206.25129.qmail@science.horizon.com> References: <20110813102244.9033.qmail@science.horizon.com> Cc: git@vger.kernel.org To: gitster@pobox.com, linux@horizon.com X-From: git-owner@vger.kernel.org Sat Aug 13 21:12:17 2011 Return-path: Envelope-to: gcvg-git-2@lo.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1QsJd1-0000AR-UE for gcvg-git-2@lo.gmane.org; Sat, 13 Aug 2011 21:12:16 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751932Ab1HMTML (ORCPT ); Sat, 13 Aug 2011 15:12:11 -0400 Received: from science.horizon.com ([71.41.210.146]:54166 "HELO science.horizon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1751529Ab1HMTMJ (ORCPT ); Sat, 13 Aug 2011 15:12:09 -0400 Received: (qmail 25130 invoked by uid 1000); 13 Aug 2011 15:12:06 -0400 In-Reply-To: <20110813102244.9033.qmail@science.horizon.com> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: I've been doing a lot of reading on Cuckoo hashing. Yes, the single-table variant is described and used. However, the insertion procedure is not the way you do it. Also, d-ary Cuckoo hashing (also called d-Cuckoo hashing) where you use more than 2 hash functions is also used. The insertion algorithm, however, is not really agreed on. One algorithm (mostly proposed for hardware) uses d separate tables. Every entry displaced from table i is displaced to table i+1 (mod d). http://infoscience.epfl.ch/record/164147/files/cuckoo_dir_hpca2011_camera_ready.pdf In the single-table case, and in general, however, a displaced entry has more than one possible new location. This leads to the question of how to choose. One proposal is to do a breadth-first search looking for a path to a free slot. It's provable that this will succeed with high probability before the exponential growth of the breadth of the search tree gets too bad. See "Space Efficient Hash Tables with Worst Case Constant Access Time" http://www.itu.dk/people/pagh/papers/d-cuckoo-jour.pdf Another suggested tehcnique is to just pick an alternative at random and proceed. This is recommended in e.g. "Efficient Hash Probes on Modern Processors" http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.1189&rep=rep1&type=pdf Both of these lead to rather complex implementations. The random number generator is probably simpler than the breadth-first search, but either way there's a bunch of auxiliary code. Sticking with two hash functions, but using multi-entry buckets is definitely an attractive possibility.