From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Haggerty Subject: Bloom filters for have/want negotiation Date: Fri, 11 Sep 2015 23:13:25 +0200 Message-ID: <55F343F5.6010903@alum.mit.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Cc: Wilhelm Bierbaum , Jeff King To: git discussion list X-From: git-owner@vger.kernel.org Fri Sep 11 23:13:54 2015 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1ZaVdh-0004jF-JN for gcvg-git-2@plane.gmane.org; Fri, 11 Sep 2015 23:13:45 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753922AbbIKVNl (ORCPT ); Fri, 11 Sep 2015 17:13:41 -0400 Received: from alum-mailsec-scanner-2.mit.edu ([18.7.68.13]:62632 "EHLO alum-mailsec-scanner-2.mit.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752890AbbIKVNl (ORCPT ); Fri, 11 Sep 2015 17:13:41 -0400 X-AuditID: 1207440d-f79136d00000402c-64-55f343f70cb9 Received: from outgoing-alum.mit.edu (OUTGOING-ALUM.MIT.EDU [18.7.68.33]) by alum-mailsec-scanner-2.mit.edu (Symantec Messaging Gateway) with SMTP id 65.84.16428.7F343F55; Fri, 11 Sep 2015 17:13:27 -0400 (EDT) Received: from [192.168.69.130] (p4FC96DD6.dip0.t-ipconnect.de [79.201.109.214]) (authenticated bits=0) (User authenticated as mhagger@ALUM.MIT.EDU) by outgoing-alum.mit.edu (8.13.8/8.12.4) with ESMTP id t8BLDPrv015150 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Fri, 11 Sep 2015 17:13:26 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.8.0 X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFvrFIsWRmVeSWpSXmKPExsUixO6iqPvd+XOowcVF4hZTn81it+i60s1k 8aOlh9mB2WPnrLvsHs969zB6fN4kF8AcxW2TlFhSFpyZnqdvl8Cd8a7rF2vBA6WKUzuOMjYw /pXuYuTkkBAwkTi84wg7hC0mceHeerYuRi4OIYHLjBIX+h8xQTgXmCTajs5nAaniFdCWWLtu JhuIzSKgKrGrfREziM0moCuxqKeZCcQWFQiSWLH8BSNEvaDEyZlPwHpFgGp2PbsK1sss4CJx 9eIGsM3CAgYSK9//ZYeIq0v8mXeJGcKWl2jeOpt5AiPfLCSjZiEpm4WkbAEj8ypGucSc0lzd 3MTMnOLUZN3i5MS8vNQiXSO93MwSvdSU0k2MkHDk3cH4f53MIUYBDkYlHl4L1U+hQqyJZcWV uYcYJTmYlER5TRw+hwrxJeWnVGYkFmfEF5XmpBYfYpTgYFYS4S0WBsrxpiRWVqUW5cOkpDlY lMR51Zao+wkJpCeWpGanphakFsFkZTg4lCR4jzkBNQoWpaanVqRl5pQgpJk4OEGGc0mJFKfm paQWJZaWZMSDYi++GBh9ICkeoL2rQdp5iwsSc4GiEK2nGBWlxHl3gyQEQBIZpXlwY2FJ5hWj ONCXwrxfQap4gAkKrvsV0GAmoMHviz+BDC5JREhJNTDWicpf1oudGXLMtNLjpcKzp3L1h+ZN vmhQWzH72oXGz2Fyjyre8h0R0HvXcUlcNvqTnN78zb5ystGimw91K665tPZ0+rm/i3/wNUo1 r2vpu25zu6MqZU1YvPCGUlb3IzECDnUOrmErM04s5L9//qTzd+apN1y2Nryf0p70 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: I have been thinking about Wilhelm Bierbaum's talk at the last GitMerge conference [1] in which he describes a scheme for using Bloom filters to make the initial reference advertisement less expensive. In his scheme (if I understand correctly) the client starts off the conversation by passing the server a Bloom filter that indicates what (refname, SHA-1) pairs the client already has. This makes it unnecessary for the server to advertise those references, thereby reducing the cost of incremental fetches dramatically if the server has very many references. Because Bloom filters have false positives, this scheme is not 100% reliable. Therefore I don't think we would want Git to depend on it. But it got me thinking about how the client could use a Bloom filter in a later stage of the negotiation, when telling the server what objects it already has, while preserving 100% reliability. The idea is to use connectivity information to correct mistakes caused by reliance on the Bloom-filter: 1. The server advertises the references that it has in the way that it is currently done. 2. The client advertises the objects that it has (or some subset of them; see below) via a Bloom filter. 3. The server sends the client the packfile that results from assuming that the Bloom filter is giving correct answers. (This might mean that too few objects are sent to the client.) 4. The client does a connectivity check on the packfile. If any objects are missing, it asks the server for them via a reliable (non-Bloom-filter-based) request. How would one construct the Bloom filter for step 2? (Remember that a properly-configured Bloom filter requires about 5 bits of space per value stored for each factor of 0.1 in the false-positive rate. So, for example, to store 5000 values with a 1% false-positive rate, the Bloom filter would need to be approximately 5000 * 10 bits = 6.2 kB in size.) Here are some possible schemes: * Record *all* objects in the Bloom filter. The Git repo has approximately 200k objects, so, supposing that we could live with a 10% false-positive rate (see below), the Bloom filter would need to be about 125 kB. * Record all commit objects in the Bloom filter. For the Git repo that is about 40k commits, so for a 10% error rate the Bloom filter would have to be about 25 kB. * Record some subset of commits; for example, all unique branch and tag tips, the peeled tags, plus some sparse subsets of commits deeper in the history. The ls-remote for the Git repo lists 1730 unique SHA-1s, so, supposing we choose 10x that number with a 1% error rate, the Bloom filter would be about 20 kB. * Record only the branch and tag tips and peeled tags. Please note that for situations where the client has fetched from the server before and still has the remote-tracking references from that fetch, this scheme might work surprisingly well. For the Git repository, with a 1% error rate, this would be about 2 kB. For the first two schemes, we could tolerate a pretty high error rate because the server could perform additional consistency checks to reduce the error rate. For example, if the Bloom filter reports that the client has commit X, but that the client does *not* have a parent of X, then the server can assume that the check of X was a false positive and discard it. Such consistency checks would not be possible with the third or fourth schemes, so I have chosen lower false-positive rates for those schemes. Additional points: * The client can decide what to include in the Bloom filter. For example, if it has done a recent fetch from the server, it might want to send only the remote-tracking branch tips. But if it has never fetched from this server before, it might want to send all commits. * A Bloom filter could be computed at repack time rather than at each fetch. On fetch, the precomputed Bloom filters could be loaded, any loose objects added to it, and the result sent to the server. I don't have a gut feeling about the cost of this phase of the negotiation, so I don't know whether this would be a net savings, let alone one that is worth the added complexity. But I wanted to document the idea in case somebody thinks it has promise. (I have no plans to pursue it.) Michael [1] http://git-merge.com/videos/scaling-git-at-twitter-wilhelm-bierbaum.html -- Michael Haggerty mhagger@alum.mit.edu