From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lukas Fleischer Subject: Re: [PATCH] receive-pack: avoid sending duplicate "have" lines Date: Sat, 07 Nov 2015 10:04:31 +0100 Message-ID: <20151107090431.576.3746@typhoon.lan> References: <1446851773-32390-1-git-send-email-lfleischer@lfos.de> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8BIT Cc: git@vger.kernel.org To: Junio C Hamano X-From: git-owner@vger.kernel.org Sat Nov 07 10:04:45 2015 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1ZuzQR-00074a-UK for gcvg-git-2@plane.gmane.org; Sat, 07 Nov 2015 10:04:44 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752569AbbKGJEj (ORCPT ); Sat, 7 Nov 2015 04:04:39 -0500 Received: from elnino.cryptocrack.de ([46.165.227.75]:36698 "EHLO elnino.cryptocrack.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752318AbbKGJEf convert rfc822-to-8bit (ORCPT ); Sat, 7 Nov 2015 04:04:35 -0500 Received: by elnino.cryptocrack.de (OpenSMTPD) with ESMTPSA id c2eb8566; TLS version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO; Sat, 7 Nov 2015 10:04:32 +0100 (CET) In-Reply-To: User-Agent: alot/0.3.6 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Sat, 07 Nov 2015 at 00:38:00, Junio C Hamano wrote: > [...] > I suspect that many of these that turn into ".have"s point the same > object as those sit at the tip of our refs (e.g. tags in alternates > we borrow from, which are the folks of the same project that copy > the same tags from the same upstream). I wonder if it easy to filter > them out? After all, if we say object X sits at refs/tags/v1.0 > there is no point showing ".have" for that same object X. > [...] It is certainly doable but we would need a better infrastructure. The easiest implementation (i.e. the one that is closest to what we currently do) is adding a second SHA1 array to collect all the hashes we already advertised. We could then pass a struct of both SHA1 arrays as second parameter to for_each_ref() and always insert the object identifier into either of the arrays in show_ref_cb(), depending on whether we immediately advertise it as a ref or decide convert it into a "have" line. To transmit "have" lines, instead of calling sha1_array_for_each_unique(), we would then call some function that prints unique entries from the first array that do not appear in the second array. Something like this (fully untested): -- 8< -- void sha1_array_for_each_diff(struct sha1_array *array1, struct sha1_array *array2, for_each_sha1_fn fn, void *data) { int i, j; if (!array1->sorted) sha1_array_sort(array1); if (!array2->sorted) sha1_array_sort(array2); for (i = j = 0; i < array1->nr; i++) { if (i > 0 && !hashcmp(array1->sha1[i], array1->sha1[i - 1])) continue; while (j < array2->nr && hashcmp(array1->sha1[i], array2->sha1[j]) > 0) j++; if (j < array2->nr && !hashcmp(array1->sha1[i], array2->sha1[j])) continue; fn(array1->sha1[i], data); } } -- >8 -- If we decide to implement all that, however, I wonder whether we should use a more sophisticated data structure instead. Currently, if there are n refs outside the current namespace pointing all to the same object, we append n entries to the SHA1 array and sort them afterwards which requires O(n) space. If we would maintain a set of hashes instead of an array in the first place, we could reduce space usage significantly by only storing what is needed (O(1) in the scenario I mentioned). We could also add markers to those object identifiers we already advertised instead of maintaining two separate lists. Unfortunately, in order to keep the running time of O(n log n), we would probably need something like self-balancing binary search trees. Alternatively, a hash map based approach could be taken. Is something like this already used anywhere in the Git source tree such that we can borrow the required data structures and functions?