From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-yw1-f174.google.com (mail-yw1-f174.google.com [209.85.128.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D79722003D5 for ; Fri, 8 Nov 2024 15:17:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.174 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731079049; cv=none; b=nJ2IAaHA2WzP5rc1NwD/nxcypOIqDhb2jImTP/66IPZtfadtW7YkkN3zHIiRV1r+dcpyVXfhfHdw+Wn594gt+sEi/lADw+6AYHVgWQ6MM2+zHte6/6Vohj5ET6EhCy9XxhbrBwKk3KpfS7h/Gq/CeZfJ6MwaMbAwvgJfCewWBfE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731079049; c=relaxed/simple; bh=vacnqkbTat5BQcIe0T/r/clRlj3WydqiRcu7WVjG+2M=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=tvRCAOQi7BmW2r/2LojHXtdlnXPX7xWVrD/0dvC1+V3avjmEgj7n7oOVOEbb6QREHxKVbnL32zHhChENFwbwKakeMPLTIMPZLAxWeRYL2Hm2BDk1Y246NBvM/Wohal9jT35AfxykaYmTbAR5CsBlxajf7XMXYRubgt2NECsqYQY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=f/iWgg4x; arc=none smtp.client-ip=209.85.128.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="f/iWgg4x" Received: by mail-yw1-f174.google.com with SMTP id 00721157ae682-6ea1407e978so21409877b3.1 for ; Fri, 08 Nov 2024 07:17:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1731079047; x=1731683847; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=EzgmnxDrw5vgJ+foGwIXnUXL30SoUhbQZe3ddKmckHo=; b=f/iWgg4xSdT9DF+ZhT/n6fDVYqmtT0hSPO/LWihHTZ3mNviYMfkfpVKlrbXlggrH/h 0KaMywh/DjufFpuPQYB+6u3N3YGp6IsWF9hC+buFvzRbPeFTlAavRrO+xSJdxgP0xsJf VfMXnZDaB8wjsTFWsYrFTc6FranptlmJ2GxV8+Mzm7+OFjDVc7+4sNEj5NNuaTgnmkQP 2K7mCkzJY8a5AjAe91FrINy4plNDWiGc3NbiZXKHRyIQqzTUEGSAi6WOvCzymdl8gw3f fu4jUxerAehR1M6gV5UjY6YM7qZq2JWoc0Q3uAu0RuRaR9ale+FgCddDM8rnCy7NdyeK Fo1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731079047; x=1731683847; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=EzgmnxDrw5vgJ+foGwIXnUXL30SoUhbQZe3ddKmckHo=; b=gwB12CAmIsC65WKG3cx243mhRYEVGF3gOVyzjWWulqazNwMFmQliICWV2fQG84wQtC miV+YmfmA6ZDFHVkuh/Q+T4jrCZJbacp4mhxTGF7llhZUb/AiZguQtS7iqTPWR/xYKE0 CfejcHWWhEBYmpURUVecqHMuDd4P5z5y1Z48yQZ4Pqerw+728GfCsFHH5My+hrBHnzfi 2ZU4r4N+va+b4Q5ArNF8eSFWEtozisssfmM7vKyfT7yOtREkNT+Ez5tqNa+7Hn2lDr7l FMXb2fvMgT9hRZW2qsPJ8cya+VTrBBP5Sd9dM1+3/3afhaoWoHijL3hMfgKQTqO9J4sn r6kQ== X-Forwarded-Encrypted: i=1; AJvYcCV0IFQqyoKRy7m+jK04CLeCK6W9ysAb1GByYDXt3iDmlGrFxTwSsbadOTx6c/lFQyvOmRQ=@vger.kernel.org X-Gm-Message-State: AOJu0YwjwYEGy3fMTpnb9PJFEdfF6Lz3Z83IVrVGTgCb7VOEtat96s2H dq8cH0rLOOwv78GfTFUXkw+1Gz4YIawq3LN/Si3nj+B2dAZUuy/H X-Google-Smtp-Source: AGHT+IEimG9/s/Xe+UU6bjiRzd8eZAsW95XVEKFL0MWOzazR7Z5VVBqUbArbNwFuKNVMRF1ibKMplg== X-Received: by 2002:a05:690c:c96:b0:6e2:50a:f436 with SMTP id 00721157ae682-6eaddfc5255mr37331797b3.36.1731079046686; Fri, 08 Nov 2024 07:17:26 -0800 (PST) Received: from ?IPV6:2600:1700:60ba:9810:cdf0:7186:dd7d:aa91? ([2600:1700:60ba:9810:cdf0:7186:dd7d:aa91]) by smtp.gmail.com with ESMTPSA id 00721157ae682-6eace8d77e6sm7418567b3.17.2024.11.08.07.17.24 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 08 Nov 2024 07:17:25 -0800 (PST) Message-ID: Date: Fri, 8 Nov 2024 10:17:24 -0500 Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 0/6] PATH WALK I: The path-walk API To: Junio C Hamano , Jeff King Cc: Taylor Blau , Derrick Stolee via GitGitGadget , git@vger.kernel.org, johannes.schindelin@gmx.de, ps@pks.im, johncai86@gmail.com, newren@gmail.com, christian.couder@gmail.com, kristofferhaugsbakk@fastmail.com, jonathantanmy@google.com References: <2d2940ef-0b26-4060-90b6-9b6969f23754@gmail.com> <20241104172533.GA2985568@coredump.intra.peff.net> Content-Language: en-US From: Derrick Stolee In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 11/4/24 7:11 PM, Junio C Hamano wrote: > Jeff King writes: > >> On Mon, Nov 04, 2024 at 10:48:49AM -0500, Derrick Stolee wrote: >>> I disagree that all environments will prefer the --full-name-hash. I'm >>> currently repeating the performance tests right now, and I've added one. >>> The issues are: >>> >>> 1. The --full-name-hash approach sometimes leads to a larger pack when >>> using "git push" on the client, especially when the name-hash is >>> already effective for compressing across paths. >> >> That's interesting. I wonder which cases get worse, and if a larger >> window size might help. I.e., presumably we are pushing the candidates >> further away in the sorted delta list. I think the cases that make things get worse with --full-name-hash are: 1. The presence of renames, partitioning objects that used to fit into the same bucket (in the case of directory renames). 2. Some standard kinds of files may appear several times across the tree but do not change very often and are similar across path. 3. Common patterns across similar file types, such as similar includes in .c and .h files or other kinds of boilerplate in different languages. A larger window size will always expand the range of possible deltas, at a cost to the time taken to compute the deltas. My first experiment in these repositories was to increase the window size to 250 (from the default 10). This caused a very slow repack, but the repositories shrunk. For example, the big Javascript monorepo that repacked to ~100 GB with default settings would repack to ~30 GB with --window=250. This was an indicator that delta compression would work if we can find the right pairs to use for deltas. The point of the two strategies (--full-name-hash and --path-walk) is about putting objects close to each other in better ways than the name hash sort. >>> 2. A depth 1 shallow clone cannot use previous versions of a path, so >>> those situations will want to use the normal name hash. This can be >>> accomplished simply by disabling the --full-name-hash option when >>> the --shallow option is present; a more detailed version could be >>> used to check for a large depth before disabling it. This case also >>> disables bitmaps, so that isn't something to worry about. >> >> I'm not sure why a larger hash would be worse in a shallow clone. As you >> note, with only one version of each path the name-similarity heuristic >> is not likely to buy you much. But I'd have thought that would be true >> for the existing name hash as well as a longer one. Maybe this is the >> "over-emphasizing" case. I'm confused by your wording of "larger hash" because the hash size is the exact same: 32 bits. It's just that the --full-name-hash option has fewer collisions by abandoning the hope of locality. In a depth 1 shallow clone, there are no repeated paths, so any hash collisions are true collisions instead of good candidates for deltas. The full name hash is essentially random, so the delta compression algorithm basically says: 1. Sort by type. 2. Within each type, sort the objects randomly. With that sort, the delta compression scan is less effective than the standard name hash. > I too am curious to hear Derrick explain the above points and what > was learned from the performance tests. The original hash was > designed to place files that are renamed across directories closer > to each other in the list sorted by the name hash, so a/Makefile and > b/Makefile would likely be treated as delta-base candidates while > foo/bar.c and bar/foo.c are treated as unrelated things. A push > of a handful of commits that rename paths would likely place the > rename source of older commits and rename destination of newer > commits into the same delta chain, even with a smaller delta window. > In such a history, uniformly-distributed-without-regard-to-renames > hash is likely to make them into two distinct delta chains, leading > to less optimal delta-base selection. Yes. This is the downside of the --full-name-hash compared to the standard name hash. When repacking an entire repository, the effect of these renames is typically not important in the long run as it's basically a single break in the delta chain. The downside comes in when doing a small fetch or push where the rename has more impact. The --path-walk approach does not suffer from this problem because it has a second pass that sorts by the name hash and looks for better deltas than the ones that already exist. Thus, it gets the best of both worlds. The performance impact of the two passes of the --path-walk approach is interesting, as you'd typically expect this to always be slower. However: 1. The delta compression within each batch only compares the objects within that batch. We do not compare these objects to unrelated objects, which can be expensive and wasteful. This also means that small batches may even be smaller than the delta window, reducing the number of comparisons. 2. In the second pass, the delta calculation can short-circuit if the computed delta would be larger than the current-best delta. Thus, the good deltas from the first pass make the second pass faster. > A whole-repository packing, or a large push or fetch, of the same > history with renamed files are affected a lot less by such negative > effects of full-name hash. When generating a pack with more commits > than the "--window", use of the original hash would mean blobs from > paths that share similar names (e.g., "Makefile"s everywhere in the > directory hierarchy) are placed close to each other, but full-name > hash will likely group the blobs from exactly the same path and > nothing else together, and the resulting delta-chain for identical > (and not similar) paths would be sufficiently long. A long delta > chain has to be broken into multiple chains _anyway_ due to finite > "--depth" setting, so placing blobs from each path into its own > (initial) delta chain, completely ignoring renamed paths, would > likely to give us long enough (initial) delta chain to be split at > the depth limit. > > It would lead to a good delta-base selection with smaller window > size quite efficiently with full-name hash.> > I think a full-name hash forces a single-commit pack of a wide tree > to give up on deltified blobs, but with the original hash, at least > similar and common files (e.g. Makefile and COPYING) would sit close > together in the delta queue and can be deltified with each other, > which may be where the inefficiency comes from when full-name hash > is used. Yes, this is a good summary of why this works for the data efficiency in long histories. Your earlier observations are why the full-name hash has demonstrated issues with smaller time scales. These numbers are carefully detailed in the performance tests in the refreshed series [1]. The series also has a way to disable the full-name hash when serving a shallow clone for this reason. [1] https://lore.kernel.org/git/pull.1823.git.1730775907.gitgitgadget@gmail.com/ Thanks, -Stolee