From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 573B8C76196 for ; Mon, 10 Apr 2023 20:14:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229722AbjDJUOR (ORCPT ); Mon, 10 Apr 2023 16:14:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46834 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229645AbjDJUOQ (ORCPT ); Mon, 10 Apr 2023 16:14:16 -0400 Received: from cloud.peff.net (cloud.peff.net [104.130.231.41]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AD404C0 for ; Mon, 10 Apr 2023 13:14:15 -0700 (PDT) Received: (qmail 2954 invoked by uid 109); 10 Apr 2023 20:14:15 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Mon, 10 Apr 2023 20:14:15 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 607 invoked by uid 111); 10 Apr 2023 20:14:14 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Mon, 10 Apr 2023 16:14:14 -0400 Authentication-Results: peff.net; auth=none Date: Mon, 10 Apr 2023 16:14:14 -0400 From: Jeff King To: ZheNing Hu Cc: Taylor Blau , Junio C Hamano , Git List , johncai86@gmail.com Subject: Re: [Question] Can git cat-file have a type filtering option? Message-ID: <20230410201414.GC104097@coredump.intra.peff.net> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Sun, Apr 09, 2023 at 02:47:30PM +0800, ZheNing Hu wrote: > > Perhaps slightly so, since there is naturally going to be some > > duplicated effort spawning processes, loading any shared libraries, > > initializing the repository and reading its configuration, etc. > > > > But I'd wager that these are all a negligible cost when compared to the > > time we'll have to spend reading, inflating, and printing out all of the > > objects in your repository. > > "What you said makes sense. I implemented the --type-filter option for > git cat-file and compared the performance of outputting all blobs in the > git repository with and without using the type-filter. I found that the > difference was not significant. > > time git cat-file --batch-all-objects --batch-check="%(objectname) > %(objecttype)" | > awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null > 17.10s user 0.27s system 102% cpu 16.987 total > > time git cat-file --batch-all-objects --batch --type-filter=blob >/dev/null > 16.74s user 0.19s system 95% cpu 17.655 total > > At first, I thought the processes that provide all blob oids by using > git rev-list or git cat-file --batch-all-objects --batch-check might waste > cpu, io, memory resources because they need to read a large number > of objects, and then they are read again by git cat-file --batch. > However, it seems that this is not actually the bottleneck in performance. Yeah, I think most of your time there is spent on the --batch command itself, which is just putting through a lot of bytes. You might also try with "--unordered". The default ordering for --batch-all-objects is in sha1 order, which has pretty bad locality characteristics for delta caching. Using --unordered goes in pack-order, which should be optimal. E.g., in git.git, running: time \ git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' | perl -lne 'print $1 if /^blob (.*)/' | git cat-file --batch >/dev/null takes: real 0m29.961s user 0m29.128s sys 0m1.461s Adding "--unordered" to the initial cat-file gives: real 0m1.970s user 0m2.170s sys 0m0.126s So reducing the size of the actual --batch printing may make the relative cost of using multiple processes much higher (I didn't apply your --type-filter patches to test myself). In general, I do think having a processing pipeline like this is OK, as it's pretty flexible. But especially for smaller queries (even ones that don't ask for the whole object contents), the per-object lookup costs can start to dominate (especially in a repository that hasn't been recently packed). Right now, even your "--batch --type-filter" example is probably making at least two lookups per object, because we don't have a way to open a "handle" to an object to check its type, and then extract the contents conditionally. And of course with multiple processes, we're naturally doing a separate lookup in each one. So a nice thing about being able to do the filtering in one process is that we could _eventually_ do it all with one object lookup. But I'd probably wait on adding something like --type-filter until we have an internal single-lookup API, and then we could time it to see how much speedup we can get. -Peff