From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.8 required=3.0 tests=AWL,BAYES_00,FREEMAIL_FROM, RCVD_IN_DNSWL_HI,T_RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 9408A1FA7B for ; Thu, 15 Jun 2017 11:33:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752507AbdFOLdy (ORCPT ); Thu, 15 Jun 2017 07:33:54 -0400 Received: from mout.web.de ([212.227.15.3]:60356 "EHLO mout.web.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752490AbdFOLdu (ORCPT ); Thu, 15 Jun 2017 07:33:50 -0400 Received: from [192.168.178.36] ([79.237.60.227]) by smtp.web.de (mrweb003 [213.165.67.108]) with ESMTPSA (Nemesis) id 0Lo0YS-1e1OQ52Q9I-00g2Eh; Thu, 15 Jun 2017 13:33:37 +0200 Subject: Re: [BUG] add_again() off-by-one error in custom format To: Jeff King Cc: Junio C Hamano , Michael Giuffrida , git@vger.kernel.org, =?UTF-8?Q?SZEDER_G=c3=a1bor?= References: <99d19e5a-9f79-9c1e-3a23-7b2437b04ce9@web.de> <20170615055654.efvsouhr3leszz3i@sigill.intra.peff.net> From: =?UTF-8?Q?Ren=c3=a9_Scharfe?= Message-ID: Date: Thu, 15 Jun 2017 13:33:34 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.1.1 MIME-Version: 1.0 In-Reply-To: <20170615055654.efvsouhr3leszz3i@sigill.intra.peff.net> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Provags-ID: V03:K0:0BzM2BoQ7mtD61mPjLqZBDRkq1ak9bDpfx2JchV8kgTJqgGs0uK UrbuX1TH2fhYJkCJdAE9Ff+m0wdi3VKvNavqiotndfNHeIwYhCYP3DFIs6jaFHfiUjbkXH9 RmMpslLt1So9E8WdZhuGSO/K77IrJVBH+tKCau6kEJGpbUz5nCaIU2MZwQtNIfnHc2Er0yU mCJbZRtfvskbonYUa+hqA== X-UI-Out-Filterresults: notjunk:1;V01:K0:MWpymKzigw8=:xP8hsvhCcEKZgLar7Rk27y 4oMEi/fyvEccCJ1OntQF0YnwoEDw7seJ+W+v5tfG10kq1obUFD853NXROnVj2AAuuiLIz3xcl sgaTB+yvNsnyDGFdBNuyM07ZIL7ieyU/Eh8Uil1sDA7eAXh/gsmtTmaJxzmXNdYOOn+dL7wZK jyRl5klEgwYF0p0V2sl7/lS5ueMdbQsVLkJ9d9nzafckxhO5/tIEMGqnXJE8PZublyJvLECMh W7Q+hS8FflK8VZyudUzTv6NYN/lUZXbgIu6YRlM45PMH5pxZKm01APUk0XsSpYjpYlpfVRn5r +sJrKf2b8gPcDrlVdfwwnZQLNyG4ejSpjEuj4yDxuoji5/cE7uIP9PmnLwJcsQ4ThoRXr9rH8 OdbuggsfnbktgI6+Awr4fcOvGvEb/QGg6GJQy/VleS+8n9V6eX+UjjnJJREDsyjhKWjeSG2sr HgzHeVJQYzRFT9OoaRy5/LmhhLXjt932TAUEj4rs0ty4Zt5tv9Stn8NNIvkrFymZKotFKv6r6 g4fWbm5gi4RZcYIJLrsMpZJ8KycXwIQ6odV+581kqng4rCWF4R60emn6jlQCMteMqfO4ls6y6 IGlBJIHTjLptvLeC0F3MycYYHdhvHl01gry2WnjBQGquzyTng6nep+tJ4EgvuC+FlEnhvsngo 5dZ9Zqh/8TQl/jKg0HrmkzvThNg2NK9BvbwHeA7ZVdLeOC+kGQC7/uuVFHqgd9LJEBblRMAiO 1cEtqH0Ou1gOFj4VxhekcQI2Kx4gfm40sjJQl7p3LAQV08KOs3NHdm9/rdlZBnp+9yaoePgsS 8t/2BV8 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Am 15.06.2017 um 07:56 schrieb Jeff King: > One interesting thing is that the cost of finding short hashes very much > depends on your loose object setup. I timed: > > git log --format=%H >/dev/null > > versus > > git log --format=%h >/dev/null > > on git.git. It went from about 400ms to about 800ms. But then I noticed > I had a lot of loose object directories, and ran "git gc --prune=now". > Afterwards, my timings were more like 380ms and 460ms. > > The difference is that in the "before" case, we actually opened each > directory and ran getdents(). But after gc, the directories are gone > totally and open() fails. We also have to do a linear walk through the > objects in each directory, since the contents are sorted. Do you mean "unsorted"? > So I wonder if it is worth trying to optimize the short-sha1 computation > in the first place. Double-%h aside, that would make _everything_ > faster, including --oneline. Right. > I'm not really sure how, though, short of caching the directory > contents. That opens up questions of whether and when to invalidate the > cache. If the cache were _just_ about short hashes, it might be OK to > just assume that it remains valid through the length of the program (so > worst case, a simultaneous write might mean that we generate a sha1 > which just became ambiguous, but that's generally going to be racy > anyway). > > The other downside of course is that we'd spend RAM on it. We could > bound the size of the cache, I suppose. You mean like an in-memory pack index for loose objects? In order to avoid the readdir call in sha1_name.c::find_short_object_filename()? We'd only need to keep the hashes of found objects. An oid_array would be quite compact. Non-racy writes inside a process should not be ignored (write, read later) -- e.g. checkout does something like that. Can we trust object directory time stamps for cache invalidation? René