git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michael Haggerty <mhagger@alum.mit.edu>
To: "brian m. carlson" <sandals@crustytoothpaste.net>, git@vger.kernel.org
Cc: Andreas Schwab <schwab@linux-m68k.org>,
	"Kyle J. McKay" <mackyle@gmail.com>,
	Nguyen Thai Ngoc Duy <pclouds@gmail.com>,
	Junio C Hamano <gitster@pobox.com>, Johannes Sixt <j6t@kdbg.org>,
	David Kastrup <dak@gnu.org>, James Denholm <nod.helm@gmail.com>
Subject: Re: [PATCH v2 00/10] Use a structure for object IDs.
Date: Wed, 11 Mar 2015 17:08:58 +0100	[thread overview]
Message-ID: <5500689A.5090101@alum.mit.edu> (raw)
In-Reply-To: <1425770645-628957-1-git-send-email-sandals@crustytoothpaste.net>

On 03/08/2015 12:23 AM, brian m. carlson wrote:
> This is a patch series to convert some of the relevant uses of unsigned
> char [20] to struct object_id.
> 
> The goal of this series to improve type-checking in the codebase and to
> make it easier to move to a different hash function if the project
> decides to do that.  This series does not convert all of the codebase,
> but only parts.  I've dropped some of the patches from earlier (which no
> longer apply) and added others.
> 
> Certain parts of the code have to be converted before others to keep the
> patch sizes small, maintainable, and bisectable, so functions and
> structures that are used across the codebase (e.g. struct object) will
> be converted later.  Conversion has been done in a somewhat haphazard
> manner by converting modules with leaf functions and less commonly used
> structs first.
> 
> Since part of the goal is to ease a move to a different hash function if
> the project decides to do that, I've adopted Michael Haggerty's
> suggestion of using variables named "oid", naming the structure member
> "sha1", and using GIT_SHA1_RAWSZ and GIT_SHA1_HEXSZ as the constants.
> 
> I've been holding on to this series for a long time in hopes of
> converting more of the code before submitting, but realized that this
> deprived others of the ability to use the new structure and required me
> to rebase extremely frequently.
> [...]

I've added CC to several people who commented on v1.

I think this is a really interesting project and I hope that it works out.

In my opinion, the biggest risk (aside from the sheer amount of work
required) is the issue that was brought up on the mailing list when you
submitted v1 [1]: Converting an arbitrary (unsigned char *) to a (struct
object_id *) is not allowed, because the alignment requirements of the
latter might be stricter than those of the former. This means that if,
for example, we are reading some data from disk or from the network, and
we expect the 20 bytes starting with byte number 17 to be a SHA-1 in
binary format, we used to be able to do

    unsigned char *sha1 = buf + 17;

and use sha1 like any other SHA-1, without the need for any copying. But
we can't do the analogous

    struct object_id *oid = (struct object_id *)(buf + 17);

because the alignment is not necessarily correct. So in a pure "struct
object_id" world, I think we would be forced to change such code to

    struct object_id oid;
    hashcpy(oid.sha1, buf + 17);

This uses additional memory and introduces copying overhead. Also, the
lifetime of oid might exceed the scope of a function, so oid might have
to be allocated on the heap and later freed. This adds more
computational overhead, more memory overhead, and more programming
effort to get it all right.

So as much as I like the idea of wrapping SHA-1s in objects, I think it
would be a good idea to first make sure that we can all agree on a plan
for dealing with situations like this. I can think of the following
possibilities:

1. Maybe code that needs to handle SHA-1s with screwy alignment does not
exist, or maybe it does the copying anyway. I haven't actually checked.

2. Maybe somebody can prove that

       struct object_id *oid = (struct object_id *)(buf + 17);

   somehow *is* allowed by the C standard, or at least that it is
sufficiently portable for our purposes.

3. We accept the additional runtime costs and development effort for the
extra copies. To accept this approach, we would need some idea of which
areas of the code will be affected, and some estimate of how much
additional memory it would cost.

4. We continue to support working with SHA-1s declared to be (unsigned
char *) in some performance-critical code, even as we migrate most other
code to using SHA-1s embedded within a (struct object_id). This will
cost some duplication of code. To accept this approach, we would need an
idea of *how much* code duplication would be needed. E.g., how many
functions will need both (unsigned char *) versions and (struct
object_id *) versions?

5. We only make the change opportunistically. Whenever we find a
function that needs to work with non-struct-aligned SHA-1s, we leave the
function as-is rather than converting it or creating a second version
that works with object_id objects. This approach would leave the
codebase somewhat schizophrenic.

I'm not trying to dissuade you from this project, but I think that for
your project to have a chance of success, we need an answer to this
question. So let's confront it now rather than after you have invested
lots more time and/or gotten patches merged.

Michael

[1] http://thread.gmane.org/gmane.comp.version-control.git/248054

-- 
Michael Haggerty
mhagger@alum.mit.edu

  parent reply	other threads:[~2015-03-11 16:09 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-07 23:23 [PATCH v2 00/10] Use a structure for object IDs brian m. carlson
2015-03-07 23:23 ` [PATCH v2 01/10] Define " brian m. carlson
     [not found]   ` <CEA07500-9F47-4B24-AD5D-1423A601A4DD@gmail.com>
2015-03-11 22:08     ` brian m. carlson
2015-03-12  0:26       ` Junio C Hamano
2015-03-12  9:34         ` brian m. carlson
2015-03-12 10:28         ` Michael Haggerty
2015-03-12 10:46           ` brian m. carlson
2015-03-12 11:16             ` Duy Nguyen
2015-03-12 18:24               ` Junio C Hamano
2015-03-12 18:44                 ` Junio C Hamano
2015-03-13  0:58                 ` Duy Nguyen
2015-03-13  6:03                   ` Junio C Hamano
2015-03-14 11:49                     ` Duy Nguyen
2015-03-14 22:19                       ` Junio C Hamano
2015-03-15  0:17                         ` Duy Nguyen
2015-03-15  2:32                           ` Junio C Hamano
2015-03-07 23:23 ` [PATCH v2 02/10] Define utility functions " brian m. carlson
2015-03-08  9:57   ` Duy Nguyen
2015-03-08 14:48     ` brian m. carlson
2015-03-11 12:44   ` Michael Haggerty
2015-03-07 23:23 ` [PATCH v2 03/10] bisect.c: convert leaf functions to use struct object_id brian m. carlson
2015-03-07 23:23 ` [PATCH v2 04/10] archive.c: convert " brian m. carlson
2015-03-11 14:20   ` Michael Haggerty
2015-03-11 22:12     ` brian m. carlson
2015-03-07 23:24 ` [PATCH v2 05/10] zip: use GIT_SHA1_HEXSZ for trailers brian m. carlson
2015-03-07 23:24 ` [PATCH v2 06/10] bulk-checkin.c: convert to use struct object_id brian m. carlson
2015-03-07 23:24 ` [PATCH v2 07/10] diff: convert struct combine_diff_path to object_id brian m. carlson
2015-03-07 23:24 ` [PATCH v2 08/10] commit: convert parts to struct object_id brian m. carlson
2015-03-11 14:46   ` Michael Haggerty
2015-03-07 23:24 ` [PATCH v2 09/10] patch-id: convert to use " brian m. carlson
2015-03-07 23:24 ` [PATCH v2 10/10] apply: convert threeway_stage to object_id brian m. carlson
2015-03-08  7:43 ` [PATCH v2 00/10] Use a structure for object IDs Junio C Hamano
2015-03-11  2:38 ` Kyle J. McKay
2015-03-11 16:08 ` Michael Haggerty [this message]
2015-03-11 20:35   ` Junio C Hamano
2015-03-13 22:45     ` brian m. carlson
  -- strict thread matches above, loose matches on Subject: below --
2015-03-13 23:39 brian m. carlson
2015-03-14 14:16 ` brian m. carlson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5500689A.5090101@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=dak@gnu.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=j6t@kdbg.org \
    --cc=mackyle@gmail.com \
    --cc=nod.helm@gmail.com \
    --cc=pclouds@gmail.com \
    --cc=sandals@crustytoothpaste.net \
    --cc=schwab@linux-m68k.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).