Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Michael Haggerty <mhagger@alum.mit.edu>
To: Junio C Hamano <gitster@pobox.com>
Cc: Jeff King <peff@peff.org>, git@vger.kernel.org
Subject: Re: [PATCH 00/14] numparse module: systematically tighten up integer parsing
Date: Tue, 24 Mar 2015 16:42:17 +0100	[thread overview]
Message-ID: <551185D9.6050200@alum.mit.edu> (raw)
In-Reply-To: <xmqq7fudld61.fsf@gitster.dls.corp.google.com>

On 03/19/2015 07:22 AM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>> * It allows leading whitespace.
> 
> This might be blessing in disguise.  Our friends on MacOS may be
> relying on that
> 
>     git cmd --abbrev="$(wc -c <foo)"
> 
> to work "as expected", even though their "wc" gives leading spaces,
> for example.

Yuck. If you'd like, I can make sure to continue allowing leading
whitespace everywhere that it is allowed now. Alternatively, we could
make the parsing tighter and see if anybody squeals. What do you think?

>> * It allows arbitrary trailing characters.
> 
> Which is why we have introduced strtoul_ui() and such.
> 
>> * It allows a leading sign character ('+' or '-'), even though the
>>   result is unsigned.
> 
> I do not particularly see it a bad thing to accept "--abbrev=+7".
> Using unsigned type to accept a width and parsing "--abbrev=-7" into
> a large positive integer _is_ a problem, and we may want to fix it,
> but I think that is still within the scope of the original "better
> strtol/strtoul", and I do not see it as a justification to introduce
> a set of functions with completely unfamiliar name.

It is easy to allow "--abbrev=+7"; I would just need to add NUM_PLUS to
those call sites. Should I do so?

It would even be possible to allow "--abbrev=-0", which is also a
non-negative integer. But it's more work and it seems pretty silly to me.

>> * If the string doesn't contain a number at all, it sets its "endptr"
>>   argument to point at the start of the string but doesn't set errno.
> 
> Why is that a problem?  A caller that cares is supposed to check
> endptr and the string it gave, no?  Now, if strtoul_ui() and friends
> do not do so, I again think that is still within the scope of the
> original "better strtol/strtoul".

The text you are quoting was my argument for why we need wrappers around
strtol() and strtoul(), not for how the wrappers should be named.

The "endptr" convention for detecting errors is fine in theory, but in
practice a large majority of callers didn't check for errors correctly.
This is partly because the "endptr" convention is so awkward.

The existing strtoul_ui() and strtol_i() did check endptr correctly, but
there were only int-sized variants of the functions, and they didn't
give the caller the chance to capture endptr if the caller wanted to
allow trailing characters.

Why did I rename the wrapper functions?

* The old names, I think, emphasize that they take the long-sized
results of strtou?l() and convert them to integer size, which I think is
an implementation detail.
* The new functions will also have long versions. The obvious way to
generalize the existing function names for long variants (srtoul_ul()
and strtol_l()) seem kindof redundant.
* I wanted to change the signature and behavior of the functions, so
knowledge of the existing functions wouldn't be super helpful in
understanding the new ones.

>> * If the value is larger than fits in an unsigned long, it returns the
>>   value clamped to the range 0..ULONG_MAX (setting errno to ERANGE).
> 
> Ditto.
> 
>> * If the value is between -ULONG_MAX and 0, it returns the positive
>>   integer with the same bit pattern, without setting errno(!) (I can
>>   hardly think of a scenario where this would be useful.)
> 
> Ditto.
> 
>> * For base=0 (autodetect base), it allows a base specifier prefix "0x"
>>   or "0" and parses the number accordingly. For base=16 it also allows
>>   a "0x" prefix.
> 
> Why is it a problem to allow "git cmd --hexval=0x1234", even if "git
> cmd --hexval=1234" would suffice?

In some cases we would like to allow that flexibility; in some cases
not. But the strtol()/strtoul() functions *always* allow it.

>> When I looked around, I found scores of sites that call atoi(),
>> strtoul(), and strtol() carelessly. And it's understandable. Calling
>> any of these functions safely is so much work as to be completely
>> impractical in day-to-day code.
> 
> Yes, these burdens on the caller were exactly why strtoul_ui()
> etc. wanted to reduce---and it will be a welcome change to do
> necessary checks that are currently not done.
> 
>> Please see the docstrings in numparse.h in the first commit for
>> detailed API docs. Briefly, there are eight new functions:
>>
>>     parse_{l,ul,i,ui}(const char *s, unsigned int flags,
>>                       T *result, char **endptr);
>>     convert_{l,ul,i,ui}(const char *s, unsigned int flags, T *result);
>>
>> The parse_*() functions are for parsing a number from the start of a
>> string; the convert_*() functions are for parsing a string that
>> consists of a single number.
> 
> I am not sure if I follow.  Why not unify them into one 4-function
> set, with optional endptr that could be NULL?

In the next version of this patch series I plan to include only four
functions, str_to_{i,ui,l,ul}(), named that way to make them more
reminiscent of the functions that they replace. They will take an entptr
argument, but if that argument is set to NULL, then the function will
error out if there are any trailing characters.

> While we are on the topic of improving number parsing, one thing
> that I found somewhat frustrating is that config.c layer knows the
> scaling suffix but option parsing layer does not.  These functions
> should offer an option (via their "flags") to say "I allow scaled
> numbers like 2k, 4M, etc.".

That's a good idea. But I think I will leave that for a later patch
series, if that is OK with you.

Thanks for the great feedback!
Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu

next prev parent reply	other threads:[~2015-03-24 16:28 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-17 16:00 [PATCH 00/14] numparse module: systematically tighten up integer parsing Michael Haggerty
2015-03-17 16:00 ` [PATCH 01/14] numparse: new module for parsing integral numbers Michael Haggerty
2015-03-18 18:27   ` Eric Sunshine
2015-03-18 22:47     ` Michael Haggerty
2015-03-20  8:54       ` Eric Sunshine
2015-03-20 17:51   ` Junio C Hamano
2015-03-17 16:00 ` [PATCH 02/14] cacheinfo_callback(): use convert_ui() when handling "--cacheinfo" Michael Haggerty
2015-03-17 16:00 ` [PATCH 03/14] write_subdirectory(): use convert_ui() for parsing mode Michael Haggerty
2015-03-17 16:00 ` [PATCH 04/14] handle_revision_opt(): use skip_prefix() in many places Michael Haggerty
2015-03-17 16:00 ` [PATCH 05/14] handle_revision_opt(): use convert_i() when handling "-<digit>" Michael Haggerty
2015-03-19  6:34   ` Junio C Hamano
2015-03-17 16:00 ` [PATCH 06/14] strtoul_ui(), strtol_i(): remove functions Michael Haggerty
2015-03-17 16:00 ` [PATCH 07/14] handle_revision_opt(): use convert_ui() when handling "--abbrev=" Michael Haggerty
2015-03-17 16:00 ` [PATCH 08/14] builtin_diff(): detect errors when parsing --unified argument Michael Haggerty
2015-03-17 16:00 ` [PATCH 09/14] opt_arg(): val is always non-NULL Michael Haggerty
2015-03-17 16:00 ` [PATCH 10/14] opt_arg(): use convert_i() in implementation Michael Haggerty
2015-03-17 16:00 ` [PATCH 11/14] opt_arg(): report errors parsing option values Michael Haggerty
2015-03-17 16:00 ` [PATCH 12/14] opt_arg(): simplify pointer handling Michael Haggerty
2015-03-17 16:00 ` [PATCH 13/14] diff_opt_parse(): use convert_i() when handling "-l<num>" Michael Haggerty
2015-03-17 16:00 ` [PATCH 14/14] diff_opt_parse(): use convert_i() when handling --abbrev=<num> Michael Haggerty
2015-03-19  6:37   ` Junio C Hamano
2015-03-17 18:48 ` [PATCH 00/14] numparse module: systematically tighten up integer parsing Junio C Hamano
2015-03-17 19:46   ` Michael Haggerty
2015-03-19  6:31     ` Junio C Hamano
2015-03-17 23:05 ` Duy Nguyen
2015-03-18  9:47   ` Michael Haggerty
2015-03-18  9:58     ` Duy Nguyen
2015-03-18 10:03     ` Jeff King
2015-03-18 10:20       ` Michael Haggerty
2015-03-19  5:26 ` Jeff King
2015-03-19  6:41   ` Junio C Hamano
2015-03-19  7:32   ` Junio C Hamano
2015-03-24 16:06     ` Michael Haggerty
2015-03-24 16:49       ` René Scharfe
2015-03-25 21:14         ` Michael Haggerty
2015-03-25 21:59           ` Junio C Hamano
2015-03-24 15:05   ` Michael Haggerty
2015-03-19  6:22 ` Junio C Hamano
2015-03-24 15:42   ` Michael Haggerty [this message]
2015-03-24 15:58     ` Junio C Hamano
2015-03-24 16:09       ` Junio C Hamano
2015-03-24 17:39       ` Michael Haggerty
2015-03-24 18:08         ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=551185D9.6050200@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).