From: Michael Haggerty <mhagger@alum.mit.edu>
To: Junio C Hamano <gitster@pobox.com>
Cc: Jeff King <peff@peff.org>,
git@vger.kernel.org, Michael Haggerty <mhagger@alum.mit.edu>
Subject: [PATCH 00/14] numparse module: systematically tighten up integer parsing
Date: Tue, 17 Mar 2015 17:00:02 +0100 [thread overview]
Message-ID: <1426608016-2978-1-git-send-email-mhagger@alum.mit.edu> (raw)
Sorry for the long email. Feel free to skip over the background info
and continue reading at "My solution" below.
This odyssey started with a typo:
$ git shortlog -snl v2.2.0..v2.3.0
14795 Junio C Hamano
1658 Jeff King
1400 Shawn O. Pearce
1109 Linus Torvalds
760 Jonathan Nieder
[...]
It took me a minute to realize why so many commits are listed. It
turns out that "-l", which I added by accident, requires an integer
argument, but it happily swallows "v2.2.0..v2.3.0" without error. This
leaves no range argument, so the command traverses the entire history
of HEAD.
The relevant code is
else if ((argcount = short_opt('l', av, &optarg))) {
options->rename_limit = strtoul(optarg, NULL, 10);
return argcount;
}
, which is broken in many ways. First of all, strtoul() is way too
permissive for simple tasks like this:
* It allows leading whitespace.
* It allows arbitrary trailing characters.
* It allows a leading sign character ('+' or '-'), even though the
result is unsigned.
* If the string doesn't contain a number at all, it sets its "endptr"
argument to point at the start of the string but doesn't set errno.
* If the value is larger than fits in an unsigned long, it returns the
value clamped to the range 0..ULONG_MAX (setting errno to ERANGE).
* If the value is between -ULONG_MAX and 0, it returns the positive
integer with the same bit pattern, without setting errno(!) (I can
hardly think of a scenario where this would be useful.)
* For base=0 (autodetect base), it allows a base specifier prefix "0x"
or "0" and parses the number accordingly. For base=16 it also allows
a "0x" prefix.
strtol() has similar foibles.
The caller compounds the permissivity of strtoul() with further sins:
* It does absolutely no error detection.
* It assigns the return value, which is an unsigned long, to an int
value. This could truncate the result, perhaps even resulting in
rename_limit being set to a negative value.
When I looked around, I found scores of sites that call atoi(),
strtoul(), and strtol() carelessly. And it's understandable. Calling
any of these functions safely is so much work as to be completely
impractical in day-to-day code.
git-compat-util.h has two functions, strtoul_ui() and strtol_i(), that
try to make parsing integers a little bit easier. But they are only
used in a few places, they hardly restrict the pathological
flexibility of strtoul()/strtol(), strtoul_ui() doesn't implement
clamping consistently when converting from unsigned long to unsigned
int, and neither function can be used when the integer value *is*
followed by other characters.
My solution: the numparse module
So I hereby propose a new module, numparse, to make it easier to parse
integral values from strings in a convenient, safe, and flexible way.
The first commit introduces the new module, and subsequent commits
change a sample (a small fraction!) of call sites to use the new
functions. Consider it a proof-of-concept. If people are OK with this
approach, I will continue sending patches to fix other call sites. (I
already have another two dozen such patches in my repo).
Please see the docstrings in numparse.h in the first commit for
detailed API docs. Briefly, there are eight new functions:
parse_{l,ul,i,ui}(const char *s, unsigned int flags,
T *result, char **endptr);
convert_{l,ul,i,ui}(const char *s, unsigned int flags, T *result);
The parse_*() functions are for parsing a number from the start of a
string; the convert_*() functions are for parsing a string that
consists of a single number. The flags argument selects not only the
base of the number, but also which of strtol()/strtoul()'s many
features should be allowed. The *_i() and *.ui() functions are for
parsing int and unsigned int values; they are careful about how they
truncate the corresponding long values. The functions all return 0 on
success and a negative number on error.
Here are a few examples of how these functions can be used (taken from
the header of numparse.h):
* Convert hexadecimal string s into an unsigned int. Die if there are
any characters in s besides hexadecimal digits, or if the result
exceeds the range of an unsigned int:
if (convert_ui(s, 16, &result))
die("...");
* Read a base-ten long number from the front of a string, allowing
sign characters and setting endptr to point at any trailing
characters:
if (parse_l(s, 10 | NUM_SIGN | NUM_TRAILING, &result, &endptr))
die("...");
* Convert decimal string s into a signed int, but not allowing the
string to contain a '+' or '-' prefix (and thereby indirectly
ensuring that the result will be non-negative):
if (convert_i(s, 10, &result))
die("...");
* Convert s into a signed int, interpreting prefix "0x" to mean
hexadecimal and "0" to mean octal. If the value doesn't fit in an
unsigned int, set result to INT_MIN or INT_MAX.
if (convert_i(s, NUM_SLOPPY, &result))
die("...");
My main questions:
* Do people like the API? My main goal was to make these functions as
painless as possible to use correctly, because there are so many
call sites.
* Is it too gimmicky to encode the base together with other options in
`flags`? (I think it is worth it to avoid the need for another
parameter, which callers could easily put in the wrong order.)
* Am I making callers too strict? In many cases where a positive
integer is expected (e.g., "--abbrev=<num>"), I have been replacing
code like
result = strtoul(s, NULL, 10);
with
if (convert_i(s, 10, &result))
die("...");
Strictly speaking, this is backwards incompatible, because the
former allows things like
--abbrev="+10"
--abbrev=" 10"
--abbrev="10 bananas"
--abbrev=-18446744073709551606
(all equivalent to "--abbrev=10"), whereas the new code rejects all
of them. It would be easy to change the new code to allow the first
three, by adding flags NUM_PLUS, NUM_LEADING_WHITESPACE, or
NUM_TRAILING respectively. But my inclination is to be strict
(though I could easily be persuaded to permit the first one).
* Should I submit tiny patches, each rewriting a single call site, or
make changes in larger chunks (say, file by file)? For example, in
revision.c there is a function handle_revision_opt() that contains
about 10 call sites that should be rewritten, for 10 different
options that take integers. My gut feeling is that the rewrite of
"--min-age=<value>" should be done in a different patch than that of
"--min-parents=<value>", especially since the invocations might use
different levels of parsing strictness that might be topics of
discussion on the ML. On the other hand, I feel silly bombarding the
list with tons of tiny patches.
* I couldn't think of any places where today's sloppy parsing could
result in security vulnerabilities, but people should think about
this, too. I would be especially wary of sites that call strtoul()
and assign the result to an int, seemingly not considering that the
result could end up negative.
These patches apply to "master". They are also available on my GitHub
repo [1].
Michael
[1] https://github.com/mhagger/git.git, branch "numparse1"
Michael Haggerty (14):
numparse: new module for parsing integral numbers
cacheinfo_callback(): use convert_ui() when handling "--cacheinfo"
write_subdirectory(): use convert_ui() for parsing mode
handle_revision_opt(): use skip_prefix() in many places
handle_revision_opt(): use convert_i() when handling "-<digit>"
strtoul_ui(), strtol_i(): remove functions
handle_revision_opt(): use convert_ui() when handling "--abbrev="
builtin_diff(): detect errors when parsing --unified argument
opt_arg(): val is always non-NULL
opt_arg(): use convert_i() in implementation
opt_arg(): report errors parsing option values
opt_arg(): simplify pointer handling
diff_opt_parse(): use convert_i() when handling "-l<num>"
diff_opt_parse(): use convert_i() when handling --abbrev=<num>
Makefile | 1 +
builtin/update-index.c | 3 +-
contrib/convert-objects/convert-objects.c | 3 +-
diff.c | 55 ++++----
git-compat-util.h | 26 ----
numparse.c | 180 ++++++++++++++++++++++++++
numparse.h | 207 ++++++++++++++++++++++++++++++
revision.c | 64 ++++-----
8 files changed, 447 insertions(+), 92 deletions(-)
create mode 100644 numparse.c
create mode 100644 numparse.h
--
2.1.4
next reply other threads:[~2015-03-17 16:00 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-03-17 16:00 Michael Haggerty [this message]
2015-03-17 16:00 ` [PATCH 01/14] numparse: new module for parsing integral numbers Michael Haggerty
2015-03-18 18:27 ` Eric Sunshine
2015-03-18 22:47 ` Michael Haggerty
2015-03-20 8:54 ` Eric Sunshine
2015-03-20 17:51 ` Junio C Hamano
2015-03-17 16:00 ` [PATCH 02/14] cacheinfo_callback(): use convert_ui() when handling "--cacheinfo" Michael Haggerty
2015-03-17 16:00 ` [PATCH 03/14] write_subdirectory(): use convert_ui() for parsing mode Michael Haggerty
2015-03-17 16:00 ` [PATCH 04/14] handle_revision_opt(): use skip_prefix() in many places Michael Haggerty
2015-03-17 16:00 ` [PATCH 05/14] handle_revision_opt(): use convert_i() when handling "-<digit>" Michael Haggerty
2015-03-19 6:34 ` Junio C Hamano
2015-03-17 16:00 ` [PATCH 06/14] strtoul_ui(), strtol_i(): remove functions Michael Haggerty
2015-03-17 16:00 ` [PATCH 07/14] handle_revision_opt(): use convert_ui() when handling "--abbrev=" Michael Haggerty
2015-03-17 16:00 ` [PATCH 08/14] builtin_diff(): detect errors when parsing --unified argument Michael Haggerty
2015-03-17 16:00 ` [PATCH 09/14] opt_arg(): val is always non-NULL Michael Haggerty
2015-03-17 16:00 ` [PATCH 10/14] opt_arg(): use convert_i() in implementation Michael Haggerty
2015-03-17 16:00 ` [PATCH 11/14] opt_arg(): report errors parsing option values Michael Haggerty
2015-03-17 16:00 ` [PATCH 12/14] opt_arg(): simplify pointer handling Michael Haggerty
2015-03-17 16:00 ` [PATCH 13/14] diff_opt_parse(): use convert_i() when handling "-l<num>" Michael Haggerty
2015-03-17 16:00 ` [PATCH 14/14] diff_opt_parse(): use convert_i() when handling --abbrev=<num> Michael Haggerty
2015-03-19 6:37 ` Junio C Hamano
2015-03-17 18:48 ` [PATCH 00/14] numparse module: systematically tighten up integer parsing Junio C Hamano
2015-03-17 19:46 ` Michael Haggerty
2015-03-19 6:31 ` Junio C Hamano
2015-03-17 23:05 ` Duy Nguyen
2015-03-18 9:47 ` Michael Haggerty
2015-03-18 9:58 ` Duy Nguyen
2015-03-18 10:03 ` Jeff King
2015-03-18 10:20 ` Michael Haggerty
2015-03-19 5:26 ` Jeff King
2015-03-19 6:41 ` Junio C Hamano
2015-03-19 7:32 ` Junio C Hamano
2015-03-24 16:06 ` Michael Haggerty
2015-03-24 16:49 ` René Scharfe
2015-03-25 21:14 ` Michael Haggerty
2015-03-25 21:59 ` Junio C Hamano
2015-03-24 15:05 ` Michael Haggerty
2015-03-19 6:22 ` Junio C Hamano
2015-03-24 15:42 ` Michael Haggerty
2015-03-24 15:58 ` Junio C Hamano
2015-03-24 16:09 ` Junio C Hamano
2015-03-24 17:39 ` Michael Haggerty
2015-03-24 18:08 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1426608016-2978-1-git-send-email-mhagger@alum.mit.edu \
--to=mhagger@alum.mit.edu \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peff@peff.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).