From: Al Viro <viro@zeniv.linux.org.uk>
To: chriscli@google.com
Cc: linux-sparse@vger.kernel.org, dan.carpenter@linaro.org,
ben.dooks@codethink.co.uk, rf@opensource.cirrus.com,
torvalds@linux-foundation.org, Eric Zhang <zxh@xh-zhang.com>
Subject: Re: [RFC PATCH] pre-process: add __VA_OPT__ support
Date: Tue, 31 Mar 2026 09:06:31 +0100 [thread overview]
Message-ID: <20260331080631.GA1328137@ZenIV> (raw)
In-Reply-To: <20260316065622.GA607739@ZenIV>
On Mon, Mar 16, 2026 at 06:56:22AM +0000, Al Viro wrote:
... and here's are the whitespace fixes and a bunch of optimizations; I
really wonder if we ought to simply switch to uint64_t for struct position
contents and go for several helpers for accessing individual parts.
Code generation with clang is reasonable, but gcc is bloody awful -
I'm sick and tired of fighting it all the time ;-/ Anyway, I've spent
way too much time in that rabbit hole. I've a few more sitting in the
local queue (tokenizer optimizations, mostly), but that's a separate
story and I'd rather leave that for later.
See git://git.kernel.org/pub/scm/linux/kernel/git/viro/sparse.git #whitespaces
(or individual patches in followups); it's on top of #va_opt.
Re whitespace: there are several places where gcc and clang handling of
that stuff differs. The origin of the headache is that historically
preprocessor had been a text filter sitting in front of compiler -
its output had been fed through tokenizer and quite a few things in the
original ANSI C bear the resulting scars. By C99 times it mostly switched
to working in terms of token stream, which simplified a lot of stuff.
Unfortunately, what it's working with is not quite a list of tokens -
it's tokens mixed with whitespace. By that point comments are replaced
with spaces and implementation is allowed to collapse any sequence of
whitespace other than newline into a single space character.
All operations on the stream are described in terms of replacing a
token or a sequence of tokens with a sequence of tokens; what happens
to possible whitespace is almost unspecified.
Some things are obvious and generally agreed upon:
* if we are replacing a sequence of tokens, all whitespace stuck
between them is removed. All whitespace before to the first or after the
last affected token in left as-is.
* all whitespace stuck between the tokens in the replacement
sequence is inserted along with those tokens.
What is *not* obvious is what should be done to leading or trailing
whitespace in the replacement sequence. Part of that is explicitly
covered (e.g. macro body has no leading or trailing whitespace), but in
some cases the things get iffy. Trivial example:
#define E
#define F E - E
#define S(X) #X
#define A(X) S([X])
char *s = A(F);
A(F) is a macro invocation; there's one argument - Ident(F). No whitespace
anywhere. What should these 4 tokens be replaced with? We start with the
replacement list of A - Ident(S),Punctuator('('),Punctuator('['),Ident(X),
Punctuator(']'),Punctuator(')'). The only parameter (X) occurs only once
and there's neither # nor ## operator in sight, so that occurrence gets
replaced with the token sequence resulting from full expansion of the
corresponding argument.
Ident(F) expands to Ident(E),whitespace,Punctuator('-'),whitespace,Ident(E),
which, after expanding both instances of E turns into
whitespace,Punctuator('-'),whitespace
Fair enough, there's only one token in that, so there's no inter-token
whitespace in sight. There is both leading and trailing whitespace, though.
One interpretation is that this one token is precisely what gets substituted,
yielding S([-]) with no whitespace anywhere. That's an invocation of
S with the sole argument being Punctuator('['),Punctuator('-),Punctuator(']'),
and the end result is "[-]". This is what clang does.
Another way to handle that is to have the whitespace preceding and following
Punctuator('-') in the expansion of F stuck to it when we are substituting
the damn thing, yielding S([ - ]) (same tokens as in the first variant, but
with whitespaces stuck before and after the minus). The end result in this
interpretation is "[ - ]". This is what gcc goes for.
Unfortunately, gcc rules are convoluted and when you add __VA_OPT__ into
the mix they get really nasty.
Example:
#define E
#define F E - E
#define S(X) #X
#define A(X,...) S([__VA_OPT__(X)])
char *s = A(F,_);
ends up with "[- ]" - the leading whitespace is stripped, the trailing one
remains stuck to the token sequence. Mixing that with ## leads to outright
bugs:
#define E
#define F1 E 1
#define A(X,...) 1##__VA_OPT__(X)
int m1 = A(F1,...);
expands into
int m1 = 11;
In other words, left concatenation works fine. However, adding the right-hand
analogue
#define F2 sizeof m E
#define B(X,...) __VA_OPT__(X)##1
int m2 = B(F2,_);
which should result in
int m2 = sizeof m1;
does *not* work with gcc - it ends up with
int m2 = sizeof m 1;
with predictable objections from the parser. Note that it's *not* a matter of
confusion re "do we expand the argument here?" - if that argument hadn't been
expanded, we would've ended up with
int m2 = sizeof F21;
and resulting errors would've been different. With clang everything works
as expected and I don't see any interpretation of the standard that would
allow what gcc is doing. The trigger for that bug is their treatment of the
trailing whitespace, AFAICS; leading whitespace is stripped here.
Note that the current mainline sparse doesn't match either gcc or clang
semantics - it loses whitespaces in some cases where they definitely should
be retained. That needs to be fixed; e.g.
#define E
#define S(X) #X
#define A(X) S(X)
char *s = A([1 E]);
ought to produce "[1 ]", not "[1]". Nothing subtle about that case...
IMO we are be better off with the clang approach; it is easy to describe
and it matches the behaviour of gcc unless you step into the areas where gcc
is rather random. We can always reconsider if/when gcc folks come up with
something that looks consistent...
Shortlog:
Al Viro (6):
nextchar(): get rid of special[]
simplify the inlined side of nextchar()
tokenize_stream(): don't bother with isspace()
TOKEN_DIRECTIVE: recognize directive-introducing # in tokenizer
saner collect_arg() code generation
try to get whitespaces right
Diffstat:
pre-process.c | 169 ++++++++++++++++++--------------
token.h | 1 +
tokenize.c | 91 +++++++++--------
validation/preprocessor/va_opt-concat.c | 24 +++++
validation/preprocessor/va_opt.c | 23 +++--
validation/preprocessor/va_opt2.c | 2 +-
6 files changed, 184 insertions(+), 126 deletions(-)
create mode 100644 validation/preprocessor/va_opt-concat.c
next prev parent reply other threads:[~2026-03-31 8:03 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <cover.1771930766.git.dan.carpenter@linaro.org>
2026-02-24 11:07 ` [PATCH] sparse: add support for __VA_OPT__ Dan Carpenter
2026-02-24 11:16 ` Ben Dooks
2026-02-24 11:56 ` Dan Carpenter
2026-02-24 12:42 ` Richard Fitzgerald
2026-02-24 13:15 ` Ben Dooks
2026-02-25 2:39 ` Chris Li
2026-02-25 3:36 ` Al Viro
2026-02-25 5:29 ` [RFC PATCH] pre-process: add __VA_OPT__ support Eric Zhang
2026-02-25 6:40 ` Al Viro
2026-02-25 7:27 ` Al Viro
2026-02-25 8:14 ` Eric Zhang
2026-02-25 22:18 ` Al Viro
2026-02-26 7:29 ` Al Viro
2026-03-16 6:56 ` Al Viro
2026-03-16 7:03 ` [PATCH 01/21] split copy() into "need to copy" and "can move in place" cases Al Viro
2026-03-16 7:03 ` [PATCH 02/21] expand and simplify the call of dup_token() in copy() Al Viro
2026-03-16 7:03 ` [PATCH 03/21] more dup_token() optimizations Al Viro
2026-03-16 7:03 ` [PATCH 04/21] parsing #define: saner handling of argument count, part 1 Al Viro
2026-03-16 7:03 ` [PATCH 05/21] simplify collect_arguments() and fix error handling there Al Viro
2026-03-16 7:04 ` [PATCH 06/21] try_arg(): don't use arglist for argument name lookups Al Viro
2026-03-16 7:04 ` [PATCH 07/21] make expand_has_...() responsible for expanding its argument Al Viro
2026-03-16 7:04 ` [PATCH 08/21] preparing to change argument number encoding for TOKEN_..._ARGUMENT Al Viro
2026-03-16 7:04 ` [PATCH 09/21] steal 2 bits from argnum for argument kind Al Viro
2026-03-16 7:04 ` [PATCH 10/21] on-demand argument expansion Al Viro
2026-03-16 7:04 ` [PATCH 11/21] kill create_arglist() Al Viro
2026-03-16 7:04 ` [PATCH 12/21] stop mangling arglist, get rid of TOKEN_ARG_COUNT Al Viro
2026-03-16 7:04 ` [PATCH 13/21] deal with ## on arguments separately Al Viro
2026-03-16 7:04 ` [PATCH 14/21] preparations for __VA_OPT__ support: reshuffle argument slot assignments Al Viro
2026-03-16 7:04 ` [PATCH 15/21] pre-process.c: split try_arg() Al Viro
2026-03-16 7:04 ` [PATCH 16/21] __VA_OPT__: parsing Al Viro
2026-03-16 7:04 ` [PATCH 17/21] expansion-time va_opt handling Al Viro
2026-03-16 7:04 ` [PATCH 18/21] merge(): saner handling of ->noexpand Al Viro
2026-03-16 7:04 ` [PATCH 19/21] simplify the calling conventions of collect_arguments() Al Viro
2026-03-16 7:04 ` [PATCH 20/21] make expand_one_symbol() inline Al Viro
2026-03-16 7:04 ` [PATCH 21/21] substitute(): convert switch() into cascade of ifs Al Viro
2026-03-16 16:42 ` [RFC PATCH] pre-process: add __VA_OPT__ support Linus Torvalds
2026-03-19 3:53 ` Al Viro
2026-03-19 4:07 ` Linus Torvalds
2026-03-19 5:34 ` Al Viro
2026-03-17 7:41 ` Chris Li
2026-03-18 6:35 ` Eric Zhang
2026-03-31 8:06 ` Al Viro [this message]
2026-03-31 8:07 ` [PATCH 1/6] nextchar(): get rid of special[] Al Viro
2026-03-31 8:07 ` [PATCH 2/6] simplify the inlined side of nextchar() Al Viro
2026-03-31 8:07 ` [PATCH 3/6] tokenize_stream(): don't bother with isspace() Al Viro
2026-03-31 8:07 ` [PATCH 4/6] TOKEN_DIRECTIVE: recognize directive-introducing # in tokenizer Al Viro
2026-03-31 8:07 ` [PATCH 5/6] saner collect_arg() code generation Al Viro
2026-03-31 8:07 ` [PATCH 6/6] try to get whitespaces right Al Viro
2026-04-01 10:39 ` [RFC PATCH] pre-process: add __VA_OPT__ support Al Viro
2026-04-01 16:18 ` Linus Torvalds
2026-04-01 19:52 ` Al Viro
2026-04-01 20:22 ` Al Viro
2026-02-25 7:05 ` [PATCH] sparse: add support for __VA_OPT__ Chris Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260331080631.GA1328137@ZenIV \
--to=viro@zeniv.linux.org.uk \
--cc=ben.dooks@codethink.co.uk \
--cc=chriscli@google.com \
--cc=dan.carpenter@linaro.org \
--cc=linux-sparse@vger.kernel.org \
--cc=rf@opensource.cirrus.com \
--cc=torvalds@linux-foundation.org \
--cc=zxh@xh-zhang.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox