From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from zeniv.linux.org.uk (zeniv.linux.org.uk [62.89.141.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3E5DD3C0630 for ; Tue, 31 Mar 2026 08:03:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=62.89.141.173 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774944185; cv=none; b=rSG7Ie808i4uViVzbVbq1cAmxrIPGNW9uiUp7Yjy4mxeQeuuQDOzGSNgT5gyDfHOAu8Mk3drvv8ktkt2rA19JtctiLq4386Ot8n/ZePQ8xdlO3dIWGRRbmHAXqjsNMvFSLGz/xGUf0EjpNgxfRXDlZpl9N6azC+tL5TTp4qFb4A= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774944185; c=relaxed/simple; bh=Qwn2Qgy1c/1CHOz2Z0arcndzmP4W2wUnoZhebDQ2bxo=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=IKLglp2ER+EKM6lbd47kFshcs0+DZIDe2Ys0IOhqr88hu4sL28L2f3nkbLCgcZV106bsJw5o872szKk4A4v7xXj5OpxlHWFrrN3rhcjVjMsPsSZBrGcqD53hKIUZwBSyLMJLi5La8/EyEjTloHN2eo+nDwrg9N9qnOW3/6XQx/Q= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=zeniv.linux.org.uk; spf=none smtp.mailfrom=ftp.linux.org.uk; dkim=pass (2048-bit key) header.d=linux.org.uk header.i=@linux.org.uk header.b=Ai9szDPp; arc=none smtp.client-ip=62.89.141.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=zeniv.linux.org.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=ftp.linux.org.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linux.org.uk header.i=@linux.org.uk header.b="Ai9szDPp" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=linux.org.uk; s=zeniv-20220401; h=Sender:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=lZhuqiEKbQG/bQAqFXKDqQ8dqVS1yEK7shfBRAtGeKE=; b=Ai9szDPp+VNN2PFRelSmI2EwqF gIKG7QZuUuI0X4139TKHGC8otE7k8SbU4l4OmHdcJjCKJEtapWE8U/KOJtgve5+R/shQfyNmTK01p PT45imOuMtZMKIomnz/9Yf6EjJBNmzps5s/A0H4biqBRWt3Eaub66iZGY8zV6wU/i+rRPb/sGBJwk r7Rs8wmAodC3d9UczJ+OAqHq6kfTAc6uoUTfcznvJT+T/wBPTfP6VMR841ErpRJN2R9RR195ncDxJ EBFzsktWv0hq5TDi3uDtCr71pITLCPA5CUgj7hHEksWCz7wyY3i428TppYh9HAp2wTHfDh5IK9I84 m1NRUUZg==; Received: from viro by zeniv.linux.org.uk with local (Exim 4.99.1 #2 (Red Hat Linux)) id 1w7U7L-00000005mXl-0n0R; Tue, 31 Mar 2026 08:06:31 +0000 Date: Tue, 31 Mar 2026 09:06:31 +0100 From: Al Viro To: chriscli@google.com Cc: linux-sparse@vger.kernel.org, dan.carpenter@linaro.org, ben.dooks@codethink.co.uk, rf@opensource.cirrus.com, torvalds@linux-foundation.org, Eric Zhang Subject: Re: [RFC PATCH] pre-process: add __VA_OPT__ support Message-ID: <20260331080631.GA1328137@ZenIV> References: <20260225072731.GA3093958@ZenIV> <20260225081413.2480484-1-zxh@xh-zhang.com> <20260225221851.GE1762976@ZenIV> <20260226072945.GA4104757@ZenIV> <20260316065622.GA607739@ZenIV> Precedence: bulk X-Mailing-List: linux-sparse@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260316065622.GA607739@ZenIV> Sender: Al Viro On Mon, Mar 16, 2026 at 06:56:22AM +0000, Al Viro wrote: ... and here's are the whitespace fixes and a bunch of optimizations; I really wonder if we ought to simply switch to uint64_t for struct position contents and go for several helpers for accessing individual parts. Code generation with clang is reasonable, but gcc is bloody awful - I'm sick and tired of fighting it all the time ;-/ Anyway, I've spent way too much time in that rabbit hole. I've a few more sitting in the local queue (tokenizer optimizations, mostly), but that's a separate story and I'd rather leave that for later. See git://git.kernel.org/pub/scm/linux/kernel/git/viro/sparse.git #whitespaces (or individual patches in followups); it's on top of #va_opt. Re whitespace: there are several places where gcc and clang handling of that stuff differs. The origin of the headache is that historically preprocessor had been a text filter sitting in front of compiler - its output had been fed through tokenizer and quite a few things in the original ANSI C bear the resulting scars. By C99 times it mostly switched to working in terms of token stream, which simplified a lot of stuff. Unfortunately, what it's working with is not quite a list of tokens - it's tokens mixed with whitespace. By that point comments are replaced with spaces and implementation is allowed to collapse any sequence of whitespace other than newline into a single space character. All operations on the stream are described in terms of replacing a token or a sequence of tokens with a sequence of tokens; what happens to possible whitespace is almost unspecified. Some things are obvious and generally agreed upon: * if we are replacing a sequence of tokens, all whitespace stuck between them is removed. All whitespace before to the first or after the last affected token in left as-is. * all whitespace stuck between the tokens in the replacement sequence is inserted along with those tokens. What is *not* obvious is what should be done to leading or trailing whitespace in the replacement sequence. Part of that is explicitly covered (e.g. macro body has no leading or trailing whitespace), but in some cases the things get iffy. Trivial example: #define E #define F E - E #define S(X) #X #define A(X) S([X]) char *s = A(F); A(F) is a macro invocation; there's one argument - Ident(F). No whitespace anywhere. What should these 4 tokens be replaced with? We start with the replacement list of A - Ident(S),Punctuator('('),Punctuator('['),Ident(X), Punctuator(']'),Punctuator(')'). The only parameter (X) occurs only once and there's neither # nor ## operator in sight, so that occurrence gets replaced with the token sequence resulting from full expansion of the corresponding argument. Ident(F) expands to Ident(E),whitespace,Punctuator('-'),whitespace,Ident(E), which, after expanding both instances of E turns into whitespace,Punctuator('-'),whitespace Fair enough, there's only one token in that, so there's no inter-token whitespace in sight. There is both leading and trailing whitespace, though. One interpretation is that this one token is precisely what gets substituted, yielding S([-]) with no whitespace anywhere. That's an invocation of S with the sole argument being Punctuator('['),Punctuator('-),Punctuator(']'), and the end result is "[-]". This is what clang does. Another way to handle that is to have the whitespace preceding and following Punctuator('-') in the expansion of F stuck to it when we are substituting the damn thing, yielding S([ - ]) (same tokens as in the first variant, but with whitespaces stuck before and after the minus). The end result in this interpretation is "[ - ]". This is what gcc goes for. Unfortunately, gcc rules are convoluted and when you add __VA_OPT__ into the mix they get really nasty. Example: #define E #define F E - E #define S(X) #X #define A(X,...) S([__VA_OPT__(X)]) char *s = A(F,_); ends up with "[- ]" - the leading whitespace is stripped, the trailing one remains stuck to the token sequence. Mixing that with ## leads to outright bugs: #define E #define F1 E 1 #define A(X,...) 1##__VA_OPT__(X) int m1 = A(F1,...); expands into int m1 = 11; In other words, left concatenation works fine. However, adding the right-hand analogue #define F2 sizeof m E #define B(X,...) __VA_OPT__(X)##1 int m2 = B(F2,_); which should result in int m2 = sizeof m1; does *not* work with gcc - it ends up with int m2 = sizeof m 1; with predictable objections from the parser. Note that it's *not* a matter of confusion re "do we expand the argument here?" - if that argument hadn't been expanded, we would've ended up with int m2 = sizeof F21; and resulting errors would've been different. With clang everything works as expected and I don't see any interpretation of the standard that would allow what gcc is doing. The trigger for that bug is their treatment of the trailing whitespace, AFAICS; leading whitespace is stripped here. Note that the current mainline sparse doesn't match either gcc or clang semantics - it loses whitespaces in some cases where they definitely should be retained. That needs to be fixed; e.g. #define E #define S(X) #X #define A(X) S(X) char *s = A([1 E]); ought to produce "[1 ]", not "[1]". Nothing subtle about that case... IMO we are be better off with the clang approach; it is easy to describe and it matches the behaviour of gcc unless you step into the areas where gcc is rather random. We can always reconsider if/when gcc folks come up with something that looks consistent... Shortlog: Al Viro (6): nextchar(): get rid of special[] simplify the inlined side of nextchar() tokenize_stream(): don't bother with isspace() TOKEN_DIRECTIVE: recognize directive-introducing # in tokenizer saner collect_arg() code generation try to get whitespaces right Diffstat: pre-process.c | 169 ++++++++++++++++++-------------- token.h | 1 + tokenize.c | 91 +++++++++-------- validation/preprocessor/va_opt-concat.c | 24 +++++ validation/preprocessor/va_opt.c | 23 +++-- validation/preprocessor/va_opt2.c | 2 +- 6 files changed, 184 insertions(+), 126 deletions(-) create mode 100644 validation/preprocessor/va_opt-concat.c