From: "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: "D. Ben Knoble" <ben.knoble@gmail.com>,
Eric Sunshine <sunshine@sunshineco.com>,
Michael Montalbo <mmontalbo@gmail.com>,
Michael Montalbo <mmontalbo@gmail.com>
Subject: [PATCH 2/6] t: extract chainlint's parser into shared module
Date: Thu, 04 Jun 2026 07:45:54 +0000 [thread overview]
Message-ID: <a1069efa8fafe17b88f701168e7fe7c4ac663304.1780559158.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.2135.git.1780559158.gitgitgadget@gmail.com>
From: Michael Montalbo <mmontalbo@gmail.com>
Move the Lexer, ShellParser, and ScriptParser packages from
chainlint.pl into t/lib-shell-parser.pl so they can be reused by
other tools. ScriptParser's check_test() is a no-op in the shared
module; callers subclass ScriptParser and override it.
chainlint.pl defines TestParser (&&-chain detection) and
ChainlintParser (a ScriptParser subclass whose check_test runs
TestParser and formats the results). The shared module is loaded
via do() for portability with minimal Perl installations.
A subsequent commit introduces lint-style.pl which needs the same
shell parser to properly tokenize test scripts. Sharing the parser
avoids reimplementing heredoc handling, $(...) nesting, pipe
tracking, quoting, and test body extraction.
Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
---
t/chainlint.pl | 521 ++----------------------------------------
t/lib-shell-parser.pl | 517 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 530 insertions(+), 508 deletions(-)
create mode 100644 t/lib-shell-parser.pl
diff --git a/t/chainlint.pl b/t/chainlint.pl
index f0598e3934..49b7cc6cb8 100755
--- a/t/chainlint.pl
+++ b/t/chainlint.pl
@@ -23,458 +23,10 @@ my $jobs = -1;
my $show_stats;
my $emit_all;
-# Lexer tokenizes POSIX shell scripts. It is roughly modeled after section 2.3
-# "Token Recognition" of POSIX chapter 2 "Shell Command Language". Although
-# similar to lexical analyzers for other languages, this one differs in a few
-# substantial ways due to quirks of the shell command language.
-#
-# For instance, in many languages, newline is just whitespace like space or
-# TAB, but in shell a newline is a command separator, thus a distinct lexical
-# token. A newline is significant and returned as a distinct token even at the
-# end of a shell comment.
-#
-# In other languages, `1+2` would typically be scanned as three tokens
-# (`1`, `+`, and `2`), but in shell it is a single token. However, the similar
-# `1 + 2`, which embeds whitepace, is scanned as three token in shell, as well.
-# In shell, several characters with special meaning lose that meaning when not
-# surrounded by whitespace. For instance, the negation operator `!` is special
-# when standing alone surrounded by whitespace; whereas in `foo!uucp` it is
-# just a plain character in the longer token "foo!uucp". In many other
-# languages, `"string"/foo:'string'` might be scanned as five tokens ("string",
-# `/`, `foo`, `:`, and 'string'), but in shell, it is just a single token.
-#
-# The lexical analyzer for the shell command language is also somewhat unusual
-# in that it recursively invokes the parser to handle the body of `$(...)`
-# expressions which can contain arbitrary shell code. Such expressions may be
-# encountered both inside and outside of double-quoted strings.
-#
-# The lexical analyzer is responsible for consuming shell here-doc bodies which
-# extend from the line following a `<<TAG` operator until a line consisting
-# solely of `TAG`. Here-doc consumption begins when a newline is encountered.
-# It is legal for multiple here-doc `<<TAG` operators to be present on a single
-# line, in which case their bodies must be present one following the next, and
-# are consumed in the (left-to-right) order the `<<TAG` operators appear on the
-# line. A special complication is that the bodies of all here-docs must be
-# consumed when the newline is encountered even if the parse context depth has
-# changed. For instance, in `cat <<A && x=$(cat <<B &&\n`, bodies of here-docs
-# "A" and "B" must be consumed even though "A" was introduced outside the
-# recursive parse context in which "B" was introduced and in which the newline
-# is encountered.
-package Lexer;
-
-sub new {
- my ($class, $parser, $s) = @_;
- bless {
- parser => $parser,
- buff => $s,
- lineno => 1,
- heretags => []
- } => $class;
-}
-
-sub scan_heredoc_tag {
- my $self = shift @_;
- ${$self->{buff}} =~ /\G(-?)/gc;
- my $indented = $1;
- my $token = $self->scan_token();
- return "<<$indented" unless $token;
- my $tag = $token->[0];
- $tag =~ s/['"\\]//g;
- $$token[0] = $indented ? "\t$tag" : "$tag";
- push(@{$self->{heretags}}, $token);
- return "<<$indented$tag";
-}
-
-sub scan_op {
- my ($self, $c) = @_;
- my $b = $self->{buff};
- return $c unless $$b =~ /\G(.)/sgc;
- my $cc = $c . $1;
- return scan_heredoc_tag($self) if $cc eq '<<';
- return $cc if $cc =~ /^(?:&&|\|\||>>|;;|<&|>&|<>|>\|)$/;
- pos($$b)--;
- return $c;
-}
-
-sub scan_sqstring {
- my $self = shift @_;
- ${$self->{buff}} =~ /\G([^']*'|.*\z)/sgc;
- my $s = $1;
- $self->{lineno} += () = $s =~ /\n/sg;
- return "'" . $s;
-}
-
-sub scan_dqstring {
- my $self = shift @_;
- my $b = $self->{buff};
- my $s = '"';
- while (1) {
- # slurp up non-special characters
- $s .= $1 if $$b =~ /\G([^"\$\\]+)/gc;
- # handle special characters
- last unless $$b =~ /\G(.)/sgc;
- my $c = $1;
- $s .= '"', last if $c eq '"';
- $s .= '$' . $self->scan_dollar(), next if $c eq '$';
- if ($c eq '\\') {
- $s .= '\\', last unless $$b =~ /\G(.)/sgc;
- $c = $1;
- $self->{lineno}++, next if $c eq "\n"; # line splice
- # backslash escapes only $, `, ", \ in dq-string
- $s .= '\\' unless $c =~ /^[\$`"\\]$/;
- $s .= $c;
- next;
- }
- die("internal error scanning dq-string '$c'\n");
- }
- $self->{lineno} += () = $s =~ /\n/sg;
- return $s;
-}
-
-sub scan_balanced {
- my ($self, $c1, $c2) = @_;
- my $b = $self->{buff};
- my $depth = 1;
- my $s = $c1;
- while ($$b =~ /\G([^\Q$c1$c2\E]*(?:[\Q$c1$c2\E]|\z))/gc) {
- $s .= $1;
- $depth++, next if $s =~ /\Q$c1\E$/;
- $depth--;
- last if $depth == 0;
- }
- $self->{lineno} += () = $s =~ /\n/sg;
- return $s;
-}
-
-sub scan_subst {
- my $self = shift @_;
- my @tokens = $self->{parser}->parse(qr/^\)$/);
- $self->{parser}->next_token(); # closing ")"
- return @tokens;
-}
-
-sub scan_dollar {
- my $self = shift @_;
- my $b = $self->{buff};
- return $self->scan_balanced('(', ')') if $$b =~ /\G\((?=\()/gc; # $((...))
- return '(' . join(' ', map {$_->[0]} $self->scan_subst()) . ')' if $$b =~ /\G\(/gc; # $(...)
- return $self->scan_balanced('{', '}') if $$b =~ /\G\{/gc; # ${...}
- return $1 if $$b =~ /\G(\w+)/gc; # $var
- return $1 if $$b =~ /\G([@*#?$!0-9-])/gc; # $*, $1, $$, etc.
- return '';
-}
-
-sub swallow_heredocs {
- my $self = shift @_;
- my $b = $self->{buff};
- my $tags = $self->{heretags};
- while (my $tag = shift @$tags) {
- my $start = pos($$b);
- my $indent = $$tag[0] =~ s/^\t// ? '\\s*' : '';
- $$b =~ /(?:\G|\n)$indent\Q$$tag[0]\E(?:\n|\z)/gc;
- if (pos($$b) > $start) {
- my $body = substr($$b, $start, pos($$b) - $start);
- $self->{parser}->{heredocs}->{$$tag[0]} = {
- content => substr($body, 0, length($body) - length($&)),
- start_line => $self->{lineno},
- };
- $self->{lineno} += () = $body =~ /\n/sg;
- next;
- }
- push(@{$self->{parser}->{problems}}, ['HEREDOC', $tag]);
- $$b =~ /(?:\G|\n).*\z/gc; # consume rest of input
- my $body = substr($$b, $start, pos($$b) - $start);
- $self->{lineno} += () = $body =~ /\n/sg;
- last;
- }
-}
-
-sub scan_token {
- my $self = shift @_;
- my $b = $self->{buff};
- my $token = '';
- my ($start, $startln);
-RESTART:
- $startln = $self->{lineno};
- $$b =~ /\G[ \t]+/gc; # skip whitespace (but not newline)
- $start = pos($$b) || 0;
- $self->{lineno}++, return ["\n", $start, pos($$b), $startln, $startln] if $$b =~ /\G#[^\n]*(?:\n|\z)/gc; # comment
- while (1) {
- # slurp up non-special characters
- $token .= $1 if $$b =~ /\G([^\\;&|<>(){}'"\$\s]+)/gc;
- # handle special characters
- last unless $$b =~ /\G(.)/sgc;
- my $c = $1;
- pos($$b)--, last if $c =~ /^[ \t]$/; # whitespace ends token
- pos($$b)--, last if length($token) && $c =~ /^[;&|<>(){}\n]$/;
- $token .= $self->scan_sqstring(), next if $c eq "'";
- $token .= $self->scan_dqstring(), next if $c eq '"';
- $token .= $c . $self->scan_dollar(), next if $c eq '$';
- $self->{lineno}++, $self->swallow_heredocs(), $token = $c, last if $c eq "\n";
- $token = $self->scan_op($c), last if $c =~ /^[;&|<>]$/;
- $token = $c, last if $c =~ /^[(){}]$/;
- if ($c eq '\\') {
- $token .= '\\', last unless $$b =~ /\G(.)/sgc;
- $c = $1;
- $self->{lineno}++, next if $c eq "\n" && length($token); # line splice
- $self->{lineno}++, goto RESTART if $c eq "\n"; # line splice
- $token .= '\\' . $c;
- next;
- }
- die("internal error scanning character '$c'\n");
- }
- return length($token) ? [$token, $start, pos($$b), $startln, $self->{lineno}] : undef;
-}
-
-# ShellParser parses POSIX shell scripts (with minor extensions for Bash). It
-# is a recursive descent parser very roughly modeled after section 2.10 "Shell
-# Grammar" of POSIX chapter 2 "Shell Command Language".
-package ShellParser;
-
-sub new {
- my ($class, $s) = @_;
- my $self = bless {
- buff => [],
- stop => [],
- output => [],
- heredocs => {},
- insubshell => 0,
- } => $class;
- $self->{lexer} = Lexer->new($self, $s);
- return $self;
-}
-
-sub next_token {
- my $self = shift @_;
- return pop(@{$self->{buff}}) if @{$self->{buff}};
- return $self->{lexer}->scan_token();
-}
-
-sub untoken {
- my $self = shift @_;
- push(@{$self->{buff}}, @_);
-}
-
-sub peek {
- my $self = shift @_;
- my $token = $self->next_token();
- return undef unless defined($token);
- $self->untoken($token);
- return $token;
-}
-
-sub stop_at {
- my ($self, $token) = @_;
- return 1 unless defined($token);
- my $stop = ${$self->{stop}}[-1] if @{$self->{stop}};
- return defined($stop) && $token->[0] =~ $stop;
-}
-
-sub expect {
- my ($self, $expect) = @_;
- my $token = $self->next_token();
- return $token if defined($token) && $token->[0] eq $expect;
- push(@{$self->{output}}, "?!ERR?! expected '$expect' but found '" . (defined($token) ? $token->[0] : "<end-of-input>") . "'\n");
- $self->untoken($token) if defined($token);
- return ();
-}
-
-sub optional_newlines {
- my $self = shift @_;
- my @tokens;
- while (my $token = $self->peek()) {
- last unless $token->[0] eq "\n";
- push(@tokens, $self->next_token());
- }
- return @tokens;
-}
-
-sub parse_group {
- my $self = shift @_;
- return ($self->parse(qr/^}$/),
- $self->expect('}'));
-}
-
-sub parse_subshell {
- my $self = shift @_;
- $self->{insubshell}++;
- my @tokens = ($self->parse(qr/^\)$/),
- $self->expect(')'));
- $self->{insubshell}--;
- return @tokens;
-}
-
-sub parse_case_pattern {
- my $self = shift @_;
- my @tokens;
- while (defined(my $token = $self->next_token())) {
- push(@tokens, $token);
- last if $token->[0] eq ')';
- }
- return @tokens;
-}
-
-sub parse_case {
- my $self = shift @_;
- my @tokens;
- push(@tokens,
- $self->next_token(), # subject
- $self->optional_newlines(),
- $self->expect('in'),
- $self->optional_newlines());
- while (1) {
- my $token = $self->peek();
- last unless defined($token) && $token->[0] ne 'esac';
- push(@tokens,
- $self->parse_case_pattern(),
- $self->optional_newlines(),
- $self->parse(qr/^(?:;;|esac)$/)); # item body
- $token = $self->peek();
- last unless defined($token) && $token->[0] ne 'esac';
- push(@tokens,
- $self->expect(';;'),
- $self->optional_newlines());
- }
- push(@tokens, $self->expect('esac'));
- return @tokens;
-}
-
-sub parse_for {
- my $self = shift @_;
- my @tokens;
- push(@tokens,
- $self->next_token(), # variable
- $self->optional_newlines());
- my $token = $self->peek();
- if (defined($token) && $token->[0] eq 'in') {
- push(@tokens,
- $self->expect('in'),
- $self->optional_newlines());
- }
- push(@tokens,
- $self->parse(qr/^do$/), # items
- $self->expect('do'),
- $self->optional_newlines(),
- $self->parse_loop_body(),
- $self->expect('done'));
- return @tokens;
-}
-
-sub parse_if {
- my $self = shift @_;
- my @tokens;
- while (1) {
- push(@tokens,
- $self->parse(qr/^then$/), # if/elif condition
- $self->expect('then'),
- $self->optional_newlines(),
- $self->parse(qr/^(?:elif|else|fi)$/)); # if/elif body
- my $token = $self->peek();
- last unless defined($token) && $token->[0] eq 'elif';
- push(@tokens, $self->expect('elif'));
- }
- my $token = $self->peek();
- if (defined($token) && $token->[0] eq 'else') {
- push(@tokens,
- $self->expect('else'),
- $self->optional_newlines(),
- $self->parse(qr/^fi$/)); # else body
- }
- push(@tokens, $self->expect('fi'));
- return @tokens;
-}
-
-sub parse_loop_body {
- my $self = shift @_;
- return $self->parse(qr/^done$/);
-}
-
-sub parse_loop {
- my $self = shift @_;
- return ($self->parse(qr/^do$/), # condition
- $self->expect('do'),
- $self->optional_newlines(),
- $self->parse_loop_body(),
- $self->expect('done'));
-}
-
-sub parse_func {
- my $self = shift @_;
- return ($self->expect('('),
- $self->expect(')'),
- $self->optional_newlines(),
- $self->parse_cmd()); # body
-}
-
-sub parse_bash_array_assignment {
- my $self = shift @_;
- my @tokens = $self->expect('(');
- while (defined(my $token = $self->next_token())) {
- push(@tokens, $token);
- last if $token->[0] eq ')';
- }
- return @tokens;
-}
-
-my %compound = (
- '{' => \&parse_group,
- '(' => \&parse_subshell,
- 'case' => \&parse_case,
- 'for' => \&parse_for,
- 'if' => \&parse_if,
- 'until' => \&parse_loop,
- 'while' => \&parse_loop);
-
-sub parse_cmd {
- my $self = shift @_;
- my $cmd = $self->next_token();
- return () unless defined($cmd);
- return $cmd if $cmd->[0] eq "\n";
-
- my $token;
- my @tokens = $cmd;
- if ($cmd->[0] eq '!') {
- push(@tokens, $self->parse_cmd());
- return @tokens;
- } elsif (my $f = $compound{$cmd->[0]}) {
- push(@tokens, $self->$f());
- } elsif (defined($token = $self->peek()) && $token->[0] eq '(') {
- if ($cmd->[0] !~ /\w=$/) {
- push(@tokens, $self->parse_func());
- return @tokens;
- }
- my @array = $self->parse_bash_array_assignment();
- $tokens[-1]->[0] .= join(' ', map {$_->[0]} @array);
- $tokens[-1]->[2] = $array[$#array][2] if @array;
- }
-
- while (defined(my $token = $self->next_token())) {
- $self->untoken($token), last if $self->stop_at($token);
- push(@tokens, $token);
- last if $token->[0] =~ /^(?:[;&\n|]|&&|\|\|)$/;
- }
- push(@tokens, $self->next_token()) if $tokens[-1]->[0] ne "\n" && defined($token = $self->peek()) && $token->[0] eq "\n";
- return @tokens;
-}
-
-sub accumulate {
- my ($self, $tokens, $cmd) = @_;
- push(@$tokens, @$cmd);
-}
-
-sub parse {
- my ($self, $stop) = @_;
- push(@{$self->{stop}}, $stop);
- goto DONE if $self->stop_at($self->peek());
- my @tokens;
- while (my @cmd = $self->parse_cmd()) {
- $self->accumulate(\@tokens, \@cmd);
- last if $self->stop_at($self->peek());
- }
-DONE:
- pop(@{$self->{stop}});
- return @tokens;
-}
+use File::Basename;
+my $_lib = dirname($0) . "/lib-shell-parser.pl";
+$_lib = "./$_lib" unless $_lib =~ m{^/};
+do $_lib or die "failed to load $_lib: $@$!\n";
# TestParser is a subclass of ShellParser which, beyond parsing shell script
# code, is also imbued with semantic knowledge of test construction, and checks
@@ -484,7 +36,7 @@ DONE:
# scripts in which the tests are defined.
package TestParser;
-use base 'ShellParser';
+our @ISA = ('ShellParser');
sub new {
my $class = shift @_;
@@ -578,14 +130,12 @@ DONE:
$self->SUPER::accumulate($tokens, $cmd);
}
-# ScriptParser is a subclass of ShellParser which identifies individual test
-# definitions within test scripts, and passes each test body through TestParser
-# to identify possible problems. ShellParser detects test definitions not only
-# at the top-level of test scripts but also within compound commands such as
-# loops and function definitions.
-package ScriptParser;
+# ChainlintParser is a subclass of ScriptParser which checks each test
+# body for broken &&-chains via TestParser, then formats and collects
+# the results.
+package ChainlintParser;
-use base 'ShellParser';
+our @ISA = ('ScriptParser');
sub new {
my $class = shift @_;
@@ -595,35 +145,6 @@ sub new {
return $self;
}
-# extract the raw content of a token, which may be a single string or a
-# composition of multiple strings and non-string character runs; for instance,
-# `"test body"` unwraps to `test body`; `word"a b"42'c d'` to `worda b42c d`
-sub unwrap {
- my $token = (@_ ? shift @_ : $_)->[0];
- # simple case: 'sqstring' or "dqstring"
- return $token if $token =~ s/^'([^']*)'$/$1/;
- return $token if $token =~ s/^"([^"]*)"$/$1/;
-
- # composite case
- my ($s, $q, $escaped);
- while (1) {
- # slurp up non-special characters
- $s .= $1 if $token =~ /\G([^\\'"]*)/gc;
- # handle special characters
- last unless $token =~ /\G(.)/sgc;
- my $c = $1;
- $q = undef, next if defined($q) && $c eq $q;
- $q = $c, next if !defined($q) && $c =~ /^['"]$/;
- if ($c eq '\\') {
- last unless $token =~ /\G(.)/sgc;
- $c = $1;
- $s .= '\\' if $c eq "\n"; # preserve line splice
- }
- $s .= $c;
- }
- return $s
-}
-
sub format_problem {
local $_ = shift;
/^AMP$/ && return "missing '&&'";
@@ -635,10 +156,10 @@ sub format_problem {
sub check_test {
my $self = shift @_;
- my $title = unwrap(shift @_);
+ my $title = ScriptParser::unwrap(shift @_);
my $body = shift @_;
my $lineno = $body->[3];
- $body = unwrap($body);
+ $body = ScriptParser::unwrap($body);
if ($body eq '-') {
my $herebody = shift @_;
$body = $herebody->{content};
@@ -673,22 +194,6 @@ sub check_test {
push(@{$self->{output}}, "$c->{blue}# chainlint: $title$c->{reset}\n$checked");
}
-sub parse_cmd {
- my $self = shift @_;
- my @tokens = $self->SUPER::parse_cmd();
- return @tokens unless @tokens && $tokens[0]->[0] =~ /^test_expect_(?:success|failure)$/;
- my $n = $#tokens;
- $n-- while $n >= 0 && $tokens[$n]->[0] =~ /^(?:[;&\n|]|&&|\|\|)$/;
- my $herebody;
- if ($n >= 2 && $tokens[$n-1]->[0] eq '-' && $tokens[$n]->[0] =~ /^<<-?(.+)$/) {
- $herebody = $self->{heredocs}->{$1};
- $n--;
- }
- $self->check_test($tokens[1], $tokens[2], $herebody) if $n == 2; # title body
- $self->check_test($tokens[2], $tokens[3], $herebody) if $n > 2; # prereq title body
- return @tokens;
-}
-
# main contains high-level functionality for processing command-line switches,
# feeding input test scripts to ScriptParser, and reporting results.
package main;
@@ -803,7 +308,7 @@ sub check_script {
}
my $s = do { local $/; <$fh> };
close($fh);
- my $parser = ScriptParser->new(\$s);
+ my $parser = ChainlintParser->new(\$s);
1 while $parser->parse_cmd();
if (@{$parser->{output}}) {
my $c = fd_colors(1);
diff --git a/t/lib-shell-parser.pl b/t/lib-shell-parser.pl
new file mode 100644
index 0000000000..1e521a94f8
--- /dev/null
+++ b/t/lib-shell-parser.pl
@@ -0,0 +1,517 @@
+use strict;
+use warnings;
+
+# Copyright (c) 2021-2022 Eric Sunshine <sunshine@sunshineco.com>
+#
+# Lexer tokenizes POSIX shell scripts. It is roughly modeled after section 2.3
+# "Token Recognition" of POSIX chapter 2 "Shell Command Language". Although
+# similar to lexical analyzers for other languages, this one differs in a few
+# substantial ways due to quirks of the shell command language.
+#
+# For instance, in many languages, newline is just whitespace like space or
+# TAB, but in shell a newline is a command separator, thus a distinct lexical
+# token. A newline is significant and returned as a distinct token even at the
+# end of a shell comment.
+#
+# In other languages, `1+2` would typically be scanned as three tokens
+# (`1`, `+`, and `2`), but in shell it is a single token. However, the similar
+# `1 + 2`, which embeds whitepace, is scanned as three token in shell, as well.
+# In shell, several characters with special meaning lose that meaning when not
+# surrounded by whitespace. For instance, the negation operator `!` is special
+# when standing alone surrounded by whitespace; whereas in `foo!uucp` it is
+# just a plain character in the longer token "foo!uucp". In many other
+# languages, `"string"/foo:'string'` might be scanned as five tokens ("string",
+# `/`, `foo`, `:`, and 'string'), but in shell, it is just a single token.
+#
+# The lexical analyzer for the shell command language is also somewhat unusual
+# in that it recursively invokes the parser to handle the body of `$(...)`
+# expressions which can contain arbitrary shell code. Such expressions may be
+# encountered both inside and outside of double-quoted strings.
+#
+# The lexical analyzer is responsible for consuming shell here-doc bodies which
+# extend from the line following a `<<TAG` operator until a line consisting
+# solely of `TAG`. Here-doc consumption begins when a newline is encountered.
+# It is legal for multiple here-doc `<<TAG` operators to be present on a single
+# line, in which case their bodies must be present one following the next, and
+# are consumed in the (left-to-right) order the `<<TAG` operators appear on the
+# line. A special complication is that the bodies of all here-docs must be
+# consumed when the newline is encountered even if the parse context depth has
+# changed. For instance, in `cat <<A && x=$(cat <<B &&\n`, bodies of here-docs
+# "A" and "B" must be consumed even though "A" was introduced outside the
+# recursive parse context in which "B" was introduced and in which the newline
+# is encountered.
+package Lexer;
+
+sub new {
+ my ($class, $parser, $s) = @_;
+ bless {
+ parser => $parser,
+ buff => $s,
+ lineno => 1,
+ heretags => []
+ } => $class;
+}
+
+sub scan_heredoc_tag {
+ my $self = shift @_;
+ ${$self->{buff}} =~ /\G(-?)/gc;
+ my $indented = $1;
+ my $token = $self->scan_token();
+ return "<<$indented" unless $token;
+ my $tag = $token->[0];
+ $tag =~ s/['"\\]//g;
+ $$token[0] = $indented ? "\t$tag" : "$tag";
+ push(@{$self->{heretags}}, $token);
+ return "<<$indented$tag";
+}
+
+sub scan_op {
+ my ($self, $c) = @_;
+ my $b = $self->{buff};
+ return $c unless $$b =~ /\G(.)/sgc;
+ my $cc = $c . $1;
+ return scan_heredoc_tag($self) if $cc eq '<<';
+ return $cc if $cc =~ /^(?:&&|\|\||>>|;;|<&|>&|<>|>\|)$/;
+ pos($$b)--;
+ return $c;
+}
+
+sub scan_sqstring {
+ my $self = shift @_;
+ ${$self->{buff}} =~ /\G([^']*'|.*\z)/sgc;
+ my $s = $1;
+ $self->{lineno} += () = $s =~ /\n/sg;
+ return "'" . $s;
+}
+
+sub scan_dqstring {
+ my $self = shift @_;
+ my $b = $self->{buff};
+ my $s = '"';
+ while (1) {
+ # slurp up non-special characters
+ $s .= $1 if $$b =~ /\G([^"\$\\]+)/gc;
+ # handle special characters
+ last unless $$b =~ /\G(.)/sgc;
+ my $c = $1;
+ $s .= '"', last if $c eq '"';
+ $s .= '$' . $self->scan_dollar(), next if $c eq '$';
+ if ($c eq '\\') {
+ $s .= '\\', last unless $$b =~ /\G(.)/sgc;
+ $c = $1;
+ $self->{lineno}++, next if $c eq "\n"; # line splice
+ # backslash escapes only $, `, ", \ in dq-string
+ $s .= '\\' unless $c =~ /^[\$`"\\]$/;
+ $s .= $c;
+ next;
+ }
+ die("internal error scanning dq-string '$c'\n");
+ }
+ $self->{lineno} += () = $s =~ /\n/sg;
+ return $s;
+}
+
+sub scan_balanced {
+ my ($self, $c1, $c2) = @_;
+ my $b = $self->{buff};
+ my $depth = 1;
+ my $s = $c1;
+ while ($$b =~ /\G([^\Q$c1$c2\E]*(?:[\Q$c1$c2\E]|\z))/gc) {
+ $s .= $1;
+ $depth++, next if $s =~ /\Q$c1\E$/;
+ $depth--;
+ last if $depth == 0;
+ }
+ $self->{lineno} += () = $s =~ /\n/sg;
+ return $s;
+}
+
+sub scan_subst {
+ my $self = shift @_;
+ my @tokens = $self->{parser}->parse(qr/^\)$/);
+ $self->{parser}->next_token(); # closing ")"
+ return @tokens;
+}
+
+sub scan_dollar {
+ my $self = shift @_;
+ my $b = $self->{buff};
+ return $self->scan_balanced('(', ')') if $$b =~ /\G\((?=\()/gc; # $((...))
+ return '(' . join(' ', map {$_->[0]} $self->scan_subst()) . ')' if $$b =~ /\G\(/gc; # $(...)
+ return $self->scan_balanced('{', '}') if $$b =~ /\G\{/gc; # ${...}
+ return $1 if $$b =~ /\G(\w+)/gc; # $var
+ return $1 if $$b =~ /\G([@*#?$!0-9-])/gc; # $*, $1, $$, etc.
+ return '';
+}
+
+sub swallow_heredocs {
+ my $self = shift @_;
+ my $b = $self->{buff};
+ my $tags = $self->{heretags};
+ while (my $tag = shift @$tags) {
+ my $start = pos($$b);
+ my $indent = $$tag[0] =~ s/^\t// ? '\\s*' : '';
+ $$b =~ /(?:\G|\n)$indent\Q$$tag[0]\E(?:\n|\z)/gc;
+ if (pos($$b) > $start) {
+ my $body = substr($$b, $start, pos($$b) - $start);
+ $self->{parser}->{heredocs}->{$$tag[0]} = {
+ content => substr($body, 0, length($body) - length($&)),
+ start_line => $self->{lineno},
+ };
+ $self->{lineno} += () = $body =~ /\n/sg;
+ next;
+ }
+ push(@{$self->{parser}->{problems}}, ['HEREDOC', $tag]);
+ $$b =~ /(?:\G|\n).*\z/gc; # consume rest of input
+ my $body = substr($$b, $start, pos($$b) - $start);
+ $self->{lineno} += () = $body =~ /\n/sg;
+ last;
+ }
+}
+
+sub scan_token {
+ my $self = shift @_;
+ my $b = $self->{buff};
+ my $token = '';
+ my ($start, $startln);
+RESTART:
+ $startln = $self->{lineno};
+ $$b =~ /\G[ \t]+/gc; # skip whitespace (but not newline)
+ $start = pos($$b) || 0;
+ $self->{lineno}++, return ["\n", $start, pos($$b), $startln, $startln] if $$b =~ /\G#[^\n]*(?:\n|\z)/gc; # comment
+ while (1) {
+ # slurp up non-special characters
+ $token .= $1 if $$b =~ /\G([^\\;&|<>(){}'"\$\s]+)/gc;
+ # handle special characters
+ last unless $$b =~ /\G(.)/sgc;
+ my $c = $1;
+ pos($$b)--, last if $c =~ /^[ \t]$/; # whitespace ends token
+ pos($$b)--, last if length($token) && $c =~ /^[;&|<>(){}\n]$/;
+ $token .= $self->scan_sqstring(), next if $c eq "'";
+ $token .= $self->scan_dqstring(), next if $c eq '"';
+ $token .= $c . $self->scan_dollar(), next if $c eq '$';
+ $self->{lineno}++, $self->swallow_heredocs(), $token = $c, last if $c eq "\n";
+ $token = $self->scan_op($c), last if $c =~ /^[;&|<>]$/;
+ $token = $c, last if $c =~ /^[(){}]$/;
+ if ($c eq '\\') {
+ $token .= '\\', last unless $$b =~ /\G(.)/sgc;
+ $c = $1;
+ $self->{lineno}++, next if $c eq "\n" && length($token); # line splice
+ $self->{lineno}++, goto RESTART if $c eq "\n"; # line splice
+ $token .= '\\' . $c;
+ next;
+ }
+ die("internal error scanning character '$c'\n");
+ }
+ return length($token) ? [$token, $start, pos($$b), $startln, $self->{lineno}] : undef;
+}
+
+# ShellParser parses POSIX shell scripts (with minor extensions for Bash). It
+# is a recursive descent parser very roughly modeled after section 2.10 "Shell
+# Grammar" of POSIX chapter 2 "Shell Command Language".
+package ShellParser;
+
+sub new {
+ my ($class, $s) = @_;
+ my $self = bless {
+ buff => [],
+ stop => [],
+ output => [],
+ heredocs => {},
+ insubshell => 0,
+ } => $class;
+ $self->{lexer} = Lexer->new($self, $s);
+ return $self;
+}
+
+sub next_token {
+ my $self = shift @_;
+ return pop(@{$self->{buff}}) if @{$self->{buff}};
+ return $self->{lexer}->scan_token();
+}
+
+sub untoken {
+ my $self = shift @_;
+ push(@{$self->{buff}}, @_);
+}
+
+sub peek {
+ my $self = shift @_;
+ my $token = $self->next_token();
+ return undef unless defined($token);
+ $self->untoken($token);
+ return $token;
+}
+
+sub stop_at {
+ my ($self, $token) = @_;
+ return 1 unless defined($token);
+ my $stop = ${$self->{stop}}[-1] if @{$self->{stop}};
+ return defined($stop) && $token->[0] =~ $stop;
+}
+
+sub expect {
+ my ($self, $expect) = @_;
+ my $token = $self->next_token();
+ return $token if defined($token) && $token->[0] eq $expect;
+ push(@{$self->{output}}, "?!ERR?! expected '$expect' but found '" . (defined($token) ? $token->[0] : "<end-of-input>") . "'\n");
+ $self->untoken($token) if defined($token);
+ return ();
+}
+
+sub optional_newlines {
+ my $self = shift @_;
+ my @tokens;
+ while (my $token = $self->peek()) {
+ last unless $token->[0] eq "\n";
+ push(@tokens, $self->next_token());
+ }
+ return @tokens;
+}
+
+sub parse_group {
+ my $self = shift @_;
+ return ($self->parse(qr/^}$/),
+ $self->expect('}'));
+}
+
+sub parse_subshell {
+ my $self = shift @_;
+ $self->{insubshell}++;
+ my @tokens = ($self->parse(qr/^\)$/),
+ $self->expect(')'));
+ $self->{insubshell}--;
+ return @tokens;
+}
+
+sub parse_case_pattern {
+ my $self = shift @_;
+ my @tokens;
+ while (defined(my $token = $self->next_token())) {
+ push(@tokens, $token);
+ last if $token->[0] eq ')';
+ }
+ return @tokens;
+}
+
+sub parse_case {
+ my $self = shift @_;
+ my @tokens;
+ push(@tokens,
+ $self->next_token(), # subject
+ $self->optional_newlines(),
+ $self->expect('in'),
+ $self->optional_newlines());
+ while (1) {
+ my $token = $self->peek();
+ last unless defined($token) && $token->[0] ne 'esac';
+ push(@tokens,
+ $self->parse_case_pattern(),
+ $self->optional_newlines(),
+ $self->parse(qr/^(?:;;|esac)$/)); # item body
+ $token = $self->peek();
+ last unless defined($token) && $token->[0] ne 'esac';
+ push(@tokens,
+ $self->expect(';;'),
+ $self->optional_newlines());
+ }
+ push(@tokens, $self->expect('esac'));
+ return @tokens;
+}
+
+sub parse_for {
+ my $self = shift @_;
+ my @tokens;
+ push(@tokens,
+ $self->next_token(), # variable
+ $self->optional_newlines());
+ my $token = $self->peek();
+ if (defined($token) && $token->[0] eq 'in') {
+ push(@tokens,
+ $self->expect('in'),
+ $self->optional_newlines());
+ }
+ push(@tokens,
+ $self->parse(qr/^do$/), # items
+ $self->expect('do'),
+ $self->optional_newlines(),
+ $self->parse_loop_body(),
+ $self->expect('done'));
+ return @tokens;
+}
+
+sub parse_if {
+ my $self = shift @_;
+ my @tokens;
+ while (1) {
+ push(@tokens,
+ $self->parse(qr/^then$/), # if/elif condition
+ $self->expect('then'),
+ $self->optional_newlines(),
+ $self->parse(qr/^(?:elif|else|fi)$/)); # if/elif body
+ my $token = $self->peek();
+ last unless defined($token) && $token->[0] eq 'elif';
+ push(@tokens, $self->expect('elif'));
+ }
+ my $token = $self->peek();
+ if (defined($token) && $token->[0] eq 'else') {
+ push(@tokens,
+ $self->expect('else'),
+ $self->optional_newlines(),
+ $self->parse(qr/^fi$/)); # else body
+ }
+ push(@tokens, $self->expect('fi'));
+ return @tokens;
+}
+
+sub parse_loop_body {
+ my $self = shift @_;
+ return $self->parse(qr/^done$/);
+}
+
+sub parse_loop {
+ my $self = shift @_;
+ return ($self->parse(qr/^do$/), # condition
+ $self->expect('do'),
+ $self->optional_newlines(),
+ $self->parse_loop_body(),
+ $self->expect('done'));
+}
+
+sub parse_func {
+ my $self = shift @_;
+ return ($self->expect('('),
+ $self->expect(')'),
+ $self->optional_newlines(),
+ $self->parse_cmd()); # body
+}
+
+sub parse_bash_array_assignment {
+ my $self = shift @_;
+ my @tokens = $self->expect('(');
+ while (defined(my $token = $self->next_token())) {
+ push(@tokens, $token);
+ last if $token->[0] eq ')';
+ }
+ return @tokens;
+}
+
+my %compound = (
+ '{' => \&parse_group,
+ '(' => \&parse_subshell,
+ 'case' => \&parse_case,
+ 'for' => \&parse_for,
+ 'if' => \&parse_if,
+ 'until' => \&parse_loop,
+ 'while' => \&parse_loop);
+
+sub parse_cmd {
+ my $self = shift @_;
+ my $cmd = $self->next_token();
+ return () unless defined($cmd);
+ return $cmd if $cmd->[0] eq "\n";
+
+ my $token;
+ my @tokens = $cmd;
+ if ($cmd->[0] eq '!') {
+ push(@tokens, $self->parse_cmd());
+ return @tokens;
+ } elsif (my $f = $compound{$cmd->[0]}) {
+ push(@tokens, $self->$f());
+ } elsif (defined($token = $self->peek()) && $token->[0] eq '(') {
+ if ($cmd->[0] !~ /\w=$/) {
+ push(@tokens, $self->parse_func());
+ return @tokens;
+ }
+ my @array = $self->parse_bash_array_assignment();
+ $tokens[-1]->[0] .= join(' ', map {$_->[0]} @array);
+ $tokens[-1]->[2] = $array[$#array][2] if @array;
+ }
+
+ while (defined(my $token = $self->next_token())) {
+ $self->untoken($token), last if $self->stop_at($token);
+ push(@tokens, $token);
+ last if $token->[0] =~ /^(?:[;&\n|]|&&|\|\|)$/;
+ }
+ push(@tokens, $self->next_token()) if $tokens[-1]->[0] ne "\n" && defined($token = $self->peek()) && $token->[0] eq "\n";
+ return @tokens;
+}
+
+sub accumulate {
+ my ($self, $tokens, $cmd) = @_;
+ push(@$tokens, @$cmd);
+}
+
+sub parse {
+ my ($self, $stop) = @_;
+ push(@{$self->{stop}}, $stop);
+ goto DONE if $self->stop_at($self->peek());
+ my @tokens;
+ while (my @cmd = $self->parse_cmd()) {
+ $self->accumulate(\@tokens, \@cmd);
+ last if $self->stop_at($self->peek());
+ }
+DONE:
+ pop(@{$self->{stop}});
+ return @tokens;
+}
+
+# ScriptParser is a subclass of ShellParser which identifies individual test
+# definitions within test scripts and calls check_test() for each test body
+# found. Callers subclass ScriptParser and override check_test() to
+# implement specific checks (e.g. chainlint checks &&-chains, lint-style
+# checks grep usage).
+package ScriptParser;
+
+our @ISA = ('ShellParser');
+
+# extract the raw content of a token, which may be a single string or a
+# composition of multiple strings and non-string character runs; for instance,
+# `"test body"` unwraps to `test body`; `word"a b"42'c d'` to `worda b42c d`
+sub unwrap {
+ my $token = (@_ ? shift @_ : $_)->[0];
+ # simple case: 'sqstring' or "dqstring"
+ return $token if $token =~ s/^'([^']*)'$/$1/;
+ return $token if $token =~ s/^"([^"]*)"$/$1/;
+
+ # composite case
+ my ($s, $q, $escaped);
+ while (1) {
+ # slurp up non-special characters
+ $s .= $1 if $token =~ /\G([^\\'"]*)/gc;
+ # handle special characters
+ last unless $token =~ /\G(.)/sgc;
+ my $c = $1;
+ $q = undef, next if defined($q) && $c eq $q;
+ $q = $c, next if !defined($q) && $c =~ /^['"]$/;
+ if ($c eq '\\') {
+ last unless $token =~ /\G(.)/sgc;
+ $c = $1;
+ $s .= '\\' if $c eq "\n"; # preserve line splice
+ }
+ $s .= $c;
+ }
+ return $s
+}
+
+sub check_test {
+ # no-op; subclasses override to implement specific checks
+}
+
+sub parse_cmd {
+ my $self = shift @_;
+ my @tokens = $self->SUPER::parse_cmd();
+ return @tokens unless @tokens && $tokens[0]->[0] =~ /^test_expect_(?:success|failure)$/;
+ my $n = $#tokens;
+ $n-- while $n >= 0 && $tokens[$n]->[0] =~ /^(?:[;&\n|]|&&|\|\|)$/;
+ my $herebody;
+ if ($n >= 2 && $tokens[$n-1]->[0] eq '-' && $tokens[$n]->[0] =~ /^<<-?(.+)$/) {
+ $herebody = $self->{heredocs}->{$1};
+ $n--;
+ }
+ $self->check_test($tokens[1], $tokens[2], $herebody) if $n == 2; # title body
+ $self->check_test($tokens[2], $tokens[3], $herebody) if $n > 2; # prereq title body
+ return @tokens;
+}
+
+1;
--
gitgitgadget
next prev parent reply other threads:[~2026-06-04 7:46 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-04 7:45 [PATCH 0/6] t: add lint-style.pl and convert grep to test_grep Michael Montalbo via GitGitGadget
2026-06-04 7:45 ` [PATCH 1/6] t/README: document test_grep helper Michael Montalbo via GitGitGadget
2026-06-04 7:45 ` Michael Montalbo via GitGitGadget [this message]
2026-06-04 7:45 ` [PATCH 3/6] t: fix Lexer line count for $() inside double-quoted strings Michael Montalbo via GitGitGadget
2026-06-04 7:45 ` [PATCH 4/6] t: add lint-style.pl with test_grep negation rule Michael Montalbo via GitGitGadget
2026-06-04 7:45 ` [PATCH 5/6] t: fix grep assertions missing file arguments Michael Montalbo via GitGitGadget
2026-06-04 7:45 ` [PATCH 6/6] t: lint and convert grep assertions to test_grep Michael Montalbo via GitGitGadget
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a1069efa8fafe17b88f701168e7fe7c4ac663304.1780559158.git.gitgitgadget@gmail.com \
--to=gitgitgadget@gmail.com \
--cc=ben.knoble@gmail.com \
--cc=git@vger.kernel.org \
--cc=mmontalbo@gmail.com \
--cc=sunshine@sunshineco.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox