From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
To: "Nícolas F. R. A. Prado" <n@nfraprado.net>
Cc: linux-doc@vger.kernel.org, "Jonathan Corbet" <corbet@lwn.net>,
"Matthew Wilcox" <willy@infradead.org>,
linux-kernel@vger.kernel.org,
"André Almeida" <andrealmeid@collabora.com>,
~lkcamp/patches@lists.sr.ht
Subject: Re: [RFC PATCH 0/2] docs: automarkup.py: Add literal markup of known constants
Date: Wed, 9 Jun 2021 09:45:35 +0200 [thread overview]
Message-ID: <20210609094535.7ed61991@coco.lan> (raw)
In-Reply-To: <20210609014308.234027-1-n@nfraprado.net>
Hi Nícolas,
Em Tue, 8 Jun 2021 22:43:06 -0300
Nícolas F. R. A. Prado <n@nfraprado.net> escreveu:
> Hi,
>
> a while back Matthew suggested adding automatic markup of known constants, like
> NULL and error codes, as literals [1]. This is a great idea since those literals
> are very frequently used throughout the documentation and are fixed names, so we
> can easily match them without false positives and make the source files less
> cluttered.
>
> Patch 1 adds that feature to automarkup.py, while patch 2 removes the no longer
> needed markup from the xarray.rst file which was given as an example.
>
> Some things I'd like to discuss:
>
> * As a first draft I added the constants I found on xarray.rst plus all error
> codes from errno-base.h and errno.h. We don't need to start with everything,
> but we can certainly do better than this. What are common constants that
> should be added here? (Matthew mentioned ANSI C or POSIX, some reference to
> the constants in those would be great, even more if they are easily parsable)
>
> * The Literals list added is already a bit big with just the error codes, so
> perhaps we should move them to a separate plain text file that is read on
> startup?
If we're willing to keep a long list, they should be parsed from files,
instead of directly included. For things like errno-base.h/errno.h, the
better would be if automarkup would read them directly from the header
file, as otherwise there will be a maintainance burden.
Yet, a regex like "-E[A-Z\d]+\b" would likely get all cases, but this
will produce false positives, like EXAMPLE, EXAMPLES and some other
non-trivial exceptions.
With regards to NULL, I would just use a trivial regex like: "\bNULL\b"
(but see below my notes about using "\b").
> * There was also mention of automatically converting uppercase words to
> literals. I'm not so sure about that one so I left it out for now.
>
> The example given was XA_MARK_0, which is a #define in xarray.h. The way
> to go here is certainly to first use kernel-doc to get it included in the
> documentation, and then cross-reference to it. FWICT at the moment kernel-doc
> for defines should be done like for functions (but the parenthesis can be
> omitted, although they show up in the output):
>
> /**
> * XA_MARK_0 - Brief description.
> *
> * Description of the type.
> */
>
> At the current state, the cross-reference would then need to be done either
> through plain :c:macro:`XA_MARK_0`, or, by relying on automarkup, typedef
> XA_MARK_0 (which is not ideal since this isn't a typedef). Options for
> improvements would be to add 'macro' as a possible prefix (so eg. macro
> XA_MARK_0), or go all out and try to cross-reference on every uppercase word
> like suggested. We should strive for what is the most natural to write, as
> long as the regex isn't too broad.
>
> Since automarkup.py is opt-out rather than opt-in, I think we should be extra
> careful not to make the regexes too broad, even if in this case it relies on a
> C symbol being present.
>
> One other idea I had was that we could limit to all uppercase words as
> long as it has an _ on it, since I don't expect we would get false positives
> with that. The downside is that it might be very confusing to people why their
> MACRONAME isn't being automatically cross-referenced to...
What it can be done would be to first check/apply cross-references. Only if it
fails, then replace everything in uppercase to literals.
There is a drawback, tough: this would cause texts in uppercase. We used to
have lots of them before ReST conversion. I won't doubt that some files
would have kept a few of them. So, in this case, the regex would need to
require at least one _, e. g. something like:
\b[\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*\b
Again, tests are needed in order to check if the regex won't get anything
that shouldn't be caught. So, I would grep the source codes in order to
check if the regexes won't bring false positives, e. g.:
$ git grep -E "\b[\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*\b" Documentation/
Btw, the "\b" at the end won't actually work, due to things like:
GLOBAL_GEN_STORAGE{0:3}
GAUDI_ENGINE_ID_*
If you take a look at the scripts/get_abi.pl that I wrote, I faced
the same problem there with \b. So, I ended implementing my own set
of \b:
my $start = qr {(^|\s|\() }x;
my $bondary = qr { ([,.:;\)\s]|\z) }x;
Using the above replacements for \b, I would do something like this to
double-check if the regex won't be producing false-positives:
$ for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$1 \e[0;31m$2\e[0;37m $3\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)/' $i; done
Btw, on a quick look, this specific regex:
(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)
Seems to be producing just one false positive: a sequence of _, e. g.:
\b\_+\b
While we could make the regex more complex, I would just test if the second
match is '^\_+$', skipping it if found. On other words, a python-equivalent
to this:
$ for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$2\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)/' $i|grep -vE "^\_+$"; done|sort|uniq
-
That's said, it is worth to also mention the false negatives:
$ for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$2\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)/' $i|grep -vE "^\_+$"; done|sort|uniq >OK_list
$ for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$2\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*)([,.:;\)\s]|\z)/' $i|grep -vE " "; done|sort|uniq >FULL_list
$ grep -v -f OK_list FULL_list >excluded_list
$ wc -l OK_list FULL_list excluded_list
7398 OK_list
14487 FULL_list
7520 excluded_list
29405 total
There are a lot of constants that won't be parsed if we require at least
one '_'. Looking at the excluded_list, indeed there are things there
which should be excluded, like EXAMPLE, EXAMPLES, EXCEPTION, WITH, WITHIN,
WITHOUT..., but also there are a several matches that are constants,
like KERNELRELEASE, DISCONTIGMEM, SIOCGIFINDEX, SIGALRM, SETREGSET,
CDROMREADCOOKED, CDROMREADRAW, ...
I doubt that there would be an easy way to handle such cases, as
a file with a ~7000 constants would be a maintenance nightmare.
---
In summary, my suggestion is that we should stay out of having a big list
of constants. So, I would start with:
1. a regex to get FOO_BAR cases, like:
(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)
with a test to exclude this pattern:
^\_+$
2. a logic that will pick and use errno codes from the errno*.h
files;
3. an specific handler handler for the NULL special case.
This would get 7000+ different constants, which seems a very good
start. If needed, later we could get a few other relevant constants
by checking the most-used ones with something like:
for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$2\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*)([,.:;\)\s]|\z)/' $i|grep -v "_"; done|sort|uniq -c|sort -n
in order to grab the most common ones.
Regards,
Mauro
prev parent reply other threads:[~2021-06-09 7:45 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-09 1:43 [RFC PATCH 0/2] docs: automarkup.py: Add literal markup of known constants Nícolas F. R. A. Prado
2021-06-09 1:43 ` [RFC PATCH 1/2] " Nícolas F. R. A. Prado
2021-06-09 8:11 ` Mauro Carvalho Chehab
2021-06-13 23:13 ` Jonathan Corbet
2021-06-14 7:27 ` Mauro Carvalho Chehab
2021-06-09 1:43 ` [RFC PATCH 2/2] XArray: Remove literal markup from " Nícolas F. R. A. Prado
2021-06-09 7:50 ` Mauro Carvalho Chehab
2021-06-09 7:45 ` Mauro Carvalho Chehab [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210609094535.7ed61991@coco.lan \
--to=mchehab+huawei@kernel.org \
--cc=andrealmeid@collabora.com \
--cc=corbet@lwn.net \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=n@nfraprado.net \
--cc=willy@infradead.org \
--cc=~lkcamp/patches@lists.sr.ht \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.