public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Aditya <yashsri421@gmail.com>
To: Joe Perches <joe@perches.com>
Cc: linux-kernel@vger.kernel.org,
	linux-kernel-mentees@lists.linuxfoundation.org,
	lukas.bulwahn@gmail.com, dwaipayanray1@gmail.com
Subject: Re: [PATCH v2] checkpatch: fix false positives in REPEATED_WORD warning
Date: Fri, 23 Oct 2020 00:44:59 +0530	[thread overview]
Message-ID: <5121bf7c-a126-6178-62ff-e54f0bb4cb6e@gmail.com> (raw)
In-Reply-To: <4cbbd8d8b6c4d686f71648af8bc970baa4b0ee9b.camel@perches.com>

On 22/10/20 9:40 pm, Joe Perches wrote:
> On Thu, 2020-10-22 at 20:20 +0530, Aditya Srivastava wrote:
>> Presence of hexadecimal address or symbol results in false warning
>> message by checkpatch.pl.
> []
>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> []
>> @@ -3051,7 +3051,10 @@ sub process {
>>  		}
>>  
>>  # check for repeated words separated by a single space
>> -		if ($rawline =~ /^\+/ || $in_commit_log) {
>> +# avoid false positive from list command eg, '-rw-r--r-- 1 root root'
>> +		if (($rawline =~ /^\+/ || $in_commit_log) &&
>> +		$rawline !~ /[bcCdDlMnpPs\?-][rwxsStT-]{9}/) {
> 
> Alignment and use \b before and after the regex please.

If we use \b either before or after or both it does not match patterns
such as:
+   -rw-r--r--. 1 root root 112K Mar 20 12:16
selinux-policy-3.14.4-48.fc31.noarch.rpm

This is happening probably because it is counting '-' for '\b'
I have not observed any negatives of using this though.

> 
> 		if (($rawline =~ /^\+/ || $in_commit_log) &&
> 		    $rawline !~ /\b[bcCdDlMnpPs\?-][rwxsStT-]{9}\b/) {
>> @@ -3065,6 +3068,34 @@ sub process {
>>  				next if ($first ne $second);
>>  				next if ($first eq 'long');
>>  
>> +				# avoid repeating hex occurrences like 'ff ff fe 09 ...'
>> +				if ($first =~ /\b[0-9a-f]{2,}/) {
>> +					# if such sequence occurs more than 4, it is most probably part of some of code
>> +					next if ((scalar @hex_seq)>4);
>> +					# for hex occurrences which are less than 4
>> +					# get first hex word in the line
>> +					if ($rawline =~ /\b[0-9a-f]{2,} /) {
>> +						my $post_hex_seq = $';
>> +
>> +						# set suffieciently high default values to avoid ignoring or counting in absence of another
>> +						my $non_hex_char_pos = 1000;
>> +						my $special_chars_pos = 500;
>> +
>> +						if ($post_hex_seq =~ /[g-z]+/) {
>> +							# first non hex character in post_hex_seq
>> +							$non_hex_char_pos = $-[0];
>> +						}
>> +						if($post_hex_seq =~ /[^a-zA-Z0-9]{2,}/) {
>> +							# first occurrence of 2 or more special chars
>> +							$special_chars_pos = $-[0];
>> +						}
> 
> What does all this code actually avoid?
> 
> 

Sir, there are multiple variations of hex for which this warning is
occurring, for eg:
1) 00 c0 06 16 00 00 ff ff 00 93 1c 18 00 00 ff ff  ................
2) ffffffff ffffffff 00000000 c070058c
3)     f5a:       48 c7 44 24 78 ff ff    movq
$0xffffffffffffffff,0x78(%rsp)
4) +    fe fe
5) +    fe fe   - ? end marker ?
6) Code: ff ff 48 (...)

So I first check if the repeated word matches /\b[0-9a-f]{2,}/ . If it
does and occurs as a sequence of such repetitions more than 4(ie more
than or equal to 5), then it is most probably a part of hexadecimal
code. This is implemented here,

+				if ($first =~ /\b[0-9a-f]{2,}/) {
+					# if such sequence occurs more than 4, it is most probably part
of some of code
+					next if ((scalar @hex_seq)>4);

This addresses our issues for warning similar to example (1),(2) and (3).

But still we haven't detected 4,5,6. One can argue that we can modify:

+					next if ((scalar @hex_seq)>4);

with (scalar @hex_seq)>2 or (scalar @hex_seq)>3

but then, we'll not be able to account for warnings such as:

7) +	 * sets this to -1, the slack value will be calculated to be be
halfway
8) + * @seg: index of packet segment whose raw fields are to be be
extracted
9) The data in destination buffer is expected to be be parsed in big
10) +	 *   1. New session or device can'be be created - session sysfs
files

Here I observed that in hex codes, there are atleast 2 special
characters present before any non-hex character, for eg. in (5). Also
generally such occurrences are very rare in writing english, and it is
also helpful in our case.

This is implemented here:

>> +				# avoid repeating hex occurrences like 'ff ff fe 09 ...'
>> +				if ($first =~ /\b[0-9a-f]{2,}/) {
>> +					# if such sequence occurs more than 4, it is most probably
part of some of code
>> +					next if ((scalar @hex_seq)>4);
>> +					# for hex occurrences which are less than 4
>> +					# get first hex word in the line
>> +					if ($rawline =~ /\b[0-9a-f]{2,} /) {
>> +						my $post_hex_seq = $';
>> +
>> +						# set suffieciently high default values to avoid ignoring or
counting in absence of another
>> +						my $non_hex_char_pos = 1000;
>> +						my $special_chars_pos = 500;
>> +
>> +						if ($post_hex_seq =~ /[g-z]+/) {
>> +							# first non hex character in post_hex_seq
>> +							$non_hex_char_pos = $-[0];
>> +						}
>> +						if($post_hex_seq =~ /[^a-zA-Z0-9]{2,}/) {
>> +							# first occurrence of 2 or more special chars
>> +							$special_chars_pos = $-[0];
>> +						}

I have used these two lines for cases like example(4):
+						my $non_hex_char_pos = 1000;
+						my $special_chars_pos = 500;

Here, non-hex characters are missing, thus the default character helps
us to get desired result.
Also, I have set higher values such that if one of them occurs in a
line, the result remain unaffected, than with lower default values.


Thanks
Aditya

  reply	other threads:[~2020-10-22 19:15 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-22 14:50 [PATCH v2] checkpatch: fix false positives in REPEATED_WORD warning Aditya Srivastava
2020-10-22 14:58 ` Aditya
2020-10-22 16:10 ` Joe Perches
2020-10-22 19:14   ` Aditya [this message]
2020-10-22 19:33     ` Joe Perches
2020-10-22 21:05       ` Aditya
2020-10-22 22:46         ` Joe Perches
2020-10-23  6:33           ` Aditya
2020-10-23  6:38             ` Aditya

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5121bf7c-a126-6178-62ff-e54f0bb4cb6e@gmail.com \
    --to=yashsri421@gmail.com \
    --cc=dwaipayanray1@gmail.com \
    --cc=joe@perches.com \
    --cc=linux-kernel-mentees@lists.linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lukas.bulwahn@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox