* [PATCH] Check charset in scripts/checkpatch.pl
@ 2007-07-06 7:01 David Woodhouse
2007-07-06 7:08 ` Andrew Morton
0 siblings, 1 reply; 4+ messages in thread
From: David Woodhouse @ 2007-07-06 7:01 UTC (permalink / raw)
To: torvalds, akpm; +Cc: linux-kernel
Reject all legacy 8-bit character sets and allow only ASCII or UTF-8 to
be added to files or used in patch descriptions.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: asd
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 277c326..7a7f283 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -395,6 +395,22 @@ sub process {
$clean = 0;
}
+# UTF-8 regex found at http://www.w3.org/International/questions/qa-forms-utf-8.en.php
+ if ( ($realfile =~ /^$/ || $line =~ /^\+/) &&
+ !($line =~ m/^(
+ [\x09\x0A\x0D\x20-\x7E] # ASCII
+ | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
+ | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
+ | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
+ | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
+ | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
+ | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
+ | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
+ )*$/x ) ) {
+ print "Invalid UTF-8\n";
+ print "$herecurr";
+ $clean = 0;
+ }
#ignore lines being removed
if ($line=~/^-/) {next;}
--
dwmw2
^ permalink raw reply related [flat|nested] 4+ messages in thread* Re: [PATCH] Check charset in scripts/checkpatch.pl
2007-07-06 7:01 [PATCH] Check charset in scripts/checkpatch.pl David Woodhouse
@ 2007-07-06 7:08 ` Andrew Morton
2007-07-06 7:19 ` David Woodhouse
0 siblings, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2007-07-06 7:08 UTC (permalink / raw)
To: David Woodhouse; +Cc: torvalds, linux-kernel
On Fri, 06 Jul 2007 03:01:03 -0400 David Woodhouse <dwmw2@infradead.org> wrote:
> Reject all legacy 8-bit character sets and allow only ASCII or UTF-8 to
> be added to files or used in patch descriptions.
What is the reasoning behind this?
> Signed-off-by: David Woodhouse <dwmw2@infradead.org>
>
> Signed-off-by: asd
Jekyll & Hyde?
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] Check charset in scripts/checkpatch.pl
2007-07-06 7:08 ` Andrew Morton
@ 2007-07-06 7:19 ` David Woodhouse
2007-07-11 1:11 ` Matt Mackall
0 siblings, 1 reply; 4+ messages in thread
From: David Woodhouse @ 2007-07-06 7:19 UTC (permalink / raw)
To: Andrew Morton; +Cc: torvalds, linux-kernel
On Fri, 2007-07-06 at 00:08 -0700, Andrew Morton wrote:
> On Fri, 06 Jul 2007 03:01:03 -0400 David Woodhouse <dwmw2@infradead.org> wrote:
>
> > Reject all legacy 8-bit character sets and allow only ASCII or UTF-8 to
> > be added to files or used in patch descriptions.
>
> What is the reasoning behind this?
The character set used by the kernel is UTF-8. So we should check for
people trying to add invalid stuff in other character sets. There's no
way for them to _label_ legacy character sets as such; it's not like
MIME email where we can use a different charset for every mail and
expect people to cope. We need to be consistent.
> > Signed-off-by: David Woodhouse <dwmw2@infradead.org>
> >
> > Signed-off-by: asd
>
> Jekyll & Hyde?
Oops. I prepended that to the output of 'git-diff' just to make
checkpatch happy. And then forgot about it when I inserted the text file
into my mailer.
--
dwmw2
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] Check charset in scripts/checkpatch.pl
2007-07-06 7:19 ` David Woodhouse
@ 2007-07-11 1:11 ` Matt Mackall
0 siblings, 0 replies; 4+ messages in thread
From: Matt Mackall @ 2007-07-11 1:11 UTC (permalink / raw)
To: David Woodhouse; +Cc: Andrew Morton, torvalds, linux-kernel
On Fri, Jul 06, 2007 at 03:19:50AM -0400, David Woodhouse wrote:
> On Fri, 2007-07-06 at 00:08 -0700, Andrew Morton wrote:
> > On Fri, 06 Jul 2007 03:01:03 -0400 David Woodhouse <dwmw2@infradead.org> wrote:
> >
> > > Reject all legacy 8-bit character sets and allow only ASCII or UTF-8 to
> > > be added to files or used in patch descriptions.
> >
> > What is the reasoning behind this?
>
> The character set used by the kernel is UTF-8. So we should check for
> people trying to add invalid stuff in other character sets. There's no
> way for them to _label_ legacy character sets as such; it's not like
> MIME email where we can use a different charset for every mail and
> expect people to cope. We need to be consistent.
Seconded.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2007-07-11 1:12 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-06 7:01 [PATCH] Check charset in scripts/checkpatch.pl David Woodhouse
2007-07-06 7:08 ` Andrew Morton
2007-07-06 7:19 ` David Woodhouse
2007-07-11 1:11 ` Matt Mackall
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.