* Re: b4: unicode control characters -- warn or remove? [not found] <20211101175020.5r4cwmy4qppi7dis@meerkat.local> @ 2021-11-01 19:09 ` Eric Wong 2021-11-01 19:17 ` Konstantin Ryabitsev 2021-11-01 20:02 ` Ævar Arnfjörð Bjarmason 0 siblings, 2 replies; 7+ messages in thread From: Eric Wong @ 2021-11-01 19:09 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: users, tools, git Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > Hi, all: > > Per exhibit a, what should we do in the situation where we discover unicode > control characters in an email? > > 1. Warn and strip these chars out, because they are extremely unlikely to be > doing anything legitimate in the context of a patch (unless someone is > sending patches for docs actually written in RTL languages) > 2. Warn and error out, refusing to produce an mbox > 3. Just warn and produce an mbox anyway > > I'd normally do #3, but with many people piping things to git-am, I'm not sure > if it's the safest choice. > > Exibit a: https://lwn.net/Articles/874546/ +Cc: git@vger IMHO, defense for this belongs in git-am (which already checks things like whitespace). ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: b4: unicode control characters -- warn or remove? 2021-11-01 19:09 ` b4: unicode control characters -- warn or remove? Eric Wong @ 2021-11-01 19:17 ` Konstantin Ryabitsev 2021-11-01 20:02 ` Ævar Arnfjörð Bjarmason 1 sibling, 0 replies; 7+ messages in thread From: Konstantin Ryabitsev @ 2021-11-01 19:17 UTC (permalink / raw) To: Eric Wong; +Cc: users, tools, git On Mon, Nov 01, 2021 at 07:09:05PM +0000, Eric Wong wrote: > > Per exhibit a, what should we do in the situation where we discover unicode > > control characters in an email? > > > > 1. Warn and strip these chars out, because they are extremely unlikely to be > > doing anything legitimate in the context of a patch (unless someone is > > sending patches for docs actually written in RTL languages) > > 2. Warn and error out, refusing to produce an mbox > > 3. Just warn and produce an mbox anyway > > > > I'd normally do #3, but with many people piping things to git-am, I'm not sure > > if it's the safest choice. > > > > Exibit a: https://lwn.net/Articles/874546/ > > +Cc: git@vger > > IMHO, defense for this belongs in git-am (which already checks > things like whitespace). I agree, but even if that is implemented in git, we'll still probably want to catch this on the b4 side of things until everyone uses the git client where that's handled natively. -K ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: b4: unicode control characters -- warn or remove? 2021-11-01 19:09 ` b4: unicode control characters -- warn or remove? Eric Wong 2021-11-01 19:17 ` Konstantin Ryabitsev @ 2021-11-01 20:02 ` Ævar Arnfjörð Bjarmason 2021-11-01 20:22 ` Konstantin Ryabitsev 1 sibling, 1 reply; 7+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-11-01 20:02 UTC (permalink / raw) To: Eric Wong; +Cc: Konstantin Ryabitsev, users, tools, git On Mon, Nov 01 2021, Eric Wong wrote: > Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: >> Hi, all: >> >> Per exhibit a, what should we do in the situation where we discover unicode >> control characters in an email? >> >> 1. Warn and strip these chars out, because they are extremely unlikely to be >> doing anything legitimate in the context of a patch (unless someone is >> sending patches for docs actually written in RTL languages) >> 2. Warn and error out, refusing to produce an mbox >> 3. Just warn and produce an mbox anyway >> >> I'd normally do #3, but with many people piping things to git-am, I'm not sure >> if it's the safest choice. >> >> Exibit a: https://lwn.net/Articles/874546/ > > +Cc: git@vger > > IMHO, defense for this belongs in git-am (which already checks > things like whitespace). It checks whitespace because that's something that's commonly a source of patch corruption. I'm not adverse to adding this to core.whitespace, but trying to catch malicious injected code seems like a rather big expansion of its scope, particularly since: "[...]sending patches for docs actually written in RTL languages[...]" Or just code? People write comment and even in their native languages, and not all projects are as anglo-centric as those hosted on kernel.org. I haven't checked what the overlap is between solving this issue & i18n support, but we definitely should not be assuming that git's only using by kernel.org users & similar, even something as relatively obscure as git-am. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: b4: unicode control characters -- warn or remove? 2021-11-01 20:02 ` Ævar Arnfjörð Bjarmason @ 2021-11-01 20:22 ` Konstantin Ryabitsev 2021-11-01 20:49 ` Pavel Machek 2021-11-02 14:09 ` Konstantin Ryabitsev 0 siblings, 2 replies; 7+ messages in thread From: Konstantin Ryabitsev @ 2021-11-01 20:22 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: Eric Wong, users, tools, git On Mon, Nov 01, 2021 at 09:02:34PM +0100, Ævar Arnfjörð Bjarmason wrote: > It checks whitespace because that's something that's commonly a source > of patch corruption. I'm not adverse to adding this to core.whitespace, > but trying to catch malicious injected code seems like a rather big > expansion of its scope, particularly since: > > "[...]sending patches for docs actually written in RTL languages[...]" > > Or just code? People write comment and even in their native languages, > and not all projects are as anglo-centric as those hosted on kernel.org. My comment about docs was purely within the scope of the Linux kernel. I think the following would be a sane check: 1. are there unicode control characters (CCs) present? 2. are there other characters from RTL languages present in the same line? if both 1 && 2 are true, this is a legitimate use of Unicode CCs. If only 1 is true, then it's likely worth a warning. Maybe even relax #2 to just check for unicode characters above a certain barrier where RTL languages live. I think everyone will agree that if there are unicode CCs and no other unicode characters in that same line, it's likely not a legitimate use of control characters. -K ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: b4: unicode control characters -- warn or remove? 2021-11-01 20:22 ` Konstantin Ryabitsev @ 2021-11-01 20:49 ` Pavel Machek 2021-11-01 21:02 ` Konstantin Ryabitsev 2021-11-02 14:09 ` Konstantin Ryabitsev 1 sibling, 1 reply; 7+ messages in thread From: Pavel Machek @ 2021-11-01 20:49 UTC (permalink / raw) To: Konstantin Ryabitsev Cc: Ævar Arnfjörð Bjarmason, Eric Wong, users, tools, git [-- Attachment #1: Type: text/plain, Size: 1429 bytes --] Hi! > > It checks whitespace because that's something that's commonly a source > > of patch corruption. I'm not adverse to adding this to core.whitespace, > > but trying to catch malicious injected code seems like a rather big > > expansion of its scope, particularly since: > > > > "[...]sending patches for docs actually written in RTL languages[...]" > > > > Or just code? People write comment and even in their native languages, > > and not all projects are as anglo-centric as those hosted on kernel.org. > > My comment about docs was purely within the scope of the Linux kernel. > > I think the following would be a sane check: > > 1. are there unicode control characters (CCs) present? > 2. are there other characters from RTL languages present in the same line? > > if both 1 && 2 are true, this is a legitimate use of Unicode CCs. If only 1 is > true, then it's likely worth a warning. > > Maybe even relax #2 to just check for unicode characters above a certain > barrier where RTL languages live. I think everyone will agree that if there > are unicode CCs and no other unicode characters in that same line, it's likely > not a legitimate use of control characters. If you are worried about malicious patches, then it should be easy for attackers to add some RTL characters and escape the check... Best regards, Pavel -- http://www.livejournal.com/~pavelmachek [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: b4: unicode control characters -- warn or remove? 2021-11-01 20:49 ` Pavel Machek @ 2021-11-01 21:02 ` Konstantin Ryabitsev 0 siblings, 0 replies; 7+ messages in thread From: Konstantin Ryabitsev @ 2021-11-01 21:02 UTC (permalink / raw) To: Pavel Machek Cc: Ævar Arnfjörð Bjarmason, Eric Wong, users, tools, git [-- Attachment #1: Type: text/plain, Size: 1444 bytes --] On Mon, Nov 01, 2021 at 09:49:14PM +0100, Pavel Machek wrote: > > I think the following would be a sane check: > > > > 1. are there unicode control characters (CCs) present? > > 2. are there other characters from RTL languages present in the same line? > > > > if both 1 && 2 are true, this is a legitimate use of Unicode CCs. If only 1 is > > true, then it's likely worth a warning. > > > > Maybe even relax #2 to just check for unicode characters above a certain > > barrier where RTL languages live. I think everyone will agree that if there > > are unicode CCs and no other unicode characters in that same line, it's likely > > not a legitimate use of control characters. > > If you are worried about malicious patches, then it should be easy for > attackers to add some RTL characters and escape the check... Well, the point of this attack was to trick the reviewer into accepting code that the compiler would treat differently (e.g. something that looked to be inside a comment block is actually outside of it). So, if attackers include some actual RTL text, then the reviewer would no longer be (as easily) tricked because there would be stuff other than just invisible characters in the line of code. This actually similar to how we treat unicode domains. Most browsers only allow unicode domains when the entire domain name consists of unicode characters. I suggest we take a similar approach. -K [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: b4: unicode control characters -- warn or remove? 2021-11-01 20:22 ` Konstantin Ryabitsev 2021-11-01 20:49 ` Pavel Machek @ 2021-11-02 14:09 ` Konstantin Ryabitsev 1 sibling, 0 replies; 7+ messages in thread From: Konstantin Ryabitsev @ 2021-11-02 14:09 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: Eric Wong, users, tools, git On Mon, Nov 01, 2021 at 04:22:20PM -0400, Konstantin Ryabitsev wrote: > I think the following would be a sane check: > > 1. are there unicode control characters (CCs) present? > 2. are there other characters from RTL languages present in the same line? > > if both 1 && 2 are true, this is a legitimate use of Unicode CCs. If only 1 is > true, then it's likely worth a warning. I implemented this solution in b4 master, so it should error out only when it finds control characters without any "other letter" unicode character category present in the same line (where Hebrew, Arabic, etc live). There's probably still a way to take advantage of this, but hopefully it's a lot less trivial now and less likely to go unnoticed by the reviewer. The error message will point where it found the problem: WARNING: Message contains suspicious unicode control characters! Subject: [PATCH 1/2] SPI: Add SPI driver for Sunplus SP7021 Line: + /* } if (isAdmin) begin admins only */ -----------^ Char: RIGHT-TO-LEFT OVERRIDE (0x202e) If you are sure about this, rerun with the right flag to allow. One can rerun with --allow-unicode-control-chars to override this. -K ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-11-02 14:09 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20211101175020.5r4cwmy4qppi7dis@meerkat.local>
2021-11-01 19:09 ` b4: unicode control characters -- warn or remove? Eric Wong
2021-11-01 19:17 ` Konstantin Ryabitsev
2021-11-01 20:02 ` Ævar Arnfjörð Bjarmason
2021-11-01 20:22 ` Konstantin Ryabitsev
2021-11-01 20:49 ` Pavel Machek
2021-11-01 21:02 ` Konstantin Ryabitsev
2021-11-02 14:09 ` Konstantin Ryabitsev
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).