[RFC] Proposal: Static SECCOMP Policies

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] Proposal: Static SECCOMP Policies
@ 2024-09-12 16:02 Maxwell Bland
  2024-09-12 20:57 ` Neill Kapron
  2024-09-17  7:34 ` Kees Cook
  0 siblings, 2 replies; 26+ messages in thread
From: Maxwell Bland @ 2024-09-12 16:02 UTC (permalink / raw)
  To: linux-arm-msm@vger.kernel.org
  Cc: Andrew Wheeler, Sammy BS2 Que | 阙斌生,
	Neill Kapron, Todd Kjos, Viktor Martensson, Andy Lutomirski,
	keescook@chromium.org, Will Drewry, Andy Gross, Bjorn Andersson,
	Konrad Dybcio, kernel-team

(Resending as plaintext for msm-kernel mailing list.
Original message was intended for android kernel team
though msm-kernel should be aware.)

Hi Kernel Team,

+ Kees, Andy, and Will since their input may be valuable.

It has been a while! (~9 months to be exact). This January, I sent out a small
message on BPF code loading ("unprivileged BPF considered harmful" or something
like that). In it, I noted new BPF programs are compiled all the time and
thrown into the kernel. At the time, I did not know these programs were just
compiled seccomp filter policies, loaded in as new BPF programs continuously
through the libminijail interface as well as direct syscall. As of two days
ago, I now know this (and now you do too, if not already).

OK, yes, syscall filtering is very important, but this is creating a catch-22
issue. For one, see step (4) under "Exploitation overview" for
https://www.qualys.com/2021/07/20/cve-2021-33909/sequoia-local-privilege-escalation-linux.txt.
Second, this minor lack of caching is adding load time to more than 90
binaries/services on the standard QCOM baseline―I'll admit, it is probably
negligible in the grand scheme of things (a quick approximation puts the data
operated on around 0.1188 MB). But most importantly, third, without some degree
of provenance, I have no way of telling if someone has injected malicious code
into the kernel, and unfortunately even knowing the correct bytes is still
"iffy", as in order to prevent JIT spray attacks, each of these filters is
offset by some random number of uint32_t's, making every 4-byte shift of the
filter a "valid" codepage to be loaded at runtime.

You might be thinking, "but wait, bionic's libc only defines a couple of
restricted policies, primary and secondary for system and user apps
respectively." I know! For the most part, apps fall into either what I presume
is the default app/system policies, but there are lots of QCOM binaries and
other magic programs (dolby dax) that are sending up these programs as well.
I'm seeing more than 20 different programs for around a minute's worth of
runtime. One example is attached at the end.

So, the proposal: a "CONFIG_SECCOMMP_STATIC_POLICY" for seccomp. This
would change the Android kernel's generic SYS_seccomp call, which takes in a
filter with an array of BPF instructions, to instead reference an ID which
corresponds to a fixed file on /sys/bpf/seccomp or something like that. The
sandboxing behavior of these apps should be known at compile-time, even if
there are multiple "permission set types" that may need to be dispatched. User
apps should always have a single, fixed policy. This way it is possible to say
for every code page loaded into the kernel where it came from and what it
should look like.

Unfortunately, I do not know Motorola has enough "weight" to convince QCOM to
do the right foundational thing here, or to "define" the seccomp APIs for
Android, so it would be good to have Google's buy in, know if there are plans
to fix this issue, or some discussion of how to best fix the problem? If
anything, a contact at QCOM that might be able to actually hunt down and
document valid bytes for these policies?

The end goal is simple: when we see a code page is allocated in the kernel, we
can be sure that (1) it isn't malicious and (2) has not been modified in
transit. I'm fine putting code where my mouth is, but right now that code
would involve having to fingerprint the signatures loaded by Qualcomm
components every time a new one is released, or pinging Google with a huge
patch changing how seccomp works with no idea of what requirements QCOM may
have on seccomp policy generation.

Thoughts? Is this doable, and if not, why? I'd also love help with the code and
adapting existing minijail code to use a new, more integrity-preserving
interface. If I am mistaken and it is possible to grab out valid BPF policy
code at compile time, please let me know how!

Regards,
Maxwell Bland

Standard filter, (from, for example, com.google.android.gms)
"ac00000000000000ac77000000000000bf160000000000006160040000000000b4020000b70000c01d20020000000000b4000000000000009500000000000000616000000000000055000200cb000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f950000000000000055000200ce000000b40000000000ff7f950000000000000055000200c6000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000100de00000005007b000000000055000200d7000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f950000000000000055000100e200000005008f000000000055000200a7000000b40000000000ff7f95000000000000005500020038000000b40000000000ff7f95000000000000005500020062000000b40000000000ff7f95000000000000005500020039000000b40000000000ff7f9500000000000000550002003f000000b40000000000ff7f95000000000000005500020040000000b40000000000ff7f95000000000000005500020050000000b40000000000ff7f9500000000000000550002004e000000b40000000000ff7f9500000000000000550002002c000000b40000000000ff7f95000000000000005500020043000000b40000000000ff7f9500000000000000550002001d000000b40000000000ff7f95000000000000005500020030000000b40000000000ff7f95000000000000005500020071000000b40000000000ff7f950000000000000055000200ae000000b40000000000ff7f950000000000000055000200a3000000b40000000000ff7f95000000000000005500020086000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000200e9000000b40000000000ff7f9500000000000000550002003e000000b40000000000ff7f95000000000000005500020087000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f9500000000000000550002005c000000b40000000000ff7f95000000000000005500020016010000b40000000000ff7f950000000000000055000200dc000000b40000000000ff7f95000000000000005500020060000000b40000000000ff7f950000000000000055000200dd000000b40000000000ff7f95000000000000005500020078000000b40000000000ff7f9500000000000000550002005e000000b40000000000ff7f9500000000000000550002008b000000b40000000000ff7f95000000000000005500020080000000b40000000000ff7f950000000000000055000200cb000000b40000000000ff7f950000000000000055000100c600000005004c0000000000550002005d000000b40000000000ff7f950000000000000055000200ac000000b40000000000ff7f95000000000000005500020084000000b40000000000ff7f9500000000000000550002008c000000b40000000000ff7f9500000000000000550002003d000000b40000000000ff7f95000000000000005500020017000000b40000000000ff7f9500000000000000b400000000000300950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160100000000000630afcff000000006160140000000000630af8ff00000000550002000000000061a0fcff000000001500010001000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f9500000000000000",
Unknown filter (from QCOM's /vendor/bin/qesdk-secmanager)
 "ac00000000000000ac77000000000000bf160000000000006160040000000000b4020000b70000c01d20020000000000b4000000000000009500000000000000616000000000000055000200cb000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f950000000000000055000200ce000000b40000000000ff7f950000000000000055000200c6000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000100de00000005007e000000000055000100e2000000050098000000000055000200d7000000b40000000000ff7f950000000000000055000200a7000000b40000000000ff7f95000000000000005500020062000000b40000000000ff7f9500000000000000550002001d000000b40000000000ff7f95000000000000005500020038000000b40000000000ff7f9500000000000000550002003f000000b40000000000ff7f95000000000000005500020039000000b40000000000ff7f95000000000000005500020050000000b40000000000ff7f9500000000000000550002004e000000b40000000000ff7f9500000000000000550002004f000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f95000000000000005500020043000000b40000000000ff7f9500000000000000550002002c000000b40000000000ff7f95000000000000005500020087000000b40000000000ff7f95000000000000005500020086000000b40000000000ff7f95000000000000005500020030000000b40000000000ff7f950000000000000055000200ae000000b40000000000ff7f95000000000000005500020016010000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000200dc000000b40000000000ff7f9500000000000000550002005e000000b40000000000ff7f9500000000000000550002007b000000b40000000000ff7f9500000000000000550002005d000000b40000000000ff7f950000000000000055000200ac000000b40000000000ff7f95000000000000005500020084000000b40000000000ff7f950000000000000055000200a3000000b40000000000ff7f95000000000000005500020080000000b40000000000ff7f95000000000000005500020078000000b40000000000ff7f950000000000000055000200dd000000b40000000000ff7f950000000000000055000100c600000005005800000000005500020060000000b40000000000ff7f9500000000000000550002008b000000b40000000000ff7f950000000000000055000200cb000000b40000000000ff7f95000000000000005500020071000000b40000000000ff7f95000000000000005500020040000000b40000000000ff7f9500000000000000550002003b000000b40000000000ff7f950000000000000055000200e9000000b40000000000ff7f950000000000000055000200b2000000b40000000000ff7f9500000000000000550002008c000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f9500000000000000b400000000000300950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160100000000000630afcff000000006160140000000000630af8ff00000000550002000000000061a0fcff000000001500010001000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f9500000000000000",

List of services loading seccomp filters pulled from one run of the phone:
com.google.android.deskclock
/vendor/bin/qesdk-secmanager
media.hwcodec/vendor.qti.media.c2@1.0-service
media.audio.qc.codec.qti.media.c2audio@1.0-service
/vendor/bin/vendor.qti.qspmhal-service
/vendor/bin/qsap_sensors
media.extractoraextractor
/system_ext/bin/perfservice
/vendor/bin/wfdhdcphalservice
/vendor/bin/wifidisplayhalservice
/vendor/bin/qsap_dcfd
/vendor/bin/qms
/vendor/bin/qsap_location
/vendor/bin/qsap_qapeservice
/vendor/bin/wfdvndservice
media.swcodecoid.media.swcodec/bin/mediaswcodec
/vendor/bin/hw/qcrilNrd
qsap_qms_13qms16
qsap_qms_24qms17
/vendor/bin/ATFWD-daemon
/vendor/bin/hw/sxrservice
/vendor/bin/hw/qcrilNrd-c2
system_server
/vendor/bin/qmi_motext_hook1013170
/vendor/bin/qmi_motext_hook1013171
/vendor/bin/ims_rtp_daemon
com.android.systemui
webview_zygote
com.dolby.daxservice
vendor.qti.qesdk.sysservice
org.codeaurora.ims
com.android.se
com.android.phone
com.qti.qcc
com.google.android.ext.services
com.google.android.gms
com.google.android.euicc
com.google.android.googlequicksearchbox:interactor
com.google.android.apps.messaging:rcs
com.android.nfc
com.qualcomm.qti.workloadclassifier
com.qualcomm.location
com.google.android.gms.unstable
com.thundercomm.ar.core
com.android.vending:background
com.android.vending:quick_launch
com.android.dynsystem
com.android.managedprovisioning
com.android.shell

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-12 16:02 [RFC] Proposal: Static SECCOMP Policies Maxwell Bland
@ 2024-09-12 20:57 ` Neill Kapron
  2024-09-12 21:39   ` Maciej Żenczykowski
  2024-09-17  7:34 ` Kees Cook
  1 sibling, 1 reply; 26+ messages in thread
From: Neill Kapron @ 2024-09-12 20:57 UTC (permalink / raw)
  To: Maxwell Bland
  Cc: linux-arm-msm@vger.kernel.org, Andrew Wheeler,
	Sammy BS2 Que | 阙斌生, Todd Kjos,
	Viktor Martensson, Andy Lutomirski, keescook@chromium.org,
	Will Drewry, Andy Gross, Bjorn Andersson, Konrad Dybcio,
	kernel-team, maze, adelva, jeffv

On Thu, Sep 12, 2024 at 04:02:53PM +0000, Maxwell Bland wrote:
> (Resending as plaintext for msm-kernel mailing list.
> Original message was intended for android kernel team
> though msm-kernel should be aware.)
> 
> Hi Kernel Team,
> 
> + Kees, Andy, and Will since their input may be valuable.
> 
> It has been a while! (~9 months to be exact). This January, I sent out a small
> message on BPF code loading ("unprivileged BPF considered harmful" or something
> like that). In it, I noted new BPF programs are compiled all the time and
> thrown into the kernel. At the time, I did not know these programs were just
> compiled seccomp filter policies, loaded in as new BPF programs continuously
> through the libminijail interface as well as direct syscall. As of two days
> ago, I now know this (and now you do too, if not already).
> 
> OK, yes, syscall filtering is very important, but this is creating a catch-22
> issue. For one, see step (4) under "Exploitation overview" for
> https://www.qualys.com/2021/07/20/cve-2021-33909/sequoia-local-privilege-escalation-linux.txt.
> Second, this minor lack of caching is adding load time to more than 90
> binaries/services on the standard QCOM baseline—I'll admit, it is probably
> negligible in the grand scheme of things (a quick approximation puts the data
> operated on around 0.1188 MB). But most importantly, third, without some degree
> of provenance, I have no way of telling if someone has injected malicious code
> into the kernel, and unfortunately even knowing the correct bytes is still
> "iffy", as in order to prevent JIT spray attacks, each of these filters is
> offset by some random number of uint32_t's, making every 4-byte shift of the
> filter a "valid" codepage to be loaded at runtime.
> 
> You might be thinking, "but wait, bionic's libc only defines a couple of
> restricted policies, primary and secondary for system and user apps
> respectively." I know! For the most part, apps fall into either what I presume
> is the default app/system policies, but there are lots of QCOM binaries and
> other magic programs (dolby dax) that are sending up these programs as well.
> I'm seeing more than 20 different programs for around a minute's worth of
> runtime. One example is attached at the end.
> 
> So, the proposal: a "CONFIG_SECCOMMP_STATIC_POLICY" for seccomp. This
> would change the Android kernel's generic SYS_seccomp call, which takes in a
> filter with an array of BPF instructions, to instead reference an ID which
> corresponds to a fixed file on /sys/bpf/seccomp or something like that. The
> sandboxing behavior of these apps should be known at compile-time, even if
> there are multiple "permission set types" that may need to be dispatched. User
> apps should always have a single, fixed policy. This way it is possible to say
> for every code page loaded into the kernel where it came from and what it
> should look like.
> 
> Unfortunately, I do not know Motorola has enough "weight" to convince QCOM to
> do the right foundational thing here, or to "define" the seccomp APIs for
> Android, so it would be good to have Google's buy in, know if there are plans
> to fix this issue, or some discussion of how to best fix the problem? If
> anything, a contact at QCOM that might be able to actually hunt down and
> document valid bytes for these policies?
> 
> The end goal is simple: when we see a code page is allocated in the kernel, we
> can be sure that (1) it isn't malicious and (2) has not been modified in
> transit. I'm fine putting code where my mouth is, but right now that code
> would involve having to fingerprint the signatures loaded by Qualcomm
> components every time a new one is released, or pinging Google with a huge
> patch changing how seccomp works with no idea of what requirements QCOM may
> have on seccomp policy generation.
> 
> Thoughts? Is this doable, and if not, why? I'd also love help with the code and
> adapting existing minijail code to use a new, more integrity-preserving
> interface. If I am mistaken and it is possible to grab out valid BPF policy
> code at compile time, please let me know how!
> 
> Regards,
> Maxwell Bland
> 
> Standard filter, (from, for example, com.google.android.gms)
> "ac00000000000000ac77000000000000bf160000000000006160040000000000b4020000b70000c01d20020000000000b4000000000000009500000000000000616000000000000055000200cb000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f950000000000000055000200ce000000b40000000000ff7f950000000000000055000200c6000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000100de00000005007b000000000055000200d7000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f950000000000000055000100e200000005008f000000000055000200a7000000b40000000000ff7f95000000000000005500020038000000b40000000000ff7f95000000000000005500020062000000b40000000000ff7f95000000000000005500020039000000b40000000000ff7f9500000000000000550002003f000000b40000000000ff7f95000000000000005500020040000000b40000000000ff7f95000000000000005500020050000000b40000000000ff7f9500000000000000550002004e000000b40000000000ff7f9500000000000000550002002c000000b40000000000ff7f95000000000000005500020043000000b40000000000ff7f9500000000000000550002001d000000b40000000000ff7f95000000000000005500020030000000b40000000000ff7f95000000000000005500020071000000b40000000000ff7f950000000000000055000200ae000000b40000000000ff7f950000000000000055000200a3000000b40000000000ff7f95000000000000005500020086000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000200e9000000b40000000000ff7f9500000000000000550002003e000000b40000000000ff7f95000000000000005500020087000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f9500000000000000550002005c000000b40000000000ff7f95000000000000005500020016010000b40000000000ff7f950000000000000055000200dc000000b40000000000ff7f95000000000000005500020060000000b40000000000ff7f950000000000000055000200dd000000b40000000000ff7f95000000000000005500020078000000b40000000000ff7f9500000000000000550002005e000000b40000000000ff7f9500000000000000550002008b000000b40000000000ff7f95000000000000005500020080000000b40000000000ff7f950000000000000055000200cb000000b40000000000ff7f950000000000000055000100c600000005004c0000000000550002005d000000b40000000000ff7f950000000000000055000200ac000000b40000000000ff7f95000000000000005500020084000000b40000000000ff7f9500000000000000550002008c000000b40000000000ff7f9500000000000000550002003d000000b40000000000ff7f95000000000000005500020017000000b40000000000ff7f9500000000000000b400000000000300950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160100000000000630afcff000000006160140000000000630af8ff00000000550002000000000061a0fcff000000001500010001000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f9500000000000000",
> Unknown filter (from QCOM's /vendor/bin/qesdk-secmanager)
>  "ac00000000000000ac77000000000000bf160000000000006160040000000000b4020000b70000c01d20020000000000b4000000000000009500000000000000616000000000000055000200cb000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f950000000000000055000200ce000000b40000000000ff7f950000000000000055000200c6000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000100de00000005007e000000000055000100e2000000050098000000000055000200d7000000b40000000000ff7f950000000000000055000200a7000000b40000000000ff7f95000000000000005500020062000000b40000000000ff7f9500000000000000550002001d000000b40000000000ff7f95000000000000005500020038000000b40000000000ff7f9500000000000000550002003f000000b40000000000ff7f95000000000000005500020039000000b40000000000ff7f95000000000000005500020050000000b40000000000ff7f9500000000000000550002004e000000b40000000000ff7f9500000000000000550002004f000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f95000000000000005500020043000000b40000000000ff7f9500000000000000550002002c000000b40000000000ff7f95000000000000005500020087000000b40000000000ff7f95000000000000005500020086000000b40000000000ff7f95000000000000005500020030000000b40000000000ff7f950000000000000055000200ae000000b40000000000ff7f95000000000000005500020016010000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000200dc000000b40000000000ff7f9500000000000000550002005e000000b40000000000ff7f9500000000000000550002007b000000b40000000000ff7f9500000000000000550002005d000000b40000000000ff7f950000000000000055000200ac000000b40000000000ff7f95000000000000005500020084000000b40000000000ff7f950000000000000055000200a3000000b40000000000ff7f95000000000000005500020080000000b40000000000ff7f95000000000000005500020078000000b40000000000ff7f950000000000000055000200dd000000b40000000000ff7f950000000000000055000100c600000005005800000000005500020060000000b40000000000ff7f9500000000000000550002008b000000b40000000000ff7f950000000000000055000200cb000000b40000000000ff7f95000000000000005500020071000000b40000000000ff7f95000000000000005500020040000000b40000000000ff7f9500000000000000550002003b000000b40000000000ff7f950000000000000055000200e9000000b40000000000ff7f950000000000000055000200b2000000b40000000000ff7f9500000000000000550002008c000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f9500000000000000b400000000000300950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160100000000000630afcff000000006160140000000000630af8ff00000000550002000000000061a0fcff000000001500010001000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f9500000000000000",
> 
> List of services loading seccomp filters pulled from one run of the phone:
> com.google.android.deskclock
> /vendor/bin/qesdk-secmanager
> media.hwcodec/vendor.qti.media.c2@1.0-service
> media.audio.qc.codec.qti.media.c2audio@1.0-service
> /vendor/bin/vendor.qti.qspmhal-service
> /vendor/bin/qsap_sensors
> media.extractoraextractor
> /system_ext/bin/perfservice
> /vendor/bin/wfdhdcphalservice
> /vendor/bin/wifidisplayhalservice
> /vendor/bin/qsap_dcfd
> /vendor/bin/qms
> /vendor/bin/qsap_location
> /vendor/bin/qsap_qapeservice
> /vendor/bin/wfdvndservice
> media.swcodecoid.media.swcodec/bin/mediaswcodec
> /vendor/bin/hw/qcrilNrd
> qsap_qms_13qms16
> qsap_qms_24qms17
> /vendor/bin/ATFWD-daemon
> /vendor/bin/hw/sxrservice
> /vendor/bin/hw/qcrilNrd-c2
> system_server
> /vendor/bin/qmi_motext_hook1013170
> /vendor/bin/qmi_motext_hook1013171
> /vendor/bin/ims_rtp_daemon
> com.android.systemui
> webview_zygote
> com.dolby.daxservice
> vendor.qti.qesdk.sysservice
> org.codeaurora.ims
> com.android.se
> com.android.phone
> com.qti.qcc
> com.google.android.ext.services
> com.google.android.gms
> com.google.android.euicc
> com.google.android.googlequicksearchbox:interactor
> com.google.android.apps.messaging:rcs
> com.android.nfc
> com.qualcomm.qti.workloadclassifier
> com.qualcomm.location
> com.google.android.gms.unstable
> com.thundercomm.ar.core
> com.android.vending:background
> com.android.vending:quick_launch
> com.android.dynsystem
> com.android.managedprovisioning
> com.android.shell


+ Jeff, Alistair, and Maciej

Maxwell,

Thanks for the details on this, I have added several people who may be
better suited to comment on this.

Neill

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-12 20:57 ` Neill Kapron
@ 2024-09-12 21:39   ` Maciej Żenczykowski
  2024-09-13 17:07     ` [External] " Maxwell Bland
  0 siblings, 1 reply; 26+ messages in thread
From: Maciej Żenczykowski @ 2024-09-12 21:39 UTC (permalink / raw)
  To: Neill Kapron
  Cc: Maxwell Bland, linux-arm-msm@vger.kernel.org, Andrew Wheeler,
	Sammy BS2 Que | 阙斌生, Todd Kjos,
	Viktor Martensson, Andy Lutomirski, keescook@chromium.org,
	Will Drewry, Andy Gross, Bjorn Andersson, Konrad Dybcio,
	kernel-team, adelva, jeffv

wrt. BPF on Android:

(a) eBPF should already be locked down to just the bpfloader boot time process.

If you can prove it isn't, please let us know, but as this is sepolicy
around the bpf(BPF_PROG_LOAD) system call, it should be pretty
airtight:

allow bpfloader self:bpf { ... prog_load ... };
...
neverallow { domain -bpfloader } *:bpf prog_load;

(basically the only exception to the above is root/su on userdebug/eng
builds, which runs sepolicy in permissive mode and thus doesn't
enforce the above - but that obviously doesn't matter for user builds)

(b) cBPF [classic BPF, internally the kernel translates this to eBPF]
is still allowed, for both seccomp() and normal old style socket
filters

- bpf seccomp() is to the best of my knowledge used by normal play
store updatable applications (including the chrome web browser) for
sandboxing (of rendering processes), as such it would be basically
impossible to lock it down (as apps update independently of the rest
of the system) - and would probably be a net loss for security if you
did lock it down / break it...

If you wanted to pursue this you'd need to get agreement from Chrome &
other applications and provide some 'better' alternative.  Likely some
sort of hard coded seccomp version that blocks things that most
sandboxing apps agree is beneficial to block...

(bpf seccomp() is also used by the Android zygote itself to block
various extra system calls from processes/apps it spawns, but as this
list is hardcoded at build time, it's not actually a problem)

- similarly old style BPF socket filters are 'normal' 'ancient'
BSD/Unix/Linux API.  They're used in the (privileged) network stack
itself (which is mainline updatable via the play store, including the
cbpf code), but could also AFAIK be used by random play store
applications - filtering on sockets is truly ancient api.
https://www.tcpdump.org/papers/bpf-usenix93.pdf is from 1992

-

Is there some eBPF program loading API I'm not aware of that we thus
haven't blocked?

On Thu, Sep 12, 2024 at 1:57 PM Neill Kapron <nkapron@google.com> wrote:
>
> On Thu, Sep 12, 2024 at 04:02:53PM +0000, Maxwell Bland wrote:
> > (Resending as plaintext for msm-kernel mailing list.
> > Original message was intended for android kernel team
> > though msm-kernel should be aware.)
> >
> > Hi Kernel Team,
> >
> > + Kees, Andy, and Will since their input may be valuable.
> >
> > It has been a while! (~9 months to be exact). This January, I sent out a small
> > message on BPF code loading ("unprivileged BPF considered harmful" or something
> > like that). In it, I noted new BPF programs are compiled all the time and
> > thrown into the kernel. At the time, I did not know these programs were just
> > compiled seccomp filter policies, loaded in as new BPF programs continuously
> > through the libminijail interface as well as direct syscall. As of two days
> > ago, I now know this (and now you do too, if not already).
> >
> > OK, yes, syscall filtering is very important, but this is creating a catch-22
> > issue. For one, see step (4) under "Exploitation overview" for
> > https://www.qualys.com/2021/07/20/cve-2021-33909/sequoia-local-privilege-escalation-linux.txt.
> > Second, this minor lack of caching is adding load time to more than 90
> > binaries/services on the standard QCOM baseline—I'll admit, it is probably
> > negligible in the grand scheme of things (a quick approximation puts the data
> > operated on around 0.1188 MB). But most importantly, third, without some degree
> > of provenance, I have no way of telling if someone has injected malicious code
> > into the kernel, and unfortunately even knowing the correct bytes is still
> > "iffy", as in order to prevent JIT spray attacks, each of these filters is
> > offset by some random number of uint32_t's, making every 4-byte shift of the
> > filter a "valid" codepage to be loaded at runtime.
> >
> > You might be thinking, "but wait, bionic's libc only defines a couple of
> > restricted policies, primary and secondary for system and user apps
> > respectively." I know! For the most part, apps fall into either what I presume
> > is the default app/system policies, but there are lots of QCOM binaries and
> > other magic programs (dolby dax) that are sending up these programs as well.
> > I'm seeing more than 20 different programs for around a minute's worth of
> > runtime. One example is attached at the end.
> >
> > So, the proposal: a "CONFIG_SECCOMMP_STATIC_POLICY" for seccomp. This
> > would change the Android kernel's generic SYS_seccomp call, which takes in a
> > filter with an array of BPF instructions, to instead reference an ID which
> > corresponds to a fixed file on /sys/bpf/seccomp or something like that. The
> > sandboxing behavior of these apps should be known at compile-time, even if
> > there are multiple "permission set types" that may need to be dispatched. User
> > apps should always have a single, fixed policy. This way it is possible to say
> > for every code page loaded into the kernel where it came from and what it
> > should look like.
> >
> > Unfortunately, I do not know Motorola has enough "weight" to convince QCOM to
> > do the right foundational thing here, or to "define" the seccomp APIs for
> > Android, so it would be good to have Google's buy in, know if there are plans
> > to fix this issue, or some discussion of how to best fix the problem? If
> > anything, a contact at QCOM that might be able to actually hunt down and
> > document valid bytes for these policies?
> >
> > The end goal is simple: when we see a code page is allocated in the kernel, we
> > can be sure that (1) it isn't malicious and (2) has not been modified in
> > transit. I'm fine putting code where my mouth is, but right now that code
> > would involve having to fingerprint the signatures loaded by Qualcomm
> > components every time a new one is released, or pinging Google with a huge
> > patch changing how seccomp works with no idea of what requirements QCOM may
> > have on seccomp policy generation.
> >
> > Thoughts? Is this doable, and if not, why? I'd also love help with the code and
> > adapting existing minijail code to use a new, more integrity-preserving
> > interface. If I am mistaken and it is possible to grab out valid BPF policy
> > code at compile time, please let me know how!
> >
> > Regards,
> > Maxwell Bland
> >
> > Standard filter, (from, for example, com.google.android.gms)
> > "ac00000000000000ac77000000000000bf160000000000006160040000000000b4020000b70000c01d20020000000000b4000000000000009500000000000000616000000000000055000200cb000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f950000000000000055000200ce000000b40000000000ff7f950000000000000055000200c6000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000100de00000005007b000000000055000200d7000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f950000000000000055000100e200000005008f000000000055000200a7000000b40000000000ff7f95000000000000005500020038000000b40000000000ff7f95000000000000005500020062000000b40000000000ff7f95000000000000005500020039000000b40000000000ff7f9500000000000000550002003f000000b40000000000ff7f95000000000000005500020040000000b40000000000ff7f95000000000000005500020050000000b40000000000ff7f9500000000000000550002004e000000b40000000000ff7f9500000000000000550002002c000000b40000000000ff7f95000000000000005500020043000000b40000000000ff7f9500000000000000550002001d000000b40000000000ff7f95000000000000005500020030000000b40000000000ff7f95000000000000005500020071000000b40000000000ff7f950000000000000055000200ae000000b40000000000ff7f950000000000000055000200a3000000b40000000000ff7f95000000000000005500020086000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000200e9000000b40000000000ff7f9500000000000000550002003e000000b40000000000ff7f95000000000000005500020087000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f9500000000000000550002005c000000b40000000000ff7f95000000000000005500020016010000b40000000000ff7f950000000000000055000200dc000000b40000000000ff7f95000000000000005500020060000000b40000000000ff7f950000000000000055000200dd000000b40000000000ff7f95000000000000005500020078000000b40000000000ff7f9500000000000000550002005e000000b40000000000ff7f9500000000000000550002008b000000b40000000000ff7f95000000000000005500020080000000b40000000000ff7f950000000000000055000200cb000000b40000000000ff7f950000000000000055000100c600000005004c0000000000550002005d000000b40000000000ff7f950000000000000055000200ac000000b40000000000ff7f95000000000000005500020084000000b40000000000ff7f9500000000000000550002008c000000b40000000000ff7f9500000000000000550002003d000000b40000000000ff7f95000000000000005500020017000000b40000000000ff7f9500000000000000b400000000000300950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160100000000000630afcff000000006160140000000000630af8ff00000000550002000000000061a0fcff000000001500010001000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f9500000000000000",
> > Unknown filter (from QCOM's /vendor/bin/qesdk-secmanager)
> >  "ac00000000000000ac77000000000000bf160000000000006160040000000000b4020000b70000c01d20020000000000b4000000000000009500000000000000616000000000000055000200cb000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f950000000000000055000200ce000000b40000000000ff7f950000000000000055000200c6000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000100de00000005007e000000000055000100e2000000050098000000000055000200d7000000b40000000000ff7f950000000000000055000200a7000000b40000000000ff7f95000000000000005500020062000000b40000000000ff7f9500000000000000550002001d000000b40000000000ff7f95000000000000005500020038000000b40000000000ff7f9500000000000000550002003f000000b40000000000ff7f95000000000000005500020039000000b40000000000ff7f95000000000000005500020050000000b40000000000ff7f9500000000000000550002004e000000b40000000000ff7f9500000000000000550002004f000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f95000000000000005500020043000000b40000000000ff7f9500000000000000550002002c000000b40000000000ff7f95000000000000005500020087000000b40000000000ff7f95000000000000005500020086000000b40000000000ff7f95000000000000005500020030000000b40000000000ff7f950000000000000055000200ae000000b40000000000ff7f95000000000000005500020016010000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000200dc000000b40000000000ff7f9500000000000000550002005e000000b40000000000ff7f9500000000000000550002007b000000b40000000000ff7f9500000000000000550002005d000000b40000000000ff7f950000000000000055000200ac000000b40000000000ff7f95000000000000005500020084000000b40000000000ff7f950000000000000055000200a3000000b40000000000ff7f95000000000000005500020080000000b40000000000ff7f95000000000000005500020078000000b40000000000ff7f950000000000000055000200dd000000b40000000000ff7f950000000000000055000100c600000005005800000000005500020060000000b40000000000ff7f9500000000000000550002008b000000b40000000000ff7f950000000000000055000200cb000000b40000000000ff7f95000000000000005500020071000000b40000000000ff7f95000000000000005500020040000000b40000000000ff7f9500000000000000550002003b000000b40000000000ff7f950000000000000055000200e9000000b40000000000ff7f950000000000000055000200b2000000b40000000000ff7f9500000000000000550002008c000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f9500000000000000b400000000000300950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160100000000000630afcff000000006160140000000000630af8ff00000000550002000000000061a0fcff000000001500010001000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f9500000000000000",
> >
> > List of services loading seccomp filters pulled from one run of the phone:
> > com.google.android.deskclock
> > /vendor/bin/qesdk-secmanager
> > media.hwcodec/vendor.qti.media.c2@1.0-service
> > media.audio.qc.codec.qti.media.c2audio@1.0-service
> > /vendor/bin/vendor.qti.qspmhal-service
> > /vendor/bin/qsap_sensors
> > media.extractoraextractor
> > /system_ext/bin/perfservice
> > /vendor/bin/wfdhdcphalservice
> > /vendor/bin/wifidisplayhalservice
> > /vendor/bin/qsap_dcfd
> > /vendor/bin/qms
> > /vendor/bin/qsap_location
> > /vendor/bin/qsap_qapeservice
> > /vendor/bin/wfdvndservice
> > media.swcodecoid.media.swcodec/bin/mediaswcodec
> > /vendor/bin/hw/qcrilNrd
> > qsap_qms_13qms16
> > qsap_qms_24qms17
> > /vendor/bin/ATFWD-daemon
> > /vendor/bin/hw/sxrservice
> > /vendor/bin/hw/qcrilNrd-c2
> > system_server
> > /vendor/bin/qmi_motext_hook1013170
> > /vendor/bin/qmi_motext_hook1013171
> > /vendor/bin/ims_rtp_daemon
> > com.android.systemui
> > webview_zygote
> > com.dolby.daxservice
> > vendor.qti.qesdk.sysservice
> > org.codeaurora.ims
> > com.android.se
> > com.android.phone
> > com.qti.qcc
> > com.google.android.ext.services
> > com.google.android.gms
> > com.google.android.euicc
> > com.google.android.googlequicksearchbox:interactor
> > com.google.android.apps.messaging:rcs
> > com.android.nfc
> > com.qualcomm.qti.workloadclassifier
> > com.qualcomm.location
> > com.google.android.gms.unstable
> > com.thundercomm.ar.core
> > com.android.vending:background
> > com.android.vending:quick_launch
> > com.android.dynsystem
> > com.android.managedprovisioning
> > com.android.shell
>
>
> + Jeff, Alistair, and Maciej
>
> Maxwell,
>
> Thanks for the details on this, I have added several people who may be
> better suited to comment on this.
>
> Neill

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-12 21:39   ` Maciej Żenczykowski
@ 2024-09-13 17:07     ` Maxwell Bland
  2024-09-13 17:12       ` Maxwell Bland
                         ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Maxwell Bland @ 2024-09-13 17:07 UTC (permalink / raw)
  To: Maciej Żenczykowski, Neill Kapron
  Cc: linux-arm-msm@vger.kernel.org, Andrew Wheeler,
	Sammy BS2 Que | 阙斌生, Todd Kjos,
	Viktor Martensson, Andy Lutomirski, keescook@chromium.org,
	Will Drewry, Andy Gross, Bjorn Andersson, Konrad Dybcio,
	kernel-team, adelva@google.com, jeffv@google.com

OK, after spending three hours working on this email, I think I know what to do
here. Since Moto's code for this stuff is forced to be open source anyways,
I'll spoil the solution:

Add a hook to seccomp which triggers/enables hooks in BPF's JIT to instrument
the output machine code  page so that EL2 can (1) invert the machine code back
to BPF then (2) check the BPF corresponds to a valid seccomp filter policy.

It would need to be kept up to date with whatever seccomp decides to do, but I
can see a world where the result guarantees the code page has not been modified
in transit and corresponds to a reasonable seccomp policy.

I will say, I'm not the biggest fan of this. I am a fan of SYS_seccomp
implicitly compiling the filters at build-time, so I can just know immediately
what the new code pages "should be". That said, I think my solution also
resolves the issue of an adversary using the BRK instruction padding to
generate a "valid" codepage at an invalid offset.

I've included a few other responses just for kicks, since you should know I've
been working hard on this problem for more than a year, I'm not just emailing
things to sound cool and waste time (OK maybe a little of that, but this is
also a serious, honest effort to understand the problem!). (-:

> If you can prove it isn't

To test it yourself, it is easiest to add a printk statement under
bpf_int_jit_compile, or try to implement a system for checking the integrity of
page table updates, or add a print statment to the page table update code in
vmalloc. or enable the CONFIG_PTDUMP_DEBUGFS options. Use my patch here if you
want to see decent output.
https://lore.kernel.org/all/2bcb3htsjhepxdybpw2bwot2jnuezl3p5mnj5rhjwgitlsufe7@xzhkyntridw3/
or I've also attached a kernel module which is a part of this "OpenKP" project
I am working on, which should provide a larger, open-source framework and
standard for the ARM community to provide hypervisor-enforced code integrity on
Android / QCOM chipsets, so you can see 2% of the work I've done over the past
2 years and test it out yourself. Uncomment the part under "DEBUG" and read
through it, test it out.

I can submit a formal patch with printk statements for you to test out if that
is needed? Or just trust me, lol. I'm probably going to just go work on
that instrumentation step I mentioned earlier. (-:

>selinux

Note this whole loading is outside the scope of SELinux, it is a side-effect of
the SYS_Seccomp system call as used by privileged system services.

>cBPF [classic BPF, internally the kernel translates this to eBPF] is still
>allowed,

These programs will not print out using PTRACE and are difficult to audit
without patching the seccomp calls yourself because the ptrace call to
PTRACE_SECCOMP_GET_FILTER will fail. I believe (have not checked) because they
are not cBPF, and seccomp's logic makes prog->fprog evaluates to null despite
prog existing if it is cBPF, at least on Android 14. I spent a whole day
getting frustrated with the failing ptrace call before finally ending up my
patches (attached to the end) that instrument ptrace and can print the
programs.

>a net loss for security if you did lock it down / break it

I am a fan of seccomp and I don't want to break it and I don't want to "lock it
down", I want to ask people nicely to provide the code pages they want in the
kernel!

Thanks,
Maxwell Bland

As a P.S., maybe I should add context, though I don't know whether it is
needed:
Many, many exploits for the kernel over the past decade rely on write
gadgets to modify kernel resources, such as the exploit I linked in my original
email, Project Zero's recent
https://googleprojectzero.blogspot.com/2023/09/analyzing-modern-in-wild-android-exploit.html
or the more recent https://pwning.tech/nftables/. We can't begin to make honest
progress on the existing exploits until we nail down the basic rule that
privileged executable pages are immutable in Android. My goal is to eventually
make a standard framework for EL2-based kernel protection open source, then we
have a counter of the 29,000ish writable datastructures,and well defined
mechanisms for preventing malicious modification via write gadgets (like we see
with kworker queues, task cred structs back in the day, etc, etc). Once I've
locked down 1 and 3 of 1) integrity of loaded code pages, 2) system control
register modifications such as TCR (this is a pain in the *** because
snapdragon chipsets are a pain in the *** sometimes), and 3) writing a couple
of testcases to lock-down kworker queues and other data structures (e.g. fops)
at EL2 and fix, among other exploits,
https://github.blog/security/vulnerability-research/the-android-kernel-mitigations-obstacle-race/,
I will work with Moto's legal to try and open source the solution and send it
to the ARM mailing list, since eventually these hacks should be polished and
made into kconfigs as part of the GKI for Android's good.

This is all "goals" though, but I figured I would plug the effort.

main.c:

// SPDX-License-Identifier: GPL-2.0
/*
 * Copyright (C) 2023 Motorola Mobility, Inc.
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License version 2 as
 * published by the Free Software Foundation.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * Kernel module that hooks the vmalloc infrastructure to ensure that code
 * pages are not interleaved with data pages unless at a PMD level granularity.
 * Must be loaded prior to other kernel mechanisms leveraging code page
 * allocation, e.g. BPF, EROFS fixmap.
 */


#include <linux/kernel.h>
#include <linux/bpf.h>
#include <linux/mutex.h>
#include <linux/atomic.h>
#include <linux/highmem.h>
#include <linux/kprobes.h>
#include <linux/list.h>
#include <linux/mm_types.h>
#include <linux/module.h>
#include <linux/of.h>
#include <linux/of_platform.h>
#include <linux/pagewalk.h>
#include <linux/types.h>
#include <linux/moduleloader.h>
#include <linux/vmalloc.h>
#include <linux/gfp_types.h>
#include <linux/seccomp.h>
#include <asm/pgalloc.h>
#include <asm/ptrace.h>
#include <asm/patching.h>
#include <asm/module.h>
#include <asm/page.h>
#include <asm/seccomp.h>

#ifdef SECCOMP_ARCH_NATIVE
/**
 * struct action_cache - per-filter cache of seccomp actions per
 * arch/syscall pair
 *
 * @allow_native: A bitmap where each bit represents whether the
 *                filter will always allow the syscall, for the
 *                native architecture.
 * @allow_compat: A bitmap where each bit represents whether the
 *                filter will always allow the syscall, for the
 *                compat architecture.
 */
struct action_cache {
        DECLARE_BITMAP(allow_native, SECCOMP_ARCH_NATIVE_NR);
#ifdef SECCOMP_ARCH_COMPAT
        DECLARE_BITMAP(allow_compat, SECCOMP_ARCH_COMPAT_NR);
#endif
};
#else
struct action_cache { };
#endif

struct seccomp_filter {
        refcount_t refs;
        refcount_t users;
        bool log;
        bool wait_killable_recv;
        struct action_cache cache;
        struct seccomp_filter *prev;
        struct bpf_prog *prog;
        struct notification *notif;
        struct mutex notify_lock;
        wait_queue_head_t wqh;
};



void print_bpf_prog_aux(struct bpf_prog_aux *aux) {
        printk("BPF Program Aux Details:\n");
        printk("Ref Count: %lld\n", atomic64_read(&aux->refcnt));
        printk("Used Map Count: %u\n", aux->used_map_cnt);
        printk("Used BTF Count: %u\n", aux->used_btf_cnt);
        printk("Max Context Offset: %u\n", aux->max_ctx_offset);
        printk("Max Packet Offset: %u\n", aux->max_pkt_offset);
        printk("Max TP Access: %u\n", aux->max_tp_access);
        printk("Stack Depth: %u\n", aux->stack_depth);
        printk("ID: %u\n", aux->id);
        printk("Function Count: %u\n", aux->func_cnt);
        printk("Function Index: %u\n", aux->func_idx);
        printk("Attach BTF ID: %u\n", aux->attach_btf_id);
        printk("Context Arg Info Size: %u\n", aux->ctx_arg_info_size);
        printk("Max Read-Only Access: %u\n", aux->max_rdonly_access);
        printk("Max Read-Write Access: %u\n", aux->max_rdwr_access);
        printk("Attach BTF: %p\n", aux->attach_btf);
        printk("Context Arg Info: %p\n", aux->ctx_arg_info);
        printk("DST Mutex: %p\n", &aux->dst_mutex);
        printk("DST Program: %p\n", aux->dst_prog);
        printk("DST Trampoline: %p\n", aux->dst_trampoline);
        printk("Saved DST Program Type: %d\n", aux->saved_dst_prog_type);
        printk("Saved DST Attach Type: %d\n", aux->saved_dst_attach_type);
        printk("Verifier Zero Extension: %u\n", aux->verifier_zext);
        printk("Attach BTF Trace: %u\n", aux->attach_btf_trace);
        printk("Function Proto Unreliable: %u\n", aux->func_proto_unreliable);
        printk("Sleepable: %u\n", aux->sleepable);
        printk("Tail Call Reachable: %u\n", aux->tail_call_reachable);
        printk("XDP Has Frags: %u\n", aux->xdp_has_frags);
        printk("Attach Func Proto: %p\n", aux->attach_func_proto);
        printk("Attach Func Name: %s\n", aux->attach_func_name);
        printk("Functions: %p\n", aux->func);
        printk("JIT Data: %p\n", aux->jit_data);
        printk("Poke Table: %p\n", aux->poke_tab);
        printk("Kfunc Table: %p\n", aux->kfunc_tab);
        printk("Kfunc BTF Table: %p\n", aux->kfunc_btf_tab);
        printk("Size Poke Table: %u\n", aux->size_poke_tab);
        printk("Ksym: %p\n", &aux->ksym);
        printk("Operations: %p\n", aux->ops);
        printk("Used Maps: %p\n", aux->used_maps);
        printk("Used Maps Mutex: %p\n", &aux->used_maps_mutex);
        printk("Used BTFs: %p\n", aux->used_btfs);
        printk("Program: %p\n", aux->prog);
        printk("User: %p\n", aux->user);
        printk("Load Time: %llu\n", aux->load_time);
        printk("Verified Instructions: %u\n", aux->verified_insns);
        printk("Cgroup Attach Type: %d\n", aux->cgroup_atype);
        printk("Cgroup Storage: %p\n", aux->cgroup_storage);
        printk("Name: %s\n", aux->name);
}

void print_bpf_prog_insnsi(struct bpf_insn * insns, uint64_t len) {
        int i;
        for (i = 0; i < len; i++) {
                const struct bpf_insn *insn = &insns[i];
                printk("BPF INSN %016llx\n", *((uint64_t *)insn));
        }
}

void print_bpf_prog(struct bpf_prog *prog) {
        printk("BPF Program Details:\n");
        printk("Pages: %u\n", prog->pages);
        printk("JITed: %u\n", prog->jited);
        printk("JIT Requested: %u\n", prog->jit_requested);
        printk("GPL Compatible: %u\n", prog->gpl_compatible);
        printk("Control Block Access: %u\n", prog->cb_access);
        printk("DST Needed: %u\n", prog->dst_needed);
        printk("Blinding Requested: %u\n", prog->blinding_requested);
        printk("Blinded: %u\n", prog->blinded);
        printk("Is Function: %u\n", prog->is_func);
        printk("Kprobe Override: %u\n", prog->kprobe_override);
        printk("Has Callchain Buffer: %u\n", prog->has_callchain_buf);
        printk("Enforce Expected Attach Type: %u\n", prog->enforce_expected_attach_type);
        printk("Call Get Stack: %u\n", prog->call_get_stack);
        printk("Call Get Func IP: %u\n", prog->call_get_func_ip);
        printk("Timestamp Type Access: %u\n", prog->tstamp_type_access);
        printk("Type: %d\n", prog->type);
        printk("Expected Attach Type: %d\n", prog->expected_attach_type);
        printk("Length: %u\n", prog->len);
        printk("JITed Length: %u\n", prog->jited_len);
        printk("Tag: ");
        for (int i = 0; i < BPF_TAG_SIZE; i++) {
                printk("%02x", prog->tag[i]);
        }
        printk("\n");
        printk("Stats: %p\n", prog->stats);
        printk("Active: %p\n", prog->active);
        printk("AUX FIELDS:\n");
        print_bpf_prog_aux(prog->aux);
        print_bpf_prog_insnsi(prog->insnsi, prog->len);
}


/* Functions we need for patching dynamic code allocations */
typedef void *(*module_alloc_t)(unsigned long size);
module_alloc_t module_alloc_ind;
typedef void (*module_memfree_t)(void *module_region);
module_memfree_t module_memfree_ind;

/* TODO: actually we could probably just include "net/bpf_jit.h" */
typedef int (*aarch64_insn_patch_text_nosync_t)(void *addr, u32 insn);
aarch64_insn_patch_text_nosync_t aarch64_insn_patch_text_nosync_ind;
typedef u32 (*aarch64_insn_gen_branch_imm_t)(unsigned long pc,
                                             unsigned long addr,
enum aarch64_insn_branch_type type);
aarch64_insn_gen_branch_imm_t aarch64_insn_gen_branch_imm_ind;
typedef u32 (*aarch64_insn_gen_hint_t)(enum aarch64_insn_hint_cr_op op);
aarch64_insn_gen_hint_t aarch64_insn_gen_hint_ind;
typedef u32 (*aarch64_insn_gen_branch_reg_t)(
        enum aarch64_insn_register reg, enum aarch64_insn_branch_type type);
aarch64_insn_gen_branch_reg_t aarch64_insn_gen_branch_reg_ind;
typedef void *(*__vmalloc_node_range_t)(unsigned long size, unsigned long align,
                                        unsigned long start, unsigned long end,
                                        gfp_t gfp_mask, pgprot_t prot,
                                        unsigned long vm_flags, int node,
const void *caller);
__vmalloc_node_range_t __vmalloc_node_range_ind;

/* Used for reworking the kprobe allocator */
typedef int (*collect_garbage_slots_t)(struct kprobe_insn_cache *c);
collect_garbage_slots_t collect_garbage_slots_ind;

static struct kprobe kallsyms_lookup_name_kp = { .symbol_name =
        "kallsyms_lookup_name",
.addr = 0 };
typedef unsigned long (*kallsyms_lookup_name_t)(const char *name);
kallsyms_lookup_name_t kallsyms_lookup_name_ind;

/* Functions we are patching */
static struct kprobe alloc_vmap_area_kp = { .symbol_name = "alloc_vmap_area",
.addr = 0 };

/* DEBUG: bpf allocation printing */
// static struct kprobe bpf_int_jit_compile_kp = { .symbol_name = "bpf_int_jit_compile",
// .addr = 0 };
static struct kprobe ptrace_request_kp = { .symbol_name = "ptrace_request",
.addr = 0 };
/* END DEBUG */

/* Static variables that must be manually accessed for definition */
u64 module_alloc_base;
struct kprobe_insn_cache *kprobe_insn_slots_ptr;

/**
 * get_kp_addr - TODO comment rest of file
 */
static __always_inline void *get_kp_addr(struct kprobe *kp)
{
        void *res = 0;
        if (register_kprobe(kp)) {
                pr_err("Error: moto_org_mem failed to get kp addr for %s\n",
                       kp->symbol_name);
                return 0;
        }
        res = kp->addr;
        unregister_kprobe(kp);
        return res;
}

static void *bpf_jit_alloc_exec_handler(unsigned long size)
{
        return module_alloc_ind(size);
}

static void bpf_jit_free_exec_handler(void *addr)
{
        module_memfree_ind(addr);
}

static u64 bpf_jit_alloc_exec_limit_handler(void)
{
        return MODULES_END - MODULES_VADDR;
}

static void *alloc_insn_page_handler(void)
{
        return __vmalloc_node_range_ind(PAGE_SIZE, 1, module_alloc_base,
                                        module_alloc_base + SZ_2G, GFP_KERNEL,
                                        PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS,
                                        NUMA_NO_NODE,
        __builtin_return_address(0));
}

static bool allocation_balance = false;

/**
 * alloc_vmap_area_pre_handler - adjusts vstart, vend to not interleave code/data
 *
 * Right now, vmalloc infrastructure does the following:
 * |<-----data----->||<-----code and data pages----->||<-----data----->|
 * Maintainers likely do not want to touch vmalloc internals for fear of
 * breaking everything, so we provide an open-source work-around with hopes
 * that these fixes will make their way into the mainline kernel.
 *
 * We adjust the parameters to the call to avoid the code memory range by
 * selecting the lower half, then in a separate post handler, we check whether
 * the allocation failed, and if so, run the allocation with the upper half.
 *
 * TODO: we need to remove the flip/flopping and properly segment the memory
 * here, but it is not clear how to do this without modifying core vmalloc
 * infrastructure. See upstream patch here:
 * https://lore.kernel.org/all/20240423095843.446565600-1-mbland@motorola.com/#t
 *
 * Parameters are passed in the arm64 linux kernel following the AAPCS64 ABI
 * convention, and thus it is safe to interpolate based upon the signature
 * the location of the specific values for vstart and vend.
 * https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst
 */
static int alloc_vmap_area_handler(struct kprobe *kp, struct pt_regs *regs)
{
        unsigned long size;
        unsigned long vstart;
        size = regs->regs[0];
        vstart = regs->regs[2];
        if (vstart == VMALLOC_START) { /* We are attempting to vmalloc data */
                /* Everything is fine, do nothing */
                if (module_alloc_base + SZ_2G <= VMALLOC_START ||
                        module_alloc_base > VMALLOC_END)
                return 0;

                allocation_balance = !allocation_balance;

                /* Not enough room below, else if not enough room above */
                if (module_alloc_base - VMALLOC_START < size)
                        allocation_balance = true;
                        else if (VMALLOC_END - module_alloc_base + SZ_2G < size)
                        allocation_balance = false;

                /* Allocate from higher valued addresses or lower valued
                 * address evenly. since these are virtual it does not
                 * really matter */
                if (allocation_balance) {
                        regs->regs[2] = module_alloc_base + SZ_2G;
                } else {
                        regs->regs[3] = module_alloc_base;
                }
        }

        return 0;
}

/* DEBUG: Analyze allocated BPF programs */
// static int bpf_int_jit_compile_handler(struct kprobe *kp, struct pt_regs *regs)
// {
//         // struct bpf_prog *prog = (struct bpf_prog *)regs->regs[0];
//         // print_bpf_prog(prog);
//         return 0;
// }
//
static int ptrace_request_handler(struct kprobe *kp, struct pt_regs *regs)
{
        struct task_struct *task = (struct task_struct *)regs->regs[0];
        long request = regs->regs[1];
        unsigned long addr = regs->regs[2];
        struct seccomp_filter *filter;
        if (request != 0x420c) {
                return 0;
        }
        if (addr != 13371337) {
                printk("waiting for regs ... %llx\n", regs->regs[1]);
                return 0;
        }

        if (!task)
        {
                printk("ptrace_request_handler no task\n");
                return 0;
        }

        filter = READ_ONCE(task->seccomp.filter);
        printk("TASK PID %d or %d\n", task->pid, pid_vnr(task_pgrp(task)));
        if (!filter) {
                printk("ptrace_request_handler no filter\n");
                return 0;
        }
        if (filter->prog)
                print_bpf_prog(filter->prog);

        return 0;
}
/* END DEBUG */


void __always_inline patch_jump_to_handler(void *faddr, void *helper)
{
        u32 insn;
        insn = aarch64_insn_gen_branch_imm_ind((unsigned long)faddr,
                                               (unsigned long)helper,
        AARCH64_INSN_BRANCH_NOLINK);
        aarch64_insn_patch_text_nosync_ind(faddr, insn);
}

struct kprobe_insn_page {
        struct list_head list;
        kprobe_opcode_t *insns; /* Page of instruction slots */
        struct kprobe_insn_cache *cache;
        int nused;
        int ngarbage;
        char slot_used[];
};

void free_insn_pages(struct kprobe_insn_cache *kic)
{
        struct kprobe_insn_page *kip;
        unsigned int i = 0;

        /* TODO: Since the slot array is not protected by rcu, we need a mutex,
         * but we are also should be the only thing running that is touching
         * the kprobes */
        list_for_each_entry_rcu (kip, &kic->pages, list) {
                for (i = 0; i < kip->nused; i++) {
                        kip->slot_used[i] = 0;
                        kip->nused--;
                }
                list_del_rcu(&kip->list);
                synchronize_rcu();
                kip->cache->free(kip->insns);
                kfree(kip);
        }
}

/**
 * mod_init - TODO
 *
 * TODO FAIL IF ANY OF THE BELOW FAILS
 */
static int __init mod_init(void)
{
        void *bpf_jit_alloc_exec_addr = 0;
        void *bpf_jit_free_exec_addr = 0;
        void *bpf_jit_alloc_exec_limit_addr = 0;
        void *alloc_insn_page_addr = 0;
        kallsyms_lookup_name_ind =
                (kallsyms_lookup_name_t)get_kp_addr(&kallsyms_lookup_name_kp);

        module_alloc_ind =
                (module_alloc_t)kallsyms_lookup_name_ind("module_alloc");
        module_memfree_ind =
                (module_memfree_t)kallsyms_lookup_name_ind("module_memfree");
        __vmalloc_node_range_ind =
                (__vmalloc_node_range_t)kallsyms_lookup_name_ind(
                        "__vmalloc_node_range");
        aarch64_insn_patch_text_nosync_ind =
                (aarch64_insn_patch_text_nosync_t)kallsyms_lookup_name_ind(
                        "aarch64_insn_patch_text_nosync");
        aarch64_insn_gen_branch_imm_ind =
                (aarch64_insn_gen_branch_imm_t)kallsyms_lookup_name_ind(
                        "aarch64_insn_gen_branch_imm");
        aarch64_insn_gen_hint_ind =
                (aarch64_insn_gen_hint_t)kallsyms_lookup_name_ind(
                        "aarch64_insn_gen_hint");
        aarch64_insn_gen_branch_reg_ind =
                (aarch64_insn_gen_branch_reg_t)kallsyms_lookup_name_ind(
                        "aarch64_insn_gen_branch_reg");

        collect_garbage_slots_ind =
                (collect_garbage_slots_t)kallsyms_lookup_name_ind(
                        "collect_garbage_slots");

        bpf_jit_alloc_exec_addr =
                (void *)kallsyms_lookup_name_ind("bpf_jit_alloc_exec");
        bpf_jit_free_exec_addr =
                (void *)kallsyms_lookup_name_ind("bpf_jit_free_exec");
        bpf_jit_alloc_exec_limit_addr =
                (void *)kallsyms_lookup_name_ind("bpf_jit_alloc_exec_limit");
        alloc_insn_page_addr =
                (void *)kallsyms_lookup_name_ind("alloc_insn_page");

        module_alloc_base =
                *((u64 *)kallsyms_lookup_name_ind("module_alloc_base"));

        patch_jump_to_handler(bpf_jit_alloc_exec_addr,
                              bpf_jit_alloc_exec_handler);
        patch_jump_to_handler(bpf_jit_free_exec_addr,
                              bpf_jit_free_exec_handler);
        patch_jump_to_handler(bpf_jit_alloc_exec_limit_addr,
                              bpf_jit_alloc_exec_limit_handler);
        patch_jump_to_handler(alloc_insn_page_addr, alloc_insn_page_handler);

        /*
         * Under the hood, arm64 calls __get_insn_slot to generate memory pages for
         * kprobes, and these memory pages *supposedly* access an indirect pointer to
         * their allocation function through kprobe_insn_slots. Because we allocated
         * a kprobe in order to access kallsyms_lookup_name, one page is already allocated.
         * However, even kprobe garbage collection cowardly refuses to kill the last page,
         * so we have our own free routine that nixes that last survivor.
         */
        kprobe_insn_slots_ptr =
                (struct kprobe_insn_cache *)kallsyms_lookup_name_ind(
        "kprobe_insn_slots");
        free_insn_pages(kprobe_insn_slots_ptr);

        alloc_vmap_area_kp.pre_handler = alloc_vmap_area_handler;
        if (register_kprobe(&alloc_vmap_area_kp)) {
                pr_err("moto_org_mem.ko failed to hook alloc_vmap_area!\n");
                return -EACCES;
        }

        /* DEBUG */
        // bpf_int_jit_compile_kp.pre_handler = bpf_int_jit_compile_handler;
        // if (register_kprobe(&bpf_int_jit_compile_kp)) {
        //         pr_err("moto_org_mem.ko failed to hook bpf_int_jit_compile!\n");
        //         return -EACCES;
        // }

        ptrace_request_kp.pre_handler = ptrace_request_handler;
        if (register_kprobe(&ptrace_request_kp)) {
                pr_err("moto_org_mem.ko failed to hook ptrace_request_kp!\n");
                return -EACCES;
        }

        /* END DEBUG */
        pr_info("moto_org_mem loaded!\n");

        return 0;
}

static void __exit mod_exit(void)
{
}

module_init(mod_init);
module_exit(mod_exit);

MODULE_LICENSE("GPL v2");
MODULE_AUTHOR("Maxwell Bland <mbland@motorola.com>");
MODULE_DESCRIPTION("Organizes the vmalloc memory code pages are not interleaved "
                   "with data pages.");



________________________________________
From: Maciej Żenczykowski <maze@google.com>
Sent: Thursday, September 12, 2024 4:39 PM
To: Neill Kapron
Cc: Maxwell Bland; linux-arm-msm@vger.kernel.org; Andrew Wheeler; Sammy BS2 Que | 阙斌生; Todd Kjos; Viktor Martensson; Andy Lutomirski; keescook@chromium.org; Will Drewry; Andy Gross; Bjorn Andersson; Konrad Dybcio; kernel-team; adelva@google.com; jeffv@google.com
Subject: [External] Re: [RFC] Proposal: Static SECCOMP Policies

wrt. BPF on Android:

(a) eBPF should already be locked down to just the bpfloader boot time process.

If you can prove it isn't, please let us know, but as this is sepolicy
around the bpf(BPF_PROG_LOAD) system call, it should be pretty
airtight:

allow bpfloader self:bpf { ... prog_load ... };
...
neverallow { domain -bpfloader } *:bpf prog_load;

(basically the only exception to the above is root/su on userdebug/eng
builds, which runs sepolicy in permissive mode and thus doesn't
enforce the above - but that obviously doesn't matter for user builds)

(b) cBPF [classic BPF, internally the kernel translates this to eBPF]
is still allowed, for both seccomp() and normal old style socket
filters

- bpf seccomp() is to the best of my knowledge used by normal play
store updatable applications (including the chrome web browser) for
sandboxing (of rendering processes), as such it would be basically
impossible to lock it down (as apps update independently of the rest
of the system) - and would probably be a net loss for security if you
did lock it down / break it...

If you wanted to pursue this you'd need to get agreement from Chrome &
other applications and provide some 'better' alternative.  Likely some
sort of hard coded seccomp version that blocks things that most
sandboxing apps agree is beneficial to block...

(bpf seccomp() is also used by the Android zygote itself to block
various extra system calls from processes/apps it spawns, but as this
list is hardcoded at build time, it's not actually a problem)

- similarly old style BPF socket filters are 'normal' 'ancient'
BSD/Unix/Linux API.  They're used in the (privileged) network stack
itself (which is mainline updatable via the play store, including the
cbpf code), but could also AFAIK be used by random play store
applications - filtering on sockets is truly ancient api.
https://www.tcpdump.org/papers/bpf-usenix93.pdf is from 1992

-

Is there some eBPF program loading API I'm not aware of that we thus
haven't blocked?

On Thu, Sep 12, 2024 at 1:57 PM Neill Kapron <nkapron@google.com> wrote:
>
> On Thu, Sep 12, 2024 at 04:02:53PM +0000, Maxwell Bland wrote:
> > (Resending as plaintext for msm-kernel mailing list.
> > Original message was intended for android kernel team
> > though msm-kernel should be aware.)
> >
> > Hi Kernel Team,
> >
> > + Kees, Andy, and Will since their input may be valuable.
> >
> > It has been a while! (~9 months to be exact). This January, I sent out a small
> > message on BPF code loading ("unprivileged BPF considered harmful" or something
> > like that). In it, I noted new BPF programs are compiled all the time and
> > thrown into the kernel. At the time, I did not know these programs were just
> > compiled seccomp filter policies, loaded in as new BPF programs continuously
> > through the libminijail interface as well as direct syscall. As of two days
> > ago, I now know this (and now you do too, if not already).
> >
> > OK, yes, syscall filtering is very important, but this is creating a catch-22
> > issue. For one, see step (4) under "Exploitation overview" for
> > https://www.qualys.com/2021/07/20/cve-2021-33909/sequoia-local-privilege-escalation-linux.txt.
> > Second, this minor lack of caching is adding load time to more than 90
> > binaries/services on the standard QCOM baseline—I'll admit, it is probably
> > negligible in the grand scheme of things (a quick approximation puts the data
> > operated on around 0.1188 MB). But most importantly, third, without some degree
> > of provenance, I have no way of telling if someone has injected malicious code
> > into the kernel, and unfortunately even knowing the correct bytes is still
> > "iffy", as in order to prevent JIT spray attacks, each of these filters is
> > offset by some random number of uint32_t's, making every 4-byte shift of the
> > filter a "valid" codepage to be loaded at runtime.
> >
> > You might be thinking, "but wait, bionic's libc only defines a couple of
> > restricted policies, primary and secondary for system and user apps
> > respectively." I know! For the most part, apps fall into either what I presume
> > is the default app/system policies, but there are lots of QCOM binaries and
> > other magic programs (dolby dax) that are sending up these programs as well.
> > I'm seeing more than 20 different programs for around a minute's worth of
> > runtime. One example is attached at the end.
> >
> > So, the proposal: a "CONFIG_SECCOMMP_STATIC_POLICY" for seccomp. This
> > would change the Android kernel's generic SYS_seccomp call, which takes in a
> > filter with an array of BPF instructions, to instead reference an ID which
> > corresponds to a fixed file on /sys/bpf/seccomp or something like that. The
> > sandboxing behavior of these apps should be known at compile-time, even if
> > there are multiple "permission set types" that may need to be dispatched. User
> > apps should always have a single, fixed policy. This way it is possible to say
> > for every code page loaded into the kernel where it came from and what it
> > should look like.
> >
> > Unfortunately, I do not know Motorola has enough "weight" to convince QCOM to
> > do the right foundational thing here, or to "define" the seccomp APIs for
> > Android, so it would be good to have Google's buy in, know if there are plans
> > to fix this issue, or some discussion of how to best fix the problem? If
> > anything, a contact at QCOM that might be able to actually hunt down and
> > document valid bytes for these policies?
> >
> > The end goal is simple: when we see a code page is allocated in the kernel, we
> > can be sure that (1) it isn't malicious and (2) has not been modified in
> > transit. I'm fine putting code where my mouth is, but right now that code
> > would involve having to fingerprint the signatures loaded by Qualcomm
> > components every time a new one is released, or pinging Google with a huge
> > patch changing how seccomp works with no idea of what requirements QCOM may
> > have on seccomp policy generation.
> >
> > Thoughts? Is this doable, and if not, why? I'd also love help with the code and
> > adapting existing minijail code to use a new, more integrity-preserving
> > interface. If I am mistaken and it is possible to grab out valid BPF policy
> > code at compile time, please let me know how!
> >
> > Regards,
> > Maxwell Bland
> >
> > Standard filter, (from, for example, com.google.android.gms)
> > "ac00000000000000ac77000000000000bf160000000000006160040000000000b4020000b70000c01d20020000000000b4000000000000009500000000000000616000000000000055000200cb000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f950000000000000055000200ce000000b40000000000ff7f950000000000000055000200c6000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000100de00000005007b000000000055000200d7000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f950000000000000055000100e200000005008f000000000055000200a7000000b40000000000ff7f95000000000000005500020038000000b40000000000ff7f95000000000000005500020062000000b40000000000ff7f95000000000000005500020039000000b40000000000ff7f9500000000000000550002003f000000b40000000000ff7f95000000000000005500020040000000b40000000000ff7f95000000000000005500020050000000b40000000000ff7f9500000000000000550002004e000000b40000000000ff7f9500000000000000550002002c000000b40000000000ff7f95000000000000005500020043000000b40000000000ff7f9500000000000000550002001d000000b40000000000ff7f95000000000000005500020030000000b40000000000ff7f95000000000000005500020071000000b40000000000ff7f950000000000000055000200ae000000b40000000000ff7f950000000000000055000200a3000000b40000000000ff7f95000000000000005500020086000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000200e9000000b40000000000ff7f9500000000000000550002003e000000b40000000000ff7f95000000000000005500020087000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f9500000000000000550002005c000000b40000000000ff7f95000000000000005500020016010000b40000000000ff7f950000000000000055000200dc000000b40000000000ff7f95000000000000005500020060000000b40000000000ff7f950000000000000055000200dd000000b40000000000ff7f95000000000000005500020078000000b40000000000ff7f9500000000000000550002005e000000b40000000000ff7f9500000000000000550002008b000000b40000000000ff7f95000000000000005500020080000000b40000000000ff7f950000000000000055000200cb000000b40000000000ff7f950000000000000055000100c600000005004c0000000000550002005d000000b40000000000ff7f950000000000000055000200ac000000b40000000000ff7f95000000000000005500020084000000b40000000000ff7f9500000000000000550002008c000000b40000000000ff7f9500000000000000550002003d000000b40000000000ff7f95000000000000005500020017000000b40000000000ff7f9500000000000000b400000000000300950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160100000000000630afcff000000006160140000000000630af8ff00000000550002000000000061a0fcff000000001500010001000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f9500000000000000",
> > Unknown filter (from QCOM's /vendor/bin/qesdk-secmanager)
> >  "ac00000000000000ac77000000000000bf160000000000006160040000000000b4020000b70000c01d20020000000000b4000000000000009500000000000000616000000000000055000200cb000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f950000000000000055000200ce000000b40000000000ff7f950000000000000055000200c6000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000100de00000005007e000000000055000100e2000000050098000000000055000200d7000000b40000000000ff7f950000000000000055000200a7000000b40000000000ff7f95000000000000005500020062000000b40000000000ff7f9500000000000000550002001d000000b40000000000ff7f95000000000000005500020038000000b40000000000ff7f9500000000000000550002003f000000b40000000000ff7f95000000000000005500020039000000b40000000000ff7f95000000000000005500020050000000b40000000000ff7f9500000000000000550002004e000000b40000000000ff7f9500000000000000550002004f000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f95000000000000005500020043000000b40000000000ff7f9500000000000000550002002c000000b40000000000ff7f95000000000000005500020087000000b40000000000ff7f95000000000000005500020086000000b40000000000ff7f95000000000000005500020030000000b40000000000ff7f950000000000000055000200ae000000b40000000000ff7f95000000000000005500020016010000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000200dc000000b40000000000ff7f9500000000000000550002005e000000b40000000000ff7f9500000000000000550002007b000000b40000000000ff7f9500000000000000550002005d000000b40000000000ff7f950000000000000055000200ac000000b40000000000ff7f95000000000000005500020084000000b40000000000ff7f950000000000000055000200a3000000b40000000000ff7f95000000000000005500020080000000b40000000000ff7f95000000000000005500020078000000b40000000000ff7f950000000000000055000200dd000000b40000000000ff7f950000000000000055000100c600000005005800000000005500020060000000b40000000000ff7f9500000000000000550002008b000000b40000000000ff7f950000000000000055000200cb000000b40000000000ff7f95000000000000005500020071000000b40000000000ff7f95000000000000005500020040000000b40000000000ff7f9500000000000000550002003b000000b40000000000ff7f950000000000000055000200e9000000b40000000000ff7f950000000000000055000200b2000000b40000000000ff7f9500000000000000550002008c000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f9500000000000000b400000000000300950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160100000000000630afcff000000006160140000000000630af8ff00000000550002000000000061a0fcff000000001500010001000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f9500000000000000",
> >
> > List of services loading seccomp filters pulled from one run of the phone:
> > com.google.android.deskclock
> > /vendor/bin/qesdk-secmanager
> > media.hwcodec/vendor.qti.media.c2@1.0-service
> > media.audio.qc.codec.qti.media.c2audio@1.0-service
> > /vendor/bin/vendor.qti.qspmhal-service
> > /vendor/bin/qsap_sensors
> > media.extractoraextractor
> > /system_ext/bin/perfservice
> > /vendor/bin/wfdhdcphalservice
> > /vendor/bin/wifidisplayhalservice
> > /vendor/bin/qsap_dcfd
> > /vendor/bin/qms
> > /vendor/bin/qsap_location
> > /vendor/bin/qsap_qapeservice
> > /vendor/bin/wfdvndservice
> > media.swcodecoid.media.swcodec/bin/mediaswcodec
> > /vendor/bin/hw/qcrilNrd
> > qsap_qms_13qms16
> > qsap_qms_24qms17
> > /vendor/bin/ATFWD-daemon
> > /vendor/bin/hw/sxrservice
> > /vendor/bin/hw/qcrilNrd-c2
> > system_server
> > /vendor/bin/qmi_motext_hook1013170
> > /vendor/bin/qmi_motext_hook1013171
> > /vendor/bin/ims_rtp_daemon
> > com.android.systemui
> > webview_zygote
> > com.dolby.daxservice
> > vendor.qti.qesdk.sysservice
> > org.codeaurora.ims
> > com.android.se
> > com.android.phone
> > com.qti.qcc
> > com.google.android.ext.services
> > com.google.android.gms
> > com.google.android.euicc
> > com.google.android.googlequicksearchbox:interactor
> > com.google.android.apps.messaging:rcs
> > com.android.nfc
> > com.qualcomm.qti.workloadclassifier
> > com.qualcomm.location
> > com.google.android.gms.unstable
> > com.thundercomm.ar.core
> > com.android.vending:background
> > com.android.vending:quick_launch
> > com.android.dynsystem
> > com.android.managedprovisioning
> > com.android.shell
>
>
> + Jeff, Alistair, and Maciej
>
> Maxwell,
>
> Thanks for the details on this, I have added several people who may be
> better suited to comment on this.
>
> Neill

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-13 17:07     ` [External] " Maxwell Bland
@ 2024-09-13 17:12       ` Maxwell Bland
  2024-09-13 17:30       ` Maxwell Bland
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Maxwell Bland @ 2024-09-13 17:12 UTC (permalink / raw)
  To: Maciej Żenczykowski, Neill Kapron
  Cc: linux-arm-msm@vger.kernel.org, Andrew Wheeler, Sammy BS2 Que,
	Todd Kjos, Viktor Martensson, Andy Lutomirski,
	keescook@chromium.org, Will Drewry, Andy Gross, Bjorn Andersson,
	Konrad Dybcio, kernel-team, adelva, jeffv

Here's that main.c from my prior reply sent via neomutt (outlook wraps
plaintext messages at 80 chars), if you want to check it out and don't
want to fix the random newlines:

// SPDX-License-Identifier: GPL-2.0
/*
 * Copyright (C) 2023 Motorola Mobility, Inc.
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License version 2 as
 * published by the Free Software Foundation.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * Kernel module that hooks the vmalloc infrastructure to ensure that code
 * pages are not interleaved with data pages unless at a PMD level granularity.
 * Must be loaded prior to other kernel mechanisms leveraging code page
 * allocation, e.g. BPF, EROFS fixmap.
 */


#include <linux/kernel.h>
#include <linux/bpf.h>
#include <linux/mutex.h>
#include <linux/atomic.h>
#include <linux/highmem.h>
#include <linux/kprobes.h>
#include <linux/list.h>
#include <linux/mm_types.h>
#include <linux/module.h>
#include <linux/of.h>
#include <linux/of_platform.h>
#include <linux/pagewalk.h>
#include <linux/types.h>
#include <linux/moduleloader.h>
#include <linux/vmalloc.h>
#include <linux/gfp_types.h>
#include <linux/seccomp.h>
#include <asm/pgalloc.h>
#include <asm/ptrace.h>
#include <asm/patching.h>
#include <asm/module.h>
#include <asm/page.h>
#include <asm/seccomp.h>

#ifdef SECCOMP_ARCH_NATIVE                                          
/**                                                                 
 * struct action_cache - per-filter cache of seccomp actions per    
 * arch/syscall pair                                                
 *                                                                  
 * @allow_native: A bitmap where each bit represents whether the    
 *                filter will always allow the syscall, for the     
 *                native architecture.                              
 * @allow_compat: A bitmap where each bit represents whether the    
 *                filter will always allow the syscall, for the     
 *                compat architecture.                              
 */                                                                 
struct action_cache {                                               
        DECLARE_BITMAP(allow_native, SECCOMP_ARCH_NATIVE_NR);       
#ifdef SECCOMP_ARCH_COMPAT                                          
        DECLARE_BITMAP(allow_compat, SECCOMP_ARCH_COMPAT_NR);       
#endif                                                              
};                                                                  
#else                                                               
struct action_cache { };                                            
#endif

struct seccomp_filter {             
        refcount_t refs;            
        refcount_t users;           
        bool log;                   
        bool wait_killable_recv;    
        struct action_cache cache;  
        struct seccomp_filter *prev;
        struct bpf_prog *prog;      
        struct notification *notif; 
        struct mutex notify_lock;   
        wait_queue_head_t wqh;      
};                                  



void print_bpf_prog_aux(struct bpf_prog_aux *aux) {
        printk("BPF Program Aux Details:\n");
        printk("Ref Count: %lld\n", atomic64_read(&aux->refcnt));
        printk("Used Map Count: %u\n", aux->used_map_cnt);
        printk("Used BTF Count: %u\n", aux->used_btf_cnt);
        printk("Max Context Offset: %u\n", aux->max_ctx_offset);
        printk("Max Packet Offset: %u\n", aux->max_pkt_offset);
        printk("Max TP Access: %u\n", aux->max_tp_access);
        printk("Stack Depth: %u\n", aux->stack_depth);
        printk("ID: %u\n", aux->id);
        printk("Function Count: %u\n", aux->func_cnt);
        printk("Function Index: %u\n", aux->func_idx);
        printk("Attach BTF ID: %u\n", aux->attach_btf_id);
        printk("Context Arg Info Size: %u\n", aux->ctx_arg_info_size);
        printk("Max Read-Only Access: %u\n", aux->max_rdonly_access);
        printk("Max Read-Write Access: %u\n", aux->max_rdwr_access);
        printk("Attach BTF: %p\n", aux->attach_btf);
        printk("Context Arg Info: %p\n", aux->ctx_arg_info);
        printk("DST Mutex: %p\n", &aux->dst_mutex);
        printk("DST Program: %p\n", aux->dst_prog);
        printk("DST Trampoline: %p\n", aux->dst_trampoline);
        printk("Saved DST Program Type: %d\n", aux->saved_dst_prog_type);
        printk("Saved DST Attach Type: %d\n", aux->saved_dst_attach_type);
        printk("Verifier Zero Extension: %u\n", aux->verifier_zext);
        printk("Attach BTF Trace: %u\n", aux->attach_btf_trace);
        printk("Function Proto Unreliable: %u\n", aux->func_proto_unreliable);
        printk("Sleepable: %u\n", aux->sleepable);
        printk("Tail Call Reachable: %u\n", aux->tail_call_reachable);
        printk("XDP Has Frags: %u\n", aux->xdp_has_frags);
        printk("Attach Func Proto: %p\n", aux->attach_func_proto);
        printk("Attach Func Name: %s\n", aux->attach_func_name);
        printk("Functions: %p\n", aux->func);
        printk("JIT Data: %p\n", aux->jit_data);
        printk("Poke Table: %p\n", aux->poke_tab);
        printk("Kfunc Table: %p\n", aux->kfunc_tab);
        printk("Kfunc BTF Table: %p\n", aux->kfunc_btf_tab);
        printk("Size Poke Table: %u\n", aux->size_poke_tab);
        printk("Ksym: %p\n", &aux->ksym);
        printk("Operations: %p\n", aux->ops);
        printk("Used Maps: %p\n", aux->used_maps);
        printk("Used Maps Mutex: %p\n", &aux->used_maps_mutex);
        printk("Used BTFs: %p\n", aux->used_btfs);
        printk("Program: %p\n", aux->prog);
        printk("User: %p\n", aux->user);
        printk("Load Time: %llu\n", aux->load_time);
        printk("Verified Instructions: %u\n", aux->verified_insns);
        printk("Cgroup Attach Type: %d\n", aux->cgroup_atype);
        printk("Cgroup Storage: %p\n", aux->cgroup_storage);
        printk("Name: %s\n", aux->name);
}

void print_bpf_prog_insnsi(struct bpf_insn * insns, uint64_t len) {
        int i;
        for (i = 0; i < len; i++) {                     
                const struct bpf_insn *insn = &insns[i];
                printk("BPF INSN %016llx\n", *((uint64_t *)insn));
        }
}

void print_bpf_prog(struct bpf_prog *prog) {
        printk("BPF Program Details:\n");
        printk("Pages: %u\n", prog->pages);
        printk("JITed: %u\n", prog->jited);
        printk("JIT Requested: %u\n", prog->jit_requested);
        printk("GPL Compatible: %u\n", prog->gpl_compatible);
        printk("Control Block Access: %u\n", prog->cb_access);
        printk("DST Needed: %u\n", prog->dst_needed);
        printk("Blinding Requested: %u\n", prog->blinding_requested);
        printk("Blinded: %u\n", prog->blinded);
        printk("Is Function: %u\n", prog->is_func);
        printk("Kprobe Override: %u\n", prog->kprobe_override);
        printk("Has Callchain Buffer: %u\n", prog->has_callchain_buf);
        printk("Enforce Expected Attach Type: %u\n", prog->enforce_expected_attach_type);
        printk("Call Get Stack: %u\n", prog->call_get_stack);
        printk("Call Get Func IP: %u\n", prog->call_get_func_ip);
        printk("Timestamp Type Access: %u\n", prog->tstamp_type_access);
        printk("Type: %d\n", prog->type);
        printk("Expected Attach Type: %d\n", prog->expected_attach_type);
        printk("Length: %u\n", prog->len);
        printk("JITed Length: %u\n", prog->jited_len);
        printk("Tag: ");
        for (int i = 0; i < BPF_TAG_SIZE; i++) {
                printk("%02x", prog->tag[i]);
        }
        printk("\n");
        printk("Stats: %p\n", prog->stats);
        printk("Active: %p\n", prog->active);
        printk("AUX FIELDS:\n");
        print_bpf_prog_aux(prog->aux);
        print_bpf_prog_insnsi(prog->insnsi, prog->len);
}


/* Functions we need for patching dynamic code allocations */
typedef void *(*module_alloc_t)(unsigned long size);
module_alloc_t module_alloc_ind;
typedef void (*module_memfree_t)(void *module_region);
module_memfree_t module_memfree_ind;

/* TODO: actually we could probably just include "net/bpf_jit.h" */
typedef int (*aarch64_insn_patch_text_nosync_t)(void *addr, u32 insn);
aarch64_insn_patch_text_nosync_t aarch64_insn_patch_text_nosync_ind;
typedef u32 (*aarch64_insn_gen_branch_imm_t)(unsigned long pc,
                                             unsigned long addr,
enum aarch64_insn_branch_type type);
aarch64_insn_gen_branch_imm_t aarch64_insn_gen_branch_imm_ind;
typedef u32 (*aarch64_insn_gen_hint_t)(enum aarch64_insn_hint_cr_op op);
aarch64_insn_gen_hint_t aarch64_insn_gen_hint_ind;
typedef u32 (*aarch64_insn_gen_branch_reg_t)(
        enum aarch64_insn_register reg, enum aarch64_insn_branch_type type);
aarch64_insn_gen_branch_reg_t aarch64_insn_gen_branch_reg_ind;
typedef void *(*__vmalloc_node_range_t)(unsigned long size, unsigned long align,
                                        unsigned long start, unsigned long end,
                                        gfp_t gfp_mask, pgprot_t prot,
                                        unsigned long vm_flags, int node,
const void *caller);
__vmalloc_node_range_t __vmalloc_node_range_ind;

/* Used for reworking the kprobe allocator */
typedef int (*collect_garbage_slots_t)(struct kprobe_insn_cache *c);
collect_garbage_slots_t collect_garbage_slots_ind;

static struct kprobe kallsyms_lookup_name_kp = { .symbol_name =
        "kallsyms_lookup_name",
.addr = 0 };
typedef unsigned long (*kallsyms_lookup_name_t)(const char *name);
kallsyms_lookup_name_t kallsyms_lookup_name_ind;

/* Functions we are patching */
static struct kprobe alloc_vmap_area_kp = { .symbol_name = "alloc_vmap_area",
.addr = 0 };

/* DEBUG: bpf allocation printing */
// static struct kprobe bpf_int_jit_compile_kp = { .symbol_name = "bpf_int_jit_compile",
// .addr = 0 };
static struct kprobe ptrace_request_kp = { .symbol_name = "ptrace_request",
.addr = 0 };
/* END DEBUG */

/* Static variables that must be manually accessed for definition */
u64 module_alloc_base;
struct kprobe_insn_cache *kprobe_insn_slots_ptr;

/**
 * get_kp_addr - TODO comment rest of file
 */
static __always_inline void *get_kp_addr(struct kprobe *kp)
{
        void *res = 0;
        if (register_kprobe(kp)) {
                pr_err("Error: moto_org_mem failed to get kp addr for %s\n",
                       kp->symbol_name);
                return 0;
        }
        res = kp->addr;
        unregister_kprobe(kp);
        return res;
}

static void *bpf_jit_alloc_exec_handler(unsigned long size)
{
        return module_alloc_ind(size);
}

static void bpf_jit_free_exec_handler(void *addr)
{
        module_memfree_ind(addr);
}

static u64 bpf_jit_alloc_exec_limit_handler(void)
{
        return MODULES_END - MODULES_VADDR;
}

static void *alloc_insn_page_handler(void)
{
        return __vmalloc_node_range_ind(PAGE_SIZE, 1, module_alloc_base,
                                        module_alloc_base + SZ_2G, GFP_KERNEL,
                                        PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS,
                                        NUMA_NO_NODE,
        __builtin_return_address(0));
}

static bool allocation_balance = false;

/**
 * alloc_vmap_area_pre_handler - adjusts vstart, vend to not interleave code/data
 *
 * Right now, vmalloc infrastructure does the following:
 * |<-----data----->||<-----code and data pages----->||<-----data----->|
 * Maintainers likely do not want to touch vmalloc internals for fear of
 * breaking everything, so we provide an open-source work-around with hopes
 * that these fixes will make their way into the mainline kernel.
 *
 * We adjust the parameters to the call to avoid the code memory range by
 * selecting the lower half, then in a separate post handler, we check whether
 * the allocation failed, and if so, run the allocation with the upper half.
 *
 * TODO: we need to remove the flip/flopping and properly segment the memory
 * here, but it is not clear how to do this without modifying core vmalloc
 * infrastructure. See upstream patch here:
 * https://lore.kernel.org/all/20240423095843.446565600-1-mbland@motorola.com/#t
 *
 * Parameters are passed in the arm64 linux kernel following the AAPCS64 ABI 
 * convention, and thus it is safe to interpolate based upon the signature
 * the location of the specific values for vstart and vend.
 * https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst
 */
static int alloc_vmap_area_handler(struct kprobe *kp, struct pt_regs *regs)
{
        unsigned long size;
        unsigned long vstart;
        size = regs->regs[0];
        vstart = regs->regs[2];
        if (vstart == VMALLOC_START) { /* We are attempting to vmalloc data */
                /* Everything is fine, do nothing */
                if (module_alloc_base + SZ_2G <= VMALLOC_START ||
                        module_alloc_base > VMALLOC_END)
                return 0;

                allocation_balance = !allocation_balance;

                /* Not enough room below, else if not enough room above */
                if (module_alloc_base - VMALLOC_START < size)
                        allocation_balance = true;
                        else if (VMALLOC_END - module_alloc_base + SZ_2G < size)
                        allocation_balance = false;

                /* Allocate from higher valued addresses or lower valued
                 * address evenly. since these are virtual it does not 
                 * really matter */
                if (allocation_balance) {
                        regs->regs[2] = module_alloc_base + SZ_2G;
                } else {
                        regs->regs[3] = module_alloc_base;
                }
        }

        return 0;
}

/* DEBUG: Analyze allocated BPF programs */
// static int bpf_int_jit_compile_handler(struct kprobe *kp, struct pt_regs *regs)
// {
//         // struct bpf_prog *prog = (struct bpf_prog *)regs->regs[0];
//         // print_bpf_prog(prog);
//         return 0;
// }
// 
static int ptrace_request_handler(struct kprobe *kp, struct pt_regs *regs)
{
        struct task_struct *task = (struct task_struct *)regs->regs[0];
        long request = regs->regs[1];
        unsigned long addr = regs->regs[2];
        struct seccomp_filter *filter;                                      
        if (request != 0x420c) {
                return 0;
        }
        if (addr != 13371337) {
                printk("waiting for regs ... %llx\n", regs->regs[1]);
                return 0;
        }
                                                                            
        if (!task)
        {
                printk("ptrace_request_handler no task\n");
                return 0;
        }
                                                                            
        filter = READ_ONCE(task->seccomp.filter);
        printk("TASK PID %d or %d\n", task->pid, pid_vnr(task_pgrp(task)));
        if (!filter) {
                printk("ptrace_request_handler no filter\n");
                return 0;
        }
        if (filter->prog)
                print_bpf_prog(filter->prog);

        return 0;
}
/* END DEBUG */


void __always_inline patch_jump_to_handler(void *faddr, void *helper)
{
        u32 insn;
        insn = aarch64_insn_gen_branch_imm_ind((unsigned long)faddr,
                                               (unsigned long)helper,
        AARCH64_INSN_BRANCH_NOLINK);
        aarch64_insn_patch_text_nosync_ind(faddr, insn);
}

struct kprobe_insn_page {
        struct list_head list;
        kprobe_opcode_t *insns; /* Page of instruction slots */
        struct kprobe_insn_cache *cache;
        int nused;
        int ngarbage;
        char slot_used[];
};

void free_insn_pages(struct kprobe_insn_cache *kic)
{
        struct kprobe_insn_page *kip;
        unsigned int i = 0;

        /* TODO: Since the slot array is not protected by rcu, we need a mutex,
         * but we are also should be the only thing running that is touching
         * the kprobes */
        list_for_each_entry_rcu (kip, &kic->pages, list) {
                for (i = 0; i < kip->nused; i++) {
                        kip->slot_used[i] = 0;
                        kip->nused--;
                }
                list_del_rcu(&kip->list);
                synchronize_rcu();
                kip->cache->free(kip->insns);
                kfree(kip);
        }
}

/**
 * mod_init - TODO
 *
 * TODO FAIL IF ANY OF THE BELOW FAILS
 */
static int __init mod_init(void)
{
        void *bpf_jit_alloc_exec_addr = 0;
        void *bpf_jit_free_exec_addr = 0;
        void *bpf_jit_alloc_exec_limit_addr = 0;
        void *alloc_insn_page_addr = 0;
        kallsyms_lookup_name_ind =
                (kallsyms_lookup_name_t)get_kp_addr(&kallsyms_lookup_name_kp);

        module_alloc_ind =
                (module_alloc_t)kallsyms_lookup_name_ind("module_alloc");
        module_memfree_ind =
                (module_memfree_t)kallsyms_lookup_name_ind("module_memfree");
        __vmalloc_node_range_ind =
                (__vmalloc_node_range_t)kallsyms_lookup_name_ind(
                        "__vmalloc_node_range");
        aarch64_insn_patch_text_nosync_ind =
                (aarch64_insn_patch_text_nosync_t)kallsyms_lookup_name_ind(
                        "aarch64_insn_patch_text_nosync");
        aarch64_insn_gen_branch_imm_ind =
                (aarch64_insn_gen_branch_imm_t)kallsyms_lookup_name_ind(
                        "aarch64_insn_gen_branch_imm");
        aarch64_insn_gen_hint_ind =
                (aarch64_insn_gen_hint_t)kallsyms_lookup_name_ind(
                        "aarch64_insn_gen_hint");
        aarch64_insn_gen_branch_reg_ind =
                (aarch64_insn_gen_branch_reg_t)kallsyms_lookup_name_ind(
                        "aarch64_insn_gen_branch_reg");

        collect_garbage_slots_ind =
                (collect_garbage_slots_t)kallsyms_lookup_name_ind(
                        "collect_garbage_slots");

        bpf_jit_alloc_exec_addr =
                (void *)kallsyms_lookup_name_ind("bpf_jit_alloc_exec");
        bpf_jit_free_exec_addr =
                (void *)kallsyms_lookup_name_ind("bpf_jit_free_exec");
        bpf_jit_alloc_exec_limit_addr =
                (void *)kallsyms_lookup_name_ind("bpf_jit_alloc_exec_limit");
        alloc_insn_page_addr =
                (void *)kallsyms_lookup_name_ind("alloc_insn_page");

        module_alloc_base =
                *((u64 *)kallsyms_lookup_name_ind("module_alloc_base"));

        patch_jump_to_handler(bpf_jit_alloc_exec_addr,
                              bpf_jit_alloc_exec_handler);
        patch_jump_to_handler(bpf_jit_free_exec_addr,
                              bpf_jit_free_exec_handler);
        patch_jump_to_handler(bpf_jit_alloc_exec_limit_addr,
                              bpf_jit_alloc_exec_limit_handler);
        patch_jump_to_handler(alloc_insn_page_addr, alloc_insn_page_handler);

        /*
         * Under the hood, arm64 calls __get_insn_slot to generate memory pages for
         * kprobes, and these memory pages *supposedly* access an indirect pointer to
         * their allocation function through kprobe_insn_slots. Because we allocated
         * a kprobe in order to access kallsyms_lookup_name, one page is already allocated.
         * However, even kprobe garbage collection cowardly refuses to kill the last page,
         * so we have our own free routine that nixes that last survivor.
         */
        kprobe_insn_slots_ptr =
                (struct kprobe_insn_cache *)kallsyms_lookup_name_ind(
        "kprobe_insn_slots");
        free_insn_pages(kprobe_insn_slots_ptr);

        alloc_vmap_area_kp.pre_handler = alloc_vmap_area_handler;
        if (register_kprobe(&alloc_vmap_area_kp)) {
                pr_err("moto_org_mem.ko failed to hook alloc_vmap_area!\n");
                return -EACCES;
        }

        /* DEBUG */
        // bpf_int_jit_compile_kp.pre_handler = bpf_int_jit_compile_handler;
        // if (register_kprobe(&bpf_int_jit_compile_kp)) {
        //         pr_err("moto_org_mem.ko failed to hook bpf_int_jit_compile!\n");
        //         return -EACCES;
        // }

        ptrace_request_kp.pre_handler = ptrace_request_handler;
        if (register_kprobe(&ptrace_request_kp)) {
                pr_err("moto_org_mem.ko failed to hook ptrace_request_kp!\n");
                return -EACCES;
        }

        /* END DEBUG */
        pr_info("moto_org_mem loaded!\n");

        return 0;
}

static void __exit mod_exit(void)
{
}

module_init(mod_init);
module_exit(mod_exit);

MODULE_LICENSE("GPL v2");
MODULE_AUTHOR("Maxwell Bland <mbland@motorola.com>");
MODULE_DESCRIPTION("Organizes the vmalloc memory code pages are not interleaved "
                   "with data pages.");


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-13 17:07     ` [External] " Maxwell Bland
  2024-09-13 17:12       ` Maxwell Bland
@ 2024-09-13 17:30       ` Maxwell Bland
  2024-09-14  4:18         ` Andy Lutomirski
  2024-09-13 18:17       ` Maxwell Bland
  2024-09-13 21:16       ` [External] " Maciej Żenczykowski
  3 siblings, 1 reply; 26+ messages in thread
From: Maxwell Bland @ 2024-09-13 17:30 UTC (permalink / raw)
  To: Maciej Żenczykowski, Neill Kapron
  Cc: linux-arm-msm@vger.kernel.org, Andrew Wheeler, Sammy BS2 Que,
	Todd Kjos, Viktor Martensson, Andy Lutomirski,
	keescook@chromium.org, Will Drewry, Andy Gross, Bjorn Andersson,
	Konrad Dybcio, kernel-team, adelva, jeffv

On Fri, Sep 13, 2024 at 05:07:46PM GMT, Maxwell Bland wrote:

> These programs will not print out using PTRACE and are difficult to audit
> without patching the seccomp calls yourself because the ptrace call to
> PTRACE_SECCOMP_GET_FILTER will fail. I believe (have not checked) because they
> are not cBPF, and seccomp's logic makes prog->fprog evaluates to null despite
> prog existing if it is cBPF, at least on Android 14. I spent a whole day
> getting frustrated with the failing ptrace call before finally ending up my
> patches (attached to the end) that instrument ptrace and can print the
> programs.

LOL, this paragraph is a mess, apologies: I'm referencing the failure of
get_seccomp_filter in seccomp.c here:

fprog = filter->prog->orig_prog;
if (!fprog) {
	/* This must be a new non-cBPF filter, since we save
	 * every cBPF filter's orig_prog above when
	 * CONFIG_CHECKPOINT_RESTORE is enabled.
	 */
	ret = -EMEDIUMTYPE;
	goto out;
}

Though CONFIG_CHECKPOINT_RESTORE is not set on Android 14, so I think
the ptrace probably failed for all sorts of reasons unrelated to cBPF.

But don't let me distract from the issue, which is that
cBPF/eBPF/however these filters get allocated to machine code,
bpf_int_jit_compile ends up getting called and a new
privileged-executable page gets allocated without compile-time
provenance (at least, without reverse engineering) for where that code
came from.

But I think instrumentation of the BPF JIT compiler (which I will work
on next) should fix that.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-13 17:30       ` Maxwell Bland
@ 2024-09-14  4:18         ` Andy Lutomirski
  2024-09-17 15:08           ` Maxwell Bland
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2024-09-14  4:18 UTC (permalink / raw)
  To: Maxwell Bland
  Cc: Maciej Żenczykowski, Neill Kapron,
	linux-arm-msm@vger.kernel.org, Andrew Wheeler, Sammy BS2 Que,
	Todd Kjos, Viktor Martensson, keescook@chromium.org, Will Drewry,
	Andy Gross, Bjorn Andersson, Konrad Dybcio, kernel-team, adelva,
	jeffv

On Fri, Sep 13, 2024 at 10:30 AM Maxwell Bland <mbland@motorola.com> wrote:
>
> On Fri, Sep 13, 2024 at 05:07:46PM GMT, Maxwell Bland wrote:
>
> > These programs will not print out using PTRACE and are difficult to audit
> > without patching the seccomp calls yourself because the ptrace call to
> > PTRACE_SECCOMP_GET_FILTER will fail. I believe (have not checked) because they
> > are not cBPF, and seccomp's logic makes prog->fprog evaluates to null despite
> > prog existing if it is cBPF, at least on Android 14. I spent a whole day
> > getting frustrated with the failing ptrace call before finally ending up my
> > patches (attached to the end) that instrument ptrace and can print the
> > programs.
>
> LOL, this paragraph is a mess, apologies: I'm referencing the failure of
> get_seccomp_filter in seccomp.c here:
>
> fprog = filter->prog->orig_prog;
> if (!fprog) {
>         /* This must be a new non-cBPF filter, since we save
>          * every cBPF filter's orig_prog above when
>          * CONFIG_CHECKPOINT_RESTORE is enabled.
>          */
>         ret = -EMEDIUMTYPE;
>         goto out;
> }
>
> Though CONFIG_CHECKPOINT_RESTORE is not set on Android 14, so I think
> the ptrace probably failed for all sorts of reasons unrelated to cBPF.
>
> But don't let me distract from the issue, which is that
> cBPF/eBPF/however these filters get allocated to machine code,
> bpf_int_jit_compile ends up getting called and a new
> privileged-executable page gets allocated without compile-time
> provenance (at least, without reverse engineering) for where that code
> came from.

Mulling over this a bit, I think there are sort of two issues here,
and they're sort of orthogonal to each other.

The easy one first: can there be a static or somewhat static or at
least administrator-controlled list of seccomp cBPF programs?  (Where
administrator is, sadly, probably not the actual owner of a phone, but
that ship sailed a long time ago.). Trying to make a list *and
reference that list from programs loading filters* seems like a huge
breaking change, not to mention that getting it to work right in
namespaces will be extra complex.

But what if there was a mechanism to *cryptographically hash* a BPF
program as part of the loading process?  Then that hash could be
looked up in a list, and a decision could be made based on the result?
 Would this help solve any problems?

Okay, on to the hard part: code integrity.  I've mulled over this a
bit from the perspective of userspace JITs and their interaction with
kernel-enforced security.  Kernel-based JITs and their interactions
with hypervisor security are rather similar.  (They're *not* the same.
The kernel can and does muck with its own pagetables.  User code
can't.  But I don't think this is a huge difference here as to the big
picture.)  There's also self-modifying code (existing executable code
that changes) and code generation (code that is created where code
previously didn't exist).  I'm going to focus on the latter.

Today, userspace can use nasty APIs to allocate writable memory, then
write to it, then change it to be executable.  This comes with gnarly
architecture-specific coherency issues, and it doesn't give a great
way for the kernel to render an intelligent opinion.  And, today, the
kernel can allocate memory (by futzing with pagetables or just using
existing maps), write some code, then either change the permissions to
executable or create a new executable alias, and then do the
architecture-specific incantation to make it coherent, then run it.
In neither case is there an amazing way for the supervisor (kernel or
hypervisor) to render an opinion about the code, and in the userspace
case, the actual efficiency of the process is quite low.

So what would a good solution look like?  It seem to me that the
program being supervised (a userspace or kernel JIT) could generate
some kind of data structure along these lines:

- machine code to be materialized

- address and length at which to materialize it (probably
page-aligned, but maybe not)

- an "origin" of this code (perhaps a file handle?) -- I'm not 100%
sure this is useful

- a "justification" for the code.  This could be something like "Hey,
this is JITted from cBPF for seccomp, and here's the cBPF".

Or there could be a more indirect variant:

- source to be JITed (cBPF, WASM, eBPF, whatever)

- enough relocation info for the supervisor to JIT it appropriately

- address to materialize the code at, along with maximum size

and the supervisor JITs it and materializes it.

I could imagine this being used for userspace and for hypervisor-based
kernel integrity.  Does it do what's needed here if there was a
hypercall kind of like this?

I can also imagine this being considerably faster than what current
userspace does.  On x86, for example, the kernel could populate a page
with the JITted code, then map that page at an address where nothing
was previously mapped, and return to userspace, and userspace could
execute that code, even on a different CPU, with no heavyweight
serialization at all.  I think the only practical way on Linux today
to do this would be to create a memfd, use write(2) or similar to fill
in the code, then mmap it executable.  And to fight with LSMs to make
sure they allow it and to maybe seal it as read-only before mmapping
it.  That latter bit kind of kills it if the goal is to write a web
browser, though -- you don't really want a whole new memfd for each
javascript block that gets JITted.

Is any of this helpful?

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-14  4:18         ` Andy Lutomirski
@ 2024-09-17 15:08           ` Maxwell Bland
  2024-09-25 18:16             ` Andy Lutomirski
  0 siblings, 1 reply; 26+ messages in thread
From: Maxwell Bland @ 2024-09-17 15:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Maciej Żenczykowski, Neill Kapron,
	linux-arm-msm@vger.kernel.org, Andrew Wheeler, Sammy BS2 Que,
	Todd Kjos, Viktor Martensson, keescook@chromium.org, Will Drewry,
	Andy Gross, Bjorn Andersson, Konrad Dybcio, kernel-team, adelva,
	jeffv

On Fri, Sep 13, 2024 at 09:18:58PM GMT, Andy Lutomirski wrote:
> On Fri, Sep 13, 2024 at 10:30 AM Maxwell Bland <mbland@motorola.com> wrote:
> > On Fri, Sep 13, 2024 at 05:07:46PM GMT, Maxwell Bland wrote:
> >
> > But don't let me distract from the issue, which is that
> > cBPF/eBPF/however these filters get allocated to machine code,
> > bpf_int_jit_compile ends up getting called and a new
> > privileged-executable page gets allocated without compile-time
> > provenance (at least, without reverse engineering) for where that code
> > came from.
> 
> But what if there was a mechanism to *cryptographically hash* a BPF
> program as part of the loading process?  Then that hash could be
> looked up in a list, and a decision could be made based on the result?
>  Would this help solve any problems?

The issue I have seen in the prior Qualys linked exploit from my initial
message and from talks by security researchers elsewhere, for example
Google Project Zero's recent "Analyzing a Modern In-the-wild Android
Exploit" by Seth Jenkins, is that people have the ability to target
these pages during the window between the page being allocated as
writable by vmalloc.c and the update to the PTE which makes it
executable, so a signature does help (creates the requirement of more
than one write to commit "forgery"), but doesn't totally 100% solve the
problem.

Right now, every time I open up chrome on our latest flagship the
browsers sandbox filters trigger my EL2 monitor because they are
attempting to follow the standard W^X protocol. If I were to build one
of these exploits, I'd:

(1) find out a non-crashing leak for code page and data values
(2) determine from vmalloc's rb-tree where the next one-page allocation
    is likely to occur
(3) prime my write gadget for an offset into that page
(4) spin up chrome in a second thread
(5) attempt to trigger a write (or two) at the right precise time using
    prior empirical measurement or my read gadget for kernel mem

Which is messy, but people have been known to do more given good enough
stakes. Hell, I spent a few months working on something similar for
airplane communication management units.

> So what would a good solution look like?  It seem to me that the
> program being supervised (a userspace or kernel JIT) could generate
> some kind of data structure along these lines:
> 
> - machine code to be materialized
> 
> - address and length at which to materialize it (probably
> page-aligned, but maybe not)
> 
> - an "origin" of this code (perhaps a file handle?) -- I'm not 100%
> sure this is useful
> 
> - a "justification" for the code.  This could be something like "Hey,
> this is JITted from cBPF for seccomp, and here's the cBPF".
> 
> Or there could be a more indirect variant:
> 
> - source to be JITed (cBPF, WASM, eBPF, whatever)
> 
> - enough relocation info for the supervisor to JIT it appropriately
> 
> - address to materialize the code at, along with maximum size
> 
> and the supervisor JITs it and materializes it.
> 
> I could imagine this being used for userspace and for hypervisor-based
> kernel integrity.  Does it do what's needed here if there was a
> hypercall kind of like this?
>
"Origin" to me seems like the most significant part, as it should be
possible for engineers to hack in the rest based upon the implicit
contract provided by the software that is trying to compile the program.

Expanding on the other points, right now, I'm trying to see if it is
possible to orient EL2 so that there is little to no standard "runtime"
interface to the security monitor, as Samsung historically had issues
with respect to these routes leading to exploits because the engineers
(like me) were not super skilled. That is, pushing the verification
effort to EL2 will be more dangerous, since EL2's code now has the
possibility for error in the JIT which has an out-of-bounds write.

Returning to the idea of origins, at the end of the work day yesterday I
queried Maciej to "have Android choose one compiler for seccomp policies
to BPF and stick with it", because if I knew filters were chosen by
libminijail or some other userspace system, I could pretty easily figure
out what EL2 needs to expect at runtime. An "origin" field would be
equally as effective, and retain flexibility.

Here's what I have now that is actually enough to lock down most of everything
except the seccomp filters and dynamic datastructures (kworker, e.g.
call_usermode_exec_helper, queues will be the motivating example at that
point):

case MARK_RANGE_RO: /* Set the RO bit on a stage-2 PTE/PMD range */
case ADD_JUMP_ENTRY_LOOKUP: /* Add in exceptions for static_keys */
case LOCK: /* Prevent any further SMC calls outside of *_TUPLE */
case SPLIT_BLOCK: /* Demote (PMD) hugepage to PTEs */
case REGISTER_AMEM: /* Preserve region of physical mem for just EL2 */

Maxwell

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-17 15:08           ` Maxwell Bland
@ 2024-09-25 18:16             ` Andy Lutomirski
  2024-09-25 19:52               ` Maciej Żenczykowski
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2024-09-25 18:16 UTC (permalink / raw)
  To: Maxwell Bland
  Cc: Maciej Żenczykowski, Neill Kapron,
	linux-arm-msm@vger.kernel.org, Andrew Wheeler, Sammy BS2 Que,
	Todd Kjos, Viktor Martensson, keescook@chromium.org, Will Drewry,
	Andy Gross, Bjorn Andersson, Konrad Dybcio, kernel-team, adelva,
	jeffv

On Tue, Sep 17, 2024 at 8:08 AM Maxwell Bland <mbland@motorola.com> wrote:
>
> On Fri, Sep 13, 2024 at 09:18:58PM GMT, Andy Lutomirski wrote:
> > On Fri, Sep 13, 2024 at 10:30 AM Maxwell Bland <mbland@motorola.com> wrote:
> > > On Fri, Sep 13, 2024 at 05:07:46PM GMT, Maxwell Bland wrote:
> > >
> > > But don't let me distract from the issue, which is that
> > > cBPF/eBPF/however these filters get allocated to machine code,
> > > bpf_int_jit_compile ends up getting called and a new
> > > privileged-executable page gets allocated without compile-time
> > > provenance (at least, without reverse engineering) for where that code
> > > came from.
> >
> > But what if there was a mechanism to *cryptographically hash* a BPF
> > program as part of the loading process?  Then that hash could be
> > looked up in a list, and a decision could be made based on the result?
> >  Would this help solve any problems?
>
> The issue I have seen in the prior Qualys linked exploit from my initial
> message and from talks by security researchers elsewhere, for example
> Google Project Zero's recent "Analyzing a Modern In-the-wild Android
> Exploit" by Seth Jenkins, is that people have the ability to target
> these pages during the window between the page being allocated as
> writable by vmalloc.c and the update to the PTE which makes it
> executable, so a signature does help (creates the requirement of more
> than one write to commit "forgery"), but doesn't totally 100% solve the
> problem.
>
> Right now, every time I open up chrome on our latest flagship the
> browsers sandbox filters trigger my EL2 monitor because they are
> attempting to follow the standard W^X protocol. If I were to build one
> of these exploits, I'd:
>
> (1) find out a non-crashing leak for code page and data values
> (2) determine from vmalloc's rb-tree where the next one-page allocation
>     is likely to occur
> (3) prime my write gadget for an offset into that page
> (4) spin up chrome in a second thread
> (5) attempt to trigger a write (or two) at the right precise time using
>     prior empirical measurement or my read gadget for kernel mem
>
> Which is messy, but people have been known to do more given good enough
> stakes. Hell, I spent a few months working on something similar for
> airplane communication management units.

My vague proposal for a "better JIT API" (which you quoted below)
explicitly and completely solves this problem:

>
> > So what would a good solution look like?  It seem to me that the
> > program being supervised (a userspace or kernel JIT) could generate
> > some kind of data structure along these lines:
> >
> > - machine code to be materialized
> >
> > - address and length at which to materialize it (probably
> > page-aligned, but maybe not)
> >
> > - an "origin" of this code (perhaps a file handle?) -- I'm not 100%
> > sure this is useful
> >
> > - a "justification" for the code.  This could be something like "Hey,
> > this is JITted from cBPF for seccomp, and here's the cBPF".

Even ignoring the origin and justification parts, there's no WX window
in here.  The code is generated, then it's shipped off to the
hypervisor/supervisor, and *exactly that code* is materialized !W, X.

Of course, this still leaves verification to be handled.

> Returning to the idea of origins, at the end of the work day yesterday I
> queried Maciej to "have Android choose one compiler for seccomp policies
> to BPF and stick with it", because if I knew filters were chosen by
> libminijail or some other userspace system, I could pretty easily figure
> out what EL2 needs to expect at runtime. An "origin" field would be
> equally as effective, and retain flexibility.

At the risk of a silly suggestion, what if the entire JIT compiler and
verifier (or a sufficient portion) were, itself, a WASM (or similar)
program, signed or whatever, and shipped off to the hypervisor?  The
hypervisor could run it (in whatever sandbox it likes -- hypervisors
are capable of spawning a separate VM to host it if needed), and only
then accept the output.

I, personally, think that this is of extremely dubious value unless
it's paired with a control flow integrity system.  But maybe it could
be!  Something like x86 IBT would be a start, and FineIBT would be
better, as would an ARM equivalent.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-25 18:16             ` Andy Lutomirski
@ 2024-09-25 19:52               ` Maciej Żenczykowski
  2024-09-25 19:53                 ` Maciej Żenczykowski
  0 siblings, 1 reply; 26+ messages in thread
From: Maciej Żenczykowski @ 2024-09-25 19:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Maxwell Bland, Neill Kapron, linux-arm-msm@vger.kernel.org,
	Andrew Wheeler, Sammy BS2 Que, Todd Kjos, Viktor Martensson,
	keescook@chromium.org, Will Drewry, Andy Gross, Bjorn Andersson,
	Konrad Dybcio, kernel-team, adelva, jeffv

On Wed, Sep 25, 2024 at 11:16 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
> On Tue, Sep 17, 2024 at 8:08 AM Maxwell Bland <mbland@motorola.com> wrote:
> >
> > On Fri, Sep 13, 2024 at 09:18:58PM GMT, Andy Lutomirski wrote:
> > > On Fri, Sep 13, 2024 at 10:30 AM Maxwell Bland <mbland@motorola.com> wrote:
> > > > On Fri, Sep 13, 2024 at 05:07:46PM GMT, Maxwell Bland wrote:
> > > >
> > > > But don't let me distract from the issue, which is that
> > > > cBPF/eBPF/however these filters get allocated to machine code,
> > > > bpf_int_jit_compile ends up getting called and a new
> > > > privileged-executable page gets allocated without compile-time
> > > > provenance (at least, without reverse engineering) for where that code
> > > > came from.
> > >
> > > But what if there was a mechanism to *cryptographically hash* a BPF
> > > program as part of the loading process?  Then that hash could be
> > > looked up in a list, and a decision could be made based on the result?
> > >  Would this help solve any problems?
> >
> > The issue I have seen in the prior Qualys linked exploit from my initial
> > message and from talks by security researchers elsewhere, for example
> > Google Project Zero's recent "Analyzing a Modern In-the-wild Android
> > Exploit" by Seth Jenkins, is that people have the ability to target
> > these pages during the window between the page being allocated as
> > writable by vmalloc.c and the update to the PTE which makes it
> > executable, so a signature does help (creates the requirement of more
> > than one write to commit "forgery"), but doesn't totally 100% solve the
> > problem.
> >
> > Right now, every time I open up chrome on our latest flagship the
> > browsers sandbox filters trigger my EL2 monitor because they are
> > attempting to follow the standard W^X protocol. If I were to build one
> > of these exploits, I'd:
> >
> > (1) find out a non-crashing leak for code page and data values
> > (2) determine from vmalloc's rb-tree where the next one-page allocation
> >     is likely to occur
> > (3) prime my write gadget for an offset into that page
> > (4) spin up chrome in a second thread
> > (5) attempt to trigger a write (or two) at the right precise time using
> >     prior empirical measurement or my read gadget for kernel mem
> >
> > Which is messy, but people have been known to do more given good enough
> > stakes. Hell, I spent a few months working on something similar for
> > airplane communication management units.
>
> My vague proposal for a "better JIT API" (which you quoted below)
> explicitly and completely solves this problem:
>
> >
> > > So what would a good solution look like?  It seem to me that the
> > > program being supervised (a userspace or kernel JIT) could generate
> > > some kind of data structure along these lines:
> > >
> > > - machine code to be materialized
> > >
> > > - address and length at which to materialize it (probably
> > > page-aligned, but maybe not)
> > >
> > > - an "origin" of this code (perhaps a file handle?) -- I'm not 100%
> > > sure this is useful
> > >
> > > - a "justification" for the code.  This could be something like "Hey,
> > > this is JITted from cBPF for seccomp, and here's the cBPF".
>
> Even ignoring the origin and justification parts, there's no WX window
> in here.  The code is generated, then it's shipped off to the
> hypervisor/supervisor, and *exactly that code* is materialized !W, X.
>
> Of course, this still leaves verification to be handled.
>
> > Returning to the idea of origins, at the end of the work day yesterday I
> > queried Maciej to "have Android choose one compiler for seccomp policies
> > to BPF and stick with it", because if I knew filters were chosen by
> > libminijail or some other userspace system, I could pretty easily figure
> > out what EL2 needs to expect at runtime. An "origin" field would be
> > equally as effective, and retain flexibility.
>
> At the risk of a silly suggestion, what if the entire JIT compiler and
> verifier (or a sufficient portion) were, itself, a WASM (or similar)
> program, signed or whatever, and shipped off to the hypervisor?  The
> hypervisor could run it (in whatever sandbox it likes -- hypervisors
> are capable of spawning a separate VM to host it if needed), and only
> then accept the output.
>
> I, personally, think that this is of extremely dubious value unless
> it's paired with a control flow integrity system.  But maybe it could
> be!  Something like x86 IBT would be a start, and FineIBT would be
> better, as would an ARM equivalent.
>
> --Andy

I've heard rumours (probably read some LWN article perhaps
https://lwn.net/Articles/836693/ ) that protected kvm for Android has
some mechanism to start the kernel in some higher priv level (EL2?),
then move most of it to EL1 while keeping a protected VPN shim in EL2.

Perhaps the answer is to leave the bpf verifier + jit compiler in EL2?

--
Maciej Żenczykowski, Kernel Networking Developer @ Google

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-25 19:52               ` Maciej Żenczykowski
@ 2024-09-25 19:53                 ` Maciej Żenczykowski
  2024-09-30 11:22                   ` Sebastian Ene
  0 siblings, 1 reply; 26+ messages in thread
From: Maciej Żenczykowski @ 2024-09-25 19:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Maxwell Bland, Neill Kapron, linux-arm-msm@vger.kernel.org,
	Andrew Wheeler, Sammy BS2 Que, Todd Kjos, Viktor Martensson,
	keescook@chromium.org, Will Drewry, Andy Gross, Bjorn Andersson,
	Konrad Dybcio, kernel-team, adelva, jeffv

On Wed, Sep 25, 2024 at 12:52 PM Maciej Żenczykowski <maze@google.com> wrote:
>
> On Wed, Sep 25, 2024 at 11:16 AM Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > On Tue, Sep 17, 2024 at 8:08 AM Maxwell Bland <mbland@motorola.com> wrote:
> > >
> > > On Fri, Sep 13, 2024 at 09:18:58PM GMT, Andy Lutomirski wrote:
> > > > On Fri, Sep 13, 2024 at 10:30 AM Maxwell Bland <mbland@motorola.com> wrote:
> > > > > On Fri, Sep 13, 2024 at 05:07:46PM GMT, Maxwell Bland wrote:
> > > > >
> > > > > But don't let me distract from the issue, which is that
> > > > > cBPF/eBPF/however these filters get allocated to machine code,
> > > > > bpf_int_jit_compile ends up getting called and a new
> > > > > privileged-executable page gets allocated without compile-time
> > > > > provenance (at least, without reverse engineering) for where that code
> > > > > came from.
> > > >
> > > > But what if there was a mechanism to *cryptographically hash* a BPF
> > > > program as part of the loading process?  Then that hash could be
> > > > looked up in a list, and a decision could be made based on the result?
> > > >  Would this help solve any problems?
> > >
> > > The issue I have seen in the prior Qualys linked exploit from my initial
> > > message and from talks by security researchers elsewhere, for example
> > > Google Project Zero's recent "Analyzing a Modern In-the-wild Android
> > > Exploit" by Seth Jenkins, is that people have the ability to target
> > > these pages during the window between the page being allocated as
> > > writable by vmalloc.c and the update to the PTE which makes it
> > > executable, so a signature does help (creates the requirement of more
> > > than one write to commit "forgery"), but doesn't totally 100% solve the
> > > problem.
> > >
> > > Right now, every time I open up chrome on our latest flagship the
> > > browsers sandbox filters trigger my EL2 monitor because they are
> > > attempting to follow the standard W^X protocol. If I were to build one
> > > of these exploits, I'd:
> > >
> > > (1) find out a non-crashing leak for code page and data values
> > > (2) determine from vmalloc's rb-tree where the next one-page allocation
> > >     is likely to occur
> > > (3) prime my write gadget for an offset into that page
> > > (4) spin up chrome in a second thread
> > > (5) attempt to trigger a write (or two) at the right precise time using
> > >     prior empirical measurement or my read gadget for kernel mem
> > >
> > > Which is messy, but people have been known to do more given good enough
> > > stakes. Hell, I spent a few months working on something similar for
> > > airplane communication management units.
> >
> > My vague proposal for a "better JIT API" (which you quoted below)
> > explicitly and completely solves this problem:
> >
> > >
> > > > So what would a good solution look like?  It seem to me that the
> > > > program being supervised (a userspace or kernel JIT) could generate
> > > > some kind of data structure along these lines:
> > > >
> > > > - machine code to be materialized
> > > >
> > > > - address and length at which to materialize it (probably
> > > > page-aligned, but maybe not)
> > > >
> > > > - an "origin" of this code (perhaps a file handle?) -- I'm not 100%
> > > > sure this is useful
> > > >
> > > > - a "justification" for the code.  This could be something like "Hey,
> > > > this is JITted from cBPF for seccomp, and here's the cBPF".
> >
> > Even ignoring the origin and justification parts, there's no WX window
> > in here.  The code is generated, then it's shipped off to the
> > hypervisor/supervisor, and *exactly that code* is materialized !W, X.
> >
> > Of course, this still leaves verification to be handled.
> >
> > > Returning to the idea of origins, at the end of the work day yesterday I
> > > queried Maciej to "have Android choose one compiler for seccomp policies
> > > to BPF and stick with it", because if I knew filters were chosen by
> > > libminijail or some other userspace system, I could pretty easily figure
> > > out what EL2 needs to expect at runtime. An "origin" field would be
> > > equally as effective, and retain flexibility.
> >
> > At the risk of a silly suggestion, what if the entire JIT compiler and
> > verifier (or a sufficient portion) were, itself, a WASM (or similar)
> > program, signed or whatever, and shipped off to the hypervisor?  The
> > hypervisor could run it (in whatever sandbox it likes -- hypervisors
> > are capable of spawning a separate VM to host it if needed), and only
> > then accept the output.
> >
> > I, personally, think that this is of extremely dubious value unless
> > it's paired with a control flow integrity system.  But maybe it could
> > be!  Something like x86 IBT would be a start, and FineIBT would be
> > better, as would an ARM equivalent.
> >
> > --Andy
>
> I've heard rumours (probably read some LWN article perhaps
> https://lwn.net/Articles/836693/ ) that protected kvm for Android has
> some mechanism to start the kernel in some higher priv level (EL2?),
> then move most of it to EL1 while keeping a protected VPN shim in EL2.

s/VPN/KVM/

>
> Perhaps the answer is to leave the bpf verifier + jit compiler in EL2?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-25 19:53                 ` Maciej Żenczykowski
@ 2024-09-30 11:22                   ` Sebastian Ene
  2024-09-30 18:43                     ` Maxwell Bland
  2024-09-30 23:35                     ` Maciej Żenczykowski
  0 siblings, 2 replies; 26+ messages in thread
From: Sebastian Ene @ 2024-09-30 11:22 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Andy Lutomirski, Maxwell Bland, Neill Kapron,
	linux-arm-msm@vger.kernel.org, Andrew Wheeler, Sammy BS2 Que,
	Todd Kjos, Viktor Martensson, keescook@chromium.org, Will Drewry,
	Andy Gross, Bjorn Andersson, Konrad Dybcio, kernel-team, adelva,
	jeffv

On Wed, Sep 25, 2024 at 12:53:11PM -0700, 'Maciej Żenczykowski' via kernel-team wrote:
> On Wed, Sep 25, 2024 at 12:52 PM Maciej Żenczykowski <maze@google.com> wrote:
> >
> > On Wed, Sep 25, 2024 at 11:16 AM Andy Lutomirski <luto@amacapital.net> wrote:
> > >
> > > On Tue, Sep 17, 2024 at 8:08 AM Maxwell Bland <mbland@motorola.com> wrote:
> > > >
> > > > On Fri, Sep 13, 2024 at 09:18:58PM GMT, Andy Lutomirski wrote:
> > > > > On Fri, Sep 13, 2024 at 10:30 AM Maxwell Bland <mbland@motorola.com> wrote:
> > > > > > On Fri, Sep 13, 2024 at 05:07:46PM GMT, Maxwell Bland wrote:
> > > > > >
> > > > > > But don't let me distract from the issue, which is that
> > > > > > cBPF/eBPF/however these filters get allocated to machine code,
> > > > > > bpf_int_jit_compile ends up getting called and a new
> > > > > > privileged-executable page gets allocated without compile-time
> > > > > > provenance (at least, without reverse engineering) for where that code
> > > > > > came from.
> > > > >
> > > > > But what if there was a mechanism to *cryptographically hash* a BPF
> > > > > program as part of the loading process?  Then that hash could be
> > > > > looked up in a list, and a decision could be made based on the result?
> > > > >  Would this help solve any problems?
> > > >
> > > > The issue I have seen in the prior Qualys linked exploit from my initial
> > > > message and from talks by security researchers elsewhere, for example
> > > > Google Project Zero's recent "Analyzing a Modern In-the-wild Android
> > > > Exploit" by Seth Jenkins, is that people have the ability to target
> > > > these pages during the window between the page being allocated as
> > > > writable by vmalloc.c and the update to the PTE which makes it
> > > > executable, so a signature does help (creates the requirement of more
> > > > than one write to commit "forgery"), but doesn't totally 100% solve the
> > > > problem.
> > > >
> > > > Right now, every time I open up chrome on our latest flagship the
> > > > browsers sandbox filters trigger my EL2 monitor because they are
> > > > attempting to follow the standard W^X protocol. If I were to build one
> > > > of these exploits, I'd:
> > > >
> > > > (1) find out a non-crashing leak for code page and data values
> > > > (2) determine from vmalloc's rb-tree where the next one-page allocation
> > > >     is likely to occur
> > > > (3) prime my write gadget for an offset into that page
> > > > (4) spin up chrome in a second thread
> > > > (5) attempt to trigger a write (or two) at the right precise time using
> > > >     prior empirical measurement or my read gadget for kernel mem
> > > >
> > > > Which is messy, but people have been known to do more given good enough
> > > > stakes. Hell, I spent a few months working on something similar for
> > > > airplane communication management units.
> > >
> > > My vague proposal for a "better JIT API" (which you quoted below)
> > > explicitly and completely solves this problem:
> > >
> > > >
> > > > > So what would a good solution look like?  It seem to me that the
> > > > > program being supervised (a userspace or kernel JIT) could generate
> > > > > some kind of data structure along these lines:
> > > > >
> > > > > - machine code to be materialized
> > > > >
> > > > > - address and length at which to materialize it (probably
> > > > > page-aligned, but maybe not)
> > > > >
> > > > > - an "origin" of this code (perhaps a file handle?) -- I'm not 100%
> > > > > sure this is useful
> > > > >
> > > > > - a "justification" for the code.  This could be something like "Hey,
> > > > > this is JITted from cBPF for seccomp, and here's the cBPF".
> > >
> > > Even ignoring the origin and justification parts, there's no WX window
> > > in here.  The code is generated, then it's shipped off to the
> > > hypervisor/supervisor, and *exactly that code* is materialized !W, X.
> > >
> > > Of course, this still leaves verification to be handled.
> > >
> > > > Returning to the idea of origins, at the end of the work day yesterday I
> > > > queried Maciej to "have Android choose one compiler for seccomp policies
> > > > to BPF and stick with it", because if I knew filters were chosen by
> > > > libminijail or some other userspace system, I could pretty easily figure
> > > > out what EL2 needs to expect at runtime. An "origin" field would be
> > > > equally as effective, and retain flexibility.
> > >
> > > At the risk of a silly suggestion, what if the entire JIT compiler and
> > > verifier (or a sufficient portion) were, itself, a WASM (or similar)
> > > program, signed or whatever, and shipped off to the hypervisor?  The
> > > hypervisor could run it (in whatever sandbox it likes -- hypervisors
> > > are capable of spawning a separate VM to host it if needed), and only
> > > then accept the output.
> > >
> > > I, personally, think that this is of extremely dubious value unless
> > > it's paired with a control flow integrity system.  But maybe it could
> > > be!  Something like x86 IBT would be a start, and FineIBT would be
> > > better, as would an ARM equivalent.
> > >
> > > --Andy
> >

Hi,

In response to your previous message (this is Seb from pKVM team):


> > I've heard rumours (probably read some LWN article perhaps
> > https://lwn.net/Articles/836693/ ) that protected kvm for Android has
> > some mechanism to start the kernel in some higher priv level (EL2?),
> > then move most of it to EL1 while keeping a protected VPN shim in EL2.
> 
> s/VPN/KVM/

Yes we do initialize the pKVM hypervisor at EL2 fairly early at
device_initcall_sync (initcall 5) before we depriviledge the rest of the
kernel at EL1.

> 
> >
> > Perhaps the answer is to leave the bpf verifier + jit compiler in EL2?
> 

What are the gains to move this at EL2 ? I am a bit late to this party.
We don't have any init at that stage because it is too early. We do
support some EL2 vendor modules loading from a ramdisk but this is a
different story. 

> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> 

Thanks,
Seb

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-30 11:22                   ` Sebastian Ene
@ 2024-09-30 18:43                     ` Maxwell Bland
  2024-09-30 23:35                     ` Maciej Żenczykowski
  1 sibling, 0 replies; 26+ messages in thread
From: Maxwell Bland @ 2024-09-30 18:43 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: Andy Lutomirski, Maxwell Bland, Neill Kapron,
	linux-arm-msm@vger.kernel.org, Andrew Wheeler, Sammy BS2 Que,
	Todd Kjos, Viktor Martensson, keescook@chromium.org, Will Drewry,
	Andy Gross, Bjorn Andersson, Konrad Dybcio, kernel-team, adelva,
	jeffv

On Mon, Sep 30, 2024 at 11:22:22AM GMT, Sebastian Ene wrote:
> On Wed, Sep 25, 2024 at 12:53:11PM -0700, 'Maciej Żenczykowski' via kernel-team wrote:
> > On Wed, Sep 25, 2024 at 12:52 PM Maciej Żenczykowski <maze@google.com> wrote:
> > >
> > > On Wed, Sep 25, 2024 at 11:16 AM Andy Lutomirski <luto@amacapital.net> wrote:
> > > >
> > > > On Tue, Sep 17, 2024 at 8:08 AM Maxwell Bland <mbland@motorola.com> wrote:
> > > > >
> > > > > On Fri, Sep 13, 2024 at 09:18:58PM GMT, Andy Lutomirski wrote:
> > > > > > On Fri, Sep 13, 2024 at 10:30 AM Maxwell Bland <mbland@motorola.com> wrote:
> > > > > > > On Fri, Sep 13, 2024 at 05:07:46PM GMT, Maxwell Bland wrote:
> > > > > > >
> > > > > > > But don't let me distract from the issue, which is that
> > > > > > > cBPF/eBPF/however these filters get allocated to machine code,
> > > > > > > bpf_int_jit_compile ends up getting called and a new
> > > > > > > privileged-executable page gets allocated without compile-time
> > > > > > > provenance (at least, without reverse engineering) for where that code
> > > > > > > came from.
> > > > > >
> > > > > > But what if there was a mechanism to *cryptographically hash* a BPF
> > > > > > program as part of the loading process?  Then that hash could be
> > > > > > looked up in a list, and a decision could be made based on the result?
> > > > > >  Would this help solve any problems?
> > > > >
> > > > > The issue I have seen in the prior Qualys linked exploit from my initial
> > > > > message and from talks by security researchers elsewhere, for example
> > > > > Google Project Zero's recent "Analyzing a Modern In-the-wild Android
> > > > > Exploit" by Seth Jenkins, is that people have the ability to target
> > > > > these pages during the window between the page being allocated as
> > > > > writable by vmalloc.c and the update to the PTE which makes it
> > > > > executable, so a signature does help (creates the requirement of more
> > > > > than one write to commit "forgery"), but doesn't totally 100% solve the
> > > > > problem.
> > > > >
> > > > > Right now, every time I open up chrome on our latest flagship the
> > > > > browsers sandbox filters trigger my EL2 monitor because they are
> > > > > attempting to follow the standard W^X protocol. If I were to build one
> > > > > of these exploits, I'd:
> > > > >
> > > > > (1) find out a non-crashing leak for code page and data values
> > > > > (2) determine from vmalloc's rb-tree where the next one-page allocation
> > > > >     is likely to occur
> > > > > (3) prime my write gadget for an offset into that page
> > > > > (4) spin up chrome in a second thread
> > > > > (5) attempt to trigger a write (or two) at the right precise time using
> > > > >     prior empirical measurement or my read gadget for kernel mem
> > > > >
> > > > > Which is messy, but people have been known to do more given good enough
> > > > > stakes. Hell, I spent a few months working on something similar for
> > > > > airplane communication management units.
> > > >
> > > > My vague proposal for a "better JIT API" (which you quoted below)
> > > > explicitly and completely solves this problem:
> > > >
> > > > >
> > > > > > So what would a good solution look like?  It seem to me that the
> > > > > > program being supervised (a userspace or kernel JIT) could generate
> > > > > > some kind of data structure along these lines:
> > > > > >
> > > > > > - machine code to be materialized
> > > > > >
> > > > > > - address and length at which to materialize it (probably
> > > > > > page-aligned, but maybe not)
> > > > > >
> > > > > > - an "origin" of this code (perhaps a file handle?) -- I'm not 100%
> > > > > > sure this is useful
> > > > > >
> > > > > > - a "justification" for the code.  This could be something like "Hey,
> > > > > > this is JITted from cBPF for seccomp, and here's the cBPF".
> > > >
> > > > Even ignoring the origin and justification parts, there's no WX window
> > > > in here.  The code is generated, then it's shipped off to the
> > > > hypervisor/supervisor, and *exactly that code* is materialized !W, X.
> > > >
> > > > Of course, this still leaves verification to be handled.
> > > >
> > > > > Returning to the idea of origins, at the end of the work day yesterday I
> > > > > queried Maciej to "have Android choose one compiler for seccomp policies
> > > > > to BPF and stick with it", because if I knew filters were chosen by
> > > > > libminijail or some other userspace system, I could pretty easily figure
> > > > > out what EL2 needs to expect at runtime. An "origin" field would be
> > > > > equally as effective, and retain flexibility.
> > > >
> > > > At the risk of a silly suggestion, what if the entire JIT compiler and
> > > > verifier (or a sufficient portion) were, itself, a WASM (or similar)
> > > > program, signed or whatever, and shipped off to the hypervisor?  The
> > > > hypervisor could run it (in whatever sandbox it likes -- hypervisors
> > > > are capable of spawning a separate VM to host it if needed), and only
> > > > then accept the output.
> > > >
> > > > I, personally, think that this is of extremely dubious value unless
> > > > it's paired with a control flow integrity system.  But maybe it could
> > > > be!  Something like x86 IBT would be a start, and FineIBT would be
> > > > better, as would an ARM equivalent.
> > > >
> > > > --Andy
> > >
> 
> Hi,
> 
> In response to your previous message (this is Seb from pKVM team):
> 
> 
> > > I've heard rumours (probably read some LWN article perhaps
> > > https://lwn.net/Articles/836693/ ) that protected kvm for Android has
> > > some mechanism to start the kernel in some higher priv level (EL2?),
> > > then move most of it to EL1 while keeping a protected VPN shim in EL2.
> > 
> > s/VPN/KVM/
> 
> Yes we do initialize the pKVM hypervisor at EL2 fairly early at
> device_initcall_sync (initcall 5) before we depriviledge the rest of the
> kernel at EL1.
> 

Implementing code page integrity checks in pKVM as a reference spec for all the
other EL2 developers and the kernel to "do the right thing" for
hypervisor-based exploit prevention and kernel integrity checking would be a
major success for ARM/Google. I am hoping I can get Moto to release our code.

> >                                                                          
> > >                                                                        
> > > Perhaps the answer is to leave the bpf verifier + jit compiler in EL2? 
> >                                                                          
>                                                                            
> What are the gains to move this at EL2 ? I am a bit late to this party.    
> We don't have any init at that stage because it is too early. We do        
> support some EL2 vendor modules loading from a ramdisk but this is a       
> different story.                                                           
>                                                                            

I see moving the full JIT/verifier into EL2 as problematic because of increased
threat surface. We've seen many project zero originated and third-party
exploits targeting EL2 SMC interfaces on Android: *cough* a certain
galactic-themed phone manufacturer's claims to have a system protecting these
code pages, who never seemed to mention the complications seccomp creates, let
alone the impossibility of filtering page table updates on snapdragon chipsets
without reworking vmalloc infrastructure in what must be a GPL-2.0 compliant
interface they never made open source, had serious SMC-call based CVEs in the
past *cough* https://project-zero.issues.chromium.org/issues/42452502 *cough*

From empirical evidence of implementing hypervisor-enforced code/data
integrity, the only runtime interface needed for protecting everything but
dynamically modifiable data structures (e.g. kworker queues) on Android is the
standard EL2 page-level permission fault handler.

---

I hope it was clear from my base PoC code that to ensure the filter is a "pure
function" it is enough to reproduce the memory access semantics and protections
introduced by BPF's verifier.c with additional limited scope. This at least
practically ensures something using the mechanism of CVE-2021-33909 (_on
Android in particular, generic linux is another ballpark_) cannot transform the
seccomp code page into something breaking verifier.c's semantics. Though it may
break the intended seccomp policy itself, I think that level of checking can be
added on as an additional layer once this basic exploit is resolved.

This in mind, as of last week, I've gone ahead and gotten my earlier, buggy PoC
for an EL2-level seccomp verifier (in another earlier email in this chain)
running on a real device. After I fixed some other bugs (they are pretty
obvious once you look through the PoC code), I discovered empirically a QCOM
hardware abstraction layer (HAL) service has a filter program that uses the
stack (uses a store instruction to BPF_REGS_FP), so my initial hope of
"banning" stores to memory outright *did* end up a no-go for one empirical
case. The clear solution seems to be to relax the restrictions the smallest
amount possible: check that stores in the program are in the predefined stack
memory scratch area.

Thankfully, BPF_REGS_FP is read-only. And I totally understand and support the
possibility that a filter program can load/store from its well-defined stack
space as a scratch area. Additionally, bpf_check_classic and the existence of

        BUILD_BUG_ON(BPF_MEMWORDS * sizeof(u32) > MAX_BPF_STACK);

Seems to communicate to me that the intentions here are that the BPF_STX
allowed by seccomp's verifier is limited to the defined stack space relative to
FP. My gut says check_load_and_stores and bpf_check_classic are not technically
as strict as they should be for the intentions of SECCOMP, but just happen to
work. I'd expect to see some code that just says "every store must index memory
from a well-defined offset from the read-only FP", but I don't quite see that,
unless I missed it (I think something to do with how fp->k is decided for
struct sock_filter in the classic verfier), despite there being indications
this was the intention elsewhere.

But maybe I am incorrect to assume the stack is the theoretical limit on what
can be the destination register for a seccomp filter store instruction for now
(and into the future)? If not, why? Is there an explicit exception I can make
in an EL2 verifier for filter programs that do not abide by these rules?

For now, based on what I am seeing in the kernel, I think it may be fine to
assume BPF_MEMWORDS associated with the stack is the theoretical limit on which
memory should be store-able, and "hope" that all associated instructions are
FP-relative.  If not, the alias register should be readily resolvable back to
FP, though having a formal contract that it would be FP-relative in the
kernel's JIT for cBPF would be awesome.

Regards,
Maxwell Bland

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-30 11:22                   ` Sebastian Ene
  2024-09-30 18:43                     ` Maxwell Bland
@ 2024-09-30 23:35                     ` Maciej Żenczykowski
  2024-09-30 23:41                       ` Maciej Żenczykowski
  1 sibling, 1 reply; 26+ messages in thread
From: Maciej Żenczykowski @ 2024-09-30 23:35 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: Andy Lutomirski, Maxwell Bland, Neill Kapron,
	linux-arm-msm@vger.kernel.org, Andrew Wheeler, Sammy BS2 Que,
	Todd Kjos, Viktor Martensson, keescook@chromium.org, Will Drewry,
	Andy Gross, Bjorn Andersson, Konrad Dybcio, kernel-team, adelva,
	jeffv

On Mon, Sep 30, 2024 at 4:22 AM Sebastian Ene <sebastianene@google.com> wrote:
>
> On Wed, Sep 25, 2024 at 12:53:11PM -0700, 'Maciej Żenczykowski' via kernel-team wrote:
> > On Wed, Sep 25, 2024 at 12:52 PM Maciej Żenczykowski <maze@google.com> wrote:
> > >
> > > On Wed, Sep 25, 2024 at 11:16 AM Andy Lutomirski <luto@amacapital.net> wrote:
> > > >
> > > > On Tue, Sep 17, 2024 at 8:08 AM Maxwell Bland <mbland@motorola.com> wrote:
> > > > >
> > > > > On Fri, Sep 13, 2024 at 09:18:58PM GMT, Andy Lutomirski wrote:
> > > > > > On Fri, Sep 13, 2024 at 10:30 AM Maxwell Bland <mbland@motorola.com> wrote:
> > > > > > > On Fri, Sep 13, 2024 at 05:07:46PM GMT, Maxwell Bland wrote:
> > > > > > >
> > > > > > > But don't let me distract from the issue, which is that
> > > > > > > cBPF/eBPF/however these filters get allocated to machine code,
> > > > > > > bpf_int_jit_compile ends up getting called and a new
> > > > > > > privileged-executable page gets allocated without compile-time
> > > > > > > provenance (at least, without reverse engineering) for where that code
> > > > > > > came from.
> > > > > >
> > > > > > But what if there was a mechanism to *cryptographically hash* a BPF
> > > > > > program as part of the loading process?  Then that hash could be
> > > > > > looked up in a list, and a decision could be made based on the result?
> > > > > >  Would this help solve any problems?
> > > > >
> > > > > The issue I have seen in the prior Qualys linked exploit from my initial
> > > > > message and from talks by security researchers elsewhere, for example
> > > > > Google Project Zero's recent "Analyzing a Modern In-the-wild Android
> > > > > Exploit" by Seth Jenkins, is that people have the ability to target
> > > > > these pages during the window between the page being allocated as
> > > > > writable by vmalloc.c and the update to the PTE which makes it
> > > > > executable, so a signature does help (creates the requirement of more
> > > > > than one write to commit "forgery"), but doesn't totally 100% solve the
> > > > > problem.
> > > > >
> > > > > Right now, every time I open up chrome on our latest flagship the
> > > > > browsers sandbox filters trigger my EL2 monitor because they are
> > > > > attempting to follow the standard W^X protocol. If I were to build one
> > > > > of these exploits, I'd:
> > > > >
> > > > > (1) find out a non-crashing leak for code page and data values
> > > > > (2) determine from vmalloc's rb-tree where the next one-page allocation
> > > > >     is likely to occur
> > > > > (3) prime my write gadget for an offset into that page
> > > > > (4) spin up chrome in a second thread
> > > > > (5) attempt to trigger a write (or two) at the right precise time using
> > > > >     prior empirical measurement or my read gadget for kernel mem
> > > > >
> > > > > Which is messy, but people have been known to do more given good enough
> > > > > stakes. Hell, I spent a few months working on something similar for
> > > > > airplane communication management units.
> > > >
> > > > My vague proposal for a "better JIT API" (which you quoted below)
> > > > explicitly and completely solves this problem:
> > > >
> > > > >
> > > > > > So what would a good solution look like?  It seem to me that the
> > > > > > program being supervised (a userspace or kernel JIT) could generate
> > > > > > some kind of data structure along these lines:
> > > > > >
> > > > > > - machine code to be materialized
> > > > > >
> > > > > > - address and length at which to materialize it (probably
> > > > > > page-aligned, but maybe not)
> > > > > >
> > > > > > - an "origin" of this code (perhaps a file handle?) -- I'm not 100%
> > > > > > sure this is useful
> > > > > >
> > > > > > - a "justification" for the code.  This could be something like "Hey,
> > > > > > this is JITted from cBPF for seccomp, and here's the cBPF".
> > > >
> > > > Even ignoring the origin and justification parts, there's no WX window
> > > > in here.  The code is generated, then it's shipped off to the
> > > > hypervisor/supervisor, and *exactly that code* is materialized !W, X.
> > > >
> > > > Of course, this still leaves verification to be handled.
> > > >
> > > > > Returning to the idea of origins, at the end of the work day yesterday I
> > > > > queried Maciej to "have Android choose one compiler for seccomp policies
> > > > > to BPF and stick with it", because if I knew filters were chosen by
> > > > > libminijail or some other userspace system, I could pretty easily figure
> > > > > out what EL2 needs to expect at runtime. An "origin" field would be
> > > > > equally as effective, and retain flexibility.
> > > >
> > > > At the risk of a silly suggestion, what if the entire JIT compiler and
> > > > verifier (or a sufficient portion) were, itself, a WASM (or similar)
> > > > program, signed or whatever, and shipped off to the hypervisor?  The
> > > > hypervisor could run it (in whatever sandbox it likes -- hypervisors
> > > > are capable of spawning a separate VM to host it if needed), and only
> > > > then accept the output.
> > > >
> > > > I, personally, think that this is of extremely dubious value unless
> > > > it's paired with a control flow integrity system.  But maybe it could
> > > > be!  Something like x86 IBT would be a start, and FineIBT would be
> > > > better, as would an ARM equivalent.
> > > >
> > > > --Andy
> > >
>
> Hi,
>
> In response to your previous message (this is Seb from pKVM team):
>
>
> > > I've heard rumours (probably read some LWN article perhaps
> > > https://lwn.net/Articles/836693/ ) that protected kvm for Android has
> > > some mechanism to start the kernel in some higher priv level (EL2?),
> > > then move most of it to EL1 while keeping a protected VPN shim in EL2.
> >
> > s/VPN/KVM/
>
> Yes we do initialize the pKVM hypervisor at EL2 fairly early at
> device_initcall_sync (initcall 5) before we depriviledge the rest of the
> kernel at EL1.

I'd love to learn more about this for some unrelated reasons.
Even been considering dropping by London to chat about it (with Will)
at some point.

> > > Perhaps the answer is to leave the bpf verifier + jit compiler in EL2?
> >
> What are the gains to move this at EL2 ? I am a bit late to this party.
> We don't have any init at that stage because it is too early. We do
> support some EL2 vendor modules loading from a ramdisk but this is a
> different story.

I think the OP is trying to verify the 'sanctity' of EL1 code pages.
(ie. prove via signature that they're all legit, which is hard with jit)
Presumably he's doing this from EL2 (I seriously doubt he's in EL3).
There's been talk of
unjitting/rejitting/regenerating/peephole-verifying the BPF jitted
dynamically generated kernel executable pages - to verify they're
'safe'.
Moving just the 'bpf verifier/jit' into EL2 would seem to solve that
particular problem.
Though of course that is a fair bit of code (though the only untrusted
input to it, post boot completion, is cBPF which is pretty small in
scope)...
Compromises of EL0/EL1 would no longer be able to write gadget over
the bpf jitted kernel executable page prior to them being marked -W+X.
I'm not certain how much of a win in safety this is though?
I guess it depends on how easy the bpf verifier/jitter is to audit.


>
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >
>
> Thanks,
> Seb

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-30 23:35                     ` Maciej Żenczykowski
@ 2024-09-30 23:41                       ` Maciej Żenczykowski
  2024-10-01 16:34                         ` Maxwell Bland
  0 siblings, 1 reply; 26+ messages in thread
From: Maciej Żenczykowski @ 2024-09-30 23:41 UTC (permalink / raw)
  To: Sebastian Ene
  Cc: Andy Lutomirski, Maxwell Bland, Neill Kapron,
	linux-arm-msm@vger.kernel.org, Andrew Wheeler, Sammy BS2 Que,
	Todd Kjos, Viktor Martensson, keescook@chromium.org, Will Drewry,
	Andy Gross, Bjorn Andersson, Konrad Dybcio, kernel-team, adelva,
	jeffv

On Mon, Sep 30, 2024 at 4:35 PM Maciej Żenczykowski <maze@google.com> wrote:
>
> On Mon, Sep 30, 2024 at 4:22 AM Sebastian Ene <sebastianene@google.com> wrote:
> >
> > On Wed, Sep 25, 2024 at 12:53:11PM -0700, 'Maciej Żenczykowski' via kernel-team wrote:
> > > On Wed, Sep 25, 2024 at 12:52 PM Maciej Żenczykowski <maze@google.com> wrote:
> > > >
> > > > On Wed, Sep 25, 2024 at 11:16 AM Andy Lutomirski <luto@amacapital.net> wrote:
> > > > >
> > > > > On Tue, Sep 17, 2024 at 8:08 AM Maxwell Bland <mbland@motorola.com> wrote:
> > > > > >
> > > > > > On Fri, Sep 13, 2024 at 09:18:58PM GMT, Andy Lutomirski wrote:
> > > > > > > On Fri, Sep 13, 2024 at 10:30 AM Maxwell Bland <mbland@motorola.com> wrote:
> > > > > > > > On Fri, Sep 13, 2024 at 05:07:46PM GMT, Maxwell Bland wrote:
> > > > > > > >
> > > > > > > > But don't let me distract from the issue, which is that
> > > > > > > > cBPF/eBPF/however these filters get allocated to machine code,
> > > > > > > > bpf_int_jit_compile ends up getting called and a new
> > > > > > > > privileged-executable page gets allocated without compile-time
> > > > > > > > provenance (at least, without reverse engineering) for where that code
> > > > > > > > came from.
> > > > > > >
> > > > > > > But what if there was a mechanism to *cryptographically hash* a BPF
> > > > > > > program as part of the loading process?  Then that hash could be
> > > > > > > looked up in a list, and a decision could be made based on the result?
> > > > > > >  Would this help solve any problems?
> > > > > >
> > > > > > The issue I have seen in the prior Qualys linked exploit from my initial
> > > > > > message and from talks by security researchers elsewhere, for example
> > > > > > Google Project Zero's recent "Analyzing a Modern In-the-wild Android
> > > > > > Exploit" by Seth Jenkins, is that people have the ability to target
> > > > > > these pages during the window between the page being allocated as
> > > > > > writable by vmalloc.c and the update to the PTE which makes it
> > > > > > executable, so a signature does help (creates the requirement of more
> > > > > > than one write to commit "forgery"), but doesn't totally 100% solve the
> > > > > > problem.
> > > > > >
> > > > > > Right now, every time I open up chrome on our latest flagship the
> > > > > > browsers sandbox filters trigger my EL2 monitor because they are
> > > > > > attempting to follow the standard W^X protocol. If I were to build one
> > > > > > of these exploits, I'd:
> > > > > >
> > > > > > (1) find out a non-crashing leak for code page and data values
> > > > > > (2) determine from vmalloc's rb-tree where the next one-page allocation
> > > > > >     is likely to occur
> > > > > > (3) prime my write gadget for an offset into that page
> > > > > > (4) spin up chrome in a second thread
> > > > > > (5) attempt to trigger a write (or two) at the right precise time using
> > > > > >     prior empirical measurement or my read gadget for kernel mem
> > > > > >
> > > > > > Which is messy, but people have been known to do more given good enough
> > > > > > stakes. Hell, I spent a few months working on something similar for
> > > > > > airplane communication management units.
> > > > >
> > > > > My vague proposal for a "better JIT API" (which you quoted below)
> > > > > explicitly and completely solves this problem:
> > > > >
> > > > > >
> > > > > > > So what would a good solution look like?  It seem to me that the
> > > > > > > program being supervised (a userspace or kernel JIT) could generate
> > > > > > > some kind of data structure along these lines:
> > > > > > >
> > > > > > > - machine code to be materialized
> > > > > > >
> > > > > > > - address and length at which to materialize it (probably
> > > > > > > page-aligned, but maybe not)
> > > > > > >
> > > > > > > - an "origin" of this code (perhaps a file handle?) -- I'm not 100%
> > > > > > > sure this is useful
> > > > > > >
> > > > > > > - a "justification" for the code.  This could be something like "Hey,
> > > > > > > this is JITted from cBPF for seccomp, and here's the cBPF".
> > > > >
> > > > > Even ignoring the origin and justification parts, there's no WX window
> > > > > in here.  The code is generated, then it's shipped off to the
> > > > > hypervisor/supervisor, and *exactly that code* is materialized !W, X.
> > > > >
> > > > > Of course, this still leaves verification to be handled.
> > > > >
> > > > > > Returning to the idea of origins, at the end of the work day yesterday I
> > > > > > queried Maciej to "have Android choose one compiler for seccomp policies
> > > > > > to BPF and stick with it", because if I knew filters were chosen by
> > > > > > libminijail or some other userspace system, I could pretty easily figure
> > > > > > out what EL2 needs to expect at runtime. An "origin" field would be
> > > > > > equally as effective, and retain flexibility.
> > > > >
> > > > > At the risk of a silly suggestion, what if the entire JIT compiler and
> > > > > verifier (or a sufficient portion) were, itself, a WASM (or similar)
> > > > > program, signed or whatever, and shipped off to the hypervisor?  The
> > > > > hypervisor could run it (in whatever sandbox it likes -- hypervisors
> > > > > are capable of spawning a separate VM to host it if needed), and only
> > > > > then accept the output.
> > > > >
> > > > > I, personally, think that this is of extremely dubious value unless
> > > > > it's paired with a control flow integrity system.  But maybe it could
> > > > > be!  Something like x86 IBT would be a start, and FineIBT would be
> > > > > better, as would an ARM equivalent.
> > > > >
> > > > > --Andy
> > > >
> >
> > Hi,
> >
> > In response to your previous message (this is Seb from pKVM team):
> >
> >
> > > > I've heard rumours (probably read some LWN article perhaps
> > > > https://lwn.net/Articles/836693/ ) that protected kvm for Android has
> > > > some mechanism to start the kernel in some higher priv level (EL2?),
> > > > then move most of it to EL1 while keeping a protected VPN shim in EL2.
> > >
> > > s/VPN/KVM/
> >
> > Yes we do initialize the pKVM hypervisor at EL2 fairly early at
> > device_initcall_sync (initcall 5) before we depriviledge the rest of the
> > kernel at EL1.
>
> I'd love to learn more about this for some unrelated reasons.
> Even been considering dropping by London to chat about it (with Will)
> at some point.
>
> > > > Perhaps the answer is to leave the bpf verifier + jit compiler in EL2?
> > >
> > What are the gains to move this at EL2 ? I am a bit late to this party.
> > We don't have any init at that stage because it is too early. We do
> > support some EL2 vendor modules loading from a ramdisk but this is a
> > different story.
>
> I think the OP is trying to verify the 'sanctity' of EL1 code pages.
> (ie. prove via signature that they're all legit, which is hard with jit)
> Presumably he's doing this from EL2 (I seriously doubt he's in EL3).
> There's been talk of
> unjitting/rejitting/regenerating/peephole-verifying the BPF jitted
> dynamically generated kernel executable pages - to verify they're
> 'safe'.
> Moving just the 'bpf verifier/jit' into EL2 would seem to solve that
> particular problem.
> Though of course that is a fair bit of code (though the only untrusted
> input to it, post boot completion, is cBPF which is pretty small in
> scope)...
> Compromises of EL0/EL1 would no longer be able to write gadget over
> the bpf jitted kernel executable page prior to them being marked -W+X.
> I'm not certain how much of a win in safety this is though?
> I guess it depends on how easy the bpf verifier/jitter is to audit.

Note: if the full blown bpf verifier/jitter is too hard to audit, you
could potentially write a new EL2 jitter just for cBPF.  It could just
be a trimmed down version of the generic eBPF jitter.  cBPF is much
much simpler.

>
>
> >
> > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> > >
> >
> > Thanks,
> > Seb

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-30 23:41                       ` Maciej Żenczykowski
@ 2024-10-01 16:34                         ` Maxwell Bland
  0 siblings, 0 replies; 26+ messages in thread
From: Maxwell Bland @ 2024-10-01 16:34 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Sebastian Ene, Andy Lutomirski, Neill Kapron,
	linux-arm-msm@vger.kernel.org, Andrew Wheeler, Sammy BS2 Que,
	Todd Kjos, Viktor Martensson, keescook@chromium.org, Will Drewry,
	Andy Gross, Bjorn Andersson, Konrad Dybcio, kernel-team, adelva,
	jeffv

On Mon, Sep 30, 2024 at 04:41:19PM GMT, Maciej Żenczykowski wrote:
> On Mon, Sep 30, 2024 at 4:35 PM Maciej Żenczykowski <maze@google.com> wrote:
> > I think the OP is trying to verify the 'sanctity' of EL1 code pages.
> > (ie. prove via signature that they're all legit, which is hard with jit)
> > Presumably he's doing this from EL2 (I seriously doubt he's in EL3).
> > There's been talk of
> > unjitting/rejitting/regenerating/peephole-verifying the BPF jitted
> > dynamically generated kernel executable pages - to verify they're
> > 'safe'.
> > Moving just the 'bpf verifier/jit' into EL2 would seem to solve that
> > particular problem.
> > Though of course that is a fair bit of code (though the only untrusted
> > input to it, post boot completion, is cBPF which is pretty small in
> > scope)...
> > Compromises of EL0/EL1 would no longer be able to write gadget over
> > the bpf jitted kernel executable page prior to them being marked -W+X.
> > I'm not certain how much of a win in safety this is though?
> > I guess it depends on how easy the bpf verifier/jitter is to audit.
> 
> Note: if the full blown bpf verifier/jitter is too hard to audit, you
> could potentially write a new EL2 jitter just for cBPF.  It could just
> be a trimmed down version of the generic eBPF jitter.  cBPF is much
> much simpler.

As of yesterday I confirmed a simple version of the above I was able to
whip up in 2-3 days works on Android 14. It operates at EL2 and passes
standard tests for camera, browsing, etc.. cBPF is, in fact, the saving
grace here! :-)

Cheers,
Maxwell Bland

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-13 17:07     ` [External] " Maxwell Bland
  2024-09-13 17:12       ` Maxwell Bland
  2024-09-13 17:30       ` Maxwell Bland
@ 2024-09-13 18:17       ` Maxwell Bland
  2024-09-13 21:16       ` [External] " Maciej Żenczykowski
  3 siblings, 0 replies; 26+ messages in thread
From: Maxwell Bland @ 2024-09-13 18:17 UTC (permalink / raw)
  To: Maciej Żenczykowski, Neill Kapron
  Cc: linux-arm-msm@vger.kernel.org, Andrew Wheeler,
	Sammy BS2 Que | 阙斌生, Todd Kjos,
	Viktor Martensson, Andy Lutomirski, keescook@chromium.org,
	Will Drewry, Andy Gross, Bjorn Andersson, Konrad Dybcio,
	kernel-team, adelva@google.com, jeffv@google.com

On Fri, Sep 13, 2024 at 05:07:46PM GMT, Maxwell Bland wrote:
> make a standard framework for EL2-based kernel protection open source, then we
> have a counter of the 29,000ish writable datastructures,and well defined
> mechanisms for preventing malicious modification via write gadgets

Ugh, this is a complicated issue and I wrote this email quickly, let me
clarify, apologies:

1 I am worried about write gadgets (e.g. UAF + heap spray)
2 _Some_ modern exploits use write gadgets to modify read-only data
  (e.g. code pages), most target dynamic data, such as device struct
  pointers and kworker queues.
3 I'm working to build an open-source system that will reduce the ARM64
  kernel's threat surface for write gadgets to the _just_ those targeting
  dynamic data.
4 After that point, there is still the issue of developing a
  verification framework for updates to approx. 29,000 dynamic data
  structures (based on our generated vmlinux) in the kernel. Attempts
  like ARM MTE are the most promising approaches so far.

That is, I'm suggesting empirically measuring the set of datastructures
vulnerable to the write gadget stage of current exploits and then taking
steps to reduce the number of datastructures and impact on those
datastructures a write gadget can have.

Hopefully the above explanation will help remove some of the confusion
resulting from my poor writing.

Thanks,
Maxwell

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-13 17:07     ` [External] " Maxwell Bland
                         ` (2 preceding siblings ...)
  2024-09-13 18:17       ` Maxwell Bland
@ 2024-09-13 21:16       ` Maciej Żenczykowski
  2024-09-16 22:17         ` Maxwell Bland
  3 siblings, 1 reply; 26+ messages in thread
From: Maciej Żenczykowski @ 2024-09-13 21:16 UTC (permalink / raw)
  To: Maxwell Bland
  Cc: Neill Kapron, linux-arm-msm@vger.kernel.org, Andrew Wheeler,
	Sammy BS2 Que | 阙斌生, Todd Kjos,
	Viktor Martensson, Andy Lutomirski, keescook@chromium.org,
	Will Drewry, Andy Gross, Bjorn Andersson, Konrad Dybcio,
	kernel-team, adelva@google.com, jeffv@google.com

On Fri, Sep 13, 2024 at 10:07 AM Maxwell Bland <mbland@motorola.com> wrote:
>
> OK, after spending three hours working on this email, I think I know what to do
> here. Since Moto's code for this stuff is forced to be open source anyways,
> I'll spoil the solution:
>
> Add a hook to seccomp which triggers/enables hooks in BPF's JIT to instrument
> the output machine code  page so that EL2 can (1) invert the machine code back
> to BPF then (2) check the BPF corresponds to a valid seccomp filter policy.

If you care that deeply about this: you could simply turn of jit
compilation of cBPF (including seccomp) - but you'll take a
performance hit.
If you care about performance you could only jit compile *recognized*
cBPF programs.
Hell, instead of jit-ing them you could replace them with outright
(pre)compiled into the kernel native functions that accomplish the
same thing.
There's probably only somewhere <10 of these in common use / part of
the platform.
That said, you'd still pay a performance hit for (Chrome web browser
style) sandboxes since those policies *will* be updated without os
updates.  Similarly with the mainline shipped cBPF code (which does
process all packets) - you can't guarantee it won't change.

> It would need to be kept up to date with whatever seccomp decides to do, but I
> can see a world where the result guarantees the code page has not been modified
> in transit and corresponds to a reasonable seccomp policy.
>
> I will say, I'm not the biggest fan of this. I am a fan of SYS_seccomp
> implicitly compiling the filters at build-time, so I can just know immediately
> what the new code pages "should be".

We/you simply don't know what all of the filters are going to be at
kernel/platform build time - because they're provided from outside the
platform.

I guess for the mainline shipped cBPF programs we could technically
probably swap them for eBPF.  Taking a quick glance at uses of
BpfClassic.h in aosp I see 6 socket filter cBPF programs, of which
only 1 is dynamic (for matching clat IP addresses), so the remaining 5
are probably trivial to eBPF-ify (and thus hide behind selinux
restrictions).

Or I guess you could just exempt CAP_NET_ADMIN privileged code from
the don't jit cBPF exception.

> That said, I think my solution also
> resolves the issue of an adversary using the BRK instruction padding to
> generate a "valid" codepage at an invalid offset.
>
> I've included a few other responses just for kicks, since you should know I've
> been working hard on this problem for more than a year, I'm not just emailing
> things to sound cool and waste time (OK maybe a little of that, but this is
> also a serious, honest effort to understand the problem!). (-:
>
> > If you can prove it isn't
>
> To test it yourself, it is easiest to add a printk statement under
> bpf_int_jit_compile,

I didn't say this doesn't get called at runtime.

I said (that I believe that) it doesn't get called for eBPF, only for cBPF.
cBPF is much more limited in what you can do with it.
(obviously it does get called for eBPF during the boot process by the
bpfloader as well, but that's super duper privileged code running as
root with capabilities...)

> or try to implement a system for checking the integrity of
> page table updates, or add a print statment to the page table update code in
> vmalloc. or enable the CONFIG_PTDUMP_DEBUGFS options. Use my patch here if you
> want to see decent output.
> https://lore.kernel.org/all/2bcb3htsjhepxdybpw2bwot2jnuezl3p5mnj5rhjwgitlsufe7@xzhkyntridw3/
> or I've also attached a kernel module which is a part of this "OpenKP" project
> I am working on, which should provide a larger, open-source framework and
> standard for the ARM community to provide hypervisor-enforced code integrity on
> Android / QCOM chipsets, so you can see 2% of the work I've done over the past
> 2 years and test it out yourself. Uncomment the part under "DEBUG" and read
> through it, test it out.
>
> I can submit a formal patch with printk statements for you to test out if that
> is needed? Or just trust me, lol. I'm probably going to just go work on
> that instrumentation step I mentioned earlier. (-:
>
> >selinux
>
> Note this whole loading is outside the scope of SELinux, it is a side-effect of
> the SYS_Seccomp system call as used by privileged system services.
>
> >cBPF [classic BPF, internally the kernel translates this to eBPF] is still
> >allowed,
>
> These programs will not print out using PTRACE and are difficult to audit
> without patching the seccomp calls yourself because the ptrace call to
> PTRACE_SECCOMP_GET_FILTER will fail. I believe (have not checked) because they
> are not cBPF, and seccomp's logic makes prog->fprog evaluates to null despite
> prog existing if it is cBPF, at least on Android 14. I spent a whole day
> getting frustrated with the failing ptrace call before finally ending up my
> patches (attached to the end) that instrument ptrace and can print the
> programs.
>
> >a net loss for security if you did lock it down / break it
>
> I am a fan of seccomp and I don't want to break it and I don't want to "lock it
> down", I want to ask people nicely to provide the code pages they want in the
> kernel!
>
> Thanks,
> Maxwell Bland
>
> As a P.S., maybe I should add context, though I don't know whether it is
> needed:
> Many, many exploits for the kernel over the past decade rely on write
> gadgets to modify kernel resources, such as the exploit I linked in my original
> email, Project Zero's recent
> https://googleprojectzero.blogspot.com/2023/09/analyzing-modern-in-wild-android-exploit.html
> or the more recent https://pwning.tech/nftables/. We can't begin to make honest
> progress on the existing exploits until we nail down the basic rule that
> privileged executable pages are immutable in Android. My goal is to eventually
> make a standard framework for EL2-based kernel protection open source, then we
> have a counter of the 29,000ish writable datastructures,and well defined
> mechanisms for preventing malicious modification via write gadgets (like we see
> with kworker queues, task cred structs back in the day, etc, etc). Once I've
> locked down 1 and 3 of 1) integrity of loaded code pages, 2) system control
> register modifications such as TCR (this is a pain in the *** because
> snapdragon chipsets are a pain in the *** sometimes), and 3) writing a couple
> of testcases to lock-down kworker queues and other data structures (e.g. fops)
> at EL2 and fix, among other exploits,
> https://github.blog/security/vulnerability-research/the-android-kernel-mitigations-obstacle-race/,
> I will work with Moto's legal to try and open source the solution and send it
> to the ARM mailing list, since eventually these hacks should be polished and
> made into kconfigs as part of the GKI for Android's good.
>
> This is all "goals" though, but I figured I would plug the effort.
>
> main.c:
>
> // SPDX-License-Identifier: GPL-2.0
> /*
>  * Copyright (C) 2023 Motorola Mobility, Inc.
>  *
>  * This program is free software; you can redistribute it and/or modify
>  * it under the terms of the GNU General Public License version 2 as
>  * published by the Free Software Foundation.
>  *
>  * This program is distributed in the hope that it will be useful,
>  * but WITHOUT ANY WARRANTY; without even the implied warranty of
>  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>  * GNU General Public License for more details.
>  *
>  * Kernel module that hooks the vmalloc infrastructure to ensure that code
>  * pages are not interleaved with data pages unless at a PMD level granularity.
>  * Must be loaded prior to other kernel mechanisms leveraging code page
>  * allocation, e.g. BPF, EROFS fixmap.
>  */
>
>
> #include <linux/kernel.h>
> #include <linux/bpf.h>
> #include <linux/mutex.h>
> #include <linux/atomic.h>
> #include <linux/highmem.h>
> #include <linux/kprobes.h>
> #include <linux/list.h>
> #include <linux/mm_types.h>
> #include <linux/module.h>
> #include <linux/of.h>
> #include <linux/of_platform.h>
> #include <linux/pagewalk.h>
> #include <linux/types.h>
> #include <linux/moduleloader.h>
> #include <linux/vmalloc.h>
> #include <linux/gfp_types.h>
> #include <linux/seccomp.h>
> #include <asm/pgalloc.h>
> #include <asm/ptrace.h>
> #include <asm/patching.h>
> #include <asm/module.h>
> #include <asm/page.h>
> #include <asm/seccomp.h>
>
> #ifdef SECCOMP_ARCH_NATIVE
> /**
>  * struct action_cache - per-filter cache of seccomp actions per
>  * arch/syscall pair
>  *
>  * @allow_native: A bitmap where each bit represents whether the
>  *                filter will always allow the syscall, for the
>  *                native architecture.
>  * @allow_compat: A bitmap where each bit represents whether the
>  *                filter will always allow the syscall, for the
>  *                compat architecture.
>  */
> struct action_cache {
>         DECLARE_BITMAP(allow_native, SECCOMP_ARCH_NATIVE_NR);
> #ifdef SECCOMP_ARCH_COMPAT
>         DECLARE_BITMAP(allow_compat, SECCOMP_ARCH_COMPAT_NR);
> #endif
> };
> #else
> struct action_cache { };
> #endif
>
> struct seccomp_filter {
>         refcount_t refs;
>         refcount_t users;
>         bool log;
>         bool wait_killable_recv;
>         struct action_cache cache;
>         struct seccomp_filter *prev;
>         struct bpf_prog *prog;
>         struct notification *notif;
>         struct mutex notify_lock;
>         wait_queue_head_t wqh;
> };
>
>
>
> void print_bpf_prog_aux(struct bpf_prog_aux *aux) {
>         printk("BPF Program Aux Details:\n");
>         printk("Ref Count: %lld\n", atomic64_read(&aux->refcnt));
>         printk("Used Map Count: %u\n", aux->used_map_cnt);
>         printk("Used BTF Count: %u\n", aux->used_btf_cnt);
>         printk("Max Context Offset: %u\n", aux->max_ctx_offset);
>         printk("Max Packet Offset: %u\n", aux->max_pkt_offset);
>         printk("Max TP Access: %u\n", aux->max_tp_access);
>         printk("Stack Depth: %u\n", aux->stack_depth);
>         printk("ID: %u\n", aux->id);
>         printk("Function Count: %u\n", aux->func_cnt);
>         printk("Function Index: %u\n", aux->func_idx);
>         printk("Attach BTF ID: %u\n", aux->attach_btf_id);
>         printk("Context Arg Info Size: %u\n", aux->ctx_arg_info_size);
>         printk("Max Read-Only Access: %u\n", aux->max_rdonly_access);
>         printk("Max Read-Write Access: %u\n", aux->max_rdwr_access);
>         printk("Attach BTF: %p\n", aux->attach_btf);
>         printk("Context Arg Info: %p\n", aux->ctx_arg_info);
>         printk("DST Mutex: %p\n", &aux->dst_mutex);
>         printk("DST Program: %p\n", aux->dst_prog);
>         printk("DST Trampoline: %p\n", aux->dst_trampoline);
>         printk("Saved DST Program Type: %d\n", aux->saved_dst_prog_type);
>         printk("Saved DST Attach Type: %d\n", aux->saved_dst_attach_type);
>         printk("Verifier Zero Extension: %u\n", aux->verifier_zext);
>         printk("Attach BTF Trace: %u\n", aux->attach_btf_trace);
>         printk("Function Proto Unreliable: %u\n", aux->func_proto_unreliable);
>         printk("Sleepable: %u\n", aux->sleepable);
>         printk("Tail Call Reachable: %u\n", aux->tail_call_reachable);
>         printk("XDP Has Frags: %u\n", aux->xdp_has_frags);
>         printk("Attach Func Proto: %p\n", aux->attach_func_proto);
>         printk("Attach Func Name: %s\n", aux->attach_func_name);
>         printk("Functions: %p\n", aux->func);
>         printk("JIT Data: %p\n", aux->jit_data);
>         printk("Poke Table: %p\n", aux->poke_tab);
>         printk("Kfunc Table: %p\n", aux->kfunc_tab);
>         printk("Kfunc BTF Table: %p\n", aux->kfunc_btf_tab);
>         printk("Size Poke Table: %u\n", aux->size_poke_tab);
>         printk("Ksym: %p\n", &aux->ksym);
>         printk("Operations: %p\n", aux->ops);
>         printk("Used Maps: %p\n", aux->used_maps);
>         printk("Used Maps Mutex: %p\n", &aux->used_maps_mutex);
>         printk("Used BTFs: %p\n", aux->used_btfs);
>         printk("Program: %p\n", aux->prog);
>         printk("User: %p\n", aux->user);
>         printk("Load Time: %llu\n", aux->load_time);
>         printk("Verified Instructions: %u\n", aux->verified_insns);
>         printk("Cgroup Attach Type: %d\n", aux->cgroup_atype);
>         printk("Cgroup Storage: %p\n", aux->cgroup_storage);
>         printk("Name: %s\n", aux->name);
> }
>
> void print_bpf_prog_insnsi(struct bpf_insn * insns, uint64_t len) {
>         int i;
>         for (i = 0; i < len; i++) {
>                 const struct bpf_insn *insn = &insns[i];
>                 printk("BPF INSN %016llx\n", *((uint64_t *)insn));
>         }
> }
>
> void print_bpf_prog(struct bpf_prog *prog) {
>         printk("BPF Program Details:\n");
>         printk("Pages: %u\n", prog->pages);
>         printk("JITed: %u\n", prog->jited);
>         printk("JIT Requested: %u\n", prog->jit_requested);
>         printk("GPL Compatible: %u\n", prog->gpl_compatible);
>         printk("Control Block Access: %u\n", prog->cb_access);
>         printk("DST Needed: %u\n", prog->dst_needed);
>         printk("Blinding Requested: %u\n", prog->blinding_requested);
>         printk("Blinded: %u\n", prog->blinded);
>         printk("Is Function: %u\n", prog->is_func);
>         printk("Kprobe Override: %u\n", prog->kprobe_override);
>         printk("Has Callchain Buffer: %u\n", prog->has_callchain_buf);
>         printk("Enforce Expected Attach Type: %u\n", prog->enforce_expected_attach_type);
>         printk("Call Get Stack: %u\n", prog->call_get_stack);
>         printk("Call Get Func IP: %u\n", prog->call_get_func_ip);
>         printk("Timestamp Type Access: %u\n", prog->tstamp_type_access);
>         printk("Type: %d\n", prog->type);
>         printk("Expected Attach Type: %d\n", prog->expected_attach_type);
>         printk("Length: %u\n", prog->len);
>         printk("JITed Length: %u\n", prog->jited_len);
>         printk("Tag: ");
>         for (int i = 0; i < BPF_TAG_SIZE; i++) {
>                 printk("%02x", prog->tag[i]);
>         }
>         printk("\n");
>         printk("Stats: %p\n", prog->stats);
>         printk("Active: %p\n", prog->active);
>         printk("AUX FIELDS:\n");
>         print_bpf_prog_aux(prog->aux);
>         print_bpf_prog_insnsi(prog->insnsi, prog->len);
> }
>
>
> /* Functions we need for patching dynamic code allocations */
> typedef void *(*module_alloc_t)(unsigned long size);
> module_alloc_t module_alloc_ind;
> typedef void (*module_memfree_t)(void *module_region);
> module_memfree_t module_memfree_ind;
>
> /* TODO: actually we could probably just include "net/bpf_jit.h" */
> typedef int (*aarch64_insn_patch_text_nosync_t)(void *addr, u32 insn);
> aarch64_insn_patch_text_nosync_t aarch64_insn_patch_text_nosync_ind;
> typedef u32 (*aarch64_insn_gen_branch_imm_t)(unsigned long pc,
>                                              unsigned long addr,
> enum aarch64_insn_branch_type type);
> aarch64_insn_gen_branch_imm_t aarch64_insn_gen_branch_imm_ind;
> typedef u32 (*aarch64_insn_gen_hint_t)(enum aarch64_insn_hint_cr_op op);
> aarch64_insn_gen_hint_t aarch64_insn_gen_hint_ind;
> typedef u32 (*aarch64_insn_gen_branch_reg_t)(
>         enum aarch64_insn_register reg, enum aarch64_insn_branch_type type);
> aarch64_insn_gen_branch_reg_t aarch64_insn_gen_branch_reg_ind;
> typedef void *(*__vmalloc_node_range_t)(unsigned long size, unsigned long align,
>                                         unsigned long start, unsigned long end,
>                                         gfp_t gfp_mask, pgprot_t prot,
>                                         unsigned long vm_flags, int node,
> const void *caller);
> __vmalloc_node_range_t __vmalloc_node_range_ind;
>
> /* Used for reworking the kprobe allocator */
> typedef int (*collect_garbage_slots_t)(struct kprobe_insn_cache *c);
> collect_garbage_slots_t collect_garbage_slots_ind;
>
> static struct kprobe kallsyms_lookup_name_kp = { .symbol_name =
>         "kallsyms_lookup_name",
> .addr = 0 };
> typedef unsigned long (*kallsyms_lookup_name_t)(const char *name);
> kallsyms_lookup_name_t kallsyms_lookup_name_ind;
>
> /* Functions we are patching */
> static struct kprobe alloc_vmap_area_kp = { .symbol_name = "alloc_vmap_area",
> .addr = 0 };
>
> /* DEBUG: bpf allocation printing */
> // static struct kprobe bpf_int_jit_compile_kp = { .symbol_name = "bpf_int_jit_compile",
> // .addr = 0 };
> static struct kprobe ptrace_request_kp = { .symbol_name = "ptrace_request",
> .addr = 0 };
> /* END DEBUG */
>
> /* Static variables that must be manually accessed for definition */
> u64 module_alloc_base;
> struct kprobe_insn_cache *kprobe_insn_slots_ptr;
>
> /**
>  * get_kp_addr - TODO comment rest of file
>  */
> static __always_inline void *get_kp_addr(struct kprobe *kp)
> {
>         void *res = 0;
>         if (register_kprobe(kp)) {
>                 pr_err("Error: moto_org_mem failed to get kp addr for %s\n",
>                        kp->symbol_name);
>                 return 0;
>         }
>         res = kp->addr;
>         unregister_kprobe(kp);
>         return res;
> }
>
> static void *bpf_jit_alloc_exec_handler(unsigned long size)
> {
>         return module_alloc_ind(size);
> }
>
> static void bpf_jit_free_exec_handler(void *addr)
> {
>         module_memfree_ind(addr);
> }
>
> static u64 bpf_jit_alloc_exec_limit_handler(void)
> {
>         return MODULES_END - MODULES_VADDR;
> }
>
> static void *alloc_insn_page_handler(void)
> {
>         return __vmalloc_node_range_ind(PAGE_SIZE, 1, module_alloc_base,
>                                         module_alloc_base + SZ_2G, GFP_KERNEL,
>                                         PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS,
>                                         NUMA_NO_NODE,
>         __builtin_return_address(0));
> }
>
> static bool allocation_balance = false;
>
> /**
>  * alloc_vmap_area_pre_handler - adjusts vstart, vend to not interleave code/data
>  *
>  * Right now, vmalloc infrastructure does the following:
>  * |<-----data----->||<-----code and data pages----->||<-----data----->|
>  * Maintainers likely do not want to touch vmalloc internals for fear of
>  * breaking everything, so we provide an open-source work-around with hopes
>  * that these fixes will make their way into the mainline kernel.
>  *
>  * We adjust the parameters to the call to avoid the code memory range by
>  * selecting the lower half, then in a separate post handler, we check whether
>  * the allocation failed, and if so, run the allocation with the upper half.
>  *
>  * TODO: we need to remove the flip/flopping and properly segment the memory
>  * here, but it is not clear how to do this without modifying core vmalloc
>  * infrastructure. See upstream patch here:
>  * https://lore.kernel.org/all/20240423095843.446565600-1-mbland@motorola.com/#t
>  *
>  * Parameters are passed in the arm64 linux kernel following the AAPCS64 ABI
>  * convention, and thus it is safe to interpolate based upon the signature
>  * the location of the specific values for vstart and vend.
>  * https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst
>  */
> static int alloc_vmap_area_handler(struct kprobe *kp, struct pt_regs *regs)
> {
>         unsigned long size;
>         unsigned long vstart;
>         size = regs->regs[0];
>         vstart = regs->regs[2];
>         if (vstart == VMALLOC_START) { /* We are attempting to vmalloc data */
>                 /* Everything is fine, do nothing */
>                 if (module_alloc_base + SZ_2G <= VMALLOC_START ||
>                         module_alloc_base > VMALLOC_END)
>                 return 0;
>
>                 allocation_balance = !allocation_balance;
>
>                 /* Not enough room below, else if not enough room above */
>                 if (module_alloc_base - VMALLOC_START < size)
>                         allocation_balance = true;
>                         else if (VMALLOC_END - module_alloc_base + SZ_2G < size)
>                         allocation_balance = false;
>
>                 /* Allocate from higher valued addresses or lower valued
>                  * address evenly. since these are virtual it does not
>                  * really matter */
>                 if (allocation_balance) {
>                         regs->regs[2] = module_alloc_base + SZ_2G;
>                 } else {
>                         regs->regs[3] = module_alloc_base;
>                 }
>         }
>
>         return 0;
> }
>
> /* DEBUG: Analyze allocated BPF programs */
> // static int bpf_int_jit_compile_handler(struct kprobe *kp, struct pt_regs *regs)
> // {
> //         // struct bpf_prog *prog = (struct bpf_prog *)regs->regs[0];
> //         // print_bpf_prog(prog);
> //         return 0;
> // }
> //
> static int ptrace_request_handler(struct kprobe *kp, struct pt_regs *regs)
> {
>         struct task_struct *task = (struct task_struct *)regs->regs[0];
>         long request = regs->regs[1];
>         unsigned long addr = regs->regs[2];
>         struct seccomp_filter *filter;
>         if (request != 0x420c) {
>                 return 0;
>         }
>         if (addr != 13371337) {
>                 printk("waiting for regs ... %llx\n", regs->regs[1]);
>                 return 0;
>         }
>
>         if (!task)
>         {
>                 printk("ptrace_request_handler no task\n");
>                 return 0;
>         }
>
>         filter = READ_ONCE(task->seccomp.filter);
>         printk("TASK PID %d or %d\n", task->pid, pid_vnr(task_pgrp(task)));
>         if (!filter) {
>                 printk("ptrace_request_handler no filter\n");
>                 return 0;
>         }
>         if (filter->prog)
>                 print_bpf_prog(filter->prog);
>
>         return 0;
> }
> /* END DEBUG */
>
>
> void __always_inline patch_jump_to_handler(void *faddr, void *helper)
> {
>         u32 insn;
>         insn = aarch64_insn_gen_branch_imm_ind((unsigned long)faddr,
>                                                (unsigned long)helper,
>         AARCH64_INSN_BRANCH_NOLINK);
>         aarch64_insn_patch_text_nosync_ind(faddr, insn);
> }
>
> struct kprobe_insn_page {
>         struct list_head list;
>         kprobe_opcode_t *insns; /* Page of instruction slots */
>         struct kprobe_insn_cache *cache;
>         int nused;
>         int ngarbage;
>         char slot_used[];
> };
>
> void free_insn_pages(struct kprobe_insn_cache *kic)
> {
>         struct kprobe_insn_page *kip;
>         unsigned int i = 0;
>
>         /* TODO: Since the slot array is not protected by rcu, we need a mutex,
>          * but we are also should be the only thing running that is touching
>          * the kprobes */
>         list_for_each_entry_rcu (kip, &kic->pages, list) {
>                 for (i = 0; i < kip->nused; i++) {
>                         kip->slot_used[i] = 0;
>                         kip->nused--;
>                 }
>                 list_del_rcu(&kip->list);
>                 synchronize_rcu();
>                 kip->cache->free(kip->insns);
>                 kfree(kip);
>         }
> }
>
> /**
>  * mod_init - TODO
>  *
>  * TODO FAIL IF ANY OF THE BELOW FAILS
>  */
> static int __init mod_init(void)
> {
>         void *bpf_jit_alloc_exec_addr = 0;
>         void *bpf_jit_free_exec_addr = 0;
>         void *bpf_jit_alloc_exec_limit_addr = 0;
>         void *alloc_insn_page_addr = 0;
>         kallsyms_lookup_name_ind =
>                 (kallsyms_lookup_name_t)get_kp_addr(&kallsyms_lookup_name_kp);
>
>         module_alloc_ind =
>                 (module_alloc_t)kallsyms_lookup_name_ind("module_alloc");
>         module_memfree_ind =
>                 (module_memfree_t)kallsyms_lookup_name_ind("module_memfree");
>         __vmalloc_node_range_ind =
>                 (__vmalloc_node_range_t)kallsyms_lookup_name_ind(
>                         "__vmalloc_node_range");
>         aarch64_insn_patch_text_nosync_ind =
>                 (aarch64_insn_patch_text_nosync_t)kallsyms_lookup_name_ind(
>                         "aarch64_insn_patch_text_nosync");
>         aarch64_insn_gen_branch_imm_ind =
>                 (aarch64_insn_gen_branch_imm_t)kallsyms_lookup_name_ind(
>                         "aarch64_insn_gen_branch_imm");
>         aarch64_insn_gen_hint_ind =
>                 (aarch64_insn_gen_hint_t)kallsyms_lookup_name_ind(
>                         "aarch64_insn_gen_hint");
>         aarch64_insn_gen_branch_reg_ind =
>                 (aarch64_insn_gen_branch_reg_t)kallsyms_lookup_name_ind(
>                         "aarch64_insn_gen_branch_reg");
>
>         collect_garbage_slots_ind =
>                 (collect_garbage_slots_t)kallsyms_lookup_name_ind(
>                         "collect_garbage_slots");
>
>         bpf_jit_alloc_exec_addr =
>                 (void *)kallsyms_lookup_name_ind("bpf_jit_alloc_exec");
>         bpf_jit_free_exec_addr =
>                 (void *)kallsyms_lookup_name_ind("bpf_jit_free_exec");
>         bpf_jit_alloc_exec_limit_addr =
>                 (void *)kallsyms_lookup_name_ind("bpf_jit_alloc_exec_limit");
>         alloc_insn_page_addr =
>                 (void *)kallsyms_lookup_name_ind("alloc_insn_page");
>
>         module_alloc_base =
>                 *((u64 *)kallsyms_lookup_name_ind("module_alloc_base"));
>
>         patch_jump_to_handler(bpf_jit_alloc_exec_addr,
>                               bpf_jit_alloc_exec_handler);
>         patch_jump_to_handler(bpf_jit_free_exec_addr,
>                               bpf_jit_free_exec_handler);
>         patch_jump_to_handler(bpf_jit_alloc_exec_limit_addr,
>                               bpf_jit_alloc_exec_limit_handler);
>         patch_jump_to_handler(alloc_insn_page_addr, alloc_insn_page_handler);
>
>         /*
>          * Under the hood, arm64 calls __get_insn_slot to generate memory pages for
>          * kprobes, and these memory pages *supposedly* access an indirect pointer to
>          * their allocation function through kprobe_insn_slots. Because we allocated
>          * a kprobe in order to access kallsyms_lookup_name, one page is already allocated.
>          * However, even kprobe garbage collection cowardly refuses to kill the last page,
>          * so we have our own free routine that nixes that last survivor.
>          */
>         kprobe_insn_slots_ptr =
>                 (struct kprobe_insn_cache *)kallsyms_lookup_name_ind(
>         "kprobe_insn_slots");
>         free_insn_pages(kprobe_insn_slots_ptr);
>
>         alloc_vmap_area_kp.pre_handler = alloc_vmap_area_handler;
>         if (register_kprobe(&alloc_vmap_area_kp)) {
>                 pr_err("moto_org_mem.ko failed to hook alloc_vmap_area!\n");
>                 return -EACCES;
>         }
>
>         /* DEBUG */
>         // bpf_int_jit_compile_kp.pre_handler = bpf_int_jit_compile_handler;
>         // if (register_kprobe(&bpf_int_jit_compile_kp)) {
>         //         pr_err("moto_org_mem.ko failed to hook bpf_int_jit_compile!\n");
>         //         return -EACCES;
>         // }
>
>         ptrace_request_kp.pre_handler = ptrace_request_handler;
>         if (register_kprobe(&ptrace_request_kp)) {
>                 pr_err("moto_org_mem.ko failed to hook ptrace_request_kp!\n");
>                 return -EACCES;
>         }
>
>         /* END DEBUG */
>         pr_info("moto_org_mem loaded!\n");
>
>         return 0;
> }
>
> static void __exit mod_exit(void)
> {
> }
>
> module_init(mod_init);
> module_exit(mod_exit);
>
> MODULE_LICENSE("GPL v2");
> MODULE_AUTHOR("Maxwell Bland <mbland@motorola.com>");
> MODULE_DESCRIPTION("Organizes the vmalloc memory code pages are not interleaved "
>                    "with data pages.");
>
>
>
> ________________________________________
> From: Maciej Żenczykowski <maze@google.com>
> Sent: Thursday, September 12, 2024 4:39 PM
> To: Neill Kapron
> Cc: Maxwell Bland; linux-arm-msm@vger.kernel.org; Andrew Wheeler; Sammy BS2 Que | 阙斌生; Todd Kjos; Viktor Martensson; Andy Lutomirski; keescook@chromium.org; Will Drewry; Andy Gross; Bjorn Andersson; Konrad Dybcio; kernel-team; adelva@google.com; jeffv@google.com
> Subject: [External] Re: [RFC] Proposal: Static SECCOMP Policies
>
> wrt. BPF on Android:
>
> (a) eBPF should already be locked down to just the bpfloader boot time process.
>
> If you can prove it isn't, please let us know, but as this is sepolicy
> around the bpf(BPF_PROG_LOAD) system call, it should be pretty
> airtight:
>
> allow bpfloader self:bpf { ... prog_load ... };
> ...
> neverallow { domain -bpfloader } *:bpf prog_load;
>
> (basically the only exception to the above is root/su on userdebug/eng
> builds, which runs sepolicy in permissive mode and thus doesn't
> enforce the above - but that obviously doesn't matter for user builds)
>
> (b) cBPF [classic BPF, internally the kernel translates this to eBPF]
> is still allowed, for both seccomp() and normal old style socket
> filters
>
> - bpf seccomp() is to the best of my knowledge used by normal play
> store updatable applications (including the chrome web browser) for
> sandboxing (of rendering processes), as such it would be basically
> impossible to lock it down (as apps update independently of the rest
> of the system) - and would probably be a net loss for security if you
> did lock it down / break it...
>
> If you wanted to pursue this you'd need to get agreement from Chrome &
> other applications and provide some 'better' alternative.  Likely some
> sort of hard coded seccomp version that blocks things that most
> sandboxing apps agree is beneficial to block...
>
> (bpf seccomp() is also used by the Android zygote itself to block
> various extra system calls from processes/apps it spawns, but as this
> list is hardcoded at build time, it's not actually a problem)
>
> - similarly old style BPF socket filters are 'normal' 'ancient'
> BSD/Unix/Linux API.  They're used in the (privileged) network stack
> itself (which is mainline updatable via the play store, including the
> cbpf code), but could also AFAIK be used by random play store
> applications - filtering on sockets is truly ancient api.
> https://www.tcpdump.org/papers/bpf-usenix93.pdf is from 1992
>
> -
>
> Is there some eBPF program loading API I'm not aware of that we thus
> haven't blocked?
>
> On Thu, Sep 12, 2024 at 1:57 PM Neill Kapron <nkapron@google.com> wrote:
> >
> > On Thu, Sep 12, 2024 at 04:02:53PM +0000, Maxwell Bland wrote:
> > > (Resending as plaintext for msm-kernel mailing list.
> > > Original message was intended for android kernel team
> > > though msm-kernel should be aware.)
> > >
> > > Hi Kernel Team,
> > >
> > > + Kees, Andy, and Will since their input may be valuable.
> > >
> > > It has been a while! (~9 months to be exact). This January, I sent out a small
> > > message on BPF code loading ("unprivileged BPF considered harmful" or something
> > > like that). In it, I noted new BPF programs are compiled all the time and
> > > thrown into the kernel. At the time, I did not know these programs were just
> > > compiled seccomp filter policies, loaded in as new BPF programs continuously
> > > through the libminijail interface as well as direct syscall. As of two days
> > > ago, I now know this (and now you do too, if not already).
> > >
> > > OK, yes, syscall filtering is very important, but this is creating a catch-22
> > > issue. For one, see step (4) under "Exploitation overview" for
> > > https://www.qualys.com/2021/07/20/cve-2021-33909/sequoia-local-privilege-escalation-linux.txt.
> > > Second, this minor lack of caching is adding load time to more than 90
> > > binaries/services on the standard QCOM baseline—I'll admit, it is probably
> > > negligible in the grand scheme of things (a quick approximation puts the data
> > > operated on around 0.1188 MB). But most importantly, third, without some degree
> > > of provenance, I have no way of telling if someone has injected malicious code
> > > into the kernel, and unfortunately even knowing the correct bytes is still
> > > "iffy", as in order to prevent JIT spray attacks, each of these filters is
> > > offset by some random number of uint32_t's, making every 4-byte shift of the
> > > filter a "valid" codepage to be loaded at runtime.
> > >
> > > You might be thinking, "but wait, bionic's libc only defines a couple of
> > > restricted policies, primary and secondary for system and user apps
> > > respectively." I know! For the most part, apps fall into either what I presume
> > > is the default app/system policies, but there are lots of QCOM binaries and
> > > other magic programs (dolby dax) that are sending up these programs as well.
> > > I'm seeing more than 20 different programs for around a minute's worth of
> > > runtime. One example is attached at the end.
> > >
> > > So, the proposal: a "CONFIG_SECCOMMP_STATIC_POLICY" for seccomp. This
> > > would change the Android kernel's generic SYS_seccomp call, which takes in a
> > > filter with an array of BPF instructions, to instead reference an ID which
> > > corresponds to a fixed file on /sys/bpf/seccomp or something like that. The
> > > sandboxing behavior of these apps should be known at compile-time, even if
> > > there are multiple "permission set types" that may need to be dispatched. User
> > > apps should always have a single, fixed policy. This way it is possible to say
> > > for every code page loaded into the kernel where it came from and what it
> > > should look like.
> > >
> > > Unfortunately, I do not know Motorola has enough "weight" to convince QCOM to
> > > do the right foundational thing here, or to "define" the seccomp APIs for
> > > Android, so it would be good to have Google's buy in, know if there are plans
> > > to fix this issue, or some discussion of how to best fix the problem? If
> > > anything, a contact at QCOM that might be able to actually hunt down and
> > > document valid bytes for these policies?
> > >
> > > The end goal is simple: when we see a code page is allocated in the kernel, we
> > > can be sure that (1) it isn't malicious and (2) has not been modified in
> > > transit. I'm fine putting code where my mouth is, but right now that code
> > > would involve having to fingerprint the signatures loaded by Qualcomm
> > > components every time a new one is released, or pinging Google with a huge
> > > patch changing how seccomp works with no idea of what requirements QCOM may
> > > have on seccomp policy generation.
> > >
> > > Thoughts? Is this doable, and if not, why? I'd also love help with the code and
> > > adapting existing minijail code to use a new, more integrity-preserving
> > > interface. If I am mistaken and it is possible to grab out valid BPF policy
> > > code at compile time, please let me know how!
> > >
> > > Regards,
> > > Maxwell Bland
> > >
> > > Standard filter, (from, for example, com.google.android.gms)
> > > "ac00000000000000ac77000000000000bf160000000000006160040000000000b4020000b70000c01d20020000000000b4000000000000009500000000000000616000000000000055000200cb000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f950000000000000055000200ce000000b40000000000ff7f950000000000000055000200c6000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000100de00000005007b000000000055000200d7000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f950000000000000055000100e200000005008f000000000055000200a7000000b40000000000ff7f95000000000000005500020038000000b40000000000ff7f95000000000000005500020062000000b40000000000ff7f95000000000000005500020039000000b40000000000ff7f9500000000000000550002003f000000b40000000000ff7f95000000000000005500020040000000b40000000000ff7f95000000000000005500020050000000b40000000000ff7f9500000000000000550002004e000000b40000000000ff7f9500000000000000550002002c000000b40000000000ff7f95000000000000005500020043000000b40000000000ff7f9500000000000000550002001d000000b40000000000ff7f95000000000000005500020030000000b40000000000ff7f95000000000000005500020071000000b40000000000ff7f950000000000000055000200ae000000b40000000000ff7f950000000000000055000200a3000000b40000000000ff7f95000000000000005500020086000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000200e9000000b40000000000ff7f9500000000000000550002003e000000b40000000000ff7f95000000000000005500020087000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f9500000000000000550002005c000000b40000000000ff7f95000000000000005500020016010000b40000000000ff7f950000000000000055000200dc000000b40000000000ff7f95000000000000005500020060000000b40000000000ff7f950000000000000055000200dd000000b40000000000ff7f95000000000000005500020078000000b40000000000ff7f9500000000000000550002005e000000b40000000000ff7f9500000000000000550002008b000000b40000000000ff7f95000000000000005500020080000000b40000000000ff7f950000000000000055000200cb000000b40000000000ff7f950000000000000055000100c600000005004c0000000000550002005d000000b40000000000ff7f950000000000000055000200ac000000b40000000000ff7f95000000000000005500020084000000b40000000000ff7f9500000000000000550002008c000000b40000000000ff7f9500000000000000550002003d000000b40000000000ff7f95000000000000005500020017000000b40000000000ff7f9500000000000000b400000000000300950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160100000000000630afcff000000006160140000000000630af8ff00000000550002000000000061a0fcff000000001500010001000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f9500000000000000",
> > > Unknown filter (from QCOM's /vendor/bin/qesdk-secmanager)
> > >  "ac00000000000000ac77000000000000bf160000000000006160040000000000b4020000b70000c01d20020000000000b4000000000000009500000000000000616000000000000055000200cb000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f950000000000000055000200ce000000b40000000000ff7f950000000000000055000200c6000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000100de00000005007e000000000055000100e2000000050098000000000055000200d7000000b40000000000ff7f950000000000000055000200a7000000b40000000000ff7f95000000000000005500020062000000b40000000000ff7f9500000000000000550002001d000000b40000000000ff7f95000000000000005500020038000000b40000000000ff7f9500000000000000550002003f000000b40000000000ff7f95000000000000005500020039000000b40000000000ff7f95000000000000005500020050000000b40000000000ff7f9500000000000000550002004e000000b40000000000ff7f9500000000000000550002004f000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f95000000000000005500020043000000b40000000000ff7f9500000000000000550002002c000000b40000000000ff7f95000000000000005500020087000000b40000000000ff7f95000000000000005500020086000000b40000000000ff7f95000000000000005500020030000000b40000000000ff7f950000000000000055000200ae000000b40000000000ff7f95000000000000005500020016010000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000200dc000000b40000000000ff7f9500000000000000550002005e000000b40000000000ff7f9500000000000000550002007b000000b40000000000ff7f9500000000000000550002005d000000b40000000000ff7f950000000000000055000200ac000000b40000000000ff7f95000000000000005500020084000000b40000000000ff7f950000000000000055000200a3000000b40000000000ff7f95000000000000005500020080000000b40000000000ff7f95000000000000005500020078000000b40000000000ff7f950000000000000055000200dd000000b40000000000ff7f950000000000000055000100c600000005005800000000005500020060000000b40000000000ff7f9500000000000000550002008b000000b40000000000ff7f950000000000000055000200cb000000b40000000000ff7f95000000000000005500020071000000b40000000000ff7f95000000000000005500020040000000b40000000000ff7f9500000000000000550002003b000000b40000000000ff7f950000000000000055000200e9000000b40000000000ff7f950000000000000055000200b2000000b40000000000ff7f9500000000000000550002008c000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f9500000000000000b400000000000300950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160100000000000630afcff000000006160140000000000630af8ff00000000550002000000000061a0fcff000000001500010001000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f9500000000000000",
> > >
> > > List of services loading seccomp filters pulled from one run of the phone:
> > > com.google.android.deskclock
> > > /vendor/bin/qesdk-secmanager
> > > media.hwcodec/vendor.qti.media.c2@1.0-service
> > > media.audio.qc.codec.qti.media.c2audio@1.0-service
> > > /vendor/bin/vendor.qti.qspmhal-service
> > > /vendor/bin/qsap_sensors
> > > media.extractoraextractor
> > > /system_ext/bin/perfservice
> > > /vendor/bin/wfdhdcphalservice
> > > /vendor/bin/wifidisplayhalservice
> > > /vendor/bin/qsap_dcfd
> > > /vendor/bin/qms
> > > /vendor/bin/qsap_location
> > > /vendor/bin/qsap_qapeservice
> > > /vendor/bin/wfdvndservice
> > > media.swcodecoid.media.swcodec/bin/mediaswcodec
> > > /vendor/bin/hw/qcrilNrd
> > > qsap_qms_13qms16
> > > qsap_qms_24qms17
> > > /vendor/bin/ATFWD-daemon
> > > /vendor/bin/hw/sxrservice
> > > /vendor/bin/hw/qcrilNrd-c2
> > > system_server
> > > /vendor/bin/qmi_motext_hook1013170
> > > /vendor/bin/qmi_motext_hook1013171
> > > /vendor/bin/ims_rtp_daemon
> > > com.android.systemui
> > > webview_zygote
> > > com.dolby.daxservice
> > > vendor.qti.qesdk.sysservice
> > > org.codeaurora.ims
> > > com.android.se
> > > com.android.phone
> > > com.qti.qcc
> > > com.google.android.ext.services
> > > com.google.android.gms
> > > com.google.android.euicc
> > > com.google.android.googlequicksearchbox:interactor
> > > com.google.android.apps.messaging:rcs
> > > com.android.nfc
> > > com.qualcomm.qti.workloadclassifier
> > > com.qualcomm.location
> > > com.google.android.gms.unstable
> > > com.thundercomm.ar.core
> > > com.android.vending:background
> > > com.android.vending:quick_launch
> > > com.android.dynsystem
> > > com.android.managedprovisioning
> > > com.android.shell
> >
> >
> > + Jeff, Alistair, and Maciej
> >
> > Maxwell,
> >
> > Thanks for the details on this, I have added several people who may be
> > better suited to comment on this.
> >
> > Neill

--
Maciej Żenczykowski, Kernel Networking Developer @ Google

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-13 21:16       ` [External] " Maciej Żenczykowski
@ 2024-09-16 22:17         ` Maxwell Bland
  2024-09-16 22:50           ` Maciej Żenczykowski
  0 siblings, 1 reply; 26+ messages in thread
From: Maxwell Bland @ 2024-09-16 22:17 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Neill Kapron, linux-arm-msm@vger.kernel.org, Andrew Wheeler,
	Sammy BS2 Que | 阙斌生, Todd Kjos,
	Viktor Martensson, Andy Lutomirski, keescook@chromium.org,
	Will Drewry, Andy Gross, Bjorn Andersson, Konrad Dybcio,
	kernel-team, adelva@google.com, jeffv@google.com

Another long email follows. The TL;DR is considering the related issues
such as changes in cBPF and some interesting thoughts regarding Google's
maintenance of seccomp inside Android, Android maintainers should make
the decision to "use minijail" or "use bionic's tools" for compiling
policies to BPF. Is there any reason multiple seccomp policy to BPF
program compilers need to exist in the AOSP (or even, maybe, Linux's use
of seccomp)? The shift to a single project for policy compilation to BPF
would remove duplicate effort in maintaining seccomp policy to BPF
compilers, solve the code page integrity issue, and lower potential
sources of policy compiler errors. See below.

On Fri, Sep 13, 2024 at 02:16:40PM GMT, Maciej Żenczykowski wrote:
> On Fri, Sep 13, 2024 at 10:07 AM Maxwell Bland <mbland@motorola.com> wrote:
> > Add a hook to seccomp which triggers/enables hooks in BPF's JIT to instrument
> > the output machine code  page so that EL2 can (1) invert the machine code back
> > to BPF then (2) check the BPF corresponds to a valid seccomp filter policy.
>
> If you care that deeply about this: you could simply turn of jit
> compilation of cBPF (including seccomp) - but you'll take a
> performance hit.
> If you care about performance you could only jit compile *recognized*
> cBPF programs.
> Hell, instead of jit-ing them you could replace them with outright
> (pre)compiled into the kernel native functions that accomplish the
> same thing.
> There's probably only somewhere <10 of these in common use / part of
> the platform.
> That said, you'd still pay a performance hit for (Chrome web browser
> style) sandboxes since those policies *will* be updated without os
> updates.  Similarly with the mainline shipped cBPF code (which does
> process all packets) - you can't guarantee it won't change.

I am hesistant with opting to turn off JIT, as a few months ago I got a
warning from Alexei Starovoitov about this approach:
https://lore.kernel.org/all/CAADnVQJCxFt2R=fbqx1T_03UioAsBO4UXYGh58kJaYHDpMHyxw@mail.gmail.com/

I would be hesitant for Moto (or anyone) to maintain a dynamic list of
acceptable code pages for each AOSP (or subpackage) release, and the
list will only grow with time. It would be really difficult, as well,
for me to even begin to figure out if I have "caught" all of them, since
Qualcomm services use seccomp and I have no idea if I am testing every
edge condition in the phone while developing this.

In lieu of knowing exactly what these code pages will be and the dangers
or growing lack of support for the BPF interpreter: the current
SYS_Seccomp user environment, e.g. libminijail or bionic's libc or
whatever Qualcomm is using, ends up being the de dacto specification of
the seccomp BPF "language", rather than a translation layer to a
standard policy file format which uniformly gets translated to BPF for
the kernel's consumption. The disconnect is that the current seccomp.c
semantics _only_ encode the cBPF operations and some sensibility
checking for the ranges of referenced memory, but seccomp.c is currently
not sufficient to provide an EL2-enforcable or Android-enforceable
contract on the integrity of the desired policy.

For example, I took some measurements today on-device, and the three
programs that were triggering EL2-level code page integrity failures in
the basic case follow the same general structure:

- Load systemcall _NR_ definition values
- Generate "priority" JEQ statements (opcode 0x15)
- Generate additional jump statements (opcode 0xa5, 0x35, etc)
- Standard(ish) suffix consisting of loads/movs/exits (opcode 0x61,0xb4,0x95)

But there's nothing to guarantee that this is what will happen in for
arbitrary programs with SYS_seccomp permission, as they could be using
different generators for their BPF. For example,
compile_seccomp_policy.py under the minijail project and genseccomp.py
under the bionic libc project solve this same problem in two different
ways, though they both generate a couple of _NR_ checks and jump
statements, but with different python code.

Can Android just say "use minijail" or "use bionic's tools" and call it
a day, similar to the intent system, or binder, or any number of the
ecosystem "hard rules"? That way, Google also does not have to maintain
the two separate projects doing the same thing, we can figure out what
the heck Qualcomm is doing, and I can sleep better at night. Seccomp is
not C, there's not the fight over clang vs gcc: system call numbers are
baked into struct seccomp_data, why bother with multiple (potentially
buggy and differing in flexibility) ways of compiling the desired policy
into BPF. Maybe this is too opinionated, but the nice world we would get
as a result is every single code page in Android's kernel would be
verifiable (and, if it was adopted in Linux generally) most ARM systems.

Regardless, the clear hack, to me, is that when EL2 gets a code page
integrity failure on one of these seccomp pages, for now I do some
simple binary analysis to check that the code page consists only of what
is effectively a giant case statement. Over time, this needs to be
refined to ensure the adversary has not mucked with the policy in a
valid way, like seccomp_check_filter in kernel/seccomp.c but better.

> I guess for the mainline shipped cBPF programs we could technically
> probably swap them for eBPF.  Taking a quick glance at uses of
> BpfClassic.h in aosp I see 6 socket filter cBPF programs, of which
> only 1 is dynamic (for matching clat IP addresses), so the remaining 5
> are probably trivial to eBPF-ify (and thus hide behind selinux
> restrictions).

clatd, netd, gpuWork, and others turned out to not be an issue (or I
have not run into any code page errors) yet, maybe because I'm running
drivers for the kernel protection at the book-ends of the kernel boot
process: one prior to any memory allocation so that it can ensure pages
get allocated in regions permissible for the Snapdragon chipset's
performance constraints on EL2 write checks, and the second after the
allocation of all boot-time kernel modules and BPF program loads, since
at that point I can check the allocated pages w.r.t. SHA256 hashes
computed (considering holes for self-patching and static_keys) at build
time using the .ko files, only because I am paranoid someone will
circumvent the existing verified boot routines.

As mentioned, I will work with Motorola see if I can figure out a
permissive license for the EL2 components for this part, especially
considering I have seen ... questionable promises ... regarding this
subject in my research and a apparent lack of acknowledgement of issues
like dynamic datastructures and seccomp filters from others (not Google)
promising hypervisor-enforced code integrity. Thankfully, due to GPL-2.0
the EL1 drivers will be open source. I will share them once they are
ready with testcases of existing exploits for page table modification,
code page modification, system control register modification, kworker
queue manipulation, BPF page manipulation, like the below:

#define MODIFY_KERNEL_CODE                                                     \
	do {                                                                   \
		fake_je = (struct jump_entry *)kallsyms_lookup_name_ind(       \
			"spectre_bhb_state");                                  \
		attack_addr = kallsyms_lookup_name_ind("udp_recvmsg");         \
		if (register_kprobe(&kp2)) {                                   \
			return -1;                                             \
		}                                                              \
		arch_jump_label_transform =                                    \
			(arch_jump_label_transform_t)kp2.addr;                 \
		fake_je->code = attack_addr - (unsigned long)&(fake_je->code); \
		fake_je->target = stext - (unsigned long)&(fake_je->target);   \
		arch_jump_label_transform(fake_je, JUMP_LABEL_JMP);            \
		return 0;                                                      \
	} while (0)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-16 22:17         ` Maxwell Bland
@ 2024-09-16 22:50           ` Maciej Żenczykowski
  2024-09-17 15:15             ` Maxwell Bland
  0 siblings, 1 reply; 26+ messages in thread
From: Maciej Żenczykowski @ 2024-09-16 22:50 UTC (permalink / raw)
  To: Maxwell Bland
  Cc: Neill Kapron, linux-arm-msm@vger.kernel.org, Andrew Wheeler,
	Sammy BS2 Que | 阙斌生, Todd Kjos,
	Viktor Martensson, Andy Lutomirski, keescook@chromium.org,
	Will Drewry, Andy Gross, Bjorn Andersson, Konrad Dybcio,
	kernel-team, adelva@google.com, jeffv@google.com

On Mon, Sep 16, 2024 at 3:18 PM Maxwell Bland <mbland@motorola.com> wrote:
>
> Another long email follows. The TL;DR is considering the related issues
> such as changes in cBPF and some interesting thoughts regarding Google's
> maintenance of seccomp inside Android, Android maintainers should make
> the decision to "use minijail" or "use bionic's tools" for compiling
> policies to BPF. Is there any reason multiple seccomp policy to BPF
> program compilers need to exist in the AOSP (or even, maybe, Linux's use
> of seccomp)? The shift to a single project for policy compilation to BPF
> would remove duplicate effort in maintaining seccomp policy to BPF
> compilers, solve the code page integrity issue, and lower potential
> sources of policy compiler errors. See below.
>
> On Fri, Sep 13, 2024 at 02:16:40PM GMT, Maciej Żenczykowski wrote:
> > On Fri, Sep 13, 2024 at 10:07 AM Maxwell Bland <mbland@motorola.com> wrote:
> > > Add a hook to seccomp which triggers/enables hooks in BPF's JIT to instrument
> > > the output machine code  page so that EL2 can (1) invert the machine code back
> > > to BPF then (2) check the BPF corresponds to a valid seccomp filter policy.
> >
> > If you care that deeply about this: you could simply turn of jit
> > compilation of cBPF (including seccomp) - but you'll take a
> > performance hit.
> > If you care about performance you could only jit compile *recognized*
> > cBPF programs.
> > Hell, instead of jit-ing them you could replace them with outright
> > (pre)compiled into the kernel native functions that accomplish the
> > same thing.
> > There's probably only somewhere <10 of these in common use / part of
> > the platform.
> > That said, you'd still pay a performance hit for (Chrome web browser
> > style) sandboxes since those policies *will* be updated without os
> > updates.  Similarly with the mainline shipped cBPF code (which does
> > process all packets) - you can't guarantee it won't change.
>
> I am hesistant with opting to turn off JIT, as a few months ago I got a
> warning from Alexei Starovoitov about this approach:
> https://lore.kernel.org/all/CAADnVQJCxFt2R=fbqx1T_03UioAsBO4UXYGh58kJaYHDpMHyxw@mail.gmail.com/
>
> I would be hesitant for Moto (or anyone) to maintain a dynamic list of
> acceptable code pages for each AOSP (or subpackage) release, and the
> list will only grow with time. It would be really difficult, as well,
> for me to even begin to figure out if I have "caught" all of them, since
> Qualcomm services use seccomp and I have no idea if I am testing every
> edge condition in the phone while developing this.
>
> In lieu of knowing exactly what these code pages will be and the dangers
> or growing lack of support for the BPF interpreter: the current
> SYS_Seccomp user environment, e.g. libminijail or bionic's libc or
> whatever Qualcomm is using, ends up being the de dacto specification of
> the seccomp BPF "language", rather than a translation layer to a
> standard policy file format which uniformly gets translated to BPF for
> the kernel's consumption. The disconnect is that the current seccomp.c
> semantics _only_ encode the cBPF operations and some sensibility
> checking for the ranges of referenced memory, but seccomp.c is currently
> not sufficient to provide an EL2-enforcable or Android-enforceable
> contract on the integrity of the desired policy.
>
> For example, I took some measurements today on-device, and the three
> programs that were triggering EL2-level code page integrity failures in
> the basic case follow the same general structure:
>
> - Load systemcall _NR_ definition values
> - Generate "priority" JEQ statements (opcode 0x15)
> - Generate additional jump statements (opcode 0xa5, 0x35, etc)
> - Standard(ish) suffix consisting of loads/movs/exits (opcode 0x61,0xb4,0x95)
>
> But there's nothing to guarantee that this is what will happen in for
> arbitrary programs with SYS_seccomp permission, as they could be using
> different generators for their BPF. For example,
> compile_seccomp_policy.py under the minijail project and genseccomp.py
> under the bionic libc project solve this same problem in two different
> ways, though they both generate a couple of _NR_ checks and jump
> statements, but with different python code.
>
> Can Android just say "use minijail" or "use bionic's tools" and call it
> a day, similar to the intent system, or binder, or any number of the
> ecosystem "hard rules"? That way, Google also does not have to maintain
> the two separate projects doing the same thing, we can figure out what
> the heck Qualcomm is doing, and I can sleep better at night. Seccomp is
> not C, there's not the fight over clang vs gcc: system call numbers are
> baked into struct seccomp_data, why bother with multiple (potentially
> buggy and differing in flexibility) ways of compiling the desired policy
> into BPF. Maybe this is too opinionated, but the nice world we would get
> as a result is every single code page in Android's kernel would be
> verifiable (and, if it was adopted in Linux generally) most ARM systems.
>
> Regardless, the clear hack, to me, is that when EL2 gets a code page
> integrity failure on one of these seccomp pages, for now I do some
> simple binary analysis to check that the code page consists only of what
> is effectively a giant case statement. Over time, this needs to be
> refined to ensure the adversary has not mucked with the policy in a
> valid way, like seccomp_check_filter in kernel/seccomp.c but better.
>
> > I guess for the mainline shipped cBPF programs we could technically
> > probably swap them for eBPF.  Taking a quick glance at uses of
> > BpfClassic.h in aosp I see 6 socket filter cBPF programs, of which
> > only 1 is dynamic (for matching clat IP addresses), so the remaining 5
> > are probably trivial to eBPF-ify (and thus hide behind selinux
> > restrictions).
>
> clatd, netd, gpuWork, and others turned out to not be an issue (or I
> have not run into any code page errors) yet, maybe because I'm running
> drivers for the kernel protection at the book-ends of the kernel boot
> process: one prior to any memory allocation so that it can ensure pages
> get allocated in regions permissible for the Snapdragon chipset's
> performance constraints on EL2 write checks, and the second after the
> allocation of all boot-time kernel modules and BPF program loads, since
> at that point I can check the allocated pages w.r.t. SHA256 hashes
> computed (considering holes for self-patching and static_keys) at build
> time using the .ko files, only because I am paranoid someone will
> circumvent the existing verified boot routines.
>
> As mentioned, I will work with Motorola see if I can figure out a
> permissive license for the EL2 components for this part, especially
> considering I have seen ... questionable promises ... regarding this
> subject in my research and a apparent lack of acknowledgement of issues
> like dynamic datastructures and seccomp filters from others (not Google)
> promising hypervisor-enforced code integrity. Thankfully, due to GPL-2.0
> the EL1 drivers will be open source. I will share them once they are
> ready with testcases of existing exploits for page table modification,
> code page modification, system control register modification, kworker
> queue manipulation, BPF page manipulation, like the below:
>
> #define MODIFY_KERNEL_CODE                                                     \
>         do {                                                                   \
>                 fake_je = (struct jump_entry *)kallsyms_lookup_name_ind(       \
>                         "spectre_bhb_state");                                  \
>                 attack_addr = kallsyms_lookup_name_ind("udp_recvmsg");         \
>                 if (register_kprobe(&kp2)) {                                   \
>                         return -1;                                             \
>                 }                                                              \
>                 arch_jump_label_transform =                                    \
>                         (arch_jump_label_transform_t)kp2.addr;                 \
>                 fake_je->code = attack_addr - (unsigned long)&(fake_je->code); \
>                 fake_je->target = stext - (unsigned long)&(fake_je->target);   \
>                 arch_jump_label_transform(fake_je, JUMP_LABEL_JMP);            \
>                 return 0;                                                      \
>         } while (0)

That's not valid cBPF

--
Maciej Żenczykowski, Kernel Networking Developer @ Google

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-16 22:50           ` Maciej Żenczykowski
@ 2024-09-17 15:15             ` Maxwell Bland
  2024-09-18 19:22               ` Maxwell Bland
  0 siblings, 1 reply; 26+ messages in thread
From: Maxwell Bland @ 2024-09-17 15:15 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Neill Kapron, linux-arm-msm@vger.kernel.org, Andrew Wheeler,
	Sammy BS2 Que | 阙斌生, Todd Kjos,
	Viktor Martensson, Andy Lutomirski, keescook@chromium.org,
	Will Drewry, Andy Gross, Bjorn Andersson, Konrad Dybcio,
	kernel-team, adelva@google.com, jeffv@google.com

On Mon, Sep 16, 2024 at 03:50:04PM GMT, Maciej Żenczykowski wrote:
> On Mon, Sep 16, 2024 at 3:18 PM Maxwell Bland <mbland@motorola.com> wrote:
> >
> > #define MODIFY_KERNEL_CODE                                                     \
> >         do {                                                                   \
> >                 fake_je = (struct jump_entry *)kallsyms_lookup_name_ind(       \
> >                         "spectre_bhb_state");                                  \
> >                 attack_addr = kallsyms_lookup_name_ind("udp_recvmsg");         \
> >                 if (register_kprobe(&kp2)) {                                   \
> >                         return -1;                                             \
> >                 }                                                              \
> >                 arch_jump_label_transform =                                    \
> >                         (arch_jump_label_transform_t)kp2.addr;                 \
> >                 fake_je->code = attack_addr - (unsigned long)&(fake_je->code); \
> >                 fake_je->target = stext - (unsigned long)&(fake_je->target);   \
> >                 arch_jump_label_transform(fake_je, JUMP_LABEL_JMP);            \
> >                 return 0;                                                      \
> >         } while (0)
> 
> That's not valid cBPF

It is not intended to be: see the Qualys exploit from my original
message. People are not loading bad BPF, they are targeting BPF code
pages for modification during the window between JIT and execution,
using a write-gadget exploit, e.g. UAF + Heap Spray.

Also, I read through and responded to Andy's message on this thread just
now. Andy had the really good idea of rather than Android saying "use
this seccomp->BPF compiler", the code page or BPF program comes with an
"origin" tag, that is, something saying "this was generated by
libminijail" or "bionic libc". That would work just as well supposing
that if I were to see a tag for something I did not know (likely one of
these QCOM services), I could email someone at QCOM to get the compiler
spec, hopefully.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-17 15:15             ` Maxwell Bland
@ 2024-09-18 19:22               ` Maxwell Bland
  0 siblings, 0 replies; 26+ messages in thread
From: Maxwell Bland @ 2024-09-18 19:22 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Kees Cook, linux-arm-msm@vger.kernel.org, Andrew Wheeler,
	Sammy BS2 Que | 阙斌生, Neill Kapron, Todd Kjos,
	Viktor Martensson, Andy Lutomirski, Will Drewry, Andy Gross,
	Bjorn Andersson, Konrad Dybcio, kernel-team, Mimi Zohar,
	Dmitry Kasatkin, linux-integrity

Hi All!

First, thanks for your help! This will likely be my last email on the
subject until I have something completed. I figured out how to get this
working(ish) well enough.

Introducing the Seccomp Filter Purity Test! (Attached at the end), for
preventing naughty JIT'ed code pages from mucking up your kernel (they
are not bad, persay, just naughty, since they are not acting as proper
"filters" in the standard sense of the word).

I got to thinking, and realized all we really want is for these new code
pages to not suddenly start exerting their freedom to store all sorts of
illegitimate content into different kernel regions, and that what they
decide to filter or whether their filter ends up being "dirty" since an
adversary used a write gadget to swap a comparison of one value with a
comparison of another is their own, private business, not mine.

What does matter is that if they are going to use my room in the kernel
to do all this stuff without me knowing exactly what the stuff they are
doing in their filter is, I am going to set down some baseline rules :
notably, sort of the same as a hotel room, leave the kernel the way you
got it.

I could imagine pretty easily expanding this to rerun a "version" of the
BPF verifier and ensure that any changes made to the kernel's data stay
in a confined boundary.

Here's the code. I've not tested it extensively, but it works for at
least one case (linked here https://github.com/KSPP/linux/issues/154).
Additional prescriptions are also in my comment at the link above.

Thanks again,
Maxwell Bland

// SPDX-License-Identifier: GPL-2.0-only
/*
 * Copyright (C) 2024 Motorola Mobility, Inc.
 *
 * Author: Maxwell Bland
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License version 2 as
 * published by the Free Software Foundation.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * Routine for verification of JIT'ed ARM64 seccomp code.
 *
 * Because of issues in determining the origin and source of seccomp filters,
 * and a lack of support for the provenance of seccomp filters, we must
 * create a special case for these allocated pages. Critically, if a page
 * does not match a known SHA256 hash, we allow the allocation of a pure
 * function matching the following restrictions:
 *
 * (1) The start can be padded by some number of BRK trap instructions matching
 * standard BPF JIT semantics.
 * (2) The prologue must match the prologue given by bpf_jit_comp.c
 * (3) The epilogue must match the epilogue given by bpf_jit_comp.c
 * (4) The body must only consist of:
 *
 * - Loads (of general purpose registers only)
 * - Arithmetic/Logical instructions (on general purpose registers only)
 * - Comparisons (on general purpose registers, because I am paranoid)
 * - Branches to immediate offsets guaranteed to be within the program
 *   (branch register is considered harmful)
 *
 * This is sufficient to guarantee that the program can be trusted to not touch
 * the rest of the kernel, though it may of course leak critical information and
 * secrets.
 */
#include <stdbool.h>
#include <stdio.h>
#include <stdint.h>

/* TODO REMOVE BELOW --- integrate with actual kernel
 * For now this is done so that you can quickly "test it out"
 * by placing this file in arch/arm64/net/ and compiling with
 * `clang -static seccomp_jit_check_patched.c ../lib/lib.a`
 *
 * After the following easy patch:

--- a/arch/arm64/include/asm/insn-def.h
 #ifndef __ASM_INSN_DEF_H
 #define __ASM_INSN_DEF_H
 
-#include <asm/brk-imm.h>
+#include "brk-imm.h"
--- a/arch/arm64/include/asm/insn.h
 #ifndef        __ASM_INSN_H
 #define        __ASM_INSN_H
-#include <linux/build_bug.h>
 #include <linux/types.h>
 
-#include <asm/insn-def.h>
+#include "insn-def.h"
 
 #ifndef __ASSEMBLY__
 
@@ -301,7 +300,6 @@ enum aarch64_insn_mb_type {
 #define        __AARCH64_INSN_FUNCS(abbr, mask, val)                           \
 static __always_inline bool aarch64_insn_is_##abbr(u32 code)           \
 {                                                                      \
-       BUILD_BUG_ON(~(mask) & (val));                                  \
        return (code & (mask)) == (val);                                \
 }                                                                      \
 static __always_inline u32 aarch64_insn_get_##abbr##_value(void)       \
--- a/arch/arm64/net/bpf_jit.h
 #ifndef _BPF_JIT_H
 #define _BPF_JIT_H
 
-#include <asm/insn.h>

 *
 */
int _printk(const char *format, ...) {
	return 0;
}

void __sw_hweight64() {
	printf("ACKBAR!\n");
}

#define u8 uint8_t
#define u32 uint32_t
#define s32 int32_t
#define u64 uint64_t

#define A64_HINT(x) aarch64_insn_gen_hint(x)             
#define A64_NOP A64_HINT(AARCH64_INSN_HINT_NOP)
#define A64_PACIASP A64_HINT(AARCH64_INSN_HINT_PACIASP)  
#define A64_AUTIASP A64_HINT(AARCH64_INSN_HINT_AUTIASP)  
#define A64_R(x)        AARCH64_INSN_REG_##x
#define A64_FP          AARCH64_INSN_REG_FP 
#define A64_LR          AARCH64_INSN_REG_LR 
#define A64_ZR          AARCH64_INSN_REG_ZR 
#define A64_SP          AARCH64_INSN_REG_SP 

/* TODO REMOVE ABOVE */

#include "../include/asm/insn.h"
#include "bpf_jit.h"

#define PAGE_SIZE 0x1000
/* Vals may increase based on CONFIGs */
#define PROLOGUE_BASE_NUM_INSNS 11
#define EPILOGUE_BASE_NUM_INSNS 8
#define COMMON_GP_TARGET_REG_MASK 0x1F /* Common general purpose target register mask */
#define PAIR_GP_TARGET_REG_MASK 0x7C00
#define BRANCH_IMM_MASK  0x0FFFFFF
#define CBRANCH_IMM_MASK 0x0FFFFE0

int32_t sign_extend_branch_mask(uint32_t imm) {
	return ((int32_t) (imm << 8)) >> 6;
}

int32_t sign_extend_cbranch_mask(uint32_t imm) {
	return (((int32_t) (imm << 8)) >> 6) >> 5;
}

int match_padding(uint32_t insn) {
        if (insn == 0xd4202000)
                return 1;
        return 0;
}

/*
 * Effectively a copy of the semantics from build_prologue in
 * bpf_jit_comp.c: we might as well call to this function
 * directly to create an "integrity verification" buffer at boot-time and
 * then use this read-only verification buffer to guarantee
 * the contents of the prologue after JIT.
 */
int match_prologue(uint32_t page[PAGE_SIZE], uint64_t *ind) {
        const uint8_t r6 = A64_R(19);
	const uint8_t r7 = A64_R(20);
	const uint8_t r8 = A64_R(21);
	const uint8_t r9 = A64_R(22);
	const uint8_t fp = A64_R(25);
	const uint8_t tcc = A64_R(26);
	const uint8_t fpb = A64_R(27);
        uint64_t max_num_insns = PROLOGUE_BASE_NUM_INSNS;
        // if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
        //         max_num_insns++
	// if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL))
                max_num_insns++;

        if (*ind + max_num_insns >= PAGE_SIZE) {
		printf("Index passed PAGE_SIZE\n");
                return 0;
	}

        // if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
        //         if (page[*ind++] != (A64_BTI_JC))
        //                 return 0;

        if (page[(*ind)++] != (A64_MOV(1, A64_R(9), A64_LR))) {
		printf("page[(*ind)++] != (A64_MOV(1, A64_R(9), A64_LR) %x != %x", page[*(ind - 1)], (A64_MOV(1, A64_R(9), A64_LR)));
                return 0;
	}

        if (page[(*ind)++] != (A64_NOP)) {
		printf("page[(*ind)++] != (A64_NOP) %x != %x", page[*(ind - 1)], (A64_NOP));
                return 0;
	}

	/* Sign lr */
	// if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL))
                if (page[(*ind)++] != (A64_PACIASP)) {
			printf("page[(*ind)++] != (A64_PACIASP) %x != %x", page[*(ind - 1)], (A64_PACIASP));
                        return 0;
		}

	/* Save FP and LR registers to stay align with ARM64 AAPCS */
	if (page[(*ind)++] != (A64_PUSH(A64_FP, A64_LR, A64_SP))) {
		printf("page[(*ind)++] != (A64_PUSH(A64_FP, A64_LR, A64_SP) %x != %x", page[*(ind - 1)], (A64_PUSH(A64_FP, A64_LR, A64_SP)));
                return 0;
	}
	if (page[(*ind)++] != (A64_MOV(1, A64_FP, A64_SP))) {
		printf("page[(*ind)++] != (A64_MOV(1, A64_FP, A64_SP) %x != %x", page[*(ind - 1)], (A64_MOV(1, A64_FP, A64_SP)));
                return 0;
	}

	/* Save callee-saved registers */
	if (page[(*ind)++] != (A64_PUSH(r6, r7, A64_SP))) {
		printf("page[(*ind)++] != (A64_PUSH(r6, r7, A64_SP) %x != %x", page[*(ind - 1)], (A64_PUSH(r6, r7, A64_SP)));
                return 0;
	}
	if (page[(*ind)++] != (A64_PUSH(r8, r9, A64_SP))) {
		printf("page[(*ind)++] != (A64_PUSH(r8, r9, A64_SP) %x != %x", page[*(ind - 1)], (A64_PUSH(r8, r9, A64_SP)));
                return 0;
	}
	if (page[(*ind)++] != (A64_PUSH(fp, tcc, A64_SP))) {
		printf("page[(*ind)++] != (A64_PUSH(fp, tcc, A64_SP) %x != %x", page[*(ind - 1)], (A64_PUSH(fp, tcc, A64_SP)));
                return 0;
	}
	if (page[(*ind)++] != (A64_PUSH(fpb, A64_R(28), A64_SP))) {
		printf("page[(*ind)++] != (A64_PUSH(fpb, A64_R(28), A64_SP) %x != %x", page[*(ind - 1)], (A64_PUSH(fpb, A64_R(28), A64_SP)));
                return 0;
	}

	/* Set up BPF prog stack base register */
	if (page[(*ind)++] != (A64_MOV(1, fp, A64_SP))) {
		printf("page[(*ind)++] != (A64_MOV(1, fp, A64_SP) %x != %x", page[*(ind - 1)], (A64_MOV(1, fp, A64_SP)));
                return 0;
	}

        /* Program should always be ebpf_from_cpf for
         * seccomp, so ignore the tail_call_cnt and bti j initialization
         * which would normally be in the prologue at this point */

        /* Based on the semantics of find_fpb_offset, ctx->fpb_offset, used
         * to decide this next instruction, is non-zero iff there is a
         * store/load involving the frame pointer, which would be exceedingly
         * weird to have in a seccomp filter (I'd like to see a justification if
         * such a program does exist) and is therefore assumed to be 0. */
        if (page[(*ind)++] != (A64_SUB_I(1, fpb, fp, 0))) {
		printf("page[(*ind)++] != (A64_SUB_I(1, fpb, fp, 0) %x != %x", page[*(ind - 1)], (A64_SUB_I(1, fpb, fp, 0)));
                return 0;
	}

        /* Standard program semantics here only make the restriction that
         * the program stack must be a multiple of 16 bytes, but why the
         * heck is a seccomp filter using the stack? So we force it 0 as
         * well */
	if (page[(*ind)++] != (A64_SUB_I(1, A64_SP, A64_SP, 0))) {
		printf("page[(*ind)++] != (A64_SUB_I(1, A64_SP, A64_SP, 0) %x != %x", page[*(ind - 1)], (A64_SUB_I(1, A64_SP, A64_SP, 0)));
                return 0;
	}

	return 1;
}

int match_epilogue(uint32_t page[PAGE_SIZE], uint64_t *ind) {
	const uint8_t r0 = A64_R(7);
        const uint8_t r6 = A64_R(19);
	const uint8_t r7 = A64_R(20);
	const uint8_t r8 = A64_R(21);
	const uint8_t r9 = A64_R(22);
	const uint8_t fp = A64_R(25);
	const uint8_t fpb = A64_R(27);
        uint64_t max_num_insns = EPILOGUE_BASE_NUM_INSNS;
	// if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL))
                max_num_insns++;

        if (*ind + max_num_insns >= PAGE_SIZE) {
		printf("Epilogue past page size!\n");
                return 0;
	}

        if (page[(*ind)++] != (A64_ADD_I(1, A64_SP, A64_SP, 0))) {
		printf("page[(*ind)++] != (A64_ADD_I(1, A64_SP, A64_SP, 0) %x != %x", page[*(ind - 1)], (A64_ADD_I(1, A64_SP, A64_SP, 0)));
                return 0;
	}
                                                                 
        if (page[(*ind)++] != (A64_POP(fpb, A64_R(28), A64_SP))) {
		printf("page[(*ind)++] != (A64_POP(fpb, A64_R(28), A64_SP) %x != %x", page[*(ind - 1)], (A64_POP(fpb, A64_R(28), A64_SP)));
                return 0;
	}
        if (page[(*ind)++] != (A64_POP(fp, A64_R(26), A64_SP))) {
		printf("page[(*ind)++] != (A64_POP(fp, A64_R(26), A64_SP) %x != %x", page[*(ind - 1)], (A64_POP(fp, A64_R(26), A64_SP)));
                return 0;
	}
                                                                         
        if (page[(*ind)++] != (A64_POP(r8, r9, A64_SP)))                              {
		printf("page[(*ind)++] != (A64_POP(r8, r9, A64_SP) %x != %x", page[*(ind - 1)], (A64_POP(r8, r9, A64_SP)));
                return 0;
	}
        if (page[(*ind)++] != (A64_POP(r6, r7, A64_SP)))                              {
		printf("page[(*ind)++] != (A64_POP(r6, r7, A64_SP) %x != %x", page[*(ind - 1)], (A64_POP(r6, r7, A64_SP)));
                return 0;
	}
                                                                         
        if (page[(*ind)++] != (A64_POP(A64_FP, A64_LR, A64_SP))) {
		printf("page[(*ind)++] != (A64_POP(A64_FP, A64_LR, A64_SP) %x != %x", page[*(ind - 1)], (A64_POP(A64_FP, A64_LR, A64_SP)));
                return 0;
	}
                                                                         
        if (page[(*ind)++] != (A64_MOV(1, A64_R(0), r0))) {
		printf("page[(*ind)++] != (A64_MOV(1, A64_R(0), r0) %x != %x", page[*(ind - 1)], (A64_MOV(1, A64_R(0), r0)));
                return 0;
	}
                                                                         
         // if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL))                   
                if (page[(*ind)++] != (A64_AUTIASP)) {
			printf("page[(*ind)++] != (A64_AUTIASP) %x != %x", page[*(ind - 1)], (A64_AUTIASP));
                        return 0;
		}
                                                                         
        if (page[(*ind)++] != (A64_RET(A64_LR))) {
		printf("page[(*ind)++] != (A64_RET(A64_LR) %x != %x", page[*(ind - 1)], (A64_RET(A64_LR)));
                return 0;
	}

        return 1;
}


int insn_ok(uint32_t insn) {
	printf("Entering insn_ok\n");
        // Dest is in least sig 5 bits. Note CMP* instructions
        // are implemented as subs, etc, with a special dest.
        if (aarch64_insn_is_adr(insn) ||
            aarch64_insn_is_adrp(insn) ||
            aarch64_insn_is_load_imm(insn) ||
            aarch64_insn_is_load_pre(insn) ||
            aarch64_insn_is_load_post(insn) ||
            aarch64_insn_is_ldr_reg(insn) ||
            aarch64_insn_is_ldr_imm(insn) ||
            aarch64_insn_is_ldr_lit(insn) ||
            aarch64_insn_is_ldrsw_lit(insn) ||
            aarch64_insn_is_add_imm(insn) ||
            aarch64_insn_is_adds_imm(insn) ||
            aarch64_insn_is_sub_imm(insn) ||
            aarch64_insn_is_subs_imm(insn) ||
            aarch64_insn_is_movn(insn) ||
            aarch64_insn_is_sbfm(insn) ||
            aarch64_insn_is_bfm(insn) ||
            aarch64_insn_is_movz(insn) ||
            aarch64_insn_is_ubfm(insn) ||
            aarch64_insn_is_movk(insn) ||
            aarch64_insn_is_add(insn) ||
            aarch64_insn_is_adds(insn) ||
            aarch64_insn_is_sub(insn) ||
            aarch64_insn_is_subs(insn) ||
            aarch64_insn_is_madd(insn) ||
            aarch64_insn_is_msub(insn) ||
            aarch64_insn_is_udiv(insn) ||
            aarch64_insn_is_sdiv(insn) ||
            aarch64_insn_is_lslv(insn) ||
            aarch64_insn_is_lsrv(insn) ||
            aarch64_insn_is_asrv(insn) ||
            aarch64_insn_is_rorv(insn) ||
            aarch64_insn_is_rev16(insn) ||
            aarch64_insn_is_rev32(insn) ||
            aarch64_insn_is_rev64(insn) ||
            aarch64_insn_is_and(insn) ||
            aarch64_insn_is_bic(insn) ||
            aarch64_insn_is_orr(insn) ||
            aarch64_insn_is_mov_reg(insn) ||
            aarch64_insn_is_orn(insn) ||
            aarch64_insn_is_eor(insn) ||
            aarch64_insn_is_eon(insn) ||
            aarch64_insn_is_ands(insn) ||
            aarch64_insn_is_bics(insn) ||
            aarch64_insn_is_and_imm(insn) ||
            aarch64_insn_is_orr_imm(insn) ||
            aarch64_insn_is_eor_imm(insn) ||
            aarch64_insn_is_ands_imm(insn) ||
            aarch64_insn_is_extr(insn)
        )
        {
                if ((insn & COMMON_GP_TARGET_REG_MASK) > 28) {
                        if ((insn & COMMON_GP_TARGET_REG_MASK) == 0x1f) {
				printf("Checking is cmp op type!\n");
                                /* Only allow  aliases, otherwise fail */
				if (
						aarch64_insn_is_adds(insn) ||
						aarch64_insn_is_adds_imm(insn) ||
						aarch64_insn_is_subs(insn) ||
						aarch64_insn_is_subs_imm(insn) ||
						aarch64_insn_is_ands(insn) ||
						aarch64_insn_is_ands_imm(insn)
				   ) {
					printf("Returning OK!\n");
                                        return 1;
				}
                        }
                        return 0;
                }
                return 1;
        }

        // Pair of registers
        if (
                aarch64_insn_is_ldp(insn) ||
                aarch64_insn_is_ldp_post(insn) || // post-index variant
                aarch64_insn_is_ldp_pre(insn) // pre-index variant
        )
        {
                if ((insn & PAIR_GP_TARGET_REG_MASK) > 28 ||
                    (insn & COMMON_GP_TARGET_REG_MASK) > 28)
                        return 0;
                return 1;
        }
        
        // No dest effect
        if (
                aarch64_insn_is_prfm(insn) ||
                aarch64_insn_is_prfm_lit(insn) ||
                aarch64_insn_is_dmb(insn) ||
                aarch64_insn_is_dsb_base(insn) ||
                aarch64_insn_is_dsb_nxs(insn) ||
                aarch64_insn_is_isb(insn) ||
                aarch64_insn_is_sb(insn) ||
                aarch64_insn_is_ssbb(insn) ||
                aarch64_insn_is_pssbb(insn)
        ) {
                return 1;
        }

        // Branch insn
        if (
                aarch64_insn_is_b(insn) ||
                aarch64_insn_is_cbz(insn) ||
                aarch64_insn_is_cbnz(insn) ||
                aarch64_insn_is_tbz(insn) ||
                aarch64_insn_is_tbnz(insn) ||
                aarch64_insn_is_bcond(insn)
        ) {
                /* 
                 * For now, do nothing. After we determine the
                 * size of the body, we will check each index
                 * is not outside of the expected bounds
                 */
                return 1;
        }

        return 0;
}

/*
 * branch_imms_ok - checks if any branches immediate values exceed the program length
 * @branch_imms: array storing adjacent pairs of immediate value, PC offset
 * @pairs_len: the length of this array
 * @prog_len: the size of the program in bytes
 *
 * Yes, you may still jump to arbitrary offsets in the program, but since everything
 * is "pure" you won't exactly be able to store your results.
 */
int branch_imms_ok(uint32_t branch_imms[PAGE_SIZE * 2], uint64_t pairs_len, uint64_t prog_len) {
	uint64_t i = 0;
	uint32_t pc;
	int32_t imm;
	while (i < pairs_len) {
		imm = (int32_t) branch_imms[i];
		pc = branch_imms[i + 1];
		printf("Checking branch imm %x pc %x prog_len %lx\n", imm, pc, prog_len);
		if (pc + imm > prog_len || pc + imm < 0)
			return 0;
		i += 2;
	}
	return 1;
}

int purity_check(uint32_t page[PAGE_SIZE], uint64_t *ind) {
	uint32_t branch_imms[PAGE_SIZE * 2];
	uint64_t branch_imms_ind = 0;
	uint64_t prog_start = 0;
	uint64_t prog_end = 0;

        if (!match_prologue(page, ind)) {
		printf("Failed to match prologue\n");
                return -1;
	}

	prog_start = *ind;

	/* Check instructions and record branch offset information */
        while (*ind < PAGE_SIZE) {
		printf("Checking insn %x at ind %lx\n", page[*ind], *ind);
                if (!insn_ok(page[*ind]))
                        break;
		if (aarch64_insn_is_b(page[*ind])) {
			branch_imms[branch_imms_ind++] = sign_extend_branch_mask(page[*ind] & BRANCH_IMM_MASK);
			branch_imms[branch_imms_ind++] = (*ind) - prog_start;
		} else if (aarch64_insn_is_cbz(page[*ind]) ||
			   aarch64_insn_is_cbnz(page[*ind]) ||
			   aarch64_insn_is_tbz(page[*ind]) ||
			   aarch64_insn_is_tbnz(page[*ind]) ||
			   aarch64_insn_is_bcond(page[*ind])) {
			branch_imms[branch_imms_ind++] = sign_extend_cbranch_mask(page[*ind] & CBRANCH_IMM_MASK);
			branch_imms[branch_imms_ind++] = (*ind) - prog_start;
		}
		(*ind)++;
        }
	branch_imms[branch_imms_ind] = 0;

	printf("Leaving instruction check loop\n");

	prog_end = (*ind) - 1;

        if (!match_epilogue(page, ind)) {
		printf("Invalid epilogue!\n");
                return 0;
	}


	if (!branch_imms_ok(branch_imms, branch_imms_ind, (prog_end - prog_start) * sizeof(uint32_t))) {
		printf("Invalid Branch Immediate Offset Value!\n");
		return 0;
	}

        return 1;
}

int main(int argc, char **argv) {
        FILE * inp = freopen(NULL, "rb", stdin);
        uint32_t page[PAGE_SIZE];
        uint32_t insn = 0;
        uint64_t sz = 0;
        uint64_t ind = 0;

        while ((sz = fread(&insn, sizeof(insn), 1, inp))) {
                if (ind >= PAGE_SIZE)
                        return -1; /* Nah, we don't play like that */
                page[ind++] = insn;
        }

        ind = 0;
        while (ind < PAGE_SIZE) { 
                if (!match_padding(page[ind]))
                    break;
		ind++;
        }

        if (ind == PAGE_SIZE) {
		printf("Ind == PAGE_SIZE\n");
                return -1;
	}

        if (!purity_check(page, &ind)) {
		printf("Failed purity check\n");
                return -1;
	}

	printf("Passed purity check\n");
        return 0;
}


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-12 16:02 [RFC] Proposal: Static SECCOMP Policies Maxwell Bland
  2024-09-12 20:57 ` Neill Kapron
@ 2024-09-17  7:34 ` Kees Cook
  2024-09-17 16:54   ` Maxwell Bland
  1 sibling, 1 reply; 26+ messages in thread
From: Kees Cook @ 2024-09-17  7:34 UTC (permalink / raw)
  To: Maxwell Bland
  Cc: linux-arm-msm@vger.kernel.org, Andrew Wheeler,
	Sammy BS2 Que | 阙斌生, Neill Kapron, Todd Kjos,
	Viktor Martensson, Andy Lutomirski, Will Drewry, Andy Gross,
	Bjorn Andersson, Konrad Dybcio, kernel-team

On Thu, Sep 12, 2024 at 04:02:53PM +0000, Maxwell Bland wrote:
> operated on around 0.1188 MB). But most importantly, third, without some degree
> of provenance, I have no way of telling if someone has injected malicious code
> into the kernel, and unfortunately even knowing the correct bytes is still
> "iffy", as in order to prevent JIT spray attacks, each of these filters is
> offset by some random number of uint32_t's, making every 4-byte shift of the
> filter a "valid" codepage to be loaded at runtime.

I wanted to focus this thread on the problem, rather than potential
solutions. I think we risk losing sight of getting a complete description
of what is needed if we dive into solutions too quickly.

So, let's start here. What I've seen from the thread is that there isn't
a way to verify that a given JIT matches the cBPF. Is validating the
cBPF itself also needed?

This reminds me of two related topics, which might help either better
define the problem or help find some other folks with similar needs.

- The IMA subsystem has wanted a way to measure (and validate) seccomp
  filters. We could get more details from them for defining this need
  more clearly.

- The JIT needs to be verified against the cBPF that it was generated
  from. We currently do only a single pass and don't validate it once
  the region has been set read-only. We have a standing feature request
  for improving this: https://github.com/KSPP/linux/issues/154

For solutions, I didn't see much discussion around the "orig_prog"
copy of the cBPF. Under CHECKPOINT_RESTORE, the original cBPF remains
associated with the JIT. struct seccomp_filter's struct bpf_prog prog's
orig_prog member. If it has value outside of CHECKPOINT_RESTORE, then
we could do it for those conditions too.

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-17  7:34 ` Kees Cook
@ 2024-09-17 16:54   ` Maxwell Bland
  2024-09-17 17:01     ` Maxwell Bland
  0 siblings, 1 reply; 26+ messages in thread
From: Maxwell Bland @ 2024-09-17 16:54 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-arm-msm@vger.kernel.org, Andrew Wheeler,
	Sammy BS2 Que | 阙斌生, Neill Kapron, Todd Kjos,
	Viktor Martensson, Andy Lutomirski, Will Drewry, Andy Gross,
	Bjorn Andersson, Konrad Dybcio, kernel-team

On Tue, Sep 17, 2024 at 12:34:28AM GMT, Kees Cook wrote:
> On Thu, Sep 12, 2024 at 04:02:53PM +0000, Maxwell Bland wrote:
> > operated on around 0.1188 MB). But most importantly, third, without some degree
> > of provenance, I have no way of telling if someone has injected malicious code
> > into the kernel, and unfortunately even knowing the correct bytes is still
> > "iffy", as in order to prevent JIT spray attacks, each of these filters is
> > offset by some random number of uint32_t's, making every 4-byte shift of the
> > filter a "valid" codepage to be loaded at runtime.
> 
> So, let's start here. What I've seen from the thread is that there isn't
> a way to verify that a given JIT matches the cBPF. Is validating the
> cBPF itself also needed?

Yes(ish) but mostly no. Current kernel exploits, from what I have seen
and what is readily available consist of three stages:

- Find a UAF
- Bootstrap this UAF into an unconstrained read/write
- Modify some core kernel resource to get arbitrary execution.

Example dating back to 2019:
https://googleprojectzero.blogspot.com/2019/11/bad-binder-android-in-wild-exploit.html

An adversary could modify the loaded cBPF program prior to loading in
order to, say, change the range of syscall _NR_'s accepted by the
seccomp switch statement in order to stage their escape from Chrome's
sandbox.

However, JIT presents a more general issue, hence the mostly no, since
and exploited native system service could target the JITed code page
in order to exploit the kernel, rather than requiring something to
be staged within the modified seccomp sandbox in the "cBPF itself"
example.

For example, Motorola has a few system services for hardware and other
things (as well as QCOM), written in C, for example, our native dropbox
agent. Supposing there were an exploit for this agent allowing execution
within that service's context, an adversary could find a UAF, and target
the page of Chrome's JITed seccomp filter in order to exploit the full
kernel. That is, they are not worried about escaping the sandbox so much
as finding a writable resource from which they can gain privileges in
the rest of the kernel.

Admitted, there are ~29,000 other writable data structures (in
msm-kernel) they could also target, but the JIT'ed seccomp filter is the
only code page they could modify (since it is not possible to get
compile-time provenance/signatures). The dilemma is that opposed to
modifying, say, the system_unbound_wq and adding an entry to it that
holds a pointer to call_usermodehelper_exec_work, you could add some
code to this page instead, making the kernel the same level of
exploitable.

The goal at the end of the day is to fix this and then try to build a
system to lock down the rest of the data in a sensible way. Likely an
ARM-MTE like, EL2-maintained tag system conditioned on the kernel's
scheduler and memory allocation infrastructure. At least, that is what I
want to be working on, after I figure out this seccomp stuff.

> - The IMA subsystem has wanted a way to measure (and validate) seccomp
>   filters. We could get more details from them for defining this need
>   more clearly.

You are right. I have added Mimi, Dmitry, and the integrity list. Their
work with linked lists and other data structures is right in line with
these concerns. I do not know if they have looked at building verifiers
for JIT'ed cBPF pages already.

> - The JIT needs to be verified against the cBPF that it was generated
>   from. We currently do only a single pass and don't validate it once
>   the region has been set read-only. We have a standing feature request
>   for improving this: https://github.com/KSPP/linux/issues/154
>
Kees, this is exactly what I'm talking about, you are awesome!

I'll share the (pretty straightforward) EL2 logic for this, though not the
code, since licensing and all that, but this public mailing list should
hopefully serve as prior art for any questionable chipset vendor attempting to
patent public domain security for the everyday person:

- Marking PTEs null is fine
- If a new PTE is allocated, mark it PXN atomically using the EL2
  permission fault failure triggered from the page table lockdown (see
  GPL-2.0 kernel module below).
- If a PTE is updated and the PXN bit is switched from 1 to 0, SHA256
  the page, mark it immutable, and let it through if it is OK.

This lets the page be mucked with during the whole JIT process, but ensures
that the second the page wants to be priv-executable, no further modifications
happen. To "unlock" the page for free-ing, one just needs to set the PXN bit
back. Then if we ever want to execute from it again, the process repeats, so
on. This relies on my prior main.c vmalloc maintenance and the below ptprotect
logic (note, WIP, no warranty on this code).

> For solutions, I didn't see much discussion around the "orig_prog"
> copy of the cBPF. Under CHECKPOINT_RESTORE, the original cBPF remains
> associated with the JIT. struct seccomp_filter's struct bpf_prog prog's
> orig_prog member. If it has value outside of CHECKPOINT_RESTORE, then
> we could do it for those conditions too.

Unfortunately the Android GKI does not support checkpoint restore and makes the
orig_prog reference fail (at least in the case I'm trying to work towards for
cell phones).

I could lock the orig_prog as immutable during the JIT, and given the resulting
code page, and then attempt to reproduce the code page in EL2 from the original
cBPF, but that seems dangerous and potentially buggy as opposed to checking the
reference addresses in the final machine code against knowledge of struct
seccomp_data (what I am working on right now).

Maxwell

// SPDX-License-Identifier: GPL-2.0
/*
 * Copyright (C) 2023 Motorola Mobility, Inc.
 *
 * Authors: Maxwell Bland
 * Binsheng "Sammy" Que
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License version 2 as
 * published by the Free Software Foundation.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * Initializes hypervisor-level protections for the kernel pagetables.  In
 * coordination with the moto_org_mem driver, which restricts executable code
 * pages to a well defined region in-between
 *
 * stext <-> module_alloc_base + SZ_2G
 *
 * It is able to mark all page tables not corresponding to this virtual address
 * range PXNTable. Mark the table these descriptors exist within as immutable.
 * For all tables/descriptors which are marked privileged executable, these are
 * marked permanently immutable, and their modifications are tracked directly.
 */
#ifndef _PTPROTECT_H
#define _PTPROTECT_H

#include <linux/delay.h>
#include <linux/highmem.h>
#include <linux/kprobes.h>
#include <linux/list.h>
#include <linux/mm_types.h>
#include <linux/module.h>
#include <linux/of.h>
#include <linux/of_platform.h>
#include <linux/pagewalk.h>
#include <linux/types.h>
#include <asm/pgalloc.h>
#include <asm/pgtable-hwdef.h>
#include <asm/pgtable.h>
#include <mm/pgalloc-track.h>
#include <trace/hooks/fault.h>
#include <trace/hooks/vendor_hooks.h>
#include <fs/erofs/compress.h>

uint64_t stext_vaddr = 0;
uint64_t etext_vaddr = 0;
uint64_t module_alloc_base_vaddr = 0;

uint64_t last_pmd_range[2] = { 0, 0 };
uint64_t pmd_range_list[1024][2] = { 0 };
int pmd_range_list_index = 0;

/**
 * add_to_pmd_range_list - adds a range to the pmd range list
 * @start: Start of the range
 * @end: End of the range
 *
 * Used to implement a naive set of adjacent pmd segments to 
 * speed up protection code as otherwise we will treat each
 * pmd (there are a lot of them, as a separate region to protect)
 */
static void add_to_pmd_range_list(uint64_t start, uint64_t end)
{
	pmd_range_list[pmd_range_list_index][0] = start;
	pmd_range_list[pmd_range_list_index][1] = end;
	pmd_range_list_index++;
}

void lock_last_pmd_range(void)
{
	if (last_pmd_range[0] == 0 || last_pmd_range[1] == 0)
		return;
	split_block(last_pmd_range[0]);
	mark_range_ro_smc(last_pmd_range[0], last_pmd_range[1],
			  KERN_PROT_PAGE_TABLE);
	msleep(10);
}

/**
 * prot_pmd_entry - protects a range pointed to by a pmd entry
 *
 * @pmd: Pointer to the pmd entry
 * @addr: Virtual address of the pmd entry
 */
static void prot_pmd_entry(pmd_t *pmd, unsigned long addr)
{
	uint64_t pgaddr = pmd_page_vaddr(*pmd);
	uint64_t start_range = 0;
	uint64_t end_range = 0;

	/*
         * Just found that QCOM's gic_intr_routing.c kernel module is getting
         * allocated at vaddr ffffffdb87f67000, but modules code region should
         * only be allocated from ffffffdb8fc00000 to ffffffdc0fdfffff...
         * 
         * It seems to be because arm64's module.h defines module_alloc_base as
         * ((u64)_etext - MODULES_VSIZE) But this module_alloc_base preprocesor
         * define should be redefined/randomized by kernel/kaslr.c, however, it
         * appears that early init modules get allocated before
         * module_alloc_base is relocated, so c'est la vie, and the efforts of
         * kaslr.c are for naught (_etext's vaddr is randomized though, so it
         * does not matter, I guess).
         */
	uint64_t module_alloc_start = module_alloc_base_vaddr;
	uint64_t module_alloc_end = module_alloc_base_vaddr + SZ_2G;

	if (!pmd_present(*pmd) || pmd_bad(*pmd) || pmd_none(*pmd) ||
	    !pmd_val(*pmd))
		return;

	/* Round the starts and ends of each region to their boundary limits */
	// module_alloc_start -= (module_alloc_start % PMD_SIZE);
	// module_alloc_end += PMD_SIZE - (module_alloc_end % PMD_SIZE) - 1;

	start_range = __virt_to_phys(pgaddr);
	end_range = __virt_to_phys(pgaddr) + sizeof(pte_t) * PTRS_PER_PMD - 1;

	/* If the PMD potentially points to code, check it in the hypervisor */
	if (!pmd_leaf(*pmd) &&
	    ((addr <= etext_vaddr && (addr + PMD_SIZE - 1) >= stext_vaddr) ||
	     (addr <= module_alloc_end &&
	      (addr + PMD_SIZE - 1) >= module_alloc_start))) {
		if (start_range == last_pmd_range[1] + 1) {
			last_pmd_range[1] = end_range;
		} else if (end_range + 1 == last_pmd_range[0]) {
			last_pmd_range[0] = start_range;
		} else if (last_pmd_range[0] == 0 && last_pmd_range[1] == 0) {
			last_pmd_range[0] = start_range;
			last_pmd_range[1] = end_range;
		} else {
			add_to_pmd_range_list(last_pmd_range[0],
					      last_pmd_range[1]);
			lock_last_pmd_range();
			last_pmd_range[0] = start_range;
			last_pmd_range[1] = end_range;
		}
		/* If the PMD points to data only, mark it PXN, as the caller will
                 * mark the PMD immutable after this function returns */
	} else {
		if (!pmd_leaf(*pmd)) {
			set_pmd(pmd, __pmd(pmd_val(*pmd) | PMD_TABLE_PXN));
		} else {
			/* TODO: if block, ensure range is marked immutable */
			pr_info("MotoRKP: pmd block at %llx\n", start_range);
		}
	}
}

pgd_t *swapper_pg_dir_ind;
void (*set_swapper_pgd_ind)(pgd_t *pgdp, pgd_t pgd);

static inline bool in_swapper_pgdir_ind(void *addr)
{
	return ((unsigned long)addr & PAGE_MASK) ==
	       ((unsigned long)swapper_pg_dir_ind & PAGE_MASK);
}

static inline void set_pgd_ind(pgd_t *pgdp, pgd_t pgd)
{
	if (in_swapper_pgdir_ind(pgdp)) {
		set_swapper_pgd_ind(pgdp, __pgd(pgd_val(pgd)));
		return;
	}

	WRITE_ONCE(*pgdp, pgd);
	dsb(ishst);
	isb();
}

/**
 * prot_pgd_entry - protects a range pointed to by a pgd entry
 * @pgd: pgd struct with descriptor values
 * @addr: vaddr of start of pgds referenced memory range
 */
static int prot_pgd_entry(pgd_t *pgd, unsigned long addr, unsigned long next,
			  struct mm_walk *walk)
{
	uint64_t pgaddr = pgd_page_vaddr(*pgd);
	uint64_t start_range = 0;
	uint64_t end_range = 0;
	uint64_t module_alloc_start = module_alloc_base_vaddr;
	uint64_t module_alloc_end = module_alloc_base_vaddr + SZ_2G;
	uint64_t i = 0;
	pmd_t *subdescriptor = 0;
	unsigned long subdescriptor_addr = addr;

	if (!pgd_present(*pgd) || pgd_bad(*pgd) || pgd_none(*pgd) ||
	    !pgd_val(*pgd))
		return 0;

	/* Round the starts and ends of each region to their boundary limits */
	// module_alloc_start -= (module_alloc_start % PGDIR_SIZE);
	// module_alloc_end += PGDIR_SIZE - (module_alloc_end % PGDIR_SIZE) - 1;

	if (!pgd_leaf(*pgd)) {
		start_range = __virt_to_phys(pgaddr);
		end_range = __virt_to_phys(pgaddr) +
			    sizeof(p4d_t) * PTRS_PER_PGD - 1;

		/* If the PGD contains addesses between stext_vaddr and etext_vaddr or
                 * module_alloc_base and module_alloc_base + SZ_2G, then do not mark it
                * PXN */
		if ((addr <= etext_vaddr &&
		     (addr + PGDIR_SIZE - 1) >= stext_vaddr) ||
		    (addr <= module_alloc_end &&
		     (addr + PGDIR_SIZE - 1) >= module_alloc_start)) {
			/* Protect all second-level PMD entries */
			for (i = 0; i < PTRS_PER_PGD; i++) {
				subdescriptor =
					(pmd_t *)(pgaddr + i * sizeof(pmd_t));
				prot_pmd_entry(subdescriptor,
					       subdescriptor_addr);
				subdescriptor_addr += PMD_SIZE;
			}
			lock_last_pmd_range();

			split_block(start_range);
			mark_range_ro_smc(start_range, end_range,
					  KERN_PROT_PAGE_TABLE);
		} else {
			/* Further modifications protected by immutability from hyp_rodata_end to __inittext_begin in kickoff */
			set_pgd_ind(pgd, __pgd(pgd_val(*pgd) | 1UL << 59));
		}
	} else {
		/* TODO: Handle block case at this level? */
		pr_info("MotoRKP: pgd block at %llx\n", start_range);
	}
	return 0;
}

/*
 * Locks down the ranges of memory pointed to by all PGDs as read-only.
 * Current kernel configurations do not bother with p4ds or puds, and
 * thus we do not need protections for these layers (pgd points directly
 * to pmd).
 */
static const struct mm_walk_ops protect_pgds = {
	.pgd_entry = prot_pgd_entry,
};

#endif /* _PTPROTECT_H */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Proposal: Static SECCOMP Policies
  2024-09-17 16:54   ` Maxwell Bland
@ 2024-09-17 17:01     ` Maxwell Bland
  0 siblings, 0 replies; 26+ messages in thread
From: Maxwell Bland @ 2024-09-17 17:01 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-arm-msm@vger.kernel.org, Andrew Wheeler,
	Sammy BS2 Que | 阙斌生, Neill Kapron, Todd Kjos,
	Viktor Martensson, Andy Lutomirski, Will Drewry, Andy Gross,
	Bjorn Andersson, Konrad Dybcio, kernel-team, Mimi Zohar,
	Dmitry Kasatkin, linux-integrity

+ Mimi, Dmitry, Integrity, FYI

On Tue, Sep 17, 2024 at 11:54:17AM GMT, Maxwell Bland wrote:
> On Tue, Sep 17, 2024 at 12:34:28AM GMT, Kees Cook wrote:
> > On Thu, Sep 12, 2024 at 04:02:53PM +0000, Maxwell Bland wrote:
> > > operated on around 0.1188 MB). But most importantly, third, without some degree
> > > of provenance, I have no way of telling if someone has injected malicious code
> > > into the kernel, and unfortunately even knowing the correct bytes is still
> > > "iffy", as in order to prevent JIT spray attacks, each of these filters is
> > > offset by some random number of uint32_t's, making every 4-byte shift of the
> > > filter a "valid" codepage to be loaded at runtime.
> > 
> > So, let's start here. What I've seen from the thread is that there isn't
> > a way to verify that a given JIT matches the cBPF. Is validating the
> > cBPF itself also needed?
> 
> Yes(ish) but mostly no. Current kernel exploits, from what I have seen
> and what is readily available consist of three stages:
> 
> - Find a UAF
> - Bootstrap this UAF into an unconstrained read/write
> - Modify some core kernel resource to get arbitrary execution.
> 
> Example dating back to 2019:
> https://googleprojectzero.blogspot.com/2019/11/bad-binder-android-in-wild-exploit.html
> 
> An adversary could modify the loaded cBPF program prior to loading in
> order to, say, change the range of syscall _NR_'s accepted by the
> seccomp switch statement in order to stage their escape from Chrome's
> sandbox.
> 
> However, JIT presents a more general issue, hence the mostly no, since
> and exploited native system service could target the JITed code page
> in order to exploit the kernel, rather than requiring something to
> be staged within the modified seccomp sandbox in the "cBPF itself"
> example.
> 
> For example, Motorola has a few system services for hardware and other
> things (as well as QCOM), written in C, for example, our native dropbox
> agent. Supposing there were an exploit for this agent allowing execution
> within that service's context, an adversary could find a UAF, and target
> the page of Chrome's JITed seccomp filter in order to exploit the full
> kernel. That is, they are not worried about escaping the sandbox so much
> as finding a writable resource from which they can gain privileges in
> the rest of the kernel.
> 
> Admitted, there are ~29,000 other writable data structures (in
> msm-kernel) they could also target, but the JIT'ed seccomp filter is the
> only code page they could modify (since it is not possible to get
> compile-time provenance/signatures). The dilemma is that opposed to
> modifying, say, the system_unbound_wq and adding an entry to it that
> holds a pointer to call_usermodehelper_exec_work, you could add some
> code to this page instead, making the kernel the same level of
> exploitable.
> 
> The goal at the end of the day is to fix this and then try to build a
> system to lock down the rest of the data in a sensible way. Likely an
> ARM-MTE like, EL2-maintained tag system conditioned on the kernel's
> scheduler and memory allocation infrastructure. At least, that is what I
> want to be working on, after I figure out this seccomp stuff.
> 
> > - The IMA subsystem has wanted a way to measure (and validate) seccomp
> >   filters. We could get more details from them for defining this need
> >   more clearly.
> 
> You are right. I have added Mimi, Dmitry, and the integrity list. Their
> work with linked lists and other data structures is right in line with
> these concerns. I do not know if they have looked at building verifiers
> for JIT'ed cBPF pages already.
> 
> > - The JIT needs to be verified against the cBPF that it was generated
> >   from. We currently do only a single pass and don't validate it once
> >   the region has been set read-only. We have a standing feature request
> >   for improving this: https://github.com/KSPP/linux/issues/154
> >
> Kees, this is exactly what I'm talking about, you are awesome!
> 
> I'll share the (pretty straightforward) EL2 logic for this, though not the
> code, since licensing and all that, but this public mailing list should
> hopefully serve as prior art for any questionable chipset vendor attempting to
> patent public domain security for the everyday person:
> 
> - Marking PTEs null is fine
> - If a new PTE is allocated, mark it PXN atomically using the EL2
>   permission fault failure triggered from the page table lockdown (see
>   GPL-2.0 kernel module below).
> - If a PTE is updated and the PXN bit is switched from 1 to 0, SHA256
>   the page, mark it immutable, and let it through if it is OK.
> 
> This lets the page be mucked with during the whole JIT process, but ensures
> that the second the page wants to be priv-executable, no further modifications
> happen. To "unlock" the page for free-ing, one just needs to set the PXN bit
> back. Then if we ever want to execute from it again, the process repeats, so
> on. This relies on my prior main.c vmalloc maintenance and the below ptprotect
> logic (note, WIP, no warranty on this code).
> 
> > For solutions, I didn't see much discussion around the "orig_prog"
> > copy of the cBPF. Under CHECKPOINT_RESTORE, the original cBPF remains
> > associated with the JIT. struct seccomp_filter's struct bpf_prog prog's
> > orig_prog member. If it has value outside of CHECKPOINT_RESTORE, then
> > we could do it for those conditions too.
> 
> Unfortunately the Android GKI does not support checkpoint restore and makes the
> orig_prog reference fail (at least in the case I'm trying to work towards for
> cell phones).
> 
> I could lock the orig_prog as immutable during the JIT, and given the resulting
> code page, and then attempt to reproduce the code page in EL2 from the original
> cBPF, but that seems dangerous and potentially buggy as opposed to checking the
> reference addresses in the final machine code against knowledge of struct
> seccomp_data (what I am working on right now).
> 
> Maxwell
> 
> // SPDX-License-Identifier: GPL-2.0
> /*
>  * Copyright (C) 2023 Motorola Mobility, Inc.
>  *
>  * Authors: Maxwell Bland
>  * Binsheng "Sammy" Que
>  *
>  * This program is free software; you can redistribute it and/or modify
>  * it under the terms of the GNU General Public License version 2 as
>  * published by the Free Software Foundation.
>  *
>  * This program is distributed in the hope that it will be useful,
>  * but WITHOUT ANY WARRANTY; without even the implied warranty of
>  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>  * GNU General Public License for more details.
>  *
>  * Initializes hypervisor-level protections for the kernel pagetables.  In
>  * coordination with the moto_org_mem driver, which restricts executable code
>  * pages to a well defined region in-between
>  *
>  * stext <-> module_alloc_base + SZ_2G
>  *
>  * It is able to mark all page tables not corresponding to this virtual address
>  * range PXNTable. Mark the table these descriptors exist within as immutable.
>  * For all tables/descriptors which are marked privileged executable, these are
>  * marked permanently immutable, and their modifications are tracked directly.
>  */
> #ifndef _PTPROTECT_H
> #define _PTPROTECT_H
> 
> #include <linux/delay.h>
> #include <linux/highmem.h>
> #include <linux/kprobes.h>
> #include <linux/list.h>
> #include <linux/mm_types.h>
> #include <linux/module.h>
> #include <linux/of.h>
> #include <linux/of_platform.h>
> #include <linux/pagewalk.h>
> #include <linux/types.h>
> #include <asm/pgalloc.h>
> #include <asm/pgtable-hwdef.h>
> #include <asm/pgtable.h>
> #include <mm/pgalloc-track.h>
> #include <trace/hooks/fault.h>
> #include <trace/hooks/vendor_hooks.h>
> #include <fs/erofs/compress.h>
> 
> uint64_t stext_vaddr = 0;
> uint64_t etext_vaddr = 0;
> uint64_t module_alloc_base_vaddr = 0;
> 
> uint64_t last_pmd_range[2] = { 0, 0 };
> uint64_t pmd_range_list[1024][2] = { 0 };
> int pmd_range_list_index = 0;
> 
> /**
>  * add_to_pmd_range_list - adds a range to the pmd range list
>  * @start: Start of the range
>  * @end: End of the range
>  *
>  * Used to implement a naive set of adjacent pmd segments to 
>  * speed up protection code as otherwise we will treat each
>  * pmd (there are a lot of them, as a separate region to protect)
>  */
> static void add_to_pmd_range_list(uint64_t start, uint64_t end)
> {
> 	pmd_range_list[pmd_range_list_index][0] = start;
> 	pmd_range_list[pmd_range_list_index][1] = end;
> 	pmd_range_list_index++;
> }
> 
> void lock_last_pmd_range(void)
> {
> 	if (last_pmd_range[0] == 0 || last_pmd_range[1] == 0)
> 		return;
> 	split_block(last_pmd_range[0]);
> 	mark_range_ro_smc(last_pmd_range[0], last_pmd_range[1],
> 			  KERN_PROT_PAGE_TABLE);
> 	msleep(10);
> }
> 
> /**
>  * prot_pmd_entry - protects a range pointed to by a pmd entry
>  *
>  * @pmd: Pointer to the pmd entry
>  * @addr: Virtual address of the pmd entry
>  */
> static void prot_pmd_entry(pmd_t *pmd, unsigned long addr)
> {
> 	uint64_t pgaddr = pmd_page_vaddr(*pmd);
> 	uint64_t start_range = 0;
> 	uint64_t end_range = 0;
> 
> 	/*
>          * Just found that QCOM's gic_intr_routing.c kernel module is getting
>          * allocated at vaddr ffffffdb87f67000, but modules code region should
>          * only be allocated from ffffffdb8fc00000 to ffffffdc0fdfffff...
>          * 
>          * It seems to be because arm64's module.h defines module_alloc_base as
>          * ((u64)_etext - MODULES_VSIZE) But this module_alloc_base preprocesor
>          * define should be redefined/randomized by kernel/kaslr.c, however, it
>          * appears that early init modules get allocated before
>          * module_alloc_base is relocated, so c'est la vie, and the efforts of
>          * kaslr.c are for naught (_etext's vaddr is randomized though, so it
>          * does not matter, I guess).
>          */
> 	uint64_t module_alloc_start = module_alloc_base_vaddr;
> 	uint64_t module_alloc_end = module_alloc_base_vaddr + SZ_2G;
> 
> 	if (!pmd_present(*pmd) || pmd_bad(*pmd) || pmd_none(*pmd) ||
> 	    !pmd_val(*pmd))
> 		return;
> 
> 	/* Round the starts and ends of each region to their boundary limits */
> 	// module_alloc_start -= (module_alloc_start % PMD_SIZE);
> 	// module_alloc_end += PMD_SIZE - (module_alloc_end % PMD_SIZE) - 1;
> 
> 	start_range = __virt_to_phys(pgaddr);
> 	end_range = __virt_to_phys(pgaddr) + sizeof(pte_t) * PTRS_PER_PMD - 1;
> 
> 	/* If the PMD potentially points to code, check it in the hypervisor */
> 	if (!pmd_leaf(*pmd) &&
> 	    ((addr <= etext_vaddr && (addr + PMD_SIZE - 1) >= stext_vaddr) ||
> 	     (addr <= module_alloc_end &&
> 	      (addr + PMD_SIZE - 1) >= module_alloc_start))) {
> 		if (start_range == last_pmd_range[1] + 1) {
> 			last_pmd_range[1] = end_range;
> 		} else if (end_range + 1 == last_pmd_range[0]) {
> 			last_pmd_range[0] = start_range;
> 		} else if (last_pmd_range[0] == 0 && last_pmd_range[1] == 0) {
> 			last_pmd_range[0] = start_range;
> 			last_pmd_range[1] = end_range;
> 		} else {
> 			add_to_pmd_range_list(last_pmd_range[0],
> 					      last_pmd_range[1]);
> 			lock_last_pmd_range();
> 			last_pmd_range[0] = start_range;
> 			last_pmd_range[1] = end_range;
> 		}
> 		/* If the PMD points to data only, mark it PXN, as the caller will
>                  * mark the PMD immutable after this function returns */
> 	} else {
> 		if (!pmd_leaf(*pmd)) {
> 			set_pmd(pmd, __pmd(pmd_val(*pmd) | PMD_TABLE_PXN));
> 		} else {
> 			/* TODO: if block, ensure range is marked immutable */
> 			pr_info("MotoRKP: pmd block at %llx\n", start_range);
> 		}
> 	}
> }
> 
> pgd_t *swapper_pg_dir_ind;
> void (*set_swapper_pgd_ind)(pgd_t *pgdp, pgd_t pgd);
> 
> static inline bool in_swapper_pgdir_ind(void *addr)
> {
> 	return ((unsigned long)addr & PAGE_MASK) ==
> 	       ((unsigned long)swapper_pg_dir_ind & PAGE_MASK);
> }
> 
> static inline void set_pgd_ind(pgd_t *pgdp, pgd_t pgd)
> {
> 	if (in_swapper_pgdir_ind(pgdp)) {
> 		set_swapper_pgd_ind(pgdp, __pgd(pgd_val(pgd)));
> 		return;
> 	}
> 
> 	WRITE_ONCE(*pgdp, pgd);
> 	dsb(ishst);
> 	isb();
> }
> 
> /**
>  * prot_pgd_entry - protects a range pointed to by a pgd entry
>  * @pgd: pgd struct with descriptor values
>  * @addr: vaddr of start of pgds referenced memory range
>  */
> static int prot_pgd_entry(pgd_t *pgd, unsigned long addr, unsigned long next,
> 			  struct mm_walk *walk)
> {
> 	uint64_t pgaddr = pgd_page_vaddr(*pgd);
> 	uint64_t start_range = 0;
> 	uint64_t end_range = 0;
> 	uint64_t module_alloc_start = module_alloc_base_vaddr;
> 	uint64_t module_alloc_end = module_alloc_base_vaddr + SZ_2G;
> 	uint64_t i = 0;
> 	pmd_t *subdescriptor = 0;
> 	unsigned long subdescriptor_addr = addr;
> 
> 	if (!pgd_present(*pgd) || pgd_bad(*pgd) || pgd_none(*pgd) ||
> 	    !pgd_val(*pgd))
> 		return 0;
> 
> 	/* Round the starts and ends of each region to their boundary limits */
> 	// module_alloc_start -= (module_alloc_start % PGDIR_SIZE);
> 	// module_alloc_end += PGDIR_SIZE - (module_alloc_end % PGDIR_SIZE) - 1;
> 
> 	if (!pgd_leaf(*pgd)) {
> 		start_range = __virt_to_phys(pgaddr);
> 		end_range = __virt_to_phys(pgaddr) +
> 			    sizeof(p4d_t) * PTRS_PER_PGD - 1;
> 
> 		/* If the PGD contains addesses between stext_vaddr and etext_vaddr or
>                  * module_alloc_base and module_alloc_base + SZ_2G, then do not mark it
>                 * PXN */
> 		if ((addr <= etext_vaddr &&
> 		     (addr + PGDIR_SIZE - 1) >= stext_vaddr) ||
> 		    (addr <= module_alloc_end &&
> 		     (addr + PGDIR_SIZE - 1) >= module_alloc_start)) {
> 			/* Protect all second-level PMD entries */
> 			for (i = 0; i < PTRS_PER_PGD; i++) {
> 				subdescriptor =
> 					(pmd_t *)(pgaddr + i * sizeof(pmd_t));
> 				prot_pmd_entry(subdescriptor,
> 					       subdescriptor_addr);
> 				subdescriptor_addr += PMD_SIZE;
> 			}
> 			lock_last_pmd_range();
> 
> 			split_block(start_range);
> 			mark_range_ro_smc(start_range, end_range,
> 					  KERN_PROT_PAGE_TABLE);
> 		} else {
> 			/* Further modifications protected by immutability from hyp_rodata_end to __inittext_begin in kickoff */
> 			set_pgd_ind(pgd, __pgd(pgd_val(*pgd) | 1UL << 59));
> 		}
> 	} else {
> 		/* TODO: Handle block case at this level? */
> 		pr_info("MotoRKP: pgd block at %llx\n", start_range);
> 	}
> 	return 0;
> }
> 
> /*
>  * Locks down the ranges of memory pointed to by all PGDs as read-only.
>  * Current kernel configurations do not bother with p4ds or puds, and
>  * thus we do not need protections for these layers (pgd points directly
>  * to pmd).
>  */
> static const struct mm_walk_ops protect_pgds = {
> 	.pgd_entry = prot_pgd_entry,
> };
> 
> #endif /* _PTPROTECT_H */
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC] Proposal: Static SECCOMP Policies
@ 2024-09-12 16:18 Maxwell Bland
  0 siblings, 0 replies; 26+ messages in thread
From: Maxwell Bland @ 2024-09-12 16:18 UTC (permalink / raw)
  To: linux-arm-msm

(
Resending via neomutt as plaintext since I think mlmmj may be filtering,
please keep in mind this was originally intended for android kernel team
and the set of CC's on replies should be:

"Andrew Wheeler" <awheeler@motorola.com>;
"Sammy Que" <quebs2@motorola.com>;
Neill Kapron <nkapron@google.com>;
Todd Kjos <tkjos@google.com>;
Viktor Martensson <vmartensson@google.com>;
Andy Lutomirski <luto@amacapital.net>;
keescook@chromium.org <keescook@chromium.org>;
Will Drewry <wad@chromium.org>;
Andy Gross <agross@kernel.org>;
Bjorn Andersson <andersson@kernel.org>;
Konrad Dybcio <konrad.dybcio@somainline.org>;
kernel-team <kernel-team@android.com>
)

Apologies if this is a "duplicate".

Am sending to msm-kernel since this list should also be somewhat aware
and may have engineer with knowledge on generated seccomp sandbox code.
Thanks! (-:

Hi Kernel Team,

+ Kees, Andy, and Will since their input may be valuable.

It has been a while! (~9 months to be exact). This January, I sent out a small
message on BPF code loading ("unprivileged BPF considered harmful" or something
like that). In it, I noted new BPF programs are compiled all the time and
thrown into the kernel. At the time, I did not know these programs were just
compiled seccomp filter policies, loaded in as new BPF programs continuously
through the libminijail interface as well as direct syscall. As of two days
ago, I now know this (and now you do too, if not already).

OK, yes, syscall filtering is very important, but this is creating a catch-22
issue. For one, see step (4) under "Exploitation overview" for
https://www.qualys.com/2021/07/20/cve-2021-33909/sequoia-local-privilege-escalation-linux.txt.
Second, this minor lack of caching is adding load time to more than 90
binaries/services on the standard QCOM baseline—I'll admit, it is probably
negligible in the grand scheme of things (a quick approximation puts the data
operated on around 0.1188 MB). But most importantly, third, without some degree
of provenance, I have no way of telling if someone has injected malicious code
into the kernel, and unfortunately even knowing the correct bytes is still
"iffy", as in order to prevent JIT spray attacks, each of these filters is
offset by some random number of uint32_t's, making every 4-byte shift of the
filter a "valid" codepage to be loaded at runtime.

You might be thinking, "but wait, bionic's libc only defines a couple of
restricted policies, primary and secondary for system and user apps
respectively." I know! For the most part, apps fall into either what I presume
is the default app/system policies, but there are lots of QCOM binaries and
other magic programs (dolby dax) that are sending up these programs as well.
I'm seeing more than 20 different programs for around a minute's worth of
runtime. One example is attached at the end.

So, the proposal: a "CONFIG_SECCOMMP_STATIC_POLICY" for seccomp. This
would change the Android kernel's generic SYS_seccomp call, which takes in a
filter with an array of BPF instructions, to instead reference an ID which
corresponds to a fixed file on /sys/bpf/seccomp or something like that. The
sandboxing behavior of these apps should be known at compile-time, even if
there are multiple "permission set types" that may need to be dispatched. User
apps should always have a single, fixed policy. This way it is possible to say
for every code page loaded into the kernel where it came from and what it
should look like.

Unfortunately, I do not know Motorola has enough "weight" to convince QCOM to
do the right foundational thing here, or to "define" the seccomp APIs for
Android, so it would be good to have Google's buy in, know if there are plans
to fix this issue, or some discussion of how to best fix the problem? If
anything, a contact at QCOM that might be able to actually hunt down and
document valid bytes for these policies?

The end goal is simple: when we see a code page is allocated in the kernel, we
can be sure that (1) it isn't malicious and (2) has not been modified in
transit. I'm fine putting code where my mouth is, but right now that code
would involve having to fingerprint the signatures loaded by Qualcomm
components every time a new one is released, or pinging Google with a huge
patch changing how seccomp works with no idea of what requirements QCOM may
have on seccomp policy generation.

Thoughts? Is this doable, and if not, why? I'd also love help with the code and
adapting existing minijail code to use a new, more integrity-preserving
interface. If I am mistaken and it is possible to grab out valid BPF policy
code at compile time, please let me know how!

Regards,
Maxwell Bland

Standard filter, (from, for example, com.google.android.gms)
"ac00000000000000ac77000000000000bf160000000000006160040000000000b4020000b70000c01d20020000000000b4000000000000009500000000000000616000000000000055000200cb000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f950000000000000055000200ce000000b40000000000ff7f950000000000000055000200c6000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000100de00000005007b000000000055000200d7000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f950000000000000055000100e200000005008f000000000055000200a7000000b40000000000ff7f95000000000000005500020038000000b40000000000ff7f95000000000000005500020062000000b40000000000ff7f95000000000000005500020039000000b40000000000ff7f9500000000000000550002003f000000b40000000000ff7f95000000000000005500020040000000b40000000000ff7f95000000000000005500020050000000b40000000000ff7f9500000000000000550002004e000000b40000000000ff7f9500000000000000550002002c000000b40000000000ff7f95000000000000005500020043000000b40000000000ff7f9500000000000000550002001d000000b40000000000ff7f95000000000000005500020030000000b40000000000ff7f95000000000000005500020071000000b40000000000ff7f950000000000000055000200ae000000b40000000000ff7f950000000000000055000200a3000000b40000000000ff7f95000000000000005500020086000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000200e9000000b40000000000ff7f9500000000000000550002003e000000b40000000000ff7f95000000000000005500020087000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f9500000000000000550002005c000000b40000000000ff7f95000000000000005500020016010000b40000000000ff7f950000000000000055000200dc000000b40000000000ff7f95000000000000005500020060000000b40000000000ff7f950000000000000055000200dd000000b40000000000ff7f95000000000000005500020078000000b40000000000ff7f9500000000000000550002005e000000b40000000000ff7f9500000000000000550002008b000000b40000000000ff7f95000000000000005500020080000000b40000000000ff7f950000000000000055000200cb000000b40000000000ff7f950000000000000055000100c600000005004c0000000000550002005d000000b40000000000ff7f950000000000000055000200ac000000b40000000000ff7f95000000000000005500020084000000b40000000000ff7f9500000000000000550002008c000000b40000000000ff7f9500000000000000550002003d000000b40000000000ff7f95000000000000005500020017000000b40000000000ff7f9500000000000000b400000000000300950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160100000000000630afcff000000006160140000000000630af8ff00000000550002000000000061a0fcff000000001500010001000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f9500000000000000",
Unknown filter (from QCOM's /vendor/bin/qesdk-secmanager)
 "ac00000000000000ac77000000000000bf160000000000006160040000000000b4020000b70000c01d20020000000000b4000000000000009500000000000000616000000000000055000200cb000000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f950000000000000055000200ce000000b40000000000ff7f950000000000000055000200c6000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000100de00000005007e000000000055000100e2000000050098000000000055000200d7000000b40000000000ff7f950000000000000055000200a7000000b40000000000ff7f95000000000000005500020062000000b40000000000ff7f9500000000000000550002001d000000b40000000000ff7f95000000000000005500020038000000b40000000000ff7f9500000000000000550002003f000000b40000000000ff7f95000000000000005500020039000000b40000000000ff7f95000000000000005500020050000000b40000000000ff7f9500000000000000550002004e000000b40000000000ff7f9500000000000000550002004f000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f95000000000000005500020043000000b40000000000ff7f9500000000000000550002002c000000b40000000000ff7f95000000000000005500020087000000b40000000000ff7f95000000000000005500020086000000b40000000000ff7f95000000000000005500020030000000b40000000000ff7f950000000000000055000200ae000000b40000000000ff7f95000000000000005500020016010000b40000000000ff7f95000000000000005500020019000000b40000000000ff7f95000000000000005500020042000000b40000000000ff7f950000000000000055000200dc000000b40000000000ff7f9500000000000000550002005e000000b40000000000ff7f9500000000000000550002007b000000b40000000000ff7f9500000000000000550002005d000000b40000000000ff7f950000000000000055000200ac000000b40000000000ff7f95000000000000005500020084000000b40000000000ff7f950000000000000055000200a3000000b40000000000ff7f95000000000000005500020080000000b40000000000ff7f95000000000000005500020078000000b40000000000ff7f950000000000000055000200dd000000b40000000000ff7f950000000000000055000100c600000005005800000000005500020060000000b40000000000ff7f9500000000000000550002008b000000b40000000000ff7f950000000000000055000200cb000000b40000000000ff7f95000000000000005500020071000000b40000000000ff7f95000000000000005500020040000000b40000000000ff7f9500000000000000550002003b000000b40000000000ff7f950000000000000055000200e9000000b40000000000ff7f950000000000000055000200b2000000b40000000000ff7f9500000000000000550002008c000000b40000000000ff7f950000000000000055000200d8000000b40000000000ff7f9500000000000000b400000000000300950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100040000000500010000000000050001000000000005000e000000000005000000000000006160200000000000630afcff000000006160240000000000630af8ff00000000450003000000000061a0fcff0000000045000100020000000500010000000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f950000000000000005000000000000006160100000000000630afcff000000006160140000000000630af8ff00000000550002000000000061a0fcff000000001500010001000000050001000000000005000300000000000500000000000000b40000000000030095000000000000000500000000000000b40000000000ff7f9500000000000000",

List of services loading seccomp filters pulled from one run of the phone:
com.google.android.deskclock
/vendor/bin/qesdk-secmanager
media.hwcodec/vendor.qti.media.c2@1.0-service
media.audio.qc.codec.qti.media.c2audio@1.0-service
/vendor/bin/vendor.qti.qspmhal-service
/vendor/bin/qsap_sensors
media.extractoraextractor
/system_ext/bin/perfservice
/vendor/bin/wfdhdcphalservice
/vendor/bin/wifidisplayhalservice
/vendor/bin/qsap_dcfd
/vendor/bin/qms
/vendor/bin/qsap_location
/vendor/bin/qsap_qapeservice
/vendor/bin/wfdvndservice
media.swcodecoid.media.swcodec/bin/mediaswcodec
/vendor/bin/hw/qcrilNrd
qsap_qms_13qms16
qsap_qms_24qms17
/vendor/bin/ATFWD-daemon
/vendor/bin/hw/sxrservice
/vendor/bin/hw/qcrilNrd-c2
system_server
/vendor/bin/qmi_motext_hook1013170
/vendor/bin/qmi_motext_hook1013171
/vendor/bin/ims_rtp_daemon
com.android.systemui
webview_zygote
com.dolby.daxservice
vendor.qti.qesdk.sysservice
org.codeaurora.ims
com.android.se
com.android.phone
com.qti.qcc
com.google.android.ext.services
com.google.android.gms
com.google.android.euicc
com.google.android.googlequicksearchbox:interactor
com.google.android.apps.messaging:rcs
com.android.nfc
com.qualcomm.qti.workloadclassifier
com.qualcomm.location
com.google.android.gms.unstable
com.thundercomm.ar.core
com.android.vending:background
com.android.vending:quick_launch
com.android.dynsystem
com.android.managedprovisioning
com.android.shell

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-10-01 16:34 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-12 16:02 [RFC] Proposal: Static SECCOMP Policies Maxwell Bland
2024-09-12 20:57 ` Neill Kapron
2024-09-12 21:39   ` Maciej Żenczykowski
2024-09-13 17:07     ` [External] " Maxwell Bland
2024-09-13 17:12       ` Maxwell Bland
2024-09-13 17:30       ` Maxwell Bland
2024-09-14  4:18         ` Andy Lutomirski
2024-09-17 15:08           ` Maxwell Bland
2024-09-25 18:16             ` Andy Lutomirski
2024-09-25 19:52               ` Maciej Żenczykowski
2024-09-25 19:53                 ` Maciej Żenczykowski
2024-09-30 11:22                   ` Sebastian Ene
2024-09-30 18:43                     ` Maxwell Bland
2024-09-30 23:35                     ` Maciej Żenczykowski
2024-09-30 23:41                       ` Maciej Żenczykowski
2024-10-01 16:34                         ` Maxwell Bland
2024-09-13 18:17       ` Maxwell Bland
2024-09-13 21:16       ` [External] " Maciej Żenczykowski
2024-09-16 22:17         ` Maxwell Bland
2024-09-16 22:50           ` Maciej Żenczykowski
2024-09-17 15:15             ` Maxwell Bland
2024-09-18 19:22               ` Maxwell Bland
2024-09-17  7:34 ` Kees Cook
2024-09-17 16:54   ` Maxwell Bland
2024-09-17 17:01     ` Maxwell Bland
  -- strict thread matches above, loose matches on Subject: below --
2024-09-12 16:18 Maxwell Bland

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.