* [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN
@ 2023-05-11 7:05 Tianjia Zhang
2023-05-11 7:05 ` [PATCH 1/2] " Tianjia Zhang
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Tianjia Zhang @ 2023-05-11 7:05 UTC (permalink / raw)
To: Serge Hallyn, Paul Moore, Stephen Smalley, Eric Paris,
Frederick Lawler, Jens Axboe, Joseph Qi, linux-security-module,
selinux, linux-block, linux-kernel
Cc: Tianjia Zhang
Separated fine-grained capability CAP_BLOCK_ADMIN from CAP_SYS_ADMIN.
For backward compatibility, the CAP_BLOCK_ADMIN capability is included
within CAP_SYS_ADMIN.
Some database products rely on shared storage to complete the
write-once-read-multiple and write-multiple-read-multiple functions.
When HA occurs, they rely on the PR (Persistent Reservations) protocol
provided by the storage layer to manage block device permissions to
ensure data correctness.
CAP_SYS_ADMIN is required in the PR protocol implementation of existing
block devices in the Linux kernel, which has too many sensitive
permissions, which may lead to risks such as container escape. The
kernel needs to provide more fine-grained permission management like
CAP_NET_ADMIN to avoid online products directly relying on root to run.
CAP_BLOCK_ADMIN can also provide support for other block device
operations that require CAP_SYS_ADMIN capabilities in the future,
ensuring that applications run with least privilege.
Tianjia Zhang (2):
capability: Introduce CAP_BLOCK_ADMIN
block: use block_admin_capable() for Persistent Reservations
block/ioctl.c | 10 +++++-----
include/linux/capability.h | 5 +++++
include/uapi/linux/capability.h | 7 ++++++-
security/selinux/include/classmap.h | 4 ++--
4 files changed, 18 insertions(+), 8 deletions(-)
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/2] capability: Introduce CAP_BLOCK_ADMIN
2023-05-11 7:05 [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN Tianjia Zhang
@ 2023-05-11 7:05 ` Tianjia Zhang
2023-05-11 7:05 ` [PATCH 2/2] block: use block_admin_capable() for Persistent Reservations Tianjia Zhang
` (2 subsequent siblings)
3 siblings, 0 replies; 10+ messages in thread
From: Tianjia Zhang @ 2023-05-11 7:05 UTC (permalink / raw)
To: Serge Hallyn, Paul Moore, Stephen Smalley, Eric Paris,
Frederick Lawler, Jens Axboe, Joseph Qi, linux-security-module,
selinux, linux-block, linux-kernel
Cc: Tianjia Zhang
Separated fine-grained capability CAP_BLOCK_ADMIN from CAP_SYS_ADMIN.
For backward compatibility, the CAP_BLOCK_ADMIN capability is included
within CAP_SYS_ADMIN.
Some database products rely on shared storage to complete the
write-once-read-multiple and write-multiple-read-multiple functions.
When HA occurs, they rely on the PR (Persistent Reservations) protocol
provided by the storage layer to manage block device permissions to
ensure data correctness.
CAP_SYS_ADMIN is required in the PR protocol implementation of existing
block devices in the Linux kernel, which has too many sensitive
permissions, which may lead to risks such as container escape. The
kernel needs to provide more fine-grained permission management like
CAP_NET_ADMIN to avoid online products directly relying on root to run.
CAP_BLOCK_ADMIN can also provide support for other block device
operations that require CAP_SYS_ADMIN capabilities in the future,
ensuring that applications run with least privilege.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
include/linux/capability.h | 5 +++++
include/uapi/linux/capability.h | 7 ++++++-
security/selinux/include/classmap.h | 4 ++--
3 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 0c356a517991..95b81a75806f 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -208,6 +208,11 @@ static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
ns_capable(ns, CAP_SYS_ADMIN);
}
+static inline bool block_admin_capable(void)
+{
+ return capable(CAP_BLOCK_ADMIN) || capable(CAP_SYS_ADMIN);
+}
+
/* audit system wants to get cap info from files as well */
int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
const struct dentry *dentry,
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 3d61a0ae055d..7c07f5916289 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -417,7 +417,12 @@ struct vfs_ns_cap_data {
#define CAP_CHECKPOINT_RESTORE 40
-#define CAP_LAST_CAP CAP_CHECKPOINT_RESTORE
+/*
+ * Allow Persistent Reservations operations for block device
+ */
+#define CAP_BLOCK_ADMIN 41
+
+#define CAP_LAST_CAP CAP_BLOCK_ADMIN
#define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index a3c380775d41..83eb32e3a5cd 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -28,9 +28,9 @@
#define COMMON_CAP2_PERMS "mac_override", "mac_admin", "syslog", \
"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
- "checkpoint_restore"
+ "checkpoint_restore", "block_admin"
-#if CAP_LAST_CAP > CAP_CHECKPOINT_RESTORE
+#if CAP_LAST_CAP > CAP_BLOCK_ADMIN
#error New capability defined, please update COMMON_CAP2_PERMS.
#endif
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/2] block: use block_admin_capable() for Persistent Reservations
2023-05-11 7:05 [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN Tianjia Zhang
2023-05-11 7:05 ` [PATCH 1/2] " Tianjia Zhang
@ 2023-05-11 7:05 ` Tianjia Zhang
2023-05-11 16:17 ` [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN Casey Schaufler
2023-05-23 6:18 ` Christoph Hellwig
3 siblings, 0 replies; 10+ messages in thread
From: Tianjia Zhang @ 2023-05-11 7:05 UTC (permalink / raw)
To: Serge Hallyn, Paul Moore, Stephen Smalley, Eric Paris,
Frederick Lawler, Jens Axboe, Joseph Qi, linux-security-module,
selinux, linux-block, linux-kernel
Cc: Tianjia Zhang
Use the newly introduced capability CAP_BLOCK_ADMIN for Persistent
Reservations.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
block/ioctl.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/block/ioctl.c b/block/ioctl.c
index 9c5f637ff153..83af050eaa42 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -260,7 +260,7 @@ static int blkdev_pr_register(struct block_device *bdev,
const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
struct pr_registration reg;
- if (!capable(CAP_SYS_ADMIN))
+ if (!block_admin_capable())
return -EPERM;
if (!ops || !ops->pr_register)
return -EOPNOTSUPP;
@@ -278,7 +278,7 @@ static int blkdev_pr_reserve(struct block_device *bdev,
const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
struct pr_reservation rsv;
- if (!capable(CAP_SYS_ADMIN))
+ if (!block_admin_capable())
return -EPERM;
if (!ops || !ops->pr_reserve)
return -EOPNOTSUPP;
@@ -296,7 +296,7 @@ static int blkdev_pr_release(struct block_device *bdev,
const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
struct pr_reservation rsv;
- if (!capable(CAP_SYS_ADMIN))
+ if (!block_admin_capable())
return -EPERM;
if (!ops || !ops->pr_release)
return -EOPNOTSUPP;
@@ -314,7 +314,7 @@ static int blkdev_pr_preempt(struct block_device *bdev,
const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
struct pr_preempt p;
- if (!capable(CAP_SYS_ADMIN))
+ if (!block_admin_capable())
return -EPERM;
if (!ops || !ops->pr_preempt)
return -EOPNOTSUPP;
@@ -332,7 +332,7 @@ static int blkdev_pr_clear(struct block_device *bdev,
const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
struct pr_clear c;
- if (!capable(CAP_SYS_ADMIN))
+ if (!block_admin_capable())
return -EPERM;
if (!ops || !ops->pr_clear)
return -EOPNOTSUPP;
--
2.24.3 (Apple Git-128)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN
2023-05-11 7:05 [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN Tianjia Zhang
2023-05-11 7:05 ` [PATCH 1/2] " Tianjia Zhang
2023-05-11 7:05 ` [PATCH 2/2] block: use block_admin_capable() for Persistent Reservations Tianjia Zhang
@ 2023-05-11 16:17 ` Casey Schaufler
2023-05-16 12:05 ` Tianjia Zhang
2023-05-23 6:18 ` Christoph Hellwig
3 siblings, 1 reply; 10+ messages in thread
From: Casey Schaufler @ 2023-05-11 16:17 UTC (permalink / raw)
To: Tianjia Zhang, Serge Hallyn, Paul Moore, Stephen Smalley,
Eric Paris, Frederick Lawler, Jens Axboe, Joseph Qi,
linux-security-module, selinux, linux-block, linux-kernel,
Casey Schaufler
On 5/11/2023 12:05 AM, Tianjia Zhang wrote:
> Separated fine-grained capability CAP_BLOCK_ADMIN from CAP_SYS_ADMIN.
> For backward compatibility, the CAP_BLOCK_ADMIN capability is included
> within CAP_SYS_ADMIN.
>
> Some database products rely on shared storage to complete the
> write-once-read-multiple and write-multiple-read-multiple functions.
> When HA occurs, they rely on the PR (Persistent Reservations) protocol
> provided by the storage layer to manage block device permissions to
> ensure data correctness.
>
> CAP_SYS_ADMIN is required in the PR protocol implementation of existing
> block devices in the Linux kernel, which has too many sensitive
> permissions, which may lead to risks such as container escape. The
> kernel needs to provide more fine-grained permission management like
> CAP_NET_ADMIN to avoid online products directly relying on root to run.
>
> CAP_BLOCK_ADMIN can also provide support for other block device
> operations that require CAP_SYS_ADMIN capabilities in the future,
> ensuring that applications run with least privilege.
Can you demonstrate that there are cases where a program that needs
CAP_BLOCK_ADMIN does not also require CAP_SYS_ADMIN for other operations?
How much of what's allowed by CAP_SYS_ADMIN would be allowed by
CAP_BLOCK_ADMIN? If use of a new capability is rare it's difficult to
justify.
>
> Tianjia Zhang (2):
> capability: Introduce CAP_BLOCK_ADMIN
> block: use block_admin_capable() for Persistent Reservations
>
> block/ioctl.c | 10 +++++-----
> include/linux/capability.h | 5 +++++
> include/uapi/linux/capability.h | 7 ++++++-
> security/selinux/include/classmap.h | 4 ++--
> 4 files changed, 18 insertions(+), 8 deletions(-)
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN
2023-05-11 16:17 ` [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN Casey Schaufler
@ 2023-05-16 12:05 ` Tianjia Zhang
2023-05-18 0:01 ` Casey Schaufler
0 siblings, 1 reply; 10+ messages in thread
From: Tianjia Zhang @ 2023-05-16 12:05 UTC (permalink / raw)
To: Casey Schaufler, Serge Hallyn, Paul Moore, Stephen Smalley,
Eric Paris, Frederick Lawler, Jens Axboe, Joseph Qi,
linux-security-module, selinux, linux-block, linux-kernel,
louxiao.lx
Hi Casey,
On 5/12/23 12:17 AM, Casey Schaufler wrote:
> On 5/11/2023 12:05 AM, Tianjia Zhang wrote:
>> Separated fine-grained capability CAP_BLOCK_ADMIN from CAP_SYS_ADMIN.
>> For backward compatibility, the CAP_BLOCK_ADMIN capability is included
>> within CAP_SYS_ADMIN.
>>
>> Some database products rely on shared storage to complete the
>> write-once-read-multiple and write-multiple-read-multiple functions.
>> When HA occurs, they rely on the PR (Persistent Reservations) protocol
>> provided by the storage layer to manage block device permissions to
>> ensure data correctness.
>>
>> CAP_SYS_ADMIN is required in the PR protocol implementation of existing
>> block devices in the Linux kernel, which has too many sensitive
>> permissions, which may lead to risks such as container escape. The
>> kernel needs to provide more fine-grained permission management like
>> CAP_NET_ADMIN to avoid online products directly relying on root to run.
>>
>> CAP_BLOCK_ADMIN can also provide support for other block device
>> operations that require CAP_SYS_ADMIN capabilities in the future,
>> ensuring that applications run with least privilege.
>
> Can you demonstrate that there are cases where a program that needs
> CAP_BLOCK_ADMIN does not also require CAP_SYS_ADMIN for other operations?
> How much of what's allowed by CAP_SYS_ADMIN would be allowed by
> CAP_BLOCK_ADMIN? If use of a new capability is rare it's difficult to
> justify.
>
For the previous non-container scenarios, the block device is a shared
device, because the business-system generally operates the file system
on the block. Therefore, directly operating the block device has a high
probability of affecting other processes on the same host, and it is a
reasonable requirement to need the CAP_SYS_ADMIN capability.
But for a database running in a container scenario, especially a
container scenario on the cloud, it is likely that a container
exclusively occupies a block device. That is to say, for a container,
its access to the block device will not affect other process, there is
no need to obtain a higher CAP_SYS_ADMIN capability.
For a file system similar to distributed write-once-read-many, it is
necessary to ensure the correctness of recovery, then when recovery
occurs, it is necessary to ensure that no inflighting-io is completed
after recovery.
This can be guaranteed by performing operations such as SCSI/NVME
Persistent Reservations on block devices on the distributed file system.
Therefore, at present, it is only necessary to have the relevant
permission support of the control command of such container-exclusive
block devices.
Kind regards,
Tianjia
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN
2023-05-16 12:05 ` Tianjia Zhang
@ 2023-05-18 0:01 ` Casey Schaufler
2023-05-22 2:53 ` Tianjia Zhang
0 siblings, 1 reply; 10+ messages in thread
From: Casey Schaufler @ 2023-05-18 0:01 UTC (permalink / raw)
To: Tianjia Zhang, Serge Hallyn, Paul Moore, Stephen Smalley,
Eric Paris, Frederick Lawler, Jens Axboe, Joseph Qi,
linux-security-module, selinux, linux-block, linux-kernel,
louxiao.lx, Casey Schaufler
On 5/16/2023 5:05 AM, Tianjia Zhang wrote:
> Hi Casey,
>
> On 5/12/23 12:17 AM, Casey Schaufler wrote:
>> On 5/11/2023 12:05 AM, Tianjia Zhang wrote:
>>> Separated fine-grained capability CAP_BLOCK_ADMIN from CAP_SYS_ADMIN.
>>> For backward compatibility, the CAP_BLOCK_ADMIN capability is included
>>> within CAP_SYS_ADMIN.
>>>
>>> Some database products rely on shared storage to complete the
>>> write-once-read-multiple and write-multiple-read-multiple functions.
>>> When HA occurs, they rely on the PR (Persistent Reservations) protocol
>>> provided by the storage layer to manage block device permissions to
>>> ensure data correctness.
>>>
>>> CAP_SYS_ADMIN is required in the PR protocol implementation of existing
>>> block devices in the Linux kernel, which has too many sensitive
>>> permissions, which may lead to risks such as container escape. The
>>> kernel needs to provide more fine-grained permission management like
>>> CAP_NET_ADMIN to avoid online products directly relying on root to run.
>>>
>>> CAP_BLOCK_ADMIN can also provide support for other block device
>>> operations that require CAP_SYS_ADMIN capabilities in the future,
>>> ensuring that applications run with least privilege.
>>
>> Can you demonstrate that there are cases where a program that needs
>> CAP_BLOCK_ADMIN does not also require CAP_SYS_ADMIN for other
>> operations?
>> How much of what's allowed by CAP_SYS_ADMIN would be allowed by
>> CAP_BLOCK_ADMIN? If use of a new capability is rare it's difficult to
>> justify.
>>
>
> For the previous non-container scenarios, the block device is a shared
> device, because the business-system generally operates the file system
> on the block. Therefore, directly operating the block device has a high
> probability of affecting other processes on the same host, and it is a
> reasonable requirement to need the CAP_SYS_ADMIN capability.
>
> But for a database running in a container scenario, especially a
> container scenario on the cloud, it is likely that a container
> exclusively occupies a block device. That is to say, for a container,
> its access to the block device will not affect other process, there is
> no need to obtain a higher CAP_SYS_ADMIN capability.
If I understand correctly, you're saying that the process that requires
CAP_BLOCK_ADMIN in the container won't also require CAP_SYS_ADMIN for
other operations.
That's good, but it isn't clear how a process on bare metal would
require CAP_SYS_ADMIN while the same process in a container wouldn't.
>
> For a file system similar to distributed write-once-read-many, it is
> necessary to ensure the correctness of recovery, then when recovery
> occurs, it is necessary to ensure that no inflighting-io is completed
> after recovery.
>
> This can be guaranteed by performing operations such as SCSI/NVME
> Persistent Reservations on block devices on the distributed file system.
Does your cloud based system always run "real" devices? My
understanding is that cloud based deployment usually uses
virtual machines and virtio or other simulated devices.
A container deployment in the cloud seems unlikely to be able
to take advantage of block administration. But I can't say
I know the specifics of your environment.
> Therefore, at present, it is only necessary to have the relevant
> permission support of the control command of such container-exclusive
> block devices.
This looks like an extremely special case in which breaking out
block management would make sense.
>
> Kind regards,
> Tianjia
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN
2023-05-18 0:01 ` Casey Schaufler
@ 2023-05-22 2:53 ` Tianjia Zhang
2023-05-22 19:13 ` Casey Schaufler
0 siblings, 1 reply; 10+ messages in thread
From: Tianjia Zhang @ 2023-05-22 2:53 UTC (permalink / raw)
To: Casey Schaufler, Serge Hallyn, Paul Moore, Stephen Smalley,
Eric Paris, Frederick Lawler, Jens Axboe, Joseph Qi,
linux-security-module, selinux, linux-block, linux-kernel,
louxiao.lx
Hi Casey,
On 5/18/23 8:01 AM, Casey Schaufler wrote:
> On 5/16/2023 5:05 AM, Tianjia Zhang wrote:
>> Hi Casey,
>>
>> On 5/12/23 12:17 AM, Casey Schaufler wrote:
>>> On 5/11/2023 12:05 AM, Tianjia Zhang wrote:
>>>> Separated fine-grained capability CAP_BLOCK_ADMIN from CAP_SYS_ADMIN.
>>>> For backward compatibility, the CAP_BLOCK_ADMIN capability is included
>>>> within CAP_SYS_ADMIN.
>>>>
>>>> Some database products rely on shared storage to complete the
>>>> write-once-read-multiple and write-multiple-read-multiple functions.
>>>> When HA occurs, they rely on the PR (Persistent Reservations) protocol
>>>> provided by the storage layer to manage block device permissions to
>>>> ensure data correctness.
>>>>
>>>> CAP_SYS_ADMIN is required in the PR protocol implementation of existing
>>>> block devices in the Linux kernel, which has too many sensitive
>>>> permissions, which may lead to risks such as container escape. The
>>>> kernel needs to provide more fine-grained permission management like
>>>> CAP_NET_ADMIN to avoid online products directly relying on root to run.
>>>>
>>>> CAP_BLOCK_ADMIN can also provide support for other block device
>>>> operations that require CAP_SYS_ADMIN capabilities in the future,
>>>> ensuring that applications run with least privilege.
>>>
>>> Can you demonstrate that there are cases where a program that needs
>>> CAP_BLOCK_ADMIN does not also require CAP_SYS_ADMIN for other
>>> operations?
>>> How much of what's allowed by CAP_SYS_ADMIN would be allowed by
>>> CAP_BLOCK_ADMIN? If use of a new capability is rare it's difficult to
>>> justify.
>>>
>>
>> For the previous non-container scenarios, the block device is a shared
>> device, because the business-system generally operates the file system
>> on the block. Therefore, directly operating the block device has a high
>> probability of affecting other processes on the same host, and it is a
>> reasonable requirement to need the CAP_SYS_ADMIN capability.
>>
>> But for a database running in a container scenario, especially a
>> container scenario on the cloud, it is likely that a container
>> exclusively occupies a block device. That is to say, for a container,
>> its access to the block device will not affect other process, there is
>> no need to obtain a higher CAP_SYS_ADMIN capability.
>
> If I understand correctly, you're saying that the process that requires
> CAP_BLOCK_ADMIN in the container won't also require CAP_SYS_ADMIN for
> other operations.
>
> That's good, but it isn't clear how a process on bare metal would
> require CAP_SYS_ADMIN while the same process in a container wouldn't.
>
>>
>> For a file system similar to distributed write-once-read-many, it is
>> necessary to ensure the correctness of recovery, then when recovery
>> occurs, it is necessary to ensure that no inflighting-io is completed
>> after recovery.
>>
>> This can be guaranteed by performing operations such as SCSI/NVME
>> Persistent Reservations on block devices on the distributed file system.
>
> Does your cloud based system always run "real" devices? My
> understanding is that cloud based deployment usually uses
> virtual machines and virtio or other simulated devices.
> A container deployment in the cloud seems unlikely to be able
> to take advantage of block administration. But I can't say
> I know the specifics of your environment.
>
>> Therefore, at present, it is only necessary to have the relevant
>> permission support of the control command of such container-exclusive
>> block devices.
>
> This looks like an extremely special case in which breaking out
> block management would make sense.
>
Our scenario is like this. In simply terms, a distributed database has
a read-write instance and one or more read-only instances. Each instance
runs in an isolated container. All containers share the same block device.
In addition to the database instance, there is also a control program
running on the control plane in the container. The database ensures
the correctness of the data through the PR (Persistent Reservations)
of the block device. This operation is also the only operation in the
container that requires CAP_SYS_ADMIN privileges.
This system as a whole, whether it is running on VM or bare metal, the
difference is not big.
In order to support the PR of block devices, we need to grant
CAP_SYS_ADMIN permissions to the container, which not only greatly
increases the risk of container escape, but also makes us have to
carefully configure the permissions of the container. Many container
escapes that have occurred are also caused by these reasons.
This is essentially a problem of permission isolation. We hope to
share the smallest possible permissions from CAP_SYS_ADMIN to support
necessary operations, and avoid providing CAP_SYS_ADMIN permissions
to containers as much as possible.
Kind regards,
Tianjia
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN
2023-05-22 2:53 ` Tianjia Zhang
@ 2023-05-22 19:13 ` Casey Schaufler
2023-05-23 3:05 ` Tianjia Zhang
0 siblings, 1 reply; 10+ messages in thread
From: Casey Schaufler @ 2023-05-22 19:13 UTC (permalink / raw)
To: Tianjia Zhang, Serge Hallyn, Paul Moore, Stephen Smalley,
Eric Paris, Frederick Lawler, Jens Axboe, Joseph Qi,
linux-security-module, selinux, linux-block, linux-kernel,
louxiao.lx, Casey Schaufler
On 5/21/2023 7:53 PM, Tianjia Zhang wrote:
> Hi Casey,
>
> On 5/18/23 8:01 AM, Casey Schaufler wrote:
>> On 5/16/2023 5:05 AM, Tianjia Zhang wrote:
>>> Hi Casey,
>>>
>>> On 5/12/23 12:17 AM, Casey Schaufler wrote:
>>>> On 5/11/2023 12:05 AM, Tianjia Zhang wrote:
>>>>> Separated fine-grained capability CAP_BLOCK_ADMIN from CAP_SYS_ADMIN.
>>>>> For backward compatibility, the CAP_BLOCK_ADMIN capability is
>>>>> included
>>>>> within CAP_SYS_ADMIN.
>>>>>
>>>>> Some database products rely on shared storage to complete the
>>>>> write-once-read-multiple and write-multiple-read-multiple functions.
>>>>> When HA occurs, they rely on the PR (Persistent Reservations)
>>>>> protocol
>>>>> provided by the storage layer to manage block device permissions to
>>>>> ensure data correctness.
>>>>>
>>>>> CAP_SYS_ADMIN is required in the PR protocol implementation of
>>>>> existing
>>>>> block devices in the Linux kernel, which has too many sensitive
>>>>> permissions, which may lead to risks such as container escape. The
>>>>> kernel needs to provide more fine-grained permission management like
>>>>> CAP_NET_ADMIN to avoid online products directly relying on root to
>>>>> run.
>>>>>
>>>>> CAP_BLOCK_ADMIN can also provide support for other block device
>>>>> operations that require CAP_SYS_ADMIN capabilities in the future,
>>>>> ensuring that applications run with least privilege.
>>>>
>>>> Can you demonstrate that there are cases where a program that needs
>>>> CAP_BLOCK_ADMIN does not also require CAP_SYS_ADMIN for other
>>>> operations?
>>>> How much of what's allowed by CAP_SYS_ADMIN would be allowed by
>>>> CAP_BLOCK_ADMIN? If use of a new capability is rare it's difficult to
>>>> justify.
>>>>
>>>
>>> For the previous non-container scenarios, the block device is a shared
>>> device, because the business-system generally operates the file system
>>> on the block. Therefore, directly operating the block device has a high
>>> probability of affecting other processes on the same host, and it is a
>>> reasonable requirement to need the CAP_SYS_ADMIN capability.
>>>
>>> But for a database running in a container scenario, especially a
>>> container scenario on the cloud, it is likely that a container
>>> exclusively occupies a block device. That is to say, for a container,
>>> its access to the block device will not affect other process, there is
>>> no need to obtain a higher CAP_SYS_ADMIN capability.
>>
>> If I understand correctly, you're saying that the process that requires
>> CAP_BLOCK_ADMIN in the container won't also require CAP_SYS_ADMIN for
>> other operations.
>>
>> That's good, but it isn't clear how a process on bare metal would
>> require CAP_SYS_ADMIN while the same process in a container wouldn't.
>>
>>>
>>> For a file system similar to distributed write-once-read-many, it is
>>> necessary to ensure the correctness of recovery, then when recovery
>>> occurs, it is necessary to ensure that no inflighting-io is completed
>>> after recovery.
>>>
>>> This can be guaranteed by performing operations such as SCSI/NVME
>>> Persistent Reservations on block devices on the distributed file
>>> system.
>>
>> Does your cloud based system always run "real" devices? My
>> understanding is that cloud based deployment usually uses
>> virtual machines and virtio or other simulated devices.
>> A container deployment in the cloud seems unlikely to be able
>> to take advantage of block administration. But I can't say
>> I know the specifics of your environment.
>>
>>> Therefore, at present, it is only necessary to have the relevant
>>> permission support of the control command of such container-exclusive
>>> block devices.
>>
>> This looks like an extremely special case in which breaking out
>> block management would make sense.
>>
> Our scenario is like this. In simply terms, a distributed database has
> a read-write instance and one or more read-only instances. Each instance
> runs in an isolated container. All containers share the same block
> device.
>
> In addition to the database instance, there is also a control program
> running on the control plane in the container. The database ensures
> the correctness of the data through the PR (Persistent Reservations)
> of the block device. This operation is also the only operation in the
> container that requires CAP_SYS_ADMIN privileges.
>
> This system as a whole, whether it is running on VM or bare metal, the
> difference is not big.
>
> In order to support the PR of block devices, we need to grant
> CAP_SYS_ADMIN permissions to the container, which not only greatly
> increases the risk of container escape, but also makes us have to
> carefully configure the permissions of the container. Many container
> escapes that have occurred are also caused by these reasons.
>
> This is essentially a problem of permission isolation. We hope to
> share the smallest possible permissions from CAP_SYS_ADMIN to support
> necessary operations, and avoid providing CAP_SYS_ADMIN permissions
> to containers as much as possible.
Your use case is interesting, but not compelling. While you may have
come up with a specific case where you can completely break CAP_BLOCK_ADMIN
out from CAP_SYS_ADMIN, it's hardly general.
>
> Kind regards,
> Tianjia
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN
2023-05-22 19:13 ` Casey Schaufler
@ 2023-05-23 3:05 ` Tianjia Zhang
0 siblings, 0 replies; 10+ messages in thread
From: Tianjia Zhang @ 2023-05-23 3:05 UTC (permalink / raw)
To: Casey Schaufler, Serge Hallyn, Paul Moore, Stephen Smalley,
Eric Paris, Frederick Lawler, Jens Axboe, Joseph Qi,
linux-security-module, selinux, linux-block, linux-kernel,
louxiao.lx
On 5/23/23 3:13 AM, Casey Schaufler wrote:
> On 5/21/2023 7:53 PM, Tianjia Zhang wrote:
>> Hi Casey,
>>
>> On 5/18/23 8:01 AM, Casey Schaufler wrote:
>>> On 5/16/2023 5:05 AM, Tianjia Zhang wrote:
>>>> Hi Casey,
>>>>
>>>> On 5/12/23 12:17 AM, Casey Schaufler wrote:
>>>>> On 5/11/2023 12:05 AM, Tianjia Zhang wrote:
>>>>>> Separated fine-grained capability CAP_BLOCK_ADMIN from CAP_SYS_ADMIN.
>>>>>> For backward compatibility, the CAP_BLOCK_ADMIN capability is
>>>>>> included
>>>>>> within CAP_SYS_ADMIN.
>>>>>>
>>>>>> Some database products rely on shared storage to complete the
>>>>>> write-once-read-multiple and write-multiple-read-multiple functions.
>>>>>> When HA occurs, they rely on the PR (Persistent Reservations)
>>>>>> protocol
>>>>>> provided by the storage layer to manage block device permissions to
>>>>>> ensure data correctness.
>>>>>>
>>>>>> CAP_SYS_ADMIN is required in the PR protocol implementation of
>>>>>> existing
>>>>>> block devices in the Linux kernel, which has too many sensitive
>>>>>> permissions, which may lead to risks such as container escape. The
>>>>>> kernel needs to provide more fine-grained permission management like
>>>>>> CAP_NET_ADMIN to avoid online products directly relying on root to
>>>>>> run.
>>>>>>
>>>>>> CAP_BLOCK_ADMIN can also provide support for other block device
>>>>>> operations that require CAP_SYS_ADMIN capabilities in the future,
>>>>>> ensuring that applications run with least privilege.
>>>>>
>>>>> Can you demonstrate that there are cases where a program that needs
>>>>> CAP_BLOCK_ADMIN does not also require CAP_SYS_ADMIN for other
>>>>> operations?
>>>>> How much of what's allowed by CAP_SYS_ADMIN would be allowed by
>>>>> CAP_BLOCK_ADMIN? If use of a new capability is rare it's difficult to
>>>>> justify.
>>>>>
>>>>
>>>> For the previous non-container scenarios, the block device is a shared
>>>> device, because the business-system generally operates the file system
>>>> on the block. Therefore, directly operating the block device has a high
>>>> probability of affecting other processes on the same host, and it is a
>>>> reasonable requirement to need the CAP_SYS_ADMIN capability.
>>>>
>>>> But for a database running in a container scenario, especially a
>>>> container scenario on the cloud, it is likely that a container
>>>> exclusively occupies a block device. That is to say, for a container,
>>>> its access to the block device will not affect other process, there is
>>>> no need to obtain a higher CAP_SYS_ADMIN capability.
>>>
>>> If I understand correctly, you're saying that the process that requires
>>> CAP_BLOCK_ADMIN in the container won't also require CAP_SYS_ADMIN for
>>> other operations.
>>>
>>> That's good, but it isn't clear how a process on bare metal would
>>> require CAP_SYS_ADMIN while the same process in a container wouldn't.
>>>
>>>>
>>>> For a file system similar to distributed write-once-read-many, it is
>>>> necessary to ensure the correctness of recovery, then when recovery
>>>> occurs, it is necessary to ensure that no inflighting-io is completed
>>>> after recovery.
>>>>
>>>> This can be guaranteed by performing operations such as SCSI/NVME
>>>> Persistent Reservations on block devices on the distributed file
>>>> system.
>>>
>>> Does your cloud based system always run "real" devices? My
>>> understanding is that cloud based deployment usually uses
>>> virtual machines and virtio or other simulated devices.
>>> A container deployment in the cloud seems unlikely to be able
>>> to take advantage of block administration. But I can't say
>>> I know the specifics of your environment.
>>>
>>>> Therefore, at present, it is only necessary to have the relevant
>>>> permission support of the control command of such container-exclusive
>>>> block devices.
>>>
>>> This looks like an extremely special case in which breaking out
>>> block management would make sense.
>>>
>> Our scenario is like this. In simply terms, a distributed database has
>> a read-write instance and one or more read-only instances. Each instance
>> runs in an isolated container. All containers share the same block
>> device.
>>
>> In addition to the database instance, there is also a control program
>> running on the control plane in the container. The database ensures
>> the correctness of the data through the PR (Persistent Reservations)
>> of the block device. This operation is also the only operation in the
>> container that requires CAP_SYS_ADMIN privileges.
>>
>> This system as a whole, whether it is running on VM or bare metal, the
>> difference is not big.
>>
>> In order to support the PR of block devices, we need to grant
>> CAP_SYS_ADMIN permissions to the container, which not only greatly
>> increases the risk of container escape, but also makes us have to
>> carefully configure the permissions of the container. Many container
>> escapes that have occurred are also caused by these reasons.
>>
>> This is essentially a problem of permission isolation. We hope to
>> share the smallest possible permissions from CAP_SYS_ADMIN to support
>> necessary operations, and avoid providing CAP_SYS_ADMIN permissions
>> to containers as much as possible.
>
> Your use case is interesting, but not compelling. While you may have
> come up with a specific case where you can completely break CAP_BLOCK_ADMIN
> out from CAP_SYS_ADMIN, it's hardly general.
>
It sounds a pity, thanks for your reply, we try to provide support
through self-developed patches first.
Kind regards,
Tianjia
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN
2023-05-11 7:05 [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN Tianjia Zhang
` (2 preceding siblings ...)
2023-05-11 16:17 ` [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN Casey Schaufler
@ 2023-05-23 6:18 ` Christoph Hellwig
3 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2023-05-23 6:18 UTC (permalink / raw)
To: Tianjia Zhang
Cc: Serge Hallyn, Paul Moore, Stephen Smalley, Eric Paris,
Frederick Lawler, Jens Axboe, Joseph Qi, linux-security-module,
selinux, linux-block, linux-kernel
On Thu, May 11, 2023 at 03:05:18PM +0800, Tianjia Zhang wrote:
> Separated fine-grained capability CAP_BLOCK_ADMIN from CAP_SYS_ADMIN.
> For backward compatibility, the CAP_BLOCK_ADMIN capability is included
> within CAP_SYS_ADMIN.
Splitting out capabilities tends to massivel break userspace. Don't
do it.
> CAP_SYS_ADMIN is required in the PR protocol implementation of existing
> block devices in the Linux kernel, which has too many sensitive
> permissions, which may lead to risks such as container escape. The
> kernel needs to provide more fine-grained permission management like
> CAP_NET_ADMIN to avoid online products directly relying on root to run.
I'm pretty sure the PR API can be keyed off just permissions on the
block device node as nothing in it is fundamentally unsafe.
Please work on relaxing the permissions checks there.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2023-05-23 6:19 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-11 7:05 [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN Tianjia Zhang
2023-05-11 7:05 ` [PATCH 1/2] " Tianjia Zhang
2023-05-11 7:05 ` [PATCH 2/2] block: use block_admin_capable() for Persistent Reservations Tianjia Zhang
2023-05-11 16:17 ` [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN Casey Schaufler
2023-05-16 12:05 ` Tianjia Zhang
2023-05-18 0:01 ` Casey Schaufler
2023-05-22 2:53 ` Tianjia Zhang
2023-05-22 19:13 ` Casey Schaufler
2023-05-23 3:05 ` Tianjia Zhang
2023-05-23 6:18 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).