* [PATCH ghak90 V6 04/10] audit: log container info of syscalls
From: Richard Guy Briggs @ 2019-04-09 3:39 UTC (permalink / raw)
To: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
LKML, netdev, netfilter-devel
Cc: Paul Moore, sgrubb, omosnace, dhowells, simo, eparis, serge,
ebiederm, nhorman, Richard Guy Briggs
In-Reply-To: <cover.1554732921.git.rgb@redhat.com>
Create a new audit record AUDIT_CONTAINER_ID to document the audit
container identifier of a process if it is present.
Called from audit_log_exit(), syscalls are covered.
A sample raw event:
type=SYSCALL msg=audit(1519924845.499:257): arch=c000003e syscall=257 success=yes exit=3 a0=ffffff9c a1=56374e1cef30 a2=241 a3=1b6 items=2 ppid=606 pid=635 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=3 comm="bash" exe="/usr/bin/bash" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="tmpcontainerid"
type=CWD msg=audit(1519924845.499:257): cwd="/root"
type=PATH msg=audit(1519924845.499:257): item=0 name="/tmp/" inode=13863 dev=00:27 mode=041777 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype= PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
type=PATH msg=audit(1519924845.499:257): item=1 name="/tmp/tmpcontainerid" inode=17729 dev=00:27 mode=0100644 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
type=PROCTITLE msg=audit(1519924845.499:257): proctitle=62617368002D6300736C65657020313B206563686F2074657374203E202F746D702F746D70636F6E7461696E65726964
type=CONTAINER_ID msg=audit(1519924845.499:257): contid=123458
Please see the github audit kernel issue for the main feature:
https://github.com/linux-audit/audit-kernel/issues/90
Please see the github audit userspace issue for supporting additions:
https://github.com/linux-audit/audit-userspace/issues/51
Please see the github audit testsuiite issue for the test case:
https://github.com/linux-audit/audit-testsuite/issues/64
Please see the github audit wiki for the feature overview:
https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Steve Grubb <sgrubb@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
---
include/linux/audit.h | 5 +++++
include/uapi/linux/audit.h | 1 +
kernel/audit.c | 20 ++++++++++++++++++++
kernel/auditsc.c | 20 ++++++++++++++------
4 files changed, 40 insertions(+), 6 deletions(-)
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 301337776193..43438192ca2a 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -199,6 +199,8 @@ static inline u64 audit_get_contid(struct task_struct *tsk)
return tsk->audit->contid;
}
+extern void audit_log_contid(struct audit_context *context, u64 contid);
+
extern u32 audit_enabled;
#else /* CONFIG_AUDIT */
static inline int audit_alloc(struct task_struct *task)
@@ -265,6 +267,9 @@ static inline u64 audit_get_contid(struct task_struct *tsk)
return AUDIT_CID_UNSET;
}
+static inline void audit_log_contid(struct audit_context *context, u64 contid)
+{ }
+
#define audit_enabled AUDIT_OFF
#endif /* CONFIG_AUDIT */
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 4a6a8bf1de32..55fde9970762 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -115,6 +115,7 @@
#define AUDIT_REPLACE 1329 /* Replace auditd if this packet unanswerd */
#define AUDIT_KERN_MODULE 1330 /* Kernel Module events */
#define AUDIT_FANOTIFY 1331 /* Fanotify access decision */
+#define AUDIT_CONTAINER_ID 1332 /* Container ID */
#define AUDIT_AVC 1400 /* SE Linux avc denial or grant */
#define AUDIT_SELINUX_ERR 1401 /* Internal SE Linux Errors */
diff --git a/kernel/audit.c b/kernel/audit.c
index 182b0f2c183d..3e0af53f3c4d 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -2127,6 +2127,26 @@ void audit_log_session_info(struct audit_buffer *ab)
audit_log_format(ab, "auid=%u ses=%u", auid, sessionid);
}
+/*
+ * audit_log_contid - report container info
+ * @context: task or local context for record
+ * @contid: container ID to report
+ */
+void audit_log_contid(struct audit_context *context, u64 contid)
+{
+ struct audit_buffer *ab;
+
+ if (!audit_contid_valid(contid))
+ return;
+ /* Generate AUDIT_CONTAINER_ID record with container ID */
+ ab = audit_log_start(context, GFP_KERNEL, AUDIT_CONTAINER_ID);
+ if (!ab)
+ return;
+ audit_log_format(ab, "contid=%llu", (unsigned long long)contid);
+ audit_log_end(ab);
+}
+EXPORT_SYMBOL(audit_log_contid);
+
void audit_log_key(struct audit_buffer *ab, char *key)
{
audit_log_format(ab, " key=");
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 1f7edf035b16..eea445b7a181 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1523,7 +1523,7 @@ static void audit_log_exit(void)
for (aux = context->aux_pids; aux; aux = aux->next) {
struct audit_aux_data_pids *axs = (void *)aux;
- for (i = 0; i < axs->pid_count; i++)
+ for (i = 0; i < axs->pid_count; i++) {
if (audit_log_pid_context(context, axs->target_pid[i],
axs->target_auid[i],
axs->target_uid[i],
@@ -1531,14 +1531,20 @@ static void audit_log_exit(void)
axs->target_sid[i],
axs->target_comm[i]))
call_panic = 1;
+ audit_log_contid(context, axs->target_cid[i]);
+ }
}
- if (context->target_pid &&
- audit_log_pid_context(context, context->target_pid,
- context->target_auid, context->target_uid,
- context->target_sessionid,
- context->target_sid, context->target_comm))
+ if (context->target_pid) {
+ if (audit_log_pid_context(context, context->target_pid,
+ context->target_auid,
+ context->target_uid,
+ context->target_sessionid,
+ context->target_sid,
+ context->target_comm))
call_panic = 1;
+ audit_log_contid(context, context->target_cid);
+ }
if (context->pwd.dentry && context->pwd.mnt) {
ab = audit_log_start(context, GFP_KERNEL, AUDIT_CWD);
@@ -1557,6 +1563,8 @@ static void audit_log_exit(void)
audit_log_proctitle();
+ audit_log_contid(context, audit_get_contid(current));
+
/* Send end of event record to help user space know we are finished */
ab = audit_log_start(context, GFP_KERNEL, AUDIT_EOE);
if (ab)
--
1.8.3.1
^ permalink raw reply related
* [PATCH ghak90 V6 03/10] audit: read container ID of a process
From: Richard Guy Briggs @ 2019-04-09 3:39 UTC (permalink / raw)
To: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
LKML, netdev, netfilter-devel
Cc: Paul Moore, sgrubb, omosnace, dhowells, simo, eparis, serge,
ebiederm, nhorman, Richard Guy Briggs
In-Reply-To: <cover.1554732921.git.rgb@redhat.com>
Add support for reading the audit container identifier from the proc
filesystem.
This is a read from the proc entry of the form
/proc/PID/audit_containerid where PID is the process ID of the task
whose audit container identifier is sought.
The read expects up to a u64 value (unset: 18446744073709551615).
This read requires CAP_AUDIT_CONTROL.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
---
fs/proc/base.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 43fd0c4b87de..acc70239d0cb 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1211,7 +1211,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
};
#ifdef CONFIG_AUDIT
-#define TMPBUFLEN 11
+#define TMPBUFLEN 21
static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
size_t count, loff_t *ppos)
{
@@ -1295,6 +1295,24 @@ static ssize_t proc_sessionid_read(struct file * file, char __user * buf,
.llseek = generic_file_llseek,
};
+static ssize_t proc_contid_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct inode *inode = file_inode(file);
+ struct task_struct *task = get_proc_task(inode);
+ ssize_t length;
+ char tmpbuf[TMPBUFLEN];
+
+ if (!task)
+ return -ESRCH;
+ /* if we don't have caps, reject */
+ if (!capable(CAP_AUDIT_CONTROL))
+ return -EPERM;
+ length = scnprintf(tmpbuf, TMPBUFLEN, "%llu", audit_get_contid(task));
+ put_task_struct(task);
+ return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
+}
+
static ssize_t proc_contid_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
@@ -1325,6 +1343,7 @@ static ssize_t proc_contid_write(struct file *file, const char __user *buf,
}
static const struct file_operations proc_contid_operations = {
+ .read = proc_contid_read,
.write = proc_contid_write,
.llseek = generic_file_llseek,
};
@@ -3067,7 +3086,7 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
#ifdef CONFIG_AUDIT
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUGO, proc_sessionid_operations),
- REG("audit_containerid", S_IWUSR, proc_contid_operations),
+ REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
#endif
#ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
@@ -3466,7 +3485,7 @@ static int proc_tid_comm_permission(struct inode *inode, int mask)
#ifdef CONFIG_AUDIT
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUGO, proc_sessionid_operations),
- REG("audit_containerid", S_IWUSR, proc_contid_operations),
+ REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
#endif
#ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
--
1.8.3.1
^ permalink raw reply related
* [PATCH ghak90 V6 02/10] audit: add container id
From: Richard Guy Briggs @ 2019-04-09 3:39 UTC (permalink / raw)
To: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
LKML, netdev, netfilter-devel
Cc: Paul Moore, sgrubb, omosnace, dhowells, simo, eparis, serge,
ebiederm, nhorman, Richard Guy Briggs
In-Reply-To: <cover.1554732921.git.rgb@redhat.com>
Implement the proc fs write to set the audit container identifier of a
process, emitting an AUDIT_CONTAINER_OP record to document the event.
This is a write from the container orchestrator task to a proc entry of
the form /proc/PID/audit_containerid where PID is the process ID of the
newly created task that is to become the first task in a container, or
an additional task added to a container.
The write expects up to a u64 value (unset: 18446744073709551615).
The writer must have capability CAP_AUDIT_CONTROL.
This will produce a record such as this:
type=CONTAINER_OP msg=audit(2018-06-06 12:39:29.636:26949) : op=set opid=2209 contid=123456 old-contid=18446744073709551615 pid=628 auid=root uid=root tty=ttyS0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 comm=bash exe=/usr/bin/bash res=yes
The "op" field indicates an initial set. The "pid" to "ses" fields are
the orchestrator while the "opid" field is the object's PID, the process
being "contained". New and old audit container identifier values are
given in the "contid" fields, while res indicates its success.
It is not permitted to unset the audit container identifier.
A child inherits its parent's audit container identifier.
Please see the github audit kernel issue for the main feature:
https://github.com/linux-audit/audit-kernel/issues/90
Please see the github audit userspace issue for supporting additions:
https://github.com/linux-audit/audit-userspace/issues/51
Please see the github audit testsuiite issue for the test case:
https://github.com/linux-audit/audit-testsuite/issues/64
Please see the github audit wiki for the feature overview:
https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Steve Grubb <sgrubb@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
---
fs/proc/base.c | 36 ++++++++++++++++++++++++
include/linux/audit.h | 25 +++++++++++++++++
include/uapi/linux/audit.h | 2 ++
kernel/audit.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/audit.h | 1 +
kernel/auditsc.c | 4 +++
6 files changed, 137 insertions(+)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ddef482f1334..43fd0c4b87de 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1294,6 +1294,40 @@ static ssize_t proc_sessionid_read(struct file * file, char __user * buf,
.read = proc_sessionid_read,
.llseek = generic_file_llseek,
};
+
+static ssize_t proc_contid_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct inode *inode = file_inode(file);
+ u64 contid;
+ int rv;
+ struct task_struct *task = get_proc_task(inode);
+
+ if (!task)
+ return -ESRCH;
+ if (*ppos != 0) {
+ /* No partial writes. */
+ put_task_struct(task);
+ return -EINVAL;
+ }
+
+ rv = kstrtou64_from_user(buf, count, 10, &contid);
+ if (rv < 0) {
+ put_task_struct(task);
+ return rv;
+ }
+
+ rv = audit_set_contid(task, contid);
+ put_task_struct(task);
+ if (rv < 0)
+ return rv;
+ return count;
+}
+
+static const struct file_operations proc_contid_operations = {
+ .write = proc_contid_write,
+ .llseek = generic_file_llseek,
+};
#endif
#ifdef CONFIG_FAULT_INJECTION
@@ -3033,6 +3067,7 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
#ifdef CONFIG_AUDIT
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUGO, proc_sessionid_operations),
+ REG("audit_containerid", S_IWUSR, proc_contid_operations),
#endif
#ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
@@ -3431,6 +3466,7 @@ static int proc_tid_comm_permission(struct inode *inode, int mask)
#ifdef CONFIG_AUDIT
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUGO, proc_sessionid_operations),
+ REG("audit_containerid", S_IWUSR, proc_contid_operations),
#endif
#ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
diff --git a/include/linux/audit.h b/include/linux/audit.h
index bde346e73f0c..301337776193 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -89,6 +89,7 @@ struct audit_field {
struct audit_task_info {
kuid_t loginuid;
unsigned int sessionid;
+ u64 contid;
#ifdef CONFIG_AUDITSYSCALL
struct audit_context *ctx;
#endif
@@ -189,6 +190,15 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
return tsk->audit->sessionid;
}
+extern int audit_set_contid(struct task_struct *tsk, u64 contid);
+
+static inline u64 audit_get_contid(struct task_struct *tsk)
+{
+ if (!tsk->audit)
+ return AUDIT_CID_UNSET;
+ return tsk->audit->contid;
+}
+
extern u32 audit_enabled;
#else /* CONFIG_AUDIT */
static inline int audit_alloc(struct task_struct *task)
@@ -250,6 +260,11 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
return AUDIT_SID_UNSET;
}
+static inline u64 audit_get_contid(struct task_struct *tsk)
+{
+ return AUDIT_CID_UNSET;
+}
+
#define audit_enabled AUDIT_OFF
#endif /* CONFIG_AUDIT */
@@ -606,6 +621,16 @@ static inline bool audit_loginuid_set(struct task_struct *tsk)
return uid_valid(audit_get_loginuid(tsk));
}
+static inline bool audit_contid_valid(u64 contid)
+{
+ return contid != AUDIT_CID_UNSET;
+}
+
+static inline bool audit_contid_set(struct task_struct *tsk)
+{
+ return audit_contid_valid(audit_get_contid(tsk));
+}
+
static inline void audit_log_string(struct audit_buffer *ab, const char *buf)
{
audit_log_n_string(ab, buf, strlen(buf));
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 3901c51c0b93..4a6a8bf1de32 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -71,6 +71,7 @@
#define AUDIT_TTY_SET 1017 /* Set TTY auditing status */
#define AUDIT_SET_FEATURE 1018 /* Turn an audit feature on or off */
#define AUDIT_GET_FEATURE 1019 /* Get which features are enabled */
+#define AUDIT_CONTAINER_OP 1020 /* Define the container id and info */
#define AUDIT_FIRST_USER_MSG 1100 /* Userspace messages mostly uninteresting to kernel */
#define AUDIT_USER_AVC 1107 /* We filter this differently */
@@ -485,6 +486,7 @@ struct audit_tty_status {
#define AUDIT_UID_UNSET (unsigned int)-1
#define AUDIT_SID_UNSET ((unsigned int)-1)
+#define AUDIT_CID_UNSET ((u64)-1)
/* audit_rule_data supports filter rules with both integer and string
* fields. It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
diff --git a/kernel/audit.c b/kernel/audit.c
index 3fb09783cd4a..182b0f2c183d 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -244,6 +244,7 @@ int audit_alloc(struct task_struct *tsk)
}
info->loginuid = audit_get_loginuid(current);
info->sessionid = audit_get_sessionid(current);
+ info->contid = audit_get_contid(current);
tsk->audit = info;
ret = audit_alloc_syscall(tsk);
@@ -258,6 +259,7 @@ int audit_alloc(struct task_struct *tsk)
struct audit_task_info init_struct_audit = {
.loginuid = INVALID_UID,
.sessionid = AUDIT_SID_UNSET,
+ .contid = AUDIT_CID_UNSET,
#ifdef CONFIG_AUDITSYSCALL
.ctx = NULL,
#endif
@@ -2341,6 +2343,73 @@ int audit_set_loginuid(kuid_t loginuid)
}
/**
+ * audit_set_contid - set current task's audit contid
+ * @contid: contid value
+ *
+ * Returns 0 on success, -EPERM on permission failure.
+ *
+ * Called (set) from fs/proc/base.c::proc_contid_write().
+ */
+int audit_set_contid(struct task_struct *task, u64 contid)
+{
+ u64 oldcontid;
+ int rc = 0;
+ struct audit_buffer *ab;
+ uid_t uid;
+ struct tty_struct *tty;
+ char comm[sizeof(current->comm)];
+
+ task_lock(task);
+ /* Can't set if audit disabled */
+ if (!task->audit) {
+ task_unlock(task);
+ return -ENOPROTOOPT;
+ }
+ oldcontid = audit_get_contid(task);
+ read_lock(&tasklist_lock);
+ /* Don't allow the audit containerid to be unset */
+ if (!audit_contid_valid(contid))
+ rc = -EINVAL;
+ /* if we don't have caps, reject */
+ else if (!capable(CAP_AUDIT_CONTROL))
+ rc = -EPERM;
+ /* if task has children or is not single-threaded, deny */
+ else if (!list_empty(&task->children))
+ rc = -EBUSY;
+ else if (!(thread_group_leader(task) && thread_group_empty(task)))
+ rc = -EALREADY;
+ read_unlock(&tasklist_lock);
+ if (!rc)
+ task->audit->contid = contid;
+ task_unlock(task);
+
+ if (!audit_enabled)
+ return rc;
+
+ ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONTAINER_OP);
+ if (!ab)
+ return rc;
+
+ uid = from_kuid(&init_user_ns, task_uid(current));
+ tty = audit_get_tty();
+ audit_log_format(ab,
+ "op=set opid=%d contid=%llu old-contid=%llu pid=%d uid=%u auid=%u tty=%s ses=%u",
+ task_tgid_nr(task), contid, oldcontid,
+ task_tgid_nr(current), uid,
+ from_kuid(&init_user_ns, audit_get_loginuid(current)),
+ tty ? tty_name(tty) : "(none)",
+ audit_get_sessionid(current));
+ audit_put_tty(tty);
+ audit_log_task_context(ab);
+ audit_log_format(ab, " comm=");
+ audit_log_untrustedstring(ab, get_task_comm(comm, current));
+ audit_log_d_path_exe(ab, current->mm);
+ audit_log_format(ab, " res=%d", !rc);
+ audit_log_end(ab);
+ return rc;
+}
+
+/**
* audit_log_end - end one audit record
* @ab: the audit_buffer
*
diff --git a/kernel/audit.h b/kernel/audit.h
index c00e2ee3c6b3..e2912924af0d 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -148,6 +148,7 @@ struct audit_context {
kuid_t target_uid;
unsigned int target_sessionid;
u32 target_sid;
+ u64 target_cid;
char target_comm[TASK_COMM_LEN];
struct audit_tree_refs *trees, *first_trees;
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index fd7ca983de4f..1f7edf035b16 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -113,6 +113,7 @@ struct audit_aux_data_pids {
kuid_t target_uid[AUDIT_AUX_PIDS];
unsigned int target_sessionid[AUDIT_AUX_PIDS];
u32 target_sid[AUDIT_AUX_PIDS];
+ u64 target_cid[AUDIT_AUX_PIDS];
char target_comm[AUDIT_AUX_PIDS][TASK_COMM_LEN];
int pid_count;
};
@@ -2368,6 +2369,7 @@ void __audit_ptrace(struct task_struct *t)
context->target_uid = task_uid(t);
context->target_sessionid = audit_get_sessionid(t);
security_task_getsecid(t, &context->target_sid);
+ context->target_cid = audit_get_contid(t);
memcpy(context->target_comm, t->comm, TASK_COMM_LEN);
}
@@ -2408,6 +2410,7 @@ int audit_signal_info(int sig, struct task_struct *t)
ctx->target_uid = t_uid;
ctx->target_sessionid = audit_get_sessionid(t);
security_task_getsecid(t, &ctx->target_sid);
+ ctx->target_cid = audit_get_contid(t);
memcpy(ctx->target_comm, t->comm, TASK_COMM_LEN);
return 0;
}
@@ -2429,6 +2432,7 @@ int audit_signal_info(int sig, struct task_struct *t)
axp->target_uid[axp->pid_count] = t_uid;
axp->target_sessionid[axp->pid_count] = audit_get_sessionid(t);
security_task_getsecid(t, &axp->target_sid[axp->pid_count]);
+ axp->target_cid[axp->pid_count] = audit_get_contid(t);
memcpy(axp->target_comm[axp->pid_count], t->comm, TASK_COMM_LEN);
axp->pid_count++;
--
1.8.3.1
^ permalink raw reply related
* [PATCH ghak90 V6 01/10] audit: collect audit task parameters
From: Richard Guy Briggs @ 2019-04-09 3:39 UTC (permalink / raw)
To: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
LKML, netdev, netfilter-devel
Cc: Paul Moore, sgrubb, omosnace, dhowells, simo, eparis, serge,
ebiederm, nhorman, Richard Guy Briggs
In-Reply-To: <cover.1554732921.git.rgb@redhat.com>
The audit-related parameters in struct task_struct should ideally be
collected together and accessed through a standard audit API.
Collect the existing loginuid, sessionid and audit_context together in a
new struct audit_task_info called "audit" in struct task_struct.
Use kmem_cache to manage this pool of memory.
Un-inline audit_free() to be able to always recover that memory.
Please see the upstream github issue
https://github.com/linux-audit/audit-kernel/issues/81
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
---
include/linux/audit.h | 49 +++++++++++++++++++++++------------
include/linux/sched.h | 7 +----
init/init_task.c | 3 +--
init/main.c | 2 ++
kernel/audit.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++--
kernel/audit.h | 5 ++++
kernel/auditsc.c | 26 ++++++++++---------
kernel/fork.c | 1 -
8 files changed, 124 insertions(+), 40 deletions(-)
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 1e69d9fe16da..bde346e73f0c 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -86,6 +86,16 @@ struct audit_field {
u32 op;
};
+struct audit_task_info {
+ kuid_t loginuid;
+ unsigned int sessionid;
+#ifdef CONFIG_AUDITSYSCALL
+ struct audit_context *ctx;
+#endif
+};
+
+extern struct audit_task_info init_struct_audit;
+
extern int is_audit_feature_set(int which);
extern int __init audit_register_class(int class, unsigned *list);
@@ -122,6 +132,9 @@ struct audit_field {
#ifdef CONFIG_AUDIT
/* These are defined in audit.c */
/* Public API */
+extern int audit_alloc(struct task_struct *task);
+extern void audit_free(struct task_struct *task);
+extern void __init audit_task_init(void);
extern __printf(4, 5)
void audit_log(struct audit_context *ctx, gfp_t gfp_mask, int type,
const char *fmt, ...);
@@ -164,16 +177,28 @@ extern void audit_log_key(struct audit_buffer *ab,
static inline kuid_t audit_get_loginuid(struct task_struct *tsk)
{
- return tsk->loginuid;
+ if (!tsk->audit)
+ return INVALID_UID;
+ return tsk->audit->loginuid;
}
static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
{
- return tsk->sessionid;
+ if (!tsk->audit)
+ return AUDIT_SID_UNSET;
+ return tsk->audit->sessionid;
}
extern u32 audit_enabled;
#else /* CONFIG_AUDIT */
+static inline int audit_alloc(struct task_struct *task)
+{
+ return 0;
+}
+static inline void audit_free(struct task_struct *task)
+{ }
+static inline void __init audit_task_init(void)
+{ }
static inline __printf(4, 5)
void audit_log(struct audit_context *ctx, gfp_t gfp_mask, int type,
const char *fmt, ...)
@@ -239,8 +264,6 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
/* These are defined in auditsc.c */
/* Public API */
-extern int audit_alloc(struct task_struct *task);
-extern void __audit_free(struct task_struct *task);
extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long a1,
unsigned long a2, unsigned long a3);
extern void __audit_syscall_exit(int ret_success, long ret_value);
@@ -263,12 +286,14 @@ extern void audit_seccomp_actions_logged(const char *names,
static inline void audit_set_context(struct task_struct *task, struct audit_context *ctx)
{
- task->audit_context = ctx;
+ task->audit->ctx = ctx;
}
static inline struct audit_context *audit_context(void)
{
- return current->audit_context;
+ if (!current->audit)
+ return NULL;
+ return current->audit->ctx;
}
static inline bool audit_dummy_context(void)
@@ -276,11 +301,7 @@ static inline bool audit_dummy_context(void)
void *p = audit_context();
return !p || *(int *)p;
}
-static inline void audit_free(struct task_struct *task)
-{
- if (unlikely(task->audit_context))
- __audit_free(task);
-}
+
static inline void audit_syscall_entry(int major, unsigned long a0,
unsigned long a1, unsigned long a2,
unsigned long a3)
@@ -470,12 +491,6 @@ static inline void audit_fanotify(unsigned int response)
extern int audit_n_rules;
extern int audit_signals;
#else /* CONFIG_AUDITSYSCALL */
-static inline int audit_alloc(struct task_struct *task)
-{
- return 0;
-}
-static inline void audit_free(struct task_struct *task)
-{ }
static inline void audit_syscall_entry(int major, unsigned long a0,
unsigned long a1, unsigned long a2,
unsigned long a3)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1549584a1538..3e0699a26dab 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -32,7 +32,6 @@
#include <linux/rseq.h>
/* task_struct member predeclarations (sorted alphabetically): */
-struct audit_context;
struct backing_dev_info;
struct bio_list;
struct blk_plug;
@@ -873,11 +872,7 @@ struct task_struct {
struct callback_head *task_works;
#ifdef CONFIG_AUDIT
-#ifdef CONFIG_AUDITSYSCALL
- struct audit_context *audit_context;
-#endif
- kuid_t loginuid;
- unsigned int sessionid;
+ struct audit_task_info *audit;
#endif
struct seccomp seccomp;
diff --git a/init/init_task.c b/init/init_task.c
index c70ef656d0f4..32420485e54b 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -123,8 +123,7 @@ struct task_struct init_task
.thread_group = LIST_HEAD_INIT(init_task.thread_group),
.thread_node = LIST_HEAD_INIT(init_signals.thread_head),
#ifdef CONFIG_AUDIT
- .loginuid = INVALID_UID,
- .sessionid = AUDIT_SID_UNSET,
+ .audit = &init_struct_audit,
#endif
#ifdef CONFIG_PERF_EVENTS
.perf_event_mutex = __MUTEX_INITIALIZER(init_task.perf_event_mutex),
diff --git a/init/main.c b/init/main.c
index 598e278b46f7..26b6c80f5b1d 100644
--- a/init/main.c
+++ b/init/main.c
@@ -92,6 +92,7 @@
#include <linux/rodata_test.h>
#include <linux/jump_label.h>
#include <linux/mem_encrypt.h>
+#include <linux/audit.h>
#include <asm/io.h>
#include <asm/bugs.h>
@@ -734,6 +735,7 @@ asmlinkage __visible void __init start_kernel(void)
nsfs_init();
cpuset_init();
cgroup_init();
+ audit_task_init();
taskstats_init_early();
delayacct_init();
diff --git a/kernel/audit.c b/kernel/audit.c
index b96bf69183f4..3fb09783cd4a 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -215,6 +215,73 @@ struct audit_reply {
struct sk_buff *skb;
};
+static struct kmem_cache *audit_task_cache;
+
+void __init audit_task_init(void)
+{
+ audit_task_cache = kmem_cache_create("audit_task",
+ sizeof(struct audit_task_info),
+ 0, SLAB_PANIC, NULL);
+}
+
+/**
+ * audit_alloc - allocate an audit info block for a task
+ * @tsk: task
+ *
+ * Call audit_alloc_syscall to filter on the task information and
+ * allocate a per-task audit context if necessary. This is called from
+ * copy_process, so no lock is needed.
+ */
+int audit_alloc(struct task_struct *tsk)
+{
+ int ret = 0;
+ struct audit_task_info *info;
+
+ info = kmem_cache_alloc(audit_task_cache, GFP_KERNEL);
+ if (!info) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ info->loginuid = audit_get_loginuid(current);
+ info->sessionid = audit_get_sessionid(current);
+ tsk->audit = info;
+
+ ret = audit_alloc_syscall(tsk);
+ if (ret) {
+ tsk->audit = NULL;
+ kmem_cache_free(audit_task_cache, info);
+ }
+out:
+ return ret;
+}
+
+struct audit_task_info init_struct_audit = {
+ .loginuid = INVALID_UID,
+ .sessionid = AUDIT_SID_UNSET,
+#ifdef CONFIG_AUDITSYSCALL
+ .ctx = NULL,
+#endif
+};
+
+/**
+ * audit_free - free per-task audit info
+ * @tsk: task whose audit info block to free
+ *
+ * Called from copy_process and do_exit
+ */
+void audit_free(struct task_struct *tsk)
+{
+ struct audit_task_info *info = tsk->audit;
+
+ audit_free_syscall(tsk);
+ /* Freeing the audit_task_info struct must be performed after
+ * audit_log_exit() due to need for loginuid and sessionid.
+ */
+ info = tsk->audit;
+ tsk->audit = NULL;
+ kmem_cache_free(audit_task_cache, info);
+}
+
/**
* auditd_test_task - Check to see if a given task is an audit daemon
* @task: the task to check
@@ -2266,8 +2333,8 @@ int audit_set_loginuid(kuid_t loginuid)
sessionid = (unsigned int)atomic_inc_return(&session_id);
}
- current->sessionid = sessionid;
- current->loginuid = loginuid;
+ current->audit->sessionid = sessionid;
+ current->audit->loginuid = loginuid;
out:
audit_log_set_loginuid(oldloginuid, loginuid, oldsessionid, sessionid, rc);
return rc;
diff --git a/kernel/audit.h b/kernel/audit.h
index 958d5b8fc1b3..c00e2ee3c6b3 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -264,6 +264,8 @@ extern void audit_log_d_path_exe(struct audit_buffer *ab,
extern unsigned int audit_serial(void);
extern int auditsc_get_stamp(struct audit_context *ctx,
struct timespec64 *t, unsigned int *serial);
+extern int audit_alloc_syscall(struct task_struct *tsk);
+extern void audit_free_syscall(struct task_struct *tsk);
extern void audit_put_watch(struct audit_watch *watch);
extern void audit_get_watch(struct audit_watch *watch);
@@ -305,6 +307,9 @@ extern void audit_filter_inodes(struct task_struct *tsk,
extern struct list_head *audit_killed_trees(void);
#else /* CONFIG_AUDITSYSCALL */
#define auditsc_get_stamp(c, t, s) 0
+#define audit_alloc_syscall(t) 0
+#define audit_free_syscall(t) {}
+
#define audit_put_watch(w) {}
#define audit_get_watch(w) {}
#define audit_to_watch(k, p, l, o) (-EINVAL)
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 98a98e6dca05..fd7ca983de4f 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -892,23 +892,25 @@ static inline struct audit_context *audit_alloc_context(enum audit_state state)
return context;
}
-/**
- * audit_alloc - allocate an audit context block for a task
+/*
+ * audit_alloc_syscall - allocate an audit context block for a task
* @tsk: task
*
* Filter on the task information and allocate a per-task audit context
* if necessary. Doing so turns on system call auditing for the
- * specified task. This is called from copy_process, so no lock is
- * needed.
+ * specified task. This is called from copy_process via audit_alloc, so
+ * no lock is needed.
*/
-int audit_alloc(struct task_struct *tsk)
+int audit_alloc_syscall(struct task_struct *tsk)
{
struct audit_context *context;
enum audit_state state;
char *key = NULL;
- if (likely(!audit_ever_enabled))
+ if (likely(!audit_ever_enabled)) {
+ audit_set_context(tsk, NULL);
return 0; /* Return if not auditing. */
+ }
state = audit_filter_task(tsk, &key);
if (state == AUDIT_DISABLED) {
@@ -918,7 +920,7 @@ int audit_alloc(struct task_struct *tsk)
if (!(context = audit_alloc_context(state))) {
kfree(key);
- audit_log_lost("out of memory in audit_alloc");
+ audit_log_lost("out of memory in audit_alloc_syscall");
return -ENOMEM;
}
context->filterkey = key;
@@ -1563,14 +1565,15 @@ static void audit_log_exit(void)
}
/**
- * __audit_free - free a per-task audit context
+ * audit_free_syscall - free per-task audit context info
* @tsk: task whose audit context block to free
*
- * Called from copy_process and do_exit
+ * Called from audit_free
*/
-void __audit_free(struct task_struct *tsk)
+void audit_free_syscall(struct task_struct *tsk)
{
- struct audit_context *context = tsk->audit_context;
+ struct audit_task_info *info = tsk->audit;
+ struct audit_context *context = info->ctx;
if (!context)
return;
@@ -1593,7 +1596,6 @@ void __audit_free(struct task_struct *tsk)
if (context->current_state == AUDIT_RECORD_CONTEXT)
audit_log_exit();
}
-
audit_set_context(tsk, NULL);
audit_free_context(context);
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 9dcd18aa210b..9167a8f3edae 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1834,7 +1834,6 @@ static __latent_entropy struct task_struct *copy_process(
posix_cpu_timers_init(p);
p->io_context = NULL;
- audit_set_context(p, NULL);
cgroup_fork(p);
#ifdef CONFIG_NUMA
p->mempolicy = mpol_dup(p->mempolicy);
--
1.8.3.1
^ permalink raw reply related
* [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Richard Guy Briggs @ 2019-04-09 3:39 UTC (permalink / raw)
To: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
LKML, netdev, netfilter-devel
Cc: Paul Moore, sgrubb, omosnace, dhowells, simo, eparis, serge,
ebiederm, nhorman, Richard Guy Briggs
Implement kernel audit container identifier.
This patchset is a fifth based on the proposal document (V3)
posted:
https://www.redhat.com/archives/linux-audit/2018-January/msg00014.html
The first patch was the last patch from ghak81 that was absorbed into
this patchset since its primary justification is the rest of this
patchset.
The second patch implements the proc fs write to set the audit container
identifier of a process, emitting an AUDIT_CONTAINER_OP record to
announce the registration of that audit container identifier on that
process. This patch requires userspace support for record acceptance
and proper type display.
The third implements reading the audit container identifier from the
proc filesystem for debugging. This patch wasn't planned for upstream
inclusion but is starting to become more likely.
The fourth implements the auxiliary record AUDIT_CONTAINER_ID if an audit
container identifier is associated with an event. This patch requires
userspace support for proper type display.
The 5th adds audit daemon signalling provenance through audit_sig_info2.
The 6th creates a local audit context to be able to bind a standalone
record with a locally created auxiliary record.
The 7th patch adds audit container identifier records to the user
standalone records.
The 8th adds audit container identifier filtering to the exit,
exclude and user lists. This patch adds the AUDIT_CONTID field and
requires auditctl userspace support for the --contid option.
The 9th adds network namespace audit container identifier labelling
based on member tasks' audit container identifier labels.
The 10th adds audit container identifier support to standalone netfilter
records that don't have a task context and lists each container to which
that net namespace belongs.
Example: Set an audit container identifier of 123456 to the "sleep" task:
sleep 2&
child=$!
echo 123456 > /proc/$child/audit_containerid; echo $?
ausearch -ts recent -m container_op
echo child:$child contid:$( cat /proc/$child/audit_containerid)
This should produce a record such as:
type=CONTAINER_OP msg=audit(2018-06-06 12:39:29.636:26949) : op=set opid=2209 contid=123456 old-contid=18446744073709551615 pid=628 auid=root uid=root tty=ttyS0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 comm=bash exe=/usr/bin/bash res=yes
Example: Set a filter on an audit container identifier 123459 on /tmp/tmpcontainerid:
contid=123459
key=tmpcontainerid
auditctl -a exit,always -F dir=/tmp -F perm=wa -F contid=$contid -F key=$key
perl -e "sleep 1; open(my \$tmpfile, '>', \"/tmp/$key\"); close(\$tmpfile);" &
child=$!
echo $contid > /proc/$child/audit_containerid
sleep 2
ausearch -i -ts recent -k $key
auditctl -d exit,always -F dir=/tmp -F perm=wa -F contid=$contid -F key=$key
rm -f /tmp/$key
This should produce an event such as:
type=CONTAINER_ID msg=audit(2018-06-06 12:46:31.707:26953) : contid=123459
type=PROCTITLE msg=audit(2018-06-06 12:46:31.707:26953) : proctitle=perl -e sleep 1; open(my $tmpfile, '>', "/tmp/tmpcontainerid"); close($tmpfile);
type=PATH msg=audit(2018-06-06 12:46:31.707:26953) : item=1 name=/tmp/tmpcontainerid inode=25656 dev=00:26 mode=file,644 ouid=root ogid=root rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
type=PATH msg=audit(2018-06-06 12:46:31.707:26953) : item=0 name=/tmp/ inode=8985 dev=00:26 mode=dir,sticky,777 ouid=root ogid=root rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype=PARENT cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
type=CWD msg=audit(2018-06-06 12:46:31.707:26953) : cwd=/root
type=SYSCALL msg=audit(2018-06-06 12:46:31.707:26953) : arch=x86_64 syscall=openat success=yes exit=3 a0=0xffffffffffffff9c a1=0x5621f2b81900 a2=O_WRONLY|O_CREAT|O_TRUNC a3=0x1b6 items=2 ppid=628 pid=2232 auid=root uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=ttyS0 ses=1 comm=perl exe=/usr/bin/perl subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=tmpcontainerid
Example: Test multiple containers on one netns:
sleep 5 &
child1=$!
containerid1=123451
echo $containerid1 > /proc/$child1/audit_containerid
sleep 5 &
child2=$!
containerid2=123452
echo $containerid2 > /proc/$child2/audit_containerid
iptables -I INPUT -i lo -p icmp --icmp-type echo-request -j AUDIT --type accept
iptables -I INPUT -t mangle -i lo -p icmp --icmp-type echo-request -j MARK --set-mark 0x12345555
sleep 1;
bash -c "ping -q -c 1 127.0.0.1 >/dev/null 2>&1"
sleep 1;
ausearch -i -m NETFILTER_PKT -ts boot|grep mark=0x12345555
ausearch -i -m NETFILTER_PKT -ts boot|grep contid=|grep $containerid1|grep $containerid2
This should produce an event such as:
type=NETFILTER_PKT msg=audit(03/15/2019 14:16:13.369:244) : mark=0x12345555 saddr=127.0.0.1 daddr=127.0.0.1 proto=icmp
type=CONTAINER_ID msg=audit(03/15/2019 14:16:13.369:244) : contid=123452,123451
Includes the last patch of https://github.com/linux-audit/audit-kernel/issues/81
Please see the github audit kernel issue for the main feature:
https://github.com/linux-audit/audit-kernel/issues/90
and the kernel filter code:
https://github.com/linux-audit/audit-kernel/issues/91
and the network support:
https://github.com/linux-audit/audit-kernel/issues/92
Please see the github audit userspace issue for supporting record types:
https://github.com/linux-audit/audit-userspace/issues/51
and filter code:
https://github.com/linux-audit/audit-userspace/issues/40
Please see the github audit testsuiite issue for the test case:
https://github.com/linux-audit/audit-testsuite/issues/64
Please see the github audit wiki for the feature overview:
https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Changelog:
v6
- change TMPBUFLEN from 11 to 21 to cover the decimal value of contid
u64 (nhorman)
- fix bug overwriting ctx in struct audit_sig_info, move cid above
ctx[0] (nhorman)
- fix bug skipping remaining fields and not advancing bufp when copying
out contid in audit_krule_to_data (omosnacec)
- add acks, tidy commit descriptions, other formatting fixes (checkpatch
wrong on audit_log_lost)
- cast ull for u64 prints
- target_cid tracking was moved from the ptrace/signal patch to
container_op
- target ptrace and signal records were moved from the ptrace/signal
patch to container_id
- auditd signaller tracking was moved to a new AUDIT_SIGNAL_INFO2
request and record
- ditch unnecessary list_empty() checks
- check for null net and aunet in audit_netns_contid_add()
- swap CONTAINER_OP contid/old-contid order to ease parsing
v5
- address loginuid and sessionid syscall scope in ghak104
- address audit_context in CONFIG_AUDIT vs CONFIG_AUDITSYSCALL in ghak105
- remove tty patch, addressed in ghak106
- rebase on audit/next v5.0-rc1
w/ghak59/ghak104/ghak103/ghak100/ghak107/ghak105/ghak106/ghak105sup
- update CONTAINER_ID to CONTAINER_OP in patch description
- move audit_context in audit_task_info to CONFIG_AUDITSYSCALL
- move audit_alloc() and audit_free() out of CONFIG_AUDITSYSCALL and into
CONFIG_AUDIT and create audit_{alloc,free}_syscall
- use plain kmem_cache_alloc() rather than kmem_cache_zalloc() in audit_alloc()
- fix audit_get_contid() declaration type error
- move audit_set_contid() from auditsc.c to audit.c
- audit_log_contid() returns void
- audit_log_contid() handed contid rather than tsk
- switch from AUDIT_CONTAINER to AUDIT_CONTAINER_ID for aux record
- move audit_log_contid(tsk/contid) & audit_contid_set(tsk)/audit_contid_valid(contid)
- switch from tsk to current
- audit_alloc_local() calls audit_log_lost() on failure to allocate a context
- add AUDIT_USER* non-syscall contid record
- cosmetic cleanup double parens, goto out on err
- ditch audit_get_ns_contid_list_lock(), fix aunet lock race
- switch from all-cpu read spinlock to rcu, keep spinlock for write
- update audit_alloc_local() to use ktime_get_coarse_real_ts64()
- add nft_log support
- add call from do_exit() in audit_free() to remove contid from netns
- relegate AUDIT_CONTAINER ref= field (was op=) to debug patch
v4
- preface set with ghak81:"collect audit task parameters"
- add shallyn and sgrubb acks
- rename feature bitmap macro
- rename cid_valid() to audit_contid_valid()
- rename AUDIT_CONTAINER_ID to AUDIT_CONTAINER_OP
- delete audit_get_contid_list() from headers
- move work into inner if, delete "found"
- change netns contid list function names
- move exports for audit_log_contid audit_alloc_local audit_free_context to non-syscall patch
- list contids CSV
- pass in gfp flags to audit_alloc_local() (fix audit_alloc_context callers)
- use "local" in lieu of abusing in_syscall for auditsc_get_stamp()
- read_lock(&tasklist_lock) around children and thread check
- task_lock(tsk) should be taken before first check of tsk->audit
- add spin lock to contid list in aunet
- restrict /proc read to CAP_AUDIT_CONTROL
- remove set again prohibition and inherited flag
- delete contidion spelling fix from patchset, send to netdev/linux-wireless
v3
- switched from containerid in task_struct to audit_task_info (depends on ghak81)
- drop INVALID_CID in favour of only AUDIT_CID_UNSET
- check for !audit_task_info, throw -ENOPROTOOPT on set
- changed -EPERM to -EEXIST for parent check
- return AUDIT_CID_UNSET if !audit_enabled
- squash child/thread check patch into AUDIT_CONTAINER_ID patch
- changed -EPERM to -EBUSY for child check
- separate child and thread checks, use -EALREADY for latter
- move addition of op= from ptrace/signal patch to AUDIT_CONTAINER patch
- fix && to || bashism in ptrace/signal patch
- uninline and export function for audit_free_context()
- drop CONFIG_CHANGE, FEATURE_CHANGE, ANOM_ABEND, ANOM_SECCOMP patches
- move audit_enabled check (xt_AUDIT)
- switched from containerid list in struct net to net_generic's struct audit_net
- move containerid list iteration into audit (xt_AUDIT)
- create function to move namespace switch into audit
- switched /proc/PID/ entry from containerid to audit_containerid
- call kzalloc with GFP_ATOMIC on in_atomic() in audit_alloc_context()
- call kzalloc with GFP_ATOMIC on in_atomic() in audit_log_container_info()
- use xt_net(par) instead of sock_net(skb->sk) to get net
- switched record and field names: initial CONTAINER_ID, aux CONTAINER, field CONTID
- allow to set own contid
- open code audit_set_containerid
- add contid inherited flag
- ccontainerid and pcontainerid eliminated due to inherited flag
- change name of container list funcitons
- rename containerid to contid
- convert initial container record to syscall aux
- fix spelling mistake of contidion in net/rfkill/core.c to avoid contid name collision
v2
- add check for children and threads
- add network namespace container identifier list
- add NETFILTER_PKT audit container identifier logging
- patch description and documentation clean-up and example
- reap unused ppid
Richard Guy Briggs (10):
audit: collect audit task parameters
audit: add container id
audit: read container ID of a process
audit: log container info of syscalls
audit: add contid support for signalling the audit daemon
audit: add support for non-syscall auxiliary records
audit: add containerid support for user records
audit: add containerid filtering
audit: add support for containerid to network namespaces
audit: NETFILTER_PKT: record each container ID associated with a netNS
fs/proc/base.c | 57 +++++++-
include/linux/audit.h | 113 +++++++++++++--
include/linux/sched.h | 7 +-
include/uapi/linux/audit.h | 9 +-
init/init_task.c | 3 +-
init/main.c | 2 +
kernel/audit.c | 325 ++++++++++++++++++++++++++++++++++++++++++--
kernel/audit.h | 9 ++
kernel/auditfilter.c | 47 +++++++
kernel/auditsc.c | 90 ++++++++----
kernel/fork.c | 1 -
kernel/nsproxy.c | 4 +
net/netfilter/nft_log.c | 11 +-
net/netfilter/xt_AUDIT.c | 11 +-
security/selinux/nlmsgtab.c | 1 +
15 files changed, 627 insertions(+), 63 deletions(-)
--
1.8.3.1
^ permalink raw reply
* Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)
From: Carlos O'Donell @ 2019-04-08 21:45 UTC (permalink / raw)
To: Tulio Magno Quites Machado Filho, Florian Weimer,
Michael Meissner, Alan Modra, Peter Bergner, Michael Ellerman,
Mathieu Desnoyers
Cc: Paul Burton, Will Deacon, Boqun Feng, Heiko Carstens,
Vasily Gorbik, Martin Schwidefsky, Russell King,
Benjamin Herrenschmidt, Paul Mackerras, carlos, Joseph Myers,
Szabolcs Nagy, libc-alpha, Thomas Gleixner, Ben Maurer,
Peter Zijlstra, Paul E. McKenney, Dave Watson, Paul Turner,
Rich Felker
In-Reply-To: <877ec4pam2.fsf@linux.ibm.com>
On 4/8/19 3:20 PM, Tulio Magno Quites Machado Filho wrote:
> Carlos O'Donell <codonell@redhat.com> writes:
>
>> On 4/5/19 5:16 AM, Florian Weimer wrote:
>>> * Carlos O'Donell:
>>>> It is valuable that it be a trap, particularly for constant pools because
>>>> it means that a jump into the constant pool will trap.
>>>
>>> Sorry, I don't understand why this matters in this context. Would you
>>> please elaborate?
>>
>> Sorry, I wasn't very clear.
>>
>> My point is only that any accidental jumps, either with off-by-one (like you
>> fixed in gcc/glibc's signal unwinding most recently), result in a process fault
>> rather than executing RSEQ_SIG as a valid instruction *and then* continuing
>> onwards to the handler.
>>
>> A process fault is achieved either by a trap, or an invalid instruction, or
>> a privileged insn (like suggested for MIPS in this thread).
>
> In that case, mtmsr (Move to Machine State Register) seems a good candidate.
>
> mtmsr is available both on 32 and 64 bits since their first implementations.
>
> It's a privileged instruction and should never appear in userspace
> code (causes SIGILL).
>
> Any comments?
That seems good to me.
Mathieu,
What's required to move this forward for POWER?
--
Cheers,
Carlos.
^ permalink raw reply
* Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)
From: Tulio Magno Quites Machado Filho @ 2019-04-08 19:20 UTC (permalink / raw)
To: Carlos O'Donell, Florian Weimer, Michael Meissner, Alan Modra,
Peter Bergner, Michael Ellerman
Cc: Mathieu Desnoyers, Paul Burton, Will Deacon, Boqun Feng,
Heiko Carstens, Vasily Gorbik, Martin Schwidefsky, Russell King,
Benjamin Herrenschmidt, Paul Mackerras, carlos, Joseph Myers,
Szabolcs Nagy, libc-alpha, Thomas Gleixner, Ben Maurer,
Peter Zijlstra, Paul E. McKenney, Dave Watson, Paul Turner
In-Reply-To: <43f97ddb-c8df-27ea-9517-63252ebd3183@redhat.com>
Carlos O'Donell <codonell@redhat.com> writes:
> On 4/5/19 5:16 AM, Florian Weimer wrote:
>> * Carlos O'Donell:
>>> It is valuable that it be a trap, particularly for constant pools because
>>> it means that a jump into the constant pool will trap.
>>
>> Sorry, I don't understand why this matters in this context. Would you
>> please elaborate?
>
> Sorry, I wasn't very clear.
>
> My point is only that any accidental jumps, either with off-by-one (like you
> fixed in gcc/glibc's signal unwinding most recently), result in a process fault
> rather than executing RSEQ_SIG as a valid instruction *and then* continuing
> onwards to the handler.
>
> A process fault is achieved either by a trap, or an invalid instruction, or
> a privileged insn (like suggested for MIPS in this thread).
In that case, mtmsr (Move to Machine State Register) seems a good candidate.
mtmsr is available both on 32 and 64 bits since their first implementations.
It's a privileged instruction and should never appear in userspace
code (causes SIGILL).
Any comments?
--
Tulio Magno
^ permalink raw reply
* [PATCH linux-next v9 6/7] ptrace: add PTRACE_GET_SYSCALL_INFO request
From: Dmitry V. Levin @ 2019-04-08 17:42 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: Stephen Rothwell, Kees Cook, linux-api-u79uwXL29TY76Z2rM5mHXA,
Eugene Syromyatnikov, Oleg Nesterov, Andy Lutomirski,
strace-devel-3+4lAyCyj6AWlMsSdNXQLw
In-Reply-To: <20190408174036.GA11889-u2l5PoMzF/Vg9hUCZPvPmw@public.gmane.org>
From: Elvira Khabirova <lineprinter-u2l5PoMzF/Vg9hUCZPvPmw@public.gmane.org>
PTRACE_GET_SYSCALL_INFO is a generic ptrace API that lets ptracer obtain
details of the syscall the tracee is blocked in.
There are two reasons for a special syscall-related ptrace request.
Firstly, with the current ptrace API there are cases when ptracer cannot
retrieve necessary information about syscalls. Some examples include:
* The notorious int-0x80-from-64-bit-task issue. See [1] for details.
In short, if a 64-bit task performs a syscall through int 0x80, its tracer
has no reliable means to find out that the syscall was, in fact,
a compat syscall, and misidentifies it.
* Syscall-enter-stop and syscall-exit-stop look the same for the tracer.
Common practice is to keep track of the sequence of ptrace-stops in order
not to mix the two syscall-stops up. But it is not as simple as it looks;
for example, strace had a (just recently fixed) long-standing bug where
attaching strace to a tracee that is performing the execve system call
led to the tracer identifying the following syscall-exit-stop as
syscall-enter-stop, which messed up all the state tracking.
* Since the introduction of commit 84d77d3f06e7e8dea057d10e8ec77ad71f721be3
("ptrace: Don't allow accessing an undumpable mm"), both PTRACE_PEEKDATA
and process_vm_readv become unavailable when the process dumpable flag
is cleared. On such architectures as ia64 this results in all syscall
arguments being unavailable for the tracer.
Secondly, ptracers also have to support a lot of arch-specific code for
obtaining information about the tracee. For some architectures, this
requires a ptrace(PTRACE_PEEKUSER, ...) invocation for every syscall
argument and return value.
ptrace(2) man page:
long ptrace(enum __ptrace_request request, pid_t pid,
void *addr, void *data);
...
PTRACE_GET_SYSCALL_INFO
Retrieve information about the syscall that caused the stop.
The information is placed into the buffer pointed by "data"
argument, which should be a pointer to a buffer of type
"struct ptrace_syscall_info".
The "addr" argument contains the size of the buffer pointed to
by "data" argument (i.e., sizeof(struct ptrace_syscall_info)).
The return value contains the number of bytes available
to be written by the kernel.
If the size of data to be written by the kernel exceeds the size
specified by "addr" argument, the output is truncated.
Co-authored-by: Dmitry V. Levin <ldv-u2l5PoMzF/Vg9hUCZPvPmw@public.gmane.org>
Reviewed-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
Cc: Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Eugene Syromyatnikov <esyr-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: strace-devel-3+4lAyCyj6AWlMsSdNXQLw@public.gmane.org
Signed-off-by: Elvira Khabirova <lineprinter-u2l5PoMzF/Vg9hUCZPvPmw@public.gmane.org>
Signed-off-by: Dmitry V. Levin <ldv-u2l5PoMzF/Vg9hUCZPvPmw@public.gmane.org>
---
Notes:
v9:
* Rebased to linux-next again due to syscall_get_arguments() signature change.
v8:
* Rebased to linux-next.
* Moved ptrace_get_syscall_info code under #ifdef CONFIG_HAVE_ARCH_TRACEHOOK,
narrowing down the set of architectures supported by this implementation
back to those 19 that enable CONFIG_HAVE_ARCH_TRACEHOOK because
I failed to get all syscall_get_*(), instruction_pointer(),
and user_stack_pointer() functions implemented on some niche
architectures. This leaves the following architectures out:
alpha, h8300, m68k, microblaze, and unicore32.
v7: unchanged
v6:
* Change PTRACE_GET_SYSCALL_INFO return code: do not take trailing paddings
into account, use the end of the last field of the structure being written.
* Change struct ptrace_syscall_info:
* remove .frame_pointer field, is is not needed and not portable;
* make .arch field explicitly aligned, remove no longer needed
padding before .arch field;
* remove trailing pads, they are no longer needed.
v5:
* Change PTRACE_EVENTMSG_SYSCALL_{ENTRY,EXIT} values as requested by Oleg.
* Change struct ptrace_syscall_info: generalize instruction_pointer,
stack_pointer, and frame_pointer fields by moving them from
ptrace_syscall_info.{entry,seccomp} substructures to ptrace_syscall_info
and initializing them for all stops.
* Add PTRACE_SYSCALL_INFO_NONE, set it when not in a syscall stop,
so e.g. "strace -i" could use PTRACE_SYSCALL_INFO_SECCOMP to obtain
instruction_pointer when the tracee is in a signal stop.
* Make available for all architectures: do not conditionalize on
CONFIG_HAVE_ARCH_TRACEHOOK since all syscall_get_* functions
are implemented on all architectures.
v4:
* Do not introduce task_struct.ptrace_event,
use child->last_siginfo->si_code instead.
* Implement PTRACE_SYSCALL_INFO_SECCOMP and ptrace_syscall_info.seccomp
support along with PTRACE_SYSCALL_INFO_{ENTRY,EXIT} and
ptrace_syscall_info.{entry,exit}.
v3:
* Change struct ptrace_syscall_info.
* Support PTRACE_EVENT_SECCOMP by adding ptrace_event to task_struct.
* Add proper defines for ptrace_syscall_info.op values.
* Rename PT_SYSCALL_IS_ENTERING and PT_SYSCALL_IS_EXITING to
PTRACE_EVENTMSG_SYSCALL_ENTRY and PTRACE_EVENTMSG_SYSCALL_EXIT
* and move them to uapi.
v2:
* Do not use task->ptrace.
* Replace entry_info.is_compat with entry_info.arch, use syscall_get_arch().
* Use addr argument of sys_ptrace to get expected size of the struct;
return full size of the struct.
include/linux/tracehook.h | 9 ++--
include/uapi/linux/ptrace.h | 35 ++++++++++++
kernel/ptrace.c | 103 +++++++++++++++++++++++++++++++++++-
3 files changed, 143 insertions(+), 4 deletions(-)
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index df20f8bdbfa3..6bc7a3d58e2f 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -57,13 +57,15 @@ struct linux_binprm;
/*
* ptrace report for syscall entry and exit looks identical.
*/
-static inline int ptrace_report_syscall(struct pt_regs *regs)
+static inline int ptrace_report_syscall(struct pt_regs *regs,
+ unsigned long message)
{
int ptrace = current->ptrace;
if (!(ptrace & PT_PTRACED))
return 0;
+ current->ptrace_message = message;
ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
/*
@@ -76,6 +78,7 @@ static inline int ptrace_report_syscall(struct pt_regs *regs)
current->exit_code = 0;
}
+ current->ptrace_message = 0;
return fatal_signal_pending(current);
}
@@ -101,7 +104,7 @@ static inline int ptrace_report_syscall(struct pt_regs *regs)
static inline __must_check int tracehook_report_syscall_entry(
struct pt_regs *regs)
{
- return ptrace_report_syscall(regs);
+ return ptrace_report_syscall(regs, PTRACE_EVENTMSG_SYSCALL_ENTRY);
}
/**
@@ -126,7 +129,7 @@ static inline void tracehook_report_syscall_exit(struct pt_regs *regs, int step)
if (step)
user_single_step_report(regs);
else
- ptrace_report_syscall(regs);
+ ptrace_report_syscall(regs, PTRACE_EVENTMSG_SYSCALL_EXIT);
}
/**
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index d5a1b8a492b9..a71b6e3b03eb 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -73,6 +73,41 @@ struct seccomp_metadata {
__u64 flags; /* Output: filter's flags */
};
+#define PTRACE_GET_SYSCALL_INFO 0x420e
+#define PTRACE_SYSCALL_INFO_NONE 0
+#define PTRACE_SYSCALL_INFO_ENTRY 1
+#define PTRACE_SYSCALL_INFO_EXIT 2
+#define PTRACE_SYSCALL_INFO_SECCOMP 3
+
+struct ptrace_syscall_info {
+ __u8 op; /* PTRACE_SYSCALL_INFO_* */
+ __u32 arch __attribute__((__aligned__(sizeof(__u32))));
+ __u64 instruction_pointer;
+ __u64 stack_pointer;
+ union {
+ struct {
+ __u64 nr;
+ __u64 args[6];
+ } entry;
+ struct {
+ __s64 rval;
+ __u8 is_error;
+ } exit;
+ struct {
+ __u64 nr;
+ __u64 args[6];
+ __u32 ret_data;
+ } seccomp;
+ };
+};
+
+/*
+ * These values are stored in task->ptrace_message
+ * by tracehook_report_syscall_* to describe the current syscall-stop.
+ */
+#define PTRACE_EVENTMSG_SYSCALL_ENTRY 1
+#define PTRACE_EVENTMSG_SYSCALL_EXIT 2
+
/* Read signals from a shared (process wide) queue */
#define PTRACE_PEEKSIGINFO_SHARED (1 << 0)
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 6f357f4fc859..f68c0d44461e 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -31,6 +31,10 @@
#include <linux/compat.h>
#include <linux/sched/signal.h>
+#ifdef CONFIG_HAVE_ARCH_TRACEHOOK
+#include <asm/syscall.h> /* For syscall_get_* */
+#endif
+
/*
* Access another process' address space via ptrace.
* Source/target buffer must be kernel space,
@@ -879,7 +883,100 @@ static int ptrace_regset(struct task_struct *task, int req, unsigned int type,
* to ensure no machine forgets it.
*/
EXPORT_SYMBOL_GPL(task_user_regset_view);
-#endif
+
+static unsigned long
+ptrace_get_syscall_info_entry(struct task_struct *child, struct pt_regs *regs,
+ struct ptrace_syscall_info *info)
+{
+ unsigned long args[ARRAY_SIZE(info->entry.args)];
+ int i;
+
+ info->op = PTRACE_SYSCALL_INFO_ENTRY;
+ info->entry.nr = syscall_get_nr(child, regs);
+ syscall_get_arguments(child, regs, args);
+ for (i = 0; i < ARRAY_SIZE(args); i++)
+ info->entry.args[i] = args[i];
+
+ /* args is the last field in struct ptrace_syscall_info.entry */
+ return offsetofend(struct ptrace_syscall_info, entry.args);
+}
+
+static unsigned long
+ptrace_get_syscall_info_seccomp(struct task_struct *child, struct pt_regs *regs,
+ struct ptrace_syscall_info *info)
+{
+ /*
+ * As struct ptrace_syscall_info.entry is currently a subset
+ * of struct ptrace_syscall_info.seccomp, it makes sense to
+ * initialize that subset using ptrace_get_syscall_info_entry().
+ * This can be reconsidered in the future if these structures
+ * diverge significantly enough.
+ */
+ ptrace_get_syscall_info_entry(child, regs, info);
+ info->op = PTRACE_SYSCALL_INFO_SECCOMP;
+ info->seccomp.ret_data = child->ptrace_message;
+
+ /* ret_data is the last field in struct ptrace_syscall_info.seccomp */
+ return offsetofend(struct ptrace_syscall_info, seccomp.ret_data);
+}
+
+static unsigned long
+ptrace_get_syscall_info_exit(struct task_struct *child, struct pt_regs *regs,
+ struct ptrace_syscall_info *info)
+{
+ info->op = PTRACE_SYSCALL_INFO_EXIT;
+ info->exit.rval = syscall_get_error(child, regs);
+ info->exit.is_error = !!info->exit.rval;
+ if (!info->exit.is_error)
+ info->exit.rval = syscall_get_return_value(child, regs);
+
+ /* is_error is the last field in struct ptrace_syscall_info.exit */
+ return offsetofend(struct ptrace_syscall_info, exit.is_error);
+}
+
+static int
+ptrace_get_syscall_info(struct task_struct *child, unsigned long user_size,
+ void __user *datavp)
+{
+ struct pt_regs *regs = task_pt_regs(child);
+ struct ptrace_syscall_info info = {
+ .op = PTRACE_SYSCALL_INFO_NONE,
+ .arch = syscall_get_arch(child),
+ .instruction_pointer = instruction_pointer(regs),
+ .stack_pointer = user_stack_pointer(regs),
+ };
+ unsigned long actual_size = offsetof(struct ptrace_syscall_info, entry);
+ unsigned long write_size;
+
+ /*
+ * This does not need lock_task_sighand() to access
+ * child->last_siginfo because ptrace_freeze_traced()
+ * called earlier by ptrace_check_attach() ensures that
+ * the tracee cannot go away and clear its last_siginfo.
+ */
+ switch (child->last_siginfo ? child->last_siginfo->si_code : 0) {
+ case SIGTRAP | 0x80:
+ switch (child->ptrace_message) {
+ case PTRACE_EVENTMSG_SYSCALL_ENTRY:
+ actual_size = ptrace_get_syscall_info_entry(child, regs,
+ &info);
+ break;
+ case PTRACE_EVENTMSG_SYSCALL_EXIT:
+ actual_size = ptrace_get_syscall_info_exit(child, regs,
+ &info);
+ break;
+ }
+ break;
+ case SIGTRAP | (PTRACE_EVENT_SECCOMP << 8):
+ actual_size = ptrace_get_syscall_info_seccomp(child, regs,
+ &info);
+ break;
+ }
+
+ write_size = min(actual_size, user_size);
+ return copy_to_user(datavp, &info, write_size) ? -EFAULT : actual_size;
+}
+#endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
int ptrace_request(struct task_struct *child, long request,
unsigned long addr, unsigned long data)
@@ -1096,6 +1193,10 @@ int ptrace_request(struct task_struct *child, long request,
ret = __put_user(kiov.iov_len, &uiov->iov_len);
break;
}
+
+ case PTRACE_GET_SYSCALL_INFO:
+ ret = ptrace_get_syscall_info(child, addr, datavp);
+ break;
#endif
case PTRACE_SECCOMP_GET_FILTER:
--
ldv
--
Strace-devel mailing list
Strace-devel-3+4lAyCyj6AWlMsSdNXQLw@public.gmane.org
https://lists.strace.io/mailman/listinfo/strace-devel
^ permalink raw reply related
* [PATCH linux-next v9 0/7] ptrace: add PTRACE_GET_SYSCALL_INFO request
From: Dmitry V. Levin @ 2019-04-08 17:40 UTC (permalink / raw)
To: Stephen Rothwell, Oleg Nesterov, Andy Lutomirski
Cc: Elvira Khabirova, Eugene Syromyatnikov, Benjamin Herrenschmidt,
Greentime Hu, Helge Deller, James E.J. Bottomley, James Hogan,
Kees Cook, Michael Ellerman, Paul Burton, Paul Mackerras,
Ralf Baechle, Richard Kuo, Shuah Khan, Vincent Chen, linux-api,
linux-hexagon, linux-kernel, linux-kselftest, linux-mips,
linux-parisc, linuxppc-dev
[I suggest to stop waiting for more acks and merge this into linux-next as is.]
PTRACE_GET_SYSCALL_INFO is a generic ptrace API that lets ptracer obtain
details of the syscall the tracee is blocked in.
There are two reasons for a special syscall-related ptrace request.
Firstly, with the current ptrace API there are cases when ptracer cannot
retrieve necessary information about syscalls. Some examples include:
* The notorious int-0x80-from-64-bit-task issue. See [1] for details.
In short, if a 64-bit task performs a syscall through int 0x80, its tracer
has no reliable means to find out that the syscall was, in fact,
a compat syscall, and misidentifies it.
* Syscall-enter-stop and syscall-exit-stop look the same for the tracer.
Common practice is to keep track of the sequence of ptrace-stops in order
not to mix the two syscall-stops up. But it is not as simple as it looks;
for example, strace had a (just recently fixed) long-standing bug where
attaching strace to a tracee that is performing the execve system call
led to the tracer identifying the following syscall-exit-stop as
syscall-enter-stop, which messed up all the state tracking.
* Since the introduction of commit 84d77d3f06e7e8dea057d10e8ec77ad71f721be3
("ptrace: Don't allow accessing an undumpable mm"), both PTRACE_PEEKDATA
and process_vm_readv become unavailable when the process dumpable flag
is cleared. On such architectures as ia64 this results in all syscall
arguments being unavailable for the tracer.
Secondly, ptracers also have to support a lot of arch-specific code for
obtaining information about the tracee. For some architectures, this
requires a ptrace(PTRACE_PEEKUSER, ...) invocation for every syscall
argument and return value.
PTRACE_GET_SYSCALL_INFO returns the following structure:
struct ptrace_syscall_info {
__u8 op; /* PTRACE_SYSCALL_INFO_* */
__u32 arch __attribute__((__aligned__(sizeof(__u32))));
__u64 instruction_pointer;
__u64 stack_pointer;
union {
struct {
__u64 nr;
__u64 args[6];
} entry;
struct {
__s64 rval;
__u8 is_error;
} exit;
struct {
__u64 nr;
__u64 args[6];
__u32 ret_data;
} seccomp;
};
};
The structure was chosen according to [2], except for the following
changes:
* seccomp substructure was added as a superset of entry substructure;
* the type of nr field was changed from int to __u64 because syscall
numbers are, as a practical matter, 64 bits;
* stack_pointer field was added along with instruction_pointer field
since it is readily available and can save the tracer from extra
PTRACE_GETREGS/PTRACE_GETREGSET calls;
* arch is always initialized to aid with tracing system calls
* such as execve();
* instruction_pointer and stack_pointer are always initialized
so they could be easily obtained for non-syscall stops;
* a boolean is_error field was added along with rval field, this way
the tracer can more reliably distinguish a return value
from an error value.
strace has been ported to PTRACE_GET_SYSCALL_INFO.
Starting with release 4.26, strace uses PTRACE_GET_SYSCALL_INFO API
as the preferred mechanism of obtaining syscall information.
[1] https://lore.kernel.org/lkml/CA+55aFzcSVmdDj9Lh_gdbz1OzHyEm6ZrGPBDAJnywm2LF_eVyg@mail.gmail.com/
[2] https://lore.kernel.org/lkml/CAObL_7GM0n80N7J_DFw_eQyfLyzq+sf4y2AvsCCV88Tb3AwEHA@mail.gmail.com/
---
Notes:
v9:
* Rebased to linux-next again due to syscall_get_arguments() signature change.
v8:
* Moved syscall_get_arch() specific patches to a separate patchset
which is now merged into audit/next tree.
* Rebased to linux-next.
* Moved ptrace_get_syscall_info code under #ifdef CONFIG_HAVE_ARCH_TRACEHOOK,
narrowing down the set of architectures supported by this implementation
back to those 19 that enable CONFIG_HAVE_ARCH_TRACEHOOK because
I failed to get all syscall_get_*(), instruction_pointer(),
and user_stack_pointer() functions implemented on some niche
architectures. This leaves the following architectures out:
alpha, h8300, m68k, microblaze, and unicore32.
v7:
* Rebased to v5.0-rc1.
* 5 arch-specific preparatory patches out of 25 have been merged
into v5.0-rc1 via arch trees.
v6:
* Add syscall_get_arguments and syscall_set_arguments wrappers
to asm-generic/syscall.h, requested by Geert.
* Change PTRACE_GET_SYSCALL_INFO return code: do not take trailing paddings
into account, use the end of the last field of the structure being written.
* Change struct ptrace_syscall_info:
* remove .frame_pointer field, is is not needed and not portable;
* make .arch field explicitly aligned, remove no longer needed
padding before .arch field;
* remove trailing pads, they are no longer needed.
v5:
* Merge separate series and patches into the single series.
* Change PTRACE_EVENTMSG_SYSCALL_{ENTRY,EXIT} values as requested by Oleg.
* Change struct ptrace_syscall_info: generalize instruction_pointer,
stack_pointer, and frame_pointer fields by moving them from
ptrace_syscall_info.{entry,seccomp} substructures to ptrace_syscall_info
and initializing them for all stops.
* Add PTRACE_SYSCALL_INFO_NONE, set it when not in a syscall stop,
so e.g. "strace -i" could use PTRACE_SYSCALL_INFO_SECCOMP to obtain
instruction_pointer when the tracee is in a signal stop.
* Patch all remaining architectures to provide all necessary
syscall_get_* functions.
* Make available for all architectures: do not conditionalize on
CONFIG_HAVE_ARCH_TRACEHOOK since all syscall_get_* functions
are implemented on all architectures.
* Add a test for PTRACE_GET_SYSCALL_INFO to selftests/ptrace.
v4:
* Do not introduce task_struct.ptrace_event,
use child->last_siginfo->si_code instead.
* Implement PTRACE_SYSCALL_INFO_SECCOMP and ptrace_syscall_info.seccomp
support along with PTRACE_SYSCALL_INFO_{ENTRY,EXIT} and
ptrace_syscall_info.{entry,exit}.
v3:
* Change struct ptrace_syscall_info.
* Support PTRACE_EVENT_SECCOMP by adding ptrace_event to task_struct.
* Add proper defines for ptrace_syscall_info.op values.
* Rename PT_SYSCALL_IS_ENTERING and PT_SYSCALL_IS_EXITING to
PTRACE_EVENTMSG_SYSCALL_ENTRY and PTRACE_EVENTMSG_SYSCALL_EXIT
* and move them to uapi.
v2:
* Do not use task->ptrace.
* Replace entry_info.is_compat with entry_info.arch, use syscall_get_arch().
* Use addr argument of sys_ptrace to get expected size of the struct;
return full size of the struct.
Dmitry V. Levin (6):
nds32: fix asm/syscall.h # waiting for ack since early January
hexagon: define syscall_get_error() and syscall_get_return_value() # waiting for ack since November
mips: define syscall_get_error() # acked
parisc: define syscall_get_error() # acked
powerpc: define syscall_get_error() # waiting for ack since early December
selftests/ptrace: add a test case for PTRACE_GET_SYSCALL_INFO
Elvira Khabirova (1):
ptrace: add PTRACE_GET_SYSCALL_INFO request # reviewed
arch/hexagon/include/asm/syscall.h | 14 +
arch/mips/include/asm/syscall.h | 6 +
arch/nds32/include/asm/syscall.h | 27 +-
arch/parisc/include/asm/syscall.h | 7 +
arch/powerpc/include/asm/syscall.h | 10 +
include/linux/tracehook.h | 9 +-
include/uapi/linux/ptrace.h | 35 +++
kernel/ptrace.c | 103 ++++++-
tools/testing/selftests/ptrace/.gitignore | 1 +
tools/testing/selftests/ptrace/Makefile | 2 +-
.../selftests/ptrace/get_syscall_info.c | 271 ++++++++++++++++++
11 files changed, 470 insertions(+), 15 deletions(-)
create mode 100644 tools/testing/selftests/ptrace/get_syscall_info.c
--
ldv
^ permalink raw reply
* Re: [PATCH v8 01/16] sched/core: uclamp: Add CPU's clamp buckets refcounting
From: Patrick Bellasi @ 2019-04-08 11:49 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: LKML, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra, Tejun Heo,
Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
Todd Kjos, Joel Fernandes, Steve Muckle
In-Reply-To: <CAJuCfpFdDK+xYZDq+TDVjRUcF8BTqeXWvF2Wex8n7nB-O5qH1g@mail.gmail.com>
On 06-Apr 16:51, Suren Baghdasaryan wrote:
> On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
[...]
> > + * The first few values calculated by this routine:
> > + * bf(0) = 1
> > + * bf(1) = 1
> > + * bf(2) = 2
> > + * bf(3) = 2
> > + * bf(4) = 3
> > + * ... and so on.
> > + */
> > +#define bits_per(n) \
> > +( \
> > + __builtin_constant_p(n) ? ( \
> > + ((n) == 0 || (n) == 1) ? 1 : ( \
> > + ((n) & (n - 1)) == 0 ? \
>
> missing braces around 'n'
> - ((n) & (n - 1)) == 0 ? \
> + ((n) & ((n) - 1)) == 0 ? \
>
> > + ilog2((n) - 1) + 2 : \
> > + ilog2((n) - 1) + 1 \
>
> Isn't this "((n) & ((n) - 1)) == 0 ? ilog2((n) - 1) + 2 : ilog2((n) -
> 1) + 1" expression equivalent to a simple "ilog2(n) + 1"?
Right, since we already have n=0 and n=1 as special cases, what you
propose should work for all n>=2.
>
> > + ) \
> > + ) : \
> > + __bits_per(n) \
> > +)
> > #endif /* _LINUX_LOG2_H */
[...]
> > +static inline unsigned int uclamp_bucket_base_value(unsigned int clamp_value)
>
> Where are you using uclamp_bucket_base_value()? I would expect its
> usage somewhere inside uclamp_rq_dec_id() when the last task in the
> bucket is dequeued but I don't see it...
This behavior is not move into a dedicated patch, as per Peter
request:
Message-ID: <20190314111849.gx6bl6myfjtaan7r@e110439-lin>
This functions was left here to support a the intialization code in
init_uclamp() but... I notice know I'm doing the initialization in a
different way thus, I'll move it into the following patch.
> > +{
> > + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> > +}
> > +
[...]
> > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> > + unsigned int clamp_id)
> > +{
> > + struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
> > + struct uclamp_se *uc_se = &p->uclamp[clamp_id];
> > + struct uclamp_bucket *bucket;
> > + unsigned int rq_clamp;
> > +
> > + bucket = &uc_rq->bucket[uc_se->bucket_id];
> > + SCHED_WARN_ON(!bucket->tasks);
> > + if (likely(bucket->tasks))
> > + bucket->tasks--;
> > +
> > + if (likely(bucket->tasks))
>
> Shouldn't you adjust bucket->value if the remaining tasks in the
> bucket have a lower clamp value than the task that was just removed?
No, this is never done. As long as a bucket is not empty/idle we never
reset it to its nominal value. In this patch specifically, the value
is never changed since we moved the "local max tracking" bits into a
dedicated patch.
> > + return;
> > +
> > + rq_clamp = READ_ONCE(uc_rq->value);
> > + /*
> > + * Defensive programming: this should never happen. If it happens,
> > + * e.g. due to future modification, warn and fixup the expected value.
> > + */
> > + SCHED_WARN_ON(bucket->value > rq_clamp);
> > + if (bucket->value >= rq_clamp)
> > + WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id));
> > +}
[...]
> > +static void __init init_uclamp(void)
> > +{
> > + unsigned int clamp_id;
> > + int cpu;
> > +
> > + for_each_possible_cpu(cpu) {
> > + struct uclamp_bucket *bucket;
> > + struct uclamp_rq *uc_rq;
> > + unsigned int bucket_id;
> > +
> > + memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
> > +
> > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > + uc_rq = &cpu_rq(cpu)->uclamp[clamp_id];
> > +
> > + bucket_id = 1;
> > + while (bucket_id < UCLAMP_BUCKETS) {
> > + bucket = &uc_rq->bucket[bucket_id];
> > + bucket->value = bucket_id * UCLAMP_BUCKET_DELTA;
> > + ++bucket_id;
> > + }
> > + }
> > + }
All the initialization code above is not more required after the next
patch introducing "local max tracking".
> > +
> > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > + struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
> > +
> > + uc_se->value = uclamp_none(clamp_id);
> > + uc_se->bucket_id = uclamp_bucket_id(uc_se->value);
> > + }
> > +}
[...]
--
#include <best/regards.h>
Patrick Bellasi
^ permalink raw reply
* Re: [PATCH] sched/core: expand sched_getaffinity(2) to return number of CPUs
From: Florian Weimer @ 2019-04-08 7:49 UTC (permalink / raw)
To: Alexey Dobriyan; +Cc: Peter Zijlstra, mingo, linux-kernel, linux-api
In-Reply-To: <20190406194825.GA5106@avx2>
* Alexey Dobriyan:
>> >> Patch overloads sched_getaffinity(len=0) to simply return "nr_cpu_ids".
>> >> This will make gettting CPU mask require at most 2 system calls
>> >> and will eliminate unnecessary code.
>> >>
>> >> len=0 is chosen so that
>> >> * passing zeroes is the simplest thing
>> >>
>> >> syscall(__NR_sched_getaffinity, 0, 0, NULL)
>> >>
>> >> will simply do the right thing,
>> >>
>> >> * old kernels returned -EINVAL unconditionally.
>> >>
>> >> Note: glibc segfaults upon exiting from system call because it tries to
>> >> clear the rest of the buffer if return value is positive, so
>> >> applications will have to use syscall(3).
>> >> Good news is that it proves noone uses sched_getaffinity(pid, 0, NULL).
>>
>> Given that old kernels fail with EINVAL, that evidence is fairly
>> restricted.
>>
>> I'm not sure if it's a good idea to overload this interface. I expect
>> that users will want to call sched_getaffinity (the system call wrapper)
>> with cpusetsize == 0 to query the value, so there will be pressure on
>> glibc to remove the memset. At that point we have an API that obscurely
>> fails with old glibc versions, but suceeds with newer ones, which isn't
>> great.
>
> I can do "if (len == 536870912)" so that bit count overflows on old
> kernels into EINVAL and is unlikely to be used ever.
I don't see how this solves this particular issue. It will still result
in a mysterious crash if programs use an updated system call wrapper.
Thanks,
Florian
^ permalink raw reply
* Re: [PATCH v13 1/3] /proc/pid/status: Add support for architecture specific output
From: Li, Aubrey @ 2019-04-08 2:33 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
Andi Kleen, Tim Chen, Dave Hansen, Arjan van de Ven, aubrey.li,
LKML, Linux API, Alexey Dobriyan, Andrew Morton
In-Reply-To: <CALCETrVxkcn=R+T-9P3CNkEiRwUGmttem52vykBcYhVSYWwh8A@mail.gmail.com>
On 2019/4/8 9:52, Andy Lutomirski wrote:
> On Sun, Apr 7, 2019 at 5:38 PM Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>>
>> On 2019/4/8 1:34, Andy Lutomirski wrote:
>>> On Fri, Apr 5, 2019 at 12:32 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>>>
>>>> On Sun, 24 Feb 2019, Aubrey Li wrote:
>>>>
>>>>> The architecture specific information of the running processes could
>>>>> be useful to the userland. Add support to examine process architecture
>>>>> specific information externally.
>>>>>
>>>>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>>> Cc: Andi Kleen <ak@linux.intel.com>
>>>>> Cc: Tim Chen <tim.c.chen@linux.intel.com>
>>>>> Cc: Dave Hansen <dave.hansen@intel.com>
>>>>> Cc: Arjan van de Ven <arjan@linux.intel.com>
>>>>
>>>> This really lacks
>>>>
>>>> Cc: Linux API <linux-api@vger.kernel.org>
>>>> Cc: Alexey Dobriyan <adobriyan@gmail.com>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>
>>>> Cc'ed now.
>>>>
>>>
>>> I certainly understand why you want to expose this info, but would it
>>> make more sense to instead add an arch_status file in /proc with
>>> architecture-specific info? Or maybe an x86_status field for x86
>>> status, etc.
>>>
>>
>> I tried this, but no other architecture showed interest in arch_status
>> under /proc.
>>
>
> Why is that a problem? It could exist on x86 and not exist on other
> arches until needed.
>
I placed it in tid_base_stuff, under live_patch entry, so it exists for
all arches, is there a better way to do this?
Thanks,
-Aubrey
^ permalink raw reply
* Re: [PATCH v13 1/3] /proc/pid/status: Add support for architecture specific output
From: Andy Lutomirski @ 2019-04-08 1:52 UTC (permalink / raw)
To: Li, Aubrey
Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
H. Peter Anvin, Andi Kleen, Tim Chen, Dave Hansen,
Arjan van de Ven, aubrey.li, LKML, Linux API, Alexey Dobriyan,
Andrew Morton
In-Reply-To: <c9662fea-faaf-beb3-246a-c60dad6d7420@linux.intel.com>
On Sun, Apr 7, 2019 at 5:38 PM Li, Aubrey <aubrey.li@linux.intel.com> wrote:
>
> On 2019/4/8 1:34, Andy Lutomirski wrote:
> > On Fri, Apr 5, 2019 at 12:32 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>
> >> On Sun, 24 Feb 2019, Aubrey Li wrote:
> >>
> >>> The architecture specific information of the running processes could
> >>> be useful to the userland. Add support to examine process architecture
> >>> specific information externally.
> >>>
> >>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> >>> Cc: Peter Zijlstra <peterz@infradead.org>
> >>> Cc: Andi Kleen <ak@linux.intel.com>
> >>> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> >>> Cc: Dave Hansen <dave.hansen@intel.com>
> >>> Cc: Arjan van de Ven <arjan@linux.intel.com>
> >>
> >> This really lacks
> >>
> >> Cc: Linux API <linux-api@vger.kernel.org>
> >> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >>
> >> Cc'ed now.
> >>
> >
> > I certainly understand why you want to expose this info, but would it
> > make more sense to instead add an arch_status file in /proc with
> > architecture-specific info? Or maybe an x86_status field for x86
> > status, etc.
> >
>
> I tried this, but no other architecture showed interest in arch_status
> under /proc.
>
Why is that a problem? It could exist on x86 and not exist on other
arches until needed.
^ permalink raw reply
* Re: [PATCH v13 1/3] /proc/pid/status: Add support for architecture specific output
From: Li, Aubrey @ 2019-04-08 0:45 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: Thomas Gleixner, mingo, Peter Zijlstra, H. Peter Anvin, ak,
tim.c.chen, dave.hansen, arjan, aubrey.li, LKML, Linux API,
Andrew Morton
In-Reply-To: <20190407154622.GA16004@avx2>
On 2019/4/7 23:46, Alexey Dobriyan wrote:
> On Sun, Apr 07, 2019 at 09:02:38PM +0800, Li, Aubrey wrote:
>> On 2019/4/7 5:41, Alexey Dobriyan wrote:
>>> On Fri, Apr 05, 2019 at 09:32:35PM +0200, Thomas Gleixner wrote:
>>>>> +/* Add support for architecture specific output in /proc/pid/status */
>>>>> +extern void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
>>> ^^^^^^
>>>
>>> Unnecessary extern.
>>>
>> The linkage is default extern, but with this functions and variables
>> can be treated the same way.
>>
>> Is it mandatory not to use it explicitly? ./script/checkpatch.pl did
>> not report this.
>
> "extern" is only necessary for variables. For prototypes, it is more typing
> and more characters on the screen.
>
> checkpatch.pl doesn't report it because opening floodgates is not the plan.
>
> Yours,
> Extern Police.
>
No problem, Mr. Police, I can remove it in the next version.
Thanks,
-Aubrey
^ permalink raw reply
* Re: [PATCH v13 1/3] /proc/pid/status: Add support for architecture specific output
From: Li, Aubrey @ 2019-04-08 0:38 UTC (permalink / raw)
To: Andy Lutomirski, Thomas Gleixner
Cc: Ingo Molnar, Peter Zijlstra, H. Peter Anvin, Andi Kleen, Tim Chen,
Dave Hansen, Arjan van de Ven, aubrey.li, LKML, Linux API,
Alexey Dobriyan, Andrew Morton
In-Reply-To: <CALCETrVhpqZiBe6zvK+z6HmD9AhK_O_O=2H=PYzBHid4C=SMYQ@mail.gmail.com>
On 2019/4/8 1:34, Andy Lutomirski wrote:
> On Fri, Apr 5, 2019 at 12:32 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> On Sun, 24 Feb 2019, Aubrey Li wrote:
>>
>>> The architecture specific information of the running processes could
>>> be useful to the userland. Add support to examine process architecture
>>> specific information externally.
>>>
>>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Andi Kleen <ak@linux.intel.com>
>>> Cc: Tim Chen <tim.c.chen@linux.intel.com>
>>> Cc: Dave Hansen <dave.hansen@intel.com>
>>> Cc: Arjan van de Ven <arjan@linux.intel.com>
>>
>> This really lacks
>>
>> Cc: Linux API <linux-api@vger.kernel.org>
>> Cc: Alexey Dobriyan <adobriyan@gmail.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>
>> Cc'ed now.
>>
>
> I certainly understand why you want to expose this info, but would it
> make more sense to instead add an arch_status file in /proc with
> architecture-specific info? Or maybe an x86_status field for x86
> status, etc.
>
I tried this, but no other architecture showed interest in arch_status
under /proc.
Thanks,
-Aubrey
^ permalink raw reply
* Re: [PATCH v13 1/3] /proc/pid/status: Add support for architecture specific output
From: Andy Lutomirski @ 2019-04-07 17:34 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Aubrey Li, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
Andi Kleen, Tim Chen, Dave Hansen, Arjan van de Ven, aubrey.li,
LKML, Linux API, Alexey Dobriyan, Andrew Morton
In-Reply-To: <alpine.DEB.2.21.1904052129371.1802@nanos.tec.linutronix.de>
On Fri, Apr 5, 2019 at 12:32 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Sun, 24 Feb 2019, Aubrey Li wrote:
>
> > The architecture specific information of the running processes could
> > be useful to the userland. Add support to examine process architecture
> > specific information externally.
> >
> > Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Andi Kleen <ak@linux.intel.com>
> > Cc: Tim Chen <tim.c.chen@linux.intel.com>
> > Cc: Dave Hansen <dave.hansen@intel.com>
> > Cc: Arjan van de Ven <arjan@linux.intel.com>
>
> This really lacks
>
> Cc: Linux API <linux-api@vger.kernel.org>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
>
> Cc'ed now.
>
I certainly understand why you want to expose this info, but would it
make more sense to instead add an arch_status file in /proc with
architecture-specific info? Or maybe an x86_status field for x86
status, etc.
^ permalink raw reply
* Re: [PATCH v13 1/3] /proc/pid/status: Add support for architecture specific output
From: Alexey Dobriyan @ 2019-04-07 15:46 UTC (permalink / raw)
To: Li, Aubrey
Cc: Thomas Gleixner, mingo, Peter Zijlstra, H. Peter Anvin, ak,
tim.c.chen, dave.hansen, arjan, aubrey.li, LKML, Linux API,
Andrew Morton
In-Reply-To: <7cd8fa2d-301a-5c96-ce3d-3dc373a5d8f5@linux.intel.com>
On Sun, Apr 07, 2019 at 09:02:38PM +0800, Li, Aubrey wrote:
> On 2019/4/7 5:41, Alexey Dobriyan wrote:
> > On Fri, Apr 05, 2019 at 09:32:35PM +0200, Thomas Gleixner wrote:
> >>> +/* Add support for architecture specific output in /proc/pid/status */
> >>> +extern void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
> > ^^^^^^
> >
> > Unnecessary extern.
> >
> The linkage is default extern, but with this functions and variables
> can be treated the same way.
>
> Is it mandatory not to use it explicitly? ./script/checkpatch.pl did
> not report this.
"extern" is only necessary for variables. For prototypes, it is more typing
and more characters on the screen.
checkpatch.pl doesn't report it because opening floodgates is not the plan.
Yours,
Extern Police.
^ permalink raw reply
* Re: [PATCH v10 00/18] Introduce the Counter subsystem
From: Jonathan Cameron @ 2019-04-07 14:25 UTC (permalink / raw)
To: William Breathitt Gray
Cc: gregkh, linux-arm-kernel, devicetree, linux-kernel, linux-iio,
fabrice.gasnier, benjamin.gaignard, knaack.h, lars, pmeerw, akpm,
david, robh+dt, mark.rutland, shawnguo, leoyang.li,
daniel.lezcano, tglx, thierry.reding, esben, linux-pwm,
linuxppc-dev, patrick.havelange, linux-api
In-Reply-To: <cover.1554184734.git.vilhelm.gray@gmail.com>
On Tue, 2 Apr 2019 15:30:35 +0900
William Breathitt Gray <vilhelm.gray@gmail.com> wrote:
> Changes in v10:
> - Fix minor typographical errors in documentation
> - Merge the FlexTimer Module Quadrature decoder counter driver patches
>
> This revision is functionally identical to the last; changes in this
> version were made to fix minor typos in the documentation files and also
> to pull in the new FTM quadrature decoder counter driver.
>
> The Generic Counter API has been and is still in a feature freeze until
> it is merged into the mainline. The following features will be
> investigated after the merge: interrupt support for counter devices, and
> a character device interface for low-latency applications.
Hi William / al,
So the question is how to move this forwards? I'm happy with how it turned
out and the existing drivers we had in IIO are a lot cleaner under
the counter subsystem (other than the backwards compatibility for those that
ever existed in IIO). For those not following closely the situation is:
1. Counter drivers never really fitted that well in IIO, because IIO is
focused on an abstraction of individual channels that just doesn't match
to these devices. It's just the wrong model.
2. William tried hard in earlier proposals to extend IIO to support these
devices well, but it became so convoluted and involved I advised him that
we were better off with a separate subsystem. The amount of code overlap
between the core IIO support for counters and the reset of IIO was
become very small and it would have been a maintenance problem for both.
https://lwn.net/Articles/729363/ gives some of the history
3. The new subsystem introduced by this series is fairly simple, clean
and well aligned with the way these devices work. There are (I think)
4 initial drivers in this series from 4 different authors so it's got
some practical review that way!
There are a couple more drivers under development. Right now, not
everyone is aware of this work and so we have had a few developers potentially
waste their time writing IIO drivers (which are then ported to this) rather
that starting with the counter subsystem.
So what we are after is more review, or agreement that we can move this
series forwards. For now the intent is that the counter subsystem will
share the linux-iio mailing list etc but I don't think either William
or I have any particularly strong views on how we actually handle the
patches. I'm more than happy to take them through the IIO tree, if that
works for everyone, particularly Greg as IIO goes through him after me.
Once it is in a release, the cross dependency is broken and we can think
about longer term approaches.
So Greg and others, how do we make progress here? If there are any obvious
reviewers not on the CC list, please do draw their attention to this.
Thanks,
Jonathan
+CC linux-api as obviously one of the biggest areas for review is the new
userspace ABI.
>
> Benjamin Gaignard (2):
> counter: Add STM32 Timer quadrature encoder
> dt-bindings: counter: Document stm32 quadrature encoder
>
> Fabrice Gasnier (2):
> counter: stm32-lptimer: add counter device
> dt-bindings: counter: Adjust dt-bindings for STM32 lptimer move
>
> Patrick Havelange (7):
> include/fsl: add common FlexTimer #defines in a separate header.
> drivers/pwm: pwm-fsl-ftm: use common header for FlexTimer #defines
> drivers/clocksource: timer-fsl-ftm: use common header for FlexTimer
> #defines
> dt-bindings: counter: ftm-quaddec
> counter: add FlexTimer Module Quadrature decoder counter driver
> counter: ftm-quaddec: Documentation: Add specific counter sysfs
> documentation
> LS1021A: dtsi: add ftm quad decoder entries
>
> William Breathitt Gray (7):
> counter: Introduce the Generic Counter interface
> counter: Documentation: Add Generic Counter sysfs documentation
> docs: Add Generic Counter interface documentation
> iio: 104-quad-8: Update license boilerplate
> counter: 104-quad-8: Add Generic Counter interface support
> counter: 104-quad-8: Documentation: Add Generic Counter sysfs
> documentation
> iio: counter: Add deprecation markings for IIO Counter attributes
>
> Documentation/ABI/testing/sysfs-bus-counter | 230 +++
> .../ABI/testing/sysfs-bus-counter-104-quad-8 | 36 +
> .../ABI/testing/sysfs-bus-counter-ftm-quaddec | 16 +
> Documentation/ABI/testing/sysfs-bus-iio | 8 +
> .../testing/sysfs-bus-iio-counter-104-quad-8 | 16 +
> .../bindings/counter/ftm-quaddec.txt | 18 +
> .../{iio => }/counter/stm32-lptimer-cnt.txt | 0
> .../bindings/counter/stm32-timer-cnt.txt | 31 +
> .../devicetree/bindings/mfd/stm32-lptimer.txt | 2 +-
> .../devicetree/bindings/mfd/stm32-timers.txt | 7 +
> Documentation/driver-api/generic-counter.rst | 342 ++++
> Documentation/driver-api/index.rst | 1 +
> MAINTAINERS | 15 +-
> arch/arm/boot/dts/ls1021a.dtsi | 28 +
> drivers/Kconfig | 2 +
> drivers/Makefile | 1 +
> drivers/clocksource/timer-fsl-ftm.c | 15 +-
> drivers/{iio => }/counter/104-quad-8.c | 782 +++++++-
> drivers/counter/Kconfig | 60 +
> drivers/counter/Makefile | 10 +
> drivers/counter/counter.c | 1567 +++++++++++++++++
> drivers/counter/ftm-quaddec.c | 356 ++++
> drivers/{iio => }/counter/stm32-lptimer-cnt.c | 361 +++-
> drivers/counter/stm32-timer-cnt.c | 390 ++++
> drivers/iio/Kconfig | 1 -
> drivers/iio/Makefile | 1 -
> drivers/iio/counter/Kconfig | 34 -
> drivers/iio/counter/Makefile | 8 -
> drivers/pwm/pwm-fsl-ftm.c | 44 +-
> include/linux/counter.h | 510 ++++++
> include/linux/counter_enum.h | 45 +
> include/linux/fsl/ftm.h | 88 +
> 32 files changed, 4877 insertions(+), 148 deletions(-)
> create mode 100644 Documentation/ABI/testing/sysfs-bus-counter
> create mode 100644 Documentation/ABI/testing/sysfs-bus-counter-104-quad-8
> create mode 100644 Documentation/ABI/testing/sysfs-bus-counter-ftm-quaddec
> create mode 100644 Documentation/devicetree/bindings/counter/ftm-quaddec.txt
> rename Documentation/devicetree/bindings/{iio => }/counter/stm32-lptimer-cnt.txt (100%)
> create mode 100644 Documentation/devicetree/bindings/counter/stm32-timer-cnt.txt
> create mode 100644 Documentation/driver-api/generic-counter.rst
> rename drivers/{iio => }/counter/104-quad-8.c (44%)
> create mode 100644 drivers/counter/Kconfig
> create mode 100644 drivers/counter/Makefile
> create mode 100644 drivers/counter/counter.c
> create mode 100644 drivers/counter/ftm-quaddec.c
> rename drivers/{iio => }/counter/stm32-lptimer-cnt.c (51%)
> create mode 100644 drivers/counter/stm32-timer-cnt.c
> delete mode 100644 drivers/iio/counter/Kconfig
> delete mode 100644 drivers/iio/counter/Makefile
> create mode 100644 include/linux/counter.h
> create mode 100644 include/linux/counter_enum.h
> create mode 100644 include/linux/fsl/ftm.h
>
^ permalink raw reply
* Re: [PATCH v13 1/3] /proc/pid/status: Add support for architecture specific output
From: Li, Aubrey @ 2019-04-07 13:02 UTC (permalink / raw)
To: Alexey Dobriyan, Thomas Gleixner
Cc: mingo, Peter Zijlstra, H. Peter Anvin, ak, tim.c.chen,
dave.hansen, arjan, aubrey.li, LKML, Linux API, Andrew Morton
In-Reply-To: <20190406214143.GA21551@avx2>
On 2019/4/7 5:41, Alexey Dobriyan wrote:
> On Fri, Apr 05, 2019 at 09:32:35PM +0200, Thomas Gleixner wrote:
>>> +/* Add support for architecture specific output in /proc/pid/status */
>>> +extern void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
> ^^^^^^
>
> Unnecessary extern.
>
The linkage is default extern, but with this functions and variables
can be treated the same way.
Is it mandatory not to use it explicitly? ./script/checkpatch.pl did
not report this.
Thanks,
-Aubrey
^ permalink raw reply
* Re: [PATCH v8 01/16] sched/core: uclamp: Add CPU's clamp buckets refcounting
From: Suren Baghdasaryan @ 2019-04-06 23:51 UTC (permalink / raw)
To: Patrick Bellasi
Cc: LKML, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra, Tejun Heo,
Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
Todd Kjos, Joel Fernandes, Steve Muckle
In-Reply-To: <20190402104153.25404-2-patrick.bellasi@arm.com>
On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>
> Utilization clamping allows to clamp the CPU's utilization within a
> [util_min, util_max] range, depending on the set of RUNNABLE tasks on
> that CPU. Each task references two "clamp buckets" defining its minimum
> and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
> bucket is active if there is at least one RUNNABLE tasks enqueued on
> that CPU and refcounting that bucket.
>
> When a task is {en,de}queued {on,from} a rq, the set of active clamp
> buckets on that CPU can change. If the set of active clamp buckets
> changes for a CPU a new "aggregated" clamp value is computed for that
> CPU. This is because each clamp bucket enforces a different utilization
> clamp value.
>
> Clamp values are always MAX aggregated for both util_min and util_max.
> This ensures that no task can affect the performance of other
> co-scheduled tasks which are more boosted (i.e. with higher util_min
> clamp) or less capped (i.e. with higher util_max clamp).
>
> A tasks has:
A task has | Tasks have
> task_struct::uclamp[clamp_id]::bucket_id
> to track the "bucket index" of the CPU's clamp bucket it refcounts while
> enqueued, for each clamp index (clamp_id).
>
> A runqueue has:
> rq::uclamp[clamp_id]::bucket[bucket_id].tasks
> to track how many RUNNABLE tasks on that CPU refcount each
> clamp bucket (bucket_id) of a clamp index (clamp_id).
> It also has a:
> rq::uclamp[clamp_id]::bucket[bucket_id].value
> to track the clamp value of each clamp bucket (bucket_id) of a clamp
> index (clamp_id).
>
> The rq::uclamp::bucket[clamp_id][] array is scanned every time it's
> needed to find a new MAX aggregated clamp value for a clamp_id. This
> operation is required only when it's dequeued the last task of a clamp
> bucket tracking the current MAX aggregated clamp value. In this case,
> the CPU is either entering IDLE or going to schedule a less boosted or
> more clamped task.
> The expected number of different clamp values configured at build time
> is small enough to fit the full unordered array into a single cache
> line, for configurations of up to 7 buckets.
>
> Add to struct rq the basic data structures required to refcount the
> number of RUNNABLE tasks for each clamp bucket. Add also the max
> aggregation required to update the rq's clamp value at each
> enqueue/dequeue event.
>
> Use a simple linear mapping of clamp values into clamp buckets.
> Pre-compute and cache bucket_id to avoid integer divisions at
> enqueue/dequeue time.
>
> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
>
> ---
> Changes in v8:
> Message-ID: <20190314111849.gx6bl6myfjtaan7r@e110439-lin>
> - remove "bucket local boosting" code and move it into a dedicated
> patch
> Message-ID: <20190313161229.pkib2tmjass5chtb@e110439-lin>
> - refactored uclamp_rq_update() code to make code cleaner
> Message-ID: <20190314122256.7wb3ydswpkfmntvf@e110439-lin>
> - s/uclamp_rq_update/uclamp_rq_max_value/ and move update into caller
> Message-ID: <CAJuCfpEWCcWj=B2SPai2pQt+wcjsAhEfVV1O+H0A+_fqLCnb8Q@mail.gmail.com>
> - update changelog to clarify the configuration fitting in one cache line
> Message-ID: <20190314145456.5qpxchfltfauqaem@e110439-lin>
> - s/uclamp_bucket_value/uclamp_bucket_base_value/
> Message-ID: <20190313113757.aeaksz5akv6y5uep@e110439-lin>
> - update UCLAMP_BUCKET_DELTA to use DIV_ROUND_CLOSEST()
> ---
> include/linux/log2.h | 37 ++++++++
> include/linux/sched.h | 39 ++++++++
> include/linux/sched/topology.h | 6 --
> init/Kconfig | 53 +++++++++++
> kernel/sched/core.c | 160 +++++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 51 +++++++++++
> 6 files changed, 340 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/log2.h b/include/linux/log2.h
> index 2af7f77866d0..e2db25734532 100644
> --- a/include/linux/log2.h
> +++ b/include/linux/log2.h
> @@ -224,4 +224,41 @@ int __order_base_2(unsigned long n)
> ilog2((n) - 1) + 1) : \
> __order_base_2(n) \
> )
> +
> +static inline __attribute__((const))
> +int __bits_per(unsigned long n)
> +{
> + if (n < 2)
> + return 1;
> + if (is_power_of_2(n))
> + return order_base_2(n) + 1;
> + return order_base_2(n);
> +}
> +
> +/**
> + * bits_per - calculate the number of bits required for the argument
> + * @n: parameter
> + *
> + * This is constant-capable and can be used for compile time
> + * initiaizations, e.g bitfields.
s/initiaizations/initializations
> + *
> + * The first few values calculated by this routine:
> + * bf(0) = 1
> + * bf(1) = 1
> + * bf(2) = 2
> + * bf(3) = 2
> + * bf(4) = 3
> + * ... and so on.
> + */
> +#define bits_per(n) \
> +( \
> + __builtin_constant_p(n) ? ( \
> + ((n) == 0 || (n) == 1) ? 1 : ( \
> + ((n) & (n - 1)) == 0 ? \
missing braces around 'n'
- ((n) & (n - 1)) == 0 ? \
+ ((n) & ((n) - 1)) == 0 ? \
> + ilog2((n) - 1) + 2 : \
> + ilog2((n) - 1) + 1 \
Isn't this "((n) & ((n) - 1)) == 0 ? ilog2((n) - 1) + 2 : ilog2((n) -
1) + 1" expression equivalent to a simple "ilog2(n) + 1"?
> + ) \
> + ) : \
> + __bits_per(n) \
> +)
> #endif /* _LINUX_LOG2_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 18696a194e06..0c0dd7aac8e9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -281,6 +281,18 @@ struct vtime {
> u64 gtime;
> };
>
> +/*
> + * Utilization clamp constraints.
> + * @UCLAMP_MIN: Minimum utilization
> + * @UCLAMP_MAX: Maximum utilization
> + * @UCLAMP_CNT: Utilization clamp constraints count
> + */
> +enum uclamp_id {
> + UCLAMP_MIN = 0,
> + UCLAMP_MAX,
> + UCLAMP_CNT
> +};
> +
> struct sched_info {
> #ifdef CONFIG_SCHED_INFO
> /* Cumulative counters: */
> @@ -312,6 +324,10 @@ struct sched_info {
> # define SCHED_FIXEDPOINT_SHIFT 10
> # define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)
>
> +/* Increase resolution of cpu_capacity calculations */
> +# define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
> +# define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
> +
> struct load_weight {
> unsigned long weight;
> u32 inv_weight;
> @@ -560,6 +576,25 @@ struct sched_dl_entity {
> struct hrtimer inactive_timer;
> };
>
> +#ifdef CONFIG_UCLAMP_TASK
> +/* Number of utilization clamp buckets (shorter alias) */
> +#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT
> +
> +/*
> + * Utilization clamp for a scheduling entity
> + * @value: clamp value "assigned" to a se
> + * @bucket_id: bucket index corresponding to the "assigned" value
> + *
> + * The bucket_id is the index of the clamp bucket matching the clamp value
> + * which is pre-computed and stored to avoid expensive integer divisions from
> + * the fast path.
> + */
> +struct uclamp_se {
> + unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> + unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> +};
> +#endif /* CONFIG_UCLAMP_TASK */
> +
> union rcu_special {
> struct {
> u8 blocked;
> @@ -640,6 +675,10 @@ struct task_struct {
> #endif
> struct sched_dl_entity dl;
>
> +#ifdef CONFIG_UCLAMP_TASK
> + struct uclamp_se uclamp[UCLAMP_CNT];
> +#endif
> +
> #ifdef CONFIG_PREEMPT_NOTIFIERS
> /* List of struct preempt_notifier: */
> struct hlist_head preempt_notifiers;
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 57c7ed3fe465..bb5d77d45b09 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -6,12 +6,6 @@
>
> #include <linux/sched/idle.h>
>
> -/*
> - * Increase resolution of cpu_capacity calculations
> - */
> -#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
> -#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
> -
> /*
> * sched-domains (multiprocessor balancing) declarations:
> */
> diff --git a/init/Kconfig b/init/Kconfig
> index c9386a365eea..7439cbf4d02e 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -651,6 +651,59 @@ config HAVE_UNSTABLE_SCHED_CLOCK
> config GENERIC_SCHED_CLOCK
> bool
>
> +menu "Scheduler features"
> +
> +config UCLAMP_TASK
> + bool "Enable utilization clamping for RT/FAIR tasks"
> + depends on CPU_FREQ_GOV_SCHEDUTIL
> + help
> + This feature enables the scheduler to track the clamped utilization
> + of each CPU based on RUNNABLE tasks scheduled on that CPU.
> +
> + With this option, the user can specify the min and max CPU
> + utilization allowed for RUNNABLE tasks. The max utilization defines
> + the maximum frequency a task should use while the min utilization
> + defines the minimum frequency it should use.
> +
> + Both min and max utilization clamp values are hints to the scheduler,
> + aiming at improving its frequency selection policy, but they do not
> + enforce or grant any specific bandwidth for tasks.
> +
> + If in doubt, say N.
> +
> +config UCLAMP_BUCKETS_COUNT
> + int "Number of supported utilization clamp buckets"
> + range 5 20
> + default 5
> + depends on UCLAMP_TASK
> + help
> + Defines the number of clamp buckets to use. The range of each bucket
> + will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
> + number of clamp buckets the finer their granularity and the higher
> + the precision of clamping aggregation and tracking at run-time.
> +
> + For example, with the minimum configuration value we will have 5
> + clamp buckets tracking 20% utilization each. A 25% boosted tasks will
> + be refcounted in the [20..39]% bucket and will set the bucket clamp
> + effective value to 25%.
> + If a second 30% boosted task should be co-scheduled on the same CPU,
> + that task will be refcounted in the same bucket of the first task and
> + it will boost the bucket clamp effective value to 30%.
> + The clamp effective value of a bucket is reset to its nominal value
> + (20% in the example above) when there are no more tasks refcounted in
> + that bucket.
> +
> + An additional boost/capping margin can be added to some tasks. In the
> + example above the 25% task will be boosted to 30% until it exits the
> + CPU. If that should be considered not acceptable on certain systems,
> + it's always possible to reduce the margin by increasing the number of
> + clamp buckets to trade off used memory for run-time tracking
> + precision.
> +
> + If in doubt, use the default value.
> +
> +endmenu
> +
> #
> # For architectures that want to enable the support for NUMA-affine scheduler
> # balancing logic:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6b2c055564b5..032211b72110 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -732,6 +732,162 @@ static void set_load_weight(struct task_struct *p, bool update_load)
> }
> }
>
> +#ifdef CONFIG_UCLAMP_TASK
> +
> +/* Integer rounded range for each bucket */
> +#define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
> +
> +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
> +{
> + return clamp_value / UCLAMP_BUCKET_DELTA;
> +}
> +
> +static inline unsigned int uclamp_bucket_base_value(unsigned int clamp_value)
Where are you using uclamp_bucket_base_value()? I would expect its
usage somewhere inside uclamp_rq_dec_id() when the last task in the
bucket is dequeued but I don't see it...
> +{
> + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> +}
> +
> +static inline unsigned int uclamp_none(int clamp_id)
> +{
> + if (clamp_id == UCLAMP_MIN)
> + return 0;
> + return SCHED_CAPACITY_SCALE;
> +}
> +
> +static inline
> +unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
> +{
> + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> + int bucket_id = UCLAMP_BUCKETS - 1;
> +
> + /*
> + * Since both min and max clamps are max aggregated, find the
> + * top most bucket with tasks in.
> + */
> + for ( ; bucket_id >= 0; bucket_id--) {
> + if (!bucket[bucket_id].tasks)
> + continue;
> + return bucket[bucket_id].value;
> + }
> +
> + /* No tasks -- default clamp values */
> + return uclamp_none(clamp_id);
> +}
> +
> +/*
> + * When a task is enqueued on a rq, the clamp bucket currently defined by the
> + * task's uclamp::bucket_id is refcounted on that rq. This also immediately
> + * updates the rq's clamp value if required.
> + */
> +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> + unsigned int clamp_id)
> +{
> + struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
> + struct uclamp_se *uc_se = &p->uclamp[clamp_id];
> + struct uclamp_bucket *bucket;
> +
> + bucket = &uc_rq->bucket[uc_se->bucket_id];
> + bucket->tasks++;
> +
> + if (uc_se->value > READ_ONCE(uc_rq->value))
> + WRITE_ONCE(uc_rq->value, bucket->value);
> +}
> +
> +/*
> + * When a task is dequeued from a rq, the clamp bucket refcounted by the task
> + * is released. If this is the last task reference counting the rq's max
> + * active clamp value, then the rq's clamp value is updated.
> + *
> + * Both refcounted tasks and rq's cached clamp values are expected to be
> + * always valid. If it's detected they are not, as defensive programming,
> + * enforce the expected state and warn.
> + */
> +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> + unsigned int clamp_id)
> +{
> + struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
> + struct uclamp_se *uc_se = &p->uclamp[clamp_id];
> + struct uclamp_bucket *bucket;
> + unsigned int rq_clamp;
> +
> + bucket = &uc_rq->bucket[uc_se->bucket_id];
> + SCHED_WARN_ON(!bucket->tasks);
> + if (likely(bucket->tasks))
> + bucket->tasks--;
> +
> + if (likely(bucket->tasks))
Shouldn't you adjust bucket->value if the remaining tasks in the
bucket have a lower clamp value than the task that was just removed?
> + return;
> +
> + rq_clamp = READ_ONCE(uc_rq->value);
> + /*
> + * Defensive programming: this should never happen. If it happens,
> + * e.g. due to future modification, warn and fixup the expected value.
> + */
> + SCHED_WARN_ON(bucket->value > rq_clamp);
> + if (bucket->value >= rq_clamp)
> + WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id));
> +}
> +
> +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> +{
> + unsigned int clamp_id;
> +
> + if (unlikely(!p->sched_class->uclamp_enabled))
> + return;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + uclamp_rq_inc_id(p, rq, clamp_id);
> +}
> +
> +static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
> +{
> + unsigned int clamp_id;
> +
> + if (unlikely(!p->sched_class->uclamp_enabled))
> + return;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + uclamp_rq_dec_id(p, rq, clamp_id);
> +}
> +
> +static void __init init_uclamp(void)
> +{
> + unsigned int clamp_id;
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + struct uclamp_bucket *bucket;
> + struct uclamp_rq *uc_rq;
> + unsigned int bucket_id;
> +
> + memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + uc_rq = &cpu_rq(cpu)->uclamp[clamp_id];
> +
> + bucket_id = 1;
> + while (bucket_id < UCLAMP_BUCKETS) {
> + bucket = &uc_rq->bucket[bucket_id];
> + bucket->value = bucket_id * UCLAMP_BUCKET_DELTA;
> + ++bucket_id;
> + }
> + }
> + }
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
> +
> + uc_se->value = uclamp_none(clamp_id);
> + uc_se->bucket_id = uclamp_bucket_id(uc_se->value);
> + }
> +}
> +
> +#else /* CONFIG_UCLAMP_TASK */
> +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
> +static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
> +static inline void init_uclamp(void) { }
> +#endif /* CONFIG_UCLAMP_TASK */
> +
> static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> {
> if (!(flags & ENQUEUE_NOCLOCK))
> @@ -742,6 +898,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> psi_enqueue(p, flags & ENQUEUE_WAKEUP);
> }
>
> + uclamp_rq_inc(rq, p);
> p->sched_class->enqueue_task(rq, p, flags);
> }
>
> @@ -755,6 +912,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> psi_dequeue(p, flags & DEQUEUE_SLEEP);
> }
>
> + uclamp_rq_dec(rq, p);
> p->sched_class->dequeue_task(rq, p, flags);
> }
>
> @@ -6088,6 +6246,8 @@ void __init sched_init(void)
>
> psi_init();
>
> + init_uclamp();
> +
> scheduler_running = 1;
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 71208b67e58a..c3d1ae1e7eec 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -797,6 +797,48 @@ extern void rto_push_irq_work_func(struct irq_work *work);
> #endif
> #endif /* CONFIG_SMP */
>
> +#ifdef CONFIG_UCLAMP_TASK
> +/*
> + * struct uclamp_bucket - Utilization clamp bucket
> + * @value: utilization clamp value for tasks on this clamp bucket
> + * @tasks: number of RUNNABLE tasks on this clamp bucket
> + *
> + * Keep track of how many tasks are RUNNABLE for a given utilization
> + * clamp value.
> + */
> +struct uclamp_bucket {
> + unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
> + unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
> +};
> +
> +/*
> + * struct uclamp_rq - rq's utilization clamp
> + * @value: currently active clamp values for a rq
> + * @bucket: utilization clamp buckets affecting a rq
> + *
> + * Keep track of RUNNABLE tasks on a rq to aggregate their clamp values.
> + * A clamp value is affecting a rq when there is at least one task RUNNABLE
> + * (or actually running) with that value.
> + *
> + * There are up to UCLAMP_CNT possible different clamp values, currently there
> + * are only two: minmum utilization and maximum utilization.
s/minmum/minimum
> + *
> + * All utilization clamping values are MAX aggregated, since:
> + * - for util_min: we want to run the CPU at least at the max of the minimum
> + * utilization required by its currently RUNNABLE tasks.
> + * - for util_max: we want to allow the CPU to run up to the max of the
> + * maximum utilization allowed by its currently RUNNABLE tasks.
> + *
> + * Since on each system we expect only a limited number of different
> + * utilization clamp values (UCLAMP_BUCKETS), use a simple array to track
> + * the metrics required to compute all the per-rq utilization clamp values.
> + */
> +struct uclamp_rq {
> + unsigned int value;
> + struct uclamp_bucket bucket[UCLAMP_BUCKETS];
> +};
> +#endif /* CONFIG_UCLAMP_TASK */
> +
> /*
> * This is the main, per-CPU runqueue data structure.
> *
> @@ -835,6 +877,11 @@ struct rq {
> unsigned long nr_load_updates;
> u64 nr_switches;
>
> +#ifdef CONFIG_UCLAMP_TASK
> + /* Utilization clamp values based on CPU's RUNNABLE tasks */
> + struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned;
> +#endif
> +
> struct cfs_rq cfs;
> struct rt_rq rt;
> struct dl_rq dl;
> @@ -1649,6 +1696,10 @@ extern const u32 sched_prio_to_wmult[40];
> struct sched_class {
> const struct sched_class *next;
>
> +#ifdef CONFIG_UCLAMP_TASK
> + int uclamp_enabled;
> +#endif
> +
> void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
> void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
> void (*yield_task) (struct rq *rq);
> --
> 2.20.1
>
^ permalink raw reply
* Re: [PATCH v13 1/3] /proc/pid/status: Add support for architecture specific output
From: Alexey Dobriyan @ 2019-04-06 21:41 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Aubrey Li, mingo, Peter Zijlstra, H. Peter Anvin, ak, tim.c.chen,
dave.hansen, arjan, aubrey.li, LKML, Linux API, Andrew Morton
In-Reply-To: <alpine.DEB.2.21.1904052129371.1802@nanos.tec.linutronix.de>
On Fri, Apr 05, 2019 at 09:32:35PM +0200, Thomas Gleixner wrote:
> > +/* Add support for architecture specific output in /proc/pid/status */
> > +extern void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
^^^^^^
Unnecessary extern.
^ permalink raw reply
* Re: [PATCH] sched/core: expand sched_getaffinity(2) to return number of CPUs
From: Alexey Dobriyan @ 2019-04-06 19:48 UTC (permalink / raw)
To: Florian Weimer; +Cc: Peter Zijlstra, mingo, linux-kernel, linux-api
In-Reply-To: <87wok83gfs.fsf@oldenburg2.str.redhat.com>
On Fri, Apr 05, 2019 at 12:16:39PM +0200, Florian Weimer wrote:
> * Peter Zijlstra:
>
> > On Wed, Apr 03, 2019 at 11:08:09PM +0300, Alexey Dobriyan wrote:
> >> Currently there is no easy way to get the number of CPUs on the system.
>
> The size of the affinity mask is only related to the number of CPUs in
> the system in such a way that the number of CPUs cannot be larger than
> the number of bits in the affinity mask.
>
> >> Glibc in particular shipped with 1024 CPUs support maximum at some point
> >> which is quite surprising as glibc maitainers should know better.
>
> This dates back to a time when the kernel was never going to support
> more than 1024 CPUs.
>
> A lot of distribution kernels still enforce a hard limit, which papers
> over firmware bugs which tell the kernel that the system can be
> hot-plugged to a ridiculous number of sockets/CPUs.
>
> >> Another group dynamically grow buffer until cpumask fits. This is
> >> inefficient as multiple system calls are done.
> >>
> >> Nobody seems to parse "/sys/devices/system/cpu/possible".
> >> Even if someone does, parsing sysfs is much slower than necessary.
> >
> > True; but I suppose glibc already does lots of that anyway, right? It
> > does contain the right information.
>
> If I recall correctly my last investigation,
> /sys/devices/system/cpu/possible does not reflect the size of the
> affinity mask, either.
>
> >> Patch overloads sched_getaffinity(len=0) to simply return "nr_cpu_ids".
> >> This will make gettting CPU mask require at most 2 system calls
> >> and will eliminate unnecessary code.
> >>
> >> len=0 is chosen so that
> >> * passing zeroes is the simplest thing
> >>
> >> syscall(__NR_sched_getaffinity, 0, 0, NULL)
> >>
> >> will simply do the right thing,
> >>
> >> * old kernels returned -EINVAL unconditionally.
> >>
> >> Note: glibc segfaults upon exiting from system call because it tries to
> >> clear the rest of the buffer if return value is positive, so
> >> applications will have to use syscall(3).
> >> Good news is that it proves noone uses sched_getaffinity(pid, 0, NULL).
>
> Given that old kernels fail with EINVAL, that evidence is fairly
> restricted.
>
> I'm not sure if it's a good idea to overload this interface. I expect
> that users will want to call sched_getaffinity (the system call wrapper)
> with cpusetsize == 0 to query the value, so there will be pressure on
> glibc to remove the memset. At that point we have an API that obscurely
> fails with old glibc versions, but suceeds with newer ones, which isn't
> great.
I can do "if (len == 536870912)" so that bit count overflows on old
kernels into EINVAL and is unlikely to be used ever.
^ permalink raw reply
* Re: [PATCH v13 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time
From: Li, Aubrey @ 2019-04-06 15:41 UTC (permalink / raw)
To: Jann Horn
Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, H . Peter Anvin,
Andi Kleen, tim.c.chen, Dave Hansen, Arjan van de Ven, aubrey.li,
kernel list, Linux API, Alexey Dobriyan, Andrew Morton
In-Reply-To: <CAG48ez2hJvjRD-immEJ9uJAY1WRth3pMf0Qz8bApRke05kC+2Q@mail.gmail.com>
On 2019/4/6 4:27, Jann Horn wrote:
> On Fri, Apr 5, 2019 at 10:02 PM Aubrey Li <aubrey.li@linux.intel.com> wrote:
>> AVX-512 components use could cause core turbo frequency drop. So
>> it's useful to expose AVX-512 usage elapsed time as a heuristic hint
>> for the user space job scheduler to cluster the AVX-512 using tasks
>> together.
>>
>> Tensorflow example:
>> $ while [ 1 ]; do cat /proc/pid/status | grep AVX; sleep 1; done
>> AVX512_elapsed_ms: 4
>> AVX512_elapsed_ms: 8
>> AVX512_elapsed_ms: 4
>>
>> This means that 4 milliseconds have elapsed since the AVX512 usage
>> of tensorflow task was detected when the task was scheduled out.
>>
>> Or:
>> $ cat /proc/pid/status | grep AVX512_elapsed_ms
>> AVX512_elapsed_ms: -1
>
> (Very nitpicky, feel free to ignore: If you change the /proc/pid to
> /proc/tid in the commit message, it becomes clearer that this status
> is really per-task/thread, not per-process/threadgroup.)
Thanks, I'll refine.
>
> [...]
>> +
>> +/*
>> + * Report the amount of time elapsed in millisecond since last AVX512
>> + * use in the task.
>> + */
>> +static void avx512_status(struct seq_file *m, struct task_struct *task)
>> +{
>> + unsigned long timestamp = task->thread.fpu.avx512_timestamp;
>
> This is theoretically a data race, right? Should this have a READ_ONCE() on it?
Thanks, I'll refine.
>
> Is there something that zeroes out the avx512_timestamp on
> fork()/clone(), or will every child inherit the avx512 timestamp? As
> far as I can tell, the timestamp is inherited; I think it would be
> nicer to zero it out at that point. Either way, It might be worth
> documenting this decision.
>
This timestamp is not inherited, see below:
_do_fork()
->copy_process()
-->dup_task_struct()
--->arch_dup_task_struct()
---->fpu__copy()
Thanks,
-Aubrey
^ permalink raw reply
* [PATCH v2] moduleparam: Save information about built-in modules in separate file
From: Alexey Gladkov @ 2019-04-06 12:14 UTC (permalink / raw)
To: Masahiro Yamada
Cc: linux-kernel, linux-kbuild, linux-api, linux-modules,
Kirill A . Shutemov, Gleb Fotengauer-Malinovskiy, Dmitry V. Levin,
Michal Marek, Dmitry Torokhov, Rusty Russell, Jessica Yu,
Lucas De Marchi
Problem:
When a kernel module is compiled as a separate module, some important
information about the kernel module is available via .modinfo section of
the module. In contrast, when the kernel module is compiled into the
kernel, that information is not available.
Information about built-in modules is necessary in the following cases:
1. When it is necessary to find out what additional parameters can be
passed to the kernel at boot time.
2. When you need to know which module names and their aliases are in
the kernel. This is very useful for creating an initrd image.
Proposal:
The proposed patch does not remove .modinfo section with module
information from the vmlinux at the build time and saves it into a
separate file after kernel linking. So, the kernel does not increase in
size and no additional information remains in it. Information is stored
in the same format as in the separate modules (null-terminated string
array). Because the .modinfo section is already exported with a separate
modules, we are not creating a new API.
It can be easily read in the userspace:
$ tr '\0' '\n' < kernel.builtin
ext4.softdep=pre: crc32c
ext4.license=GPL
ext4.description=Fourth Extended Filesystem
ext4.author=Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, Theodore Ts'o and others
ext4.alias=fs-ext4
ext4.alias=ext3
ext4.alias=fs-ext3
ext4.alias=ext2
ext4.alias=fs-ext2
md_mod.alias=block-major-9-*
md_mod.alias=md
md_mod.description=MD RAID framework
md_mod.license=GPL
md_mod.parmtype=create_on_open:bool
md_mod.parmtype=start_dirty_degraded:int
...
v2:
* Extract modinfo from vmlinux.o as suggested by Masahiro Yamada;
* Rename output file to kernel.builtin;
* Add MODULE_VERSION to modinfo that is saved to the kernel.builtin;
* Fix build warnings on powerpc.
Co-Developed-by: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
Signed-off-by: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
---
.gitignore | 1 +
Makefile | 2 ++
include/asm-generic/vmlinux.lds.h | 1 +
include/linux/module.h | 1 +
include/linux/moduleparam.h | 12 +++++-------
scripts/link-vmlinux.sh | 4 ++++
6 files changed, 14 insertions(+), 7 deletions(-)
diff --git a/.gitignore b/.gitignore
index a20ac26aa2f5..432332fd745e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -45,6 +45,7 @@
*.xz
Module.symvers
modules.builtin
+kernel.builtin
#
# Top-level generic files
diff --git a/Makefile b/Makefile
index d5713e7b1e50..d9dc6211fbc7 100644
--- a/Makefile
+++ b/Makefile
@@ -1288,6 +1288,7 @@ _modinst_:
fi
@cp -f $(objtree)/modules.order $(MODLIB)/
@cp -f $(objtree)/modules.builtin $(MODLIB)/
+ @cp -f $(objtree)/kernel.builtin $(MODLIB)/
$(Q)$(MAKE) -f $(srctree)/scripts/Makefile.modinst
# This depmod is only for convenience to give the initial
@@ -1328,6 +1329,7 @@ endif # CONFIG_MODULES
# Directories & files removed with 'make clean'
CLEAN_DIRS += $(MODVERDIR) include/ksym
+CLEAN_FILES += kernel.builtin
# Directories & files removed with 'make mrproper'
MRPROPER_DIRS += include/config usr/include include/generated \
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 3d7a6a9c2370..44c724bf7d3a 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -844,6 +844,7 @@
EXIT_CALL \
*(.discard) \
*(.discard.*) \
+ *(.modinfo) \
}
/**
diff --git a/include/linux/module.h b/include/linux/module.h
index f5bc4c046461..1cae28b1172a 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -237,6 +237,7 @@ extern typeof(name) __mod_##type##__##name##_device_table \
#define MODULE_VERSION(_version) MODULE_INFO(version, _version)
#else
#define MODULE_VERSION(_version) \
+ MODULE_INFO(version, _version); \
static struct module_version_attribute ___modver_attr = { \
.mattr = { \
.attr = { \
diff --git a/include/linux/moduleparam.h b/include/linux/moduleparam.h
index ba36506db4fb..5ba250d9172a 100644
--- a/include/linux/moduleparam.h
+++ b/include/linux/moduleparam.h
@@ -10,23 +10,21 @@
module name. */
#ifdef MODULE
#define MODULE_PARAM_PREFIX /* empty */
+#define __MODULE_INFO_PREFIX /* empty */
#else
#define MODULE_PARAM_PREFIX KBUILD_MODNAME "."
+/* We cannot use MODULE_PARAM_PREFIX because some modules override it. */
+#define __MODULE_INFO_PREFIX KBUILD_MODNAME "."
#endif
/* Chosen so that structs with an unsigned long line up. */
#define MAX_PARAM_PREFIX_LEN (64 - sizeof(unsigned long))
-#ifdef MODULE
#define __MODULE_INFO(tag, name, info) \
static const char __UNIQUE_ID(name)[] \
__used __attribute__((section(".modinfo"), unused, aligned(1))) \
- = __stringify(tag) "=" info
-#else /* !MODULE */
-/* This struct is here for syntactic coherency, it is not used */
-#define __MODULE_INFO(tag, name, info) \
- struct __UNIQUE_ID(name) {}
-#endif
+ = __MODULE_INFO_PREFIX __stringify(tag) "=" info
+
#define __MODULE_PARM_TYPE(name, _type) \
__MODULE_INFO(parmtype, name##type, #name ":" _type)
diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
index c8cf45362bd6..b914e026ef28 100755
--- a/scripts/link-vmlinux.sh
+++ b/scripts/link-vmlinux.sh
@@ -226,6 +226,10 @@ modpost_link vmlinux.o
# modpost vmlinux.o to check for section mismatches
${MAKE} -f "${srctree}/scripts/Makefile.modpost" vmlinux.o
+info MODINFO kernel.builtin
+"${OBJCOPY}" -j .modinfo -O binary vmlinux.o kernel.builtin
+chmod 444 kernel.builtin
+
kallsymso=""
kallsyms_vmlinux=""
if [ -n "${CONFIG_KALLSYMS}" ]; then
--
2.21.0
^ permalink raw reply related
* Re: [PATCH v13 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time
From: Jann Horn @ 2019-04-05 20:27 UTC (permalink / raw)
To: Aubrey Li
Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, H . Peter Anvin,
Andi Kleen, tim.c.chen, Dave Hansen, Arjan van de Ven, aubrey.li,
kernel list, Linux API, Alexey Dobriyan, Andrew Morton
In-Reply-To: <20190224044400.34975-2-aubrey.li@linux.intel.com>
On Fri, Apr 5, 2019 at 10:02 PM Aubrey Li <aubrey.li@linux.intel.com> wrote:
> AVX-512 components use could cause core turbo frequency drop. So
> it's useful to expose AVX-512 usage elapsed time as a heuristic hint
> for the user space job scheduler to cluster the AVX-512 using tasks
> together.
>
> Tensorflow example:
> $ while [ 1 ]; do cat /proc/pid/status | grep AVX; sleep 1; done
> AVX512_elapsed_ms: 4
> AVX512_elapsed_ms: 8
> AVX512_elapsed_ms: 4
>
> This means that 4 milliseconds have elapsed since the AVX512 usage
> of tensorflow task was detected when the task was scheduled out.
>
> Or:
> $ cat /proc/pid/status | grep AVX512_elapsed_ms
> AVX512_elapsed_ms: -1
(Very nitpicky, feel free to ignore: If you change the /proc/pid to
/proc/tid in the commit message, it becomes clearer that this status
is really per-task/thread, not per-process/threadgroup.)
[...]
> +
> +/*
> + * Report the amount of time elapsed in millisecond since last AVX512
> + * use in the task.
> + */
> +static void avx512_status(struct seq_file *m, struct task_struct *task)
> +{
> + unsigned long timestamp = task->thread.fpu.avx512_timestamp;
This is theoretically a data race, right? Should this have a READ_ONCE() on it?
Is there something that zeroes out the avx512_timestamp on
fork()/clone(), or will every child inherit the avx512 timestamp? As
far as I can tell, the timestamp is inherited; I think it would be
nicer to zero it out at that point. Either way, It might be worth
documenting this decision.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox