From: KaiGai Kohei <kaigai@ak.jp.nec.com>
To: linux-kernel@vger.kernel.org
Cc: akpm@osdl.org
Subject: Add pacct_struct to fix some pacct bugs.
Date: Tue, 20 Jun 2006 15:24:59 +0900 [thread overview]
Message-ID: <449794BB.8010108@ak.jp.nec.com> (raw)
Hi, I noticed three problems in pacct facility.
1. Pacct facility has a possibility to write incorrect ac_flag
in multi-threading cases.
2. There is a possibility to be waken up OOM Killer from
pacct facility. It will cause system stall.
3. If several threads are killed at same time, There is
a possibility not to pick up a part of those accountings.
The attached patch will resolve those matters.
Any comments please. Thanks,
Signed-off-by: KaiGai Kohei <kaigai@ak.jp.nec.com>
[The details of those matters]
(1) about incorrect ac_flag
When pacct facility generate an 'ac_flag' field in accounting record,
it refers a task_struct of the thread which died last in the process.
But any other task_structs are ignored.
Therefore, pacct facility drops ASU flag even if root-privilege
operations are used by any other threads except the last one.
In addition, AFORK flag is always set when the thread of group-leader
didn't die last, although this process has called execve() after fork().
---- in kernel/acct.c : do_acct_process() ----
ac.ac_flag = 0;
if (current->flags & PF_FORKNOEXEC)
ac.ac_flag |= AFORK;
if (current->flags & PF_SUPERPRIV)
ac.ac_flag |= ASU;
if (current->flags & PF_DUMPCORE)
ac.ac_flag |= ACORE;
if (current->flags & PF_SIGNALED)
ac.ac_flag |= AXSIG;
----------------------------------------------
I think ASU, ACORE, AXSIG flag should be set if any task_structs
satisfy the conditions. And, AFORK flag should not be checked
for any threads except the group leader.
(2) about OOM killer
The pacct facility need an i/o operation when an accounting record
is generated. There is a possibility to wake OOM killer up.
If OOM killer is activated, it kills some processes to make them
release process memory regions.
But acct_process() is called in the killed processes context
before calling exit_mm(), so those processes cannot release
own memory. In the results, any processes stop in this point and
it finally cause a system stall.
---- in kernel/exit.c : do_exit() ------------
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
hrtimer_cancel(&tsk->signal->real_timer);
exit_itimers(tsk->signal);
acct_process(code);
}
:
- snip -
:
exit_mm(tsk);
----------------------------------------------
I think acct_process() should be called after exit_mm() to avoid
this matter. As mm_struct is necessary to calculate ac_mem field,
it should be kept before exit_mm().
(3) about race condition in pacct
In current 2.6.17 implementation, signal_struct refered from task_struct
is used for per-process data structure. The pacct facility also uses it
as a per-process data structure to save stime, utime, minflt, majflt.
But those members are saved in __exit_signal(). It's too late.
For example, if some threads exits at same time, pacct facility has
a possibility to drop accountings for a part of those threads.
(see, the following 'The results of original 2.6.17 kernel')
I think accounting information should be completely collected into
the per-process data structure before writing out an accounting record.
[The solution for those matters]
a. pacct_struct is newly defined, and it's deployed into signal_struct
as a per-process accounting storage.
b. acct_collect() is newly added to collect any thread's accountings
before generating accounting record and calling exit_mm().
c. Calling acct_process() is moved after exit_mm().
d. do_acct_process() becomes to refer the per-process accounting
structure, not task_strcut of the last thread.
[The results of original 2.6.17 kernel]
* test for ac_flag
# time -p ./bugacct
real 10.11
user 5.94
sys 0.17
# touch /tmp/acct-`uname -r`
# accton /tmp/acct-`uname -r`
# time -p ./bugacct
real 10.04
user 5.86
sys 0.17
-- The accounting results -----------------------------------------
FLAG BTIME ETIME UTIME STIME MEM MINFLT MAJFLT COMM
F--- 06/20-14:31:47 1003 586 16 140992 32776 0 bugacct
(*) 'F' means it didn't execve() after fork().
'P' means it used root-privilege operations.
=> original 2.6.17 drops 'P' flag, and appends undesired 'F' flag.
* test for the race condition
# time -p ./raceacct 4
real 2.59
user 7.56
sys 0.00
# time -p ./raceacct 4
real 2.27
user 6.62
sys 0.00
# time -p ./raceacct 4
real 2.16
user 6.16
sys 0.00
# time -p ./raceacct 4
real 2.87
user 8.36
sys 0.00
# time -p ./raceacct 4
real 2.77
user 8.11
sys 0.00
-- The accounting results -----------------------------------------
FLAG BTIME ETIME UTIME STIME MEM MINFLT MAJFLT COMM
F--- 06/20-11:18:57 259 596 0 28528 38 0 raceacct *
---- 06/20-11:19:14 227 662 0 28528 171 0 raceacct
F--- 06/20-11:19:18 216 509 0 28528 38 0 raceacct *
F--- 06/20-11:19:22 287 651 0 28528 39 0 raceacct *
---- 06/20-11:19:25 276 811 0 28528 170 0 raceacct
(*) = UTIME does not match with the result of 'time -p'
[The results of patched 2.6.17-kg kernel]
* test for ac_flag
# time -p ./bugacct
real 10.07
user 5.88
sys 0.17
-- The accounting results -----------------------------------------
FLAG BTIME ETIME UTIME STIME MEM MINFLT MAJFLT COMM
-P-- 06/20-14:08:06 1007 588 16 140992 32944 0 bugacct
* test for the race condition
# time -p ./raceacct 4
real 2.55
user 7.48
sys 0.00
# time -p ./raceacct 4
real 2.59
user 7.58
sys 0.00
# time -p ./raceacct 4
real 2.64
user 7.75
sys 0.00
# time -p ./raceacct 4
real 2.29
user 6.71
sys 0.00
# time -p ./raceacct 4
real 2.68
user 7.83
sys 0.00
-- The accounting results -----------------------------------------
FLAG BTIME ETIME UTIME STIME MEM MINFLT MAJFLT COMM
---- 06/20-11:09:33 255 748 0 28528 171 0 raceacct
---- 06/20-11:09:37 259 758 0 28528 171 0 raceacct
---- 06/20-11:09:41 264 775 0 28528 170 0 raceacct
---- 06/20-11:09:44 229 671 0 28528 170 0 raceacct
---- 06/20-11:09:57 268 783 0 28528 170 0 raceacct
[The test program : raceacct.c]
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
void *worker(void *dummy) {
int i, *flag = dummy;
for (i=0; i < 100000000 && *flag; i++);
*flag = 0;
pthread_exit(NULL);
}
int main(int argc, char *argv[]) {
pthread_t pthread;
int i, num = 0, flag = 1;
if (argc > 2) {
fprintf(stderr, "argument too large\n");
return 1;
} else if (argc == 2) {
num = atoi(argv[1]);
}
for (i=1; i < num; i++)
pthread_create(&pthread, NULL, worker, (void *)&flag);
worker((void *)&flag);
}
[The test program : bugacct.c]
#include <stdlib.h>
#include <string.h>
#include <pthread.h>
#include <sys/time.h>
#define BUFFER_SIZE (128 * 1024 * 1024) /* 128MB */
void *mychild(void *buffer) {
struct timeval tv;
u_int64_t t1, t2;
gettimeofday(&tv, NULL);
t1 = tv.tv_sec * 1000000 + tv.tv_usec;
t2 = t1 + 6 * 1000000;
/* heavy CPU/Mem job */
srand(tv.tv_usec);
while(t1 < t2) {
memset(buffer, rand(), BUFFER_SIZE);
gettimeofday(&tv, NULL);
t1 = tv.tv_sec * 1000000 + tv.tv_usec;
}
return NULL;
}
int main(int argc, char *argv[]) {
pthread_t pthread;
void *buffer = NULL;
buffer = malloc(BUFFER_SIZE);
if (!buffer)
return 1;
sleep(4);
pthread_create(&pthread, NULL, mychild, buffer);
nice(-1);
sleep(1);
pthread_exit(0);
}
[The patch against to 2.6.17]
--- linux-2.6.17/include/linux/acct.h 2006-06-18 10:49:35.000000000 +0900
+++ linux-2.6.17.kg/include/linux/acct.h 2006-06-19 17:04:11.000000000 +0900
@@ -123,13 +123,17 @@ struct vfsmount;
struct super_block;
extern void acct_auto_close_mnt(struct vfsmount *m);
extern void acct_auto_close(struct super_block *sb);
-extern void acct_process(long exitcode);
+extern void acct_init_pacct(struct pacct_struct *pacct, struct task_struct *tsk);
+extern void acct_collect(long exitcode, int group_dead);
+extern void acct_process(void);
extern void acct_update_integrals(struct task_struct *tsk);
extern void acct_clear_integrals(struct task_struct *tsk);
#else
#define acct_auto_close_mnt(x) do { } while (0)
#define acct_auto_close(x) do { } while (0)
-#define acct_process(x) do { } while (0)
+#define acct_init_pacct(x,y) do { } while (0)
+#define acct_collect(x,y) do { } while (0)
+#define acct_process() do { } while (0)
#define acct_update_integrals(x) do { } while (0)
#define acct_clear_integrals(task) do { } while (0)
#endif
--- linux-2.6.17/include/linux/sched.h 2006-06-18 10:49:35.000000000 +0900
+++ linux-2.6.17.kg/include/linux/sched.h 2006-06-19 16:59:16.000000000 +0900
@@ -358,6 +358,15 @@ struct sighand_struct {
spinlock_t siglock;
};
+struct pacct_struct {
+ int ac_flag;
+ long ac_exitcode;
+ cputime_t ac_utime, ac_stime;
+ unsigned long ac_mem;
+ unsigned long ac_minflt, ac_majflt;
+ struct timespec ac_start_time;
+};
+
/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
@@ -449,6 +458,9 @@ struct signal_struct {
struct key *session_keyring; /* keyring inherited over fork */
struct key *process_keyring; /* keyring private to this process */
#endif
+#ifdef CONFIG_BSD_PROCESS_ACCT
+ struct pacct_struct pacct; /* per-process accounting information */
+#endif
};
/* Context switch must be unlocked if interrupts are to be enabled */
--- linux-2.6.17/kernel/acct.c 2006-06-18 10:49:35.000000000 +0900
+++ linux-2.6.17.kg/kernel/acct.c 2006-06-20 14:55:34.000000000 +0900
@@ -75,7 +75,7 @@ int acct_parm[3] = {4, 2, 30};
/*
* External references and all of the globals.
*/
-static void do_acct_process(long, struct file *);
+static void do_acct_process(struct file *);
/*
* This structure is used so that all the data protected by lock
@@ -196,7 +196,7 @@ static void acct_file_reopen(struct file
if (old_acct) {
mnt_unpin(old_acct->f_vfsmnt);
spin_unlock(&acct_globals.lock);
- do_acct_process(0, old_acct);
+ do_acct_process(old_acct);
filp_close(old_acct, NULL);
spin_lock(&acct_globals.lock);
}
@@ -419,16 +419,16 @@ static u32 encode_float(u64 value)
/*
* do_acct_process does all actual work. Caller holds the reference to file.
*/
-static void do_acct_process(long exitcode, struct file *file)
+static void do_acct_process(struct file *file)
{
+ struct signal_struct *sig = current->signal;
+ struct pacct_struct *pacct = &sig->pacct;
acct_t ac;
mm_segment_t fs;
- unsigned long vsize;
unsigned long flim;
u64 elapsed;
u64 run_time;
struct timespec uptime;
- unsigned long jiffies;
/*
* First check to see if there is enough free_space to continue
@@ -449,8 +449,8 @@ static void do_acct_process(long exitcod
/* calculate run_time in nsec*/
do_posix_clock_monotonic_gettime(&uptime);
run_time = (u64)uptime.tv_sec*NSEC_PER_SEC + uptime.tv_nsec;
- run_time -= (u64)current->group_leader->start_time.tv_sec * NSEC_PER_SEC
- + current->group_leader->start_time.tv_nsec;
+ run_time -= (u64)pacct->ac_start_time.tv_sec * NSEC_PER_SEC
+ + pacct->ac_start_time.tv_nsec;
/* convert nsec -> AHZ */
elapsed = nsec_to_AHZ(run_time);
#if ACCT_VERSION==3
@@ -469,12 +469,8 @@ static void do_acct_process(long exitcod
#endif
do_div(elapsed, AHZ);
ac.ac_btime = xtime.tv_sec - elapsed;
- jiffies = cputime_to_jiffies(cputime_add(current->utime,
- current->signal->utime));
- ac.ac_utime = encode_comp_t(jiffies_to_AHZ(jiffies));
- jiffies = cputime_to_jiffies(cputime_add(current->stime,
- current->signal->stime));
- ac.ac_stime = encode_comp_t(jiffies_to_AHZ(jiffies));
+ ac.ac_utime = encode_comp_t(jiffies_to_AHZ(cputime_to_jiffies(pacct->ac_utime)));
+ ac.ac_stime = encode_comp_t(jiffies_to_AHZ(cputime_to_jiffies(pacct->ac_stime)));
/* we really need to bite the bullet and change layout */
ac.ac_uid = current->uid;
ac.ac_gid = current->gid;
@@ -496,37 +492,14 @@ static void do_acct_process(long exitcod
old_encode_dev(tty_devnum(current->signal->tty)) : 0;
read_unlock(&tasklist_lock);
- ac.ac_flag = 0;
- if (current->flags & PF_FORKNOEXEC)
- ac.ac_flag |= AFORK;
- if (current->flags & PF_SUPERPRIV)
- ac.ac_flag |= ASU;
- if (current->flags & PF_DUMPCORE)
- ac.ac_flag |= ACORE;
- if (current->flags & PF_SIGNALED)
- ac.ac_flag |= AXSIG;
-
- vsize = 0;
- if (current->mm) {
- struct vm_area_struct *vma;
- down_read(¤t->mm->mmap_sem);
- vma = current->mm->mmap;
- while (vma) {
- vsize += vma->vm_end - vma->vm_start;
- vma = vma->vm_next;
- }
- up_read(¤t->mm->mmap_sem);
- }
- vsize = vsize / 1024;
- ac.ac_mem = encode_comp_t(vsize);
+ ac.ac_flag = pacct->ac_flag;
+ ac.ac_mem = encode_comp_t(pacct->ac_mem);
ac.ac_io = encode_comp_t(0 /* current->io_usage */); /* %% */
ac.ac_rw = encode_comp_t(ac.ac_io / 1024);
- ac.ac_minflt = encode_comp_t(current->signal->min_flt +
- current->min_flt);
- ac.ac_majflt = encode_comp_t(current->signal->maj_flt +
- current->maj_flt);
+ ac.ac_minflt = encode_comp_t(pacct->ac_minflt);
+ ac.ac_majflt = encode_comp_t(pacct->ac_majflt);
ac.ac_swaps = encode_comp_t(0);
- ac.ac_exitcode = exitcode;
+ ac.ac_exitcode = pacct->ac_exitcode;
/*
* Kernel segment override to datasegment and write it
@@ -545,13 +518,56 @@ static void do_acct_process(long exitcod
set_fs(fs);
}
+void acct_init_pacct(struct pacct_struct *pacct, struct task_struct *tsk)
+{
+ memset(pacct, 0, sizeof(struct pacct_struct));
+ pacct->ac_utime = pacct->ac_stime = cputime_zero;
+}
+
+void acct_collect(long exitcode, int group_dead)
+{
+ struct pacct_struct *pacct = ¤t->signal->pacct;
+ unsigned long vsize = 0;
+
+ if (group_dead && current->mm) {
+ struct vm_area_struct *vma;
+ down_read(¤t->mm->mmap_sem);
+ vma = current->mm->mmap;
+ while (vma) {
+ vsize += vma->vm_end - vma->vm_start;
+ vma = vma->vm_next;
+ }
+ up_read(¤t->mm->mmap_sem);
+ }
+ spin_lock(¤t->sighand->siglock);
+ if (group_dead)
+ pacct->ac_mem = vsize / 1024;
+ if (thread_group_leader(current)) {
+ pacct->ac_start_time = current->start_time;
+ pacct->ac_exitcode = exitcode;
+ if (current->flags & PF_FORKNOEXEC)
+ pacct->ac_flag |= AFORK;
+ }
+ if (current->flags & PF_SUPERPRIV)
+ pacct->ac_flag |= ASU;
+ if (current->flags & PF_DUMPCORE)
+ pacct->ac_flag |= ACORE;
+ if (current->flags & PF_SIGNALED)
+ pacct->ac_flag |= AXSIG;
+ pacct->ac_utime = cputime_add(pacct->ac_utime, current->utime);
+ pacct->ac_stime = cputime_add(pacct->ac_stime, current->stime);
+ pacct->ac_minflt += current->min_flt;
+ pacct->ac_majflt += current->maj_flt;
+ spin_unlock(¤t->sighand->siglock);
+}
+
/**
* acct_process - now just a wrapper around do_acct_process
* @exitcode: task exit code
*
* handles process accounting for an exiting task
*/
-void acct_process(long exitcode)
+void acct_process()
{
struct file *file = NULL;
@@ -570,11 +586,10 @@ void acct_process(long exitcode)
get_file(file);
spin_unlock(&acct_globals.lock);
- do_acct_process(exitcode, file);
+ do_acct_process(file);
fput(file);
}
-
/**
* acct_update_integrals - update mm integral fields in task_struct
* @tsk: task_struct for accounting
--- linux-2.6.17/kernel/fork.c 2006-06-18 10:49:35.000000000 +0900
+++ linux-2.6.17.kg/kernel/fork.c 2006-06-19 16:16:30.000000000 +0900
@@ -871,6 +871,7 @@ static inline int copy_signal(unsigned l
tsk->it_prof_expires =
secs_to_cputime(sig->rlim[RLIMIT_CPU].rlim_cur);
}
+ acct_init_pacct(&sig->pacct, tsk);
return 0;
}
--- linux-2.6.17/kernel/exit.c 2006-06-18 10:49:35.000000000 +0900
+++ linux-2.6.17.kg/kernel/exit.c 2006-06-19 16:14:31.000000000 +0900
@@ -895,8 +895,8 @@ fastcall NORET_TYPE void do_exit(long co
if (group_dead) {
hrtimer_cancel(&tsk->signal->real_timer);
exit_itimers(tsk->signal);
- acct_process(code);
}
+ acct_collect(code, group_dead);
if (unlikely(tsk->robust_list))
exit_robust_list(tsk);
#ifdef CONFIG_COMPAT
@@ -907,6 +907,8 @@ fastcall NORET_TYPE void do_exit(long co
audit_free(tsk);
exit_mm(tsk);
+ if (group_dead)
+ acct_process();
exit_sem(tsk);
__exit_files(tsk);
__exit_fs(tsk);
next reply other threads:[~2006-06-20 6:25 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-06-20 6:24 KaiGai Kohei [this message]
2006-06-20 6:42 ` Add pacct_struct to fix some pacct bugs Andrew Morton
2006-06-20 7:27 ` KaiGai Kohei
2006-06-22 3:07 ` KaiGai Kohei
2006-06-22 3:08 ` KaiGai Kohei
2006-06-22 3:09 ` KaiGai Kohei
2006-06-22 3:09 ` KaiGai Kohei
2006-06-22 3:35 ` Andrew Morton
2006-06-22 4:05 ` KaiGai Kohei
2006-06-22 4:14 ` Randy.Dunlap
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=449794BB.8010108@ak.jp.nec.com \
--to=kaigai@ak.jp.nec.com \
--cc=akpm@osdl.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.