From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DC4CA7470 for ; Sun, 28 Jul 2024 21:30:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722202244; cv=none; b=oarG/smreRRjbeO74CBT61mHXmmtBKC8zN1Dlumu5jh56nONF465KfFBbHC4ozzTKyXXC8T3Cce2K70Bcs24T05gKm2KSlJBWyfP2hqOGTYPObqdeIK0JANnU+RG8138zkA0vg3Ca5F9d6iYlQPLi+RwOztCu1Ce3bnXsB/sTG0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722202244; c=relaxed/simple; bh=WvDLWZXNj7JvefON5oN/oI6hLOY2sERSFpKQzRgTBzc=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=heETyEG+lFzljh0WAz5IVSK+qVH/by5SOstxPBTG/ySbkl9fjxhYtHU10CAZxLCxNHRt5I6Q7YJsSK2CwlT+1HOMidKeMbzE1FLYvz2gPLkkzxCUG6JQTpqBTn6u1d2AoBzazJPxffoca3Sk4HTc+DVFrIO3Y2muHLC0rpjArzA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=AegzNYdP; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=HIUWeODf; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="AegzNYdP"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="HIUWeODf" From: Thomas Gleixner DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1722202239; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Mkxxkg5G+QCKXORXqfQheO8NG7nyryepkIDctyVjL18=; b=AegzNYdPO9Dp5Bc+D6MKm89CRgP4G7VK8YM4UOapwi48YT73TQMImAqqNPI+pb/bhW4SiF u5gihWQl7/F2P6Aas4XOm9Oc26OSa1zirtRfBFKKxTjXlPPy5YvEQ2XBozDvxu81qaKzAs 9F5OvEBt1BEHKY3yVLuRzeolv69QTKdroB8whPxJ4F99T2s+2ycxYmF6BVLmp+6PbcYfdI pAn2D94xL/Ut83lQAm7EMOWtd9XTeKNAqVh2BQ1/mKF5HYVURWly0taDCZRpixCAqR9ovg YQSHbke+10OrsYNiG+8wcIX0mBZE+M8cJh9KXTOW2G3tThALTOujkzEM+Xs/kg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1722202239; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Mkxxkg5G+QCKXORXqfQheO8NG7nyryepkIDctyVjL18=; b=HIUWeODfButgjkayOEsDjAdS9bXa8TqVjTMQe5l2xrxZ0Y25r7JiWt3jvB0dPIgALGPkt3 hcIy1oeuSfjmPkDw== To: Peter Zijlstra , mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, linux-kernel@vger.kernel.org Cc: kprateek.nayak@amd.com, wuyun.abel@bytedance.com, youssefesmat@chromium.org, efault@gmx.de Subject: Re: [RFC PATCH 24/24] sched/time: Introduce CLOCK_THREAD_DVFS_ID In-Reply-To: <20240727105031.053611186@infradead.org> References: <20240727102732.960974693@infradead.org> <20240727105031.053611186@infradead.org> Date: Sun, 28 Jul 2024 23:30:38 +0200 Message-ID: <874j89tda9.ffs@tglx> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain On Sat, Jul 27 2024 at 12:27, Peter Zijlstra wrote: > In order to measure thread time in a DVFS world, introduce > CLOCK_THREAD_DVFS_ID -- a copy of CLOCK_THREAD_CPUTIME_ID that slows > down with both DVFS scaling and CPU capacity. > > The clock does *NOT* support setting timers. That's not the only limitation. See below. > Useful for both SCHED_DEADLINE and the newly introduced > sched_attr::sched_runtime usage for SCHED_NORMAL. Can this please have an explanation about the usage of the previously reserved value of 0x7 in the lower 3 bits? > * > * Bit 2 indicates whether a cpu clock refers to a thread or a process. > * > - * Bits 1 and 0 give the type: PROF=0, VIRT=1, SCHED=2, or FD=3. > + * Bits 1 and 0 give the type: PROF=0, VIRT=1, SCHED=2, or DVSF=3 > * > - * A clockid is invalid if bits 2, 1, and 0 are all set. > + * (DVFS is PERTHREAD only) This drops the information about the FD usage. Something like: /* * Bit fields within a clockid: * * Bit 31:3 hold either a pid or a file descriptor. * * Bit 2 Bit 1 Bit 0 * 0 0 0 Per process CPUCLOCK_PROF * 0 0 1 Per process CPUCLOCK_VIRT * 0 1 0 Per process CPUCLOCK_SCHED * 0 1 1 Posixclock FD CLOCKFD * 1 0 0 Per thread CPUCLOCK_PROF * 1 0 1 Per thread CPUCLOCK_VIRT * 1 1 0 Per thread CPUCLOCK_SCHED * 1 1 1 Per thread CPUCLOCK_DVSF * * CPUCLOCK_DVSF is per thread only and shares the type code in Bit 1:0 * with CLOCKFD. CLOCKFD uses a file descriptor to access dynamically * registered POSIX clocks (e.g. PTP hardware clocks). */ should be clear enough, no? But, all of this is wishful thinking because the provided implementation only works for: sys_clock_getres(CLOCK_THREAD_DVFS_ID, ...) which falls back to thread_cpu_clock_getres(). The variant which has the TID encoded in bit 31:3 and the type in bit 2:0 fails the test in pid_for_clock(): if (CPUCLOCK_WHICH(clock) >= CPUCLOCK_MAX) return NULL; Worse for sys_clock_gettime(). That fails in both cases for the very same reason. See the uncompiled delta patch below for a cure of that and the rest of my comments. > #define CPUCLOCK_PROF 0 > #define CPUCLOCK_VIRT 1 > #define CPUCLOCK_SCHED 2 > +#define CPUCLOCK_DVFS 3 > #define CPUCLOCK_MAX 3 > #define CLOCKFD CPUCLOCK_MAX > #define CLOCKFD_MASK (CPUCLOCK_PERTHREAD_MASK|CPUCLOCK_CLOCK_MASK) With that DVFS addition CPUCLOCK_MAX is misleading at best. See delta patch. > + > + rq = task_rq_lock(p, &rf); > + /* > + * Must be ->curr _and_ ->on_rq. If dequeued, we would > + * project cycles that may never be accounted to this > + * thread, breaking clock_gettime(). Must be? For what? I assume you want to say: Update the runtime if the task is the current task and on the runqueue. The latter is important because if current is dequeued, .... > + */ > + if (task_current(rq, p) && task_on_rq_queued(p)) { > + prefetch_curr_exec_start(p); > + update_rq_clock(rq); > + p->sched_class->update_curr(rq); > + } > + ns = p->se.sum_dvfs_runtime; > + task_rq_unlock(rq, p, &rf); > @@ -1664,6 +1668,11 @@ static int thread_cpu_timer_create(struc > timer->it_clock = THREAD_CLOCK; > return posix_cpu_timer_create(timer); > } > +static int thread_dvfs_cpu_clock_get(const clockid_t which_clock, > + struct timespec64 *tp) Please align the second line properly with the argument in the first line. Thanks, tglx --- --- a/include/linux/posix-timers_types.h +++ b/include/linux/posix-timers_types.h @@ -9,27 +9,42 @@ /* * Bit fields within a clockid: * - * The most significant 29 bits hold either a pid or a file descriptor. + * Bit 31:3 hold either a PID/TID or a file descriptor. * - * Bit 2 indicates whether a cpu clock refers to a thread or a process. + * Bit 2 Bit 1 Bit 0 + * 0 0 0 Per process CPUCLOCK_PROF + * 0 0 1 Per process CPUCLOCK_VIRT + * 0 1 0 Per process CPUCLOCK_SCHED + * 0 1 1 Posixclock FD CLOCKFD + * 1 0 0 Per thread CPUCLOCK_PROF + * 1 0 1 Per thread CPUCLOCK_VIRT + * 1 1 0 Per thread CPUCLOCK_SCHED + * 1 1 1 Per thread CPUCLOCK_DVSF * - * Bits 1 and 0 give the type: PROF=0, VIRT=1, SCHED=2, or DVSF=3 - * - * (DVFS is PERTHREAD only) + * CPUCLOCK_DVSF is per thread only and shares the type code in Bit 1:0 + * with CLOCKFD. CLOCKFD uses a file descriptor to access dynamically + * registered POSIX clocks (e.g. PTP hardware clocks). */ + #define CPUCLOCK_PID(clock) ((pid_t) ~((clock) >> 3)) -#define CPUCLOCK_PERTHREAD(clock) \ - (((clock) & (clockid_t) CPUCLOCK_PERTHREAD_MASK) != 0) +#define CPUCLOCK_PERTHREAD(clock) (((clock) & (clockid_t) CPUCLOCK_PERTHREAD_MASK) != 0) -#define CPUCLOCK_PERTHREAD_MASK 4 -#define CPUCLOCK_WHICH(clock) ((clock) & (clockid_t) CPUCLOCK_CLOCK_MASK) -#define CPUCLOCK_CLOCK_MASK 3 #define CPUCLOCK_PROF 0 #define CPUCLOCK_VIRT 1 #define CPUCLOCK_SCHED 2 -#define CPUCLOCK_DVFS 3 -#define CPUCLOCK_MAX 3 -#define CLOCKFD CPUCLOCK_MAX +#define CPUCLOCK_SAMPLE_MAX (CPUCLOCK_SCHED + 1) + +#define CPUCLOCK_CLOCK_MASK 3 +#define CPUCLOCK_PERTHREAD_MASK 4 +#define CPUCLOCK_WHICH(clock) ((clock) & (clockid_t) CPUCLOCK_CLOCK_MASK) + +/* + * CPUCLOCK_DVFS and CLOCKFD share the type code in bit 1:0. CPUCLOCK_DVFS + * does not belong to the sampling clocks and does not allow timers to be + * armed on it. + */ +#define CPUCLOCK_DVFS CPUCLOCK_SAMPLE_MAX +#define CLOCKFD CPUCLOCK_DVFS #define CLOCKFD_MASK (CPUCLOCK_PERTHREAD_MASK|CPUCLOCK_CLOCK_MASK) #ifdef CONFIG_POSIX_TIMERS @@ -55,7 +70,7 @@ struct posix_cputimer_base { * Used in task_struct and signal_struct */ struct posix_cputimers { - struct posix_cputimer_base bases[CPUCLOCK_MAX]; + struct posix_cputimer_base bases[CPUCLOCK_SAMPLE_MAX]; unsigned int timers_active; unsigned int expiry_active; }; --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5413,9 +5413,10 @@ unsigned long long task_sched_dvfs_runti rq = task_rq_lock(p, &rf); /* - * Must be ->curr _and_ ->on_rq. If dequeued, we would - * project cycles that may never be accounted to this - * thread, breaking clock_gettime(). + * Update the runtime if the task is the current task and on the + * runqueue. The latter is important because if current is + * dequeued, we would project cycles that may never be accounted to + * this thread, breaking clock_gettime(). */ if (task_current(rq, p) && task_on_rq_queued(p)) { prefetch_curr_exec_start(p); --- a/kernel/time/posix-cpu-timers.c +++ b/kernel/time/posix-cpu-timers.c @@ -54,13 +54,13 @@ int update_rlimit_cpu(struct task_struct /* * Functions for validating access to tasks. */ -static struct pid *pid_for_clock(const clockid_t clock, bool gettime) +static struct pid *__pid_for_clock(const clockid_t clock, const clockid_t maxclock, bool gettime) { const bool thread = !!CPUCLOCK_PERTHREAD(clock); const pid_t upid = CPUCLOCK_PID(clock); struct pid *pid; - if (CPUCLOCK_WHICH(clock) >= CPUCLOCK_MAX) + if (CPUCLOCK_WHICH(clock) > maxclock) return NULL; /* @@ -94,12 +94,17 @@ static struct pid *pid_for_clock(const c return pid_has_task(pid, PIDTYPE_TGID) ? pid : NULL; } +static inline struct pid *pid_for_clock(const clockid_t clock, bool gettime) +{ + return __pid_for_clock(clock, CPUCLOCK_SCHED, gettime); +} + static inline int validate_clock_permissions(const clockid_t clock) { int ret; rcu_read_lock(); - ret = pid_for_clock(clock, false) ? 0 : -EINVAL; + ret = __pid_for_clock(clock, CPUCLOCK_DVFS, false) ? 0 : -EINVAL; rcu_read_unlock(); return ret; @@ -344,7 +349,7 @@ static u64 cpu_clock_sample_group(const { struct thread_group_cputimer *cputimer = &p->signal->cputimer; struct posix_cputimers *pct = &p->signal->posix_cputimers; - u64 samples[CPUCLOCK_MAX]; + u64 samples[CPUCLOCK_SAMPLE_MAX]; if (!READ_ONCE(pct->timers_active)) { if (start) @@ -365,7 +370,7 @@ static int posix_cpu_clock_get(const clo u64 t; rcu_read_lock(); - tsk = pid_task(pid_for_clock(clock, true), clock_pid_type(clock)); + tsk = pid_task(__pid_for_clock(clock, CPUCLOCK_DVFS, true), clock_pid_type(clock)); if (!tsk) { rcu_read_unlock(); return -EINVAL; @@ -864,7 +869,7 @@ static void collect_posix_cputimers(stru struct posix_cputimer_base *base = pct->bases; int i; - for (i = 0; i < CPUCLOCK_MAX; i++, base++) { + for (i = 0; i < CPUCLOCK_SAMPLE_MAX; i++, base++) { base->nextevt = collect_timerqueue(&base->tqhead, firing, samples[i]); } @@ -901,7 +906,7 @@ static void check_thread_timers(struct t struct list_head *firing) { struct posix_cputimers *pct = &tsk->posix_cputimers; - u64 samples[CPUCLOCK_MAX]; + u64 samples[CPUCLOCK_SAMPLE_MAX]; unsigned long soft; if (dl_task(tsk)) @@ -979,7 +984,7 @@ static void check_process_timers(struct { struct signal_struct *const sig = tsk->signal; struct posix_cputimers *pct = &sig->posix_cputimers; - u64 samples[CPUCLOCK_MAX]; + u64 samples[CPUCLOCK_SAMPLE_MAX]; unsigned long soft; /* @@ -1098,7 +1103,7 @@ task_cputimers_expired(const u64 *sample { int i; - for (i = 0; i < CPUCLOCK_MAX; i++) { + for (i = 0; i < CPUCLOCK_SAMPLE_MAX; i++) { if (samples[i] >= pct->bases[i].nextevt) return true; } @@ -1121,7 +1126,7 @@ static inline bool fastpath_timer_check( struct signal_struct *sig; if (!expiry_cache_is_inactive(pct)) { - u64 samples[CPUCLOCK_MAX]; + u64 samples[CPUCLOCK_SAMPLE_MAX]; task_sample_cputime(tsk, samples); if (task_cputimers_expired(samples, pct)) @@ -1146,7 +1151,7 @@ static inline bool fastpath_timer_check( * delays with signals actually getting sent are expected. */ if (READ_ONCE(pct->timers_active) && !READ_ONCE(pct->expiry_active)) { - u64 samples[CPUCLOCK_MAX]; + u64 samples[CPUCLOCK_SAMPLE_MAX]; proc_sample_cputime_atomic(&sig->cputimer.cputime_atomic, samples); @@ -1669,7 +1674,7 @@ static int thread_cpu_timer_create(struc return posix_cpu_timer_create(timer); } static int thread_dvfs_cpu_clock_get(const clockid_t which_clock, - struct timespec64 *tp) + struct timespec64 *tp) { return posix_cpu_clock_get(THREAD_DVFS_CLOCK, tp); }