[PATCH] um: insert scheduler ticks when userspace does not yield

linux-um.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] um: insert scheduler ticks when userspace does not yield
@ 2024-09-13 20:17 Benjamin Berg
  2024-09-19 14:11 ` Benjamin Beichler
  0 siblings, 1 reply; 9+ messages in thread
From: Benjamin Berg @ 2024-09-13 20:17 UTC (permalink / raw)
  To: linux-um; +Cc: Benjamin Berg

From: Benjamin Berg <benjamin.berg@intel.com>

In time-travel mode userspace can do a lot of work without any time
passing. Unfortunately, this can result in OOM situations as the RCU
core code will never be run.

Work around this by keeping track of userspace processes that do not
yield for a lot of operations. When this happens, insert a jiffie into
the sched_clock clock to account time against the process and cause the
bookkeeping to run.

As sched_clock is used for tracing, it is useful to keep it in sync
between the different VMs. As such, try to remove added ticks again when
the actual clock ticks.

Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
---
 arch/um/kernel/time.c           | 20 ++++++++++++++++++++
 arch/um/os-Linux/skas/process.c | 25 +++++++++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/arch/um/kernel/time.c b/arch/um/kernel/time.c
index 29b27b90581f..e7da5e6cd4ce 100644
--- a/arch/um/kernel/time.c
+++ b/arch/um/kernel/time.c
@@ -25,6 +25,8 @@
 #include <shared/init.h>
 
 #ifdef CONFIG_UML_TIME_TRAVEL_SUPPORT
+#include <linux/sched/clock.h>
+
 enum time_travel_mode time_travel_mode;
 EXPORT_SYMBOL_GPL(time_travel_mode);
 
@@ -47,6 +49,15 @@ static u16 time_travel_shm_id;
 static struct um_timetravel_schedshm *time_travel_shm;
 static union um_timetravel_schedshm_client *time_travel_shm_client;
 
+unsigned long tt_extra_sched_jiffies;
+
+notrace unsigned long long sched_clock(void)
+{
+	return (unsigned long long)(jiffies - INITIAL_JIFFIES +
+				    tt_extra_sched_jiffies)
+					* (NSEC_PER_SEC / HZ);
+}
+
 static void time_travel_set_time(unsigned long long ns)
 {
 	if (unlikely(ns < time_travel_time))
@@ -443,6 +454,11 @@ static void time_travel_periodic_timer(struct time_travel_event *e)
 {
 	time_travel_add_event(&time_travel_timer_event,
 			      time_travel_time + time_travel_timer_interval);
+
+	/* Remove inserted sched_clock ticks again to avoid timestamp drift */
+	if (tt_extra_sched_jiffies > 0)
+		tt_extra_sched_jiffies -= 1;
+
 	deliver_alarm();
 }
 
@@ -594,6 +610,10 @@ EXPORT_SYMBOL_GPL(time_travel_add_irq_event);
 
 static void time_travel_oneshot_timer(struct time_travel_event *e)
 {
+	/* Remove inserted sched_clock ticks again to avoid timestamp drift */
+	if (tt_extra_sched_jiffies > 0)
+		tt_extra_sched_jiffies -= 1;
+
 	deliver_alarm();
 }
 
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index b6f656bcffb1..e1a6f97a000b 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -336,6 +336,9 @@ int start_userspace(unsigned long stub_stack)
 	return err;
 }
 
+int unscheduled_userspace_iterations;
+extern unsigned long tt_extra_sched_jiffies;
+
 void userspace(struct uml_pt_regs *regs, unsigned long *aux_fp_regs)
 {
 	int err, status, op, pid = userspace_pid[0];
@@ -345,6 +348,26 @@ void userspace(struct uml_pt_regs *regs, unsigned long *aux_fp_regs)
 	interrupt_end();
 
 	while (1) {
+		/*
+		 * When we are in time-travel mode, userspace can theoretically
+		 * do a *lot* of work without being scheduled. The problem with
+		 * this is that it will prevent kernel bookkeeping (primarily
+		 * the RCU) from running and this can for example cause OOM
+		 * situations.
+		 *
+		 * This code accounts a jiffie against the scheduling clock
+		 * after 10000 userspace iterations (syscall or pagefault) in
+		 * the same thread. By doing so the situation is effectively
+		 * prevented.
+		 */
+		if (time_travel_mode == TT_MODE_INFCPU ||
+		    time_travel_mode == TT_MODE_EXTERNAL) {
+			if (unscheduled_userspace_iterations++ > 10000) {
+				tt_extra_sched_jiffies += 1;
+				unscheduled_userspace_iterations = 0;
+			}
+		}
+
 		time_travel_print_bc_msg();
 
 		current_mm_sync();
@@ -487,6 +510,8 @@ void new_thread(void *stack, jmp_buf *buf, void (*handler)(void))
 
 void switch_threads(jmp_buf *me, jmp_buf *you)
 {
+	unscheduled_userspace_iterations = 0;
+
 	if (UML_SETJMP(me) == 0)
 		UML_LONGJMP(you, 1);
 }
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] um: insert scheduler ticks when userspace does not yield
  2024-09-13 20:17 [PATCH] um: insert scheduler ticks when userspace does not yield Benjamin Berg
@ 2024-09-19 14:11 ` Benjamin Beichler
  2024-09-19 14:22   ` Benjamin Berg
  0 siblings, 1 reply; 9+ messages in thread
From: Benjamin Beichler @ 2024-09-19 14:11 UTC (permalink / raw)
  To: Benjamin Berg, linux-um; +Cc: Benjamin Berg


[-- Attachment #1.1.1: Type: text/plain, Size: 993 bytes --]

Hi,

Could this also eliminate/address the busy-loop hack in timer_read in 
time.c?

And another question: Why you remove only 1 extra jiffy in the timer 
callbacks and not all the extra jiffies? Is there always only 1 or could 
there be multiple?

regards

Benjamin Beichler

Am 13.09.2024 um 22:17 schrieb Benjamin Berg:
> From: Benjamin Berg <benjamin.berg@intel.com>
> 
> In time-travel mode userspace can do a lot of work without any time
> passing. Unfortunately, this can result in OOM situations as the RCU
> core code will never be run.
> 
> Work around this by keeping track of userspace processes that do not
> yield for a lot of operations. When this happens, insert a jiffie into
> the sched_clock clock to account time against the process and cause the
> bookkeeping to run.
> 
> As sched_clock is used for tracing, it is useful to keep it in sync
> between the different VMs. As such, try to remove added ticks again when
> the actual clock ticks.


[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3025 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] um: insert scheduler ticks when userspace does not yield
  2024-09-19 14:11 ` Benjamin Beichler
@ 2024-09-19 14:22   ` Benjamin Berg
  2024-09-19 14:37     ` Benjamin Beichler
  0 siblings, 1 reply; 9+ messages in thread
From: Benjamin Berg @ 2024-09-19 14:22 UTC (permalink / raw)
  To: Benjamin Beichler, linux-um

Hi,

On Thu, 2024-09-19 at 16:11 +0200, Benjamin Beichler wrote:
> Could this also eliminate/address the busy-loop hack in timer_read in
> time.c?

Hmm, I was considering changing the other hack in handle_syscall to
also use this approach.

But, I don't think the timer_read hack can be removed. In the case of
userspace reading the time, it should not see the difference of the
sched_clock. So even though the process would start use CPU time, the
realtime clock should not change and python would continue to busyloop.

> And another question: Why you remove only 1 extra jiffy in the timer 
> callbacks and not all the extra jiffies? Is there always only 1 or
> could there be multiple?

Oh, there we can absolutely end up inserting multiple jiffies. However,
deliver_alarm will cause "jiffies" to be incremented by one. So by
subtracting only one jiffie we can avoid a situation where sched_clock
moves backwards.

Benjamin

> 
> regards
> 
> Benjamin Beichler
> 
> Am 13.09.2024 um 22:17 schrieb Benjamin Berg:
> > From: Benjamin Berg <benjamin.berg@intel.com>
> > 
> > In time-travel mode userspace can do a lot of work without any time
> > passing. Unfortunately, this can result in OOM situations as the
> > RCU
> > core code will never be run.
> > 
> > Work around this by keeping track of userspace processes that do
> > not
> > yield for a lot of operations. When this happens, insert a jiffie
> > into
> > the sched_clock clock to account time against the process and cause
> > the
> > bookkeeping to run.
> > 
> > As sched_clock is used for tracing, it is useful to keep it in sync
> > between the different VMs. As such, try to remove added ticks again
> > when
> > the actual clock ticks.
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] um: insert scheduler ticks when userspace does not yield
  2024-09-19 14:22   ` Benjamin Berg
@ 2024-09-19 14:37     ` Benjamin Beichler
  2024-09-19 16:55       ` Benjamin Berg
  0 siblings, 1 reply; 9+ messages in thread
From: Benjamin Beichler @ 2024-09-19 14:37 UTC (permalink / raw)
  To: Benjamin Berg, linux-um

Am 19.09.2024 um 16:22 schrieb Benjamin Berg:
>> Could this also eliminate/address the busy-loop hack in timer_read in
>> time.c?
> Hmm, I was considering changing the other hack in handle_syscall to
> also use this approach.
>
> But, I don't think the timer_read hack can be removed. In the case of
> userspace reading the time, it should not see the difference of the
> sched_clock. So even though the process would start use CPU time, the
> realtime clock should not change and python would continue to busyloop.
> the actual clock ticks.

Okay, I think you are right.

Nonetheless, my proposal would be anyway to shift the advance of the 
realtime clock only into user space systemcalls and not every function 
in the kernel, that reads the timer.

I made ages ago the mediocre proposal to check the syscall number here, 
to detect "malicious" user space programs doing busy loops.

For "clean" semantics of a simulative execution of the kernel, it feels 
erroneous to advance time even if this value is only read once.

In my experiments timer_read was called much more often than I 
anticipated (e.g., filesystem code).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] um: insert scheduler ticks when userspace does not yield
  2024-09-19 14:37     ` Benjamin Beichler
@ 2024-09-19 16:55       ` Benjamin Berg
  2024-09-23 13:56         ` Benjamin Beichler
  0 siblings, 1 reply; 9+ messages in thread
From: Benjamin Berg @ 2024-09-19 16:55 UTC (permalink / raw)
  To: Benjamin Beichler, linux-um

Hi,

On Thu, 2024-09-19 at 16:37 +0200, Benjamin Beichler wrote:
> Am 19.09.2024 um 16:22 schrieb Benjamin Berg:
> > > Could this also eliminate/address the busy-loop hack in timer_read in
> > > time.c?
> > Hmm, I was considering changing the other hack in handle_syscall to
> > also use this approach.
> > 
> > But, I don't think the timer_read hack can be removed. In the case of
> > userspace reading the time, it should not see the difference of the
> > sched_clock. So even though the process would start use CPU time, the
> > realtime clock should not change and python would continue to busyloop.
> > the actual clock ticks.
> 
> Okay, I think you are right.
> 
> Nonetheless, my proposal would be anyway to shift the advance of the 
> realtime clock only into user space systemcalls and not every function 
> in the kernel, that reads the timer.
> 
> I made ages ago the mediocre proposal to check the syscall number here, 
> to detect "malicious" user space programs doing busy loops.

And now here I am proposing another form of "malicious" userspace
detection …

> For "clean" semantics of a simulative execution of the kernel, it feels 
> erroneous to advance time even if this value is only read once.
> 
> In my experiments timer_read was called much more often than I 
> anticipated (e.g., filesystem code).

Yeah, that does not really sound like something we would want (and it
will also not help with performance with time-travel=ext).

Looking at the old discussion, it doesn't seem that Johannes was
against the idea of doing the time insertion only in more specific
scenarios. So, we "just" need a reasonably elegant solution.

If we accept writing a list of syscalls, then maybe we could just do it
within handle_syscall and do a um_udelay(1) for any syscall that takes
a timeout parameter (select, pselect6, poll, ...)? It is going to be a
pretty long list, but could still be reasonable.

One neat side effect is that if reading time does not actually cost
time, then we could implement clock_gettime in the VDSO.

Benjamin


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] um: insert scheduler ticks when userspace does not yield
  2024-09-19 16:55       ` Benjamin Berg
@ 2024-09-23 13:56         ` Benjamin Beichler
  2024-09-23 14:48           ` Benjamin Berg
  0 siblings, 1 reply; 9+ messages in thread
From: Benjamin Beichler @ 2024-09-23 13:56 UTC (permalink / raw)
  To: Benjamin Berg, linux-um

>> For "clean" semantics of a simulative execution of the kernel, it feels
>> erroneous to advance time even if this value is only read once.
>>
>> In my experiments timer_read was called much more often than I
>> anticipated (e.g., filesystem code).
> Yeah, that does not really sound like something we would want (and it
> will also not help with performance with time-travel=ext).
>
> Looking at the old discussion, it doesn't seem that Johannes was
> against the idea of doing the time insertion only in more specific
> scenarios. So, we "just" need a reasonably elegant solution.
>
> If we accept writing a list of syscalls, then maybe we could just do it
> within handle_syscall and do a um_udelay(1) for any syscall that takes
> a timeout parameter (select, pselect6, poll, ...)? It is going to be a
> pretty long list, but could still be reasonable.

That's actually not what my "hack" did. I filtered out all syscalls, 
that give some information about the current timestamp of the system.

Actually, I think, timeouts are no problem, if we can assure, that a 
timeout is never rounded down to 0. Mostly a direct input of 0 have 
special meanings, or provokes wrong behavior in the first place from 
user space program.

Since time-travel mode has a very limited niche, I would not try to 
prevent every possible dumb behavior that bad user space programs could 
have. I think busy-waiting on a system clock advancement is not the best 
style, but acceptable.

So my list was:

sys_getitimer
sys_gettimeofday
sys_time
sys_timer_gettime
sys_clock_gettime
sys_timerfd_gettime

While overthinking it, I see the possibility to read the access 
timestamps of a file to create an endless loop, so maybe the stat 
syscalls may be included, although this makes me a bit uncomfortable 
again. I tend to say, this "bad" behavior of asking the same information 
over and over again, should only be punished, if it happens multiple times.

I was thinking about, storing the PID of a busy-looped process, and only 
increase time, if the same PID is "suspicious".  However, this "hack" 
becomes more and more costly, which is on the other hand not important 
for timetravel mode.

>
> One neat side effect is that if reading time does not actually cost
> time, then we could implement clock_gettime in the VDSO.

That would exactly not work, because of my comment from before.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] um: insert scheduler ticks when userspace does not yield
  2024-09-23 13:56         ` Benjamin Beichler
@ 2024-09-23 14:48           ` Benjamin Berg
  2024-09-23 21:50             ` Benjamin Beichler
  0 siblings, 1 reply; 9+ messages in thread
From: Benjamin Berg @ 2024-09-23 14:48 UTC (permalink / raw)
  To: Benjamin Beichler, linux-um

On Mon, 2024-09-23 at 15:56 +0200, Benjamin Beichler wrote:
> > > For "clean" semantics of a simulative execution of the kernel, it feels
> > > erroneous to advance time even if this value is only read once.
> > > 
> > > In my experiments timer_read was called much more often than I
> > > anticipated (e.g., filesystem code).
> > Yeah, that does not really sound like something we would want (and it
> > will also not help with performance with time-travel=ext).
> > 
> > Looking at the old discussion, it doesn't seem that Johannes was
> > against the idea of doing the time insertion only in more specific
> > scenarios. So, we "just" need a reasonably elegant solution.
> > 
> > If we accept writing a list of syscalls, then maybe we could just do it
> > within handle_syscall and do a um_udelay(1) for any syscall that takes
> > a timeout parameter (select, pselect6, poll, ...)? It is going to be a
> > pretty long list, but could still be reasonable.
> 
> That's actually not what my "hack" did. I filtered out all syscalls, 
> that give some information about the current timestamp of the system.

Yes, I know.

> Actually, I think, timeouts are no problem, if we can assure, that a 
> timeout is never rounded down to 0. Mostly a direct input of 0 have 
> special meanings, or provokes wrong behavior in the first place from 
> user space program.

I don't think that is a problem. The kernel should guarantee that a
timeout never fires too early.

I believe in the case of the linked python code, the timeout fires at
exactly the correct time. And then the python code (incorrectly)
detects that the timeout has not passed and tries to "select" again
with a timeout of exactly zero.

Really, that implementation is just buggy in subtle ways. It could
probably just trust the kernel to not wake up early. And, if it does
check whether the timeout has passed, then it should just accept the
exact time.

(Note that e.g. python asyncio explicitly takes into account the clock
resolution to avoid this type of issue.)

> Since time-travel mode has a very limited niche, I would not try to 
> prevent every possible dumb behavior that bad user space programs could 
> have. I think busy-waiting on a system clock advancement is not the best 
> style, but acceptable.
> 
> So my list was:
> 
> sys_getitimer
> sys_gettimeofday
> sys_time
> sys_timer_gettime
> sys_clock_gettime
> sys_timerfd_gettime
> 
> While overthinking it, I see the possibility to read the access 
> timestamps of a file to create an endless loop, so maybe the stat 
> syscalls may be included, although this makes me a bit uncomfortable 
> again. I tend to say, this "bad" behavior of asking the same information 
> over and over again, should only be punished, if it happens multiple times.
> 
> I was thinking about, storing the PID of a busy-looped process, and only 
> increase time, if the same PID is "suspicious".  However, this "hack"
> becomes more and more costly, which is on the other hand not important 
> for timetravel mode.

Maybe a stupid question, but aren't we overthinking this in general?

While I think that Johannes' solution to make reading the time cost
time is kind of ingenious, I really wonder how much of an issue this
actually is. Because if this is just a few userspace applications and
libraries misbehaving, then we might as well fix the issue there
instead of doing anything special in UML.

> > One neat side effect is that if reading time does not actually cost
> > time, then we could implement clock_gettime in the VDSO.
> 
> That would exactly not work, because of my comment from before.

Of course. It is just that I have always in the back of my mind that
syscalls and pagefaults (including minor faults) are really expensive
in UML. So if the hack is moved elsewhere then implementing
clock_gettime in the vDSO could be an easy win to speed up the
simulation.

Benjamin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] um: insert scheduler ticks when userspace does not yield
  2024-09-23 14:48           ` Benjamin Berg
@ 2024-09-23 21:50             ` Benjamin Beichler
  2024-09-24 10:46               ` Benjamin Berg
  0 siblings, 1 reply; 9+ messages in thread
From: Benjamin Beichler @ 2024-09-23 21:50 UTC (permalink / raw)
  To: Benjamin Berg, linux-um

Hi,

Am 23.09.2024 um 16:48 schrieb Benjamin Berg:
>> Actually, I think, timeouts are no problem, if we can assure, that a
>> timeout is never rounded down to 0. Mostly a direct input of 0 have
>> special meanings, or provokes wrong behavior in the first place from
>> user space program.
> I don't think that is a problem. The kernel should guarantee that a
> timeout never fires too early.
>
> I believe in the case of the linked python code, the timeout fires at
> exactly the correct time. And then the python code (incorrectly)
> detects that the timeout has not passed and tries to "select" again
> with a timeout of exactly zero.
>
> Really, that implementation is just buggy in subtle ways. It could
> probably just trust the kernel to not wake up early. And, if it does
> check whether the timeout has passed, then it should just accept the
> exact time.

Maybe I'm doing a captain obvious here, but I had the impression this 
code was written this way, to handle interruptions by signals and not to 
doubt the time accuracy. Possibly I'm totally wrong, but it seems quite 
elegant to simply use time here to avoid that dance to mask signals or 
check for interruptions etc.

I believe this code was written in mind that time() will advance, so 
this will never be an endless loop, so even the corner case that timeout 
was 0 would be covered by this.

>> Since time-travel mode has a very limited niche, I would not try to
>> prevent every possible dumb behavior that bad user space programs could
>> have. I think busy-waiting on a system clock advancement is not the best
>> style, but acceptable.
>>
>> So my list was:
>>
>> sys_getitimer
>> sys_gettimeofday
>> sys_time
>> sys_timer_gettime
>> sys_clock_gettime
>> sys_timerfd_gettime
>>
>> While overthinking it, I see the possibility to read the access
>> timestamps of a file to create an endless loop, so maybe the stat
>> syscalls may be included, although this makes me a bit uncomfortable
>> again. I tend to say, this "bad" behavior of asking the same information
>> over and over again, should only be punished, if it happens multiple times.
>>
>> I was thinking about, storing the PID of a busy-looped process, and only
>> increase time, if the same PID is "suspicious".  However, this "hack"
>> becomes more and more costly, which is on the other hand not important
>> for timetravel mode.
> Maybe a stupid question, but aren't we overthinking this in general?
>
> While I think that Johannes' solution to make reading the time cost
> time is kind of ingenious, I really wonder how much of an issue this
> actually is. Because if this is just a few userspace applications and
> libraries misbehaving, then we might as well fix the issue there
> instead of doing anything special in UML.

Your point is right, and such bugs may be fixed in user space. On the 
other hand, what about software we can't or don't want to fix, which in 
the wild simply works. For my future use cases, I will run code, that 
I'm not able to compile myself. I would even consider to have a runtime 
switch to change the behavior of this hack, to reduce the overhead in 
simulations that behave nicely, but have some quick workaround for 
misbehaving code.

And sorry for repeating myself, but I believe, that busy waiting on an 
increasing timer value is not the best style, but considered okay/normal 
for some use cases. So I think it would be helpful to be able to execute 
such user space code.

But I want to bring in another idea: Could we use an ebpf program to 
dynamically hook into syscalls and do a timetravel_update or something 
similar? Actually, I do not know whether ebpf works normally in UM, but 
that way it would be flexible and moving the dirty hacks into small 
portions outside the kernel. From what I understand, we would need to 
add an ebpf callable wrapper for the time travel update function, isn't it?

>
>>> One neat side effect is that if reading time does not actually cost
>>> time, then we could implement clock_gettime in the VDSO.
>> That would exactly not work, because of my comment from before.
> Of course. It is just that I have always in the back of my mind that
> syscalls and pagefaults (including minor faults) are really expensive
> in UML. So if the hack is moved elsewhere then implementing
> clock_gettime in the vDSO could be an easy win to speed up the
> simulation.

Mhh I did only a quick look into "arch/x86/um/vdso/um_vdso.c" and from 
my understanding, currently every vdso call is converted into syscalls 
of the host. So we need much more code to use here the time travel 
clock, isn't it? Of course, my proposed ebpf hook would not work here 
either...

>
> Benjamin
>
  kind regards

(the other) Benjamin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] um: insert scheduler ticks when userspace does not yield
  2024-09-23 21:50             ` Benjamin Beichler
@ 2024-09-24 10:46               ` Benjamin Berg
  0 siblings, 0 replies; 9+ messages in thread
From: Benjamin Berg @ 2024-09-24 10:46 UTC (permalink / raw)
  To: Benjamin Beichler, linux-um

On Mon, 2024-09-23 at 23:50 +0200, Benjamin Beichler wrote:
> Hi,
> 
> Am 23.09.2024 um 16:48 schrieb Benjamin Berg:
> > > Actually, I think, timeouts are no problem, if we can assure, that a
> > > timeout is never rounded down to 0. Mostly a direct input of 0 have
> > > special meanings, or provokes wrong behavior in the first place from
> > > user space program.
> > I don't think that is a problem. The kernel should guarantee that a
> > timeout never fires too early.
> > 
> > I believe in the case of the linked python code, the timeout fires at
> > exactly the correct time. And then the python code (incorrectly)
> > detects that the timeout has not passed and tries to "select" again
> > with a timeout of exactly zero.
> > 
> > Really, that implementation is just buggy in subtle ways. It could
> > probably just trust the kernel to not wake up early. And, if it does
> > check whether the timeout has passed, then it should just accept the
> > exact time.
> 
> Maybe I'm doing a captain obvious here, but I had the impression this
> code was written this way, to handle interruptions by signals and not to 
> doubt the time accuracy. Possibly I'm totally wrong, but it seems quite 
> elegant to simply use time here to avoid that dance to mask signals or 
> check for interruptions etc.

Yeah, you are right. I wasn't sure what python does in case of EINTR.
But it does look like it'll stop the select in that case after handling
the signal.

> I believe this code was written in mind that time() will advance, so 
> this will never be an endless loop, so even the corner case that timeout 
> was 0 would be covered by this

Dunno, it feels to me like someone just didn't think much about whether
it should be a "<" or "<=" operator there. Simply changing the
    if timeout < 0: 
to
    if timeout <= 0:
will resolve the problem and is more correct in my view. A specific
"time" is valid in the interval from the clock tick until the next
tick. So the only sensible thing is to assume that you are somewhere
within the interval *after* the tick, which means the timeout has
passed already if it is exactly zero.

Trying to do a zero length sleep to wait for the next clock tick does
not make sense. Especially as a zero length sleep is defined to not
sleep.

> > > Since time-travel mode has a very limited niche, I would not try to
> > > prevent every possible dumb behavior that bad user space programs could
> > > have. I think busy-waiting on a system clock advancement is not the best
> > > style, but acceptable.
> > > 
> > > So my list was:
> > > 
> > > sys_getitimer
> > > sys_gettimeofday
> > > sys_time
> > > sys_timer_gettime
> > > sys_clock_gettime
> > > sys_timerfd_gettime
> > > 
> > > While overthinking it, I see the possibility to read the access
> > > timestamps of a file to create an endless loop, so maybe the stat
> > > syscalls may be included, although this makes me a bit uncomfortable
> > > again. I tend to say, this "bad" behavior of asking the same information
> > > over and over again, should only be punished, if it happens multiple times.
> > > 
> > > I was thinking about, storing the PID of a busy-looped process, and only
> > > increase time, if the same PID is "suspicious".  However, this "hack"
> > > becomes more and more costly, which is on the other hand not important
> > > for timetravel mode.
> > Maybe a stupid question, but aren't we overthinking this in general?
> > 
> > While I think that Johannes' solution to make reading the time cost
> > time is kind of ingenious, I really wonder how much of an issue this
> > actually is. Because if this is just a few userspace applications and
> > libraries misbehaving, then we might as well fix the issue there
> > instead of doing anything special in UML.
> 
> Your point is right, and such bugs may be fixed in user space. On the
> other hand, what about software we can't or don't want to fix, which in 
> the wild simply works. For my future use cases, I will run code, that
> I'm not able to compile myself. I would even consider to have a runtime 
> switch to change the behavior of this hack, to reduce the overhead in
> simulations that behave nicely, but have some quick workaround for 
> misbehaving code.

Sure. But we could still decide to not support that in the upstream
kernel. How you solve the problem in your application would be up to
you. You can apply a simple patch or find another solution in
userspace.

> And sorry for repeating myself, but I believe, that busy waiting on an 
> increasing timer value is not the best style, but considered okay/normal 
> for some use cases. So I think it would be helpful to be able to execute 
> such user space code.

Right, it is just that I have not actually seen anyone wanting to do a
busy wait on a time. The python example explicitly tries to sleep and
just gets the rounding wrong.

> But I want to bring in another idea: Could we use an ebpf program to 
> dynamically hook into syscalls and do a timetravel_update or something 
> similar? Actually, I do not know whether ebpf works normally in UM, but 
> that way it would be flexible and moving the dirty hacks into small 
> portions outside the kernel. From what I understand, we would need to
> add an ebpf callable wrapper for the time travel update function, isn't it?

Hmm, possibly. I am not familiar with what is possible to do when
tracing syscalls using eBPF.

That said, I wonder if we should be inserting a clock_nanosleep()
conceptually. If you want to force a process to continue running after
the next clock tick, then you might want to schedule all other runnable
tasks *before* letting the clock tick.

> > > > One neat side effect is that if reading time does not actually cost
> > > > time, then we could implement clock_gettime in the VDSO.
> > > That would exactly not work, because of my comment from before.
> > Of course. It is just that I have always in the back of my mind that
> > syscalls and pagefaults (including minor faults) are really expensive
> > in UML. So if the hack is moved elsewhere then implementing
> > clock_gettime in the vDSO could be an easy win to speed up the
> > simulation.
> 
> Mhh I did only a quick look into "arch/x86/um/vdso/um_vdso.c" and from 
> my understanding, currently every vdso call is converted into syscalls 
> of the host. So we need much more code to use here the time travel 
> clock, isn't it? Of course, my proposed ebpf hook would not work here
> either...

Yeah, but I don't expect it to be that complicated. If the clock
doesn't actually change, then it would be trivial to just return a
constant value.

It gets more complicated if reading time should take time. But even
then, we can probably hide some (writable) bookkeeping data on the stub
page (as the vDSO data should be read-only). And outside of time-travel
a SECCOMP execution model would allow us to do direct host syscalls.

Anyway, I suppose vDSO improvements are not necessarily a short term
project.

Benjamin



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-09-24 10:46 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-13 20:17 [PATCH] um: insert scheduler ticks when userspace does not yield Benjamin Berg
2024-09-19 14:11 ` Benjamin Beichler
2024-09-19 14:22   ` Benjamin Berg
2024-09-19 14:37     ` Benjamin Beichler
2024-09-19 16:55       ` Benjamin Berg
2024-09-23 13:56         ` Benjamin Beichler
2024-09-23 14:48           ` Benjamin Berg
2024-09-23 21:50             ` Benjamin Beichler
2024-09-24 10:46               ` Benjamin Berg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).