* Dovetail <-> PREEMPT_RT hybridization @ 2020-07-20 20:47 Philippe Gerum 2020-07-20 22:44 ` Paul 2020-07-21 5:26 ` Meng, Fino 0 siblings, 2 replies; 13+ messages in thread From: Philippe Gerum @ 2020-07-20 20:47 UTC (permalink / raw) To: Evl; +Cc: Xenomai@xenomai.org FWIW, I'm investigating the opportunity for rebasing Dovetail - and therefore the EVL core - on the PREEMPT_RT code base, ahead of the final integration of the latter into the mainline kernel tree. In the same move, the goal would be to leverage the improvements brought by native preemption with respect to fine-grained interrupt protection, while keeping the alternate scheduling [1] feature, which still exhibits significantly shorter preemption times and much cleaner jitter compared to what is - at least currently - achievable with a plain PREEMPT_RT kernel under meaningful stress. With such hybridization, the Dovetail implementation should be even simpler. Companion cores based on it could run on the out-of-band execution stage unimpeded by other forms of preemption disabling in the in-band kernel (e.g. locks). This would preserve the most significant advantage of the pipelining model when it comes to reliable response times for applications at a modest processing cost by a lightweight real-time infrastructure. This work entails porting the latest Dovetail code base I have been working on lately from 5.8 back to 5.6-rt, since this is the most recent public release of PREEMPT_RT so far. In addition, interrupt-free sections which are deemed too long for the companion core on top to cope with, need to be identified in the target PREEMPT_RT release so that they could be mitigated (see below for an explanation about how it could be done). In the future, a way to automate such research should be looked for, since finding these spots is likely going to be the boring task to carry out each time this new Dovetail implementation is ported to the next PREEMPT_RT release. Plus, the truckload of other tricky issues I may have overlooked. If anyone is interested in participating in this work, let me know. I cannot guarantee success, but the data I have collected over time with both the dual kernel and native preemption models leaves me optimistic about the outcome if they are combined the right way. -- Nitty-gritty details about why and how to do this Those acquainted with the interrupt pipelining technique Dovetail implements [2] may already know that decoupling the interrupt mask flag as perceived by the CPU from the one perceived by the kernel induces a number of tricky issues. We want interrupts to be unmasked in the CPU as long as possible while the kernel runs; to this end local_irq_*() helpers are switched to a software-based implementation which virtualizes the interrupt mask as perceived by the kernel, while leaving interrupts enabled in the CPU, postponing the delivery of IRQs blocked by the virtual masking until they are accepted again by the kernel (aka "stall bit"). This is a plain simple log-if-blocked-then-replay-when-unblocked game [3]. However, we also have to synchronize the hardware and software interrupt masks in some specific places of the kernel in order to keep some hardware and software logic happy. Two examples come to mind, there are more of them: - hardware-wise, we want updates to some registers to remain fully atomic despite the fact interrupt pipelining is in effect. For arm64, we have to ensure that updates to the translation table registers (TTBRs) cannot be preempted, likewise for updates to the CR4 register on x86 which is notably used during TLB management. In both cases, we have to locally revert/override the changes Dovetail implicitly did by re-introducing CPU-based forms of interrupt disabling, instead of the software-based one. - software-wise, maintaining the LOCKDEP logic usable in a pipelined system requires fixing up the virtual interrupt mask on kernel boundaries between kernel and user mode, so that it properly reflects what the locking validation engine expects at all times. This has been the most time-consuming work in a number of Dovetail upgrades to recent kernel releases, 5.8-rc included. Besides, I'm still not happy with the way this is done, which looks like playing whack-a-mole to some extent. Many of these issues are hard to identify, some may not be trivial to address (LOCKDEP support can become really ugly in this respect). Several other sub-systems like CPU idleness and power management have similar requirements for particular code paths. Now, we may have another option for gaining fine-grained interrupt protection, which would build on the relentless work the PREEMPT_RT folks did about shrinking the interrupt-free sections in the kernel code to the bare minimum which is acceptable for native preemption, by threading IRQs and introducing sleeping locks mainly. Instead of systematizing the virtualization of the local_irq_*() helpers, we could switch them back to their original - hardware-based - behavior, adding controlled mask-breaking statements manually to any remaining problematic code path. Such statement would enable interrupts in the CPU while blocking them for the in-band kernel, using a local, non-pervasive variant of the current interrupt pipeline. Within those long interrupt-free sections created by the in-band code, the companion core would nevertheless be allowed to process pending interrupts immediately while maintaining the interrupt protection for the in-band kernel. Identifying these sections for enabling the out-of-band code to preempt locally should be a matter of properly using the irqsoff tracer, provided the trace_hardirqs* instrumentation is correct. e.g. roughly sketching a possible use case: __schedule() lock(rq) /* hard irqs off */ ... context_switch() switch_mm switch_to ... unlock(rq) /* hard irqs on */ The interrupt-free section above could amount to tenths of microseconds on armv7 under significant pressure (especially with a sluggish L2 outer cache) and would prevent the out-of-band (companion) core to preempt in the meantime. To address this, switching the virtual interrupt state could be done manually by some dedicated service, say, "oob_synchronize()", which would first stall the in-band stage to keep the code interrupt-free in-band wise, then allow any pending hard IRQ to be taken by toggling the CPU mask flag, possibly some of which the companion core would handle. Other IRQs to be handled by the in-band code would have to wait into a deferred interrupt log until hard IRQs are generally re-enabled later on, which is what happens today with the common pipelining technique on a broader scope. __schedule() lock(rq) /* hard irqs off */ ... context_switch() switch_mm cond_sync_oob(); /* pending IRQs are synchronized for oob only */ switch_to ... unlock(rq) /* hard irqs on */ Ideally, switch_mm() should allow out-of-band IRQs to flow normally while changing the memory context for in-band tasks - we once had that for armv4/5 in the early days of the I-pipe, but this would require non-trivial magic to do this properly in current kernels. So maybe next when all the rest is functional. Congrats if you read up to there. Comments welcome as usual. [1] https://evlproject.org/dovetail/altsched/ [2] https://www.usenix.org/legacy/publications/library/proceedings/micro93/full_papers/stodolsky.txt [3] https://evlproject.org/dovetail/pipeline/#virtual-i-flag -- Philippe. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Dovetail <-> PREEMPT_RT hybridization 2020-07-20 20:47 Dovetail <-> PREEMPT_RT hybridization Philippe Gerum @ 2020-07-20 22:44 ` Paul 2020-07-21 8:18 ` Philippe Gerum 2020-07-21 5:26 ` Meng, Fino 1 sibling, 1 reply; 13+ messages in thread From: Paul @ 2020-07-20 22:44 UTC (permalink / raw) To: Philippe Gerum via Xenomai; +Cc: Philippe Gerum, Evl On Mon, 20 Jul 2020 22:47:29 +0200 Philippe Gerum via Xenomai <xenomai@xenomai.org> wrote: > > Congrats if you read up to there. Comments welcome as usual. > > [1] https://evlproject.org/dovetail/altsched/ Getting a "Forbidden, You don't have permission to access this resource" error for that URL... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Dovetail <-> PREEMPT_RT hybridization 2020-07-20 22:44 ` Paul @ 2020-07-21 8:18 ` Philippe Gerum 2020-07-21 8:39 ` Paul 0 siblings, 1 reply; 13+ messages in thread From: Philippe Gerum @ 2020-07-21 8:18 UTC (permalink / raw) To: Paul, Philippe Gerum via Xenomai; +Cc: Evl On 7/21/20 12:44 AM, Paul wrote: > On Mon, 20 Jul 2020 22:47:29 +0200 > Philippe Gerum via Xenomai <xenomai@xenomai.org> wrote: > >> >> Congrats if you read up to there. Comments welcome as usual. >> >> [1] https://evlproject.org/dovetail/altsched/ > > Getting a "Forbidden, You don't have permission to access this > resource" error for that URL... > Is this a general issue with the site or can you access the top page at https://evlproject.org for instance? -- Philippe. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Dovetail <-> PREEMPT_RT hybridization 2020-07-21 8:18 ` Philippe Gerum @ 2020-07-21 8:39 ` Paul 2020-07-21 9:25 ` Philippe Gerum 0 siblings, 1 reply; 13+ messages in thread From: Paul @ 2020-07-21 8:39 UTC (permalink / raw) To: Philippe Gerum; +Cc: Philippe Gerum via Xenomai, Evl On Tue, 21 Jul 2020 10:18:01 +0200 Philippe Gerum <rpm@xenomai.org> wrote: > On 7/21/20 12:44 AM, Paul wrote: > > On Mon, 20 Jul 2020 22:47:29 +0200 > > Philippe Gerum via Xenomai <xenomai@xenomai.org> wrote: > > > >> > >> Congrats if you read up to there. Comments welcome as usual. > >> > >> [1] https://evlproject.org/dovetail/altsched/ > > > > Getting a "Forbidden, You don't have permission to access this > > resource" error for that URL... > > > > Is this a general issue with the site or can you access the top page > at https://evlproject.org for instance? > I can access the rest of the site including much of the Dovetail docs. Just the altsched page is out of bounds. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Dovetail <-> PREEMPT_RT hybridization 2020-07-21 8:39 ` Paul @ 2020-07-21 9:25 ` Philippe Gerum 2020-07-21 9:43 ` Paul 0 siblings, 1 reply; 13+ messages in thread From: Philippe Gerum @ 2020-07-21 9:25 UTC (permalink / raw) To: Paul; +Cc: Philippe Gerum via Xenomai, Evl On 7/21/20 10:39 AM, Paul wrote: > On Tue, 21 Jul 2020 10:18:01 +0200 > Philippe Gerum <rpm@xenomai.org> wrote: > >> On 7/21/20 12:44 AM, Paul wrote: >>> On Mon, 20 Jul 2020 22:47:29 +0200 >>> Philippe Gerum via Xenomai <xenomai@xenomai.org> wrote: >>> >>>> >>>> Congrats if you read up to there. Comments welcome as usual. >>>> >>>> [1] https://evlproject.org/dovetail/altsched/ >>> >>> Getting a "Forbidden, You don't have permission to access this >>> resource" error for that URL... >>> >> >> Is this a general issue with the site or can you access the top page >> at https://evlproject.org for instance? >> > > I can access the rest of the site including much of the Dovetail docs. > Just the altsched page is out of bounds. > Can you try again, forcing a cache reload in your browser? Thanks, -- Philippe. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Dovetail <-> PREEMPT_RT hybridization 2020-07-21 9:25 ` Philippe Gerum @ 2020-07-21 9:43 ` Paul 2020-07-21 9:46 ` Philippe Gerum 0 siblings, 1 reply; 13+ messages in thread From: Paul @ 2020-07-21 9:43 UTC (permalink / raw) To: Philippe Gerum; +Cc: Philippe Gerum via Xenomai, Evl On Tue, 21 Jul 2020 11:25:43 +0200 Philippe Gerum <rpm@xenomai.org> wrote: > On 7/21/20 10:39 AM, Paul wrote: > > On Tue, 21 Jul 2020 10:18:01 +0200 > > Philippe Gerum <rpm@xenomai.org> wrote: > > > >> On 7/21/20 12:44 AM, Paul wrote: > >>> On Mon, 20 Jul 2020 22:47:29 +0200 > >>> Philippe Gerum via Xenomai <xenomai@xenomai.org> wrote: > >>> > >>>> > >>>> Congrats if you read up to there. Comments welcome as usual. > >>>> > >>>> [1] https://evlproject.org/dovetail/altsched/ > >>> > >>> Getting a "Forbidden, You don't have permission to access this > >>> resource" error for that URL... > >>> > >> > >> Is this a general issue with the site or can you access the top > >> page at https://evlproject.org for instance? > >> > > > > I can access the rest of the site including much of the Dovetail > > docs. Just the altsched page is out of bounds. > > > > Can you try again, forcing a cache reload in your browser? That's fixed it ;-) ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Dovetail <-> PREEMPT_RT hybridization 2020-07-21 9:43 ` Paul @ 2020-07-21 9:46 ` Philippe Gerum 0 siblings, 0 replies; 13+ messages in thread From: Philippe Gerum @ 2020-07-21 9:46 UTC (permalink / raw) To: Paul; +Cc: Philippe Gerum via Xenomai, Evl On 7/21/20 11:43 AM, Paul wrote: > On Tue, 21 Jul 2020 11:25:43 +0200 > Philippe Gerum <rpm@xenomai.org> wrote: > >> On 7/21/20 10:39 AM, Paul wrote: >>> On Tue, 21 Jul 2020 10:18:01 +0200 >>> Philippe Gerum <rpm@xenomai.org> wrote: >>> >>>> On 7/21/20 12:44 AM, Paul wrote: >>>>> On Mon, 20 Jul 2020 22:47:29 +0200 >>>>> Philippe Gerum via Xenomai <xenomai@xenomai.org> wrote: >>>>> >>>>>> >>>>>> Congrats if you read up to there. Comments welcome as usual. >>>>>> >>>>>> [1] https://evlproject.org/dovetail/altsched/ >>>>> >>>>> Getting a "Forbidden, You don't have permission to access this >>>>> resource" error for that URL... >>>>> >>>> >>>> Is this a general issue with the site or can you access the top >>>> page at https://evlproject.org for instance? >>>> >>> >>> I can access the rest of the site including much of the Dovetail >>> docs. Just the altsched page is out of bounds. >>> >> >> Can you try again, forcing a cache reload in your browser? > > That's fixed it ;-) > Ok, thanks. -- Philippe. ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: Dovetail <-> PREEMPT_RT hybridization 2020-07-20 20:47 Dovetail <-> PREEMPT_RT hybridization Philippe Gerum 2020-07-20 22:44 ` Paul @ 2020-07-21 5:26 ` Meng, Fino 2020-07-21 17:18 ` Philippe Gerum 1 sibling, 1 reply; 13+ messages in thread From: Meng, Fino @ 2020-07-21 5:26 UTC (permalink / raw) To: Philippe Gerum, Evl; +Cc: Xenomai (xenomai@xenomai.org) >Sent: Tuesday, July 21, 2020 4:47 AM > >FWIW, I'm investigating the opportunity for rebasing Dovetail - and therefore the EVL core - on the PREEMPT_RT code base, >ahead of the final integration of the latter into the mainline kernel tree. In the same move, the goal would be to leverage the >improvements brought by native preemption with respect to fine-grained interrupt protection, while keeping the alternate >scheduling [1] feature, which still exhibits significantly shorter preemption times and much cleaner jitter compared to what >is - at least currently - achievable with a plain PREEMPT_RT kernel under meaningful stress. > >With such hybridization, the Dovetail implementation should be even simpler. >Companion cores based on it could run on the out-of-band execution stage unimpeded by other forms of preemption >disabling in the in-band kernel (e.g. >locks). This would preserve the most significant advantage of the pipelining model when it comes to reliable response times >for applications at a modest processing cost by a lightweight real-time infrastructure. > >This work entails porting the latest Dovetail code base I have been working on lately from 5.8 back to 5.6-rt, since this is the >most recent public release of PREEMPT_RT so far. In addition, interrupt-free sections which are deemed too long for the >companion core on top to cope with, need to be identified in the target PREEMPT_RT release so that they could be mitigated >(see below for an explanation about how it could be done). In the future, a way to automate such research should be >looked for, since finding these spots is likely going to be the boring task to carry out each time this new Dovetail >implementation is ported to the next PREEMPT_RT release. Plus, the truckload of other tricky issues I may have overlooked. > >If anyone is interested in participating in this work, let me know. I cannot guarantee success, but the data I have collected >over time with both the dual kernel and native preemption models leaves me optimistic about the outcome if they are >combined the right way. Hi Philippe, I would like to participate. One the of motivation is the TSN stack is now within Preempt-RT Linux. Some time ago we have discussed with Jan about similar idea, patch Ipipe/Xenomai onto Preempt-RT kernel but not vanilla kernel, then separate Cobalt thread and Preempt-RT's RT thread to different cores. BR / Fino (孟祥夫) Intel – IOTG Developer Enabling > >-- Nitty-gritty details about why and how to do this > >Those acquainted with the interrupt pipelining technique Dovetail implements [2] may already know that decoupling the >interrupt mask flag as perceived by the CPU from the one perceived by the kernel induces a number of tricky issues. We >want interrupts to be unmasked in the CPU as long as possible while the kernel runs; to this end local_irq_*() helpers are >switched to a software-based implementation which virtualizes the interrupt mask as perceived by the kernel, while leaving >interrupts enabled in the CPU, postponing the delivery of IRQs blocked by the virtual masking until they are accepted again >by the kernel (aka "stall bit"). This is a plain simple log-if-blocked-then-replay-when-unblocked game [3]. > >However, we also have to synchronize the hardware and software interrupt masks in some specific places of the kernel in >order to keep some hardware and software logic happy. Two examples come to mind, there are more of them: > >- hardware-wise, we want updates to some registers to remain fully atomic despite the fact interrupt pipelining is in effect. >For arm64, we have to ensure that updates to the translation table registers (TTBRs) cannot be preempted, likewise for >updates to the CR4 register on x86 which is notably used during TLB management. In both cases, we have to locally >revert/override the changes Dovetail implicitly did by re-introducing CPU-based forms of interrupt disabling, instead of the >software-based one. > >- software-wise, maintaining the LOCKDEP logic usable in a pipelined system requires fixing up the virtual interrupt mask on >kernel boundaries between kernel and user mode, so that it properly reflects what the locking validation engine expects at >all times. This has been the most time-consuming work in a number of Dovetail upgrades to recent kernel releases, 5.8-rc >included. >Besides, I'm still not happy with the way this is done, which looks like playing whack-a-mole to some extent. > >Many of these issues are hard to identify, some may not be trivial to address (LOCKDEP support can become really ugly in >this respect). Several other sub-systems like CPU idleness and power management have similar requirements for particular >code paths. > >Now, we may have another option for gaining fine-grained interrupt protection, which would build on the relentless work >the PREEMPT_RT folks did about shrinking the interrupt-free sections in the kernel code to the bare minimum which is >acceptable for native preemption, by threading IRQs and introducing sleeping locks mainly. > >Instead of systematizing the virtualization of the local_irq_*() helpers, we could switch them back to their original - >hardware-based - behavior, adding controlled mask-breaking statements manually to any remaining problematic code path. >Such statement would enable interrupts in the CPU while blocking them for the in-band kernel, using a local, non-pervasive >variant of the current interrupt pipeline. > >Within those long interrupt-free sections created by the in-band code, the companion core would nevertheless be allowed >to process pending interrupts immediately while maintaining the interrupt protection for the in-band kernel. >Identifying these sections for enabling the out-of-band code to preempt locally should be a matter of properly using the >irqsoff tracer, provided the >trace_hardirqs* instrumentation is correct. > >e.g. roughly sketching a possible use case: > >__schedule() >lock(rq) /* hard irqs off */ >... >context_switch() > switch_mm > switch_to >... >unlock(rq) /* hard irqs on */ > >The interrupt-free section above could amount to tenths of microseconds on >armv7 under significant pressure (especially with a sluggish L2 outer cache) and would prevent the out-of-band (companion) >core to preempt in the meantime. >To address this, switching the virtual interrupt state could be done manually by some dedicated service, say, >"oob_synchronize()", which would first stall the in-band stage to keep the code interrupt-free in-band wise, then allow any >pending hard IRQ to be taken by toggling the CPU mask flag, possibly some of which the companion core would handle. >Other IRQs to be handled by the in-band code would have to wait into a deferred interrupt log until hard IRQs are generally >re-enabled later on, which is what happens today with the common pipelining technique on a broader scope. > >__schedule() >lock(rq) /* hard irqs off */ >... >context_switch() > switch_mm > cond_sync_oob(); /* pending IRQs are synchronized for oob only */ > switch_to >... >unlock(rq) /* hard irqs on */ > >Ideally, switch_mm() should allow out-of-band IRQs to flow normally while changing the memory context for in-band tasks - >we once had that for armv4/5 in the early days of the I-pipe, but this would require non-trivial magic to do this properly in >current kernels. So maybe next when all the rest is functional. > >Congrats if you read up to there. Comments welcome as usual. > >[1] https://evlproject.org/dovetail/altsched/ >[2] >https://www.usenix.org/legacy/publications/library/proceedings/micro93/full_papers/stodolsky.txt >[3] https://evlproject.org/dovetail/pipeline/#virtual-i-flag > >-- >Philippe. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Dovetail <-> PREEMPT_RT hybridization 2020-07-21 5:26 ` Meng, Fino @ 2020-07-21 17:18 ` Philippe Gerum 2020-07-22 12:26 ` Meng, Fino 2020-07-23 13:09 ` Steven Seeger 0 siblings, 2 replies; 13+ messages in thread From: Philippe Gerum @ 2020-07-21 17:18 UTC (permalink / raw) To: Meng, Fino, Evl; +Cc: Xenomai (xenomai@xenomai.org) On 7/21/20 7:26 AM, Meng, Fino wrote: > >> Sent: Tuesday, July 21, 2020 4:47 AM >> >> FWIW, I'm investigating the opportunity for rebasing Dovetail - and therefore the EVL core - on the PREEMPT_RT code base, >> ahead of the final integration of the latter into the mainline kernel tree. In the same move, the goal would be to leverage the >> improvements brought by native preemption with respect to fine-grained interrupt protection, while keeping the alternate >> scheduling [1] feature, which still exhibits significantly shorter preemption times and much cleaner jitter compared to what >> is - at least currently - achievable with a plain PREEMPT_RT kernel under meaningful stress. >> >> With such hybridization, the Dovetail implementation should be even simpler. >> Companion cores based on it could run on the out-of-band execution stage unimpeded by other forms of preemption >> disabling in the in-band kernel (e.g. >> locks). This would preserve the most significant advantage of the pipelining model when it comes to reliable response times >> for applications at a modest processing cost by a lightweight real-time infrastructure. >> >> This work entails porting the latest Dovetail code base I have been working on lately from 5.8 back to 5.6-rt, since this is the >> most recent public release of PREEMPT_RT so far. In addition, interrupt-free sections which are deemed too long for the >> companion core on top to cope with, need to be identified in the target PREEMPT_RT release so that they could be mitigated >> (see below for an explanation about how it could be done). In the future, a way to automate such research should be >> looked for, since finding these spots is likely going to be the boring task to carry out each time this new Dovetail >> implementation is ported to the next PREEMPT_RT release. Plus, the truckload of other tricky issues I may have overlooked. >> >> If anyone is interested in participating in this work, let me know. I cannot guarantee success, but the data I have collected >> over time with both the dual kernel and native preemption models leaves me optimistic about the outcome if they are >> combined the right way. > > Hi Philippe, > > I would like to participate. One the of motivation is the TSN stack is now within Preempt-RT Linux. > Some time ago we have discussed with Jan about similar idea, patch Ipipe/Xenomai onto Preempt-RT kernel but not vanilla kernel, > then separate Cobalt thread and Preempt-RT's RT thread to different cores. > Ok. As far as I'm concerned, I'm only scratching an itch, I find some interest in looking for ways to downsize the hardware for running applications with demanding response time requirements, without necessarily resorting to a plain rtos. Back to the initial point, this work should involve, roughly: - implementing a Dovetail variant in the native preemption kernel. This is actually not a direct port, the new implementation would depart from the current Dovetail code in significant ways, although the basics would be the same, only used differently. I plan to work on this, although it would be much better if other folks would join me in the implementation once the thing is bootstrapped. - identifying and quantifying the longest interrupt-free sections in the target preempt-rt kernel under meaningful stress load, with the irqoff tracer. I wrote down some information [1] about the stress workloads which actually make a difference when benchmarking as far as I can tell. At any rate, the results we would get there would be crucial in order to figure out where to add the out-of-band synchronization points, and likely of some interest upstream too. I'm primarily targeting armv7 and armv8, it would be great if you could help with x86. - the two previous points are obviously part of an iterative process centered on testing the implementation with a real-time core. I'm going to use the EVL core for this, since it is sitting on Dovetail already. [1] https://evlproject.org/core/benchmarks/#stress-load -- Philippe. ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: Dovetail <-> PREEMPT_RT hybridization 2020-07-21 17:18 ` Philippe Gerum @ 2020-07-22 12:26 ` Meng, Fino 2020-07-23 13:09 ` Steven Seeger 1 sibling, 0 replies; 13+ messages in thread From: Meng, Fino @ 2020-07-22 12:26 UTC (permalink / raw) To: Philippe Gerum, Evl; +Cc: Xenomai (xenomai@xenomai.org) >On 7/21/20 7:26 AM, Meng, Fino wrote: >> >>> Sent: Tuesday, July 21, 2020 4:47 AM >>> >>> FWIW, I'm investigating the opportunity for rebasing Dovetail - and >>> therefore the EVL core - on the PREEMPT_RT code base, ahead of the >>> final integration of the latter into the mainline kernel tree. In the >>> same move, the goal would be to leverage the improvements brought by >>> native preemption with respect to fine-grained interrupt protection, while keeping the alternate scheduling [1] feature, >which still exhibits significantly shorter preemption times and much cleaner jitter compared to what is - at least currently - >achievable with a plain PREEMPT_RT kernel under meaningful stress. >>> >>> With such hybridization, the Dovetail implementation should be even simpler. >>> Companion cores based on it could run on the out-of-band execution >>> stage unimpeded by other forms of preemption disabling in the in-band kernel (e.g. >>> locks). This would preserve the most significant advantage of the >>> pipelining model when it comes to reliable response times for applications at a modest processing cost by a lightweight >real-time infrastructure. >>> >>> This work entails porting the latest Dovetail code base I have been >>> working on lately from 5.8 back to 5.6-rt, since this is the most >>> recent public release of PREEMPT_RT so far. In addition, >>> interrupt-free sections which are deemed too long for the companion >>> core on top to cope with, need to be identified in the target >>> PREEMPT_RT release so that they could be mitigated (see below for an explanation about how it could be done). In the >future, a way to automate such research should be looked for, since finding these spots is likely going to be the boring task >to carry out each time this new Dovetail implementation is ported to the next PREEMPT_RT release. Plus, the truckload of >other tricky issues I may have overlooked. >>> >>> If anyone is interested in participating in this work, let me know. I >>> cannot guarantee success, but the data I have collected over time >>> with both the dual kernel and native preemption models leaves me optimistic about the outcome if they are combined >the right way. >> >> Hi Philippe, >> >> I would like to participate. One the of motivation is the TSN stack is now within Preempt-RT Linux. >> Some time ago we have discussed with Jan about similar idea, patch >> Ipipe/Xenomai onto Preempt-RT kernel but not vanilla kernel, then separate Cobalt thread and Preempt-RT's RT thread to >different cores. >> > >Ok. As far as I'm concerned, I'm only scratching an itch, I find some interest in looking for ways to downsize the hardware >for running applications with demanding response time requirements, without necessarily resorting to a plain rtos. > >Back to the initial point, this work should involve, roughly: > >- implementing a Dovetail variant in the native preemption kernel. This is actually not a direct port, the new implementation >would depart from the current Dovetail code in significant ways, although the basics would be the same, only used >differently. I plan to work on this, although it would be much better if other folks would join me in the implementation once >the thing is bootstrapped. > >- identifying and quantifying the longest interrupt-free sections in the target preempt-rt kernel under meaningful stress load, >with the irqoff tracer. >I wrote down some information [1] about the stress workloads which actually make a difference when benchmarking as far >as I can tell. At any rate, the results we would get there would be crucial in order to figure out where to add the out-of-band >synchronization points, and likely of some interest upstream too. I'm primarily targeting armv7 and armv8, it would be great >if you could help with x86. > >- the two previous points are obviously part of an iterative process centered on testing the implementation with a real-time >core. I'm going to use the EVL core for this, since it is sitting on Dovetail already. > >[1] https://evlproject.org/core/benchmarks/#stress-load > I will use UP Xtreme (WHL8565U) for test, since it is easy to buy for global developers. https://up-shop.org/up-xtreme-series.html Intel IOTG kernel team maintains a Preempt-RT kernel, well tested for x86, but up to 5.4 https://github.com/intel/linux-intel-lts/tree/5.4/preempt-rt don't know if would help in this case. BR / Fino ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Dovetail <-> PREEMPT_RT hybridization 2020-07-21 17:18 ` Philippe Gerum 2020-07-22 12:26 ` Meng, Fino @ 2020-07-23 13:09 ` Steven Seeger 2020-07-23 16:23 ` Philippe Gerum 1 sibling, 1 reply; 13+ messages in thread From: Steven Seeger @ 2020-07-23 13:09 UTC (permalink / raw) To: Meng, Fino, Evl, xenomai; +Cc: Philippe Gerum On Tuesday, July 21, 2020 1:18:21 PM EDT Philippe Gerum wrote: > > - identifying and quantifying the longest interrupt-free sections in the > target preempt-rt kernel under meaningful stress load, with the irqoff > tracer. I wrote down some information [1] about the stress workloads which > actually make a difference when benchmarking as far as I can tell. At any > rate, the results we would get there would be crucial in order to figure > out where to add the out-of-band synchronization points, and likely of some > interest upstream too. I'm primarily targeting armv7 and armv8, it would be > great if you could help with x86. So from my perspective, one of the beauties of Xenomai with traditional IPIPE is you can analyze the fast interrupt path and see that by design you have an upper bound on latency. You can even calculate it. It's based on the number of cpu cycles at irq entry multiplied by the total numbers of IRQs that could happen at the same time. Depending on your hardware, maybe you know the priority of handling the interrupt in question. The point was the system was analyzable by design. When you start talking about looking for long critical sections and adding sync points in it, I think you take away the by-design guarantees for latency. This might make it less-suitable for hard realtime systems. IMHO this is not any better than Preempt-RT. But maybe I am missing something. :) Steven ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Dovetail <-> PREEMPT_RT hybridization 2020-07-23 13:09 ` Steven Seeger @ 2020-07-23 16:23 ` Philippe Gerum 2020-07-23 21:53 ` Steven Seeger 0 siblings, 1 reply; 13+ messages in thread From: Philippe Gerum @ 2020-07-23 16:23 UTC (permalink / raw) To: Steven Seeger, Meng, Fino, Evl, xenomai On 7/23/20 3:09 PM, Steven Seeger wrote: > On Tuesday, July 21, 2020 1:18:21 PM EDT Philippe Gerum wrote: >> >> - identifying and quantifying the longest interrupt-free sections in the >> target preempt-rt kernel under meaningful stress load, with the irqoff >> tracer. I wrote down some information [1] about the stress workloads which >> actually make a difference when benchmarking as far as I can tell. At any >> rate, the results we would get there would be crucial in order to figure >> out where to add the out-of-band synchronization points, and likely of some >> interest upstream too. I'm primarily targeting armv7 and armv8, it would be >> great if you could help with x86. > > So from my perspective, one of the beauties of Xenomai with traditional IPIPE > is you can analyze the fast interrupt path and see that by design you have an > upper bound on latency. You can even calculate it. It's based on the number of > cpu cycles at irq entry multiplied by the total numbers of IRQs that could > happen at the same time. Depending on your hardware, maybe you know the > priority of handling the interrupt in question. > > The point was the system was analyzable by design. > Two misunderstandings it seems: - this work is all about evolving Dovetail, not Xenomai. If such work does bring the upsides I'm expecting, then I would surely switch EVL to it. In parallel, you would still have the opportunity to keep the current Dovetail implementation - currently under validation on top of 5.8 - and maintain it for Xenomai, once the latter is rebased over the former. You could also stick to the I-pipe for Xenomai, so no issue. - you seem to be assuming that every code paths of the kernel is interruptible with the I-pipe/Dovetail, this is not the case, by far. Some keys portions run with hard irqs off, just because there is no other way to 1) share some code paths between the regular kernel and the real-time core, 2) the hardware may require it (as hinted in my introductory post). Some of those sections may take ages under cache pressure (switch_to comes to mind), tenths of micro-seconds, happening mostly randomly from the standpoint of the external observer (i.e. you, me). So much for quantifying timings by design. We can only figure out a worst-case value by submitting the system to a reckless stress workload, for long enough. This game of sharing the very same hardware between GPOS and a RTOS activities has been based on a probabilistic approach so far, which can be summarized as: do your best to keep the interrupts enabled as long as possible, ensure fine-grained preemption of tasks, make sure to give the result hell to detect issues, and hope for the hardware not to rain on the parade. Back to the initial point: virtualizing the effect of the local_irq helpers you refer to is required when their use is front and center in serializing kernel activities. However, in a preempt-rt kernel, most interrupt handlers are threaded, regular spinlocks are blocking mutexes in disguise, so what remains is: - sections covered by the raw_spin_lock API, which is primarily a problem because we would spin with hard irqs off attempting to acquire the lock. There is a proven technical solution to this based on a application of interrupt pipelining. - few remaining local_irq disabled sections which may run for too long, but could be relaxed enough in order for the real-time core to preempt without prejudice. This is where pro-actively tracing the kernel under stress comes into play. Working on these three aspects specifically does not bring less guarantees than hoping for no assembly code to create long uninterruptible section (therefore not covered by local_irq_* helpers), no driver talking to a GPU killing latency with CPU stalls, no shared cache architecture causing all sort of insane traffic between cache levels, causing memory access speed to sink and overall performances to degrade. Again, the key issue there is about running two competing workloads on the same hardware, GPOS and RTOS. They tend to not get along that much. Each time the latter pauses, the former may resume and happily trash some hardware sub-system both rely on. So for your calculation to be right, you would have not only to involve the RTOS code, but also what the GPOS is up to, and how this might change the timings. In this respect, no I-pipe - and for that matter no current or future Dovetail implementation - can provide any guarantee. In short, I'm pretty convinced that any calculation you would try would be wrong by design, missing quite a few significant variables in the equation, at least for precise timing. However, it is true that relying on native preemption creates more opportunities for some hog code to cause havoc in the latency chart because of braindamage locking for instance, which may have been solved in a serendipitous way with the current I-pipe/Dovetail model, due to the systematized virtualization of the local_irq helpers. This said, spinlock-wise for preempt-rt, this would only be an issue with rogue code contributions explicitly using raw spinlocks mindlessly, which would be yet another step toward a nomination of their author for the Darwin awards (especially if they get caught by some upstream reviewer on the lkml). At this point, there are two options: - consider that direct or indirect local_irq* usage is the only factor of increased interrupt latency, which is provably wrong. - assume that future work aimed at statically detecting misbehaving code (rt-wise) in the kernel may succeed, which may be optimistic but at least not fundamentally flawed. So I'll go for the optimistic view. > When you start talking about looking for long critical sections and adding > sync points in it, I think you take away the by-design guarantees for latency. > This might make it less-suitable for hard realtime systems. > > IMHO this is not any better than Preempt-RT. But maybe I am missing something. > :) > You may be missing the reasoning behind the alternate scheduling Dovetail implements, which is a generalization of what the I-pipe and Xenomai do to schedule tasks regardless of the preemption status of the regular kernel. IMHO, the issue shouldn't be about decreasing interrupt masking throughout the kernel, which should be fine-grained enough to only require manual fixups for addressing the remaining problems - this assumption triggered the idea of a Dovetail overhaul. The other granularity, the one that still matters today, relates to task preemption. How snappy can the kernel be made in order to direct the CPU to running a different task when some urgent event has to be handled asap. Dovetail still brings the alternate scheduling feature for expedited tasks preempt-rt has no reason to have, by essence. This task preemption issue is a harder nut to crack, because techniques which make the granularity finer may also increase the overall cost induced in maintaining the infrastructure (complex priority inheritance for mutexes, threaded irq model and sleeping spinlocks which increase the context switching rate, preemptible RCU and so on). That cost, which is clearly visible in latency analysis, is by design not applicable to tasks scheduled by a companion core, with much simpler locking semantics which are not shared with the regular kernel. In that sense specifically, I would definitely agree that estimating a WCET based on the scheduling behavior of Xenomai or EVL is way simpler than mathematizing what might happen in the linux kernel. -- Philippe. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Dovetail <-> PREEMPT_RT hybridization 2020-07-23 16:23 ` Philippe Gerum @ 2020-07-23 21:53 ` Steven Seeger 0 siblings, 0 replies; 13+ messages in thread From: Steven Seeger @ 2020-07-23 21:53 UTC (permalink / raw) To: Meng, Fino, Evl, xenomai, Philippe Gerum On Thursday, July 23, 2020 12:23:53 PM EDT Philippe Gerum wrote: > Two misunderstandings it seems: > > - this work is all about evolving Dovetail, not Xenomai. If such work does > bring the upsides I'm expecting, then I would surely switch EVL to it. In > parallel, you would still have the opportunity to keep the current Dovetail > implementation - currently under validation on top of 5.8 - and maintain it > for Xenomai, once the latter is rebased over the former. You could also > stick to the I-pipe for Xenomai, so no issue. That may be my misunderstanding. I thought Dovetail's ultimate goal is at least the performance of IPIPE but being simpler to maintain. > - you seem to be assuming that every code paths of the kernel is > interruptible with the I-pipe/Dovetail, this is not the case, by far. Some > keys portions run with hard irqs off, just because there is no other way to > 1) share some code paths between the regular kernel and the real-time core, > 2) the hardware may require it (as hinted in my introductory post). Some of > those sections may take ages under cache pressure (switch_to comes to > mind), tenths of micro-seconds, happening mostly randomly from the > standpoint of the external observer (i.e. you, me). So much for quantifying > timings by design. So with switch_to having hard irqs off, the cache pressure should be deterministic because there's an upper bound on cache lines, the number of memory pages that need to be accessed, and the code path is pretty straight forward if memory serfves. I would think that this being well bounded should serve to my initial point. > > We can only figure out a worst-case value by submitting the system to a > reckless stress workload, for long enough. This game of sharing the very > same hardware between GPOS and a RTOS activities has been based on a > probabilistic approach so far, which can be summarized as: do your best to > keep the interrupts enabled as long as possible, ensure fine-grained > preemption of tasks, make sure to give the result hell to detect issues, > and hope for the hardware not to rain on the parade. I agree that in practice, a reckless stress workload is necessary to quantify system latency. However, relying on this is a problem when it comes time to convince managers who want to spend tons of money for expensive and proven OS solutions instead of using the fun and cool stuff we do. ;) At some point, if possible, someone should try and actually prove the system given the bounds. 1) There's only so many pages of memory 2) There's only so much cache and so many cache lines 3) There's only so many sources of interrupts 4) There's only so many sources of CPU stalls where those number of stalls should have a limit in hardware. I can't really think of anything else, but I don't know why there'd be any sort of randomness on top of this. One thing we might be not on the same page of is that typically (especially single processor systems) when I talk about timing by design calculations I am referring to one single high priority thing. That could be a timer interrupt to the first instruction running in that timer interrupt handler, or it could be to the point where the highest priority thread in the system resumes. > > Back to the initial point: virtualizing the effect of the local_irq helpers > you refer to is required when their use is front and center in serializing > kernel activities. However, in a preempt-rt kernel, most interrupt handlers > are threaded, regular spinlocks are blocking mutexes in disguise, so what > remains is: Yes but this depends on a cooperative model. Other drivers can mess you up, as described by you below. > > - sections covered by the raw_spin_lock API, which is primarily a problem > because we would spin with hard irqs off attempting to acquire the lock. > There is a proven technical solution to this based on a application of > interrupt pipelining. Yes. > - few remaining local_irq disabled sections which may run for too long, but > could be relaxed enough in order for the real-time core to preempt without > prejudice. This is where pro-actively tracing the kernel under stress comes > into play. This is my problem with preempt-rt. Ipipe forces this preemption by changing what the macros do that linux devs think is turning interrupts off. We never need to worry about this in the RTOS domain. > Working on these three aspects specifically does not bring less guarantees > than hoping for no assembly code to create long uninterruptible section > (therefore not covered by local_irq_* helpers), no driver talking to a GPU > killing latency with CPU stalls, no shared cache architecture causing all > sort of insane traffic between cache levels, causing memory access speed to > sink and overall performances to degrade. I havne't had a chance to work with these sorts of systems but we are doing more wuth arm processors with multi-level MMU and I'm very curious about how this will affect performance when you're trying to do AMP and RTOS and GPOS on the same chip. > > Again, the key issue there is about running two competing workloads on the > same hardware, GPOS and RTOS. They tend to not get along that much. Each > time the latter pauses, the former may resume and happily trash some > hardware sub-system both rely on. So for your calculation to be right, you > would have not only to involve the RTOS code, but also what the GPOS is up > to, and how this might change the timings. That is true. This could possibly be handled with a hypervisor though that actually touches the hardware. > > In this respect, no I-pipe - and for that matter no current or future > Dovetail implementation - can provide any guarantee. In short, I'm pretty > convinced that any calculation you would try would be wrong by design, > missing quite a few significant variables in the equation, at least for > precise timing. There's only so many ways to break the system, so we just need to make sure we find all the variables. ;) I do think what I am saying is true for single processor systems, but you have a point for multi. > However, it is true that relying on native preemption creates more > opportunities for some hog code to cause havoc in the latency chart because > of braindamage locking for instance, which may have been solved in a > serendipitous way with the current I-pipe/Dovetail model, due to the > systematized virtualization of the local_irq helpers. This said, > spinlock-wise for preempt-rt, this would only be an issue with rogue code > contributions explicitly using raw spinlocks mindlessly, which would be yet > another step toward a nomination of their author for the Darwin awards > (especially if they get caught by some upstream reviewer on the lkml). There are people who insist that the system be safe and robust against rogue code. > > At this point, there are two options: > > - consider that direct or indirect local_irq* usage is the only factor of > increased interrupt latency, which is provably wrong. I would think this to be the largest contributor to it, though. > > - assume that future work aimed at statically detecting misbehaving code > (rt-wise) in the kernel may succeed, which may be optimistic but at least > not fundamentally flawed. So I'll go for the optimistic view. This is a good point. But I think a system where you depend on millions of lines of code to be good instead of a a few thousand lines of code is asking for trouble. > > This task preemption issue is a harder nut to crack, because techniques > which make the granularity finer may also increase the overall cost induced > in maintaining the infrastructure (complex priority inheritance for > mutexes, threaded irq model and sleeping spinlocks which increase the > context switching rate, preemptible RCU and so on). That cost, which is > clearly visible in latency analysis, is by design not applicable to tasks > scheduled by a companion core, with much simpler locking semantics which > are not shared with the regular kernel. In that sense specifically, I would > definitely agree that estimating a WCET based on the scheduling behavior of > Xenomai or EVL is way simpler than mathematizing what might happen in the > linux kernel. So in your opinion what is more important? Lowest possible latency on a best- case or average basis, or determinism? What I mean is the things you mention that come with a higher cost in terms of latency come with more determinism (less jitter, more predictable result, etc.) So what is the goal you are working towards? I've known you for like 20 years and probably never won an argument, so history says I will be headed to defeat here. ;) Steven ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2020-07-23 21:53 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-07-20 20:47 Dovetail <-> PREEMPT_RT hybridization Philippe Gerum 2020-07-20 22:44 ` Paul 2020-07-21 8:18 ` Philippe Gerum 2020-07-21 8:39 ` Paul 2020-07-21 9:25 ` Philippe Gerum 2020-07-21 9:43 ` Paul 2020-07-21 9:46 ` Philippe Gerum 2020-07-21 5:26 ` Meng, Fino 2020-07-21 17:18 ` Philippe Gerum 2020-07-22 12:26 ` Meng, Fino 2020-07-23 13:09 ` Steven Seeger 2020-07-23 16:23 ` Philippe Gerum 2020-07-23 21:53 ` Steven Seeger
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.