From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <43D21144.8040005@domain.hid> Date: Sat, 21 Jan 2006 11:47:32 +0100 From: Jan Kiszka MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig8924C7F8C9F0537DBAC95345" Sender: jan.kiszka@domain.hid Subject: [Xenomai-core] [BUG] racy xnshadow_harden under CONFIG_PREEMPT List-Id: "Xenomai life and development \(bug reports, patches, discussions\)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: xenomai-core This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig8924C7F8C9F0537DBAC95345 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: quoted-printable Hi, well, if I'm not totally wrong, we have a design problem in the RT-thread hardening path. I dug into the crash Jeroen reported and I'm quite sure that this is the reason. So that's the bad news. The good one is that we can at least work around it by switching off CONFIG_PREEMPT for Linux (this implicitly means that it's a 2.6-only issue). @Jeroen: Did you verify that your setup also works fine without CONFIG_PREEMPT? But let's start with two assumptions my further analysis is based on: [Xenomai] o Shadow threads have only one stack, i.e. one context. If the real-time part is active (this includes it is blocked on some xnsynch object or delayed), the original Linux task must NEVER EVER be executed, even if it will immediately fall asleep again. That's because the stack is in use by the real-time part at that time. And this condition is checked in do_schedule_event() [1]. [Linux] o A Linux task which has called set_current_state() will remain in the run-queue as long as it calls schedule() on its own. This means that it can be preempted (if CONFIG_PREEMPT is set) between set_current_state() and schedule() and then even be resumed again. Only the explicit call of schedule() will trigger deactivate_task() which will in turn remove current from the run-queue. Ok, if this is true, let's have a look at xnshadow_harden(): After grabbing the gatekeeper sem and putting itself in gk->thread, a task going for RT then marks itself TASK_INTERRUPTIBLE and wakes up the gatekeeper [2]. This does not include a Linux reschedule due to the _sync version of wake_up_interruptible. What can happen now? 1) No interruption until we can called schedule() [3]. All fine as we will not be removed from the run-queue before the gatekeeper starts kicking our RT part, thus no conflict in using the thread's stack. 3) Interruption by a RT IRQ. This would just delay the path described above, even if some RT threads get executed. Once they are finished, we continue in xnshadow_harden() - given that the RT part does not trigger the following case: 3) Interruption by some Linux IRQ. This may cause other threads to become runnable as well, but the gatekeeper has the highest prio and will therefore be the next. The problem is that the rescheduling on Linux IRQ exit will PREEMPT our task in xnshadow_harden(), it will NOT remove it from the Linux run-queue. And now we are in real troubles: The gatekeeper will kick off our RT part which will take over the thread's stack. As soon as the RT domain falls asleep and Linux takes over again, it will continue our non-RT part as well! Actually, this seems to be the reason for the panic in do_schedule_event(). Without CONFIG_XENO_OPT_DEBUG and this check, we will run both parts AT THE SAME TIME now, thus violating my first assumption. The system gets fatally corrupted. Well, I would be happy if someone can prove me wrong here. The problem is that I don't see a solution because Linux does not provide an atomic wake-up + schedule-out under CONFIG_PREEMPT. I'm currently considering a hack to remove the migrating Linux thread manually from the run-queue, but this could easily break the Linux scheduler. Jan PS: Out of curiosity I also checked RTAI's migration mechanism in this regard. It's similar except for the fact that it does the gatekeeper's work in the Linux scheduler's tail (i.e. after the next context switch). And RTAI seems it suffers from the very same race. So this is either a fundamental issue - or I'm fundamentally wrong. [1]http://www.rts.uni-hannover.de/xenomai/lxr/source/ksrc/nucleus/shadow.= c?v=3DSVN-trunk#L1573 [2]http://www.rts.uni-hannover.de/xenomai/lxr/source/ksrc/nucleus/shadow.= c?v=3DSVN-trunk#L461 [3]http://www.rts.uni-hannover.de/xenomai/lxr/source/ksrc/nucleus/shadow.= c?v=3DSVN-trunk#L481 --------------enig8924C7F8C9F0537DBAC95345 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFD0hFHniDOoMHTA+kRAmQKAJ9mzcpfF1ZqZJL3AKecICwwgsTBPgCdGae8 CrhY6MdrqrMVgi3amTKWQnc= =ASws -----END PGP SIGNATURE----- --------------enig8924C7F8C9F0537DBAC95345--