From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <4651C70E.8080908@domain.hid> Date: Mon, 21 May 2007 18:21:34 +0200 From: Jan Kiszka MIME-Version: 1.0 References: <465099CE.2010004@domain.hid> <4651AFF9.5030402@domain.hid> <1179763621.473.145.camel@domain.hid> In-Reply-To: <1179763621.473.145.camel@domain.hid> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig5ACF6E0691ED1360EBDF78C0" Sender: jan.kiszka@domain.hid Subject: Re: [Xenomai-core] [BUG] module_put called over non-root domain List-Id: "Xenomai life and development \(bug reports, patches, discussions\)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: rpm@xenomai.org Cc: xenomai-core This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig5ACF6E0691ED1360EBDF78C0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: quoted-printable Philippe Gerum wrote: > On Mon, 2007-05-21 at 16:43 +0200, Jan Kiszka wrote: >> Hi Philippe, >> >> Jan Kiszka wrote: >>> Just ran into this with CONFIG_IPIPE_DEBUG_CONTEXT (maybe due to some= >>> bug of my own): >> Here is some code to trigger the issue reliably: >> >> #include >> #include >> >> void task_fnct(void *arg) >> { >> rt_task_delete(NULL); >> } >> >> main() >> { >> RT_TASK task; >> mlockall(MCL_CURRENT|MCL_FUTURE); >> rt_task_spawn(&task, "task", 0, 10, 0, task_fnct, NULL); >> } >> >> >>> [ 102.616000] I-pipe: Detected illicit call from domain 'Xenomai' >>> [ 102.616000] into a service reserved for domain 'Linux' and= below. >>> [ 102.616000] c741bdc8 00000000 00000000 c8860ef8 c741bdec c0= 105683 c032c200 c13fe22c >>> [ 102.616000] c0361f00 c741be08 c01519ed c032f5b8 c032c742 c0= 3380b3 c8885100 c78beac0 >>> [ 102.616000] c741be14 c0142ce9 c7a80b30 c741be3c c884d075 c8= 85f150 c8860ef8 c741be3c >>> [ 102.616000] Call Trace: >>> [ 102.616000] [] show_trace_log_lvl+0x1f/0x40 >>> [ 102.616000] [] show_stack_log_lvl+0xb1/0xe0 >>> [ 102.616000] [] show_stack+0x33/0x40 >>> [ 102.616000] [] ipipe_check_context+0xad/0xc0 >>> [ 102.616000] [] module_put+0x19/0x90 >>> [ 102.616000] [] xnshadow_unmap+0xb5/0x130 [xeno_nucleus]= >>> [ 102.616000] [] __shadow_delete_hook+0x25/0x30 [xeno_nat= ive] >>> [ 102.616000] [] xnpod_schedule+0xb58/0x12f0 [xeno_nucleu= s] >>> [ 102.616000] [] xnpod_delete_thread+0x2cb/0x3d0 [xeno_nu= cleus] >>> [ 102.616000] [] rt_task_delete+0x20d/0x220 [xeno_native]= >>> >>> I would dare to say that module_put in xnshadow_unmap is not well pla= ced >>> as it can wakeup a Linux process. The module ref-counter maintenance >>> needs some postponing, I guess. >=20 > Mm, I should definitely read mails entirely. >=20 >> Attached is a patch proposal. It solves the issue by postponing the >> module_put via a new schedule_linux_call. Note that this approach issu= es >> LO_WAKEUP_REQ where the old test (p->state !=3D TASK_RUNNING) would no= t >> have done so. I don't see negative side effects yet, >=20 > This code has to work with an awful lot of runtime situations, includin= g > those raised by ptracing/GDB, and thread signaling in general, so test > harnessing is key here. Well, what about listing those scenarios and their requirements in a comment if this is so critical? That may make /me feel better as I could more easily go through them and check changes like those of today against them. >=20 >> and I'm furthermore >> not sure of the old code was handling SMP scenarios safely (What if th= e >> thread to be unmapped was running on different CPU than xnshadow_unmap= ? >> How to ensure test-atomicity then?). >> >=20 > This wakeup call is there to release emerging shadows waiting on their > start barrier that get killed before start. Other regular cases imply > that either the current thread is exiting - in which case this is a no > brainer, or it has been sent a group kill signal, in which case it has > to wake up anyway. I didn't _removed_ cases where the wakeup takes place, I _added_ some more my removing the test. >=20 >> Well, this thing seems to work, >=20 > It does, actually. >=20 >> but it doesn't leave me with a good >> feeling. IMHO, there are far too many cleanup cases with all the >> combinations of non-shadow termination, shadow self-termination, >> termination on task exit, various SMP scenarios, etc. >> Moreover, quite a >> lot of stuff is executed in atomic scheduling hooks (namely on cleanup= ). >> At least for shadow threads, I don't think all of this is necessary. >> Actually, deferring most of the uncritical cleanup work into Linux >> context would make things more regular, easier to review, and likely >> less disturbing for overall system latencies. I know that atomic >> scheduling hooks are required to implement certain skins, but I don't >> see why janitor work should be done there as well. And we also still >> have that tiny uncleanness on shadow threads termination around TCB >> release vs. hook execution. >=20 > Talking about xnshadow_unmap, and as your patch illustrates, what you > could postpone is basically the first part of the routine, which update= s > the reference counts. The rest _must_ run over the deletion hook in > atomic context (already runs fine all Linux debug switches on). The rpi_pop()? No chance to call it before entering the scheduler and then the hooks? > latency hit is certainly not seen there; the real issues are left in th= e > native and POSIX skin thread deletion routines (heap management and > other housekeeping calls under nklock and so on). That's surely true. Besides RPI manipulation, there is really no problematic code left here latency-wise. Other hooks are trickier. But my point is that we first need some infrastructure to provide an alternative cleanup context before we can start moving things. My patch today is a hack...err...workaround from this perspective. >=20 >> As we are already at it: LO_MAX_REQUESTS is a hidden limitation that >> may (theoretically) bite large setups when e.g. too many threads gets >> killed from the "wrong" context. At least some overflow detection is >> required here to warn the unfortunate user that he reached hard-coded >> scalability limits. >=20 > Yes. CONFIG_XENO_OPT_DEBUG[_NUCLEUS] or unconditionally (error return code)? Jan --------------enig5ACF6E0691ED1360EBDF78C0 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGUccOniDOoMHTA+kRAif9AJ909kujjNzF75UzBAqssZZQUlH8BQCdGik+ ZkKa7pWRRBjrNtIFxcdAA3w= =USQh -----END PGP SIGNATURE----- --------------enig5ACF6E0691ED1360EBDF78C0--