From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============9192889811289744662==" MIME-Version: 1.0 From: Andriy Gapon Subject: Re: [Devel] ref-counting races? Date: Thu, 21 Mar 2013 20:07:44 +0200 Message-ID: <514B4C70.5050305@FreeBSD.org> In-Reply-To: 94F2FBAB4432B54E8AACC7DFDE6C92E346BDDCF3@ORSMSX101.amr.corp.intel.com List-ID: To: devel@acpica.org --===============9192889811289744662== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable on 21/11/2012 17:18 Moore, Robert said the following: >> -----Original Message----- >> From: Andriy Gapon [mailto:avg(a)FreeBSD.org] [snip] >> AcpiUtUpdateRefCount handling of REF_DECREMENT && Count < 1 looks >> worrying. >> Not sure if that ever happens in practice, but since the code exists... >> The case is being treated as a minor event (judging from the debugging >> print), but we are already lucky if we see Count =3D=3D 0 there instead = of >> some garbage or just plain crashing because the memory is already freed. >> Nevertheless we discard even that luck by happily continuing into >> potentially a repeated call to AcpiUtDeleteInternalObj. > = > This code appears in order to essentially catch a case where an attempt i= s made to > delete an object twice. Admittedly, it doesn't serve much purpose these d= ays and > should probably be removed. In practice, we never, ever hit the (Count < = 1) case > that I know of. In other words, I don't know of any cases where the acpic= a code > ever attempts to delete an object twice. With a help from a skilled user we've been able to catch a likely culprit o= f the FreeBSD + ACPICA heisenbugs, which all showed some relation to the FreeBSD = battery driver. Probably unlike other OSes FreeBSD liberally uses AcpiGetObjectInfo in the = code that handles battery status queries from userland. What's more, FreeBSD al= lows such queries to run in parallel and, thus, AcpiGetObjectInfo may be executed concurrently by more than one thread. In turn, AcpiGetObjectInfo calls fun= ctions like AcpiUtExecute_HID, AcpiUtExecute_UID, etc. These functions are called without holding any lock and thus, for example, AcpiUtExecute_HID may run i= n parallel. The above mentioned functions all follow the same template: { Status =3D AcpiUtEvaluateObject (DeviceNode, METHOD_NAME__HID, ACPI_BTYPE_INTEGER | ACPI_BTYPE_STRING, &ObjDesc); ... do useful work ... Cleanup: /* On exit, we must delete the return object */ AcpiUtRemoveReference (ObjDesc); return_ACPI_STATUS (Status); } It can be clearly seen that AcpiUtRemoveReference is called without protect= ion of any lock. Neither of AcpiUtRemoveReference -> AcpiUtUpdateObjectReference = -> AcpiUtUpdateRefCount use any lock. So, in the above code path ReferenceCount is decremented without holding an= y lock. Given that the decrement is performed in "highly" non-atomic fashion, it is= open to races. E.g. let's assume that we have some like this in ASL: Name (_HID, EisaId ("PNP0C0E")) Creation of the _HID node would create a value object with reference count = of 1. Let's assume that thread T1 executes AcpiUtExecute_HID. Then AcpiUtEvaluateObject would increment the count to 2. Let's assume that thread T2 then concurrently executes AcpiUtExecute_HID. It is possible that T1 and T2 would then concurrently call AcpiUtUpdateRefC= ount: T1 with REF_DECREMENT (via AcpiUtRemoveReference) and T2 with REF_INCREMENT= (via AcpiUtEvaluateObject). Both would see the same reference count of 2 as a starting value. If T2 is quicker to write 3 into ReferenceCount, then T1 would overwrite it= with 1. Then, when T2 calls AcpiUtRemoveReference, it would decrement ReferenceCoun= t to zero and thus the value object would be freed while the node still has a po= inter to its address. Subsequent accesses to the _HID node would be accessing freed memory. This description is very consistent with the problem reports from users. Taking the most recent example: http://thread.gmane.org/gmane.os.freebsd.devel.acpi/7707 The panic was preceded by the following messages: ACPI Error: No object attached to node 0xfffffe00094a51c0 (20110527/exresnt= e-138) ACPI Error: Method execution failed [\_SB_.BAT0._UID] (Node 0xfffffe00094a5= 1c0), AE_AML_NO_OPERAND (20110527/uteval-113) The panic itself is consistent with corruption of ACPICA object cache: http://thread.gmane.org/gmane.os.freebsd.devel.acpi/7707/focus=3D7730 All the reports (some of which I mentioned at the start of this thread) fol= low the same pattern: a panic in object cache code, frequently preceded by various = error messages related to battery device _UID/_HID/etc (such as missing object or= wrong object type, etc). I see several possibilities of fixing this problem. >From worst to best: 1. Ensure that AcpiGetObjectInfo/AcpiUtExecute_HID/etc are never executed concurrently on the same object at the OS level code (primarily in the batt= ery driver). 2. Ensure that AcpiUtRemoveReference is always invoked under some lock. No= t sure if it should be the Interpreter mutex or the Namespace mutex. I believe th= at AcpiUtAddReference are always consistently called under a lock, but I haven= 't verified all the code paths. On the other hand, it is obvious that there a= re many places where AcpiUtRemoveReference is called without any protection, but it= is hard to conclusively enumerate _all_ of them. 3. Introduce locking to AcpiUtUpdateRefCount to ensure that ReferenceCount = is never modified concurrently. Essentially this should give an equivalence of atomicity for ReferenceCount manipulations. Additionally, it would be a good idea to make ReferenceCount =3D=3D 0 check= more strict (if not fatal). Also, unused and dangerous REF_FORCE_DELETE should = be removed. I like #3 the best because it provides the most guarantees while requiring = the least code inspection/verification efforts. Of course, you know the code much better, so your assessment of the problem= and possible solutions could be very different. Here is a quickly hacked together prototype patch that seems to have helped= the user: http://people.freebsd.org/~avg/acpi-ref-count-2.diff The panic calls are to demonstrate various dangerous places in the current = code. They were actually hit in testing before the locking was added. -- = Andriy Gapon --===============9192889811289744662==--