From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Siluvery, Arun" Subject: possible struct_mutex race, gtt_space becomes invalid in execbuf under memory pressure Date: Mon, 18 Nov 2013 15:51:30 +0000 Message-ID: <1384789888.6058.5.camel@asiluver-linux.isw.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by gabe.freedesktop.org (Postfix) with ESMTP id ABA2BFB117 for ; Mon, 18 Nov 2013 07:52:09 -0800 (PST) Content-Language: en-US Content-ID: <207279AAA8EB5740842BCF649EB4BADA@intel.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: intel-gfx-bounces@lists.freedesktop.org Errors-To: intel-gfx-bounces@lists.freedesktop.org To: "intel-gfx@lists.freedesktop.org" List-Id: intel-gfx@lists.freedesktop.org Hi All, I am running a repetitive test on HSW with max available RAM limited to 1GB (max TOLUD is 1GB) and it fails with NULL pointer dereference in execbuf ioctl. Debug showed that the batch_obj->gtt_space which was valid becomes NULL before it is dispatched. During debug I stored batch_obj->gtt_space address in execbuf and compared this address whenever an obj is freed in i915_gem_object_unbind() and found that it is triggered by i915_gem_fault(). It is freed as it is not yet pinned. I have artificially incremented the pin_count of this bo to see if it helps as a workaround but now I am seeing kernel panic with "general protection fault". It is not clear to me how i915_gem_fault() is able to acquire struct_mutex as it is held by execbuf ioctl. It is released if relocation is done by slow path but that is not the case here. There are page allocation failures of different orders during the test, but as the system runs out of memory, low memory killer starts killing processes to free up space and also i915_gem_evict_everything() is called to free space, the system recovers from it but is failing randomly. It looks like somewhere there is a possibility where struct_mutex is released during execbuf and the fault handler is able to free the valid bo because of memory pressure. Is there any possibility for this to happen? I really appreciate any suggestions on how to debug further. regards Arun