public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
	Matthew Auld <matthew.auld@intel.com>,
	intel-gfx@lists.freedesktop.org
Cc: "Christian König" <christian.koenig@amd.com>,
	dri-devel@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH v9 2/8] drm/i915/ttm: add tt shmem backend
Date: Wed, 8 Dec 2021 10:32:42 +0100	[thread overview]
Message-ID: <e7e5035c-980b-6ca9-d119-b6f3810d8e13@linux.intel.com> (raw)
In-Reply-To: <ca5587dc-e3cc-c5b4-6034-127d20a3677b@linux.intel.com>


On 12/8/21 10:24, Tvrtko Ursulin wrote:
>
> On 08/12/2021 08:39, Thomas Hellström wrote:
>>
>> On 12/8/21 09:30, Tvrtko Ursulin wrote:
>>
>>
>> ...
>>
>>
>>>>> Apart from the code organisation questions, on the practical level 
>>>>> - do you need writeback from the TTM backend or while I am 
>>>>> proposing to remove it from the "legacy" paths, I can propose 
>>>>> removing it from the TTM flow as well?
>>>>
>>>> Yeah, if that is somehow busted then we should remove from TTM 
>>>> backend also.
>>>
>>> Okay thanks, I wanted to check in case there was an extra need in 
>>> TTM. I will float a patch soon hopefully but testing will be a 
>>> problem since it seems very hard to repro at the moment.
>>
>> Do we have some information about what's causing the deadlock or a 
>> signature? I'm asking because if some sort of shrinker was added to TTM 
>
> Yes, signature is hung task detector kicking in and pretty much system 
> standstill ensues. You will find a bunch of tasks stuck like this:
>
> [  247.165578] chrome          D    0  1791   1785 0x00004080
> [  247.171732] Call Trace:
> [  247.174492]  __schedule+0x57e/0x10d2
> [  247.178514]  ? pcpu_alloc_area+0x25d/0x273
> [  247.183115]  schedule+0x7c/0x9f
> [  247.186647]  rwsem_down_write_slowpath+0x2ae/0x494
> [  247.192025]  register_shrinker_prepared+0x19/0x48
> [  247.197310]  ? test_single_super+0x10/0x10
> [  247.201910]  sget_fc+0x1fc/0x20e
> [  247.205540]  ? kill_litter_super+0x40/0x40
> [  247.210139]  ? proc_apply_options+0x42/0x42
> [  247.214838]  vfs_get_super+0x3a/0xdf
> [  247.218855]  vfs_get_tree+0x2b/0xc3
> [  247.222911]  fc_mount+0x12/0x39
> [  247.226814]  pid_ns_prepare_proc+0x9d/0xc5
> [  247.232468]  alloc_pid+0x275/0x289
> [  247.236432]  copy_process+0x5e5/0xeea
> [  247.240640]  _do_fork+0x95/0x303
> [  247.244261]  __se_sys_clone+0x65/0x7f
> [  247.248366]  do_syscall_64+0x54/0x7e
> [  247.252365]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Or this:
>
> [  247.030274] minijail-init   D    0  1773   1770 0x80004082
> [  247.036419] Call Trace:
> [  247.039167]  __schedule+0x57e/0x10d2
> [  247.043175]  ? __schedule+0x586/0x10d2
> [  247.047381]  ? _raw_spin_unlock+0xe/0x20
> [  247.051779]  ? __queue_work+0x316/0x371
> [  247.056079]  schedule+0x7c/0x9f
> [  247.059602]  rwsem_down_write_slowpath+0x2ae/0x494
> [  247.064971]  unregister_shrinker+0x20/0x65
> [  247.069562]  deactivate_locked_super+0x38/0x88
> [  247.074538]  cleanup_mnt+0xcc/0x10e
> [  247.078447]  task_work_run+0x7d/0xa6
> [  247.082459]  do_exit+0x23d/0x831
> [  247.086079]  ? syscall_trace_enter+0x207/0x20e
> [  247.091055]  do_group_exit+0x8d/0x9d
> [  247.095062]  __x64_sys_exit_group+0x17/0x17
> [  247.099750]  do_syscall_64+0x54/0x7e
> [  247.103758]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> And smoking gun is:
>
> [  247.383338] CPU: 3 PID: 88 Comm: kswapd0 Tainted: G U            
> 5.4.154 #36
> [  247.383338] Hardware name: Google Delbin/Delbin, BIOS 
> Google_Delbin.13672.57.0 02/09/2021
> [  247.383339] RIP: 0010:__rcu_read_lock+0x0/0x1a
> [  247.383339] Code: ff ff 0f 0b e9 61 fe ff ff 0f 0b e9 76 fe ff ff 
> 0f 0b 49 8b 44 24 20 e9 59 ff ff ff 0f 0b e9 67 ff ff ff 0f 0b e9 1b 
> ff ff ff <0f> 1f 44 00 00 55 48 89 e5 65 48 8b 04 25 80 5d 01 00 ff 80 
> f8 03
> [  247.383340] RSP: 0018:ffffb0aa0031b978 EFLAGS: 00000286
> [  247.383340] RAX: 0000000000000000 RBX: fffff6b944ca8040 RCX: 
> fffff6b944ca8001
> [  247.383341] RDX: 0000000000000028 RSI: 0000000000000001 RDI: 
> ffff8b52bc618c18
> [  247.383341] RBP: ffffb0aa0031b9d0 R08: 0000000000000000 R09: 
> ffff8b52fb5f00d8
> [  247.383341] R10: 0000000000000000 R11: 0000000000000000 R12: 
> 0000000000000000
> [  247.383342] R13: 61c8864680b583eb R14: 0000000000000001 R15: 
> ffffb0aa0031b980
> [  247.383342] FS:  0000000000000000(0000) GS:ffff8b52fbf80000(0000) 
> knlGS:0000000000000000
> [  247.383343] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  247.383343] CR2: 00007c78a400d680 CR3: 0000000120f46006 CR4: 
> 0000000000762ee0
> [  247.383344] PKRU: 55555554
> [  247.383344] Call Trace:
> [  247.383345]  find_get_entry+0x4c/0x116
> [  247.383345]  find_lock_entry+0xc8/0xec
> [  247.383346]  shmem_writeback+0x7b/0x163
> [  247.383346]  i915_gem_shrink+0x302/0x40b
> [  247.383347]  ? __intel_runtime_pm_get+0x22/0x82
> [  247.383347]  i915_gem_shrinker_scan+0x86/0xa8
> [  247.383348]  shrink_slab+0x272/0x48b
> [  247.383348]  shrink_node+0x784/0xbea
> [  247.383348]  ? rcu_read_unlock_special+0x66/0x15f
> [  247.383349]  ? update_batch_size+0x78/0x78
> [  247.383349]  kswapd+0x75c/0xa56
> [  247.383350]  kthread+0x147/0x156
> [  247.383350]  ? kswapd_run+0xb6/0xb6
> [  247.383351]  ? kthread_blkcg+0x2e/0x2e
> [  247.383351]  ret_from_fork+0x1f/0x40
>
> Sadly getting logs or repro from 5.16-rc is more difficult due other 
> issues, or altogether gone, which is also a possibility. It is also 
> possible that transparent hugepages either enable the hang, or just 
> make it more likely.
>
> However due history of writeback I think it sounds plausible that it 
> is indeed unsafe. I will try to dig out a reply from Hugh Dickins who 
> advised against doing it and I think that advice did not change, or I 
> failed to find a later thread. There is at least a mention of that 
> discussion in the patch which added writeback.
>
>> itself, for the TTM page vectors, it would need to allocate shmem 
>> pages at shrink time rather than to unpin them at shrink time as we 
>> do here. And for that to have any chance of working sort of reliably, 
>> I think writeback is needed.
>
> I don't claim to understand it fully, but won't the system take care 
> of that, with the only difference being that allocation might work a 
> little less reliably under extreme memory pressure? 

Yes, IIRC it's exactly this that made the previous attempt of a generic 
TTM shrinker fail.


> And I did not find other drivers use it at least which may be and 
> indication we should indeed steer clear of it.

You're probably right.

>
> Regards,
>
> Tvrtko

Thanks,

Thomas




>
>> But I agree for this implementation, the need for writeback isn't 
>> different than for the non-TTM shmem objects> Thanks,
>>
>> Thomas
>>
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>

  reply	other threads:[~2021-12-08  9:32 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-18  9:10 [Intel-gfx] [PATCH v9 1/8] drm/i915/gem: Break out some shmem backend utils Matthew Auld
2021-10-18  9:10 ` [Intel-gfx] [PATCH v9 2/8] drm/i915/ttm: add tt shmem backend Matthew Auld
2021-12-07 13:10   ` Tvrtko Ursulin
2021-12-07 14:05     ` Matthew Auld
2021-12-08  8:30       ` Tvrtko Ursulin
2021-12-08  8:39         ` Thomas Hellström
2021-12-08  9:24           ` Tvrtko Ursulin
2021-12-08  9:32             ` Thomas Hellström [this message]
2021-10-18  9:10 ` [Intel-gfx] [PATCH v9 3/8] drm/i915/gtt: drop unneeded make_unshrinkable Matthew Auld
2021-10-18  9:10 ` [Intel-gfx] [PATCH v9 4/8] drm/i915: drop unneeded make_unshrinkable in free_object Matthew Auld
2021-10-18  9:10 ` [Intel-gfx] [PATCH v9 5/8] drm/i915: add some kernel-doc for shrink_pin and friends Matthew Auld
2021-10-18  9:10 ` [Intel-gfx] [PATCH v9 6/8] drm/i915/ttm: move shrinker management into adjust_lru Matthew Auld
2021-10-20 14:32   ` Thomas Hellström
2021-10-18  9:10 ` [Intel-gfx] [PATCH v9 7/8] drm/i915/ttm: use cached system pages when evicting lmem Matthew Auld
2021-10-18  9:10 ` [Intel-gfx] [PATCH v9 8/8] drm/i915/ttm: enable shmem tt backend Matthew Auld
2021-10-18 11:57 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [v9,1/8] drm/i915/gem: Break out some shmem backend utils Patchwork
2021-10-18 11:58 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2021-10-18 12:29 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2021-10-18 15:48 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e7e5035c-980b-6ca9-d119-b6f3810d8e13@linux.intel.com \
    --to=thomas.hellstrom@linux.intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox