Re: [PATCH] drm/amdgpu: fix stall on CPU when allocate large system memory

All of lore.kernel.org
 help / color / mirror / Atom feed

From: James Zhu <jamesz@amd.com>
To: Felix Kuehling <felix.kuehling@amd.com>,
	James Zhu <James.Zhu@amd.com>,
	amd-gfx@lists.freedesktop.org, christian.koenig@amd.com,
	philip.yang@amd.com
Subject: Re: [PATCH] drm/amdgpu: fix stall on CPU when allocate large system memory
Date: Fri, 18 Nov 2022 07:22:51 -0500	[thread overview]
Message-ID: <a8e115cd-acba-29a1-1020-4bb653d03bbc@amd.com> (raw)
In-Reply-To: <59748ca8-9dac-e983-95a8-5b5b4c0a3946@amd.com>


On 2022-11-17 17:03, Felix Kuehling wrote:
> Am 2022-11-17 um 16:38 schrieb James Zhu:
>> When applications try to allocate large system (more than > 128GB),
>> "stall cpu" is reported.
>>
>> for such large system memory, walk_page_range takes more than 20s 
>> usually.
>> The warning message can be removed when splitting hmm range into smaller
>> ones which is not more 64GB for each walk_page_range.
>>
>> [  164.437617] amdgpu:amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu:1753: 
>> amdgpu: create BO VA 0x7f63c7a00000 size 0x2f16000000 domain CPU
>> [  164.488847] amdgpu:amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu:1785: 
>> amdgpu: creating userptr BO for user_addr = 7f63c7a00000
>> [  185.439116] rcu: INFO: rcu_sched self-detected stall on CPU
>> [  185.439125] rcu:     8-....: (20999 ticks this GP) 
>> idle=e22/1/0x4000000000000000 softirq=2242/2242 fqs=5249
>> [  185.439137]     (t=21000 jiffies g=6325 q=1215)
>> [  185.439141] NMI backtrace for cpu 8
>> [  185.439143] CPU: 8 PID: 3470 Comm: kfdtest Kdump: loaded Tainted: 
>> G           O 5.12.0-0_fbk5_zion_rc1_5697_g2c723fb88626 #1
>> [  185.439147] Hardware name: HPE ProLiant XL675d Gen10 Plus/ProLiant 
>> XL675d Gen10 Plus, BIOS A47 11/06/2020
>> [  185.439150] Call Trace:
>> [  185.439153]  <IRQ>
>> [  185.439157]  dump_stack+0x64/0x7c
>> [  185.439163]  nmi_cpu_backtrace.cold.7+0x30/0x65
>> [  185.439165]  ? lapic_can_unplug_cpu+0x70/0x70
>> [  185.439170]  nmi_trigger_cpumask_backtrace+0xf9/0x100
>> [  185.439174]  rcu_dump_cpu_stacks+0xc5/0xf5
>> [  185.439178]  rcu_sched_clock_irq.cold.97+0x112/0x38c
>> [  185.439182]  ? tick_sched_handle.isra.21+0x50/0x50
>> [  185.439185]  update_process_times+0x8c/0xc0
>> [  185.439189]  tick_sched_timer+0x63/0x70
>> [  185.439192]  __hrtimer_run_queues+0xff/0x250
>> [  185.439195]  hrtimer_interrupt+0xf4/0x200
>> [  185.439199]  __sysvec_apic_timer_interrupt+0x51/0xd0
>> [  185.439201]  sysvec_apic_timer_interrupt+0x69/0x90
>> [  185.439206]  </IRQ>
>> [  185.439207]  asm_sysvec_apic_timer_interrupt+0x12/0x20
>> [  185.439211] RIP: 0010:clear_page_rep+0x7/0x10
>> [  185.439214] Code: e8 fe 7c 51 00 44 89 e2 48 89 ee 48 89 df e8 60 
>> ff ff ff c6 03 00 5b 5d 41 5c c3 cc cc cc cc cc cc cc cc b9 00 02 00 
>> 00 31 c0 <f3> 48 ab c3 0f 1f 44 00 00 31 c0 b9 40 00 00 00 66 0f 1f 
>> 84 00 00
>> [  185.439218] RSP: 0018:ffffc9000f58f818 EFLAGS: 00000246
>> [  185.439220] RAX: 0000000000000000 RBX: 0000000000000881 RCX: 
>> 000000000000005c
>> [  185.439223] RDX: 0000000000100dca RSI: 0000000000000000 RDI: 
>> ffff88a59e0e5d20
>> [  185.439225] RBP: ffffea0096783940 R08: ffff888118c35280 R09: 
>> ffffea0096783940
>> [  185.439227] R10: ffff888000000000 R11: 0000160000000000 R12: 
>> ffffea0096783980
>> [  185.439228] R13: ffffea0096783940 R14: ffff88b07fdfdd00 R15: 
>> 0000000000000000
>> [  185.439232]  prep_new_page+0x81/0xc0
>> [  185.439236]  get_page_from_freelist+0x13be/0x16f0
>> [  185.439240]  ? release_pages+0x16a/0x4a0
>> [  185.439244]  __alloc_pages_nodemask+0x1ae/0x340
>> [  185.439247]  alloc_pages_vma+0x74/0x1e0
>> [  185.439251]  __handle_mm_fault+0xafe/0x1360
>> [  185.439255]  handle_mm_fault+0xc3/0x280
>> [  185.439257]  hmm_vma_fault.isra.22+0x49/0x90
>> [  185.439261]  __walk_page_range+0x692/0x9b0
>> [  185.439265]  walk_page_range+0x9b/0x120
>> [  185.439269]  hmm_range_fault+0x4f/0x90
>> [  185.439274]  amdgpu_hmm_range_get_pages+0x24f/0x260 [amdgpu]
>> [  185.439463]  amdgpu_ttm_tt_get_user_pages+0xc2/0x190 [amdgpu]
>> [  185.439603] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x49f/0x7a0 
>> [amdgpu]
>> [  185.439774]  kfd_ioctl_alloc_memory_of_gpu+0xfb/0x410 [amdgpu]
>>
>> Signed-off-by: James Zhu <James.Zhu@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 47 +++++++++++++++++--------
>>   1 file changed, 32 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
>> index a48ea62b12b0..0425fc6a49aa 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
>> @@ -163,6 +163,7 @@ int amdgpu_hmm_range_get_pages(struct 
>> mmu_interval_notifier *notifier,
>>                      struct hmm_range **phmm_range)
>>   {
>>       struct hmm_range *hmm_range;
>> +    unsigned long hmm_range_end;
>>       unsigned long timeout;
>>       unsigned long i;
>>       unsigned long *pfns;
>> @@ -184,25 +185,41 @@ int amdgpu_hmm_range_get_pages(struct 
>> mmu_interval_notifier *notifier,
>>           hmm_range->default_flags |= HMM_PFN_REQ_WRITE;
>>       hmm_range->hmm_pfns = pfns;
>>       hmm_range->start = start;
>> -    hmm_range->end = start + npages * PAGE_SIZE;
>> +    hmm_range_end = start + npages * PAGE_SIZE;
>
> This variable name is too easy to confuse with hmm_range->end. I would 
> suggest calling it "end", analogous to "start".
[JZ] Sure
>
>
>>       hmm_range->dev_private_owner = owner;
>>   -    /* Assuming 512MB takes maxmium 1 second to fault page address */
>> -    timeout = max(npages >> 17, 1ULL) * HMM_RANGE_DEFAULT_TIMEOUT;
>> -    timeout = jiffies + msecs_to_jiffies(timeout);
>> +#define MAX_WALK_BYTE    (64ULL<<30)
>> +    do {
>> +        hmm_range->end = min(hmm_range->start + MAX_WALK_BYTE, 
>> hmm_range_end);
>> +
>> +        pr_debug("hmm range: start = 0x%lx, end = 0x%lx",
>> +            hmm_range->start, hmm_range->end);
>> +
>> +        /* Assuming 512MB takes maxmium 1 second to fault page 
>> address */
>> +        timeout = max((hmm_range->end - hmm_range->start) >> 29, 
>> 1ULL) *
>> +            HMM_RANGE_DEFAULT_TIMEOUT;
>> +        timeout = jiffies + msecs_to_jiffies(timeout);
>>     retry:
>> -    hmm_range->notifier_seq = mmu_interval_read_begin(notifier);
>> -    r = hmm_range_fault(hmm_range);
>> -    if (unlikely(r)) {
>> -        /*
>> -         * FIXME: This timeout should encompass the retry from
>> -         * mmu_interval_read_retry() as well.
>> -         */
>> -        if (r == -EBUSY && !time_after(jiffies, timeout))
>> -            goto retry;
>> -        goto out_free_pfns;
>> -    }
>> +        hmm_range->notifier_seq = mmu_interval_read_begin(notifier);
>> +        r = hmm_range_fault(hmm_range);
>> +        if (unlikely(r)) {
>> +            /*
>> +             * FIXME: This timeout should encompass the retry from
>> +             * mmu_interval_read_retry() as well.
>> +             */
>> +            if (r == -EBUSY && !time_after(jiffies, timeout))
>> +                goto retry;
>> +            goto out_free_pfns;
>> +        }
>> +
>> +        hmm_range->hmm_pfns += MAX_WALK_BYTE >> PAGE_SHIFT;
>> +        hmm_range->start = hmm_range->end;
>> +        schedule();
>
> We don't need to schedule for ranges < MAX_WALK_BYTES. I guess these 
> three lines could be conditional:
>
> +        if (hmm_range->end == end)
> +            break;
>
>         hmm_range->hmm_pfns += MAX_WALK_BYTE >> PAGE_SHIFT;
>         hmm_range->start = hmm_range->end;
>         schedule();
[JZ] Yes, it will be more accurate. I will update it. Thanks!
>
> Regards,
>   Felix
>
>
>> +    } while (hmm_range->end < hmm_range_end);
>> +
>> +    hmm_range->start = start;
>> +    hmm_range->hmm_pfns = pfns;
>>         /*
>>        * Due to default_flags, all pages are HMM_PFN_VALID or

next prev parent reply	other threads:[~2022-11-18 12:23 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-17 21:38 [PATCH] drm/amdgpu: fix stall on CPU when allocate large system memory James Zhu
2022-11-17 22:03 ` Felix Kuehling
2022-11-18 12:22   ` James Zhu [this message]
2022-11-21 13:13 ` [PATCH v2] " James Zhu
2022-11-21 13:18   ` Christian König
2022-11-21 14:46     ` James Zhu
2022-11-21 14:53 ` [PATCH v3] " James Zhu
2022-11-22 15:12   ` James Zhu
2022-11-22 16:21   ` Felix Kuehling

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a8e115cd-acba-29a1-1020-4bb653d03bbc@amd.com \
    --to=jamesz@amd.com \
    --cc=James.Zhu@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=felix.kuehling@amd.com \
    --cc=philip.yang@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.