From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8C469C433F5
	for <linux-kernel@archiver.kernel.org>; Wed,  3 Nov 2021 05:41:53 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 644A360F5A
	for <linux-kernel@archiver.kernel.org>; Wed,  3 Nov 2021 05:41:53 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231322AbhKCFo2 (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Wed, 3 Nov 2021 01:44:28 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50878 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229650AbhKCFo1 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 3 Nov 2021 01:44:27 -0400
Received: from mail-pg1-x531.google.com (mail-pg1-x531.google.com [IPv6:2607:f8b0:4864:20::531])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0DFE6C061714
        for <linux-kernel@vger.kernel.org>; Tue,  2 Nov 2021 22:41:51 -0700 (PDT)
Received: by mail-pg1-x531.google.com with SMTP id s136so1440490pgs.4
        for <linux-kernel@vger.kernel.org>; Tue, 02 Nov 2021 22:41:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=subject:to:cc:references:from:message-id:date:user-agent
         :mime-version:in-reply-to:content-transfer-encoding:content-language;
        bh=IkLavH6bHRnjOCk9c0dRR1U3BfPY7tUVIXIB9PCYDy4=;
        b=p7XJQa6iK9SZ4fS3lGJdyVWApi0iUXUu9J0u6wGITQg53ccsLx1caSLsV71kf05ZeR
         +DkJTQ09jhsM04iJGsFd0A64iUw6JkDCrUYLSri304jwqYQL2PGiXJZ6KBGumREOyDbm
         n8r+lt9wf4y4xA5Gq29knnjrLcDfWUFBjiptVowoqsPkUABoQv4gbLw02Vx4g8kJk4R4
         HAa03JuZrWETFcJvH/e0YpEfNuzkdJ3pVjbfPZfkmXTo1WbCtQ/s4ikcm2jwOCn2+Sv9
         7cpGE1nVY3N6tZd0CIvrlRCxE24s1hYIrAyqrv2H/WJrvcL9hgY1dfPUzdmUWnkjz5r3
         vMCw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:subject:to:cc:references:from:message-id:date
         :user-agent:mime-version:in-reply-to:content-transfer-encoding
         :content-language;
        bh=IkLavH6bHRnjOCk9c0dRR1U3BfPY7tUVIXIB9PCYDy4=;
        b=yqQsdnCyNvNHLjrsKmOmNT7YydyXALWGZz8Qb0Zh5Jvp1EtBmP2/j/qV0RfyP8jaWt
         5y0Utu0JbeiWxdcXHbJT83eqQRETwMK70yAPN5sn7X6dkBlQDDbLcg0vYIvW5A5pjyt2
         cG7pFkafG3nViYMW+ueXALBK4SxJBc5PVX2hj9YCBceAi25jpmL9icAj85As4c3oiVWU
         zsV/zJi+IIsIumOiMkWqdaRopv7+QXGpRv3gZZfyuaJvr0QfQbF6phqWOjslwTX8NgD5
         57EF2A0Ags74fJic9CjO2rZMmQ6sexT2fVUnrKIoyCRjsIdY5zIinvL1/ByB9SjSowKM
         f1zA==
X-Gm-Message-State: AOAM5331dZuoC9T9r/me55F6oas/54kTDQzrKuzQTNCpSnSTM0otjk0G
        /Ler8WfHT7QoSViBEjqtNVAv4x2Vc5L9fA==
X-Google-Smtp-Source: ABdhPJzaMyoDL/8+G0gZPvsMRnec4wTj5cy+ssft5uVKalYqvH2f3ubIwgNTV/uHFT2pihC77gXf/Q==
X-Received: by 2002:a63:1e51:: with SMTP id p17mr2684150pgm.195.1635918110370;
        Tue, 02 Nov 2021 22:41:50 -0700 (PDT)
Received: from [172.18.2.138] ([137.59.101.13])
        by smtp.gmail.com with ESMTPSA id h24sm927683pfn.180.2021.11.02.22.41.48
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Tue, 02 Nov 2021 22:41:49 -0700 (PDT)
Subject: Re: [PATCH] ALSA: seq: Fix RCU stall in snd_seq_write()
To:     Takashi Iwai <tiwai@suse.de>
Cc:     tiwai@suse.com, alsa-devel@alsa-project.org,
        linux-kernel@vger.kernel.org
References: <20211102033222.3849-1-qiang.zhang1211@gmail.com>
 <s5hy267ot27.wl-tiwai@suse.de>
 <2d05ceab-b8b7-0c7b-f847-69950c6db14e@gmail.com>
 <s5ho872q25p.wl-tiwai@suse.de>
 <47f05b3a-811b-e64c-0366-3aebaece6c8e@gmail.com>
 <s5hcznipwsm.wl-tiwai@suse.de>
From:   Zqiang <qiang.zhang1211@gmail.com>
Message-ID: <b8e27c66-2fe3-aae0-a19c-e87a3e56625e@gmail.com>
Date:   Wed, 3 Nov 2021 13:41:47 +0800
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <s5hcznipwsm.wl-tiwai@suse.de>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Content-Language: en-US
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


On 2021/11/2 下午8:27, Takashi Iwai wrote:
> On Tue, 02 Nov 2021 12:20:32 +0100,
> Zqiang wrote:
>>
>> On 2021/11/2 下午6:31, Takashi Iwai wrote:
>>> On Tue, 02 Nov 2021 10:41:57 +0100,
>>> Zqiang wrote:
>>>> On 2021/11/2 下午4:33, Takashi Iwai wrote:
>>>>> On Tue, 02 Nov 2021 04:32:22 +0100,
>>>>> Zqiang wrote:
>>>>>> If we have a lot of cell object, this cycle may take a long time, and
>>>>>> trigger RCU stall. insert a conditional reschedule point to fix it.
>>>>>>
>>>>>> rcu: INFO: rcu_preempt self-detected stall on CPU
>>>>>> rcu: 	1-....: (1 GPs behind) idle=9f5/1/0x4000000000000000
>>>>>> 	softirq=16474/16475 fqs=4916
>>>>>> 	(t=10500 jiffies g=19249 q=192515)
>>>>>> NMI backtrace for cpu 1
>>>>>> ......
>>>>>> asm_sysvec_apic_timer_interrupt
>>>>>> RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70
>>>>>> spin_unlock_irqrestore
>>>>>> snd_seq_prioq_cell_out+0x1dc/0x360
>>>>>> snd_seq_check_queue+0x1a6/0x3f0
>>>>>> snd_seq_enqueue_event+0x1ed/0x3e0
>>>>>> snd_seq_client_enqueue_event.constprop.0+0x19a/0x3c0
>>>>>> snd_seq_write+0x2db/0x510
>>>>>> vfs_write+0x1c4/0x900
>>>>>> ksys_write+0x171/0x1d0
>>>>>> do_syscall_64+0x35/0xb0
>>>>>>
>>>>>> Reported-by: syzbot+bb950e68b400ab4f65f8@syzkaller.appspotmail.com
>>>>>> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
>>>>>> ---
>>>>>>     sound/core/seq/seq_queue.c | 2 ++
>>>>>>     1 file changed, 2 insertions(+)
>>>>>>
>>>>>> diff --git a/sound/core/seq/seq_queue.c b/sound/core/seq/seq_queue.c
>>>>>> index d6c02dea976c..f5b1e4562a64 100644
>>>>>> --- a/sound/core/seq/seq_queue.c
>>>>>> +++ b/sound/core/seq/seq_queue.c
>>>>>> @@ -263,6 +263,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
>>>>>>     		if (!cell)
>>>>>>     			break;
>>>>>>     		snd_seq_dispatch_event(cell, atomic, hop);
>>>>>> +		cond_resched();
>>>>>>     	}
>>>>>>       	/* Process time queue... */
>>>>>> @@ -272,6 +273,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
>>>>>>     		if (!cell)
>>>>>>     			break;
>>>>>>     		snd_seq_dispatch_event(cell, atomic, hop);
>>>>>> +		cond_resched();
>>>>> It's good to have cond_resched() in those places but it must be done
>>>>> more carefully, as the code path may be called from the non-atomic
>>>>> context, too.  That is, it must have a check of atomic argument, and
>>>>> cond_resched() is applied only when atomic==false.
>>>>>
>>>>> But I still wonder how this gets a RCU stall out of sudden.  Looking
>>>>> through https://syzkaller.appspot.com/bug?extid=bb950e68b400ab4f65f8
>>>>> it's triggered by many cases since the end of September...
>>>> I did not find useful information from the log,  through calltrace, I
>>>> guess it may be triggered by the long cycle time, which caused the
>>>> static state of the RCU to
>>>>
>>>> not be reported in time.
>>> Yes, I understand that logic.  But I wonder why this gets triggered
>>> *now* out of sudden.  The code has been present over decades, and I
>>> don't think the similar test case must have been performed by fuzzer.
>>>
>>>> I ignore the atomic parameter check,  I will resend v2 .   in
>>>> no-atomic context, we can insert
>>>>
>>>> cond_resched() to avoid this situation, but in atomic context,
>>>>
>>>> the RCU stall maybe still trigger.
>>> Right, so maybe it's better to have an upper limit for the processed
>>> cells, something like below (totally untested).
>>>
>>> Could you reproduce the problem locally?  Otherwise it's all nothing
>>> but a guess...
>> yes, this is just a guess.  I haven't reproduced locally, limiting the
>> number of cycles is a suitable modification,
>>
>> but the  MAX_CELL_PROCESSES_IN_QUEUE is an experience value.
> Yes, that's why we need the reproducer in anyway before moving
> forward.  The problem is that the patch looks as if it were fixing the
> RCU stall, but we haven't verified it at all that it is really the
> cause.  Even we haven't checked whether it's really the too many cells
> queued, or just because the concurrent queuing made the function
> re-running.

Thanks your explanation,  I think we can send your changes out and wait 
for syzbot to test.

thanks

Zqiang

>
> Takashi