From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9139EC433FE
	for <linux-kernel@archiver.kernel.org>; Tue,  2 Nov 2021 12:27:42 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 790FF60EDF
	for <linux-kernel@archiver.kernel.org>; Tue,  2 Nov 2021 12:27:42 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231283AbhKBMaQ (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Tue, 2 Nov 2021 08:30:16 -0400
Received: from smtp-out2.suse.de ([195.135.220.29]:37694 "EHLO
        smtp-out2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230349AbhKBMaM (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 2 Nov 2021 08:30:12 -0400
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
        by smtp-out2.suse.de (Postfix) with ESMTP id 6F8FA1FD4C;
        Tue,  2 Nov 2021 12:27:37 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
        t=1635856057; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
         mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=6nE7xU/FFP5yqVCM/ob/RuNgtbC6lpn41P9w84srlH8=;
        b=0pdHwkfXf5YZ4StFJg6TqAi1CeKQGLIQU2mBgHjm704zNZbIZptPz9stTyorKFWoXDUQ2j
        lpriAimphFUlhpu2eF5oFIfb9QZhn3bc8UCcp/fz34mEDh+dILuys/73Q07RnAWUt8pKct
        I3LVVqDoTwRowqR/1QCU2ce3XEKrMeI=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
        s=susede2_ed25519; t=1635856057;
        h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
         mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=6nE7xU/FFP5yqVCM/ob/RuNgtbC6lpn41P9w84srlH8=;
        b=yhJicjYRrkKHRgsrsDY+4ApWmkaVIvtAaTstfy1FR3JmHxZXzBfr6T3rPWmbEXyzEM2+ka
        ClFe9L1hdKK/RrDg==
Received: from alsa1.suse.de (alsa1.suse.de [10.160.4.42])
        by relay2.suse.de (Postfix) with ESMTP id 59436A3B88;
        Tue,  2 Nov 2021 12:27:37 +0000 (UTC)
Date:   Tue, 02 Nov 2021 13:27:37 +0100
Message-ID: <s5hcznipwsm.wl-tiwai@suse.de>
From:   Takashi Iwai <tiwai@suse.de>
To:     Zqiang <qiang.zhang1211@gmail.com>
Cc:     tiwai@suse.com, alsa-devel@alsa-project.org,
        linux-kernel@vger.kernel.org
Subject: Re: [PATCH] ALSA: seq: Fix RCU stall in snd_seq_write()
In-Reply-To: <47f05b3a-811b-e64c-0366-3aebaece6c8e@gmail.com>
References: <20211102033222.3849-1-qiang.zhang1211@gmail.com>
        <s5hy267ot27.wl-tiwai@suse.de>
        <2d05ceab-b8b7-0c7b-f847-69950c6db14e@gmail.com>
        <s5ho872q25p.wl-tiwai@suse.de>
        <47f05b3a-811b-e64c-0366-3aebaece6c8e@gmail.com>
User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI/1.14.6 (Maruoka)
 FLIM/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL/10.8 Emacs/25.3
 (x86_64-suse-linux-gnu) MULE/6.0 (HANACHIRUSATO)
MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka")
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue, 02 Nov 2021 12:20:32 +0100,
Zqiang wrote:
> 
> 
> On 2021/11/2 下午6:31, Takashi Iwai wrote:
> > On Tue, 02 Nov 2021 10:41:57 +0100,
> > Zqiang wrote:
> >>
> >> On 2021/11/2 下午4:33, Takashi Iwai wrote:
> >>> On Tue, 02 Nov 2021 04:32:22 +0100,
> >>> Zqiang wrote:
> >>>> If we have a lot of cell object, this cycle may take a long time, and
> >>>> trigger RCU stall. insert a conditional reschedule point to fix it.
> >>>>
> >>>> rcu: INFO: rcu_preempt self-detected stall on CPU
> >>>> rcu: 	1-....: (1 GPs behind) idle=9f5/1/0x4000000000000000
> >>>> 	softirq=16474/16475 fqs=4916
> >>>> 	(t=10500 jiffies g=19249 q=192515)
> >>>> NMI backtrace for cpu 1
> >>>> ......
> >>>> asm_sysvec_apic_timer_interrupt
> >>>> RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70
> >>>> spin_unlock_irqrestore
> >>>> snd_seq_prioq_cell_out+0x1dc/0x360
> >>>> snd_seq_check_queue+0x1a6/0x3f0
> >>>> snd_seq_enqueue_event+0x1ed/0x3e0
> >>>> snd_seq_client_enqueue_event.constprop.0+0x19a/0x3c0
> >>>> snd_seq_write+0x2db/0x510
> >>>> vfs_write+0x1c4/0x900
> >>>> ksys_write+0x171/0x1d0
> >>>> do_syscall_64+0x35/0xb0
> >>>>
> >>>> Reported-by: syzbot+bb950e68b400ab4f65f8@syzkaller.appspotmail.com
> >>>> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
> >>>> ---
> >>>>    sound/core/seq/seq_queue.c | 2 ++
> >>>>    1 file changed, 2 insertions(+)
> >>>>
> >>>> diff --git a/sound/core/seq/seq_queue.c b/sound/core/seq/seq_queue.c
> >>>> index d6c02dea976c..f5b1e4562a64 100644
> >>>> --- a/sound/core/seq/seq_queue.c
> >>>> +++ b/sound/core/seq/seq_queue.c
> >>>> @@ -263,6 +263,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
> >>>>    		if (!cell)
> >>>>    			break;
> >>>>    		snd_seq_dispatch_event(cell, atomic, hop);
> >>>> +		cond_resched();
> >>>>    	}
> >>>>      	/* Process time queue... */
> >>>> @@ -272,6 +273,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
> >>>>    		if (!cell)
> >>>>    			break;
> >>>>    		snd_seq_dispatch_event(cell, atomic, hop);
> >>>> +		cond_resched();
> >>> It's good to have cond_resched() in those places but it must be done
> >>> more carefully, as the code path may be called from the non-atomic
> >>> context, too.  That is, it must have a check of atomic argument, and
> >>> cond_resched() is applied only when atomic==false.
> >>>
> >>> But I still wonder how this gets a RCU stall out of sudden.  Looking
> >>> through https://syzkaller.appspot.com/bug?extid=bb950e68b400ab4f65f8
> >>> it's triggered by many cases since the end of September...
> >> I did not find useful information from the log,  through calltrace, I
> >> guess it may be triggered by the long cycle time, which caused the
> >> static state of the RCU to
> >>
> >> not be reported in time.
> > Yes, I understand that logic.  But I wonder why this gets triggered
> > *now* out of sudden.  The code has been present over decades, and I
> > don't think the similar test case must have been performed by fuzzer.
> >
> >> I ignore the atomic parameter check,  I will resend v2 .   in
> >> no-atomic context, we can insert
> >>
> >> cond_resched() to avoid this situation, but in atomic context,
> >>
> >> the RCU stall maybe still trigger.
> > Right, so maybe it's better to have an upper limit for the processed
> > cells, something like below (totally untested).
> >
> > Could you reproduce the problem locally?  Otherwise it's all nothing
> > but a guess...
> 
> yes, this is just a guess.  I haven't reproduced locally, limiting the
> number of cycles is a suitable modification,
> 
> but the  MAX_CELL_PROCESSES_IN_QUEUE is an experience value.

Yes, that's why we need the reproducer in anyway before moving
forward.  The problem is that the patch looks as if it were fixing the
RCU stall, but we haven't verified it at all that it is really the
cause.  Even we haven't checked whether it's really the too many cells
queued, or just because the concurrent queuing made the function
re-running.


Takashi