From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mslow1.mail.gandi.net (mslow1.mail.gandi.net [217.70.178.240]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 30ECB171C4 for ; Wed, 31 May 2023 18:08:44 +0000 (UTC) Received: from relay8-d.mail.gandi.net (unknown [217.70.183.201]) by mslow1.mail.gandi.net (Postfix) with ESMTP id 436A5D5EEC for ; Wed, 31 May 2023 17:24:36 +0000 (UTC) X-GND-Sasl: rpm@xenomai.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=xenomai.org; s=gm1; t=1685553870; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=2f6fYuHT0sd7BbfIY+8jSi7Kv7R2+Ln5ADl2Ae9Gs24=; b=kCc5qTWKTb3DtYx9u8wmRqvQoAI3XS8lxbmFphG/WL5G+JYl5u6d3o8NLO1UvdFzEFlcHI 4yqMKm7u0Mrccx+XzldNQOutCbuFhA2xuTz3anPfrhb7Vyp64J4D6yxnORTGZzlMOYOdQf WFJPhdcbiZb1H95nbnIXyRNeWRFJ/2hN2wN0WdRiG0U2XalTcol1qyKzrA9xHjP3A9OpCy jRZHR0PPiyKUriRvsdE3I1ntVjxubVd6jdBvsc7WCGNBocoqQsof/45kiwlSLIwnz58Qo9 M6Yc4mLvgL5Txqi2Kp8efIODZm/uGt6hIOE50EOAOdxCmpJciommV4dM8i7x6w== X-GND-Sasl: rpm@xenomai.org X-GND-Sasl: rpm@xenomai.org Received: by mail.gandi.net (Postfix) with ESMTPSA id F103E1BF205; Wed, 31 May 2023 17:24:29 +0000 (UTC) References: <87jzwpxa8l.fsf@xenomai.org> User-agent: mu4e 1.8.11; emacs 28.2 From: Philippe Gerum To: Dave Rolenc Cc: "xenomai@lists.linux.dev" , Russell Johnson Subject: Re: [PATCH 0/1] rwsem_down_write_slowpath check if oob() before skipping schedule() Date: Wed, 31 May 2023 18:55:57 +0200 In-reply-to: Message-ID: <877csoxvle.fsf@xenomai.org> Precedence: bulk X-Mailing-List: xenomai@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain Dave Rolenc writes: >> What's even worse and the root cause of that issue is that no task on >> the oob stage should ever run rwsem_down_write_slowpath() in the first >> place. > [snip] > > Assuming you're correct that this code never runs out of band, that > just means that the schedule call will never get skipped by the goto as > the running_oob() always returns false. The important part I observed is > that if that goto trylock_again gets hit the task never gets out of that > loop. If we make it so the schedule() chunk never gets skipped, things > work fine. > > Empirical evidence shows that never skipping that schedule() call makes > the problem go away. My test scenario, which is way too involved to > package up for you and involves custom hardware, will run a couple hours > max before getting a stuck CPU. With the patch it ran over 4 days > without issue. Assuming running_oob always returns false, the code > should be roughly equivalent to commenting out the lines as I did in my > first attempt. My first attempt at commenting out the lines also worked > fine for over 24 hours. I wish I had more of a definitive answer to the > other task involved, but the stack traces didn't really help there. Not > having a fully working kernel debugger kind of limited what I could see, > so I had to sample stack traces and figure out which code path it was > taking by piecing together the information I had. > > I can definitely run your suggested test with the WARN_ONCE if you > really want, but I don't think the cause is some oob context running > this code. It was just my misunderstanding that this code could run in > oob coupled with my desire to not delete those lines if at all possible. > Ok, got it. Adding running_oob() which should always evaluate to false only prevents the code from spinning, papering over the issue. Replacing running_oob() by dovetailing() would achieve the same purpose. If so, my patch does not bring anything valuable. Besides, if the Dovetail debug is compiled in, a bad (oob) context running strictly in-band code would most certainly have caused some existing assertions to trigger anyway. > Do you have any thoughts on how to proceed? > First thing would be to enable CONFIG_DEBUG_RWSEMS and re-run an overnight test without any patch in, hoping for the native debug infrastructure to give us some hint. -- Philippe.