From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bootlin.com header.i=@bootlin.com header.b="kx7uiucP" Received: from relay8-d.mail.gandi.net (relay8-d.mail.gandi.net [217.70.183.201]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7FE931985; Thu, 30 Nov 2023 12:15:52 -0800 (PST) Received: by mail.gandi.net (Postfix) with ESMTPSA id 379661BF20B; Thu, 30 Nov 2023 20:15:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bootlin.com; s=gm1; t=1701375350; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=mIbIFfrn5l/35BhQp4MbDdbDxlcbSPBpb9WWg24uDPY=; b=kx7uiucPrUVYV+YkZUGY0OF7p0jdMHfxgJ4v75gTrlIn+8GniiSyfnWoFXjww2F9llLD8T ugapChnaDaqMxKvJbKINlumrAY7hGm0dlDLYcBkeZkB8d+Gaf2GYhFPadUbSCfdgvDn12g mJhNK9ge/tJNDFXjsA175ZeZwrm2LVtsEWakxitVwaGfNurHrkfY7jTXU2QllAHDbVWGWH ZhqHpHuNe8rLwMCD5hBISv3ivM9SRbmWFx+1B9+v0vW/D/5utQOQjjOzwUnqTQx7QaGHv0 kS7p8N3oqMJ2BtFqdCOafwgoA0C/Szdj+DsmfxrriEvVsO+u+p2oyCQcc/74MQ== Date: Thu, 30 Nov 2023 21:15:43 +0100 From: Miquel Raynal To: Richard Weinberger Cc: Ronald Wahl , Mark Brown , linux-spi , Thomas Petazzoni , Ryan Wanner , stable , Richard Weinberger , David Woodhouse Subject: Re: [PATCH 1/2] spi: atmel: Do not cancel a transfer upon any signal Message-ID: <20231130211543.2801a55b@xps-13> In-Reply-To: <1192504136.46091.1701368767836.JavaMail.zimbra@nod.at> References: <20231127095842.389631-1-miquel.raynal@bootlin.com> <0ce4c673-5c0b-4181-9d8b-53bcb0521f3e@raritan.com> <20231129094932.2639ca49@xps-13> <723263313.45007.1701348374765.JavaMail.zimbra@nod.at> <1192504136.46091.1701368767836.JavaMail.zimbra@nod.at> Organization: Bootlin X-Mailer: Claws Mail 4.0.0 (GTK+ 3.24.33; x86_64-pc-linux-gnu) Precedence: bulk X-Mailing-List: linux-spi@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-GND-Sasl: miquel.raynal@bootlin.com Hi Richard, richard@nod.at wrote on Thu, 30 Nov 2023 19:26:07 +0100 (CET): > ----- Urspr=C3=BCngliche Mail ----- > > Von: "richard" > > An: "Miquel Raynal" > > CC: "Ronald Wahl" , "Mark Brown" , "linux-spi" , > > "Thomas Petazzoni" , "Ryan Wanner" , "stable" > > , "Richard Weinberger" > > Gesendet: Donnerstag, 30. November 2023 13:46:14 > > Betreff: Re: [PATCH 1/2] spi: atmel: Do not cancel a transfer upon any = signal =20 >=20 > > ----- Urspr=C3=BCngliche Mail ----- =20 > >> Von: "Miquel Raynal" > >> + Richard, my dear jffs2 expert ;) =20 > >=20 > > :-S > > =20 > >>=20 > >> ronald.wahl@raritan.com wrote on Mon, 27 Nov 2023 18:54:40 +0100: > >> =20 > >>> On 27.11.23 16:10, Ronald Wahl wrote: =20 > >>> > On 27.11.23 10:58, Miquel Raynal wrote: =20 > >>> >> The intended move from wait_for_completion_*() to > >>> >> wait_for_completion_interruptible_*() was to allow (very) long spi= memor =20 > >>> y =20 > >>> >> transfers to be stopped upon user request instead of freezing the > >>> >> machine forever as the timeout value could now be significantly bi= gger. > >>> >> > >>> >> However, depending on the user logic, applications can receive many > >>> >> signals for their own "internal" purpose and have nothing to do wi= th the > >>> >> requested kernel operations, hence interrupting spi transfers upon= any > >>> >> signal is probably not a wise choice. Instead, let's switch to > >>> >> wait_for_completion_killable_*() to only catch the "important" > >>> >> signals. This was likely the intended behavior anyway. =20 > >>> > > >>> > Actually this seems to work. But aborting a process that has a SPI > >>> > transfer running causes ugly messages from kernel. This is somehow > >>> > unexpected: > >>> > > >>> > # dd if=3D/dev/urandom of=3D/flashdisk/testfile bs=3D1024 count=3D5= 12 > >>> > ^C[=C2=A0 380.726760] spi-nor spi0.0: spi transfer canceled > >>> > [=C2=A0 380.731688] spi-nor spi0.0: SPI transfer failed: -512 > >>> > [=C2=A0 380.737141] spi_master spi0: failed to transfer one message= from queue > >>> > [=C2=A0 380.746495] spi-nor spi0.0: spi transfer canceled > >>> > [=C2=A0 380.751549] spi-nor spi0.0: SPI transfer failed: -512 > >>> > [=C2=A0 380.756844] spi_master spi0: failed to transfer one message= from queue > >>> > > >>> > JFFS2 also logs an informational message which is less visible but = also > >>> > may rise eyebrows: > >>> > [=C2=A0 380.743904] jffs2: Write of 4164 bytes at 0x0016a47c failed= . retu =20 > >>> rned =20 > >>> > -512, retlen 68 =20 > >=20 > > Ugly kernel messages are a normal consequence of killing an IO. > > Chances are good that we'll find bugs in the upper layers. > > =20 > >>> > Killing a process is something to expect in certain cases and it sh= ould > >>> > not cause such messages which may create some anxiety that somethin= g bad > >>> > had happened. So maybe the "kill" case should be silent (e.g. level > >>> > "debug") > >>> > but without out hiding real errors. But even when hiding the messag= e in t =20 > >>> he =20 > >>> > SPI framework it may cause additional messages in upper layers like= JFFS2 =20 > >>> . =20 > >>> > I'm not sure whether all of this is a good idea. This is something = others > >>> > have to decide. =20 > >>>=20 > >>> ... and now I just got a crash when unmounting and remounting jffs2: > >>>=20 > >>> unmount: > >>> [ 8245.821105] spi-nor spi0.0: spi transfer canceled > >>> [ 8245.826288] spi-nor spi0.0: SPI transfer failed: -512 > >>> [ 8245.831508] spi_master spi0: failed to transfer one message from q= ueue > >>> [ 8245.838484] jffs2: Write of 1092 bytes at 0x00181458 failed. retur= ned -5 > >>> 12, retlen 68 > >>> [ 8245.839786] spi-nor spi0.0: spi transfer canceled > >>> [ 8245.844759] spi-nor spi0.0: SPI transfer failed: -512 > >>> [ 8245.850145] spi_master spi0: failed to transfer one message from q= ueue > >>> [ 8245.856909] jffs2: Write of 1092 bytes at 0x0018189c failed. retur= ned -5 > >>> 12, retlen 0 > >>> [ 8245.856942] jffs2: Not marking the space at 0x0018189c as dirty be= cause the > >>> flash driver returned retlen zero =20 > >=20 > > jffs2 has a garbage collect thread which can be controlled using various > > signals. > > To terminate the thread, jffs2 sends SIGKILL upon umount. > > If the gc thread does IO while that, you gonna kill the IO too. > > =20 > >>> mount: > >>> [ 8831.213456] jffs2: error: (1142) jffs2_link_node_ref: Adding new r= ef 28b > >>> d9da7 at (0x000ad578-0x000ae5bc) not immediately after previous (0x00= 0ad578 > >>> -0x000ad578) > >>> [ 8831.228212] Internal error: Oops - undefined instruction: 0 [#1] T= HUMB2 =20 > >=20 > >=20 > > I fear this is a jffs2 (summary feature) bug. Chances are great that yo= u're able > > to trigger the very same using a sudden loss of power. > > =20 > >> It's not just spi-atmel, any spi-mem controller might be tempted to use > >> interruptible^Wkillable transfers just because the timeout values can > >> be really big as the memory sizes increase. > >>=20 > >> One solution is to change the completion helpers back to something > >> non-killable/non-interruptible, but the user experience will be > >> slightly degraded. The other would be to look into jffs2 (if it's the > >> only filesystem playing with signals during unmount, tbh I don't know). > >> But maybe this signaling mechanism can't be hacked for compatibility > >> reasons. Handling this at the spi level would be a mix of layers, I'm > >> not ready for that. > >>=20 > >> Richard, Mark, what's your opinion here? =20 > >=20 > > I *think* we can remove the signal handling code from jffs2 since it ma= kes > > already use of the kthread_should_stop() API. > > That way we can keep the SPI transfer interruptible by signals. > > ...reading right now into the history to figure better. =20 >=20 > After a brief discussion with dwmw2 another question came up, if an spi t= ransfer > is cancelled, *all* other IO do the filesystem has to stop too. > IO can happen concurrently, if only one IO path dies but the other ones c= an > make progress, the filesystem becomes inconsistent and all hope is lost. >=20 > Miquel, is this guaranteed by your changes? Absolutely not, the changes are in a spi controller, there is nothing specific to the user there. If a filesystem transfer get interrupted, it's the filesystem responsibility to cancel the other IOs if that's relevant for its own consistency? Thanks, Miqu=C3=A8l