Linux cryptographic layer development
 help / color / mirror / Atom feed
From: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
To: Kyle Sanderson <kyle.leet@gmail.com>, herbert@gondor.apana.org.au
Cc: Dave Chinner <david@fromorbit.com>,
	qat-linux@intel.com, Linux-Kernal <linux-kernel@vger.kernel.org>,
	linux-xfs@vger.kernel.org, linux-crypto@vger.kernel.org,
	dm-devel@redhat.com,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Greg KH <gregkh@linuxfoundation.org>
Subject: Re: Intel QAT on A2SDi-8C-HLN4F causes massive data corruption with dm-crypt + xfs
Date: Mon, 21 Feb 2022 11:47:55 +0000	[thread overview]
Message-ID: <YhN76/ONC9qgIKQc@silpixa00400314> (raw)
In-Reply-To: <CACsaVZ+LZUebtsGuiKhNV_No8fNLTv5kJywFKOigieB1cZcKUw@mail.gmail.com>

Hi Kyle,

The issue is that the implementations of aead and skcipher in the QAT
driver are not properly supporting requests with the
CRYPTO_TFM_REQ_MAY_BACKLOG flag set.
If the HW queue is full, the driver returns -EBUSY [1] but does not
enqueues the request as dm-crypt expects [2]. Dm-crypt ends up waiting
indefinitely for a completion to a request that was never submitted,
therefore the stall.
This is not related to QATE-7495 'An incorrectly formatted request to
QAT can hang the entire QAT endpoint' [3], which occurs when a malformed
request is sent to the device.

I'm working at patch that resolves this problem. In the meanwhile a
workaround is to blacklist the qat_c3xxx.ko driver.

Regarding avoiding this issue on stable kernels. The usage of QAT with
dm-crypt was already disabled in kernel 5.10 for a different issue
(the driver allocates memory in the datapath).
The following patches implement the change:
    7bcb2c99f8ed crypto: algapi - use common mechanism for inheriting flags
    2eb27c11937e crypto: algapi - add NEED_FALLBACK to INHERITED_FLAGS
    fbb6cda44190 crypto: algapi - introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY
    b8aa7dc5c753 crypto: drivers - set the flag CRYPTO_ALG_ALLOCATES_MEMORY
    cd74693870fb dm crypt: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY
An option would be to send the patches above to stable, another is to wait
for a patch that fixes the problems in the QAT driver and send that to
stable.
@Herbert, what is the preferred approach here?

Thanks,

[1] https://elixir.bootlin.com/linux/latest/source/drivers/crypto/qat/qat_common/qat_algs.c#L1022
[2] https://elixir.bootlin.com/linux/latest/source/drivers/md/dm-crypt.c#L1584
[3] https://01.org/sites/default/files/downloads//336211qatsoftwareforlinux-rn-hwversion1.7021.pdf - page 25

-- 
Giovanni


On Sat, Feb 19, 2022 at 03:00:51PM -0800, Kyle Sanderson wrote:
> hi Dave,
> 
> > This really sounds like broken hardware, not a kernel problem.
> 
> It is indeed a hardware issue, specifically the intel qat crypto
> driver that's in-tree - the hardware is fine (see below). The IQAT
> eratta documentation states that if a request is not submitted
> properly it can stall the entire device. The remediation guidance from
> 2020 was "don't do that" and "don't allow unprivileged users access to
> the device". The in-tree driver is not implemented properly either for
> this SoC or board - I'm thinking it's related to QATE-7495.
> 
> https://01.org/sites/default/files/downloads//336211qatsoftwareforlinux-rn-hwversion1.7021.pdf
> 
> > This implies a dmcrypt level problem - XFS can't make progress is dmcrypt is not completing IOs.
> 
> That's the weird part about it. Some bio's are completing, others are
> completely dropped, with some stalling forever. I had to use
> xfs_repair to get the volumes operational again. I lost a good deal of
> files and had to recover from backup after toggling the device back on
> on a production system (silly, I know).
> 
> > Where are the XFS corruption reports that the subject implies is occurring?
> 
> I think you're right, it's dm-crypt that's broken here, with
> ultimately the crypto driver causing this corruption. XFS being the
> edge to the end-user is taking the brunt of it. There's reports going
> back to late 2017 of significant issues with this mainlined stable
> driver.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1522962
> https://serverfault.com/questions/1010108/luks-hangs-on-centos-running-on-atom-c3758-cpu
> https://www.phoronix.com/forums/forum/software/distributions/1172231-fedora-33-s-enterprise-linux-next-effort-approved-testbed-for-raising-cpu-requirements-etc?p=1174560#post1174560
> 
> Any guidance would be appreciated.
> Kyle.
> On Sat, Feb 19, 2022 at 1:03 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Fri, Feb 18, 2022 at 09:02:28PM -0800, Kyle Sanderson wrote:
> > > A2SDi-8C-HLN4F has IQAT enabled by default, when this device is
> > > attempted to be used by xfs (through dm-crypt) the entire kernel
> > > thread stalls forever. Multiple users have hit this over the years
> > > (through sporadic reporting) - I ended up trying ZFS and encryption
> > > wasn't an issue there at all because I guess they don't use this
> > > device. Returning to sanity (xfs), I was able to provision a dm-crypt
> > > volume no problem on the disk, however when running mkfs.xfs on the
> > > volume is what triggers the cascading failure (each request kills a
> > > kthread).
> >
> > Can you provide the full stack traces for these errors so we can see
> > exactly what this cascading failure looks like, please? In reality,
> > the stall messages some time after this are not interesting - it's
> > the first errors that cause the stall that need to be investigated.
> >
> > A good idea would be to provide the full storage stack decription
> > and hardware in use, as per:
> >
> > https://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> >
> > > Disabling IQAT on the south bridge results in a working
> > > system, however this is not the default configuration for the
> > > distribution of choice (Ubuntu 20.04.3 LTS), nor the motherboard. I'm
> > > convinced this never worked properly based on the lack of popularity
> > > for kernel encryption (crypto), and the embedded nature that
> > > SuperMicro has integrated this device in collaboration with intel as
> > > it looks like the primary usage is through external accelerator cards.
> >
> > This really sounds like broken hardware, not a kernel problem.
> >
> > > Kernels tried were from RHEL8 over a year ago, and this impacts the
> > > entirety of the 5.4 series on Ubuntu.
> > > Please CC me on replies as I'm not subscribed to all lists. CPU is C3758.
> >
> > [snip stalled kcryptd worker threads]
> >
> > This implies a dmcrypt level problem - XFS can't make progress is
> > dmcrypt is not completing IOs.
> >
> > Where are the XFS corruption reports that the subject implies is
> > occurring?
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com

  reply	other threads:[~2022-02-21 11:48 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-19  5:02 Intel QAT on A2SDi-8C-HLN4F causes massive data corruption with dm-crypt + xfs Kyle Sanderson
2022-02-19 21:03 ` Dave Chinner
2022-02-19 23:00   ` Kyle Sanderson
2022-02-21 11:47     ` Giovanni Cabiddu [this message]
2022-02-28  8:18       ` Kyle Sanderson
2022-02-28 19:25         ` Linus Torvalds
2022-02-28 20:39           ` Giovanni Cabiddu
2022-02-28 20:59             ` Greg KH
2022-02-28 23:26             ` Herbert Xu
2022-03-01  1:12               ` Linus Torvalds
2022-03-01  4:11                 ` Herbert Xu
2022-03-02 10:29                   ` Greg KH
2022-03-02 11:49                     ` Giovanni Cabiddu
2022-03-02 14:56                       ` Greg KH
2022-03-02 22:27                         ` Herbert Xu
2022-03-02 22:42                           ` Giovanni Cabiddu
2022-03-02 22:45                             ` Herbert Xu
2022-03-03 13:49                               ` Giovanni Cabiddu
2022-03-03 19:21                                 ` Eric Biggers
2022-03-03 21:24                                   ` Giovanni Cabiddu
2022-03-03 21:44                                     ` Eric Biggers
2022-03-04 17:50                                       ` Giovanni Cabiddu
2022-03-16 21:38                                         ` Kyle Sanderson
2022-03-16 22:13                                           ` Herbert Xu
2022-02-28 21:13           ` [dm-devel] " Milan Broz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YhN76/ONC9qgIKQc@silpixa00400314 \
    --to=giovanni.cabiddu@intel.com \
    --cc=david@fromorbit.com \
    --cc=dm-devel@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=kyle.leet@gmail.com \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=qat-linux@intel.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox