From: Eric Farman <farman@linux.ibm.com>
To: Cornelia Huck <cohuck@redhat.com>
Cc: Jared Rossi <jrossi@linux.ibm.com>,
Halil Pasic <pasic@linux.ibm.com>,
linux-s390@vger.kernel.org, kvm@vger.kernel.org
Subject: Re: [RFC PATCH v3 0/3] vfio-ccw: Fix interrupt handling for HALT/CLEAR
Date: Wed, 17 Jun 2020 07:24:17 -0400 [thread overview]
Message-ID: <5ae6151b-31de-eca6-2917-4e23ecd4f0df@linux.ibm.com> (raw)
In-Reply-To: <20200616195053.99253-1-farman@linux.ibm.com>
On 6/16/20 3:50 PM, Eric Farman wrote:
> Let's continue our discussion of the handling of vfio-ccw interrupts.
>
> The initial fix [1] relied upon the interrupt path's examination of the
> FSM state, and freeing all resources if it were CP_PENDING. But the
> interface used by HALT/CLEAR SUBCHANNEL doesn't affect the FSM state.
> Consider this sequence:
>
> CPU 1 CPU 2
> CLEAR (state=IDLE/no change)
> START [2]
> INTERRUPT (set state=IDLE)
> INTERRUPT (set state=IDLE)
>
> This translates to a couple of possible scenarios:
>
> A) The START gets a cc2 because of the outstanding CLEAR, -EBUSY is
> returned, resources are freed, and state remains IDLE
> B) The START gets a cc0 because the CLEAR has already presented an
> interrupt, and state is set to CP_PENDING
>
> If the START gets a cc0 before the CLEAR INTERRUPT (stacked onto a
> workqueue by the IRQ context) gets a chance to run, then the INTERRUPT
> will release the channel program memory prematurely. If the two
> operations run concurrently, then the FSM state set to CP_PROCESSING
> will prevent the cp_free() from being invoked. But the io_mutex
> boundary on that path will pause itself until the START completes,
> and then allow the FSM to be reset to IDLE without considering the
> outstanding START. Neither scenario would be considered good.
>
> Having said all of that, in v2 Conny suggested [3] the following:
>
>> - Detach the cp from the subchannel (or better, remove the 1:1
>> relationship). By that I mean building the cp as a separately
>> allocated structure (maybe embedding a kref, but that might not be
>> needed), and appending it to a list after SSCH with cc=0. Discard it
>> if cc!=0.
>> - Remove the CP_PENDING state. The state is either IDLE after any
>> successful SSCH/HSCH/CSCH, or a new state in that case. But no
>> special state for SSCH.
>> - A successful CSCH removes the first queued request, if any.
>> - A final interrupt removes the first queued request, if any.
>
> What I have implemented here is basically this, with a few changes:
>
> - I don't queue cp's. Since there should only be one START in process
> at a time, and HALT/CLEAR doesn't build a cp, I didn't see a pressing
> need to introduce that complexity.
> - Furthermore, while I initially made a separately allocated cp, adding
> an alloc for a cp on each I/O AND moving the guest_cp alloc from the
> probe path to the I/O path seems excessive. So I implemented a
> "started" flag to the cp, set after a cc0 from the START, and examine
> that on the interrupt path to determine whether cp_free() is needed.
FYI... After a day or two of running, I sprung a kernel debug oops for
list corruption in ccwchain_free(). I'm going to blame this piece, since
it was the last thing I changed and I hadn't come across any such damage
since v2. So either "started" is a bad idea, or a broken one. Or both. :)
> - I opted against a "SOMETHING_PENDING" state if START/HALT/CLEAR
> got a cc0, and just put the FSM back to IDLE. It becomes too unwieldy
> to discern which operation an interrupt is completing, and whether
> more interrupts are expected, to be worth the additional state.
> - A successful CSCH doesn't do anything special, and cp_free()
> is only performed on the interrupt path. Part of me wrestled with
> how a HALT fits into that, but mostly it was that a cc0 on any
> of the instructions indicated the "channel subsystem is signaled
> to asynchronously perform the [START/HALT/CLEAR] function."
> This means that an in-flight START could still receive data from the
> device/subchannel, so not a good idea to release memory at that point.
>
> Separate from all that, I added a small check of the io_work queue to
> the FSM START path. Part of the problems I've seen was that an interrupt
> is presented by a CPU, but not yet processed by vfio-ccw. Some of the
> problems seen thus far is because of this gap, and the above changes
> don't address that either. Whether this is appropriate or ridiculous
> would be a welcome discussion.
>
> Previous versions:
> v2: https://lore.kernel.org/kvm/20200513142934.28788-1-farman@linux.ibm.com/
> v1: https://lore.kernel.org/kvm/20200124145455.51181-1-farman@linux.ibm.com/
>
> Footnotes:
> [1] https://lore.kernel.org/kvm/62e87bf67b38dc8d5760586e7c96d400db854ebe.1562854091.git.alifm@linux.ibm.com/
> [2] Halil has pointed out that QEMU should prohibit this, based on the
> rules set forth by the POPs. This is true, but we should not rely on
> it behaving properly without addressing this scenario that is visible
> today. Once I get this behaving correctly, I'll spend some time
> seeing if QEMU is misbehaving somehow.
> [3] https://lore.kernel.org/kvm/20200518180903.7cb21dd8.cohuck@redhat.com/
> [4] https://lore.kernel.org/kvm/a52368d3-8cec-7b99-1587-25e055228b62@linux.ibm.com/
>
> Eric Farman (3):
> vfio-ccw: Indicate if a channel_program is started
> vfio-ccw: Remove the CP_PENDING FSM state
> vfio-ccw: Check workqueue before doing START
>
> drivers/s390/cio/vfio_ccw_cp.c | 2 ++
> drivers/s390/cio/vfio_ccw_cp.h | 1 +
> drivers/s390/cio/vfio_ccw_drv.c | 5 +----
> drivers/s390/cio/vfio_ccw_fsm.c | 32 +++++++++++++++++------------
> drivers/s390/cio/vfio_ccw_ops.c | 3 +--
> drivers/s390/cio/vfio_ccw_private.h | 1 -
> 6 files changed, 24 insertions(+), 20 deletions(-)
>
next prev parent reply other threads:[~2020-06-17 11:24 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-16 19:50 [RFC PATCH v3 0/3] vfio-ccw: Fix interrupt handling for HALT/CLEAR Eric Farman
2020-06-16 19:50 ` [RFC PATCH v3 1/3] vfio-ccw: Indicate if a channel_program is started Eric Farman
2020-06-17 23:11 ` Halil Pasic
2020-06-18 11:47 ` Eric Farman
2020-06-16 19:50 ` [RFC PATCH v3 2/3] vfio-ccw: Remove the CP_PENDING FSM state Eric Farman
2020-06-16 19:50 ` [RFC PATCH v3 3/3] vfio-ccw: Check workqueue before doing START Eric Farman
2020-06-19 11:40 ` Cornelia Huck
2020-06-17 11:24 ` Eric Farman [this message]
2020-06-29 14:56 ` [RFC PATCH v3 0/3] vfio-ccw: Fix interrupt handling for HALT/CLEAR Cornelia Huck
2020-06-30 19:10 ` Eric Farman
2020-06-19 11:21 ` Cornelia Huck
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5ae6151b-31de-eca6-2917-4e23ecd4f0df@linux.ibm.com \
--to=farman@linux.ibm.com \
--cc=cohuck@redhat.com \
--cc=jrossi@linux.ibm.com \
--cc=kvm@vger.kernel.org \
--cc=linux-s390@vger.kernel.org \
--cc=pasic@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox