From: Yongji Xie <elohimes@gmail.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: "Stefan Hajnoczi" <stefanha@gmail.com>,
"Marc-André Lureau" <marcandre.lureau@redhat.com>,
"Daniel P. Berrangé" <berrange@redhat.com>,
"Jason Wang" <jasowang@redhat.com>,
"Coquelin, Maxime" <maxime.coquelin@redhat.com>,
"Yury Kotov" <yury-kotov@yandex-team.ru>,
"Евгений Яковлев" <wrfsh@yandex-team.ru>,
qemu-devel <qemu-devel@nongnu.org>,
zhangyu31@baidu.com, chaiwen@baidu.com, nixun@baidu.com,
lilin24@baidu.com, "Xie Yongji" <xieyongji@baidu.com>
Subject: Re: [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend
Date: Sun, 24 Feb 2019 16:02:42 +0800 [thread overview]
Message-ID: <CAONzpcbBkiN6CePxDPjEK4rv_K04UeCWcm1QFQCw_PdamwUAxA@mail.gmail.com> (raw)
In-Reply-To: <20190223191037-mutt-send-email-mst@kernel.org>
On Sun, 24 Feb 2019 at 08:14, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Sat, Feb 23, 2019 at 09:10:01PM +0800, Yongji Xie wrote:
> > On Fri, 22 Feb 2019 at 22:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Fri, Feb 22, 2019 at 03:05:23PM +0800, Yongji Xie wrote:
> > > > On Fri, 22 Feb 2019 at 14:21, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > >
> > > > > On Fri, Feb 22, 2019 at 10:47:03AM +0800, Yongji Xie wrote:
> > > > > > > > +
> > > > > > > > +To track inflight I/O, the queue region should be processed as follows:
> > > > > > > > +
> > > > > > > > +When receiving available buffers from the driver:
> > > > > > > > +
> > > > > > > > + 1. Get the next available head-descriptor index from available ring, i
> > > > > > > > +
> > > > > > > > + 2. Set desc[i].inflight to 1
> > > > > > > > +
> > > > > > > > +When supplying used buffers to the driver:
> > > > > > > > +
> > > > > > > > + 1. Get corresponding used head-descriptor index, i
> > > > > > > > +
> > > > > > > > + 2. Set desc[i].next to process_head
> > > > > > > > +
> > > > > > > > + 3. Set process_head to i
> > > > > > > > +
> > > > > > > > + 4. Steps 1,2,3 may be performed repeatedly if batching is possible
> > > > > > > > +
> > > > > > > > + 5. Increase the idx value of used ring by the size of the batch
> > > > > > > > +
> > > > > > > > + 6. Set the inflight field of each DescStateSplit entry in the batch to 0
> > > > > > > > +
> > > > > > > > + 7. Set used_idx to the idx value of used ring
> > > > > > > > +
> > > > > > > > +When reconnecting:
> > > > > > > > +
> > > > > > > > + 1. If the value of used_idx does not match the idx value of used ring,
> > > > > > > > +
> > > > > > > > + (a) Subtract the value of used_idx from the idx value of used ring to get
> > > > > > > > + the number of in-progress DescStateSplit entries
> > > > > > > > +
> > > > > > > > + (b) Set the inflight field of the in-progress DescStateSplit entries which
> > > > > > > > + start from process_head to 0
> > > > > > > > +
> > > > > > > > + (c) Set used_idx to the idx value of used ring
> > > > > > > > +
> > > > > > > > + 2. Resubmit each inflight DescStateSplit entry
> > > > > > >
> > > > > > > I re-read a couple of time and I still don't understand what it says.
> > > > > > >
> > > > > > > For simplicity consider split ring. So we want a list of heads that are
> > > > > > > outstanding. Fair enough. Now device finishes a head. What now? I needs
> > > > > > > to drop head from the list. But list is unidirectional (just next, no
> > > > > > > prev). So how can you drop an entry from the middle?
> > > > > > >
> > > > > >
> > > > > > The process_head is only used when slave crash between increasing the
> > > > > > idx value of used ring and updating used_idx. We use it to find the
> > > > > > in-progress DescStateSplit entries before the crash and complete them
> > > > > > when reconnecting. Make sure guest and slave have the same view for
> > > > > > inflight I/Os.
> > > > > >
> > > > >
> > > > > But I don't understand how does the described process help do it?
> > > > >
> > > >
> > > > For example, we need to submit descriptors A, B, C to driver in a batch.
> > > >
> > > > Firstly, we will link those descriptors like:
> > > >
> > > > process_head->A->B->C (A)
> > > >
> > > > Then, we need to update idx value of used vring to mark those
> > > > descriptors as used:
> > > >
> > > > _vring.used->idx += 3 (B)
> > > >
> > > > At last, clear the inflight field of those descriptors and update
> > > > used_idx field:
> > > >
> > > > A.inflight = 0; B.inflight = 0; C.inflight = 0; (C)
> > > >
> > > > used_idx = _vring.used->idx; (D)
> > > >
> > > > After (B), guest can consume the descriptors A,B,C. So we must make
> > > > sure the inflight field of A,B,C is cleared when reconnecting to avoid
> > > > re-submitting used descriptor. If slave crash during (C), the inflight
> > > > field of A,B,C may be incorrect. To detect that case, we can see
> > > > whether used_idx matches _vring.used->idx. And through process_head,
> > > > we can get the in-progress descriptors A,B,C and clear their inflight
> > > > field again when reconnecting.
> > > >
> > > > >
> > > > > > In other case, the inflight field is enough to track inflight I/O.
> > > > > > When reconnecting, we go through all DescStateSplit entries and
> > > > > > re-submit the entry whose inflight field is equal to 1.
> > > > >
> > > > > What I don't understand is how do we know the order
> > > > > in which they have to be resubmitted. Reordering
> > > > > operations would be a big problem, won't it?
> > > > >
> > > >
> > > > In previous patch, I record avail_idx for each DescStateSplit entry to
> > > > preserve the order. Is it useful to fix this?
> > > >
> > > > >
> > > > > Let's say I fetch descriptors A, B, C and start
> > > > > processing. how does memory look?
> > > >
> > > > A.inflight = 1, C.inflight = 1, B.inflight = 1
> > > >
> > > > > Now I finished B and marked it used. How does
> > > > > memory look?
> > > > >
> > > >
> > > > A.inflight = 1, C.inflight = 1, B.inflight = 0, process_head = B
> > >
> > > OK. And we have
> > >
> > > process_head->B->process_head
> > >
> > > ?
> > >
> > > Now if there is a reconnect, I want to submit A and then C,
> > > correct? How do I know that from this picture? How do I
> > > know to start with A? It's not on the list anymore...
> > >
> >
> > We can go through all DescStateSplit entries (track all descriptors in
> > Descriptor Table), then we can find A and C are inflight entry by its
> > inflight field. And if we want to resubmit them in order (submit A and
> > then C), we need to introduce a timestamp for each DescStateSplit
> > entry to preserve the order when we fetch them from driver. Something
> > like:
> >
> > When receiving available buffers from the driver:
> >
> > 1. Get the next available head-descriptor index from available ring, i
> >
> > 2. desc[i].timestamp = avail_idx++;
> >
> > 3. Set desc[i].inflight to 1
> >
> > Thanks,
> > Yongji
>
> OK I guess a 64 bit counter would be fine for that.
>
> In order seems critical for storage right?
> Reordering write would seem to lead to data corruption.
>
Actually I'm not sure. If we care about the order, we should not do
access to one block until another access to the same block is
completed?
> But now I don't understand what does the next
> field do. So it so you can maintain a freelist
> within a statically allocated array?
>
Yes, we can use it to maintain a list. The head of the list is
process_head. This list is only used when we want to submit
descriptors in a batch. All descriptors in this batch are linked to
the list. Then if we crash between marking those descriptors used and
clearing their inflight field. We need to find those in-progress
descriptors. The list will be helpful to achieve that. If no batch for
submitting, the next field can be removed.
Thanks,
Yongji
next prev parent reply other threads:[~2019-02-24 8:03 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-18 10:27 [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend elohimes
2019-02-21 17:27 ` Michael S. Tsirkin
2019-02-22 2:47 ` Yongji Xie
2019-02-22 6:21 ` Michael S. Tsirkin
2019-02-22 7:05 ` Yongji Xie
2019-02-22 14:54 ` Michael S. Tsirkin
2019-02-23 13:10 ` Yongji Xie
2019-02-24 0:14 ` Michael S. Tsirkin
2019-02-24 8:02 ` Yongji Xie [this message]
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 2/7] libvhost-user: Remove unnecessary FD flag check for event file descriptors elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 3/7] libvhost-user: Introduce vu_queue_map_desc() elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 4/7] libvhost-user: Support tracking inflight I/O in shared memory elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 5/7] vhost-user-blk: Add support to get/set inflight buffer elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 6/7] vhost-user-blk: Add support to reconnect backend elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 7/7] contrib/vhost-user-blk: enable inflight I/O tracking elohimes
2019-02-20 19:59 ` [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting Michael S. Tsirkin
2019-02-21 1:30 ` Yongji Xie
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAONzpcbBkiN6CePxDPjEK4rv_K04UeCWcm1QFQCw_PdamwUAxA@mail.gmail.com \
--to=elohimes@gmail.com \
--cc=berrange@redhat.com \
--cc=chaiwen@baidu.com \
--cc=jasowang@redhat.com \
--cc=lilin24@baidu.com \
--cc=marcandre.lureau@redhat.com \
--cc=maxime.coquelin@redhat.com \
--cc=mst@redhat.com \
--cc=nixun@baidu.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@gmail.com \
--cc=wrfsh@yandex-team.ru \
--cc=xieyongji@baidu.com \
--cc=yury-kotov@yandex-team.ru \
--cc=zhangyu31@baidu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).