From: "Michael S. Tsirkin" <mst@redhat.com>
To: Yongji Xie <elohimes@gmail.com>
Cc: "Stefan Hajnoczi" <stefanha@gmail.com>,
"Marc-André Lureau" <marcandre.lureau@redhat.com>,
"Daniel P. Berrangé" <berrange@redhat.com>,
"Jason Wang" <jasowang@redhat.com>,
"Coquelin, Maxime" <maxime.coquelin@redhat.com>,
"Yury Kotov" <yury-kotov@yandex-team.ru>,
"Евгений Яковлев" <wrfsh@yandex-team.ru>,
qemu-devel <qemu-devel@nongnu.org>,
zhangyu31@baidu.com, chaiwen@baidu.com, nixun@baidu.com,
lilin24@baidu.com, "Xie Yongji" <xieyongji@baidu.com>
Subject: Re: [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend
Date: Sat, 23 Feb 2019 19:14:32 -0500 [thread overview]
Message-ID: <20190223191037-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <CAONzpcZ_4e5xYonBhXinkQC8E5UCsko3iLY8_8ykaVwdU0Aa4Q@mail.gmail.com>
On Sat, Feb 23, 2019 at 09:10:01PM +0800, Yongji Xie wrote:
> On Fri, 22 Feb 2019 at 22:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Fri, Feb 22, 2019 at 03:05:23PM +0800, Yongji Xie wrote:
> > > On Fri, 22 Feb 2019 at 14:21, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Fri, Feb 22, 2019 at 10:47:03AM +0800, Yongji Xie wrote:
> > > > > > > +
> > > > > > > +To track inflight I/O, the queue region should be processed as follows:
> > > > > > > +
> > > > > > > +When receiving available buffers from the driver:
> > > > > > > +
> > > > > > > + 1. Get the next available head-descriptor index from available ring, i
> > > > > > > +
> > > > > > > + 2. Set desc[i].inflight to 1
> > > > > > > +
> > > > > > > +When supplying used buffers to the driver:
> > > > > > > +
> > > > > > > + 1. Get corresponding used head-descriptor index, i
> > > > > > > +
> > > > > > > + 2. Set desc[i].next to process_head
> > > > > > > +
> > > > > > > + 3. Set process_head to i
> > > > > > > +
> > > > > > > + 4. Steps 1,2,3 may be performed repeatedly if batching is possible
> > > > > > > +
> > > > > > > + 5. Increase the idx value of used ring by the size of the batch
> > > > > > > +
> > > > > > > + 6. Set the inflight field of each DescStateSplit entry in the batch to 0
> > > > > > > +
> > > > > > > + 7. Set used_idx to the idx value of used ring
> > > > > > > +
> > > > > > > +When reconnecting:
> > > > > > > +
> > > > > > > + 1. If the value of used_idx does not match the idx value of used ring,
> > > > > > > +
> > > > > > > + (a) Subtract the value of used_idx from the idx value of used ring to get
> > > > > > > + the number of in-progress DescStateSplit entries
> > > > > > > +
> > > > > > > + (b) Set the inflight field of the in-progress DescStateSplit entries which
> > > > > > > + start from process_head to 0
> > > > > > > +
> > > > > > > + (c) Set used_idx to the idx value of used ring
> > > > > > > +
> > > > > > > + 2. Resubmit each inflight DescStateSplit entry
> > > > > >
> > > > > > I re-read a couple of time and I still don't understand what it says.
> > > > > >
> > > > > > For simplicity consider split ring. So we want a list of heads that are
> > > > > > outstanding. Fair enough. Now device finishes a head. What now? I needs
> > > > > > to drop head from the list. But list is unidirectional (just next, no
> > > > > > prev). So how can you drop an entry from the middle?
> > > > > >
> > > > >
> > > > > The process_head is only used when slave crash between increasing the
> > > > > idx value of used ring and updating used_idx. We use it to find the
> > > > > in-progress DescStateSplit entries before the crash and complete them
> > > > > when reconnecting. Make sure guest and slave have the same view for
> > > > > inflight I/Os.
> > > > >
> > > >
> > > > But I don't understand how does the described process help do it?
> > > >
> > >
> > > For example, we need to submit descriptors A, B, C to driver in a batch.
> > >
> > > Firstly, we will link those descriptors like:
> > >
> > > process_head->A->B->C (A)
> > >
> > > Then, we need to update idx value of used vring to mark those
> > > descriptors as used:
> > >
> > > _vring.used->idx += 3 (B)
> > >
> > > At last, clear the inflight field of those descriptors and update
> > > used_idx field:
> > >
> > > A.inflight = 0; B.inflight = 0; C.inflight = 0; (C)
> > >
> > > used_idx = _vring.used->idx; (D)
> > >
> > > After (B), guest can consume the descriptors A,B,C. So we must make
> > > sure the inflight field of A,B,C is cleared when reconnecting to avoid
> > > re-submitting used descriptor. If slave crash during (C), the inflight
> > > field of A,B,C may be incorrect. To detect that case, we can see
> > > whether used_idx matches _vring.used->idx. And through process_head,
> > > we can get the in-progress descriptors A,B,C and clear their inflight
> > > field again when reconnecting.
> > >
> > > >
> > > > > In other case, the inflight field is enough to track inflight I/O.
> > > > > When reconnecting, we go through all DescStateSplit entries and
> > > > > re-submit the entry whose inflight field is equal to 1.
> > > >
> > > > What I don't understand is how do we know the order
> > > > in which they have to be resubmitted. Reordering
> > > > operations would be a big problem, won't it?
> > > >
> > >
> > > In previous patch, I record avail_idx for each DescStateSplit entry to
> > > preserve the order. Is it useful to fix this?
> > >
> > > >
> > > > Let's say I fetch descriptors A, B, C and start
> > > > processing. how does memory look?
> > >
> > > A.inflight = 1, C.inflight = 1, B.inflight = 1
> > >
> > > > Now I finished B and marked it used. How does
> > > > memory look?
> > > >
> > >
> > > A.inflight = 1, C.inflight = 1, B.inflight = 0, process_head = B
> >
> > OK. And we have
> >
> > process_head->B->process_head
> >
> > ?
> >
> > Now if there is a reconnect, I want to submit A and then C,
> > correct? How do I know that from this picture? How do I
> > know to start with A? It's not on the list anymore...
> >
>
> We can go through all DescStateSplit entries (track all descriptors in
> Descriptor Table), then we can find A and C are inflight entry by its
> inflight field. And if we want to resubmit them in order (submit A and
> then C), we need to introduce a timestamp for each DescStateSplit
> entry to preserve the order when we fetch them from driver. Something
> like:
>
> When receiving available buffers from the driver:
>
> 1. Get the next available head-descriptor index from available ring, i
>
> 2. desc[i].timestamp = avail_idx++;
>
> 3. Set desc[i].inflight to 1
>
> Thanks,
> Yongji
OK I guess a 64 bit counter would be fine for that.
In order seems critical for storage right?
Reordering write would seem to lead to data corruption.
But now I don't understand what does the next
field do. So it so you can maintain a freelist
within a statically allocated array?
--
MST
next prev parent reply other threads:[~2019-02-24 0:14 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-18 10:27 [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend elohimes
2019-02-21 17:27 ` Michael S. Tsirkin
2019-02-22 2:47 ` Yongji Xie
2019-02-22 6:21 ` Michael S. Tsirkin
2019-02-22 7:05 ` Yongji Xie
2019-02-22 14:54 ` Michael S. Tsirkin
2019-02-23 13:10 ` Yongji Xie
2019-02-24 0:14 ` Michael S. Tsirkin [this message]
2019-02-24 8:02 ` Yongji Xie
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 2/7] libvhost-user: Remove unnecessary FD flag check for event file descriptors elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 3/7] libvhost-user: Introduce vu_queue_map_desc() elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 4/7] libvhost-user: Support tracking inflight I/O in shared memory elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 5/7] vhost-user-blk: Add support to get/set inflight buffer elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 6/7] vhost-user-blk: Add support to reconnect backend elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 7/7] contrib/vhost-user-blk: enable inflight I/O tracking elohimes
2019-02-20 19:59 ` [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting Michael S. Tsirkin
2019-02-21 1:30 ` Yongji Xie
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190223191037-mutt-send-email-mst@kernel.org \
--to=mst@redhat.com \
--cc=berrange@redhat.com \
--cc=chaiwen@baidu.com \
--cc=elohimes@gmail.com \
--cc=jasowang@redhat.com \
--cc=lilin24@baidu.com \
--cc=marcandre.lureau@redhat.com \
--cc=maxime.coquelin@redhat.com \
--cc=nixun@baidu.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@gmail.com \
--cc=wrfsh@yandex-team.ru \
--cc=xieyongji@baidu.com \
--cc=yury-kotov@yandex-team.ru \
--cc=zhangyu31@baidu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.