From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([209.51.188.92]:55711) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gxhRG-0002ST-T9 for qemu-devel@nongnu.org; Sat, 23 Feb 2019 19:14:39 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gxhRF-0001CO-SJ for qemu-devel@nongnu.org; Sat, 23 Feb 2019 19:14:38 -0500 Received: from mail-qt1-f194.google.com ([209.85.160.194]:33181) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1gxhRF-0001Ac-Mu for qemu-devel@nongnu.org; Sat, 23 Feb 2019 19:14:37 -0500 Received: by mail-qt1-f194.google.com with SMTP id z39so6757988qtz.0 for ; Sat, 23 Feb 2019 16:14:36 -0800 (PST) Date: Sat, 23 Feb 2019 19:14:32 -0500 From: "Michael S. Tsirkin" Message-ID: <20190223191037-mutt-send-email-mst@kernel.org> References: <20190218102748.2242-1-xieyongji@baidu.com> <20190218102748.2242-2-xieyongji@baidu.com> <20190221120121-mutt-send-email-mst@kernel.org> <20190222011631-mutt-send-email-mst@kernel.org> <20190222095009-mutt-send-email-mst@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Yongji Xie Cc: Stefan Hajnoczi , =?iso-8859-1?Q?Marc-Andr=E9?= Lureau , Daniel =?iso-8859-1?Q?P=2E_Berrang=E9?= , Jason Wang , "Coquelin, Maxime" , Yury Kotov , =?utf-8?B?0JXQstCz0LXQvdC40Lkg0K/QutC+0LLQu9C10LI=?= , qemu-devel , zhangyu31@baidu.com, chaiwen@baidu.com, nixun@baidu.com, lilin24@baidu.com, Xie Yongji On Sat, Feb 23, 2019 at 09:10:01PM +0800, Yongji Xie wrote: > On Fri, 22 Feb 2019 at 22:54, Michael S. Tsirkin wrote: > > > > On Fri, Feb 22, 2019 at 03:05:23PM +0800, Yongji Xie wrote: > > > On Fri, 22 Feb 2019 at 14:21, Michael S. Tsirkin wrote: > > > > > > > > On Fri, Feb 22, 2019 at 10:47:03AM +0800, Yongji Xie wrote: > > > > > > > + > > > > > > > +To track inflight I/O, the queue region should be processed as follows: > > > > > > > + > > > > > > > +When receiving available buffers from the driver: > > > > > > > + > > > > > > > + 1. Get the next available head-descriptor index from available ring, i > > > > > > > + > > > > > > > + 2. Set desc[i].inflight to 1 > > > > > > > + > > > > > > > +When supplying used buffers to the driver: > > > > > > > + > > > > > > > + 1. Get corresponding used head-descriptor index, i > > > > > > > + > > > > > > > + 2. Set desc[i].next to process_head > > > > > > > + > > > > > > > + 3. Set process_head to i > > > > > > > + > > > > > > > + 4. Steps 1,2,3 may be performed repeatedly if batching is possible > > > > > > > + > > > > > > > + 5. Increase the idx value of used ring by the size of the batch > > > > > > > + > > > > > > > + 6. Set the inflight field of each DescStateSplit entry in the batch to 0 > > > > > > > + > > > > > > > + 7. Set used_idx to the idx value of used ring > > > > > > > + > > > > > > > +When reconnecting: > > > > > > > + > > > > > > > + 1. If the value of used_idx does not match the idx value of used ring, > > > > > > > + > > > > > > > + (a) Subtract the value of used_idx from the idx value of used ring to get > > > > > > > + the number of in-progress DescStateSplit entries > > > > > > > + > > > > > > > + (b) Set the inflight field of the in-progress DescStateSplit entries which > > > > > > > + start from process_head to 0 > > > > > > > + > > > > > > > + (c) Set used_idx to the idx value of used ring > > > > > > > + > > > > > > > + 2. Resubmit each inflight DescStateSplit entry > > > > > > > > > > > > I re-read a couple of time and I still don't understand what it says. > > > > > > > > > > > > For simplicity consider split ring. So we want a list of heads that are > > > > > > outstanding. Fair enough. Now device finishes a head. What now? I needs > > > > > > to drop head from the list. But list is unidirectional (just next, no > > > > > > prev). So how can you drop an entry from the middle? > > > > > > > > > > > > > > > > The process_head is only used when slave crash between increasing the > > > > > idx value of used ring and updating used_idx. We use it to find the > > > > > in-progress DescStateSplit entries before the crash and complete them > > > > > when reconnecting. Make sure guest and slave have the same view for > > > > > inflight I/Os. > > > > > > > > > > > > > But I don't understand how does the described process help do it? > > > > > > > > > > For example, we need to submit descriptors A, B, C to driver in a batch. > > > > > > Firstly, we will link those descriptors like: > > > > > > process_head->A->B->C (A) > > > > > > Then, we need to update idx value of used vring to mark those > > > descriptors as used: > > > > > > _vring.used->idx += 3 (B) > > > > > > At last, clear the inflight field of those descriptors and update > > > used_idx field: > > > > > > A.inflight = 0; B.inflight = 0; C.inflight = 0; (C) > > > > > > used_idx = _vring.used->idx; (D) > > > > > > After (B), guest can consume the descriptors A,B,C. So we must make > > > sure the inflight field of A,B,C is cleared when reconnecting to avoid > > > re-submitting used descriptor. If slave crash during (C), the inflight > > > field of A,B,C may be incorrect. To detect that case, we can see > > > whether used_idx matches _vring.used->idx. And through process_head, > > > we can get the in-progress descriptors A,B,C and clear their inflight > > > field again when reconnecting. > > > > > > > > > > > > In other case, the inflight field is enough to track inflight I/O. > > > > > When reconnecting, we go through all DescStateSplit entries and > > > > > re-submit the entry whose inflight field is equal to 1. > > > > > > > > What I don't understand is how do we know the order > > > > in which they have to be resubmitted. Reordering > > > > operations would be a big problem, won't it? > > > > > > > > > > In previous patch, I record avail_idx for each DescStateSplit entry to > > > preserve the order. Is it useful to fix this? > > > > > > > > > > > Let's say I fetch descriptors A, B, C and start > > > > processing. how does memory look? > > > > > > A.inflight = 1, C.inflight = 1, B.inflight = 1 > > > > > > > Now I finished B and marked it used. How does > > > > memory look? > > > > > > > > > > A.inflight = 1, C.inflight = 1, B.inflight = 0, process_head = B > > > > OK. And we have > > > > process_head->B->process_head > > > > ? > > > > Now if there is a reconnect, I want to submit A and then C, > > correct? How do I know that from this picture? How do I > > know to start with A? It's not on the list anymore... > > > > We can go through all DescStateSplit entries (track all descriptors in > Descriptor Table), then we can find A and C are inflight entry by its > inflight field. And if we want to resubmit them in order (submit A and > then C), we need to introduce a timestamp for each DescStateSplit > entry to preserve the order when we fetch them from driver. Something > like: > > When receiving available buffers from the driver: > > 1. Get the next available head-descriptor index from available ring, i > > 2. desc[i].timestamp = avail_idx++; > > 3. Set desc[i].inflight to 1 > > Thanks, > Yongji OK I guess a 64 bit counter would be fine for that. In order seems critical for storage right? Reordering write would seem to lead to data corruption. But now I don't understand what does the next field do. So it so you can maintain a freelist within a statically allocated array? -- MST