From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43040) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bhH82-0002YP-UX for qemu-devel@nongnu.org; Tue, 06 Sep 2016 10:13:38 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bhH7w-0002kT-IO for qemu-devel@nongnu.org; Tue, 06 Sep 2016 10:13:33 -0400 References: <20160829171021.4902-1-pbutsykin@virtuozzo.com> <83595cde-6b37-20c2-a37d-e6b030a005a6@scylladb.com> From: Pavel Butsykin Message-ID: <57CEB957.5050009@virtuozzo.com> Date: Tue, 6 Sep 2016 15:40:55 +0300 MIME-Version: 1.0 In-Reply-To: <83595cde-6b37-20c2-a37d-e6b030a005a6@scylladb.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH RFC v2 00/22] I/O prefetch cache List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Avi Kivity , qemu-block@nongnu.org, qemu-devel@nongnu.org Cc: kwolf@redhat.com, famz@redhat.com, mreitz@redhat.com, stefanha@redhat.com, den@openvz.org, jsnow@redhat.com On 01.09.2016 18:26, Avi Kivity wrote: > On 08/29/2016 08:09 PM, Pavel Butsykin wrote: >> The prefetch cache aims to improve the performance of sequential read >> data. >> Of most interest here are the requests of a small size of data for >> sequential >> read, such requests can be optimized by extending them and moving into >> the prefetch cache. However, there are 2 issues: >> - In aggregate only a small portion of requests is sequential, so >> delays caused >> by the need to read more volumes of data will lead to an overall >> decrease >> in performance. >> - The presence of redundant data in the cache memory with a large >> number of >> random requests. >> This pcache implementation solves the above and other problems >> prefetching data. >> The pcache algorithm can be summarised by the following main steps. >> >> 1. Monitor I/O requests to identify typical sequences. >> This implementation of prefetch cache works at the storage system >> level and has >> information only about the physical block addresses of I/O requests. >> Statistics >> are collected only from read requests to a maximum size of 32kb(by >> default), >> each request that matches the criteria falls into a pool of requests. >> In order >> to store requests statistic used by the rb-tree(lreq.tree), it's >> simple but for >> this issue a quite efficient data structure. >> >> 2. Identifying sequential I/O streams. >> For each read request to be carried out attempting to lift the chain >> sequence >> from lreq.tree, where this request will be element of a sequential >> chain of >> requests. The key to search for consecutive requests is the area of >> sectors >> preceding the current request. The size of this area should not be too >> small to >> avoid false readahead. The sequential stream data requests can be >> identified >> even when a large number of random requests. For example, if there is >> access to >> the blocks 100, 1157, 27520, 4, 101, 312, 1337, 102, in the context of >> request >> processing 102 will be identified the chain of sequential requests >> 100, 101. 102 >> and then should a decision be made to do readahead. Also a situation >> may arise >> when multiple applications A, B, C simultaneously perform sequential >> read of >> data. For each separate application that will be sequential read data >> A(100, 101, 102), B(300, 301, 302), C(700, 701, 702), but for block >> devices it >> may look like a random data reading: 100,300,700,101,301,701,102,302,702. >> In this case, the sequential streams will also be recognised because >> location >> requests in the rb-tree will allow to separate the sequential I/O >> streams. >> >> 3. Do the readahead into the cache for recognized sequential data >> streams. >> After the issue of the detection of pcache case was resolved, need >> using larger >> requests to bring data into the cache. In this implementation the >> pcache used >> readahead instead of the extension request, therefore the request goes >> as is. >> There is not any reason to put data in the cache that will never be >> picked up, >> but this will always happen in the case of extension requests. In >> order to store >> areas of cached blocks is also used by the rb-tree(pcache.tree), it's >> simple but >> for this issue a quite efficient data structure. >> >> 4. Control size of the prefetch cache pool and the requests statistic >> pool >> For control the border of the pool statistic of requests, the data of >> requests >> are placed and replaced according to the FIFO principle, everything is >> simple. >> For control the boundaries of the memory cache used LRU list, it >> allows to limit >> the max amount memory that we can allocate for pcache. But the LRU is >> there >> mainly to prevent displacement of the cache blocks that was read >> partially. >> The main way the memory is pushed out immediately after use, as soon >> as a chunk >> of memory from the cache has been completely read, since the >> probability of >> repetition of the request is very low. Cases when one and the same >> portion of >> the cache memory has been read several times are not optimized and do >> not apply >> to the cases that can optimize the pcache. Thus, using a cache memory >> of small >> volume, by the optimization of the operations read-ahead and clear >> memory, we >> can read entire volumes of data, providing a 100% cache hit. Also does >> not >> decrease the effectiveness of random read requests. >> >> PCache is implemented as a qemu block filter driver, has some >> configurable >> parameters, such as: total cache size, readahead size, maximum size of >> block >> that can be processed. >> >> For performance evaluation has been used several test cases with >> different >> sequential and random read data on SSD disk. Here are the results of >> tests and >> qemu parameters: >> >> qemu parameters: >> -M pc-i440fx-2.4 --enable-kvm -smp 4 -m 1024 >> -drive >> file=centos7.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none, >> aio=native,pcache-full-size=4MB,pcache-readahead-size=128KB, >> pcache-max-aio-size=32KB >> -device >> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk0, >> id=virtio-disk0 >> (-set device.virtio-disk0.x-data-plane=on) >> >> ******************************************************************************** >> >> * Testcase * Results in >> iops * >> * >> ********************************************** >> * * clean qemu * pcache * >> x-data-plane * >> ******************************************************************************** >> >> * Create/open 16 file(s) of total * 25514 req/s * 85659 req/s * >> 28249 req/s * >> * size 2048.00 MB named * 25692 req/s * 89064 req/s * >> 27950 req/s * >> * /tmp/tmp.tmp, start 4 thread(s) * 25836 req/s * 84142 req/s * >> 28120 req/s * >> * and do uncached sequential read * * >> * * >> * by 4KB blocks * * >> * * >> ******************************************************************************** >> >> * Create/open 16 file(s) of total * 56006 req/s * 92137 req/s * >> 56992 req/s * >> * size 2048.00 MB named * 55335 req/s * 92269 req/s * >> 57023 req/s * >> * /tmp/tmp.tmp, start 4 thread(s) * 55731 req/s * 98722 req/s * >> 56593 req/s * >> * and do uncached sequential read * * >> * * >> * by 4KB blocks with constant * * >> * * >> ******************************************************************************** >> >> * Create/open 16 file(s) of total * 14104 req/s * 14164 req/s * >> 13914 req/s * >> * size 2048.00 MB named * 14130 req/s * 14232 req/s * >> 13613 req/s * >> * /tmp/tmp.tmp, start 4 thread(s) * 14183 req/s * 14080 req/s * >> 13374 req/s * >> * and do uncached random read by * * >> * * >> * 4KB blocks * * >> * * >> ******************************************************************************** >> >> * Create/open 16 file(s) of total * 23480 req/s * 23483 req/s * >> 20887 req/s * >> * size 2048.00 MB named * 23070 req/s * 22432 req/s * >> 21127 req/s * >> * /tmp/tmp.tmp, start 4 thread(s) * 24090 req/s * 23499 req/s * >> 23415 req/s * >> * and do uncached random read by * * >> * * >> * 4KB blocks with constant queue * * >> * * >> * len 32 * * >> * * >> ******************************************************************************** >> > > > I note, in your tests, you use uncached sequential reads. But are > uncached sequential reads with a small block size common? > > Consider the case of cached sequential reads. Here, the guest OS will > issue read-aheads. pcache will detect them and issue its own > read-aheads, both layers will read ahead more than necessary, so pcache > is adding extra I/O and memory copies here. > Yes, guests can have their own read-ahead cache, but pcache in this case doesn't lead to excessive activity, because the first guest read-ahead request hit in the pcache memory, and the next read-ahead requests will be filtered out on the side of pcache. This is only for the same size window, but if the window size is different, then a concurrent read-ahead request will never happen. Even if simultaneous read-ahead request can leads to extra I/O, it is only a problem of pcache implementation. > So I'm wondering about the use case. Guest userspace applications which > do uncached reads will typically manage their own read-ahead; and cached > reads have the kernel reading ahead for them, with the benefit of > knowing the file layout. That leaves dd iflag=direct, but is it such an > important application? > It helps with live loads on Windows. A simple example, Windows boot(win8.1 1024-RAM), even with enabled Windows Prefetcher leads to reading about 300MB from pcache memory. It should be understood that pcache is designed for optimizing the guest's behaviour as a whole and not any apps inside. Guest read-ahead is tied to fd, and aimed at optimizing userspace application, but pcache is several levels above that allows us to cover other cases. Another example is walking a directory tree. This effect happens because, when traversing a directory tree, there big chance that some fs blocks can be placed sequentially. But in generally, pcache helps to reduce latency under high load for Windows VMs. >> TODO list: >> - add tracepoints >> - add migration support >> - add more explanations in the commit messages >> - get rid of the additional allocation in >> pcache_node_find_and_create() and >> pcache_aio_readv() >> >> Changes from v1: >> - Fix failed automatic build test (11) >> >> Pavel Butsykin (22): >> block/pcache: empty pcache driver filter >> block/pcache: add own AIOCB block >> util/rbtree: add rbtree from linux kernel >> block/pcache: add pcache debug build >> block/pcache: add aio requests into cache >> block/pcache: restrict cache size >> block/pcache: introduce LRU as method of memory >> block/pcache: implement pickup parts of the cache >> block/pcache: separation AIOCB on requests >> block/pcache: add check node leak >> add QEMU style defines for __sync_add_and_fetch >> block/pcache: implement read cache to qiov and drop node during aio >> write >> block/pcache: add generic request complete >> block/pcache: add support for rescheduling requests >> block/pcache: simple readahead one chunk forward >> block/pcache: pcache readahead node around >> block/pcache: skip readahead for non-sequential requests >> block/pcache: add pcache skip large aio read >> block/pcache: add pcache node assert >> block/pcache: implement pcache error handling of aio cb >> block/pcache: add write through node >> block/pcache: drop used pcache node >> >> block/Makefile.objs | 1 + >> block/pcache.c | 1224 >> +++++++++++++++++++++++++++++++++++++++ >> include/qemu/atomic.h | 8 + >> include/qemu/rbtree.h | 109 ++++ >> include/qemu/rbtree_augmented.h | 237 ++++++++ >> util/Makefile.objs | 1 + >> util/rbtree.c | 570 ++++++++++++++++++ >> 7 files changed, 2150 insertions(+) >> create mode 100644 block/pcache.c >> create mode 100644 include/qemu/rbtree.h >> create mode 100644 include/qemu/rbtree_augmented.h >> create mode 100644 util/rbtree.c >> >