From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S263640AbUECLOy (ORCPT ); Mon, 3 May 2004 07:14:54 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S263641AbUECLOy (ORCPT ); Mon, 3 May 2004 07:14:54 -0400 Received: from smtp103.mail.sc5.yahoo.com ([66.163.169.222]:2151 "HELO smtp103.mail.sc5.yahoo.com") by vger.kernel.org with SMTP id S263640AbUECLOt (ORCPT ); Mon, 3 May 2004 07:14:49 -0400 Message-ID: <409629A5.8070201@yahoo.com.au> Date: Mon, 03 May 2004 21:14:45 +1000 From: Nick Piggin User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040401 Debian/1.6-4 X-Accept-Language: en MIME-Version: 1.0 To: Alexey Kopytov CC: linux-kernel@vger.kernel.org, Jens Axboe , Andrew Morton Subject: Re: Random file I/O regressions in 2.6 References: <200405022357.59415.alexeyk@mysql.com> In-Reply-To: <200405022357.59415.alexeyk@mysql.com> Content-Type: multipart/mixed; boundary="------------050507080204080205020004" Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org This is a multi-part message in MIME format. --------------050507080204080205020004 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Alexey Kopytov wrote: > Hello! > > I tried to compare random file I/O performance in 2.4 and 2.6 kernels and > found some regressions that I failed to explain. I tested 2.4.25, 2.6.5-bk2 > and 2.6.6-rc3 with my own utility SysBench which was written to generate > workloads similar to a database under intensive load. > > For 2.6.x kernels anticipatory, deadline, CFQ and noop I/O schedulers were > tested with AS giving the best results for this workload, but it's still about > 1.5 times worse than the results for 2.4.25 kernel. > > The SysBench 'fileio' test was configured to generate the following workload: > 16 worker threads are created, each running random read/write file requests in > blocks of 16 KB with a read/write ratio of 1.5. All I/O operations are evenly > distributed over 128 files with a total size of 3 GB. Each 100 requests, an > fsync() operations is performed sequentially on each file. The total number of > requests is limited by 10000. > > The FS used for the test was ext3 with data=ordered. > I am able to reproduce this here. 2.6 isn't improved by increasing nr_requests, relaxing IO scheduler deadlines, or turning off readahead. It looks like 2.6 is submitting a lot of the IO in 4KB sized requests... Hmm, oh dear. It looks like the readahead logic shat itself and/or do_generic_mapping_read doesn't know how to handle multipage reads properly. What ends up happening is that readahead gets turned off, then the 16K read ends up being done in 4 synchronous 4K chunks. Because they are synchronous, they have no chance of being merged with one another either. I have attached a proof of concept hack... I think what should really happen is that page_cache_readahead should be taught about the size of the requested read, and ensures that a decent amount of reading is done while within the read request window, even if beyond-request-window-readahead has been previously unsuccessful. Numbers with an IDE disk, 256MB ram 2.4.24: 81s 2.6.6-rc3-mm1: 126s rc3-mm1+patch: 87s The small remaining regression might be explained by 2.6's smaller nr_requests, IDE driver, io scheduler tuning, etc. > Here are the results (values are number of seconds to complete the test): > > 2.4.25: 77.5377 > > 2.6.5-bk2(noop): 165.3393 > 2.6.5-bk2(anticipatory): 118.7450 > 2.6.5-bk2(deadline): 130.3254 > 2.6.5-bk2(CFQ): 146.4286 > > 2.6.6-rc3(noop): 164.9486 > 2.6.6-rc3(anticipatory): 125.1776 > 2.6.6-rc3(deadline): 131.8903 > 2.6.6-rc3(CFQ): 152.9280 > > I have published the results as well as the hardware and kernel setups at the > SysBench home page: http://sysbench.sourceforge.net/results/fileio/ > > Any comments or suggestions would be highly appreciated. > From your website: "Another interesting fact is that AS gives the best results for this workload, though it's believed to give worse results for this kind of workloads as compared to other I/O schedulers available in 2.6.x kernels." The anticipatory scheduler is actually in a fairly good state of tune, and can often beat deadline even for random read/write/fsync tests. The infamous database regression problem is when this sort of workload is combined with TCQ disk drives. Nick --------------050507080204080205020004 Content-Type: text/x-patch; name="read-populate.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="read-populate.patch" include/linux/mm.h | 0 linux-2.6-npiggin/mm/filemap.c | 5 ++++- 2 files changed, 4 insertions(+), 1 deletion(-) diff -puN mm/readahead.c~read-populate mm/readahead.c diff -puN mm/filemap.c~read-populate mm/filemap.c --- linux-2.6/mm/filemap.c~read-populate 2004-05-03 19:56:00.000000000 +1000 +++ linux-2.6-npiggin/mm/filemap.c 2004-05-03 20:51:37.000000000 +1000 @@ -627,6 +627,9 @@ void do_generic_mapping_read(struct addr index = *ppos >> PAGE_CACHE_SHIFT; offset = *ppos & ~PAGE_CACHE_MASK; + force_page_cache_readahead(mapping, filp, index, + max_sane_readahead(desc->count >> PAGE_CACHE_SHIFT)); + for (;;) { struct page *page; unsigned long end_index, nr, ret; @@ -644,7 +647,7 @@ void do_generic_mapping_read(struct addr } cond_resched(); - page_cache_readahead(mapping, ra, filp, index); + page_cache_readahead(mapping, ra, filp, index + desc->count); nr = nr - offset; find_page: diff -puN include/linux/mm.h~read-populate include/linux/mm.h _ --------------050507080204080205020004--