From mboxrd@z Thu Jan 1 00:00:00 1970 From: Li Wang Subject: Read ahead affect Ceph read performance much Date: Mon, 29 Jul 2013 18:24:34 +0800 Message-ID: <51F642E2.3090201@ubuntukylin.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from m59-178.qiye.163.com ([123.58.178.59]:50645 "EHLO m59-178.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753567Ab3G2KYi (ORCPT ); Mon, 29 Jul 2013 06:24:38 -0400 Sender: ceph-devel-owner@vger.kernel.org List-ID: To: "ceph-devel@vger.kernel.org" Cc: Sage Weil We performed Iozone read test on a 32-node HPC server. Regarding the=20 hardware of each node, the CPU is very powerful, so does the network,=20 with a bandwidth > 1.5 GB/s. 64GB memory, the IO is relatively slow, th= e=20 throughput measured by =91dd=92 locally is around 70MB/s. We configured= a=20 Ceph cluster with 24 OSDs on 24 nodes, one mds, one to four clients, on= e=20 client per node. The performance is as follows, Iozone sequential read throughput (MB/s) Number of clients 1 2 4 Default resize 180.0954 324.4836 591.5851 Resize: 256MB 645.3347 1022.998 1267.631 The complete iozone parameter for one client is, iozone -t 1 -+m /tmp/iozone.nodelist.50305030 -s 64G -r 4M -i 0 -+n -w=20 -c -e -b /tmp/iozone.nodelist.50305030.output, on each client node, onl= y=20 one thread is started. for two clients, it is, iozone -t 2 -+m /tmp/iozone.nodelist.50305030 -s 32G -r 4M -i 0 -+n -w=20 -c -e -b /tmp/iozone.nodelist.50305030.output As the data shown, a larger read ahead window could result in >300% spe= edup! Besides, Since the backend of Ceph is not the traditional hard disk, it= =20 is beneficial to capture the stride read prefetching. To prove this, we= =20 tested the stride read with the following program, as we know, the=20 generic read ahead algorithm of Linux kernel will not capture=20 stride-read prefetch, so we use fadvise() to manually force pretching. the record size is 4MB. The result is even more surprising, Stride read throughput (MB/s) Number of records prefetched 0 1 4 16 64 128 Throughput 42.82 100.74 217.41 497.73 854.48 950.1= 8 As the data shown, with a read ahead size of 128*4MB, the speedup over without read ahead could be up to 950/42 > 2000%! The core logic of the test program is below, stride =3D 17 recordsize =3D 4MB for (;;) { for (i =3D 0; i < count; ++i) { long long start =3D pos + (i + 1) * stride * recordsize; printf("PRE READ %lld %lld\n", start, start + block); posix_fadvise(fd, start, block, POSIX_FADV_WILLNEED); } len =3D read(fd, buf, block); total +=3D len; printf("READ %lld %lld\n", pos, (pos + len)); pos +=3D len; lseek(fd, (stride - 1) * block, SEEK_CUR); pos +=3D (stride - 1) * block; } Given the above results and some more, We plan to submit a blue print t= o=20 discuss the prefetching optimization of Ceph. Cheers, Li Wang -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html