Flexible I/O Tester development
 help / color / mirror / Atom feed
* fio rbd hang for block sizes > 1M
@ 2014-10-24  2:38 Mark Kirkwood
  2014-10-24  5:35 ` Jens Axboe
  0 siblings, 1 reply; 52+ messages in thread
From: Mark Kirkwood @ 2014-10-24  2:38 UTC (permalink / raw)
  To: fio

[-- Attachment #1: Type: text/plain, Size: 10899 bytes --]

I stumbled across this performance testing a new ceph cluster:

Env:

Ceph 0.86-467-g317b83d (317b83dddd1a917f70838870b31931a79bdd4dd0)
Ubuntu 14.04 (3.13.0-37-generic #64-Ubuntu SMP Mon Sep 22 21:28:38 UTC 
2014 x86_64 x86_64 x86_64 GNU/Linux)
Fio fio-2.1.13-88-gb2ee7

Cmd:

$ rbd ls -l
NAME           SIZE PARENT FMT PROT LOCK
vol0          4096M          1

$ fio read-test.fio     # attached
rbd_thread: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=rbd, iodepth=32
fio-2.1.13-88-gb2ee7
Starting 1 process
rbd engine: RBD version: 0.1.8
Killed1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 
1158050441d:06h:59m:33s]

Block sizes 1M usually works, 2M,4M always fail. The rbd volume should 
be written to 1st (just change read to write in workload file). Note 
that 2-4M blocksize is fine for writes!

Running the read variant under valgrind shows seveal invalid reads - 
only for these bigger block sizes, so I'm guessing they are the problem:

$ valgrind fio read-test.fio
==12519== Memcheck, a memory error detector
==12519== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==12519== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for 
copyright info
==12519== Command: fio read-test.fio
==12519==
rbd_thread: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=rbd, iodepth=32
fio-2.1.13-88-gb2ee7
Starting 1 process
rbd engine: RBD version: 0.1.8
==12519== Thread 6:
==12519== Invalid read of size 8
==12519==    at 0x4EFA7B3: ObjectCacher::_readx(ObjectCacher::OSDRead*, 
ObjectCacher::ObjectSet*, Context*, bool) (ObjectCacher.cc:1158)
==12519==    by 0x4E965A7: 
librbd::ImageCtx::aio_read_from_cache(object_t, ceph::buffer::list*, 
unsigned long, unsigned long, Context*) (ImageCtx.cc:484)
==12519==    by 0x4EAA9FA: librbd::aio_read(librbd::ImageCtx*, 
std::vector<std::pair<unsigned long, unsigned long>, 
std::allocator<std::pair<unsigned long, unsigned long> > > const&, 
char*, ceph::buffer::list*, librbd::AioCompletion*) (internal.cc:3262)
==12519==    by 0x4EAB872: librbd::aio_read(librbd::ImageCtx*, unsigned 
long, unsigned long, char*, ceph::buffer::list*, librbd::AioCompletion*) 
(internal.cc:3135)
==12519==    by 0x4E8B737: rbd_aio_read (librbd.cc:1518)
==12519==    by 0x459D92: fio_rbd_queue (rbd.c:294)
==12519==    by 0x40D379: td_io_queue (ioengines.c:300)
==12519==    by 0x44B77E: thread_main (backend.c:781)
==12519==    by 0x81F6181: start_thread (pthread_create.c:312)
==12519==    by 0x870AFBC: clone (clone.S:111)
==12519==  Address 0x197b6fe0 is 48 bytes inside a block of size 264 free'd
==12519==    at 0x4C2C2BC: operator delete(void*) (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12519==    by 0x4EFA7AE: ObjectCacher::_readx(ObjectCacher::OSDRead*, 
ObjectCacher::ObjectSet*, Context*, bool) (ObjectCacher.cc:1149)
==12519==    by 0x4E965A7: 
librbd::ImageCtx::aio_read_from_cache(object_t, ceph::buffer::list*, 
unsigned long, unsigned long, Context*) (ImageCtx.cc:484)
==12519==    by 0x4EAA9FA: librbd::aio_read(librbd::ImageCtx*, 
std::vector<std::pair<unsigned long, unsigned long>, 
std::allocator<std::pair<unsigned long, unsigned long> > > const&, 
char*, ceph::buffer::list*, librbd::AioCompletion*) (internal.cc:3262)
==12519==    by 0x4EAB872: librbd::aio_read(librbd::ImageCtx*, unsigned 
long, unsigned long, char*, ceph::buffer::list*, librbd::AioCompletion*) 
(internal.cc:3135)
==12519==    by 0x4E8B737: rbd_aio_read (librbd.cc:1518)
==12519==    by 0x459D92: fio_rbd_queue (rbd.c:294)
==12519==    by 0x40D379: td_io_queue (ioengines.c:300)
==12519==    by 0x44B77E: thread_main (backend.c:781)
==12519==    by 0x81F6181: start_thread (pthread_create.c:312)
==12519==    by 0x870AFBC: clone (clone.S:111)
==12519==
==12519== Invalid read of size 8
==12519==    at 0x4EFA7CD: ObjectCacher::_readx(ObjectCacher::OSDRead*, 
ObjectCacher::ObjectSet*, Context*, bool) (ObjectCacher.h:170)
==12519==    by 0x4E965A7: 
librbd::ImageCtx::aio_read_from_cache(object_t, ceph::buffer::list*, 
unsigned long, unsigned long, Context*) (ImageCtx.cc:484)
==12519==    by 0x4EAA9FA: librbd::aio_read(librbd::ImageCtx*, 
std::vector<std::pair<unsigned long, unsigned long>, 
std::allocator<std::pair<unsigned long, unsigned long> > > const&, 
char*, ceph::buffer::list*, librbd::AioCompletion*) (internal.cc:3262)
==12519==    by 0x4EAB872: librbd::aio_read(librbd::ImageCtx*, unsigned 
long, unsigned long, char*, ceph::buffer::list*, librbd::AioCompletion*) 
(internal.cc:3135)
==12519==    by 0x4E8B737: rbd_aio_read (librbd.cc:1518)
==12519==    by 0x459D92: fio_rbd_queue (rbd.c:294)
==12519==    by 0x40D379: td_io_queue (ioengines.c:300)
==12519==    by 0x44B77E: thread_main (backend.c:781)
==12519==    by 0x81F6181: start_thread (pthread_create.c:312)
==12519==    by 0x870AFBC: clone (clone.S:111)
==12519==  Address 0x197b6fe8 is 56 bytes inside a block of size 264 free'd
==12519==    at 0x4C2C2BC: operator delete(void*) (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12519==    by 0x4EFA7AE: ObjectCacher::_readx(ObjectCacher::OSDRead*, 
ObjectCacher::ObjectSet*, Context*, bool) (ObjectCacher.cc:1149)
==12519==    by 0x4E965A7: 
librbd::ImageCtx::aio_read_from_cache(object_t, ceph::buffer::list*, 
unsigned long, unsigned long, Context*) (ImageCtx.cc:484)
==12519==    by 0x4EAA9FA: librbd::aio_read(librbd::ImageCtx*, 
std::vector<std::pair<unsigned long, unsigned long>, 
std::allocator<std::pair<unsigned long, unsigned long> > > const&, 
char*, ceph::buffer::list*, librbd::AioCompletion*) (internal.cc:3262)
==12519==    by 0x4EAB872: librbd::aio_read(librbd::ImageCtx*, unsigned 
long, unsigned long, char*, ceph::buffer::list*, librbd::AioCompletion*) 
(internal.cc:3135)
==12519==    by 0x4E8B737: rbd_aio_read (librbd.cc:1518)
==12519==    by 0x459D92: fio_rbd_queue (rbd.c:294)
==12519==    by 0x40D379: td_io_queue (ioengines.c:300)
==12519==    by 0x44B77E: thread_main (backend.c:781)
==12519==    by 0x81F6181: start_thread (pthread_create.c:312)
==12519==    by 0x870AFBC: clone (clone.S:111)
==12519==
==12519== Thread 18:
==12519== Invalid read of size 8
==12519==    at 0x4EFA7B3: ObjectCacher::_readx(ObjectCacher::OSDRead*, 
ObjectCacher::ObjectSet*, Context*, bool) (ObjectCacher.cc:1158)
==12519==    by 0x4F027BF: ObjectCacher::C_RetryRead::finish(int) 
(ObjectCacher.h:581)
==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
==12519==    by 0x4EFF083: void finish_contexts<Context>(CephContext*, 
std::list<Context*, std::allocator<Context*> >&, int) (Context.h:120)
==12519==    by 0x4EF489C: ObjectCacher::bh_read_finish(long, sobject_t, 
unsigned long, long, unsigned long, ceph::buffer::list&, int, bool) 
(ObjectCacher.cc:805)
==12519==    by 0x4F01590: ObjectCacher::C_ReadFinish::finish(int) 
(ObjectCacher.h:504)
==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
==12519==    by 0x4EB9BBC: librbd::C_Request::finish(int) 
(LibrbdWriteback.cc:54)
==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
==12519==    by 0x53B64FC: librados::C_AioComplete::finish(int) 
(AioCompletionImpl.h:180)
==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
==12519==    by 0x5452397: Finisher::finisher_thread_entry() 
(Finisher.cc:59)
==12519==  Address 0x1a299710 is 48 bytes inside a block of size 264 free'd
==12519==    at 0x4C2C2BC: operator delete(void*) (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12519==    by 0x4EFA7AE: ObjectCacher::_readx(ObjectCacher::OSDRead*, 
ObjectCacher::ObjectSet*, Context*, bool) (ObjectCacher.cc:1149)
==12519==    by 0x4F027BF: ObjectCacher::C_RetryRead::finish(int) 
(ObjectCacher.h:581)
==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
==12519==    by 0x4EFF083: void finish_contexts<Context>(CephContext*, 
std::list<Context*, std::allocator<Context*> >&, int) (Context.h:120)
==12519==    by 0x4EF489C: ObjectCacher::bh_read_finish(long, sobject_t, 
unsigned long, long, unsigned long, ceph::buffer::list&, int, bool) 
(ObjectCacher.cc:805)
==12519==    by 0x4F01590: ObjectCacher::C_ReadFinish::finish(int) 
(ObjectCacher.h:504)
==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
==12519==    by 0x4EB9BBC: librbd::C_Request::finish(int) 
(LibrbdWriteback.cc:54)
==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
==12519==    by 0x53B64FC: librados::C_AioComplete::finish(int) 
(AioCompletionImpl.h:180)
==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
==12519==
==12519== Invalid read of size 8
==12519==    at 0x4EFA7CD: ObjectCacher::_readx(ObjectCacher::OSDRead*, 
ObjectCacher::ObjectSet*, Context*, bool) (ObjectCacher.h:170)
==12519==    by 0x4F027BF: ObjectCacher::C_RetryRead::finish(int) 
(ObjectCacher.h:581)
==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
==12519==    by 0x4EFF083: void finish_contexts<Context>(CephContext*, 
std::list<Context*, std::allocator<Context*> >&, int) (Context.h:120)
==12519==    by 0x4EF489C: ObjectCacher::bh_read_finish(long, sobject_t, 
unsigned long, long, unsigned long, ceph::buffer::list&, int, bool) 
(ObjectCacher.cc:805)
==12519==    by 0x4F01590: ObjectCacher::C_ReadFinish::finish(int) 
(ObjectCacher.h:504)
==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
==12519==    by 0x4EB9BBC: librbd::C_Request::finish(int) 
(LibrbdWriteback.cc:54)
==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
==12519==    by 0x53B64FC: librados::C_AioComplete::finish(int) 
(AioCompletionImpl.h:180)
==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
==12519==    by 0x5452397: Finisher::finisher_thread_entry() 
(Finisher.cc:59)
==12519==  Address 0x1a299718 is 56 bytes inside a block of size 264 free'd
==12519==    at 0x4C2C2BC: operator delete(void*) (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12519==    by 0x4EFA7AE: ObjectCacher::_readx(ObjectCacher::OSDRead*, 
ObjectCacher::ObjectSet*, Context*, bool) (ObjectCacher.cc:1149)
==12519==    by 0x4F027BF: ObjectCacher::C_RetryRead::finish(int) 
(ObjectCacher.h:581)
==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
==12519==    by 0x4EFF083: void finish_contexts<Context>(CephContext*, 
std::list<Context*, std::allocator<Context*> >&, int) (Context.h:120)
==12519==    by 0x4EF489C: ObjectCacher::bh_read_finish(long, sobject_t, 
unsigned long, long, unsigned long, ceph::buffer::list&, int, bool) 
(ObjectCacher.cc:805)
==12519==    by 0x4F01590: ObjectCacher::C_ReadFinish::finish(int) 
(ObjectCacher.h:504)
==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
==12519==    by 0x4EB9BBC: librbd::C_Request::finish(int) 
(LibrbdWriteback.cc:54)
==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
==12519==    by 0x53B64FC: librados::C_AioComplete::finish(int) 
(AioCompletionImpl.h:180)
==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
==12519==

[-- Attachment #2: read-test.fio --]
[-- Type: text/plain, Size: 650 bytes --]

######################################################################
# Example test for the RBD engine.
#
# From http://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html
#
# Runs a 4k random write test agains a RBD via librbd
#
# NOTE: Make sure you have either a RBD named 'voltest' or change
#       the rbdname parameter.
######################################################################
[global]
#logging
#write_iops_log=write_iops_log
#write_bw_log=write_bw_log
#write_lat_log=write_lat_log
ioengine=rbd
clientname=admin
pool=rbd
rbdname=vol0
invalidate=0    # mandatory
rw=read
bs=2M

[rbd_thread]
iodepth=32

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd hang for block sizes > 1M
  2014-10-24  2:38 fio rbd hang for block sizes > 1M Mark Kirkwood
@ 2014-10-24  5:35 ` Jens Axboe
  2014-10-24  6:17   ` Mark Kirkwood
  2014-10-24 14:11   ` Danny Al-Gaaf
  0 siblings, 2 replies; 52+ messages in thread
From: Jens Axboe @ 2014-10-24  5:35 UTC (permalink / raw)
  To: Mark Kirkwood, fio; +Cc: d.gollub@telekom.de >> Daniel Gollub, xan.peng

CC'ing relevant parties, leaving email intact.

On 2014-10-23 20:38, Mark Kirkwood wrote:
> I stumbled across this performance testing a new ceph cluster:
>
> Env:
>
> Ceph 0.86-467-g317b83d (317b83dddd1a917f70838870b31931a79bdd4dd0)
> Ubuntu 14.04 (3.13.0-37-generic #64-Ubuntu SMP Mon Sep 22 21:28:38 UTC
> 2014 x86_64 x86_64 x86_64 GNU/Linux)
> Fio fio-2.1.13-88-gb2ee7
>
> Cmd:
>
> $ rbd ls -l
> NAME           SIZE PARENT FMT PROT LOCK
> vol0          4096M          1
>
> $ fio read-test.fio     # attached
> rbd_thread: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=rbd, iodepth=32
> fio-2.1.13-88-gb2ee7
> Starting 1 process
> rbd engine: RBD version: 0.1.8
> Killed1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
> 1158050441d:06h:59m:33s]
>
> Block sizes 1M usually works, 2M,4M always fail. The rbd volume should
> be written to 1st (just change read to write in workload file). Note
> that 2-4M blocksize is fine for writes!
>
> Running the read variant under valgrind shows seveal invalid reads -
> only for these bigger block sizes, so I'm guessing they are the problem:
>
> $ valgrind fio read-test.fio
> ==12519== Memcheck, a memory error detector
> ==12519== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
> ==12519== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for
> copyright info
> ==12519== Command: fio read-test.fio
> ==12519==
> rbd_thread: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=rbd, iodepth=32
> fio-2.1.13-88-gb2ee7
> Starting 1 process
> rbd engine: RBD version: 0.1.8
> ==12519== Thread 6:
> ==12519== Invalid read of size 8
> ==12519==    at 0x4EFA7B3: ObjectCacher::_readx(ObjectCacher::OSDRead*,
> ObjectCacher::ObjectSet*, Context*, bool) (ObjectCacher.cc:1158)
> ==12519==    by 0x4E965A7:
> librbd::ImageCtx::aio_read_from_cache(object_t, ceph::buffer::list*,
> unsigned long, unsigned long, Context*) (ImageCtx.cc:484)
> ==12519==    by 0x4EAA9FA: librbd::aio_read(librbd::ImageCtx*,
> std::vector<std::pair<unsigned long, unsigned long>,
> std::allocator<std::pair<unsigned long, unsigned long> > > const&,
> char*, ceph::buffer::list*, librbd::AioCompletion*) (internal.cc:3262)
> ==12519==    by 0x4EAB872: librbd::aio_read(librbd::ImageCtx*, unsigned
> long, unsigned long, char*, ceph::buffer::list*, librbd::AioCompletion*)
> (internal.cc:3135)
> ==12519==    by 0x4E8B737: rbd_aio_read (librbd.cc:1518)
> ==12519==    by 0x459D92: fio_rbd_queue (rbd.c:294)
> ==12519==    by 0x40D379: td_io_queue (ioengines.c:300)
> ==12519==    by 0x44B77E: thread_main (backend.c:781)
> ==12519==    by 0x81F6181: start_thread (pthread_create.c:312)
> ==12519==    by 0x870AFBC: clone (clone.S:111)
> ==12519==  Address 0x197b6fe0 is 48 bytes inside a block of size 264 free'd
> ==12519==    at 0x4C2C2BC: operator delete(void*) (in
> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==12519==    by 0x4EFA7AE: ObjectCacher::_readx(ObjectCacher::OSDRead*,
> ObjectCacher::ObjectSet*, Context*, bool) (ObjectCacher.cc:1149)
> ==12519==    by 0x4E965A7:
> librbd::ImageCtx::aio_read_from_cache(object_t, ceph::buffer::list*,
> unsigned long, unsigned long, Context*) (ImageCtx.cc:484)
> ==12519==    by 0x4EAA9FA: librbd::aio_read(librbd::ImageCtx*,
> std::vector<std::pair<unsigned long, unsigned long>,
> std::allocator<std::pair<unsigned long, unsigned long> > > const&,
> char*, ceph::buffer::list*, librbd::AioCompletion*) (internal.cc:3262)
> ==12519==    by 0x4EAB872: librbd::aio_read(librbd::ImageCtx*, unsigned
> long, unsigned long, char*, ceph::buffer::list*, librbd::AioCompletion*)
> (internal.cc:3135)
> ==12519==    by 0x4E8B737: rbd_aio_read (librbd.cc:1518)
> ==12519==    by 0x459D92: fio_rbd_queue (rbd.c:294)
> ==12519==    by 0x40D379: td_io_queue (ioengines.c:300)
> ==12519==    by 0x44B77E: thread_main (backend.c:781)
> ==12519==    by 0x81F6181: start_thread (pthread_create.c:312)
> ==12519==    by 0x870AFBC: clone (clone.S:111)
> ==12519==
> ==12519== Invalid read of size 8
> ==12519==    at 0x4EFA7CD: ObjectCacher::_readx(ObjectCacher::OSDRead*,
> ObjectCacher::ObjectSet*, Context*, bool) (ObjectCacher.h:170)
> ==12519==    by 0x4E965A7:
> librbd::ImageCtx::aio_read_from_cache(object_t, ceph::buffer::list*,
> unsigned long, unsigned long, Context*) (ImageCtx.cc:484)
> ==12519==    by 0x4EAA9FA: librbd::aio_read(librbd::ImageCtx*,
> std::vector<std::pair<unsigned long, unsigned long>,
> std::allocator<std::pair<unsigned long, unsigned long> > > const&,
> char*, ceph::buffer::list*, librbd::AioCompletion*) (internal.cc:3262)
> ==12519==    by 0x4EAB872: librbd::aio_read(librbd::ImageCtx*, unsigned
> long, unsigned long, char*, ceph::buffer::list*, librbd::AioCompletion*)
> (internal.cc:3135)
> ==12519==    by 0x4E8B737: rbd_aio_read (librbd.cc:1518)
> ==12519==    by 0x459D92: fio_rbd_queue (rbd.c:294)
> ==12519==    by 0x40D379: td_io_queue (ioengines.c:300)
> ==12519==    by 0x44B77E: thread_main (backend.c:781)
> ==12519==    by 0x81F6181: start_thread (pthread_create.c:312)
> ==12519==    by 0x870AFBC: clone (clone.S:111)
> ==12519==  Address 0x197b6fe8 is 56 bytes inside a block of size 264 free'd
> ==12519==    at 0x4C2C2BC: operator delete(void*) (in
> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==12519==    by 0x4EFA7AE: ObjectCacher::_readx(ObjectCacher::OSDRead*,
> ObjectCacher::ObjectSet*, Context*, bool) (ObjectCacher.cc:1149)
> ==12519==    by 0x4E965A7:
> librbd::ImageCtx::aio_read_from_cache(object_t, ceph::buffer::list*,
> unsigned long, unsigned long, Context*) (ImageCtx.cc:484)
> ==12519==    by 0x4EAA9FA: librbd::aio_read(librbd::ImageCtx*,
> std::vector<std::pair<unsigned long, unsigned long>,
> std::allocator<std::pair<unsigned long, unsigned long> > > const&,
> char*, ceph::buffer::list*, librbd::AioCompletion*) (internal.cc:3262)
> ==12519==    by 0x4EAB872: librbd::aio_read(librbd::ImageCtx*, unsigned
> long, unsigned long, char*, ceph::buffer::list*, librbd::AioCompletion*)
> (internal.cc:3135)
> ==12519==    by 0x4E8B737: rbd_aio_read (librbd.cc:1518)
> ==12519==    by 0x459D92: fio_rbd_queue (rbd.c:294)
> ==12519==    by 0x40D379: td_io_queue (ioengines.c:300)
> ==12519==    by 0x44B77E: thread_main (backend.c:781)
> ==12519==    by 0x81F6181: start_thread (pthread_create.c:312)
> ==12519==    by 0x870AFBC: clone (clone.S:111)
> ==12519==
> ==12519== Thread 18:
> ==12519== Invalid read of size 8
> ==12519==    at 0x4EFA7B3: ObjectCacher::_readx(ObjectCacher::OSDRead*,
> ObjectCacher::ObjectSet*, Context*, bool) (ObjectCacher.cc:1158)
> ==12519==    by 0x4F027BF: ObjectCacher::C_RetryRead::finish(int)
> (ObjectCacher.h:581)
> ==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
> ==12519==    by 0x4EFF083: void finish_contexts<Context>(CephContext*,
> std::list<Context*, std::allocator<Context*> >&, int) (Context.h:120)
> ==12519==    by 0x4EF489C: ObjectCacher::bh_read_finish(long, sobject_t,
> unsigned long, long, unsigned long, ceph::buffer::list&, int, bool)
> (ObjectCacher.cc:805)
> ==12519==    by 0x4F01590: ObjectCacher::C_ReadFinish::finish(int)
> (ObjectCacher.h:504)
> ==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
> ==12519==    by 0x4EB9BBC: librbd::C_Request::finish(int)
> (LibrbdWriteback.cc:54)
> ==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
> ==12519==    by 0x53B64FC: librados::C_AioComplete::finish(int)
> (AioCompletionImpl.h:180)
> ==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
> ==12519==    by 0x5452397: Finisher::finisher_thread_entry()
> (Finisher.cc:59)
> ==12519==  Address 0x1a299710 is 48 bytes inside a block of size 264 free'd
> ==12519==    at 0x4C2C2BC: operator delete(void*) (in
> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==12519==    by 0x4EFA7AE: ObjectCacher::_readx(ObjectCacher::OSDRead*,
> ObjectCacher::ObjectSet*, Context*, bool) (ObjectCacher.cc:1149)
> ==12519==    by 0x4F027BF: ObjectCacher::C_RetryRead::finish(int)
> (ObjectCacher.h:581)
> ==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
> ==12519==    by 0x4EFF083: void finish_contexts<Context>(CephContext*,
> std::list<Context*, std::allocator<Context*> >&, int) (Context.h:120)
> ==12519==    by 0x4EF489C: ObjectCacher::bh_read_finish(long, sobject_t,
> unsigned long, long, unsigned long, ceph::buffer::list&, int, bool)
> (ObjectCacher.cc:805)
> ==12519==    by 0x4F01590: ObjectCacher::C_ReadFinish::finish(int)
> (ObjectCacher.h:504)
> ==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
> ==12519==    by 0x4EB9BBC: librbd::C_Request::finish(int)
> (LibrbdWriteback.cc:54)
> ==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
> ==12519==    by 0x53B64FC: librados::C_AioComplete::finish(int)
> (AioCompletionImpl.h:180)
> ==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
> ==12519==
> ==12519== Invalid read of size 8
> ==12519==    at 0x4EFA7CD: ObjectCacher::_readx(ObjectCacher::OSDRead*,
> ObjectCacher::ObjectSet*, Context*, bool) (ObjectCacher.h:170)
> ==12519==    by 0x4F027BF: ObjectCacher::C_RetryRead::finish(int)
> (ObjectCacher.h:581)
> ==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
> ==12519==    by 0x4EFF083: void finish_contexts<Context>(CephContext*,
> std::list<Context*, std::allocator<Context*> >&, int) (Context.h:120)
> ==12519==    by 0x4EF489C: ObjectCacher::bh_read_finish(long, sobject_t,
> unsigned long, long, unsigned long, ceph::buffer::list&, int, bool)
> (ObjectCacher.cc:805)
> ==12519==    by 0x4F01590: ObjectCacher::C_ReadFinish::finish(int)
> (ObjectCacher.h:504)
> ==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
> ==12519==    by 0x4EB9BBC: librbd::C_Request::finish(int)
> (LibrbdWriteback.cc:54)
> ==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
> ==12519==    by 0x53B64FC: librados::C_AioComplete::finish(int)
> (AioCompletionImpl.h:180)
> ==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
> ==12519==    by 0x5452397: Finisher::finisher_thread_entry()
> (Finisher.cc:59)
> ==12519==  Address 0x1a299718 is 56 bytes inside a block of size 264 free'd
> ==12519==    at 0x4C2C2BC: operator delete(void*) (in
> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==12519==    by 0x4EFA7AE: ObjectCacher::_readx(ObjectCacher::OSDRead*,
> ObjectCacher::ObjectSet*, Context*, bool) (ObjectCacher.cc:1149)
> ==12519==    by 0x4F027BF: ObjectCacher::C_RetryRead::finish(int)
> (ObjectCacher.h:581)
> ==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
> ==12519==    by 0x4EFF083: void finish_contexts<Context>(CephContext*,
> std::list<Context*, std::allocator<Context*> >&, int) (Context.h:120)
> ==12519==    by 0x4EF489C: ObjectCacher::bh_read_finish(long, sobject_t,
> unsigned long, long, unsigned long, ceph::buffer::list&, int, bool)
> (ObjectCacher.cc:805)
> ==12519==    by 0x4F01590: ObjectCacher::C_ReadFinish::finish(int)
> (ObjectCacher.h:504)
> ==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
> ==12519==    by 0x4EB9BBC: librbd::C_Request::finish(int)
> (LibrbdWriteback.cc:54)
> ==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
> ==12519==    by 0x53B64FC: librados::C_AioComplete::finish(int)
> (AioCompletionImpl.h:180)
> ==12519==    by 0x4E8EBE8: Context::complete(int) (Context.h:64)
> ==12519==


-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd hang for block sizes > 1M
  2014-10-24  5:35 ` Jens Axboe
@ 2014-10-24  6:17   ` Mark Kirkwood
  2014-10-24 13:19     ` Mark Nelson
  2014-10-24 14:11   ` Danny Al-Gaaf
  1 sibling, 1 reply; 52+ messages in thread
From: Mark Kirkwood @ 2014-10-24  6:17 UTC (permalink / raw)
  To: Jens Axboe, fio; +Cc: d.gollub@telekom.de >> Daniel Gollub, xan.peng

On 24/10/14 18:35, Jens Axboe wrote:
> CC'ing relevant parties, leaving email intact.
>

Note that the 'Killed' is because I killed the run - it hangs and 
appears to be non interruptable. I missed that when pasting, sorry!

>> $ fio read-test.fio     # attached
>> rbd_thread: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=rbd,
>> iodepth=32
>> fio-2.1.13-88-gb2ee7
>> Starting 1 process
>> rbd engine: RBD version: 0.1.8
>> Killed1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>> 1158050441d:06h:59m:33s]



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd hang for block sizes > 1M
  2014-10-24  6:17   ` Mark Kirkwood
@ 2014-10-24 13:19     ` Mark Nelson
  2014-10-24 14:09       ` Mark Nelson
  2014-10-24 22:30       ` fio rbd hang for block sizes > 1M Mark Kirkwood
  0 siblings, 2 replies; 52+ messages in thread
From: Mark Nelson @ 2014-10-24 13:19 UTC (permalink / raw)
  To: Mark Kirkwood, Jens Axboe, fio
  Cc: d.gollub@telekom.de >> Daniel Gollub, xan.peng

FWIW we are seeing this at Redhat/Inktank with recent fio from master 
and ceph giant branch as well.

Mark

On 10/24/2014 01:17 AM, Mark Kirkwood wrote:
> On 24/10/14 18:35, Jens Axboe wrote:
>> CC'ing relevant parties, leaving email intact.
>>
>
> Note that the 'Killed' is because I killed the run - it hangs and
> appears to be non interruptable. I missed that when pasting, sorry!
>
>>> $ fio read-test.fio     # attached
>>> rbd_thread: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=rbd,
>>> iodepth=32
>>> fio-2.1.13-88-gb2ee7
>>> Starting 1 process
>>> rbd engine: RBD version: 0.1.8
>>> Killed1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>>> 1158050441d:06h:59m:33s]
>
> --
> To unsubscribe from this list: send the line "unsubscribe fio" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd hang for block sizes > 1M
  2014-10-24 13:19     ` Mark Nelson
@ 2014-10-24 14:09       ` Mark Nelson
  2014-10-24 14:30         ` Jens Axboe
  2014-10-24 22:45         ` Mark Kirkwood
  2014-10-24 22:30       ` fio rbd hang for block sizes > 1M Mark Kirkwood
  1 sibling, 2 replies; 52+ messages in thread
From: Mark Nelson @ 2014-10-24 14:09 UTC (permalink / raw)
  To: Mark Kirkwood, Jens Axboe, fio
  Cc: d.gollub@telekom.de >> Daniel Gollub, xan.peng,
	ceph-devel@vger.kernel.org

More info:

I went back and tested fio versions back to 2.1.10 and still encountered 
the issue.  I then went back and tested the v0.86 release versus giant 
and was able to get through a 4MB read test without error.  I suspect 
this is not an fio problem.  I'll try to narrow down the commit after 
0.86 that is causing this.

Mark

On 10/24/2014 08:19 AM, Mark Nelson wrote:
> FWIW we are seeing this at Redhat/Inktank with recent fio from master
> and ceph giant branch as well.
>
> Mark
>
> On 10/24/2014 01:17 AM, Mark Kirkwood wrote:
>> On 24/10/14 18:35, Jens Axboe wrote:
>>> CC'ing relevant parties, leaving email intact.
>>>
>>
>> Note that the 'Killed' is because I killed the run - it hangs and
>> appears to be non interruptable. I missed that when pasting, sorry!
>>
>>>> $ fio read-test.fio     # attached
>>>> rbd_thread: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=rbd,
>>>> iodepth=32
>>>> fio-2.1.13-88-gb2ee7
>>>> Starting 1 process
>>>> rbd engine: RBD version: 0.1.8
>>>> Killed1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>>>> 1158050441d:06h:59m:33s]
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe fio" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd hang for block sizes > 1M
  2014-10-24  5:35 ` Jens Axboe
  2014-10-24  6:17   ` Mark Kirkwood
@ 2014-10-24 14:11   ` Danny Al-Gaaf
  2014-10-24 14:31     ` Jens Axboe
  1 sibling, 1 reply; 52+ messages in thread
From: Danny Al-Gaaf @ 2014-10-24 14:11 UTC (permalink / raw)
  To: Jens Axboe, Mark Kirkwood, fio; +Cc: xan.peng

Am 24.10.2014 um 07:35 schrieb Jens Axboe:
> CC'ing relevant parties, leaving email intact.
> 

I'll take a look at it.

@Jens: I removed Daniel from the thread since his email is no longer valid.

Regards,

Danny


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd hang for block sizes > 1M
  2014-10-24 14:09       ` Mark Nelson
@ 2014-10-24 14:30         ` Jens Axboe
  2014-10-24 22:45         ` Mark Kirkwood
  1 sibling, 0 replies; 52+ messages in thread
From: Jens Axboe @ 2014-10-24 14:30 UTC (permalink / raw)
  To: Mark Nelson, Mark Kirkwood, fio
  Cc: d.gollub@telekom.de >> Daniel Gollub, xan.peng,
	ceph-devel@vger.kernel.org

On 2014-10-24 08:09, Mark Nelson wrote:
> More info:
>
> I went back and tested fio versions back to 2.1.10 and still encountered
> the issue.  I then went back and tested the v0.86 release versus giant
> and was able to get through a 4MB read test without error.  I suspect
> this is not an fio problem.  I'll try to narrow down the commit after
> 0.86 that is causing this.

Thanks, it doesn't look like a fio problem if it's dependent on the 
block size used. Might warrant a check in the fio configure script, so 
we can fail (or limit) read sizes on the problematic versions.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd hang for block sizes > 1M
  2014-10-24 14:11   ` Danny Al-Gaaf
@ 2014-10-24 14:31     ` Jens Axboe
  0 siblings, 0 replies; 52+ messages in thread
From: Jens Axboe @ 2014-10-24 14:31 UTC (permalink / raw)
  To: Danny Al-Gaaf, Mark Kirkwood, fio; +Cc: xan.peng

On 2014-10-24 08:11, Danny Al-Gaaf wrote:
> Am 24.10.2014 um 07:35 schrieb Jens Axboe:
>> CC'ing relevant parties, leaving email intact.
>>
>
> I'll take a look at it.

Thanks!

> @Jens: I removed Daniel from the thread since his email is no longer valid.

Yeah, forgot to remove him on subsequent emails, sorry.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd hang for block sizes > 1M
  2014-10-24 13:19     ` Mark Nelson
  2014-10-24 14:09       ` Mark Nelson
@ 2014-10-24 22:30       ` Mark Kirkwood
  2014-10-24 22:38         ` Mark Nelson
  1 sibling, 1 reply; 52+ messages in thread
From: Mark Kirkwood @ 2014-10-24 22:30 UTC (permalink / raw)
  To: Mark Nelson, Jens Axboe, fio
  Cc: d.gollub@telekom.de >> Daniel Gollub, xan.peng

It looks like it is an rbd cache issue:

http://tracker.ceph.com/issues/9854

If I disable the rbd ccahe:

$ tail /etc/ceph/ceph.conf
...
[client]
rbd cache = false

then the 2-4M reads work fine (no invalid reads in valgrind either).

Regards

Mark

On 25/10/14 02:19, Mark Nelson wrote:
> FWIW we are seeing this at Redhat/Inktank with recent fio from master
> and ceph giant branch as well.
>
> Mark
>
> On 10/24/2014 01:17 AM, Mark Kirkwood wrote:
>> On 24/10/14 18:35, Jens Axboe wrote:
>>> CC'ing relevant parties, leaving email intact.
>>>
>>
>> Note that the 'Killed' is because I killed the run - it hangs and
>> appears to be non interruptable. I missed that when pasting, sorry!
>>
>>>> $ fio read-test.fio     # attached
>>>> rbd_thread: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=rbd,
>>>> iodepth=32
>>>> fio-2.1.13-88-gb2ee7
>>>> Starting 1 process
>>>> rbd engine: RBD version: 0.1.8
>>>> Killed1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>>>> 1158050441d:06h:59m:33s]
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe fio" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd hang for block sizes > 1M
  2014-10-24 22:30       ` fio rbd hang for block sizes > 1M Mark Kirkwood
@ 2014-10-24 22:38         ` Mark Nelson
  0 siblings, 0 replies; 52+ messages in thread
From: Mark Nelson @ 2014-10-24 22:38 UTC (permalink / raw)
  To: Mark Kirkwood, Jens Axboe, fio
  Cc: d.gollub@telekom.de >> Daniel Gollub, xan.peng

Yeah, we reverted the commit that we think was causing it earlier today. 
  Should be able to confirm things are working again in the next hour or 
two.

Mark

On 10/24/2014 05:30 PM, Mark Kirkwood wrote:
> It looks like it is an rbd cache issue:
>
> http://tracker.ceph.com/issues/9854
>
> If I disable the rbd ccahe:
>
> $ tail /etc/ceph/ceph.conf
> ...
> [client]
> rbd cache = false
>
> then the 2-4M reads work fine (no invalid reads in valgrind either).
>
> Regards
>
> Mark
>
> On 25/10/14 02:19, Mark Nelson wrote:
>> FWIW we are seeing this at Redhat/Inktank with recent fio from master
>> and ceph giant branch as well.
>>
>> Mark
>>
>> On 10/24/2014 01:17 AM, Mark Kirkwood wrote:
>>> On 24/10/14 18:35, Jens Axboe wrote:
>>>> CC'ing relevant parties, leaving email intact.
>>>>
>>>
>>> Note that the 'Killed' is because I killed the run - it hangs and
>>> appears to be non interruptable. I missed that when pasting, sorry!
>>>
>>>>> $ fio read-test.fio     # attached
>>>>> rbd_thread: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=rbd,
>>>>> iodepth=32
>>>>> fio-2.1.13-88-gb2ee7
>>>>> Starting 1 process
>>>>> rbd engine: RBD version: 0.1.8
>>>>> Killed1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>>>>> 1158050441d:06h:59m:33s]
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe fio" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd hang for block sizes > 1M
  2014-10-24 14:09       ` Mark Nelson
  2014-10-24 14:30         ` Jens Axboe
@ 2014-10-24 22:45         ` Mark Kirkwood
  2014-10-25  0:12           ` Mark Nelson
  1 sibling, 1 reply; 52+ messages in thread
From: Mark Kirkwood @ 2014-10-24 22:45 UTC (permalink / raw)
  To: Mark Nelson, Jens Axboe, fio
  Cc: d.gollub@telekom.de >> Daniel Gollub, xan.peng,
	ceph-devel@vger.kernel.org

Interestingly, I first encountered this on (what I think is) 0.86 
release (0.86-1precise). I wonder if you had a bigger rbd cache on the 
release cluster you tested?

As mentioned in the same named thread on -users, disabling the rbd cache 
stops the hang.

Regards

Mark

On 25/10/14 03:09, Mark Nelson wrote:
> More info:
>
> I went back and tested fio versions back to 2.1.10 and still encountered
> the issue.  I then went back and tested the v0.86 release versus giant
> and was able to get through a 4MB read test without error.  I suspect
> this is not an fio problem.  I'll try to narrow down the commit after
> 0.86 that is causing this.
>
> Mark
>
> On 10/24/2014 08:19 AM, Mark Nelson wrote:
>> FWIW we are seeing this at Redhat/Inktank with recent fio from master
>> and ceph giant branch as well.
>>
>> Mark
>>
>> On 10/24/2014 01:17 AM, Mark Kirkwood wrote:
>>> On 24/10/14 18:35, Jens Axboe wrote:
>>>> CC'ing relevant parties, leaving email intact.
>>>>
>>>
>>> Note that the 'Killed' is because I killed the run - it hangs and
>>> appears to be non interruptable. I missed that when pasting, sorry!
>>>
>>>>> $ fio read-test.fio     # attached
>>>>> rbd_thread: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=rbd,
>>>>> iodepth=32
>>>>> fio-2.1.13-88-gb2ee7
>>>>> Starting 1 process
>>>>> rbd engine: RBD version: 0.1.8
>>>>> Killed1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>>>>> 1158050441d:06h:59m:33s]
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe fio" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd hang for block sizes > 1M
  2014-10-24 22:45         ` Mark Kirkwood
@ 2014-10-25  0:12           ` Mark Nelson
  2014-10-25  0:37             ` Mark Kirkwood
  0 siblings, 1 reply; 52+ messages in thread
From: Mark Nelson @ 2014-10-25  0:12 UTC (permalink / raw)
  To: Mark Kirkwood, Mark Nelson, Jens Axboe, fio
  Cc: d.gollub@telekom.de >> Daniel Gollub, xan.peng,
	ceph-devel@vger.kernel.org

Hi Mark,

Try the latest giant branch.  I believe we've fixed this with 7272bb8. 
My test cluster is passing read tests now.

Mark

On 10/24/2014 05:45 PM, Mark Kirkwood wrote:
> Interestingly, I first encountered this on (what I think is) 0.86
> release (0.86-1precise). I wonder if you had a bigger rbd cache on the
> release cluster you tested?
>
> As mentioned in the same named thread on -users, disabling the rbd cache
> stops the hang.
>
> Regards
>
> Mark
>
> On 25/10/14 03:09, Mark Nelson wrote:
>> More info:
>>
>> I went back and tested fio versions back to 2.1.10 and still encountered
>> the issue.  I then went back and tested the v0.86 release versus giant
>> and was able to get through a 4MB read test without error.  I suspect
>> this is not an fio problem.  I'll try to narrow down the commit after
>> 0.86 that is causing this.
>>
>> Mark
>>
>> On 10/24/2014 08:19 AM, Mark Nelson wrote:
>>> FWIW we are seeing this at Redhat/Inktank with recent fio from master
>>> and ceph giant branch as well.
>>>
>>> Mark
>>>
>>> On 10/24/2014 01:17 AM, Mark Kirkwood wrote:
>>>> On 24/10/14 18:35, Jens Axboe wrote:
>>>>> CC'ing relevant parties, leaving email intact.
>>>>>
>>>>
>>>> Note that the 'Killed' is because I killed the run - it hangs and
>>>> appears to be non interruptable. I missed that when pasting, sorry!
>>>>
>>>>>> $ fio read-test.fio     # attached
>>>>>> rbd_thread: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=rbd,
>>>>>> iodepth=32
>>>>>> fio-2.1.13-88-gb2ee7
>>>>>> Starting 1 process
>>>>>> rbd engine: RBD version: 0.1.8
>>>>>> Killed1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>>>>>> 1158050441d:06h:59m:33s]
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe fio" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd hang for block sizes > 1M
  2014-10-25  0:12           ` Mark Nelson
@ 2014-10-25  0:37             ` Mark Kirkwood
  2014-10-25  2:35               ` Mark Kirkwood
  0 siblings, 1 reply; 52+ messages in thread
From: Mark Kirkwood @ 2014-10-25  0:37 UTC (permalink / raw)
  To: Mark Nelson, Mark Nelson, Jens Axboe, fio
  Cc: d.gollub@telekom.de >> Daniel Gollub, xan.peng,
	ceph-devel@vger.kernel.org

Righty, building now.

On 25/10/14 13:12, Mark Nelson wrote:
> Hi Mark,
>
> Try the latest giant branch.  I believe we've fixed this with 7272bb8.
> My test cluster is passing read tests now.
>
> Mark
>
> On 10/24/2014 05:45 PM, Mark Kirkwood wrote:
>> Interestingly, I first encountered this on (what I think is) 0.86
>> release (0.86-1precise). I wonder if you had a bigger rbd cache on the
>> release cluster you tested?
>>
>> As mentioned in the same named thread on -users, disabling the rbd cache
>> stops the hang.
>>
>> Regards
>>
>> Mark
>>
>> On 25/10/14 03:09, Mark Nelson wrote:
>>> More info:
>>>
>>> I went back and tested fio versions back to 2.1.10 and still encountered
>>> the issue.  I then went back and tested the v0.86 release versus giant
>>> and was able to get through a 4MB read test without error.  I suspect
>>> this is not an fio problem.  I'll try to narrow down the commit after
>>> 0.86 that is causing this.
>>>
>>> Mark
>>>
>>> On 10/24/2014 08:19 AM, Mark Nelson wrote:
>>>> FWIW we are seeing this at Redhat/Inktank with recent fio from master
>>>> and ceph giant branch as well.
>>>>
>>>> Mark
>>>>
>>>> On 10/24/2014 01:17 AM, Mark Kirkwood wrote:
>>>>> On 24/10/14 18:35, Jens Axboe wrote:
>>>>>> CC'ing relevant parties, leaving email intact.
>>>>>>
>>>>>
>>>>> Note that the 'Killed' is because I killed the run - it hangs and
>>>>> appears to be non interruptable. I missed that when pasting, sorry!
>>>>>
>>>>>>> $ fio read-test.fio     # attached
>>>>>>> rbd_thread: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=rbd,
>>>>>>> iodepth=32
>>>>>>> fio-2.1.13-88-gb2ee7
>>>>>>> Starting 1 process
>>>>>>> rbd engine: RBD version: 0.1.8
>>>>>>> Killed1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>>>>>>> 1158050441d:06h:59m:33s]
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe fio" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd hang for block sizes > 1M
  2014-10-25  0:37             ` Mark Kirkwood
@ 2014-10-25  2:35               ` Mark Kirkwood
  2014-10-25  3:47                 ` Jens Axboe
  0 siblings, 1 reply; 52+ messages in thread
From: Mark Kirkwood @ 2014-10-25  2:35 UTC (permalink / raw)
  To: Mark Nelson, Mark Nelson, Jens Axboe, fio
  Cc: d.gollub@telekom.de >> Daniel Gollub, xan.peng,
	ceph-devel@vger.kernel.org

Patched client machine *only* - re-running fio from there works fine 
with (default - i.e no [client' section at all) cache settings:

$ fio read-test.fio
rbd_thread: (g=0): rw=read, bs=4M-4M/4M-4M/4M-4M, ioengine=rbd, iodepth=32
fio-2.1.13-88-gb2ee7
Starting 1 process
rbd engine: RBD version: 0.1.8
Jobs: 1 (f=1): [R(1)] [75.0% done] [1165MB/0KB/0KB /s] [291/0/0 iops] 
[eta 00m:0Jobs: 1 (f=1): [R(1)] [83.3% done] [447.4MB/0KB/0KB /s] 
[111/0/0 iops] [eta 00m:Jobs: 1 (f=1): [R(1)] [100.0% done] 
[268.0MB/0KB/0KB /s] [67/0/0 iops] [eta 00m:Jobs: 1 (f=1): [R(1)] 
[100.0% done] [336.1MB/0KB/0KB /s] [84/0/0 iops] [eta 00m:00s]
rbd_thread: (groupid=0, jobs=1): err= 0: pid=5980: Sat Oct 25 15:32:16 2014
   read : io=4096.0MB, bw=623410KB/s, iops=152, runt=  6728msec
     slat (usec): min=7, max=230691, avg=5664.46, stdev=14434.46
     clat (msec): min=11, max=1589, avg=193.03, stdev=246.84
      lat (msec): min=13, max=1606, avg=198.70, stdev=248.62
     clat percentiles (msec):
      |  1.00th=[   17],  5.00th=[   30], 10.00th=[   43], 20.00th=[   60],
      | 30.00th=[   78], 40.00th=[   93], 50.00th=[  109], 60.00th=[  124],
      | 70.00th=[  147], 80.00th=[  210], 90.00th=[  498], 95.00th=[  758],
      | 99.00th=[ 1237], 99.50th=[ 1467], 99.90th=[ 1565], 99.95th=[ 1598],
      | 99.99th=[ 1598]
     bw (KB  /s): min=178086, max=1193644, per=100.00%, avg=637349.58, 
stdev=397329.85
     lat (msec) : 20=2.15%, 50=12.11%, 100=30.08%, 250=38.09%, 500=7.62%
     lat (msec) : 750=4.79%, 1000=2.64%, 2000=2.54%
   cpu          : usr=1.69%, sys=0.28%, ctx=6234, majf=0, minf=78
   IO depths    : 1=0.1%, 2=0.2%, 4=0.4%, 8=1.7%, 16=58.6%, 32=39.1%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=94.3%, 8=5.0%, 16=0.4%, 32=0.3%, 64=0.0%, 
 >=64=0.0%
      issued    : total=r=1024/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
      latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
    READ: io=4096.0MB, aggrb=623410KB/s, minb=623410KB/s, 
maxb=623410KB/s, mint=6728msec, maxt=6728msec


On 25/10/14 13:37, Mark Kirkwood wrote:
> Righty, building now.
>
> On 25/10/14 13:12, Mark Nelson wrote:
>> Hi Mark,
>>
>> Try the latest giant branch.  I believe we've fixed this with 7272bb8.
>> My test cluster is passing read tests now.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd hang for block sizes > 1M
  2014-10-25  2:35               ` Mark Kirkwood
@ 2014-10-25  3:47                 ` Jens Axboe
  2014-10-25  4:50                   ` fio rbd completions (Was: fio rbd hang for block sizes > 1M) Mark Kirkwood
  0 siblings, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2014-10-25  3:47 UTC (permalink / raw)
  To: Mark Kirkwood, Mark Nelson, Mark Nelson, fio
  Cc: d.gollub@telekom.de >> Daniel Gollub, xan.peng,
	ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2460 bytes --]

On 2014-10-24 20:35, Mark Kirkwood wrote:
> Patched client machine *only* - re-running fio from there works fine
> with (default - i.e no [client' section at all) cache settings:
>
> $ fio read-test.fio
> rbd_thread: (g=0): rw=read, bs=4M-4M/4M-4M/4M-4M, ioengine=rbd, iodepth=32
> fio-2.1.13-88-gb2ee7
> Starting 1 process
> rbd engine: RBD version: 0.1.8
> Jobs: 1 (f=1): [R(1)] [75.0% done] [1165MB/0KB/0KB /s] [291/0/0 iops]
> [eta 00m:0Jobs: 1 (f=1): [R(1)] [83.3% done] [447.4MB/0KB/0KB /s]
> [111/0/0 iops] [eta 00m:Jobs: 1 (f=1): [R(1)] [100.0% done]
> [268.0MB/0KB/0KB /s] [67/0/0 iops] [eta 00m:Jobs: 1 (f=1): [R(1)]
> [100.0% done] [336.1MB/0KB/0KB /s] [84/0/0 iops] [eta 00m:00s]
> rbd_thread: (groupid=0, jobs=1): err= 0: pid=5980: Sat Oct 25 15:32:16 2014
>    read : io=4096.0MB, bw=623410KB/s, iops=152, runt=  6728msec
>      slat (usec): min=7, max=230691, avg=5664.46, stdev=14434.46
>      clat (msec): min=11, max=1589, avg=193.03, stdev=246.84
>       lat (msec): min=13, max=1606, avg=198.70, stdev=248.62
>      clat percentiles (msec):
>       |  1.00th=[   17],  5.00th=[   30], 10.00th=[   43], 20.00th=[   60],
>       | 30.00th=[   78], 40.00th=[   93], 50.00th=[  109], 60.00th=[  124],
>       | 70.00th=[  147], 80.00th=[  210], 90.00th=[  498], 95.00th=[  758],
>       | 99.00th=[ 1237], 99.50th=[ 1467], 99.90th=[ 1565], 99.95th=[ 1598],
>       | 99.99th=[ 1598]
>      bw (KB  /s): min=178086, max=1193644, per=100.00%, avg=637349.58,
> stdev=397329.85
>      lat (msec) : 20=2.15%, 50=12.11%, 100=30.08%, 250=38.09%, 500=7.62%
>      lat (msec) : 750=4.79%, 1000=2.64%, 2000=2.54%
>    cpu          : usr=1.69%, sys=0.28%, ctx=6234, majf=0, minf=78
>    IO depths    : 1=0.1%, 2=0.2%, 4=0.4%, 8=1.7%, 16=58.6%, 32=39.1%,
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>  >=64=0.0%
>       complete  : 0=0.0%, 4=94.3%, 8=5.0%, 16=0.4%, 32=0.3%, 64=0.0%,
>  >=64=0.0%
>       issued    : total=r=1024/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>       latency   : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
>     READ: io=4096.0MB, aggrb=623410KB/s, minb=623410KB/s,
> maxb=623410KB/s, mint=6728msec, maxt=6728msec

Since you're running rbd tests... Mind giving this patch a go? I don't 
have an easy way to test it myself. It has nothing to do with this 
issue, it's just a potentially faster way to do the rbd completions.

-- 
Jens Axboe


[-- Attachment #2: rbd-complete-v2.patch --]
[-- Type: text/x-patch, Size: 4345 bytes --]

diff --git a/engines/rbd.c b/engines/rbd.c
index 6fe87b8d010c..6aa96a5ff550 100644
--- a/engines/rbd.c
+++ b/engines/rbd.c
@@ -11,6 +11,7 @@
 
 struct fio_rbd_iou {
 	struct io_u *io_u;
+	rbd_completion_t completion;
 	int io_complete;
 };
 
@@ -221,34 +222,66 @@ static struct io_u *fio_rbd_event(struct thread_data *td, int event)
 	return rbd_data->aio_events[event];
 }
 
-static int fio_rbd_getevents(struct thread_data *td, unsigned int min,
-			     unsigned int max, const struct timespec *t)
+static inline int fri_check_complete(struct rbd_data *rbd_data,
+				     struct io_u *io_u,
+				     unsigned int *events)
+{
+	struct fio_rbd_iou *fri = io_u->engine_data;
+
+	if (fri->io_complete) {
+		fri->io_complete = 0;
+		rbd_data->aio_events[*events] = io_u;
+		(*events)++;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int rbd_iter_events(struct thread_data *td, unsigned int *events,
+			   unsigned int min_evts, int wait)
 {
 	struct rbd_data *rbd_data = td->io_ops->data;
-	unsigned int events = 0;
+	unsigned int this_events = 0;
 	struct io_u *io_u;
 	int i;
-	struct fio_rbd_iou *fov;
 
-	do {
-		io_u_qiter(&td->io_u_all, io_u, i) {
-			if (!(io_u->flags & IO_U_F_FLIGHT))
-				continue;
+	io_u_qiter(&td->io_u_all, io_u, i) {
+		if (!(io_u->flags & IO_U_F_FLIGHT))
+			continue;
 
-			fov = (struct fio_rbd_iou *)io_u->engine_data;
+		if (fri_check_complete(rbd_data, io_u, events))
+			this_events++;
+		else if (wait) {
+			struct fio_rbd_iou *fri = io_u->engine_data;
 
-			if (fov->io_complete) {
-				fov->io_complete = 0;
-				rbd_data->aio_events[events] = io_u;
-				events++;
-			}
+			rbd_aio_wait_for_complete(fri->completion);
 
+			if (fri_check_complete(rbd_data, io_u, events))
+				this_events++;
 		}
-		if (events < min)
-			usleep(100);
-		else
+		if (*events >= min_evts)
+			break;
+	}
+
+	return this_events;
+}
+
+static int fio_rbd_getevents(struct thread_data *td, unsigned int min,
+			     unsigned int max, const struct timespec *t)
+{
+	unsigned int this_events, events = 0;
+	int wait = 0;
+
+	do {
+		this_events = rbd_iter_events(td, &events, min, wait);
+
+		if (events >= min)
 			break;
+		if (this_events)
+			continue;
 
+		wait = 1;
 	} while (1);
 
 	return events;
@@ -258,7 +291,7 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 {
 	int r = -1;
 	struct rbd_data *rbd_data = td->io_ops->data;
-	rbd_completion_t comp;
+	struct fio_rbd_iou *fri = io_u->engine_data;
 
 	fio_ro_check(td, io_u);
 
@@ -266,7 +299,7 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		r = rbd_aio_create_completion(io_u,
 					      (rbd_callback_t)
 					      _fio_rbd_finish_write_aiocb,
-					      &comp);
+					      &fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_WRITE failed.\n");
@@ -274,7 +307,8 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		}
 
 		r = rbd_aio_write(rbd_data->image, io_u->offset,
-				  io_u->xfer_buflen, io_u->xfer_buf, comp);
+				  io_u->xfer_buflen, io_u->xfer_buf,
+				  fri->completion);
 		if (r < 0) {
 			log_err("rbd_aio_write failed.\n");
 			goto failed;
@@ -284,7 +318,7 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		r = rbd_aio_create_completion(io_u,
 					      (rbd_callback_t)
 					      _fio_rbd_finish_read_aiocb,
-					      &comp);
+					      &fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_READ failed.\n");
@@ -292,7 +326,8 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		}
 
 		r = rbd_aio_read(rbd_data->image, io_u->offset,
-				 io_u->xfer_buflen, io_u->xfer_buf, comp);
+				 io_u->xfer_buflen, io_u->xfer_buf,
+				 fri->completion);
 
 		if (r < 0) {
 			log_err("rbd_aio_read failed.\n");
@@ -303,14 +338,14 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		r = rbd_aio_create_completion(io_u,
 					      (rbd_callback_t)
 					      _fio_rbd_finish_sync_aiocb,
-					      &comp);
+					      &fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_SYNC failed.\n");
 			goto failed;
 		}
 
-		r = rbd_aio_flush(rbd_data->image, comp);
+		r = rbd_aio_flush(rbd_data->image, fri->completion);
 		if (r < 0) {
 			log_err("rbd_flush failed.\n");
 			goto failed;

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-25  3:47                 ` Jens Axboe
@ 2014-10-25  4:50                   ` Mark Kirkwood
  2014-10-25 19:20                     ` Jens Axboe
  0 siblings, 1 reply; 52+ messages in thread
From: Mark Kirkwood @ 2014-10-25  4:50 UTC (permalink / raw)
  To: Jens Axboe, Mark Nelson, Mark Nelson, fio
  Cc: d.gollub@telekom.de >> Daniel Gollub, xan.peng,
	ceph-devel@vger.kernel.org

On 25/10/14 16:47, Jens Axboe wrote:
>
> Since you're running rbd tests... Mind giving this patch a go? I don't
> have an easy way to test it myself. It has nothing to do with this
> issue, it's just a potentially faster way to do the rbd completions.
>

Sure - but note I'm testing this on my i7 workstation (4x osd's running 
on 2x Crucial M550) so not exactly server grade :-)

With that in mind, I'm seeing slightly *slower* performance with the 
patch applied: e.g: for 128k blocks - 2 runs, 1 uncached and the next 
cached.

Unpatched:

$ fio read-test.fio
rbd_thread: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K, 
ioengine=rbd, iodepth=32
fio-2.1.13-88-gb2ee7
Starting 1 process
rbd engine: RBD version: 0.1.8
Jobs: 1 (f=1): [R(1)] [100.0% done] [588.5MB/0KB/0KB /s] [4707/0/0 iops] 
[eta 00m:00s]
rbd_thread: (groupid=0, jobs=1): err= 0: pid=4305: Sat Oct 25 17:39:32 2014
   read : io=4096.0MB, bw=596205KB/s, iops=4657, runt=  7035msec
     slat (usec): min=2, max=2967, avg=36.67, stdev=58.70
     clat (usec): min=1, max=28305, avg=6812.05, stdev=3062.44
      lat (usec): min=24, max=28330, avg=6848.72, stdev=3061.25
     clat percentiles (usec):
      |  1.00th=[ 2008],  5.00th=[ 2544], 10.00th=[ 3024], 20.00th=[ 3952],
      | 30.00th=[ 4832], 40.00th=[ 5664], 50.00th=[ 6560], 60.00th=[ 7456],
      | 70.00th=[ 8384], 80.00th=[ 9280], 90.00th=[10816], 95.00th=[11968],
      | 99.00th=[14912], 99.50th=[16512], 99.90th=[24192], 99.95th=[26496],
      | 99.99th=[28032]
     bw (KB  /s): min=568064, max=620288, per=100.00%, avg=596434.86, 
stdev=18741.30
     lat (usec) : 2=0.01%, 50=0.01%, 750=0.01%, 1000=0.01%
     lat (msec) : 2=0.94%, 4=19.48%, 10=65.18%, 20=14.16%, 50=0.22%
   cpu          : usr=12.84%, sys=1.96%, ctx=52370, majf=0, minf=78
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=25.6%, 32=74.3%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=99.6%, 8=0.4%, 16=0.1%, 32=0.1%, 64=0.0%, 
 >=64=0.0%
      issued    : total=r=32768/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
      latency   : target=0, window=0, percentile=100.00%, depth=32


$ fio read-test.fio
rbd_thread: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K, 
ioengine=rbd, iodepth=32
fio-2.1.13-88-gb2ee7
Starting 1 process
rbd engine: RBD version: 0.1.8
Jobs: 1 (f=1): [R(1)] [100.0% done] [843.8MB/0KB/0KB /s] [6750/0/0 iops] 
[eta 00m:00s]
rbd_thread: (groupid=0, jobs=1): err= 0: pid=4393: Sat Oct 25 17:39:50 2014
   read : io=4096.0MB, bw=847163KB/s, iops=6618, runt=  4951msec
     slat (usec): min=2, max=3996, avg=46.39, stdev=106.38
     clat (usec): min=1, max=19652, avg=4699.45, stdev=2251.49
      lat (usec): min=14, max=19726, avg=4745.83, stdev=2244.04
     clat percentiles (usec):
      |  1.00th=[  916],  5.00th=[ 1400], 10.00th=[ 1864], 20.00th=[ 2704],
      | 30.00th=[ 3408], 40.00th=[ 3984], 50.00th=[ 4512], 60.00th=[ 5088],
      | 70.00th=[ 5664], 80.00th=[ 6432], 90.00th=[ 7584], 95.00th=[ 8640],
      | 99.00th=[11328], 99.50th=[11968], 99.90th=[14016], 99.95th=[14784],
      | 99.99th=[16320]
     bw (KB  /s): min=823808, max=885760, per=100.00%, avg=847975.33, 
stdev=24137.14
     lat (usec) : 2=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 500=0.03%
     lat (usec) : 750=0.32%, 1000=1.15%
     lat (msec) : 2=10.05%, 4=28.67%, 10=57.42%, 20=2.34%
   cpu          : usr=15.31%, sys=3.15%, ctx=48359, majf=0, minf=82
   IO depths    : 1=0.1%, 2=0.1%, 4=0.5%, 8=2.3%, 16=43.4%, 32=53.7%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=98.3%, 8=1.0%, 16=0.4%, 32=0.3%, 64=0.0%, 
 >=64=0.0%
      issued    : total=r=32768/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
      latency   : target=0, window=0, percentile=100.00%, depth=32


patched:

$ fio read-test.fio
rbd_thread: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K, 
ioengine=rbd, iodepth=32
fio-2.1.13-88-gb2ee7
Starting 1 process
rbd engine: RBD version: 0.1.8
Jobs: 1 (f=1): [R(1)] [100.0% done] [424.9MB/0KB/0KB /s] [3399/0/0 iops] 
[eta 00m:00s]
rbd_thread: (groupid=0, jobs=1): err= 0: pid=4528: Sat Oct 25 17:40:31 2014
   read : io=4096.0MB, bw=429744KB/s, iops=3357, runt=  9760msec
     slat (usec): min=2, max=1450, avg=24.89, stdev=28.80
     clat (usec): min=0, max=29343, avg=9504.27, stdev=3355.50
      lat (usec): min=14, max=29352, avg=9529.17, stdev=3351.45
     clat percentiles (usec):
      |  1.00th=[  852],  5.00th=[ 2960], 10.00th=[ 4512], 20.00th=[ 6688],
      | 30.00th=[ 8512], 40.00th=[ 9408], 50.00th=[10304], 60.00th=[10944],
      | 70.00th=[11456], 80.00th=[11968], 90.00th=[12480], 95.00th=[13632],
      | 99.00th=[18048], 99.50th=[19072], 99.90th=[21376], 99.95th=[21888],
      | 99.99th=[22400]
     bw (KB  /s): min=400606, max=463141, per=100.00%, avg=429940.42, 
stdev=19324.84
     lat (usec) : 2=0.07%, 500=0.01%, 750=0.56%, 1000=0.78%
     lat (msec) : 2=1.70%, 4=5.10%, 10=38.37%, 20=53.20%, 50=0.21%
   cpu          : usr=6.36%, sys=0.79%, ctx=18607, majf=0, minf=81
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
 >=64=0.0%
      issued    : total=r=32768/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
      latency   : target=0, window=0, percentile=100.00%, depth=32

$ fio read-test.fio
rbd_thread: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K, 
ioengine=rbd, iodepth=32
fio-2.1.13-88-gb2ee7
Starting 1 process
rbd engine: RBD version: 0.1.8
Jobs: 1 (f=0): [R(1)] [100.0% done] [711.9MB/0KB/0KB /s] [5695/0/0 iops] 
[eta 00m:00s]
rbd_thread: (groupid=0, jobs=1): err= 0: pid=4594: Sat Oct 25 17:40:43 2014
   read : io=4096.0MB, bw=719311KB/s, iops=5619, runt=  5831msec
     slat (usec): min=2, max=3965, avg=32.65, stdev=86.47
     clat (usec): min=0, max=16050, avg=5658.86, stdev=2230.99
      lat (usec): min=17, max=16074, avg=5691.51, stdev=2222.24
     clat percentiles (usec):
      |  1.00th=[  796],  5.00th=[ 1880], 10.00th=[ 2864], 20.00th=[ 3888],
      | 30.00th=[ 4576], 40.00th=[ 5088], 50.00th=[ 5536], 60.00th=[ 6112],
      | 70.00th=[ 6624], 80.00th=[ 7328], 90.00th=[ 8384], 95.00th=[ 9408],
      | 99.00th=[11968], 99.50th=[12864], 99.90th=[15552], 99.95th=[15552],
      | 99.99th=[15680]
     bw (KB  /s): min=631040, max=795904, per=100.00%, avg=719788.73, 
stdev=49266.37
     lat (usec) : 2=0.03%, 250=0.01%, 500=0.08%, 750=0.69%, 1000=0.99%
     lat (msec) : 2=3.76%, 4=15.47%, 10=75.63%, 20=3.35%
   cpu          : usr=11.17%, sys=1.22%, ctx=22614, majf=0, minf=83
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
 >=64=0.0%
      issued    : total=r=32768/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
      latency   : target=0, window=0, percentile=100.00%, depth=32


I'll try it out next week on our real cluster (3x hosts, 24x osds on 
spinners + ssd journals), Mark Nelson will probably beat me to it mind you!

Cheers

Mark


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-25  4:50                   ` fio rbd completions (Was: fio rbd hang for block sizes > 1M) Mark Kirkwood
@ 2014-10-25 19:20                     ` Jens Axboe
  2014-10-25 22:25                       ` Mark Kirkwood
  0 siblings, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2014-10-25 19:20 UTC (permalink / raw)
  To: Mark Kirkwood, Mark Nelson, Mark Nelson, fio
  Cc: d.gollub@telekom.de >> Daniel Gollub, xan.peng,
	ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 871 bytes --]

On 10/24/2014 10:50 PM, Mark Kirkwood wrote:
> On 25/10/14 16:47, Jens Axboe wrote:
>>
>> Since you're running rbd tests... Mind giving this patch a go? I don't
>> have an easy way to test it myself. It has nothing to do with this
>> issue, it's just a potentially faster way to do the rbd completions.
>>
> 
> Sure - but note I'm testing this on my i7 workstation (4x osd's running
> on 2x Crucial M550) so not exactly server grade :-)
> 
> With that in mind, I'm seeing slightly *slower* performance with the
> patch applied: e.g: for 128k blocks - 2 runs, 1 uncached and the next
> cached.

Yeah, that doesn't look good. Mind trying this one out? I wonder if we
doubly wait on them - or perhaps rbd_aio_wait_for_complete() isn't
working correctly. If you try this one, we should know more...

Goal is, I want to get rid of that usleep() in getevents.

-- 
Jens Axboe


[-- Attachment #2: rbd-comp-v3.patch --]
[-- Type: text/x-patch, Size: 5109 bytes --]

diff --git a/engines/rbd.c b/engines/rbd.c
index 6fe87b8d010c..2353b1f11caf 100644
--- a/engines/rbd.c
+++ b/engines/rbd.c
@@ -11,7 +11,9 @@
 
 struct fio_rbd_iou {
 	struct io_u *io_u;
+	rbd_completion_t completion;
 	int io_complete;
+	int io_seen;
 };
 
 struct rbd_data {
@@ -221,34 +223,69 @@ static struct io_u *fio_rbd_event(struct thread_data *td, int event)
 	return rbd_data->aio_events[event];
 }
 
-static int fio_rbd_getevents(struct thread_data *td, unsigned int min,
-			     unsigned int max, const struct timespec *t)
+static inline int fri_check_complete(struct rbd_data *rbd_data,
+				     struct io_u *io_u,
+				     unsigned int *events)
+{
+	struct fio_rbd_iou *fri = io_u->engine_data;
+
+	if (fri->io_complete) {
+		fri->io_complete = 0;
+		fri->io_seen = 1;
+		rbd_data->aio_events[*events] = io_u;
+		(*events)++;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int rbd_iter_events(struct thread_data *td, unsigned int *events,
+			   unsigned int min_evts, int wait)
 {
 	struct rbd_data *rbd_data = td->io_ops->data;
-	unsigned int events = 0;
+	unsigned int this_events = 0;
 	struct io_u *io_u;
 	int i;
-	struct fio_rbd_iou *fov;
 
-	do {
-		io_u_qiter(&td->io_u_all, io_u, i) {
-			if (!(io_u->flags & IO_U_F_FLIGHT))
-				continue;
+	io_u_qiter(&td->io_u_all, io_u, i) {
+		struct fio_rbd_iou *fri = io_u->engine_data;
 
-			fov = (struct fio_rbd_iou *)io_u->engine_data;
+		if (!(io_u->flags & IO_U_F_FLIGHT))
+			continue;
+		if (fri->io_seen)
+			continue;
 
-			if (fov->io_complete) {
-				fov->io_complete = 0;
-				rbd_data->aio_events[events] = io_u;
-				events++;
-			}
+		if (fri_check_complete(rbd_data, io_u, events))
+			this_events++;
+		else if (wait) {
+			rbd_aio_wait_for_complete(fri->completion);
 
+			if (fri_check_complete(rbd_data, io_u, events))
+				this_events++;
 		}
-		if (events < min)
-			usleep(100);
-		else
+		if (*events >= min_evts)
+			break;
+	}
+
+	return this_events;
+}
+
+static int fio_rbd_getevents(struct thread_data *td, unsigned int min,
+			     unsigned int max, const struct timespec *t)
+{
+	unsigned int this_events, events = 0;
+	int wait = 0;
+
+	do {
+		this_events = rbd_iter_events(td, &events, min, wait);
+
+		if (events >= min)
 			break;
+		if (this_events)
+			continue;
 
+		wait = 1;
 	} while (1);
 
 	return events;
@@ -258,7 +295,7 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 {
 	int r = -1;
 	struct rbd_data *rbd_data = td->io_ops->data;
-	rbd_completion_t comp;
+	struct fio_rbd_iou *fri = io_u->engine_data;
 
 	fio_ro_check(td, io_u);
 
@@ -266,7 +303,7 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		r = rbd_aio_create_completion(io_u,
 					      (rbd_callback_t)
 					      _fio_rbd_finish_write_aiocb,
-					      &comp);
+					      &fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_WRITE failed.\n");
@@ -274,7 +311,8 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		}
 
 		r = rbd_aio_write(rbd_data->image, io_u->offset,
-				  io_u->xfer_buflen, io_u->xfer_buf, comp);
+				  io_u->xfer_buflen, io_u->xfer_buf,
+				  fri->completion);
 		if (r < 0) {
 			log_err("rbd_aio_write failed.\n");
 			goto failed;
@@ -284,7 +322,7 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		r = rbd_aio_create_completion(io_u,
 					      (rbd_callback_t)
 					      _fio_rbd_finish_read_aiocb,
-					      &comp);
+					      &fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_READ failed.\n");
@@ -292,7 +330,8 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		}
 
 		r = rbd_aio_read(rbd_data->image, io_u->offset,
-				 io_u->xfer_buflen, io_u->xfer_buf, comp);
+				 io_u->xfer_buflen, io_u->xfer_buf,
+				 fri->completion);
 
 		if (r < 0) {
 			log_err("rbd_aio_read failed.\n");
@@ -303,14 +342,14 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		r = rbd_aio_create_completion(io_u,
 					      (rbd_callback_t)
 					      _fio_rbd_finish_sync_aiocb,
-					      &comp);
+					      &fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_SYNC failed.\n");
 			goto failed;
 		}
 
-		r = rbd_aio_flush(rbd_data->image, comp);
+		r = rbd_aio_flush(rbd_data->image, fri->completion);
 		if (r < 0) {
 			log_err("rbd_flush failed.\n");
 			goto failed;
@@ -439,22 +478,21 @@ static int fio_rbd_invalidate(struct thread_data *td, struct fio_file *f)
 
 static void fio_rbd_io_u_free(struct thread_data *td, struct io_u *io_u)
 {
-	struct fio_rbd_iou *o = io_u->engine_data;
+	struct fio_rbd_iou *fri = io_u->engine_data;
 
-	if (o) {
+	if (fri) {
 		io_u->engine_data = NULL;
-		free(o);
+		free(fri);
 	}
 }
 
 static int fio_rbd_io_u_init(struct thread_data *td, struct io_u *io_u)
 {
-	struct fio_rbd_iou *o;
+	struct fio_rbd_iou *fri;
 
-	o = malloc(sizeof(*o));
-	o->io_complete = 0;
-	o->io_u = io_u;
-	io_u->engine_data = o;
+	fri = calloc(1, sizeof(*fri));
+	fri->io_u = io_u;
+	io_u->engine_data = fri;
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-25 19:20                     ` Jens Axboe
@ 2014-10-25 22:25                       ` Mark Kirkwood
  2014-10-27  9:27                         ` Ketor D
  2014-10-27 14:19                         ` Jens Axboe
  0 siblings, 2 replies; 52+ messages in thread
From: Mark Kirkwood @ 2014-10-25 22:25 UTC (permalink / raw)
  To: Jens Axboe, Mark Nelson, Mark Nelson, fio
  Cc: xan.peng, ceph-devel@vger.kernel.org

On 26/10/14 08:20, Jens Axboe wrote:
> On 10/24/2014 10:50 PM, Mark Kirkwood wrote:
>> On 25/10/14 16:47, Jens Axboe wrote:
>>>
>>> Since you're running rbd tests... Mind giving this patch a go? I don't
>>> have an easy way to test it myself. It has nothing to do with this
>>> issue, it's just a potentially faster way to do the rbd completions.
>>>
>>
>> Sure - but note I'm testing this on my i7 workstation (4x osd's running
>> on 2x Crucial M550) so not exactly server grade :-)
>>
>> With that in mind, I'm seeing slightly *slower* performance with the
>> patch applied: e.g: for 128k blocks - 2 runs, 1 uncached and the next
>> cached.
>
> Yeah, that doesn't look good. Mind trying this one out? I wonder if we
> doubly wait on them - or perhaps rbd_aio_wait_for_complete() isn't
> working correctly. If you try this one, we should know more...
>
> Goal is, I want to get rid of that usleep() in getevents.
>

Testing with v3 patch applied hangs. I did wonder if we had somehow hit 
a new variant of the cache issue - so reran with it disabled in 
ceph.conf. Result is the same:

$ fio read-test.fio
rbd_thread: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K, 
ioengine=rbd, iodepth=32
fio-2.1.13-88-gb2ee7
Starting 1 process
rbd engine: RBD version: 0.1.8
Jobs: 1 (f=1): [R(1)] [0.1% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 
01h:25m:15s]





^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-25 22:25                       ` Mark Kirkwood
@ 2014-10-27  9:27                         ` Ketor D
  2014-10-27 10:25                           ` Ketor D
  2014-10-27 14:15                           ` Jens Axboe
  2014-10-27 14:19                         ` Jens Axboe
  1 sibling, 2 replies; 52+ messages in thread
From: Ketor D @ 2014-10-27  9:27 UTC (permalink / raw)
  To: Mark Kirkwood
  Cc: Jens Axboe, Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1854 bytes --]

Hi, Jens:
          I have test your v2 and v3 patch.
          The v2 patch get SIGABT and crash. The v3 patch hang.

         Why not simply comment usleep?


2014-10-26 6:25 GMT+08:00 Mark Kirkwood <mark.kirkwood@catalyst.net.nz>:

> On 26/10/14 08:20, Jens Axboe wrote:
>
>> On 10/24/2014 10:50 PM, Mark Kirkwood wrote:
>>
>>> On 25/10/14 16:47, Jens Axboe wrote:
>>>
>>>>
>>>> Since you're running rbd tests... Mind giving this patch a go? I don't
>>>> have an easy way to test it myself. It has nothing to do with this
>>>> issue, it's just a potentially faster way to do the rbd completions.
>>>>
>>>>
>>> Sure - but note I'm testing this on my i7 workstation (4x osd's running
>>> on 2x Crucial M550) so not exactly server grade :-)
>>>
>>> With that in mind, I'm seeing slightly *slower* performance with the
>>> patch applied: e.g: for 128k blocks - 2 runs, 1 uncached and the next
>>> cached.
>>>
>>
>> Yeah, that doesn't look good. Mind trying this one out? I wonder if we
>> doubly wait on them - or perhaps rbd_aio_wait_for_complete() isn't
>> working correctly. If you try this one, we should know more...
>>
>> Goal is, I want to get rid of that usleep() in getevents.
>>
>>
> Testing with v3 patch applied hangs. I did wonder if we had somehow hit a
> new variant of the cache issue - so reran with it disabled in ceph.conf.
> Result is the same:
>
> $ fio read-test.fio
> rbd_thread: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K,
> ioengine=rbd, iodepth=32
> fio-2.1.13-88-gb2ee7
> Starting 1 process
> rbd engine: RBD version: 0.1.8
> Jobs: 1 (f=1): [R(1)] [0.1% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
> 01h:25m:15s]
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

[-- Attachment #2: Type: text/html, Size: 2959 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-27  9:27                         ` Ketor D
@ 2014-10-27 10:25                           ` Ketor D
  2014-10-27 14:19                             ` Jens Axboe
  2014-10-27 14:15                           ` Jens Axboe
  1 sibling, 1 reply; 52+ messages in thread
From: Ketor D @ 2014-10-27 10:25 UTC (permalink / raw)
  To: Mark Kirkwood
  Cc: Jens Axboe, Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2684 bytes --]

Hi Jens:
      After debug the v3 patch, I found there is a bug in the patch.
      On the first fio_rbd_getevents loop, the fri->io_seen is set to
1, and this variable never set to 0 again. So the program get into
endless loop in such code:

do {
this_events = rbd_iter_events(td, &events, min, wait);

if (events >= min)
break;
if (this_events)
continue;

wait = 1;
} while (1);

this_events and events always be 0, because the fri->io_seen is always
1, so no events can be getted.

The Bug fix is:
in the function _fio_rbd_finish_read_aiocb,
_fio_rbd_finish_write_aiocb and _fio_rbd_finish_sync_aiocb add
"fio_rbd_iou->io_seen = 0;" after "fio_rbd_iou->io_complete = 1;".


The attchment is the new patch.

2014-10-27 17:27 GMT+08:00 Ketor D <d.ketor@gmail.com>:
> Hi, Jens:
>           I have test your v2 and v3 patch.
>           The v2 patch get SIGABT and crash. The v3 patch hang.
>
>          Why not simply comment usleep?
>
>
> 2014-10-26 6:25 GMT+08:00 Mark Kirkwood <mark.kirkwood@catalyst.net.nz>:
>>
>> On 26/10/14 08:20, Jens Axboe wrote:
>>>
>>> On 10/24/2014 10:50 PM, Mark Kirkwood wrote:
>>>>
>>>> On 25/10/14 16:47, Jens Axboe wrote:
>>>>>
>>>>>
>>>>> Since you're running rbd tests... Mind giving this patch a go? I don't
>>>>> have an easy way to test it myself. It has nothing to do with this
>>>>> issue, it's just a potentially faster way to do the rbd completions.
>>>>>
>>>>
>>>> Sure - but note I'm testing this on my i7 workstation (4x osd's running
>>>> on 2x Crucial M550) so not exactly server grade :-)
>>>>
>>>> With that in mind, I'm seeing slightly *slower* performance with the
>>>> patch applied: e.g: for 128k blocks - 2 runs, 1 uncached and the next
>>>> cached.
>>>
>>>
>>> Yeah, that doesn't look good. Mind trying this one out? I wonder if we
>>> doubly wait on them - or perhaps rbd_aio_wait_for_complete() isn't
>>> working correctly. If you try this one, we should know more...
>>>
>>> Goal is, I want to get rid of that usleep() in getevents.
>>>
>>
>> Testing with v3 patch applied hangs. I did wonder if we had somehow hit a
>> new variant of the cache issue - so reran with it disabled in ceph.conf.
>> Result is the same:
>>
>> $ fio read-test.fio
>> rbd_thread: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K,
>> ioengine=rbd, iodepth=32
>> fio-2.1.13-88-gb2ee7
>> Starting 1 process
>> rbd engine: RBD version: 0.1.8
>> Jobs: 1 (f=1): [R(1)] [0.1% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>> 01h:25m:15s]
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

[-- Attachment #2: rbd-comp-v4.patch --]
[-- Type: application/octet-stream, Size: 6034 bytes --]

diff --git a/engines/rbd.c b/engines/rbd.c
index 6fe87b8..3fd815c 100644
--- a/engines/rbd.c
+++ b/engines/rbd.c
@@ -11,7 +11,9 @@
 
 struct fio_rbd_iou {
 	struct io_u *io_u;
+	rbd_completion_t completion;
 	int io_complete;
+	int io_seen;
 };
 
 struct rbd_data {
@@ -170,6 +172,7 @@ static void _fio_rbd_finish_write_aiocb(rbd_completion_t comp, void *data)
 	    (struct fio_rbd_iou *)io_u->engine_data;
 
 	fio_rbd_iou->io_complete = 1;
+	fio_rbd_iou->io_seen = 0;
 
 	/* if write needs to be verified - we should not release comp here
 	   without fetching the result */
@@ -187,6 +190,7 @@ static void _fio_rbd_finish_read_aiocb(rbd_completion_t comp, void *data)
 	    (struct fio_rbd_iou *)io_u->engine_data;
 
 	fio_rbd_iou->io_complete = 1;
+	fio_rbd_iou->io_seen = 0;
 
 	/* if read needs to be verified - we should not release comp here
 	   without fetching the result */
@@ -204,6 +208,7 @@ static void _fio_rbd_finish_sync_aiocb(rbd_completion_t comp, void *data)
 	    (struct fio_rbd_iou *)io_u->engine_data;
 
 	fio_rbd_iou->io_complete = 1;
+	fio_rbd_iou->io_seen = 0;
 
 	/* if sync needs to be verified - we should not release comp here
 	   without fetching the result */
@@ -221,34 +226,70 @@ static struct io_u *fio_rbd_event(struct thread_data *td, int event)
 	return rbd_data->aio_events[event];
 }
 
-static int fio_rbd_getevents(struct thread_data *td, unsigned int min,
-			     unsigned int max, const struct timespec *t)
+static inline int fri_check_complete(struct rbd_data *rbd_data,
+				     struct io_u *io_u,
+				     unsigned int *events)
+{
+	struct fio_rbd_iou *fri = io_u->engine_data;
+
+	if (fri->io_complete) {
+		fri->io_complete = 0;
+		fri->io_seen = 1;
+		rbd_data->aio_events[*events] = io_u;
+		(*events)++;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int rbd_iter_events(struct thread_data *td, unsigned int *events,
+			   unsigned int min_evts, int wait)
 {
 	struct rbd_data *rbd_data = td->io_ops->data;
-	unsigned int events = 0;
+	unsigned int this_events = 0;
 	struct io_u *io_u;
 	int i;
-	struct fio_rbd_iou *fov;
 
-	do {
-		io_u_qiter(&td->io_u_all, io_u, i) {
-			if (!(io_u->flags & IO_U_F_FLIGHT))
-				continue;
+	io_u_qiter(&td->io_u_all, io_u, i) {
+		struct fio_rbd_iou *fri = io_u->engine_data;
 
-			fov = (struct fio_rbd_iou *)io_u->engine_data;
+		if (!(io_u->flags & IO_U_F_FLIGHT))
+			continue;
+		if (fri->io_seen)
+			continue;
 
-			if (fov->io_complete) {
-				fov->io_complete = 0;
-				rbd_data->aio_events[events] = io_u;
-				events++;
-			}
+		if (fri_check_complete(rbd_data, io_u, events)){
+			this_events++;
+		}
+		else if (wait) {
+			rbd_aio_wait_for_complete(fri->completion);
 
+			if (fri_check_complete(rbd_data, io_u, events))
+				this_events++;
 		}
-		if (events < min)
-			usleep(100);
-		else
+		if (*events >= min_evts)
+			break;
+	}
+
+	return this_events;
+}
+
+static int fio_rbd_getevents(struct thread_data *td, unsigned int min,
+			     unsigned int max, const struct timespec *t)
+{
+	unsigned int this_events, events = 0;
+	int wait = 0;
+
+	do {
+		this_events = rbd_iter_events(td, &events, min, wait);
+
+		if (events >= min)
 			break;
+		if (this_events)
+			continue;
 
+		wait = 1;
 	} while (1);
 
 	return events;
@@ -258,7 +299,7 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 {
 	int r = -1;
 	struct rbd_data *rbd_data = td->io_ops->data;
-	rbd_completion_t comp;
+	struct fio_rbd_iou *fri = io_u->engine_data;
 
 	fio_ro_check(td, io_u);
 
@@ -266,7 +307,7 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		r = rbd_aio_create_completion(io_u,
 					      (rbd_callback_t)
 					      _fio_rbd_finish_write_aiocb,
-					      &comp);
+					      &fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_WRITE failed.\n");
@@ -274,7 +315,8 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		}
 
 		r = rbd_aio_write(rbd_data->image, io_u->offset,
-				  io_u->xfer_buflen, io_u->xfer_buf, comp);
+				  io_u->xfer_buflen, io_u->xfer_buf,
+				  fri->completion);
 		if (r < 0) {
 			log_err("rbd_aio_write failed.\n");
 			goto failed;
@@ -284,7 +326,7 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		r = rbd_aio_create_completion(io_u,
 					      (rbd_callback_t)
 					      _fio_rbd_finish_read_aiocb,
-					      &comp);
+					      &fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_READ failed.\n");
@@ -292,7 +334,8 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		}
 
 		r = rbd_aio_read(rbd_data->image, io_u->offset,
-				 io_u->xfer_buflen, io_u->xfer_buf, comp);
+				 io_u->xfer_buflen, io_u->xfer_buf,
+				 fri->completion);
 
 		if (r < 0) {
 			log_err("rbd_aio_read failed.\n");
@@ -303,14 +346,14 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		r = rbd_aio_create_completion(io_u,
 					      (rbd_callback_t)
 					      _fio_rbd_finish_sync_aiocb,
-					      &comp);
+					      &fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_SYNC failed.\n");
 			goto failed;
 		}
 
-		r = rbd_aio_flush(rbd_data->image, comp);
+		r = rbd_aio_flush(rbd_data->image, fri->completion);
 		if (r < 0) {
 			log_err("rbd_flush failed.\n");
 			goto failed;
@@ -439,22 +482,21 @@ static int fio_rbd_invalidate(struct thread_data *td, struct fio_file *f)
 
 static void fio_rbd_io_u_free(struct thread_data *td, struct io_u *io_u)
 {
-	struct fio_rbd_iou *o = io_u->engine_data;
+	struct fio_rbd_iou *fri = io_u->engine_data;
 
-	if (o) {
+	if (fri) {
 		io_u->engine_data = NULL;
-		free(o);
+		free(fri);
 	}
 }
 
 static int fio_rbd_io_u_init(struct thread_data *td, struct io_u *io_u)
 {
-	struct fio_rbd_iou *o;
+	struct fio_rbd_iou *fri;
 
-	o = malloc(sizeof(*o));
-	o->io_complete = 0;
-	o->io_u = io_u;
-	io_u->engine_data = o;
+	fri = calloc(1, sizeof(*fri));
+	fri->io_u = io_u;
+	io_u->engine_data = fri;
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-27  9:27                         ` Ketor D
  2014-10-27 10:25                           ` Ketor D
@ 2014-10-27 14:15                           ` Jens Axboe
  1 sibling, 0 replies; 52+ messages in thread
From: Jens Axboe @ 2014-10-27 14:15 UTC (permalink / raw)
  To: Ketor D, Mark Kirkwood
  Cc: Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

On 10/27/2014 03:27 AM, Ketor D wrote:
> Hi, Jens:
>           I have test your v2 and v3 patch.
>           The v2 patch get SIGABT and crash. The v3 patch hang.
> 
>          Why not simply comment usleep?

Because that is very inefficient as well, then fio would basically be
busy looping waiting for IO to finish.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-27 10:25                           ` Ketor D
@ 2014-10-27 14:19                             ` Jens Axboe
  0 siblings, 0 replies; 52+ messages in thread
From: Jens Axboe @ 2014-10-27 14:19 UTC (permalink / raw)
  To: Ketor D, Mark Kirkwood
  Cc: Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1425 bytes --]

On 10/27/2014 04:25 AM, Ketor D wrote:
> Hi Jens:
>       After debug the v3 patch, I found there is a bug in the patch.
>       On the first fio_rbd_getevents loop, the fri->io_seen is set to
> 1, and this variable never set to 0 again. So the program get into
> endless loop in such code:
> 
> do {
> this_events = rbd_iter_events(td, &events, min, wait);
> 
> if (events >= min)
> break;
> if (this_events)
> continue;
> 
> wait = 1;
> } while (1);
> 
> this_events and events always be 0, because the fri->io_seen is always
> 1, so no events can be getted.
> 
> The Bug fix is:
> in the function _fio_rbd_finish_read_aiocb,
> _fio_rbd_finish_write_aiocb and _fio_rbd_finish_sync_aiocb add
> "fio_rbd_iou->io_seen = 0;" after "fio_rbd_iou->io_complete = 1;".

So there are two issues. One is that ->io_seen should be reset in the
->queue() ops, before issuing the IO. The second is that the comp is
released in a racy way, so we can't use it in getevents() reliably.

The new patch moves the comp release to when we reap the event, and
cleans up the ->io_seen setting as well. As far as I can tell, this
should fix all cases.

Additionally, it now actually checks for IO errors and handles those
correctly. They were just ignored before. Gets rid of some useless
casting as well, and lots of duplicated IO comp functions.

If everybody involved (Mark, you) could try this one out, then I'd
appreciate it.

-- 
Jens Axboe


[-- Attachment #2: rbd-comp-v5.patch --]
[-- Type: text/x-patch, Size: 7347 bytes --]

diff --git a/engines/rbd.c b/engines/rbd.c
index 6fe87b8d010c..89344033f894 100644
--- a/engines/rbd.c
+++ b/engines/rbd.c
@@ -11,7 +11,9 @@
 
 struct fio_rbd_iou {
 	struct io_u *io_u;
+	rbd_completion_t completion;
 	int io_complete;
+	int io_seen;
 };
 
 struct rbd_data {
@@ -163,92 +165,102 @@ static void _fio_rbd_disconnect(struct rbd_data *rbd_data)
 	}
 }
 
-static void _fio_rbd_finish_write_aiocb(rbd_completion_t comp, void *data)
+static void _fio_rbd_io_finish(struct io_u *io_u)
 {
-	struct io_u *io_u = (struct io_u *)data;
-	struct fio_rbd_iou *fio_rbd_iou =
-	    (struct fio_rbd_iou *)io_u->engine_data;
+	struct fio_rbd_iou *fri = io_u->engine_data;
+	ssize_t ret;
+
+	fri->io_complete = 1;
+
+	ret = rbd_aio_get_return_value(&fri->completion);
+	if (ret != (int) io_u->xfer_buflen) {
+		if (ret >= 0) {
+			io_u->resid = io_u->xfer_buflen - ret;
+			io_u->error = 0;
+		} else
+			io_u->error = ret;
+	}
+}
 
-	fio_rbd_iou->io_complete = 1;
+static void _fio_rbd_finish_aiocb(rbd_completion_t comp, void *data)
+{
+	struct io_u *io_u = data;
 
-	/* if write needs to be verified - we should not release comp here
-	   without fetching the result */
+	_fio_rbd_io_finish(io_u);
+}
 
-	rbd_aio_release(comp);
-	/* TODO handle error */
+static struct io_u *fio_rbd_event(struct thread_data *td, int event)
+{
+	struct rbd_data *rbd_data = td->io_ops->data;
 
-	return;
+	return rbd_data->aio_events[event];
 }
 
-static void _fio_rbd_finish_read_aiocb(rbd_completion_t comp, void *data)
+static inline int fri_check_complete(struct rbd_data *rbd_data,
+				     struct io_u *io_u,
+				     unsigned int *events)
 {
-	struct io_u *io_u = (struct io_u *)data;
-	struct fio_rbd_iou *fio_rbd_iou =
-	    (struct fio_rbd_iou *)io_u->engine_data;
+	struct fio_rbd_iou *fri = io_u->engine_data;
 
-	fio_rbd_iou->io_complete = 1;
+	if (fri->io_complete) {
+		fri->io_complete = 0;
+		fri->io_seen = 1;
+		rbd_data->aio_events[*events] = io_u;
+		(*events)++;
 
-	/* if read needs to be verified - we should not release comp here
-	   without fetching the result */
-	rbd_aio_release(comp);
-
-	/* TODO handle error */
+		rbd_aio_release(&fri->completion);
+		return 1;
+	}
 
-	return;
+	return 0;
 }
 
-static void _fio_rbd_finish_sync_aiocb(rbd_completion_t comp, void *data)
+static int rbd_iter_events(struct thread_data *td, unsigned int *events,
+			   unsigned int min_evts, int wait)
 {
-	struct io_u *io_u = (struct io_u *)data;
-	struct fio_rbd_iou *fio_rbd_iou =
-	    (struct fio_rbd_iou *)io_u->engine_data;
-
-	fio_rbd_iou->io_complete = 1;
+	struct rbd_data *rbd_data = td->io_ops->data;
+	unsigned int this_events = 0;
+	struct io_u *io_u;
+	int i;
 
-	/* if sync needs to be verified - we should not release comp here
-	   without fetching the result */
-	rbd_aio_release(comp);
+	io_u_qiter(&td->io_u_all, io_u, i) {
+		struct fio_rbd_iou *fri = io_u->engine_data;
 
-	/* TODO handle error */
+		if (!(io_u->flags & IO_U_F_FLIGHT))
+			continue;
+		if (fri->io_seen)
+			continue;
 
-	return;
-}
+		if (fri_check_complete(rbd_data, io_u, events))
+			this_events++;
+		else if (wait) {
+			rbd_aio_wait_for_complete(fri->completion);
 
-static struct io_u *fio_rbd_event(struct thread_data *td, int event)
-{
-	struct rbd_data *rbd_data = td->io_ops->data;
+			if (fri_check_complete(rbd_data, io_u, events))
+				this_events++;
+		}
+		if (*events >= min_evts)
+			break;
+	}
 
-	return rbd_data->aio_events[event];
+	return this_events;
 }
 
 static int fio_rbd_getevents(struct thread_data *td, unsigned int min,
 			     unsigned int max, const struct timespec *t)
 {
-	struct rbd_data *rbd_data = td->io_ops->data;
-	unsigned int events = 0;
-	struct io_u *io_u;
-	int i;
-	struct fio_rbd_iou *fov;
+	unsigned int this_events, events = 0;
+	int wait = 0;
 
 	do {
-		io_u_qiter(&td->io_u_all, io_u, i) {
-			if (!(io_u->flags & IO_U_F_FLIGHT))
-				continue;
-
-			fov = (struct fio_rbd_iou *)io_u->engine_data;
-
-			if (fov->io_complete) {
-				fov->io_complete = 0;
-				rbd_data->aio_events[events] = io_u;
-				events++;
-			}
+		this_events = rbd_iter_events(td, &events, min, wait);
 
-		}
-		if (events < min)
-			usleep(100);
-		else
+		if (events >= min)
 			break;
+		if (this_events)
+			continue;
 
+		wait = 1;
 	} while (1);
 
 	return events;
@@ -256,17 +268,18 @@ static int fio_rbd_getevents(struct thread_data *td, unsigned int min,
 
 static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 {
-	int r = -1;
 	struct rbd_data *rbd_data = td->io_ops->data;
-	rbd_completion_t comp;
+	struct fio_rbd_iou *fri = io_u->engine_data;
+	int r = -1;
 
 	fio_ro_check(td, io_u);
 
+	fri->io_complete = 0;
+	fri->io_seen = 0;
+
 	if (io_u->ddir == DDIR_WRITE) {
-		r = rbd_aio_create_completion(io_u,
-					      (rbd_callback_t)
-					      _fio_rbd_finish_write_aiocb,
-					      &comp);
+		r = rbd_aio_create_completion(io_u, _fio_rbd_finish_aiocb,
+						&fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_WRITE failed.\n");
@@ -274,17 +287,16 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		}
 
 		r = rbd_aio_write(rbd_data->image, io_u->offset,
-				  io_u->xfer_buflen, io_u->xfer_buf, comp);
+				  io_u->xfer_buflen, io_u->xfer_buf,
+				  fri->completion);
 		if (r < 0) {
 			log_err("rbd_aio_write failed.\n");
 			goto failed;
 		}
 
 	} else if (io_u->ddir == DDIR_READ) {
-		r = rbd_aio_create_completion(io_u,
-					      (rbd_callback_t)
-					      _fio_rbd_finish_read_aiocb,
-					      &comp);
+		r = rbd_aio_create_completion(io_u, _fio_rbd_finish_aiocb,
+						&fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_READ failed.\n");
@@ -292,7 +304,8 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		}
 
 		r = rbd_aio_read(rbd_data->image, io_u->offset,
-				 io_u->xfer_buflen, io_u->xfer_buf, comp);
+				 io_u->xfer_buflen, io_u->xfer_buf,
+				 fri->completion);
 
 		if (r < 0) {
 			log_err("rbd_aio_read failed.\n");
@@ -300,17 +313,15 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		}
 
 	} else if (io_u->ddir == DDIR_SYNC) {
-		r = rbd_aio_create_completion(io_u,
-					      (rbd_callback_t)
-					      _fio_rbd_finish_sync_aiocb,
-					      &comp);
+		r = rbd_aio_create_completion(io_u, _fio_rbd_finish_aiocb,
+						&fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_SYNC failed.\n");
 			goto failed;
 		}
 
-		r = rbd_aio_flush(rbd_data->image, comp);
+		r = rbd_aio_flush(rbd_data->image, fri->completion);
 		if (r < 0) {
 			log_err("rbd_flush failed.\n");
 			goto failed;
@@ -439,22 +450,21 @@ static int fio_rbd_invalidate(struct thread_data *td, struct fio_file *f)
 
 static void fio_rbd_io_u_free(struct thread_data *td, struct io_u *io_u)
 {
-	struct fio_rbd_iou *o = io_u->engine_data;
+	struct fio_rbd_iou *fri = io_u->engine_data;
 
-	if (o) {
+	if (fri) {
 		io_u->engine_data = NULL;
-		free(o);
+		free(fri);
 	}
 }
 
 static int fio_rbd_io_u_init(struct thread_data *td, struct io_u *io_u)
 {
-	struct fio_rbd_iou *o;
+	struct fio_rbd_iou *fri;
 
-	o = malloc(sizeof(*o));
-	o->io_complete = 0;
-	o->io_u = io_u;
-	io_u->engine_data = o;
+	fri = calloc(1, sizeof(*fri));
+	fri->io_u = io_u;
+	io_u->engine_data = fri;
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-25 22:25                       ` Mark Kirkwood
  2014-10-27  9:27                         ` Ketor D
@ 2014-10-27 14:19                         ` Jens Axboe
  2014-10-27 15:12                           ` Ketor D
  1 sibling, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2014-10-27 14:19 UTC (permalink / raw)
  To: Mark Kirkwood, Mark Nelson, Mark Nelson, fio
  Cc: xan.peng, ceph-devel@vger.kernel.org

On 10/25/2014 04:25 PM, Mark Kirkwood wrote:
> On 26/10/14 08:20, Jens Axboe wrote:
>> On 10/24/2014 10:50 PM, Mark Kirkwood wrote:
>>> On 25/10/14 16:47, Jens Axboe wrote:
>>>>
>>>> Since you're running rbd tests... Mind giving this patch a go? I don't
>>>> have an easy way to test it myself. It has nothing to do with this
>>>> issue, it's just a potentially faster way to do the rbd completions.
>>>>
>>>
>>> Sure - but note I'm testing this on my i7 workstation (4x osd's running
>>> on 2x Crucial M550) so not exactly server grade :-)
>>>
>>> With that in mind, I'm seeing slightly *slower* performance with the
>>> patch applied: e.g: for 128k blocks - 2 runs, 1 uncached and the next
>>> cached.
>>
>> Yeah, that doesn't look good. Mind trying this one out? I wonder if we
>> doubly wait on them - or perhaps rbd_aio_wait_for_complete() isn't
>> working correctly. If you try this one, we should know more...
>>
>> Goal is, I want to get rid of that usleep() in getevents.
>>
> 
> Testing with v3 patch applied hangs. I did wonder if we had somehow hit
> a new variant of the cache issue - so reran with it disabled in
> ceph.conf. Result is the same:
> 
> $ fio read-test.fio
> rbd_thread: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K,
> ioengine=rbd, iodepth=32
> fio-2.1.13-88-gb2ee7
> Starting 1 process
> rbd engine: RBD version: 0.1.8
> Jobs: 1 (f=1): [R(1)] [0.1% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
> 01h:25m:15s]

There were, unfortunately, still two bugs left in -v3. I just posted an
updated one, please try that and see if it works for you.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-27 14:19                         ` Jens Axboe
@ 2014-10-27 15:12                           ` Ketor D
  2014-10-27 15:22                             ` Jens Axboe
  0 siblings, 1 reply; 52+ messages in thread
From: Ketor D @ 2014-10-27 15:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

The v5 patch does not work.

Run 5 times:
3 times SEGSV
2 times NO IOPS, Endless loop



2014-10-27 22:19 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
> On 10/25/2014 04:25 PM, Mark Kirkwood wrote:
>> On 26/10/14 08:20, Jens Axboe wrote:
>>> On 10/24/2014 10:50 PM, Mark Kirkwood wrote:
>>>> On 25/10/14 16:47, Jens Axboe wrote:
>>>>>
>>>>> Since you're running rbd tests... Mind giving this patch a go? I don't
>>>>> have an easy way to test it myself. It has nothing to do with this
>>>>> issue, it's just a potentially faster way to do the rbd completions.
>>>>>
>>>>
>>>> Sure - but note I'm testing this on my i7 workstation (4x osd's running
>>>> on 2x Crucial M550) so not exactly server grade :-)
>>>>
>>>> With that in mind, I'm seeing slightly *slower* performance with the
>>>> patch applied: e.g: for 128k blocks - 2 runs, 1 uncached and the next
>>>> cached.
>>>
>>> Yeah, that doesn't look good. Mind trying this one out? I wonder if we
>>> doubly wait on them - or perhaps rbd_aio_wait_for_complete() isn't
>>> working correctly. If you try this one, we should know more...
>>>
>>> Goal is, I want to get rid of that usleep() in getevents.
>>>
>>
>> Testing with v3 patch applied hangs. I did wonder if we had somehow hit
>> a new variant of the cache issue - so reran with it disabled in
>> ceph.conf. Result is the same:
>>
>> $ fio read-test.fio
>> rbd_thread: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K,
>> ioengine=rbd, iodepth=32
>> fio-2.1.13-88-gb2ee7
>> Starting 1 process
>> rbd engine: RBD version: 0.1.8
>> Jobs: 1 (f=1): [R(1)] [0.1% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>> 01h:25m:15s]
>
> There were, unfortunately, still two bugs left in -v3. I just posted an
> updated one, please try that and see if it works for you.
>
> --
> Jens Axboe
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-27 15:12                           ` Ketor D
@ 2014-10-27 15:22                             ` Jens Axboe
  2014-10-27 15:25                               ` Jens Axboe
  0 siblings, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2014-10-27 15:22 UTC (permalink / raw)
  To: Ketor D
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 305 bytes --]

On 10/27/2014 09:12 AM, Ketor D wrote:
> The v5 patch does not work.
> 
> Run 5 times:
> 3 times SEGSV
> 2 times NO IOPS, Endless loop

Try this one, perhaps it's the wrong type passed for release. Typedefs
for the win (or not).

This also fixes comp leaks, if the read/write/sync fails.

-- 
Jens Axboe


[-- Attachment #2: rbd-comp-v6.patch --]
[-- Type: text/x-patch, Size: 9805 bytes --]

diff --git a/engines/rbd.c b/engines/rbd.c
index 6fe87b8d010c..a6e5dafb87fd 100644
--- a/engines/rbd.c
+++ b/engines/rbd.c
@@ -11,7 +11,9 @@
 
 struct fio_rbd_iou {
 	struct io_u *io_u;
+	rbd_completion_t completion;
 	int io_complete;
+	int io_seen;
 };
 
 struct rbd_data {
@@ -30,35 +32,35 @@ struct rbd_options {
 
 static struct fio_option options[] = {
 	{
-	 .name     = "rbdname",
-	 .lname    = "rbd engine rbdname",
-	 .type     = FIO_OPT_STR_STORE,
-	 .help     = "RBD name for RBD engine",
-	 .off1     = offsetof(struct rbd_options, rbd_name),
-	 .category = FIO_OPT_C_ENGINE,
-	 .group    = FIO_OPT_G_RBD,
-	 },
+		.name		= "rbdname",
+		.lname		= "rbd engine rbdname",
+		.type		= FIO_OPT_STR_STORE,
+		.help		= "RBD name for RBD engine",
+		.off1		= offsetof(struct rbd_options, rbd_name),
+		.category	= FIO_OPT_C_ENGINE,
+		.group		= FIO_OPT_G_RBD,
+	},
 	{
-	 .name     = "pool",
-	 .lname    = "rbd engine pool",
-	 .type     = FIO_OPT_STR_STORE,
-	 .help     = "Name of the pool hosting the RBD for the RBD engine",
-	 .off1     = offsetof(struct rbd_options, pool_name),
-	 .category = FIO_OPT_C_ENGINE,
-	 .group    = FIO_OPT_G_RBD,
-	 },
+		.name     = "pool",
+		.lname    = "rbd engine pool",
+		.type     = FIO_OPT_STR_STORE,
+		.help     = "Name of the pool hosting the RBD for the RBD engine",
+		.off1     = offsetof(struct rbd_options, pool_name),
+		.category = FIO_OPT_C_ENGINE,
+		.group    = FIO_OPT_G_RBD,
+	},
 	{
-	 .name     = "clientname",
-	 .lname    = "rbd engine clientname",
-	 .type     = FIO_OPT_STR_STORE,
-	 .help     = "Name of the ceph client to access the RBD for the RBD engine",
-	 .off1     = offsetof(struct rbd_options, client_name),
-	 .category = FIO_OPT_C_ENGINE,
-	 .group    = FIO_OPT_G_RBD,
-	 },
+		.name     = "clientname",
+		.lname    = "rbd engine clientname",
+		.type     = FIO_OPT_STR_STORE,
+		.help     = "Name of the ceph client to access the RBD for the RBD engine",
+		.off1     = offsetof(struct rbd_options, client_name),
+		.category = FIO_OPT_C_ENGINE,
+		.group    = FIO_OPT_G_RBD,
+	},
 	{
-	 .name = NULL,
-	 },
+		.name = NULL,
+	},
 };
 
 static int _fio_setup_rbd_data(struct thread_data *td,
@@ -163,92 +165,96 @@ static void _fio_rbd_disconnect(struct rbd_data *rbd_data)
 	}
 }
 
-static void _fio_rbd_finish_write_aiocb(rbd_completion_t comp, void *data)
+static void _fio_rbd_finish_aiocb(rbd_completion_t comp, void *data)
 {
-	struct io_u *io_u = (struct io_u *)data;
-	struct fio_rbd_iou *fio_rbd_iou =
-	    (struct fio_rbd_iou *)io_u->engine_data;
-
-	fio_rbd_iou->io_complete = 1;
-
-	/* if write needs to be verified - we should not release comp here
-	   without fetching the result */
+	struct io_u *io_u = data;
+	struct fio_rbd_iou *fri = io_u->engine_data;
+	ssize_t ret;
+
+	fri->io_complete = 1;
+
+	ret = rbd_aio_get_return_value(&fri->completion);
+	if (ret != (int) io_u->xfer_buflen) {
+		if (ret >= 0) {
+			io_u->resid = io_u->xfer_buflen - ret;
+			io_u->error = 0;
+		} else
+			io_u->error = ret;
+	}
+}
 
-	rbd_aio_release(comp);
-	/* TODO handle error */
+static struct io_u *fio_rbd_event(struct thread_data *td, int event)
+{
+	struct rbd_data *rbd_data = td->io_ops->data;
 
-	return;
+	return rbd_data->aio_events[event];
 }
 
-static void _fio_rbd_finish_read_aiocb(rbd_completion_t comp, void *data)
+static inline int fri_check_complete(struct rbd_data *rbd_data,
+				     struct io_u *io_u,
+				     unsigned int *events)
 {
-	struct io_u *io_u = (struct io_u *)data;
-	struct fio_rbd_iou *fio_rbd_iou =
-	    (struct fio_rbd_iou *)io_u->engine_data;
-
-	fio_rbd_iou->io_complete = 1;
+	struct fio_rbd_iou *fri = io_u->engine_data;
 
-	/* if read needs to be verified - we should not release comp here
-	   without fetching the result */
-	rbd_aio_release(comp);
+	if (fri->io_complete) {
+		fri->io_complete = 0;
+		fri->io_seen = 1;
+		rbd_data->aio_events[*events] = io_u;
+		(*events)++;
 
-	/* TODO handle error */
+		rbd_aio_release(fri->completion);
+		return 1;
+	}
 
-	return;
+	return 0;
 }
 
-static void _fio_rbd_finish_sync_aiocb(rbd_completion_t comp, void *data)
+static int rbd_iter_events(struct thread_data *td, unsigned int *events,
+			   unsigned int min_evts, int wait)
 {
-	struct io_u *io_u = (struct io_u *)data;
-	struct fio_rbd_iou *fio_rbd_iou =
-	    (struct fio_rbd_iou *)io_u->engine_data;
-
-	fio_rbd_iou->io_complete = 1;
+	struct rbd_data *rbd_data = td->io_ops->data;
+	unsigned int this_events = 0;
+	struct io_u *io_u;
+	int i;
 
-	/* if sync needs to be verified - we should not release comp here
-	   without fetching the result */
-	rbd_aio_release(comp);
+	io_u_qiter(&td->io_u_all, io_u, i) {
+		struct fio_rbd_iou *fri = io_u->engine_data;
 
-	/* TODO handle error */
+		if (!(io_u->flags & IO_U_F_FLIGHT))
+			continue;
+		if (fri->io_seen)
+			continue;
 
-	return;
-}
+		if (fri_check_complete(rbd_data, io_u, events))
+			this_events++;
+		else if (wait) {
+			rbd_aio_wait_for_complete(fri->completion);
 
-static struct io_u *fio_rbd_event(struct thread_data *td, int event)
-{
-	struct rbd_data *rbd_data = td->io_ops->data;
+			if (fri_check_complete(rbd_data, io_u, events))
+				this_events++;
+		}
+		if (*events >= min_evts)
+			break;
+	}
 
-	return rbd_data->aio_events[event];
+	return this_events;
 }
 
 static int fio_rbd_getevents(struct thread_data *td, unsigned int min,
 			     unsigned int max, const struct timespec *t)
 {
-	struct rbd_data *rbd_data = td->io_ops->data;
-	unsigned int events = 0;
-	struct io_u *io_u;
-	int i;
-	struct fio_rbd_iou *fov;
+	unsigned int this_events, events = 0;
+	int wait = 0;
 
 	do {
-		io_u_qiter(&td->io_u_all, io_u, i) {
-			if (!(io_u->flags & IO_U_F_FLIGHT))
-				continue;
-
-			fov = (struct fio_rbd_iou *)io_u->engine_data;
+		this_events = rbd_iter_events(td, &events, min, wait);
 
-			if (fov->io_complete) {
-				fov->io_complete = 0;
-				rbd_data->aio_events[events] = io_u;
-				events++;
-			}
-
-		}
-		if (events < min)
-			usleep(100);
-		else
+		if (events >= min)
 			break;
+		if (this_events)
+			continue;
 
+		wait = 1;
 	} while (1);
 
 	return events;
@@ -256,17 +262,18 @@ static int fio_rbd_getevents(struct thread_data *td, unsigned int min,
 
 static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 {
-	int r = -1;
 	struct rbd_data *rbd_data = td->io_ops->data;
-	rbd_completion_t comp;
+	struct fio_rbd_iou *fri = io_u->engine_data;
+	int r = -1;
 
 	fio_ro_check(td, io_u);
 
+	fri->io_complete = 0;
+	fri->io_seen = 0;
+
 	if (io_u->ddir == DDIR_WRITE) {
-		r = rbd_aio_create_completion(io_u,
-					      (rbd_callback_t)
-					      _fio_rbd_finish_write_aiocb,
-					      &comp);
+		r = rbd_aio_create_completion(io_u, _fio_rbd_finish_aiocb,
+						&fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_WRITE failed.\n");
@@ -274,17 +281,17 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		}
 
 		r = rbd_aio_write(rbd_data->image, io_u->offset,
-				  io_u->xfer_buflen, io_u->xfer_buf, comp);
+				  io_u->xfer_buflen, io_u->xfer_buf,
+				  fri->completion);
 		if (r < 0) {
 			log_err("rbd_aio_write failed.\n");
+			rbd_aio_release(fri->completion);
 			goto failed;
 		}
 
 	} else if (io_u->ddir == DDIR_READ) {
-		r = rbd_aio_create_completion(io_u,
-					      (rbd_callback_t)
-					      _fio_rbd_finish_read_aiocb,
-					      &comp);
+		r = rbd_aio_create_completion(io_u, _fio_rbd_finish_aiocb,
+						&fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_READ failed.\n");
@@ -292,27 +299,28 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		}
 
 		r = rbd_aio_read(rbd_data->image, io_u->offset,
-				 io_u->xfer_buflen, io_u->xfer_buf, comp);
+				 io_u->xfer_buflen, io_u->xfer_buf,
+				 fri->completion);
 
 		if (r < 0) {
 			log_err("rbd_aio_read failed.\n");
+			rbd_aio_release(fri->completion);
 			goto failed;
 		}
 
 	} else if (io_u->ddir == DDIR_SYNC) {
-		r = rbd_aio_create_completion(io_u,
-					      (rbd_callback_t)
-					      _fio_rbd_finish_sync_aiocb,
-					      &comp);
+		r = rbd_aio_create_completion(io_u, _fio_rbd_finish_aiocb,
+						&fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_SYNC failed.\n");
 			goto failed;
 		}
 
-		r = rbd_aio_flush(rbd_data->image, comp);
+		r = rbd_aio_flush(rbd_data->image, fri->completion);
 		if (r < 0) {
 			log_err("rbd_flush failed.\n");
+			rbd_aio_release(fri->completion);
 			goto failed;
 		}
 
@@ -344,7 +352,6 @@ static int fio_rbd_init(struct thread_data *td)
 
 failed:
 	return 1;
-
 }
 
 static void fio_rbd_cleanup(struct thread_data *td)
@@ -379,8 +386,9 @@ static int fio_rbd_setup(struct thread_data *td)
 	}
 	td->io_ops->data = rbd_data;
 
-	/* librbd does not allow us to run first in the main thread and later in a
-	 * fork child. It needs to be the same process context all the time. 
+	/* librbd does not allow us to run first in the main thread and later
+	 * in a fork child. It needs to be the same process context all the
+	 * time. 
 	 */
 	td->o.use_thread = 1;
 
@@ -439,22 +447,21 @@ static int fio_rbd_invalidate(struct thread_data *td, struct fio_file *f)
 
 static void fio_rbd_io_u_free(struct thread_data *td, struct io_u *io_u)
 {
-	struct fio_rbd_iou *o = io_u->engine_data;
+	struct fio_rbd_iou *fri = io_u->engine_data;
 
-	if (o) {
+	if (fri) {
 		io_u->engine_data = NULL;
-		free(o);
+		free(fri);
 	}
 }
 
 static int fio_rbd_io_u_init(struct thread_data *td, struct io_u *io_u)
 {
-	struct fio_rbd_iou *o;
+	struct fio_rbd_iou *fri;
 
-	o = malloc(sizeof(*o));
-	o->io_complete = 0;
-	o->io_u = io_u;
-	io_u->engine_data = o;
+	fri = calloc(1, sizeof(*fri));
+	fri->io_u = io_u;
+	io_u->engine_data = fri;
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-27 15:22                             ` Jens Axboe
@ 2014-10-27 15:25                               ` Jens Axboe
  2014-10-27 15:29                                 ` Ketor D
  0 siblings, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2014-10-27 15:25 UTC (permalink / raw)
  To: Ketor D
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 408 bytes --]

On 10/27/2014 09:22 AM, Jens Axboe wrote:
> On 10/27/2014 09:12 AM, Ketor D wrote:
>> The v5 patch does not work.
>>
>> Run 5 times:
>> 3 times SEGSV
>> 2 times NO IOPS, Endless loop
> 
> Try this one, perhaps it's the wrong type passed for release. Typedefs
> for the win (or not).
> 
> This also fixes comp leaks, if the read/write/sync fails.

The get_return was wrong too, here's -v7...

-- 
Jens Axboe


[-- Attachment #2: rbd-comp-v7.patch --]
[-- Type: text/x-patch, Size: 9804 bytes --]

diff --git a/engines/rbd.c b/engines/rbd.c
index 6fe87b8d010c..0e04b610b3d9 100644
--- a/engines/rbd.c
+++ b/engines/rbd.c
@@ -11,7 +11,9 @@
 
 struct fio_rbd_iou {
 	struct io_u *io_u;
+	rbd_completion_t completion;
 	int io_complete;
+	int io_seen;
 };
 
 struct rbd_data {
@@ -30,35 +32,35 @@ struct rbd_options {
 
 static struct fio_option options[] = {
 	{
-	 .name     = "rbdname",
-	 .lname    = "rbd engine rbdname",
-	 .type     = FIO_OPT_STR_STORE,
-	 .help     = "RBD name for RBD engine",
-	 .off1     = offsetof(struct rbd_options, rbd_name),
-	 .category = FIO_OPT_C_ENGINE,
-	 .group    = FIO_OPT_G_RBD,
-	 },
+		.name		= "rbdname",
+		.lname		= "rbd engine rbdname",
+		.type		= FIO_OPT_STR_STORE,
+		.help		= "RBD name for RBD engine",
+		.off1		= offsetof(struct rbd_options, rbd_name),
+		.category	= FIO_OPT_C_ENGINE,
+		.group		= FIO_OPT_G_RBD,
+	},
 	{
-	 .name     = "pool",
-	 .lname    = "rbd engine pool",
-	 .type     = FIO_OPT_STR_STORE,
-	 .help     = "Name of the pool hosting the RBD for the RBD engine",
-	 .off1     = offsetof(struct rbd_options, pool_name),
-	 .category = FIO_OPT_C_ENGINE,
-	 .group    = FIO_OPT_G_RBD,
-	 },
+		.name     = "pool",
+		.lname    = "rbd engine pool",
+		.type     = FIO_OPT_STR_STORE,
+		.help     = "Name of the pool hosting the RBD for the RBD engine",
+		.off1     = offsetof(struct rbd_options, pool_name),
+		.category = FIO_OPT_C_ENGINE,
+		.group    = FIO_OPT_G_RBD,
+	},
 	{
-	 .name     = "clientname",
-	 .lname    = "rbd engine clientname",
-	 .type     = FIO_OPT_STR_STORE,
-	 .help     = "Name of the ceph client to access the RBD for the RBD engine",
-	 .off1     = offsetof(struct rbd_options, client_name),
-	 .category = FIO_OPT_C_ENGINE,
-	 .group    = FIO_OPT_G_RBD,
-	 },
+		.name     = "clientname",
+		.lname    = "rbd engine clientname",
+		.type     = FIO_OPT_STR_STORE,
+		.help     = "Name of the ceph client to access the RBD for the RBD engine",
+		.off1     = offsetof(struct rbd_options, client_name),
+		.category = FIO_OPT_C_ENGINE,
+		.group    = FIO_OPT_G_RBD,
+	},
 	{
-	 .name = NULL,
-	 },
+		.name = NULL,
+	},
 };
 
 static int _fio_setup_rbd_data(struct thread_data *td,
@@ -163,92 +165,96 @@ static void _fio_rbd_disconnect(struct rbd_data *rbd_data)
 	}
 }
 
-static void _fio_rbd_finish_write_aiocb(rbd_completion_t comp, void *data)
+static void _fio_rbd_finish_aiocb(rbd_completion_t comp, void *data)
 {
-	struct io_u *io_u = (struct io_u *)data;
-	struct fio_rbd_iou *fio_rbd_iou =
-	    (struct fio_rbd_iou *)io_u->engine_data;
-
-	fio_rbd_iou->io_complete = 1;
-
-	/* if write needs to be verified - we should not release comp here
-	   without fetching the result */
+	struct io_u *io_u = data;
+	struct fio_rbd_iou *fri = io_u->engine_data;
+	ssize_t ret;
+
+	fri->io_complete = 1;
+
+	ret = rbd_aio_get_return_value(fri->completion);
+	if (ret != (int) io_u->xfer_buflen) {
+		if (ret >= 0) {
+			io_u->resid = io_u->xfer_buflen - ret;
+			io_u->error = 0;
+		} else
+			io_u->error = ret;
+	}
+}
 
-	rbd_aio_release(comp);
-	/* TODO handle error */
+static struct io_u *fio_rbd_event(struct thread_data *td, int event)
+{
+	struct rbd_data *rbd_data = td->io_ops->data;
 
-	return;
+	return rbd_data->aio_events[event];
 }
 
-static void _fio_rbd_finish_read_aiocb(rbd_completion_t comp, void *data)
+static inline int fri_check_complete(struct rbd_data *rbd_data,
+				     struct io_u *io_u,
+				     unsigned int *events)
 {
-	struct io_u *io_u = (struct io_u *)data;
-	struct fio_rbd_iou *fio_rbd_iou =
-	    (struct fio_rbd_iou *)io_u->engine_data;
-
-	fio_rbd_iou->io_complete = 1;
+	struct fio_rbd_iou *fri = io_u->engine_data;
 
-	/* if read needs to be verified - we should not release comp here
-	   without fetching the result */
-	rbd_aio_release(comp);
+	if (fri->io_complete) {
+		fri->io_complete = 0;
+		fri->io_seen = 1;
+		rbd_data->aio_events[*events] = io_u;
+		(*events)++;
 
-	/* TODO handle error */
+		rbd_aio_release(fri->completion);
+		return 1;
+	}
 
-	return;
+	return 0;
 }
 
-static void _fio_rbd_finish_sync_aiocb(rbd_completion_t comp, void *data)
+static int rbd_iter_events(struct thread_data *td, unsigned int *events,
+			   unsigned int min_evts, int wait)
 {
-	struct io_u *io_u = (struct io_u *)data;
-	struct fio_rbd_iou *fio_rbd_iou =
-	    (struct fio_rbd_iou *)io_u->engine_data;
-
-	fio_rbd_iou->io_complete = 1;
+	struct rbd_data *rbd_data = td->io_ops->data;
+	unsigned int this_events = 0;
+	struct io_u *io_u;
+	int i;
 
-	/* if sync needs to be verified - we should not release comp here
-	   without fetching the result */
-	rbd_aio_release(comp);
+	io_u_qiter(&td->io_u_all, io_u, i) {
+		struct fio_rbd_iou *fri = io_u->engine_data;
 
-	/* TODO handle error */
+		if (!(io_u->flags & IO_U_F_FLIGHT))
+			continue;
+		if (fri->io_seen)
+			continue;
 
-	return;
-}
+		if (fri_check_complete(rbd_data, io_u, events))
+			this_events++;
+		else if (wait) {
+			rbd_aio_wait_for_complete(fri->completion);
 
-static struct io_u *fio_rbd_event(struct thread_data *td, int event)
-{
-	struct rbd_data *rbd_data = td->io_ops->data;
+			if (fri_check_complete(rbd_data, io_u, events))
+				this_events++;
+		}
+		if (*events >= min_evts)
+			break;
+	}
 
-	return rbd_data->aio_events[event];
+	return this_events;
 }
 
 static int fio_rbd_getevents(struct thread_data *td, unsigned int min,
 			     unsigned int max, const struct timespec *t)
 {
-	struct rbd_data *rbd_data = td->io_ops->data;
-	unsigned int events = 0;
-	struct io_u *io_u;
-	int i;
-	struct fio_rbd_iou *fov;
+	unsigned int this_events, events = 0;
+	int wait = 0;
 
 	do {
-		io_u_qiter(&td->io_u_all, io_u, i) {
-			if (!(io_u->flags & IO_U_F_FLIGHT))
-				continue;
-
-			fov = (struct fio_rbd_iou *)io_u->engine_data;
+		this_events = rbd_iter_events(td, &events, min, wait);
 
-			if (fov->io_complete) {
-				fov->io_complete = 0;
-				rbd_data->aio_events[events] = io_u;
-				events++;
-			}
-
-		}
-		if (events < min)
-			usleep(100);
-		else
+		if (events >= min)
 			break;
+		if (this_events)
+			continue;
 
+		wait = 1;
 	} while (1);
 
 	return events;
@@ -256,17 +262,18 @@ static int fio_rbd_getevents(struct thread_data *td, unsigned int min,
 
 static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 {
-	int r = -1;
 	struct rbd_data *rbd_data = td->io_ops->data;
-	rbd_completion_t comp;
+	struct fio_rbd_iou *fri = io_u->engine_data;
+	int r = -1;
 
 	fio_ro_check(td, io_u);
 
+	fri->io_complete = 0;
+	fri->io_seen = 0;
+
 	if (io_u->ddir == DDIR_WRITE) {
-		r = rbd_aio_create_completion(io_u,
-					      (rbd_callback_t)
-					      _fio_rbd_finish_write_aiocb,
-					      &comp);
+		r = rbd_aio_create_completion(io_u, _fio_rbd_finish_aiocb,
+						&fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_WRITE failed.\n");
@@ -274,17 +281,17 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		}
 
 		r = rbd_aio_write(rbd_data->image, io_u->offset,
-				  io_u->xfer_buflen, io_u->xfer_buf, comp);
+				  io_u->xfer_buflen, io_u->xfer_buf,
+				  fri->completion);
 		if (r < 0) {
 			log_err("rbd_aio_write failed.\n");
+			rbd_aio_release(fri->completion);
 			goto failed;
 		}
 
 	} else if (io_u->ddir == DDIR_READ) {
-		r = rbd_aio_create_completion(io_u,
-					      (rbd_callback_t)
-					      _fio_rbd_finish_read_aiocb,
-					      &comp);
+		r = rbd_aio_create_completion(io_u, _fio_rbd_finish_aiocb,
+						&fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_READ failed.\n");
@@ -292,27 +299,28 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		}
 
 		r = rbd_aio_read(rbd_data->image, io_u->offset,
-				 io_u->xfer_buflen, io_u->xfer_buf, comp);
+				 io_u->xfer_buflen, io_u->xfer_buf,
+				 fri->completion);
 
 		if (r < 0) {
 			log_err("rbd_aio_read failed.\n");
+			rbd_aio_release(fri->completion);
 			goto failed;
 		}
 
 	} else if (io_u->ddir == DDIR_SYNC) {
-		r = rbd_aio_create_completion(io_u,
-					      (rbd_callback_t)
-					      _fio_rbd_finish_sync_aiocb,
-					      &comp);
+		r = rbd_aio_create_completion(io_u, _fio_rbd_finish_aiocb,
+						&fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_SYNC failed.\n");
 			goto failed;
 		}
 
-		r = rbd_aio_flush(rbd_data->image, comp);
+		r = rbd_aio_flush(rbd_data->image, fri->completion);
 		if (r < 0) {
 			log_err("rbd_flush failed.\n");
+			rbd_aio_release(fri->completion);
 			goto failed;
 		}
 
@@ -344,7 +352,6 @@ static int fio_rbd_init(struct thread_data *td)
 
 failed:
 	return 1;
-
 }
 
 static void fio_rbd_cleanup(struct thread_data *td)
@@ -379,8 +386,9 @@ static int fio_rbd_setup(struct thread_data *td)
 	}
 	td->io_ops->data = rbd_data;
 
-	/* librbd does not allow us to run first in the main thread and later in a
-	 * fork child. It needs to be the same process context all the time. 
+	/* librbd does not allow us to run first in the main thread and later
+	 * in a fork child. It needs to be the same process context all the
+	 * time. 
 	 */
 	td->o.use_thread = 1;
 
@@ -439,22 +447,21 @@ static int fio_rbd_invalidate(struct thread_data *td, struct fio_file *f)
 
 static void fio_rbd_io_u_free(struct thread_data *td, struct io_u *io_u)
 {
-	struct fio_rbd_iou *o = io_u->engine_data;
+	struct fio_rbd_iou *fri = io_u->engine_data;
 
-	if (o) {
+	if (fri) {
 		io_u->engine_data = NULL;
-		free(o);
+		free(fri);
 	}
 }
 
 static int fio_rbd_io_u_init(struct thread_data *td, struct io_u *io_u)
 {
-	struct fio_rbd_iou *o;
+	struct fio_rbd_iou *fri;
 
-	o = malloc(sizeof(*o));
-	o->io_complete = 0;
-	o->io_u = io_u;
-	io_u->engine_data = o;
+	fri = calloc(1, sizeof(*fri));
+	fri->io_u = io_u;
+	io_u->engine_data = fri;
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-27 15:25                               ` Jens Axboe
@ 2014-10-27 15:29                                 ` Ketor D
  2014-10-27 15:36                                   ` Jens Axboe
  0 siblings, 1 reply; 52+ messages in thread
From: Ketor D @ 2014-10-27 15:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

I just found the aio_get_return and aio_release bug, then you fix it. So fast!

But the test looks bad.

The write bytes is always zero..........

2014-10-27 23:25 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
> On 10/27/2014 09:22 AM, Jens Axboe wrote:
>> On 10/27/2014 09:12 AM, Ketor D wrote:
>>> The v5 patch does not work.
>>>
>>> Run 5 times:
>>> 3 times SEGSV
>>> 2 times NO IOPS, Endless loop
>>
>> Try this one, perhaps it's the wrong type passed for release. Typedefs
>> for the win (or not).
>>
>> This also fixes comp leaks, if the read/write/sync fails.
>
> The get_return was wrong too, here's -v7...
>
> --
> Jens Axboe
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-27 15:29                                 ` Ketor D
@ 2014-10-27 15:36                                   ` Jens Axboe
  2014-10-27 15:45                                     ` Ketor D
  0 siblings, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2014-10-27 15:36 UTC (permalink / raw)
  To: Ketor D
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

On 10/27/2014 09:29 AM, Ketor D wrote:
> I just found the aio_get_return and aio_release bug, then you fix it. So fast!
> 
> But the test looks bad.
> 
> The write bytes is always zero..........

Looks like I need to setup a local test here, haven't run ceph/rbd
before... Can you put a debug printf() in _fio_rbd_finish_aiocb() and
dump what rbd_aio_get_return_value() returns?

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-27 15:36                                   ` Jens Axboe
@ 2014-10-27 15:45                                     ` Ketor D
  2014-10-27 15:53                                       ` Jens Axboe
  0 siblings, 1 reply; 52+ messages in thread
From: Ketor D @ 2014-10-27 15:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

The return code is 0 if success.I mod the code a bit and then run fio very well.
I think if you fix this bug, the path will be nearly pefect!!

ret = rbd_aio_get_return_value(fri->completion);
//printf("ret=%ld\n", ret);
//if (ret != (int) io_u->xfer_buflen) {
if (ret != 0) {
if (ret >= 0) {
io_u->resid = io_u->xfer_buflen - ret;
io_u->error = 0;
} else
io_u->error = ret;
}

2014-10-27 23:36 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
> On 10/27/2014 09:29 AM, Ketor D wrote:
>> I just found the aio_get_return and aio_release bug, then you fix it. So fast!
>>
>> But the test looks bad.
>>
>> The write bytes is always zero..........
>
> Looks like I need to setup a local test here, haven't run ceph/rbd
> before... Can you put a debug printf() in _fio_rbd_finish_aiocb() and
> dump what rbd_aio_get_return_value() returns?
>
> --
> Jens Axboe
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-27 15:45                                     ` Ketor D
@ 2014-10-27 15:53                                       ` Jens Axboe
  2014-10-27 16:20                                         ` Ketor D
  0 siblings, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2014-10-27 15:53 UTC (permalink / raw)
  To: Ketor D
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 672 bytes --]

On 10/27/2014 09:45 AM, Ketor D wrote:
> The return code is 0 if success.I mod the code a bit and then run fio very well.
> I think if you fix this bug, the path will be nearly pefect!!
> 
> ret = rbd_aio_get_return_value(fri->completion);
> //printf("ret=%ld\n", ret);
> //if (ret != (int) io_u->xfer_buflen) {
> if (ret != 0) {
> if (ret >= 0) {
> io_u->resid = io_u->xfer_buflen - ret;
> io_u->error = 0;
> } else
> io_u->error = ret;
> }

Weird, so it does not do partial completions I assume. Modified -v8 to
take that into account, hopefully this just works out-of-the-box.

What does the performance numbers look like for your sync test with this?

-- 
Jens Axboe


[-- Attachment #2: rbd-comp-v8.patch --]
[-- Type: text/x-patch, Size: 9900 bytes --]

diff --git a/engines/rbd.c b/engines/rbd.c
index 6fe87b8d010c..5160c32aedb0 100644
--- a/engines/rbd.c
+++ b/engines/rbd.c
@@ -11,7 +11,9 @@
 
 struct fio_rbd_iou {
 	struct io_u *io_u;
+	rbd_completion_t completion;
 	int io_complete;
+	int io_seen;
 };
 
 struct rbd_data {
@@ -30,35 +32,35 @@ struct rbd_options {
 
 static struct fio_option options[] = {
 	{
-	 .name     = "rbdname",
-	 .lname    = "rbd engine rbdname",
-	 .type     = FIO_OPT_STR_STORE,
-	 .help     = "RBD name for RBD engine",
-	 .off1     = offsetof(struct rbd_options, rbd_name),
-	 .category = FIO_OPT_C_ENGINE,
-	 .group    = FIO_OPT_G_RBD,
-	 },
+		.name		= "rbdname",
+		.lname		= "rbd engine rbdname",
+		.type		= FIO_OPT_STR_STORE,
+		.help		= "RBD name for RBD engine",
+		.off1		= offsetof(struct rbd_options, rbd_name),
+		.category	= FIO_OPT_C_ENGINE,
+		.group		= FIO_OPT_G_RBD,
+	},
 	{
-	 .name     = "pool",
-	 .lname    = "rbd engine pool",
-	 .type     = FIO_OPT_STR_STORE,
-	 .help     = "Name of the pool hosting the RBD for the RBD engine",
-	 .off1     = offsetof(struct rbd_options, pool_name),
-	 .category = FIO_OPT_C_ENGINE,
-	 .group    = FIO_OPT_G_RBD,
-	 },
+		.name     = "pool",
+		.lname    = "rbd engine pool",
+		.type     = FIO_OPT_STR_STORE,
+		.help     = "Name of the pool hosting the RBD for the RBD engine",
+		.off1     = offsetof(struct rbd_options, pool_name),
+		.category = FIO_OPT_C_ENGINE,
+		.group    = FIO_OPT_G_RBD,
+	},
 	{
-	 .name     = "clientname",
-	 .lname    = "rbd engine clientname",
-	 .type     = FIO_OPT_STR_STORE,
-	 .help     = "Name of the ceph client to access the RBD for the RBD engine",
-	 .off1     = offsetof(struct rbd_options, client_name),
-	 .category = FIO_OPT_C_ENGINE,
-	 .group    = FIO_OPT_G_RBD,
-	 },
+		.name     = "clientname",
+		.lname    = "rbd engine clientname",
+		.type     = FIO_OPT_STR_STORE,
+		.help     = "Name of the ceph client to access the RBD for the RBD engine",
+		.off1     = offsetof(struct rbd_options, client_name),
+		.category = FIO_OPT_C_ENGINE,
+		.group    = FIO_OPT_G_RBD,
+	},
 	{
-	 .name = NULL,
-	 },
+		.name = NULL,
+	},
 };
 
 static int _fio_setup_rbd_data(struct thread_data *td,
@@ -163,92 +165,99 @@ static void _fio_rbd_disconnect(struct rbd_data *rbd_data)
 	}
 }
 
-static void _fio_rbd_finish_write_aiocb(rbd_completion_t comp, void *data)
+static void _fio_rbd_finish_aiocb(rbd_completion_t comp, void *data)
 {
-	struct io_u *io_u = (struct io_u *)data;
-	struct fio_rbd_iou *fio_rbd_iou =
-	    (struct fio_rbd_iou *)io_u->engine_data;
+	struct io_u *io_u = data;
+	struct fio_rbd_iou *fri = io_u->engine_data;
+	ssize_t ret;
 
-	fio_rbd_iou->io_complete = 1;
+	fri->io_complete = 1;
 
-	/* if write needs to be verified - we should not release comp here
-	   without fetching the result */
+	/*
+	 * Looks like return value is 0 for success, or < 0 for
+	 * a specific error. So we have to assume that it can't do
+	 * partial completions.
+	 */
+	ret = rbd_aio_get_return_value(fri->completion);
+	if (ret < 0) {
+		io_u->error = ret;
+		io_u->resid = io_u->xfer_buflen;
+	} else
+		io_u->error = 0;
+}
 
-	rbd_aio_release(comp);
-	/* TODO handle error */
+static struct io_u *fio_rbd_event(struct thread_data *td, int event)
+{
+	struct rbd_data *rbd_data = td->io_ops->data;
 
-	return;
+	return rbd_data->aio_events[event];
 }
 
-static void _fio_rbd_finish_read_aiocb(rbd_completion_t comp, void *data)
+static inline int fri_check_complete(struct rbd_data *rbd_data,
+				     struct io_u *io_u,
+				     unsigned int *events)
 {
-	struct io_u *io_u = (struct io_u *)data;
-	struct fio_rbd_iou *fio_rbd_iou =
-	    (struct fio_rbd_iou *)io_u->engine_data;
+	struct fio_rbd_iou *fri = io_u->engine_data;
 
-	fio_rbd_iou->io_complete = 1;
+	if (fri->io_complete) {
+		fri->io_complete = 0;
+		fri->io_seen = 1;
+		rbd_data->aio_events[*events] = io_u;
+		(*events)++;
 
-	/* if read needs to be verified - we should not release comp here
-	   without fetching the result */
-	rbd_aio_release(comp);
-
-	/* TODO handle error */
+		rbd_aio_release(fri->completion);
+		return 1;
+	}
 
-	return;
+	return 0;
 }
 
-static void _fio_rbd_finish_sync_aiocb(rbd_completion_t comp, void *data)
+static int rbd_iter_events(struct thread_data *td, unsigned int *events,
+			   unsigned int min_evts, int wait)
 {
-	struct io_u *io_u = (struct io_u *)data;
-	struct fio_rbd_iou *fio_rbd_iou =
-	    (struct fio_rbd_iou *)io_u->engine_data;
-
-	fio_rbd_iou->io_complete = 1;
+	struct rbd_data *rbd_data = td->io_ops->data;
+	unsigned int this_events = 0;
+	struct io_u *io_u;
+	int i;
 
-	/* if sync needs to be verified - we should not release comp here
-	   without fetching the result */
-	rbd_aio_release(comp);
+	io_u_qiter(&td->io_u_all, io_u, i) {
+		struct fio_rbd_iou *fri = io_u->engine_data;
 
-	/* TODO handle error */
+		if (!(io_u->flags & IO_U_F_FLIGHT))
+			continue;
+		if (fri->io_seen)
+			continue;
 
-	return;
-}
+		if (fri_check_complete(rbd_data, io_u, events))
+			this_events++;
+		else if (wait) {
+			rbd_aio_wait_for_complete(fri->completion);
 
-static struct io_u *fio_rbd_event(struct thread_data *td, int event)
-{
-	struct rbd_data *rbd_data = td->io_ops->data;
+			if (fri_check_complete(rbd_data, io_u, events))
+				this_events++;
+		}
+		if (*events >= min_evts)
+			break;
+	}
 
-	return rbd_data->aio_events[event];
+	return this_events;
 }
 
 static int fio_rbd_getevents(struct thread_data *td, unsigned int min,
 			     unsigned int max, const struct timespec *t)
 {
-	struct rbd_data *rbd_data = td->io_ops->data;
-	unsigned int events = 0;
-	struct io_u *io_u;
-	int i;
-	struct fio_rbd_iou *fov;
+	unsigned int this_events, events = 0;
+	int wait = 0;
 
 	do {
-		io_u_qiter(&td->io_u_all, io_u, i) {
-			if (!(io_u->flags & IO_U_F_FLIGHT))
-				continue;
+		this_events = rbd_iter_events(td, &events, min, wait);
 
-			fov = (struct fio_rbd_iou *)io_u->engine_data;
-
-			if (fov->io_complete) {
-				fov->io_complete = 0;
-				rbd_data->aio_events[events] = io_u;
-				events++;
-			}
-
-		}
-		if (events < min)
-			usleep(100);
-		else
+		if (events >= min)
 			break;
+		if (this_events)
+			continue;
 
+		wait = 1;
 	} while (1);
 
 	return events;
@@ -256,17 +265,18 @@ static int fio_rbd_getevents(struct thread_data *td, unsigned int min,
 
 static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 {
-	int r = -1;
 	struct rbd_data *rbd_data = td->io_ops->data;
-	rbd_completion_t comp;
+	struct fio_rbd_iou *fri = io_u->engine_data;
+	int r = -1;
 
 	fio_ro_check(td, io_u);
 
+	fri->io_complete = 0;
+	fri->io_seen = 0;
+
 	if (io_u->ddir == DDIR_WRITE) {
-		r = rbd_aio_create_completion(io_u,
-					      (rbd_callback_t)
-					      _fio_rbd_finish_write_aiocb,
-					      &comp);
+		r = rbd_aio_create_completion(io_u, _fio_rbd_finish_aiocb,
+						&fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_WRITE failed.\n");
@@ -274,17 +284,17 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		}
 
 		r = rbd_aio_write(rbd_data->image, io_u->offset,
-				  io_u->xfer_buflen, io_u->xfer_buf, comp);
+				  io_u->xfer_buflen, io_u->xfer_buf,
+				  fri->completion);
 		if (r < 0) {
 			log_err("rbd_aio_write failed.\n");
+			rbd_aio_release(fri->completion);
 			goto failed;
 		}
 
 	} else if (io_u->ddir == DDIR_READ) {
-		r = rbd_aio_create_completion(io_u,
-					      (rbd_callback_t)
-					      _fio_rbd_finish_read_aiocb,
-					      &comp);
+		r = rbd_aio_create_completion(io_u, _fio_rbd_finish_aiocb,
+						&fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_READ failed.\n");
@@ -292,27 +302,28 @@ static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		}
 
 		r = rbd_aio_read(rbd_data->image, io_u->offset,
-				 io_u->xfer_buflen, io_u->xfer_buf, comp);
+				 io_u->xfer_buflen, io_u->xfer_buf,
+				 fri->completion);
 
 		if (r < 0) {
 			log_err("rbd_aio_read failed.\n");
+			rbd_aio_release(fri->completion);
 			goto failed;
 		}
 
 	} else if (io_u->ddir == DDIR_SYNC) {
-		r = rbd_aio_create_completion(io_u,
-					      (rbd_callback_t)
-					      _fio_rbd_finish_sync_aiocb,
-					      &comp);
+		r = rbd_aio_create_completion(io_u, _fio_rbd_finish_aiocb,
+						&fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_SYNC failed.\n");
 			goto failed;
 		}
 
-		r = rbd_aio_flush(rbd_data->image, comp);
+		r = rbd_aio_flush(rbd_data->image, fri->completion);
 		if (r < 0) {
 			log_err("rbd_flush failed.\n");
+			rbd_aio_release(fri->completion);
 			goto failed;
 		}
 
@@ -344,7 +355,6 @@ static int fio_rbd_init(struct thread_data *td)
 
 failed:
 	return 1;
-
 }
 
 static void fio_rbd_cleanup(struct thread_data *td)
@@ -379,8 +389,9 @@ static int fio_rbd_setup(struct thread_data *td)
 	}
 	td->io_ops->data = rbd_data;
 
-	/* librbd does not allow us to run first in the main thread and later in a
-	 * fork child. It needs to be the same process context all the time. 
+	/* librbd does not allow us to run first in the main thread and later
+	 * in a fork child. It needs to be the same process context all the
+	 * time. 
 	 */
 	td->o.use_thread = 1;
 
@@ -439,22 +450,21 @@ static int fio_rbd_invalidate(struct thread_data *td, struct fio_file *f)
 
 static void fio_rbd_io_u_free(struct thread_data *td, struct io_u *io_u)
 {
-	struct fio_rbd_iou *o = io_u->engine_data;
+	struct fio_rbd_iou *fri = io_u->engine_data;
 
-	if (o) {
+	if (fri) {
 		io_u->engine_data = NULL;
-		free(o);
+		free(fri);
 	}
 }
 
 static int fio_rbd_io_u_init(struct thread_data *td, struct io_u *io_u)
 {
-	struct fio_rbd_iou *o;
+	struct fio_rbd_iou *fri;
 
-	o = malloc(sizeof(*o));
-	o->io_complete = 0;
-	o->io_u = io_u;
-	io_u->engine_data = o;
+	fri = calloc(1, sizeof(*fri));
+	fri->io_u = io_u;
+	io_u->engine_data = fri;
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-27 15:53                                       ` Jens Axboe
@ 2014-10-27 16:20                                         ` Ketor D
  2014-10-27 16:55                                           ` Jens Axboe
  2014-10-27 21:59                                           ` Mark Kirkwood
  0 siblings, 2 replies; 52+ messages in thread
From: Ketor D @ 2014-10-27 16:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

V8 patch runs good.

The iops is 33032. If I just comment the usleep(100) in the master, I
can get iops 35245.
The CPU usage about the two test is same 120%.
So maybe this patch could be better!

Belong to the master, this patch is perfect enough!!


2014-10-27 23:53 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
> On 10/27/2014 09:45 AM, Ketor D wrote:
>> The return code is 0 if success.I mod the code a bit and then run fio very well.
>> I think if you fix this bug, the path will be nearly pefect!!
>>
>> ret = rbd_aio_get_return_value(fri->completion);
>> //printf("ret=%ld\n", ret);
>> //if (ret != (int) io_u->xfer_buflen) {
>> if (ret != 0) {
>> if (ret >= 0) {
>> io_u->resid = io_u->xfer_buflen - ret;
>> io_u->error = 0;
>> } else
>> io_u->error = ret;
>> }
>
> Weird, so it does not do partial completions I assume. Modified -v8 to
> take that into account, hopefully this just works out-of-the-box.
>
> What does the performance numbers look like for your sync test with this?
>
> --
> Jens Axboe
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-27 16:20                                         ` Ketor D
@ 2014-10-27 16:55                                           ` Jens Axboe
  2014-10-27 21:59                                           ` Mark Kirkwood
  1 sibling, 0 replies; 52+ messages in thread
From: Jens Axboe @ 2014-10-27 16:55 UTC (permalink / raw)
  To: Ketor D
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

On 10/27/2014 10:20 AM, Ketor D wrote:
> V8 patch runs good.
> 
> The iops is 33032. If I just comment the usleep(100) in the master, I
> can get iops 35245.
> The CPU usage about the two test is same 120%.
> So maybe this patch could be better!
> 
> Belong to the master, this patch is perfect enough!!

Agree, committed. I'll setup a local test here and see if we can't
recoup those last percentages. CPU usage may have been the same for your
test, but it will be more for others. A busy loop in there is not a good
idea.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-27 16:20                                         ` Ketor D
  2014-10-27 16:55                                           ` Jens Axboe
@ 2014-10-27 21:59                                           ` Mark Kirkwood
  2014-10-27 22:32                                             ` Jens Axboe
  1 sibling, 1 reply; 52+ messages in thread
From: Mark Kirkwood @ 2014-10-27 21:59 UTC (permalink / raw)
  To: Ketor D, Jens Axboe
  Cc: Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

On 28/10/14 05:20, Ketor D wrote:
> V8 patch runs good.
>
> The iops is 33032. If I just comment the usleep(100) in the master, I
> can get iops 35245.
> The CPU usage about the two test is same 120%.
> So maybe this patch could be better!
>

Yeah, v8 is working for me.

I'm seeing it a bit slower for some blocksizes, but faster (or perhaps 
about the same within repeat measurement error) for others:

blocksize k |  patched iops | orig iops
------------+---------------+-----------
4           | 12265         | 11930
128         |  5800         |  7100
1024        |  1193         |  1196

Regards

Mark


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-27 21:59                                           ` Mark Kirkwood
@ 2014-10-27 22:32                                             ` Jens Axboe
  2014-10-27 23:21                                               ` Mark Kirkwood
  0 siblings, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2014-10-27 22:32 UTC (permalink / raw)
  To: Mark Kirkwood, Ketor D
  Cc: Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

On 10/27/2014 03:59 PM, Mark Kirkwood wrote:
> On 28/10/14 05:20, Ketor D wrote:
>> V8 patch runs good.
>>
>> The iops is 33032. If I just comment the usleep(100) in the master, I
>> can get iops 35245.
>> The CPU usage about the two test is same 120%.
>> So maybe this patch could be better!
>>
> 
> Yeah, v8 is working for me.
> 
> I'm seeing it a bit slower for some blocksizes, but faster (or perhaps
> about the same within repeat measurement error) for others:
> 
> blocksize k |  patched iops | orig iops
> ------------+---------------+-----------
> 4           | 12265         | 11930
> 128         |  5800         |  7100
> 1024        |  1193         |  1196

As for most things, the difference should be in IOPS, not bandwidth. So
I would assume that the above are within normal variance, since 4k
should show the biggest difference, then drop off after that and match
at 128/1024k.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-27 22:32                                             ` Jens Axboe
@ 2014-10-27 23:21                                               ` Mark Kirkwood
  2014-10-28  3:23                                                 ` Ketor D
  0 siblings, 1 reply; 52+ messages in thread
From: Mark Kirkwood @ 2014-10-27 23:21 UTC (permalink / raw)
  To: Jens Axboe, Ketor D
  Cc: Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

On 28/10/14 11:32, Jens Axboe wrote:
> On 10/27/2014 03:59 PM, Mark Kirkwood wrote:
>> On 28/10/14 05:20, Ketor D wrote:
>>> V8 patch runs good.
>>>
>>> The iops is 33032. If I just comment the usleep(100) in the master, I
>>> can get iops 35245.
>>> The CPU usage about the two test is same 120%.
>>> So maybe this patch could be better!
>>>
>>
>> Yeah, v8 is working for me.
>>
>> I'm seeing it a bit slower for some blocksizes, but faster (or perhaps
>> about the same within repeat measurement error) for others:
>>
>> blocksize k |  patched iops | orig iops
>> ------------+---------------+-----------
>> 4           | 12265         | 11930
>> 128         |  5800         |  7100
>> 1024        |  1193         |  1196
>
> As for most things, the difference should be in IOPS, not bandwidth. So
> I would assume that the above are within normal variance, since 4k
> should show the biggest difference, then drop off after that and match
> at 128/1024k.
>


Yeah, I suspect the 4K numbers are the same as we are bottlenecked by 
Ceph's small blocksize performance, not fio itself. If Ketor has a setup 
that can get higher IOPS @4K it would be interesting to see his numbers 
for patched vs orig!

Cheers

Mark


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-27 23:21                                               ` Mark Kirkwood
@ 2014-10-28  3:23                                                 ` Ketor D
  2014-10-28  4:01                                                   ` Mark Kirkwood
  2014-10-28  4:05                                                   ` Jens Axboe
  0 siblings, 2 replies; 52+ messages in thread
From: Ketor D @ 2014-10-28  3:23 UTC (permalink / raw)
  To: Mark Kirkwood
  Cc: Jens Axboe, Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 209 bytes --]

Hi Mark,
      Wish you could test my patch.I get the best performance using this patch.


2014-10-28 7:21 GMT+08:00 Mark Kirkwood <mark.kirkwood@catalyst.net.nz>:
> ng to see his numbers for patched vs orig!

[-- Attachment #2: rbd_no_usleep.patch --]
[-- Type: application/octet-stream, Size: 290 bytes --]

diff --git a/engines/rbd.c b/engines/rbd.c
index 6fe87b8..e6f2dff 100644
--- a/engines/rbd.c
+++ b/engines/rbd.c
@@ -245,7 +245,7 @@ static int fio_rbd_getevents(struct thread_data *td, unsigned int min,
 
 		}
 		if (events < min)
-			usleep(100);
+			;//usleep(100);
 		else
 			break;
 

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-28  3:23                                                 ` Ketor D
@ 2014-10-28  4:01                                                   ` Mark Kirkwood
  2014-10-28  4:05                                                   ` Jens Axboe
  1 sibling, 0 replies; 52+ messages in thread
From: Mark Kirkwood @ 2014-10-28  4:01 UTC (permalink / raw)
  To: Ketor D
  Cc: Jens Axboe, Mark Nelson, Mark Nelson, fio, xan.peng,
	ceph-devel@vger.kernel.org

On 28/10/14 16:23, Ketor D wrote:
> Hi Mark,
>        Wish you could test my patch.I get the best performance using this patch.
>
>

It is not clear cut for me (tested reads only):

blocksize k | v8 patched iops | Ketor patch iops | orig iops
------------+-----------------+------------------+-----------
4           | 12265           | 11930            |  11516
128         |  5800           |  7100            |   6550
1024        |  1193           |  1196            |   1248

Cheers

Mark


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-28  3:23                                                 ` Ketor D
  2014-10-28  4:01                                                   ` Mark Kirkwood
@ 2014-10-28  4:05                                                   ` Jens Axboe
  2014-10-28  4:49                                                     ` Ketor D
  1 sibling, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2014-10-28  4:05 UTC (permalink / raw)
  To: Ketor D
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio@vger.kernel.org,
	xan.peng, ceph-devel@vger.kernel.org


> On Oct 27, 2014, at 9:23 PM, Ketor D <d.ketor@gmail.com> wrote:
> 
> Hi Mark,
>      Wish you could test my patch.I get the best performance using this patch.

There's no way we're doing a busy loop, sorry. As mentioned in a previous email, it'd be great if you would work off current git and potentially improve that. 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-28  4:05                                                   ` Jens Axboe
@ 2014-10-28  4:49                                                     ` Ketor D
  2014-10-28 15:14                                                       ` Jens Axboe
  0 siblings, 1 reply; 52+ messages in thread
From: Ketor D @ 2014-10-28  4:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio@vger.kernel.org,
	xan.peng, ceph-devel@vger.kernel.org

Agree. Busy loop is only for test.
I will try the current git.

Thanks!

2014-10-28 12:05 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
>
>> On Oct 27, 2014, at 9:23 PM, Ketor D <d.ketor@gmail.com> wrote:
>>
>> Hi Mark,
>>      Wish you could test my patch.I get the best performance using this patch.
>
> There's no way we're doing a busy loop, sorry. As mentioned in a previous email, it'd be great if you would work off current git and potentially improve that.
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-28  4:49                                                     ` Ketor D
@ 2014-10-28 15:14                                                       ` Jens Axboe
  2014-10-28 15:49                                                         ` Ketor D
  0 siblings, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2014-10-28 15:14 UTC (permalink / raw)
  To: Ketor D
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio@vger.kernel.org,
	xan.peng, ceph-devel@vger.kernel.org

On 2014-10-27 22:49, Ketor D wrote:
> Agree. Busy loop is only for test.
> I will try the current git.

Committed two more rbd changes:

- Add support for rbd_invalidate_cache() (if it exists)
- Use rbd_aio_is_complete() instead of using fri->io_complete. The 
latter should have some locking to ensure it's always seen, so it's 
better to use the API provided function to determine whether this IO is 
done or not.

Unless we often hit the complete race, I would not expect this to make 
much of a difference. But it's worth testing in any case, especially 
since my two attempts at setting up ceph + rbd have failed miserably. So 
I still can't test myself.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-28 15:14                                                       ` Jens Axboe
@ 2014-10-28 15:49                                                         ` Ketor D
  2014-10-28 15:53                                                           ` Jens Axboe
  2014-10-28 17:09                                                           ` Jens Axboe
  0 siblings, 2 replies; 52+ messages in thread
From: Ketor D @ 2014-10-28 15:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio@vger.kernel.org,
	xan.peng, ceph-devel@vger.kernel.org

Cannot get the new commited code from github now.
When I get the newest code, I will test.

2014-10-28 23:14 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
> On 2014-10-27 22:49, Ketor D wrote:
>>
>> Agree. Busy loop is only for test.
>> I will try the current git.
>
>
> Committed two more rbd changes:
>
> - Add support for rbd_invalidate_cache() (if it exists)
> - Use rbd_aio_is_complete() instead of using fri->io_complete. The latter
> should have some locking to ensure it's always seen, so it's better to use
> the API provided function to determine whether this IO is done or not.
>
> Unless we often hit the complete race, I would not expect this to make much
> of a difference. But it's worth testing in any case, especially since my two
> attempts at setting up ceph + rbd have failed miserably. So I still can't
> test myself.
>
> --
> Jens Axboe
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-28 15:49                                                         ` Ketor D
@ 2014-10-28 15:53                                                           ` Jens Axboe
  2014-10-28 17:09                                                           ` Jens Axboe
  1 sibling, 0 replies; 52+ messages in thread
From: Jens Axboe @ 2014-10-28 15:53 UTC (permalink / raw)
  To: Ketor D
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio@vger.kernel.org,
	xan.peng, ceph-devel@vger.kernel.org

On 2014-10-28 09:49, Ketor D wrote:
> Cannot get the new commited code from github now.
> When I get the newest code, I will test.

github is just a mirror, I push to:

git://git.kernel.dk/fio

Github is pushed automatically every hour, if there are changes. So it 
may lag an hour. Should be there now, though.


-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-28 15:49                                                         ` Ketor D
  2014-10-28 15:53                                                           ` Jens Axboe
@ 2014-10-28 17:09                                                           ` Jens Axboe
  2014-10-28 18:43                                                             ` Ketor D
  1 sibling, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2014-10-28 17:09 UTC (permalink / raw)
  To: Ketor D
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio@vger.kernel.org,
	xan.peng, ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 610 bytes --]

On 2014-10-28 09:49, Ketor D wrote:
> Cannot get the new commited code from github now.
> When I get the newest code, I will test.

So here's another idea, applies on top of current -git. Basically it 
makes rbd wait for the oldest event, not just the first one in the array 
of all ios. This is the saner thing to do, as hopefully the oldest event 
will be the one to complete first. At least it has a much higher chance 
of being the right thing to do, than just waiting on a random event.

Completely untested, so you might have to fiddle a bit with it to ensure 
that it actually works...

-- 
Jens Axboe


[-- Attachment #2: rbd-time-sort.patch --]
[-- Type: text/x-patch, Size: 3060 bytes --]

diff --git a/engines/rbd.c b/engines/rbd.c
index cf7be0acd1e3..f3129044c430 100644
--- a/engines/rbd.c
+++ b/engines/rbd.c
@@ -20,6 +20,7 @@ struct rbd_data {
 	rados_ioctx_t io_ctx;
 	rbd_image_t image;
 	struct io_u **aio_events;
+	struct io_u **sort_events;
 };
 
 struct rbd_options {
@@ -80,20 +81,19 @@ static int _fio_setup_rbd_data(struct thread_data *td,
 	if (td->io_ops->data)
 		return 0;
 
-	rbd_data = malloc(sizeof(struct rbd_data));
+	rbd_data = calloc(1, sizeof(struct rbd_data));
 	if (!rbd_data)
 		goto failed;
 
-	memset(rbd_data, 0, sizeof(struct rbd_data));
-
-	rbd_data->aio_events = malloc(td->o.iodepth * sizeof(struct io_u *));
+	rbd_data->aio_events = calloc(td->o.iodepth, sizeof(struct io_u *));
 	if (!rbd_data->aio_events)
 		goto failed;
 
-	memset(rbd_data->aio_events, 0, td->o.iodepth * sizeof(struct io_u *));
+	rbd_data->sort_events = calloc(td->o.iodepth, sizeof(struct io_u *));
+	if (!rbd_data->sort_events)
+		goto failed;
 
 	*rbd_data_ptr = rbd_data;
-
 	return 0;
 
 failed:
@@ -218,14 +218,32 @@ static inline int fri_check_complete(struct rbd_data *rbd_data,
 	return 0;
 }
 
+static int rbd_io_u_cmp(const void *p1, const void *p2)
+{
+	const struct io_u **a = (const struct io_u **) p1;
+	const struct io_u **b = (const struct io_u **) p2;
+	uint64_t at, bt;
+
+	at = utime_since_now(&(*a)->start_time);
+	bt = utime_since_now(&(*b)->start_time);
+
+	if (at < bt)
+		return -1;
+	else if (at == bt)
+		return 0;
+	else
+		return 1;
+}
+
 static int rbd_iter_events(struct thread_data *td, unsigned int *events,
 			   unsigned int min_evts, int wait)
 {
 	struct rbd_data *rbd_data = td->io_ops->data;
 	unsigned int this_events = 0;
 	struct io_u *io_u;
-	int i;
+	int i, sort_idx;
 
+	sort_idx = 0;
 	io_u_qiter(&td->io_u_all, io_u, i) {
 		struct fio_rbd_iou *fri = io_u->engine_data;
 
@@ -236,16 +254,39 @@ static int rbd_iter_events(struct thread_data *td, unsigned int *events,
 
 		if (fri_check_complete(rbd_data, io_u, events))
 			this_events++;
-		else if (wait) {
-			rbd_aio_wait_for_complete(fri->completion);
+		else if (wait)
+			rbd_data->sort_events[sort_idx++] = io_u;
 
-			if (fri_check_complete(rbd_data, io_u, events))
-				this_events++;
-		}
 		if (*events >= min_evts)
 			break;
 	}
 
+	if (!wait || !sort_idx)
+		return this_events;
+
+	qsort(rbd_data->sort_events, sort_idx, sizeof(struct io_u *), rbd_io_u_cmp);
+	for (i = 0; i < sort_idx; i++) {
+		struct fio_rbd_iou *fri;
+
+		io_u = rbd_data->sort_events[i];
+		fri = io_u->engine_data;
+
+		if (fri_check_complete(rbd_data, io_u, events)) {
+			this_events++;
+			continue;
+		}
+		if (!wait)
+			continue;
+
+		rbd_aio_wait_for_complete(fri->completion);
+
+		if (fri_check_complete(rbd_data, io_u, events))
+			this_events++;
+
+		if (wait && *events >= min_evts)
+			wait = 0;
+	}
+
 	return this_events;
 }
 
@@ -359,6 +400,7 @@ static void fio_rbd_cleanup(struct thread_data *td)
 	if (rbd_data) {
 		_fio_rbd_disconnect(rbd_data);
 		free(rbd_data->aio_events);
+		free(rbd_data->sort_events);
 		free(rbd_data);
 	}
 

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-28 17:09                                                           ` Jens Axboe
@ 2014-10-28 18:43                                                             ` Ketor D
  2014-10-29  7:15                                                               ` Ketor D
  0 siblings, 1 reply; 52+ messages in thread
From: Ketor D @ 2014-10-28 18:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio@vger.kernel.org,
	xan.peng, ceph-devel@vger.kernel.org

Yeah, the new wait strategy looks better.

I will test the patch soon.

2014-10-29 1:09 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
> On 2014-10-28 09:49, Ketor D wrote:
>>
>> Cannot get the new commited code from github now.
>> When I get the newest code, I will test.
>
>
> So here's another idea, applies on top of current -git. Basically it makes
> rbd wait for the oldest event, not just the first one in the array of all
> ios. This is the saner thing to do, as hopefully the oldest event will be
> the one to complete first. At least it has a much higher chance of being the
> right thing to do, than just waiting on a random event.
>
> Completely untested, so you might have to fiddle a bit with it to ensure
> that it actually works...
>
> --
> Jens Axboe
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-28 18:43                                                             ` Ketor D
@ 2014-10-29  7:15                                                               ` Ketor D
  2014-10-29 14:31                                                                 ` Jens Axboe
  0 siblings, 1 reply; 52+ messages in thread
From: Ketor D @ 2014-10-29  7:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio@vger.kernel.org,
	xan.peng, ceph-devel@vger.kernel.org

Hi, Jens,

There is cmdline parse bug in the fio rbd test.

I have fixed this and create a pull request on the github.

Please review.

After fix the bugs, the fio test can run.


2014-10-29 2:43 GMT+08:00 Ketor D <d.ketor@gmail.com>:
> Yeah, the new wait strategy looks better.
>
> I will test the patch soon.
>
> 2014-10-29 1:09 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
>> On 2014-10-28 09:49, Ketor D wrote:
>>>
>>> Cannot get the new commited code from github now.
>>> When I get the newest code, I will test.
>>
>>
>> So here's another idea, applies on top of current -git. Basically it makes
>> rbd wait for the oldest event, not just the first one in the array of all
>> ios. This is the saner thing to do, as hopefully the oldest event will be
>> the one to complete first. At least it has a much higher chance of being the
>> right thing to do, than just waiting on a random event.
>>
>> Completely untested, so you might have to fiddle a bit with it to ensure
>> that it actually works...
>>
>> --
>> Jens Axboe
>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-29  7:15                                                               ` Ketor D
@ 2014-10-29 14:31                                                                 ` Jens Axboe
  2014-10-30  2:50                                                                   ` Ketor D
  2014-10-30  7:44                                                                   ` Mark Kirkwood
  0 siblings, 2 replies; 52+ messages in thread
From: Jens Axboe @ 2014-10-29 14:31 UTC (permalink / raw)
  To: Ketor D
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio@vger.kernel.org,
	xan.peng, ceph-devel@vger.kernel.org

On 2014-10-29 01:15, Ketor D wrote:
> Hi, Jens,
>
> There is cmdline parse bug in the fio rbd test.
>
> I have fixed this and create a pull request on the github.
>
> Please review.
>
> After fix the bugs, the fio test can run.

I merged your two pull requests (thanks!) and committed a polished 
variant of the sort patch. Ketor and Mark, would you mind both running a 
quick benchmark on the current -git head?

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-29 14:31                                                                 ` Jens Axboe
@ 2014-10-30  2:50                                                                   ` Ketor D
  2014-10-30  2:55                                                                     ` Jens Axboe
  2014-10-30  7:44                                                                   ` Mark Kirkwood
  1 sibling, 1 reply; 52+ messages in thread
From: Ketor D @ 2014-10-30  2:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio@vger.kernel.org,
	xan.peng, ceph-devel@vger.kernel.org

Hi Jens,

The current code runs good!

Test Conf: jobs=1 iodepth=1 bs=4k

if busy_poll = 1, IOPS is 38989.
if busy_poll = 0, IOPS is 33031.

And busy_poll=0 test result looks no difference from the old code than
do not have sorted events wait.



2014-10-29 22:31 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
> On 2014-10-29 01:15, Ketor D wrote:
>>
>> Hi, Jens,
>>
>> There is cmdline parse bug in the fio rbd test.
>>
>> I have fixed this and create a pull request on the github.
>>
>> Please review.
>>
>> After fix the bugs, the fio test can run.
>
>
> I merged your two pull requests (thanks!) and committed a polished variant
> of the sort patch. Ketor and Mark, would you mind both running a quick
> benchmark on the current -git head?
>
> --
> Jens Axboe
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-30  2:50                                                                   ` Ketor D
@ 2014-10-30  2:55                                                                     ` Jens Axboe
  2014-10-30  5:29                                                                       ` Ketor D
  0 siblings, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2014-10-30  2:55 UTC (permalink / raw)
  To: Ketor D
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio@vger.kernel.org,
	xan.peng, ceph-devel@vger.kernel.org

On 2014-10-29 20:50, Ketor D wrote:
> Hi Jens,
>
> The current code runs good!
>
> Test Conf: jobs=1 iodepth=1 bs=4k
>
> if busy_poll = 1, IOPS is 38989.
> if busy_poll = 0, IOPS is 33031.
>
> And busy_poll=0 test result looks no difference from the old code than
> do not have sorted events wait.

Good to hear! I think we can safely say that we've pushed rbd as far as 
we can. At least on the fio side. Still appears to be some suboptimal 
parts of the librbd API. And the kernel rbd driver could definitely be 
improved a lot as well. Using busy_poll on the user side gets rid of a 
sleep/wakeup cycle at that end, but the kernel driver still punts and 
offloads any work item to a work queue...

Thanks for all your testing through this, really appreciated!

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-30  2:55                                                                     ` Jens Axboe
@ 2014-10-30  5:29                                                                       ` Ketor D
  0 siblings, 0 replies; 52+ messages in thread
From: Ketor D @ 2014-10-30  5:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mark Kirkwood, Mark Nelson, Mark Nelson, fio@vger.kernel.org,
	xan.peng, ceph-devel@vger.kernel.org

Thanks ,I feel a great honour to finished these test and happy to help
improved fio.
And I am trying to decrease the librbd latency. I have make some small progess.

2014-10-30 10:55 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
> On 2014-10-29 20:50, Ketor D wrote:
>>
>> Hi Jens,
>>
>> The current code runs good!
>>
>> Test Conf: jobs=1 iodepth=1 bs=4k
>>
>> if busy_poll = 1, IOPS is 38989.
>> if busy_poll = 0, IOPS is 33031.
>>
>> And busy_poll=0 test result looks no difference from the old code than
>> do not have sorted events wait.
>
>
> Good to hear! I think we can safely say that we've pushed rbd as far as we
> can. At least on the fio side. Still appears to be some suboptimal parts of
> the librbd API. And the kernel rbd driver could definitely be improved a lot
> as well. Using busy_poll on the user side gets rid of a sleep/wakeup cycle
> at that end, but the kernel driver still punts and offloads any work item to
> a work queue...
>
> Thanks for all your testing through this, really appreciated!
>
> --
> Jens Axboe
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-29 14:31                                                                 ` Jens Axboe
  2014-10-30  2:50                                                                   ` Ketor D
@ 2014-10-30  7:44                                                                   ` Mark Kirkwood
  2014-10-30  8:04                                                                     ` Ketor D
  1 sibling, 1 reply; 52+ messages in thread
From: Mark Kirkwood @ 2014-10-30  7:44 UTC (permalink / raw)
  To: Jens Axboe, Ketor D
  Cc: Mark Nelson, Mark Nelson, fio@vger.kernel.org, xan.peng,
	ceph-devel@vger.kernel.org

On 30/10/14 03:31, Jens Axboe wrote:
> On 2014-10-29 01:15, Ketor D wrote:
>> Hi, Jens,
>>
>> There is cmdline parse bug in the fio rbd test.
>>
>> I have fixed this and create a pull request on the github.
>>
>> Please review.
>>
>> After fix the bugs, the fio test can run.
>
> I merged your two pull requests (thanks!) and committed a polished
> variant of the sort patch. Ketor and Mark, would you mind both running a
> quick benchmark on the current -git head?
>

Better late than never (sorry), comparing with the 'original' fio code 
containing the usleep(100):

blocksize k | head iops     | orig iops
------------+---------------+--------------
4           |  11114        |  11516
128         |   4551        |   6550
1024        |   1195        |   1248

So we do pretty much the same except in the middle blocksize range (I 
checked again with the old binary just to rule out any other changes on 
the ceph end...).

Regards

Mark


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-30  7:44                                                                   ` Mark Kirkwood
@ 2014-10-30  8:04                                                                     ` Ketor D
  2014-10-31  8:54                                                                       ` Mark Kirkwood
  0 siblings, 1 reply; 52+ messages in thread
From: Ketor D @ 2014-10-30  8:04 UTC (permalink / raw)
  To: Mark Kirkwood
  Cc: Jens Axboe, Mark Nelson, Mark Nelson, fio@vger.kernel.org,
	xan.peng, ceph-devel@vger.kernel.org

Hi Mark,
      Could you do a fio test in your env with the busy_poll=1 ?
I am very interested in the busy_poll result. Thanks!

2014-10-30 15:44 GMT+08:00 Mark Kirkwood <mark.kirkwood@catalyst.net.nz>:
> On 30/10/14 03:31, Jens Axboe wrote:
>>
>> On 2014-10-29 01:15, Ketor D wrote:
>>>
>>> Hi, Jens,
>>>
>>> There is cmdline parse bug in the fio rbd test.
>>>
>>> I have fixed this and create a pull request on the github.
>>>
>>> Please review.
>>>
>>> After fix the bugs, the fio test can run.
>>
>>
>> I merged your two pull requests (thanks!) and committed a polished
>> variant of the sort patch. Ketor and Mark, would you mind both running a
>> quick benchmark on the current -git head?
>>
>
> Better late than never (sorry), comparing with the 'original' fio code
> containing the usleep(100):
>
> blocksize k | head iops     | orig iops
> ------------+---------------+--------------
> 4           |  11114        |  11516
> 128         |   4551        |   6550
> 1024        |   1195        |   1248
>
> So we do pretty much the same except in the middle blocksize range (I
> checked again with the old binary just to rule out any other changes on the
> ceph end...).
>
> Regards
>
> Mark


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: fio rbd completions (Was: fio rbd hang for block sizes > 1M)
  2014-10-30  8:04                                                                     ` Ketor D
@ 2014-10-31  8:54                                                                       ` Mark Kirkwood
  0 siblings, 0 replies; 52+ messages in thread
From: Mark Kirkwood @ 2014-10-31  8:54 UTC (permalink / raw)
  To: Ketor D
  Cc: Jens Axboe, Mark Nelson, Mark Nelson, fio@vger.kernel.org,
	xan.peng, ceph-devel@vger.kernel.org

On 30/10/14 21:04, Ketor D wrote:
> Hi Mark,
>        Could you do a fio test in your env with the busy_poll=1 ?
> I am very interested in the busy_poll result. Thanks!
>

Sure:

blocksize k | head iops     |  head iops (busy_pool=1)
------------+---------------+--------------------------
4           |  11114        |  12608
128         |   4551        |   6422
1024        |   1195        |   1175
4096        |    320        |    316

So looks like the busy_pool=1 improves performance for small and mid 
range blocksizes but is a little slower at the larger end.

However there are a lot of variables here - I'm using iodepth=32 for 
instance, and altering that may change the pattern I'm seeing, also a 
system with more osd's may bring out different behaviours as it runs the 
fio client out of available cpu power in the smaller block sizes.

Regards

Mark




^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2014-10-31  8:54 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-24  2:38 fio rbd hang for block sizes > 1M Mark Kirkwood
2014-10-24  5:35 ` Jens Axboe
2014-10-24  6:17   ` Mark Kirkwood
2014-10-24 13:19     ` Mark Nelson
2014-10-24 14:09       ` Mark Nelson
2014-10-24 14:30         ` Jens Axboe
2014-10-24 22:45         ` Mark Kirkwood
2014-10-25  0:12           ` Mark Nelson
2014-10-25  0:37             ` Mark Kirkwood
2014-10-25  2:35               ` Mark Kirkwood
2014-10-25  3:47                 ` Jens Axboe
2014-10-25  4:50                   ` fio rbd completions (Was: fio rbd hang for block sizes > 1M) Mark Kirkwood
2014-10-25 19:20                     ` Jens Axboe
2014-10-25 22:25                       ` Mark Kirkwood
2014-10-27  9:27                         ` Ketor D
2014-10-27 10:25                           ` Ketor D
2014-10-27 14:19                             ` Jens Axboe
2014-10-27 14:15                           ` Jens Axboe
2014-10-27 14:19                         ` Jens Axboe
2014-10-27 15:12                           ` Ketor D
2014-10-27 15:22                             ` Jens Axboe
2014-10-27 15:25                               ` Jens Axboe
2014-10-27 15:29                                 ` Ketor D
2014-10-27 15:36                                   ` Jens Axboe
2014-10-27 15:45                                     ` Ketor D
2014-10-27 15:53                                       ` Jens Axboe
2014-10-27 16:20                                         ` Ketor D
2014-10-27 16:55                                           ` Jens Axboe
2014-10-27 21:59                                           ` Mark Kirkwood
2014-10-27 22:32                                             ` Jens Axboe
2014-10-27 23:21                                               ` Mark Kirkwood
2014-10-28  3:23                                                 ` Ketor D
2014-10-28  4:01                                                   ` Mark Kirkwood
2014-10-28  4:05                                                   ` Jens Axboe
2014-10-28  4:49                                                     ` Ketor D
2014-10-28 15:14                                                       ` Jens Axboe
2014-10-28 15:49                                                         ` Ketor D
2014-10-28 15:53                                                           ` Jens Axboe
2014-10-28 17:09                                                           ` Jens Axboe
2014-10-28 18:43                                                             ` Ketor D
2014-10-29  7:15                                                               ` Ketor D
2014-10-29 14:31                                                                 ` Jens Axboe
2014-10-30  2:50                                                                   ` Ketor D
2014-10-30  2:55                                                                     ` Jens Axboe
2014-10-30  5:29                                                                       ` Ketor D
2014-10-30  7:44                                                                   ` Mark Kirkwood
2014-10-30  8:04                                                                     ` Ketor D
2014-10-31  8:54                                                                       ` Mark Kirkwood
2014-10-24 22:30       ` fio rbd hang for block sizes > 1M Mark Kirkwood
2014-10-24 22:38         ` Mark Nelson
2014-10-24 14:11   ` Danny Al-Gaaf
2014-10-24 14:31     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox