CFQ slower than NOOP with pgbench

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* CFQ slower than NOOP with pgbench
@ 2010-02-10 22:32 Jan Kara
  2010-02-11  4:10 ` Nikanth Karthikesan
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2010-02-10 22:32 UTC (permalink / raw)
  To: LKML; +Cc: jens.axboe, jmoyer

[-- Attachment #1: Type: text/plain, Size: 1502 bytes --]

  Hi,

  I was playing with a pgbench benchmark - it runs a series of operations
on top of PostgreSQL database. I was using:
  pgbench -c 8 -t 2000 pgbench
which runs 8 threads and each thread does 2000 transactions over the
database. The funny thing is that the benchmark does ~70 tps (transactions
per second) with CFQ and ~90 tps with a NOOP io scheduler. This is with
2.6.32 kernel.
  The load on the IO subsystem basically looks like lots of random reads
interleaved with occasional short synchronous sequential writes (the
database does write immediately followed by fdatasync) to the database
logs. I was pondering for quite some time why CFQ is slower and I've tried
tuning it in various ways without success. What I found is that with NOOP
scheduler, the fdatasync is like 20-times faster on average than with CFQ.
Looking at the block traces (available on request) this is usually because
when fdatasync is called, it takes time before the timeslice of the process
doing the sync comes (other processes are using their timeslices for reads)
and writes are dispatched... The question is: Can we do something about
that? Because I'm currently out of ideas except for hacks like "run this
queue immediately if it's fsync" or such...
  The config of the database is attached (it actually influences the
performance and the visibility of the problem noticably). The machine
is just Core 2 Duo with 3.7 GB of memory and a plain SATA drive.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: postgresql.conf --]
[-- Type: text/plain, Size: 593 bytes --]

shared_buffers = 1GB
temp_buffers = 256MB
work_mem = 256MB
maintenance_work_mem = 1GB
effective_io_concurrency = 0
wal_buffers = 1MB
checkpoint_segments = 2048
random_page_cost = 6.0
effective_cache_size = 2GB
synchronous_commit = on
#commit_delay = 1000
#wal_writer_delay = 100
#default_statistics_target = 1000
bgwriter_lru_maxpages = 1000

log_destination = 'stderr'
logging_collector = on
#log_checkpoints = on
#log_connections = on
#log_disconnections = on
#log_lock_waits = on
#log_statement = 'none'
#log_statement_stats=1
#log_planner_stats=1
#log_parser_stats=1
#log_executor_stats=1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: CFQ slower than NOOP with pgbench
  2010-02-10 22:32 CFQ slower than NOOP with pgbench Jan Kara
@ 2010-02-11  4:10 ` Nikanth Karthikesan
  2010-02-11 13:14   ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Nikanth Karthikesan @ 2010-02-11  4:10 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, jens.axboe, jmoyer

On Thursday 11 February 2010 04:02:55 Jan Kara wrote:
>   Hi,
> 
>   I was playing with a pgbench benchmark - it runs a series of operations
> on top of PostgreSQL database. I was using:
>   pgbench -c 8 -t 2000 pgbench
> which runs 8 threads and each thread does 2000 transactions over the
> database. The funny thing is that the benchmark does ~70 tps (transactions
> per second) with CFQ and ~90 tps with a NOOP io scheduler. This is with
> 2.6.32 kernel.
>   The load on the IO subsystem basically looks like lots of random reads
> interleaved with occasional short synchronous sequential writes (the
> database does write immediately followed by fdatasync) to the database
> logs. I was pondering for quite some time why CFQ is slower and I've tried
> tuning it in various ways without success. What I found is that with NOOP
> scheduler, the fdatasync is like 20-times faster on average than with CFQ.
> Looking at the block traces (available on request) this is usually because
> when fdatasync is called, it takes time before the timeslice of the process
> doing the sync comes (other processes are using their timeslices for reads)
> and writes are dispatched... The question is: Can we do something about
> that? Because I'm currently out of ideas except for hacks like "run this
> queue immediately if it's fsync" or such...

I guess, noop would be hurting those reads which is also a synchronous 
operation like fsync. But it doesn't seem to have a huge negative impact on 
the pgbench. Is it because reads are random in this benchmark and delaying 
them might even help by getting new requests for sectors in between two random 
reads? If that is the case, I dont think fsync should be given higher priority 
than reads based on this benchmark.

Can you make the blktrace available?

Thanks
Nikanth

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: CFQ slower than NOOP with pgbench
  2010-02-11  4:10 ` Nikanth Karthikesan
@ 2010-02-11 13:14   ` Jan Kara
  2010-02-11 19:30     ` Vivek Goyal
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2010-02-11 13:14 UTC (permalink / raw)
  To: Nikanth Karthikesan; +Cc: Jan Kara, LKML, jens.axboe, jmoyer

On Thu 11-02-10 09:40:33, Nikanth Karthikesan wrote:
> On Thursday 11 February 2010 04:02:55 Jan Kara wrote:
> >   Hi,
> > 
> >   I was playing with a pgbench benchmark - it runs a series of operations
> > on top of PostgreSQL database. I was using:
> >   pgbench -c 8 -t 2000 pgbench
> > which runs 8 threads and each thread does 2000 transactions over the
> > database. The funny thing is that the benchmark does ~70 tps (transactions
> > per second) with CFQ and ~90 tps with a NOOP io scheduler. This is with
> > 2.6.32 kernel.
> >   The load on the IO subsystem basically looks like lots of random reads
> > interleaved with occasional short synchronous sequential writes (the
> > database does write immediately followed by fdatasync) to the database
> > logs. I was pondering for quite some time why CFQ is slower and I've tried
> > tuning it in various ways without success. What I found is that with NOOP
> > scheduler, the fdatasync is like 20-times faster on average than with CFQ.
> > Looking at the block traces (available on request) this is usually because
> > when fdatasync is called, it takes time before the timeslice of the process
> > doing the sync comes (other processes are using their timeslices for reads)
> > and writes are dispatched... The question is: Can we do something about
> > that? Because I'm currently out of ideas except for hacks like "run this
> > queue immediately if it's fsync" or such...
> 
> I guess, noop would be hurting those reads which is also a synchronous 
> operation like fsync. But it doesn't seem to have a huge negative impact on 
> the pgbench. Is it because reads are random in this benchmark and delaying 
> them might even help by getting new requests for sectors in between two random 
> reads? If that is the case, I dont think fsync should be given higher priority 
> than reads based on this benchmark.
> 
> Can you make the blktrace available?
  OK, traces are available from:
http://beta.suse.com/private/jack/pgbench-cfq-noop/pgbench-blktrace.tar.gz

  I've tried also two tests: I've run the database with LD_PRELOAD so that
fdatasync does
a) nothing
b) calls sync_file_range(fd, 0, LLONG_MAX, SYNC_FILE_RANGE_WRITE)
c) calls posix_fadvise(fd, 0, LLONG_MAX, POSIX_FADV_DONTNEED)
   - it does filemap_flush() which was my main aim..

  The results (CFQ as a IO scheduler) are interesting. In a) the performance
was slightly higher than with NOOP scheduler and fully functional fdatasync.
Not surprising - we spend only like 2 s (out of ~200) in fdatasync with NOOP
scheduler.
  In b) the performance was only about 2% better than with full fdatasync
(with NOOP scheduler, it's ~20% better). Looking at the strace
output, it seems sync_file_range() takes as long as fdatasync() took -
probably because we are waiting for PageWriteback or lock_page.
  In c) the performance was ~11% better - fadvise calls seem to be quite
quick - comparable times between CFQ and NOOP. So higher latency of fdatasync
seems to be at least part of a problem...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: CFQ slower than NOOP with pgbench
  2010-02-11 13:14   ` Jan Kara
@ 2010-02-11 19:30     ` Vivek Goyal
  2010-02-18 18:56       ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Vivek Goyal @ 2010-02-11 19:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: Nikanth Karthikesan, LKML, jens.axboe, jmoyer

On Thu, Feb 11, 2010 at 02:14:17PM +0100, Jan Kara wrote:
> On Thu 11-02-10 09:40:33, Nikanth Karthikesan wrote:
> > On Thursday 11 February 2010 04:02:55 Jan Kara wrote:
> > >   Hi,
> > > 
> > >   I was playing with a pgbench benchmark - it runs a series of operations
> > > on top of PostgreSQL database. I was using:
> > >   pgbench -c 8 -t 2000 pgbench
> > > which runs 8 threads and each thread does 2000 transactions over the
> > > database. The funny thing is that the benchmark does ~70 tps (transactions
> > > per second) with CFQ and ~90 tps with a NOOP io scheduler. This is with
> > > 2.6.32 kernel.
> > >   The load on the IO subsystem basically looks like lots of random reads
> > > interleaved with occasional short synchronous sequential writes (the
> > > database does write immediately followed by fdatasync) to the database
> > > logs. I was pondering for quite some time why CFQ is slower and I've tried
> > > tuning it in various ways without success. What I found is that with NOOP
> > > scheduler, the fdatasync is like 20-times faster on average than with CFQ.
> > > Looking at the block traces (available on request) this is usually because
> > > when fdatasync is called, it takes time before the timeslice of the process
> > > doing the sync comes (other processes are using their timeslices for reads)
> > > and writes are dispatched... The question is: Can we do something about
> > > that? Because I'm currently out of ideas except for hacks like "run this
> > > queue immediately if it's fsync" or such...
> > 
> > I guess, noop would be hurting those reads which is also a synchronous 
> > operation like fsync. But it doesn't seem to have a huge negative impact on 
> > the pgbench. Is it because reads are random in this benchmark and delaying 
> > them might even help by getting new requests for sectors in between two random 
> > reads? If that is the case, I dont think fsync should be given higher priority 
> > than reads based on this benchmark.
> > 
> > Can you make the blktrace available?
>   OK, traces are available from:
> http://beta.suse.com/private/jack/pgbench-cfq-noop/pgbench-blktrace.tar.gz
> 

I had a quick look at the blktrace of cfq. Looks like CFQ is idling on 
random read sync queues also and that could be one reason contributing to
reduced throughput of pgpbench. This helpled in reducing random workload
latencies in the presence of other sequential read or write going on.

Later corrodo changed the logic to do group wait on all random readers
instead of individual queue. 

Can you please try latest kernel 2.6.32-rc7 and see if you still see the
issue. This version does group wait on random readers as well as drives
deeper queue depths for writers. (deeper queue depth might not help on 
SATA but does help if multiple spindles are behind RAID card).

Or, if your SATA, disk suppports NCQ, then just set low_latency=0 on
2.6.32 kernel. Looking at 2.6.32 code, it looks like that will also
disable idling on random reader queues.

Thanks
Vivek


>   I've tried also two tests: I've run the database with LD_PRELOAD so that
> fdatasync does
> a) nothing
> b) calls sync_file_range(fd, 0, LLONG_MAX, SYNC_FILE_RANGE_WRITE)
> c) calls posix_fadvise(fd, 0, LLONG_MAX, POSIX_FADV_DONTNEED)
>    - it does filemap_flush() which was my main aim..
> 
>   The results (CFQ as a IO scheduler) are interesting. In a) the performance
> was slightly higher than with NOOP scheduler and fully functional fdatasync.
> Not surprising - we spend only like 2 s (out of ~200) in fdatasync with NOOP
> scheduler.
>   In b) the performance was only about 2% better than with full fdatasync
> (with NOOP scheduler, it's ~20% better). Looking at the strace
> output, it seems sync_file_range() takes as long as fdatasync() took -
> probably because we are waiting for PageWriteback or lock_page.
>   In c) the performance was ~11% better - fadvise calls seem to be quite
> quick - comparable times between CFQ and NOOP. So higher latency of fdatasync
> seems to be at least part of a problem...
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: CFQ slower than NOOP with pgbench
  2010-02-11 19:30     ` Vivek Goyal
@ 2010-02-18 18:56       ` Jan Kara
  0 siblings, 0 replies; 5+ messages in thread
From: Jan Kara @ 2010-02-18 18:56 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jan Kara, Nikanth Karthikesan, LKML, jens.axboe, jmoyer

On Thu 11-02-10 14:30:41, Vivek Goyal wrote:
> On Thu, Feb 11, 2010 at 02:14:17PM +0100, Jan Kara wrote:
> > On Thu 11-02-10 09:40:33, Nikanth Karthikesan wrote:
> > > On Thursday 11 February 2010 04:02:55 Jan Kara wrote:
> > > >   Hi,
> > > > 
> > > >   I was playing with a pgbench benchmark - it runs a series of operations
> > > > on top of PostgreSQL database. I was using:
> > > >   pgbench -c 8 -t 2000 pgbench
> > > > which runs 8 threads and each thread does 2000 transactions over the
> > > > database. The funny thing is that the benchmark does ~70 tps (transactions
> > > > per second) with CFQ and ~90 tps with a NOOP io scheduler. This is with
> > > > 2.6.32 kernel.
> > > >   The load on the IO subsystem basically looks like lots of random reads
> > > > interleaved with occasional short synchronous sequential writes (the
> > > > database does write immediately followed by fdatasync) to the database
> > > > logs. I was pondering for quite some time why CFQ is slower and I've tried
> > > > tuning it in various ways without success. What I found is that with NOOP
> > > > scheduler, the fdatasync is like 20-times faster on average than with CFQ.
> > > > Looking at the block traces (available on request) this is usually because
> > > > when fdatasync is called, it takes time before the timeslice of the process
> > > > doing the sync comes (other processes are using their timeslices for reads)
> > > > and writes are dispatched... The question is: Can we do something about
> > > > that? Because I'm currently out of ideas except for hacks like "run this
> > > > queue immediately if it's fsync" or such...
> > > 
> > > I guess, noop would be hurting those reads which is also a synchronous 
> > > operation like fsync. But it doesn't seem to have a huge negative impact on 
> > > the pgbench. Is it because reads are random in this benchmark and delaying 
> > > them might even help by getting new requests for sectors in between two random 
> > > reads? If that is the case, I dont think fsync should be given higher priority 
> > > than reads based on this benchmark.
> > > 
> > > Can you make the blktrace available?
> >   OK, traces are available from:
> > http://beta.suse.com/private/jack/pgbench-cfq-noop/pgbench-blktrace.tar.gz
> > 
> 
> I had a quick look at the blktrace of cfq. Looks like CFQ is idling on 
> random read sync queues also and that could be one reason contributing to
> reduced throughput of pgpbench. This helpled in reducing random workload
> latencies in the presence of other sequential read or write going on.
> 
> Later corrodo changed the logic to do group wait on all random readers
> instead of individual queue. 
> 
> Can you please try latest kernel 2.6.32-rc7 and see if you still see the
> issue. This version does group wait on random readers as well as drives
> deeper queue depths for writers. (deeper queue depth might not help on 
> SATA but does help if multiple spindles are behind RAID card).
> 
> Or, if your SATA, disk suppports NCQ, then just set low_latency=0 on
> 2.6.32 kernel. Looking at 2.6.32 code, it looks like that will also
> disable idling on random reader queues.
  Thanks for suggestions! I've now got to do the testing and the latest
upstream kernel has the same CFQ performance as NOOP for pgbench. Also
setting low_latency to 0 helped in 2.6.32 kernel...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-02-18 18:56 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-10 22:32 CFQ slower than NOOP with pgbench Jan Kara
2010-02-11  4:10 ` Nikanth Karthikesan
2010-02-11 13:14   ` Jan Kara
2010-02-11 19:30     ` Vivek Goyal
2010-02-18 18:56       ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox