Loosing transactions

public inbox for linux-bcache@vger.kernel.org
 help / color / mirror / Atom feed

* Loosing transactions
@ 2013-01-23 20:14 Pierre Beck
       [not found] ` <51004490.704-MZZvbRqs/9F0RdzJJlgK+g@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Pierre Beck @ 2013-01-23 20:14 UTC (permalink / raw)
  To: linux-bcache-u79uwXL29TY76Z2rM5mHXA

Hi,

something is not working as advertised :-)

I have a test setup for power loss behaviour evaluation. Recently a 
batch of SSDs was of interest and following them, naturally, bcache.

The test is simple: format an ext4 fs on the target device, copy over an 
empty mysql db and server with ACID compliant config (defaults, innodb 
table), then write inserts with a python script and output the latest 
insert id. Watch via SSH, then cut power. I was positively surprised 
that the consumer SSDs obey flushes and don't loose transactions (stored 
transaction was in fact always one or two ahead of output). Intel 
520series, Samsung 840 Pro and Corsair Neutron GTX, all 256 GB, in case 
you're wondering. The Intel 520 was alot faster btw., I think Sandforce 
did a really good job performance-wise. Testing an OCZ Vector failed, 
BIOS hang during detection.

Using an external Ext4 Journal with data=journal yielded SSD-like write 
performance with writebacks to an ST3000DM001 at the same level thanks 
to re-ordering, not loosing transactions as well.

Adding bcache, tests immediately failed, in both writeback and 
writethrough modes. Watching writethrough mode, the performance of the 
HDD looked odd, because waiting for cache flushes it should not exceed 1 
MiB/s, yet I saw 30 MiB/s. So cache flushes are simply eaten somewhere.

dmesg says this at boot time:

Jan 23 19:23:37 dr-nick kernel: [    2.948131] sd 2:0:0:0: [sdb] 
5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)
Jan 23 19:23:37 dr-nick kernel: [    2.948135] sd 2:0:0:0: [sdb] 
4096-byte physical blocks
Jan 23 19:23:37 dr-nick kernel: [    2.948185] sd 2:0:0:0: [sdb] Write 
Protect is off
Jan 23 19:23:37 dr-nick kernel: [    2.948189] sd 2:0:0:0: [sdb] Mode 
Sense: 00 3a 00 00
Jan 23 19:23:37 dr-nick kernel: [    2.948212] sd 2:0:0:0: [sdb] Write 
cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan 23 19:23:37 dr-nick kernel: [    2.948914] sd 3:0:0:0: [sdc] 
468862128 512-byte logical blocks: (240 GB/223 GiB)
Jan 23 19:23:37 dr-nick kernel: [    2.948986] sd 3:0:0:0: [sdc] Write 
Protect is off
Jan 23 19:23:37 dr-nick kernel: [    2.948990] sd 3:0:0:0: [sdc] Mode 
Sense: 00 3a 00 00
Jan 23 19:23:37 dr-nick kernel: [    2.949013] sd 3:0:0:0: [sdc] Write 
cache: enabled, read cache: enabled, doesn't support DPO or FUA

and bcache journal recovery looks like this:

Jan 23 19:24:58 dr-nick kernel: [   96.909115] bcache: 
btree_journal_read() done
Jan 23 19:24:58 dr-nick kernel: [   97.112616] bcache: btree_check() done
Jan 23 19:24:58 dr-nick kernel: [   97.113322] bcache: journal replay 
done, 103 keys in 2 entries, seq 6175-6176
Jan 23 19:24:58 dr-nick kernel: [   97.118998] bcache: Caching sdb as 
bcache0 on set f5f0cd6d-0f77-49d3-ab2d-2203ffff1668
Jan 23 19:24:58 dr-nick kernel: [   97.119125] bcache: registered cache 
device sdc

I wonder if there's some cache flushing method missing in bcache that 
other device mappers use to work around the missing support for FUA 
(queue draining?).

Any ideas where to start debugging?

Greetings,

Pierre Beck

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Loosing transactions
       [not found] ` <51004490.704-MZZvbRqs/9F0RdzJJlgK+g@public.gmane.org>
@ 2013-01-24 23:35   ` Kent Overstreet
       [not found]     ` <20130124233559.GO26407-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Kent Overstreet @ 2013-01-24 23:35 UTC (permalink / raw)
  To: Pierre Beck; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Wed, Jan 23, 2013 at 09:14:09PM +0100, Pierre Beck wrote:
> Hi,
> 
> something is not working as advertised :-)
> 
> I have a test setup for power loss behaviour evaluation. Recently a
> batch of SSDs was of interest and following them, naturally, bcache.
> 
> The test is simple: format an ext4 fs on the target device, copy
> over an empty mysql db and server with ACID compliant config
> (defaults, innodb table), then write inserts with a python script
> and output the latest insert id. Watch via SSH, then cut power. I
> was positively surprised that the consumer SSDs obey flushes and
> don't loose transactions (stored transaction was in fact always one
> or two ahead of output). Intel 520series, Samsung 840 Pro and
> Corsair Neutron GTX, all 256 GB, in case you're wondering. The Intel
> 520 was alot faster btw., I think Sandforce did a really good job
> performance-wise. Testing an OCZ Vector failed, BIOS hang during
> detection.

Ok, sounds reasonable

> Using an external Ext4 Journal with data=journal yielded SSD-like
> write performance with writebacks to an ST3000DM001 at the same
> level thanks to re-ordering, not loosing transactions as well.
> 
> Adding bcache, tests immediately failed, in both writeback and
> writethrough modes. Watching writethrough mode, the performance of
> the HDD looked odd, because waiting for cache flushes it should not
> exceed 1 MiB/s, yet I saw 30 MiB/s. So cache flushes are simply
> eaten somewhere.

Hmm.

So when you say the test failed - were there any inconsistencies after
you rebooted, or was it just that the most recent transactions didn't
amke it down?

> dmesg says this at boot time:
> 
> Jan 23 19:23:37 dr-nick kernel: [    2.948131] sd 2:0:0:0: [sdb]
> 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)
> Jan 23 19:23:37 dr-nick kernel: [    2.948135] sd 2:0:0:0: [sdb]
> 4096-byte physical blocks
> Jan 23 19:23:37 dr-nick kernel: [    2.948185] sd 2:0:0:0: [sdb]
> Write Protect is off
> Jan 23 19:23:37 dr-nick kernel: [    2.948189] sd 2:0:0:0: [sdb]
> Mode Sense: 00 3a 00 00
> Jan 23 19:23:37 dr-nick kernel: [    2.948212] sd 2:0:0:0: [sdb]
> Write cache: enabled, read cache: enabled, doesn't support DPO or
> FUA
> Jan 23 19:23:37 dr-nick kernel: [    2.948914] sd 3:0:0:0: [sdc]
> 468862128 512-byte logical blocks: (240 GB/223 GiB)
> Jan 23 19:23:37 dr-nick kernel: [    2.948986] sd 3:0:0:0: [sdc]
> Write Protect is off
> Jan 23 19:23:37 dr-nick kernel: [    2.948990] sd 3:0:0:0: [sdc]
> Mode Sense: 00 3a 00 00
> Jan 23 19:23:37 dr-nick kernel: [    2.949013] sd 3:0:0:0: [sdc]
> Write cache: enabled, read cache: enabled, doesn't support DPO or
> FUA
> 
> and bcache journal recovery looks like this:
> 
> Jan 23 19:24:58 dr-nick kernel: [   96.909115] bcache:
> btree_journal_read() done
> Jan 23 19:24:58 dr-nick kernel: [   97.112616] bcache: btree_check() done
> Jan 23 19:24:58 dr-nick kernel: [   97.113322] bcache: journal
> replay done, 103 keys in 2 entries, seq 6175-6176
> Jan 23 19:24:58 dr-nick kernel: [   97.118998] bcache: Caching sdb
> as bcache0 on set f5f0cd6d-0f77-49d3-ab2d-2203ffff1668
> Jan 23 19:24:58 dr-nick kernel: [   97.119125] bcache: registered
> cache device sdc
> 
> I wonder if there's some cache flushing method missing in bcache
> that other device mappers use to work around the missing support for
> FUA (queue draining?).
> 
> Any ideas where to start debugging?

We probably want to start by simplifying/narrowing it down a bit - we
can eliminate the possibility of the disk having anything to do with it
and just use the SSD by forcing everything to writeback mode:

For that you'll want to disable both sequential bypass (echo 0 >
/sys/block/bcache/bcacheN/sequential_cutoff) and the congested
thresholds -
echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us,
echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us

After that (assuming you're also in writeback mode) all writes will be
writeback writes until the device is more than half full of dirty data.

Can you check if transactions are still getting lost in that setup? If
so (I kind of expect they will be) we may have to do a bit of
blktracing, but that'll really narrow down the possibilities.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Loosing transactions
       [not found]     ` <20130124233559.GO26407-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2013-01-28 14:45       ` Pierre Beck
       [not found]         ` <51068F01.9060000-MZZvbRqs/9F0RdzJJlgK+g@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Pierre Beck @ 2013-01-28 14:45 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA


> We probably want to start by simplifying/narrowing it down a bit - we
> can eliminate the possibility of the disk having anything to do with it
> and just use the SSD by forcing everything to writeback mode:
>
> For that you'll want to disable both sequential bypass (echo 0 >
> /sys/block/bcache/bcacheN/sequential_cutoff) and the congested
> thresholds -
> echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us,
> echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us
>
> After that (assuming you're also in writeback mode) all writes will be
> writeback writes until the device is more than half full of dirty data.
>
> Can you check if transactions are still getting lost in that setup? If
> so (I kind of expect they will be) we may have to do a bit of
> blktracing, but that'll really narrow down the possibilities.
>

Yes, the most recent transactions are still lost.

Greetings,

Pierre Beck

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Loosing transactions
       [not found]         ` <51068F01.9060000-MZZvbRqs/9F0RdzJJlgK+g@public.gmane.org>
@ 2013-01-29 19:01           ` Kent Overstreet
       [not found]             ` <20130129190133.GL26407-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Kent Overstreet @ 2013-01-29 19:01 UTC (permalink / raw)
  To: Pierre Beck; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Mon, Jan 28, 2013 at 03:45:23PM +0100, Pierre Beck wrote:
> 
> >We probably want to start by simplifying/narrowing it down a bit - we
> >can eliminate the possibility of the disk having anything to do with it
> >and just use the SSD by forcing everything to writeback mode:
> >
> >For that you'll want to disable both sequential bypass (echo 0 >
> >/sys/block/bcache/bcacheN/sequential_cutoff) and the congested
> >thresholds -
> >echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us,
> >echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us
> >
> >After that (assuming you're also in writeback mode) all writes will be
> >writeback writes until the device is more than half full of dirty data.
> >
> >Can you check if transactions are still getting lost in that setup? If
> >so (I kind of expect they will be) we may have to do a bit of
> >blktracing, but that'll really narrow down the possibilities.
> >
> 
> Yes, the most recent transactions are still lost.

Think I figured out what's going on. Just had a chat with another kernel
dev and figured out the flaw in my logic :P

This is going to take some thought to fix, though it shouldn't be much
code. I'll let you know when I think I have a fix.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Loosing transactions
       [not found]             ` <20130129190133.GL26407-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2013-01-29 19:09               ` Kent Overstreet
       [not found]                 ` <20130129190942.GM26407-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Kent Overstreet @ 2013-01-29 19:09 UTC (permalink / raw)
  To: Pierre Beck; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Tue, Jan 29, 2013 at 11:01:33AM -0800, Kent Overstreet wrote:
> On Mon, Jan 28, 2013 at 03:45:23PM +0100, Pierre Beck wrote:
> > 
> > >We probably want to start by simplifying/narrowing it down a bit - we
> > >can eliminate the possibility of the disk having anything to do with it
> > >and just use the SSD by forcing everything to writeback mode:
> > >
> > >For that you'll want to disable both sequential bypass (echo 0 >
> > >/sys/block/bcache/bcacheN/sequential_cutoff) and the congested
> > >thresholds -
> > >echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us,
> > >echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us
> > >
> > >After that (assuming you're also in writeback mode) all writes will be
> > >writeback writes until the device is more than half full of dirty data.
> > >
> > >Can you check if transactions are still getting lost in that setup? If
> > >so (I kind of expect they will be) we may have to do a bit of
> > >blktracing, but that'll really narrow down the possibilities.
> > >
> > 
> > Yes, the most recent transactions are still lost.
> 
> Think I figured out what's going on. Just had a chat with another kernel
> dev and figured out the flaw in my logic :P

I lied, that idea turned out to be wrong.

Though - data=journal isn't that commonly used, an ext4 bug is an
outside possibility (if it was assuming the pre 2.6.38 semantics of
barriers that could cause this). Could you test with data=ordered and
tell me what happens?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Loosing transactions
       [not found]                 ` <20130129190942.GM26407-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2013-01-29 20:16                   ` Pierre Beck
       [not found]                     ` <51082E02.7000908-MZZvbRqs/9F0RdzJJlgK+g@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Pierre Beck @ 2013-01-29 20:16 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA

Am 29.01.2013 20:09, schrieb Kent Overstreet:
> I lied, that idea turned out to be wrong.
>
> Though - data=journal isn't that commonly used, an ext4 bug is an
> outside possibility (if it was assuming the pre 2.6.38 semantics of
> barriers that could cause this). Could you test with data=ordered and
> tell me what happens?
>

Sorry I didn't state that clearly. The ext4 data=journal test was for 
comparison only. All bcache tests were done with defaults, which is 
data=ordered.

# mkfs.ext4 -K -b 4096
# mount -o noatime

Disabling all hardware caches with hdparm AND putting bcache in 
writethrough mode made it pass the test, btw., with severe performance 
loss of course. What's really alarming is that bcache in writeback with 
hw caches deactivated still loses transactions. This points to the 
journal logic. Maybe the last element of the journal is simply not iterated?

Greetings,

Pierre Beck

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Loosing transactions
       [not found]                     ` <51082E02.7000908-MZZvbRqs/9F0RdzJJlgK+g@public.gmane.org>
@ 2013-01-30 19:02                       ` Kent Overstreet
       [not found]                         ` <20130130190220.GS26407-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Kent Overstreet @ 2013-01-30 19:02 UTC (permalink / raw)
  To: Pierre Beck; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Tue, Jan 29, 2013 at 09:16:02PM +0100, Pierre Beck wrote:
> Am 29.01.2013 20:09, schrieb Kent Overstreet:
> >I lied, that idea turned out to be wrong.
> >
> >Though - data=journal isn't that commonly used, an ext4 bug is an
> >outside possibility (if it was assuming the pre 2.6.38 semantics of
> >barriers that could cause this). Could you test with data=ordered and
> >tell me what happens?
> >
> 
> Sorry I didn't state that clearly. The ext4 data=journal test was
> for comparison only. All bcache tests were done with defaults, which
> is data=ordered.
> 
> # mkfs.ext4 -K -b 4096
> # mount -o noatime

Ok, good - so it's pretty definitively a bcache bug then.

> Disabling all hardware caches with hdparm AND putting bcache in
> writethrough mode made it pass the test, btw., with severe
> performance loss of course. What's really alarming is that bcache in
> writeback with hw caches deactivated still loses transactions. This
> points to the journal logic. Maybe the last element of the journal
> is simply not iterated?

I was thinking more the cache flush logic, but you may be right -
there's not much logic involved in handling cache flushes, there's
somewhat more to screw up in the journalling code itself.

I'll start reading through the journalling code. Can you try another
setting and tell me what happens?

For normal (non flush) data writes, bcache adds the keys to the next
journal write that's going to go down but it doesn't do the journal
write right away - it delays by up to 100 ms by default, unless either
the journal write fills up or it sees a flush (only flushes actually
wait on journal writes).

That delay is a parameter - /sys/fs/bcache/<cache set>/journal_delay_ms.
Can you set it to 0 and see what happens? It might reduce the number of
missing transactions, eliminate the missing transactions... or have no
effect at all.

And thanks for all the testing you've been doing! It's greatly
appreciated.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Loosing transactions
       [not found]                         ` <20130130190220.GS26407-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2013-01-30 20:05                           ` Pierre Beck
       [not found]                             ` <51097D0A.6040204-MZZvbRqs/9F0RdzJJlgK+g@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Pierre Beck @ 2013-01-30 20:05 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA

Am 30.01.2013 20:02, schrieb Kent Overstreet:
> That delay is a parameter - /sys/fs/bcache/<cache set>/journal_delay_ms.
> Can you set it to 0 and see what happens? It might reduce the number of
> missing transactions, eliminate the missing transactions... or have no
> effect at all.
>

Test passed. Double-checked. No transactions lost.

I really want to see bcache in mainline kernel soon. It's such a great 
addition for storage arrays. Let's polish it up. When you have a fix, 
I'll test a full storage stack on it. Ext4 alone is boring :-)

Greetings,

Pierre Beck

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Loosing transactions
       [not found]                             ` <51097D0A.6040204-MZZvbRqs/9F0RdzJJlgK+g@public.gmane.org>
@ 2013-01-30 20:18                               ` Kent Overstreet
  0 siblings, 0 replies; 9+ messages in thread
From: Kent Overstreet @ 2013-01-30 20:18 UTC (permalink / raw)
  To: Pierre Beck; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Wed, Jan 30, 2013 at 09:05:31PM +0100, Pierre Beck wrote:
> Am 30.01.2013 20:02, schrieb Kent Overstreet:
> >That delay is a parameter - /sys/fs/bcache/<cache set>/journal_delay_ms.
> >Can you set it to 0 and see what happens? It might reduce the number of
> >missing transactions, eliminate the missing transactions... or have no
> >effect at all.
> >
> 
> Test passed. Double-checked. No transactions lost.

None at all, ever?

That's actually kind of annoying, because we can't definitively
distinguish between "fixes bug completely" and "makes bug a lot harder
to actually detect". Bah :p

Can you run blktrace on the bcache device and the SSD for a second or
two while your msql test is running? Maybe if I stare at the trace I'll
get lucky and see something wrong.

> I really want to see bcache in mainline kernel soon. It's such a
> great addition for storage arrays. Let's polish it up. When you have
> a fix, I'll test a full storage stack on it. Ext4 alone is boring
> :-)

Working on it :D

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-01-30 20:18 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-23 20:14 Loosing transactions Pierre Beck
     [not found] ` <51004490.704-MZZvbRqs/9F0RdzJJlgK+g@public.gmane.org>
2013-01-24 23:35   ` Kent Overstreet
     [not found]     ` <20130124233559.GO26407-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2013-01-28 14:45       ` Pierre Beck
     [not found]         ` <51068F01.9060000-MZZvbRqs/9F0RdzJJlgK+g@public.gmane.org>
2013-01-29 19:01           ` Kent Overstreet
     [not found]             ` <20130129190133.GL26407-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2013-01-29 19:09               ` Kent Overstreet
     [not found]                 ` <20130129190942.GM26407-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2013-01-29 20:16                   ` Pierre Beck
     [not found]                     ` <51082E02.7000908-MZZvbRqs/9F0RdzJJlgK+g@public.gmane.org>
2013-01-30 19:02                       ` Kent Overstreet
     [not found]                         ` <20130130190220.GS26407-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2013-01-30 20:05                           ` Pierre Beck
     [not found]                             ` <51097D0A.6040204-MZZvbRqs/9F0RdzJJlgK+g@public.gmane.org>
2013-01-30 20:18                               ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox