public inbox for linux-bcache@vger.kernel.org
 help / color / mirror / Atom feed
* md-raid5 with bcache member devices => kernel panic
@ 2013-12-05 21:29 Matthias Ferdinand
  2013-12-05 22:52 ` Kent Overstreet
  0 siblings, 1 reply; 7+ messages in thread
From: Matthias Ferdinand @ 2013-12-05 21:29 UTC (permalink / raw)
  To: linux-bcache

Hi,

I am currently experimenting with bcache. The hardware is rather old:
Intel Core2 6600, 2.4GHz, 8GB RAM. I intend using it as a KVM host. OS
is Ubuntu 13.10 amd64.

SSD: single Intel 530 series 120G (SSDSC2BW120A4), i.e. same cache
device for all backing devices

My test procedure:

  - prepare VMs:
       foreach vm in x y z
         copy VM image to volume (dd_rescue)
           (Ubuntu 12.04 amd64 Webservers, XFS filesystem)
         start up VM

  - wait at least 5 min so any writeback can settle down

  - synchronously start "apt-get dist-upgrade" inside the
    VMs (includes linux-image-... and linux-headers-...,
    which makes for lot of small files)

I did this with various values of bcache cache_mode and KVM virtual
disk caching options.

Bcaches 'writeback' cache_mode is fastest, of course. But now the KVM
setting "writeback" is slower than "writethrough" - in the same setup
with no bcache involved, KVMs "writeback" would be far ahead.

Storage setup:

  LVM
   |
  md-raid5 (chunksize 512k)
   |
  3x SATA 2TB Seagate ST2000DM001-1CH1; Partition 6


I tried putting bcache at different levels in the storage stack:

  [bcache <1a> <1b> <1c>]
   |
  LVM
   |
  [bcache <2>]
   |
  md-raid5 (chunksize 512k)
   |
  [bcache <3a> <3b> <3c>]
   |
  3x SATA 2TB Seagate ST2000DM001-1CH1; Partition 6


1) 3 bcache devices on top of LVs (<1a>, <1b>, <1c>).

2) 1 bcache device above the md-raid5 (<2>), used as LVM PV.

3) 3 bcache devices on top of the partitions (<3a>, <3b>, <3c>),
      used as member devices for md-raid5.

The higher bcache was in this hierarchy, the better the
performance was. An md-raid5 made of bcaches (that use the same cache
device) is horribly slow.

But not only is it rather slow, it reliably (but nondeterministically)
produces kernel panics. It might panic while copying the first VM image
(dd_rescue), or during startup of the first VM, while the copy process
for the second VM image (dd_rescue) is already running.

Tried with different kernels, all produce the panics:
  - Ubuntu 3.11.0-13.20
  - kernel.org 3.12.2
  - kernel.org 3.13-rc2

Having so many layers on top of bcache may be stupid, but sure it should
not panic :-)

You can find the complete serial console output of those crashing runs
at http://dl.mfedv.net/md5raid_on_bcache_panic/

I can't see bcache mentioned in those kernel backtraces - perhaps it's
not really bcaches fault. (there is a single bcache line in the 3.12.2
trace, though)

Any ideas?

Regards
Matthias

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: md-raid5 with bcache member devices => kernel panic
  2013-12-05 21:29 md-raid5 with bcache member devices => kernel panic Matthias Ferdinand
@ 2013-12-05 22:52 ` Kent Overstreet
  2013-12-05 23:08   ` Matthias Ferdinand
  2013-12-08 23:53   ` Matthias Ferdinand
  0 siblings, 2 replies; 7+ messages in thread
From: Kent Overstreet @ 2013-12-05 22:52 UTC (permalink / raw)
  To: Matthias Ferdinand, nks; +Cc: linux-bcache

On Thu, Dec 05, 2013 at 10:29:13PM +0100, Matthias Ferdinand wrote:
> Hi,
> 
> I am currently experimenting with bcache. The hardware is rather old:
> Intel Core2 6600, 2.4GHz, 8GB RAM. I intend using it as a KVM host. OS
> is Ubuntu 13.10 amd64.
> 
> SSD: single Intel 530 series 120G (SSDSC2BW120A4), i.e. same cache
> device for all backing devices
> 
> But not only is it rather slow, it reliably (but nondeterministically)
> produces kernel panics. It might panic while copying the first VM image
> (dd_rescue), or during startup of the first VM, while the copy process
> for the second VM image (dd_rescue) is already running.
> 
> Tried with different kernels, all produce the panics:
>   - Ubuntu 3.11.0-13.20
>   - kernel.org 3.12.2
>   - kernel.org 3.13-rc2
> 
> Having so many layers on top of bcache may be stupid, but sure it should
> not panic :-)
> 
> You can find the complete serial console output of those crashing runs
> at http://dl.mfedv.net/md5raid_on_bcache_panic/
> 
> I can't see bcache mentioned in those kernel backtraces - perhaps it's
> not really bcaches fault. (there is a single bcache line in the 3.12.2
> trace, though)

Erk. I thought I was done with these bugs. Nick, do you think you could try and
track this down?

Looking at this:
http://dl.mfedv.net/md5raid_on_bcache_panic/mdraid5_on_bcache_panic_3.12.2.txt

that's a null pointer deref; if Matthias could get the exact line number it
happened on we could tell what variable was null. I _think_ it's *sg because
it's running off the end of the scatterlist; if that's the case (and you should
verify that that is what's happening, then what's going on is bcache is sending
down a bio larger than what the device expects.

Assuming that's the case, the bug would be in bch_bio_max_sectors(), which is in
drivers/md/bcache/io.c. Backstory:

In current kernels, the way the block layer works to do an io, you fill out a
struct bio and pass it down: BUT, the bio is not allowed to be bigger than
whatever the device can do atomically as a single request.

The way this is normally done is a filesystem will add data to a bio, a page at
a time, with bio_add_page() (in fs/bio.c); bio_add_page() checks all the device
constraints and may fail, and then the filesystem sends the bio down and starts
making a new bio.

Anyways, bcache doesn't do things this way, because it's braindead and gets
obscenely complicated when you have stacked block devices - instead, when it
goes to submit a bio it first splits the bio if necessary. That
bch_bio_max_sectors() is supposed to check all the same constraints and
essentially replicate the behaviour of building up a bio with bio_add_page().

Matthias - I'm running bcache on top of a raid6 at home and I've never seen
this, so there's probably something unusual about your setup that's required to
trigger this. Can you help Nick out with reproducing the bug and/or getting him
more information?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: md-raid5 with bcache member devices => kernel panic
  2013-12-05 22:52 ` Kent Overstreet
@ 2013-12-05 23:08   ` Matthias Ferdinand
  2013-12-05 23:15     ` Kent Overstreet
  2013-12-08 23:53   ` Matthias Ferdinand
  1 sibling, 1 reply; 7+ messages in thread
From: Matthias Ferdinand @ 2013-12-05 23:08 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: nks, linux-bcache

On Thu, Dec 05, 2013 at 02:52:34PM -0800, Kent Overstreet wrote:
> On Thu, Dec 05, 2013 at 10:29:13PM +0100, Matthias Ferdinand wrote:
> Erk. I thought I was done with these bugs. Nick, do you think you could try and
> track this down?
> 
> Looking at this:
> http://dl.mfedv.net/md5raid_on_bcache_panic/mdraid5_on_bcache_panic_3.12.2.txt
> 
> that's a null pointer deref; if Matthias could get the exact line number it
I have no idea how to do that - can you help me out on that?

> Matthias - I'm running bcache on top of a raid6 at home and I've never seen
I only get panics when I use md-raid on top of bcaches:

              LVM
               |
           md-raid5
         /     |     \
  bcache0  bcache1  bcache2
   |           |       |
  sdb6       sdc6     sdd6

(probably an unusual setup; just playing around...)

To reproduce, just writing anything to an LV on top of this should do
(linear writes). If some r/w random i/o is then added (start of VM in my
setup), it breaks.

Regards
Matthias

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: md-raid5 with bcache member devices => kernel panic
  2013-12-05 23:08   ` Matthias Ferdinand
@ 2013-12-05 23:15     ` Kent Overstreet
  2013-12-06  0:00       ` Matthias Ferdinand
  2013-12-06  3:22       ` Paul B. Henson
  0 siblings, 2 replies; 7+ messages in thread
From: Kent Overstreet @ 2013-12-05 23:15 UTC (permalink / raw)
  To: Matthias Ferdinand; +Cc: nks, linux-bcache

On Fri, Dec 06, 2013 at 12:08:13AM +0100, Matthias Ferdinand wrote:
> On Thu, Dec 05, 2013 at 02:52:34PM -0800, Kent Overstreet wrote:
> > On Thu, Dec 05, 2013 at 10:29:13PM +0100, Matthias Ferdinand wrote:
> > Erk. I thought I was done with these bugs. Nick, do you think you could try and
> > track this down?
> > 
> > Looking at this:
> > http://dl.mfedv.net/md5raid_on_bcache_panic/mdraid5_on_bcache_panic_3.12.2.txt
> > 
> > that's a null pointer deref; if Matthias could get the exact line number it
> I have no idea how to do that - can you help me out on that?
> 
> > Matthias - I'm running bcache on top of a raid6 at home and I've never seen
> I only get panics when I use md-raid on top of bcaches:
> 
>               LVM
>                |
>            md-raid5
>          /     |     \
>   bcache0  bcache1  bcache2
>    |           |       |
>   sdb6       sdc6     sdd6
> 
> (probably an unusual setup; just playing around...)

Yeah, that is unusual. Very odd though, that setup I would definitely expect to
work.

You mentioned it's faster with bcache higher in the stack - do you have any
issues with your preferred setup?

If the bug isn't actually affecting real/preferred use cases then I'll be a lot
less concerned about it - I'm reworking how all this crap works in mainline,
maybe by 3.14 generic_make_request() will be accepting arbitrary size bios and
the code that's probably buggy here will be gone.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: md-raid5 with bcache member devices => kernel panic
  2013-12-05 23:15     ` Kent Overstreet
@ 2013-12-06  0:00       ` Matthias Ferdinand
  2013-12-06  3:22       ` Paul B. Henson
  1 sibling, 0 replies; 7+ messages in thread
From: Matthias Ferdinand @ 2013-12-06  0:00 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: nks, linux-bcache

On Thu, Dec 05, 2013 at 03:15:47PM -0800, Kent Overstreet wrote:
> On Fri, Dec 06, 2013 at 12:08:13AM +0100, Matthias Ferdinand wrote:
> > I only get panics when I use md-raid on top of bcaches:
> > 
> >               LVM
> >                |
> >            md-raid5
> >          /     |     \
> >   bcache0  bcache1  bcache2
> >    |           |       |
> >   sdb6       sdc6     sdd6
> > 
> > (probably an unusual setup; just playing around...)
> 
> Yeah, that is unusual. Very odd though, that setup I would definitely expect to
> work.
> 
> You mentioned it's faster with bcache higher in the stack - do you have any
> issues with your preferred setup?

for simpler administration, I would prefer to have LVM on top. Where
exactly bcache sits below that is not really important (except for
performance gains, which is the whole point of it all :-)

My preferred lookup would be like this:

         LVM
          |
       bcache0
          |
       md-raid5
      /   |    \
  sdb6   sdc6   sdd6

but it is slightly slower compared to having bcache{0,1,2} devices on
top of LVs for the virtual disks.

I only had the idea to RAID bcaches because the default chunk size of
512k equals the erase block size of the SSD (I assume it is 512k, could
not find hard evidence for it). And since small writes on RAID5 create
tremendous overhead, I hoped to benefit from bcache having already
cached some full chunks.

> If the bug isn't actually affecting real/preferred use cases then I'll be a lot
> less concerned about it - I'm reworking how all this crap works in mainline,
> maybe by 3.14 generic_make_request() will be accepting arbitrary size bios and
> the code that's probably buggy here will be gone.
It is not yet in production use, I am still playing with the setup.

Regards
Matthias

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: md-raid5 with bcache member devices => kernel panic
  2013-12-05 23:15     ` Kent Overstreet
  2013-12-06  0:00       ` Matthias Ferdinand
@ 2013-12-06  3:22       ` Paul B. Henson
  1 sibling, 0 replies; 7+ messages in thread
From: Paul B. Henson @ 2013-12-06  3:22 UTC (permalink / raw)
  To: 'Kent Overstreet'; +Cc: linux-bcache

> From: Kent Overstreet
> Sent: Thursday, December 05, 2013 3:16 PM
>
> I'm reworking how all this crap works in mainline,
> maybe by 3.14 generic_make_request() will be accepting arbitrary size bios
> and the code that's probably buggy here will be gone.

Hmm, calling the current code crap is not exactly confidence inspiring for
running 3.11 or 3.12 in production ;).

If you're in an email answering mood, any chance I could trouble you to
answer the question I posed earlier this week about the -w option to
make-bcache, its relation (or not) to SSD page size,  and how to best
configure an SSD with a page size of 8k as a cache device :)?

Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: md-raid5 with bcache member devices => kernel panic
  2013-12-05 22:52 ` Kent Overstreet
  2013-12-05 23:08   ` Matthias Ferdinand
@ 2013-12-08 23:53   ` Matthias Ferdinand
  1 sibling, 0 replies; 7+ messages in thread
From: Matthias Ferdinand @ 2013-12-08 23:53 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: nks, linux-bcache

On Thu, Dec 05, 2013 at 02:52:34PM -0800, Kent Overstreet wrote:
> that's a null pointer deref; if Matthias could get the exact line number it
> happened on we could tell what variable was null. I _think_ it's *sg because
> it's running off the end of the scatterlist; if that's the case (and you should
> verify that that is what's happening, then what's going on is bcache is sending
> down a bio larger than what the device expects.

found the kernel config value CONFIG_DEBUG_BUGVERBOSE, tried again with
3.12.3 and 3.13-rc3. The backtrace now spells the line number:

    kernel BUG at drivers/scsi/scsi_lib.c:1048!

1028 static int scsi_init_sgtable(struct request *req, struct scsi_data_buffer *sdb,
1029                              gfp_t gfp_mask)
1030 {
1031         int count;
1032
1033         /*
1034          * If sg table allocation fails, requeue request later.
1035          */
1036         if (unlikely(scsi_alloc_sgtable(sdb, req->nr_phys_segments,
1037                                         gfp_mask))) {
1038                 return BLKPREP_DEFER;
1039         }
1040
1041         req->buffer = NULL;
1042
1043         /*
1044          * Next, walk the list, and fill in the addresses and sizes of
1045          * each segment.
1046          */
1047         count = blk_rq_map_sg(req->q, req, sdb->table.sgl);
1048         BUG_ON(count > sdb->table.nents);
1049         sdb->table.nents = count;
1050         sdb->length = blk_rq_bytes(req);
1051         return BLKPREP_OK;
1052 }

Regards
Matthias

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-12-08 23:53 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-05 21:29 md-raid5 with bcache member devices => kernel panic Matthias Ferdinand
2013-12-05 22:52 ` Kent Overstreet
2013-12-05 23:08   ` Matthias Ferdinand
2013-12-05 23:15     ` Kent Overstreet
2013-12-06  0:00       ` Matthias Ferdinand
2013-12-06  3:22       ` Paul B. Henson
2013-12-08 23:53   ` Matthias Ferdinand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox