* next bio iters break discard? @ 2014-01-13 3:52 Hugh Dickins 2014-01-14 2:33 ` Kent Overstreet 0 siblings, 1 reply; 15+ messages in thread From: Hugh Dickins @ 2014-01-13 3:52 UTC (permalink / raw) To: Jens Axboe; +Cc: Kent Overstreet, Shaohua Li, Andrew Morton, linux-kernel When I try to exercise heavy swapping with discard on mmotm 2014-01-09, I soon hit a NULL pointer dereference in __blk_recalc_rq_segments(): __blk_recalc_rq_segments blk_recount_segments ll_back_merge_fn bio_attempt_back_merge blk_queue_bio generic_make_request submit_bio blkdev_issue_discard swap_do_scheduled_discard scan_swap_map_try_ssd_cluster scan_swap_map get_swap_page add_to_swap shrink_page_list etc. etc. The crash is on the NULL struct page pointer in page_to_pfn(bv.bv_page) on line 35 of block/blk-merge.c. The code around there is not very different from 3.13-rc8 (which doesn't crash), and I didn't notice REQ_DISCARD or bio_has_data() checks removed. I think it worked before because the old bio_for_each_segment() iterator was a straightforward "i < bio->bi_vcnt" loop which would do nothing when bi_vcnt is 0; but the new iterators are relying (perhaps) on bio->bi_iter.bi_size which is non-0 despite no data? I expect it would crash in the same way on other recent nexts and mmotms, I've not tried. Hugh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next bio iters break discard? 2014-01-13 3:52 next bio iters break discard? Hugh Dickins @ 2014-01-14 2:33 ` Kent Overstreet 2014-01-14 4:06 ` Martin K. Petersen 0 siblings, 1 reply; 15+ messages in thread From: Kent Overstreet @ 2014-01-14 2:33 UTC (permalink / raw) To: Hugh Dickins Cc: Jens Axboe, Shaohua Li, Andrew Morton, Martin K. Petersen, linux-kernel On Sun, Jan 12, 2014 at 07:52:40PM -0800, Hugh Dickins wrote: > When I try to exercise heavy swapping with discard on mmotm 2014-01-09, > I soon hit a NULL pointer dereference in __blk_recalc_rq_segments(): > > __blk_recalc_rq_segments > blk_recount_segments > ll_back_merge_fn > bio_attempt_back_merge > blk_queue_bio > generic_make_request > submit_bio > blkdev_issue_discard > swap_do_scheduled_discard > scan_swap_map_try_ssd_cluster > scan_swap_map > get_swap_page > add_to_swap > shrink_page_list > etc. etc. > > The crash is on the NULL struct page pointer in page_to_pfn(bv.bv_page) > on line 35 of block/blk-merge.c. > > The code around there is not very different from 3.13-rc8 (which doesn't > crash), and I didn't notice REQ_DISCARD or bio_has_data() checks removed. > > I think it worked before because the old bio_for_each_segment() > iterator was a straightforward "i < bio->bi_vcnt" loop which would > do nothing when bi_vcnt is 0; but the new iterators are relying > (perhaps) on bio->bi_iter.bi_size which is non-0 despite no data? > > I expect it would crash in the same way on other recent nexts and > mmotms, I've not tried. > > Hugh Ugh, discards. Wonder why this wasn't seen sooner, I can't figure out what the null pointer deref actually was but if it was __blk_recalc_rq_segments() blowing up, that shouldn't have had to wait for two discards to get merged to get called. (calling bio_for_each_segment() on REQ_DISCARD/REQ_WRITE_SAME bios should in general work; bio_advance_iter() checks against BIO_NO_ADVANCE_ITER_MASK to determine whether to advance down the bvec or just decrement bi_size. But for counting segments bio_for_each_segment() is definitely not what we want.) I think for discards we can deal with this easily enough - __blk_recalc_rq_segments() will have to special case them - but there's a similar (but worse) issue with WRITE_SAME, and looking at the code it does attempt to merge WRITE_SAME requests too. Jens, Martin - are we sure we want to merge WRITE_SAME requests? I'm not sure I even see how that makes sense, at the very least it's a potential minefield. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next bio iters break discard? 2014-01-14 2:33 ` Kent Overstreet @ 2014-01-14 4:06 ` Martin K. Petersen 2014-01-14 4:48 ` Kent Overstreet 0 siblings, 1 reply; 15+ messages in thread From: Martin K. Petersen @ 2014-01-14 4:06 UTC (permalink / raw) To: Kent Overstreet Cc: Hugh Dickins, Jens Axboe, Shaohua Li, Andrew Morton, Martin K. Petersen, linux-kernel >>>>> "Kent" == Kent Overstreet <kmo@daterainc.com> writes: Kent, Kent> I think for discards we can deal with this easily enough - Kent> __blk_recalc_rq_segments() will have to special case them - but Kent> there's a similar (but worse) issue with WRITE_SAME, and looking Kent> at the code it does attempt to merge WRITE_SAME requests too. DISCARD bios have no payload going down the stack. They get a payload attached in the sd driver and will therefore have a single bvec at completion time. WRITE_SAME bios have a single bvec payload throughout their lifetime. For both these types of requests we never attempt to merge the actual payloads. But the block range worked on may shrink or grow as the bio is split or merged going down the stack. IOW, DISCARD, WRITE SAME and the impending COPY requests do not have a 1:1 mapping between the block range worked on and the size of any bvecs attached. Your recent changes must have changed the way we handled that in the past. I'll take a look tomorrow... -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next bio iters break discard? 2014-01-14 4:06 ` Martin K. Petersen @ 2014-01-14 4:48 ` Kent Overstreet 2014-01-14 20:17 ` Martin K. Petersen 0 siblings, 1 reply; 15+ messages in thread From: Kent Overstreet @ 2014-01-14 4:48 UTC (permalink / raw) To: Martin K. Petersen Cc: Hugh Dickins, Jens Axboe, Shaohua Li, Andrew Morton, linux-kernel On Mon, Jan 13, 2014 at 11:06:33PM -0500, Martin K. Petersen wrote: > >>>>> "Kent" == Kent Overstreet <kmo@daterainc.com> writes: > > Kent, > > Kent> I think for discards we can deal with this easily enough - > Kent> __blk_recalc_rq_segments() will have to special case them - but > Kent> there's a similar (but worse) issue with WRITE_SAME, and looking > Kent> at the code it does attempt to merge WRITE_SAME requests too. > > DISCARD bios have no payload going down the stack. They get a payload > attached in the sd driver and will therefore have a single bvec at > completion time. > > WRITE_SAME bios have a single bvec payload throughout their lifetime. > > For both these types of requests we never attempt to merge the actual > payloads. But the block range worked on may shrink or grow as the bio is > split or merged going down the stack. > > IOW, DISCARD, WRITE SAME and the impending COPY requests do not have a > 1:1 mapping between the block range worked on and the size of any bvecs > attached. Your recent changes must have changed the way we handled that > in the past. Yeah - but with WRITE_SAME bios, wouldn't we at least have to check that they're writing the same data to merge them? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next bio iters break discard? 2014-01-14 4:48 ` Kent Overstreet @ 2014-01-14 20:17 ` Martin K. Petersen 2014-01-14 22:24 ` Kent Overstreet 0 siblings, 1 reply; 15+ messages in thread From: Martin K. Petersen @ 2014-01-14 20:17 UTC (permalink / raw) To: Kent Overstreet Cc: Martin K. Petersen, Hugh Dickins, Jens Axboe, Shaohua Li, Andrew Morton, linux-kernel >>>>> "Kent" == Kent Overstreet <kmo@daterainc.com> writes: >> IOW, DISCARD, WRITE SAME and the impending COPY requests do not have >> a 1:1 mapping between the block range worked on and the size of any >> bvecs attached. Your recent changes must have changed the way we >> handled that in the past. Kent> Yeah - but with WRITE_SAME bios, wouldn't we at least have to Kent> check that they're writing the same data to merge them? We do. Check blk_write_same_mergeable(). -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next bio iters break discard? 2014-01-14 20:17 ` Martin K. Petersen @ 2014-01-14 22:24 ` Kent Overstreet 2014-01-16 1:39 ` Martin K. Petersen 2014-01-16 20:21 ` Hugh Dickins 0 siblings, 2 replies; 15+ messages in thread From: Kent Overstreet @ 2014-01-14 22:24 UTC (permalink / raw) To: Martin K. Petersen Cc: Hugh Dickins, Jens Axboe, Shaohua Li, Andrew Morton, linux-kernel On Tue, Jan 14, 2014 at 03:17:32PM -0500, Martin K. Petersen wrote: > >>>>> "Kent" == Kent Overstreet <kmo@daterainc.com> writes: > > >> IOW, DISCARD, WRITE SAME and the impending COPY requests do not have > >> a 1:1 mapping between the block range worked on and the size of any > >> bvecs attached. Your recent changes must have changed the way we > >> handled that in the past. > > Kent> Yeah - but with WRITE_SAME bios, wouldn't we at least have to > Kent> check that they're writing the same data to merge them? > > We do. Check blk_write_same_mergeable(). Aha, I missed that. Side note, one of the things on my todo list/wishlist is make a separate enum for bio type - bare flush, normal read/write, scsi command, discard and write same. In particular with bi_size meaning different things depending on the type of the command, I think having an enum we can switch off of where appropriate instead of a bunch of random flags will make things a hell of a lot saner. Looking more at this code, I'm not sure it did what we wanted before for REQ_DISCARD and REQ_WRITE_SAME bios/requests - I suspect for REQ_WRITE_SAME requests that had been merged it overcounted. There's also that horrible hack in the scsi (?) code Christoph added for discards - where the driver adds a page to the bio - if the driver is counting the number of segments _after_ that god only knows what is supposed to happen. Does the below patch look like what we want? I'm assuming that if multiple WRITE_SAME bios are merged, since they're all writing the same data we can consider the entire request to be a single segment. commit 1755e7ffc5745591d37b8956ce2512f4052a104a Author: Kent Overstreet <kmo@daterainc.com> Date: Tue Jan 14 14:22:01 2014 -0800 block: Explicitly handle discard/write same when counting segments diff --git a/block/blk-merge.c b/block/blk-merge.c index 8f8adaa..7d977f8 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -21,6 +21,12 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, if (!bio) return 0; + if (bio->bi_rw & REQ_DISCARD) + return 0; + + if (bio->bi_rw & REQ_WRITE_SAME) + return 1; + fbio = bio; cluster = blk_queue_cluster(q); seg_size = 0; ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: next bio iters break discard? 2014-01-14 22:24 ` Kent Overstreet @ 2014-01-16 1:39 ` Martin K. Petersen 2014-01-16 20:21 ` Hugh Dickins 1 sibling, 0 replies; 15+ messages in thread From: Martin K. Petersen @ 2014-01-16 1:39 UTC (permalink / raw) To: Kent Overstreet Cc: Martin K. Petersen, Hugh Dickins, Jens Axboe, Shaohua Li, Andrew Morton, linux-kernel >>>>> "Kent" == Kent Overstreet <kmo@daterainc.com> writes: Kent> Side note, one of the things on my todo list/wishlist is make a Kent> separate enum for bio type - bare flush, normal read/write, scsi Kent> command, discard and write same. In particular with bi_size Kent> meaning different things depending on the type of the command, I Kent> think having an enum we can switch off of where appropriate Kent> instead of a bunch of random flags will make things a hell of a Kent> lot saner. Yeah. That's one of the reasons we cleaned up the merge flags and introduced bio_has_data() and bio_is_rw(). Kent> There's also that horrible hack in the scsi (?) code Christoph Kent> added for discards - where the driver adds a page to the bio - if Kent> the driver is counting the number of segments _after_ that god Kent> only knows what is supposed to happen. I manually do the bookkeeping in the SCSI disk driver. I.e. LLDs are explicitly told to perform a 512-byte transfer. And when they're done I complete bi_size bytes to the block layer. Kind of ugly but it's an unfortunate side effect of bi_size being both the size of the scatterlist and the block count. I did consider making them separate entities but that means struct bio will grow for no reason for the common case of read/write. Didn't seem worth the hassle... Kent> Does the below patch look like what we want? I'm assuming that if Kent> multiple WRITE_SAME bios are merged, since they're all writing the Kent> same data we can consider the entire request to be a single Kent> segment. Looks OK to me. Acked-by: Martin K. Petersen <martin.petersen@oracle.com> -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next bio iters break discard? 2014-01-14 22:24 ` Kent Overstreet 2014-01-16 1:39 ` Martin K. Petersen @ 2014-01-16 20:21 ` Hugh Dickins 2014-01-17 1:06 ` Kent Overstreet 2014-01-17 1:21 ` Kent Overstreet 1 sibling, 2 replies; 15+ messages in thread From: Hugh Dickins @ 2014-01-16 20:21 UTC (permalink / raw) To: Kent Overstreet Cc: Martin K. Petersen, Jens Axboe, Shaohua Li, Andrew Morton, linux-kernel On Tue, 14 Jan 2014, Kent Overstreet wrote: > > Does the below patch look like what we want? I'm assuming that if You don't fill me with confidence ;) > multiple WRITE_SAME bios are merged, since they're all writing the same > data we can consider the entire request to be a single segment. > > commit 1755e7ffc5745591d37b8956ce2512f4052a104a > Author: Kent Overstreet <kmo@daterainc.com> > Date: Tue Jan 14 14:22:01 2014 -0800 > > block: Explicitly handle discard/write same when counting segments > > diff --git a/block/blk-merge.c b/block/blk-merge.c > index 8f8adaa..7d977f8 100644 > --- a/block/blk-merge.c > +++ b/block/blk-merge.c > @@ -21,6 +21,12 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, > if (!bio) > return 0; > > + if (bio->bi_rw & REQ_DISCARD) > + return 0; > + > + if (bio->bi_rw & REQ_WRITE_SAME) > + return 1; > + > fbio = bio; > cluster = blk_queue_cluster(q); > seg_size = 0; For me this just shifts the crash, from __blk_recalc_rq_segments() to blk_rq_map_sg(): blk_rq_map_sg scsi_init_sgtable scsi_init_io scsi_setup_blk_pc_cmnd sd_prep_fn blk_peek_request scsi_request_fn __blk_run_queue blk_run_queue scsi_run_queue scsi_next_command scsi_io_completion scsi_finish_command scsi_softirq_done blk_done_softirq __do_softirq irq_exit do_IRQ common_interrupt <EOI> cpuidle_idle_call arch_cpu_idle cpu_startup_entry start_secondary It's GPF'ing on struct scatter_list *sg 0x800000001473e064 in static inline void sg_assign_page(struct scatterlist *sg, struct page *page) { unsigned long page_link = sg->page_link & 0x3; It appears to be in the static inline __blk_segment_map_sg(), and that GPF'ing address is what it just got from sg_next(). Sorry, this isn't the kind of dump you'll be used to, but it's the best I can do at the moment, and I've just had to reboot the machine. O, tried again and it hit the BUG_ON(count > sdb->table.nents) on line 1048 of drivers/scsi/scsi_lib.c: scsi_init_sgtable <IRQ> scsi_init_io scsi_setup_blk_pc_cmnd sd_setup_discard_cmnd sd_prep_fn blk_peek_request etc. as before I'll have to leave the machine shortly - I'm rather hoping you can do your own discard testing to see such crashes. Thanks, Hugh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next bio iters break discard? 2014-01-16 20:21 ` Hugh Dickins @ 2014-01-17 1:06 ` Kent Overstreet 2014-01-17 1:21 ` Kent Overstreet 1 sibling, 0 replies; 15+ messages in thread From: Kent Overstreet @ 2014-01-17 1:06 UTC (permalink / raw) To: Hugh Dickins Cc: Martin K. Petersen, Jens Axboe, Shaohua Li, Andrew Morton, linux-kernel On Thu, Jan 16, 2014 at 12:21:10PM -0800, Hugh Dickins wrote: > On Tue, 14 Jan 2014, Kent Overstreet wrote: > > > > Does the below patch look like what we want? I'm assuming that if > > You don't fill me with confidence ;) > > > multiple WRITE_SAME bios are merged, since they're all writing the same > > data we can consider the entire request to be a single segment. > > > > commit 1755e7ffc5745591d37b8956ce2512f4052a104a > > Author: Kent Overstreet <kmo@daterainc.com> > > Date: Tue Jan 14 14:22:01 2014 -0800 > > > > block: Explicitly handle discard/write same when counting segments > > > > diff --git a/block/blk-merge.c b/block/blk-merge.c > > index 8f8adaa..7d977f8 100644 > > --- a/block/blk-merge.c > > +++ b/block/blk-merge.c > > @@ -21,6 +21,12 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, > > if (!bio) > > return 0; > > > > + if (bio->bi_rw & REQ_DISCARD) > > + return 0; > > + > > + if (bio->bi_rw & REQ_WRITE_SAME) > > + return 1; > > + > > fbio = bio; > > cluster = blk_queue_cluster(q); > > seg_size = 0; > > For me this just shifts the crash, > from __blk_recalc_rq_segments() to blk_rq_map_sg(): > > blk_rq_map_sg > scsi_init_sgtable > scsi_init_io > scsi_setup_blk_pc_cmnd > sd_prep_fn > blk_peek_request > scsi_request_fn > __blk_run_queue > blk_run_queue > scsi_run_queue > scsi_next_command > scsi_io_completion > scsi_finish_command > scsi_softirq_done > blk_done_softirq > __do_softirq > irq_exit > do_IRQ > common_interrupt > <EOI> > cpuidle_idle_call > arch_cpu_idle > cpu_startup_entry > start_secondary > > It's GPF'ing on struct scatter_list *sg 0x800000001473e064 in > > static inline void sg_assign_page(struct scatterlist *sg, struct page *page) > { > unsigned long page_link = sg->page_link & 0x3; > > It appears to be in the static inline __blk_segment_map_sg(), > and that GPF'ing address is what it just got from sg_next(). > > Sorry, this isn't the kind of dump you'll be used to, but it's the > best I can do at the moment, and I've just had to reboot the machine. > > O, tried again and it hit the BUG_ON(count > sdb->table.nents) > on line 1048 of drivers/scsi/scsi_lib.c: > > scsi_init_sgtable > <IRQ> scsi_init_io > scsi_setup_blk_pc_cmnd > sd_setup_discard_cmnd > sd_prep_fn > blk_peek_request > etc. as before > > I'll have to leave the machine shortly - I'm rather hoping > you can do your own discard testing to see such crashes. My simple hdparm/scsi_debug based test isn't hitting it - any suggestions on how to reproduce it? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next bio iters break discard? 2014-01-16 20:21 ` Hugh Dickins 2014-01-17 1:06 ` Kent Overstreet @ 2014-01-17 1:21 ` Kent Overstreet 2014-01-31 17:17 ` Hugh Dickins 1 sibling, 1 reply; 15+ messages in thread From: Kent Overstreet @ 2014-01-17 1:21 UTC (permalink / raw) To: Hugh Dickins Cc: Martin K. Petersen, Jens Axboe, Shaohua Li, Andrew Morton, linux-kernel On Thu, Jan 16, 2014 at 12:21:10PM -0800, Hugh Dickins wrote: > For me this just shifts the crash, > from __blk_recalc_rq_segments() to blk_rq_map_sg(): > > blk_rq_map_sg > scsi_init_sgtable > scsi_init_io > scsi_setup_blk_pc_cmnd > sd_prep_fn > blk_peek_request > scsi_request_fn > __blk_run_queue > blk_run_queue > scsi_run_queue > scsi_next_command > scsi_io_completion > scsi_finish_command > scsi_softirq_done > blk_done_softirq > __do_softirq > irq_exit > do_IRQ > common_interrupt > <EOI> > cpuidle_idle_call > arch_cpu_idle > cpu_startup_entry > start_secondary > > It's GPF'ing on struct scatter_list *sg 0x800000001473e064 in > > static inline void sg_assign_page(struct scatterlist *sg, struct page *page) > { > unsigned long page_link = sg->page_link & 0x3; > > It appears to be in the static inline __blk_segment_map_sg(), > and that GPF'ing address is what it just got from sg_next(). > > Sorry, this isn't the kind of dump you'll be used to, but it's the > best I can do at the moment, and I've just had to reboot the machine. > > O, tried again and it hit the BUG_ON(count > sdb->table.nents) > on line 1048 of drivers/scsi/scsi_lib.c: > > scsi_init_sgtable > <IRQ> scsi_init_io > scsi_setup_blk_pc_cmnd > sd_setup_discard_cmnd > sd_prep_fn > blk_peek_request > etc. as before Ok, I reread the code and figured it out - the analagous change also has to be made in __blk_segment_map_sg(). I'll mail out a patch for this tomorrow after I've stared at the code more and had less beer. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next bio iters break discard? 2014-01-17 1:21 ` Kent Overstreet @ 2014-01-31 17:17 ` Hugh Dickins 2014-01-31 21:58 ` Jens Axboe 2014-02-04 10:17 ` [PATCH] block: Explicitly handle discard/write same segments Kent Overstreet 0 siblings, 2 replies; 15+ messages in thread From: Hugh Dickins @ 2014-01-31 17:17 UTC (permalink / raw) To: Kent Overstreet Cc: Hugh Dickins, Martin K. Petersen, Jens Axboe, Shaohua Li, Andrew Morton, linux-kernel On Thu, 16 Jan 2014, Kent Overstreet wrote: > > Ok, I reread the code and figured it out - the analagous change also has to be > made in __blk_segment_map_sg(). I'll mail out a patch for this tomorrow after > I've stared at the code more and had less beer. I'd been hoping for a patch to try, but now your changes have hit Linus's tree: so today we have discard broken there too, crashing as originally reported on the NULL struct page pointer in __blk_recalc_rq_segments()'s page_to_pfn(bv.bv_page). How to reproduce it? I hope you'll find easier ways, but I get it with swapping to SSD (remember "swapon -d" to enable discard). I'm just doing what I've done for years, running a pair of make -j20 kbuilds to tmpfs in limited RAM (I use mem=700M with 1.5G of swap: but that would be far too little RAM for a general config of current tree), to get plenty of fairly chaotic swapping but good forward progress nonetheless (if the sizes are too small, then it'll just thrash abysmally or be OOM-killed). But please do send me a patch and I'll give it a try - thanks. Hugh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: next bio iters break discard? 2014-01-31 17:17 ` Hugh Dickins @ 2014-01-31 21:58 ` Jens Axboe 2014-02-04 10:17 ` [PATCH] block: Explicitly handle discard/write same segments Kent Overstreet 1 sibling, 0 replies; 15+ messages in thread From: Jens Axboe @ 2014-01-31 21:58 UTC (permalink / raw) To: Hugh Dickins Cc: Kent Overstreet, Martin K. Petersen, Shaohua Li, Andrew Morton, linux-kernel On Fri, Jan 31 2014, Hugh Dickins wrote: > On Thu, 16 Jan 2014, Kent Overstreet wrote: > > > > Ok, I reread the code and figured it out - the analagous change also has to be > > made in __blk_segment_map_sg(). I'll mail out a patch for this tomorrow after > > I've stared at the code more and had less beer. > > I'd been hoping for a patch to try, but now your changes have hit Linus's > tree: so today we have discard broken there too, crashing as originally > reported on the NULL struct page pointer in __blk_recalc_rq_segments()'s > page_to_pfn(bv.bv_page). > > How to reproduce it? I hope you'll find easier ways, but I get it with > swapping to SSD (remember "swapon -d" to enable discard). I'm just doing > what I've done for years, running a pair of make -j20 kbuilds to tmpfs in > limited RAM (I use mem=700M with 1.5G of swap: but that would be far too > little RAM for a general config of current tree), to get plenty of fairly > chaotic swapping but good forward progress nonetheless (if the sizes are > too small, then it'll just thrash abysmally or be OOM-killed). > > But please do send me a patch and I'll give it a try - thanks. Hugh, sorry about that, this particular issue slipped my mind. Will get this fixed up as soon as possible, hopefully before -rc1... -- Jens Axboe ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH] block: Explicitly handle discard/write same segments 2014-01-31 17:17 ` Hugh Dickins 2014-01-31 21:58 ` Jens Axboe @ 2014-02-04 10:17 ` Kent Overstreet 2014-02-04 12:25 ` Hugh Dickins 1 sibling, 1 reply; 15+ messages in thread From: Kent Overstreet @ 2014-02-04 10:17 UTC (permalink / raw) To: Hugh Dickins Cc: Martin K. Petersen, Jens Axboe, Shaohua Li, Andrew Morton, linux-kernel Immutable biovecs changed the way biovecs are interpreted - drivers no longer use bi_vcnt, they have to go by bi_iter.bi_size (to allow for using part of an existing segment without modifying it). This breaks with discards and write_same bios, since for those bi_size has nothing to do with segments in the biovec. So for now, we need a fairly gross hack - we fortunately know that there will never be more than one segment for the entire request, so we can special case discard/write_same. Signed-off-by: Kent Overstreet <kmo@daterainc.com> --- On Fri, Jan 31, 2014 at 09:17:25AM -0800, Hugh Dickins wrote: > On Thu, 16 Jan 2014, Kent Overstreet wrote: > > > > Ok, I reread the code and figured it out - the analagous change also has to be > > made in __blk_segment_map_sg(). I'll mail out a patch for this tomorrow after > > I've stared at the code more and had less beer. > > I'd been hoping for a patch to try, but now your changes have hit Linus's > tree: so today we have discard broken there too, crashing as originally > reported on the NULL struct page pointer in __blk_recalc_rq_segments()'s > page_to_pfn(bv.bv_page). > > How to reproduce it? I hope you'll find easier ways, but I get it with > swapping to SSD (remember "swapon -d" to enable discard). I'm just doing > what I've done for years, running a pair of make -j20 kbuilds to tmpfs in > limited RAM (I use mem=700M with 1.5G of swap: but that would be far too > little RAM for a general config of current tree), to get plenty of fairly > chaotic swapping but good forward progress nonetheless (if the sizes are > too small, then it'll just thrash abysmally or be OOM-killed). > > But please do send me a patch and I'll give it a try - thanks. Hugh - can you give this patch a try? Passes my tests but I was never able to reproduce your crash, unfortunately. block/blk-merge.c | 91 +++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 62 insertions(+), 29 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index 8f8adaa954..6c583f9c5b 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -21,6 +21,16 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, if (!bio) return 0; + /* + * This should probably be returning 0, but blk_add_request_payload() + * (Christoph!!!!) + */ + if (bio->bi_rw & REQ_DISCARD) + return 1; + + if (bio->bi_rw & REQ_WRITE_SAME) + return 1; + fbio = bio; cluster = blk_queue_cluster(q); seg_size = 0; @@ -161,30 +171,60 @@ new_segment: *bvprv = *bvec; } -/* - * map a request to scatterlist, return number of sg entries setup. Caller - * must make sure sg can hold rq->nr_phys_segments entries - */ -int blk_rq_map_sg(struct request_queue *q, struct request *rq, - struct scatterlist *sglist) +static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio, + struct scatterlist *sglist, + struct scatterlist **sg) { struct bio_vec bvec, bvprv = { NULL }; - struct req_iterator iter; - struct scatterlist *sg; + struct bvec_iter iter; int nsegs, cluster; nsegs = 0; cluster = blk_queue_cluster(q); - /* - * for each bio in rq - */ - sg = NULL; - rq_for_each_segment(bvec, rq, iter) { - __blk_segment_map_sg(q, &bvec, sglist, &bvprv, &sg, - &nsegs, &cluster); - } /* segments in rq */ + if (bio->bi_rw & REQ_DISCARD) { + /* + * This is a hack - drivers should be neither modifying the + * biovec, nor relying on bi_vcnt - but because of + * blk_add_request_payload(), a discard bio may or may not have + * a payload we need to set up here (thank you Christoph) and + * bi_vcnt is really the only way of telling if we need to. + */ + + if (bio->bi_vcnt) + goto single_segment; + + return 0; + } + + if (bio->bi_rw & REQ_WRITE_SAME) { +single_segment: + *sg = sglist; + bvec = bio_iovec(bio); + sg_set_page(*sg, bvec.bv_page, bvec.bv_len, bvec.bv_offset); + return 1; + } + + for_each_bio(bio) + bio_for_each_segment(bvec, bio, iter) + __blk_segment_map_sg(q, &bvec, sglist, &bvprv, sg, + &nsegs, &cluster); + return nsegs; +} + +/* + * map a request to scatterlist, return number of sg entries setup. Caller + * must make sure sg can hold rq->nr_phys_segments entries + */ +int blk_rq_map_sg(struct request_queue *q, struct request *rq, + struct scatterlist *sglist) +{ + struct scatterlist *sg = NULL; + int nsegs = 0; + + if (rq->bio) + nsegs = __blk_bios_map_sg(q, rq->bio, sglist, &sg); if (unlikely(rq->cmd_flags & REQ_COPY_USER) && (blk_rq_bytes(rq) & q->dma_pad_mask)) { @@ -230,20 +270,13 @@ EXPORT_SYMBOL(blk_rq_map_sg); int blk_bio_map_sg(struct request_queue *q, struct bio *bio, struct scatterlist *sglist) { - struct bio_vec bvec, bvprv = { NULL }; - struct scatterlist *sg; - int nsegs, cluster; - struct bvec_iter iter; - - nsegs = 0; - cluster = blk_queue_cluster(q); - - sg = NULL; - bio_for_each_segment(bvec, bio, iter) { - __blk_segment_map_sg(q, &bvec, sglist, &bvprv, &sg, - &nsegs, &cluster); - } /* segments in bio */ + struct scatterlist *sg = NULL; + int nsegs; + struct bio *next = bio->bi_next; + bio->bi_next = NULL; + nsegs = __blk_bios_map_sg(q, bio, sglist, &sg); + bio->bi_next = next; if (sg) sg_mark_end(sg); -- 1.9.rc1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] block: Explicitly handle discard/write same segments 2014-02-04 10:17 ` [PATCH] block: Explicitly handle discard/write same segments Kent Overstreet @ 2014-02-04 12:25 ` Hugh Dickins 2014-02-04 12:35 ` Kent Overstreet 0 siblings, 1 reply; 15+ messages in thread From: Hugh Dickins @ 2014-02-04 12:25 UTC (permalink / raw) To: Kent Overstreet Cc: Hugh Dickins, Martin K. Petersen, Jens Axboe, Shaohua Li, Andrew Morton, linux-kernel On Tue, 4 Feb 2014, Kent Overstreet wrote: > Immutable biovecs changed the way biovecs are interpreted - drivers no > longer use bi_vcnt, they have to go by bi_iter.bi_size (to allow for > using part of an existing segment without modifying it). > > This breaks with discards and write_same bios, since for those bi_size > has nothing to do with segments in the biovec. So for now, we need a > fairly gross hack - we fortunately know that there will never be more > than one segment for the entire request, so we can special case > discard/write_same. > > Signed-off-by: Kent Overstreet <kmo@daterainc.com> > --- > On Fri, Jan 31, 2014 at 09:17:25AM -0800, Hugh Dickins wrote: > > On Thu, 16 Jan 2014, Kent Overstreet wrote: > > > > > > Ok, I reread the code and figured it out - the analagous change also has to be > > > made in __blk_segment_map_sg(). I'll mail out a patch for this tomorrow after > > > I've stared at the code more and had less beer. > > > > I'd been hoping for a patch to try, but now your changes have hit Linus's > > tree: so today we have discard broken there too, crashing as originally > > reported on the NULL struct page pointer in __blk_recalc_rq_segments()'s > > page_to_pfn(bv.bv_page). > > > > How to reproduce it? I hope you'll find easier ways, but I get it with > > swapping to SSD (remember "swapon -d" to enable discard). I'm just doing > > what I've done for years, running a pair of make -j20 kbuilds to tmpfs in > > limited RAM (I use mem=700M with 1.5G of swap: but that would be far too > > little RAM for a general config of current tree), to get plenty of fairly > > chaotic swapping but good forward progress nonetheless (if the sizes are > > too small, then it'll just thrash abysmally or be OOM-killed). > > > > But please do send me a patch and I'll give it a try - thanks. > > Hugh - can you give this patch a try? Passes my tests but I was never > able to reproduce your crash, unfortunately. Thanks a lot, Kent: I'm glad to report, this is working fine for me. Hugh > > block/blk-merge.c | 91 +++++++++++++++++++++++++++++++++++++------------------ > 1 file changed, 62 insertions(+), 29 deletions(-) > > diff --git a/block/blk-merge.c b/block/blk-merge.c > index 8f8adaa954..6c583f9c5b 100644 > --- a/block/blk-merge.c > +++ b/block/blk-merge.c > @@ -21,6 +21,16 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, > if (!bio) > return 0; > > + /* > + * This should probably be returning 0, but blk_add_request_payload() > + * (Christoph!!!!) > + */ > + if (bio->bi_rw & REQ_DISCARD) > + return 1; > + > + if (bio->bi_rw & REQ_WRITE_SAME) > + return 1; > + > fbio = bio; > cluster = blk_queue_cluster(q); > seg_size = 0; > @@ -161,30 +171,60 @@ new_segment: > *bvprv = *bvec; > } > > -/* > - * map a request to scatterlist, return number of sg entries setup. Caller > - * must make sure sg can hold rq->nr_phys_segments entries > - */ > -int blk_rq_map_sg(struct request_queue *q, struct request *rq, > - struct scatterlist *sglist) > +static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio, > + struct scatterlist *sglist, > + struct scatterlist **sg) > { > struct bio_vec bvec, bvprv = { NULL }; > - struct req_iterator iter; > - struct scatterlist *sg; > + struct bvec_iter iter; > int nsegs, cluster; > > nsegs = 0; > cluster = blk_queue_cluster(q); > > - /* > - * for each bio in rq > - */ > - sg = NULL; > - rq_for_each_segment(bvec, rq, iter) { > - __blk_segment_map_sg(q, &bvec, sglist, &bvprv, &sg, > - &nsegs, &cluster); > - } /* segments in rq */ > + if (bio->bi_rw & REQ_DISCARD) { > + /* > + * This is a hack - drivers should be neither modifying the > + * biovec, nor relying on bi_vcnt - but because of > + * blk_add_request_payload(), a discard bio may or may not have > + * a payload we need to set up here (thank you Christoph) and > + * bi_vcnt is really the only way of telling if we need to. > + */ > + > + if (bio->bi_vcnt) > + goto single_segment; > + > + return 0; > + } > + > + if (bio->bi_rw & REQ_WRITE_SAME) { > +single_segment: > + *sg = sglist; > + bvec = bio_iovec(bio); > + sg_set_page(*sg, bvec.bv_page, bvec.bv_len, bvec.bv_offset); > + return 1; > + } > + > + for_each_bio(bio) > + bio_for_each_segment(bvec, bio, iter) > + __blk_segment_map_sg(q, &bvec, sglist, &bvprv, sg, > + &nsegs, &cluster); > > + return nsegs; > +} > + > +/* > + * map a request to scatterlist, return number of sg entries setup. Caller > + * must make sure sg can hold rq->nr_phys_segments entries > + */ > +int blk_rq_map_sg(struct request_queue *q, struct request *rq, > + struct scatterlist *sglist) > +{ > + struct scatterlist *sg = NULL; > + int nsegs = 0; > + > + if (rq->bio) > + nsegs = __blk_bios_map_sg(q, rq->bio, sglist, &sg); > > if (unlikely(rq->cmd_flags & REQ_COPY_USER) && > (blk_rq_bytes(rq) & q->dma_pad_mask)) { > @@ -230,20 +270,13 @@ EXPORT_SYMBOL(blk_rq_map_sg); > int blk_bio_map_sg(struct request_queue *q, struct bio *bio, > struct scatterlist *sglist) > { > - struct bio_vec bvec, bvprv = { NULL }; > - struct scatterlist *sg; > - int nsegs, cluster; > - struct bvec_iter iter; > - > - nsegs = 0; > - cluster = blk_queue_cluster(q); > - > - sg = NULL; > - bio_for_each_segment(bvec, bio, iter) { > - __blk_segment_map_sg(q, &bvec, sglist, &bvprv, &sg, > - &nsegs, &cluster); > - } /* segments in bio */ > + struct scatterlist *sg = NULL; > + int nsegs; > + struct bio *next = bio->bi_next; > + bio->bi_next = NULL; > > + nsegs = __blk_bios_map_sg(q, bio, sglist, &sg); > + bio->bi_next = next; > if (sg) > sg_mark_end(sg); > > -- > 1.9.rc1 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] block: Explicitly handle discard/write same segments 2014-02-04 12:25 ` Hugh Dickins @ 2014-02-04 12:35 ` Kent Overstreet 0 siblings, 0 replies; 15+ messages in thread From: Kent Overstreet @ 2014-02-04 12:35 UTC (permalink / raw) To: Hugh Dickins Cc: Martin K. Petersen, Jens Axboe, Shaohua Li, Andrew Morton, linux-kernel@vger.kernel.org Thanks! On Tue, Feb 4, 2014 at 4:25 AM, Hugh Dickins <hughd@google.com> wrote: > On Tue, 4 Feb 2014, Kent Overstreet wrote: > >> Immutable biovecs changed the way biovecs are interpreted - drivers no >> longer use bi_vcnt, they have to go by bi_iter.bi_size (to allow for >> using part of an existing segment without modifying it). >> >> This breaks with discards and write_same bios, since for those bi_size >> has nothing to do with segments in the biovec. So for now, we need a >> fairly gross hack - we fortunately know that there will never be more >> than one segment for the entire request, so we can special case >> discard/write_same. >> >> Signed-off-by: Kent Overstreet <kmo@daterainc.com> >> --- >> On Fri, Jan 31, 2014 at 09:17:25AM -0800, Hugh Dickins wrote: >> > On Thu, 16 Jan 2014, Kent Overstreet wrote: >> > > >> > > Ok, I reread the code and figured it out - the analagous change also has to be >> > > made in __blk_segment_map_sg(). I'll mail out a patch for this tomorrow after >> > > I've stared at the code more and had less beer. >> > >> > I'd been hoping for a patch to try, but now your changes have hit Linus's >> > tree: so today we have discard broken there too, crashing as originally >> > reported on the NULL struct page pointer in __blk_recalc_rq_segments()'s >> > page_to_pfn(bv.bv_page). >> > >> > How to reproduce it? I hope you'll find easier ways, but I get it with >> > swapping to SSD (remember "swapon -d" to enable discard). I'm just doing >> > what I've done for years, running a pair of make -j20 kbuilds to tmpfs in >> > limited RAM (I use mem=700M with 1.5G of swap: but that would be far too >> > little RAM for a general config of current tree), to get plenty of fairly >> > chaotic swapping but good forward progress nonetheless (if the sizes are >> > too small, then it'll just thrash abysmally or be OOM-killed). >> > >> > But please do send me a patch and I'll give it a try - thanks. >> >> Hugh - can you give this patch a try? Passes my tests but I was never >> able to reproduce your crash, unfortunately. > > Thanks a lot, Kent: I'm glad to report, this is working fine for me. > > Hugh > >> >> block/blk-merge.c | 91 +++++++++++++++++++++++++++++++++++++------------------ >> 1 file changed, 62 insertions(+), 29 deletions(-) >> >> diff --git a/block/blk-merge.c b/block/blk-merge.c >> index 8f8adaa954..6c583f9c5b 100644 >> --- a/block/blk-merge.c >> +++ b/block/blk-merge.c >> @@ -21,6 +21,16 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, >> if (!bio) >> return 0; >> >> + /* >> + * This should probably be returning 0, but blk_add_request_payload() >> + * (Christoph!!!!) >> + */ >> + if (bio->bi_rw & REQ_DISCARD) >> + return 1; >> + >> + if (bio->bi_rw & REQ_WRITE_SAME) >> + return 1; >> + >> fbio = bio; >> cluster = blk_queue_cluster(q); >> seg_size = 0; >> @@ -161,30 +171,60 @@ new_segment: >> *bvprv = *bvec; >> } >> >> -/* >> - * map a request to scatterlist, return number of sg entries setup. Caller >> - * must make sure sg can hold rq->nr_phys_segments entries >> - */ >> -int blk_rq_map_sg(struct request_queue *q, struct request *rq, >> - struct scatterlist *sglist) >> +static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio, >> + struct scatterlist *sglist, >> + struct scatterlist **sg) >> { >> struct bio_vec bvec, bvprv = { NULL }; >> - struct req_iterator iter; >> - struct scatterlist *sg; >> + struct bvec_iter iter; >> int nsegs, cluster; >> >> nsegs = 0; >> cluster = blk_queue_cluster(q); >> >> - /* >> - * for each bio in rq >> - */ >> - sg = NULL; >> - rq_for_each_segment(bvec, rq, iter) { >> - __blk_segment_map_sg(q, &bvec, sglist, &bvprv, &sg, >> - &nsegs, &cluster); >> - } /* segments in rq */ >> + if (bio->bi_rw & REQ_DISCARD) { >> + /* >> + * This is a hack - drivers should be neither modifying the >> + * biovec, nor relying on bi_vcnt - but because of >> + * blk_add_request_payload(), a discard bio may or may not have >> + * a payload we need to set up here (thank you Christoph) and >> + * bi_vcnt is really the only way of telling if we need to. >> + */ >> + >> + if (bio->bi_vcnt) >> + goto single_segment; >> + >> + return 0; >> + } >> + >> + if (bio->bi_rw & REQ_WRITE_SAME) { >> +single_segment: >> + *sg = sglist; >> + bvec = bio_iovec(bio); >> + sg_set_page(*sg, bvec.bv_page, bvec.bv_len, bvec.bv_offset); >> + return 1; >> + } >> + >> + for_each_bio(bio) >> + bio_for_each_segment(bvec, bio, iter) >> + __blk_segment_map_sg(q, &bvec, sglist, &bvprv, sg, >> + &nsegs, &cluster); >> >> + return nsegs; >> +} >> + >> +/* >> + * map a request to scatterlist, return number of sg entries setup. Caller >> + * must make sure sg can hold rq->nr_phys_segments entries >> + */ >> +int blk_rq_map_sg(struct request_queue *q, struct request *rq, >> + struct scatterlist *sglist) >> +{ >> + struct scatterlist *sg = NULL; >> + int nsegs = 0; >> + >> + if (rq->bio) >> + nsegs = __blk_bios_map_sg(q, rq->bio, sglist, &sg); >> >> if (unlikely(rq->cmd_flags & REQ_COPY_USER) && >> (blk_rq_bytes(rq) & q->dma_pad_mask)) { >> @@ -230,20 +270,13 @@ EXPORT_SYMBOL(blk_rq_map_sg); >> int blk_bio_map_sg(struct request_queue *q, struct bio *bio, >> struct scatterlist *sglist) >> { >> - struct bio_vec bvec, bvprv = { NULL }; >> - struct scatterlist *sg; >> - int nsegs, cluster; >> - struct bvec_iter iter; >> - >> - nsegs = 0; >> - cluster = blk_queue_cluster(q); >> - >> - sg = NULL; >> - bio_for_each_segment(bvec, bio, iter) { >> - __blk_segment_map_sg(q, &bvec, sglist, &bvprv, &sg, >> - &nsegs, &cluster); >> - } /* segments in bio */ >> + struct scatterlist *sg = NULL; >> + int nsegs; >> + struct bio *next = bio->bi_next; >> + bio->bi_next = NULL; >> >> + nsegs = __blk_bios_map_sg(q, bio, sglist, &sg); >> + bio->bi_next = next; >> if (sg) >> sg_mark_end(sg); >> >> -- >> 1.9.rc1 ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2014-02-04 12:35 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-01-13 3:52 next bio iters break discard? Hugh Dickins 2014-01-14 2:33 ` Kent Overstreet 2014-01-14 4:06 ` Martin K. Petersen 2014-01-14 4:48 ` Kent Overstreet 2014-01-14 20:17 ` Martin K. Petersen 2014-01-14 22:24 ` Kent Overstreet 2014-01-16 1:39 ` Martin K. Petersen 2014-01-16 20:21 ` Hugh Dickins 2014-01-17 1:06 ` Kent Overstreet 2014-01-17 1:21 ` Kent Overstreet 2014-01-31 17:17 ` Hugh Dickins 2014-01-31 21:58 ` Jens Axboe 2014-02-04 10:17 ` [PATCH] block: Explicitly handle discard/write same segments Kent Overstreet 2014-02-04 12:25 ` Hugh Dickins 2014-02-04 12:35 ` Kent Overstreet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).