From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: NeilBrown To: Mikulas Patocka Date: Thu, 23 Nov 2017 16:12:46 +1100 Cc: Mike Snitzer , Jens Axboe , "linux-kernel\@vger.kernel.org" , linux-block@vger.kernel.org, device-mapper development , Zdenek Kabelac Subject: Re: [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER In-Reply-To: References: <149776047907.23258.8058071140236879834.stgit@noble> <20170618184143.GA10920@kernel.dk> <87poe13rmm.fsf@notabene.neil.brown.name> <87a7zg31vx.fsf@notabene.neil.brown.name> <20171121013533.GA14520@redhat.com> <20171121121049.GA17014@redhat.com> <20171121124311.GA17243@redhat.com> <20171121194709.GA18903@redhat.com> <20171121225119.GA19630@redhat.com> <87bmjv0xos.fsf@notabene.neil.brown.name> Message-ID: <87o9ntzigh.fsf@notabene.neil.brown.name> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" List-ID: --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable On Wed, Nov 22 2017, Mikulas Patocka wrote: > On Wed, 22 Nov 2017, NeilBrown wrote: > >> On Tue, Nov 21 2017, Mikulas Patocka wrote: >>=20 >> > On Tue, 21 Nov 2017, Mike Snitzer wrote: >> > >> >> On Tue, Nov 21 2017 at 4:23pm -0500, >> >> Mikulas Patocka wrote: >> >>=20 >> >> > This is not correct: >> >> >=20 >> >> > 2206 static void dm_wq_work(struct work_struct *work) >> >> > 2207 { >> >> > 2208 struct mapped_device *md =3D container_of(work, str= uct mapped_device, work); >> >> > 2209 struct bio *bio; >> >> > 2210 int srcu_idx; >> >> > 2211 struct dm_table *map; >> >> > 2212 >> >> > 2213 if (!bio_list_empty(&md->rescued)) { >> >> > 2214 struct bio_list list; >> >> > 2215 spin_lock_irq(&md->deferred_lock); >> >> > 2216 list =3D md->rescued; >> >> > 2217 bio_list_init(&md->rescued); >> >> > 2218 spin_unlock_irq(&md->deferred_lock); >> >> > 2219 while ((bio =3D bio_list_pop(&list))) >> >> > 2220 generic_make_request(bio); >> >> > 2221 } >> >> > 2222 >> >> > 2223 map =3D dm_get_live_table(md, &srcu_idx); >> >> > 2224 >> >> > 2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->fla= gs)) { >> >> > 2226 spin_lock_irq(&md->deferred_lock); >> >> > 2227 bio =3D bio_list_pop(&md->deferred); >> >> > 2228 spin_unlock_irq(&md->deferred_lock); >> >> > 2229 >> >> > 2230 if (!bio) >> >> > 2231 break; >> >> > 2232 >> >> > 2233 if (dm_request_based(md)) >> >> > 2234 generic_make_request(bio); >> >> > 2235 else >> >> > 2236 __split_and_process_bio(md, map, bi= o); >> >> > 2237 } >> >> > 2238 >> >> > 2239 dm_put_live_table(md, srcu_idx); >> >> > 2240 } >> >> >=20 >> >> > You can see that if we are in dm_wq_work in __split_and_process_bio= , we=20 >> >> > will not process md->rescued list. >> >>=20 >> >> Can you elaborate further? We cannot be "in dm_wq_work in >> >> __split_and_process_bio" simultaneously. Do you mean as a side-effect >> >> of scheduling away from __split_and_process_bio? >> >>=20 >> >> The more detail you can share the better. >> > >> > Suppose this scenario: >> > >> > * dm_wq_work calls __split_and_process_bio >> > * __split_and_process_bio eventually reaches the function snapshot_map >> > * snapshot_map attempts to take the snapshot lock >> > >> > * the snapshot lock could be released only if some bios submitted by t= he=20 >> > snapshot driver to the underlying device complete >> > * the bios submitted to the underlying device were already offloaded b= y=20 >> > some other task and they are waiting on the list md->rescued >> > * the bios waiting on md->rescued are not processed, because dm_wq_wor= k is=20 >> > blocked in snapshot_map (called from __split_and_process_bio) >>=20 >> Yes, I think you are right.=20 >>=20 >> I think the solution is to get rid of the dm_offload() infrastructure >> and make it not necessary. >> i.e. discard my patches >> dm: prepare to discontinue use of BIOSET_NEED_RESCUER >> and >> dm: revise 'rescue' strategy for bio-based bioset allocations >>=20 >> And build on "dm: ensure bio submission follows a depth-first tree walk" >> which was written after those and already makes dm_offload() less >> important. >>=20 >> Since that "depth-first" patch, every request to the dm device, after >> the initial splitting, allocates just one dm_target_io structure, and >> makes just one __map_bio() call, and so will behave exactly the way >> generic_make_request() expects and copes with - thus avoiding awkward >> dependencies and deadlocks. Except.... >>=20 >> a/ If any target defines ->num_write_bios() to return >1, >> __clone_and_map_data_bio() will make multiple calls to alloc_tio() >> and __map_bio(), which might need rescuing. >> But no target defines num_write_bios, and none have since it was >> removed from dm-cache 4.5 years ago. >> Can we discard num_write_bios?? >>=20 >> b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}= _bios >> to a value > 1, then __send_duplicate_bios() will also make multiple >> calls to alloc_tio() and __map_bio(). >> Some do. >> dm-cache-target: flush=3D2 >> dm-snap: flush=3D2 >> dm-stripe: discard, write_same, write_zeroes all set to 'stripes'. >>=20 >> These will only be a problem if the second (or subsequent) alloc_tio() >> blocks waiting for an earlier allocation to complete. This will only >> be a problem if multiple threads are each trying to allocate multiple >> dm_target_io from the same bioset at the same time. >> This is rare and should be easier to address than the current >> dm_offload() approach.=20 >> One possibility would be to copy the approach taken by >> crypt_alloc_buffer() which needs to allocate multiple entries from a >> mempool. >> It first tries the with GFP_NOWAIT. If that fails it take a mutex and >> tries with GFP_NOIO. This mean only one thread will try to allocate >> multiple bios at once, so there can be no deadlock. >>=20 >> Below are two RFC patches. The first removes num_write_bios. >> The second is incomplete and makes a stab are allocating multiple bios >> at once safely. >> A third would be needed to remove dm_offload() etc... but I cannot quite >> fit that in today - must be off. >>=20 >> Thanks, >> NeilBrown > > Another problem is this: > > struct bio *b =3D bio_clone_bioset(bio, GFP_NOIO, md->queue->bio_split); > bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9); > bio_chain(b, bio); > > What if it blocks because the bioset is exhausted? > > The code basically builds a chain of bios of unlimited length (suppose fo= r=20 > example a case when we are splitting on every sector boundary, so there=20 > will be one bio for every sector in the original bio), it could exhaust=20 > the bioset easily. > > It would be better to use mechanism from md-raid that chains all the=20 > sub-bios to the same master bio and doesn't create long chains of bios: > > if (max_sectors < bio_sectors(bio)) { > struct bio *split =3D bio_split(bio, max_sectors, > gfp, conf->bio_split); > bio_chain(split, bio); > generic_make_request(bio); > bio =3D split; > r1_bio->master_bio =3D bio; > r1_bio->sectors =3D max_sectors; > } > > Mikulas Yes, you are right something like that would be better. Also send_changing_extent_only allocates bios in a loop which can cause problems. I think we need to get __split_and_process_non_flush(), in all its branches, to check if len is too large for a single request, and if it is, create a clone for the prefix, attached that to the ci and map it, advance the original bio, and call generic_make_request on it. That shouldn't be too hard, but it is a change that would touch a few places. I'll see if I can write something. Thanks, NeilBrown --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEG8Yp69OQ2HB7X0l6Oeye3VZigbkFAloWWM4ACgkQOeye3VZi gblQNhAAg2/8vR6ShfYstyPuctY9sz/4fdkbJfpLJ7qRjifvV42rseStJftUbn2K tM0wiyZS3rhGapOJAXyVdstaFitFIFSynz53LQnW92Am3fVnuyTAoS/JTjB8hXkV nqRwZFfGM29sPi7RlON6xfDNMiBY79AczstneaSQvPXYhqblRJ3eRwPWwSHTON+w KYWd3ZcsSWAVjVLBDV423fIUCv206A1Khpke3sjq31fB14NL6p/5hcn6Qp4oXG+2 3T8CQ5c7ERpCl0UFftj9jD+8YooK81YZxfeFN9lPcMKocMI8nBMA6IOuxHLffDsf 7l+ZlN6FWEZfhdOCa9RqecoxDOJSp1+kCCjted3rFQOXHiienaJRVv5CI4jIcAk2 SKwHehgH1OUmd05lKGzDQ9WS0Y1NEo/8Am+tS/a3BgzX1R7YKeWJZiLMGbzwDYwy zqMcN67N5Rapd/isaF+yCJUle+FSQ09yCWngOmhBjrJzjblOHWOQUN567Vxvme5Z yOuR2N451rlXDX3xYVK26umsCQUQUB8Z/YujUVAOJRaOjcJqYag6IMulvId5jmzW 3rcxS56D88qNimPqoecAd7pXVy0YKjdatNA1wCC22uiVrcehhF4swNQoGAoqI4Aq Vz8RWUNHID4DDL6pZce+F7ZqahwD/YonHwA4vPXCcpEmWB5HwXA= =yyAX -----END PGP SIGNATURE----- --=-=-=--