From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 99CC7C6FD1D for ; Tue, 4 Apr 2023 20:29:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235418AbjDDU3J (ORCPT ); Tue, 4 Apr 2023 16:29:09 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42432 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230331AbjDDU3I (ORCPT ); Tue, 4 Apr 2023 16:29:08 -0400 Received: from sonic308-2.consmr.mail.bf2.yahoo.com (sonic308-2.consmr.mail.bf2.yahoo.com [74.6.130.41]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8DA5D49EB for ; Tue, 4 Apr 2023 13:29:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com.br; s=s2048; t=1680640144; bh=OCFchnKheLmiYO4m5Qf4vE4rNTt93FZnIFD9qfTrV5Q=; h=Date:From:To:Cc:In-Reply-To:References:Subject:From:Subject:Reply-To; b=Un+MnTdD4FvNo+/1ON4OIL+Ylk9HpJFStHqHGsKAs7uMB70QexiFMcYNlRXgxLz9IMz+871whkZRCUw+D7PLw18b2Vd3VCDEsqy5uU3JRKfOhPN3uBgQfLAi/DxI8WTPlNbArpMgd7VKcQXQHf8IqaUD/nTCPCsBD3Bnp540dool5wjRIEvJ4LyJwyiX+9t/ANTl9RtTZ0w+hTWqaZHHYL9vI6h9/qnkb3ous6oDkrQeNepA1lcrVCiMayQfF5sIDnZ4VdLoBK2xhBL8gGrAH3dgpshzFlG/DAlZi/r1/4GQ1MI8g6AjAqnO7ETvmkLc440uipsQYFHjFYD7huPcSA== X-SONIC-DKIM-SIGN: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1680640144; bh=Q/u3TGLxKSUYxWwg8GYjJdKQ5yB9Yfn/H0pe//i+7Bh=; h=X-Sonic-MF:Date:From:To:Subject:From:Subject; b=AhjDHhNox02bMvXzEDycUYG8eqUovhVBvkjTqyxFYVxAhsDvd12HrkHGbwyKte54GEOLm1dR82JdmzQOcOz+N992RTTVm8AGavMrKGTz1jLIXnjL2ZrSsP4lOrqIAPDGSwIqBLcskosjmTcCEyjXp/fVS8BmPbbwTtg4RfNM96TLxrHtiE5m6g9Lk9Vyb2sfPmA6L2HOv7+A4GpHiHJ4xLhVWX+oHq9opbRPj3NCurGJZnrOu41o8rWByint+9amalOy/zVks7kGZXMZF943mZRTUeqf1rFrWrWdJwjUuH7XoV8FQPPN+9j7dw+k0EJepAeUzZLQYfzVTFOMJEx1fg== X-YMail-OSG: r5g8CGYVM1n_ma4ufq7GqJYuqYFDjDc9Z1fVYdyAc2hf15isjEyKGPx2KGhk_oF MxYgex5d1yBWb48qa_5BvKNK72czIkLHPwkQSee015ySYdAxsBafnQXXyFK.PmXHkZ5lG1fzMpqQ PXRQNIe4t30ndfhgefeWsd1bIrtf.0xQo58PchapdhVl8Zb8ECxw0s97VbrMJCW_OlXQr__TRNzo ILbYukoKJsCzlayj4RT3tNo1EUMoJHPWfqKVE1W5MUkssYZuwx5fvkMV8KLVJb5KV2I4r.iQG8fv nnQmjwWlgMd4wnaH4E8xpWlhZ2EF47_Bs7cx.68dlA1vPUFuFGwfxxY6kT8_xtamD_gtmRvN8MYD .7luJgmPL4Gr3VbJ_tBmiwSdDz1NTPuwRoWkhZI2rjxDqraolMwyskF2cTRgPNUeHV_FiT1kwIKd DKEzxcTK6sAeYkjFAh3zeyrm2pOH.PofIy7QMEuV9149yP.Hn.Odx9oJZBC_cENvWgOjR63Ulxxf XCxA8UcSJdbOIaXovfDqZDB_x_TnwVi56G81sNV0qdSQWZpxvg1Klp0KAnASjZRGETeuKN5F1YPd Cqx8cgkuarIhMU2wS77BvnCyshLFdqnJJMuENdayzphbvSezWO3oZW5uwGzGK6XSaeFoZXMOohyh UcNomfnqOta_ETRUlUNRbgduGHs.097aoUAMz3YO_pLTZnJzWEVID58f8S57rB4QadmsIoJO_KGo 6bRGXWxsNVgfu6qVi24TpSMkHnlHiMqGKSGLWw6.v9u4uLmE91h5w6x9AoEjmI7AWCPujVQ3Ws8w mma3chxsupfAZ_y2jsIo8_Dryeh0s1flLcEcYn8REEehavrW.DY8tZMRumuBYYtTzjFEHSTZYfUI UWZR12JzOh2HPA7i7amXd0TADXuSWP39Tq0XjAYrSgThD3PJBXyP7J_LkxeseXWzT6HrDvL9p6Pq OZURZq2YT.W_fRbpkxDaRWEkssBTBPLt4.j1HoY8ehQItEyTjIp3Auc5wZkV9bvvulU8v6gN5o5F oHZhSqHExrLXtf6YW2FRSEyhw4T0wOoFcc75wPMst9Y7QO0o0IMqqlFEv59Y2U.2C9yzPZlpduKn RcmZL_sQMfZA8kEXikYGn7FcOraTJcL2U3QPIZjS7Ixd2jf7qg9jm35eIrAs0fLgM.5tdmLdIYud GmHtTqSUHGdFLkvR6BkZeaoWVu0N_tzrfdhmqlLIBIw.iAJvd81JvhdEssus4W1cIrhmiZwPMPk. HDvfSfOV.AEdZ50JUdY61uzqWSLOn4tL_bfM9buPrNi3gx0VKev2eZR20ZDWO9L2E4yrfBEyswUc C3.Ge6jp1VnyZfbtOBNOTe13_QNslKhrxg3W5T5d0E4YMW1LXJ7hl_IOpjmMSHVr9JzN1Jy8MSeT qUfkj7a9lOnNtJ9sRh4wfOKFvIMxO4qKYpVBQ1a_jukcLjxDDo_disHPcQ3AiTRLURq6rSXCbOCc tvB6saF_nXWnqN3ayGPrgJE6peP._PSjz.5.LzNdMYPydOIptSvQibGtSVcLm4Fyci0zdnHOsRw0 boXfY9ISUaUzwBP.JNuqW_aOY1jhLsK4rB_2pLtPR2WKuMXjiT4BQxANB7WqzhlDQ3bg6lfeheuE wFPlHaB79tz1YIObAj5iHD8RM.Vsz607xeo524syd4BR.n5cDt2OdUsNHteam_SC70Cik1XEC0lA _skknFH2sLkk5ufU0ItBv27nxTj8qg5uGaoUuJT05_x66F2RwiCqEl.Xx8um4mjX4EqHf9oWid1I 15S1jbuDYUZzQgsMqL3HmbsWd3axLu3jfACreqXe3TYbk8t7_BCRDpZTEcCU2Q7IzxkNZ7jT40Hx AqylOJaL3_ibdiiQsWc2YGnRZkl3Pbxi3e7Q9r2stRi22VjFihA6meYIb6BE1_skAi5yC8Nt959k mnEs_rmJhjFziPhupkj4QoJfEwE.gWKoXOjy.CV77lBmIstD2pGnfCc9f.eVEEJAAuXZElz7ENIl UpSBAtYRZMzWmZYEVJ7Vj4hJ1yaLHRpScAwG6mnBwD1N_j99PQkhQRhPflWwtaDGX0NCM0hogU_N DzA33idFGEtdVnqfbYP1EBR2XbrsdM94lffacYTCWxD.FgM5BgINU.BZIzWfVK4fYGF18HPD1aCw FxHcKnepNV2__xg9zEqDay1Djk3TZXp0IfKoVYmGmN37Pm09jma0SYyb_2VOQf4bPcUUJs_2IMN5 8TTsZSAQZhbEBbeXNyI3StDPWkeUq1wRHlfHx.ITpY_IgzHizot7cz0miGuawVQ-- X-Sonic-MF: X-Sonic-ID: 9527562a-7d7d-47d0-82bc-ba75231bae96 Received: from sonic.gate.mail.ne1.yahoo.com by sonic308.consmr.mail.bf2.yahoo.com with HTTP; Tue, 4 Apr 2023 20:29:04 +0000 Date: Tue, 4 Apr 2023 20:29:00 +0000 (UTC) From: Adriano Silva To: Eric Wheeler , Coly Li Cc: Bcache Linux , Martin McClure Message-ID: <1783117292.2943582.1680640140702@mail.yahoo.com> In-Reply-To: References: <1012241948.1268315.1680082721600.ref@mail.yahoo.com> <1012241948.1268315.1680082721600@mail.yahoo.com> <1121771993.1793905.1680221827127@mail.yahoo.com> Subject: Re: Writeback cache all used. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Mailer: WebService/1.1.21284 YMailNorrin Precedence: bulk List-ID: X-Mailing-List: linux-bcache@vger.kernel.org Hello, > It sounds like a large cache size with limit memory cache=20 > for B+tree nodes? > If the memory is limited and all B+tree nodes in the hot I/O=20 > paths cannot stay in memory, it is possible for such=20 > behavior happens. In this case, shrink the cached data=20 > may deduce the meta data and consequential in-memory=20 > B+tree nodes as well. Yes it may be helpful for such=20 > scenario. There are several servers (TEN) all with 128 GB of RAM, of which around 100= GB (on average) are presented by the OS as free. Cache is 594GB in size on = enterprise NVMe, mass storage is 6TB. The configuration on all is the same.= They run Ceph OSD to service a pool of disks accessed by servers (others i= ncluding themselves). All show the same behavior. When they were installed, they did not occupy the entire cache. Throughout = use, the cache gradually filled up and=C2=A0 never decreased in size. I hav= e another five servers in=C2=A0 another cluster going the same way. During = the night=C2=A0 their workload is reduced. > But what is the I/O pattern here? If all the cache space=20 > occupied by clean data for read request, and write=20 > performance is cared about then. Is this a write tended,=20 > or read tended workload, or mixed? The workload is greater in writing. Both are important, read and write. But= write latency is critical. These are virtual machine disks that are stored= on Ceph. Inside we have mixed loads, Windows with terminal service, Linux,= including a database where direct write latency is critical. > As I explained, the re-reclaim has been here already.=20 > But it cannot help too much if busy I/O requests always=20 > coming and writeback and gc threads have no spare=20 > time to perform. > If coming I/Os exceeds the service capacity of the=20 > cache service window, disappointed requesters can=20 > be expected. Today, the ten servers have been without I/O operation for at least 24 hour= s. Nothing has changed, they continue with 100% cache occupancy. I believe = I should have given time for the GC, no? > Let=E2=80=99s check whether it is just becasue of insuffecient=20 > memory to hold the hot B+tree node in memory. Does the bcache configuration have any RAM memory reservation options? Or w= ould the 100GB of RAM be insufficient for the 594GB of NVMe Cache? For that= amount of Cache, how much RAM should I have reserved for bcache? Is there = any command or parameter I should use to signal bcache that it should reser= ve this RAM memory? I didn't do anything about this matter. How would I do = it? Another question: How do I know if I should trigger a TRIM (discard) for my= NVMe with bcache? Best regards, Em ter=C3=A7a-feira, 4 de abril de 2023 =C3=A0s 05:20:06 BRT, Coly Li escreveu:=20 > 2023=E5=B9=B44=E6=9C=884=E6=97=A5 03:27=EF=BC=8CEric Wheeler =E5=86=99=E9=81=93=EF=BC=9A >=20 > On Mon, 3 Apr 2023, Coly Li wrote: >>> 2023=E5=B9=B44=E6=9C=882=E6=97=A5 08:01=EF=BC=8CEric Wheeler =E5=86=99=E9=81=93=EF=BC=9A >>>=20 >>> On Fri, 31 Mar 2023, Adriano Silva wrote: >>>> Thank you very much! >>>>=20 >>>>> I don't know for sure, but I'd think that since 91% of the cache is >>>>> evictable, writing would just evict some data from the cache (without >>>>> writing to the HDD, since it's not dirty data) and write to that area= of >>>>> the cache, *not* to the HDD. It wouldn't make sense in many cases to >>>>> actually remove data from the cache, because then any reads of that d= ata >>>>> would have to read from the HDD; leaving it in the cache has very lit= tle >>>>> cost and would speed up any reads of that data. >>>>=20 >>>> Maybe you're right, it seems to be writing to the cache, despite it=20 >>>> indicating that the cache is at 100% full. >>>>=20 >>>> I noticed that it has excellent reading performance, but the writing= =20 >>>> performance dropped a lot when the cache was full. It's still a higher= =20 >>>> performance than the HDD, but much lower than it is when it's half ful= l=20 >>>> or empty. >>>>=20 >>>> Sequential writing tests with "_fio" now show me 240MB/s of writing,= =20 >>>> which was already 900MB/s when the cache was still half full. Write=20 >>>> latency has also increased. IOPS on random 4K writes are now in the 5K= =20 >>>> range. It was 16K with half used cache. At random 4K with Ioping,=20 >>>> latency went up. With half cache it was 500us. It is now 945us. >>>>=20 >>>> For reading, nothing has changed. >>>>=20 >>>> However, for systems where writing time is critical, it makes a=20 >>>> significant difference. If possible I would like to always keep it wit= h=20 >>>> a reasonable amount of empty space, to improve writing responses. Redu= ce=20 >>>> 4K latency, mostly. Even if it were for me to program a script in=20 >>>> crontab or something like that, so that during the night or something= =20 >>>> like that the system executes a command for it to clear a percentage o= f=20 >>>> the cache (about 30% for example) that has been unused for the longest= =20 >>>> time . This would possibly make the cache more efficient on writes as= =20 >>>> well. >>>=20 >>> That is an intersting idea since it saves latency. Keeping a few unused= =20 >>> ready to go would prevent GC during a cached write.=20 >>>=20 >>=20 >> Currently there are around 10% reserved already, if dirty data exceeds= =20 >> the threshold further writing will go into backing device directly. >=20 > It doesn't sound like he is referring to dirty data.=C2=A0 If I understan= d=20 > correctly, he means that when the cache is 100% allocated (whether or not= =20 > anything is dirty) that latency is 2x what it could be compared to when= =20 > there are unallocated buckets ready for writing (ie, after formatting the= =20 > cache, but before it fully allocates).=C2=A0 His sequential throughput wa= s 4x=20 > slower with a 100% allocated cache: 900MB/s at 50% full after a format,= =20 > but only 240MB/s when the cache buckets are 100% allocated. >=20 It sounds like a large cache size with limit memory cache for B+tree nodes? If the memory is limited and all B+tree nodes in the hot I/O paths cannot s= tay in memory, it is possible for such behavior happens. In this case, shri= nk the cached data may deduce the meta data and consequential in-memory B+t= ree nodes as well. Yes it may be helpful for such scenario. But what is the I/O pattern here? If all the cache space occupied by clean = data for read request, and write performance is cared about then. Is this a= write tended, or read tended workload, or mixed? It is suspicious that the meta data I/O to read-in missing B+tree nodes con= tributes the I/O performance degrade. But this is only a guess from me. >> Reserve more space doesn=E2=80=99t change too much, if there are always = busy=20 >> write request arriving. For occupied clean cache space, I tested years= =20 >> ago, the space can be shrunk very fast and it won=E2=80=99t be a perform= ance=20 >> bottleneck. If the situation changes now, please inform me. >=20 > His performance specs above indicate that 100% occupided but clean cache= =20 > increases latency (due to release/re-allocate overhead).=C2=A0 The increa= sed=20 > latency reduces effective throughput. >=20 Maybe, increase a bit memory may help a lot. But if this is not a server ma= chine, e.g. laptop, increasing DIMM size might be unfeasible for now days m= odels. Let=E2=80=99s check whether the memory is insufficient for this case firstl= y. >>> Coly, would this be an easy feature to add? >>=20 >> To make it, the change won=E2=80=99t be complexed. But I don=E2=80=99t f= eel it may solve=20 >> the original writing performance issue when space is almost full. In the= =20 >> code we already have similar lists to hold available buckets for future= =20 >> data/metadata allocation. >=20 > Questions:=20 >=20 > - Right after a cache is formated and attached to a bcache volume, which= =20 >=C2=A0 list contains the buckets that have never been used? All buckets on cache are allocated in sequential-like order, no such dedica= te list to track never-used buckets. There are arrays to trace the buckets = dirty state and get number, but not for this purpose. Maintain such list is= expensive when cache size is large. >=20 > - Where does bcache insert back onto that list?=C2=A0 Does it? Even for synchronized bucket invalidate by fifo/lru/random, the selection i= s decided in run time by heap sort, which is still quite slow. But this is = on the very slow code path, doesn=E2=80=99t hurt too much. more. >=20 >> But if the lists are empty, there is still time required to do the dirty= =20 >> writeback and garbage collection if necessary. >=20 > True, that code would always remain, no change there. >=20 >>> Bcache would need a `cache_min_free` tunable that would (asynchronously= )=20 >>> free the least recently used buckets that are not dirty. >>=20 >> For clean cache space, it has been already. >=20 > I'm not sure what you mean by "it has been already" - do you mean this is= =20 > already implemented?=C2=A0 If so, what/where is the sysfs tunable (or=20 > hard-coded min-free-buckets value) ? For read request, gc thread is still waken up time to time. See the code pa= th, cached_dev_read_done_bh() =3D=3D> cached_dev_read_done() =3D=3D> bch_data_i= nsert() =3D=3D> bch_data_insert_start() =3D=3D> wake_up_gc() When read missed in cache, writing the clean data from backing device into = cache device still occupies cache space, and default every 1/16 space alloc= ated/occupied, the gc thread is waken up asynchronously. >=20 >> This is very fast to shrink clean cache space, I did a test 2 years ago,= =20 >> it was just not more than 10 seconds to reclaim around 1TB+ clean cache= =20 >> space. I guess the time might be much less, because reading the=20 >> information from priorities file also takes time. >=20 > Reclaiming large chunks of cache is probably fast in one shot, but=20 > reclaiming one "clean but allocated" bucket (or even a few buckets) with= =20 > each WRITE has latency overhead associated with it.=C2=A0 Early reclaim t= o some=20 > reasonable (configrable) minimum free-space value could hide that latency= =20 > in many workloads. >=20 As I explained, the re-reclaim has been here already. But it cannot help to= o much if busy I/O requests always coming and writeback and gc threads have= no spare time to perform. If coming I/Os exceeds the service capacity of the cache service window, di= sappointed requesters can be expected.=20 > Long-running bcache volumes are usually 100% allocated, and if freeing=20 > batches of clean buckets is fast, then doing it early would save metadata= =20 > handling during clean bucket re-allocation for new writes (and maybe=20 > read-promotion, too). Let=E2=80=99s check whether it is just becasue of insuffecient memory to ho= ld the hot B+tree node in memory. Thanks. Coly Li