From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1A772E77197 for ; Mon, 6 Jan 2025 01:54:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:In-Reply-To:From:References:Cc:To:Subject:MIME-Version:Date: Message-ID:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=OkbB09Avmo9P9VbNv2ofendGXFiO7b9ycK3o8iKvOqM=; b=k6fbxzdPT/D7YAZO1FCj86Keuu iCgd3/sKOh6/ynVOeHhSdArxnYJW2j/jre+bFyjGEPSGGON5GNL9+D/tE27AY3iKGu+r8KIoEm3Ie q2O8M1/slTCCRIrtTzaguZt95iWFe79x4CIwPBUTbOjnbzWvSatIg8haUZORybm6HIrU5ZckqVCr2 65cBExSx3oc6Z1D/hnoJOVTPniVcyoHT0TQieglggeB4vypRVl5XKgS9kDz23TEXH5wq4zad0sSJy Xc9mOC7UKoDFsTDSdF0s/9uW492dRBfhO49Aca8CXc6FANzf4tIHlCvUyCERtTvS9hEom+CIuhn/2 yJRKDYAA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux)) id 1tUcKT-000000000Sp-2vxK; Mon, 06 Jan 2025 01:54:53 +0000 Received: from nyc.source.kernel.org ([2604:1380:45d1:ec00::3]) by bombadil.infradead.org with esmtps (Exim 4.98 #2 (Red Hat Linux)) id 1tUcKL-000000000S1-2nJh for linux-nvme@lists.infradead.org; Mon, 06 Jan 2025 01:54:47 +0000 Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id 25A51A403AF; Mon, 6 Jan 2025 01:52:55 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id BEA98C4CED0; Mon, 6 Jan 2025 01:54:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1736128483; bh=VXHTSlWVeej65C6gWEdXhGhaNxkRXr5jGN61emh9ek8=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=eEtf3Sawb5gYw3jcbVvK/g+/c3EDSX2rcB/mxneDgQOI9p2znVsjTklVuQxF7/DAy yuErRnUtSZXCbqRdy/KAE++Yea+gXBb12mvvaKI28WpVy5rRHvoRlPx6iPPq2hvtch CCtvF8OrHJG0Aa6lvXY7Buaf9eMZLXh7EOzquUmq9adf3LextO2sUsXV0RH2XG7eR2 vBsyPAi3k4O7KXKNHQyE7HmQsd7cclN3x5HRZm4uRLrmoPIAMndZaN7R8RGYcB2mXu XGMHIiJThc2+fF0lXn+i0st5Jysv0qnhyJikf84UoqGHA0CfYkBfXqfyUVooFhFchv m0BqcxcVZSV6Q== Message-ID: Date: Mon, 6 Jan 2025 10:53:58 +0900 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems To: Vishnu ks , Song Liu , hch@infradead.org, yanjun.zhu@linux.dev Cc: lsf-pc@lists.linux-foundation.org, linux-block@vger.kernel.org, bpf@vger.kernel.org, linux-nvme@lists.infradead.org References: From: Damien Le Moal Content-Language: en-US Organization: Western Digital Research In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20250105_175445_824007_8CFD1535 X-CRM114-Status: GOOD ( 17.54 ) X-BeenThere: linux-nvme@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "Linux-nvme" Errors-To: linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org On 1/5/25 2:52 AM, Vishnu ks wrote: > Thank you all for your valuable feedback. I'd like to provide more > technical context about our implementation and the specific challenges > we're facing. > > System Architecture: > We've built a block-level continuous data protection system that: > 1. Uses eBPF to monitor block_rq_complete tracepoint to track modified sectors > 2. Captures sector numbers (not data) of changed blocks in real-time > 3. Periodically syncs the actual data from these sectors based on > configurable RPO > 4. Layers these incremental changes on top of base snapshots > > Current Implementation: > - eBPF program attached to block_rq_complete tracks sector ranges from > bio requests > - Changed sector numbers are transmitted to a central dispatcher via websocket > - Dispatcher initiates periodic data sync (1-2 min intervals) > requesting data from tracked sectors > - Base snapshot + incremental changes provide point-in-time recovery capability > > @Christoph: Regarding stability concerns - we're not using tracepoints > for data integrity, but rather for change detection. The actual data > synchronization happens through standard block device reads. > > Technical Challenge: > The core issue we've identified is the gap between write completion > notification and data availability: > - block_rq_complete tracepoint triggers before data is actually > persisted to disk Then do a flush, or disable the write cache on the device (which can totally kill write performance depending on the device). Nothing new here. File systems have journaling for this reason (among others). > - Reading sectors immediately after block_rq_complete often returns stale data That is what POSIX mandates and also what most storage protocols specify (SCSI, ATA, NVMe): reading sectors that were just written give you back what you just wrote, regardless of the actual location of the data on the device (persisted to non volatile media or not). > - Observed delay between completion and actual disk persistence ranges > from 3-7 minutes That depends on how often/when/how the drive flushes its write cache, which you cannot know from the host. If you want to reduce this, explicitly flush the device write cache more often (execute blkdev_issue_flush() or similar). > - Data becomes immediately available only after unmount/sync/reboot ?? You can read data that was written even without a sync/flush. > Proposed Enhancement: > We're looking for ways to: > 1. Detect when data is actually flushed to disk If you have the write cache enabled on the device, there is no device interface that notifies this. This simply does not exist. If you want to guarantee data persistence to non-volatile media on the device, issue a synchronize cache command (which blkdev_issue_flush() does), or sync your file system if you are using one. Or as mentioned already, disable the device write cache. > 2. Track the relationship between bio requests and cache flushes That is up to you to do that. File systems do so for sync()/fsync(). Note that data persistence guarantees are always for write requests that have already completed. > 3. Potentially add tracepoints around such operations As Christoph said, tracepoints are not a stable ABI. So relying on tracepoints for tracking data persistence is really not a good idea. -- Damien Le Moal Western Digital Research