[LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems
@ 2025-01-01  6:34 Vishnu ks
  2025-01-03  9:26 ` Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Vishnu ks @ 2025-01-01  6:34 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-block, bpf, linux-nvme

Dear Community,

I would like to propose a discussion topic regarding the enhancement
of block layer tracepoints, which could fundamentally transform how
backup and recovery systems operate on Linux.

Current Scenario:

- I'm developing a continuous data protection system using eBPF to
monitor block request completions
- The system aims to achieve reliable live data replication for block devices
Current tracepoints present challenges in capturing the complete
lifecycle of write operations

Potential Impact:

- Transform Linux Backup Systems:
- Enable true continuous data protection at block level
- Eliminate backup windows by capturing changes in real-time
- Reduce recovery point objectives (RPO) to near-zero
- Allow point-in-time recovery at block granularity

Current Technical Limitations:

- Inconsistent visibility into write operation completion
- Gaps between write operations and actual data flushes
- Potential missing instrumentation points
- Challenges in ensuring data consistency across replicated volumes

Proposed Improvements:

- Additional tracepoints for better write operation visibility
- Optimal placement of existing tracepoints
- New instrumentation points for reliable block-level monitoring

Implementation Considerations:

- Performance impact of additional tracepoints
- Integration with existing block layer infrastructure
- Compatibility with various storage backends
- Requirements for consistent backup state

These improvements could revolutionize how we approach backup and
recovery on Linux systems:

- Move from periodic snapshots to continuous data protection
- Enable more granular recovery options
- Reduce system overhead during backup operations
- Improve reliability of backup systems
- Enhance disaster recovery capabilities

This discussion would benefit both the block layer and BPF
communities, as well as the broader Linux ecosystem, particularly
enterprises requiring robust backup and recovery solutions.

Looking forward to the community's thoughts and feedback.

Best regards,
-- 
Vishnu KS,
Opensource contributor and researcher,
https://xmigrate.cloud
https://iamvishnuks.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems
  2025-01-01  6:34 [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems Vishnu ks
@ 2025-01-03  9:26 ` Christoph Hellwig
  2025-01-04  0:47 ` Zhu Yanjun
  2025-01-04  1:11 ` Song Liu
  2 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-01-03  9:26 UTC (permalink / raw)
  To: Vishnu ks; +Cc: lsf-pc, linux-block, bpf, linux-nvme

On Wed, Jan 01, 2025 at 12:04:56PM +0530, Vishnu ks wrote:
> - I'm developing a continuous data protection system using eBPF to
> monitor block request completions
> - The system aims to achieve reliable live data replication for block devices
> Current tracepoints present challenges in capturing the complete
> lifecycle of write operations

This is nuts.  No, we don't guarantee any stability in the trace points,
and certainly not for data integrity operations.

Please make sure this never gets near any production system.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems
  2025-01-01  6:34 [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems Vishnu ks
  2025-01-03  9:26 ` Christoph Hellwig
@ 2025-01-04  0:47 ` Zhu Yanjun
  2025-01-04  1:11 ` Song Liu
  2 siblings, 0 replies; 16+ messages in thread
From: Zhu Yanjun @ 2025-01-04  0:47 UTC (permalink / raw)
  To: Vishnu ks, lsf-pc; +Cc: linux-block, bpf, linux-nvme

在 2025/1/1 7:34, Vishnu ks 写道:
> Dear Community,
> 
> I would like to propose a discussion topic regarding the enhancement
> of block layer tracepoints, which could fundamentally transform how
> backup and recovery systems operate on Linux.
> 
> Current Scenario:
> 
> - I'm developing a continuous data protection system using eBPF to
> monitor block request completions

I am interested in this "eBPF to monitor block request". Will this eBPF 
make difference on the performance of the whole system? And how to use 
eBPF to implement this feature? Hope to join the meeting to listen to 
this topic.

Best Regards,

Zhu Yanjun

> - The system aims to achieve reliable live data replication for block devices
> Current tracepoints present challenges in capturing the complete
> lifecycle of write operations
> 
> Potential Impact:
> 
> - Transform Linux Backup Systems:
> - Enable true continuous data protection at block level
> - Eliminate backup windows by capturing changes in real-time
> - Reduce recovery point objectives (RPO) to near-zero
> - Allow point-in-time recovery at block granularity
> 
> Current Technical Limitations:
> 
> - Inconsistent visibility into write operation completion
> - Gaps between write operations and actual data flushes
> - Potential missing instrumentation points
> - Challenges in ensuring data consistency across replicated volumes
> 
> Proposed Improvements:
> 
> - Additional tracepoints for better write operation visibility
> - Optimal placement of existing tracepoints
> - New instrumentation points for reliable block-level monitoring
> 
> Implementation Considerations:
> 
> - Performance impact of additional tracepoints
> - Integration with existing block layer infrastructure
> - Compatibility with various storage backends
> - Requirements for consistent backup state
> 
> These improvements could revolutionize how we approach backup and
> recovery on Linux systems:
> 
> - Move from periodic snapshots to continuous data protection
> - Enable more granular recovery options
> - Reduce system overhead during backup operations
> - Improve reliability of backup systems
> - Enhance disaster recovery capabilities
> 
> This discussion would benefit both the block layer and BPF
> communities, as well as the broader Linux ecosystem, particularly
> enterprises requiring robust backup and recovery solutions.
> 
> Looking forward to the community's thoughts and feedback.
> 
> Best regards,



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems
  2025-01-01  6:34 [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems Vishnu ks
  2025-01-03  9:26 ` Christoph Hellwig
  2025-01-04  0:47 ` Zhu Yanjun
@ 2025-01-04  1:11 ` Song Liu
  2025-01-04 17:52   ` Vishnu ks
  2 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2025-01-04  1:11 UTC (permalink / raw)
  To: Vishnu ks; +Cc: lsf-pc, linux-block, bpf, linux-nvme

Hi Vishnu,

On Tue, Dec 31, 2024 at 10:35 PM Vishnu ks <ksvishnu56@gmail.com> wrote:
>
> Dear Community,
>
> I would like to propose a discussion topic regarding the enhancement
> of block layer tracepoints, which could fundamentally transform how
> backup and recovery systems operate on Linux.
>
> Current Scenario:
>
> - I'm developing a continuous data protection system using eBPF to
> monitor block request completions

This makes little sense. It is not clear how this works.

> - The system aims to achieve reliable live data replication for block devices
> Current tracepoints present challenges in capturing the complete
> lifecycle of write operations

What's the difference between this approach and existing data
replication solutions, such as md/raid?

>
> Potential Impact:
>
> - Transform Linux Backup Systems:
> - Enable true continuous data protection at block level
> - Eliminate backup windows by capturing changes in real-time
> - Reduce recovery point objectives (RPO) to near-zero
> - Allow point-in-time recovery at block granularity
>
> Current Technical Limitations:
>
> - Inconsistent visibility into write operation completion
> - Gaps between write operations and actual data flushes
> - Potential missing instrumentation points

If a tracepoint is missing or misplaced, we can fix it in a patch.

> - Challenges in ensuring data consistency across replicated volumes
>
> Proposed Improvements:
>
> - Additional tracepoints for better write operation visibility
> - Optimal placement of existing tracepoints
> - New instrumentation points for reliable block-level monitoring

Some details in these would help this topic proposal.

Thanks,
Song


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems
  2025-01-04  1:11 ` Song Liu
@ 2025-01-04 17:52   ` Vishnu ks
  2025-01-06  1:53     ` Damien Le Moal
                       ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Vishnu ks @ 2025-01-04 17:52 UTC (permalink / raw)
  To: Song Liu, hch, yanjun.zhu; +Cc: lsf-pc, linux-block, bpf, linux-nvme

Thank you all for your valuable feedback. I'd like to provide more
technical context about our implementation and the specific challenges
we're facing.

System Architecture:
We've built a block-level continuous data protection system that:
1. Uses eBPF to monitor block_rq_complete tracepoint to track modified sectors
2. Captures sector numbers (not data) of changed blocks in real-time
3. Periodically syncs the actual data from these sectors based on
configurable RPO
4. Layers these incremental changes on top of base snapshots

Current Implementation:
- eBPF program attached to block_rq_complete tracks sector ranges from
bio requests
- Changed sector numbers are transmitted to a central dispatcher via websocket
- Dispatcher initiates periodic data sync (1-2 min intervals)
requesting data from tracked sectors
- Base snapshot + incremental changes provide point-in-time recovery capability

@Christoph: Regarding stability concerns - we're not using tracepoints
for data integrity, but rather for change detection. The actual data
synchronization happens through standard block device reads.

Technical Challenge:
The core issue we've identified is the gap between write completion
notification and data availability:
- block_rq_complete tracepoint triggers before data is actually
persisted to disk
- Reading sectors immediately after block_rq_complete often returns stale data
- Observed delay between completion and actual disk persistence ranges
from 3-7 minutes
- Data becomes immediately available only after unmount/sync/reboot

@Song: Our approach fundamentally differs from md/raid in several ways:

1. Network-based vs Local:
   - Our system operates over network, allowing replication across
geographically distributed systems
   - md/raid works only with locally attached storage devices

2. Replication Model:
   - We use asynchronous replication with configurable RPO windows
   - md/raid requires synchronous, immediate mirroring of data

3. Recovery Capabilities:
   - We provide point-in-time recovery through incremental sector tracking
   - md/raid focuses on immediate redundancy without historical state

@Zhu: The eBPF performance impact is minimal as we're only tracking
sector numbers, not actual data. The main overhead comes from the
periodic data sync operations.

Proposed Enhancement:
We're looking for ways to:
1. Detect when data is actually flushed to disk
2. Track the relationship between bio requests and cache flushes
3. Potentially add tracepoints around such operations

Questions for the community:
1. Are there existing tracepoints that could help track actual disk persistence?
2. Would adding tracepoints in the page cache writeback path be feasible?
3. Are there alternative approaches to detecting when data is actually
persisted?

Would love to hear the community's thoughts on this specific challenge
and potential approaches to addressing it.

Best regards,
Vishnu KS


On Sat, 4 Jan 2025 at 06:41, Song Liu <song@kernel.org> wrote:
>
> Hi Vishnu,
>
> On Tue, Dec 31, 2024 at 10:35 PM Vishnu ks <ksvishnu56@gmail.com> wrote:
> >
> > Dear Community,
> >
> > I would like to propose a discussion topic regarding the enhancement
> > of block layer tracepoints, which could fundamentally transform how
> > backup and recovery systems operate on Linux.
> >
> > Current Scenario:
> >
> > - I'm developing a continuous data protection system using eBPF to
> > monitor block request completions
>
> This makes little sense. It is not clear how this works.
>
> > - The system aims to achieve reliable live data replication for block devices
> > Current tracepoints present challenges in capturing the complete
> > lifecycle of write operations
>
> What's the difference between this approach and existing data
> replication solutions, such as md/raid?
>
> >
> > Potential Impact:
> >
> > - Transform Linux Backup Systems:
> > - Enable true continuous data protection at block level
> > - Eliminate backup windows by capturing changes in real-time
> > - Reduce recovery point objectives (RPO) to near-zero
> > - Allow point-in-time recovery at block granularity
> >
> > Current Technical Limitations:
> >
> > - Inconsistent visibility into write operation completion
> > - Gaps between write operations and actual data flushes
> > - Potential missing instrumentation points
>
> If a tracepoint is missing or misplaced, we can fix it in a patch.
>
> > - Challenges in ensuring data consistency across replicated volumes
> >
> > Proposed Improvements:
> >
> > - Additional tracepoints for better write operation visibility
> > - Optimal placement of existing tracepoints
> > - New instrumentation points for reliable block-level monitoring
>
> Some details in these would help this topic proposal.
>
> Thanks,
> Song

-- 
Vishnu KS,
Opensource contributor and researcher,
https://iamvishnuks.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems
  2025-01-04 17:52   ` Vishnu ks
@ 2025-01-06  1:53     ` Damien Le Moal
  2025-01-06 18:28       ` Vishnu ks
  2025-01-06  7:37     ` Christoph Hellwig
  2025-01-06 21:19     ` Song Liu
  2 siblings, 1 reply; 16+ messages in thread
From: Damien Le Moal @ 2025-01-06  1:53 UTC (permalink / raw)
  To: Vishnu ks, Song Liu, hch, yanjun.zhu; +Cc: lsf-pc, linux-block, bpf, linux-nvme

On 1/5/25 2:52 AM, Vishnu ks wrote:
> Thank you all for your valuable feedback. I'd like to provide more
> technical context about our implementation and the specific challenges
> we're facing.
> 
> System Architecture:
> We've built a block-level continuous data protection system that:
> 1. Uses eBPF to monitor block_rq_complete tracepoint to track modified sectors
> 2. Captures sector numbers (not data) of changed blocks in real-time
> 3. Periodically syncs the actual data from these sectors based on
> configurable RPO
> 4. Layers these incremental changes on top of base snapshots
> 
> Current Implementation:
> - eBPF program attached to block_rq_complete tracks sector ranges from
> bio requests
> - Changed sector numbers are transmitted to a central dispatcher via websocket
> - Dispatcher initiates periodic data sync (1-2 min intervals)
> requesting data from tracked sectors
> - Base snapshot + incremental changes provide point-in-time recovery capability
> 
> @Christoph: Regarding stability concerns - we're not using tracepoints
> for data integrity, but rather for change detection. The actual data
> synchronization happens through standard block device reads.
> 
> Technical Challenge:
> The core issue we've identified is the gap between write completion
> notification and data availability:
> - block_rq_complete tracepoint triggers before data is actually
> persisted to disk

Then do a flush, or disable the write cache on the device (which can totally
kill write performance depending on the device). Nothing new here. File systems
have journaling for this reason (among others).

> - Reading sectors immediately after block_rq_complete often returns stale data

That is what POSIX mandates and also what most storage protocols specify (SCSI,
ATA, NVMe): reading sectors that were just written give you back what you just
wrote, regardless of the actual location of the data on the device (persisted
to non volatile media or not).

> - Observed delay between completion and actual disk persistence ranges
> from 3-7 minutes

That depends on how often/when/how the drive flushes its write cache, which you
cannot know from the host. If you want to reduce this, explicitly flush the
device write cache more often (execute blkdev_issue_flush() or similar).

> - Data becomes immediately available only after unmount/sync/reboot

??

You can read data that was written even without a sync/flush.

> Proposed Enhancement:
> We're looking for ways to:
> 1. Detect when data is actually flushed to disk

If you have the write cache enabled on the device, there is no device interface
that notifies this. This simply does not exist. If you want to guarantee data
persistence to non-volatile media on the device, issue a synchronize cache
command (which blkdev_issue_flush() does), or sync your file system if you are
using one. Or as mentioned already, disable the device write cache.

> 2. Track the relationship between bio requests and cache flushes

That is up to you to do that. File systems do so for sync()/fsync(). Note that
data persistence guarantees are always for write requests that have already
completed.

> 3. Potentially add tracepoints around such operations

As Christoph said, tracepoints are not a stable ABI. So relying on tracepoints
for tracking data persistence is really not a good idea.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems
  2025-01-04 17:52   ` Vishnu ks
  2025-01-06  1:53     ` Damien Le Moal
@ 2025-01-06  7:37     ` Christoph Hellwig
  2025-01-06 14:39       ` Zhu Yanjun
  2025-01-06 18:31       ` Vishnu ks
  2025-01-06 21:19     ` Song Liu
  2 siblings, 2 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-01-06  7:37 UTC (permalink / raw)
  To: Vishnu ks; +Cc: Song Liu, hch, yanjun.zhu, lsf-pc, linux-block, bpf, linux-nvme

On Sat, Jan 04, 2025 at 11:22:40PM +0530, Vishnu ks wrote:
> 1. Uses eBPF to monitor block_rq_complete tracepoint to track modified sectors

You can't.  Drivers can and often do change the sector during submission
processing.

> 2. Captures sector numbers (not data) of changed blocks in real-time
> 3. Periodically syncs the actual data from these sectors based on
> configurable RPO
> 4. Layers these incremental changes on top of base snapshots

And all of that is broken.  If you are interested in this kind of
mechanism help upstreaming the blk-filter work, which has been
explicitly designed to support that.

Before that you should really undestand how block devices and
file systems work, as the rest of the mail suggested a very dangerous
misunderstanding of the basic principles.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems
  2025-01-06  7:37     ` Christoph Hellwig
@ 2025-01-06 14:39       ` Zhu Yanjun
  2025-01-06 18:36         ` Vishnu ks
  2025-01-06 18:31       ` Vishnu ks
  1 sibling, 1 reply; 16+ messages in thread
From: Zhu Yanjun @ 2025-01-06 14:39 UTC (permalink / raw)
  To: Christoph Hellwig, Vishnu ks
  Cc: Song Liu, lsf-pc, linux-block, bpf, linux-nvme

On 06.01.25 08:37, Christoph Hellwig wrote:
> On Sat, Jan 04, 2025 at 11:22:40PM +0530, Vishnu ks wrote:
>> 1. Uses eBPF to monitor block_rq_complete tracepoint to track modified sectors
> 
> You can't.  Drivers can and often do change the sector during submission
> processing.

If I get you correctly, you mean, the action that **drivers often change 
the sector during submission processing** will generate a lot of 
tracepoint events. Thus, this will make difference on the performance of 
the whole system.

If yes, can we only monitor fentry/fexit of some_important_key_function 
to reduce the eBPF events? Thus this will not generate too many events 
then make difference on the performance.

Zhu Yanjun

> 
>> 2. Captures sector numbers (not data) of changed blocks in real-time
>> 3. Periodically syncs the actual data from these sectors based on
>> configurable RPO
>> 4. Layers these incremental changes on top of base snapshots
> 
> And all of that is broken.  If you are interested in this kind of
> mechanism help upstreaming the blk-filter work, which has been
> explicitly designed to support that.
> 
> Before that you should really undestand how block devices and
> file systems work, as the rest of the mail suggested a very dangerous
> misunderstanding of the basic principles.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems
  2025-01-06  1:53     ` Damien Le Moal
@ 2025-01-06 18:28       ` Vishnu ks
  0 siblings, 0 replies; 16+ messages in thread
From: Vishnu ks @ 2025-01-06 18:28 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Song Liu, hch, yanjun.zhu, lsf-pc, linux-block, bpf, linux-nvme

Thank you for the detailed explanation about write cache behavior and
data persistence.
I understand now that:

1. Without explicit flush commands, there's no reliable way to know
when data is actually persisted
2. The behavior we observed (3-7 minutes delay) is due to the device's
write cache policy
3. For guaranteed persistence, we need to either:
   - Use explicit flush commands (though this impacts performance)
   - Disable write cache (with significant performance impact)
   - Rely on filesystem-level journaling

We'll explore using filesystem sync operations for critical
consistency points while maintaining the write cache for general
operations.


On Mon, 6 Jan 2025 at 07:24, Damien Le Moal <dlemoal@kernel.org> wrote:
>
> On 1/5/25 2:52 AM, Vishnu ks wrote:
> > Thank you all for your valuable feedback. I'd like to provide more
> > technical context about our implementation and the specific challenges
> > we're facing.
> >
> > System Architecture:
> > We've built a block-level continuous data protection system that:
> > 1. Uses eBPF to monitor block_rq_complete tracepoint to track modified sectors
> > 2. Captures sector numbers (not data) of changed blocks in real-time
> > 3. Periodically syncs the actual data from these sectors based on
> > configurable RPO
> > 4. Layers these incremental changes on top of base snapshots
> >
> > Current Implementation:
> > - eBPF program attached to block_rq_complete tracks sector ranges from
> > bio requests
> > - Changed sector numbers are transmitted to a central dispatcher via websocket
> > - Dispatcher initiates periodic data sync (1-2 min intervals)
> > requesting data from tracked sectors
> > - Base snapshot + incremental changes provide point-in-time recovery capability
> >
> > @Christoph: Regarding stability concerns - we're not using tracepoints
> > for data integrity, but rather for change detection. The actual data
> > synchronization happens through standard block device reads.
> >
> > Technical Challenge:
> > The core issue we've identified is the gap between write completion
> > notification and data availability:
> > - block_rq_complete tracepoint triggers before data is actually
> > persisted to disk
>
> Then do a flush, or disable the write cache on the device (which can totally
> kill write performance depending on the device). Nothing new here. File systems
> have journaling for this reason (among others).
>
> > - Reading sectors immediately after block_rq_complete often returns stale data
>
> That is what POSIX mandates and also what most storage protocols specify (SCSI,
> ATA, NVMe): reading sectors that were just written give you back what you just
> wrote, regardless of the actual location of the data on the device (persisted
> to non volatile media or not).
>
> > - Observed delay between completion and actual disk persistence ranges
> > from 3-7 minutes
>
> That depends on how often/when/how the drive flushes its write cache, which you
> cannot know from the host. If you want to reduce this, explicitly flush the
> device write cache more often (execute blkdev_issue_flush() or similar).
>
> > - Data becomes immediately available only after unmount/sync/reboot
>
> ??
>
> You can read data that was written even without a sync/flush.
>
> > Proposed Enhancement:
> > We're looking for ways to:
> > 1. Detect when data is actually flushed to disk
>
> If you have the write cache enabled on the device, there is no device interface
> that notifies this. This simply does not exist. If you want to guarantee data
> persistence to non-volatile media on the device, issue a synchronize cache
> command (which blkdev_issue_flush() does), or sync your file system if you are
> using one. Or as mentioned already, disable the device write cache.
>
> > 2. Track the relationship between bio requests and cache flushes
>
> That is up to you to do that. File systems do so for sync()/fsync(). Note that
> data persistence guarantees are always for write requests that have already
> completed.
>
> > 3. Potentially add tracepoints around such operations
>
> As Christoph said, tracepoints are not a stable ABI. So relying on tracepoints
> for tracking data persistence is really not a good idea.
>
>
> --
> Damien Le Moal
> Western Digital Research

-- 
Vishnu KS,
Opensource contributor and researcher,
https://iamvishnuks.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems
  2025-01-06  7:37     ` Christoph Hellwig
  2025-01-06 14:39       ` Zhu Yanjun
@ 2025-01-06 18:31       ` Vishnu ks
  1 sibling, 0 replies; 16+ messages in thread
From: Vishnu ks @ 2025-01-06 18:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Song Liu, yanjun.zhu, lsf-pc, linux-block, bpf, linux-nvme

Thank you for pointing out these critical issues:

1. The sector tracking approach is fundamentally flawed because
drivers can modify sectors during submission
2. I'll look into the blk-filter work as it seems to be designed
specifically for this use case

Could you point me to resources about the blk-filter work? I'd like to
understand it better and potentially contribute to its upstream
efforts.

You're right that I need a better understanding of block devices and
filesystem fundamentals. Could you recommend any specific
documentation or reading materials on these topics?

On Mon, 6 Jan 2025 at 13:07, Christoph Hellwig <hch@infradead.org> wrote:
>
> On Sat, Jan 04, 2025 at 11:22:40PM +0530, Vishnu ks wrote:
> > 1. Uses eBPF to monitor block_rq_complete tracepoint to track modified sectors
>
> You can't.  Drivers can and often do change the sector during submission
> processing.
>
> > 2. Captures sector numbers (not data) of changed blocks in real-time
> > 3. Periodically syncs the actual data from these sectors based on
> > configurable RPO
> > 4. Layers these incremental changes on top of base snapshots
>
> And all of that is broken.  If you are interested in this kind of
> mechanism help upstreaming the blk-filter work, which has been
> explicitly designed to support that.
>
> Before that you should really undestand how block devices and
> file systems work, as the rest of the mail suggested a very dangerous
> misunderstanding of the basic principles.

-- 
Vishnu KS,
Opensource contributor and researcher,
https://iamvishnuks.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems
  2025-01-06 14:39       ` Zhu Yanjun
@ 2025-01-06 18:36         ` Vishnu ks
  0 siblings, 0 replies; 16+ messages in thread
From: Vishnu ks @ 2025-01-06 18:36 UTC (permalink / raw)
  To: Zhu Yanjun
  Cc: Christoph Hellwig, Song Liu, lsf-pc, linux-block, bpf, linux-nvme

Thank you for the suggestion about fentry/fexit monitoring. However,
as Christoph pointed out, the fundamental issue isn't just about
performance or number of events - it's that the sector numbers
themselves can be modified by drivers during submission and I am not
sure if this is notified back to the kernel somehow.

On Mon, 6 Jan 2025 at 20:09, Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>
> On 06.01.25 08:37, Christoph Hellwig wrote:
> > On Sat, Jan 04, 2025 at 11:22:40PM +0530, Vishnu ks wrote:
> >> 1. Uses eBPF to monitor block_rq_complete tracepoint to track modified sectors
> >
> > You can't.  Drivers can and often do change the sector during submission
> > processing.
>
> If I get you correctly, you mean, the action that **drivers often change
> the sector during submission processing** will generate a lot of
> tracepoint events. Thus, this will make difference on the performance of
> the whole system.
>
> If yes, can we only monitor fentry/fexit of some_important_key_function
> to reduce the eBPF events? Thus this will not generate too many events
> then make difference on the performance.
>
> Zhu Yanjun
>
> >
> >> 2. Captures sector numbers (not data) of changed blocks in real-time
> >> 3. Periodically syncs the actual data from these sectors based on
> >> configurable RPO
> >> 4. Layers these incremental changes on top of base snapshots
> >
> > And all of that is broken.  If you are interested in this kind of
> > mechanism help upstreaming the blk-filter work, which has been
> > explicitly designed to support that.
> >
> > Before that you should really undestand how block devices and
> > file systems work, as the rest of the mail suggested a very dangerous
> > misunderstanding of the basic principles.
>


-- 
Vishnu KS,
Opensource contributor and researcher,
https://iamvishnuks.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems
  2025-01-04 17:52   ` Vishnu ks
  2025-01-06  1:53     ` Damien Le Moal
  2025-01-06  7:37     ` Christoph Hellwig
@ 2025-01-06 21:19     ` Song Liu
  2025-01-06 22:18       ` [Lsf-pc] " Dan Williams
  2 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2025-01-06 21:19 UTC (permalink / raw)
  To: Vishnu ks; +Cc: hch, yanjun.zhu, lsf-pc, linux-block, bpf, linux-nvme

On Sat, Jan 4, 2025 at 9:52 AM Vishnu ks <ksvishnu56@gmail.com> wrote:
>
[...]
>
> @Song: Our approach fundamentally differs from md/raid in several ways:
>
> 1. Network-based vs Local:
>    - Our system operates over network, allowing replication across
> geographically distributed systems
>    - md/raid works only with locally attached storage devices

md-cluster (https://docs.kernel.org/driver-api/md/md-cluster.html)
does support RAID in a cluster.

>
> 2. Replication Model:
>    - We use asynchronous replication with configurable RPO windows
>    - md/raid requires synchronous, immediate mirroring of data

immediate mirroring is probably more efficient, as the system doesn't
need to read the data from the device.

> 3. Recovery Capabilities:
>    - We provide point-in-time recovery through incremental sector tracking
>    - md/raid focuses on immediate redundancy without historical state

IIUC, the idea is to build a block level remote full journal. By "full" journal,
I mean the journal contains all the actual data in addition to metadata.
I think the consistency can be really tricky with write cache etc.

Thanks,
Song


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems
  2025-01-06 21:19     ` Song Liu
@ 2025-01-06 22:18       ` Dan Williams
  2025-01-13 17:31         ` Vishnu ks
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2025-01-06 22:18 UTC (permalink / raw)
  To: Song Liu via Lsf-pc, Vishnu ks
  Cc: hch, yanjun.zhu, lsf-pc, linux-block, bpf, linux-nvme

Song Liu via Lsf-pc wrote:
> On Sat, Jan 4, 2025 at 9:52 AM Vishnu ks <ksvishnu56@gmail.com> wrote:
> >
> [...]
> >
> > @Song: Our approach fundamentally differs from md/raid in several ways:
> >
> > 1. Network-based vs Local:
> >    - Our system operates over network, allowing replication across
> > geographically distributed systems
> >    - md/raid works only with locally attached storage devices
> 
> md-cluster (https://docs.kernel.org/driver-api/md/md-cluster.html)
> does support RAID in a cluster.

Also,

https://docs.kernel.org/admin-guide/blockdev/drbd/index.html


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems
  2025-01-06 22:18       ` [Lsf-pc] " Dan Williams
@ 2025-01-13 17:31         ` Vishnu ks
  2025-02-07  2:06           ` Ming Lei
  0 siblings, 1 reply; 16+ messages in thread
From: Vishnu ks @ 2025-01-13 17:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: Song Liu via Lsf-pc, hch, yanjun.zhu, linux-block, bpf,
	linux-nvme

Thanks everyone for the detailed technical feedback and clarifications
- they've been extremely valuable in understanding the fundamental
challenges and existing solutions.

I appreciate the points about md-cluster and DRBD's network RAID
capabilities. While these are robust solutions for network-based
replication, I'm particularly interested in the point-in-time recovery
capability for scenarios like ransomware recovery, where being able to
roll back to a specific point before encryption occurred would be
valuable.

Regarding blk_filter - I've been exploring it since it was mentioned,
and it indeed seems to be the right approach for what we're trying to
achieve. However, I've found that many of our current requirements can
actually be implemented using eBPF without additional kernel modules.
I plan to create a detailed demonstration video to share my findings
with this thread. Additionally, I'll be cleaning up and open-sourcing
our replicator utility implementation for community feedback.

I would very much like to attend the LSF/MM/BPF summit to discuss
these ideas in person and learn more about blk_filter and proper block
layer fundamentals. Would it be possible for someone to help me with
an invitation?

Thanks again to everyone who took the time to explain the intricacies
of write caching, sector tracking limitations, and data persistence
guarantees. This discussion has been incredibly educational.

Thanks and regards,
Vishnu KS

On Tue, 7 Jan 2025 at 03:48, Dan Williams <dan.j.williams@intel.com> wrote:
>
> Song Liu via Lsf-pc wrote:
> > On Sat, Jan 4, 2025 at 9:52 AM Vishnu ks <ksvishnu56@gmail.com> wrote:
> > >
> > [...]
> > >
> > > @Song: Our approach fundamentally differs from md/raid in several ways:
> > >
> > > 1. Network-based vs Local:
> > >    - Our system operates over network, allowing replication across
> > > geographically distributed systems
> > >    - md/raid works only with locally attached storage devices
> >
> > md-cluster (https://docs.kernel.org/driver-api/md/md-cluster.html)
> > does support RAID in a cluster.
>
> Also,
>
> https://docs.kernel.org/admin-guide/blockdev/drbd/index.html

-- 
Vishnu KS,
Opensource contributor and researcher,
https://iamvishnuks.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems
  2025-01-13 17:31         ` Vishnu ks
@ 2025-02-07  2:06           ` Ming Lei
  2025-02-07 11:15             ` Vishnu ks
  0 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2025-02-07  2:06 UTC (permalink / raw)
  To: Vishnu ks
  Cc: Dan Williams, Song Liu via Lsf-pc, hch, yanjun.zhu, linux-block,
	bpf, linux-nvme

On Mon, Jan 13, 2025 at 11:01:30PM +0530, Vishnu ks wrote:
> Thanks everyone for the detailed technical feedback and clarifications
> - they've been extremely valuable in understanding the fundamental
> challenges and existing solutions.
> 
> I appreciate the points about md-cluster and DRBD's network RAID
> capabilities. While these are robust solutions for network-based
> replication, I'm particularly interested in the point-in-time recovery
> capability for scenarios like ransomware recovery, where being able to
> roll back to a specific point before encryption occurred would be
> valuable.
> 
> Regarding blk_filter - I've been exploring it since it was mentioned,
> and it indeed seems to be the right approach for what we're trying to
> achieve. However, I've found that many of our current requirements can
> actually be implemented using eBPF without additional kernel modules.
> I plan to create a detailed demonstration video to share my findings
> with this thread. Additionally, I'll be cleaning up and open-sourcing
> our replicator utility implementation for community feedback.
> 
> I would very much like to attend the LSF/MM/BPF summit to discuss
> these ideas in person and learn more about blk_filter and proper block
> layer fundamentals. Would it be possible for someone to help me with
> an invitation?

If one pair of bpf struct_ops are added for attaching to submit_bio()
and ->bi_end_io() in bio_endio(), lots of cases can be covered:

- blk filter

- bio interposer

- blk-snap

- easier IO trace

...

Then both bio and request based devices can be covered.

It shouldn't be hard to figure out generic bio/bvec kfuncs for helping block IO
bpf prog to do more valuable things & fun.

Thanks, 
Ming



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems
  2025-02-07  2:06           ` Ming Lei
@ 2025-02-07 11:15             ` Vishnu ks
  0 siblings, 0 replies; 16+ messages in thread
From: Vishnu ks @ 2025-02-07 11:15 UTC (permalink / raw)
  To: Ming Lei
  Cc: Dan Williams, Song Liu via Lsf-pc, hch, yanjun.zhu, linux-block,
	bpf, linux-nvme

Thanks Ming for the insightful suggestion about struct_ops pairs for
bio handling. Moving these operations to eBPF aligns perfectly with
the goal of keeping non-essential business logic outside the kernel.

As mentioned previously, I've open-sourced our implementation (blxrep)
which demonstrates this approach. While simple in implementation, it's
proven effective for capturing incremental changes on data disks with
sync frequencies above 3 minutes:

BPF implementation:
https://github.com/xmigrate/blxrep/blob/main/bpf/trace-blocks.c
Documentation: https://blxrep.xmigrate.cloud/

Your suggestion about generic bio/bvec kfuncs is particularly
interesting. Would you be open to providing feedback on our current
BPF program structure, particularly regarding how we could better
leverage these proposed bio handling capabilities? I would also like
to hear what others think about this approach.

Thanks,
Vishnu KS

On Fri, 7 Feb 2025 at 07:36, Ming Lei <ming.lei@redhat.com> wrote:
>
> On Mon, Jan 13, 2025 at 11:01:30PM +0530, Vishnu ks wrote:
> > Thanks everyone for the detailed technical feedback and clarifications
> > - they've been extremely valuable in understanding the fundamental
> > challenges and existing solutions.
> >
> > I appreciate the points about md-cluster and DRBD's network RAID
> > capabilities. While these are robust solutions for network-based
> > replication, I'm particularly interested in the point-in-time recovery
> > capability for scenarios like ransomware recovery, where being able to
> > roll back to a specific point before encryption occurred would be
> > valuable.
> >
> > Regarding blk_filter - I've been exploring it since it was mentioned,
> > and it indeed seems to be the right approach for what we're trying to
> > achieve. However, I've found that many of our current requirements can
> > actually be implemented using eBPF without additional kernel modules.
> > I plan to create a detailed demonstration video to share my findings
> > with this thread. Additionally, I'll be cleaning up and open-sourcing
> > our replicator utility implementation for community feedback.
> >
> > I would very much like to attend the LSF/MM/BPF summit to discuss
> > these ideas in person and learn more about blk_filter and proper block
> > layer fundamentals. Would it be possible for someone to help me with
> > an invitation?
>
> If one pair of bpf struct_ops are added for attaching to submit_bio()
> and ->bi_end_io() in bio_endio(), lots of cases can be covered:
>
> - blk filter
>
> - bio interposer
>
> - blk-snap
>
> - easier IO trace
>
> ...
>
> Then both bio and request based devices can be covered.
>
> It shouldn't be hard to figure out generic bio/bvec kfuncs for helping block IO
> bpf prog to do more valuable things & fun.
>
> Thanks,
> Ming
>


--
Vishnu KS,
Opensource contributor and researcher,
https://iamvishnuks.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-02-07 11:16 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-01  6:34 [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems Vishnu ks
2025-01-03  9:26 ` Christoph Hellwig
2025-01-04  0:47 ` Zhu Yanjun
2025-01-04  1:11 ` Song Liu
2025-01-04 17:52   ` Vishnu ks
2025-01-06  1:53     ` Damien Le Moal
2025-01-06 18:28       ` Vishnu ks
2025-01-06  7:37     ` Christoph Hellwig
2025-01-06 14:39       ` Zhu Yanjun
2025-01-06 18:36         ` Vishnu ks
2025-01-06 18:31       ` Vishnu ks
2025-01-06 21:19     ` Song Liu
2025-01-06 22:18       ` [Lsf-pc] " Dan Williams
2025-01-13 17:31         ` Vishnu ks
2025-02-07  2:06           ` Ming Lei
2025-02-07 11:15             ` Vishnu ks

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox