All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: "Joonas Lahtinen" <joonas.lahtinen@linux.intel.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Matthew Brost" <matthew.brost@intel.com>
Cc: Simona Vetter <simona.vetter@ffwll.ch>,
	Rodrigo Vivi <rodrigo.vivi@intel.com>,
	Huang Rui <ray.huang@amd.com>,
	 intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	 matthew.auld@intel.com, David Airlie <airlied@gmail.com>,
	Simona Vetter <simona@ffwll.ch>
Subject: Re: [PATCH v6 2/8] drm/ttm: Add ttm_bo_access
Date: Tue, 12 Nov 2024 17:22:07 +0100	[thread overview]
Message-ID: <27f8bd540ac1f04daf8a7786f4ae7828017d061b.camel@linux.intel.com> (raw)
In-Reply-To: <173141886970.132411.13400253861916730373@jlahtine-mobl.ger.corp.intel.com>

On Tue, 2024-11-12 at 15:41 +0200, Joonas Lahtinen wrote:
> (+ Thomas)
> 
> Quoting Christian König (2024-11-12 11:23:36)
> > Am 11.11.24 um 23:45 schrieb Matthew Brost:
> > 
> >     [SNIP]
> > 
> >             So I think only way to allow interactive debugging is
> > to avoid the
> >             dma_fences. Curious to hear if there are ideas for
> > otherwise.
> > 
> >         You need to guarantee somehow that the process is taken
> > from the hardware so
> >         that the preemption fence can signal.
> > 
> > 
> >     Our preemption fences have this functionality.
> > 
> >     A preemption fence issues a suspend execution command to the
> > firmware. The
> >     firmware, in turn, attempts to preempt the workload. If it
> > doesn't respond
> >     within a specified period, it resets the hardware queue, sends
> > a message to KMD,
> >     bans the software queue, and signals the preemption fence.
> > 
> >     We provide even more protection than that. If, for some reason,
> > the firmware
> >     doesn't respond within a longer timeout period, the KMD
> > performs a device reset,
> >     ban the offending software queue(s), and will signal the
> > preemption fences.
> > 
> >     This flow remains the same whether a debugger is attached or,
> > for example, a
> >     user submits a 10-minute non-preemptable workload. In either
> > case, other
> >     processes are guaranteed to make forward progress.
> > 
> > 
> > Yeah that is pretty much the same argumentation I have heard before
> > and it
> > turned out to not be working.
> > 
> > 
> >     The example above illustrates the memory oversubscription case,
> > where two
> >     processes are using 51% of the memory.
> > 
> > 
> > That isn't even necessary. We have seen applications dying just
> > because the
> > core memory management tried to join back small pages into huge
> > pages in an
> > userptr.
> > 
> > That the core memory management jumps in and requests that the pre-
> > emption
> > fence signals can happen all the time.
> 
> Ouch. Does there happen to be a known reproducer for this behavior or
> maybe
> bug report?
> 
> > You can mitigate that a bit, Fedora for example disables joining
> > back small
> > pages into huge pages by default for example and we even had people
> > suggesting
> > to use mprotect() so that userptrs VMAs don't fork() any more
> > (which is of
> > course completely illegal).
> > 
> > But my long term take away is that you can't block all causes of
> > sudden
> > requests to let a pre-emption fence signal.
> 
> I think this problem equally applies to the LR-workloads like the EU
> debugging ones.
> 
> >     Another preemption scenario involves two processes sharing
> > hardware resources.
> >     Our firmware follows the same flow here. If an LR workload is
> > using a hardware
> >     resource and a DMA-fence workload is waiting, and if the LR
> > workload doesn't
> >     preempt the in a timely manner, the firmware issues a hardware
> > reset, notifies
> >     KMD, and bans the LR software queue. The DMA-fence workload
> > then can make
> >     forward progress
> > 
> >     With the above in mind, this is why I say that if a user tries
> > to run a game and
> >     a non-preemptable LR workload, either oversubscribing memory or
> > sharing hardware
> >     resources, it is unlikely to work well. However, I don't think
> > this is a common
> >     use case. I would expect that when a debugger is open, it is
> > typically by a
> >     power user who knows how to disable other GPU tasks (e.g., by
> > enabling software
> >     rendering or using a machine without any display).
> > 
> >     Given this, please to reconsider your position.
> > 
> > 
> > The key point here is that this isn't stable, you can do that as a
> > tech demo
> > but it can always be that debugging an application just randomly
> > dies. And
> > believe me AMD has tried this to a rather extreme extend as well.
> 
> It's not really only limited to the debuggable applications at all,
> the
> normal LR workloads are equally impacted as far as I understand. Just
> harder to catch the issue with LR-workloads if the pre-emption fence
> signaling is sporadic.
> 
> > What you could potentially work is to taint the kernel and make
> > sure that this
> > function is only available to user who absolutely know what they
> > are doing.
> > 
> > But I would say we can only allow that if all other options have
> > been exercised
> > and doing it like this is really the only option left.
> 
> It sounds like servicing the memory pre-empt fence by stealing the
> pages from underneath the workload would be the way to resolve this
> issue.
> 
> This has been extensively discussed already, but was expected to
> really
> only be needed for low-on-memory scenarios. However it now seems like
> the need is much earlier due to the random userptr page joining by
> core
> mm.

Just to clarify here:
 
In Long-Running mode with recoverable pagefaults enabled we don't have
any preempt-fences, but rather just zap the PTEs pointing to the
affected memory and flush TLB. So from a memory resource POW a
breakpoint should be safe, and no mmu notifier nor shrinker will be
blocked.

Nor will there be any jobs with published dma-fences depending on the
job blocked either temporarily by a pagefault or long-term by a
debugger breakpoint.

/Thomas


> 
> If that is done and the memory pre-empt fence is serviced even for
> debuggable contexts, do you have further concerns with the presented
> approach
> from dma-buf and drm/sched perspective?
> 
> Regards, Joonas
> 
> > 
> > Regards,
> > Christian.
> > 
> > 
> >         This means that a breakpoint or core dump doesn't halt GPU
> > threads, but
> >         rather suspends them. E.g. all running wave data is
> > collected into a state
> >         bag which can be restored later on.
> > 
> >         I was under the impression that those long running compute
> > threads do
> >         exactly that, but when the hardware can't switch out the
> > GPU thread/process
> >         while in a break then that isn't the case.
> > 
> >         As long as you don't find a way to avoid that this patch
> > set is a pretty
> >         clear NAK from my side as DMA-buf and TTM maintainer.
> > 
> > 
> >     I believe this is addressed above.
> > 
> >     Matt
> > 
> > 
> >         What might work is to keep the submission on the hardware
> > in the break state
> >         but forbid any memory access. This way you can signal your
> > preemption fence
> >         even when the hardware isn't made available.
> > 
> >         Before you continue XE setups a new pre-emption fence and
> > makes sure that
> >         all page tables etc... are up to date.
> > 
> >         Could be tricky to get this right if completion fence based
> > submissions are
> >         mixed in as well, but that gives you at least a direction
> > you could
> >         potentially go.
> > 
> >         Regards,
> >         Christian.
> > 
> > 
> >             Regards, Joonas
> > 
> > 
> >                 Regards,
> >                 Christian.
> > 
> > 
> >                     Some wash-up thoughts from me below, but
> > consider them fairly irrelevant
> >                     since I think the main driver for these big
> > questions here should be
> >                     gdb/userspace.
> > 
> > 
> >                         Quoting Christian König (2024-11-07
> > 11:44:33)
> > 
> >                             Am 06.11.24 um 18:00 schrieb Matthew
> > Brost:
> > 
> >                                   [SNIP]
> > 
> >                                   This is not a generic interface
> > that anyone can freely access. The same
> >                                   permissions used by ptrace are
> > checked when opening such an interface.
> >                                   See [1] [2].
> > 
> >                                  
> > [1]https://patchwork.freedesktop.org/patch/617470/?series=136572&re
> > v=2
> >                                  
> > [2]https://patchwork.freedesktop.org/patch/617471/?series=136572&re
> > v=2
> > 
> > 
> >                             Thanks a lot for those pointers, that
> > is exactly what I was looking for.
> > 
> >                             And yeah, it is what I feared. You are
> > re-implementing existing functionality,
> >                             but see below.
> > 
> >                         Could you elaborate on what this "existing
> > functionality" exactly is?
> >                         I do not think this functionality exists at
> > this time.
> > 
> >                         The EU debugging architecture for Xe
> > specifically avoids the need for GDB
> >                         to attach with ptrace to the CPU process or
> > interfere with the CPU process for
> >                         the debugging via parasitic threads or so.
> > 
> >                         Debugger connection is opened to the DRM
> > driver for given PID (which uses the
> >                         ptrace may access check for now) after
> > which the all DRM client of that
> >                         PID are exposed to the debugger process.
> > 
> >                         What we want to expose via that debugger
> > connection is the ability for GDB to
> >                         read/write the different GPU VM address
> > spaces (ppGTT for Intel GPUs) just like
> >                         the EU threads would see them. Note that
> > the layout of the ppGTT is
> >                         completely up to the userspace driver to
> > setup and is mostly only partially
> >                         equal to the CPU address space.
> > 
> >                         Specifically as part of reading/writing the
> > ppGTT for debugging purposes,
> >                         there are deep flushes needed: for example
> > flushing instruction cache
> >                         when adding/removing breakpoints.
> > 
> >                         Maybe that will explain the background. I
> > elaborate on this at the end some more.
> > 
> > 
> >                                           kmap/vmap are used
> > everywhere in the DRM subsystem to access BOs, so I’m
> >                                           failing to see the
> > problem with adding a simple helper based on existing
> >                                           code.
> > 
> >                                       What#s possible and often
> > done is to do kmap/vmap if you need to implement a
> >                                       CPU copy for scanout for
> > example or for copying/validating command buffers.
> >                                       But that usually requires
> > accessing the whole BO and has separate security
> >                                       checks.
> > 
> >                                       When you want to access only
> > a few bytes of a BO that sounds massively like
> >                                       a peek/poke like interface
> > and we have already rejected that more than once.
> >                                       There even used to be
> > standardized GEM IOCTLs for that which have been
> >                                       removed by now.
> > 
> >                         Referring to the explanation at top: These
> > IOCTL are not for the debugging target
> >                         process to issue. The peek/poke interface
> > is specifically for GDB only
> >                         to facilitate the emulation of memory
> > reads/writes on the GPU address
> >                         space as they were done by EUs themselves.
> > And to recap: for modifying
> >                         instructions for example (add/remove
> > breakpoint), extra level of cache flushing is
> >                         needed which is not available to regular
> > userspace.
> > 
> >                         I specifically discussed with Sima on the
> > difference before moving forward with this
> >                         design originally. If something has changed
> > since then, I'm of course happy to rediscuss.
> > 
> >                         However, if this code can't be added, not
> > sure how we would ever be able
> >                         to implement core dumps for GPU
> > threads/memory?
> > 
> > 
> >                                       If you need to access BOs
> > which are placed in not CPU accessible memory then
> >                                       implement the access callback
> > for ptrace, see amdgpu_ttm_access_memory for
> >                                       an example how to do this.
> > 
> >                         As also mentioned above, we don't work via
> > ptrace at all when it comes
> >                         to debugging the EUs. The only thing used
> > for now is the ptrace_may_access to
> >                         implement similar access restrictions as
> > ptrace has. This can be changed
> >                         to something else if needed.
> > 
> > 
> >                                   Ptrace access via
> > vm_operations_struct.access → ttm_bo_vm_access.
> > 
> >                                   This series renames
> > ttm_bo_vm_access to ttm_bo_access, with no code changes.
> > 
> >                                   The above function accesses a BO
> > via kmap if it is in SYSTEM / TT,
> >                                   which is existing code.
> > 
> >                                   This function is only exposed to
> > user space via ptrace permissions.
> > 
> >                         Maybe this sentence is what caused the
> > confusion.
> > 
> >                         Userspace is never exposed with peek/poke
> > interface, only the debugger
> >                         connection which is its own FD.
> > 
> > 
> >                                   In this series, we implement a
> > function [3] similar to
> >                                   amdgpu_ttm_access_memory for the
> > TTM vfunc access_memory. What is
> >                                   missing is non-visible CPU memory
> > access, similar to
> >                                   amdgpu_ttm_access_memory_sdma.
> > This will be addressed in a follow-up and
> >                                   was omitted in this series given
> > its complexity.
> > 
> >                                   So, this looks more or less
> > identical to AMD's ptrace implementation,
> >                                   but in GPU address space. Again,
> > I fail to see what the problem is here.
> >                                   What am I missing?
> > 
> > 
> >                             The main question is why can't you use
> > the existing interfaces directly?
> > 
> >                         We're not working on the CPU address space
> > or BOs. We're working
> >                         strictly on the GPU address space as would
> > be seen by an EU thread if it
> >                         accessed address X.
> > 
> > 
> >                             Additional to the peek/poke interface
> > of ptrace Linux has the pidfd_getfd
> >                             system call, see
> > herehttps://man7.org/linux/man-pages/man2/pidfd_getfd.2.html.
> > 
> >                             The pidfd_getfd() allows to dup() the
> > render node file descriptor into your gdb
> >                             process. That in turn gives you all the
> > access you need from gdb, including
> >                             mapping BOs and command submission on
> > behalf of the application.
> > 
> >                         We're not operating on the CPU address
> > space nor are we operating on BOs
> >                         (there is no concept of BO in the EU debug
> > interface). Each VMA in the VM
> >                         could come from anywhere, only the start
> > address and size matter. And
> >                         neither do we need to interfere with the
> > command submission of the
> >                         process under debug.
> > 
> > 
> >                             As far as I can see that allows for the
> > same functionality as the eudebug
> >                             interface, just without any driver
> > specific code messing with ptrace
> >                             permissions and peek/poke interfaces.
> > 
> >                             So the question is still why do you
> > need the whole eudebug interface in the
> >                             first place? I might be missing
> > something, but that seems to be superfluous
> >                             from a high level view.
> > 
> >                         Recapping from above. It is to allow the
> > debugging of EU threads per DRM
> >                         client, completely independent of the CPU
> > process. If ptrace_may_acces
> >                         is the sore point, we could consider other
> > permission checks, too. There
> >                         is no other connection to ptrace in this
> > architecture as single
> >                         permission check to know if PID is fair
> > game to access by debugger
> >                         process.
> > 
> >                         Why no parasitic thread or ptrace: Going
> > forward, binding the EU debugging to
> >                         the DRM client would also pave way for
> > being able to extend core kernel generated
> >                         core dump with each DRM client's EU
> > thread/memory dump. We have similar
> >                         feature called "Offline core dump" enabled
> > in the downstream public
> >                         trees for i915, where we currently attach
> > the EU thread dump to i915 error state
> >                         and then later combine i915 error state
> > with CPU core dump file with a
> >                         tool.
> > 
> >                         This is relatively little amount of extra
> > code, as this baseline series
> >                         already introduces GDB the ability to
> > perform the necessary actions.
> >                         It's just the matter of kernel driver
> > calling: "stop all threads", then
> >                         copying the memory map and memory contents
> > for GPU threads, just like is
> >                         done for CPU threads.
> > 
> >                         With parasitic thread injection, not sure
> > if there is such way forward,
> >                         as it would seem to require to inject quite
> > abit more logic to core kernel?
> > 
> > 
> >                             It's true that the AMD KFD part has
> > still similar functionality, but that is
> >                             because of the broken KFD design of
> > tying driver state to the CPU process
> >                             (which makes it inaccessible for gdb
> > even with imported render node fd).
> > 
> >                             Both Sima and I (and partially Dave as
> > well) have pushed back on the KFD
> >                             approach. And the long term plan is to
> > get rid of such device driver specific
> >                             interface which re-implement existing
> > functionality just differently.
> > 
> >                         Recapping, this series is not adding it
> > back. The debugger connection
> >                         is a separate FD from the DRM one, with
> > separate IOCTL set. We don't allow
> >                         the DRM FD any new operations based on
> > ptrace is attached or not. We
> >                         don't ever do that check even.
> > 
> >                         We only restrict the opening of the
> > debugger connection to given PID with
> >                         ptrace_may_access check for now. That can
> > be changed to something else,
> >                         if necessary.
> > 
> >                     Yeah I think unnecessarily tying gpu processes
> > to cpu processes is a bad
> >                     thing, least because even today all the svm
> > discussions we have still hit
> >                     clear use-cases, where a 1:1 match is not
> > wanted (like multiple gpu svm
> >                     sections with offsets). Not even speaking of
> > all the gpu usecases where
> >                     the gpu vm space is still entirely independent
> > of the cpu side.
> > 
> >                     So that's why I think this entirely separate
> > approach looks like the right
> >                     one, with ptrace_may_access as the access
> > control check to make sure we
> >                     match ptrace on the cpu side.
> > 
> >                     But there's very obviously a bikeshed to be had
> > on what the actual uapi
> >                     should look like, especially how gdb opens up a
> > gpu debug access fd. But I
> >                     also think that's not much on drm to decide,
> > but whatever gdb wants. And
> >                     then we aim for some consistency on that
> > lookup/access control part
> >                     (ideally, I might be missing some reasons why
> > this is a bad idea) across
> >                     drm drivers.
> > 
> > 
> >                             So you need to have a really really
> > good explanation why the eudebug interface
> >                             is actually necessary.
> > 
> >                         TL;DR The main point is to decouple the
> > debugging of the EU workloads from the
> >                         debugging of the CPU process. This avoids
> > the interference with the CPU process with
> >                         parasitic thread injection. Further this
> > also allows generating a core dump
> >                         without any GDB connected. There are also
> > many other smaller pros/cons
> >                         which can be discussed but for the context
> > of this patch, this is the
> >                         main one.
> > 
> >                         So unlike parasitic thread injection, we
> > don't unlock any special IOCTL for
> >                         the process under debug to be performed by
> > the parasitic thread, but we
> >                         allow the minimal set of operations to be
> > performed by GDB as if those were
> >                         done on the EUs themselves.
> > 
> >                         One can think of it like the minimal subset
> > of ptrace but for EU threads,
> >                         not the CPU threads. And thus, building on
> > this it's possible to extend
> >                         the core kernel generated core dumps with
> > DRM specific extension which
> >                         would contain the EU thread/memory dump.
> > 
> >                     It might be good to document (in that debugging
> > doc patch probably) why
> >                     thread injection is not a great option, and why
> > the tradeoffs for
> >                     debugging are different than for for
> > checkpoint/restore, where with CRIU
> >                     we landed on doing most of this in userspace,
> > and often requiring
> >                     injection threads to make it all work.
> > 
> >                     Cheers, Sima
> > 
> > 
> >                         Regards, Joonas
> > 
> > 
> >                             Regards,
> >                             Christian.
> > 
> > 
> > 
> >                                   Matt
> > 
> >                                  
> > [3]https://patchwork.freedesktop.org/patch/622520/?series=140200&re
> > v=6
> > 
> > 
> >                                       Regards,
> >                                       Christian.
> > 
> > 
> >                                           Matt
> > 
> > 
> >                                               Regards,
> >                                               Christian.
> > 
> > 
> > 


  reply	other threads:[~2024-11-12 16:22 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-31 18:10 [PATCH v6 0/8] Fix non-contiguous VRAM BO access in Xe Matthew Brost
2024-10-31 18:10 ` [PATCH v6 1/8] drm/xe: Add xe_bo_vm_access Matthew Brost
2024-10-31 18:10 ` [PATCH v6 2/8] drm/ttm: Add ttm_bo_access Matthew Brost
2024-10-31 23:43   ` Matthew Brost
2024-11-04 17:34     ` Rodrigo Vivi
2024-11-04 19:28       ` Christian König
2024-11-04 21:49         ` Matthew Brost
2024-11-05  7:41           ` Christian König
2024-11-05 18:35             ` Matthew Brost
2024-11-06  9:48               ` Christian König
2024-11-06 15:25                 ` Matthew Brost
2024-11-06 15:44                   ` Christian König
2024-11-06 17:00                     ` Matthew Brost
2024-11-07  9:44                       ` Christian König
2024-11-11  8:00                         ` Joonas Lahtinen
2024-11-11 10:10                           ` Simona Vetter
2024-11-11 11:34                             ` Christian König
2024-11-11 14:00                               ` Joonas Lahtinen
2024-11-11 15:54                                 ` Christian König
2024-11-11 22:45                                   ` Matthew Brost
2024-11-12  9:23                                     ` Christian König
2024-11-12 13:41                                       ` Joonas Lahtinen
2024-11-12 16:22                                         ` Thomas Hellström [this message]
2024-11-12 16:25                                           ` Christian König
2024-11-12 16:33                                             ` Thomas Hellström
2024-11-13  8:37                                               ` Christian König
2024-11-13 10:44                                                 ` Thomas Hellström
2024-11-13 11:42                                                   ` Christian König
2024-11-15 18:27                                                     ` Matthew Brost
2024-11-25 15:29                                                       ` Matthew Brost
2024-11-25 16:19                                                         ` Christian König
2024-11-25 17:27                                                           ` Matthew Brost
2024-11-26  8:19                                                             ` Christian König
2024-11-26 17:49                                                               ` Matthew Brost
2024-11-27 13:21                                                                 ` Christian König
2024-11-12  8:28                                 ` Simona Vetter
2024-11-12  8:58                                   ` Christian König
2024-11-12 13:30                                     ` Joonas Lahtinen
2024-11-11 11:27                           ` Christian König
2024-11-04 19:47     ` Christian König
2024-11-04 21:30       ` Matthew Brost
2024-11-04 22:26         ` Rodrigo Vivi
2024-10-31 18:10 ` [PATCH v6 3/8] drm/xe: Add xe_ttm_access_memory Matthew Brost
2024-10-31 18:10 ` [PATCH v6 4/8] drm/xe: Take PM ref in delayed snapshot capture worker Matthew Brost
2024-10-31 18:10 ` [PATCH v6 5/8] drm/xe/display: Update intel_bo_read_from_page to use ttm_bo_access Matthew Brost
2024-10-31 18:10 ` [PATCH v6 6/8] drm/xe: Use ttm_bo_access in xe_vm_snapshot_capture_delayed Matthew Brost
2024-10-31 18:10 ` [PATCH v6 7/8] drm/xe: Set XE_BO_FLAG_PINNED in migrate selftest BOs Matthew Brost
2024-10-31 18:10 ` [PATCH v6 8/8] drm/xe: Only allow contiguous BOs to use xe_bo_vmap Matthew Brost
2024-10-31 18:15 ` ✓ CI.Patch_applied: success for Fix non-contiguous VRAM BO access in Xe (rev6) Patchwork
2024-10-31 18:15 ` ✗ CI.checkpatch: warning " Patchwork
2024-10-31 18:17 ` ✓ CI.KUnit: success " Patchwork
2024-10-31 18:28 ` ✓ CI.Build: " Patchwork
2024-10-31 18:31 ` ✓ CI.Hooks: " Patchwork
2024-10-31 18:32 ` ✗ CI.checksparse: warning " Patchwork
2024-10-31 18:57 ` ✓ CI.BAT: success " Patchwork
2024-10-31 21:27 ` ✗ CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=27f8bd540ac1f04daf8a7786f4ae7828017d061b.camel@linux.intel.com \
    --to=thomas.hellstrom@linux.intel.com \
    --cc=airlied@gmail.com \
    --cc=christian.koenig@amd.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=joonas.lahtinen@linux.intel.com \
    --cc=matthew.auld@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=ray.huang@amd.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona.vetter@ffwll.ch \
    --cc=simona@ffwll.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.