* [PATCH RFC 01/13] mm/rmap: remove folio->_nr_pages_mapped
From: David Hildenbrand (Arm) @ 2026-04-12 18:59 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
Shuah Khan, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Rik van Riel, Harry Yoo, Jann Horn, Brendan Jackman, Zi Yan,
Pedro Falcato, Matthew Wilcox
Cc: cgroups, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
David Hildenbrand (Arm)
In-Reply-To: <20260412-mapcount-v1-0-05e8dfab52e0@kernel.org>
In preparation for removing CONFIG_PAGE_MAPCOUNT, let's stop updating
a folio's _nr_pages_mapped and remove it.
This will make CONFIG_PAGE_MAPCOUNT behave just like
CONFIG_NO_PAGE_MAPCOUNT, in particular:
(1) We account folios as fully mapped as soon as a single page is
mapped. That is visible through:
(1) Memcg stats (e.g., "anon" and "file_mapped" in cgroup v2)
(2) System stats (e.g., "AnonPages:", "Mapped:" and "Shmem"
in /proc/meminfo)
(3) Per-node stats (e.g., "AnonPages:", "Mapped:" and "Shmem")
in /sys/devices/system/node/nodeX/meminfo
Especially for anonymous memory, that memory consumption is now visible
for partially-mapped folios until actually split and the unmapped pages
are reclaimed.
(2) We do not detect partially-mapped anonymous folios in all cases
We now detect partial mappings based on the average per-page mapcount in a
folio: if it is < 1, at least one page is not mapped.
In the most common case (exclusive anonymous folios), we always detect
partial mappings this way reliably.
Example scenarios where we will not detect partial mappings:
(A) Allocate a THP and fork a child process. Then, unmap up to half of the
THP in the parent *and* the child. Once the child quits, we detect
the partial mapping.
The folio mapcount will be >= 512 -> Average >= 1.
(B) Allocate a THP and fork 511 child processes. Then, unmap all but one
page in *all* processes.
The folio mapcount will be 512 -> Average == 1.
There are two main ideas on how to detect these cases as well, if we
ever get a real indication that this is problematic:
* Let memory reclaim scan candidates (shared anonymous folios) to detect
partial mappings.
* Add candidate folios to the deferred split queue and let the deferred
shrinker detect partial mappings.
More code cleanups are possible, but we'll defer that and focus on the
core change here.
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
---
Documentation/admin-guide/cgroup-v1/memory.rst | 6 +-
Documentation/admin-guide/cgroup-v2.rst | 13 +-
Documentation/mm/transhuge.rst | 23 ++--
include/linux/mm_types.h | 4 +-
include/linux/rmap.h | 4 +-
mm/debug.c | 3 +-
mm/internal.h | 24 ----
mm/page_alloc.c | 5 -
mm/rmap.c | 159 ++++++++-----------------
9 files changed, 69 insertions(+), 172 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 7db63c002922..ddb5ff5cee15 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -609,9 +609,9 @@ memory.stat file includes following statistics:
'rss + mapped_file" will give you resident set size of cgroup.
- Note that some kernel configurations might account complete larger
- allocations (e.g., THP) towards 'rss' and 'mapped_file', even if
- only some, but not all that memory is mapped.
+ Note that the kernel accounts entire larger allocations (e.g., THP)
+ towards 'rss' and 'mapped_file' if any part of such an allocation
+ is mapped.
(Note: file and shmem may be shared among other cgroups. In that case,
mapped_file is accounted only when the memory cgroup is owner of page
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 8ad0b2781317..aa703ec89e29 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1538,10 +1538,9 @@ The following nested keys are defined.
anon
Amount of memory used in anonymous mappings such as
- brk(), sbrk(), and mmap(MAP_ANONYMOUS). Note that
- some kernel configurations might account complete larger
- allocations (e.g., THP) if only some, but not all the
- memory of such an allocation is mapped anymore.
+ brk(), sbrk(), and mmap(MAP_ANONYMOUS). Note that the
+ kernel accounts entire larger allocations (e.g., THP) towards
+ "anon" if any part of such an allocation is mapped.
file
Amount of memory used to cache filesystem data,
@@ -1585,9 +1584,9 @@ The following nested keys are defined.
file_mapped
Amount of cached filesystem data mapped with mmap(). Note
- that some kernel configurations might account complete
- larger allocations (e.g., THP) if only some, but not
- not all the memory of such an allocation is mapped.
+ that the kernel accounts entire larger allocations
+ (e.g., THP) towards "file_mapped" if any part of such an
+ allocation is mapped.
file_dirty
Amount of cached filesystem data that was modified but
diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
index 0e7f8e4cd2e3..f200c1ac19cb 100644
--- a/Documentation/mm/transhuge.rst
+++ b/Documentation/mm/transhuge.rst
@@ -122,10 +122,6 @@ pages:
corresponding mapcount), and the current status ("maybe mapped shared" vs.
"mapped exclusively").
- With CONFIG_PAGE_MAPCOUNT, we also increment/decrement
- folio->_nr_pages_mapped by ENTIRELY_MAPPED when _entire_mapcount goes
- from -1 to 0 or 0 to -1.
-
- map/unmap of individual pages with PTE entry increment/decrement
folio->_large_mapcount.
@@ -134,9 +130,7 @@ pages:
"mapped exclusively").
With CONFIG_PAGE_MAPCOUNT, we also increment/decrement
- page->_mapcount and increment/decrement folio->_nr_pages_mapped when
- page->_mapcount goes from -1 to 0 or 0 to -1 as this counts the number
- of pages mapped by PTE.
+ page->_mapcount.
split_huge_page internally has to distribute the refcounts in the head
page to the tail pages before clearing all PG_head/tail bits from the page
@@ -181,12 +175,9 @@ The function deferred_split_folio() is used to queue a folio for splitting.
The splitting itself will happen when we get memory pressure via shrinker
interface.
-With CONFIG_PAGE_MAPCOUNT, we reliably detect partial mappings based on
-folio->_nr_pages_mapped.
-
-With CONFIG_NO_PAGE_MAPCOUNT, we detect partial mappings based on the
-average per-page mapcount in a THP: if the average is < 1, an anon THP is
-certainly partially mapped. As long as only a single process maps a THP,
-this detection is reliable. With long-running child processes, there can
-be scenarios where partial mappings can currently not be detected, and
-might need asynchronous detection during memory reclaim in the future.
+We detect partial mappings based on the average per-page mapcount in a THP: if
+the average is < 1, an anon THP is certainly partially mapped. As long as
+only a single process maps a THP, this detection is reliable. With
+long-running child processes, there can be scenarios where partial mappings
+can currently not be detected, and might need asynchronous detection during
+memory reclaim in the future.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a308e2c23b82..47b2c3d05f41 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -377,7 +377,7 @@ typedef unsigned short mm_id_t;
* @_last_cpupid: IDs of last CPU and last process that accessed the folio.
* @_entire_mapcount: Do not use directly, call folio_entire_mapcount().
* @_large_mapcount: Do not use directly, call folio_mapcount().
- * @_nr_pages_mapped: Do not use outside of rmap and debug code.
+ * @_unused_1: Temporary placeholder.
* @_pincount: Do not use directly, call folio_maybe_dma_pinned().
* @_nr_pages: Do not use directly, call folio_nr_pages().
* @_mm_id: Do not use outside of rmap code.
@@ -452,7 +452,7 @@ struct folio {
struct {
/* public: */
atomic_t _large_mapcount;
- atomic_t _nr_pages_mapped;
+ unsigned int _unused_1;
#ifdef CONFIG_64BIT
atomic_t _entire_mapcount;
atomic_t _pincount;
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 8dc0871e5f00..e5569f5fdaec 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -291,7 +291,7 @@ static inline void folio_add_large_mapcount(struct folio *folio,
static inline int folio_add_return_large_mapcount(struct folio *folio,
int diff, struct vm_area_struct *vma)
{
- BUILD_BUG();
+ return atomic_add_return(diff, &folio->_large_mapcount) + 1;
}
static inline void folio_sub_large_mapcount(struct folio *folio,
@@ -303,7 +303,7 @@ static inline void folio_sub_large_mapcount(struct folio *folio,
static inline int folio_sub_return_large_mapcount(struct folio *folio,
int diff, struct vm_area_struct *vma)
{
- BUILD_BUG();
+ return atomic_sub_return(diff, &folio->_large_mapcount) + 1;
}
#endif /* CONFIG_MM_ID */
diff --git a/mm/debug.c b/mm/debug.c
index 77fa8fe1d641..bfb41ef17a5e 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -86,11 +86,10 @@ static void __dump_folio(const struct folio *folio, const struct page *page,
if (folio_has_pincount(folio))
pincount = atomic_read(&folio->_pincount);
- pr_warn("head: order:%u mapcount:%d entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n",
+ pr_warn("head: order:%u mapcount:%d entire_mapcount:%d pincount:%d\n",
folio_order(folio),
folio_mapcount(folio),
folio_entire_mapcount(folio),
- folio_nr_pages_mapped(folio),
pincount);
}
diff --git a/mm/internal.h b/mm/internal.h
index c693646e5b3f..30e48f39d2de 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -103,34 +103,12 @@ struct pagetable_move_control {
void page_writeback_init(void);
-/*
- * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages,
- * its nr_pages_mapped would be 0x400000: choose the ENTIRELY_MAPPED bit
- * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE). Hugetlb currently
- * leaves nr_pages_mapped at 0, but avoid surprise if it participates later.
- */
-#define ENTIRELY_MAPPED 0x800000
-#define FOLIO_PAGES_MAPPED (ENTIRELY_MAPPED - 1)
-
/*
* Flags passed to __show_mem() and show_free_areas() to suppress output in
* various contexts.
*/
#define SHOW_MEM_FILTER_NODES (0x0001u) /* disallowed nodes */
-/*
- * How many individual pages have an elevated _mapcount. Excludes
- * the folio's entire_mapcount.
- *
- * Don't use this function outside of debugging code.
- */
-static inline int folio_nr_pages_mapped(const struct folio *folio)
-{
- if (IS_ENABLED(CONFIG_NO_PAGE_MAPCOUNT))
- return -1;
- return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED;
-}
-
/*
* Retrieve the first entry of a folio based on a provided entry within the
* folio. We cannot rely on folio->swap as there is no guarantee that it has
@@ -885,8 +863,6 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
folio_set_order(folio, order);
atomic_set(&folio->_large_mapcount, -1);
- if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT))
- atomic_set(&folio->_nr_pages_mapped, 0);
if (IS_ENABLED(CONFIG_MM_ID)) {
folio->_mm_ids = 0;
folio->_mm_id_mapcount[0] = -1;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b1c5430cad4e..8888f31aca49 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1111,11 +1111,6 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
bad_page(page, "nonzero large_mapcount");
goto out;
}
- if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT) &&
- unlikely(atomic_read(&folio->_nr_pages_mapped))) {
- bad_page(page, "nonzero nr_pages_mapped");
- goto out;
- }
if (IS_ENABLED(CONFIG_MM_ID)) {
if (unlikely(folio->_mm_id_mapcount[0] != -1)) {
bad_page(page, "nonzero mm mapcount 0");
diff --git a/mm/rmap.c b/mm/rmap.c
index 78b7fb5f367c..df42c38fe387 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1353,9 +1353,8 @@ static __always_inline void __folio_add_rmap(struct folio *folio,
struct page *page, int nr_pages, struct vm_area_struct *vma,
enum pgtable_level level)
{
- atomic_t *mapped = &folio->_nr_pages_mapped;
+ int nr = 0, nr_pmdmapped = 0, mapcount;
const int orig_nr_pages = nr_pages;
- int first = 0, nr = 0, nr_pmdmapped = 0;
__folio_rmap_sanity_checks(folio, page, nr_pages, level);
@@ -1366,61 +1365,25 @@ static __always_inline void __folio_add_rmap(struct folio *folio,
break;
}
- if (IS_ENABLED(CONFIG_NO_PAGE_MAPCOUNT)) {
- nr = folio_add_return_large_mapcount(folio, orig_nr_pages, vma);
- if (nr == orig_nr_pages)
- /* Was completely unmapped. */
- nr = folio_large_nr_pages(folio);
- else
- nr = 0;
- break;
+ if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT)) {
+ do {
+ atomic_inc(&page->_mapcount);
+ } while (page++, --nr_pages > 0);
}
- do {
- first += atomic_inc_and_test(&page->_mapcount);
- } while (page++, --nr_pages > 0);
-
- if (first &&
- atomic_add_return_relaxed(first, mapped) < ENTIRELY_MAPPED)
- nr = first;
-
- folio_add_large_mapcount(folio, orig_nr_pages, vma);
+ mapcount = folio_add_return_large_mapcount(folio, orig_nr_pages, vma);
+ if (mapcount == orig_nr_pages)
+ nr = folio_large_nr_pages(folio);
break;
case PGTABLE_LEVEL_PMD:
case PGTABLE_LEVEL_PUD:
- first = atomic_inc_and_test(&folio->_entire_mapcount);
- if (IS_ENABLED(CONFIG_NO_PAGE_MAPCOUNT)) {
- if (level == PGTABLE_LEVEL_PMD && first)
- nr_pmdmapped = folio_large_nr_pages(folio);
- nr = folio_inc_return_large_mapcount(folio, vma);
- if (nr == 1)
- /* Was completely unmapped. */
- nr = folio_large_nr_pages(folio);
- else
- nr = 0;
- break;
- }
+ if (atomic_inc_and_test(&folio->_entire_mapcount) &&
+ level == PGTABLE_LEVEL_PMD)
+ nr_pmdmapped = HPAGE_PMD_NR;
- if (first) {
- nr = atomic_add_return_relaxed(ENTIRELY_MAPPED, mapped);
- if (likely(nr < ENTIRELY_MAPPED + ENTIRELY_MAPPED)) {
- nr_pages = folio_large_nr_pages(folio);
- /*
- * We only track PMD mappings of PMD-sized
- * folios separately.
- */
- if (level == PGTABLE_LEVEL_PMD)
- nr_pmdmapped = nr_pages;
- nr = nr_pages - (nr & FOLIO_PAGES_MAPPED);
- /* Raced ahead of a remove and another add? */
- if (unlikely(nr < 0))
- nr = 0;
- } else {
- /* Raced ahead of a remove of ENTIRELY_MAPPED */
- nr = 0;
- }
- }
- folio_inc_large_mapcount(folio, vma);
+ mapcount = folio_inc_return_large_mapcount(folio, vma);
+ if (mapcount == 1)
+ nr = folio_large_nr_pages(folio);
break;
default:
BUILD_BUG();
@@ -1676,15 +1639,11 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
}
folio_set_large_mapcount(folio, nr, vma);
- if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT))
- atomic_set(&folio->_nr_pages_mapped, nr);
} else {
nr = folio_large_nr_pages(folio);
/* increment count (starts at -1) */
atomic_set(&folio->_entire_mapcount, 0);
folio_set_large_mapcount(folio, 1, vma);
- if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT))
- atomic_set(&folio->_nr_pages_mapped, ENTIRELY_MAPPED);
if (exclusive)
SetPageAnonExclusive(&folio->page);
nr_pmdmapped = nr;
@@ -1773,12 +1732,28 @@ void folio_add_file_rmap_pud(struct folio *folio, struct page *page,
#endif
}
+static bool __folio_certainly_partially_mapped(struct folio *folio, int mapcount)
+{
+ /*
+ * This is a best-effort check only: if the average per-page
+ * mapcount in the folio is smaller than 1, at least one page is not
+ * mapped -> partially mapped. This is always reliable for exclusive
+ * folios.
+ *
+ * We will not detect partial mappings in all scenarios:
+ * when a folio becomes partially mapped while shared and the
+ * average per-page mapcount is >= 1. However, we will detect the
+ * partial mapping once it becomes exclusively mapped again.
+ */
+ return mapcount && !folio_entire_mapcount(folio) &&
+ mapcount < folio_large_nr_pages(folio);
+}
+
static __always_inline void __folio_remove_rmap(struct folio *folio,
struct page *page, int nr_pages, struct vm_area_struct *vma,
enum pgtable_level level)
{
- atomic_t *mapped = &folio->_nr_pages_mapped;
- int last = 0, nr = 0, nr_pmdmapped = 0;
+ int nr = 0, nr_pmdmapped = 0, mapcount;
bool partially_mapped = false;
__folio_rmap_sanity_checks(folio, page, nr_pages, level);
@@ -1790,67 +1765,29 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
break;
}
- if (IS_ENABLED(CONFIG_NO_PAGE_MAPCOUNT)) {
- nr = folio_sub_return_large_mapcount(folio, nr_pages, vma);
- if (!nr) {
- /* Now completely unmapped. */
- nr = folio_large_nr_pages(folio);
- } else {
- partially_mapped = nr < folio_large_nr_pages(folio) &&
- !folio_entire_mapcount(folio);
- nr = 0;
- }
- break;
- }
-
- folio_sub_large_mapcount(folio, nr_pages, vma);
- do {
- last += atomic_add_negative(-1, &page->_mapcount);
- } while (page++, --nr_pages > 0);
+ mapcount = folio_sub_return_large_mapcount(folio, nr_pages, vma);
+ if (!mapcount)
+ nr = folio_large_nr_pages(folio);
- if (last &&
- atomic_sub_return_relaxed(last, mapped) < ENTIRELY_MAPPED)
- nr = last;
+ if (IS_ENABLED(CONFIG_PAGE_MAPCOUNT)) {
+ do {
+ atomic_dec(&page->_mapcount);
+ } while (page++, --nr_pages > 0);
+ }
- partially_mapped = nr && atomic_read(mapped);
+ partially_mapped = __folio_certainly_partially_mapped(folio, mapcount);
break;
case PGTABLE_LEVEL_PMD:
case PGTABLE_LEVEL_PUD:
- if (IS_ENABLED(CONFIG_NO_PAGE_MAPCOUNT)) {
- last = atomic_add_negative(-1, &folio->_entire_mapcount);
- if (level == PGTABLE_LEVEL_PMD && last)
- nr_pmdmapped = folio_large_nr_pages(folio);
- nr = folio_dec_return_large_mapcount(folio, vma);
- if (!nr) {
- /* Now completely unmapped. */
- nr = folio_large_nr_pages(folio);
- } else {
- partially_mapped = last &&
- nr < folio_large_nr_pages(folio);
- nr = 0;
- }
- break;
- }
+ mapcount = folio_dec_return_large_mapcount(folio, vma);
+ if (!mapcount)
+ nr = folio_large_nr_pages(folio);
- folio_dec_large_mapcount(folio, vma);
- last = atomic_add_negative(-1, &folio->_entire_mapcount);
- if (last) {
- nr = atomic_sub_return_relaxed(ENTIRELY_MAPPED, mapped);
- if (likely(nr < ENTIRELY_MAPPED)) {
- nr_pages = folio_large_nr_pages(folio);
- if (level == PGTABLE_LEVEL_PMD)
- nr_pmdmapped = nr_pages;
- nr = nr_pages - nr;
- /* Raced ahead of another remove and an add? */
- if (unlikely(nr < 0))
- nr = 0;
- } else {
- /* An add of ENTIRELY_MAPPED raced ahead */
- nr = 0;
- }
- }
+ if (atomic_add_negative(-1, &folio->_entire_mapcount) &&
+ level == PGTABLE_LEVEL_PMD)
+ nr_pmdmapped = HPAGE_PMD_NR;
- partially_mapped = nr && nr < nr_pmdmapped;
+ partially_mapped = __folio_certainly_partially_mapped(folio, mapcount);
break;
default:
BUILD_BUG();
--
2.43.0
^ permalink raw reply related
* [PATCH RFC 00/13] mm/rmap: support arbitrary folio mappings
From: David Hildenbrand (Arm) @ 2026-04-12 18:59 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
Shuah Khan, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Rik van Riel, Harry Yoo, Jann Horn, Brendan Jackman, Zi Yan,
Pedro Falcato, Matthew Wilcox
Cc: cgroups, linux-doc, linux-kernel, linux-mm, linux-fsdevel,
David Hildenbrand (Arm)
This series is related to my LSF/MM/BPF topic:
[LSF/MM/BPF TOPIC] Towards removing CONFIG_PAGE_MAPCOUNT [1]
And does the following things:
(a) Gets rid of CONFIG_PAGE_MAPCOUNT, stopping rmap-related code to no
longer use page->_mapcount.
(b) Converts the entire mapcount to a "total mapped pages" counter, that
can trivially be used to calculate the per-page average mapcount in
a folio.
(c) Cleans up the code heavily,
(d) Teaches RMAP code to support arbitrary folio mappings: For example,
supporting PMD-mapping of folios that span multiple PMDs.
Initially, I wanted to use a PMD + PUD mapcount, but once I realized that
we can do the same thing much easier with a "total mapped pages" counters,
I tried that. And was surprised how clean it looks.
More details in the last patch.
Functional Changes
------------------
The kernel now always behaves like CONFIG_PAGE_NO_MAPCOUNT currently
does, in particular:
(1) System/node/memcg stats account large folios as fully mapped as soon
as a single page is mapped, instead of the precise number of pages
a partially-mapped folio has mapped. For example, this affects
"AnonPages:", "Mapped:" and "Shmem" in /proc/meminfo.
(2) "mapmax" part of /proc/$PID/numa_maps uses the average page mapcount
in a folio instead of the effective page mapcount.
(3) Determining the PM_MMAP_EXCLUSIVE flag for /proc/$PID/pagemap is based on
folio_maybe_mapped_shared() instead of the effective page mapcount.
(4) /proc/kpagecount exposes the average page mapcount in a folio
instead of the effective page mapcount.
(5) Calculating the Pss for /proc/$PID/smaps and /proc/$PID/smaps_rollup
uses the average page mapcount in a folio instead of the effective
page mapcount.
(6) Calculating the Uss for /proc/$PID/smaps and /proc/$PID/smaps_rollup
uses folio_maybe_mapped_shared() instead of the effective page
mapcount.
(7) Detecting partially-mapped anonymous folios uses the average
page-page mapcount. This implies that we cannot detect partial
mappings of shared anonymous folios in all cases.
TODOs
-----
Partially-mapped folios:
If deemed relevant, we could detect more partially-mapped shared
anonymous folios on the memory reclaim path (e.g., during access-bit
harvesting) and flag them accordingly, so they can get deferred-split.
We might also just let the deferred splitting logic perform more such
scanning of possible candidates.
Mapcount overflows:
It may already be possible to overflow a large folio's mapcount
(+refcount). With this series, it may be possible to overflow
"total mapped pages" on 32bit; and I'd like to avoid making it an
unsigned long long on 32bit.
In a distant future, we may want a 64bit mapcountv value, but for
the time being (no relevant use cases), we should likely reject new
folio mappings if there is the possibility for mapcount +
"total mapped pages" overflows early. I assume doing some basic checks
during fork() + file folio mapping should be good enough (e.g., stop
once it would turn negative).
This series saw only very basic testing on 64bit and no performance
fine-tuning yet.
[1] https://lore.kernel.org/all/fe6afcc3-7539-4650-863b-04d971e89cfb@kernel.org/
---
David Hildenbrand (Arm) (13):
mm/rmap: remove folio->_nr_pages_mapped
fs/proc/task_mmu: remove CONFIG_PAGE_MAPCOUNT handling for "mapmax"
fs/proc/page: remove CONFIG_PAGE_MAPCOUNT handling for kpagecount
fs/proc/task_mmu: remove CONFIG_PAGE_MAPCOUNT handling for PM_MMAP_EXCLUSIVE
fs/proc/task_mmu: remove mapcount comment in smaps_account()
fs/proc/task_mmu: remove CONFIG_PAGE_MAPCOUNT handling in smaps_account()
mm/rmap: remove CONFIG_PAGE_MAPCOUNT
mm: re-consolidate folio->_entire_mapcount
mm: move _large_mapcount to _mapcount in page[1] of a large folio
mm: re-consolidate folio->_pincount
mm/rmap: stop using the entire mapcount for hugetlb folios
mm/rmap: large mapcount interface cleanups
mm/rmap: support arbitrary folio mappings
Documentation/admin-guide/cgroup-v1/memory.rst | 6 +-
Documentation/admin-guide/cgroup-v2.rst | 13 +-
Documentation/admin-guide/mm/pagemap.rst | 30 ++-
Documentation/filesystems/proc.rst | 41 ++--
Documentation/mm/transhuge.rst | 29 +--
fs/proc/internal.h | 58 +----
fs/proc/page.c | 10 +-
fs/proc/task_mmu.c | 69 ++----
include/linux/mm.h | 37 +--
include/linux/mm_types.h | 22 +-
include/linux/pgtable.h | 22 ++
include/linux/rmap.h | 221 ++++++++----------
mm/Kconfig | 17 --
mm/debug.c | 10 +-
mm/internal.h | 30 +--
mm/memory.c | 3 +-
mm/page_alloc.c | 31 +--
mm/rmap.c | 302 ++++++++-----------------
18 files changed, 325 insertions(+), 626 deletions(-)
---
base-commit: 196ab4af58d724f24335fed3da62920c3cea945f
change-id: 20260330-mapcount-32066c687010
Best regards,
--
David Hildenbrand (Arm) <david@kernel.org>
^ permalink raw reply
* Re: [PATCH RFC v2 8/9] Documentation: ABI: testing: add docs for ad9910 sysfs entries
From: David Lechner @ 2026-04-12 18:45 UTC (permalink / raw)
To: Jonathan Cameron, Rodrigo Alencar
Cc: Rodrigo Alencar via B4 Relay, rodrigo.alencar, linux-iio,
devicetree, linux-kernel, linux-doc, Lars-Peter Clausen,
Michael Hennerich, Andy Shevchenko, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Philipp Zabel, Jonathan Corbet,
Shuah Khan
In-Reply-To: <20260412155115.2f7a83bf@jic23-huawei>
On 4/12/26 9:51 AM, Jonathan Cameron wrote:
> On Mon, 23 Mar 2026 11:36:08 +0000
> Rodrigo Alencar <455.rodrigo.alencar@gmail.com> wrote:
>
>> On 26/03/22 05:22PM, Jonathan Cameron wrote:
>>> On Wed, 18 Mar 2026 17:56:08 +0000
>>> Rodrigo Alencar via B4 Relay <devnull+rodrigo.alencar.analog.com@kernel.org> wrote:
>>>
>>>> From: Rodrigo Alencar <rodrigo.alencar@analog.com>
>>>>
>>>> Add ABI documentation file for the DDS AD9910 with sysfs entries to
>>>> control Parallel Port, Digital Ramp Generator, RAM and OSK parameters.
>>>>
>>>> Signed-off-by: Rodrigo Alencar <rodrigo.alencar@analog.com>
>>>> ---
>>
...
>>>
>>>> + - "ramp_down": No-dwell low; the ramp resets to upper
>>>> + limit upon reaching the lower limit.
>>>> + - "ramp_up": No-dwell high; the ramp resets to lower
>>>> + limit upon reaching the upper limit.
>>>> + - "bidirectional_continuous": Both no-dwell high and low;
>>>> + the ramp continuously sweeps without dwelling.
>>>
>>> Triangle wave? bidirectional continuous is a rather confusing term so maybe
>>> we should rethink this one.
>>
>> Mostly yes, but not only that. Sawtooth can be achieved as well by changing
>> the step sizes, also other weird patterns can be achieved by toggling DRCTL pin.
>
> Sawtooth is kind of a special triangle wave with one very steep side.
> Wikipedia even has: "It can also be considered the extreme case of an asymmetric triangle wave"
> https://en.wikipedia.org/wiki/Sawtooth_wave
>
>> This mode is the most useful when one does not have an FPGA and want to save
>> resources on controlling the DRCTL pin. That mode name comes from the datasheet,
>> so I suppose it was fine.
>
> Let us see if we can get more opinions on this. Whilst I can see the logic of
> the datasheet naming, it's a bit obscure.
>
It is the same as ramp_up and ramp_down other than what happens when it hits
the limit? If so, I would call it ramp_up_down.
^ permalink raw reply
* Re: (sashiko status) [RFC PATCH v5.2 00/11] mm/damon: introduce DAMOS failed region quota charge ratio
From: SeongJae Park @ 2026-04-12 18:14 UTC (permalink / raw)
To: SeongJae Park
Cc: damon, kunit-dev, linux-doc, linux-kernel, linux-kselftest,
linux-mm
In-Reply-To: <20260412161957.82835-1-sj@kernel.org>
TL; DR: Seems Sashiko is finally convinced. I will drop RFC tag from the next
spin.
Forwarding sashiko.dev review status for this thread, with my short comments
for issues-may-found reviews.
# review url: https://sashiko.dev/#/patchset/20260412161957.82835-1-sj@kernel.org
- [RFC PATCH v5.2 01/11] mm/damon/core: handle <min_region_sz remaining quota as empty
- status: Reviewed
- review: No issues found.
- [RFC PATCH v5.2 02/11] mm/damon/core: merge regions after applying DAMOS schemes
- status: Reviewed
- review: No issues found.
- [RFC PATCH v5.2 03/11] mm/damon/core: introduce failed region quota charge ratio
- status: Reviewed
- review: ISSUES MAY FOUND
Sashiko is asking a same question that I already decided to ignore.
- [RFC PATCH v5.2 04/11] mm/damon/sysfs-schemes: implement fail_charge_{num,denom} files
- status: Reviewed
- review: No issues found.
- [RFC PATCH v5.2 05/11] Docs/mm/damon/design: document fail_charge_{num,denom}
- status: Reviewed
- review: No issues found.
- [RFC PATCH v5.2 06/11] Docs/admin-guide/mm/damon/usage: document fail_charge_{num,denom} files
- status: Reviewed
- review: No issues found.
- [RFC PATCH v5.2 07/11] Docs/ABI/damon: document fail_charge_{num,denom}
- status: Reviewed
- review: ISSUES MAY FOUND
For this review, Sashiko seems just hallucinated.
- [RFC PATCH v5.2 08/11] mm/damon/tests/core-kunit: test fail_charge_{num,denom} committing
- status: Reviewed
- review: No issues found.
- [RFC PATCH v5.2 09/11] selftests/damon/_damon_sysfs: support failed region quota charge ratio
- status: Reviewed
- review: No issues found.
- [RFC PATCH v5.2 10/11] selftests/damon/drgn_dump_damon_status: support failed region quota charge ratio
- status: Reviewed
- review: No issues found.
- [RFC PATCH v5.2 11/11] selftests/damon/sysfs.py: test failed region quota charge ratio
- status: Reviewed
- review: No issues found.
Thanks,
SJ
# hkml [1] generated a draft of this mail. It can be regenerated
# using below command:
#
# hkml patch sashiko_dev --thread_status --for_forwarding \
# 20260412161957.82835-1-sj@kernel.org
#
# [1] https://github.com/sjp38/hackermail
^ permalink raw reply
* Re: [PATCH v7 5/6] iio: adc: ad4691: add oversampling support
From: Jonathan Cameron @ 2026-04-12 17:58 UTC (permalink / raw)
To: David Lechner
Cc: radu.sabau, Lars-Peter Clausen, Michael Hennerich, Nuno Sá,
Andy Shevchenko, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Uwe Kleine-König, Liam Girdwood, Mark Brown, Linus Walleij,
Bartosz Golaszewski, Philipp Zabel, Jonathan Corbet, Shuah Khan,
linux-iio, devicetree, linux-kernel, linux-pwm, linux-gpio,
linux-doc
In-Reply-To: <742b1821-9103-414e-a860-c2e8d5406e35@baylibre.com>
On Fri, 10 Apr 2026 16:15:20 -0500
David Lechner <dlechner@baylibre.com> wrote:
> On 4/9/26 10:28 AM, Radu Sabau via B4 Relay wrote:
> > From: Radu Sabau <radu.sabau@analog.com>
> >
> > Add per-channel oversampling ratio (OSR) support for CNV burst mode.
> > The accumulator depth register (ACC_DEPTH_IN) is programmed with the
> > selected OSR at buffer enable time and before each single-shot read.
> >
> > Supported OSR values: 1, 2, 4, 8, 16, 32.
> >
> > Introduce AD4691_MANUAL_CHANNEL() for manual mode channels, which do
> > not expose the oversampling ratio attribute since OSR is not applicable
> > in that mode. A separate manual_channels array is added to
> > struct ad4691_channel_info and selected at probe time; offload paths
> > reuse the same arrays with num_channels capping access before the soft
> > timestamp entry.
> >
> > The reported sampling frequency accounts for the active OSR:
> > effective_freq = oscillator_freq / osr
>
> Technically, the way this is implemented is fine according to IIO ABI
> rules. Writing any attribute can cause others to change. It does
> introduce a potential pitfall though. Currently, changing the OSR will
> change the sampling frequency, so you have to always write oversampling_ratio
> first, then write sampling_frequency to get what you asked for. If you want
> to change the OSR and keep the same sample rate, you still have to write both
> attributes again.
>
> In other drivers, I've implemented it so that the requested sampling frequency
> is stored any you always get the closest sampling frequency available based on
> the oversampling ratio. This way, it doesn't matter which order you write
> the attributes. In that case, the actual periodic trigger source isn't set up
> until we actually start sampling.
>
Agreed. This is more intuitive. Now generally the userspace should
be sanity checking the value anyway as limitations may mean the new
sampling frequency is not particularly close to the original one but
at least it increases the chances of getting the expected value somewhat!
So to me this is a nice useability improvement given the code to implement
it tends not to be too complex.
Thanks,
J
^ permalink raw reply
* Re: [PATCH v7 4/6] iio: adc: ad4691: add SPI offload support
From: Jonathan Cameron @ 2026-04-12 17:56 UTC (permalink / raw)
To: Radu Sabau via B4 Relay
Cc: radu.sabau, Lars-Peter Clausen, Michael Hennerich, David Lechner,
Nuno Sá, Andy Shevchenko, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, Uwe Kleine-König, Liam Girdwood, Mark Brown,
Linus Walleij, Bartosz Golaszewski, Philipp Zabel,
Jonathan Corbet, Shuah Khan, linux-iio, devicetree, linux-kernel,
linux-pwm, linux-gpio, linux-doc
In-Reply-To: <20260409-ad4692-multichannel-sar-adc-driver-v7-4-be375d4df2c5@analog.com>
On Thu, 09 Apr 2026 18:28:25 +0300
Radu Sabau via B4 Relay <devnull+radu.sabau.analog.com@kernel.org> wrote:
> From: Radu Sabau <radu.sabau@analog.com>
>
> Add SPI offload support to enable DMA-based, CPU-independent data
> acquisition using the SPI Engine offload framework.
>
> When an SPI offload is available (devm_spi_offload_get() succeeds),
> the driver registers a DMA engine IIO buffer and uses dedicated buffer
> setup operations. If no offload is available the existing software
> triggered buffer path is used unchanged.
>
> Both CNV Burst Mode and Manual Mode support offload, but use different
> trigger mechanisms:
>
> CNV Burst Mode: the SPI Engine is triggered by the ADC's DATA_READY
> signal on the GP pin specified by the trigger-source consumer reference
> in the device tree (one cell = GP pin number 0-3). For this mode the
> driver acts as both an SPI offload consumer (DMA RX stream, message
> optimization) and a trigger source provider: it registers the
> GP/DATA_READY output via devm_spi_offload_trigger_register() so the
> offload framework can match the '#trigger-source-cells' phandle and
> automatically fire the SPI Engine DMA transfer at end-of-conversion.
>
> Manual Mode: the SPI Engine is triggered by a periodic trigger at
> the configured sampling frequency. The pre-built SPI message uses
> the pipelined CNV-on-CS protocol: N+1 16-bit transfers are issued
> for N active channels (the first result is discarded as garbage from
> the pipeline flush) and the remaining N results are captured by DMA.
>
> All offload transfers use 16-bit frames (bits_per_word=16, len=2).
> The channel scan_type (storagebits=16, shift=0, IIO_BE) is shared
> between the software triggered-buffer and offload paths; no separate
> scan_type or channel array is needed for the offload case. The
> ad4691_manual_channels[] array introduced in the triggered-buffer
> commit is reused here: it hides the IIO_CHAN_INFO_OVERSAMPLING_RATIO
> attribute, which is not applicable in Manual Mode.
>
> Kconfig gains a dependency on IIO_BUFFER_DMAENGINE.
>
> Signed-off-by: Radu Sabau <radu.sabau@analog.com>
A few comments inline.
> diff --git a/drivers/iio/adc/ad4691.c b/drivers/iio/adc/ad4691.c
> index 3e5caa0972eb..839ea7f44c78 100644
> --- a/drivers/iio/adc/ad4691.c
> +++ b/drivers/iio/adc/ad4691.c
> +
> +static int ad4691_cnv_burst_offload_buffer_postenable(struct iio_dev *indio_dev)
> +{
> + struct ad4691_state *st = iio_priv(indio_dev);
> + struct ad4691_offload_state *offload = st->offload;
> + struct device *dev = regmap_get_device(st->regmap);
> + struct spi_device *spi = to_spi_device(dev);
> + struct spi_offload_trigger_config config = {
> + .type = SPI_OFFLOAD_TRIGGER_DATA_READY,
> + };
> + unsigned int n_active;
> + unsigned int bit, k;
> + int ret;
> +
> + n_active = bitmap_weight(indio_dev->active_scan_mask, iio_get_masklength(indio_dev));
> +
> + ret = regmap_write(st->regmap, AD4691_STD_SEQ_CONFIG,
> + bitmap_read(indio_dev->active_scan_mask, 0,
> + iio_get_masklength(indio_dev)));
> + if (ret)
> + return ret;
> +
> + ret = regmap_write(st->regmap, AD4691_ACC_MASK_REG,
> + ~bitmap_read(indio_dev->active_scan_mask, 0,
> + iio_get_masklength(indio_dev)) & GENMASK(15, 0));
This indent is hard to read. I would either use a local variable, or do it as
ret = regmap_write(st->regmap, AD4691_ACC_MASK_REG,
~bitmap_read(indio_dev->active_scan_mask, 0,
iio_get_masklength(indio_dev)) &
GENMASK(15, 0));
> + if (ret)
> + return ret;
> +
> + ret = ad4691_enter_conversion_mode(st);
> + if (ret)
> + return ret;
> +
> + memset(st->scan_xfers, 0, sizeof(st->scan_xfers));
> +
> + /*
> + * Each AVG_IN register read uses two 16-bit transfers:
> + * TX: [reg_hi | 0x80, reg_lo] (address, CS stays asserted)
> + * RX: [data_hi, data_lo] (data, storagebits=16, shift=0)
> + * The state reset is also split into two 16-bit transfers
> + * (address then value) to keep bits_per_word uniform throughout.
> + */
> + k = 0;
> + iio_for_each_active_channel(indio_dev, bit) {
> + put_unaligned_be16(0x8000 | AD4691_AVG_IN(bit), offload->tx_cmd[k]);
> +
> + /* TX: address phase, CS stays asserted into data phase */
> + st->scan_xfers[2 * k].tx_buf = offload->tx_cmd[k];
> + st->scan_xfers[2 * k].len = sizeof(offload->tx_cmd[k]);
> + st->scan_xfers[2 * k].bits_per_word = AD4691_OFFLOAD_BITS_PER_WORD;
> +
> + /* RX: data phase, CS toggles after to delimit the next register op */
> + st->scan_xfers[2 * k + 1].len = sizeof(offload->tx_cmd[k]);
> + st->scan_xfers[2 * k + 1].bits_per_word = AD4691_OFFLOAD_BITS_PER_WORD;
> + st->scan_xfers[2 * k + 1].offload_flags = SPI_OFFLOAD_XFER_RX_STREAM;
> + st->scan_xfers[2 * k + 1].cs_change = 1;
> + k++;
> + }
> +
> + /* State reset to re-arm DATA_READY for the next scan. */
> + put_unaligned_be16(AD4691_STATE_RESET_REG, offload->tx_reset);
> + offload->tx_reset[2] = AD4691_STATE_RESET_ALL;
> +
> + st->scan_xfers[2 * k].tx_buf = offload->tx_reset;
> + st->scan_xfers[2 * k].len = sizeof(offload->tx_cmd[k]);
> + st->scan_xfers[2 * k].bits_per_word = AD4691_OFFLOAD_BITS_PER_WORD;
> +
> + st->scan_xfers[2 * k + 1].tx_buf = &offload->tx_reset[2];
> + st->scan_xfers[2 * k + 1].len = sizeof(offload->tx_cmd[k]);
> + st->scan_xfers[2 * k + 1].bits_per_word = AD4691_OFFLOAD_BITS_PER_WORD;
> + st->scan_xfers[2 * k + 1].cs_change = 1;
> +
> + spi_message_init_with_transfers(&st->scan_msg, st->scan_xfers, 2 * k + 2);
> + st->scan_msg.offload = offload->spi;
> +
> + ret = spi_optimize_message(spi, &st->scan_msg);
> + if (ret)
> + goto err_exit_conversion;
> +
> + ret = ad4691_sampling_enable(st, true);
> + if (ret)
> + goto err_unoptimize;
> +
> + ret = spi_offload_trigger_enable(offload->spi, offload->trigger, &config);
> + if (ret)
> + goto err_sampling_disable;
> +
> + return 0;
> +
> +err_sampling_disable:
> + ad4691_sampling_enable(st, false);
> +err_unoptimize:
> + spi_unoptimize_message(&st->scan_msg);
> +err_exit_conversion:
> + ad4691_exit_conversion_mode(st);
> + return ret;
> +}
>
> static ssize_t sampling_frequency_store(struct device *dev,
> @@ -833,6 +1123,23 @@ static ssize_t sampling_frequency_store(struct device *dev,
> if (ret)
> return ret;
>
> + if (st->manual_mode && st->offload) {
> + struct spi_offload_trigger_config config = {
> + .type = SPI_OFFLOAD_TRIGGER_PERIODIC,
> + .periodic = { .frequency_hz = freq },
> + };
> +
> + ret = spi_offload_trigger_validate(st->offload->trigger, &config);
> + if (ret) {
> + iio_device_release_direct(indio_dev);
> + return ret;
> + }
> +
> + st->offload->trigger_hz = config.periodic.frequency_hz;
> + iio_device_release_direct(indio_dev);
This release in a different scope is a bit ugly.
Look at whether the auto cleanup approach works well here.
https://elixir.bootlin.com/linux/v7.0-rc7/source/include/linux/iio/iio.h#L767
> + return len;
> + }
> +
^ permalink raw reply
* Re: [PATCH v7 3/6] iio: adc: ad4691: add triggered buffer support
From: Jonathan Cameron @ 2026-04-12 17:47 UTC (permalink / raw)
To: Radu Sabau via B4 Relay
Cc: radu.sabau, Lars-Peter Clausen, Michael Hennerich, David Lechner,
Nuno Sá, Andy Shevchenko, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, Uwe Kleine-König, Liam Girdwood, Mark Brown,
Linus Walleij, Bartosz Golaszewski, Philipp Zabel,
Jonathan Corbet, Shuah Khan, linux-iio, devicetree, linux-kernel,
linux-pwm, linux-gpio, linux-doc
In-Reply-To: <20260409-ad4692-multichannel-sar-adc-driver-v7-3-be375d4df2c5@analog.com>
On Thu, 09 Apr 2026 18:28:24 +0300
Radu Sabau via B4 Relay <devnull+radu.sabau.analog.com@kernel.org> wrote:
> From: Radu Sabau <radu.sabau@analog.com>
>
> Add buffered capture support using the IIO triggered buffer framework.
>
> CNV Burst Mode: the GP pin identified by interrupt-names in the device
> tree is configured as DATA_READY output. The IRQ handler stops
> conversions and fires the IIO trigger; the trigger handler executes a
> pre-built SPI message that reads all active channels from the AVG_IN
> accumulator registers and then resets accumulator state and restarts
> conversions for the next cycle.
>
> Manual Mode: CNV is tied to SPI CS so each transfer simultaneously
> reads the previous result and starts the next conversion (pipelined
> N+1 scheme). At preenable time a pre-built, optimised SPI message of
> N+1 transfers is constructed (N channel reads plus one NOOP to drain
> the pipeline). The trigger handler executes the message in a single
> spi_sync() call and collects the results. An external trigger (e.g.
> iio-trig-hrtimer) is required to drive the trigger at the desired
> sample rate.
>
> Both modes share the same trigger handler and push a complete scan —
> one u16 slot per channel at its scan_index position, followed by a
> timestamp — to the IIO buffer via iio_push_to_buffers_with_ts().
>
> The CNV Burst Mode sampling frequency (PWM period) is exposed as a
> buffer-level attribute via IIO_DEVICE_ATTR.
>
> Signed-off-by: Radu Sabau <radu.sabau@analog.com>
A couple of minor things inline.
Thanks,
Jonathan
> diff --git a/drivers/iio/adc/ad4691.c b/drivers/iio/adc/ad4691.c
> index 43bd408c3d11..3e5caa0972eb 100644
> --- a/drivers/iio/adc/ad4691.c
> +++ b/drivers/iio/adc/ad4691.c
> @@ -5,15 +5,19 @@
> */
> #include <linux/array_size.h>
> #include <linux/bitfield.h>
> -#include <linux/bitops.h>
> +#include <linux/bitmap.h>
> #include <linux/cleanup.h>
> #include <linux/delay.h>
> #include <linux/dev_printk.h>
> #include <linux/device/devres.h>
> +#include <linux/dmaengine.h>
> #include <linux/err.h>
> +#include <linux/interrupt.h>
> #include <linux/math.h>
> #include <linux/module.h>
> #include <linux/mod_devicetable.h>
> +#include <linux/property.h>
> +#include <linux/pwm.h>
> #include <linux/regmap.h>
> #include <linux/regulator/consumer.h>
> #include <linux/reset.h>
> @@ -21,7 +25,14 @@
> #include <linux/units.h>
> #include <linux/unaligned.h>
>
> +#include <linux/iio/buffer.h>
> +#include <linux/iio/buffer-dma.h>
> +#include <linux/iio/buffer-dmaengine.h>
Not yet... Only bring these headers in when you need them.
So far this is just doing normal SPI stuff.
> #include <linux/iio/iio.h>
> +#include <linux/iio/sysfs.h>
> +#include <linux/iio/trigger.h>
> +#include <linux/iio/triggered_buffer.h>
> +#include <linux/iio/trigger_consumer.h>
> @@ -201,8 +245,45 @@ struct ad4691_state {
> * atomicity of consecutive SPI operations.
> */
> struct mutex lock;
> + /*
> + * Per-buffer-enable lifetime resources:
> + * Manual Mode - a pre-built SPI message that clocks out N+1
> + * transfers in one go.
> + * CNV Burst Mode - a pre-built SPI message that clocks out 2*N
> + * transfers in one go.
> + */
> + struct spi_message scan_msg;
> + /* max 16 + 1 NOOP (manual) or 2*16 + 2 (CNV burst). */
> + struct spi_transfer scan_xfers[34];
> + /*
> + * CNV burst: 16 AVG_IN addresses + state-reset address + state-reset
> + * value = 18. Manual: 16 channel cmds + 1 NOOP = 17.
> + */
> + __be16 scan_tx[18];
David raised this. As these aren't going through the regmap and are using
spi_transfers directly they need to be using DMA safe buffers.
> + /* Scan buffer: one BE16 slot per channel (rx'd directly), plus timestamp */
> + struct {
> + __be16 vals[16];
> + aligned_s64 ts;
> + } scan;
> };
> +static int ad4691_manual_buffer_preenable(struct iio_dev *indio_dev)
> +{
> + struct ad4691_state *st = iio_priv(indio_dev);
> + unsigned int n_active;
> + unsigned int n_xfers;
> + unsigned int prev_i, k, i;
> + bool first;
> + int ret;
> +
> + n_active = bitmap_weight(indio_dev->active_scan_mask, iio_get_masklength(indio_dev));
> + n_xfers = n_active + 1;
> +
> + memset(st->scan_xfers, 0, n_xfers * sizeof(st->scan_xfers[0]));
> + memset(st->scan_tx, 0, n_xfers * sizeof(st->scan_tx[0]));
> +
> + spi_message_init(&st->scan_msg);
> +
> + first = true;
> + prev_i = 0;
> + k = 0;
> + iio_for_each_active_channel(indio_dev, i) {
> + st->scan_tx[k] = cpu_to_be16(AD4691_ADC_CHAN(i));
> + st->scan_xfers[k].tx_buf = &st->scan_tx[k];
> + /*
> + * The pipeline means xfer[0] receives the residual from the
> + * previous sequence, not a valid sample for channel i. Point
> + * it at vals[i] anyway; xfer[1] (or the NOOP when only one
> + * channel is active) will overwrite that slot with the real
> + * result, so no separate dummy buffer is needed.
> + */
> + if (first) {
> + st->scan_xfers[k].rx_buf = &st->scan.vals[i];
> + first = false;
> + } else {
> + st->scan_xfers[k].rx_buf = &st->scan.vals[prev_i];
> + }
> + st->scan_xfers[k].len = sizeof(__be16);
> + st->scan_xfers[k].cs_change = 1;
> + spi_message_add_tail(&st->scan_xfers[k], &st->scan_msg);
> + prev_i = i;
> + k++;
> + }
> +
> + /* Final NOOP transfer retrieves the last channel's result. */
> + st->scan_xfers[k].tx_buf = &st->scan_tx[k]; /* scan_tx[k] == 0 == NOOP */
> + st->scan_xfers[k].rx_buf = &st->scan.vals[prev_i];
> + st->scan_xfers[k].len = sizeof(__be16);
> + spi_message_add_tail(&st->scan_xfers[k], &st->scan_msg);
> +
> + ret = spi_optimize_message(st->spi, &st->scan_msg);
> + if (ret)
See below. This matches my expectations for what to do if spi_optimize_message()
fails.
> + return ret;
> +
> + ret = ad4691_enter_conversion_mode(st);
> + if (ret) {
> + spi_unoptimize_message(&st->scan_msg);
> + return ret;
> + }
> +
> + return 0;
> +}
> +static int ad4691_cnv_burst_buffer_preenable(struct iio_dev *indio_dev)
> +{
> + struct ad4691_state *st = iio_priv(indio_dev);
> + unsigned int n_active;
> + unsigned int k, i;
> + int ret;
> +
> + n_active = bitmap_weight(indio_dev->active_scan_mask, iio_get_masklength(indio_dev));
> +
> + memset(st->scan_xfers, 0, (2 * n_active + 2) * sizeof(st->scan_xfers[0]));
> + memset(st->scan_tx, 0, (n_active + 2) * sizeof(st->scan_tx[0]));
> +
> + spi_message_init(&st->scan_msg);
> +
> + /*
> + * Each AVG_IN read needs two transfers: a 2-byte address write phase
> + * followed by a 2-byte data read phase. CS toggles between channels
> + * (cs_change=1 on the read phase of all but the last channel).
> + */
> + k = 0;
> + iio_for_each_active_channel(indio_dev, i) {
> + st->scan_tx[k] = cpu_to_be16(0x8000 | AD4691_AVG_IN(i));
> + st->scan_xfers[2 * k].tx_buf = &st->scan_tx[k];
> + st->scan_xfers[2 * k].len = sizeof(__be16);
> + spi_message_add_tail(&st->scan_xfers[2 * k], &st->scan_msg);
> + st->scan_xfers[2 * k + 1].rx_buf = &st->scan.vals[i];
> + st->scan_xfers[2 * k + 1].len = sizeof(__be16);
> + st->scan_xfers[2 * k + 1].cs_change = 1;
> + spi_message_add_tail(&st->scan_xfers[2 * k + 1], &st->scan_msg);
> + k++;
> + }
> +
> + st->scan_tx[k] = cpu_to_be16(AD4691_STATE_RESET_REG);
> + st->scan_xfers[2 * k].tx_buf = &st->scan_tx[k];
> + st->scan_xfers[2 * k].len = sizeof(__be16);
> + spi_message_add_tail(&st->scan_xfers[2 * k], &st->scan_msg);
> + st->scan_tx[k + 1] = cpu_to_be16(AD4691_STATE_RESET_ALL << 8);
> + st->scan_xfers[2 * k + 1].tx_buf = &st->scan_tx[k + 1];
> + st->scan_xfers[2 * k + 1].len = sizeof(__be16);
> + st->scan_xfers[2 * k + 1].cs_change = 1;
> + spi_message_add_tail(&st->scan_xfers[2 * k + 1], &st->scan_msg);
> +
> + ret = spi_optimize_message(st->spi, &st->scan_msg);
> + if (ret)
> + goto err_unoptimize;
I'd expect spi_optimize_message() to be side effect free if it fails.
I took a quick look at the implementation and it looks like it is..
So probably just return ret; here
That matches with the other similar flow above.
> +
> + ret = regmap_write(st->regmap, AD4691_STD_SEQ_CONFIG,
> + bitmap_read(indio_dev->active_scan_mask, 0,
> + iio_get_masklength(indio_dev)));
> + if (ret)
> + goto err_unoptimize;
> +
> + ret = regmap_write(st->regmap, AD4691_ACC_MASK_REG,
> + ~bitmap_read(indio_dev->active_scan_mask, 0,
> + iio_get_masklength(indio_dev)) & GENMASK(15, 0));
> + if (ret)
> + goto err_unoptimize;
> +
> + ret = ad4691_enter_conversion_mode(st);
> + if (ret)
> + goto err_unoptimize;
> +
> + ret = ad4691_sampling_enable(st, true);
> + if (ret)
> + goto err_exit_conv;
> +
> + enable_irq(st->irq);
> + return 0;
> +
> +err_exit_conv:
> + ad4691_exit_conversion_mode(st);
> +err_unoptimize:
> + spi_unoptimize_message(&st->scan_msg);
> + return ret;
> +}
> +static int ad4691_setup_triggered_buffer(struct iio_dev *indio_dev,
> + struct ad4691_state *st)
> +{
> + struct device *dev = regmap_get_device(st->regmap);
> + struct iio_trigger *trig;
> + unsigned int i;
> + int irq, ret;
> +
> + trig = devm_iio_trigger_alloc(dev, "%s-dev%d",
> + indio_dev->name,
> + iio_device_id(indio_dev));
Might as well wrap this less.
> + if (!trig)
> + return -ENOMEM;
> +
^ permalink raw reply
* Re: [PATCH v7 3/6] iio: adc: ad4691: add triggered buffer support
From: Jonathan Cameron @ 2026-04-12 17:44 UTC (permalink / raw)
To: David Lechner
Cc: radu.sabau, Lars-Peter Clausen, Michael Hennerich, Nuno Sá,
Andy Shevchenko, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Uwe Kleine-König, Liam Girdwood, Mark Brown, Linus Walleij,
Bartosz Golaszewski, Philipp Zabel, Jonathan Corbet, Shuah Khan,
linux-iio, devicetree, linux-kernel, linux-pwm, linux-gpio,
linux-doc
In-Reply-To: <20260412184301.05544396@jic23-huawei>
On Sun, 12 Apr 2026 18:43:01 +0100
Jonathan Cameron <jic23@kernel.org> wrote:
> On Fri, 10 Apr 2026 15:46:36 -0500
> David Lechner <dlechner@baylibre.com> wrote:
>
> > On 4/9/26 10:28 AM, Radu Sabau via B4 Relay wrote:
> > > From: Radu Sabau <radu.sabau@analog.com>
> > >
> > > Add buffered capture support using the IIO triggered buffer framework.
> > >
> >
> > ...
> >
> > > @@ -201,8 +245,45 @@ struct ad4691_state {
> > > * atomicity of consecutive SPI operations.
> > > */
> > > struct mutex lock;
> > > + /*
> > > + * Per-buffer-enable lifetime resources:
> > > + * Manual Mode - a pre-built SPI message that clocks out N+1
> > > + * transfers in one go.
> > > + * CNV Burst Mode - a pre-built SPI message that clocks out 2*N
> > > + * transfers in one go.
> > > + */
> > > + struct spi_message scan_msg;
> > > + /* max 16 + 1 NOOP (manual) or 2*16 + 2 (CNV burst). */
> > > + struct spi_transfer scan_xfers[34];
> > > + /*
> > > + * CNV burst: 16 AVG_IN addresses + state-reset address + state-reset
> > > + * value = 18. Manual: 16 channel cmds + 1 NOOP = 17.
> > > + */
> > > + __be16 scan_tx[18];
> >
> > Needs __aligned(IIO_DMA_MINALIGN) since it is used with SPI.
> As below.
> >
> > > + /* Scan buffer: one BE16 slot per channel (rx'd directly), plus timestamp */
> > > + struct {
> > > + __be16 vals[16];
> > > + aligned_s64 ts;
> > > + } scan;
> >
> > Unless it is required that all channels are always enabled:
> >
> > IIO_DECLARE_BUFFER_WITH_TS(__be16, scan_rx, 16);
> >
> > In any case, needs to be DMA-safe for SPI.
> Custom regmap that uses spi_write_then_read() for all cases so it
> should be bounced anyway.
>
> Perhaps a comment to that affect would be useful.
>
Ignore that. It's an optimized spi message. oops.
> J
>
^ permalink raw reply
* Re: [PATCH v7 3/6] iio: adc: ad4691: add triggered buffer support
From: Jonathan Cameron @ 2026-04-12 17:43 UTC (permalink / raw)
To: David Lechner
Cc: radu.sabau, Lars-Peter Clausen, Michael Hennerich, Nuno Sá,
Andy Shevchenko, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Uwe Kleine-König, Liam Girdwood, Mark Brown, Linus Walleij,
Bartosz Golaszewski, Philipp Zabel, Jonathan Corbet, Shuah Khan,
linux-iio, devicetree, linux-kernel, linux-pwm, linux-gpio,
linux-doc
In-Reply-To: <0f05add7-96c0-4eee-b396-d6e1be904c09@baylibre.com>
On Fri, 10 Apr 2026 15:46:36 -0500
David Lechner <dlechner@baylibre.com> wrote:
> On 4/9/26 10:28 AM, Radu Sabau via B4 Relay wrote:
> > From: Radu Sabau <radu.sabau@analog.com>
> >
> > Add buffered capture support using the IIO triggered buffer framework.
> >
>
> ...
>
> > @@ -201,8 +245,45 @@ struct ad4691_state {
> > * atomicity of consecutive SPI operations.
> > */
> > struct mutex lock;
> > + /*
> > + * Per-buffer-enable lifetime resources:
> > + * Manual Mode - a pre-built SPI message that clocks out N+1
> > + * transfers in one go.
> > + * CNV Burst Mode - a pre-built SPI message that clocks out 2*N
> > + * transfers in one go.
> > + */
> > + struct spi_message scan_msg;
> > + /* max 16 + 1 NOOP (manual) or 2*16 + 2 (CNV burst). */
> > + struct spi_transfer scan_xfers[34];
> > + /*
> > + * CNV burst: 16 AVG_IN addresses + state-reset address + state-reset
> > + * value = 18. Manual: 16 channel cmds + 1 NOOP = 17.
> > + */
> > + __be16 scan_tx[18];
>
> Needs __aligned(IIO_DMA_MINALIGN) since it is used with SPI.
As below.
>
> > + /* Scan buffer: one BE16 slot per channel (rx'd directly), plus timestamp */
> > + struct {
> > + __be16 vals[16];
> > + aligned_s64 ts;
> > + } scan;
>
> Unless it is required that all channels are always enabled:
>
> IIO_DECLARE_BUFFER_WITH_TS(__be16, scan_rx, 16);
>
> In any case, needs to be DMA-safe for SPI.
Custom regmap that uses spi_write_then_read() for all cases so it
should be bounced anyway.
Perhaps a comment to that affect would be useful.
J
^ permalink raw reply
* Re: [PATCH v7 2/6] iio: adc: ad4691: add initial driver for AD4691 family
From: Jonathan Cameron @ 2026-04-12 17:24 UTC (permalink / raw)
To: Radu Sabau via B4 Relay
Cc: radu.sabau, Lars-Peter Clausen, Michael Hennerich, David Lechner,
Nuno Sá, Andy Shevchenko, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, Uwe Kleine-König, Liam Girdwood, Mark Brown,
Linus Walleij, Bartosz Golaszewski, Philipp Zabel,
Jonathan Corbet, Shuah Khan, linux-iio, devicetree, linux-kernel,
linux-pwm, linux-gpio, linux-doc
In-Reply-To: <20260409-ad4692-multichannel-sar-adc-driver-v7-2-be375d4df2c5@analog.com>
On Thu, 09 Apr 2026 18:28:23 +0300
Radu Sabau via B4 Relay <devnull+radu.sabau.analog.com@kernel.org> wrote:
> From: Radu Sabau <radu.sabau@analog.com>
>
> Add support for the Analog Devices AD4691 family of high-speed,
> low-power multichannel SAR ADCs: AD4691 (16-ch, 500 kSPS),
> AD4692 (16-ch, 1 MSPS), AD4693 (8-ch, 500 kSPS) and
> AD4694 (8-ch, 1 MSPS).
>
> The driver implements a custom regmap layer over raw SPI to handle the
> device's mixed 1/2/3/4-byte register widths and uses the standard IIO
> read_raw/write_raw interface for single-channel reads.
>
> The chip idles in Autonomous Mode so that single-shot read_raw can use
> the internal oscillator without disturbing the hardware configuration.
>
> Three voltage supply domains are managed: avdd (required), vio, and a
> reference supply on either the REF pin (ref-supply, external buffer)
> or the REFIN pin (refin-supply, uses the on-chip reference buffer;
> REFBUF_EN is set accordingly). Hardware reset is performed via
> the reset controller framework; a software reset through SPI_CONFIG_A
> is used as fallback when no hardware reset is available.
>
> Accumulator channel masking for single-shot reads uses ACC_MASK_REG via
> an ADDR_DESCENDING SPI write, which covers both mask bytes in a single
> 16-bit transfer.
>
> Reviewed-by: David Lechner <dlechner@baylibre.com>
> Signed-off-by: Radu Sabau <radu.sabau@analog.com>
Mostly minor stuff but the regulator bit needs another look.
> diff --git a/drivers/iio/adc/ad4691.c b/drivers/iio/adc/ad4691.c
> new file mode 100644
> index 000000000000..43bd408c3d11
> --- /dev/null
> +++ b/drivers/iio/adc/ad4691.c
> +static int ad4691_reg_read(void *context, unsigned int reg, unsigned int *val)
> +{
> + struct spi_device *spi = context;
> + u8 tx[2], rx[4];
> + int ret;
> +
> + /* Set bit 15 to mark the operation as READ. */
> + put_unaligned_be16(0x8000 | reg, tx);
> +
> + switch (reg) {
> + case 0 ... AD4691_OSC_FREQ_REG:
> + case AD4691_SPARE_CONTROL ... AD4691_ACC_SAT_OVR_REG(15):
> + ret = spi_write_then_read(spi, tx, 2, rx, 1);
for tx size can use sizeof(tx)
> + if (ret)
> + return ret;
> + *val = rx[0];
> + return 0;
> + case AD4691_STD_SEQ_CONFIG:
> + case AD4691_AVG_IN(0) ... AD4691_AVG_IN(15):
> + ret = spi_write_then_read(spi, tx, 2, rx, 2);
> + if (ret)
> + return ret;
> + *val = get_unaligned_be16(rx);
> + return 0;
> + case AD4691_AVG_STS_IN(0) ... AD4691_AVG_STS_IN(15):
> + case AD4691_ACC_IN(0) ... AD4691_ACC_IN(15):
> + ret = spi_write_then_read(spi, tx, 2, rx, 3);
> + if (ret)
> + return ret;
> + *val = get_unaligned_be24(rx);
> + return 0;
> + case AD4691_ACC_STS_DATA(0) ... AD4691_ACC_STS_DATA(15):
> + ret = spi_write_then_read(spi, tx, 2, rx, 4);
> + if (ret)
> + return ret;
> + *val = get_unaligned_be32(rx);
> + return 0;
> + default:
> + return -EINVAL;
> + }
> +}
> +
> +static int ad4691_reg_write(void *context, unsigned int reg, unsigned int val)
> +{
> + struct spi_device *spi = context;
> + u8 tx[4];
> +
> + put_unaligned_be16(reg, tx);
> +
> + switch (reg) {
> + case 0 ... AD4691_OSC_FREQ_REG:
> + case AD4691_SPARE_CONTROL ... AD4691_ACC_MASK_REG - 1:
> + case AD4691_ACC_MASK_REG + 1 ... AD4691_GPIO_MODE2_REG:
> + if (val > 0xFF)
U8_MAX from limits.h
> + return -EINVAL;
> + tx[2] = val;
> + return spi_write_then_read(spi, tx, 3, NULL, 0);
> + case AD4691_ACC_MASK_REG:
> + case AD4691_STD_SEQ_CONFIG:
> + if (val > 0xFFFF)
U16_MAX
> + return -EINVAL;
> + put_unaligned_be16(val, &tx[2]);
> + return spi_write_then_read(spi, tx, 4, NULL, 0);
> + default:
> + return -EINVAL;
> + }
> +}
> +static int ad4691_set_sampling_freq(struct iio_dev *indio_dev, int freq)
> +{
> + struct ad4691_state *st = iio_priv(indio_dev);
> + unsigned int start = (st->info->max_rate == 1 * HZ_PER_MHZ) ? 0 : 1;
This appears in a couple of places. Maybe a little helper for
ad4691_samp_freq_array_start() would let you describe why in a comment
in a single location.
> +
> + IIO_DEV_ACQUIRE_DIRECT_MODE(indio_dev, claim);
> + if (IIO_DEV_ACQUIRE_FAILED(claim))
> + return -EBUSY;
> +
> + for (unsigned int i = start; i < ARRAY_SIZE(ad4691_osc_freqs_Hz); i++) {
> + if (ad4691_osc_freqs_Hz[i] != freq)
> + continue;
> + return regmap_update_bits(st->regmap, AD4691_OSC_FREQ_REG,
> + AD4691_OSC_FREQ_MASK, i);
> + }
> +
> + return -EINVAL;
> +}
> +
> +static int ad4691_read_avail(struct iio_dev *indio_dev,
> + struct iio_chan_spec const *chan,
> + const int **vals, int *type,
> + int *length, long mask)
> +{
> + struct ad4691_state *st = iio_priv(indio_dev);
> + unsigned int start = (st->info->max_rate == 1 * HZ_PER_MHZ) ? 0 : 1;
> +
> + switch (mask) {
> + case IIO_CHAN_INFO_SAMP_FREQ:
> + *vals = &ad4691_osc_freqs_Hz[start];
> + *type = IIO_VAL_INT;
> + *length = ARRAY_SIZE(ad4691_osc_freqs_Hz) - start;
> + return IIO_AVAIL_LIST;
> + default:
> + return -EINVAL;
> + }
> +}
> +
> +static int ad4691_regulator_setup(struct ad4691_state *st)
> +{
> + struct device *dev = regmap_get_device(st->regmap);
> + int ret;
> +
> + ret = devm_regulator_bulk_get_enable(dev, ARRAY_SIZE(ad4691_supplies),
> + ad4691_supplies);
> + if (ret)
> + return dev_err_probe(dev, ret, "Failed to get and enable supplies\n");
> +
> + ret = devm_regulator_get_enable(dev, "ldo-in");
> + if (ret == -ENODEV)
> + st->ldo_en = true;
This seems odd. ldo_en to me implies enabling the internal LDO, which I'd expect
to need the ldo-in supply. Anyhow - that oddity made me dig...
The datasheet is rather confusingly worded but the diagrams seem clear.
I 'think' the only time the ldo should not be enabled is when VDD = 1.8V and
LDO_IN is then tied to 0 (which maybe we represent at no ldo-in regulator?)
When ldo-in is supplied and the LDO enabled, the LDO output is tied to the VDD pin
internally. So you'd should not be feeding VDD in that case. There is a note
that says that it'll survive being also powered to 1.8V or grounded, but seems
to strongly advise against doing that.
Anyhow, what you have here and the dt-binding don't seem to align with the datasheet.
Figure 54 shows this most clearly. Your dt binding doesn't include vdd which
is also not aligning with the ad4691 datasheet on the website.
> + else if (ret)
> + return dev_err_probe(dev, ret, "Failed to get and enable LDO-IN\n");
> +
> + st->vref_uV = devm_regulator_get_enable_read_voltage(dev, "ref");
> + if (st->vref_uV == -ENODEV) {
> + st->vref_uV = devm_regulator_get_enable_read_voltage(dev, "refin");
> + st->refbuf_en = true;
> + }
> + if (st->vref_uV < 0)
> + return dev_err_probe(dev, st->vref_uV,
> + "Failed to get reference supply\n");
> +
> + if (st->vref_uV < AD4691_VREF_uV_MIN || st->vref_uV > AD4691_VREF_uV_MAX)
> + return dev_err_probe(dev, -EINVAL,
> + "vref(%d) must be in the range [%u...%u]\n",
> + st->vref_uV, AD4691_VREF_uV_MIN,
> + AD4691_VREF_uV_MAX);
> +
> + return 0;
> +}
> +
> +static int ad4691_reset(struct ad4691_state *st)
> +{
> + struct device *dev = regmap_get_device(st->regmap);
> + struct reset_control *rst;
> +
> + rst = devm_reset_control_get_optional_exclusive(dev, NULL);
> + if (IS_ERR(rst))
> + return dev_err_probe(dev, PTR_ERR(rst), "Failed to get reset\n");
> +
> + if (rst) {
> + /*
> + * The GPIO is already asserted by reset_gpio_probe().
I thought assumption was that firmware (or driver going down) should always
leave the device in reset and hence could safely do
devm_reset_control_get_optional_exclusive_deasserted() without
the need for the dance with asserting / sleep deasserting.
More than possible I've misunderstood this bit (or we have known firmware
out there that doesn't leave this in reset - in which case good to document
that here).
I don't mind the explicit code you have, just want to understand the reasoning
a little more.
FWIW, if anyone fancies taking a look, a few of the IIO drivers could
do to use the various 'extended' devm calls rather than doing their own
handling off assert and deassert via devm_add_action_or_rest()
> + * Wait for the reset pulse width required by the chip.
> + * See datasheet Table 5.
> + */
> + fsleep(300);
> + return reset_control_deassert(rst);
> + }
> +
> + /* No hardware reset available, fall back to software reset. */
> + return regmap_write(st->regmap, AD4691_SPI_CONFIG_A_REG,
> + AD4691_SW_RESET);
> +}
> +
> +static int ad4691_config(struct ad4691_state *st)
> +{
...
> + val = FIELD_PREP(AD4691_REF_CTRL_MASK, ref_val);
> + if (st->refbuf_en)
> + val |= AD4691_REFBUF_EN;
> +
> + ret = regmap_update_bits(st->regmap, AD4691_REF_CTRL,
> + AD4691_REF_CTRL_MASK | AD4691_REFBUF_EN, val);
It doesn't hugely matter but given (at least for the AD4691) the other bits
are reserved and reset to 0 you could just write the whole thing and avoid
a bus read. Doing it like this kind of implies there is something to
preserve. I was curious what hence looked it up.
> + if (ret)
> + return dev_err_probe(dev, ret, "Failed to write REF_CTRL\n");
> +
> + ret = regmap_assign_bits(st->regmap, AD4691_DEVICE_SETUP,
> + AD4691_LDO_EN, st->ldo_en);
> + if (ret)
> + return dev_err_probe(dev, ret, "Failed to write DEVICE_SETUP\n");
> +
> + /*
> + * Set the internal oscillator to the highest rate this chip supports.
> + * Index 0 (1 MHz) exceeds the 500 kHz max of AD4691/AD4693, so those
> + * chips start at index 1 (500 kHz).
> + */
> + ret = regmap_assign_bits(st->regmap, AD4691_OSC_FREQ_REG,
> + AD4691_OSC_FREQ_MASK,
Similar to above. You could simplify to a write only given rest are reserved 0 bits.
Maybe it's simpler to keep them all the same though. Up to you.
> + (st->info->max_rate == 1 * HZ_PER_MHZ) ? 0 : 1);
> + if (ret)
> + return dev_err_probe(dev, ret, "Failed to write OSC_FREQ\n");
> +
> + ret = regmap_update_bits(st->regmap, AD4691_ADC_SETUP,
> + AD4691_ADC_MODE_MASK, AD4691_AUTONOMOUS_MODE);
> + if (ret)
> + return dev_err_probe(dev, ret, "Failed to write ADC_SETUP\n");
> +
> + return 0;
> +}
> +
> +static int ad4691_probe(struct spi_device *spi)
> +{
> + struct device *dev = &spi->dev;
> + struct iio_dev *indio_dev;
> + struct ad4691_state *st;
> + int ret;
> +
> + indio_dev = devm_iio_device_alloc(&spi->dev, sizeof(*st));
indio_dev = devm_iio_device_alloc(dev, sizeof(*st));
> + if (!indio_dev)
> + return -ENOMEM;
> +
> + st = iio_priv(indio_dev);
> + st->info = spi_get_device_match_data(spi);
> +
> + ret = devm_mutex_init(dev, &st->lock);
...
thanks
Jonathan
^ permalink raw reply
* [RFC PATCH v5.2 06/11] Docs/admin-guide/mm/damon/usage: document fail_charge_{num,denom} files
From: SeongJae Park @ 2026-04-12 16:19 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
linux-kernel, linux-mm
In-Reply-To: <20260412161957.82835-1-sj@kernel.org>
Update DAMON usage document for the DAMOS action failed regions quota
charge ratio control sysfs files.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
Documentation/admin-guide/mm/damon/usage.rst | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index bfdb717441f05..d5548e460857c 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -84,7 +84,9 @@ comma (",").
│ │ │ │ │ │ │ │ sz/min,max
│ │ │ │ │ │ │ │ nr_accesses/min,max
│ │ │ │ │ │ │ │ age/min,max
- │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes,goal_tuner
+ │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,
+ │ │ │ │ │ │ │ effective_bytes,goal_tuner,
+ │ │ │ │ │ │ │ fail_charge_num,fail_charge_denom
│ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
│ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals
│ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value,nid,path
@@ -381,9 +383,10 @@ schemes/<N>/quotas/
The directory for the :ref:`quotas <damon_design_damos_quotas>` of the given
DAMON-based operation scheme.
-Under ``quotas`` directory, five files (``ms``, ``bytes``,
-``reset_interval_ms``, ``effective_bytes`` and ``goal_tuner``) and two
-directories (``weights`` and ``goals``) exist.
+Under ``quotas`` directory, seven files (``ms``, ``bytes``,
+``reset_interval_ms``, ``effective_bytes``, ``goal_tuner``, ``fail_charge_num``
+and ``fail_charge_denom``) and two directories (``weights`` and ``goals``)
+exist.
You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and
``reset interval`` in milliseconds by writing the values to the three files,
@@ -402,6 +405,13 @@ the background design of the feature and the name of the selectable algorithms.
Refer to :ref:`goals directory <sysfs_schemes_quota_goals>` for the goals
setup.
+You can set the action-failed memory quota charging ratio by writing the
+numerator and the denominator for the ratio to ``fail_charge_num`` and
+``fail_charge_denom`` files, respectively. Reading those files will return the
+current set values. Refer to :ref:`design
+<damon_design_damos_quotas_failed_memory_charging_ratio>` for more details of
+the ratio feature.
+
The time quota is internally transformed to a size quota. Between the
transformed size quota and user-specified size quota, smaller one is applied.
Based on the user-specified :ref:`goal <sysfs_schemes_quota_goals>`, the
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v5.2 05/11] Docs/mm/damon/design: document fail_charge_{num,denom}
From: SeongJae Park @ 2026-04-12 16:19 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
Jonathan Corbet, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, damon, linux-doc,
linux-kernel, linux-mm
In-Reply-To: <20260412161957.82835-1-sj@kernel.org>
Update DAMON design document for the DAMOS action failed region quota
charge ratio.
Signed-off-by: SeongJae Park <sj@kernel.org>
---
Documentation/mm/damon/design.rst | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index 622d24e35961e..fa7392b5a331d 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -576,6 +576,28 @@ interface <sysfs_interface>`, refer to :ref:`weights <sysfs_quotas>` part of
the documentation.
+.. _damon_design_damos_quotas_failed_memory_charging_ratio:
+
+Action-failed Memory Charging Ratio
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+DAMOS action to a given region can fail for some subsets of the memory of the
+region. For example, if the action is ``pageout`` and the region has some
+unreclaimable pages, applying the action to the pages will fail. The amount of
+system resource that is taken for such failed action applications is usually
+different from that for successful action applications. For such cases, users
+can set different charging ratio for such failed memory. The ratio can be
+specified using ``fail_charge_num`` and ``fail_charge_denom`` parameters. The
+two parameters represent the numerator and denominator of the ratio. The
+feature is enabled only if ``fail_charge_denom`` is not zero.
+
+For example, let's suppose a DAMOS action is applied to a region of 1,000 MiB
+size. The action is successfully applied to only 700 MiB of the region.
+``fail_charge_num`` and ``fail_charge_denom`` are set to ``1`` and ``1024``,
+respectively. Then only 700 MiB and 300 KiB of size (``700 MiB + 300 MiB * 1 /
+1024``) will be charged.
+
+
.. _damon_design_damos_quotas_auto_tuning:
Aim-oriented Feedback-driven Auto-tuning
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v5.2 00/11] mm/damon: introduce DAMOS failed region quota charge ratio
From: SeongJae Park @ 2026-04-12 16:19 UTC (permalink / raw)
Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, Brendan Higgins,
David Gow, David Hildenbrand, Jonathan Corbet, Lorenzo Stoakes,
Michal Hocko, Mike Rapoport, Shuah Khan, Shuah Khan,
Suren Baghdasaryan, Vlastimil Babka, damon, kunit-dev, linux-doc,
linux-kernel, linux-kselftest, linux-mm
TL; DR: Let users set different DAMOS quota charge ratios for DAMOS
action failed regions, for deterministic and consistent DAMOS action
progress.
Common Reports: Unexpectedly Slow DAMOS
=======================================
One common issue report that we get from DAMON users is that DAMOS
action applying progress speed is sometimes much slower than expected.
And one common root cause is that the DAMOS quota is exceeded by the
action applying failed memory regions.
For example, a group of users tried to run DAMOS-based proactive memory
reclamation (DAMON_RECLAIM) with 100 MiB per second DAMOS quota. They
ran it on a system having no active workload which means all memory of
the system is cold. The expectation was that the system will show 100
MiB per second reclamation until (nearly) all memory is reclaimed. But
what they found is that the speed is quite inconsistent and sometimes it
becomes very slower than the expectation, sometimes even no reclamation
at all for about tens of seconds. The upper limit of the speed (100 MiB
per second) was being kept as expected, though.
By monitoring the qt_exceeds (number of DAMOS quota exceed events) DAMOS
stat, we found DAMOS quota is always exceeded when the speed is slow. By
monitoring sz_tried and sz_applied (the total amount of DAMOS action
tried memory and succeeded memory) DAMOS stats together, we found the
reclamation attempts nearly always failed when the speed is slow.
DAMOS quota charges DAMOS action tried regions regardless of the
successfulness of the try. Hence in the example reported case, there
was unreclaimable memory spread around the system memory. Sometimes
nearly 100 MiB of memory that DAMOS tried to reclaim in the given quota
interval was reclaimable, and therefore showed nearly 100 MiB per second
speed. Sometimes nearly 99 MiB of memory that DAMOS was trying to
reclaim in the given quota interval was unreclaimable, and therefore
showing only about 1 MiB per second reclaim speed.
We explained it is an expected behavior of the feature rather than a
bug, as DAMOS quota is there for only the upper-limit of the speed. The
users agreed and later reported a huge win from the adoption of
DAMON_RECLAIM on their products.
It is Not a Bug but a Feature; But...
=====================================
So nothing is broken. DAMOS quota is working as intended, as the upper
limit of the speed. It also provides its behavior observability via
DAMOS stat. In the real world production environment that runs long
term active workloads and matters stability, the speed sometimes being
slow is not a real problem.
But, the non-deterministic behavior is sometimes annoying, especially in
lab environments. Even in a realistic production environment, when
there is a huge amount of DAMOS action unapplicable memory, the speed
could be problematically slow. Let's suppose a virtual machines
provider that setup 99% of the host memory as hugetlb pages that cannot
be reclaimed, to give it to virtual machines. Also, when aim-oriented
DAMOS auto-tuning is applied, this could also make the internal feedback
loop confused.
The intention of the current behavior was that trying DAMOS action to
regions would anyway impose some overhead, and therefore somehow be
charged. But in the real world, the overhead for failed action is much
lighter than successful action. Charging those at the same ratio may be
unfair, or at least suboptimum in some environments.
DAMOS Action Failed Region Quota Charge Ratio
=============================================
Let users set the charge ratio for the action-failed memory, for more
optimal and deterministic use of DAMOS. It allows users to specify the
numerator and the denominator of the ratio for flexible setup. For
example, let's suppose the numerator and the denominator are set to 1
and 4,096, respectively. The ratio is 1 / 4,096. A DAMOS scheme action
is applied to 5 GiB memory. For 1 GiB of the memory, the action is
succeeded. For the rest (4 GiB), the action is failed. Then, only 1
GiB and 1 MiB quota is charged.
The optimal charge ratio will depend on the use case and
system/workload. I'd recommend starting from setting the nominator as 1
and the denominator as PAGE_SIZE and tune based on the results, because
many DAMOS actions are applied at page level.
Tests
=====
I tested this feature in the steps below.
1. Allocate 50% of system memory and mlock() it using a test program.
2. Fill up the page cache to exhaust nearly all free memory.
3. Start DAMON-based proactive reclamation with 100 MiB/second DAMOS
hard-quota. Auto-tune the DAMOS soft-quota under the hard-quota for
achieving 40% free memory of the system with 'temporal' tuner.
For step 1, I run a simple C program that is written by Gemini. It is
quite straightforward, so I'm not sharing the code here.
For step 2, I use dd command like below:
dd if=/dev/zero of=foo bs=1M count=$50_percent_of_system_memory
For step 3, I use the latest version of DAMON user-space tool (damo)
like below.
sudo damo start --damos_action pageout \
` # Do the pageout only up to 100 MiB per second ` \
--damos_quota_space 100M --damos_quota_interval 1s \
` # Auto-tune the quota below the hard quota aiming` \
` # 40% free memory of the node 0 ` \
` # (entire node of the test system)` \
--damos_quota_goal node_mem_free_bp 40% 0 \
` # use temporal tuner, which is easy to understnd ` \
--damos_quota_goal_tuner temporal
As expected, the progress of the reclamation is not consistent, because
the quota is exceeded for the failed reclamation of the unreclaimable
memory.
I do this again, but with the failed region charge ratio feature. For
this, the above 'damo' command is used, after appending command line
option for setup of the charge ratio like below. Note that the option
was added to 'damo' after v3.1.9.
sudo ./damo start --damos_action pageout \
[...]
` # quota-charge only 1/4096 for pageout-failed regions ` \
--damos_quota_fail_charge_ratio 1 4096
The progress of the reclamation was nearly 100 MiB per second until the
goal was achieved, meeting the expectation.
Patches Sequence
================
The first two patches make preparational changes. Patch 1 updates fully
charged quota check to handle <min_region_sz remaining quota, which will
be able to exist after this series is applied. Patch 2 merges regions
after applying schemes is done as long as it is ok to do, since regions
split operations for quota could happen much more frequently under a
corner case that this series will make available.
Patch 3 implements the feature and exposes it via DAMON core API. Patch
4 implements DAMON sysfs ABI for the feature. Three following patches
(5-7) document the feature and ABI on design, usage, and ABI documents,
respectively. Four patches for testing of the new feature follow.
Patch 8 implements a kunit test for the feature. Patches 9 and 10
extend DAMON selftest helpers for DAMON sysfs control and internal state
dumping for adding a new selftest for the feature. Patch 11 extends
existing DAMON sysfs interface selftest to test the new feature using
the extended helper scripts.
Changelog
=========
Changes from RFC v5.1
(https://lore.kernel.org/20260411164908.77189-1-sj@kernel.org)
- Add missed mergeback fixup.
Changes from RFC v5
(https://lore.kernel.org/20260410142034.83798-1-sj@kernel.org)
- Merge back: merge whatever if it doesn't lose monitoring infomration
and not violating min_nr_regions.
Changes from RFC v4
(https://lore.kernel.org/20260409142148.60652-1-sj@kernel.org)
- Fix quota-sliced region merge-back issues.
- Use damon_for_each_region() instead of damon_for_each_region_safe().
- Avoid merging back of sliced but scheme unapplied regions, to keep
the monitoring information.
Changes from RFC v3
(https://lore.kernel.org/20260407010536.83603-1-sj@kernel.org)
- Make damos_quota_is_full() safe from overflow and easier to read.
- Avoid quota-based region split making too many new regions.
Changes from RFC v2
(https://lore.kernel.org/20260405151232.102690-1-sj@kernel.org)
- Handle <min_region_sz remaining quota.
- Document zero denum behavior.
- Fix typos: s/selftets/selftests/
Changes from RFC v1
(https://lore.kernel.org/20260404163943.89278-1-sj@kernel.org)
- Avoid overflows in charge amount calculation.
- Fix/wordsmith documentation for grammar, typo, and wrong examples.
- Improve unit test for more consistent comparison source use.
SeongJae Park (11):
mm/damon/core: handle <min_region_sz remaining quota as empty
mm/damon/core: merge regions after applying DAMOS schemes
mm/damon/core: introduce failed region quota charge ratio
mm/damon/sysfs-schemes: implement fail_charge_{num,denom} files
Docs/mm/damon/design: document fail_charge_{num,denom}
Docs/admin-guide/mm/damon/usage: document fail_charge_{num,denom}
files
Docs/ABI/damon: document fail_charge_{num,denom}
mm/damon/tests/core-kunit: test fail_charge_{num,denom} committing
selftests/damon/_damon_sysfs: support failed region quota charge ratio
selftests/damon/drgn_dump_damon_status: support failed region quota
charge ratio
selftests/damon/sysfs.py: test failed region quota charge ratio
.../ABI/testing/sysfs-kernel-mm-damon | 12 ++
Documentation/admin-guide/mm/damon/usage.rst | 18 ++-
Documentation/mm/damon/design.rst | 22 ++++
include/linux/damon.h | 9 ++
mm/damon/core.c | 103 ++++++++++++++++--
mm/damon/sysfs-schemes.c | 54 +++++++++
mm/damon/tests/core-kunit.h | 6 +
tools/testing/selftests/damon/_damon_sysfs.py | 21 +++-
.../selftests/damon/drgn_dump_damon_status.py | 2 +
tools/testing/selftests/damon/sysfs.py | 6 +
10 files changed, 236 insertions(+), 17 deletions(-)
base-commit: 8419cb46e7cc36ea746f259cfcbda7166a9a1fd5
--
2.47.3
^ permalink raw reply
* Re: [PATCH v9 2/9] lib: vsprintf: export simple_strntoull() in a safe prototype
From: Jonathan Cameron @ 2026-04-12 15:05 UTC (permalink / raw)
To: Rodrigo Alencar
Cc: Andy Shevchenko, Petr Mladek, rodrigo.alencar, linux-kernel,
linux-iio, devicetree, linux-doc, David Lechner, Andy Shevchenko,
Lars-Peter Clausen, Michael Hennerich, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet, Andrew Morton,
Steven Rostedt, Rasmus Villemoes, Sergey Senozhatsky, Shuah Khan
In-Reply-To: <mnz7d2zd27x6h2qa24rajgrbhkhsypybadkqz2fi43rg7bvjvj@oufys7xs25t4>
On Tue, 31 Mar 2026 14:01:04 +0100
Rodrigo Alencar <455.rodrigo.alencar@gmail.com> wrote:
> On 26/03/30 01:49PM, Rodrigo Alencar wrote:
> > On 26/03/27 03:17PM, Rodrigo Alencar wrote:
> > > On 26/03/27 12:21PM, Andy Shevchenko wrote:
> > > > On Fri, Mar 27, 2026 at 10:11:56AM +0000, Rodrigo Alencar wrote:
> > > > > On 26/03/27 11:17AM, Andy Shevchenko wrote:
> > > > > > On Fri, Mar 27, 2026 at 09:45:17AM +0100, Petr Mladek wrote:
> > > > > > > On Fri 2026-03-20 16:27:27, Rodrigo Alencar via B4 Relay wrote:
> >
> > ...
> >
> > > > > > Maybe we want to have kstrtof32() and kstrtof64() for these two cases?
> > > > > >
> > > > > > With that we will always consider the fraction part as 32- or 64-bit,
> > > > > > imply floor() on the fraction for the sake of simplicity and require
> > > > > > it to be NUL-terminated with possible trailing '\n'.
> > > > >
> > > > > I think this is a good idea, but calling it float or fixed point itself
> > > > > is a bit confusing as float often refers to the IEEE 754 standard and
> > > > > fixed point types is often expressed in Q-format.
> > > >
> > > > Yeah... I am lack of better naming.
> > >
> > > decimals is the name, but they are often represented as:
> > >
> > > DECIMAL = INT * 10^X + FRAC
> > >
> > > in a single 64-bit number, which would be fine for my end use case.
> > > However IIO decimal fixed point parsing is out there for quite some time a
> > > lot of drivers use that. The interface often relies on breaking parsed values
> > > into an integer array (for standard attributes int val and int val2 are expected).
> >
> > Thinking about this again and in IIO drivers we end up doing something like:
> >
> > val64 = (u64)val * MICRO + val2;
> >
> > so that drivers often work with scaled versions of the decimal value.
> > then, would it make sense to have a function that already outputs such value?
> > That would allow to have more freedom over the 64-bit split between integer
> > and fractional parts.
> > As a draft:
> >
> > static int _kstrtodec64(const char *s, unsigned int scale, u64 *res)
> > {
> > u64 _res = 0, _frac = 0;
> > unsigned int rv;
> >
> > if (*s != '.') {
> > rv = _parse_integer(s, 10, &_res);
> > if (rv & KSTRTOX_OVERFLOW)
> > return -ERANGE;
> > if (rv == 0)
> > return -EINVAL;
> > s += rv;
> > }
> >
> > if (*s == '.') {
> > s++;
> > rv = _parse_integer_limit(s, 10, &_frac, scale);
> > if (rv & KSTRTOX_OVERFLOW)
> > return -ERANGE;
> > if (rv == 0)
> > return -EINVAL;
> > s += rv;
> > if (rv < scale)
> > _frac *= int_pow(10, scale - rv);
> > while (isdigit(*s)) /* truncate */
> > s++;
> > }
> >
> > if (*s == '\n')
> > s++;
> > if (*s)
> > return -EINVAL;
> >
> > if (check_mul_overflow(_res, int_pow(10, scale), &_res) ||
> > check_add_overflow(_res, _frac, &_res))
> > return -ERANGE;
> >
> > *res = _res;
> > return 0;
> > }
> >
> > noinline
> > int kstrtoudec64(const char *s, unsigned int scale, u64 *res)
> > {
> > if (s[0] == '+')
> > s++;
> > return _kstrtodec64(s, scale, res);
> > }
> > EXPORT_SYMBOL(kstrtoudec64);
> >
> > noinline
> > int kstrtosdec64(const char *s, unsigned int scale, s64 *res)
> > {
> > u64 tmp;
> > int rv;
> >
> > if (s[0] == '-') {
> > rv = _kstrtodec64(s + 1, scale, &tmp);
> > if (rv < 0)
> > return rv;
> > if ((s64)-tmp > 0)
> > return -ERANGE;
> > *res = -tmp;
> > } else {
> > rv = kstrtoudec64(s, scale, &tmp);
> > if (rv < 0)
> > return rv;
> > if ((s64)tmp < 0)
> > return -ERANGE;
> > *res = tmp;
> > }
> > return 0;
> > }
> > EXPORT_SYMBOL(kstrtosdec64);
> >
> > e.g., kstrtosdec64() or kstrtoudec64() parses "3.1415" with scale 3 into 3141
>
> Hi Jonathan,
>
> developing more on that, I wouldn't need to create a iio_str_to_fixpoint64(),
> what do you think on new format types:
>
> #define IIO_VAL_DECIMAL64_1 101
> #define IIO_VAL_DECIMAL64_2 102
> #define IIO_VAL_DECIMAL64_3 103
> #define IIO_VAL_DECIMAL64_4 104
> #define IIO_VAL_DECIMAL64_5 105
> #define IIO_VAL_DECIMAL64_6 106
> #define IIO_VAL_DECIMAL64_7 107
> #define IIO_VAL_DECIMAL64_8 108
> #define IIO_VAL_DECIMAL64_9 109
> #define IIO_VAL_DECIMAL64_10 110
> #define IIO_VAL_DECIMAL64_11 111
> #define IIO_VAL_DECIMAL64_12 112
> #define IIO_VAL_DECIMAL64_13 113
> #define IIO_VAL_DECIMAL64_14 114
> #define IIO_VAL_DECIMAL64_15 115
Seems unlikely more than a few of these would ever be used.
If you want to keep the offsets define them in terms
of first one + whatever makes sense (maybe with a base value
to make that easier - then MICRO is BASE + 6 etc)
>
> #define IIO_VAL_DECIMAL64_MILLI IIO_VAL_DECIMAL64_3
> #define IIO_VAL_DECIMAL64_MICRO IIO_VAL_DECIMAL64_6
> #define IIO_VAL_DECIMAL64_NANO IIO_VAL_DECIMAL64_9
> #define IIO_VAL_DECIMAL64_PICO IIO_VAL_DECIMAL64_12
> #define IIO_VAL_DECIMAL64_FEMTO IIO_VAL_DECIMAL64_15
>
> which gets stored as 64-bit, and represent the decimal scaled value.
> That would also work for the PLL driver (using IIO_VAL_DECIMAL64_MICRO):
> - It supports frequency range from 1 to 26 GHz with micro Hz resolution
> - In the driver a 64-bit value: (val * MICRO + val2) is already created
> anyways.
> I would leverage something like kstrtodec64() in iio_write_channel_info().
>
> That way, I would drop the changes on the iio fixpoint parse, which I think
> it would do better with something like kstrntoull() to be able to handle that
> "dB" suffix.
>
> So for now, I may have the following approaches:
> - new kstrntoull() function: to have control over the parsing, whithout
> requiring NUL-termination, avoiding unecessary string scanning or copying.
> covered in v8.
> - expose a "safe" simple_strntoull(): minimal changes to vsprintf.c, this
> is covered by this patch series (v9), and it similar solution to kstrntoull().
> - new kstrtodec64() function: parse decimal numbers as 64-bit with NUL-termination.
> Might be covered in a v10, if it is a good idea.
>
> let me know your thoughts.
For string parsing I'm not a particular expert, so I only really care
about useability of the end result. Having types would work, but I'd kind
of want an 'odd ones' to stand out. Hence only defining them for parsing
cases we already support (but with 64 bit values) would be my preference
if you got ahead with this new approach.
Jonathan
>
^ permalink raw reply
* Re: [PATCH RFC v2 9/9] docs: iio: add documentation for ad9910 driver
From: Jonathan Cameron @ 2026-04-12 14:54 UTC (permalink / raw)
To: Rodrigo Alencar
Cc: Rodrigo Alencar via B4 Relay, rodrigo.alencar, linux-iio,
devicetree, linux-kernel, linux-doc, Lars-Peter Clausen,
Michael Hennerich, David Lechner, Andy Shevchenko, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Philipp Zabel, Jonathan Corbet,
Shuah Khan
In-Reply-To: <g57xnbhtvu25dvzy5b5pz76yzalpcjpnyclob6dy6j7fphxqle@5vww7psogbv3>
On Mon, 23 Mar 2026 11:58:53 +0000
Rodrigo Alencar <455.rodrigo.alencar@gmail.com> wrote:
> On 26/03/22 05:34PM, Jonathan Cameron wrote:
> > On Wed, 18 Mar 2026 17:56:09 +0000
> > Rodrigo Alencar via B4 Relay <devnull+rodrigo.alencar.analog.com@kernel.org> wrote:
> >
> > > From: Rodrigo Alencar <rodrigo.alencar@analog.com>
> > >
> > > Add documentation for the AD9910 DDS IIO driver, which describes channels,
> > > DDS modes, attributes and ABI usage examples.
> > >
> > > Signed-off-by: Rodrigo Alencar <rodrigo.alencar@analog.com>
>
> ...
>
> > > +Channel hierarchy
> > > +=================
> > > +
> > > +The driver exposes the following IIO output channels, each identified by a
> > > +unique channel number and a human-readable label:
> > > +
> > > +* ``out_altvoltage100``: ``phy``: Physical output: system clock and profile control
> > > +
> > > + * ``out_altvoltage110``: ``single_tone``: Single tone mode: per-profile
> > > + frequency, phase, amplitude
> > > +
> > > + * ``out_altvoltage120``: ``parallel_port``: Parallel port modulation: enable
> > > + and offset/scale parameters
> > > +
> > > + * ``out_altvoltage130``: ``digital_ramp_generator``: DRG control: enable,
> > > + destination, operating mode
> > > +
> > > + * ``out_altvoltage131``: ``digital_ramp_up``: DRG ramp-up parameters:
> > > + limits, step sizes, ramp rate
> > > + * ``out_altvoltage132``: ``digital_ramp_down``: DRG ramp-down parameters:
> > > + limits, step sizes, ramp rate
> > > +
> > > + * ``out_altvoltage140``: ``ram_control``: RAM playback: enable, destination,
> > > + operating mode, address range
> > > +
> > > + * ``out_altvoltage150``: ``output_shift_keying``: OSK: enable, amplitude
> > > + scale, ramp rate, auto/manual control
> > > +
> > > +The ``phy`` channel is the root of the hierarchy. Changing its
> > > +``sampling_frequency`` reconfigures the system clock (SYSCLK) which affects all
> > > +other channels. The ``profile`` attribute on this channel selects the active
> > > +hardware profile (0-7) used by the single tone and RAM channels.
> > I asked out this profile thing in one of the other patches. Key to me is
> > that how we write non active profiles? The most similar thing we've seen
> > in the past has been setting other frequencies for FSK or phases for PSK or
> > more mundane DC DAC output that are symbol based. (often an external signal)
>
> Yes, not allowing to configure a non-active profile at this point.
> Initially was not seeing this as a problem, but would you think different
> channels for each profiles should be created? e.g.:
>
> - out_altvoltage111 ... out_altvoltage118 for single tone profiles; and
> - out_altvoltage141 .. out_altvoltage148 for RAM profiles
>
> That would not remove the need for something like the profile attribute.
Yes. I'm thinking it would be something along those lines.
I think the tone and RAM profiles end up with totally different controls
so this should make it clear they aren't related.
>
> > For those we have added an additional index so we can see which symbol we
> > are changing parameters for. Here it might need to be done in the channel
> > numbering. I'm not sure.
> >
> > > +
> > > +All mode-specific channels (parallel port, DRG, RAM, OSK) have an ``enable``
> > > +attribute. The DRG and RAM channels additionally have ``destination`` and
> > > +``operating_mode`` attributes that configure which DDS core parameter is
> > > +modulated and how.
> >
> > I wonder if we flatten things out and have separate channels for each type
> > of modulation. Might lead to a more standard looking interfaces. We don't really
> > have a standard path to control one type of thing feeding another, whereas
> > we do have simple 'enable' interfaces.
>
> ...
>
> > > +RAM mode
> > > +--------
> > > +
> > > +The AD9910 contains a 1024 x 32-bit RAM that can be loaded with waveform data
> > > +and played back to modulate frequency, phase, amplitude, or polar (phase +
> > > +amplitude) parameters.
> > > +
> > > +RAM control channel attributes
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +.. flat-table::
> > > + :header-rows: 1
> > > +
> > > + * - Attribute
> > > + - Unit
> > > + - Description
> > > +
> > > + * - ``out_altvoltage140_en``
> > > + - boolean
> > > + - Enable/disable RAM playback. Toggling swaps profile registers between
> > > + single tone and RAM configurations across all 8 profiles.
>
> ...
>
> > > + * - ``out_altvoltage140_address_start``
> > Do we need this flexibility to set the start? We needed a length, but
> > if we want different effective start can just load a different image.
>
> There is one image being loaded into the entire RAM and each profile may choose
> from wich sample it starts and ends its address ramp.
> The profiles can have those address ranges to overlap or not, then I suppose
> considering different images would just complicate things.
In the ABI docs discussion I raised the question on whether it makes more
sense to push this 'meta data' into the firmware image. Depends a bit
on how costly loading the image is vs just updating this properties and whether
there are significant real use cases where tweaking what is run from the
image apply. I can easily conjecture some. I just have no idea if they are
real!
Thanks,
Jonathan
>
> > > + - integer
> > > + - Start address for the active profile. Range [0, 1023]. Cannot be
> > > + changed while RAM mode is enabled. If set above current end address,
> > > + end address is automatically adjusted.
> > > +
>
> ...
>
^ permalink raw reply
* Re: [PATCH RFC v2 8/9] Documentation: ABI: testing: add docs for ad9910 sysfs entries
From: Jonathan Cameron @ 2026-04-12 14:51 UTC (permalink / raw)
To: Rodrigo Alencar
Cc: Rodrigo Alencar via B4 Relay, rodrigo.alencar, linux-iio,
devicetree, linux-kernel, linux-doc, Lars-Peter Clausen,
Michael Hennerich, David Lechner, Andy Shevchenko, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Philipp Zabel, Jonathan Corbet,
Shuah Khan
In-Reply-To: <mtqjtmsysz6ywvybeut6qzhee2o4qedwgvr5isbn4um7bwhjbe@sg2b7hwlszwd>
On Mon, 23 Mar 2026 11:36:08 +0000
Rodrigo Alencar <455.rodrigo.alencar@gmail.com> wrote:
> On 26/03/22 05:22PM, Jonathan Cameron wrote:
> > On Wed, 18 Mar 2026 17:56:08 +0000
> > Rodrigo Alencar via B4 Relay <devnull+rodrigo.alencar.analog.com@kernel.org> wrote:
> >
> > > From: Rodrigo Alencar <rodrigo.alencar@analog.com>
> > >
> > > Add ABI documentation file for the DDS AD9910 with sysfs entries to
> > > control Parallel Port, Digital Ramp Generator, RAM and OSK parameters.
> > >
> > > Signed-off-by: Rodrigo Alencar <rodrigo.alencar@analog.com>
> > > ---
>
> ...
>
> > > +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_profile
> > > +KernelVersion:
> > > +Contact: linux-iio@vger.kernel.org
> > > +Description:
> > > + Read/write the active profile index [0, 7] from/to the physical
> > > + channel. The AD9910 supports 8 profiles, each storing a complete
> > > + set of single tone (frequency, phase, amplitude) and RAM playback
> > > + parameters.
> >
> > This one is interesting. Can we treat them as symbols that we are picking
> > between? We have similar DAC ABIs for that already.
>
> The profile concept comes from the datasheet and defines sets of configuration
> for single tone and RAM control mode. I am not sure how we fit this idea into a
> "symbol"
Think of those tones as just different frequencies (the other stuff could be the same)
then this is FSK with 8 symbols.
>
> > Is this picking between them for purposes of configuration or setting which one is
> > in being output currently?
>
> Well, this is being used for configuration and activating, then, yes, you can only configure
> an active profile, but I was not seeing that as an issue. I suppose that simplifies the
> ABI a bit.
If you are only configuring the one that is active, short of a small overhead
of having to configure more than one property why have this at all?
Just leave it in a profile and have userspace reconfigure everything it wants to.
However, I see that in some of them modes below this is more complex
as the profiles are cycled through - for ram playback anyway.
It may make sense to separate the ram case from tone ones.
>
> ...
>
> > > +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_destination
> > > +KernelVersion:
> > > +Contact: linux-iio@vger.kernel.org
> > > +Description:
> > > + Read/write the digital ramp generator (DRG) or the RAM control
> > > + destination parameter. Determines which DDS core parameter is to
> > > + be modulated when the child mode channel is enabled.
> > > +
> > > + Available values can be read from the corresponding
> > > + out_altvoltageY_destination_available attribute.
> > > +
> > > + Valid values: "polar" (only for RAM control), "frequency", "phase"
> > > + and "amplitude"
> >
> > This is very device specific. Maybe we are better representing these as separate
> > channels each with their own controls for DRG. No problem if changing one changes
> > another.
>
> You mean removing this generic Y there? Indeed, there are separate configs for each one.
No, I mean having more channels. One for each of polar, frequency, phase and frequency for
each channel. Then enables for which channel is turned on. Might not work out,
but I'd like you to explore what problems that type of interface would bring.
The aim here is to add as little new ABI as possible as custom ABI is a real
pain for generic userspace.
>
> ...
>
> > > +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_operating_mode
> > > +KernelVersion:
> > > +Contact: linux-iio@vger.kernel.org
> > > +Description:
> > > + Read/write the DRG or RAM control operating mode. For the DRG
> > > + channel it controls the no-dwell behavior of the ramp.
> > > +
> > > + Available values can be read from the corresponding
> > > + out_altvoltageY_operating_mode_available attribute.
> > > +
> > > + Valid values for DRG channel:
> > > +
> > > + - "bidirectional": Normal ramp generation (ramp up then
> > > + down, dwelling at limits).
> >
> > Some sort of trapezium wave? Maybe this and continuous forms are combined
> > and we have a separate dwell time control?
>
> Yes, sort of. Dwell control is made by an external pin (DRCTL), often controlled
> by an FPGA to achieve certain required timings. I can say that software control
> is not really recommended, unless only a one-shot ramp is necessary.
So is it worth exposing this software control at all? Maybe just make it a firmware
description problem as to whether that DRCTL is connected or not.
>
> When adding the IIO backend support, extendend attributes will be added to
> support control of dwell times.
>
> >
> > > + - "ramp_down": No-dwell low; the ramp resets to upper
> > > + limit upon reaching the lower limit.
> > > + - "ramp_up": No-dwell high; the ramp resets to lower
> > > + limit upon reaching the upper limit.
> > > + - "bidirectional_continuous": Both no-dwell high and low;
> > > + the ramp continuously sweeps without dwelling.
> >
> > Triangle wave? bidirectional continuous is a rather confusing term so maybe
> > we should rethink this one.
>
> Mostly yes, but not only that. Sawtooth can be achieved as well by changing
> the step sizes, also other weird patterns can be achieved by toggling DRCTL pin.
Sawtooth is kind of a special triangle wave with one very steep side.
Wikipedia even has: "It can also be considered the extreme case of an asymmetric triangle wave"
https://en.wikipedia.org/wiki/Sawtooth_wave
> This mode is the most useful when one does not have an FPGA and want to save
> resources on controlling the DRCTL pin. That mode name comes from the datasheet,
> so I suppose it was fine.
Let us see if we can get more opinions on this. Whilst I can see the logic of
the datasheet naming, it's a bit obscure.
>
> > > +
> > > + Valid values for RAM control channel:
> > > +
> > > + - "direct_switch": start address defines fixed word to be used
> > > + by the selected profile.
> > > + - "ramp_up": One-shot ramp up through current profile's address
> > > + range.
> > > + - "bidirectional": Ramp up then down through PROFILE0 pin.
> >
> > Avoid specifics like this. Can we call external control pin or something like that?
> >
> > > + - "bidirectional_continuous": Continuous ramp up/down
> > > + through current profile's address range.
> > > + - "ramp_up_continuous": Continuous ramp up through
> > > + current profile's address range.
> >
> > I guess this goes back to start on finishing ramping up?
>
> Yes, any dwell time should be considered when loading the "waveform" into the RAM
>
> >
> > > + - "sequenced": Sequenced playback of RAM profiles up to
> > > + the active profile. Requires active profile > 0.
> > Is this just running through each profile one after another? (other than profile 0)?
>
> for this, this would be the actions:
> - configure all desired profiles (0 up to X)
> - Set the operating mode to sequenced (profile X would be active at this point)
> - Enable RAM mode
Ah. This reflects on the profile control above. As I mention there I'm not sure we
shouldn't separate the use of profile for tones (where it is symbol like) to that
for RAM addresses.
For the ram addresses I think I'd expose separate attributes for each (there aren't
that many) rather than a selector + controls.
Another option would be to push the control of this into the firmware files
(some sort of header). Whether that is sufficient would depend on the usecases
for the device. It would definitely make for an easier runtime configuration
though!
>
> When RAM mode is enabled, it would trigger the execution of profiles 0 up to X,
> in sequence, according to the configured address range and sample rate for each profile.
> When one profile ends the next starts until profile X finishes.
>
> > > + - "sequenced_continuous": Continuous sequenced playback
> > > + of RAM profiles up to the active profile. Requires
> > > + active profile > 0.
> > Similar to above, maybe separate out dwell time if that's the difference between
> > sequenced and sequenced_continuous.
>
> The difference here is that when Profile X finishes, Profile 0 starts again.
> So the previous one is kind of an one-shot mode of multiple profiles in sequence
They had fun designing this didn't they!
Anyhow, I think we'll need to work through a few more versions of this to get
as extensible an interface as possible.
>
> ...
>
^ permalink raw reply
* Re: [PATCH v2] Docs: iio: ad7191 Correct clock configuration
From: Jonathan Cameron @ 2026-04-12 13:40 UTC (permalink / raw)
To: Geert Uytterhoeven
Cc: Andy Shevchenko, Ammar Mustafa, Alisa-Dariana Roman,
David Lechner, Nuno Sá, Andy Shevchenko, Jonathan Corbet,
Shuah Khan, linux-iio, linux-doc, linux-kernel
In-Reply-To: <CAMuHMdUcDrDR-1f5eXjdyDR3-HcTFFeGsmYU2whV=fQhzsCNtg@mail.gmail.com>
On Mon, 30 Mar 2026 10:09:28 +0200
Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> On Sat, 28 Feb 2026 at 11:51, Andy Shevchenko
> <andriy.shevchenko@intel.com> wrote:
> > On Fri, Feb 27, 2026 at 02:08:33PM -0500, Ammar Mustafa wrote:
> > > Correct the ad7191 documentation to match the datasheet:
> > > - Fix inverted CLKSEL pin logic: device uses external clock when pin is
> > > inactive, and internal CMOS/crystal when high.
> >
> > high --> active
>
> Thanks for your patch, which is now commit d2a4ec19d2a2e54c ("Docs:
> iio: ad7191 Correct clock configuration") in char-misc-next and
> iio/togreg.
>
> That commit message still says "inactive" and "high", thus adding to
> the confustion.
>
> Gr{oetje,eeting}s,
>
> Geert
>
Ah. Sorry, I missed that and can't really do anything to fix it up
now :( All got a bit rushed as I was off on holiday.
Jonathan
^ permalink raw reply
* Re: [PATCH v11 net-next 09/11] net/nebula-matrix: add Dispatch layer implementation
From: Simon Horman @ 2026-04-12 12:59 UTC (permalink / raw)
To: illusion.wang
Cc: Simon Horman, alvin.wang, sam.chen, netdev, skhan, lorenzo,
open list, edumazet, linux-doc, enelsonmoore, andrew+netdev,
pabeni, lukas.bulwahn, kuba, hkallweit1, dimon.zhao, corbet,
vadim.fedorenko
In-Reply-To: <20260408093739.56001-10-illusion.wang@nebula-matrix.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
---
net/nebula-matrix: add Dispatch layer implementation
This patch introduces a control-level routing mechanism for the Dispatch
layer with two routing paths: a direct path (Dispatch -> Resource -> HW)
and a channel path (Dispatch -> Channel). The implementation uses
interface-declared control levels and upper-layer configured control
levels to determine the routing.
> diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_core/nbl_dispatch.c b/drivers/net/ethernet/nebula-matrix/nbl/nbl_core/nbl_dispatch.c
> index 8b031f6e963f..eee41d10ad63 100644
> --- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_core/nbl_dispatch.c
> +++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_core/nbl_dispatch.c
> @@ -6,6 +6,396 @@
> #include <linux/pci.h>
> #include "nbl_dispatch.h"
>
> +static u16 nbl_disp_chan_get_vsi_id_req(struct nbl_dispatch_mgt *disp_mgt,
> + u16 func_id, u16 type)
> +{
> + struct nbl_channel_ops *chan_ops = disp_mgt->chan_ops_tbl->ops;
> + struct nbl_common_info *common = disp_mgt->common;
> + struct nbl_chan_param_get_vsi_id result = { 0 };
> + struct nbl_chan_param_get_vsi_id param = { 0 };
> + struct nbl_chan_send_info chan_send;
> +
> + param.type = type;
> +
> + NBL_CHAN_SEND(chan_send, common->mgt_pf, NBL_CHAN_MSG_GET_VSI_ID,
> + ¶m, sizeof(param), &result, sizeof(result), 1);
> + chan_ops->send_msg(disp_mgt->chan_ops_tbl->priv, &chan_send);
> +
> + return result.vsi_id;
> +}
Does nbl_disp_chan_get_vsi_id_req() ignore the func_id parameter?
The function accepts func_id but only sends param.type in the channel
message. The nbl_chan_param_get_vsi_id struct appears to have only
{vsi_id, type} fields, with no func_id field.
On the responder side:
> +static void nbl_disp_chan_get_vsi_id_resp(void *priv, u16 src_id, u16 msg_id,
> + void *data, u32 data_len)
> +{
> + struct nbl_dispatch_mgt *disp_mgt = (struct nbl_dispatch_mgt *)priv;
> + struct nbl_channel_ops *chan_ops = disp_mgt->chan_ops_tbl->ops;
> + struct nbl_resource_ops *res_ops = disp_mgt->res_ops_tbl->ops;
> + struct nbl_resource_mgt *p = disp_mgt->res_ops_tbl->priv;
> + struct device *dev = disp_mgt->common->dev;
> + struct nbl_chan_param_get_vsi_id *param;
> + struct nbl_chan_param_get_vsi_id result = { 0 };
> + struct nbl_chan_ack_info chan_ack;
> + int err = NBL_CHAN_RESP_OK;
> + int ret;
> +
> + param = (struct nbl_chan_param_get_vsi_id *)data;
> +
> + result.vsi_id =
> + NBL_OPS_CALL_RET(res_ops->get_vsi_id, (p, src_id, param->type));
nbl_disp_chan_get_vsi_id_resp() uses src_id (the channel source) instead
of func_id. This creates a behavioral inconsistency with the direct path:
> +static u16 nbl_disp_get_vsi_id(struct nbl_dispatch_mgt *disp_mgt, u16 func_id,
> + u16 type)
> +{
> + struct nbl_resource_ops *res_ops = disp_mgt->res_ops_tbl->ops;
> + struct nbl_resource_mgt *p = disp_mgt->res_ops_tbl->priv;
> +
> + return NBL_OPS_CALL_RET(res_ops->get_vsi_id, (p, func_id, type));
> +}
The direct path correctly passes func_id to res_ops->get_vsi_id(p, func_id,
type).
When the channel path is active, callers cannot query VSI IDs for
functions other than the sender. Is the func_id parameter intended to be
used in nbl_disp_chan_get_vsi_id_req()?
[ ... ]
^ permalink raw reply
* Re: [PATCH v11 net-next 05/11] net/nebula-matrix: add channel layer
From: Simon Horman @ 2026-04-12 12:59 UTC (permalink / raw)
To: illusion.wang
Cc: Simon Horman, dimon.zhao, andrew+netdev, alvin.wang, corbet,
enelsonmoore, netdev, kuba, sam.chen, vadim.fedorenko, edumazet,
skhan, hkallweit1, lukas.bulwahn, lorenzo, open list, linux-doc,
pabeni
In-Reply-To: <20260408093739.56001-6-illusion.wang@nebula-matrix.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
---
net/nebula-matrix: add channel layer
This commit adds a channel management layer for the nebula-matrix
network driver, providing mailbox-based communication between PF0
and other PFs with message send/receive, ACK handling, and
hash-based message handler lookup.
> diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_channel/nbl_channel.c b/drivers/net/ethernet/nebula-matrix/nbl/nbl_channel/nbl_channel.c
> --- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_channel/nbl_channel.c
> +++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_channel/nbl_channel.c
[ ... ]
> +static int nbl_chan_send_msg(struct nbl_channel_mgt *chan_mgt,
> + struct nbl_chan_send_info *chan_send)
> +{
[ ... ]
> + if (test_bit(NBL_CHAN_INTERRUPT_READY, chan_info->state)) {
> + ret = wait_event_timeout(wait_head->wait_queue,
> + wait_head->acked,
> + NBL_CHAN_ACK_WAIT_TIME);
> + if (!ret) {
> + wait_head->status = NBL_MBX_STATUS_TIMEOUT;
> + dev_err(dev,
> + "Channel waiting ack failed, message type: %d, msg id: %u\n",
> + chan_send->msg_type, msgid.id);
> + return -ETIMEDOUT;
> + }
[ ... ]
> + /*polling wait mailbox ack*/
> + while (i--) {
> + nbl_chan_clean_queue(chan_mgt, chan_info);
> +
> + if (wait_head->acked) {
> + chan_send->ack_len = wait_head->ack_data_len;
> + wait_head->status = NBL_MBX_STATUS_IDLE;
> + return wait_head->ack_err;
> + }
> + usleep_range(NBL_CHAN_TX_WAIT_ACK_US_MIN,
> + NBL_CHAN_TX_WAIT_ACK_US_MAX);
> + }
> +
> + wait_head->status = NBL_MBX_STATUS_TIMEOUT;
> + dev_err(dev,
> + "Channel polling ack failed, message type: %d msg id: %u\n",
> + chan_send->msg_type, msgid.id);
> +
> + return -EFAULT;
^^^^^^
The interrupt-based ACK wait path above returns -ETIMEDOUT on timeout,
but this polling-based ACK wait path returns -EFAULT. -EFAULT ("bad
address") is not the right error code for a timeout condition.
Should this also return -ETIMEDOUT for consistency? Any caller that
checks specifically for -ETIMEDOUT to handle timeouts would miss
this polling timeout case.
> +}
[ ... ]
> diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_common/nbl_common.c b/drivers/net/ethernet/nebula-matrix/nbl/nbl_common/nbl_common.c
> --- /dev/null
> +++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_common/nbl_common.c
[ ... ]
> +int nbl_common_create_wq(void)
> +{
> + wq_mgt = kzalloc_obj(*wq_mgt);
> + if (!wq_mgt)
> + return -ENOMEM;
> +
> + wq_mgt->ctrl_dev_wq = alloc_workqueue("%s", WQ_MEM_RECLAIM | WQ_UNBOUND,
> + 0, "nbl_ctrldev_wq");
> + if (!wq_mgt->ctrl_dev_wq) {
> + pr_err("Failed to create workqueue nbl_ctrldev_wq\n");
> + goto alloc_ctrl_dev_wq_failed;
> + }
> +
> + return 0;
> +alloc_ctrl_dev_wq_failed:
> + kfree(wq_mgt);
> + return -ENOMEM;
> +}
When alloc_workqueue() fails, wq_mgt is freed but the static pointer
is not set to NULL. After this error path, wq_mgt becomes a dangling
pointer.
If nbl_common_queue_work() or nbl_common_destroy_wq() is called
afterwards, they dereference wq_mgt unconditionally:
nbl_common_queue_work():
queue_work(wq_mgt->ctrl_dev_wq, task);
nbl_common_destroy_wq():
destroy_workqueue(wq_mgt->ctrl_dev_wq);
kfree(wq_mgt);
Would adding "wq_mgt = NULL;" after kfree(wq_mgt) on the error path
prevent this?
[ ... ]
> diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_common/nbl_common.h b/drivers/net/ethernet/nebula-matrix/nbl/nbl_common/nbl_common.h
> --- /dev/null
> +++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_common/nbl_common.h
[ ... ]
> +/* it used for y_axis no necessay */
> +struct nbl_hash_entry_node {
> + struct hlist_node node;
> + void *key;
> + void *data;
> +};
This isn't a bug, but the comment "it used for y_axis no necessay"
appears garbled and "necessay" is misspelled. Could the comment be
rewritten to explain the purpose of this struct, or removed if it
is no longer relevant?
^ permalink raw reply
* Re: [PATCH v11 net-next 02/11] net/nebula-matrix: add our driver architecture
From: Simon Horman @ 2026-04-12 12:57 UTC (permalink / raw)
To: illusion.wang
Cc: Simon Horman, enelsonmoore, alvin.wang, linux-doc, hkallweit1,
edumazet, lukas.bulwahn, kuba, open list, pabeni, netdev,
sam.chen, corbet, vadim.fedorenko, lorenzo, dimon.zhao, skhan,
andrew+netdev
In-Reply-To: <20260408093739.56001-3-illusion.wang@nebula-matrix.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
---
net/nebula-matrix: add our driver architecture
This commit introduces a layered architecture for the nebula-matrix
driver, consisting of Device, Dispatch, Resource, Hardware, Channel, and
Common layers. It adds initialization and cleanup functions for each layer
with struct-based operation registration.
Signed-off-by: illusion.wang <illusion.wang@nebula-matrix.com>
> diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_main.c b/drivers/net/ethernet/nebula-matrix/nbl/nbl_main.c
> index 10c3536b327b..db035136e09c 100644
> --- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_main.c
> +++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_main.c
[ ... ]
> +struct nbl_adapter *nbl_core_init(struct pci_dev *pdev,
> + struct nbl_init_param *param)
> +{
> + struct nbl_product_base_ops *product_base_ops;
> + struct nbl_common_info *common;
> + struct nbl_adapter *adapter;
> + int ret;
> +
> + adapter = devm_kzalloc(&pdev->dev, sizeof(*adapter), GFP_KERNEL);
> + if (!adapter)
> + return NULL;
> +
> + adapter->pdev = pdev;
> + common = &adapter->common;
> +
> + common->pdev = pdev;
> + common->dev = &pdev->dev;
> + common->has_ctrl = param->caps.has_ctrl;
> + common->has_net = param->caps.has_net;
> + common->pci_using_dac = param->pci_using_dac;
> + common->function = PCI_FUNC(pdev->devfn);
> + common->devid = PCI_SLOT(pdev->devfn);
> + common->bus = pdev->bus->number;
> + common->product_type = param->product_type;
> +
> + product_base_ops = nbl_core_setup_product_ops(adapter, param);
> +
> + /*
> + *every product's hw/chan/res layer has a great difference,
> + *so call their own init ops
> + */
> + ret = product_base_ops->hw_init(adapter);
> + if (ret)
> + goto hw_init_fail;
> +
> + ret = product_base_ops->chan_init(adapter);
> + if (ret)
> + goto chan_init_fail;
> +
> + ret = product_base_ops->res_init(adapter);
> + if (ret)
> + goto res_init_fail;
> +
> + ret = nbl_disp_init(adapter);
> + if (ret)
> + goto res_init_fail;
> +
> + ret = nbl_dev_init(adapter);
> + if (ret)
> + goto res_init_fail;
Does the error path correctly clean up all initialized layers?
The initialization order is:
1. hw_init
2. chan_init
3. res_init
4. nbl_disp_init
5. nbl_dev_init
When nbl_disp_init() or nbl_dev_init() fails, control jumps to the
'res_init_fail' label, which only tears down the channel and hardware
layers:
> +res_init_fail:
> + product_base_ops->chan_remove(adapter);
> +chan_init_fail:
> + product_base_ops->hw_remove(adapter);
However, res_init has already succeeded at this point and is never cleaned
up. Looking at nbl_product_base_ops, there is no res_remove callback
defined, and nbl_core_remove() also shows the same asymmetry:
> +void nbl_core_remove(struct nbl_adapter *adapter)
> +{
> + struct nbl_product_base_ops *product_base_ops;
> +
> + product_base_ops = adapter->product_base_ops;
> + nbl_dev_remove(adapter);
> + product_base_ops->chan_remove(adapter);
> + product_base_ops->hw_remove(adapter);
> +}
This function calls nbl_dev_remove(), chan_remove(), and hw_remove(), but
never tears down the resource layer or the dispatch layer.
Currently, all allocations in res_init (nbl_res_init_leonis) and
nbl_disp_init use devm_kzalloc, so resources are auto-freed. However, the
error path structure is concretely wrong: if res_init or disp_init ever
needs non-devm resources (as nbl_hw_init_leonis already uses ioremap and
pci_request_selected_regions), the missing cleanup will cause resource
leaks.
Should the init/remove paths be symmetric, with every successful init
having a corresponding remove in both the error cleanup path and
nbl_core_remove()?
^ permalink raw reply
* Re: [PATCH v4] Documentation: Refactored watchdog old doc
From: Guenter Roeck @ 2026-04-12 12:36 UTC (permalink / raw)
To: Sunny Patel, linux-doc; +Cc: linux-watchdog, linux-kernel, corbet, wim, rdunlap
In-Reply-To: <20260412095338.52271-1-nueralspacetech@gmail.com>
On 4/12/26 02:53, Sunny Patel wrote:
> Mark WDIOC_GETTEMP and WDIOS_TEMPPANIC as deprecated since
> neither is implemented by the watchdog core and both are only
> present in a small number of legacy drivers.
>
> Add documentation for previously undocumented status bits
> WDIOF_MAGICCLOSE and WDIOF_ALARMONLY in the options field.
>
> Add documentation for WDIOF_PRETIMEOUT and WDIOF_SETTIMEOUT
> status bits describing their respective ioctls.
>
> Fix the following issues in existing documentation:
> - Remove version-specific reference to Linux 2.4.18 from
> the GETTIMEOUT ioctl description
> - Fix duplicate "was is" in printf format strings
> - Replace [FIXME] placeholder with proper descriptions for
> WDIOS_DISABLECARD, WDIOS_ENABLECARD and WDIOS_TEMPPANIC
>
> Signed-off-by: Sunny Patel <nueralspacetech@gmail.com>
> ---
>
> Changes in v4:
> - Fixed WDIOS_DISABLECARD description: corrected inverted logic —
> the ioctl disables the hardware timer entirely rather than
> stopping pings. Clarified that userspace, not the kernel driver,
> is primarily responsible for pinging under normal operation.
>
> Apologies for the broken mail threading on v2 and v3 as well.
>
> Documentation/watchdog/watchdog-api.rst | 65 +++++++++++++++++++++----
> 1 file changed, 55 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/watchdog/watchdog-api.rst b/Documentation/watchdog/watchdog-api.rst
> index 78e228c272cf..43ca6b2bbeff 100644
> --- a/Documentation/watchdog/watchdog-api.rst
> +++ b/Documentation/watchdog/watchdog-api.rst
> @@ -2,7 +2,7 @@
> The Linux Watchdog driver API
> =============================
>
> -Last reviewed: 10/05/2007
> +Last reviewed: 04/08/2026
>
>
>
> @@ -42,7 +42,7 @@ activates as soon as /dev/watchdog is opened and will reboot unless
> the watchdog is pinged within a certain time, this time is called the
> timeout or margin. The simplest way to ping the watchdog is to write
> some data to the device. So a very simple watchdog daemon would look
> -like this source file: see samples/watchdog/watchdog-simple.c
> +like this source file: see samples/watchdog/watchdog-simple.c
>
> A more advanced driver could for example check that a HTTP server is
> still responding before doing the write call to ping the watchdog.
> @@ -106,11 +106,10 @@ the requested one due to limitation of the hardware::
> This example might actually print "The timeout was set to 60 seconds"
> if the device has a granularity of minutes for its timeout.
>
> -Starting with the Linux 2.4.18 kernel, it is possible to query the
> -current timeout using the GETTIMEOUT ioctl::
> +It is also possible to get the current timeout with the GETTIMEOUT ioctl::
>
> ioctl(fd, WDIOC_GETTIMEOUT, &timeout);
> - printf("The timeout was is %d seconds\n", timeout);
> + printf("The timeout is %d seconds\n", timeout);
>
> Pretimeouts
> ===========
> @@ -133,7 +132,7 @@ seconds. Setting a pretimeout to zero disables it.
> There is also a get function for getting the pretimeout::
>
> ioctl(fd, WDIOC_GETPRETIMEOUT, &timeout);
> - printf("The pretimeout was is %d seconds\n", timeout);
> + printf("The pretimeout is %d seconds\n", timeout);
>
> Not all watchdog drivers will support a pretimeout.
>
> @@ -145,12 +144,12 @@ before the system will reboot. The WDIOC_GETTIMELEFT is the ioctl
> that returns the number of seconds before reboot::
>
> ioctl(fd, WDIOC_GETTIMELEFT, &timeleft);
> - printf("The timeout was is %d seconds\n", timeleft);
> + printf("The timeout is %d seconds\n", timeleft);
Sashiko:
This isn't a bug, but should this printf example use "time left" instead of
"timeout" since it is printing the result of the WDIOC_GETTIMELEFT ioctl?
It has a point. Please change.
Thanks,
Guenter
>
> Environmental monitoring
> ========================
>
> -All watchdog drivers are required return more information about the system,
> +All watchdog drivers are required to return more information about the system,
> some do temperature, fan and power level monitoring, some can tell you
> the reason for the last reboot of the system. The GETSUPPORT ioctl is
> available to ask what the device can do::
> @@ -227,12 +226,33 @@ The watchdog saw a keepalive ping since it was last queried.
> WDIOF_SETTIMEOUT Can set/get the timeout
> ================ =======================
>
> -The watchdog can do pretimeouts.
> +The watchdog supports timeout set/get via the WDIOC_SETTIMEOUT and
> +WDIOC_GETTIMEOUT ioctls.
>
> ================ ================================
> WDIOF_PRETIMEOUT Pretimeout (in seconds), get/set
> ================ ================================
>
> +The watchdog supports a pretimeout, a warning interrupt that fires before
> +the actual reboot timeout. Use WDIOC_SETPRETIMEOUT and WDIOC_GETPRETIMEOUT
> +to set/get the pretimeout.
> +
> + ================ ================================
> + WDIOF_MAGICCLOSE Supports magic close char
> + ================ ================================
> +
> +The driver supports the Magic Close feature. The watchdog is only disabled
> +if the character 'V' is written to /dev/watchdog before the file descriptor
> +is closed. Without writing 'V' before closing, the watchdog remains active
> +and will trigger a reboot after the timeout expires.
> +
> + ================ ================================
> + WDIOF_ALARMONLY Not a reboot watchdog
> + ================ ================================
> +
> +The watchdog will not reboot the system when it expires. Instead it
> +triggers a management or other external alarm. Userspace should not
> +rely on a system reboot occurring.
>
> For those drivers that return any bits set in the option field, the
> GETSTATUS and GETBOOTSTATUS ioctls can be used to ask for the current
> @@ -254,6 +274,11 @@ returned value is the temperature in degrees Fahrenheit::
> int temperature;
> ioctl(fd, WDIOC_GETTEMP, &temperature);
>
> +.. note::
> + ``WDIOC_GETTEMP`` is not implemented by the watchdog core and is
> + considered deprecated. It is only supported by a small number of
> + legacy drivers. New drivers should not implement it.
> +
> Finally the SETOPTIONS ioctl can be used to control some aspects of
> the cards operation::
>
> @@ -268,4 +293,24 @@ The following options are available:
> WDIOS_TEMPPANIC Kernel panic on temperature trip
> ================= ================================
>
> -[FIXME -- better explanations]
> +``WDIOS_DISABLECARD`` disables the hardware watchdog timer entirely,
> +allowing a controlled system shutdown without triggering a reboot.
> +Userspace is responsible for pinging the watchdog under normal
> +operation; this ioctl stops the underlying hardware timer so that
> +the absence of pings no longer causes a system reset.
> +
> +``WDIOS_ENABLECARD`` starts the watchdog timer. If the watchdog was
> +previously stopped via ``WDIOS_DISABLECARD``, this will re-enable it. The
> +hardware watchdog will begin counting down from the configured timeout.
> +
> +``WDIOS_TEMPPANIC`` enables temperature-based kernel panic. When set,
> +the driver will call ``panic()`` (or ``kernel_power_off()`` on some
> +drivers) if the hardware temperature sensor exceeds its threshold,
> +rather than only setting the ``WDIOF_OVERHEAT`` status bit. Support
> +for this option is driver-specific; not all watchdog drivers implement
> +temperature monitoring.
> +
> +.. note::
> + ``WDIOS_TEMPPANIC`` is not implemented by the watchdog core and is
> + considered deprecated. It is only present in a small number of
> + legacy drivers. New drivers should not implement it.
^ permalink raw reply
* [PATCH ipsec-next v7 00/14] xfrm: XFRM_MSG_MIGRATE_STATE new netlink message
From: Antony Antony @ 2026-04-12 11:13 UTC (permalink / raw)
To: Antony Antony, Steffen Klassert, Herbert Xu, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
David Ahern, Masahide NAKAMURA, Paul Moore, Stephen Smalley,
Ondrej Mosnacek, Jonathan Corbet, Shuah Khan
Cc: netdev, linux-kernel, selinux, linux-doc, Chiachang Wang, Yan Yan,
devel
The current XFRM_MSG_MIGRATE interface is tightly coupled to policy and
SA migration, and it lacks the information required to reliably migrate
individual SAs. This makes it unsuitable for IKEv2 deployments,
dual-stack setups (IPv4/IPv6), and scenarios where policies are managed
externally (e.g., by daemons other than the IKE daemon).
Mandatory SA selector list
The current API requires a non-empty SA selector list, which does not
reflect the IKEv2 use case.
A single Child SA may correspond to multiple policies,
and SA discovery already occurs via address and reqid matching. With
dual-stack Child SAs this leads to excessive churn: the current method
would have to be called up to six times (in/out/fwd × v4/v6) on SA,
while the new method only requires two calls.
Selectors lack SPI (and marks)
XFRM_MSG_MIGRATE cannot uniquely identify an SA when multiple SAs share
the same policies (per-CPU SAs, SELinux label-based SAs, etc.). Without
the SPI, the kernel may update the wrong SA instance.
Reqid cannot be changed
Some implementations allocate reqids based on traffic selectors. In
host-to-host or selector-changing scenarios, the reqid must change,
which the current API cannot express.
Because strongSwan and other implementations manage policies
independently of the kernel, an interface that updates only a specific
SA - with complete and unambiguous identification - is required.
SA Selector, x->sel, can't be changed, especially Transport mode.
XFRM_MSG_MIGRATE_STATE provides that interface. It supports migration
of a single SA via xfrm_usersa_id (including SPI) and we fix
encap removal in this patch set, reqid updates, address changes,
and other SA-specific parameters. It avoids the structural limitations
of XFRM_MSG_MIGRATE and provides a simpler, extensible mechanism for
precise per-SA migration without involving policies.
This method also allows migtrating SA selectors typically used with
host-to-host in Transport mode.
New migration steps: first install block policy, remove the old policy,
call XFRM_MSG_MIGRATE_STATE for each state, then re-install the
policies and remove the block policy.
If the target SA tuple (daddr, SPI, proto, family) is already
occupied, the operation returns -EEXIST. In this case the original
SA is not preserved. Userspace must handle -EEXIST by
re-establishing the SA at the IKE level and manage policies.
---
v6->v7: - add SA selectoor migration
- fixes to commit messages
- white space removal
Link to v6: https://lore.kernel.org/r/migrate-state-v6-0-9df9764ddb9e@secunet.com
v5->v6: - add mark to look up SA.
- restrict netlink attributes in new method
- address review feedback from Sabrina
- add new patch to fix existing inter-family address comparison
- add extack xfrm_state_init()
- Feedback from Yan : omit-to-inherit add migrating marks
- Drop missing __rcu annotation on nlsk, Sabrina has a better patch
Link to v5: https://lore.kernel.org/all/cover.1769509130.git.antony.antony@secunet.com/
v4->v5: add synchronize after migrate and delete it inside a lock
- split xfrm_state_migrate into create and install functions
Link to v4: https://lore.kernel.org/all/cover.1768811736.git.antony.antony@secunet.com/
v3->v4: add patch to fix pre-existing missing __rcu annotation on nlsk
v2->v3: - fix commit message formatting
v1->v2: dropped 6/6. That check is already there where the func is called
- merged patch 4/6 and 5/6, to fix use uninitialized value
- fix commit messages
---
Antony Antony (14):
xfrm: remove redundant assignments
xfrm: add extack to xfrm_init_state
xfrm: allow migration from UDP encapsulated to non-encapsulated ESP
xfrm: fix NAT-related field inheritance in SA migration
xfrm: rename reqid in xfrm_migrate
xfrm: split xfrm_state_migrate into create and install functions
xfrm: check family before comparing addresses in migrate
xfrm: add state synchronization after migration
xfrm: add error messages to state migration
xfrm: move encap and xuo into struct xfrm_migrate
xfrm: refactor XFRMA_MTIMER_THRESH validation into a helper
xfrm: add XFRM_MSG_MIGRATE_STATE for single SA migration
xfrm: restrict netlink attributes for XFRM_MSG_MIGRATE_STATE
xfrm: add documentation for XFRM_MSG_MIGRATE_STATE
Documentation/networking/xfrm/index.rst | 1 +
.../networking/xfrm/xfrm_migrate_state.rst | 230 ++++++++++++++
include/net/xfrm.h | 78 ++++-
include/uapi/linux/xfrm.h | 21 ++
net/ipv4/ipcomp.c | 2 +-
net/ipv6/ipcomp6.c | 2 +-
net/key/af_key.c | 12 +-
net/xfrm/xfrm_device.c | 2 +-
net/xfrm/xfrm_policy.c | 27 +-
net/xfrm/xfrm_state.c | 144 +++++----
net/xfrm/xfrm_user.c | 344 ++++++++++++++++++++-
security/selinux/nlmsgtab.c | 3 +-
12 files changed, 769 insertions(+), 97 deletions(-)
---
base-commit: be14d13625c9b070c33c423026b598ed65695225
change-id: migrate-state-063ee0342680
Best regards,
--
Antony Antony <antony.antony@secunet.com>
^ permalink raw reply
* [PATCH ipsec-next v7 14/14] xfrm: add documentation for XFRM_MSG_MIGRATE_STATE
From: Antony Antony @ 2026-04-12 11:16 UTC (permalink / raw)
To: Antony Antony, Steffen Klassert, Herbert Xu, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
David Ahern, Masahide NAKAMURA, Paul Moore, Stephen Smalley,
Ondrej Mosnacek, Jonathan Corbet, Shuah Khan
Cc: netdev, linux-kernel, selinux, linux-doc, Chiachang Wang, Yan Yan,
devel
In-Reply-To: <migrate-state-v7-0-44eb2440b91c@secunet.com>
Add documentation for the new XFRM_MSG_MIGRATE_STATE netlink message,
which migrates a single SA identified by SPI and mark without involving
policies.
The document covers the motivation and design differences from the
existing XFRM_MSG_MIGRATE, the SA lookup mechanism, supported attributes
with their omit-to-inherit semantics, and usage examples.
Signed-off-by: Antony Antony <antony.antony@secunet.com>
---
v6->v7: update docs to reflect the flags
v5->v6: added this patch
---
Documentation/networking/xfrm/index.rst | 1 +
.../networking/xfrm/xfrm_migrate_state.rst | 230 +++++++++++++++++++++
2 files changed, 231 insertions(+)
diff --git a/Documentation/networking/xfrm/index.rst b/Documentation/networking/xfrm/index.rst
index 7d866da836fe..90191848f8db 100644
--- a/Documentation/networking/xfrm/index.rst
+++ b/Documentation/networking/xfrm/index.rst
@@ -9,5 +9,6 @@ XFRM Framework
xfrm_device
xfrm_proc
+ xfrm_migrate_state
xfrm_sync
xfrm_sysctl
diff --git a/Documentation/networking/xfrm/xfrm_migrate_state.rst b/Documentation/networking/xfrm/xfrm_migrate_state.rst
new file mode 100644
index 000000000000..1e0d77f0e043
--- /dev/null
+++ b/Documentation/networking/xfrm/xfrm_migrate_state.rst
@@ -0,0 +1,230 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+XFRM SA Migrate State
+=====================
+
+Overview
+========
+
+``XFRM_MSG_MIGRATE_STATE`` migrates a single SA, looked up using SPI and
+mark, without involving policies. Unlike ``XFRM_MSG_MIGRATE``, which couples
+SA and policy migration and allows migrating multiple SAs in one call, this
+interface identifies the SA unambiguously via SPI and supports changing
+the reqid, addresses, encapsulation, selector, and offload.
+
+Because IKE daemons such as *wan manage policies independently of
+the kernel, this interface allows precise per-SA migration without
+requiring policy involvement. Optional XFRM attributes follow an
+omit-to-inherit model: omitting an attribute preserves the value from
+the old SA. Hardware offload is an exception. It is inherited by default
+but can be disabled with the ``XFRM_MIGRATE_STATE_NO_OFFLOAD``
+flag or set to a new offload configuration with the
+``XFRMA_OFFLOAD_DEV`` attribute.
+
+SA Identification
+=================
+
+The struct is defined in ``include/uapi/linux/xfrm.h``. The SA is looked
+up using ``xfrm_state_lookup()`` with ``id.spi``,
+``id.daddr``, ``id.proto``, ``id.family``, and
+``old_mark.v & old_mark.m`` as the mark key::
+
+ struct xfrm_user_migrate_state {
+ struct xfrm_usersa_id id; /* spi, daddr, proto, family */
+ xfrm_address_t new_daddr;
+ xfrm_address_t new_saddr;
+ struct xfrm_mark old_mark; /* SA lookup: key = v & m */
+ struct xfrm_selector new_sel; /* new selector (see Flags) */
+ __u32 new_reqid;
+ __u32 flags; /* XFRM_MIGRATE_STATE_* */
+ __u16 new_family;
+ __u16 reserved;
+ };
+
+Supported Attributes
+====================
+
+The following fields in ``xfrm_user_migrate_state`` are always explicit
+and are not inherited from the existing SA. Passing zero is not equivalent
+to "keep unchanged" — zero is used as-is:
+
+- ``new_daddr`` - new destination address
+- ``new_saddr`` - new source address
+- ``new_family`` - new address family
+- ``new_reqid`` - new reqid (0 = no reqid)
+- ``new_sel`` - new selector; used when ``XFRM_MIGRATE_STATE_UPDATE_SEL`` is
+ not set (see `Flags`_ below)
+- ``flags`` - bitmask of ``XFRM_MIGRATE_STATE_*`` flags (see `Flags`_ below)
+
+The following netlink attributes are also accepted. Omitting an attribute
+inherits the value from the existing SA (omit-to-inherit).
+
+.. list-table::
+ :widths: 30 70
+ :header-rows: 1
+
+ * - Attribute
+ - Description
+ * - ``XFRMA_MARK``
+ - Mark on the migrated SA (``struct xfrm_mark``). Absent inherits
+ ``old_mark``. To use no mark on the new SA, send ``XFRMA_MARK``
+ with ``{0, 0}``.
+ * - ``XFRMA_ENCAP``
+ - UDP encapsulation template; only ``UDP_ENCAP_ESPINUDP`` is supported.
+ Set ``encap_type=0`` to remove encap.
+ * - ``XFRMA_OFFLOAD_DEV``
+ - Hardware offload configuration (``struct xfrm_user_offload``). Absent
+ copies offload from the existing SA. When
+ ``XFRM_MIGRATE_STATE_NO_OFFLOAD`` is set in ``flags``, the new SA has
+ no offload; this flag is mutually exclusive with ``XFRMA_OFFLOAD_DEV``
+ and sending both returns ``-EINVAL``.
+ * - ``XFRMA_SET_MARK``
+ - Output mark on the migrated SA; pair with ``XFRMA_SET_MARK_MASK``.
+ Send 0 to clear.
+ * - ``XFRMA_NAT_KEEPALIVE_INTERVAL``
+ - NAT keepalive interval in seconds. Requires encap. Send 0 to clear.
+ Automatically cleared when encap is removed; setting a non-zero
+ value without encap returns ``-EINVAL``.
+ * - ``XFRMA_MTIMER_THRESH``
+ - Mapping maxage threshold. Requires encap. Send 0 to clear.
+ Automatically cleared when encap is removed; setting a non-zero
+ value without encap returns ``-EINVAL``.
+
+The following SA properties are immutable and cannot be changed via
+``XFRM_MSG_MIGRATE_STATE``: algorithms (``XFRMA_ALG_*``), replay state,
+direction (``XFRMA_SA_DIR``), and security context (``XFRMA_SEC_CTX``).
+
+Flags
+=====
+
+The ``flags`` field in ``xfrm_user_migrate_state`` controls optional
+migration behaviour. Unknown flag bits are rejected with ``-EINVAL``.
+
+.. list-table::
+ :widths: 40 60
+ :header-rows: 1
+
+ * - Flag
+ - Description
+ * - ``XFRM_MIGRATE_STATE_NO_OFFLOAD``
+ - When set, the new SA has no hardware offload even when
+ ``XFRMA_OFFLOAD_DEV`` is absent. Without this flag, omitting
+ ``XFRMA_OFFLOAD_DEV`` copies the existing offload to the new SA.
+ Mutually exclusive with ``XFRMA_OFFLOAD_DEV``; sending both
+ returns ``-EINVAL``.
+ * - ``XFRM_MIGRATE_STATE_UPDATE_SEL``
+ - When set, the kernel validates that the existing SA selector is a
+ single-host entry matching the SA addresses (``prefixlen_s ==
+ prefixlen_d`` equal to 32 for IPv4 or 128 for IPv6, and addresses
+ matching ``id.daddr`` and ``props.saddr``). If the check passes,
+ the new selector is derived from ``new_daddr`` and ``new_saddr``
+ with the single-host mask for ``new_family``. A mismatch returns
+ ``-EINVAL``. When this flag is not set, ``new_sel`` is used as-is
+ for the migrated SA.
+
+Migration Steps
+===============
+
+#. Install a block policy to drop traffic on the affected selector.
+#. Remove the old policy.
+#. Call ``XFRM_MSG_MIGRATE_STATE`` for each SA.
+#. Reinstall the policies.
+#. Remove the block policy.
+
+Block Policy and IV Safety
+--------------------------
+
+Installing a block policy before migration is required to prevent
+traffic leaks and IV reuse.
+
+AES-GCM IV uniqueness is critical: reusing a (key, IV) pair allows
+an attacker to recover the authentication subkey and forge
+authentication tags, breaking both confidentiality and integrity.
+
+``XFRM_MSG_MIGRATE_STATE`` atomically copies the sequence number and
+replay window from the old SA to the new SA and deletes the old SA.
+The block policy ensures no outgoing packets are sent in the migration
+window, preventing IV reuse under the same key.
+
+Feature Detection
+=================
+
+Userspace can probe for kernel support by sending a minimal
+``XFRM_MSG_MIGRATE_STATE`` message with a non-existent SPI:
+
+- ``-ENOPROTOOPT``: not supported (``CONFIG_XFRM_MIGRATE`` not enabled)
+- any other error: supported
+
+Userspace Notification on Success
+=================================
+
+On successful migration the kernel multicasts an
+``XFRM_MSG_MIGRATE_STATE`` message to the ``XFRMNLGRP_MIGRATE`` group.
+The fixed header is ``struct xfrm_user_migrate_state`` copied from the
+request, followed by the same set of netlink attributes that are
+accepted as input, with the differences noted below.
+
+Differences from the request
+-----------------------------
+
+.. list-table::
+ :widths: 25 75
+ :header-rows: 1
+
+ * - Field / Attribute
+ - Difference
+ * - ``new_sel``
+ - Contains the actual selector of the newly installed SA, not the
+ ``new_sel`` from the request. When
+ ``XFRM_MIGRATE_STATE_UPDATE_SEL`` is set the kernel derives the
+ selector from ``new_daddr`` / ``new_saddr``; the caller's
+ ``new_sel`` field is ignored in that case. The notification
+ always carries the real selector of the new SA.
+ * - ``XFRMA_SA_DIR``
+ - Present in the notification (set from the direction of the new
+ SA) but **not accepted as input** — direction is immutable.
+ * - ``flags``
+ - Echoed back as-is. ``XFRM_MIGRATE_STATE_NO_OFFLOAD`` and
+ ``XFRM_MIGRATE_STATE_UPDATE_SEL`` describe the request that was
+ made, not a property of the resulting SA.
+
+Attributes in the notification
+-------------------------------
+
+.. list-table::
+ :widths: 30 70
+ :header-rows: 1
+
+ * - Attribute
+ - Description
+ * - ``XFRMA_ENCAP``
+ - UDP encapsulation template, if configured on the new SA.
+ * - ``XFRMA_OFFLOAD_DEV``
+ - Hardware offload configuration, if active on the new SA.
+ * - ``XFRMA_MARK``
+ - Mark on the new SA, if set.
+ * - ``XFRMA_SET_MARK``
+ - Output mark on the new SA, if set.
+ * - ``XFRMA_SET_MARK_MASK``
+ - Output mark mask, present together with ``XFRMA_SET_MARK``.
+ * - ``XFRMA_MTIMER_THRESH``
+ - Mapping maxage threshold, if non-zero.
+ * - ``XFRMA_NAT_KEEPALIVE_INTERVAL``
+ - NAT keepalive interval, if non-zero.
+ * - ``XFRMA_SA_DIR``
+ - Direction of the new SA.
+
+Error Handling
+==============
+
+If the target SA tuple (daddr, SPI, proto, family) is occupied by an existing
+unrelated SA, the operation returns ``-EEXIST``. In this case both the old and
+the new SA are gone. The old SA cannot be restored as doing so would risk
+duplicate sequence number and IV reuse, which must not occur. Userspace should
+handle ``-EEXIST``, for example by re-establishing the SA at the IKE level.
+
+If the multicast notification (``XFRMNLGRP_MIGRATE``) fails to send,
+the migration itself has already completed successfully and the new SA
+is installed. The operation returns success, 0, with an extack warning,
+but listeners will not receive the migration event.
--
2.47.3
^ permalink raw reply related
* [PATCH ipsec-next v7 13/14] xfrm: restrict netlink attributes for XFRM_MSG_MIGRATE_STATE
From: Antony Antony @ 2026-04-12 11:16 UTC (permalink / raw)
To: Antony Antony, Steffen Klassert, Herbert Xu, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
David Ahern, Masahide NAKAMURA, Paul Moore, Stephen Smalley,
Ondrej Mosnacek, Jonathan Corbet, Shuah Khan
Cc: netdev, linux-kernel, selinux, linux-doc, Chiachang Wang, Yan Yan,
devel
In-Reply-To: <migrate-state-v7-0-44eb2440b91c@secunet.com>
Only accept XFRMA used in this method, reject the rest.
Signed-off-by: Antony Antony <antony.antony@secunet.com>
---
v5->v6: added this patch
---
net/xfrm/xfrm_user.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 46e506548122..441e6b1fed10 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -3721,6 +3721,30 @@ static int xfrm_reject_unused_attr(int type, struct nlattr **attrs,
}
}
+ if (type == XFRM_MSG_MIGRATE_STATE) {
+ int i;
+
+ for (i = 0; i <= XFRMA_MAX; i++) {
+ if (!attrs[i])
+ continue;
+
+ switch (i) {
+ case XFRMA_MARK:
+ case XFRMA_ENCAP:
+ case XFRMA_OFFLOAD_DEV:
+ case XFRMA_SET_MARK:
+ case XFRMA_SET_MARK_MASK:
+ case XFRMA_MTIMER_THRESH:
+ case XFRMA_NAT_KEEPALIVE_INTERVAL:
+ break;
+ default:
+ NL_SET_ERR_MSG_ATTR(extack, attrs[i],
+ "Unsupported attribute in XFRM_MSG_MIGRATE_STATE");
+ return -EINVAL;
+ }
+ }
+ }
+
return 0;
}
--
2.47.3
^ permalink raw reply related
* [PATCH ipsec-next v7 12/14] xfrm: add XFRM_MSG_MIGRATE_STATE for single SA migration
From: Antony Antony @ 2026-04-12 11:16 UTC (permalink / raw)
To: Antony Antony, Steffen Klassert, Herbert Xu, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
David Ahern, Masahide NAKAMURA, Paul Moore, Stephen Smalley,
Ondrej Mosnacek, Jonathan Corbet, Shuah Khan
Cc: netdev, linux-kernel, selinux, linux-doc, Chiachang Wang, Yan Yan,
devel
In-Reply-To: <migrate-state-v7-0-44eb2440b91c@secunet.com>
Add a new netlink method to migrate a single xfrm_state.
Unlike the existing migration mechanism (SA + policy), this
supports migrating only the SA and allows changing the reqid.
The SA is looked up via xfrm_usersa_id, which uniquely
identifies it, so old_saddr is not needed. old_daddr is carried in
xfrm_usersa_id.daddr.
The reqid is invariant in the old migration.
Signed-off-by: Antony Antony <antony.antony@secunet.com>
---
v6->v7: - add flags field to xfrm_user_migrate_state (based on Sabrina's feedback)
- add XFRM_MIGRATE_STATE_NO_OFFLOAD (bit 0): suppresses offload
- omit-to-inherit; mutually exclusive with XFRMA_OFFLOAD_DEV
- zero-initialize struct xfrm_migrate m[XFRM_MAX_DEPTH]
- add struct xfrm_selector new_sel to xfrm_user_migrate_state
- add XFRM_MIGRATE_STATE_UPDATE_SEL: derive new selector
from SA addresses when old selector is a single-host match
v5->v6: - (Feedback from Sabrina's review)
- reqid change: use xfrm_state_add, not xfrm_state_insert
- encap and xuo: use nla_data() directly, no kmemdup needed
- notification failure is non-fatal: set extack warning, return 0
- drop state direction, x->dir, check, not required
- reverse xmas tree local variable ordering
- use NL_SET_ERR_MSG_WEAK for clone failure message
- fix implicit padding in xfrm_user_migrate_state uapi struct
- support XFRMA_SET_MARK/XFRMA_SET_MARK_MASK in XFRM_MSG_MIGRATE_STATE
v4->v5: - set portid, seq in XFRM_MSG_MIGRATE_STATE netlink notification
- rename error label to out for clarity
- add locking and synchronize after cloning
- change some if(x) to if(!x) for clarity
- call __xfrm_state_delete() inside the lock
- return error from xfrm_send_migrate_state() instead of always returning 0
v3->v4: preserve reqid invariant for each state migrated
v2->v3: free the skb on the error path
v1->v2: merged next patch here to fix use uninitialized value
- removed unnecessary inline
- added const when possible
---
include/net/xfrm.h | 16 ++-
include/uapi/linux/xfrm.h | 21 ++++
net/xfrm/xfrm_device.c | 2 +-
net/xfrm/xfrm_policy.c | 19 +++
net/xfrm/xfrm_state.c | 29 +++--
net/xfrm/xfrm_user.c | 287 +++++++++++++++++++++++++++++++++++++++++++-
security/selinux/nlmsgtab.c | 3 +-
7 files changed, 363 insertions(+), 14 deletions(-)
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 4b29ab92c2a7..e33e524cd909 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -684,12 +684,20 @@ struct xfrm_migrate {
xfrm_address_t new_saddr;
struct xfrm_encap_tmpl *encap;
struct xfrm_user_offload *xuo;
+ struct xfrm_mark old_mark;
+ struct xfrm_mark *new_mark;
+ struct xfrm_mark smark;
u8 proto;
u8 mode;
- u16 reserved;
+ u16 msg_type; /* XFRM_MSG_MIGRATE or XFRM_MSG_MIGRATE_STATE */
+ u32 flags;
u32 old_reqid;
+ u32 new_reqid;
+ u32 nat_keepalive_interval;
+ u32 mapping_maxage;
u16 old_family;
u16 new_family;
+ const struct xfrm_selector *new_sel;
};
#define XFRM_KM_TIMEOUT 30
@@ -2104,7 +2112,7 @@ void xfrm_dev_resume(struct sk_buff *skb);
void xfrm_dev_backlog(struct softnet_data *sd);
struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t features, bool *again);
int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
- struct xfrm_user_offload *xuo,
+ const struct xfrm_user_offload *xuo,
struct netlink_ext_ack *extack);
int xfrm_dev_policy_add(struct net *net, struct xfrm_policy *xp,
struct xfrm_user_offload *xuo, u8 dir,
@@ -2175,7 +2183,9 @@ static inline struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_fea
return skb;
}
-static inline int xfrm_dev_state_add(struct net *net, struct xfrm_state *x, struct xfrm_user_offload *xuo, struct netlink_ext_ack *extack)
+static inline int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
+ const struct xfrm_user_offload *xuo,
+ struct netlink_ext_ack *extack)
{
return 0;
}
diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h
index a23495c0e0a1..34d8ad5c4818 100644
--- a/include/uapi/linux/xfrm.h
+++ b/include/uapi/linux/xfrm.h
@@ -227,6 +227,9 @@ enum {
#define XFRM_MSG_SETDEFAULT XFRM_MSG_SETDEFAULT
XFRM_MSG_GETDEFAULT,
#define XFRM_MSG_GETDEFAULT XFRM_MSG_GETDEFAULT
+
+ XFRM_MSG_MIGRATE_STATE,
+#define XFRM_MSG_MIGRATE_STATE XFRM_MSG_MIGRATE_STATE
__XFRM_MSG_MAX
};
#define XFRM_MSG_MAX (__XFRM_MSG_MAX - 1)
@@ -507,6 +510,24 @@ struct xfrm_user_migrate {
__u16 new_family;
};
+struct xfrm_user_migrate_state {
+ struct xfrm_usersa_id id;
+ xfrm_address_t new_daddr;
+ xfrm_address_t new_saddr;
+ struct xfrm_mark old_mark;
+ struct xfrm_selector new_sel;
+ __u32 new_reqid;
+ __u32 flags;
+ __u16 new_family;
+ __u16 reserved;
+};
+
+/* Flags for xfrm_user_migrate_state.flags */
+enum xfrm_migrate_state_flags {
+ XFRM_MIGRATE_STATE_NO_OFFLOAD = 1, /* do not inherit offload from existing SA */
+ XFRM_MIGRATE_STATE_UPDATE_SEL = 2, /* update host-to-host selector from saddr and daddr */
+};
+
struct xfrm_user_mapping {
struct xfrm_usersa_id id;
__u32 reqid;
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 52ae0e034d29..9d4c1addb98f 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -229,7 +229,7 @@ struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t featur
EXPORT_SYMBOL_GPL(validate_xmit_xfrm);
int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
- struct xfrm_user_offload *xuo,
+ const struct xfrm_user_offload *xuo,
struct netlink_ext_ack *extack)
{
int err;
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 0b5c7b51183a..3d6c778d8645 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -4635,6 +4635,22 @@ static int xfrm_migrate_check(const struct xfrm_migrate *m, int num_migrate,
return 0;
}
+/*
+ * Fill migrate fields that are invariant in XFRM_MSG_MIGRATE: inherited
+ * from the existing SA unchanged. XFRM_MSG_MIGRATE_STATE can update these.
+ */
+static void xfrm_migrate_copy_old(struct xfrm_migrate *mp,
+ const struct xfrm_state *x,
+ struct xfrm_mark *new_mark_buf)
+{
+ mp->smark = x->props.smark;
+ mp->new_reqid = x->props.reqid;
+ mp->nat_keepalive_interval = x->nat_keepalive_interval;
+ mp->mapping_maxage = x->mapping_maxage;
+ *new_mark_buf = x->mark;
+ mp->new_mark = new_mark_buf;
+}
+
int xfrm_migrate(const struct xfrm_selector *sel, u8 dir, u8 type,
struct xfrm_migrate *m, int num_migrate,
struct xfrm_kmaddress *k, struct net *net,
@@ -4642,6 +4658,7 @@ int xfrm_migrate(const struct xfrm_selector *sel, u8 dir, u8 type,
struct netlink_ext_ack *extack, struct xfrm_user_offload *xuo)
{
int i, err, nx_cur = 0, nx_new = 0;
+ struct xfrm_mark new_marks[XFRM_MAX_DEPTH] = {};
struct xfrm_policy *pol = NULL;
struct xfrm_state *x, *xc;
struct xfrm_state *x_cur[XFRM_MAX_DEPTH];
@@ -4674,6 +4691,8 @@ int xfrm_migrate(const struct xfrm_selector *sel, u8 dir, u8 type,
nx_cur++;
mp->encap = encap;
mp->xuo = xuo;
+ xfrm_migrate_copy_old(mp, x, &new_marks[i]);
+
xc = xfrm_state_migrate(x, mp, net, extack);
if (xc) {
x_new[nx_new] = xc;
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 1ee114f8515d..25d54c44fd94 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -1974,11 +1974,25 @@ static struct xfrm_state *xfrm_state_clone_and_setup(struct xfrm_state *orig,
goto out;
memcpy(&x->id, &orig->id, sizeof(x->id));
- memcpy(&x->sel, &orig->sel, sizeof(x->sel));
+ if (m->msg_type == XFRM_MSG_MIGRATE_STATE) {
+ if (m->flags & XFRM_MIGRATE_STATE_UPDATE_SEL) {
+ u8 prefixlen = (m->new_family == AF_INET6) ? 128 : 32;
+
+ memcpy(&x->sel, &orig->sel, sizeof(x->sel));
+ x->sel.family = m->new_family;
+ x->sel.prefixlen_d = prefixlen;
+ x->sel.prefixlen_s = prefixlen;
+ memcpy(&x->sel.daddr, &m->new_daddr, sizeof(x->sel.daddr));
+ memcpy(&x->sel.saddr, &m->new_saddr, sizeof(x->sel.saddr));
+ } else {
+ x->sel = *m->new_sel;
+ }
+ } else {
+ memcpy(&x->sel, &orig->sel, sizeof(x->sel));
+ }
memcpy(&x->lft, &orig->lft, sizeof(x->lft));
x->props.mode = orig->props.mode;
x->props.replay_window = orig->props.replay_window;
- x->props.reqid = orig->props.reqid;
if (orig->aalg) {
x->aalg = xfrm_algo_auth_clone(orig->aalg);
@@ -2011,8 +2025,8 @@ static struct xfrm_state *xfrm_state_clone_and_setup(struct xfrm_state *orig,
x->encap = kmemdup(m->encap, sizeof(*x->encap), GFP_KERNEL);
if (!x->encap)
goto error;
- x->mapping_maxage = orig->mapping_maxage;
- x->nat_keepalive_interval = orig->nat_keepalive_interval;
+ x->mapping_maxage = m->mapping_maxage;
+ x->nat_keepalive_interval = m->nat_keepalive_interval;
}
if (orig->security)
@@ -2029,8 +2043,9 @@ static struct xfrm_state *xfrm_state_clone_and_setup(struct xfrm_state *orig,
if (xfrm_replay_clone(x, orig))
goto error;
- memcpy(&x->mark, &orig->mark, sizeof(x->mark));
- memcpy(&x->props.smark, &orig->props.smark, sizeof(x->props.smark));
+ x->mark = m->new_mark ? *m->new_mark : m->old_mark;
+
+ x->props.smark = m->smark;
x->props.flags = orig->props.flags;
x->props.extra_flags = orig->props.extra_flags;
@@ -2053,7 +2068,7 @@ static struct xfrm_state *xfrm_state_clone_and_setup(struct xfrm_state *orig,
goto error;
}
-
+ x->props.reqid = m->new_reqid;
x->props.family = m->new_family;
memcpy(&x->id.daddr, &m->new_daddr, sizeof(x->id.daddr));
memcpy(&x->props.saddr, &m->new_saddr, sizeof(x->props.saddr));
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index fe0cf824f072..46e506548122 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1318,7 +1318,7 @@ static int copy_to_user_encap(struct xfrm_encap_tmpl *ep, struct sk_buff *skb)
return 0;
}
-static int xfrm_smark_put(struct sk_buff *skb, struct xfrm_mark *m)
+static int xfrm_smark_put(struct sk_buff *skb, const struct xfrm_mark *m)
{
int ret = 0;
@@ -3059,6 +3059,25 @@ static int xfrm_add_acquire(struct sk_buff *skb, struct nlmsghdr *nlh,
}
#ifdef CONFIG_XFRM_MIGRATE
+static void copy_from_user_migrate_state(struct xfrm_migrate *ma,
+ const struct xfrm_user_migrate_state *um)
+{
+ memcpy(&ma->old_daddr, &um->id.daddr, sizeof(ma->old_daddr));
+ memcpy(&ma->new_daddr, &um->new_daddr, sizeof(ma->new_daddr));
+ memcpy(&ma->new_saddr, &um->new_saddr, sizeof(ma->new_saddr));
+
+ ma->proto = um->id.proto;
+ ma->new_reqid = um->new_reqid;
+
+ ma->old_family = um->id.family;
+ ma->new_family = um->new_family;
+
+ ma->old_mark = um->old_mark;
+ ma->flags = um->flags;
+ ma->new_sel = &um->new_sel;
+ ma->msg_type = XFRM_MSG_MIGRATE_STATE;
+}
+
static int copy_from_user_migrate(struct xfrm_migrate *ma,
struct xfrm_kmaddress *k,
struct nlattr **attrs, int *num,
@@ -3098,6 +3117,7 @@ static int copy_from_user_migrate(struct xfrm_migrate *ma,
ma->old_family = um->old_family;
ma->new_family = um->new_family;
+ ma->msg_type = XFRM_MSG_MIGRATE;
}
*num = i;
@@ -3108,7 +3128,7 @@ static int xfrm_do_migrate(struct sk_buff *skb, struct nlmsghdr *nlh,
struct nlattr **attrs, struct netlink_ext_ack *extack)
{
struct xfrm_userpolicy_id *pi = nlmsg_data(nlh);
- struct xfrm_migrate m[XFRM_MAX_DEPTH];
+ struct xfrm_migrate m[XFRM_MAX_DEPTH] = {};
struct xfrm_kmaddress km, *kmp;
u8 type;
int err;
@@ -3161,7 +3181,268 @@ static int xfrm_do_migrate(struct sk_buff *skb, struct nlmsghdr *nlh,
kfree(xuo);
return err;
}
+
+static int build_migrate_state(struct sk_buff *skb,
+ const struct xfrm_user_migrate_state *um,
+ const struct xfrm_migrate *m,
+ u8 dir, u32 portid, u32 seq)
+{
+ int err;
+ struct nlmsghdr *nlh;
+ struct xfrm_user_migrate_state *hdr;
+
+ nlh = nlmsg_put(skb, portid, seq, XFRM_MSG_MIGRATE_STATE,
+ sizeof(struct xfrm_user_migrate_state), 0);
+ if (!nlh)
+ return -EMSGSIZE;
+
+ hdr = nlmsg_data(nlh);
+ *hdr = *um;
+ hdr->new_sel = *m->new_sel;
+
+ if (m->encap) {
+ err = nla_put(skb, XFRMA_ENCAP, sizeof(*m->encap), m->encap);
+ if (err)
+ goto out_cancel;
+ }
+
+ if (m->xuo) {
+ err = nla_put(skb, XFRMA_OFFLOAD_DEV, sizeof(*m->xuo), m->xuo);
+ if (err)
+ goto out_cancel;
+ }
+
+ if (m->new_mark) {
+ err = nla_put(skb, XFRMA_MARK, sizeof(*m->new_mark),
+ m->new_mark);
+ if (err)
+ goto out_cancel;
+ }
+
+ err = xfrm_smark_put(skb, &m->smark);
+ if (err)
+ goto out_cancel;
+
+ if (m->mapping_maxage) {
+ err = nla_put_u32(skb, XFRMA_MTIMER_THRESH, m->mapping_maxage);
+ if (err)
+ goto out_cancel;
+ }
+
+ if (m->nat_keepalive_interval) {
+ err = nla_put_u32(skb, XFRMA_NAT_KEEPALIVE_INTERVAL,
+ m->nat_keepalive_interval);
+ if (err)
+ goto out_cancel;
+ }
+
+ if (dir) {
+ err = nla_put_u8(skb, XFRMA_SA_DIR, dir);
+ if (err)
+ goto out_cancel;
+ }
+
+ nlmsg_end(skb, nlh);
+ return 0;
+
+out_cancel:
+ nlmsg_cancel(skb, nlh);
+ return err;
+}
+
+static unsigned int xfrm_migrate_state_msgsize(const struct xfrm_migrate *m,
+ u8 dir)
+{
+ return NLMSG_ALIGN(sizeof(struct xfrm_user_migrate_state)) +
+ (m->encap ? nla_total_size(sizeof(struct xfrm_encap_tmpl)) : 0) +
+ (m->xuo ? nla_total_size(sizeof(struct xfrm_user_offload)) : 0) +
+ (m->new_mark ? nla_total_size(sizeof(struct xfrm_mark)) : 0) +
+ (m->smark.v ? nla_total_size(sizeof(u32)) * 2 : 0) + /* SET_MARK + SET_MARK_MASK */
+ (m->mapping_maxage ? nla_total_size(sizeof(u32)) : 0) +
+ (m->nat_keepalive_interval ? nla_total_size(sizeof(u32)) : 0) +
+ (dir ? nla_total_size(sizeof(u8)) : 0); /* XFRMA_SA_DIR */
+}
+
+static int xfrm_send_migrate_state(const struct xfrm_user_migrate_state *um,
+ const struct xfrm_migrate *m,
+ u8 dir, u32 portid, u32 seq)
+{
+ int err;
+ struct sk_buff *skb;
+ struct net *net = &init_net;
+
+ skb = nlmsg_new(xfrm_migrate_state_msgsize(m, dir), GFP_ATOMIC);
+ if (!skb)
+ return -ENOMEM;
+
+ err = build_migrate_state(skb, um, m, dir, portid, seq);
+ if (err < 0) {
+ kfree_skb(skb);
+ return err;
+ }
+
+ return xfrm_nlmsg_multicast(net, skb, 0, XFRMNLGRP_MIGRATE);
+}
+
+static int xfrm_do_migrate_state(struct sk_buff *skb, struct nlmsghdr *nlh,
+ struct nlattr **attrs, struct netlink_ext_ack *extack)
+{
+ struct xfrm_user_migrate_state *um = nlmsg_data(nlh);
+ struct net *net = sock_net(skb->sk);
+ struct xfrm_user_offload xuo = {};
+ struct xfrm_migrate m = {};
+ struct xfrm_state *xc;
+ struct xfrm_state *x;
+ int err;
+
+ if (!um->id.spi) {
+ NL_SET_ERR_MSG(extack, "Invalid SPI 0x0");
+ return -EINVAL;
+ }
+
+ if (um->reserved) {
+ NL_SET_ERR_MSG(extack, "Reserved field must be zero");
+ return -EINVAL;
+ }
+
+ if (um->flags & ~(XFRM_MIGRATE_STATE_NO_OFFLOAD |
+ XFRM_MIGRATE_STATE_UPDATE_SEL)) {
+ NL_SET_ERR_MSG(extack, "Unknown flags in XFRM_MSG_MIGRATE_STATE");
+ return -EINVAL;
+ }
+
+ if ((um->flags & XFRM_MIGRATE_STATE_NO_OFFLOAD) &&
+ attrs[XFRMA_OFFLOAD_DEV]) {
+ NL_SET_ERR_MSG(extack,
+ "XFRM_MIGRATE_STATE_NO_OFFLOAD and XFRMA_OFFLOAD_DEV are mutually exclusive");
+ return -EINVAL;
+ }
+
+ copy_from_user_migrate_state(&m, um);
+
+ x = xfrm_state_lookup(net, m.old_mark.v & m.old_mark.m,
+ &um->id.daddr, um->id.spi,
+ um->id.proto, um->id.family);
+ if (!x) {
+ NL_SET_ERR_MSG(extack, "Can not find state");
+ return -ESRCH;
+ }
+
+ if (um->flags & XFRM_MIGRATE_STATE_UPDATE_SEL) {
+ u8 prefixlen = (x->sel.family == AF_INET6) ? 128 : 32;
+
+ if (x->sel.prefixlen_s != x->sel.prefixlen_d ||
+ x->sel.prefixlen_d != prefixlen ||
+ !xfrm_addr_equal(&x->sel.daddr, &x->id.daddr, x->sel.family) ||
+ !xfrm_addr_equal(&x->sel.saddr, &x->props.saddr, x->sel.family)) {
+ NL_SET_ERR_MSG(extack,
+ "SA selector is not a single-host match for SA addresses");
+ err = -EINVAL;
+ goto out;
+ }
+ }
+
+ if (attrs[XFRMA_ENCAP]) {
+ m.encap = nla_data(attrs[XFRMA_ENCAP]);
+ if (m.encap->encap_type == 0) {
+ m.encap = NULL; /* sentinel: remove encap */
+ } else if (m.encap->encap_type != UDP_ENCAP_ESPINUDP) {
+ NL_SET_ERR_MSG(extack, "Unsupported encapsulation type");
+ err = -EINVAL;
+ goto out;
+ }
+ } else {
+ m.encap = x->encap; /* omit-to-inherit */
+ }
+
+ if (attrs[XFRMA_MTIMER_THRESH]) {
+ err = verify_mtimer_thresh(!!m.encap, x->dir, extack);
+ if (err)
+ goto out;
+ }
+
+ if (attrs[XFRMA_NAT_KEEPALIVE_INTERVAL] &&
+ nla_get_u32(attrs[XFRMA_NAT_KEEPALIVE_INTERVAL]) && !m.encap) {
+ NL_SET_ERR_MSG(extack,
+ "NAT_KEEPALIVE_INTERVAL requires encapsulation");
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (attrs[XFRMA_OFFLOAD_DEV]) {
+ m.xuo = nla_data(attrs[XFRMA_OFFLOAD_DEV]);
+ } else if (!(um->flags & XFRM_MIGRATE_STATE_NO_OFFLOAD) && x->xso.dev) {
+ xuo.ifindex = x->xso.dev->ifindex;
+ if (x->xso.dir == XFRM_DEV_OFFLOAD_IN)
+ xuo.flags = XFRM_OFFLOAD_INBOUND;
+ if (x->xso.type == XFRM_DEV_OFFLOAD_PACKET)
+ xuo.flags |= XFRM_OFFLOAD_PACKET;
+ m.xuo = &xuo;
+ }
+
+ if (attrs[XFRMA_MARK])
+ m.new_mark = nla_data(attrs[XFRMA_MARK]);
+
+ if (attrs[XFRMA_SET_MARK])
+ xfrm_smark_init(attrs, &m.smark);
+ else
+ m.smark = x->props.smark;
+
+ m.mapping_maxage = attrs[XFRMA_MTIMER_THRESH] ?
+ nla_get_u32(attrs[XFRMA_MTIMER_THRESH]) : x->mapping_maxage;
+ m.nat_keepalive_interval = attrs[XFRMA_NAT_KEEPALIVE_INTERVAL] ?
+ nla_get_u32(attrs[XFRMA_NAT_KEEPALIVE_INTERVAL]) :
+ x->nat_keepalive_interval;
+
+ xc = xfrm_state_migrate_create(x, &m, net, extack);
+ if (!xc) {
+ NL_SET_ERR_MSG_WEAK(extack, "State migration clone failed");
+ err = -EINVAL;
+ goto out;
+ }
+
+ spin_lock_bh(&x->lock);
+ xfrm_migrate_sync(xc, x); /* to prevent SN/IV reuse */
+ __xfrm_state_delete(x);
+ spin_unlock_bh(&x->lock);
+
+ err = xfrm_state_migrate_install(x, xc, &m, extack);
+ if (err < 0) {
+ /*
+ * In this rare case both the old SA and the new SA
+ * will disappear.
+ * Alternatives risk duplicate SN/IV usage which must not occur.
+ * Userspace must handle this error, -EEXIST.
+ */
+ goto out;
+ }
+
+ /* Restore encap cleared by sentinel (type=0) during migration. */
+ if (attrs[XFRMA_ENCAP])
+ m.encap = nla_data(attrs[XFRMA_ENCAP]);
+
+ m.new_sel = &xc->sel;
+
+ err = xfrm_send_migrate_state(um, &m, xc->dir,
+ nlh->nlmsg_pid, nlh->nlmsg_seq);
+ if (err < 0) {
+ NL_SET_ERR_MSG(extack, "Failed to send migration notification");
+ err = 0;
+ }
+
+out:
+ xfrm_state_put(x);
+ return err;
+}
+
#else
+static int xfrm_do_migrate_state(struct sk_buff *skb, struct nlmsghdr *nlh,
+ struct nlattr **attrs, struct netlink_ext_ack *extack)
+{
+ NL_SET_ERR_MSG(extack, "XFRM_MSG_MIGRATE_STATE is not supported");
+ return -ENOPROTOOPT;
+}
+
static int xfrm_do_migrate(struct sk_buff *skb, struct nlmsghdr *nlh,
struct nlattr **attrs, struct netlink_ext_ack *extack)
{
@@ -3314,6 +3595,7 @@ const int xfrm_msg_min[XFRM_NR_MSGTYPES] = {
[XFRM_MSG_GETSPDINFO - XFRM_MSG_BASE] = sizeof(u32),
[XFRM_MSG_SETDEFAULT - XFRM_MSG_BASE] = XMSGSIZE(xfrm_userpolicy_default),
[XFRM_MSG_GETDEFAULT - XFRM_MSG_BASE] = XMSGSIZE(xfrm_userpolicy_default),
+ [XFRM_MSG_MIGRATE_STATE - XFRM_MSG_BASE] = XMSGSIZE(xfrm_user_migrate_state),
};
EXPORT_SYMBOL_GPL(xfrm_msg_min);
@@ -3407,6 +3689,7 @@ static const struct xfrm_link {
[XFRM_MSG_GETSPDINFO - XFRM_MSG_BASE] = { .doit = xfrm_get_spdinfo },
[XFRM_MSG_SETDEFAULT - XFRM_MSG_BASE] = { .doit = xfrm_set_default },
[XFRM_MSG_GETDEFAULT - XFRM_MSG_BASE] = { .doit = xfrm_get_default },
+ [XFRM_MSG_MIGRATE_STATE - XFRM_MSG_BASE] = { .doit = xfrm_do_migrate_state },
};
static int xfrm_reject_unused_attr(int type, struct nlattr **attrs,
diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
index 2c0b07f9fbbd..655d2616c9d2 100644
--- a/security/selinux/nlmsgtab.c
+++ b/security/selinux/nlmsgtab.c
@@ -128,6 +128,7 @@ static const struct nlmsg_perm nlmsg_xfrm_perms[] = {
{ XFRM_MSG_MAPPING, NETLINK_XFRM_SOCKET__NLMSG_READ },
{ XFRM_MSG_SETDEFAULT, NETLINK_XFRM_SOCKET__NLMSG_WRITE },
{ XFRM_MSG_GETDEFAULT, NETLINK_XFRM_SOCKET__NLMSG_READ },
+ { XFRM_MSG_MIGRATE_STATE, NETLINK_XFRM_SOCKET__NLMSG_WRITE },
};
static const struct nlmsg_perm nlmsg_audit_perms[] = {
@@ -203,7 +204,7 @@ int selinux_nlmsg_lookup(u16 sclass, u16 nlmsg_type, u32 *perm)
* structures at the top of this file with the new mappings
* before updating the BUILD_BUG_ON() macro!
*/
- BUILD_BUG_ON(XFRM_MSG_MAX != XFRM_MSG_GETDEFAULT);
+ BUILD_BUG_ON(XFRM_MSG_MAX != XFRM_MSG_MIGRATE_STATE);
if (selinux_policycap_netlink_xperm()) {
*perm = NETLINK_XFRM_SOCKET__NLMSG;
--
2.47.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox