Skip to content

Commit ecae0bd

Browse files
committed
Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/[email protected]/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2 parents bc3012f + 9732336 commit ecae0bd

File tree

281 files changed

+11683
-5277
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

281 files changed

+11683
-5277
lines changed

Documentation/ABI/testing/sysfs-kernel-mm-damon

+7
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,13 @@ Contact: SeongJae Park <[email protected]>
151151
Description: Writing to and reading from this file sets and gets the action
152152
of the scheme.
153153

154+
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/apply_interval_us
155+
Date: Sep 2023
156+
Contact: SeongJae Park <[email protected]>
157+
Description: Writing a value to this file sets the action apply interval of
158+
the scheme in microseconds. Reading this file returns the
159+
value.
160+
154161
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/access_pattern/sz/min
155162
Date: Mar 2022
156163
Contact: SeongJae Park <[email protected]>

Documentation/admin-guide/cgroup-v1/memory.rst

+1
Original file line numberDiff line numberDiff line change
@@ -551,6 +551,7 @@ memory.stat file includes following statistics:
551551
event happens each time a page is unaccounted from the
552552
cgroup.
553553
swap # of bytes of swap usage
554+
swapcached # of bytes of swap cached in memory
554555
dirty # of bytes that are waiting to get written back to the disk.
555556
writeback # of bytes of file/anon cache that are queued for syncing to
556557
disk.

Documentation/admin-guide/cgroup-v2.rst

+38
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,35 @@ cgroup v2 currently supports the following mount options.
210210
relying on the original semantics (e.g. specifying bogusly
211211
high 'bypass' protection values at higher tree levels).
212212

213+
memory_hugetlb_accounting
214+
Count HugeTLB memory usage towards the cgroup's overall
215+
memory usage for the memory controller (for the purpose of
216+
statistics reporting and memory protetion). This is a new
217+
behavior that could regress existing setups, so it must be
218+
explicitly opted in with this mount option.
219+
220+
A few caveats to keep in mind:
221+
222+
* There is no HugeTLB pool management involved in the memory
223+
controller. The pre-allocated pool does not belong to anyone.
224+
Specifically, when a new HugeTLB folio is allocated to
225+
the pool, it is not accounted for from the perspective of the
226+
memory controller. It is only charged to a cgroup when it is
227+
actually used (for e.g at page fault time). Host memory
228+
overcommit management has to consider this when configuring
229+
hard limits. In general, HugeTLB pool management should be
230+
done via other mechanisms (such as the HugeTLB controller).
231+
* Failure to charge a HugeTLB folio to the memory controller
232+
results in SIGBUS. This could happen even if the HugeTLB pool
233+
still has pages available (but the cgroup limit is hit and
234+
reclaim attempt fails).
235+
* Charging HugeTLB memory towards the memory controller affects
236+
memory protection and reclaim dynamics. Any userspace tuning
237+
(of low, min limits for e.g) needs to take this into account.
238+
* HugeTLB pages utilized while this option is not selected
239+
will not be tracked by the memory controller (even if cgroup
240+
v2 is remounted later on).
241+
213242

214243
Organizing Processes and Threads
215244
--------------------------------
@@ -1539,6 +1568,15 @@ PAGE_SIZE multiple when read back.
15391568
collapsing an existing range of pages. This counter is not
15401569
present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
15411570

1571+
thp_swpout (npn)
1572+
Number of transparent hugepages which are swapout in one piece
1573+
without splitting.
1574+
1575+
thp_swpout_fallback (npn)
1576+
Number of transparent hugepages which were split before swapout.
1577+
Usually because failed to allocate some continuous swap space
1578+
for the huge page.
1579+
15421580
memory.numa_stat
15431581
A read-only nested-keyed file which exists on non-root cgroups.
15441582

Documentation/admin-guide/mm/damon/usage.rst

+81-43
Original file line numberDiff line numberDiff line change
@@ -20,18 +20,18 @@ DAMON provides below interfaces for different users.
2020
you can write and use your personalized DAMON sysfs wrapper programs that
2121
reads/writes the sysfs files instead of you. The `DAMON user space tool
2222
<https://github.com/awslabs/damo>`_ is one example of such programs.
23-
- *debugfs interface. (DEPRECATED!)*
24-
:ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface
25-
<sysfs_interface>`. This is deprecated, so users should move to the
26-
:ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot
27-
move, please report your usecase to [email protected] and
28-
2923
- *Kernel Space Programming Interface.*
3024
:doc:`This </mm/damon/api>` is for kernel space programmers. Using this,
3125
users can utilize every feature of DAMON most flexibly and efficiently by
3226
writing kernel space DAMON application programs for you. You can even extend
3327
DAMON for various address spaces. For detail, please refer to the interface
3428
:doc:`document </mm/damon/api>`.
29+
- *debugfs interface. (DEPRECATED!)*
30+
:ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface
31+
<sysfs_interface>`. This is deprecated, so users should move to the
32+
:ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot
33+
move, please report your usecase to [email protected] and
34+
3535

3636
.. _sysfs_interface:
3737

@@ -76,7 +76,7 @@ comma (","). ::
7676
│ │ │ │ │ │ │ │ ...
7777
│ │ │ │ │ │ ...
7878
│ │ │ │ │ schemes/nr_schemes
79-
│ │ │ │ │ │ 0/action
79+
│ │ │ │ │ │ 0/action,apply_interval_us
8080
│ │ │ │ │ │ │ access_pattern/
8181
│ │ │ │ │ │ │ │ sz/min,max
8282
│ │ │ │ │ │ │ │ nr_accesses/min,max
@@ -105,14 +105,12 @@ having the root permission could use this directory.
105105
kdamonds/
106106
---------
107107

108-
The monitoring-related information including request specifications and results
109-
are called DAMON context. DAMON executes each context with a kernel thread
110-
called kdamond, and multiple kdamonds could run in parallel.
111-
112108
Under the ``admin`` directory, one directory, ``kdamonds``, which has files for
113-
controlling the kdamonds exist. In the beginning, this directory has only one
114-
file, ``nr_kdamonds``. Writing a number (``N``) to the file creates the number
115-
of child directories named ``0`` to ``N-1``. Each directory represents each
109+
controlling the kdamonds (refer to
110+
:ref:`design <damon_design_execution_model_and_data_structures>` for more
111+
details) exists. In the beginning, this directory has only one file,
112+
``nr_kdamonds``. Writing a number (``N``) to the file creates the number of
113+
child directories named ``0`` to ``N-1``. Each directory represents each
116114
kdamond.
117115

118116
kdamonds/<N>/
@@ -150,9 +148,10 @@ kdamonds/<N>/contexts/
150148

151149
In the beginning, this directory has only one file, ``nr_contexts``. Writing a
152150
number (``N``) to the file creates the number of child directories named as
153-
``0`` to ``N-1``. Each directory represents each monitoring context. At the
154-
moment, only one context per kdamond is supported, so only ``0`` or ``1`` can
155-
be written to the file.
151+
``0`` to ``N-1``. Each directory represents each monitoring context (refer to
152+
:ref:`design <damon_design_execution_model_and_data_structures>` for more
153+
details). At the moment, only one context per kdamond is supported, so only
154+
``0`` or ``1`` can be written to the file.
156155

157156
.. _sysfs_contexts:
158157

@@ -270,8 +269,8 @@ schemes/<N>/
270269
------------
271270

272271
In each scheme directory, five directories (``access_pattern``, ``quotas``,
273-
``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and one file
274-
(``action``) exist.
272+
``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and two files
273+
(``action`` and ``apply_interval``) exist.
275274

276275
The ``action`` file is for setting and getting the scheme's :ref:`action
277276
<damon_design_damos_action>`. The keywords that can be written to and read
@@ -297,6 +296,9 @@ Note that support of each action depends on the running DAMON operations set
297296
- ``stat``: Do nothing but count the statistics.
298297
Supported by all operations sets.
299298

299+
The ``apply_interval_us`` file is for setting and getting the scheme's
300+
:ref:`apply_interval <damon_design_damos>` in microseconds.
301+
300302
schemes/<N>/access_pattern/
301303
---------------------------
302304

@@ -392,7 +394,7 @@ pages of all memory cgroups except ``/having_care_already``.::
392394
echo N > 1/matching
393395

394396
Note that ``anon`` and ``memcg`` filters are currently supported only when
395-
``paddr`` `implementation <sysfs_contexts>` is being used.
397+
``paddr`` :ref:`implementation <sysfs_contexts>` is being used.
396398

397399
Also, memory regions that are filtered out by ``addr`` or ``target`` filters
398400
are not counted as the scheme has tried to those, while regions that filtered
@@ -430,9 +432,9 @@ that reading it returns the total size of the scheme tried regions, and creates
430432
directories named integer starting from ``0`` under this directory. Each
431433
directory contains files exposing detailed information about each of the memory
432434
region that the corresponding scheme's ``action`` has tried to be applied under
433-
this directory, during next :ref:`aggregation interval
434-
<sysfs_monitoring_attrs>`. The information includes address range,
435-
``nr_accesses``, and ``age`` of the region.
435+
this directory, during next :ref:`apply interval <damon_design_damos>` of the
436+
corresponding scheme. The information includes address range, ``nr_accesses``,
437+
and ``age`` of the region.
436438

437439
Writing ``update_schemes_tried_bytes`` to the relevant ``kdamonds/<N>/state``
438440
file will only update the ``total_bytes`` file, and will not create the
@@ -495,6 +497,62 @@ Please note that it's highly recommended to use user space tools like `damo
495497
<https://github.com/awslabs/damo>`_ rather than manually reading and writing
496498
the files as above. Above is only for an example.
497499

500+
.. _tracepoint:
501+
502+
Tracepoints for Monitoring Results
503+
==================================
504+
505+
Users can get the monitoring results via the :ref:`tried_regions
506+
<sysfs_schemes_tried_regions>`. The interface is useful for getting a
507+
snapshot, but it could be inefficient for fully recording all the monitoring
508+
results. For the purpose, two trace points, namely ``damon:damon_aggregated``
509+
and ``damon:damos_before_apply``, are provided. ``damon:damon_aggregated``
510+
provides the whole monitoring results, while ``damon:damos_before_apply``
511+
provides the monitoring results for regions that each DAMON-based Operation
512+
Scheme (:ref:`DAMOS <damon_design_damos>`) is gonna be applied. Hence,
513+
``damon:damos_before_apply`` is more useful for recording internal behavior of
514+
DAMOS, or DAMOS target access
515+
:ref:`pattern <damon_design_damos_access_pattern>` based query-like efficient
516+
monitoring results recording.
517+
518+
While the monitoring is turned on, you could record the tracepoint events and
519+
show results using tracepoint supporting tools like ``perf``. For example::
520+
521+
# echo on > monitor_on
522+
# perf record -e damon:damon_aggregated &
523+
# sleep 5
524+
# kill 9 $(pidof perf)
525+
# echo off > monitor_on
526+
# perf script
527+
kdamond.0 46568 [027] 79357.842179: damon:damon_aggregated: target_id=0 nr_regions=11 122509119488-135708762112: 0 864
528+
[...]
529+
530+
Each line of the perf script output represents each monitoring region. The
531+
first five fields are as usual other tracepoint outputs. The sixth field
532+
(``target_id=X``) shows the ide of the monitoring target of the region. The
533+
seventh field (``nr_regions=X``) shows the total number of monitoring regions
534+
for the target. The eighth field (``X-Y:``) shows the start (``X``) and end
535+
(``Y``) addresses of the region in bytes. The ninth field (``X``) shows the
536+
``nr_accesses`` of the region (refer to
537+
:ref:`design <damon_design_region_based_sampling>` for more details of the
538+
counter). Finally the tenth field (``X``) shows the ``age`` of the region
539+
(refer to :ref:`design <damon_design_age_tracking>` for more details of the
540+
counter).
541+
542+
If the event was ``damon:damos_beofre_apply``, the ``perf script`` output would
543+
be somewhat like below::
544+
545+
kdamond.0 47293 [000] 80801.060214: damon:damos_before_apply: ctx_idx=0 scheme_idx=0 target_idx=0 nr_regions=11 121932607488-135128711168: 0 136
546+
[...]
547+
548+
Each line of the output represents each monitoring region that each DAMON-based
549+
Operation Scheme was about to be applied at the traced time. The first five
550+
fields are as usual. It shows the index of the DAMON context (``ctx_idx=X``)
551+
of the scheme in the list of the contexts of the context's kdamond, the index
552+
of the scheme (``scheme_idx=X``) in the list of the schemes of the context, in
553+
addition to the output of ``damon_aggregated`` tracepoint.
554+
555+
498556
.. _debugfs_interface:
499557

500558
debugfs Interface (DEPRECATED!)
@@ -790,23 +848,3 @@ directory by putting the name of the context to the ``rm_contexts`` file. ::
790848

791849
Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on`` files are in the
792850
root directory only.
793-
794-
795-
.. _tracepoint:
796-
797-
Tracepoint for Monitoring Results
798-
=================================
799-
800-
Users can get the monitoring results via the :ref:`tried_regions
801-
<sysfs_schemes_tried_regions>` or a tracepoint, ``damon:damon_aggregated``.
802-
While the tried regions directory is useful for getting a snapshot, the
803-
tracepoint is useful for getting a full record of the results. While the
804-
monitoring is turned on, you could record the tracepoint events and show
805-
results using tracepoint supporting tools like ``perf``. For example::
806-
807-
# echo on > monitor_on
808-
# perf record -e damon:damon_aggregated &
809-
# sleep 5
810-
# kill 9 $(pidof perf)
811-
# echo off > monitor_on
812-
# perf script

Documentation/admin-guide/mm/ksm.rst

+11
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,15 @@ stable_node_chains_prune_millisecs
155155
scan. It's a noop if not a single KSM page hit the
156156
``max_page_sharing`` yet.
157157

158+
smart_scan
159+
Historically KSM checked every candidate page for each scan. It did
160+
not take into account historic information. When smart scan is
161+
enabled, pages that have previously not been de-duplicated get
162+
skipped. How often these pages are skipped depends on how often
163+
de-duplication has already been tried and failed. By default this
164+
optimization is enabled. The ``pages_skipped`` metric shows how
165+
effective the setting is.
166+
158167
The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
159168

160169
general_profit
@@ -169,6 +178,8 @@ pages_unshared
169178
how many pages unique but repeatedly checked for merging
170179
pages_volatile
171180
how many pages changing too fast to be placed in a tree
181+
pages_skipped
182+
how many pages did the "smart" page scanning algorithm skip
172183
full_scans
173184
how many times all mergeable areas have been scanned
174185
stable_node_chains

Documentation/admin-guide/mm/pagemap.rst

+89
Original file line numberDiff line numberDiff line change
@@ -227,3 +227,92 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
227227
always 12 at most architectures). Since Linux 3.11 their meaning changes
228228
after first clear of soft-dirty bits. Since Linux 4.2 they are used for
229229
flags unconditionally.
230+
231+
Pagemap Scan IOCTL
232+
==================
233+
234+
The ``PAGEMAP_SCAN`` IOCTL on the pagemap file can be used to get or optionally
235+
clear the info about page table entries. The following operations are supported
236+
in this IOCTL:
237+
238+
- Scan the address range and get the memory ranges matching the provided criteria.
239+
This is performed when the output buffer is specified.
240+
- Write-protect the pages. The ``PM_SCAN_WP_MATCHING`` is used to write-protect
241+
the pages of interest. The ``PM_SCAN_CHECK_WPASYNC`` aborts the operation if
242+
non-Async Write Protected pages are found. The ``PM_SCAN_WP_MATCHING`` can be
243+
used with or without ``PM_SCAN_CHECK_WPASYNC``.
244+
- Both of those operations can be combined into one atomic operation where we can
245+
get and write protect the pages as well.
246+
247+
Following flags about pages are currently supported:
248+
249+
- ``PAGE_IS_WPALLOWED`` - Page has async-write-protection enabled
250+
- ``PAGE_IS_WRITTEN`` - Page has been written to from the time it was write protected
251+
- ``PAGE_IS_FILE`` - Page is file backed
252+
- ``PAGE_IS_PRESENT`` - Page is present in the memory
253+
- ``PAGE_IS_SWAPPED`` - Page is in swapped
254+
- ``PAGE_IS_PFNZERO`` - Page has zero PFN
255+
- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed
256+
257+
The ``struct pm_scan_arg`` is used as the argument of the IOCTL.
258+
259+
1. The size of the ``struct pm_scan_arg`` must be specified in the ``size``
260+
field. This field will be helpful in recognizing the structure if extensions
261+
are done later.
262+
2. The flags can be specified in the ``flags`` field. The ``PM_SCAN_WP_MATCHING``
263+
and ``PM_SCAN_CHECK_WPASYNC`` are the only added flags at this time. The get
264+
operation is optionally performed depending upon if the output buffer is
265+
provided or not.
266+
3. The range is specified through ``start`` and ``end``.
267+
4. The walk can abort before visiting the complete range such as the user buffer
268+
can get full etc. The walk ending address is specified in``end_walk``.
269+
5. The output buffer of ``struct page_region`` array and size is specified in
270+
``vec`` and ``vec_len``.
271+
6. The optional maximum requested pages are specified in the ``max_pages``.
272+
7. The masks are specified in ``category_mask``, ``category_anyof_mask``,
273+
``category_inverted`` and ``return_mask``.
274+
275+
Find pages which have been written and WP them as well::
276+
277+
struct pm_scan_arg arg = {
278+
.size = sizeof(arg),
279+
.flags = PM_SCAN_CHECK_WPASYNC | PM_SCAN_CHECK_WPASYNC,
280+
..
281+
.category_mask = PAGE_IS_WRITTEN,
282+
.return_mask = PAGE_IS_WRITTEN,
283+
};
284+
285+
Find pages which have been written, are file backed, not swapped and either
286+
present or huge::
287+
288+
struct pm_scan_arg arg = {
289+
.size = sizeof(arg),
290+
.flags = 0,
291+
..
292+
.category_mask = PAGE_IS_WRITTEN | PAGE_IS_SWAPPED,
293+
.category_inverted = PAGE_IS_SWAPPED,
294+
.category_anyof_mask = PAGE_IS_PRESENT | PAGE_IS_HUGE,
295+
.return_mask = PAGE_IS_WRITTEN | PAGE_IS_SWAPPED |
296+
PAGE_IS_PRESENT | PAGE_IS_HUGE,
297+
};
298+
299+
The ``PAGE_IS_WRITTEN`` flag can be considered as a better-performing alternative
300+
of soft-dirty flag. It doesn't get affected by VMA merging of the kernel and hence
301+
the user can find the true soft-dirty pages in case of normal pages. (There may
302+
still be extra dirty pages reported for THP or Hugetlb pages.)
303+
304+
"PAGE_IS_WRITTEN" category is used with uffd write protect-enabled ranges to
305+
implement memory dirty tracking in userspace:
306+
307+
1. The userfaultfd file descriptor is created with ``userfaultfd`` syscall.
308+
2. The ``UFFD_FEATURE_WP_UNPOPULATED`` and ``UFFD_FEATURE_WP_ASYNC`` features
309+
are set by ``UFFDIO_API`` IOCTL.
310+
3. The memory range is registered with ``UFFDIO_REGISTER_MODE_WP`` mode
311+
through ``UFFDIO_REGISTER`` IOCTL.
312+
4. Then any part of the registered memory or the whole memory region must
313+
be write protected using ``PAGEMAP_SCAN`` IOCTL with flag ``PM_SCAN_WP_MATCHING``
314+
or the ``UFFDIO_WRITEPROTECT`` IOCTL can be used. Both of these perform the
315+
same operation. The former is better in terms of performance.
316+
5. Now the ``PAGEMAP_SCAN`` IOCTL can be used to either just find pages which
317+
have been written to since they were last marked and/or optionally write protect
318+
the pages as well.

0 commit comments

Comments
 (0)