Merge branch 'akpm' (patches from Andrew)

+274

Documentation/ABI/testing/sysfs-kernel-mm-damon

··· 1 + what: /sys/kernel/mm/damon/ 2 + Date: Mar 2022 3 + Contact: SeongJae Park <sj@kernel.org> 4 + Description: Interface for Data Access MONitoring (DAMON). Contains files 5 + for controlling DAMON. For more details on DAMON itself, 6 + please refer to Documentation/admin-guide/mm/damon/index.rst. 7 + 8 + What: /sys/kernel/mm/damon/admin/ 9 + Date: Mar 2022 10 + Contact: SeongJae Park <sj@kernel.org> 11 + Description: Interface for privileged users of DAMON. Contains files for 12 + controlling DAMON that aimed to be used by privileged users. 13 + 14 + What: /sys/kernel/mm/damon/admin/kdamonds/nr_kdamonds 15 + Date: Mar 2022 16 + Contact: SeongJae Park <sj@kernel.org> 17 + Description: Writing a number 'N' to this file creates the number of 18 + directories for controlling each DAMON worker thread (kdamond) 19 + named '0' to 'N-1' under the kdamonds/ directory. 20 + 21 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/state 22 + Date: Mar 2022 23 + Contact: SeongJae Park <sj@kernel.org> 24 + Description: Writing 'on' or 'off' to this file makes the kdamond starts or 25 + stops, respectively. Reading the file returns the keywords 26 + based on the current status. Writing 'update_schemes_stats' to 27 + the file updates contents of schemes stats files of the 28 + kdamond. 29 + 30 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/pid 31 + Date: Mar 2022 32 + Contact: SeongJae Park <sj@kernel.org> 33 + Description: Reading this file returns the pid of the kdamond if it is 34 + running. 35 + 36 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/nr_contexts 37 + Date: Mar 2022 38 + Contact: SeongJae Park <sj@kernel.org> 39 + Description: Writing a number 'N' to this file creates the number of 40 + directories for controlling each DAMON context named '0' to 41 + 'N-1' under the contexts/ directory. 42 + 43 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/operations 44 + Date: Mar 2022 45 + Contact: SeongJae Park <sj@kernel.org> 46 + Description: Writing a keyword for a monitoring operations set ('vaddr' for 47 + virtual address spaces monitoring, and 'paddr' for the physical 48 + address space monitoring) to this file makes the context to use 49 + the operations set. Reading the file returns the keyword for 50 + the operations set the context is set to use. 51 + 52 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/intervals/sample_us 53 + Date: Mar 2022 54 + Contact: SeongJae Park <sj@kernel.org> 55 + Description: Writing a value to this file sets the sampling interval of the 56 + DAMON context in microseconds as the value. Reading this file 57 + returns the value. 58 + 59 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/intervals/aggr_us 60 + Date: Mar 2022 61 + Contact: SeongJae Park <sj@kernel.org> 62 + Description: Writing a value to this file sets the aggregation interval of 63 + the DAMON context in microseconds as the value. Reading this 64 + file returns the value. 65 + 66 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/intervals/update_us 67 + Date: Mar 2022 68 + Contact: SeongJae Park <sj@kernel.org> 69 + Description: Writing a value to this file sets the update interval of the 70 + DAMON context in microseconds as the value. Reading this file 71 + returns the value. 72 + 73 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/nr_regions/min 74 + 75 + WDate: Mar 2022 76 + Contact: SeongJae Park <sj@kernel.org> 77 + Description: Writing a value to this file sets the minimum number of 78 + monitoring regions of the DAMON context as the value. Reading 79 + this file returns the value. 80 + 81 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/nr_regions/max 82 + Date: Mar 2022 83 + Contact: SeongJae Park <sj@kernel.org> 84 + Description: Writing a value to this file sets the maximum number of 85 + monitoring regions of the DAMON context as the value. Reading 86 + this file returns the value. 87 + 88 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/targets/nr_targets 89 + Date: Mar 2022 90 + Contact: SeongJae Park <sj@kernel.org> 91 + Description: Writing a number 'N' to this file creates the number of 92 + directories for controlling each DAMON target of the context 93 + named '0' to 'N-1' under the contexts/ directory. 94 + 95 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/targets/<T>/pid_target 96 + Date: Mar 2022 97 + Contact: SeongJae Park <sj@kernel.org> 98 + Description: Writing to and reading from this file sets and gets the pid of 99 + the target process if the context is for virtual address spaces 100 + monitoring, respectively. 101 + 102 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/targets/<T>/regions/nr_regions 103 + Date: Mar 2022 104 + Contact: SeongJae Park <sj@kernel.org> 105 + Description: Writing a number 'N' to this file creates the number of 106 + directories for setting each DAMON target memory region of the 107 + context named '0' to 'N-1' under the regions/ directory. In 108 + case of the virtual address space monitoring, DAMON 109 + automatically sets the target memory region based on the target 110 + processes' mappings. 111 + 112 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/targets/<T>/regions/<R>/start 113 + Date: Mar 2022 114 + Contact: SeongJae Park <sj@kernel.org> 115 + Description: Writing to and reading from this file sets and gets the start 116 + address of the monitoring region. 117 + 118 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/targets/<T>/regions/<R>/end 119 + Date: Mar 2022 120 + Contact: SeongJae Park <sj@kernel.org> 121 + Description: Writing to and reading from this file sets and gets the end 122 + address of the monitoring region. 123 + 124 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/nr_schemes 125 + Date: Mar 2022 126 + Contact: SeongJae Park <sj@kernel.org> 127 + Description: Writing a number 'N' to this file creates the number of 128 + directories for controlling each DAMON-based operation scheme 129 + of the context named '0' to 'N-1' under the schemes/ directory. 130 + 131 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/action 132 + Date: Mar 2022 133 + Contact: SeongJae Park <sj@kernel.org> 134 + Description: Writing to and reading from this file sets and gets the action 135 + of the scheme. 136 + 137 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/access_pattern/sz/min 138 + Date: Mar 2022 139 + Contact: SeongJae Park <sj@kernel.org> 140 + Description: Writing to and reading from this file sets and gets the mimimum 141 + size of the scheme's target regions in bytes. 142 + 143 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/access_pattern/sz/max 144 + Date: Mar 2022 145 + Contact: SeongJae Park <sj@kernel.org> 146 + Description: Writing to and reading from this file sets and gets the maximum 147 + size of the scheme's target regions in bytes. 148 + 149 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/access_pattern/nr_accesses/min 150 + Date: Mar 2022 151 + Contact: SeongJae Park <sj@kernel.org> 152 + Description: Writing to and reading from this file sets and gets the manimum 153 + 'nr_accesses' of the scheme's target regions. 154 + 155 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/access_pattern/nr_accesses/max 156 + Date: Mar 2022 157 + Contact: SeongJae Park <sj@kernel.org> 158 + Description: Writing to and reading from this file sets and gets the maximum 159 + 'nr_accesses' of the scheme's target regions. 160 + 161 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/access_pattern/age/min 162 + Date: Mar 2022 163 + Contact: SeongJae Park <sj@kernel.org> 164 + Description: Writing to and reading from this file sets and gets the minimum 165 + 'age' of the scheme's target regions. 166 + 167 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/access_pattern/age/max 168 + Date: Mar 2022 169 + Contact: SeongJae Park <sj@kernel.org> 170 + Description: Writing to and reading from this file sets and gets the maximum 171 + 'age' of the scheme's target regions. 172 + 173 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/ms 174 + Date: Mar 2022 175 + Contact: SeongJae Park <sj@kernel.org> 176 + Description: Writing to and reading from this file sets and gets the time 177 + quota of the scheme in milliseconds. 178 + 179 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/bytes 180 + Date: Mar 2022 181 + Contact: SeongJae Park <sj@kernel.org> 182 + Description: Writing to and reading from this file sets and gets the size 183 + quota of the scheme in bytes. 184 + 185 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/reset_interval_ms 186 + Date: Mar 2022 187 + Contact: SeongJae Park <sj@kernel.org> 188 + Description: Writing to and reading from this file sets and gets the quotas 189 + charge reset interval of the scheme in milliseconds. 190 + 191 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/weights/sz_permil 192 + Date: Mar 2022 193 + Contact: SeongJae Park <sj@kernel.org> 194 + Description: Writing to and reading from this file sets and gets the 195 + under-quota limit regions prioritization weight for 'size' in 196 + permil. 197 + 198 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/weights/nr_accesses_permil 199 + Date: Mar 2022 200 + Contact: SeongJae Park <sj@kernel.org> 201 + Description: Writing to and reading from this file sets and gets the 202 + under-quota limit regions prioritization weight for 203 + 'nr_accesses' in permil. 204 + 205 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/weights/age_permil 206 + Date: Mar 2022 207 + Contact: SeongJae Park <sj@kernel.org> 208 + Description: Writing to and reading from this file sets and gets the 209 + under-quota limit regions prioritization weight for 'age' in 210 + permil. 211 + 212 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/watermarks/metric 213 + Date: Mar 2022 214 + Contact: SeongJae Park <sj@kernel.org> 215 + Description: Writing to and reading from this file sets and gets the metric 216 + of the watermarks for the scheme. The writable/readable 217 + keywords for this file are 'none' for disabling the watermarks 218 + feature, or 'free_mem_rate' for the system's global free memory 219 + rate in permil. 220 + 221 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/watermarks/interval_us 222 + Date: Mar 2022 223 + Contact: SeongJae Park <sj@kernel.org> 224 + Description: Writing to and reading from this file sets and gets the metric 225 + check interval of the watermarks for the scheme in 226 + microseconds. 227 + 228 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/watermarks/high 229 + Date: Mar 2022 230 + Contact: SeongJae Park <sj@kernel.org> 231 + Description: Writing to and reading from this file sets and gets the high 232 + watermark of the scheme in permil. 233 + 234 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/watermarks/mid 235 + Date: Mar 2022 236 + Contact: SeongJae Park <sj@kernel.org> 237 + Description: Writing to and reading from this file sets and gets the mid 238 + watermark of the scheme in permil. 239 + 240 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/watermarks/low 241 + Date: Mar 2022 242 + Contact: SeongJae Park <sj@kernel.org> 243 + Description: Writing to and reading from this file sets and gets the low 244 + watermark of the scheme in permil. 245 + 246 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/stats/nr_tried 247 + Date: Mar 2022 248 + Contact: SeongJae Park <sj@kernel.org> 249 + Description: Reading this file returns the number of regions that the action 250 + of the scheme has tried to be applied. 251 + 252 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/stats/sz_tried 253 + Date: Mar 2022 254 + Contact: SeongJae Park <sj@kernel.org> 255 + Description: Reading this file returns the total size of regions that the 256 + action of the scheme has tried to be applied in bytes. 257 + 258 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/stats/nr_applied 259 + Date: Mar 2022 260 + Contact: SeongJae Park <sj@kernel.org> 261 + Description: Reading this file returns the number of regions that the action 262 + of the scheme has successfully applied. 263 + 264 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/stats/sz_applied 265 + Date: Mar 2022 266 + Contact: SeongJae Park <sj@kernel.org> 267 + Description: Reading this file returns the total size of regions that the 268 + action of the scheme has successfully applied in bytes. 269 + 270 + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/stats/qt_exceeds 271 + Date: Mar 2022 272 + Contact: SeongJae Park <sj@kernel.org> 273 + Description: Reading this file returns the number of the exceed events of 274 + the scheme's quotas.

+2

Documentation/admin-guide/cgroup-v1/memory.rst

··· 64 64 threads 65 65 cgroup.procs show list of processes 66 66 cgroup.event_control an interface for event_fd() 67 + This knob is not available on CONFIG_PREEMPT_RT systems. 67 68 memory.usage_in_bytes show current usage for memory 68 69 (See 5.5 for details) 69 70 memory.memsw.usage_in_bytes show current usage for memory+Swap ··· 76 75 memory.max_usage_in_bytes show max memory usage recorded 77 76 memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded 78 77 memory.soft_limit_in_bytes set/show soft limit of memory usage 78 + This knob is not available on CONFIG_PREEMPT_RT systems. 79 79 memory.stat show various statistics 80 80 memory.use_hierarchy set/show hierarchical account enabled 81 81 This knob is deprecated and shouldn't be

+5

Documentation/admin-guide/cgroup-v2.rst

··· 1301 1301 Amount of memory used to cache filesystem data, 1302 1302 including tmpfs and shared memory. 1303 1303 1304 + kernel (npn) 1305 + Amount of total kernel memory, including 1306 + (kernel_stack, pagetables, percpu, vmalloc, slab) in 1307 + addition to other kernel memory use cases. 1308 + 1304 1309 kernel_stack 1305 1310 Amount of memory allocated to kernel stacks. 1306 1311

+1 -1

Documentation/admin-guide/kernel-parameters.txt

··· 1649 1649 [KNL] Reguires CONFIG_HUGETLB_PAGE_FREE_VMEMMAP 1650 1650 enabled. 1651 1651 Allows heavy hugetlb users to free up some more 1652 - memory (6 * PAGE_SIZE for each 2MB hugetlb page). 1652 + memory (7 * PAGE_SIZE for each 2MB hugetlb page). 1653 1653 Format: { on | off (default) } 1654 1654 1655 1655 on: enable the feature

+361 -19

Documentation/admin-guide/mm/damon/usage.rst

··· 4 4 Detailed Usages 5 5 =============== 6 6 7 - DAMON provides below three interfaces for different users. 7 + DAMON provides below interfaces for different users. 8 8 9 9 - *DAMON user space tool.* 10 10 `This <https://github.com/awslabs/damo>`_ is for privileged people such as ··· 14 14 virtual and physical address spaces monitoring. For more detail, please 15 15 refer to its `usage document 16 16 <https://github.com/awslabs/damo/blob/next/USAGE.md>`_. 17 - - *debugfs interface.* 18 - :ref:`This <debugfs_interface>` is for privileged user space programmers who 17 + - *sysfs interface.* 18 + :ref:`This <sysfs_interface>` is for privileged user space programmers who 19 19 want more optimized use of DAMON. Using this, users can use DAMON’s major 20 - features by reading from and writing to special debugfs files. Therefore, 21 - you can write and use your personalized DAMON debugfs wrapper programs that 22 - reads/writes the debugfs files instead of you. The `DAMON user space tool 20 + features by reading from and writing to special sysfs files. Therefore, 21 + you can write and use your personalized DAMON sysfs wrapper programs that 22 + reads/writes the sysfs files instead of you. The `DAMON user space tool 23 23 <https://github.com/awslabs/damo>`_ is one example of such programs. It 24 24 supports both virtual and physical address spaces monitoring. Note that this 25 25 interface provides only simple :ref:`statistics <damos_stats>` for the 26 26 monitoring results. For detailed monitoring results, DAMON provides a 27 27 :ref:`tracepoint <tracepoint>`. 28 + - *debugfs interface.* 29 + :ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface 30 + <sysfs_interface>`. This will be removed after next LTS kernel is released, 31 + so users should move to the :ref:`sysfs interface <sysfs_interface>`. 28 32 - *Kernel Space Programming Interface.* 29 33 :doc:`This </vm/damon/api>` is for kernel space programmers. Using this, 30 34 users can utilize every feature of DAMON most flexibly and efficiently by ··· 36 32 DAMON for various address spaces. For detail, please refer to the interface 37 33 :doc:`document </vm/damon/api>`. 38 34 35 + .. _sysfs_interface: 36 + 37 + sysfs Interface 38 + =============== 39 + 40 + DAMON sysfs interface is built when ``CONFIG_DAMON_SYSFS`` is defined. It 41 + creates multiple directories and files under its sysfs directory, 42 + ``<sysfs>/kernel/mm/damon/``. You can control DAMON by writing to and reading 43 + from the files under the directory. 44 + 45 + For a short example, users can monitor the virtual address space of a given 46 + workload as below. :: 47 + 48 + # cd /sys/kernel/mm/damon/admin/ 49 + # echo 1 > kdamonds/nr && echo 1 > kdamonds/0/contexts/nr 50 + # echo vaddr > kdamonds/0/contexts/0/operations 51 + # echo 1 > kdamonds/0/contexts/0/targets/nr 52 + # echo $(pidof <workload>) > kdamonds/0/contexts/0/targets/0/pid 53 + # echo on > kdamonds/0/state 54 + 55 + Files Hierarchy 56 + --------------- 57 + 58 + The files hierarchy of DAMON sysfs interface is shown below. In the below 59 + figure, parents-children relations are represented with indentations, each 60 + directory is having ``/`` suffix, and files in each directory are separated by 61 + comma (","). :: 62 + 63 + /sys/kernel/mm/damon/admin 64 + │ kdamonds/nr_kdamonds 65 + │ │ 0/state,pid 66 + │ │ │ contexts/nr_contexts 67 + │ │ │ │ 0/operations 68 + │ │ │ │ │ monitoring_attrs/ 69 + │ │ │ │ │ │ intervals/sample_us,aggr_us,update_us 70 + │ │ │ │ │ │ nr_regions/min,max 71 + │ │ │ │ │ targets/nr_targets 72 + │ │ │ │ │ │ 0/pid_target 73 + │ │ │ │ │ │ │ regions/nr_regions 74 + │ │ │ │ │ │ │ │ 0/start,end 75 + │ │ │ │ │ │ │ │ ... 76 + │ │ │ │ │ │ ... 77 + │ │ │ │ │ schemes/nr_schemes 78 + │ │ │ │ │ │ 0/action 79 + │ │ │ │ │ │ │ access_pattern/ 80 + │ │ │ │ │ │ │ │ sz/min,max 81 + │ │ │ │ │ │ │ │ nr_accesses/min,max 82 + │ │ │ │ │ │ │ │ age/min,max 83 + │ │ │ │ │ │ │ quotas/ms,bytes,reset_interval_ms 84 + │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil 85 + │ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low 86 + │ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds 87 + │ │ │ │ │ │ ... 88 + │ │ │ │ ... 89 + │ │ ... 90 + 91 + Root 92 + ---- 93 + 94 + The root of the DAMON sysfs interface is ``<sysfs>/kernel/mm/damon/``, and it 95 + has one directory named ``admin``. The directory contains the files for 96 + privileged user space programs' control of DAMON. User space tools or deamons 97 + having the root permission could use this directory. 98 + 99 + kdamonds/ 100 + --------- 101 + 102 + The monitoring-related information including request specifications and results 103 + are called DAMON context. DAMON executes each context with a kernel thread 104 + called kdamond, and multiple kdamonds could run in parallel. 105 + 106 + Under the ``admin`` directory, one directory, ``kdamonds``, which has files for 107 + controlling the kdamonds exist. In the beginning, this directory has only one 108 + file, ``nr_kdamonds``. Writing a number (``N``) to the file creates the number 109 + of child directories named ``0`` to ``N-1``. Each directory represents each 110 + kdamond. 111 + 112 + kdamonds/<N>/ 113 + ------------- 114 + 115 + In each kdamond directory, two files (``state`` and ``pid``) and one directory 116 + (``contexts``) exist. 117 + 118 + Reading ``state`` returns ``on`` if the kdamond is currently running, or 119 + ``off`` if it is not running. Writing ``on`` or ``off`` makes the kdamond be 120 + in the state. Writing ``update_schemes_stats`` to ``state`` file updates the 121 + contents of stats files for each DAMON-based operation scheme of the kdamond. 122 + For details of the stats, please refer to :ref:`stats section 123 + <sysfs_schemes_stats>`. 124 + 125 + If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread. 126 + 127 + ``contexts`` directory contains files for controlling the monitoring contexts 128 + that this kdamond will execute. 129 + 130 + kdamonds/<N>/contexts/ 131 + ---------------------- 132 + 133 + In the beginning, this directory has only one file, ``nr_contexts``. Writing a 134 + number (``N``) to the file creates the number of child directories named as 135 + ``0`` to ``N-1``. Each directory represents each monitoring context. At the 136 + moment, only one context per kdamond is supported, so only ``0`` or ``1`` can 137 + be written to the file. 138 + 139 + contexts/<N>/ 140 + ------------- 141 + 142 + In each context directory, one file (``operations``) and three directories 143 + (``monitoring_attrs``, ``targets``, and ``schemes``) exist. 144 + 145 + DAMON supports multiple types of monitoring operations, including those for 146 + virtual address space and the physical address space. You can set and get what 147 + type of monitoring operations DAMON will use for the context by writing one of 148 + below keywords to, and reading from the file. 149 + 150 + - vaddr: Monitor virtual address spaces of specific processes 151 + - paddr: Monitor the physical address space of the system 152 + 153 + contexts/<N>/monitoring_attrs/ 154 + ------------------------------ 155 + 156 + Files for specifying attributes of the monitoring including required quality 157 + and efficiency of the monitoring are in ``monitoring_attrs`` directory. 158 + Specifically, two directories, ``intervals`` and ``nr_regions`` exist in this 159 + directory. 160 + 161 + Under ``intervals`` directory, three files for DAMON's sampling interval 162 + (``sample_us``), aggregation interval (``aggr_us``), and update interval 163 + (``update_us``) exist. You can set and get the values in micro-seconds by 164 + writing to and reading from the files. 165 + 166 + Under ``nr_regions`` directory, two files for the lower-bound and upper-bound 167 + of DAMON's monitoring regions (``min`` and ``max``, respectively), which 168 + controls the monitoring overhead, exist. You can set and get the values by 169 + writing to and rading from the files. 170 + 171 + For more details about the intervals and monitoring regions range, please refer 172 + to the Design document (:doc:`/vm/damon/design`). 173 + 174 + contexts/<N>/targets/ 175 + --------------------- 176 + 177 + In the beginning, this directory has only one file, ``nr_targets``. Writing a 178 + number (``N``) to the file creates the number of child directories named ``0`` 179 + to ``N-1``. Each directory represents each monitoring target. 180 + 181 + targets/<N>/ 182 + ------------ 183 + 184 + In each target directory, one file (``pid_target``) and one directory 185 + (``regions``) exist. 186 + 187 + If you wrote ``vaddr`` to the ``contexts/<N>/operations``, each target should 188 + be a process. You can specify the process to DAMON by writing the pid of the 189 + process to the ``pid_target`` file. 190 + 191 + targets/<N>/regions 192 + ------------------- 193 + 194 + When ``vaddr`` monitoring operations set is being used (``vaddr`` is written to 195 + the ``contexts/<N>/operations`` file), DAMON automatically sets and updates the 196 + monitoring target regions so that entire memory mappings of target processes 197 + can be covered. However, users could want to set the initial monitoring region 198 + to specific address ranges. 199 + 200 + In contrast, DAMON do not automatically sets and updates the monitoring target 201 + regions when ``paddr`` monitoring operations set is being used (``paddr`` is 202 + written to the ``contexts/<N>/operations``). Therefore, users should set the 203 + monitoring target regions by themselves in the case. 204 + 205 + For such cases, users can explicitly set the initial monitoring target regions 206 + as they want, by writing proper values to the files under this directory. 207 + 208 + In the beginning, this directory has only one file, ``nr_regions``. Writing a 209 + number (``N``) to the file creates the number of child directories named ``0`` 210 + to ``N-1``. Each directory represents each initial monitoring target region. 211 + 212 + regions/<N>/ 213 + ------------ 214 + 215 + In each region directory, you will find two files (``start`` and ``end``). You 216 + can set and get the start and end addresses of the initial monitoring target 217 + region by writing to and reading from the files, respectively. 218 + 219 + contexts/<N>/schemes/ 220 + --------------------- 221 + 222 + For usual DAMON-based data access aware memory management optimizations, users 223 + would normally want the system to apply a memory management action to a memory 224 + region of a specific access pattern. DAMON receives such formalized operation 225 + schemes from the user and applies those to the target memory regions. Users 226 + can get and set the schemes by reading from and writing to files under this 227 + directory. 228 + 229 + In the beginning, this directory has only one file, ``nr_schemes``. Writing a 230 + number (``N``) to the file creates the number of child directories named ``0`` 231 + to ``N-1``. Each directory represents each DAMON-based operation scheme. 232 + 233 + schemes/<N>/ 234 + ------------ 235 + 236 + In each scheme directory, four directories (``access_pattern``, ``quotas``, 237 + ``watermarks``, and ``stats``) and one file (``action``) exist. 238 + 239 + The ``action`` file is for setting and getting what action you want to apply to 240 + memory regions having specific access pattern of the interest. The keywords 241 + that can be written to and read from the file and their meaning are as below. 242 + 243 + - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED`` 244 + - ``cold``: Call ``madvise()`` for the region with ``MADV_COLD`` 245 + - ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT`` 246 + - ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE`` 247 + - ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE`` 248 + - ``stat``: Do nothing but count the statistics 249 + 250 + schemes/<N>/access_pattern/ 251 + --------------------------- 252 + 253 + The target access pattern of each DAMON-based operation scheme is constructed 254 + with three ranges including the size of the region in bytes, number of 255 + monitored accesses per aggregate interval, and number of aggregated intervals 256 + for the age of the region. 257 + 258 + Under the ``access_pattern`` directory, three directories (``sz``, 259 + ``nr_accesses``, and ``age``) each having two files (``min`` and ``max``) 260 + exist. You can set and get the access pattern for the given scheme by writing 261 + to and reading from the ``min`` and ``max`` files under ``sz``, 262 + ``nr_accesses``, and ``age`` directories, respectively. 263 + 264 + schemes/<N>/quotas/ 265 + ------------------- 266 + 267 + Optimal ``target access pattern`` for each ``action`` is workload dependent, so 268 + not easy to find. Worse yet, setting a scheme of some action too aggressive 269 + can cause severe overhead. To avoid such overhead, users can limit time and 270 + size quota for each scheme. In detail, users can ask DAMON to try to use only 271 + up to specific time (``time quota``) for applying the action, and to apply the 272 + action to only up to specific amount (``size quota``) of memory regions having 273 + the target access pattern within a given time interval (``reset interval``). 274 + 275 + When the quota limit is expected to be exceeded, DAMON prioritizes found memory 276 + regions of the ``target access pattern`` based on their size, access frequency, 277 + and age. For personalized prioritization, users can set the weights for the 278 + three properties. 279 + 280 + Under ``quotas`` directory, three files (``ms``, ``bytes``, 281 + ``reset_interval_ms``) and one directory (``weights``) having three files 282 + (``sz_permil``, ``nr_accesses_permil``, and ``age_permil``) in it exist. 283 + 284 + You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and 285 + ``reset interval`` in milliseconds by writing the values to the three files, 286 + respectively. You can also set the prioritization weights for size, access 287 + frequency, and age in per-thousand unit by writing the values to the three 288 + files under the ``weights`` directory. 289 + 290 + schemes/<N>/watermarks/ 291 + ----------------------- 292 + 293 + To allow easy activation and deactivation of each scheme based on system 294 + status, DAMON provides a feature called watermarks. The feature receives five 295 + values called ``metric``, ``interval``, ``high``, ``mid``, and ``low``. The 296 + ``metric`` is the system metric such as free memory ratio that can be measured. 297 + If the metric value of the system is higher than the value in ``high`` or lower 298 + than ``low`` at the memoent, the scheme is deactivated. If the value is lower 299 + than ``mid``, the scheme is activated. 300 + 301 + Under the watermarks directory, five files (``metric``, ``interval_us``, 302 + ``high``, ``mid``, and ``low``) for setting each value exist. You can set and 303 + get the five values by writing to the files, respectively. 304 + 305 + Keywords and meanings of those that can be written to the ``metric`` file are 306 + as below. 307 + 308 + - none: Ignore the watermarks 309 + - free_mem_rate: System's free memory rate (per thousand) 310 + 311 + The ``interval`` should written in microseconds unit. 312 + 313 + .. _sysfs_schemes_stats: 314 + 315 + schemes/<N>/stats/ 316 + ------------------ 317 + 318 + DAMON counts the total number and bytes of regions that each scheme is tried to 319 + be applied, the two numbers for the regions that each scheme is successfully 320 + applied, and the total number of the quota limit exceeds. This statistics can 321 + be used for online analysis or tuning of the schemes. 322 + 323 + The statistics can be retrieved by reading the files under ``stats`` directory 324 + (``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``, and 325 + ``qt_exceeds``), respectively. The files are not updated in real time, so you 326 + should ask DAMON sysfs interface to updte the content of the files for the 327 + stats by writing a special keyword, ``update_schemes_stats`` to the relevant 328 + ``kdamonds/<N>/state`` file. 329 + 330 + Example 331 + ~~~~~~~ 332 + 333 + Below commands applies a scheme saying "If a memory region of size in [4KiB, 334 + 8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate 335 + interval in [10, 20], page out the region. For the paging out, use only up to 336 + 10ms per second, and also don't page out more than 1GiB per second. Under the 337 + limitation, page out memory regions having longer age first. Also, check the 338 + free memory rate of the system every 5 seconds, start the monitoring and paging 339 + out when the free memory rate becomes lower than 50%, but stop it if the free 340 + memory rate becomes larger than 60%, or lower than 30%". :: 341 + 342 + # cd <sysfs>/kernel/mm/damon/admin 343 + # # populate directories 344 + # echo 1 > kdamonds/nr_kdamonds; echo 1 > kdamonds/0/contexts/nr_contexts; 345 + # echo 1 > kdamonds/0/contexts/0/schemes/nr_schemes 346 + # cd kdamonds/0/contexts/0/schemes/0 347 + # # set the basic access pattern and the action 348 + # echo 4096 > access_patterns/sz/min 349 + # echo 8192 > access_patterns/sz/max 350 + # echo 0 > access_patterns/nr_accesses/min 351 + # echo 5 > access_patterns/nr_accesses/max 352 + # echo 10 > access_patterns/age/min 353 + # echo 20 > access_patterns/age/max 354 + # echo pageout > action 355 + # # set quotas 356 + # echo 10 > quotas/ms 357 + # echo $((1024*1024*1024)) > quotas/bytes 358 + # echo 1000 > quotas/reset_interval_ms 359 + # # set watermark 360 + # echo free_mem_rate > watermarks/metric 361 + # echo 5000000 > watermarks/interval_us 362 + # echo 600 > watermarks/high 363 + # echo 500 > watermarks/mid 364 + # echo 300 > watermarks/low 365 + 366 + Please note that it's highly recommended to use user space tools like `damo 367 + <https://github.com/awslabs/damo>`_ rather than manually reading and writing 368 + the files as above. Above is only for an example. 39 369 40 370 .. _debugfs_interface: 41 371 ··· 385 47 ---------- 386 48 387 49 Users can get and set the ``sampling interval``, ``aggregation interval``, 388 - ``regions update interval``, and min/max number of monitoring target regions by 50 + ``update interval``, and min/max number of monitoring target regions by 389 51 reading from and writing to the ``attrs`` file. To know about the monitoring 390 52 attributes in detail, please refer to the :doc:`/vm/damon/design`. For 391 53 example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and ··· 446 108 as they want, by writing proper values to the ``init_regions`` file. Each line 447 109 of the input should represent one region in below form.:: 448 110 449 - <target id> <start address> <end address> 111 + <target idx> <start address> <end address> 450 112 451 - The ``target id`` should already in ``target_ids`` file, and the regions should 452 - be passed in address order. For example, below commands will set a couple of 453 - address ranges, ``1-100`` and ``100-200`` as the initial monitoring target 454 - region of process 42, and another couple of address ranges, ``20-40`` and 455 - ``50-100`` as that of process 4242.:: 113 + The ``target idx`` should be the index of the target in ``target_ids`` file, 114 + starting from ``0``, and the regions should be passed in address order. For 115 + example, below commands will set a couple of address ranges, ``1-100`` and 116 + ``100-200`` as the initial monitoring target region of pid 42, which is the 117 + first one (index ``0``) in ``target_ids``, and another couple of address 118 + ranges, ``20-40`` and ``50-100`` as that of pid 4242, which is the second one 119 + (index ``1``) in ``target_ids``.:: 456 120 457 121 # cd <debugfs>/damon 458 - # echo "42 1 100 459 - 42 100 200 460 - 4242 20 40 461 - 4242 50 100" > init_regions 122 + # cat target_ids 123 + 42 4242 124 + # echo "0 1 100 125 + 0 100 200 126 + 1 20 40 127 + 1 50 100" > init_regions 462 128 463 129 Note that this sets the initial monitoring target regions only. In case of 464 130 virtual memory monitoring, DAMON will automatically updates the boundary of the 465 - regions after one ``regions update interval``. Therefore, users should set the 466 - ``regions update interval`` large enough in this case, if they don't want the 131 + regions after one ``update interval``. Therefore, users should set the 132 + ``update interval`` large enough in this case, if they don't want the 467 133 update. 468 134 469 135

+19 -3

Documentation/admin-guide/mm/zswap.rst

··· 130 130 echo 1 > /sys/module/zswap/parameters/same_filled_pages_enabled 131 131 132 132 When zswap same-filled page identification is disabled at runtime, it will stop 133 - checking for the same-value filled pages during store operation. However, the 134 - existing pages which are marked as same-value filled pages remain stored 135 - unchanged in zswap until they are either loaded or invalidated. 133 + checking for the same-value filled pages during store operation. 134 + In other words, every page will be then considered non-same-value filled. 135 + However, the existing pages which are marked as same-value filled pages remain 136 + stored unchanged in zswap until they are either loaded or invalidated. 137 + 138 + In some circumstances it might be advantageous to make use of just the zswap 139 + ability to efficiently store same-filled pages without enabling the whole 140 + compressed page storage. 141 + In this case the handling of non-same-value pages by zswap (enabled by default) 142 + can be disabled by setting the ``non_same_filled_pages_enabled`` attribute 143 + to 0, e.g. ``zswap.non_same_filled_pages_enabled=0``. 144 + It can also be enabled and disabled at runtime using the sysfs 145 + ``non_same_filled_pages_enabled`` attribute, e.g.:: 146 + 147 + echo 1 > /sys/module/zswap/parameters/non_same_filled_pages_enabled 148 + 149 + Disabling both ``zswap.same_filled_pages_enabled`` and 150 + ``zswap.non_same_filled_pages_enabled`` effectively disables accepting any new 151 + pages by zswap. 136 152 137 153 To prevent zswap from shrinking pool when zswap is full and there's a high 138 154 pressure on swap (this will result in flipping pages in and out zswap pool

+21 -9

Documentation/admin-guide/sysctl/kernel.rst

··· 595 595 numa_balancing 596 596 ============== 597 597 598 - Enables/disables automatic page fault based NUMA memory 599 - balancing. Memory is moved automatically to nodes 600 - that access it often. 598 + Enables/disables and configures automatic page fault based NUMA memory 599 + balancing. Memory is moved automatically to nodes that access it often. 600 + The value to set can be the result of ORing the following: 601 601 602 - Enables/disables automatic NUMA memory balancing. On NUMA machines, there 603 - is a performance penalty if remote memory is accessed by a CPU. When this 604 - feature is enabled the kernel samples what task thread is accessing memory 605 - by periodically unmapping pages and later trapping a page fault. At the 606 - time of the page fault, it is determined if the data being accessed should 607 - be migrated to a local memory node. 602 + = ================================= 603 + 0 NUMA_BALANCING_DISABLED 604 + 1 NUMA_BALANCING_NORMAL 605 + 2 NUMA_BALANCING_MEMORY_TIERING 606 + = ================================= 607 + 608 + Or NUMA_BALANCING_NORMAL to optimize page placement among different 609 + NUMA nodes to reduce remote accessing. On NUMA machines, there is a 610 + performance penalty if remote memory is accessed by a CPU. When this 611 + feature is enabled the kernel samples what task thread is accessing 612 + memory by periodically unmapping pages and later trapping a page 613 + fault. At the time of the page fault, it is determined if the data 614 + being accessed should be migrated to a local memory node. 608 615 609 616 The unmapping of pages and trapping faults incur additional overhead that 610 617 ideally is offset by improved memory locality but there is no universal 611 618 guarantee. If the target workload is already bound to NUMA nodes then this 612 619 feature should be disabled. 620 + 621 + Or NUMA_BALANCING_MEMORY_TIERING to optimize page placement among 622 + different types of memory (represented as different NUMA nodes) to 623 + place the hot pages in the fast memory. This is implemented based on 624 + unmapping and page fault too. 613 625 614 626 oops_all_cpu_backtrace 615 627 ======================

+17 -2

Documentation/core-api/mm-api.rst

··· 58 58 File Mapping and Page Cache 59 59 =========================== 60 60 61 - .. kernel-doc:: mm/readahead.c 62 - :export: 61 + Filemap 62 + ------- 63 63 64 64 .. kernel-doc:: mm/filemap.c 65 65 :export: 66 66 67 + Readahead 68 + --------- 69 + 70 + .. kernel-doc:: mm/readahead.c 71 + :doc: Readahead Overview 72 + 73 + .. kernel-doc:: mm/readahead.c 74 + :export: 75 + 76 + Writeback 77 + --------- 78 + 67 79 .. kernel-doc:: mm/page-writeback.c 68 80 :export: 81 + 82 + Truncate 83 + -------- 69 84 70 85 .. kernel-doc:: mm/truncate.c 71 86 :export:

+12

Documentation/dev-tools/kfence.rst

··· 41 41 ``CONFIG_KFENCE_SAMPLE_INTERVAL``. Setting ``kfence.sample_interval=0`` 42 42 disables KFENCE. 43 43 44 + The sample interval controls a timer that sets up KFENCE allocations. By 45 + default, to keep the real sample interval predictable, the normal timer also 46 + causes CPU wake-ups when the system is completely idle. This may be undesirable 47 + on power-constrained systems. The boot parameter ``kfence.deferrable=1`` 48 + instead switches to a "deferrable" timer which does not force CPU wake-ups on 49 + idle systems, at the risk of unpredictable sample intervals. The default is 50 + configurable via the Kconfig option ``CONFIG_KFENCE_DEFERRABLE``. 51 + 52 + .. warning:: 53 + The KUnit test suite is very likely to fail when using a deferrable timer 54 + since it currently causes very unpredictable sample intervals. 55 + 44 56 The KFENCE memory pool is of fixed size, and if the pool is exhausted, no 45 57 further KFENCE allocations occur. With ``CONFIG_KFENCE_NUM_OBJECTS`` (default 46 58 255), the number of available guarded objects can be controlled. Each object

+6

Documentation/filesystems/porting.rst

··· 45 45 46 46 At some point that will become mandatory. 47 47 48 + **mandatory** 49 + 50 + The foo_inode_info should always be allocated through alloc_inode_sb() rather 51 + than kmem_cache_alloc() or kmalloc() related to set up the inode reclaim context 52 + correctly. 53 + 48 54 --- 49 55 50 56 **mandatory**

+10 -6

Documentation/filesystems/vfs.rst

··· 806 806 object. The pages are consecutive in the page cache and are 807 807 locked. The implementation should decrement the page refcount 808 808 after starting I/O on each page. Usually the page will be 809 - unlocked by the I/O completion handler. If the filesystem decides 810 - to stop attempting I/O before reaching the end of the readahead 811 - window, it can simply return. The caller will decrement the page 812 - refcount and unlock the remaining pages for you. Set PageUptodate 813 - if the I/O completes successfully. Setting PageError on any page 814 - will be ignored; simply unlock the page if an I/O error occurs. 809 + unlocked by the I/O completion handler. The set of pages are 810 + divided into some sync pages followed by some async pages, 811 + rac->ra->async_size gives the number of async pages. The 812 + filesystem should attempt to read all sync pages but may decide 813 + to stop once it reaches the async pages. If it does decide to 814 + stop attempting I/O, it can simply return. The caller will 815 + remove the remaining pages from the address space, unlock them 816 + and decrement the page refcount. Set PageUptodate if the I/O 817 + completes successfully. Setting PageError on any page will be 818 + ignored; simply unlock the page if an I/O error occurs. 815 819 816 820 ``readpages`` 817 821 called by the VM to read pages associated with the address_space

+23 -20

Documentation/vm/damon/design.rst

··· 13 13 the other hand, the accuracy and overhead tradeoff mechanism, which is the core 14 14 of DAMON, is in the pure logic space. DAMON separates the two parts in 15 15 different layers and defines its interface to allow various low level 16 - primitives implementations configurable with the core logic. 16 + primitives implementations configurable with the core logic. We call the low 17 + level primitives implementations monitoring operations. 17 18 18 19 Due to this separated design and the configurable interface, users can extend 19 - DAMON for any address space by configuring the core logics with appropriate low 20 - level primitive implementations. If appropriate one is not provided, users can 21 - implement the primitives on their own. 20 + DAMON for any address space by configuring the core logics with appropriate 21 + monitoring operations. If appropriate one is not provided, users can implement 22 + the operations on their own. 22 23 23 24 For example, physical memory, virtual memory, swap space, those for specific 24 25 processes, NUMA nodes, files, and backing memory devices would be supportable. ··· 27 26 primitives, those will be easily configurable. 28 27 29 28 30 - Reference Implementations of Address Space Specific Primitives 31 - ============================================================== 29 + Reference Implementations of Address Space Specific Monitoring Operations 30 + ========================================================================= 32 31 33 - The low level primitives for the fundamental access monitoring are defined in 34 - two parts: 32 + The monitoring operations are defined in two parts: 35 33 36 34 1. Identification of the monitoring target address range for the address space. 37 35 2. Access check of specific address range in the target space. 38 36 39 - DAMON currently provides the implementations of the primitives for the physical 37 + DAMON currently provides the implementations of the operations for the physical 40 38 and virtual address spaces. Below two subsections describe how those work. 41 39 42 40 43 41 VMA-based Target Address Range Construction 44 42 ------------------------------------------- 45 43 46 - This is only for the virtual address space primitives implementation. That for 47 - the physical address space simply asks users to manually set the monitoring 48 - target address ranges. 44 + This is only for the virtual address space monitoring operations 45 + implementation. That for the physical address space simply asks users to 46 + manually set the monitoring target address ranges. 49 47 50 48 Only small parts in the super-huge virtual address space of the processes are 51 49 mapped to the physical memory and accessed. Thus, tracking the unmapped ··· 84 84 and clear the bit(s) for next sampling target address and checks whether the 85 85 bit(s) set again after one sampling period. This could disturb other kernel 86 86 subsystems using the Accessed bits, namely Idle page tracking and the reclaim 87 - logic. To avoid such disturbances, DAMON makes it mutually exclusive with Idle 88 - page tracking and uses ``PG_idle`` and ``PG_young`` page flags to solve the 89 - conflict with the reclaim logic, as Idle page tracking does. 87 + logic. DAMON does nothing to avoid disturbing Idle page tracking, so handling 88 + the interference is the responsibility of sysadmins. However, it solves the 89 + conflict with the reclaim logic using ``PG_idle`` and ``PG_young`` page flags, 90 + as Idle page tracking does. 90 91 91 92 92 93 Address Space Independent Core Mechanisms ··· 95 94 96 95 Below four sections describe each of the DAMON core mechanisms and the five 97 96 monitoring attributes, ``sampling interval``, ``aggregation interval``, 98 - ``regions update interval``, ``minimum number of regions``, and ``maximum 99 - number of regions``. 97 + ``update interval``, ``minimum number of regions``, and ``maximum number of 98 + regions``. 100 99 101 100 102 101 Access Frequency Monitoring ··· 169 168 virtual memory could be dynamically mapped and unmapped. Physical memory could 170 169 be hot-plugged. 171 170 172 - As the changes could be quite frequent in some cases, DAMON checks the dynamic 173 - memory mapping changes and applies it to the abstracted target area only for 174 - each of a user-specified time interval (``regions update interval``). 171 + As the changes could be quite frequent in some cases, DAMON allows the 172 + monitoring operations to check dynamic changes including memory mapping changes 173 + and applies it to monitoring operations-related data structures such as the 174 + abstracted monitoring target memory area only for each of a user-specified time 175 + interval (``update interval``).

+1 -1

Documentation/vm/damon/faq.rst

··· 31 31 ======================================= 32 32 33 33 No. The core of the DAMON is address space independent. The address space 34 - specific low level primitive parts including monitoring target regions 34 + specific monitoring operations including monitoring target regions 35 35 constructions and actual access checks can be implemented and configured on the 36 36 DAMON core by the users. In this way, DAMON users can monitor any address 37 37 space with any access check technique.

+1

MAINTAINERS

··· 5326 5326 M: SeongJae Park <sj@kernel.org> 5327 5327 L: linux-mm@kvack.org 5328 5328 S: Maintained 5329 + F: Documentation/ABI/testing/sysfs-kernel-mm-damon 5329 5330 F: Documentation/admin-guide/mm/damon/ 5330 5331 F: Documentation/vm/damon/ 5331 5332 F: include/linux/damon.h

+1 -3

arch/arm/Kconfig

··· 38 38 select ARCH_USE_CMPXCHG_LOCKREF 39 39 select ARCH_USE_MEMTEST 40 40 select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU 41 + select ARCH_WANT_GENERAL_HUGETLB 41 42 select ARCH_WANT_IPC_PARSE_VERSION 42 43 select ARCH_WANT_LD_ORPHAN_WARN 43 44 select BINFMT_FLAT_ARGVP_ENVP_ON_STACK ··· 1509 1508 config HW_PERF_EVENTS 1510 1509 def_bool y 1511 1510 depends on ARM_PMU 1512 - 1513 - config ARCH_WANT_GENERAL_HUGETLB 1514 - def_bool y 1515 1511 1516 1512 config ARM_MODULE_PLTS 1517 1513 bool "Use PLTs to allow module memory to spill over into vmalloc area"

-3

arch/arm64/kernel/setup.c

··· 406 406 { 407 407 int i; 408 408 409 - for_each_online_node(i) 410 - register_one_node(i); 411 - 412 409 for_each_possible_cpu(i) { 413 410 struct cpu *cpu = &per_cpu(cpu_data.cpu, i); 414 411 cpu->hotpluggable = cpu_can_disable(i);

+1

arch/arm64/mm/hugetlbpage.c

··· 356 356 { 357 357 size_t pagesize = 1UL << shift; 358 358 359 + entry = pte_mkhuge(entry); 359 360 if (pagesize == CONT_PTE_SIZE) { 360 361 entry = pte_mkcont(entry); 361 362 } else if (pagesize == CONT_PMD_SIZE) {

-2

arch/hexagon/mm/init.c

··· 29 29 /* indicate pfn's of high memory */ 30 30 unsigned long highstart_pfn, highend_pfn; 31 31 32 - DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); 33 - 34 32 /* Default cache attribute for newly created page tables */ 35 33 unsigned long _dflt_cache_att = CACHEDEF; 36 34

-10

arch/ia64/kernel/topology.c

··· 70 70 { 71 71 int i, err = 0; 72 72 73 - #ifdef CONFIG_NUMA 74 - /* 75 - * MCD - Do we want to register all ONLINE nodes, or all POSSIBLE nodes? 76 - */ 77 - for_each_online_node(i) { 78 - if ((err = register_one_node(i))) 79 - goto out; 80 - } 81 - #endif 82 - 83 73 sysfs_cpus = kcalloc(NR_CPUS, sizeof(struct ia64_cpu), GFP_KERNEL); 84 74 if (!sysfs_cpus) 85 75 panic("kzalloc in topology_init failed - NR_CPUS too big?");

+2 -9

arch/ia64/mm/discontig.c

··· 608 608 zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page)); 609 609 } 610 610 611 - #ifdef CONFIG_MEMORY_HOTPLUG 612 - pg_data_t *arch_alloc_nodedata(int nid) 611 + pg_data_t * __init arch_alloc_nodedata(int nid) 613 612 { 614 613 unsigned long size = compute_pernodesize(nid); 615 614 616 - return kzalloc(size, GFP_KERNEL); 617 - } 618 - 619 - void arch_free_nodedata(pg_data_t *pgdat) 620 - { 621 - kfree(pgdat); 615 + return memblock_alloc(size, SMP_CACHE_BYTES); 622 616 } 623 617 624 618 void arch_refresh_nodedata(int update_node, pg_data_t *update_pgdat) ··· 620 626 pgdat_list[update_node] = update_pgdat; 621 627 scatter_node_data(); 622 628 } 623 - #endif 624 629 625 630 #ifdef CONFIG_SPARSEMEM_VMEMMAP 626 631 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,

-5

arch/mips/kernel/topology.c

··· 12 12 { 13 13 int i, ret; 14 14 15 - #ifdef CONFIG_NUMA 16 - for_each_online_node(i) 17 - register_one_node(i); 18 - #endif /* CONFIG_NUMA */ 19 - 20 15 for_each_present_cpu(i) { 21 16 struct cpu *c = &per_cpu(cpu_devices, i); 22 17

-1

arch/nds32/mm/init.c

··· 18 18 #include <asm/tlb.h> 19 19 #include <asm/page.h> 20 20 21 - DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); 22 21 DEFINE_SPINLOCK(anon_alias_lock); 23 22 extern pgd_t swapper_pg_dir[PTRS_PER_PGD]; 24 23

-2

arch/openrisc/mm/init.c

··· 38 38 39 39 int mem_init_done; 40 40 41 - DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); 42 - 43 41 static void __init zone_sizes_init(void) 44 42 { 45 43 unsigned long max_zone_pfn[MAX_NR_ZONES] = { 0 };

-5

arch/powerpc/include/asm/fadump-internal.h

··· 19 19 20 20 #define memblock_num_regions(memblock_type) (memblock.memblock_type.cnt) 21 21 22 - /* Alignment per CMA requirement. */ 23 - #define FADUMP_CMA_ALIGNMENT (PAGE_SIZE << \ 24 - max_t(unsigned long, MAX_ORDER - 1, \ 25 - pageblock_order)) 26 - 27 22 /* FAD commands */ 28 23 #define FADUMP_REGISTER 1 29 24 #define FADUMP_UNREGISTER 2

+2 -2

arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h

··· 71 71 size_t size = 1UL << shift; 72 72 73 73 if (size == SZ_16K) 74 - return __pte(pte_val(entry) & ~_PAGE_HUGE); 74 + return __pte(pte_val(entry) | _PAGE_SPS); 75 75 else 76 - return entry; 76 + return __pte(pte_val(entry) | _PAGE_SPS | _PAGE_HUGE); 77 77 } 78 78 #define arch_make_huge_pte arch_make_huge_pte 79 79 #endif

+7 -1

arch/powerpc/kernel/fadump.c

··· 113 113 } 114 114 115 115 /* 116 + * If CMA activation fails, keep the pages reserved, instead of 117 + * exposing them to buddy allocator. Same as 'fadump=nocma' case. 118 + */ 119 + cma_reserve_pages_on_error(fadump_cma); 120 + 121 + /* 116 122 * So we now have successfully initialized cma area for fadump. 117 123 */ 118 124 pr_info("Initialized 0x%lx bytes cma area at %ldMB from 0x%lx " ··· 550 544 if (!fw_dump.nocma) { 551 545 fw_dump.boot_memory_size = 552 546 ALIGN(fw_dump.boot_memory_size, 553 - FADUMP_CMA_ALIGNMENT); 547 + CMA_MIN_ALIGNMENT_BYTES); 554 548 } 555 549 #endif 556 550

-17

arch/powerpc/kernel/sysfs.c

··· 1110 1110 /* NUMA stuff */ 1111 1111 1112 1112 #ifdef CONFIG_NUMA 1113 - static void __init register_nodes(void) 1114 - { 1115 - int i; 1116 - 1117 - for (i = 0; i < MAX_NUMNODES; i++) 1118 - register_one_node(i); 1119 - } 1120 - 1121 1113 int sysfs_add_device_to_node(struct device *dev, int nid) 1122 1114 { 1123 1115 struct node *node = node_devices[nid]; ··· 1124 1132 sysfs_remove_link(&node->dev.kobj, kobject_name(&dev->kobj)); 1125 1133 } 1126 1134 EXPORT_SYMBOL_GPL(sysfs_remove_device_from_node); 1127 - 1128 - #else 1129 - static void __init register_nodes(void) 1130 - { 1131 - return; 1132 - } 1133 - 1134 1135 #endif 1135 1136 1136 1137 /* Only valid if CPU is present. */ ··· 1139 1154 static int __init topology_init(void) 1140 1155 { 1141 1156 int cpu, r; 1142 - 1143 - register_nodes(); 1144 1157 1145 1158 for_each_possible_cpu(cpu) { 1146 1159 struct cpu *c = &per_cpu(cpu_devices, cpu);

+1 -3

arch/riscv/Kconfig

··· 40 40 select ARCH_USE_MEMTEST 41 41 select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU 42 42 select ARCH_WANT_FRAME_POINTERS 43 + select ARCH_WANT_GENERAL_HUGETLB 43 44 select ARCH_WANT_HUGE_PMD_SHARE if 64BIT 44 45 select BINFMT_FLAT_NO_DATA_START_OFFSET if !MMU 45 46 select BUILDTIME_TABLE_SORT if MMU ··· 171 170 172 171 config ARCH_SELECT_MEMORY_MODEL 173 172 def_bool ARCH_SPARSEMEM_ENABLE 174 - 175 - config ARCH_WANT_GENERAL_HUGETLB 176 - def_bool y 177 173 178 174 config ARCH_SUPPORTS_UPROBES 179 175 def_bool y

-3

arch/riscv/kernel/setup.c

··· 301 301 { 302 302 int i, ret; 303 303 304 - for_each_online_node(i) 305 - register_one_node(i); 306 - 307 304 for_each_possible_cpu(i) { 308 305 struct cpu *cpu = &per_cpu(cpu_devices, i); 309 306

-7

arch/s390/kernel/numa.c

··· 33 33 NODE_DATA(0)->node_spanned_pages = memblock_end_of_DRAM() >> PAGE_SHIFT; 34 34 NODE_DATA(0)->node_id = 0; 35 35 } 36 - 37 - static int __init numa_init_late(void) 38 - { 39 - register_one_node(0); 40 - return 0; 41 - } 42 - arch_initcall(numa_init_late);

-5

arch/sh/kernel/topology.c

··· 46 46 { 47 47 int i, ret; 48 48 49 - #ifdef CONFIG_NUMA 50 - for_each_online_node(i) 51 - register_one_node(i); 52 - #endif 53 - 54 49 for_each_present_cpu(i) { 55 50 struct cpu *c = &per_cpu(cpu_devices, i); 56 51

-12

arch/sparc/kernel/sysfs.c

··· 244 244 mmu_stats_supported = 1; 245 245 } 246 246 247 - static void register_nodes(void) 248 - { 249 - #ifdef CONFIG_NUMA 250 - int i; 251 - 252 - for (i = 0; i < MAX_NUMNODES; i++) 253 - register_one_node(i); 254 - #endif 255 - } 256 - 257 247 static int __init topology_init(void) 258 248 { 259 249 int cpu, ret; 260 - 261 - register_nodes(); 262 250 263 251 check_mmu_stats(); 264 252

+1

arch/sparc/mm/hugetlbpage.c

··· 181 181 { 182 182 pte_t pte; 183 183 184 + entry = pte_mkhuge(entry); 184 185 pte = hugepage_shift_to_tte(entry, shift); 185 186 186 187 #ifdef CONFIG_SPARC64

+1 -3

arch/x86/Kconfig

··· 119 119 select ARCH_WANT_DEFAULT_BPF_JIT if X86_64 120 120 select ARCH_WANTS_DYNAMIC_TASK_STRUCT 121 121 select ARCH_WANTS_NO_INSTR 122 + select ARCH_WANT_GENERAL_HUGETLB 122 123 select ARCH_WANT_HUGE_PMD_SHARE 123 124 select ARCH_WANT_LD_ORPHAN_WARN 124 125 select ARCH_WANTS_RT_DELAYED_SIGNALS ··· 348 347 default 512 349 348 350 349 config ARCH_SUSPEND_POSSIBLE 351 - def_bool y 352 - 353 - config ARCH_WANT_GENERAL_HUGETLB 354 350 def_bool y 355 351 356 352 config AUDIT_ARCH

+5 -3

arch/x86/kernel/cpu/mce/core.c

··· 1299 1299 1300 1300 /* 1301 1301 * -EHWPOISON from memory_failure() means that it already sent SIGBUS 1302 - * to the current process with the proper error info, so no need to 1303 - * send SIGBUS here again. 1302 + * to the current process with the proper error info, 1303 + * -EOPNOTSUPP means hwpoison_filter() filtered the error event, 1304 + * 1305 + * In both cases, no further processing is required. 1304 1306 */ 1305 - if (ret == -EHWPOISON) 1307 + if (ret == -EHWPOISON || ret == -EOPNOTSUPP) 1306 1308 return; 1307 1309 1308 1310 pr_err("Memory error not recovered");

-5

arch/x86/kernel/topology.c

··· 154 154 { 155 155 int i; 156 156 157 - #ifdef CONFIG_NUMA 158 - for_each_online_node(i) 159 - register_one_node(i); 160 - #endif 161 - 162 157 for_each_present_cpu(i) 163 158 arch_register_cpu(i); 164 159

+20 -13

arch/x86/mm/numa.c

··· 738 738 numa_init(dummy_numa_init); 739 739 } 740 740 741 - static void __init init_memory_less_node(int nid) 742 - { 743 - /* Allocate and initialize node data. Memory-less node is now online.*/ 744 - alloc_node_data(nid); 745 - free_area_init_memoryless_node(nid); 746 - 747 - /* 748 - * All zonelists will be built later in start_kernel() after per cpu 749 - * areas are initialized. 750 - */ 751 - } 752 741 753 742 /* 754 743 * A node may exist which has one or more Generic Initiators but no CPUs and no ··· 755 766 { 756 767 int nid; 757 768 769 + /* 770 + * Exclude this node from 771 + * bringup_nonboot_cpus 772 + * cpu_up 773 + * __try_online_node 774 + * register_one_node 775 + * because node_subsys is not initialized yet. 776 + * TODO remove dependency on node_online 777 + */ 758 778 for_each_node_state(nid, N_GENERIC_INITIATOR) 759 779 if (!node_online(nid)) 760 - init_memory_less_node(nid); 780 + node_set_online(nid); 761 781 } 762 782 763 783 /* ··· 796 798 if (node == NUMA_NO_NODE) 797 799 continue; 798 800 801 + /* 802 + * Exclude this node from 803 + * bringup_nonboot_cpus 804 + * cpu_up 805 + * __try_online_node 806 + * register_one_node 807 + * because node_subsys is not initialized yet. 808 + * TODO remove dependency on node_online 809 + */ 799 810 if (!node_online(node)) 800 - init_memory_less_node(node); 811 + node_set_online(node); 801 812 802 813 numa_set_node(cpu, node); 803 814 }

+1 -1

block/bdev.c

··· 385 385 386 386 static struct inode *bdev_alloc_inode(struct super_block *sb) 387 387 { 388 - struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL); 388 + struct bdev_inode *ei = alloc_inode_sb(sb, bdev_cachep, GFP_KERNEL); 389 389 390 390 if (!ei) 391 391 return NULL;

+1 -1

block/bfq-iosched.c

··· 5459 5459 bfqq = bic_to_bfqq(bic, false); 5460 5460 if (bfqq) { 5461 5461 bfq_release_process_ref(bfqd, bfqq); 5462 - bfqq = bfq_get_queue(bfqd, bio, BLK_RW_ASYNC, bic, true); 5462 + bfqq = bfq_get_queue(bfqd, bio, false, bic, true); 5463 5463 bic_set_bfqq(bic, bfqq, false); 5464 5464 } 5465 5465

+1

drivers/base/init.c

··· 35 35 auxiliary_bus_init(); 36 36 cpu_dev_init(); 37 37 memory_dev_init(); 38 + node_dev_init(); 38 39 container_dev_init(); 39 40 }

+122 -25

drivers/base/memory.c

··· 215 215 adjust_present_page_count(pfn_to_page(start_pfn), mem->group, 216 216 nr_vmemmap_pages); 217 217 218 + mem->zone = zone; 218 219 return ret; 219 220 } 220 221 ··· 226 225 unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages; 227 226 int ret; 228 227 228 + if (!mem->zone) 229 + return -EINVAL; 230 + 229 231 /* 230 232 * Unaccount before offlining, such that unpopulated zone and kthreads 231 233 * can properly be torn down in offline_pages(). ··· 238 234 -nr_vmemmap_pages); 239 235 240 236 ret = offline_pages(start_pfn + nr_vmemmap_pages, 241 - nr_pages - nr_vmemmap_pages, mem->group); 237 + nr_pages - nr_vmemmap_pages, mem->zone, mem->group); 242 238 if (ret) { 243 239 /* offline_pages() failed. Account back. */ 244 240 if (nr_vmemmap_pages) ··· 250 246 if (nr_vmemmap_pages) 251 247 mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages); 252 248 249 + mem->zone = NULL; 253 250 return ret; 254 251 } 255 252 ··· 416 411 */ 417 412 if (mem->state == MEM_ONLINE) { 418 413 /* 419 - * The block contains more than one zone can not be offlined. 420 - * This can happen e.g. for ZONE_DMA and ZONE_DMA32 414 + * If !mem->zone, the memory block spans multiple zones and 415 + * cannot get offlined. 421 416 */ 422 - default_zone = test_pages_in_a_zone(start_pfn, 423 - start_pfn + nr_pages); 417 + default_zone = mem->zone; 424 418 if (!default_zone) 425 419 return sysfs_emit(buf, "%s\n", "none"); 426 420 len += sysfs_emit_at(buf, len, "%s", default_zone->name); ··· 559 555 return -EINVAL; 560 556 pfn >>= PAGE_SHIFT; 561 557 ret = memory_failure(pfn, 0); 558 + if (ret == -EOPNOTSUPP) 559 + ret = 0; 562 560 return ret ? ret : count; 563 561 } 564 562 ··· 619 613 NULL, 620 614 }; 621 615 622 - /* 623 - * register_memory - Setup a sysfs device for a memory block 624 - */ 625 - static 626 - int register_memory(struct memory_block *memory) 616 + static int __add_memory_block(struct memory_block *memory) 627 617 { 628 618 int ret; 629 619 ··· 643 641 return ret; 644 642 } 645 643 646 - static int init_memory_block(unsigned long block_id, unsigned long state, 647 - unsigned long nr_vmemmap_pages, 648 - struct memory_group *group) 644 + static struct zone *early_node_zone_for_memory_block(struct memory_block *mem, 645 + int nid) 646 + { 647 + const unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); 648 + const unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; 649 + struct zone *zone, *matching_zone = NULL; 650 + pg_data_t *pgdat = NODE_DATA(nid); 651 + int i; 652 + 653 + /* 654 + * This logic only works for early memory, when the applicable zones 655 + * already span the memory block. We don't expect overlapping zones on 656 + * a single node for early memory. So if we're told that some PFNs 657 + * of a node fall into this memory block, we can assume that all node 658 + * zones that intersect with the memory block are actually applicable. 659 + * No need to look at the memmap. 660 + */ 661 + for (i = 0; i < MAX_NR_ZONES; i++) { 662 + zone = pgdat->node_zones + i; 663 + if (!populated_zone(zone)) 664 + continue; 665 + if (!zone_intersects(zone, start_pfn, nr_pages)) 666 + continue; 667 + if (!matching_zone) { 668 + matching_zone = zone; 669 + continue; 670 + } 671 + /* Spans multiple zones ... */ 672 + matching_zone = NULL; 673 + break; 674 + } 675 + return matching_zone; 676 + } 677 + 678 + #ifdef CONFIG_NUMA 679 + /** 680 + * memory_block_add_nid() - Indicate that system RAM falling into this memory 681 + * block device (partially) belongs to the given node. 682 + * @mem: The memory block device. 683 + * @nid: The node id. 684 + * @context: The memory initialization context. 685 + * 686 + * Indicate that system RAM falling into this memory block (partially) belongs 687 + * to the given node. If the context indicates ("early") that we are adding the 688 + * node during node device subsystem initialization, this will also properly 689 + * set/adjust mem->zone based on the zone ranges of the given node. 690 + */ 691 + void memory_block_add_nid(struct memory_block *mem, int nid, 692 + enum meminit_context context) 693 + { 694 + if (context == MEMINIT_EARLY && mem->nid != nid) { 695 + /* 696 + * For early memory we have to determine the zone when setting 697 + * the node id and handle multiple nodes spanning a single 698 + * memory block by indicate via zone == NULL that we're not 699 + * dealing with a single zone. So if we're setting the node id 700 + * the first time, determine if there is a single zone. If we're 701 + * setting the node id a second time to a different node, 702 + * invalidate the single detected zone. 703 + */ 704 + if (mem->nid == NUMA_NO_NODE) 705 + mem->zone = early_node_zone_for_memory_block(mem, nid); 706 + else 707 + mem->zone = NULL; 708 + } 709 + 710 + /* 711 + * If this memory block spans multiple nodes, we only indicate 712 + * the last processed node. If we span multiple nodes (not applicable 713 + * to hotplugged memory), zone == NULL will prohibit memory offlining 714 + * and consequently unplug. 715 + */ 716 + mem->nid = nid; 717 + } 718 + #endif 719 + 720 + static int add_memory_block(unsigned long block_id, unsigned long state, 721 + unsigned long nr_vmemmap_pages, 722 + struct memory_group *group) 649 723 { 650 724 struct memory_block *mem; 651 725 int ret = 0; ··· 741 663 mem->nr_vmemmap_pages = nr_vmemmap_pages; 742 664 INIT_LIST_HEAD(&mem->group_next); 743 665 666 + #ifndef CONFIG_NUMA 667 + if (state == MEM_ONLINE) 668 + /* 669 + * MEM_ONLINE at this point implies early memory. With NUMA, 670 + * we'll determine the zone when setting the node id via 671 + * memory_block_add_nid(). Memory hotplug updated the zone 672 + * manually when memory onlining/offlining succeeds. 673 + */ 674 + mem->zone = early_node_zone_for_memory_block(mem, NUMA_NO_NODE); 675 + #endif /* CONFIG_NUMA */ 676 + 677 + ret = __add_memory_block(mem); 678 + if (ret) 679 + return ret; 680 + 744 681 if (group) { 745 682 mem->group = group; 746 683 list_add(&mem->group_next, &group->memory_blocks); 747 684 } 748 685 749 - ret = register_memory(mem); 750 - 751 - return ret; 686 + return 0; 752 687 } 753 688 754 - static int add_memory_block(unsigned long base_section_nr) 689 + static int __init add_boot_memory_block(unsigned long base_section_nr) 755 690 { 756 691 int section_count = 0; 757 692 unsigned long nr; ··· 776 685 777 686 if (section_count == 0) 778 687 return 0; 779 - return init_memory_block(memory_block_id(base_section_nr), 780 - MEM_ONLINE, 0, NULL); 688 + return add_memory_block(memory_block_id(base_section_nr), 689 + MEM_ONLINE, 0, NULL); 781 690 } 782 691 783 - static void unregister_memory(struct memory_block *memory) 692 + static int add_hotplug_memory_block(unsigned long block_id, 693 + unsigned long nr_vmemmap_pages, 694 + struct memory_group *group) 695 + { 696 + return add_memory_block(block_id, MEM_OFFLINE, nr_vmemmap_pages, group); 697 + } 698 + 699 + static void remove_memory_block(struct memory_block *memory) 784 700 { 785 701 if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys)) 786 702 return; ··· 826 728 return -EINVAL; 827 729 828 730 for (block_id = start_block_id; block_id != end_block_id; block_id++) { 829 - ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages, 830 - group); 731 + ret = add_hotplug_memory_block(block_id, vmemmap_pages, group); 831 732 if (ret) 832 733 break; 833 734 } ··· 837 740 mem = find_memory_block_by_id(block_id); 838 741 if (WARN_ON_ONCE(!mem)) 839 742 continue; 840 - unregister_memory(mem); 743 + remove_memory_block(mem); 841 744 } 842 745 } 843 746 return ret; ··· 866 769 if (WARN_ON_ONCE(!mem)) 867 770 continue; 868 771 unregister_memory_block_under_nodes(mem); 869 - unregister_memory(mem); 772 + remove_memory_block(mem); 870 773 } 871 774 } 872 775 ··· 926 829 */ 927 830 for (nr = 0; nr <= __highest_present_section_nr; 928 831 nr += sections_per_block) { 929 - ret = add_memory_block(nr); 832 + ret = add_boot_memory_block(nr); 930 833 if (ret) 931 834 panic("%s() failed to add memory block: %d\n", __func__, 932 835 ret);

+25 -23

drivers/base/node.c

··· 796 796 } 797 797 798 798 static void do_register_memory_block_under_node(int nid, 799 - struct memory_block *mem_blk) 799 + struct memory_block *mem_blk, 800 + enum meminit_context context) 800 801 { 801 802 int ret; 802 803 803 - /* 804 - * If this memory block spans multiple nodes, we only indicate 805 - * the last processed node. 806 - */ 807 - mem_blk->nid = nid; 804 + memory_block_add_nid(mem_blk, nid, context); 808 805 809 806 ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj, 810 807 &mem_blk->dev.kobj, ··· 854 857 if (page_nid != nid) 855 858 continue; 856 859 857 - do_register_memory_block_under_node(nid, mem_blk); 860 + do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY); 858 861 return 0; 859 862 } 860 863 /* mem section does not span the specified node */ ··· 870 873 { 871 874 int nid = *(int *)arg; 872 875 873 - do_register_memory_block_under_node(nid, mem_blk); 876 + do_register_memory_block_under_node(nid, mem_blk, MEMINIT_HOTPLUG); 874 877 return 0; 875 878 } 876 879 ··· 889 892 kobject_name(&node_devices[mem_blk->nid]->dev.kobj)); 890 893 } 891 894 892 - void link_mem_sections(int nid, unsigned long start_pfn, unsigned long end_pfn, 893 - enum meminit_context context) 895 + void register_memory_blocks_under_node(int nid, unsigned long start_pfn, 896 + unsigned long end_pfn, 897 + enum meminit_context context) 894 898 { 895 899 walk_memory_blocks_func_t func; 896 900 ··· 1063 1065 }; 1064 1066 1065 1067 #define NODE_CALLBACK_PRI 2 /* lower than SLAB */ 1066 - static int __init register_node_type(void) 1068 + void __init node_dev_init(void) 1067 1069 { 1068 - int ret; 1070 + static struct notifier_block node_memory_callback_nb = { 1071 + .notifier_call = node_memory_callback, 1072 + .priority = NODE_CALLBACK_PRI, 1073 + }; 1074 + int ret, i; 1069 1075 1070 1076 BUILD_BUG_ON(ARRAY_SIZE(node_state_attr) != NR_NODE_STATES); 1071 1077 BUILD_BUG_ON(ARRAY_SIZE(node_state_attrs)-1 != NR_NODE_STATES); 1072 1078 1073 1079 ret = subsys_system_register(&node_subsys, cpu_root_attr_groups); 1074 - if (!ret) { 1075 - static struct notifier_block node_memory_callback_nb = { 1076 - .notifier_call = node_memory_callback, 1077 - .priority = NODE_CALLBACK_PRI, 1078 - }; 1079 - register_hotmemory_notifier(&node_memory_callback_nb); 1080 - } 1080 + if (ret) 1081 + panic("%s() failed to register subsystem: %d\n", __func__, ret); 1082 + 1083 + register_hotmemory_notifier(&node_memory_callback_nb); 1081 1084 1082 1085 /* 1083 - * Note: we're not going to unregister the node class if we fail 1084 - * to register the node state class attribute files. 1086 + * Create all node devices, which will properly link the node 1087 + * to applicable memory block devices and already created cpu devices. 1085 1088 */ 1086 - return ret; 1089 + for_each_online_node(i) { 1090 + ret = register_one_node(i); 1091 + if (ret) 1092 + panic("%s() failed to add node: %d\n", __func__, ret); 1093 + } 1087 1094 } 1088 - postcore_initcall(register_node_type);

-3

drivers/block/drbd/drbd_int.h

··· 637 637 STATE_SENT, /* Do not change state/UUIDs while this is set */ 638 638 CALLBACK_PENDING, /* Whether we have a call_usermodehelper(, UMH_WAIT_PROC) 639 639 * pending, from drbd worker context. 640 - * If set, bdi_write_congested() returns true, 641 - * so shrink_page_list() would not recurse into, 642 - * and potentially deadlock on, this drbd worker. 643 640 */ 644 641 DISCONNECT_SENT, 645 642

+1 -2

drivers/block/drbd/drbd_req.c

··· 910 910 911 911 switch (rbm) { 912 912 case RB_CONGESTED_REMOTE: 913 - return bdi_read_congested( 914 - device->ldev->backing_bdev->bd_disk->bdi); 913 + return 0; 915 914 case RB_LEAST_PENDING: 916 915 return atomic_read(&device->local_cnt) > 917 916 atomic_read(&device->ap_pending_cnt) + atomic_read(&device->rs_pending_cnt);

+1 -1

drivers/dax/super.c

··· 282 282 struct dax_device *dax_dev; 283 283 struct inode *inode; 284 284 285 - dax_dev = kmem_cache_alloc(dax_cache, GFP_KERNEL); 285 + dax_dev = alloc_inode_sb(sb, dax_cache, GFP_KERNEL); 286 286 if (!dax_dev) 287 287 return NULL; 288 288

+3 -6

drivers/of/of_reserved_mem.c

··· 22 22 #include <linux/slab.h> 23 23 #include <linux/memblock.h> 24 24 #include <linux/kmemleak.h> 25 + #include <linux/cma.h> 25 26 26 27 #include "of_private.h" 27 28 ··· 117 116 if (IS_ENABLED(CONFIG_CMA) 118 117 && of_flat_dt_is_compatible(node, "shared-dma-pool") 119 118 && of_get_flat_dt_prop(node, "reusable", NULL) 120 - && !nomap) { 121 - unsigned long order = 122 - max_t(unsigned long, MAX_ORDER - 1, pageblock_order); 123 - 124 - align = max(align, (phys_addr_t)PAGE_SIZE << order); 125 - } 119 + && !nomap) 120 + align = max_t(phys_addr_t, align, CMA_MIN_ALIGNMENT_BYTES); 126 121 127 122 prop = of_get_flat_dt_prop(node, "alloc-ranges", &len); 128 123 if (prop) {

+1 -1

drivers/tty/tty_io.c

··· 3088 3088 { 3089 3089 struct tty_struct *tty; 3090 3090 3091 - tty = kzalloc(sizeof(*tty), GFP_KERNEL); 3091 + tty = kzalloc(sizeof(*tty), GFP_KERNEL_ACCOUNT); 3092 3092 if (!tty) 3093 3093 return NULL; 3094 3094

+3 -6

drivers/virtio/virtio_mem.c

··· 2476 2476 VIRTIO_MEM_DEFAULT_OFFLINE_THRESHOLD); 2477 2477 2478 2478 /* 2479 - * We want subblocks to span at least MAX_ORDER_NR_PAGES and 2480 - * pageblock_nr_pages pages. This: 2481 - * - Is required for now for alloc_contig_range() to work reliably - 2482 - * it doesn't properly handle smaller granularity on ZONE_NORMAL. 2479 + * TODO: once alloc_contig_range() works reliably with pageblock 2480 + * granularity on ZONE_NORMAL, use pageblock_nr_pages instead. 2483 2481 */ 2484 - sb_size = max_t(uint64_t, MAX_ORDER_NR_PAGES, 2485 - pageblock_nr_pages) * PAGE_SIZE; 2482 + sb_size = PAGE_SIZE * MAX_ORDER_NR_PAGES; 2486 2483 sb_size = max_t(uint64_t, vm->device_block_size, sb_size); 2487 2484 2488 2485 if (sb_size < memory_block_size_bytes() && !force_bbm) {

+1 -1

fs/9p/vfs_inode.c

··· 228 228 { 229 229 struct v9fs_inode *v9inode; 230 230 231 - v9inode = kmem_cache_alloc(v9fs_inode_cache, GFP_KERNEL); 231 + v9inode = alloc_inode_sb(sb, v9fs_inode_cache, GFP_KERNEL); 232 232 if (!v9inode) 233 233 return NULL; 234 234 #ifdef CONFIG_9P_FSCACHE

+1 -1

fs/adfs/super.c

··· 220 220 static struct inode *adfs_alloc_inode(struct super_block *sb) 221 221 { 222 222 struct adfs_inode_info *ei; 223 - ei = kmem_cache_alloc(adfs_inode_cachep, GFP_KERNEL); 223 + ei = alloc_inode_sb(sb, adfs_inode_cachep, GFP_KERNEL); 224 224 if (!ei) 225 225 return NULL; 226 226 return &ei->vfs_inode;

+1 -1

fs/affs/super.c

··· 100 100 { 101 101 struct affs_inode_info *i; 102 102 103 - i = kmem_cache_alloc(affs_inode_cachep, GFP_KERNEL); 103 + i = alloc_inode_sb(sb, affs_inode_cachep, GFP_KERNEL); 104 104 if (!i) 105 105 return NULL; 106 106

+1 -1

fs/afs/super.c

··· 679 679 { 680 680 struct afs_vnode *vnode; 681 681 682 - vnode = kmem_cache_alloc(afs_inode_cachep, GFP_KERNEL); 682 + vnode = alloc_inode_sb(sb, afs_inode_cachep, GFP_KERNEL); 683 683 if (!vnode) 684 684 return NULL; 685 685

+1 -1

fs/befs/linuxvfs.c

··· 277 277 { 278 278 struct befs_inode_info *bi; 279 279 280 - bi = kmem_cache_alloc(befs_inode_cachep, GFP_KERNEL); 280 + bi = alloc_inode_sb(sb, befs_inode_cachep, GFP_KERNEL); 281 281 if (!bi) 282 282 return NULL; 283 283 return &bi->vfs_inode;

+1 -1

fs/bfs/inode.c

··· 239 239 static struct inode *bfs_alloc_inode(struct super_block *sb) 240 240 { 241 241 struct bfs_inode_info *bi; 242 - bi = kmem_cache_alloc(bfs_inode_cachep, GFP_KERNEL); 242 + bi = alloc_inode_sb(sb, bfs_inode_cachep, GFP_KERNEL); 243 243 if (!bi) 244 244 return NULL; 245 245 return &bi->vfs_inode;

+1 -1

fs/btrfs/inode.c

··· 8819 8819 struct btrfs_inode *ei; 8820 8820 struct inode *inode; 8821 8821 8822 - ei = kmem_cache_alloc(btrfs_inode_cachep, GFP_KERNEL); 8822 + ei = alloc_inode_sb(sb, btrfs_inode_cachep, GFP_KERNEL); 8823 8823 if (!ei) 8824 8824 return NULL; 8825 8825

+5 -3

fs/buffer.c

··· 1235 1235 int i; 1236 1236 1237 1237 check_irqs_on(); 1238 + bh_lru_lock(); 1239 + 1238 1240 /* 1239 1241 * the refcount of buffer_head in bh_lru prevents dropping the 1240 1242 * attached page(i.e., try_to_free_buffers) so it could cause 1241 1243 * failing page migration. 1242 1244 * Skip putting upcoming bh into bh_lru until migration is done. 1243 1245 */ 1244 - if (lru_cache_disabled()) 1246 + if (lru_cache_disabled()) { 1247 + bh_lru_unlock(); 1245 1248 return; 1246 - 1247 - bh_lru_lock(); 1249 + } 1248 1250 1249 1251 b = this_cpu_ptr(&bh_lrus); 1250 1252 for (i = 0; i < BH_LRU_SIZE; i++) {

+13 -9

fs/ceph/addr.c

··· 563 563 564 564 if (atomic_long_inc_return(&fsc->writeback_count) > 565 565 CONGESTION_ON_THRESH(fsc->mount_options->congestion_kb)) 566 - set_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC); 566 + fsc->write_congested = true; 567 567 568 568 req = ceph_osdc_new_request(osdc, &ci->i_layout, ceph_vino(inode), page_off, &len, 0, 1, 569 569 CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE, snapc, ··· 623 623 624 624 if (atomic_long_dec_return(&fsc->writeback_count) < 625 625 CONGESTION_OFF_THRESH(fsc->mount_options->congestion_kb)) 626 - clear_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC); 626 + fsc->write_congested = false; 627 627 628 628 return err; 629 629 } ··· 634 634 struct inode *inode = page->mapping->host; 635 635 BUG_ON(!inode); 636 636 ihold(inode); 637 + 638 + if (wbc->sync_mode == WB_SYNC_NONE && 639 + ceph_inode_to_client(inode)->write_congested) 640 + return AOP_WRITEPAGE_ACTIVATE; 637 641 638 642 wait_on_page_fscache(page); 639 643 ··· 711 707 if (atomic_long_dec_return(&fsc->writeback_count) < 712 708 CONGESTION_OFF_THRESH( 713 709 fsc->mount_options->congestion_kb)) 714 - clear_bdi_congested(inode_to_bdi(inode), 715 - BLK_RW_ASYNC); 710 + fsc->write_congested = false; 716 711 717 712 ceph_put_snap_context(detach_page_private(page)); 718 713 end_page_writeback(page); ··· 762 759 bool should_loop, range_whole = false; 763 760 bool done = false; 764 761 bool caching = ceph_is_cache_enabled(inode); 762 + 763 + if (wbc->sync_mode == WB_SYNC_NONE && 764 + fsc->write_congested) 765 + return 0; 765 766 766 767 dout("writepages_start %p (mode=%s)\n", inode, 767 768 wbc->sync_mode == WB_SYNC_NONE ? "NONE" : ··· 961 954 962 955 if (atomic_long_inc_return(&fsc->writeback_count) > 963 956 CONGESTION_ON_THRESH( 964 - fsc->mount_options->congestion_kb)) { 965 - set_bdi_congested(inode_to_bdi(inode), 966 - BLK_RW_ASYNC); 967 - } 968 - 957 + fsc->mount_options->congestion_kb)) 958 + fsc->write_congested = true; 969 959 970 960 pages[locked_pages++] = page; 971 961 pvec.pages[i] = NULL;

+1 -1

fs/ceph/inode.c

··· 447 447 struct ceph_inode_info *ci; 448 448 int i; 449 449 450 - ci = kmem_cache_alloc(ceph_inode_cachep, GFP_NOFS); 450 + ci = alloc_inode_sb(sb, ceph_inode_cachep, GFP_NOFS); 451 451 if (!ci) 452 452 return NULL; 453 453

+1

fs/ceph/super.c

··· 802 802 fsc->have_copy_from2 = true; 803 803 804 804 atomic_long_set(&fsc->writeback_count, 0); 805 + fsc->write_congested = false; 805 806 806 807 err = -ENOMEM; 807 808 /*

+1

fs/ceph/super.h

··· 121 121 struct ceph_mds_client *mdsc; 122 122 123 123 atomic_long_t writeback_count; 124 + bool write_congested; 124 125 125 126 struct workqueue_struct *inode_wq; 126 127 struct workqueue_struct *cap_wq;

+1 -1

fs/cifs/cifsfs.c

··· 362 362 cifs_alloc_inode(struct super_block *sb) 363 363 { 364 364 struct cifsInodeInfo *cifs_inode; 365 - cifs_inode = kmem_cache_alloc(cifs_inode_cachep, GFP_KERNEL); 365 + cifs_inode = alloc_inode_sb(sb, cifs_inode_cachep, GFP_KERNEL); 366 366 if (!cifs_inode) 367 367 return NULL; 368 368 cifs_inode->cifsAttrs = 0x20; /* default */

+1 -1

fs/coda/inode.c

··· 43 43 static struct inode *coda_alloc_inode(struct super_block *sb) 44 44 { 45 45 struct coda_inode_info *ei; 46 - ei = kmem_cache_alloc(coda_inode_cachep, GFP_KERNEL); 46 + ei = alloc_inode_sb(sb, coda_inode_cachep, GFP_KERNEL); 47 47 if (!ei) 48 48 return NULL; 49 49 memset(&ei->c_fid, 0, sizeof(struct CodaFid));

+2 -1

fs/dcache.c

··· 1766 1766 char *dname; 1767 1767 int err; 1768 1768 1769 - dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); 1769 + dentry = kmem_cache_alloc_lru(dentry_cache, &sb->s_dentry_lru, 1770 + GFP_KERNEL); 1770 1771 if (!dentry) 1771 1772 return NULL; 1772 1773

+1 -1

fs/ecryptfs/super.c

··· 38 38 struct ecryptfs_inode_info *inode_info; 39 39 struct inode *inode = NULL; 40 40 41 - inode_info = kmem_cache_alloc(ecryptfs_inode_info_cache, GFP_KERNEL); 41 + inode_info = alloc_inode_sb(sb, ecryptfs_inode_info_cache, GFP_KERNEL); 42 42 if (unlikely(!inode_info)) 43 43 goto out; 44 44 if (ecryptfs_init_crypt_stat(&inode_info->crypt_stat)) {

+1 -1

fs/efs/super.c

··· 69 69 static struct inode *efs_alloc_inode(struct super_block *sb) 70 70 { 71 71 struct efs_inode_info *ei; 72 - ei = kmem_cache_alloc(efs_inode_cachep, GFP_KERNEL); 72 + ei = alloc_inode_sb(sb, efs_inode_cachep, GFP_KERNEL); 73 73 if (!ei) 74 74 return NULL; 75 75 return &ei->vfs_inode;

+1 -1

fs/erofs/super.c

··· 84 84 static struct inode *erofs_alloc_inode(struct super_block *sb) 85 85 { 86 86 struct erofs_inode *vi = 87 - kmem_cache_alloc(erofs_inode_cachep, GFP_KERNEL); 87 + alloc_inode_sb(sb, erofs_inode_cachep, GFP_KERNEL); 88 88 89 89 if (!vi) 90 90 return NULL;

+1 -1

fs/exfat/super.c

··· 183 183 { 184 184 struct exfat_inode_info *ei; 185 185 186 - ei = kmem_cache_alloc(exfat_inode_cachep, GFP_NOFS); 186 + ei = alloc_inode_sb(sb, exfat_inode_cachep, GFP_NOFS); 187 187 if (!ei) 188 188 return NULL; 189 189

-5

fs/ext2/ialloc.c

··· 170 170 unsigned long offset; 171 171 unsigned long block; 172 172 struct ext2_group_desc * gdp; 173 - struct backing_dev_info *bdi; 174 - 175 - bdi = inode_to_bdi(inode); 176 - if (bdi_rw_congested(bdi)) 177 - return; 178 173 179 174 block_group = (inode->i_ino - 1) / EXT2_INODES_PER_GROUP(inode->i_sb); 180 175 gdp = ext2_get_group_desc(inode->i_sb, block_group, NULL);

+1 -1

fs/ext2/super.c

··· 180 180 static struct inode *ext2_alloc_inode(struct super_block *sb) 181 181 { 182 182 struct ext2_inode_info *ei; 183 - ei = kmem_cache_alloc(ext2_inode_cachep, GFP_KERNEL); 183 + ei = alloc_inode_sb(sb, ext2_inode_cachep, GFP_KERNEL); 184 184 if (!ei) 185 185 return NULL; 186 186 ei->i_block_alloc_info = NULL;

+1 -1

fs/ext4/super.c

··· 1316 1316 { 1317 1317 struct ext4_inode_info *ei; 1318 1318 1319 - ei = kmem_cache_alloc(ext4_inode_cachep, GFP_NOFS); 1319 + ei = alloc_inode_sb(sb, ext4_inode_cachep, GFP_NOFS); 1320 1320 if (!ei) 1321 1321 return NULL; 1322 1322

+1 -3

fs/f2fs/compress.c

··· 1504 1504 if (IS_NOQUOTA(cc->inode)) 1505 1505 return 0; 1506 1506 ret = 0; 1507 - cond_resched(); 1508 - congestion_wait(BLK_RW_ASYNC, 1509 - DEFAULT_IO_TIMEOUT); 1507 + f2fs_io_schedule_timeout(DEFAULT_IO_TIMEOUT); 1510 1508 goto retry_write; 1511 1509 } 1512 1510 return ret;

+1 -2

fs/f2fs/data.c

··· 3041 3041 } else if (ret == -EAGAIN) { 3042 3042 ret = 0; 3043 3043 if (wbc->sync_mode == WB_SYNC_ALL) { 3044 - cond_resched(); 3045 - congestion_wait(BLK_RW_ASYNC, 3044 + f2fs_io_schedule_timeout( 3046 3045 DEFAULT_IO_TIMEOUT); 3047 3046 goto retry_write; 3048 3047 }

+6

fs/f2fs/f2fs.h

··· 4538 4538 return F2FS_OPTION(sbi).discard_unit == DISCARD_UNIT_BLOCK; 4539 4539 } 4540 4540 4541 + static inline void f2fs_io_schedule_timeout(long timeout) 4542 + { 4543 + set_current_state(TASK_UNINTERRUPTIBLE); 4544 + io_schedule_timeout(timeout); 4545 + } 4546 + 4541 4547 #define EFSBADCRC EBADMSG /* Bad CRC detected */ 4542 4548 #define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */ 4543 4549

+3 -5

fs/f2fs/segment.c

··· 313 313 skip: 314 314 iput(inode); 315 315 } 316 - congestion_wait(BLK_RW_ASYNC, DEFAULT_IO_TIMEOUT); 317 - cond_resched(); 316 + f2fs_io_schedule_timeout(DEFAULT_IO_TIMEOUT); 318 317 if (gc_failure) { 319 318 if (++looped >= count) 320 319 return; ··· 802 803 do { 803 804 ret = __submit_flush_wait(sbi, FDEV(i).bdev); 804 805 if (ret) 805 - congestion_wait(BLK_RW_ASYNC, 806 - DEFAULT_IO_TIMEOUT); 806 + f2fs_io_schedule_timeout(DEFAULT_IO_TIMEOUT); 807 807 } while (ret && --count); 808 808 809 809 if (ret) { ··· 3135 3137 blk_finish_plug(&plug); 3136 3138 mutex_unlock(&dcc->cmd_lock); 3137 3139 trimmed += __wait_all_discard_cmd(sbi, NULL); 3138 - congestion_wait(BLK_RW_ASYNC, DEFAULT_IO_TIMEOUT); 3140 + f2fs_io_schedule_timeout(DEFAULT_IO_TIMEOUT); 3139 3141 goto next; 3140 3142 } 3141 3143 skip:

+8 -6

fs/f2fs/super.c

··· 1345 1345 { 1346 1346 struct f2fs_inode_info *fi; 1347 1347 1348 - fi = f2fs_kmem_cache_alloc(f2fs_inode_cachep, 1349 - GFP_F2FS_ZERO, false, F2FS_SB(sb)); 1348 + if (time_to_inject(F2FS_SB(sb), FAULT_SLAB_ALLOC)) { 1349 + f2fs_show_injection_info(F2FS_SB(sb), FAULT_SLAB_ALLOC); 1350 + return NULL; 1351 + } 1352 + 1353 + fi = alloc_inode_sb(sb, f2fs_inode_cachep, GFP_F2FS_ZERO); 1350 1354 if (!fi) 1351 1355 return NULL; 1352 1356 ··· 2149 2145 /* we should flush all the data to keep data consistency */ 2150 2146 do { 2151 2147 sync_inodes_sb(sbi->sb); 2152 - cond_resched(); 2153 - congestion_wait(BLK_RW_ASYNC, DEFAULT_IO_TIMEOUT); 2148 + f2fs_io_schedule_timeout(DEFAULT_IO_TIMEOUT); 2154 2149 } while (get_pages(sbi, F2FS_DIRTY_DATA) && retry--); 2155 2150 2156 2151 if (unlikely(retry < 0)) ··· 2517 2514 &page, &fsdata); 2518 2515 if (unlikely(err)) { 2519 2516 if (err == -ENOMEM) { 2520 - congestion_wait(BLK_RW_ASYNC, 2521 - DEFAULT_IO_TIMEOUT); 2517 + f2fs_io_schedule_timeout(DEFAULT_IO_TIMEOUT); 2522 2518 goto retry; 2523 2519 } 2524 2520 set_sbi_flag(F2FS_SB(sb), SBI_QUOTA_NEED_REPAIR);

+1 -1

fs/fat/inode.c

··· 745 745 static struct inode *fat_alloc_inode(struct super_block *sb) 746 746 { 747 747 struct msdos_inode_info *ei; 748 - ei = kmem_cache_alloc(fat_inode_cachep, GFP_NOFS); 748 + ei = alloc_inode_sb(sb, fat_inode_cachep, GFP_NOFS); 749 749 if (!ei) 750 750 return NULL; 751 751

+1 -1

fs/freevxfs/vxfs_super.c

··· 124 124 { 125 125 struct vxfs_inode_info *vi; 126 126 127 - vi = kmem_cache_alloc(vxfs_inode_cachep, GFP_KERNEL); 127 + vi = alloc_inode_sb(sb, vxfs_inode_cachep, GFP_KERNEL); 128 128 if (!vi) 129 129 return NULL; 130 130 inode_init_once(&vi->vfs_inode);

-40

fs/fs-writeback.c

··· 894 894 EXPORT_SYMBOL_GPL(wbc_account_cgroup_owner); 895 895 896 896 /** 897 - * inode_congested - test whether an inode is congested 898 - * @inode: inode to test for congestion (may be NULL) 899 - * @cong_bits: mask of WB_[a]sync_congested bits to test 900 - * 901 - * Tests whether @inode is congested. @cong_bits is the mask of congestion 902 - * bits to test and the return value is the mask of set bits. 903 - * 904 - * If cgroup writeback is enabled for @inode, the congestion state is 905 - * determined by whether the cgwb (cgroup bdi_writeback) for the blkcg 906 - * associated with @inode is congested; otherwise, the root wb's congestion 907 - * state is used. 908 - * 909 - * @inode is allowed to be NULL as this function is often called on 910 - * mapping->host which is NULL for the swapper space. 911 - */ 912 - int inode_congested(struct inode *inode, int cong_bits) 913 - { 914 - /* 915 - * Once set, ->i_wb never becomes NULL while the inode is alive. 916 - * Start transaction iff ->i_wb is visible. 917 - */ 918 - if (inode && inode_to_wb_is_valid(inode)) { 919 - struct bdi_writeback *wb; 920 - struct wb_lock_cookie lock_cookie = {}; 921 - bool congested; 922 - 923 - wb = unlocked_inode_to_wb_begin(inode, &lock_cookie); 924 - congested = wb_congested(wb, cong_bits); 925 - unlocked_inode_to_wb_end(inode, &lock_cookie); 926 - return congested; 927 - } 928 - 929 - return wb_congested(&inode_to_bdi(inode)->wb, cong_bits); 930 - } 931 - EXPORT_SYMBOL_GPL(inode_congested); 932 - 933 - /** 934 897 * wb_split_bdi_pages - split nr_pages to write according to bandwidth 935 898 * @wb: target bdi_writeback to split @nr_pages to 936 899 * @nr_pages: number of pages to write for the whole bdi ··· 2196 2233 long pages_written; 2197 2234 2198 2235 set_worker_desc("flush-%s", bdi_dev_name(wb->bdi)); 2199 - current->flags |= PF_SWAPWRITE; 2200 2236 2201 2237 if (likely(!current_is_workqueue_rescuer() || 2202 2238 !test_bit(WB_registered, &wb->state))) { ··· 2224 2262 wb_wakeup(wb); 2225 2263 else if (wb_has_dirty_io(wb) && dirty_writeback_interval) 2226 2264 wb_wakeup_delayed(wb); 2227 - 2228 - current->flags &= ~PF_SWAPWRITE; 2229 2265 } 2230 2266 2231 2267 /*

-17

fs/fuse/control.c

··· 164 164 { 165 165 unsigned val; 166 166 struct fuse_conn *fc; 167 - struct fuse_mount *fm; 168 167 ssize_t ret; 169 168 170 169 ret = fuse_conn_limit_write(file, buf, count, ppos, &val, ··· 177 178 down_read(&fc->killsb); 178 179 spin_lock(&fc->bg_lock); 179 180 fc->congestion_threshold = val; 180 - 181 - /* 182 - * Get any fuse_mount belonging to this fuse_conn; s_bdi is 183 - * shared between all of them 184 - */ 185 - 186 - if (!list_empty(&fc->mounts)) { 187 - fm = list_first_entry(&fc->mounts, struct fuse_mount, fc_entry); 188 - if (fc->num_background < fc->congestion_threshold) { 189 - clear_bdi_congested(fm->sb->s_bdi, BLK_RW_SYNC); 190 - clear_bdi_congested(fm->sb->s_bdi, BLK_RW_ASYNC); 191 - } else { 192 - set_bdi_congested(fm->sb->s_bdi, BLK_RW_SYNC); 193 - set_bdi_congested(fm->sb->s_bdi, BLK_RW_ASYNC); 194 - } 195 - } 196 181 spin_unlock(&fc->bg_lock); 197 182 up_read(&fc->killsb); 198 183 fuse_conn_put(fc);

-8

fs/fuse/dev.c

··· 315 315 wake_up(&fc->blocked_waitq); 316 316 } 317 317 318 - if (fc->num_background == fc->congestion_threshold && fm->sb) { 319 - clear_bdi_congested(fm->sb->s_bdi, BLK_RW_SYNC); 320 - clear_bdi_congested(fm->sb->s_bdi, BLK_RW_ASYNC); 321 - } 322 318 fc->num_background--; 323 319 fc->active_background--; 324 320 flush_bg_queue(fc); ··· 536 540 fc->num_background++; 537 541 if (fc->num_background == fc->max_background) 538 542 fc->blocked = 1; 539 - if (fc->num_background == fc->congestion_threshold && fm->sb) { 540 - set_bdi_congested(fm->sb->s_bdi, BLK_RW_SYNC); 541 - set_bdi_congested(fm->sb->s_bdi, BLK_RW_ASYNC); 542 - } 543 543 list_add_tail(&req->list, &fc->bg_queue); 544 544 flush_bg_queue(fc); 545 545 queued = true;

+17

fs/fuse/file.c

··· 966 966 struct fuse_io_args *ia; 967 967 struct fuse_args_pages *ap; 968 968 969 + if (fc->num_background >= fc->congestion_threshold && 970 + rac->ra->async_size >= readahead_count(rac)) 971 + /* 972 + * Congested and only async pages left, so skip the 973 + * rest. 974 + */ 975 + break; 976 + 969 977 nr_pages = readahead_count(rac) - nr_pages; 970 978 if (nr_pages > max_pages) 971 979 nr_pages = max_pages; ··· 1967 1959 1968 1960 static int fuse_writepage(struct page *page, struct writeback_control *wbc) 1969 1961 { 1962 + struct fuse_conn *fc = get_fuse_conn(page->mapping->host); 1970 1963 int err; 1971 1964 1972 1965 if (fuse_page_is_writeback(page->mapping->host, page->index)) { ··· 1982 1973 unlock_page(page); 1983 1974 return 0; 1984 1975 } 1976 + 1977 + if (wbc->sync_mode == WB_SYNC_NONE && 1978 + fc->num_background >= fc->congestion_threshold) 1979 + return AOP_WRITEPAGE_ACTIVATE; 1985 1980 1986 1981 err = fuse_writepage_locked(page); 1987 1982 unlock_page(page); ··· 2239 2226 err = -EIO; 2240 2227 if (fuse_is_bad(inode)) 2241 2228 goto out; 2229 + 2230 + if (wbc->sync_mode == WB_SYNC_NONE && 2231 + fc->num_background >= fc->congestion_threshold) 2232 + return 0; 2242 2233 2243 2234 data.inode = inode; 2244 2235 data.wpa = NULL;

+1 -1

fs/fuse/inode.c

··· 72 72 { 73 73 struct fuse_inode *fi; 74 74 75 - fi = kmem_cache_alloc(fuse_inode_cachep, GFP_KERNEL); 75 + fi = alloc_inode_sb(sb, fuse_inode_cachep, GFP_KERNEL); 76 76 if (!fi) 77 77 return NULL; 78 78

+1 -1

fs/gfs2/super.c

··· 1425 1425 { 1426 1426 struct gfs2_inode *ip; 1427 1427 1428 - ip = kmem_cache_alloc(gfs2_inode_cachep, GFP_KERNEL); 1428 + ip = alloc_inode_sb(sb, gfs2_inode_cachep, GFP_KERNEL); 1429 1429 if (!ip) 1430 1430 return NULL; 1431 1431 ip->i_flags = 0;

+1 -1

fs/hfs/super.c

··· 162 162 { 163 163 struct hfs_inode_info *i; 164 164 165 - i = kmem_cache_alloc(hfs_inode_cachep, GFP_KERNEL); 165 + i = alloc_inode_sb(sb, hfs_inode_cachep, GFP_KERNEL); 166 166 return i ? &i->vfs_inode : NULL; 167 167 } 168 168

+1 -1

fs/hfsplus/super.c

··· 624 624 { 625 625 struct hfsplus_inode_info *i; 626 626 627 - i = kmem_cache_alloc(hfsplus_inode_cachep, GFP_KERNEL); 627 + i = alloc_inode_sb(sb, hfsplus_inode_cachep, GFP_KERNEL); 628 628 return i ? &i->vfs_inode : NULL; 629 629 } 630 630

+1 -1

fs/hostfs/hostfs_kern.c

··· 222 222 { 223 223 struct hostfs_inode_info *hi; 224 224 225 - hi = kmem_cache_alloc(hostfs_inode_cache, GFP_KERNEL_ACCOUNT); 225 + hi = alloc_inode_sb(sb, hostfs_inode_cache, GFP_KERNEL_ACCOUNT); 226 226 if (hi == NULL) 227 227 return NULL; 228 228 hi->fd = -1;

+1 -1

fs/hpfs/super.c

··· 232 232 static struct inode *hpfs_alloc_inode(struct super_block *sb) 233 233 { 234 234 struct hpfs_inode_info *ei; 235 - ei = kmem_cache_alloc(hpfs_inode_cachep, GFP_NOFS); 235 + ei = alloc_inode_sb(sb, hpfs_inode_cachep, GFP_NOFS); 236 236 if (!ei) 237 237 return NULL; 238 238 return &ei->vfs_inode;

+1 -1

fs/hugetlbfs/inode.c

··· 1110 1110 1111 1111 if (unlikely(!hugetlbfs_dec_free_inodes(sbinfo))) 1112 1112 return NULL; 1113 - p = kmem_cache_alloc(hugetlbfs_inode_cachep, GFP_KERNEL); 1113 + p = alloc_inode_sb(sb, hugetlbfs_inode_cachep, GFP_KERNEL); 1114 1114 if (unlikely(!p)) { 1115 1115 hugetlbfs_inc_free_inodes(sbinfo); 1116 1116 return NULL;

+1 -1

fs/inode.c

··· 259 259 if (ops->alloc_inode) 260 260 inode = ops->alloc_inode(sb); 261 261 else 262 - inode = kmem_cache_alloc(inode_cachep, GFP_KERNEL); 262 + inode = alloc_inode_sb(sb, inode_cachep, GFP_KERNEL); 263 263 264 264 if (!inode) 265 265 return NULL;

+1 -1

fs/isofs/inode.c

··· 70 70 static struct inode *isofs_alloc_inode(struct super_block *sb) 71 71 { 72 72 struct iso_inode_info *ei; 73 - ei = kmem_cache_alloc(isofs_inode_cachep, GFP_KERNEL); 73 + ei = alloc_inode_sb(sb, isofs_inode_cachep, GFP_KERNEL); 74 74 if (!ei) 75 75 return NULL; 76 76 return &ei->vfs_inode;

+1 -1

fs/jffs2/super.c

··· 39 39 { 40 40 struct jffs2_inode_info *f; 41 41 42 - f = kmem_cache_alloc(jffs2_inode_cachep, GFP_KERNEL); 42 + f = alloc_inode_sb(sb, jffs2_inode_cachep, GFP_KERNEL); 43 43 if (!f) 44 44 return NULL; 45 45 return &f->vfs_inode;

+1 -1

fs/jfs/super.c

··· 102 102 { 103 103 struct jfs_inode_info *jfs_inode; 104 104 105 - jfs_inode = kmem_cache_alloc(jfs_inode_cachep, GFP_NOFS); 105 + jfs_inode = alloc_inode_sb(sb, jfs_inode_cachep, GFP_NOFS); 106 106 if (!jfs_inode) 107 107 return NULL; 108 108 #ifdef CONFIG_QUOTA

+1 -1

fs/minix/inode.c

··· 63 63 static struct inode *minix_alloc_inode(struct super_block *sb) 64 64 { 65 65 struct minix_inode_info *ei; 66 - ei = kmem_cache_alloc(minix_inode_cachep, GFP_KERNEL); 66 + ei = alloc_inode_sb(sb, minix_inode_cachep, GFP_KERNEL); 67 67 if (!ei) 68 68 return NULL; 69 69 return &ei->vfs_inode;

+2

fs/namespace.c

··· 2597 2597 struct super_block *sb = mnt->mnt_sb; 2598 2598 2599 2599 if (!__mnt_is_readonly(mnt) && 2600 + (!(sb->s_iflags & SB_I_TS_EXPIRY_WARNED)) && 2600 2601 (ktime_get_real_seconds() + TIME_UPTIME_SEC_MAX > sb->s_time_max)) { 2601 2602 char *buf = (char *)__get_free_page(GFP_KERNEL); 2602 2603 char *mntpath = buf ? d_path(mountpoint, buf, PAGE_SIZE) : ERR_PTR(-ENOMEM); ··· 2612 2611 tm.tm_year+1900, (unsigned long long)sb->s_time_max); 2613 2612 2614 2613 free_page((unsigned long)buf); 2614 + sb->s_iflags |= SB_I_TS_EXPIRY_WARNED; 2615 2615 } 2616 2616 } 2617 2617

+1 -1

fs/nfs/inode.c

··· 2238 2238 struct inode *nfs_alloc_inode(struct super_block *sb) 2239 2239 { 2240 2240 struct nfs_inode *nfsi; 2241 - nfsi = kmem_cache_alloc(nfs_inode_cachep, GFP_KERNEL); 2241 + nfsi = alloc_inode_sb(sb, nfs_inode_cachep, GFP_KERNEL); 2242 2242 if (!nfsi) 2243 2243 return NULL; 2244 2244 nfsi->flags = 0UL;

+11 -3

fs/nfs/write.c

··· 417 417 418 418 if (atomic_long_inc_return(&nfss->writeback) > 419 419 NFS_CONGESTION_ON_THRESH) 420 - set_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC); 420 + nfss->write_congested = 1; 421 421 } 422 422 423 423 static void nfs_end_page_writeback(struct nfs_page *req) ··· 433 433 434 434 end_page_writeback(req->wb_page); 435 435 if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) 436 - clear_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC); 436 + nfss->write_congested = 0; 437 437 } 438 438 439 439 /* ··· 672 672 struct inode *inode = page_file_mapping(page)->host; 673 673 int err; 674 674 675 + if (wbc->sync_mode == WB_SYNC_NONE && 676 + NFS_SERVER(inode)->write_congested) 677 + return AOP_WRITEPAGE_ACTIVATE; 678 + 675 679 nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGE); 676 680 nfs_pageio_init_write(&pgio, inode, 0, 677 681 false, &nfs_async_write_completion_ops); ··· 722 718 unsigned int mntflags = NFS_SERVER(inode)->flags; 723 719 int priority = 0; 724 720 int err; 721 + 722 + if (wbc->sync_mode == WB_SYNC_NONE && 723 + NFS_SERVER(inode)->write_congested) 724 + return 0; 725 725 726 726 nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES); 727 727 ··· 1901 1893 } 1902 1894 nfss = NFS_SERVER(data->inode); 1903 1895 if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) 1904 - clear_bdi_congested(inode_to_bdi(data->inode), BLK_RW_ASYNC); 1896 + nfss->write_congested = 0; 1905 1897 1906 1898 nfs_init_cinfo(&cinfo, data->inode, data->dreq); 1907 1899 nfs_commit_end(cinfo.mds);

-16

fs/nilfs2/segbuf.c

··· 340 340 struct nilfs_write_info *wi) 341 341 { 342 342 struct bio *bio = wi->bio; 343 - int err; 344 - 345 - if (segbuf->sb_nbio > 0 && 346 - bdi_write_congested(segbuf->sb_super->s_bdi)) { 347 - wait_for_completion(&segbuf->sb_bio_event); 348 - segbuf->sb_nbio--; 349 - if (unlikely(atomic_read(&segbuf->sb_err))) { 350 - bio_put(bio); 351 - err = -EIO; 352 - goto failed; 353 - } 354 - } 355 343 356 344 bio->bi_end_io = nilfs_end_bio_write; 357 345 bio->bi_private = segbuf; ··· 351 363 wi->nr_vecs = min(wi->max_pages, wi->rest_blocks); 352 364 wi->start = wi->end; 353 365 return 0; 354 - 355 - failed: 356 - wi->bio = NULL; 357 - return err; 358 366 } 359 367 360 368 static void nilfs_segbuf_prepare_write(struct nilfs_segment_buffer *segbuf,

+1 -1

fs/nilfs2/super.c

··· 151 151 { 152 152 struct nilfs_inode_info *ii; 153 153 154 - ii = kmem_cache_alloc(nilfs_inode_cachep, GFP_NOFS); 154 + ii = alloc_inode_sb(sb, nilfs_inode_cachep, GFP_NOFS); 155 155 if (!ii) 156 156 return NULL; 157 157 ii->i_bh = NULL;

+5 -1

fs/ntfs/inode.c

··· 310 310 ntfs_inode *ni; 311 311 312 312 ntfs_debug("Entering."); 313 - ni = kmem_cache_alloc(ntfs_big_inode_cache, GFP_NOFS); 313 + ni = alloc_inode_sb(sb, ntfs_big_inode_cache, GFP_NOFS); 314 314 if (likely(ni != NULL)) { 315 315 ni->state = 0; 316 316 return VFS_I(ni); ··· 1881 1881 } 1882 1882 /* Now allocate memory for the attribute list. */ 1883 1883 ni->attr_list_size = (u32)ntfs_attr_size(a); 1884 + if (!ni->attr_list_size) { 1885 + ntfs_error(sb, "Attr_list_size is zero"); 1886 + goto put_err_out; 1887 + } 1884 1888 ni->attr_list = ntfs_malloc_nofs(ni->attr_list_size); 1885 1889 if (!ni->attr_list) { 1886 1890 ntfs_error(sb, "Not enough memory to allocate buffer "

+1 -1

fs/ntfs3/super.c

··· 399 399 400 400 static struct inode *ntfs_alloc_inode(struct super_block *sb) 401 401 { 402 - struct ntfs_inode *ni = kmem_cache_alloc(ntfs_inode_cachep, GFP_NOFS); 402 + struct ntfs_inode *ni = alloc_inode_sb(sb, ntfs_inode_cachep, GFP_NOFS); 403 403 404 404 if (!ni) 405 405 return NULL;

+1 -1

fs/ocfs2/alloc.c

··· 5981 5981 return status; 5982 5982 } 5983 5983 5984 - /* Expects you to already be holding tl_inode->i_mutex */ 5984 + /* Expects you to already be holding tl_inode->i_rwsem */ 5985 5985 int __ocfs2_flush_truncate_log(struct ocfs2_super *osb) 5986 5986 { 5987 5987 int status;

+1 -1

fs/ocfs2/aops.c

··· 2311 2311 2312 2312 down_write(&oi->ip_alloc_sem); 2313 2313 2314 - /* Delete orphan before acquire i_mutex. */ 2314 + /* Delete orphan before acquire i_rwsem. */ 2315 2315 if (dwc->dw_orphaned) { 2316 2316 BUG_ON(dwc->dw_writer_pid != task_pid_nr(current)); 2317 2317

+1 -1

fs/ocfs2/cluster/nodemanager.c

··· 689 689 struct o2nm_node_group *ns = NULL; 690 690 struct config_group *o2hb_group = NULL, *ret = NULL; 691 691 692 - /* this runs under the parent dir's i_mutex; there can be only 692 + /* this runs under the parent dir's i_rwsem; there can be only 693 693 * one caller in here at a time */ 694 694 if (o2nm_single_cluster) 695 695 return ERR_PTR(-ENOSPC);

+2 -2

fs/ocfs2/dir.c

··· 1957 1957 } 1958 1958 1959 1959 /* 1960 - * NOTE: this should always be called with parent dir i_mutex taken. 1960 + * NOTE: this should always be called with parent dir i_rwsem taken. 1961 1961 */ 1962 1962 int ocfs2_find_files_on_disk(const char *name, 1963 1963 int namelen, ··· 2003 2003 * Return 0 if the name does not exist 2004 2004 * Return -EEXIST if the directory contains the name 2005 2005 * 2006 - * Callers should have i_mutex + a cluster lock on dir 2006 + * Callers should have i_rwsem + a cluster lock on dir 2007 2007 */ 2008 2008 int ocfs2_check_dir_for_entry(struct inode *dir, 2009 2009 const char *name,

+1 -1

fs/ocfs2/dlmfs/dlmfs.c

··· 280 280 { 281 281 struct dlmfs_inode_private *ip; 282 282 283 - ip = kmem_cache_alloc(dlmfs_inode_cache, GFP_NOFS); 283 + ip = alloc_inode_sb(sb, dlmfs_inode_cache, GFP_NOFS); 284 284 if (!ip) 285 285 return NULL; 286 286

+5 -8

fs/ocfs2/file.c

··· 270 270 271 271 /* 272 272 * Don't use ocfs2_mark_inode_dirty() here as we don't always 273 - * have i_mutex to guard against concurrent changes to other 273 + * have i_rwsem to guard against concurrent changes to other 274 274 * inode fields. 275 275 */ 276 276 inode->i_atime = current_time(inode); ··· 540 540 struct ocfs2_alloc_context *meta_ac, 541 541 enum ocfs2_alloc_restarted *reason_ret) 542 542 { 543 - int ret; 544 543 struct ocfs2_extent_tree et; 545 544 546 545 ocfs2_init_dinode_extent_tree(&et, INODE_CACHE(inode), fe_bh); 547 - ret = ocfs2_add_clusters_in_btree(handle, &et, logical_offset, 548 - clusters_to_add, mark_unwritten, 549 - data_ac, meta_ac, reason_ret); 550 - 551 - return ret; 546 + return ocfs2_add_clusters_in_btree(handle, &et, logical_offset, 547 + clusters_to_add, mark_unwritten, 548 + data_ac, meta_ac, reason_ret); 552 549 } 553 550 554 551 static int ocfs2_extend_allocation(struct inode *inode, u32 logical_start, ··· 1065 1068 /* 1066 1069 * The alloc sem blocks people in read/write from reading our 1067 1070 * allocation until we're done changing it. We depend on 1068 - * i_mutex to block other extend/truncate calls while we're 1071 + * i_rwsem to block other extend/truncate calls while we're 1069 1072 * here. We even have to hold it for sparse files because there 1070 1073 * might be some tail zeroing. 1071 1074 */

+1 -1

fs/ocfs2/inode.c

··· 713 713 /* 714 714 * Serialize with orphan dir recovery. If the process doing 715 715 * recovery on this orphan dir does an iget() with the dir 716 - * i_mutex held, we'll deadlock here. Instead we detect this 716 + * i_rwsem held, we'll deadlock here. Instead we detect this 717 717 * and exit early - recovery will wipe this inode for us. 718 718 */ 719 719 static int ocfs2_check_orphan_recovery_state(struct ocfs2_super *osb,

+3 -3

fs/ocfs2/localalloc.c

··· 606 606 607 607 /* 608 608 * make sure we've got at least bits_wanted contiguous bits in the 609 - * local alloc. You lose them when you drop i_mutex. 609 + * local alloc. You lose them when you drop i_rwsem. 610 610 * 611 611 * We will add ourselves to the transaction passed in, but may start 612 612 * our own in order to shift windows. ··· 636 636 637 637 /* 638 638 * We must double check state and allocator bits because 639 - * another process may have changed them while holding i_mutex. 639 + * another process may have changed them while holding i_rwsem. 640 640 */ 641 641 spin_lock(&osb->osb_lock); 642 642 if (!ocfs2_la_state_enabled(osb) || ··· 1029 1029 /* 1030 1030 * Given an event, calculate the size of our next local alloc window. 1031 1031 * 1032 - * This should always be called under i_mutex of the local alloc inode 1032 + * This should always be called under i_rwsem of the local alloc inode 1033 1033 * so that local alloc disabling doesn't race with processes trying to 1034 1034 * use the allocator. 1035 1035 *

+1 -1

fs/ocfs2/namei.c

··· 476 476 ocfs2_free_alloc_context(meta_ac); 477 477 478 478 /* 479 - * We should call iput after the i_mutex of the bitmap been 479 + * We should call iput after the i_rwsem of the bitmap been 480 480 * unlocked in ocfs2_free_alloc_context, or the 481 481 * ocfs2_delete_inode will mutex_lock again. 482 482 */

+2 -2

fs/ocfs2/ocfs2.h

··· 355 355 struct delayed_work la_enable_wq; 356 356 357 357 /* 358 - * Must hold local alloc i_mutex and osb->osb_lock to change 358 + * Must hold local alloc i_rwsem and osb->osb_lock to change 359 359 * local_alloc_bits. Reads can be done under either lock. 360 360 */ 361 361 unsigned int local_alloc_bits; ··· 430 430 atomic_t osb_tl_disable; 431 431 /* 432 432 * How many clusters in our truncate log. 433 - * It must be protected by osb_tl_inode->i_mutex. 433 + * It must be protected by osb_tl_inode->i_rwsem. 434 434 */ 435 435 unsigned int truncated_clusters; 436 436

+1 -1

fs/ocfs2/quota_global.c

··· 36 36 * should be obeyed by all the functions: 37 37 * - any write of quota structure (either to local or global file) is protected 38 38 * by dqio_sem or dquot->dq_lock. 39 - * - any modification of global quota file holds inode cluster lock, i_mutex, 39 + * - any modification of global quota file holds inode cluster lock, i_rwsem, 40 40 * and ip_alloc_sem of the global quota file (achieved by 41 41 * ocfs2_lock_global_qf). It also has to hold qinfo_lock. 42 42 * - an allocation of new blocks for local quota file is protected by

+6 -12

fs/ocfs2/stack_user.c

··· 683 683 void *name, 684 684 unsigned int namelen) 685 685 { 686 - int ret; 687 - 688 686 if (!lksb->lksb_fsdlm.sb_lvbptr) 689 687 lksb->lksb_fsdlm.sb_lvbptr = (char *)lksb + 690 688 sizeof(struct dlm_lksb); 691 689 692 - ret = dlm_lock(conn->cc_lockspace, mode, &lksb->lksb_fsdlm, 693 - flags|DLM_LKF_NODLCKWT, name, namelen, 0, 694 - fsdlm_lock_ast_wrapper, lksb, 695 - fsdlm_blocking_ast_wrapper); 696 - return ret; 690 + return dlm_lock(conn->cc_lockspace, mode, &lksb->lksb_fsdlm, 691 + flags|DLM_LKF_NODLCKWT, name, namelen, 0, 692 + fsdlm_lock_ast_wrapper, lksb, 693 + fsdlm_blocking_ast_wrapper); 697 694 } 698 695 699 696 static int user_dlm_unlock(struct ocfs2_cluster_connection *conn, 700 697 struct ocfs2_dlm_lksb *lksb, 701 698 u32 flags) 702 699 { 703 - int ret; 704 - 705 - ret = dlm_unlock(conn->cc_lockspace, lksb->lksb_fsdlm.sb_lkid, 706 - flags, &lksb->lksb_fsdlm, lksb); 707 - return ret; 700 + return dlm_unlock(conn->cc_lockspace, lksb->lksb_fsdlm.sb_lkid, 701 + flags, &lksb->lksb_fsdlm, lksb); 708 702 } 709 703 710 704 static int user_dlm_lock_status(struct ocfs2_dlm_lksb *lksb)

+1 -1

fs/ocfs2/super.c

··· 548 548 { 549 549 struct ocfs2_inode_info *oi; 550 550 551 - oi = kmem_cache_alloc(ocfs2_inode_cachep, GFP_NOFS); 551 + oi = alloc_inode_sb(sb, ocfs2_inode_cachep, GFP_NOFS); 552 552 if (!oi) 553 553 return NULL; 554 554

+1 -1

fs/ocfs2/xattr.c

··· 7205 7205 * Used for reflink a non-preserve-security file. 7206 7206 * 7207 7207 * It uses common api like ocfs2_xattr_set, so the caller 7208 - * must not hold any lock expect i_mutex. 7208 + * must not hold any lock expect i_rwsem. 7209 7209 */ 7210 7210 int ocfs2_init_security_and_acl(struct inode *dir, 7211 7211 struct inode *inode,

+1 -1

fs/openpromfs/inode.c

··· 335 335 { 336 336 struct op_inode_info *oi; 337 337 338 - oi = kmem_cache_alloc(op_inode_cachep, GFP_KERNEL); 338 + oi = alloc_inode_sb(sb, op_inode_cachep, GFP_KERNEL); 339 339 if (!oi) 340 340 return NULL; 341 341

+1 -1

fs/orangefs/super.c

··· 107 107 { 108 108 struct orangefs_inode_s *orangefs_inode; 109 109 110 - orangefs_inode = kmem_cache_alloc(orangefs_inode_cache, GFP_KERNEL); 110 + orangefs_inode = alloc_inode_sb(sb, orangefs_inode_cache, GFP_KERNEL); 111 111 if (!orangefs_inode) 112 112 return NULL; 113 113

+1 -1

fs/overlayfs/super.c

··· 174 174 175 175 static struct inode *ovl_alloc_inode(struct super_block *sb) 176 176 { 177 - struct ovl_inode *oi = kmem_cache_alloc(ovl_inode_cachep, GFP_KERNEL); 177 + struct ovl_inode *oi = alloc_inode_sb(sb, ovl_inode_cachep, GFP_KERNEL); 178 178 179 179 if (!oi) 180 180 return NULL;

+1 -1

fs/proc/inode.c

··· 66 66 { 67 67 struct proc_inode *ei; 68 68 69 - ei = kmem_cache_alloc(proc_inode_cachep, GFP_KERNEL); 69 + ei = alloc_inode_sb(sb, proc_inode_cachep, GFP_KERNEL); 70 70 if (!ei) 71 71 return NULL; 72 72 ei->pid = NULL;

+1 -1

fs/qnx4/inode.c

··· 338 338 static struct inode *qnx4_alloc_inode(struct super_block *sb) 339 339 { 340 340 struct qnx4_inode_info *ei; 341 - ei = kmem_cache_alloc(qnx4_inode_cachep, GFP_KERNEL); 341 + ei = alloc_inode_sb(sb, qnx4_inode_cachep, GFP_KERNEL); 342 342 if (!ei) 343 343 return NULL; 344 344 return &ei->vfs_inode;

+1 -1

fs/qnx6/inode.c

··· 597 597 static struct inode *qnx6_alloc_inode(struct super_block *sb) 598 598 { 599 599 struct qnx6_inode_info *ei; 600 - ei = kmem_cache_alloc(qnx6_inode_cachep, GFP_KERNEL); 600 + ei = alloc_inode_sb(sb, qnx6_inode_cachep, GFP_KERNEL); 601 601 if (!ei) 602 602 return NULL; 603 603 return &ei->vfs_inode;

+1 -1

fs/reiserfs/super.c

··· 639 639 static struct inode *reiserfs_alloc_inode(struct super_block *sb) 640 640 { 641 641 struct reiserfs_inode_info *ei; 642 - ei = kmem_cache_alloc(reiserfs_inode_cachep, GFP_KERNEL); 642 + ei = alloc_inode_sb(sb, reiserfs_inode_cachep, GFP_KERNEL); 643 643 if (!ei) 644 644 return NULL; 645 645 atomic_set(&ei->openers, 0);

+1 -1

fs/romfs/super.c

··· 375 375 { 376 376 struct romfs_inode_info *inode; 377 377 378 - inode = kmem_cache_alloc(romfs_inode_cachep, GFP_KERNEL); 378 + inode = alloc_inode_sb(sb, romfs_inode_cachep, GFP_KERNEL); 379 379 return inode ? &inode->vfs_inode : NULL; 380 380 } 381 381

+1 -1

fs/squashfs/super.c

··· 584 584 static struct inode *squashfs_alloc_inode(struct super_block *sb) 585 585 { 586 586 struct squashfs_inode_info *ei = 587 - kmem_cache_alloc(squashfs_inode_cachep, GFP_KERNEL); 587 + alloc_inode_sb(sb, squashfs_inode_cachep, GFP_KERNEL); 588 588 589 589 return ei ? &ei->vfs_inode : NULL; 590 590 }

+1 -1

fs/sysv/inode.c

··· 306 306 { 307 307 struct sysv_inode_info *si; 308 308 309 - si = kmem_cache_alloc(sysv_inode_cachep, GFP_KERNEL); 309 + si = alloc_inode_sb(sb, sysv_inode_cachep, GFP_KERNEL); 310 310 if (!si) 311 311 return NULL; 312 312 return &si->vfs_inode;

+1 -1

fs/ubifs/super.c

··· 268 268 { 269 269 struct ubifs_inode *ui; 270 270 271 - ui = kmem_cache_alloc(ubifs_inode_slab, GFP_NOFS); 271 + ui = alloc_inode_sb(sb, ubifs_inode_slab, GFP_NOFS); 272 272 if (!ui) 273 273 return NULL; 274 274

+1 -1

fs/udf/super.c

··· 136 136 static struct inode *udf_alloc_inode(struct super_block *sb) 137 137 { 138 138 struct udf_inode_info *ei; 139 - ei = kmem_cache_alloc(udf_inode_cachep, GFP_KERNEL); 139 + ei = alloc_inode_sb(sb, udf_inode_cachep, GFP_KERNEL); 140 140 if (!ei) 141 141 return NULL; 142 142

+1 -1

fs/ufs/super.c

··· 1443 1443 { 1444 1444 struct ufs_inode_info *ei; 1445 1445 1446 - ei = kmem_cache_alloc(ufs_inode_cachep, GFP_NOFS); 1446 + ei = alloc_inode_sb(sb, ufs_inode_cachep, GFP_NOFS); 1447 1447 if (!ei) 1448 1448 return NULL; 1449 1449

+4 -1

fs/userfaultfd.c

··· 198 198 struct uffd_msg msg; 199 199 msg_init(&msg); 200 200 msg.event = UFFD_EVENT_PAGEFAULT; 201 + 202 + if (!(features & UFFD_FEATURE_EXACT_ADDRESS)) 203 + address &= PAGE_MASK; 201 204 msg.arg.pagefault.address = address; 202 205 /* 203 206 * These flags indicate why the userfault occurred: ··· 485 482 486 483 init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function); 487 484 uwq.wq.private = current; 488 - uwq.msg = userfault_msg(vmf->address, vmf->flags, reason, 485 + uwq.msg = userfault_msg(vmf->real_address, vmf->flags, reason, 489 486 ctx->features); 490 487 uwq.ctx = ctx; 491 488 uwq.waken = false;

+1 -1

fs/vboxsf/super.c

··· 241 241 { 242 242 struct vboxsf_inode *sf_i; 243 243 244 - sf_i = kmem_cache_alloc(vboxsf_inode_cachep, GFP_NOFS); 244 + sf_i = alloc_inode_sb(sb, vboxsf_inode_cachep, GFP_NOFS); 245 245 if (!sf_i) 246 246 return NULL; 247 247

+1 -1

fs/xfs/libxfs/xfs_btree.c

··· 2818 2818 * in any way. 2819 2819 */ 2820 2820 if (args->kswapd) 2821 - new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; 2821 + new_pflags |= PF_MEMALLOC | PF_KSWAPD; 2822 2822 2823 2823 current_set_flags_nested(&pflags, new_pflags); 2824 2824 xfs_trans_set_context(args->cur->bc_tp);

-3

fs/xfs/xfs_buf.c

··· 843 843 { 844 844 struct xfs_buf *bp; 845 845 846 - if (bdi_read_congested(target->bt_bdev->bd_disk->bdi)) 847 - return; 848 - 849 846 xfs_buf_read_map(target, map, nmaps, 850 847 XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD, &bp, ops, 851 848 __this_address);

+1 -1

fs/xfs/xfs_icache.c

··· 77 77 * XXX: If this didn't occur in transactions, we could drop GFP_NOFAIL 78 78 * and return NULL here on ENOMEM. 79 79 */ 80 - ip = kmem_cache_alloc(xfs_inode_cache, GFP_KERNEL | __GFP_NOFAIL); 80 + ip = alloc_inode_sb(mp->m_super, xfs_inode_cache, GFP_KERNEL | __GFP_NOFAIL); 81 81 82 82 if (inode_init_always(mp->m_super, VFS_I(ip))) { 83 83 kmem_cache_free(xfs_inode_cache, ip);

+1 -1

fs/zonefs/super.c

··· 1136 1136 { 1137 1137 struct zonefs_inode_info *zi; 1138 1138 1139 - zi = kmem_cache_alloc(zonefs_inode_cachep, GFP_KERNEL); 1139 + zi = alloc_inode_sb(sb, zonefs_inode_cachep, GFP_KERNEL); 1140 1140 if (!zi) 1141 1141 return NULL; 1142 1142

-8

include/linux/backing-dev-defs.h

··· 207 207 #endif 208 208 }; 209 209 210 - enum { 211 - BLK_RW_ASYNC = 0, 212 - BLK_RW_SYNC = 1, 213 - }; 214 - 215 - void clear_bdi_congested(struct backing_dev_info *bdi, int sync); 216 - void set_bdi_congested(struct backing_dev_info *bdi, int sync); 217 - 218 210 struct wb_lock_cookie { 219 211 bool locked; 220 212 unsigned long flags;

-50

include/linux/backing-dev.h

··· 135 135 136 136 struct backing_dev_info *inode_to_bdi(struct inode *inode); 137 137 138 - static inline int wb_congested(struct bdi_writeback *wb, int cong_bits) 139 - { 140 - return wb->congested & cong_bits; 141 - } 142 - 143 - long congestion_wait(int sync, long timeout); 144 - 145 138 static inline bool mapping_can_writeback(struct address_space *mapping) 146 139 { 147 140 return inode_to_bdi(mapping->host)->capabilities & BDI_CAP_WRITEBACK; ··· 155 162 gfp_t gfp); 156 163 void wb_memcg_offline(struct mem_cgroup *memcg); 157 164 void wb_blkcg_offline(struct blkcg *blkcg); 158 - int inode_congested(struct inode *inode, int cong_bits); 159 165 160 166 /** 161 167 * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode ··· 382 390 { 383 391 } 384 392 385 - static inline int inode_congested(struct inode *inode, int cong_bits) 386 - { 387 - return wb_congested(&inode_to_bdi(inode)->wb, cong_bits); 388 - } 389 - 390 393 #endif /* CONFIG_CGROUP_WRITEBACK */ 391 - 392 - static inline int inode_read_congested(struct inode *inode) 393 - { 394 - return inode_congested(inode, 1 << WB_sync_congested); 395 - } 396 - 397 - static inline int inode_write_congested(struct inode *inode) 398 - { 399 - return inode_congested(inode, 1 << WB_async_congested); 400 - } 401 - 402 - static inline int inode_rw_congested(struct inode *inode) 403 - { 404 - return inode_congested(inode, (1 << WB_sync_congested) | 405 - (1 << WB_async_congested)); 406 - } 407 - 408 - static inline int bdi_congested(struct backing_dev_info *bdi, int cong_bits) 409 - { 410 - return wb_congested(&bdi->wb, cong_bits); 411 - } 412 - 413 - static inline int bdi_read_congested(struct backing_dev_info *bdi) 414 - { 415 - return bdi_congested(bdi, 1 << WB_sync_congested); 416 - } 417 - 418 - static inline int bdi_write_congested(struct backing_dev_info *bdi) 419 - { 420 - return bdi_congested(bdi, 1 << WB_async_congested); 421 - } 422 - 423 - static inline int bdi_rw_congested(struct backing_dev_info *bdi) 424 - { 425 - return bdi_congested(bdi, (1 << WB_sync_congested) | 426 - (1 << WB_async_congested)); 427 - } 428 394 429 395 const char *bdi_dev_name(struct backing_dev_info *bdi); 430 396

+10

include/linux/cma.h

··· 20 20 21 21 #define CMA_MAX_NAME 64 22 22 23 + /* 24 + * TODO: once the buddy -- especially pageblock merging and alloc_contig_range() 25 + * -- can deal with only some pageblocks of a higher-order page being 26 + * MIGRATE_CMA, we can use pageblock_nr_pages. 27 + */ 28 + #define CMA_MIN_ALIGNMENT_PAGES MAX_ORDER_NR_PAGES 29 + #define CMA_MIN_ALIGNMENT_BYTES (PAGE_SIZE * CMA_MIN_ALIGNMENT_PAGES) 30 + 23 31 struct cma; 24 32 25 33 extern unsigned long totalcma_pages; ··· 58 50 extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long count); 59 51 60 52 extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data); 53 + 54 + extern void cma_reserve_pages_on_error(struct cma *cma); 61 55 #endif

+48 -39

include/linux/damon.h

··· 60 60 61 61 /** 62 62 * struct damon_target - Represents a monitoring target. 63 - * @id: Unique identifier for this target. 63 + * @pid: The PID of the virtual address space to monitor. 64 64 * @nr_regions: Number of monitoring target regions of this target. 65 65 * @regions_list: Head of the monitoring target regions of this target. 66 66 * @list: List head for siblings. 67 67 * 68 68 * Each monitoring context could have multiple targets. For example, a context 69 69 * for virtual memory address spaces could have multiple target processes. The 70 - * @id of each target should be unique among the targets of the context. For 71 - * example, in the virtual address monitoring context, it could be a pidfd or 72 - * an address of an mm_struct. 70 + * @pid should be set for appropriate &struct damon_operations including the 71 + * virtual address spaces monitoring operations. 73 72 */ 74 73 struct damon_target { 75 - unsigned long id; 74 + struct pid *pid; 76 75 unsigned int nr_regions; 77 76 struct list_head regions_list; 78 77 struct list_head list; ··· 87 88 * @DAMOS_HUGEPAGE: Call ``madvise()`` for the region with MADV_HUGEPAGE. 88 89 * @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE. 89 90 * @DAMOS_STAT: Do nothing but count the stat. 91 + * @NR_DAMOS_ACTIONS: Total number of DAMOS actions 90 92 */ 91 93 enum damos_action { 92 94 DAMOS_WILLNEED, ··· 96 96 DAMOS_HUGEPAGE, 97 97 DAMOS_NOHUGEPAGE, 98 98 DAMOS_STAT, /* Do nothing but only record the stat */ 99 + NR_DAMOS_ACTIONS, 99 100 }; 100 101 101 102 /** ··· 122 121 * uses smaller one as the effective quota. 123 122 * 124 123 * For selecting regions within the quota, DAMON prioritizes current scheme's 125 - * target memory regions using the &struct damon_primitive->get_scheme_score. 124 + * target memory regions using the &struct damon_operations->get_scheme_score. 126 125 * You could customize the prioritization logic by setting &weight_sz, 127 - * &weight_nr_accesses, and &weight_age, because monitoring primitives are 126 + * &weight_nr_accesses, and &weight_age, because monitoring operations are 128 127 * encouraged to respect those. 129 128 */ 130 129 struct damos_quota { ··· 159 158 * 160 159 * @DAMOS_WMARK_NONE: Ignore the watermarks of the given scheme. 161 160 * @DAMOS_WMARK_FREE_MEM_RATE: Free memory rate of the system in [0,1000]. 161 + * @NR_DAMOS_WMARK_METRICS: Total number of DAMOS watermark metrics 162 162 */ 163 163 enum damos_wmark_metric { 164 164 DAMOS_WMARK_NONE, 165 165 DAMOS_WMARK_FREE_MEM_RATE, 166 + NR_DAMOS_WMARK_METRICS, 166 167 }; 167 168 168 169 /** ··· 257 254 struct list_head list; 258 255 }; 259 256 257 + /** 258 + * enum damon_ops_id - Identifier for each monitoring operations implementation 259 + * 260 + * @DAMON_OPS_VADDR: Monitoring operations for virtual address spaces 261 + * @DAMON_OPS_PADDR: Monitoring operations for the physical address space 262 + */ 263 + enum damon_ops_id { 264 + DAMON_OPS_VADDR, 265 + DAMON_OPS_PADDR, 266 + NR_DAMON_OPS, 267 + }; 268 + 260 269 struct damon_ctx; 261 270 262 271 /** 263 - * struct damon_primitive - Monitoring primitives for given use cases. 272 + * struct damon_operations - Monitoring operations for given use cases. 264 273 * 265 - * @init: Initialize primitive-internal data structures. 266 - * @update: Update primitive-internal data structures. 274 + * @id: Identifier of this operations set. 275 + * @init: Initialize operations-related data structures. 276 + * @update: Update operations-related data structures. 267 277 * @prepare_access_checks: Prepare next access check of target regions. 268 278 * @check_accesses: Check the accesses to target regions. 269 279 * @reset_aggregated: Reset aggregated accesses monitoring results. ··· 286 270 * @cleanup: Clean up the context. 287 271 * 288 272 * DAMON can be extended for various address spaces and usages. For this, 289 - * users should register the low level primitives for their target address 290 - * space and usecase via the &damon_ctx.primitive. Then, the monitoring thread 273 + * users should register the low level operations for their target address 274 + * space and usecase via the &damon_ctx.ops. Then, the monitoring thread 291 275 * (&damon_ctx.kdamond) calls @init and @prepare_access_checks before starting 292 - * the monitoring, @update after each &damon_ctx.primitive_update_interval, and 276 + * the monitoring, @update after each &damon_ctx.ops_update_interval, and 293 277 * @check_accesses, @target_valid and @prepare_access_checks after each 294 278 * &damon_ctx.sample_interval. Finally, @reset_aggregated is called after each 295 279 * &damon_ctx.aggr_interval. 296 280 * 297 - * @init should initialize primitive-internal data structures. For example, 281 + * Each &struct damon_operations instance having valid @id can be registered 282 + * via damon_register_ops() and selected by damon_select_ops() later. 283 + * @init should initialize operations-related data structures. For example, 298 284 * this could be used to construct proper monitoring target regions and link 299 285 * those to @damon_ctx.adaptive_targets. 300 - * @update should update the primitive-internal data structures. For example, 286 + * @update should update the operations-related data structures. For example, 301 287 * this could be used to update monitoring target regions for current status. 302 288 * @prepare_access_checks should manipulate the monitoring regions to be 303 289 * prepared for the next access check. ··· 319 301 * monitoring. 320 302 * @cleanup is called from @kdamond just before its termination. 321 303 */ 322 - struct damon_primitive { 304 + struct damon_operations { 305 + enum damon_ops_id id; 323 306 void (*init)(struct damon_ctx *context); 324 307 void (*update)(struct damon_ctx *context); 325 308 void (*prepare_access_checks)(struct damon_ctx *context); ··· 374 355 * 375 356 * @sample_interval: The time between access samplings. 376 357 * @aggr_interval: The time between monitor results aggregations. 377 - * @primitive_update_interval: The time between monitoring primitive updates. 358 + * @ops_update_interval: The time between monitoring operations updates. 378 359 * 379 360 * For each @sample_interval, DAMON checks whether each region is accessed or 380 361 * not. It aggregates and keeps the access information (number of accesses to 381 362 * each region) for @aggr_interval time. DAMON also checks whether the target 382 363 * memory regions need update (e.g., by ``mmap()`` calls from the application, 383 364 * in case of virtual memory monitoring) and applies the changes for each 384 - * @primitive_update_interval. All time intervals are in micro-seconds. 385 - * Please refer to &struct damon_primitive and &struct damon_callback for more 365 + * @ops_update_interval. All time intervals are in micro-seconds. 366 + * Please refer to &struct damon_operations and &struct damon_callback for more 386 367 * detail. 387 368 * 388 369 * @kdamond: Kernel thread who does the monitoring. ··· 394 375 * 395 376 * Once started, the monitoring thread runs until explicitly required to be 396 377 * terminated or every monitoring target is invalid. The validity of the 397 - * targets is checked via the &damon_primitive.target_valid of @primitive. The 378 + * targets is checked via the &damon_operations.target_valid of @ops. The 398 379 * termination can also be explicitly requested by writing non-zero to 399 380 * @kdamond_stop. The thread sets @kdamond to NULL when it terminates. 400 381 * Therefore, users can know whether the monitoring is ongoing or terminated by ··· 404 385 * Note that the monitoring thread protects only @kdamond and @kdamond_stop via 405 386 * @kdamond_lock. Accesses to other fields must be protected by themselves. 406 387 * 407 - * @primitive: Set of monitoring primitives for given use cases. 388 + * @ops: Set of monitoring operations for given use cases. 408 389 * @callback: Set of callbacks for monitoring events notifications. 409 390 * 410 391 * @min_nr_regions: The minimum number of adaptive monitoring regions. ··· 415 396 struct damon_ctx { 416 397 unsigned long sample_interval; 417 398 unsigned long aggr_interval; 418 - unsigned long primitive_update_interval; 399 + unsigned long ops_update_interval; 419 400 420 401 /* private: internal use only */ 421 402 struct timespec64 last_aggregation; 422 - struct timespec64 last_primitive_update; 403 + struct timespec64 last_ops_update; 423 404 424 405 /* public: */ 425 406 struct task_struct *kdamond; 426 407 struct mutex kdamond_lock; 427 408 428 - struct damon_primitive primitive; 409 + struct damon_operations ops; 429 410 struct damon_callback callback; 430 411 431 412 unsigned long min_nr_regions; ··· 494 475 void damon_add_scheme(struct damon_ctx *ctx, struct damos *s); 495 476 void damon_destroy_scheme(struct damos *s); 496 477 497 - struct damon_target *damon_new_target(unsigned long id); 478 + struct damon_target *damon_new_target(void); 498 479 void damon_add_target(struct damon_ctx *ctx, struct damon_target *t); 499 480 bool damon_targets_empty(struct damon_ctx *ctx); 500 481 void damon_free_target(struct damon_target *t); ··· 503 484 504 485 struct damon_ctx *damon_new_ctx(void); 505 486 void damon_destroy_ctx(struct damon_ctx *ctx); 506 - int damon_set_targets(struct damon_ctx *ctx, 507 - unsigned long *ids, ssize_t nr_ids); 508 487 int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int, 509 - unsigned long aggr_int, unsigned long primitive_upd_int, 488 + unsigned long aggr_int, unsigned long ops_upd_int, 510 489 unsigned long min_nr_reg, unsigned long max_nr_reg); 511 490 int damon_set_schemes(struct damon_ctx *ctx, 512 491 struct damos **schemes, ssize_t nr_schemes); 513 492 int damon_nr_running_ctxs(void); 493 + int damon_register_ops(struct damon_operations *ops); 494 + int damon_select_ops(struct damon_ctx *ctx, enum damon_ops_id id); 514 495 515 - int damon_start(struct damon_ctx **ctxs, int nr_ctxs); 496 + int damon_start(struct damon_ctx **ctxs, int nr_ctxs, bool exclusive); 516 497 int damon_stop(struct damon_ctx **ctxs, int nr_ctxs); 517 498 518 499 #endif /* CONFIG_DAMON */ 519 - 520 - #ifdef CONFIG_DAMON_VADDR 521 - bool damon_va_target_valid(void *t); 522 - void damon_va_set_primitives(struct damon_ctx *ctx); 523 - #endif /* CONFIG_DAMON_VADDR */ 524 - 525 - #ifdef CONFIG_DAMON_PADDR 526 - bool damon_pa_target_valid(void *t); 527 - void damon_pa_set_primitives(struct damon_ctx *ctx); 528 - #endif /* CONFIG_DAMON_PADDR */ 529 500 530 501 #endif /* _DAMON_H */

+2

include/linux/fault-inject.h

··· 64 64 65 65 struct kmem_cache; 66 66 67 + bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order); 68 + 67 69 int should_failslab(struct kmem_cache *s, gfp_t gfpflags); 68 70 #ifdef CONFIG_FAILSLAB 69 71 extern bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags);

+19 -2

include/linux/fs.h

··· 42 42 #include <linux/mount.h> 43 43 #include <linux/cred.h> 44 44 #include <linux/mnt_idmapping.h> 45 + #include <linux/slab.h> 45 46 46 47 #include <asm/byteorder.h> 47 48 #include <uapi/linux/fs.h> ··· 931 930 * struct file_ra_state - Track a file's readahead state. 932 931 * @start: Where the most recent readahead started. 933 932 * @size: Number of pages read in the most recent readahead. 934 - * @async_size: Start next readahead when this many pages are left. 935 - * @ra_pages: Maximum size of a readahead request. 933 + * @async_size: Numer of pages that were/are not needed immediately 934 + * and so were/are genuinely "ahead". Start next readahead when 935 + * the first of these pages is accessed. 936 + * @ra_pages: Maximum size of a readahead request, copied from the bdi. 936 937 * @mmap_miss: How many mmap accesses missed in the page cache. 937 938 * @prev_pos: The last byte in the most recent read request. 939 + * 940 + * When this structure is passed to ->readahead(), the "most recent" 941 + * readahead means the current readahead. 938 942 */ 939 943 struct file_ra_state { 940 944 pgoff_t start; ··· 1441 1435 1442 1436 #define SB_I_SKIP_SYNC 0x00000100 /* Skip superblock at global sync */ 1443 1437 #define SB_I_PERSB_BDI 0x00000200 /* has a per-sb bdi */ 1438 + #define SB_I_TS_EXPIRY_WARNED 0x00000400 /* warned about timestamp range expiry */ 1444 1439 1445 1440 /* Possible states of 'frozen' field */ 1446 1441 enum { ··· 3114 3107 extern void free_inode_nonrcu(struct inode *inode); 3115 3108 extern int should_remove_suid(struct dentry *); 3116 3109 extern int file_remove_privs(struct file *); 3110 + 3111 + /* 3112 + * This must be used for allocating filesystems specific inodes to set 3113 + * up the inode reclaim context correctly. 3114 + */ 3115 + static inline void * 3116 + alloc_inode_sb(struct super_block *sb, struct kmem_cache *cache, gfp_t gfp) 3117 + { 3118 + return kmem_cache_alloc_lru(cache, &sb->s_inode_lru, gfp); 3119 + } 3117 3120 3118 3121 extern void __insert_inode_hash(struct inode *, unsigned long hashval); 3119 3122 static inline void insert_inode_hash(struct inode *inode)

+5 -5

include/linux/gfp.h

··· 79 79 * DOC: Page mobility and placement hints 80 80 * 81 81 * Page mobility and placement hints 82 - * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 82 + * --------------------------------- 83 83 * 84 84 * These flags provide hints about how mobile the page is. Pages with similar 85 85 * mobility are placed within the same pageblocks to minimise problems due ··· 112 112 * DOC: Watermark modifiers 113 113 * 114 114 * Watermark modifiers -- controls access to emergency reserves 115 - * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 115 + * ------------------------------------------------------------ 116 116 * 117 117 * %__GFP_HIGH indicates that the caller is high-priority and that granting 118 118 * the request is necessary before the system can make forward progress. ··· 144 144 * DOC: Reclaim modifiers 145 145 * 146 146 * Reclaim modifiers 147 - * ~~~~~~~~~~~~~~~~~ 147 + * ----------------- 148 148 * Please note that all the following flags are only applicable to sleepable 149 149 * allocations (e.g. %GFP_NOWAIT and %GFP_ATOMIC will ignore them). 150 150 * ··· 224 224 * DOC: Action modifiers 225 225 * 226 226 * Action modifiers 227 - * ~~~~~~~~~~~~~~~~ 227 + * ---------------- 228 228 * 229 229 * %__GFP_NOWARN suppresses allocation failure reports. 230 230 * ··· 256 256 * DOC: Useful GFP flag combinations 257 257 * 258 258 * Useful GFP flag combinations 259 - * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 259 + * ---------------------------- 260 260 * 261 261 * Useful GFP flag combinations that are commonly used. It is recommended 262 262 * that subsystems start with one of these combinations and then set/clear

+10

include/linux/highmem-internal.h

··· 246 246 __kunmap_atomic(__addr); \ 247 247 } while (0) 248 248 249 + /** 250 + * kunmap_local - Unmap a page mapped via kmap_local_page(). 251 + * @__addr: An address within the page mapped 252 + * 253 + * @__addr can be any address within the mapped page. Commonly it is the 254 + * address return from kmap_local_page(), but it can also include offsets. 255 + * 256 + * Unmapping should be done in the reverse order of the mapping. See 257 + * kmap_local_page() for details. 258 + */ 249 259 #define kunmap_local(__addr) \ 250 260 do { \ 251 261 BUILD_BUG_ON(__same_type((__addr), struct page *)); \

+1 -7

include/linux/hugetlb.h

··· 754 754 static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, 755 755 vm_flags_t flags) 756 756 { 757 - return entry; 757 + return pte_mkhuge(entry); 758 758 } 759 759 #endif 760 760 ··· 1074 1074 { 1075 1075 } 1076 1076 #endif /* CONFIG_HUGETLB_PAGE */ 1077 - 1078 - #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP 1079 - extern bool hugetlb_free_vmemmap_enabled; 1080 - #else 1081 - #define hugetlb_free_vmemmap_enabled false 1082 - #endif 1083 1077 1084 1078 static inline spinlock_t *huge_pte_lock(struct hstate *h, 1085 1079 struct mm_struct *mm, pte_t *pte)

-22

include/linux/kthread.h

··· 141 141 struct timer_list timer; 142 142 }; 143 143 144 - #define KTHREAD_WORKER_INIT(worker) { \ 145 - .lock = __RAW_SPIN_LOCK_UNLOCKED((worker).lock), \ 146 - .work_list = LIST_HEAD_INIT((worker).work_list), \ 147 - .delayed_work_list = LIST_HEAD_INIT((worker).delayed_work_list),\ 148 - } 149 - 150 144 #define KTHREAD_WORK_INIT(work, fn) { \ 151 145 .node = LIST_HEAD_INIT((work).node), \ 152 146 .func = (fn), \ ··· 152 158 TIMER_IRQSAFE), \ 153 159 } 154 160 155 - #define DEFINE_KTHREAD_WORKER(worker) \ 156 - struct kthread_worker worker = KTHREAD_WORKER_INIT(worker) 157 - 158 161 #define DEFINE_KTHREAD_WORK(work, fn) \ 159 162 struct kthread_work work = KTHREAD_WORK_INIT(work, fn) 160 163 161 164 #define DEFINE_KTHREAD_DELAYED_WORK(dwork, fn) \ 162 165 struct kthread_delayed_work dwork = \ 163 166 KTHREAD_DELAYED_WORK_INIT(dwork, fn) 164 - 165 - /* 166 - * kthread_worker.lock needs its own lockdep class key when defined on 167 - * stack with lockdep enabled. Use the following macros in such cases. 168 - */ 169 - #ifdef CONFIG_LOCKDEP 170 - # define KTHREAD_WORKER_INIT_ONSTACK(worker) \ 171 - ({ kthread_init_worker(&worker); worker; }) 172 - # define DEFINE_KTHREAD_WORKER_ONSTACK(worker) \ 173 - struct kthread_worker worker = KTHREAD_WORKER_INIT_ONSTACK(worker) 174 - #else 175 - # define DEFINE_KTHREAD_WORKER_ONSTACK(worker) DEFINE_KTHREAD_WORKER(worker) 176 - #endif 177 167 178 168 extern void __kthread_init_worker(struct kthread_worker *worker, 179 169 const char *name, struct lock_class_key *key);

+8 -9

include/linux/list_lru.h

··· 11 11 #include <linux/list.h> 12 12 #include <linux/nodemask.h> 13 13 #include <linux/shrinker.h> 14 + #include <linux/xarray.h> 14 15 15 16 struct mem_cgroup; 16 17 ··· 34 33 35 34 struct list_lru_memcg { 36 35 struct rcu_head rcu; 37 - /* array of per cgroup lists, indexed by memcg_cache_id */ 38 - struct list_lru_one *lru[]; 36 + /* array of per cgroup per node lists, indexed by node id */ 37 + struct list_lru_one node[]; 39 38 }; 40 39 41 40 struct list_lru_node { ··· 43 42 spinlock_t lock; 44 43 /* global list, used for the root cgroup in cgroup aware lrus */ 45 44 struct list_lru_one lru; 46 - #ifdef CONFIG_MEMCG_KMEM 47 - /* for cgroup aware lrus points to per cgroup lists, otherwise NULL */ 48 - struct list_lru_memcg __rcu *memcg_lrus; 49 - #endif 50 - long nr_items; 45 + long nr_items; 51 46 } ____cacheline_aligned_in_smp; 52 47 53 48 struct list_lru { ··· 52 55 struct list_head list; 53 56 int shrinker_id; 54 57 bool memcg_aware; 58 + struct xarray xa; 55 59 #endif 56 60 }; 57 61 ··· 67 69 #define list_lru_init_memcg(lru, shrinker) \ 68 70 __list_lru_init((lru), true, NULL, shrinker) 69 71 70 - int memcg_update_all_list_lrus(int num_memcgs); 71 - void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg); 72 + int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru, 73 + gfp_t gfp); 74 + void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *parent); 72 75 73 76 /** 74 77 * list_lru_add: add an element to the lru list's tail

+18 -28

include/linux/memcontrol.h

··· 34 34 MEMCG_SOCK, 35 35 MEMCG_PERCPU_B, 36 36 MEMCG_VMALLOC, 37 + MEMCG_KMEM, 37 38 MEMCG_NR_STAT, 38 39 }; 39 40 ··· 524 523 return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); 525 524 } 526 525 526 + static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg) 527 + { 528 + struct mem_cgroup *memcg; 529 + 530 + rcu_read_lock(); 531 + retry: 532 + memcg = obj_cgroup_memcg(objcg); 533 + if (unlikely(!css_tryget(&memcg->css))) 534 + goto retry; 535 + rcu_read_unlock(); 536 + 537 + return memcg; 538 + } 539 + 527 540 #ifdef CONFIG_MEMCG_KMEM 528 541 /* 529 542 * folio_memcg_kmem - Check if the folio has the memcg_kmem flag set. ··· 856 841 */ 857 842 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) 858 843 { 859 - if (!memcg->memory.parent) 860 - return NULL; 861 - return mem_cgroup_from_counter(memcg->memory.parent, memory); 844 + return mem_cgroup_from_css(memcg->css.parent); 862 845 } 863 846 864 847 static inline bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, ··· 1685 1672 1686 1673 extern struct static_key_false memcg_kmem_enabled_key; 1687 1674 1688 - extern int memcg_nr_cache_ids; 1689 - void memcg_get_cache_ids(void); 1690 - void memcg_put_cache_ids(void); 1691 - 1692 - /* 1693 - * Helper macro to loop through all memcg-specific caches. Callers must still 1694 - * check if the cache is valid (it is either valid or NULL). 1695 - * the slab_mutex must be held when looping through those caches 1696 - */ 1697 - #define for_each_memcg_cache_index(_idx) \ 1698 - for ((_idx) = 0; (_idx) < memcg_nr_cache_ids; (_idx)++) 1699 - 1700 1675 static inline bool memcg_kmem_enabled(void) 1701 1676 { 1702 1677 return static_branch_likely(&memcg_kmem_enabled_key); ··· 1708 1707 * A helper for accessing memcg's kmem_id, used for getting 1709 1708 * corresponding LRU lists. 1710 1709 */ 1711 - static inline int memcg_cache_id(struct mem_cgroup *memcg) 1710 + static inline int memcg_kmem_id(struct mem_cgroup *memcg) 1712 1711 { 1713 1712 return memcg ? memcg->kmemcg_id : -1; 1714 1713 } ··· 1741 1740 { 1742 1741 } 1743 1742 1744 - #define for_each_memcg_cache_index(_idx) \ 1745 - for (; NULL; ) 1746 - 1747 1743 static inline bool memcg_kmem_enabled(void) 1748 1744 { 1749 1745 return false; 1750 1746 } 1751 1747 1752 - static inline int memcg_cache_id(struct mem_cgroup *memcg) 1748 + static inline int memcg_kmem_id(struct mem_cgroup *memcg) 1753 1749 { 1754 1750 return -1; 1755 - } 1756 - 1757 - static inline void memcg_get_cache_ids(void) 1758 - { 1759 - } 1760 - 1761 - static inline void memcg_put_cache_ids(void) 1762 - { 1763 1751 } 1764 1752 1765 1753 static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)

+12

include/linux/memory.h

··· 70 70 unsigned long state; /* serialized by the dev->lock */ 71 71 int online_type; /* for passing data to online routine */ 72 72 int nid; /* NID for this memory block */ 73 + /* 74 + * The single zone of this memory block if all PFNs of this memory block 75 + * that are System RAM (not a memory hole, not ZONE_DEVICE ranges) are 76 + * managed by a single zone. NULL if multiple zones (including nodes) 77 + * apply. 78 + */ 79 + struct zone *zone; 73 80 struct device dev; 74 81 /* 75 82 * Number of vmemmap pages. These pages ··· 168 161 }) 169 162 #define register_hotmemory_notifier(nb) register_memory_notifier(nb) 170 163 #define unregister_hotmemory_notifier(nb) unregister_memory_notifier(nb) 164 + 165 + #ifdef CONFIG_NUMA 166 + void memory_block_add_nid(struct memory_block *mem, int nid, 167 + enum meminit_context context); 168 + #endif /* CONFIG_NUMA */ 171 169 #endif /* CONFIG_MEMORY_HOTPLUG */ 172 170 173 171 /*

+59 -65

include/linux/memory_hotplug.h

··· 16 16 struct resource; 17 17 struct vmem_altmap; 18 18 19 + #ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION 20 + /* 21 + * For supporting node-hotadd, we have to allocate a new pgdat. 22 + * 23 + * If an arch has generic style NODE_DATA(), 24 + * node_data[nid] = kzalloc() works well. But it depends on the architecture. 25 + * 26 + * In general, generic_alloc_nodedata() is used. 27 + * 28 + */ 29 + extern pg_data_t *arch_alloc_nodedata(int nid); 30 + extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat); 31 + 32 + #else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */ 33 + 34 + #define arch_alloc_nodedata(nid) generic_alloc_nodedata(nid) 35 + 36 + #ifdef CONFIG_NUMA 37 + /* 38 + * XXX: node aware allocation can't work well to get new node's memory at this time. 39 + * Because, pgdat for the new node is not allocated/initialized yet itself. 40 + * To use new node's memory, more consideration will be necessary. 41 + */ 42 + #define generic_alloc_nodedata(nid) \ 43 + ({ \ 44 + memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES); \ 45 + }) 46 + /* 47 + * This definition is just for error path in node hotadd. 48 + * For node hotremove, we have to replace this. 49 + */ 50 + #define generic_free_nodedata(pgdat) kfree(pgdat) 51 + 52 + extern pg_data_t *node_data[]; 53 + static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat) 54 + { 55 + node_data[nid] = pgdat; 56 + } 57 + 58 + #else /* !CONFIG_NUMA */ 59 + 60 + /* never called */ 61 + static inline pg_data_t *generic_alloc_nodedata(int nid) 62 + { 63 + BUG(); 64 + return NULL; 65 + } 66 + static inline void generic_free_nodedata(pg_data_t *pgdat) 67 + { 68 + } 69 + static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat) 70 + { 71 + } 72 + #endif /* CONFIG_NUMA */ 73 + #endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */ 74 + 19 75 #ifdef CONFIG_MEMORY_HOTPLUG 20 76 struct page *pfn_to_online_page(unsigned long pfn); 21 77 ··· 163 107 extern void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages); 164 108 extern int online_pages(unsigned long pfn, unsigned long nr_pages, 165 109 struct zone *zone, struct memory_group *group); 166 - extern struct zone *test_pages_in_a_zone(unsigned long start_pfn, 167 - unsigned long end_pfn); 168 110 extern void __offline_isolated_pages(unsigned long start_pfn, 169 111 unsigned long end_pfn); 170 112 ··· 207 153 int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages, 208 154 struct mhp_params *params); 209 155 #endif /* ARCH_HAS_ADD_PAGES */ 210 - 211 - #ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION 212 - /* 213 - * For supporting node-hotadd, we have to allocate a new pgdat. 214 - * 215 - * If an arch has generic style NODE_DATA(), 216 - * node_data[nid] = kzalloc() works well. But it depends on the architecture. 217 - * 218 - * In general, generic_alloc_nodedata() is used. 219 - * Now, arch_free_nodedata() is just defined for error path of node_hot_add. 220 - * 221 - */ 222 - extern pg_data_t *arch_alloc_nodedata(int nid); 223 - extern void arch_free_nodedata(pg_data_t *pgdat); 224 - extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat); 225 - 226 - #else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */ 227 - 228 - #define arch_alloc_nodedata(nid) generic_alloc_nodedata(nid) 229 - #define arch_free_nodedata(pgdat) generic_free_nodedata(pgdat) 230 - 231 - #ifdef CONFIG_NUMA 232 - /* 233 - * If ARCH_HAS_NODEDATA_EXTENSION=n, this func is used to allocate pgdat. 234 - * XXX: kmalloc_node() can't work well to get new node's memory at this time. 235 - * Because, pgdat for the new node is not allocated/initialized yet itself. 236 - * To use new node's memory, more consideration will be necessary. 237 - */ 238 - #define generic_alloc_nodedata(nid) \ 239 - ({ \ 240 - kzalloc(sizeof(pg_data_t), GFP_KERNEL); \ 241 - }) 242 - /* 243 - * This definition is just for error path in node hotadd. 244 - * For node hotremove, we have to replace this. 245 - */ 246 - #define generic_free_nodedata(pgdat) kfree(pgdat) 247 - 248 - extern pg_data_t *node_data[]; 249 - static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat) 250 - { 251 - node_data[nid] = pgdat; 252 - } 253 - 254 - #else /* !CONFIG_NUMA */ 255 - 256 - /* never called */ 257 - static inline pg_data_t *generic_alloc_nodedata(int nid) 258 - { 259 - BUG(); 260 - return NULL; 261 - } 262 - static inline void generic_free_nodedata(pg_data_t *pgdat) 263 - { 264 - } 265 - static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat) 266 - { 267 - } 268 - #endif /* CONFIG_NUMA */ 269 - #endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */ 270 156 271 157 void get_online_mems(void); 272 158 void put_online_mems(void); ··· 291 297 292 298 extern void try_offline_node(int nid); 293 299 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages, 294 - struct memory_group *group); 300 + struct zone *zone, struct memory_group *group); 295 301 extern int remove_memory(u64 start, u64 size); 296 302 extern void __remove_memory(u64 start, u64 size); 297 303 extern int offline_and_remove_memory(u64 start, u64 size); ··· 300 306 static inline void try_offline_node(int nid) {} 301 307 302 308 static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages, 303 - struct memory_group *group) 309 + struct zone *zone, struct memory_group *group) 304 310 { 305 311 return -EINVAL; 306 312 } ··· 317 323 extern void clear_zone_contiguous(struct zone *zone); 318 324 319 325 #ifdef CONFIG_MEMORY_HOTPLUG 320 - extern void __ref free_area_init_core_hotplug(int nid); 326 + extern void __ref free_area_init_core_hotplug(struct pglist_data *pgdat); 321 327 extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags); 322 328 extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags); 323 329 extern int add_memory_resource(int nid, struct resource *resource,

+8

include/linux/migrate.h

··· 48 48 struct folio *newfolio, struct folio *folio, int extra_count); 49 49 50 50 extern bool numa_demotion_enabled; 51 + extern void migrate_on_reclaim_init(void); 52 + #ifdef CONFIG_HOTPLUG_CPU 53 + extern void set_migration_target_nodes(void); 51 54 #else 55 + static inline void set_migration_target_nodes(void) {} 56 + #endif 57 + #else 58 + 59 + static inline void set_migration_target_nodes(void) {} 52 60 53 61 static inline void putback_movable_pages(struct list_head *l) {} 54 62 static inline int migrate_pages(struct list_head *l, new_page_t new,

+5 -6

include/linux/mm.h

··· 478 478 struct vm_area_struct *vma; /* Target VMA */ 479 479 gfp_t gfp_mask; /* gfp mask to be used for allocations */ 480 480 pgoff_t pgoff; /* Logical page offset based on vma */ 481 - unsigned long address; /* Faulting virtual address */ 481 + unsigned long address; /* Faulting virtual address - masked */ 482 + unsigned long real_address; /* Faulting virtual address - unmasked */ 482 483 }; 483 484 enum fault_flag flags; /* FAULT_FLAG_xxx flags 484 485 * XXX: should really be 'const' */ ··· 1917 1916 long pin_user_pages(unsigned long start, unsigned long nr_pages, 1918 1917 unsigned int gup_flags, struct page **pages, 1919 1918 struct vm_area_struct **vmas); 1920 - long get_user_pages_locked(unsigned long start, unsigned long nr_pages, 1921 - unsigned int gup_flags, struct page **pages, int *locked); 1922 - long pin_user_pages_locked(unsigned long start, unsigned long nr_pages, 1923 - unsigned int gup_flags, struct page **pages, int *locked); 1924 1919 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, 1925 1920 struct page **pages, unsigned int gup_flags); 1926 1921 long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, ··· 2450 2453 } 2451 2454 2452 2455 extern void __init pagecache_init(void); 2453 - extern void __init free_area_init_memoryless_node(int nid); 2454 2456 extern void free_initmem(void); 2455 2457 2456 2458 /* ··· 3147 3151 } 3148 3152 #endif 3149 3153 3154 + #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP 3150 3155 int vmemmap_remap_free(unsigned long start, unsigned long end, 3151 3156 unsigned long reuse); 3152 3157 int vmemmap_remap_alloc(unsigned long start, unsigned long end, 3153 3158 unsigned long reuse, gfp_t gfp_mask); 3159 + #endif 3154 3160 3155 3161 void *sparse_buffer_alloc(unsigned long size); 3156 3162 struct page * __populate_section_memmap(unsigned long pfn, ··· 3242 3244 MF_MSG_BUDDY, 3243 3245 MF_MSG_DAX, 3244 3246 MF_MSG_UNSPLIT_THP, 3247 + MF_MSG_DIFFERENT_PAGE_SIZE, 3245 3248 MF_MSG_UNKNOWN, 3246 3249 }; 3247 3250

+15 -7

include/linux/mmzone.h

··· 83 83 return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE; 84 84 } 85 85 86 + /* 87 + * Check whether a migratetype can be merged with another migratetype. 88 + * 89 + * It is only mergeable when it can fall back to other migratetypes for 90 + * allocation. See fallbacks[MIGRATE_TYPES][3] in page_alloc.c. 91 + */ 92 + static inline bool migratetype_is_mergeable(int mt) 93 + { 94 + return mt < MIGRATE_PCPTYPES; 95 + } 96 + 86 97 #define for_each_migratetype_order(order, type) \ 87 98 for (order = 0; order < MAX_ORDER; order++) \ 88 99 for (type = 0; type < MIGRATE_TYPES; type++) ··· 221 210 NR_PAGETABLE, /* used for pagetables */ 222 211 #ifdef CONFIG_SWAP 223 212 NR_SWAPCACHE, 213 + #endif 214 + #ifdef CONFIG_NUMA_BALANCING 215 + PGPROMOTE_SUCCESS, /* promote successfully */ 224 216 #endif 225 217 NR_VM_NODE_STAT_ITEMS 226 218 }; ··· 353 339 WMARK_MIN, 354 340 WMARK_LOW, 355 341 WMARK_HIGH, 342 + WMARK_PROMO, 356 343 NR_WMARK 357 344 }; 358 345 ··· 935 920 936 921 #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) 937 922 #define node_spanned_pages(nid) (NODE_DATA(nid)->node_spanned_pages) 938 - #ifdef CONFIG_FLATMEM 939 - #define pgdat_page_nr(pgdat, pagenr) ((pgdat)->node_mem_map + (pagenr)) 940 - #else 941 - #define pgdat_page_nr(pgdat, pagenr) pfn_to_page((pgdat)->node_start_pfn + (pagenr)) 942 - #endif 943 - #define nid_page_nr(nid, pagenr) pgdat_page_nr(NODE_DATA(nid),(pagenr)) 944 923 945 924 #define node_start_pfn(nid) (NODE_DATA(nid)->node_start_pfn) 946 925 #define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid)) ··· 1110 1101 { 1111 1102 return &contig_page_data; 1112 1103 } 1113 - #define NODE_MEM_MAP(nid) mem_map 1114 1104 1115 1105 #else /* CONFIG_NUMA */ 1116 1106

+1

include/linux/nfs_fs_sb.h

··· 138 138 struct nlm_host *nlm_host; /* NLM client handle */ 139 139 struct nfs_iostats __percpu *io_stats; /* I/O statistics */ 140 140 atomic_long_t writeback; /* number of writeback pages */ 141 + unsigned int write_congested;/* flag set when writeback gets too high */ 141 142 unsigned int flags; /* various flags */ 142 143 143 144 /* The following are for internal use only. Also see uapi/linux/nfs_mount.h */

+17 -8

include/linux/node.h

··· 99 99 typedef void (*node_registration_func_t)(struct node *); 100 100 101 101 #if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_NUMA) 102 - void link_mem_sections(int nid, unsigned long start_pfn, 103 - unsigned long end_pfn, 104 - enum meminit_context context); 102 + void register_memory_blocks_under_node(int nid, unsigned long start_pfn, 103 + unsigned long end_pfn, 104 + enum meminit_context context); 105 105 #else 106 - static inline void link_mem_sections(int nid, unsigned long start_pfn, 107 - unsigned long end_pfn, 108 - enum meminit_context context) 106 + static inline void register_memory_blocks_under_node(int nid, unsigned long start_pfn, 107 + unsigned long end_pfn, 108 + enum meminit_context context) 109 109 { 110 110 } 111 111 #endif 112 112 113 113 extern void unregister_node(struct node *node); 114 114 #ifdef CONFIG_NUMA 115 + extern void node_dev_init(void); 115 116 /* Core of the node registration - only memory hotplug should use this */ 116 117 extern int __register_one_node(int nid); 117 118 ··· 129 128 error = __register_one_node(nid); 130 129 if (error) 131 130 return error; 132 - /* link memory sections under this node */ 133 - link_mem_sections(nid, start_pfn, end_pfn, MEMINIT_EARLY); 131 + register_memory_blocks_under_node(nid, start_pfn, end_pfn, 132 + MEMINIT_EARLY); 134 133 } 135 134 136 135 return error; ··· 150 149 node_registration_func_t unregister); 151 150 #endif 152 151 #else 152 + static inline void node_dev_init(void) 153 + { 154 + } 153 155 static inline int __register_one_node(int nid) 154 156 { 155 157 return 0; ··· 184 180 #endif 185 181 186 182 #define to_node(device) container_of(device, struct node, dev) 183 + 184 + static inline bool node_is_toptier(int node) 185 + { 186 + return node_state(node, N_CPU); 187 + } 187 188 188 189 #endif /* _LINUX_NODE_H_ */

+87 -5

include/linux/page-flags.h

··· 190 190 191 191 #ifndef __GENERATING_BOUNDS_H 192 192 193 + #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP 194 + DECLARE_STATIC_KEY_MAYBE(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON, 195 + hugetlb_free_vmemmap_enabled_key); 196 + 197 + static __always_inline bool hugetlb_free_vmemmap_enabled(void) 198 + { 199 + return static_branch_maybe(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON, 200 + &hugetlb_free_vmemmap_enabled_key); 201 + } 202 + 203 + /* 204 + * If the feature of freeing some vmemmap pages associated with each HugeTLB 205 + * page is enabled, the head vmemmap page frame is reused and all of the tail 206 + * vmemmap addresses map to the head vmemmap page frame (furture details can 207 + * refer to the figure at the head of the mm/hugetlb_vmemmap.c). In other 208 + * words, there are more than one page struct with PG_head associated with each 209 + * HugeTLB page. We __know__ that there is only one head page struct, the tail 210 + * page structs with PG_head are fake head page structs. We need an approach 211 + * to distinguish between those two different types of page structs so that 212 + * compound_head() can return the real head page struct when the parameter is 213 + * the tail page struct but with PG_head. 214 + * 215 + * The page_fixed_fake_head() returns the real head page struct if the @page is 216 + * fake page head, otherwise, returns @page which can either be a true page 217 + * head or tail. 218 + */ 219 + static __always_inline const struct page *page_fixed_fake_head(const struct page *page) 220 + { 221 + if (!hugetlb_free_vmemmap_enabled()) 222 + return page; 223 + 224 + /* 225 + * Only addresses aligned with PAGE_SIZE of struct page may be fake head 226 + * struct page. The alignment check aims to avoid access the fields ( 227 + * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly) 228 + * cold cacheline in some cases. 229 + */ 230 + if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) && 231 + test_bit(PG_head, &page->flags)) { 232 + /* 233 + * We can safely access the field of the @page[1] with PG_head 234 + * because the @page is a compound page composed with at least 235 + * two contiguous pages. 236 + */ 237 + unsigned long head = READ_ONCE(page[1].compound_head); 238 + 239 + if (likely(head & 1)) 240 + return (const struct page *)(head - 1); 241 + } 242 + return page; 243 + } 244 + #else 245 + static inline const struct page *page_fixed_fake_head(const struct page *page) 246 + { 247 + return page; 248 + } 249 + 250 + static inline bool hugetlb_free_vmemmap_enabled(void) 251 + { 252 + return false; 253 + } 254 + #endif 255 + 256 + static __always_inline int page_is_fake_head(struct page *page) 257 + { 258 + return page_fixed_fake_head(page) != page; 259 + } 260 + 193 261 static inline unsigned long _compound_head(const struct page *page) 194 262 { 195 263 unsigned long head = READ_ONCE(page->compound_head); 196 264 197 265 if (unlikely(head & 1)) 198 266 return head - 1; 199 - return (unsigned long)page; 267 + return (unsigned long)page_fixed_fake_head(page); 200 268 } 201 269 202 270 #define compound_head(page) ((typeof(page))_compound_head(page)) ··· 299 231 300 232 static __always_inline int PageTail(struct page *page) 301 233 { 302 - return READ_ONCE(page->compound_head) & 1; 234 + return READ_ONCE(page->compound_head) & 1 || page_is_fake_head(page); 303 235 } 304 236 305 237 static __always_inline int PageCompound(struct page *page) 306 238 { 307 - return test_bit(PG_head, &page->flags) || PageTail(page); 239 + return test_bit(PG_head, &page->flags) || 240 + READ_ONCE(page->compound_head) & 1; 308 241 } 309 242 310 243 #define PAGE_POISON_PATTERN -1l ··· 764 695 return set_page_writeback(page); 765 696 } 766 697 767 - __PAGEFLAG(Head, head, PF_ANY) CLEARPAGEFLAG(Head, head, PF_ANY) 698 + static __always_inline bool folio_test_head(struct folio *folio) 699 + { 700 + return test_bit(PG_head, folio_flags(folio, FOLIO_PF_ANY)); 701 + } 702 + 703 + static __always_inline int PageHead(struct page *page) 704 + { 705 + PF_POISONED_CHECK(page); 706 + return test_bit(PG_head, &page->flags) && !page_is_fake_head(page); 707 + } 708 + 709 + __SETPAGEFLAG(Head, head, PF_ANY) 710 + __CLEARPAGEFLAG(Head, head, PF_ANY) 711 + CLEARPAGEFLAG(Head, head, PF_ANY) 768 712 769 713 /** 770 714 * folio_test_large() - Does this folio contain more than one page? ··· 1000 918 1001 919 extern bool is_free_buddy_page(struct page *page); 1002 920 1003 - __PAGEFLAG(Isolated, isolated, PF_ANY); 921 + PAGEFLAG(Isolated, isolated, PF_ANY); 1004 922 1005 923 #ifdef CONFIG_MMU 1006 924 #define __PG_MLOCKED (1UL << PG_mlocked)

+5 -2

include/linux/pageblock-flags.h

··· 37 37 38 38 #else /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ 39 39 40 - /* Huge pages are a constant size */ 41 - #define pageblock_order HUGETLB_PAGE_ORDER 40 + /* 41 + * Huge pages are a constant size, but don't exceed the maximum allocation 42 + * granularity. 43 + */ 44 + #define pageblock_order min_t(unsigned int, HUGETLB_PAGE_ORDER, MAX_ORDER - 1) 42 45 43 46 #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ 44 47

-7

include/linux/pagemap.h

··· 594 594 unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start, 595 595 pgoff_t end, unsigned int nr_pages, 596 596 struct page **pages); 597 - static inline unsigned find_get_pages(struct address_space *mapping, 598 - pgoff_t *start, unsigned int nr_pages, 599 - struct page **pages) 600 - { 601 - return find_get_pages_range(mapping, start, (pgoff_t)-1, nr_pages, 602 - pages); 603 - } 604 597 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start, 605 598 unsigned int nr_pages, struct page **pages); 606 599 unsigned find_get_pages_range_tag(struct address_space *mapping, pgoff_t *index,

-1

include/linux/sched.h

··· 1708 1708 * I am cleaning dirty pages from some other bdi. */ 1709 1709 #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ 1710 1710 #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ 1711 - #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ 1712 1711 #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */ 1713 1712 #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ 1714 1713 #define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */

+10

include/linux/sched/sysctl.h

··· 23 23 SCHED_TUNABLESCALING_END, 24 24 }; 25 25 26 + #define NUMA_BALANCING_DISABLED 0x0 27 + #define NUMA_BALANCING_NORMAL 0x1 28 + #define NUMA_BALANCING_MEMORY_TIERING 0x2 29 + 30 + #ifdef CONFIG_NUMA_BALANCING 31 + extern int sysctl_numa_balancing_mode; 32 + #else 33 + #define sysctl_numa_balancing_mode 0 34 + #endif 35 + 26 36 /* 27 37 * control realtime throttling: 28 38 *

+1

include/linux/shmem_fs.h

··· 24 24 struct shared_policy policy; /* NUMA memory alloc policy */ 25 25 struct simple_xattrs xattrs; /* list of xattrs */ 26 26 atomic_t stop_eviction; /* hold when working on inode */ 27 + struct timespec64 i_crtime; /* file creation time */ 27 28 struct inode vfs_inode; 28 29 }; 29 30

+3

include/linux/slab.h

··· 135 135 136 136 #include <linux/kasan.h> 137 137 138 + struct list_lru; 138 139 struct mem_cgroup; 139 140 /* 140 141 * struct kmem_cache related prototypes ··· 417 416 418 417 void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __alloc_size(1); 419 418 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t flags) __assume_slab_alignment __malloc; 419 + void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru, 420 + gfp_t gfpflags) __assume_slab_alignment __malloc; 420 421 void kmem_cache_free(struct kmem_cache *s, void *objp); 421 422 422 423 /*

+4 -2

include/linux/swap.h

··· 334 334 335 335 /* Only track the nodes of mappings with shadow entries */ 336 336 void workingset_update_node(struct xa_node *node); 337 + extern struct list_lru shadow_nodes; 337 338 #define mapping_set_update(xas, mapping) do { \ 338 - if (!dax_mapping(mapping) && !shmem_mapping(mapping)) \ 339 + if (!dax_mapping(mapping) && !shmem_mapping(mapping)) { \ 339 340 xas_set_update(xas, workingset_update_node); \ 341 + xas_set_lru(xas, &shadow_nodes); \ 342 + } \ 340 343 } while (0) 341 344 342 345 /* linux/mm/page_alloc.c */ ··· 387 384 extern unsigned long zone_reclaimable_pages(struct zone *zone); 388 385 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, 389 386 gfp_t gfp_mask, nodemask_t *mask); 390 - extern bool __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode); 391 387 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, 392 388 unsigned long nr_pages, 393 389 gfp_t gfp_mask,

+4 -1

include/linux/thread_info.h

··· 209 209 extern void __compiletime_error("copy destination size is too small") 210 210 __bad_copy_to(void); 211 211 212 + void __copy_overflow(int size, unsigned long count); 213 + 212 214 static inline void copy_overflow(int size, unsigned long count) 213 215 { 214 - WARN(1, "Buffer overflow detected (%d < %lu)!\n", size, count); 216 + if (IS_ENABLED(CONFIG_BUG)) 217 + __copy_overflow(size, count); 215 218 } 216 219 217 220 static __always_inline __must_check bool

-2

include/linux/uaccess.h

··· 401 401 #endif 402 402 403 403 #ifdef CONFIG_HARDENED_USERCOPY 404 - void usercopy_warn(const char *name, const char *detail, bool to_user, 405 - unsigned long offset, unsigned long len); 406 404 void __noreturn usercopy_abort(const char *name, const char *detail, 407 405 bool to_user, unsigned long offset, 408 406 unsigned long len);

+3

include/linux/vm_event_item.h

··· 129 129 #ifdef CONFIG_SWAP 130 130 SWAP_RA, 131 131 SWAP_RA_HIT, 132 + #ifdef CONFIG_KSM 133 + KSM_SWPIN_COPY, 134 + #endif 132 135 #endif 133 136 #ifdef CONFIG_X86 134 137 DIRECT_MAP_LEVEL2_SPLIT,

+2 -2

include/linux/vmalloc.h

··· 80 80 /* 81 81 * The following two variables can be packed, because 82 82 * a vmap_area object can be either: 83 - * 1) in "free" tree (root is vmap_area_root) 84 - * 2) or "busy" tree (root is free_vmap_area_root) 83 + * 1) in "free" tree (root is free_vmap_area_root) 84 + * 2) or "busy" tree (root is vmap_area_root) 85 85 */ 86 86 union { 87 87 unsigned long subtree_max_size; /* in "free" tree */

+8 -1

include/linux/xarray.h

··· 1317 1317 struct xa_node *xa_node; 1318 1318 struct xa_node *xa_alloc; 1319 1319 xa_update_node_t xa_update; 1320 + struct list_lru *xa_lru; 1320 1321 }; 1321 1322 1322 1323 /* ··· 1337 1336 .xa_pad = 0, \ 1338 1337 .xa_node = XAS_RESTART, \ 1339 1338 .xa_alloc = NULL, \ 1340 - .xa_update = NULL \ 1339 + .xa_update = NULL, \ 1340 + .xa_lru = NULL, \ 1341 1341 } 1342 1342 1343 1343 /** ··· 1631 1629 static inline void xas_set_update(struct xa_state *xas, xa_update_node_t update) 1632 1630 { 1633 1631 xas->xa_update = update; 1632 + } 1633 + 1634 + static inline void xas_set_lru(struct xa_state *xas, struct list_lru *lru) 1635 + { 1636 + xas->xa_lru = lru; 1634 1637 } 1635 1638 1636 1639 /**

+1

include/ras/ras_event.h

··· 374 374 EM ( MF_MSG_BUDDY, "free buddy page" ) \ 375 375 EM ( MF_MSG_DAX, "dax page" ) \ 376 376 EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" ) \ 377 + EM ( MF_MSG_DIFFERENT_PAGE_SIZE, "different page size" ) \ 377 378 EMe ( MF_MSG_UNKNOWN, "unknown page" ) 378 379 379 380 /*

+13 -13

include/trace/events/compaction.h

··· 67 67 #ifdef CONFIG_COMPACTION 68 68 TRACE_EVENT(mm_compaction_migratepages, 69 69 70 - TP_PROTO(unsigned long nr_all, 70 + TP_PROTO(struct compact_control *cc, 71 71 unsigned int nr_succeeded), 72 72 73 - TP_ARGS(nr_all, nr_succeeded), 73 + TP_ARGS(cc, nr_succeeded), 74 74 75 75 TP_STRUCT__entry( 76 76 __field(unsigned long, nr_migrated) ··· 79 79 80 80 TP_fast_assign( 81 81 __entry->nr_migrated = nr_succeeded; 82 - __entry->nr_failed = nr_all - nr_succeeded; 82 + __entry->nr_failed = cc->nr_migratepages - nr_succeeded; 83 83 ), 84 84 85 85 TP_printk("nr_migrated=%lu nr_failed=%lu", ··· 88 88 ); 89 89 90 90 TRACE_EVENT(mm_compaction_begin, 91 - TP_PROTO(unsigned long zone_start, unsigned long migrate_pfn, 92 - unsigned long free_pfn, unsigned long zone_end, bool sync), 91 + TP_PROTO(struct compact_control *cc, unsigned long zone_start, 92 + unsigned long zone_end, bool sync), 93 93 94 - TP_ARGS(zone_start, migrate_pfn, free_pfn, zone_end, sync), 94 + TP_ARGS(cc, zone_start, zone_end, sync), 95 95 96 96 TP_STRUCT__entry( 97 97 __field(unsigned long, zone_start) ··· 103 103 104 104 TP_fast_assign( 105 105 __entry->zone_start = zone_start; 106 - __entry->migrate_pfn = migrate_pfn; 107 - __entry->free_pfn = free_pfn; 106 + __entry->migrate_pfn = cc->migrate_pfn; 107 + __entry->free_pfn = cc->free_pfn; 108 108 __entry->zone_end = zone_end; 109 109 __entry->sync = sync; 110 110 ), ··· 118 118 ); 119 119 120 120 TRACE_EVENT(mm_compaction_end, 121 - TP_PROTO(unsigned long zone_start, unsigned long migrate_pfn, 122 - unsigned long free_pfn, unsigned long zone_end, bool sync, 121 + TP_PROTO(struct compact_control *cc, unsigned long zone_start, 122 + unsigned long zone_end, bool sync, 123 123 int status), 124 124 125 - TP_ARGS(zone_start, migrate_pfn, free_pfn, zone_end, sync, status), 125 + TP_ARGS(cc, zone_start, zone_end, sync, status), 126 126 127 127 TP_STRUCT__entry( 128 128 __field(unsigned long, zone_start) ··· 135 135 136 136 TP_fast_assign( 137 137 __entry->zone_start = zone_start; 138 - __entry->migrate_pfn = migrate_pfn; 139 - __entry->free_pfn = free_pfn; 138 + __entry->migrate_pfn = cc->migrate_pfn; 139 + __entry->free_pfn = cc->free_pfn; 140 140 __entry->zone_end = zone_end; 141 141 __entry->sync = sync; 142 142 __entry->status = status;

-28

include/trace/events/writeback.h

··· 735 735 ) 736 736 ); 737 737 738 - DECLARE_EVENT_CLASS(writeback_congest_waited_template, 739 - 740 - TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed), 741 - 742 - TP_ARGS(usec_timeout, usec_delayed), 743 - 744 - TP_STRUCT__entry( 745 - __field( unsigned int, usec_timeout ) 746 - __field( unsigned int, usec_delayed ) 747 - ), 748 - 749 - TP_fast_assign( 750 - __entry->usec_timeout = usec_timeout; 751 - __entry->usec_delayed = usec_delayed; 752 - ), 753 - 754 - TP_printk("usec_timeout=%u usec_delayed=%u", 755 - __entry->usec_timeout, 756 - __entry->usec_delayed) 757 - ); 758 - 759 - DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait, 760 - 761 - TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed), 762 - 763 - TP_ARGS(usec_timeout, usec_delayed) 764 - ); 765 - 766 738 DECLARE_EVENT_CLASS(writeback_single_inode_template, 767 739 768 740 TP_PROTO(struct inode *inode,

+7 -1

include/uapi/linux/userfaultfd.h

··· 32 32 UFFD_FEATURE_SIGBUS | \ 33 33 UFFD_FEATURE_THREAD_ID | \ 34 34 UFFD_FEATURE_MINOR_HUGETLBFS | \ 35 - UFFD_FEATURE_MINOR_SHMEM) 35 + UFFD_FEATURE_MINOR_SHMEM | \ 36 + UFFD_FEATURE_EXACT_ADDRESS) 36 37 #define UFFD_API_IOCTLS \ 37 38 ((__u64)1 << _UFFDIO_REGISTER | \ 38 39 (__u64)1 << _UFFDIO_UNREGISTER | \ ··· 190 189 * 191 190 * UFFD_FEATURE_MINOR_SHMEM indicates the same support as 192 191 * UFFD_FEATURE_MINOR_HUGETLBFS, but for shmem-backed pages instead. 192 + * 193 + * UFFD_FEATURE_EXACT_ADDRESS indicates that the exact address of page 194 + * faults would be provided and the offset within the page would not be 195 + * masked. 193 196 */ 194 197 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) 195 198 #define UFFD_FEATURE_EVENT_FORK (1<<1) ··· 206 201 #define UFFD_FEATURE_THREAD_ID (1<<8) 207 202 #define UFFD_FEATURE_MINOR_HUGETLBFS (1<<9) 208 203 #define UFFD_FEATURE_MINOR_SHMEM (1<<10) 204 + #define UFFD_FEATURE_EXACT_ADDRESS (1<<11) 209 205 __u64 features; 210 206 211 207 __u64 ioctls;

+1 -1

ipc/mqueue.c

··· 486 486 { 487 487 struct mqueue_inode_info *ei; 488 488 489 - ei = kmem_cache_alloc(mqueue_inode_cachep, GFP_KERNEL); 489 + ei = alloc_inode_sb(sb, mqueue_inode_cachep, GFP_KERNEL); 490 490 if (!ei) 491 491 return NULL; 492 492 return &ei->vfs_inode;

+1 -3

kernel/dma/contiguous.c

··· 399 399 400 400 static int __init rmem_cma_setup(struct reserved_mem *rmem) 401 401 { 402 - phys_addr_t align = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order); 403 - phys_addr_t mask = align - 1; 404 402 unsigned long node = rmem->fdt_node; 405 403 bool default_cma = of_get_flat_dt_prop(node, "linux,cma-default", NULL); 406 404 struct cma *cma; ··· 414 416 of_get_flat_dt_prop(node, "no-map", NULL)) 415 417 return -EINVAL; 416 418 417 - if ((rmem->base & mask) || (rmem->size & mask)) { 419 + if (!IS_ALIGNED(rmem->base | rmem->size, CMA_MIN_ALIGNMENT_BYTES)) { 418 420 pr_err("Reserved memory: incorrect alignment of CMA region\n"); 419 421 return -EINVAL; 420 422 }

+17 -4

kernel/sched/core.c

··· 4344 4344 4345 4345 #ifdef CONFIG_NUMA_BALANCING 4346 4346 4347 - void set_numabalancing_state(bool enabled) 4347 + int sysctl_numa_balancing_mode; 4348 + 4349 + static void __set_numabalancing_state(bool enabled) 4348 4350 { 4349 4351 if (enabled) 4350 4352 static_branch_enable(&sched_numa_balancing); 4351 4353 else 4352 4354 static_branch_disable(&sched_numa_balancing); 4355 + } 4356 + 4357 + void set_numabalancing_state(bool enabled) 4358 + { 4359 + if (enabled) 4360 + sysctl_numa_balancing_mode = NUMA_BALANCING_NORMAL; 4361 + else 4362 + sysctl_numa_balancing_mode = NUMA_BALANCING_DISABLED; 4363 + __set_numabalancing_state(enabled); 4353 4364 } 4354 4365 4355 4366 #ifdef CONFIG_PROC_SYSCTL ··· 4369 4358 { 4370 4359 struct ctl_table t; 4371 4360 int err; 4372 - int state = static_branch_likely(&sched_numa_balancing); 4361 + int state = sysctl_numa_balancing_mode; 4373 4362 4374 4363 if (write && !capable(CAP_SYS_ADMIN)) 4375 4364 return -EPERM; ··· 4379 4368 err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos); 4380 4369 if (err < 0) 4381 4370 return err; 4382 - if (write) 4383 - set_numabalancing_state(state); 4371 + if (write) { 4372 + sysctl_numa_balancing_mode = state; 4373 + __set_numabalancing_state(state); 4374 + } 4384 4375 return err; 4385 4376 } 4386 4377 #endif

+1 -1

kernel/sysctl.c

··· 1696 1696 .mode = 0644, 1697 1697 .proc_handler = sysctl_numa_balancing, 1698 1698 .extra1 = SYSCTL_ZERO, 1699 - .extra2 = SYSCTL_ONE, 1699 + .extra2 = SYSCTL_FOUR, 1700 1700 }, 1701 1701 #endif /* CONFIG_NUMA_BALANCING */ 1702 1702 {

+12

lib/Kconfig.kfence

··· 45 45 pages are required; with one containing the object and two adjacent 46 46 ones used as guard pages. 47 47 48 + config KFENCE_DEFERRABLE 49 + bool "Use a deferrable timer to trigger allocations" 50 + help 51 + Use a deferrable timer to trigger allocations. This avoids forcing 52 + CPU wake-ups if the system is idle, at the risk of a less predictable 53 + sample interval. 54 + 55 + Warning: The KUnit test suite fails with this option enabled - due to 56 + the unpredictability of the sample interval! 57 + 58 + Say N if you are unsure. 59 + 48 60 config KFENCE_STATIC_KEYS 49 61 bool "Use static keys to set up allocations" if EXPERT 50 62 depends on JUMP_LABEL

+2 -1

lib/kunit/try-catch.c

··· 52 52 * If tests timeout due to exceeding sysctl_hung_task_timeout_secs, 53 53 * the task will be killed and an oops generated. 54 54 */ 55 - return 300 * MSEC_PER_SEC; /* 5 min */ 55 + return 300 * msecs_to_jiffies(MSEC_PER_SEC); /* 5 min */ 56 56 } 57 57 58 58 void kunit_try_catch_run(struct kunit_try_catch *try_catch, void *context) ··· 78 78 if (time_remaining == 0) { 79 79 kunit_err(test, "try timed out\n"); 80 80 try_catch->try_result = -ETIMEDOUT; 81 + kthread_stop(task_struct); 81 82 } 82 83 83 84 exit_code = try_catch->try_result;

+5 -5

lib/xarray.c

··· 302 302 } 303 303 if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT) 304 304 gfp |= __GFP_ACCOUNT; 305 - xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp); 305 + xas->xa_alloc = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp); 306 306 if (!xas->xa_alloc) 307 307 return false; 308 308 xas->xa_alloc->parent = NULL; ··· 334 334 gfp |= __GFP_ACCOUNT; 335 335 if (gfpflags_allow_blocking(gfp)) { 336 336 xas_unlock_type(xas, lock_type); 337 - xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp); 337 + xas->xa_alloc = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp); 338 338 xas_lock_type(xas, lock_type); 339 339 } else { 340 - xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp); 340 + xas->xa_alloc = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp); 341 341 } 342 342 if (!xas->xa_alloc) 343 343 return false; ··· 371 371 if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT) 372 372 gfp |= __GFP_ACCOUNT; 373 373 374 - node = kmem_cache_alloc(radix_tree_node_cachep, gfp); 374 + node = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp); 375 375 if (!node) { 376 376 xas_set_err(xas, -ENOMEM); 377 377 return NULL; ··· 1014 1014 void *sibling = NULL; 1015 1015 struct xa_node *node; 1016 1016 1017 - node = kmem_cache_alloc(radix_tree_node_cachep, gfp); 1017 + node = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp); 1018 1018 if (!node) 1019 1019 goto nomem; 1020 1020 node->array = xas->xa;

+6

mm/Kconfig

··· 262 262 HUGETLB_PAGE_ORDER when there are multiple HugeTLB page sizes available 263 263 on a platform. 264 264 265 + Note that the pageblock_order cannot exceed MAX_ORDER - 1 and will be 266 + clamped down to MAX_ORDER - 1. 267 + 265 268 config CONTIG_ALLOC 266 269 def_bool (MEMORY_ISOLATION && COMPACTION) || CMA 267 270 ··· 413 410 memory footprint of applications without a guaranteed 414 411 benefit. 415 412 endchoice 413 + 414 + config ARCH_WANT_GENERAL_HUGETLB 415 + bool 416 416 417 417 config ARCH_WANTS_THP_SWAP 418 418 def_bool n

-57

mm/backing-dev.c

··· 1005 1005 return bdi->dev_name; 1006 1006 } 1007 1007 EXPORT_SYMBOL_GPL(bdi_dev_name); 1008 - 1009 - static wait_queue_head_t congestion_wqh[2] = { 1010 - __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]), 1011 - __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1]) 1012 - }; 1013 - static atomic_t nr_wb_congested[2]; 1014 - 1015 - void clear_bdi_congested(struct backing_dev_info *bdi, int sync) 1016 - { 1017 - wait_queue_head_t *wqh = &congestion_wqh[sync]; 1018 - enum wb_congested_state bit; 1019 - 1020 - bit = sync ? WB_sync_congested : WB_async_congested; 1021 - if (test_and_clear_bit(bit, &bdi->wb.congested)) 1022 - atomic_dec(&nr_wb_congested[sync]); 1023 - smp_mb__after_atomic(); 1024 - if (waitqueue_active(wqh)) 1025 - wake_up(wqh); 1026 - } 1027 - EXPORT_SYMBOL(clear_bdi_congested); 1028 - 1029 - void set_bdi_congested(struct backing_dev_info *bdi, int sync) 1030 - { 1031 - enum wb_congested_state bit; 1032 - 1033 - bit = sync ? WB_sync_congested : WB_async_congested; 1034 - if (!test_and_set_bit(bit, &bdi->wb.congested)) 1035 - atomic_inc(&nr_wb_congested[sync]); 1036 - } 1037 - EXPORT_SYMBOL(set_bdi_congested); 1038 - 1039 - /** 1040 - * congestion_wait - wait for a backing_dev to become uncongested 1041 - * @sync: SYNC or ASYNC IO 1042 - * @timeout: timeout in jiffies 1043 - * 1044 - * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit 1045 - * write congestion. If no backing_devs are congested then just wait for the 1046 - * next write to be completed. 1047 - */ 1048 - long congestion_wait(int sync, long timeout) 1049 - { 1050 - long ret; 1051 - unsigned long start = jiffies; 1052 - DEFINE_WAIT(wait); 1053 - wait_queue_head_t *wqh = &congestion_wqh[sync]; 1054 - 1055 - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); 1056 - ret = io_schedule_timeout(timeout); 1057 - finish_wait(wqh, &wait); 1058 - 1059 - trace_writeback_congestion_wait(jiffies_to_usecs(timeout), 1060 - jiffies_to_usecs(jiffies - start)); 1061 - 1062 - return ret; 1063 - } 1064 - EXPORT_SYMBOL(congestion_wait);

+14 -17

mm/cma.c

··· 131 131 bitmap_free(cma->bitmap); 132 132 out_error: 133 133 /* Expose all pages to the buddy, they are useless for CMA. */ 134 - for (pfn = base_pfn; pfn < base_pfn + cma->count; pfn++) 135 - free_reserved_page(pfn_to_page(pfn)); 134 + if (!cma->reserve_pages_on_error) { 135 + for (pfn = base_pfn; pfn < base_pfn + cma->count; pfn++) 136 + free_reserved_page(pfn_to_page(pfn)); 137 + } 136 138 totalcma_pages -= cma->count; 137 139 cma->count = 0; 138 140 pr_err("CMA area %s could not be activated\n", cma->name); ··· 151 149 return 0; 152 150 } 153 151 core_initcall(cma_init_reserved_areas); 152 + 153 + void __init cma_reserve_pages_on_error(struct cma *cma) 154 + { 155 + cma->reserve_pages_on_error = true; 156 + } 154 157 155 158 /** 156 159 * cma_init_reserved_mem() - create custom contiguous area from reserved memory ··· 175 168 struct cma **res_cma) 176 169 { 177 170 struct cma *cma; 178 - phys_addr_t alignment; 179 171 180 172 /* Sanity checks */ 181 173 if (cma_area_count == ARRAY_SIZE(cma_areas)) { ··· 185 179 if (!size || !memblock_is_region_reserved(base, size)) 186 180 return -EINVAL; 187 181 188 - /* ensure minimal alignment required by mm core */ 189 - alignment = PAGE_SIZE << 190 - max_t(unsigned long, MAX_ORDER - 1, pageblock_order); 191 - 192 182 /* alignment should be aligned with order_per_bit */ 193 - if (!IS_ALIGNED(alignment >> PAGE_SHIFT, 1 << order_per_bit)) 183 + if (!IS_ALIGNED(CMA_MIN_ALIGNMENT_PAGES, 1 << order_per_bit)) 194 184 return -EINVAL; 195 185 196 - if (ALIGN(base, alignment) != base || ALIGN(size, alignment) != size) 186 + /* ensure minimal alignment required by mm core */ 187 + if (!IS_ALIGNED(base | size, CMA_MIN_ALIGNMENT_BYTES)) 197 188 return -EINVAL; 198 189 199 190 /* ··· 265 262 if (alignment && !is_power_of_2(alignment)) 266 263 return -EINVAL; 267 264 268 - /* 269 - * Sanitise input arguments. 270 - * Pages both ends in CMA area could be merged into adjacent unmovable 271 - * migratetype page by page allocator's buddy algorithm. In the case, 272 - * you couldn't get a contiguous memory, which is not what we want. 273 - */ 274 - alignment = max(alignment, (phys_addr_t)PAGE_SIZE << 275 - max_t(unsigned long, MAX_ORDER - 1, pageblock_order)); 265 + /* Sanitise input arguments. */ 266 + alignment = max_t(phys_addr_t, alignment, CMA_MIN_ALIGNMENT_BYTES); 276 267 if (fixed && base & (alignment - 1)) { 277 268 ret = -EINVAL; 278 269 pr_err("Region at %pa must be aligned to %pa bytes\n",

+1

mm/cma.h

··· 30 30 /* kobject requires dynamic object */ 31 31 struct cma_kobject *cma_kobj; 32 32 #endif 33 + bool reserve_pages_on_error; 33 34 }; 34 35 35 36 extern struct cma cma_areas[MAX_CMA_AREAS];

+47 -13

mm/compaction.c

··· 785 785 * @cc: Compaction control structure. 786 786 * @low_pfn: The first PFN to isolate 787 787 * @end_pfn: The one-past-the-last PFN to isolate, within same pageblock 788 - * @isolate_mode: Isolation mode to be used. 788 + * @mode: Isolation mode to be used. 789 789 * 790 790 * Isolate all pages that can be migrated from the range specified by 791 791 * [low_pfn, end_pfn). The range is expected to be within same pageblock. ··· 798 798 */ 799 799 static int 800 800 isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, 801 - unsigned long end_pfn, isolate_mode_t isolate_mode) 801 + unsigned long end_pfn, isolate_mode_t mode) 802 802 { 803 803 pg_data_t *pgdat = cc->zone->zone_pgdat; 804 804 unsigned long nr_scanned = 0, nr_isolated = 0; ··· 806 806 unsigned long flags = 0; 807 807 struct lruvec *locked = NULL; 808 808 struct page *page = NULL, *valid_page = NULL; 809 + struct address_space *mapping; 809 810 unsigned long start_pfn = low_pfn; 810 811 bool skip_on_failure = false; 811 812 unsigned long next_skip_pfn = 0; ··· 991 990 locked = NULL; 992 991 } 993 992 994 - if (!isolate_movable_page(page, isolate_mode)) 993 + if (!isolate_movable_page(page, mode)) 995 994 goto isolate_success; 996 995 } 997 996 ··· 1003 1002 * so avoid taking lru_lock and isolating it unnecessarily in an 1004 1003 * admittedly racy check. 1005 1004 */ 1006 - if (!page_mapping(page) && 1007 - page_count(page) > page_mapcount(page)) 1005 + mapping = page_mapping(page); 1006 + if (!mapping && page_count(page) > page_mapcount(page)) 1008 1007 goto isolate_fail; 1009 1008 1010 1009 /* 1011 1010 * Only allow to migrate anonymous pages in GFP_NOFS context 1012 1011 * because those do not depend on fs locks. 1013 1012 */ 1014 - if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page)) 1013 + if (!(cc->gfp_mask & __GFP_FS) && mapping) 1015 1014 goto isolate_fail; 1016 1015 1017 1016 /* ··· 1022 1021 if (unlikely(!get_page_unless_zero(page))) 1023 1022 goto isolate_fail; 1024 1023 1025 - if (!__isolate_lru_page_prepare(page, isolate_mode)) 1024 + /* Only take pages on LRU: a check now makes later tests safe */ 1025 + if (!PageLRU(page)) 1026 1026 goto isolate_fail_put; 1027 + 1028 + /* Compaction might skip unevictable pages but CMA takes them */ 1029 + if (!(mode & ISOLATE_UNEVICTABLE) && PageUnevictable(page)) 1030 + goto isolate_fail_put; 1031 + 1032 + /* 1033 + * To minimise LRU disruption, the caller can indicate with 1034 + * ISOLATE_ASYNC_MIGRATE that it only wants to isolate pages 1035 + * it will be able to migrate without blocking - clean pages 1036 + * for the most part. PageWriteback would require blocking. 1037 + */ 1038 + if ((mode & ISOLATE_ASYNC_MIGRATE) && PageWriteback(page)) 1039 + goto isolate_fail_put; 1040 + 1041 + if ((mode & ISOLATE_ASYNC_MIGRATE) && PageDirty(page)) { 1042 + bool migrate_dirty; 1043 + 1044 + /* 1045 + * Only pages without mappings or that have a 1046 + * ->migratepage callback are possible to migrate 1047 + * without blocking. However, we can be racing with 1048 + * truncation so it's necessary to lock the page 1049 + * to stabilise the mapping as truncation holds 1050 + * the page lock until after the page is removed 1051 + * from the page cache. 1052 + */ 1053 + if (!trylock_page(page)) 1054 + goto isolate_fail_put; 1055 + 1056 + mapping = page_mapping(page); 1057 + migrate_dirty = !mapping || mapping->a_ops->migratepage; 1058 + unlock_page(page); 1059 + if (!migrate_dirty) 1060 + goto isolate_fail_put; 1061 + } 1027 1062 1028 1063 /* Try isolate the page */ 1029 1064 if (!TestClearPageLRU(page)) ··· 2387 2350 update_cached = !sync && 2388 2351 cc->zone->compact_cached_migrate_pfn[0] == cc->zone->compact_cached_migrate_pfn[1]; 2389 2352 2390 - trace_mm_compaction_begin(start_pfn, cc->migrate_pfn, 2391 - cc->free_pfn, end_pfn, sync); 2353 + trace_mm_compaction_begin(cc, start_pfn, end_pfn, sync); 2392 2354 2393 2355 /* lru_add_drain_all could be expensive with involving other CPUs */ 2394 2356 lru_add_drain(); ··· 2437 2401 compaction_free, (unsigned long)cc, cc->mode, 2438 2402 MR_COMPACTION, &nr_succeeded); 2439 2403 2440 - trace_mm_compaction_migratepages(cc->nr_migratepages, 2441 - nr_succeeded); 2404 + trace_mm_compaction_migratepages(cc, nr_succeeded); 2442 2405 2443 2406 /* All pages were either migrated or will be released */ 2444 2407 cc->nr_migratepages = 0; ··· 2513 2478 count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned); 2514 2479 count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned); 2515 2480 2516 - trace_mm_compaction_end(start_pfn, cc->migrate_pfn, 2517 - cc->free_pfn, end_pfn, sync, ret); 2481 + trace_mm_compaction_end(cc, start_pfn, end_pfn, sync, ret); 2518 2482 2519 2483 return ret; 2520 2484 }

+13 -6

mm/damon/Kconfig

··· 25 25 If unsure, say N. 26 26 27 27 config DAMON_VADDR 28 - bool "Data access monitoring primitives for virtual address spaces" 28 + bool "Data access monitoring operations for virtual address spaces" 29 29 depends on DAMON && MMU 30 30 select PAGE_IDLE_FLAG 31 31 help 32 - This builds the default data access monitoring primitives for DAMON 32 + This builds the default data access monitoring operations for DAMON 33 33 that work for virtual address spaces. 34 34 35 35 config DAMON_PADDR 36 - bool "Data access monitoring primitives for the physical address space" 36 + bool "Data access monitoring operations for the physical address space" 37 37 depends on DAMON && MMU 38 38 select PAGE_IDLE_FLAG 39 39 help 40 - This builds the default data access monitoring primitives for DAMON 40 + This builds the default data access monitoring operations for DAMON 41 41 that works for the physical address space. 42 42 43 43 config DAMON_VADDR_KUNIT_TEST 44 - bool "Test for DAMON primitives" if !KUNIT_ALL_TESTS 44 + bool "Test for DAMON operations" if !KUNIT_ALL_TESTS 45 45 depends on DAMON_VADDR && KUNIT=y 46 46 default KUNIT_ALL_TESTS 47 47 help 48 - This builds the DAMON virtual addresses primitives Kunit test suite. 48 + This builds the DAMON virtual addresses operations Kunit test suite. 49 49 50 50 For more information on KUnit and unit tests in general, please refer 51 51 to the KUnit documentation. 52 52 53 53 If unsure, say N. 54 + 55 + config DAMON_SYSFS 56 + bool "DAMON sysfs interface" 57 + depends on DAMON && SYSFS 58 + help 59 + This builds the sysfs interface for DAMON. The user space can use 60 + the interface for arbitrary data access monitoring. 54 61 55 62 config DAMON_DBGFS 56 63 bool "DAMON debugfs interface"

+4 -3

mm/damon/Makefile

··· 1 1 # SPDX-License-Identifier: GPL-2.0 2 2 3 - obj-$(CONFIG_DAMON) := core.o 4 - obj-$(CONFIG_DAMON_VADDR) += prmtv-common.o vaddr.o 5 - obj-$(CONFIG_DAMON_PADDR) += prmtv-common.o paddr.o 3 + obj-y := core.o 4 + obj-$(CONFIG_DAMON_VADDR) += ops-common.o vaddr.o 5 + obj-$(CONFIG_DAMON_PADDR) += ops-common.o paddr.o 6 + obj-$(CONFIG_DAMON_SYSFS) += sysfs.o 6 7 obj-$(CONFIG_DAMON_DBGFS) += dbgfs.o 7 8 obj-$(CONFIG_DAMON_RECLAIM) += reclaim.o

+11 -10

mm/damon/core-test.h

··· 24 24 KUNIT_EXPECT_EQ(test, 2ul, r->ar.end); 25 25 KUNIT_EXPECT_EQ(test, 0u, r->nr_accesses); 26 26 27 - t = damon_new_target(42); 27 + t = damon_new_target(); 28 28 KUNIT_EXPECT_EQ(test, 0u, damon_nr_regions(t)); 29 29 30 30 damon_add_region(r, t); ··· 52 52 struct damon_ctx *c = damon_new_ctx(); 53 53 struct damon_target *t; 54 54 55 - t = damon_new_target(42); 56 - KUNIT_EXPECT_EQ(test, 42ul, t->id); 55 + t = damon_new_target(); 57 56 KUNIT_EXPECT_EQ(test, 0u, nr_damon_targets(c)); 58 57 59 58 damon_add_target(c, t); ··· 77 78 static void damon_test_aggregate(struct kunit *test) 78 79 { 79 80 struct damon_ctx *ctx = damon_new_ctx(); 80 - unsigned long target_ids[] = {1, 2, 3}; 81 81 unsigned long saddr[][3] = {{10, 20, 30}, {5, 42, 49}, {13, 33, 55} }; 82 82 unsigned long eaddr[][3] = {{15, 27, 40}, {31, 45, 55}, {23, 44, 66} }; 83 83 unsigned long accesses[][3] = {{42, 95, 84}, {10, 20, 30}, {0, 1, 2} }; ··· 84 86 struct damon_region *r; 85 87 int it, ir; 86 88 87 - damon_set_targets(ctx, target_ids, 3); 89 + for (it = 0; it < 3; it++) { 90 + t = damon_new_target(); 91 + damon_add_target(ctx, t); 92 + } 88 93 89 94 it = 0; 90 95 damon_for_each_target(t, ctx) { ··· 123 122 struct damon_target *t; 124 123 struct damon_region *r; 125 124 126 - t = damon_new_target(42); 125 + t = damon_new_target(); 127 126 r = damon_new_region(0, 100); 128 127 damon_add_region(r, t); 129 128 damon_split_region_at(c, t, r, 25); ··· 144 143 struct damon_region *r, *r2, *r3; 145 144 int i; 146 145 147 - t = damon_new_target(42); 146 + t = damon_new_target(); 148 147 r = damon_new_region(0, 100); 149 148 r->nr_accesses = 10; 150 149 damon_add_region(r, t); ··· 192 191 unsigned long eaddrs[] = {112, 130, 156, 170, 230}; 193 192 int i; 194 193 195 - t = damon_new_target(42); 194 + t = damon_new_target(); 196 195 for (i = 0; i < ARRAY_SIZE(sa); i++) { 197 196 r = damon_new_region(sa[i], ea[i]); 198 197 r->nr_accesses = nrs[i]; ··· 216 215 struct damon_target *t; 217 216 struct damon_region *r; 218 217 219 - t = damon_new_target(42); 218 + t = damon_new_target(); 220 219 r = damon_new_region(0, 22); 221 220 damon_add_region(r, t); 222 221 damon_split_regions_of(c, t, 2); 223 222 KUNIT_EXPECT_LE(test, damon_nr_regions(t), 2u); 224 223 damon_free_target(t); 225 224 226 - t = damon_new_target(42); 225 + t = damon_new_target(); 227 226 r = damon_new_region(0, 220); 228 227 damon_add_region(r, t); 229 228 damon_split_regions_of(c, t, 4);

+116 -74

mm/damon/core.c

··· 24 24 25 25 static DEFINE_MUTEX(damon_lock); 26 26 static int nr_running_ctxs; 27 + static bool running_exclusive_ctxs; 28 + 29 + static DEFINE_MUTEX(damon_ops_lock); 30 + static struct damon_operations damon_registered_ops[NR_DAMON_OPS]; 31 + 32 + /* Should be called under damon_ops_lock with id smaller than NR_DAMON_OPS */ 33 + static bool damon_registered_ops_id(enum damon_ops_id id) 34 + { 35 + struct damon_operations empty_ops = {}; 36 + 37 + if (!memcmp(&empty_ops, &damon_registered_ops[id], sizeof(empty_ops))) 38 + return false; 39 + return true; 40 + } 41 + 42 + /** 43 + * damon_register_ops() - Register a monitoring operations set to DAMON. 44 + * @ops: monitoring operations set to register. 45 + * 46 + * This function registers a monitoring operations set of valid &struct 47 + * damon_operations->id so that others can find and use them later. 48 + * 49 + * Return: 0 on success, negative error code otherwise. 50 + */ 51 + int damon_register_ops(struct damon_operations *ops) 52 + { 53 + int err = 0; 54 + 55 + if (ops->id >= NR_DAMON_OPS) 56 + return -EINVAL; 57 + mutex_lock(&damon_ops_lock); 58 + /* Fail for already registered ops */ 59 + if (damon_registered_ops_id(ops->id)) { 60 + err = -EINVAL; 61 + goto out; 62 + } 63 + damon_registered_ops[ops->id] = *ops; 64 + out: 65 + mutex_unlock(&damon_ops_lock); 66 + return err; 67 + } 68 + 69 + /** 70 + * damon_select_ops() - Select a monitoring operations to use with the context. 71 + * @ctx: monitoring context to use the operations. 72 + * @id: id of the registered monitoring operations to select. 73 + * 74 + * This function finds registered monitoring operations set of @id and make 75 + * @ctx to use it. 76 + * 77 + * Return: 0 on success, negative error code otherwise. 78 + */ 79 + int damon_select_ops(struct damon_ctx *ctx, enum damon_ops_id id) 80 + { 81 + int err = 0; 82 + 83 + if (id >= NR_DAMON_OPS) 84 + return -EINVAL; 85 + 86 + mutex_lock(&damon_ops_lock); 87 + if (!damon_registered_ops_id(id)) 88 + err = -EINVAL; 89 + else 90 + ctx->ops = damon_registered_ops[id]; 91 + mutex_unlock(&damon_ops_lock); 92 + return err; 93 + } 27 94 28 95 /* 29 96 * Construct a damon_region struct ··· 211 144 * 212 145 * Returns the pointer to the new struct if success, or NULL otherwise 213 146 */ 214 - struct damon_target *damon_new_target(unsigned long id) 147 + struct damon_target *damon_new_target(void) 215 148 { 216 149 struct damon_target *t; 217 150 ··· 219 152 if (!t) 220 153 return NULL; 221 154 222 - t->id = id; 155 + t->pid = NULL; 223 156 t->nr_regions = 0; 224 157 INIT_LIST_HEAD(&t->regions_list); 225 158 ··· 271 204 272 205 ctx->sample_interval = 5 * 1000; 273 206 ctx->aggr_interval = 100 * 1000; 274 - ctx->primitive_update_interval = 60 * 1000 * 1000; 207 + ctx->ops_update_interval = 60 * 1000 * 1000; 275 208 276 209 ktime_get_coarse_ts64(&ctx->last_aggregation); 277 - ctx->last_primitive_update = ctx->last_aggregation; 210 + ctx->last_ops_update = ctx->last_aggregation; 278 211 279 212 mutex_init(&ctx->kdamond_lock); 280 213 ··· 291 224 { 292 225 struct damon_target *t, *next_t; 293 226 294 - if (ctx->primitive.cleanup) { 295 - ctx->primitive.cleanup(ctx); 227 + if (ctx->ops.cleanup) { 228 + ctx->ops.cleanup(ctx); 296 229 return; 297 230 } 298 231 ··· 313 246 } 314 247 315 248 /** 316 - * damon_set_targets() - Set monitoring targets. 317 - * @ctx: monitoring context 318 - * @ids: array of target ids 319 - * @nr_ids: number of entries in @ids 320 - * 321 - * This function should not be called while the kdamond is running. 322 - * 323 - * Return: 0 on success, negative error code otherwise. 324 - */ 325 - int damon_set_targets(struct damon_ctx *ctx, 326 - unsigned long *ids, ssize_t nr_ids) 327 - { 328 - ssize_t i; 329 - struct damon_target *t, *next; 330 - 331 - damon_destroy_targets(ctx); 332 - 333 - for (i = 0; i < nr_ids; i++) { 334 - t = damon_new_target(ids[i]); 335 - if (!t) { 336 - /* The caller should do cleanup of the ids itself */ 337 - damon_for_each_target_safe(t, next, ctx) 338 - damon_destroy_target(t); 339 - return -ENOMEM; 340 - } 341 - damon_add_target(ctx, t); 342 - } 343 - 344 - return 0; 345 - } 346 - 347 - /** 348 249 * damon_set_attrs() - Set attributes for the monitoring. 349 250 * @ctx: monitoring context 350 251 * @sample_int: time interval between samplings 351 252 * @aggr_int: time interval between aggregations 352 - * @primitive_upd_int: time interval between monitoring primitive updates 253 + * @ops_upd_int: time interval between monitoring operations updates 353 254 * @min_nr_reg: minimal number of regions 354 255 * @max_nr_reg: maximum number of regions 355 256 * ··· 327 292 * Return: 0 on success, negative error code otherwise. 328 293 */ 329 294 int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int, 330 - unsigned long aggr_int, unsigned long primitive_upd_int, 295 + unsigned long aggr_int, unsigned long ops_upd_int, 331 296 unsigned long min_nr_reg, unsigned long max_nr_reg) 332 297 { 333 298 if (min_nr_reg < 3) ··· 337 302 338 303 ctx->sample_interval = sample_int; 339 304 ctx->aggr_interval = aggr_int; 340 - ctx->primitive_update_interval = primitive_upd_int; 305 + ctx->ops_update_interval = ops_upd_int; 341 306 ctx->min_nr_regions = min_nr_reg; 342 307 ctx->max_nr_regions = max_nr_reg; 343 308 ··· 435 400 * damon_start() - Starts the monitorings for a given group of contexts. 436 401 * @ctxs: an array of the pointers for contexts to start monitoring 437 402 * @nr_ctxs: size of @ctxs 403 + * @exclusive: exclusiveness of this contexts group 438 404 * 439 405 * This function starts a group of monitoring threads for a group of monitoring 440 406 * contexts. One thread per each context is created and run in parallel. The 441 - * caller should handle synchronization between the threads by itself. If a 442 - * group of threads that created by other 'damon_start()' call is currently 443 - * running, this function does nothing but returns -EBUSY. 407 + * caller should handle synchronization between the threads by itself. If 408 + * @exclusive is true and a group of threads that created by other 409 + * 'damon_start()' call is currently running, this function does nothing but 410 + * returns -EBUSY. 444 411 * 445 412 * Return: 0 on success, negative error code otherwise. 446 413 */ 447 - int damon_start(struct damon_ctx **ctxs, int nr_ctxs) 414 + int damon_start(struct damon_ctx **ctxs, int nr_ctxs, bool exclusive) 448 415 { 449 416 int i; 450 417 int err = 0; 451 418 452 419 mutex_lock(&damon_lock); 453 - if (nr_running_ctxs) { 420 + if ((exclusive && nr_running_ctxs) || 421 + (!exclusive && running_exclusive_ctxs)) { 454 422 mutex_unlock(&damon_lock); 455 423 return -EBUSY; 456 424 } ··· 464 426 break; 465 427 nr_running_ctxs++; 466 428 } 429 + if (exclusive && nr_running_ctxs) 430 + running_exclusive_ctxs = true; 467 431 mutex_unlock(&damon_lock); 468 432 469 433 return err; 470 434 } 471 435 472 436 /* 473 - * __damon_stop() - Stops monitoring of given context. 437 + * __damon_stop() - Stops monitoring of a given context. 474 438 * @ctx: monitoring context 475 439 * 476 440 * Return: 0 on success, negative error code otherwise. ··· 510 470 /* nr_running_ctxs is decremented in kdamond_fn */ 511 471 err = __damon_stop(ctxs[i]); 512 472 if (err) 513 - return err; 473 + break; 514 474 } 515 - 516 475 return err; 517 476 } 518 477 ··· 587 548 { 588 549 bool ret = __damos_valid_target(r, s); 589 550 590 - if (!ret || !s->quota.esz || !c->primitive.get_scheme_score) 551 + if (!ret || !s->quota.esz || !c->ops.get_scheme_score) 591 552 return ret; 592 553 593 - return c->primitive.get_scheme_score(c, t, r, s) >= s->quota.min_score; 554 + return c->ops.get_scheme_score(c, t, r, s) >= s->quota.min_score; 594 555 } 595 556 596 557 static void damon_do_apply_schemes(struct damon_ctx *c, ··· 647 608 continue; 648 609 649 610 /* Apply the scheme */ 650 - if (c->primitive.apply_scheme) { 611 + if (c->ops.apply_scheme) { 651 612 if (quota->esz && 652 613 quota->charged_sz + sz > quota->esz) { 653 614 sz = ALIGN_DOWN(quota->esz - quota->charged_sz, ··· 657 618 damon_split_region_at(c, t, r, sz); 658 619 } 659 620 ktime_get_coarse_ts64(&begin); 660 - sz_applied = c->primitive.apply_scheme(c, t, r, s); 621 + sz_applied = c->ops.apply_scheme(c, t, r, s); 661 622 ktime_get_coarse_ts64(&end); 662 623 quota->total_charged_ns += timespec64_to_ns(&end) - 663 624 timespec64_to_ns(&begin); ··· 731 692 damos_set_effective_quota(quota); 732 693 } 733 694 734 - if (!c->primitive.get_scheme_score) 695 + if (!c->ops.get_scheme_score) 735 696 continue; 736 697 737 698 /* Fill up the score histogram */ ··· 740 701 damon_for_each_region(r, t) { 741 702 if (!__damos_valid_target(r, s)) 742 703 continue; 743 - score = c->primitive.get_scheme_score( 704 + score = c->ops.get_scheme_score( 744 705 c, t, r, s); 745 706 quota->histogram[score] += 746 707 r->ar.end - r->ar.start; ··· 919 880 } 920 881 921 882 /* 922 - * Check whether it is time to check and apply the target monitoring regions 883 + * Check whether it is time to check and apply the operations-related data 884 + * structures. 923 885 * 924 886 * Returns true if it is. 925 887 */ 926 - static bool kdamond_need_update_primitive(struct damon_ctx *ctx) 888 + static bool kdamond_need_update_operations(struct damon_ctx *ctx) 927 889 { 928 - return damon_check_reset_time_interval(&ctx->last_primitive_update, 929 - ctx->primitive_update_interval); 890 + return damon_check_reset_time_interval(&ctx->last_ops_update, 891 + ctx->ops_update_interval); 930 892 } 931 893 932 894 /* ··· 945 905 if (kthread_should_stop()) 946 906 return true; 947 907 948 - if (!ctx->primitive.target_valid) 908 + if (!ctx->ops.target_valid) 949 909 return false; 950 910 951 911 damon_for_each_target(t, ctx) { 952 - if (ctx->primitive.target_valid(t)) 912 + if (ctx->ops.target_valid(t)) 953 913 return false; 954 914 } 955 915 ··· 1048 1008 1049 1009 pr_debug("kdamond (%d) starts\n", current->pid); 1050 1010 1051 - if (ctx->primitive.init) 1052 - ctx->primitive.init(ctx); 1011 + if (ctx->ops.init) 1012 + ctx->ops.init(ctx); 1053 1013 if (ctx->callback.before_start && ctx->callback.before_start(ctx)) 1054 1014 done = true; 1055 1015 ··· 1059 1019 if (kdamond_wait_activation(ctx)) 1060 1020 continue; 1061 1021 1062 - if (ctx->primitive.prepare_access_checks) 1063 - ctx->primitive.prepare_access_checks(ctx); 1022 + if (ctx->ops.prepare_access_checks) 1023 + ctx->ops.prepare_access_checks(ctx); 1064 1024 if (ctx->callback.after_sampling && 1065 1025 ctx->callback.after_sampling(ctx)) 1066 1026 done = true; 1067 1027 1068 1028 kdamond_usleep(ctx->sample_interval); 1069 1029 1070 - if (ctx->primitive.check_accesses) 1071 - max_nr_accesses = ctx->primitive.check_accesses(ctx); 1030 + if (ctx->ops.check_accesses) 1031 + max_nr_accesses = ctx->ops.check_accesses(ctx); 1072 1032 1073 1033 if (kdamond_aggregate_interval_passed(ctx)) { 1074 1034 kdamond_merge_regions(ctx, ··· 1080 1040 kdamond_apply_schemes(ctx); 1081 1041 kdamond_reset_aggregated(ctx); 1082 1042 kdamond_split_regions(ctx); 1083 - if (ctx->primitive.reset_aggregated) 1084 - ctx->primitive.reset_aggregated(ctx); 1043 + if (ctx->ops.reset_aggregated) 1044 + ctx->ops.reset_aggregated(ctx); 1085 1045 } 1086 1046 1087 - if (kdamond_need_update_primitive(ctx)) { 1088 - if (ctx->primitive.update) 1089 - ctx->primitive.update(ctx); 1047 + if (kdamond_need_update_operations(ctx)) { 1048 + if (ctx->ops.update) 1049 + ctx->ops.update(ctx); 1090 1050 sz_limit = damon_region_sz_limit(ctx); 1091 1051 } 1092 1052 } ··· 1097 1057 1098 1058 if (ctx->callback.before_terminate) 1099 1059 ctx->callback.before_terminate(ctx); 1100 - if (ctx->primitive.cleanup) 1101 - ctx->primitive.cleanup(ctx); 1060 + if (ctx->ops.cleanup) 1061 + ctx->ops.cleanup(ctx); 1102 1062 1103 1063 pr_debug("kdamond (%d) finishes\n", current->pid); 1104 1064 mutex_lock(&ctx->kdamond_lock); ··· 1107 1067 1108 1068 mutex_lock(&damon_lock); 1109 1069 nr_running_ctxs--; 1070 + if (!nr_running_ctxs && running_exclusive_ctxs) 1071 + running_exclusive_ctxs = false; 1110 1072 mutex_unlock(&damon_lock); 1111 1073 1112 1074 return 0;

+34 -51

mm/damon/dbgfs-test.h

··· 12 12 13 13 #include <kunit/test.h> 14 14 15 - static void damon_dbgfs_test_str_to_target_ids(struct kunit *test) 15 + static void damon_dbgfs_test_str_to_ints(struct kunit *test) 16 16 { 17 17 char *question; 18 - unsigned long *answers; 19 - unsigned long expected[] = {12, 35, 46}; 18 + int *answers; 19 + int expected[] = {12, 35, 46}; 20 20 ssize_t nr_integers = 0, i; 21 21 22 22 question = "123"; 23 - answers = str_to_target_ids(question, strlen(question), 24 - &nr_integers); 23 + answers = str_to_ints(question, strlen(question), &nr_integers); 25 24 KUNIT_EXPECT_EQ(test, (ssize_t)1, nr_integers); 26 - KUNIT_EXPECT_EQ(test, 123ul, answers[0]); 25 + KUNIT_EXPECT_EQ(test, 123, answers[0]); 27 26 kfree(answers); 28 27 29 28 question = "123abc"; 30 - answers = str_to_target_ids(question, strlen(question), 31 - &nr_integers); 29 + answers = str_to_ints(question, strlen(question), &nr_integers); 32 30 KUNIT_EXPECT_EQ(test, (ssize_t)1, nr_integers); 33 - KUNIT_EXPECT_EQ(test, 123ul, answers[0]); 31 + KUNIT_EXPECT_EQ(test, 123, answers[0]); 34 32 kfree(answers); 35 33 36 34 question = "a123"; 37 - answers = str_to_target_ids(question, strlen(question), 38 - &nr_integers); 35 + answers = str_to_ints(question, strlen(question), &nr_integers); 39 36 KUNIT_EXPECT_EQ(test, (ssize_t)0, nr_integers); 40 37 kfree(answers); 41 38 42 39 question = "12 35"; 43 - answers = str_to_target_ids(question, strlen(question), 44 - &nr_integers); 40 + answers = str_to_ints(question, strlen(question), &nr_integers); 45 41 KUNIT_EXPECT_EQ(test, (ssize_t)2, nr_integers); 46 42 for (i = 0; i < nr_integers; i++) 47 43 KUNIT_EXPECT_EQ(test, expected[i], answers[i]); 48 44 kfree(answers); 49 45 50 46 question = "12 35 46"; 51 - answers = str_to_target_ids(question, strlen(question), 52 - &nr_integers); 47 + answers = str_to_ints(question, strlen(question), &nr_integers); 53 48 KUNIT_EXPECT_EQ(test, (ssize_t)3, nr_integers); 54 49 for (i = 0; i < nr_integers; i++) 55 50 KUNIT_EXPECT_EQ(test, expected[i], answers[i]); 56 51 kfree(answers); 57 52 58 53 question = "12 35 abc 46"; 59 - answers = str_to_target_ids(question, strlen(question), 60 - &nr_integers); 54 + answers = str_to_ints(question, strlen(question), &nr_integers); 61 55 KUNIT_EXPECT_EQ(test, (ssize_t)2, nr_integers); 62 56 for (i = 0; i < 2; i++) 63 57 KUNIT_EXPECT_EQ(test, expected[i], answers[i]); 64 58 kfree(answers); 65 59 66 60 question = ""; 67 - answers = str_to_target_ids(question, strlen(question), 68 - &nr_integers); 61 + answers = str_to_ints(question, strlen(question), &nr_integers); 69 62 KUNIT_EXPECT_EQ(test, (ssize_t)0, nr_integers); 70 63 kfree(answers); 71 64 72 65 question = "\n"; 73 - answers = str_to_target_ids(question, strlen(question), 74 - &nr_integers); 66 + answers = str_to_ints(question, strlen(question), &nr_integers); 75 67 KUNIT_EXPECT_EQ(test, (ssize_t)0, nr_integers); 76 68 kfree(answers); 77 69 } ··· 71 79 static void damon_dbgfs_test_set_targets(struct kunit *test) 72 80 { 73 81 struct damon_ctx *ctx = dbgfs_new_ctx(); 74 - unsigned long ids[] = {1, 2, 3}; 75 82 char buf[64]; 76 83 77 - /* Make DAMON consider target id as plain number */ 78 - ctx->primitive.target_valid = NULL; 79 - ctx->primitive.cleanup = NULL; 84 + /* Make DAMON consider target has no pid */ 85 + damon_select_ops(ctx, DAMON_OPS_PADDR); 80 86 81 - damon_set_targets(ctx, ids, 3); 82 - sprint_target_ids(ctx, buf, 64); 83 - KUNIT_EXPECT_STREQ(test, (char *)buf, "1 2 3\n"); 84 - 85 - damon_set_targets(ctx, NULL, 0); 87 + dbgfs_set_targets(ctx, 0, NULL); 86 88 sprint_target_ids(ctx, buf, 64); 87 89 KUNIT_EXPECT_STREQ(test, (char *)buf, "\n"); 88 90 89 - damon_set_targets(ctx, (unsigned long []){1, 2}, 2); 91 + dbgfs_set_targets(ctx, 1, NULL); 90 92 sprint_target_ids(ctx, buf, 64); 91 - KUNIT_EXPECT_STREQ(test, (char *)buf, "1 2\n"); 93 + KUNIT_EXPECT_STREQ(test, (char *)buf, "42\n"); 92 94 93 - damon_set_targets(ctx, (unsigned long []){2}, 1); 94 - sprint_target_ids(ctx, buf, 64); 95 - KUNIT_EXPECT_STREQ(test, (char *)buf, "2\n"); 96 - 97 - damon_set_targets(ctx, NULL, 0); 95 + dbgfs_set_targets(ctx, 0, NULL); 98 96 sprint_target_ids(ctx, buf, 64); 99 97 KUNIT_EXPECT_STREQ(test, (char *)buf, "\n"); 100 98 ··· 94 112 static void damon_dbgfs_test_set_init_regions(struct kunit *test) 95 113 { 96 114 struct damon_ctx *ctx = damon_new_ctx(); 97 - unsigned long ids[] = {1, 2, 3}; 98 - /* Each line represents one region in ``<target id> <start> <end>`` */ 99 - char * const valid_inputs[] = {"2 10 20\n 2 20 30\n2 35 45", 100 - "2 10 20\n", 101 - "2 10 20\n1 39 59\n1 70 134\n 2 20 25\n", 115 + /* Each line represents one region in ``<target idx> <start> <end>`` */ 116 + char * const valid_inputs[] = {"1 10 20\n 1 20 30\n1 35 45", 117 + "1 10 20\n", 118 + "1 10 20\n0 39 59\n0 70 134\n 1 20 25\n", 102 119 ""}; 103 120 /* Reading the file again will show sorted, clean output */ 104 - char * const valid_expects[] = {"2 10 20\n2 20 30\n2 35 45\n", 105 - "2 10 20\n", 106 - "1 39 59\n1 70 134\n2 10 20\n2 20 25\n", 121 + char * const valid_expects[] = {"1 10 20\n1 20 30\n1 35 45\n", 122 + "1 10 20\n", 123 + "0 39 59\n0 70 134\n1 10 20\n1 20 25\n", 107 124 ""}; 108 - char * const invalid_inputs[] = {"4 10 20\n", /* target not exists */ 109 - "2 10 20\n 2 14 26\n", /* regions overlap */ 110 - "1 10 20\n2 30 40\n 1 5 8"}; /* not sorted by address */ 125 + char * const invalid_inputs[] = {"3 10 20\n", /* target not exists */ 126 + "1 10 20\n 1 14 26\n", /* regions overlap */ 127 + "0 10 20\n1 30 40\n 0 5 8"}; /* not sorted by address */ 111 128 char *input, *expect; 112 129 int i, rc; 113 130 char buf[256]; 114 131 115 - damon_set_targets(ctx, ids, 3); 132 + damon_select_ops(ctx, DAMON_OPS_PADDR); 133 + 134 + dbgfs_set_targets(ctx, 3, NULL); 116 135 117 136 /* Put valid inputs and check the results */ 118 137 for (i = 0; i < ARRAY_SIZE(valid_inputs); i++) { ··· 141 158 KUNIT_EXPECT_STREQ(test, (char *)buf, ""); 142 159 } 143 160 144 - damon_set_targets(ctx, NULL, 0); 161 + dbgfs_set_targets(ctx, 0, NULL); 145 162 damon_destroy_ctx(ctx); 146 163 } 147 164 148 165 static struct kunit_case damon_test_cases[] = { 149 - KUNIT_CASE(damon_dbgfs_test_str_to_target_ids), 166 + KUNIT_CASE(damon_dbgfs_test_str_to_ints), 150 167 KUNIT_CASE(damon_dbgfs_test_set_targets), 151 168 KUNIT_CASE(damon_dbgfs_test_set_init_regions), 152 169 {},

+144 -78

mm/damon/dbgfs.c

··· 56 56 mutex_lock(&ctx->kdamond_lock); 57 57 ret = scnprintf(kbuf, ARRAY_SIZE(kbuf), "%lu %lu %lu %lu %lu\n", 58 58 ctx->sample_interval, ctx->aggr_interval, 59 - ctx->primitive_update_interval, ctx->min_nr_regions, 59 + ctx->ops_update_interval, ctx->min_nr_regions, 60 60 ctx->max_nr_regions); 61 61 mutex_unlock(&ctx->kdamond_lock); 62 62 ··· 275 275 return ret; 276 276 } 277 277 278 - static inline bool targetid_is_pid(const struct damon_ctx *ctx) 278 + static inline bool target_has_pid(const struct damon_ctx *ctx) 279 279 { 280 - return ctx->primitive.target_valid == damon_va_target_valid; 280 + return ctx->ops.id == DAMON_OPS_VADDR; 281 281 } 282 282 283 283 static ssize_t sprint_target_ids(struct damon_ctx *ctx, char *buf, ssize_t len) 284 284 { 285 285 struct damon_target *t; 286 - unsigned long id; 286 + int id; 287 287 int written = 0; 288 288 int rc; 289 289 290 290 damon_for_each_target(t, ctx) { 291 - id = t->id; 292 - if (targetid_is_pid(ctx)) 291 + if (target_has_pid(ctx)) 293 292 /* Show pid numbers to debugfs users */ 294 - id = (unsigned long)pid_vnr((struct pid *)id); 293 + id = pid_vnr(t->pid); 294 + else 295 + /* Show 42 for physical address space, just for fun */ 296 + id = 42; 295 297 296 - rc = scnprintf(&buf[written], len - written, "%lu ", id); 298 + rc = scnprintf(&buf[written], len - written, "%d ", id); 297 299 if (!rc) 298 300 return -ENOMEM; 299 301 written += rc; ··· 323 321 } 324 322 325 323 /* 326 - * Converts a string into an array of unsigned long integers 324 + * Converts a string into an integers array 327 325 * 328 - * Returns an array of unsigned long integers if the conversion success, or 329 - * NULL otherwise. 326 + * Returns an array of integers array if the conversion success, or NULL 327 + * otherwise. 330 328 */ 331 - static unsigned long *str_to_target_ids(const char *str, ssize_t len, 332 - ssize_t *nr_ids) 329 + static int *str_to_ints(const char *str, ssize_t len, ssize_t *nr_ints) 333 330 { 334 - unsigned long *ids; 335 - const int max_nr_ids = 32; 336 - unsigned long id; 331 + int *array; 332 + const int max_nr_ints = 32; 333 + int nr; 337 334 int pos = 0, parsed, ret; 338 335 339 - *nr_ids = 0; 340 - ids = kmalloc_array(max_nr_ids, sizeof(id), GFP_KERNEL); 341 - if (!ids) 336 + *nr_ints = 0; 337 + array = kmalloc_array(max_nr_ints, sizeof(*array), GFP_KERNEL); 338 + if (!array) 342 339 return NULL; 343 - while (*nr_ids < max_nr_ids && pos < len) { 344 - ret = sscanf(&str[pos], "%lu%n", &id, &parsed); 340 + while (*nr_ints < max_nr_ints && pos < len) { 341 + ret = sscanf(&str[pos], "%d%n", &nr, &parsed); 345 342 pos += parsed; 346 343 if (ret != 1) 347 344 break; 348 - ids[*nr_ids] = id; 349 - *nr_ids += 1; 345 + array[*nr_ints] = nr; 346 + *nr_ints += 1; 350 347 } 351 348 352 - return ids; 349 + return array; 353 350 } 354 351 355 - static void dbgfs_put_pids(unsigned long *ids, int nr_ids) 352 + static void dbgfs_put_pids(struct pid **pids, int nr_pids) 356 353 { 357 354 int i; 358 355 359 - for (i = 0; i < nr_ids; i++) 360 - put_pid((struct pid *)ids[i]); 356 + for (i = 0; i < nr_pids; i++) 357 + put_pid(pids[i]); 358 + } 359 + 360 + /* 361 + * Converts a string into an struct pid pointers array 362 + * 363 + * Returns an array of struct pid pointers if the conversion success, or NULL 364 + * otherwise. 365 + */ 366 + static struct pid **str_to_pids(const char *str, ssize_t len, ssize_t *nr_pids) 367 + { 368 + int *ints; 369 + ssize_t nr_ints; 370 + struct pid **pids; 371 + 372 + *nr_pids = 0; 373 + 374 + ints = str_to_ints(str, len, &nr_ints); 375 + if (!ints) 376 + return NULL; 377 + 378 + pids = kmalloc_array(nr_ints, sizeof(*pids), GFP_KERNEL); 379 + if (!pids) 380 + goto out; 381 + 382 + for (; *nr_pids < nr_ints; (*nr_pids)++) { 383 + pids[*nr_pids] = find_get_pid(ints[*nr_pids]); 384 + if (!pids[*nr_pids]) { 385 + dbgfs_put_pids(pids, *nr_pids); 386 + kfree(ints); 387 + kfree(pids); 388 + return NULL; 389 + } 390 + } 391 + 392 + out: 393 + kfree(ints); 394 + return pids; 395 + } 396 + 397 + /* 398 + * dbgfs_set_targets() - Set monitoring targets. 399 + * @ctx: monitoring context 400 + * @nr_targets: number of targets 401 + * @pids: array of target pids (size is same to @nr_targets) 402 + * 403 + * This function should not be called while the kdamond is running. @pids is 404 + * ignored if the context is not configured to have pid in each target. On 405 + * failure, reference counts of all pids in @pids are decremented. 406 + * 407 + * Return: 0 on success, negative error code otherwise. 408 + */ 409 + static int dbgfs_set_targets(struct damon_ctx *ctx, ssize_t nr_targets, 410 + struct pid **pids) 411 + { 412 + ssize_t i; 413 + struct damon_target *t, *next; 414 + 415 + damon_for_each_target_safe(t, next, ctx) { 416 + if (target_has_pid(ctx)) 417 + put_pid(t->pid); 418 + damon_destroy_target(t); 419 + } 420 + 421 + for (i = 0; i < nr_targets; i++) { 422 + t = damon_new_target(); 423 + if (!t) { 424 + damon_for_each_target_safe(t, next, ctx) 425 + damon_destroy_target(t); 426 + if (target_has_pid(ctx)) 427 + dbgfs_put_pids(pids, nr_targets); 428 + return -ENOMEM; 429 + } 430 + if (target_has_pid(ctx)) 431 + t->pid = pids[i]; 432 + damon_add_target(ctx, t); 433 + } 434 + 435 + return 0; 361 436 } 362 437 363 438 static ssize_t dbgfs_target_ids_write(struct file *file, 364 439 const char __user *buf, size_t count, loff_t *ppos) 365 440 { 366 441 struct damon_ctx *ctx = file->private_data; 367 - struct damon_target *t, *next_t; 368 442 bool id_is_pid = true; 369 443 char *kbuf; 370 - unsigned long *targets; 444 + struct pid **target_pids = NULL; 371 445 ssize_t nr_targets; 372 446 ssize_t ret; 373 - int i; 374 447 375 448 kbuf = user_input_str(buf, count, ppos); 376 449 if (IS_ERR(kbuf)) ··· 453 376 454 377 if (!strncmp(kbuf, "paddr\n", count)) { 455 378 id_is_pid = false; 456 - /* target id is meaningless here, but we set it just for fun */ 457 - scnprintf(kbuf, count, "42 "); 458 - } 459 - 460 - targets = str_to_target_ids(kbuf, count, &nr_targets); 461 - if (!targets) { 462 - ret = -ENOMEM; 463 - goto out; 379 + nr_targets = 1; 464 380 } 465 381 466 382 if (id_is_pid) { 467 - for (i = 0; i < nr_targets; i++) { 468 - targets[i] = (unsigned long)find_get_pid( 469 - (int)targets[i]); 470 - if (!targets[i]) { 471 - dbgfs_put_pids(targets, i); 472 - ret = -EINVAL; 473 - goto free_targets_out; 474 - } 383 + target_pids = str_to_pids(kbuf, count, &nr_targets); 384 + if (!target_pids) { 385 + ret = -ENOMEM; 386 + goto out; 475 387 } 476 388 } 477 389 478 390 mutex_lock(&ctx->kdamond_lock); 479 391 if (ctx->kdamond) { 480 392 if (id_is_pid) 481 - dbgfs_put_pids(targets, nr_targets); 393 + dbgfs_put_pids(target_pids, nr_targets); 482 394 ret = -EBUSY; 483 395 goto unlock_out; 484 396 } 485 397 486 398 /* remove previously set targets */ 487 - damon_for_each_target_safe(t, next_t, ctx) { 488 - if (targetid_is_pid(ctx)) 489 - put_pid((struct pid *)t->id); 490 - damon_destroy_target(t); 399 + dbgfs_set_targets(ctx, 0, NULL); 400 + if (!nr_targets) { 401 + ret = count; 402 + goto unlock_out; 491 403 } 492 404 493 405 /* Configure the context for the address space type */ 494 406 if (id_is_pid) 495 - damon_va_set_primitives(ctx); 407 + ret = damon_select_ops(ctx, DAMON_OPS_VADDR); 496 408 else 497 - damon_pa_set_primitives(ctx); 409 + ret = damon_select_ops(ctx, DAMON_OPS_PADDR); 410 + if (ret) 411 + goto unlock_out; 498 412 499 - ret = damon_set_targets(ctx, targets, nr_targets); 500 - if (ret) { 501 - if (id_is_pid) 502 - dbgfs_put_pids(targets, nr_targets); 503 - } else { 413 + ret = dbgfs_set_targets(ctx, nr_targets, target_pids); 414 + if (!ret) 504 415 ret = count; 505 - } 506 416 507 417 unlock_out: 508 418 mutex_unlock(&ctx->kdamond_lock); 509 - free_targets_out: 510 - kfree(targets); 419 + kfree(target_pids); 511 420 out: 512 421 kfree(kbuf); 513 422 return ret; ··· 503 440 { 504 441 struct damon_target *t; 505 442 struct damon_region *r; 443 + int target_idx = 0; 506 444 int written = 0; 507 445 int rc; 508 446 509 447 damon_for_each_target(t, c) { 510 448 damon_for_each_region(r, t) { 511 449 rc = scnprintf(&buf[written], len - written, 512 - "%lu %lu %lu\n", 513 - t->id, r->ar.start, r->ar.end); 450 + "%d %lu %lu\n", 451 + target_idx, r->ar.start, r->ar.end); 514 452 if (!rc) 515 453 return -ENOMEM; 516 454 written += rc; 517 455 } 456 + target_idx++; 518 457 } 519 458 return written; 520 459 } ··· 550 485 return len; 551 486 } 552 487 553 - static int add_init_region(struct damon_ctx *c, 554 - unsigned long target_id, struct damon_addr_range *ar) 488 + static int add_init_region(struct damon_ctx *c, int target_idx, 489 + struct damon_addr_range *ar) 555 490 { 556 491 struct damon_target *t; 557 492 struct damon_region *r, *prev; 558 - unsigned long id; 493 + unsigned long idx = 0; 559 494 int rc = -EINVAL; 560 495 561 496 if (ar->start >= ar->end) 562 497 return -EINVAL; 563 498 564 499 damon_for_each_target(t, c) { 565 - id = t->id; 566 - if (targetid_is_pid(c)) 567 - id = (unsigned long)pid_vnr((struct pid *)id); 568 - if (id == target_id) { 500 + if (idx++ == target_idx) { 569 501 r = damon_new_region(ar->start, ar->end); 570 502 if (!r) 571 503 return -ENOMEM; ··· 585 523 struct damon_target *t; 586 524 struct damon_region *r, *next; 587 525 int pos = 0, parsed, ret; 588 - unsigned long target_id; 526 + int target_idx; 589 527 struct damon_addr_range ar; 590 528 int err; 591 529 ··· 595 533 } 596 534 597 535 while (pos < len) { 598 - ret = sscanf(&str[pos], "%lu %lu %lu%n", 599 - &target_id, &ar.start, &ar.end, &parsed); 536 + ret = sscanf(&str[pos], "%d %lu %lu%n", 537 + &target_idx, &ar.start, &ar.end, &parsed); 600 538 if (ret != 3) 601 539 break; 602 - err = add_init_region(c, target_id, &ar); 540 + err = add_init_region(c, target_idx, &ar); 603 541 if (err) 604 542 goto fail; 605 543 pos += parsed; ··· 722 660 { 723 661 struct damon_target *t, *next; 724 662 725 - if (!targetid_is_pid(ctx)) 663 + if (!target_has_pid(ctx)) 726 664 return; 727 665 728 666 mutex_lock(&ctx->kdamond_lock); 729 667 damon_for_each_target_safe(t, next, ctx) { 730 - put_pid((struct pid *)t->id); 668 + put_pid(t->pid); 731 669 damon_destroy_target(t); 732 670 } 733 671 mutex_unlock(&ctx->kdamond_lock); ··· 741 679 if (!ctx) 742 680 return NULL; 743 681 744 - damon_va_set_primitives(ctx); 682 + if (damon_select_ops(ctx, DAMON_OPS_VADDR) && 683 + damon_select_ops(ctx, DAMON_OPS_PADDR)) { 684 + damon_destroy_ctx(ctx); 685 + return NULL; 686 + } 745 687 ctx->callback.before_terminate = dbgfs_before_terminate; 746 688 return ctx; 747 689 } ··· 967 901 return -EINVAL; 968 902 } 969 903 } 970 - ret = damon_start(dbgfs_ctxs, dbgfs_nr_ctxs); 904 + ret = damon_start(dbgfs_ctxs, dbgfs_nr_ctxs, true); 971 905 } else if (!strncmp(kbuf, "off", count)) { 972 906 ret = damon_stop(dbgfs_ctxs, dbgfs_nr_ctxs); 973 907 } else {

+19 -17

mm/damon/paddr.c

··· 14 14 #include <linux/swap.h> 15 15 16 16 #include "../internal.h" 17 - #include "prmtv-common.h" 17 + #include "ops-common.h" 18 18 19 19 static bool __damon_pa_mkold(struct page *page, struct vm_area_struct *vma, 20 20 unsigned long addr, void *arg) ··· 208 208 return max_nr_accesses; 209 209 } 210 210 211 - bool damon_pa_target_valid(void *t) 212 - { 213 - return true; 214 - } 215 - 216 211 static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx, 217 212 struct damon_target *t, struct damon_region *r, 218 213 struct damos *scheme) ··· 256 261 return DAMOS_MAX_SCORE; 257 262 } 258 263 259 - void damon_pa_set_primitives(struct damon_ctx *ctx) 264 + static int __init damon_pa_initcall(void) 260 265 { 261 - ctx->primitive.init = NULL; 262 - ctx->primitive.update = NULL; 263 - ctx->primitive.prepare_access_checks = damon_pa_prepare_access_checks; 264 - ctx->primitive.check_accesses = damon_pa_check_accesses; 265 - ctx->primitive.reset_aggregated = NULL; 266 - ctx->primitive.target_valid = damon_pa_target_valid; 267 - ctx->primitive.cleanup = NULL; 268 - ctx->primitive.apply_scheme = damon_pa_apply_scheme; 269 - ctx->primitive.get_scheme_score = damon_pa_scheme_score; 270 - } 266 + struct damon_operations ops = { 267 + .id = DAMON_OPS_PADDR, 268 + .init = NULL, 269 + .update = NULL, 270 + .prepare_access_checks = damon_pa_prepare_access_checks, 271 + .check_accesses = damon_pa_check_accesses, 272 + .reset_aggregated = NULL, 273 + .target_valid = NULL, 274 + .cleanup = NULL, 275 + .apply_scheme = damon_pa_apply_scheme, 276 + .get_scheme_score = damon_pa_scheme_score, 277 + }; 278 + 279 + return damon_register_ops(&ops); 280 + }; 281 + 282 + subsys_initcall(damon_pa_initcall);

+1 -1

mm/damon/prmtv-common.c mm/damon/ops-common.c

··· 10 10 #include <linux/pagemap.h> 11 11 #include <linux/rmap.h> 12 12 13 - #include "prmtv-common.h" 13 + #include "ops-common.h" 14 14 15 15 /* 16 16 * Get an online page for a pfn if it's in the LRU list. Otherwise, returns

mm/damon/prmtv-common.h mm/damon/ops-common.h

+5 -4

mm/damon/reclaim.c

··· 330 330 if (err) 331 331 goto free_scheme_out; 332 332 333 - err = damon_start(&ctx, 1); 333 + err = damon_start(&ctx, 1, true); 334 334 if (!err) { 335 335 kdamond_pid = ctx->kdamond->pid; 336 336 return 0; ··· 384 384 if (!ctx) 385 385 return -ENOMEM; 386 386 387 - damon_pa_set_primitives(ctx); 387 + if (damon_select_ops(ctx, DAMON_OPS_PADDR)) 388 + return -EINVAL; 389 + 388 390 ctx->callback.after_aggregation = damon_reclaim_after_aggregation; 389 391 390 - /* 4242 means nothing but fun */ 391 - target = damon_new_target(4242); 392 + target = damon_new_target(); 392 393 if (!target) { 393 394 damon_destroy_ctx(ctx); 394 395 return -ENOMEM;

+2596

mm/damon/sysfs.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * DAMON sysfs Interface 4 + * 5 + * Copyright (c) 2022 SeongJae Park <sj@kernel.org> 6 + */ 7 + 8 + #include <linux/damon.h> 9 + #include <linux/kobject.h> 10 + #include <linux/pid.h> 11 + #include <linux/sched.h> 12 + #include <linux/slab.h> 13 + 14 + static DEFINE_MUTEX(damon_sysfs_lock); 15 + 16 + /* 17 + * unsigned long range directory 18 + */ 19 + 20 + struct damon_sysfs_ul_range { 21 + struct kobject kobj; 22 + unsigned long min; 23 + unsigned long max; 24 + }; 25 + 26 + static struct damon_sysfs_ul_range *damon_sysfs_ul_range_alloc( 27 + unsigned long min, 28 + unsigned long max) 29 + { 30 + struct damon_sysfs_ul_range *range = kmalloc(sizeof(*range), 31 + GFP_KERNEL); 32 + 33 + if (!range) 34 + return NULL; 35 + range->kobj = (struct kobject){}; 36 + range->min = min; 37 + range->max = max; 38 + 39 + return range; 40 + } 41 + 42 + static ssize_t min_show(struct kobject *kobj, struct kobj_attribute *attr, 43 + char *buf) 44 + { 45 + struct damon_sysfs_ul_range *range = container_of(kobj, 46 + struct damon_sysfs_ul_range, kobj); 47 + 48 + return sysfs_emit(buf, "%lu\n", range->min); 49 + } 50 + 51 + static ssize_t min_store(struct kobject *kobj, struct kobj_attribute *attr, 52 + const char *buf, size_t count) 53 + { 54 + struct damon_sysfs_ul_range *range = container_of(kobj, 55 + struct damon_sysfs_ul_range, kobj); 56 + unsigned long min; 57 + int err; 58 + 59 + err = kstrtoul(buf, 0, &min); 60 + if (err) 61 + return -EINVAL; 62 + 63 + range->min = min; 64 + return count; 65 + } 66 + 67 + static ssize_t max_show(struct kobject *kobj, struct kobj_attribute *attr, 68 + char *buf) 69 + { 70 + struct damon_sysfs_ul_range *range = container_of(kobj, 71 + struct damon_sysfs_ul_range, kobj); 72 + 73 + return sysfs_emit(buf, "%lu\n", range->max); 74 + } 75 + 76 + static ssize_t max_store(struct kobject *kobj, struct kobj_attribute *attr, 77 + const char *buf, size_t count) 78 + { 79 + struct damon_sysfs_ul_range *range = container_of(kobj, 80 + struct damon_sysfs_ul_range, kobj); 81 + unsigned long max; 82 + int err; 83 + 84 + err = kstrtoul(buf, 0, &max); 85 + if (err) 86 + return -EINVAL; 87 + 88 + range->max = max; 89 + return count; 90 + } 91 + 92 + static void damon_sysfs_ul_range_release(struct kobject *kobj) 93 + { 94 + kfree(container_of(kobj, struct damon_sysfs_ul_range, kobj)); 95 + } 96 + 97 + static struct kobj_attribute damon_sysfs_ul_range_min_attr = 98 + __ATTR_RW_MODE(min, 0600); 99 + 100 + static struct kobj_attribute damon_sysfs_ul_range_max_attr = 101 + __ATTR_RW_MODE(max, 0600); 102 + 103 + static struct attribute *damon_sysfs_ul_range_attrs[] = { 104 + &damon_sysfs_ul_range_min_attr.attr, 105 + &damon_sysfs_ul_range_max_attr.attr, 106 + NULL, 107 + }; 108 + ATTRIBUTE_GROUPS(damon_sysfs_ul_range); 109 + 110 + static struct kobj_type damon_sysfs_ul_range_ktype = { 111 + .release = damon_sysfs_ul_range_release, 112 + .sysfs_ops = &kobj_sysfs_ops, 113 + .default_groups = damon_sysfs_ul_range_groups, 114 + }; 115 + 116 + /* 117 + * schemes/stats directory 118 + */ 119 + 120 + struct damon_sysfs_stats { 121 + struct kobject kobj; 122 + unsigned long nr_tried; 123 + unsigned long sz_tried; 124 + unsigned long nr_applied; 125 + unsigned long sz_applied; 126 + unsigned long qt_exceeds; 127 + }; 128 + 129 + static struct damon_sysfs_stats *damon_sysfs_stats_alloc(void) 130 + { 131 + return kzalloc(sizeof(struct damon_sysfs_stats), GFP_KERNEL); 132 + } 133 + 134 + static ssize_t nr_tried_show(struct kobject *kobj, struct kobj_attribute *attr, 135 + char *buf) 136 + { 137 + struct damon_sysfs_stats *stats = container_of(kobj, 138 + struct damon_sysfs_stats, kobj); 139 + 140 + return sysfs_emit(buf, "%lu\n", stats->nr_tried); 141 + } 142 + 143 + static ssize_t sz_tried_show(struct kobject *kobj, struct kobj_attribute *attr, 144 + char *buf) 145 + { 146 + struct damon_sysfs_stats *stats = container_of(kobj, 147 + struct damon_sysfs_stats, kobj); 148 + 149 + return sysfs_emit(buf, "%lu\n", stats->sz_tried); 150 + } 151 + 152 + static ssize_t nr_applied_show(struct kobject *kobj, 153 + struct kobj_attribute *attr, char *buf) 154 + { 155 + struct damon_sysfs_stats *stats = container_of(kobj, 156 + struct damon_sysfs_stats, kobj); 157 + 158 + return sysfs_emit(buf, "%lu\n", stats->nr_applied); 159 + } 160 + 161 + static ssize_t sz_applied_show(struct kobject *kobj, 162 + struct kobj_attribute *attr, char *buf) 163 + { 164 + struct damon_sysfs_stats *stats = container_of(kobj, 165 + struct damon_sysfs_stats, kobj); 166 + 167 + return sysfs_emit(buf, "%lu\n", stats->sz_applied); 168 + } 169 + 170 + static ssize_t qt_exceeds_show(struct kobject *kobj, 171 + struct kobj_attribute *attr, char *buf) 172 + { 173 + struct damon_sysfs_stats *stats = container_of(kobj, 174 + struct damon_sysfs_stats, kobj); 175 + 176 + return sysfs_emit(buf, "%lu\n", stats->qt_exceeds); 177 + } 178 + 179 + static void damon_sysfs_stats_release(struct kobject *kobj) 180 + { 181 + kfree(container_of(kobj, struct damon_sysfs_stats, kobj)); 182 + } 183 + 184 + static struct kobj_attribute damon_sysfs_stats_nr_tried_attr = 185 + __ATTR_RO_MODE(nr_tried, 0400); 186 + 187 + static struct kobj_attribute damon_sysfs_stats_sz_tried_attr = 188 + __ATTR_RO_MODE(sz_tried, 0400); 189 + 190 + static struct kobj_attribute damon_sysfs_stats_nr_applied_attr = 191 + __ATTR_RO_MODE(nr_applied, 0400); 192 + 193 + static struct kobj_attribute damon_sysfs_stats_sz_applied_attr = 194 + __ATTR_RO_MODE(sz_applied, 0400); 195 + 196 + static struct kobj_attribute damon_sysfs_stats_qt_exceeds_attr = 197 + __ATTR_RO_MODE(qt_exceeds, 0400); 198 + 199 + static struct attribute *damon_sysfs_stats_attrs[] = { 200 + &damon_sysfs_stats_nr_tried_attr.attr, 201 + &damon_sysfs_stats_sz_tried_attr.attr, 202 + &damon_sysfs_stats_nr_applied_attr.attr, 203 + &damon_sysfs_stats_sz_applied_attr.attr, 204 + &damon_sysfs_stats_qt_exceeds_attr.attr, 205 + NULL, 206 + }; 207 + ATTRIBUTE_GROUPS(damon_sysfs_stats); 208 + 209 + static struct kobj_type damon_sysfs_stats_ktype = { 210 + .release = damon_sysfs_stats_release, 211 + .sysfs_ops = &kobj_sysfs_ops, 212 + .default_groups = damon_sysfs_stats_groups, 213 + }; 214 + 215 + /* 216 + * watermarks directory 217 + */ 218 + 219 + struct damon_sysfs_watermarks { 220 + struct kobject kobj; 221 + enum damos_wmark_metric metric; 222 + unsigned long interval_us; 223 + unsigned long high; 224 + unsigned long mid; 225 + unsigned long low; 226 + }; 227 + 228 + static struct damon_sysfs_watermarks *damon_sysfs_watermarks_alloc( 229 + enum damos_wmark_metric metric, unsigned long interval_us, 230 + unsigned long high, unsigned long mid, unsigned long low) 231 + { 232 + struct damon_sysfs_watermarks *watermarks = kmalloc( 233 + sizeof(*watermarks), GFP_KERNEL); 234 + 235 + if (!watermarks) 236 + return NULL; 237 + watermarks->kobj = (struct kobject){}; 238 + watermarks->metric = metric; 239 + watermarks->interval_us = interval_us; 240 + watermarks->high = high; 241 + watermarks->mid = mid; 242 + watermarks->low = low; 243 + return watermarks; 244 + } 245 + 246 + /* Should match with enum damos_wmark_metric */ 247 + static const char * const damon_sysfs_wmark_metric_strs[] = { 248 + "none", 249 + "free_mem_rate", 250 + }; 251 + 252 + static ssize_t metric_show(struct kobject *kobj, struct kobj_attribute *attr, 253 + char *buf) 254 + { 255 + struct damon_sysfs_watermarks *watermarks = container_of(kobj, 256 + struct damon_sysfs_watermarks, kobj); 257 + 258 + return sysfs_emit(buf, "%s\n", 259 + damon_sysfs_wmark_metric_strs[watermarks->metric]); 260 + } 261 + 262 + static ssize_t metric_store(struct kobject *kobj, struct kobj_attribute *attr, 263 + const char *buf, size_t count) 264 + { 265 + struct damon_sysfs_watermarks *watermarks = container_of(kobj, 266 + struct damon_sysfs_watermarks, kobj); 267 + enum damos_wmark_metric metric; 268 + 269 + for (metric = 0; metric < NR_DAMOS_WMARK_METRICS; metric++) { 270 + if (sysfs_streq(buf, damon_sysfs_wmark_metric_strs[metric])) { 271 + watermarks->metric = metric; 272 + return count; 273 + } 274 + } 275 + return -EINVAL; 276 + } 277 + 278 + static ssize_t interval_us_show(struct kobject *kobj, 279 + struct kobj_attribute *attr, char *buf) 280 + { 281 + struct damon_sysfs_watermarks *watermarks = container_of(kobj, 282 + struct damon_sysfs_watermarks, kobj); 283 + 284 + return sysfs_emit(buf, "%lu\n", watermarks->interval_us); 285 + } 286 + 287 + static ssize_t interval_us_store(struct kobject *kobj, 288 + struct kobj_attribute *attr, const char *buf, size_t count) 289 + { 290 + struct damon_sysfs_watermarks *watermarks = container_of(kobj, 291 + struct damon_sysfs_watermarks, kobj); 292 + int err = kstrtoul(buf, 0, &watermarks->interval_us); 293 + 294 + if (err) 295 + return -EINVAL; 296 + return count; 297 + } 298 + 299 + static ssize_t high_show(struct kobject *kobj, 300 + struct kobj_attribute *attr, char *buf) 301 + { 302 + struct damon_sysfs_watermarks *watermarks = container_of(kobj, 303 + struct damon_sysfs_watermarks, kobj); 304 + 305 + return sysfs_emit(buf, "%lu\n", watermarks->high); 306 + } 307 + 308 + static ssize_t high_store(struct kobject *kobj, 309 + struct kobj_attribute *attr, const char *buf, size_t count) 310 + { 311 + struct damon_sysfs_watermarks *watermarks = container_of(kobj, 312 + struct damon_sysfs_watermarks, kobj); 313 + int err = kstrtoul(buf, 0, &watermarks->high); 314 + 315 + if (err) 316 + return -EINVAL; 317 + return count; 318 + } 319 + 320 + static ssize_t mid_show(struct kobject *kobj, 321 + struct kobj_attribute *attr, char *buf) 322 + { 323 + struct damon_sysfs_watermarks *watermarks = container_of(kobj, 324 + struct damon_sysfs_watermarks, kobj); 325 + 326 + return sysfs_emit(buf, "%lu\n", watermarks->mid); 327 + } 328 + 329 + static ssize_t mid_store(struct kobject *kobj, 330 + struct kobj_attribute *attr, const char *buf, size_t count) 331 + { 332 + struct damon_sysfs_watermarks *watermarks = container_of(kobj, 333 + struct damon_sysfs_watermarks, kobj); 334 + int err = kstrtoul(buf, 0, &watermarks->mid); 335 + 336 + if (err) 337 + return -EINVAL; 338 + return count; 339 + } 340 + 341 + static ssize_t low_show(struct kobject *kobj, 342 + struct kobj_attribute *attr, char *buf) 343 + { 344 + struct damon_sysfs_watermarks *watermarks = container_of(kobj, 345 + struct damon_sysfs_watermarks, kobj); 346 + 347 + return sysfs_emit(buf, "%lu\n", watermarks->low); 348 + } 349 + 350 + static ssize_t low_store(struct kobject *kobj, 351 + struct kobj_attribute *attr, const char *buf, size_t count) 352 + { 353 + struct damon_sysfs_watermarks *watermarks = container_of(kobj, 354 + struct damon_sysfs_watermarks, kobj); 355 + int err = kstrtoul(buf, 0, &watermarks->low); 356 + 357 + if (err) 358 + return -EINVAL; 359 + return count; 360 + } 361 + 362 + static void damon_sysfs_watermarks_release(struct kobject *kobj) 363 + { 364 + kfree(container_of(kobj, struct damon_sysfs_watermarks, kobj)); 365 + } 366 + 367 + static struct kobj_attribute damon_sysfs_watermarks_metric_attr = 368 + __ATTR_RW_MODE(metric, 0600); 369 + 370 + static struct kobj_attribute damon_sysfs_watermarks_interval_us_attr = 371 + __ATTR_RW_MODE(interval_us, 0600); 372 + 373 + static struct kobj_attribute damon_sysfs_watermarks_high_attr = 374 + __ATTR_RW_MODE(high, 0600); 375 + 376 + static struct kobj_attribute damon_sysfs_watermarks_mid_attr = 377 + __ATTR_RW_MODE(mid, 0600); 378 + 379 + static struct kobj_attribute damon_sysfs_watermarks_low_attr = 380 + __ATTR_RW_MODE(low, 0600); 381 + 382 + static struct attribute *damon_sysfs_watermarks_attrs[] = { 383 + &damon_sysfs_watermarks_metric_attr.attr, 384 + &damon_sysfs_watermarks_interval_us_attr.attr, 385 + &damon_sysfs_watermarks_high_attr.attr, 386 + &damon_sysfs_watermarks_mid_attr.attr, 387 + &damon_sysfs_watermarks_low_attr.attr, 388 + NULL, 389 + }; 390 + ATTRIBUTE_GROUPS(damon_sysfs_watermarks); 391 + 392 + static struct kobj_type damon_sysfs_watermarks_ktype = { 393 + .release = damon_sysfs_watermarks_release, 394 + .sysfs_ops = &kobj_sysfs_ops, 395 + .default_groups = damon_sysfs_watermarks_groups, 396 + }; 397 + 398 + /* 399 + * scheme/weights directory 400 + */ 401 + 402 + struct damon_sysfs_weights { 403 + struct kobject kobj; 404 + unsigned int sz; 405 + unsigned int nr_accesses; 406 + unsigned int age; 407 + }; 408 + 409 + static struct damon_sysfs_weights *damon_sysfs_weights_alloc(unsigned int sz, 410 + unsigned int nr_accesses, unsigned int age) 411 + { 412 + struct damon_sysfs_weights *weights = kmalloc(sizeof(*weights), 413 + GFP_KERNEL); 414 + 415 + if (!weights) 416 + return NULL; 417 + weights->kobj = (struct kobject){}; 418 + weights->sz = sz; 419 + weights->nr_accesses = nr_accesses; 420 + weights->age = age; 421 + return weights; 422 + } 423 + 424 + static ssize_t sz_permil_show(struct kobject *kobj, 425 + struct kobj_attribute *attr, char *buf) 426 + { 427 + struct damon_sysfs_weights *weights = container_of(kobj, 428 + struct damon_sysfs_weights, kobj); 429 + 430 + return sysfs_emit(buf, "%u\n", weights->sz); 431 + } 432 + 433 + static ssize_t sz_permil_store(struct kobject *kobj, 434 + struct kobj_attribute *attr, const char *buf, size_t count) 435 + { 436 + struct damon_sysfs_weights *weights = container_of(kobj, 437 + struct damon_sysfs_weights, kobj); 438 + int err = kstrtouint(buf, 0, &weights->sz); 439 + 440 + if (err) 441 + return -EINVAL; 442 + return count; 443 + } 444 + 445 + static ssize_t nr_accesses_permil_show(struct kobject *kobj, 446 + struct kobj_attribute *attr, char *buf) 447 + { 448 + struct damon_sysfs_weights *weights = container_of(kobj, 449 + struct damon_sysfs_weights, kobj); 450 + 451 + return sysfs_emit(buf, "%u\n", weights->nr_accesses); 452 + } 453 + 454 + static ssize_t nr_accesses_permil_store(struct kobject *kobj, 455 + struct kobj_attribute *attr, const char *buf, size_t count) 456 + { 457 + struct damon_sysfs_weights *weights = container_of(kobj, 458 + struct damon_sysfs_weights, kobj); 459 + int err = kstrtouint(buf, 0, &weights->nr_accesses); 460 + 461 + if (err) 462 + return -EINVAL; 463 + return count; 464 + } 465 + 466 + static ssize_t age_permil_show(struct kobject *kobj, 467 + struct kobj_attribute *attr, char *buf) 468 + { 469 + struct damon_sysfs_weights *weights = container_of(kobj, 470 + struct damon_sysfs_weights, kobj); 471 + 472 + return sysfs_emit(buf, "%u\n", weights->age); 473 + } 474 + 475 + static ssize_t age_permil_store(struct kobject *kobj, 476 + struct kobj_attribute *attr, const char *buf, size_t count) 477 + { 478 + struct damon_sysfs_weights *weights = container_of(kobj, 479 + struct damon_sysfs_weights, kobj); 480 + int err = kstrtouint(buf, 0, &weights->age); 481 + 482 + if (err) 483 + return -EINVAL; 484 + return count; 485 + } 486 + 487 + static void damon_sysfs_weights_release(struct kobject *kobj) 488 + { 489 + kfree(container_of(kobj, struct damon_sysfs_weights, kobj)); 490 + } 491 + 492 + static struct kobj_attribute damon_sysfs_weights_sz_attr = 493 + __ATTR_RW_MODE(sz_permil, 0600); 494 + 495 + static struct kobj_attribute damon_sysfs_weights_nr_accesses_attr = 496 + __ATTR_RW_MODE(nr_accesses_permil, 0600); 497 + 498 + static struct kobj_attribute damon_sysfs_weights_age_attr = 499 + __ATTR_RW_MODE(age_permil, 0600); 500 + 501 + static struct attribute *damon_sysfs_weights_attrs[] = { 502 + &damon_sysfs_weights_sz_attr.attr, 503 + &damon_sysfs_weights_nr_accesses_attr.attr, 504 + &damon_sysfs_weights_age_attr.attr, 505 + NULL, 506 + }; 507 + ATTRIBUTE_GROUPS(damon_sysfs_weights); 508 + 509 + static struct kobj_type damon_sysfs_weights_ktype = { 510 + .release = damon_sysfs_weights_release, 511 + .sysfs_ops = &kobj_sysfs_ops, 512 + .default_groups = damon_sysfs_weights_groups, 513 + }; 514 + 515 + /* 516 + * quotas directory 517 + */ 518 + 519 + struct damon_sysfs_quotas { 520 + struct kobject kobj; 521 + struct damon_sysfs_weights *weights; 522 + unsigned long ms; 523 + unsigned long sz; 524 + unsigned long reset_interval_ms; 525 + }; 526 + 527 + static struct damon_sysfs_quotas *damon_sysfs_quotas_alloc(void) 528 + { 529 + return kzalloc(sizeof(struct damon_sysfs_quotas), GFP_KERNEL); 530 + } 531 + 532 + static int damon_sysfs_quotas_add_dirs(struct damon_sysfs_quotas *quotas) 533 + { 534 + struct damon_sysfs_weights *weights; 535 + int err; 536 + 537 + weights = damon_sysfs_weights_alloc(0, 0, 0); 538 + if (!weights) 539 + return -ENOMEM; 540 + 541 + err = kobject_init_and_add(&weights->kobj, &damon_sysfs_weights_ktype, 542 + &quotas->kobj, "weights"); 543 + if (err) 544 + kobject_put(&weights->kobj); 545 + else 546 + quotas->weights = weights; 547 + return err; 548 + } 549 + 550 + static void damon_sysfs_quotas_rm_dirs(struct damon_sysfs_quotas *quotas) 551 + { 552 + kobject_put(&quotas->weights->kobj); 553 + } 554 + 555 + static ssize_t ms_show(struct kobject *kobj, struct kobj_attribute *attr, 556 + char *buf) 557 + { 558 + struct damon_sysfs_quotas *quotas = container_of(kobj, 559 + struct damon_sysfs_quotas, kobj); 560 + 561 + return sysfs_emit(buf, "%lu\n", quotas->ms); 562 + } 563 + 564 + static ssize_t ms_store(struct kobject *kobj, struct kobj_attribute *attr, 565 + const char *buf, size_t count) 566 + { 567 + struct damon_sysfs_quotas *quotas = container_of(kobj, 568 + struct damon_sysfs_quotas, kobj); 569 + int err = kstrtoul(buf, 0, &quotas->ms); 570 + 571 + if (err) 572 + return -EINVAL; 573 + return count; 574 + } 575 + 576 + static ssize_t bytes_show(struct kobject *kobj, struct kobj_attribute *attr, 577 + char *buf) 578 + { 579 + struct damon_sysfs_quotas *quotas = container_of(kobj, 580 + struct damon_sysfs_quotas, kobj); 581 + 582 + return sysfs_emit(buf, "%lu\n", quotas->sz); 583 + } 584 + 585 + static ssize_t bytes_store(struct kobject *kobj, 586 + struct kobj_attribute *attr, const char *buf, size_t count) 587 + { 588 + struct damon_sysfs_quotas *quotas = container_of(kobj, 589 + struct damon_sysfs_quotas, kobj); 590 + int err = kstrtoul(buf, 0, &quotas->sz); 591 + 592 + if (err) 593 + return -EINVAL; 594 + return count; 595 + } 596 + 597 + static ssize_t reset_interval_ms_show(struct kobject *kobj, 598 + struct kobj_attribute *attr, char *buf) 599 + { 600 + struct damon_sysfs_quotas *quotas = container_of(kobj, 601 + struct damon_sysfs_quotas, kobj); 602 + 603 + return sysfs_emit(buf, "%lu\n", quotas->reset_interval_ms); 604 + } 605 + 606 + static ssize_t reset_interval_ms_store(struct kobject *kobj, 607 + struct kobj_attribute *attr, const char *buf, size_t count) 608 + { 609 + struct damon_sysfs_quotas *quotas = container_of(kobj, 610 + struct damon_sysfs_quotas, kobj); 611 + int err = kstrtoul(buf, 0, &quotas->reset_interval_ms); 612 + 613 + if (err) 614 + return -EINVAL; 615 + return count; 616 + } 617 + 618 + static void damon_sysfs_quotas_release(struct kobject *kobj) 619 + { 620 + kfree(container_of(kobj, struct damon_sysfs_quotas, kobj)); 621 + } 622 + 623 + static struct kobj_attribute damon_sysfs_quotas_ms_attr = 624 + __ATTR_RW_MODE(ms, 0600); 625 + 626 + static struct kobj_attribute damon_sysfs_quotas_sz_attr = 627 + __ATTR_RW_MODE(bytes, 0600); 628 + 629 + static struct kobj_attribute damon_sysfs_quotas_reset_interval_ms_attr = 630 + __ATTR_RW_MODE(reset_interval_ms, 0600); 631 + 632 + static struct attribute *damon_sysfs_quotas_attrs[] = { 633 + &damon_sysfs_quotas_ms_attr.attr, 634 + &damon_sysfs_quotas_sz_attr.attr, 635 + &damon_sysfs_quotas_reset_interval_ms_attr.attr, 636 + NULL, 637 + }; 638 + ATTRIBUTE_GROUPS(damon_sysfs_quotas); 639 + 640 + static struct kobj_type damon_sysfs_quotas_ktype = { 641 + .release = damon_sysfs_quotas_release, 642 + .sysfs_ops = &kobj_sysfs_ops, 643 + .default_groups = damon_sysfs_quotas_groups, 644 + }; 645 + 646 + /* 647 + * access_pattern directory 648 + */ 649 + 650 + struct damon_sysfs_access_pattern { 651 + struct kobject kobj; 652 + struct damon_sysfs_ul_range *sz; 653 + struct damon_sysfs_ul_range *nr_accesses; 654 + struct damon_sysfs_ul_range *age; 655 + }; 656 + 657 + static 658 + struct damon_sysfs_access_pattern *damon_sysfs_access_pattern_alloc(void) 659 + { 660 + struct damon_sysfs_access_pattern *access_pattern = 661 + kmalloc(sizeof(*access_pattern), GFP_KERNEL); 662 + 663 + if (!access_pattern) 664 + return NULL; 665 + access_pattern->kobj = (struct kobject){}; 666 + return access_pattern; 667 + } 668 + 669 + static int damon_sysfs_access_pattern_add_range_dir( 670 + struct damon_sysfs_access_pattern *access_pattern, 671 + struct damon_sysfs_ul_range **range_dir_ptr, 672 + char *name) 673 + { 674 + struct damon_sysfs_ul_range *range = damon_sysfs_ul_range_alloc(0, 0); 675 + int err; 676 + 677 + if (!range) 678 + return -ENOMEM; 679 + err = kobject_init_and_add(&range->kobj, &damon_sysfs_ul_range_ktype, 680 + &access_pattern->kobj, name); 681 + if (err) 682 + kobject_put(&range->kobj); 683 + else 684 + *range_dir_ptr = range; 685 + return err; 686 + } 687 + 688 + static int damon_sysfs_access_pattern_add_dirs( 689 + struct damon_sysfs_access_pattern *access_pattern) 690 + { 691 + int err; 692 + 693 + err = damon_sysfs_access_pattern_add_range_dir(access_pattern, 694 + &access_pattern->sz, "sz"); 695 + if (err) 696 + goto put_sz_out; 697 + 698 + err = damon_sysfs_access_pattern_add_range_dir(access_pattern, 699 + &access_pattern->nr_accesses, "nr_accesses"); 700 + if (err) 701 + goto put_nr_accesses_sz_out; 702 + 703 + err = damon_sysfs_access_pattern_add_range_dir(access_pattern, 704 + &access_pattern->age, "age"); 705 + if (err) 706 + goto put_age_nr_accesses_sz_out; 707 + return 0; 708 + 709 + put_age_nr_accesses_sz_out: 710 + kobject_put(&access_pattern->age->kobj); 711 + access_pattern->age = NULL; 712 + put_nr_accesses_sz_out: 713 + kobject_put(&access_pattern->nr_accesses->kobj); 714 + access_pattern->nr_accesses = NULL; 715 + put_sz_out: 716 + kobject_put(&access_pattern->sz->kobj); 717 + access_pattern->sz = NULL; 718 + return err; 719 + } 720 + 721 + static void damon_sysfs_access_pattern_rm_dirs( 722 + struct damon_sysfs_access_pattern *access_pattern) 723 + { 724 + kobject_put(&access_pattern->sz->kobj); 725 + kobject_put(&access_pattern->nr_accesses->kobj); 726 + kobject_put(&access_pattern->age->kobj); 727 + } 728 + 729 + static void damon_sysfs_access_pattern_release(struct kobject *kobj) 730 + { 731 + kfree(container_of(kobj, struct damon_sysfs_access_pattern, kobj)); 732 + } 733 + 734 + static struct attribute *damon_sysfs_access_pattern_attrs[] = { 735 + NULL, 736 + }; 737 + ATTRIBUTE_GROUPS(damon_sysfs_access_pattern); 738 + 739 + static struct kobj_type damon_sysfs_access_pattern_ktype = { 740 + .release = damon_sysfs_access_pattern_release, 741 + .sysfs_ops = &kobj_sysfs_ops, 742 + .default_groups = damon_sysfs_access_pattern_groups, 743 + }; 744 + 745 + /* 746 + * scheme directory 747 + */ 748 + 749 + struct damon_sysfs_scheme { 750 + struct kobject kobj; 751 + enum damos_action action; 752 + struct damon_sysfs_access_pattern *access_pattern; 753 + struct damon_sysfs_quotas *quotas; 754 + struct damon_sysfs_watermarks *watermarks; 755 + struct damon_sysfs_stats *stats; 756 + }; 757 + 758 + /* This should match with enum damos_action */ 759 + static const char * const damon_sysfs_damos_action_strs[] = { 760 + "willneed", 761 + "cold", 762 + "pageout", 763 + "hugepage", 764 + "nohugepage", 765 + "stat", 766 + }; 767 + 768 + static struct damon_sysfs_scheme *damon_sysfs_scheme_alloc( 769 + enum damos_action action) 770 + { 771 + struct damon_sysfs_scheme *scheme = kmalloc(sizeof(*scheme), 772 + GFP_KERNEL); 773 + 774 + if (!scheme) 775 + return NULL; 776 + scheme->kobj = (struct kobject){}; 777 + scheme->action = action; 778 + return scheme; 779 + } 780 + 781 + static int damon_sysfs_scheme_set_access_pattern( 782 + struct damon_sysfs_scheme *scheme) 783 + { 784 + struct damon_sysfs_access_pattern *access_pattern; 785 + int err; 786 + 787 + access_pattern = damon_sysfs_access_pattern_alloc(); 788 + if (!access_pattern) 789 + return -ENOMEM; 790 + err = kobject_init_and_add(&access_pattern->kobj, 791 + &damon_sysfs_access_pattern_ktype, &scheme->kobj, 792 + "access_pattern"); 793 + if (err) 794 + goto out; 795 + err = damon_sysfs_access_pattern_add_dirs(access_pattern); 796 + if (err) 797 + goto out; 798 + scheme->access_pattern = access_pattern; 799 + return 0; 800 + 801 + out: 802 + kobject_put(&access_pattern->kobj); 803 + return err; 804 + } 805 + 806 + static int damon_sysfs_scheme_set_quotas(struct damon_sysfs_scheme *scheme) 807 + { 808 + struct damon_sysfs_quotas *quotas = damon_sysfs_quotas_alloc(); 809 + int err; 810 + 811 + if (!quotas) 812 + return -ENOMEM; 813 + err = kobject_init_and_add(&quotas->kobj, &damon_sysfs_quotas_ktype, 814 + &scheme->kobj, "quotas"); 815 + if (err) 816 + goto out; 817 + err = damon_sysfs_quotas_add_dirs(quotas); 818 + if (err) 819 + goto out; 820 + scheme->quotas = quotas; 821 + return 0; 822 + 823 + out: 824 + kobject_put(&quotas->kobj); 825 + return err; 826 + } 827 + 828 + static int damon_sysfs_scheme_set_watermarks(struct damon_sysfs_scheme *scheme) 829 + { 830 + struct damon_sysfs_watermarks *watermarks = 831 + damon_sysfs_watermarks_alloc(DAMOS_WMARK_NONE, 0, 0, 0, 0); 832 + int err; 833 + 834 + if (!watermarks) 835 + return -ENOMEM; 836 + err = kobject_init_and_add(&watermarks->kobj, 837 + &damon_sysfs_watermarks_ktype, &scheme->kobj, 838 + "watermarks"); 839 + if (err) 840 + kobject_put(&watermarks->kobj); 841 + else 842 + scheme->watermarks = watermarks; 843 + return err; 844 + } 845 + 846 + static int damon_sysfs_scheme_set_stats(struct damon_sysfs_scheme *scheme) 847 + { 848 + struct damon_sysfs_stats *stats = damon_sysfs_stats_alloc(); 849 + int err; 850 + 851 + if (!stats) 852 + return -ENOMEM; 853 + err = kobject_init_and_add(&stats->kobj, &damon_sysfs_stats_ktype, 854 + &scheme->kobj, "stats"); 855 + if (err) 856 + kobject_put(&stats->kobj); 857 + else 858 + scheme->stats = stats; 859 + return err; 860 + } 861 + 862 + static int damon_sysfs_scheme_add_dirs(struct damon_sysfs_scheme *scheme) 863 + { 864 + int err; 865 + 866 + err = damon_sysfs_scheme_set_access_pattern(scheme); 867 + if (err) 868 + return err; 869 + err = damon_sysfs_scheme_set_quotas(scheme); 870 + if (err) 871 + goto put_access_pattern_out; 872 + err = damon_sysfs_scheme_set_watermarks(scheme); 873 + if (err) 874 + goto put_quotas_access_pattern_out; 875 + err = damon_sysfs_scheme_set_stats(scheme); 876 + if (err) 877 + goto put_watermarks_quotas_access_pattern_out; 878 + return 0; 879 + 880 + put_watermarks_quotas_access_pattern_out: 881 + kobject_put(&scheme->watermarks->kobj); 882 + scheme->watermarks = NULL; 883 + put_quotas_access_pattern_out: 884 + kobject_put(&scheme->quotas->kobj); 885 + scheme->quotas = NULL; 886 + put_access_pattern_out: 887 + kobject_put(&scheme->access_pattern->kobj); 888 + scheme->access_pattern = NULL; 889 + return err; 890 + } 891 + 892 + static void damon_sysfs_scheme_rm_dirs(struct damon_sysfs_scheme *scheme) 893 + { 894 + damon_sysfs_access_pattern_rm_dirs(scheme->access_pattern); 895 + kobject_put(&scheme->access_pattern->kobj); 896 + damon_sysfs_quotas_rm_dirs(scheme->quotas); 897 + kobject_put(&scheme->quotas->kobj); 898 + kobject_put(&scheme->watermarks->kobj); 899 + kobject_put(&scheme->stats->kobj); 900 + } 901 + 902 + static ssize_t action_show(struct kobject *kobj, struct kobj_attribute *attr, 903 + char *buf) 904 + { 905 + struct damon_sysfs_scheme *scheme = container_of(kobj, 906 + struct damon_sysfs_scheme, kobj); 907 + 908 + return sysfs_emit(buf, "%s\n", 909 + damon_sysfs_damos_action_strs[scheme->action]); 910 + } 911 + 912 + static ssize_t action_store(struct kobject *kobj, struct kobj_attribute *attr, 913 + const char *buf, size_t count) 914 + { 915 + struct damon_sysfs_scheme *scheme = container_of(kobj, 916 + struct damon_sysfs_scheme, kobj); 917 + enum damos_action action; 918 + 919 + for (action = 0; action < NR_DAMOS_ACTIONS; action++) { 920 + if (sysfs_streq(buf, damon_sysfs_damos_action_strs[action])) { 921 + scheme->action = action; 922 + return count; 923 + } 924 + } 925 + return -EINVAL; 926 + } 927 + 928 + static void damon_sysfs_scheme_release(struct kobject *kobj) 929 + { 930 + kfree(container_of(kobj, struct damon_sysfs_scheme, kobj)); 931 + } 932 + 933 + static struct kobj_attribute damon_sysfs_scheme_action_attr = 934 + __ATTR_RW_MODE(action, 0600); 935 + 936 + static struct attribute *damon_sysfs_scheme_attrs[] = { 937 + &damon_sysfs_scheme_action_attr.attr, 938 + NULL, 939 + }; 940 + ATTRIBUTE_GROUPS(damon_sysfs_scheme); 941 + 942 + static struct kobj_type damon_sysfs_scheme_ktype = { 943 + .release = damon_sysfs_scheme_release, 944 + .sysfs_ops = &kobj_sysfs_ops, 945 + .default_groups = damon_sysfs_scheme_groups, 946 + }; 947 + 948 + /* 949 + * schemes directory 950 + */ 951 + 952 + struct damon_sysfs_schemes { 953 + struct kobject kobj; 954 + struct damon_sysfs_scheme **schemes_arr; 955 + int nr; 956 + }; 957 + 958 + static struct damon_sysfs_schemes *damon_sysfs_schemes_alloc(void) 959 + { 960 + return kzalloc(sizeof(struct damon_sysfs_schemes), GFP_KERNEL); 961 + } 962 + 963 + static void damon_sysfs_schemes_rm_dirs(struct damon_sysfs_schemes *schemes) 964 + { 965 + struct damon_sysfs_scheme **schemes_arr = schemes->schemes_arr; 966 + int i; 967 + 968 + for (i = 0; i < schemes->nr; i++) { 969 + damon_sysfs_scheme_rm_dirs(schemes_arr[i]); 970 + kobject_put(&schemes_arr[i]->kobj); 971 + } 972 + schemes->nr = 0; 973 + kfree(schemes_arr); 974 + schemes->schemes_arr = NULL; 975 + } 976 + 977 + static int damon_sysfs_schemes_add_dirs(struct damon_sysfs_schemes *schemes, 978 + int nr_schemes) 979 + { 980 + struct damon_sysfs_scheme **schemes_arr, *scheme; 981 + int err, i; 982 + 983 + damon_sysfs_schemes_rm_dirs(schemes); 984 + if (!nr_schemes) 985 + return 0; 986 + 987 + schemes_arr = kmalloc_array(nr_schemes, sizeof(*schemes_arr), 988 + GFP_KERNEL | __GFP_NOWARN); 989 + if (!schemes_arr) 990 + return -ENOMEM; 991 + schemes->schemes_arr = schemes_arr; 992 + 993 + for (i = 0; i < nr_schemes; i++) { 994 + scheme = damon_sysfs_scheme_alloc(DAMOS_STAT); 995 + if (!scheme) { 996 + damon_sysfs_schemes_rm_dirs(schemes); 997 + return -ENOMEM; 998 + } 999 + 1000 + err = kobject_init_and_add(&scheme->kobj, 1001 + &damon_sysfs_scheme_ktype, &schemes->kobj, 1002 + "%d", i); 1003 + if (err) 1004 + goto out; 1005 + err = damon_sysfs_scheme_add_dirs(scheme); 1006 + if (err) 1007 + goto out; 1008 + 1009 + schemes_arr[i] = scheme; 1010 + schemes->nr++; 1011 + } 1012 + return 0; 1013 + 1014 + out: 1015 + damon_sysfs_schemes_rm_dirs(schemes); 1016 + kobject_put(&scheme->kobj); 1017 + return err; 1018 + } 1019 + 1020 + static ssize_t nr_schemes_show(struct kobject *kobj, 1021 + struct kobj_attribute *attr, char *buf) 1022 + { 1023 + struct damon_sysfs_schemes *schemes = container_of(kobj, 1024 + struct damon_sysfs_schemes, kobj); 1025 + 1026 + return sysfs_emit(buf, "%d\n", schemes->nr); 1027 + } 1028 + 1029 + static ssize_t nr_schemes_store(struct kobject *kobj, 1030 + struct kobj_attribute *attr, const char *buf, size_t count) 1031 + { 1032 + struct damon_sysfs_schemes *schemes = container_of(kobj, 1033 + struct damon_sysfs_schemes, kobj); 1034 + int nr, err = kstrtoint(buf, 0, &nr); 1035 + 1036 + if (err) 1037 + return err; 1038 + if (nr < 0) 1039 + return -EINVAL; 1040 + 1041 + if (!mutex_trylock(&damon_sysfs_lock)) 1042 + return -EBUSY; 1043 + err = damon_sysfs_schemes_add_dirs(schemes, nr); 1044 + mutex_unlock(&damon_sysfs_lock); 1045 + if (err) 1046 + return err; 1047 + return count; 1048 + } 1049 + 1050 + static void damon_sysfs_schemes_release(struct kobject *kobj) 1051 + { 1052 + kfree(container_of(kobj, struct damon_sysfs_schemes, kobj)); 1053 + } 1054 + 1055 + static struct kobj_attribute damon_sysfs_schemes_nr_attr = 1056 + __ATTR_RW_MODE(nr_schemes, 0600); 1057 + 1058 + static struct attribute *damon_sysfs_schemes_attrs[] = { 1059 + &damon_sysfs_schemes_nr_attr.attr, 1060 + NULL, 1061 + }; 1062 + ATTRIBUTE_GROUPS(damon_sysfs_schemes); 1063 + 1064 + static struct kobj_type damon_sysfs_schemes_ktype = { 1065 + .release = damon_sysfs_schemes_release, 1066 + .sysfs_ops = &kobj_sysfs_ops, 1067 + .default_groups = damon_sysfs_schemes_groups, 1068 + }; 1069 + 1070 + /* 1071 + * init region directory 1072 + */ 1073 + 1074 + struct damon_sysfs_region { 1075 + struct kobject kobj; 1076 + unsigned long start; 1077 + unsigned long end; 1078 + }; 1079 + 1080 + static struct damon_sysfs_region *damon_sysfs_region_alloc( 1081 + unsigned long start, 1082 + unsigned long end) 1083 + { 1084 + struct damon_sysfs_region *region = kmalloc(sizeof(*region), 1085 + GFP_KERNEL); 1086 + 1087 + if (!region) 1088 + return NULL; 1089 + region->kobj = (struct kobject){}; 1090 + region->start = start; 1091 + region->end = end; 1092 + return region; 1093 + } 1094 + 1095 + static ssize_t start_show(struct kobject *kobj, struct kobj_attribute *attr, 1096 + char *buf) 1097 + { 1098 + struct damon_sysfs_region *region = container_of(kobj, 1099 + struct damon_sysfs_region, kobj); 1100 + 1101 + return sysfs_emit(buf, "%lu\n", region->start); 1102 + } 1103 + 1104 + static ssize_t start_store(struct kobject *kobj, struct kobj_attribute *attr, 1105 + const char *buf, size_t count) 1106 + { 1107 + struct damon_sysfs_region *region = container_of(kobj, 1108 + struct damon_sysfs_region, kobj); 1109 + int err = kstrtoul(buf, 0, &region->start); 1110 + 1111 + if (err) 1112 + return -EINVAL; 1113 + return count; 1114 + } 1115 + 1116 + static ssize_t end_show(struct kobject *kobj, struct kobj_attribute *attr, 1117 + char *buf) 1118 + { 1119 + struct damon_sysfs_region *region = container_of(kobj, 1120 + struct damon_sysfs_region, kobj); 1121 + 1122 + return sysfs_emit(buf, "%lu\n", region->end); 1123 + } 1124 + 1125 + static ssize_t end_store(struct kobject *kobj, struct kobj_attribute *attr, 1126 + const char *buf, size_t count) 1127 + { 1128 + struct damon_sysfs_region *region = container_of(kobj, 1129 + struct damon_sysfs_region, kobj); 1130 + int err = kstrtoul(buf, 0, &region->end); 1131 + 1132 + if (err) 1133 + return -EINVAL; 1134 + return count; 1135 + } 1136 + 1137 + static void damon_sysfs_region_release(struct kobject *kobj) 1138 + { 1139 + kfree(container_of(kobj, struct damon_sysfs_region, kobj)); 1140 + } 1141 + 1142 + static struct kobj_attribute damon_sysfs_region_start_attr = 1143 + __ATTR_RW_MODE(start, 0600); 1144 + 1145 + static struct kobj_attribute damon_sysfs_region_end_attr = 1146 + __ATTR_RW_MODE(end, 0600); 1147 + 1148 + static struct attribute *damon_sysfs_region_attrs[] = { 1149 + &damon_sysfs_region_start_attr.attr, 1150 + &damon_sysfs_region_end_attr.attr, 1151 + NULL, 1152 + }; 1153 + ATTRIBUTE_GROUPS(damon_sysfs_region); 1154 + 1155 + static struct kobj_type damon_sysfs_region_ktype = { 1156 + .release = damon_sysfs_region_release, 1157 + .sysfs_ops = &kobj_sysfs_ops, 1158 + .default_groups = damon_sysfs_region_groups, 1159 + }; 1160 + 1161 + /* 1162 + * init_regions directory 1163 + */ 1164 + 1165 + struct damon_sysfs_regions { 1166 + struct kobject kobj; 1167 + struct damon_sysfs_region **regions_arr; 1168 + int nr; 1169 + }; 1170 + 1171 + static struct damon_sysfs_regions *damon_sysfs_regions_alloc(void) 1172 + { 1173 + return kzalloc(sizeof(struct damon_sysfs_regions), GFP_KERNEL); 1174 + } 1175 + 1176 + static void damon_sysfs_regions_rm_dirs(struct damon_sysfs_regions *regions) 1177 + { 1178 + struct damon_sysfs_region **regions_arr = regions->regions_arr; 1179 + int i; 1180 + 1181 + for (i = 0; i < regions->nr; i++) 1182 + kobject_put(&regions_arr[i]->kobj); 1183 + regions->nr = 0; 1184 + kfree(regions_arr); 1185 + regions->regions_arr = NULL; 1186 + } 1187 + 1188 + static int damon_sysfs_regions_add_dirs(struct damon_sysfs_regions *regions, 1189 + int nr_regions) 1190 + { 1191 + struct damon_sysfs_region **regions_arr, *region; 1192 + int err, i; 1193 + 1194 + damon_sysfs_regions_rm_dirs(regions); 1195 + if (!nr_regions) 1196 + return 0; 1197 + 1198 + regions_arr = kmalloc_array(nr_regions, sizeof(*regions_arr), 1199 + GFP_KERNEL | __GFP_NOWARN); 1200 + if (!regions_arr) 1201 + return -ENOMEM; 1202 + regions->regions_arr = regions_arr; 1203 + 1204 + for (i = 0; i < nr_regions; i++) { 1205 + region = damon_sysfs_region_alloc(0, 0); 1206 + if (!region) { 1207 + damon_sysfs_regions_rm_dirs(regions); 1208 + return -ENOMEM; 1209 + } 1210 + 1211 + err = kobject_init_and_add(&region->kobj, 1212 + &damon_sysfs_region_ktype, &regions->kobj, 1213 + "%d", i); 1214 + if (err) { 1215 + kobject_put(&region->kobj); 1216 + damon_sysfs_regions_rm_dirs(regions); 1217 + return err; 1218 + } 1219 + 1220 + regions_arr[i] = region; 1221 + regions->nr++; 1222 + } 1223 + return 0; 1224 + } 1225 + 1226 + static ssize_t nr_regions_show(struct kobject *kobj, 1227 + struct kobj_attribute *attr, char *buf) 1228 + { 1229 + struct damon_sysfs_regions *regions = container_of(kobj, 1230 + struct damon_sysfs_regions, kobj); 1231 + 1232 + return sysfs_emit(buf, "%d\n", regions->nr); 1233 + } 1234 + 1235 + static ssize_t nr_regions_store(struct kobject *kobj, 1236 + struct kobj_attribute *attr, const char *buf, size_t count) 1237 + { 1238 + struct damon_sysfs_regions *regions = container_of(kobj, 1239 + struct damon_sysfs_regions, kobj); 1240 + int nr, err = kstrtoint(buf, 0, &nr); 1241 + 1242 + if (err) 1243 + return err; 1244 + if (nr < 0) 1245 + return -EINVAL; 1246 + 1247 + if (!mutex_trylock(&damon_sysfs_lock)) 1248 + return -EBUSY; 1249 + err = damon_sysfs_regions_add_dirs(regions, nr); 1250 + mutex_unlock(&damon_sysfs_lock); 1251 + if (err) 1252 + return err; 1253 + 1254 + return count; 1255 + } 1256 + 1257 + static void damon_sysfs_regions_release(struct kobject *kobj) 1258 + { 1259 + kfree(container_of(kobj, struct damon_sysfs_regions, kobj)); 1260 + } 1261 + 1262 + static struct kobj_attribute damon_sysfs_regions_nr_attr = 1263 + __ATTR_RW_MODE(nr_regions, 0600); 1264 + 1265 + static struct attribute *damon_sysfs_regions_attrs[] = { 1266 + &damon_sysfs_regions_nr_attr.attr, 1267 + NULL, 1268 + }; 1269 + ATTRIBUTE_GROUPS(damon_sysfs_regions); 1270 + 1271 + static struct kobj_type damon_sysfs_regions_ktype = { 1272 + .release = damon_sysfs_regions_release, 1273 + .sysfs_ops = &kobj_sysfs_ops, 1274 + .default_groups = damon_sysfs_regions_groups, 1275 + }; 1276 + 1277 + /* 1278 + * target directory 1279 + */ 1280 + 1281 + struct damon_sysfs_target { 1282 + struct kobject kobj; 1283 + struct damon_sysfs_regions *regions; 1284 + int pid; 1285 + }; 1286 + 1287 + static struct damon_sysfs_target *damon_sysfs_target_alloc(void) 1288 + { 1289 + return kzalloc(sizeof(struct damon_sysfs_target), GFP_KERNEL); 1290 + } 1291 + 1292 + static int damon_sysfs_target_add_dirs(struct damon_sysfs_target *target) 1293 + { 1294 + struct damon_sysfs_regions *regions = damon_sysfs_regions_alloc(); 1295 + int err; 1296 + 1297 + if (!regions) 1298 + return -ENOMEM; 1299 + 1300 + err = kobject_init_and_add(&regions->kobj, &damon_sysfs_regions_ktype, 1301 + &target->kobj, "regions"); 1302 + if (err) 1303 + kobject_put(&regions->kobj); 1304 + else 1305 + target->regions = regions; 1306 + return err; 1307 + } 1308 + 1309 + static void damon_sysfs_target_rm_dirs(struct damon_sysfs_target *target) 1310 + { 1311 + damon_sysfs_regions_rm_dirs(target->regions); 1312 + kobject_put(&target->regions->kobj); 1313 + } 1314 + 1315 + static ssize_t pid_target_show(struct kobject *kobj, 1316 + struct kobj_attribute *attr, char *buf) 1317 + { 1318 + struct damon_sysfs_target *target = container_of(kobj, 1319 + struct damon_sysfs_target, kobj); 1320 + 1321 + return sysfs_emit(buf, "%d\n", target->pid); 1322 + } 1323 + 1324 + static ssize_t pid_target_store(struct kobject *kobj, 1325 + struct kobj_attribute *attr, const char *buf, size_t count) 1326 + { 1327 + struct damon_sysfs_target *target = container_of(kobj, 1328 + struct damon_sysfs_target, kobj); 1329 + int err = kstrtoint(buf, 0, &target->pid); 1330 + 1331 + if (err) 1332 + return -EINVAL; 1333 + return count; 1334 + } 1335 + 1336 + static void damon_sysfs_target_release(struct kobject *kobj) 1337 + { 1338 + kfree(container_of(kobj, struct damon_sysfs_target, kobj)); 1339 + } 1340 + 1341 + static struct kobj_attribute damon_sysfs_target_pid_attr = 1342 + __ATTR_RW_MODE(pid_target, 0600); 1343 + 1344 + static struct attribute *damon_sysfs_target_attrs[] = { 1345 + &damon_sysfs_target_pid_attr.attr, 1346 + NULL, 1347 + }; 1348 + ATTRIBUTE_GROUPS(damon_sysfs_target); 1349 + 1350 + static struct kobj_type damon_sysfs_target_ktype = { 1351 + .release = damon_sysfs_target_release, 1352 + .sysfs_ops = &kobj_sysfs_ops, 1353 + .default_groups = damon_sysfs_target_groups, 1354 + }; 1355 + 1356 + /* 1357 + * targets directory 1358 + */ 1359 + 1360 + struct damon_sysfs_targets { 1361 + struct kobject kobj; 1362 + struct damon_sysfs_target **targets_arr; 1363 + int nr; 1364 + }; 1365 + 1366 + static struct damon_sysfs_targets *damon_sysfs_targets_alloc(void) 1367 + { 1368 + return kzalloc(sizeof(struct damon_sysfs_targets), GFP_KERNEL); 1369 + } 1370 + 1371 + static void damon_sysfs_targets_rm_dirs(struct damon_sysfs_targets *targets) 1372 + { 1373 + struct damon_sysfs_target **targets_arr = targets->targets_arr; 1374 + int i; 1375 + 1376 + for (i = 0; i < targets->nr; i++) { 1377 + damon_sysfs_target_rm_dirs(targets_arr[i]); 1378 + kobject_put(&targets_arr[i]->kobj); 1379 + } 1380 + targets->nr = 0; 1381 + kfree(targets_arr); 1382 + targets->targets_arr = NULL; 1383 + } 1384 + 1385 + static int damon_sysfs_targets_add_dirs(struct damon_sysfs_targets *targets, 1386 + int nr_targets) 1387 + { 1388 + struct damon_sysfs_target **targets_arr, *target; 1389 + int err, i; 1390 + 1391 + damon_sysfs_targets_rm_dirs(targets); 1392 + if (!nr_targets) 1393 + return 0; 1394 + 1395 + targets_arr = kmalloc_array(nr_targets, sizeof(*targets_arr), 1396 + GFP_KERNEL | __GFP_NOWARN); 1397 + if (!targets_arr) 1398 + return -ENOMEM; 1399 + targets->targets_arr = targets_arr; 1400 + 1401 + for (i = 0; i < nr_targets; i++) { 1402 + target = damon_sysfs_target_alloc(); 1403 + if (!target) { 1404 + damon_sysfs_targets_rm_dirs(targets); 1405 + return -ENOMEM; 1406 + } 1407 + 1408 + err = kobject_init_and_add(&target->kobj, 1409 + &damon_sysfs_target_ktype, &targets->kobj, 1410 + "%d", i); 1411 + if (err) 1412 + goto out; 1413 + 1414 + err = damon_sysfs_target_add_dirs(target); 1415 + if (err) 1416 + goto out; 1417 + 1418 + targets_arr[i] = target; 1419 + targets->nr++; 1420 + } 1421 + return 0; 1422 + 1423 + out: 1424 + damon_sysfs_targets_rm_dirs(targets); 1425 + kobject_put(&target->kobj); 1426 + return err; 1427 + } 1428 + 1429 + static ssize_t nr_targets_show(struct kobject *kobj, 1430 + struct kobj_attribute *attr, char *buf) 1431 + { 1432 + struct damon_sysfs_targets *targets = container_of(kobj, 1433 + struct damon_sysfs_targets, kobj); 1434 + 1435 + return sysfs_emit(buf, "%d\n", targets->nr); 1436 + } 1437 + 1438 + static ssize_t nr_targets_store(struct kobject *kobj, 1439 + struct kobj_attribute *attr, const char *buf, size_t count) 1440 + { 1441 + struct damon_sysfs_targets *targets = container_of(kobj, 1442 + struct damon_sysfs_targets, kobj); 1443 + int nr, err = kstrtoint(buf, 0, &nr); 1444 + 1445 + if (err) 1446 + return err; 1447 + if (nr < 0) 1448 + return -EINVAL; 1449 + 1450 + if (!mutex_trylock(&damon_sysfs_lock)) 1451 + return -EBUSY; 1452 + err = damon_sysfs_targets_add_dirs(targets, nr); 1453 + mutex_unlock(&damon_sysfs_lock); 1454 + if (err) 1455 + return err; 1456 + 1457 + return count; 1458 + } 1459 + 1460 + static void damon_sysfs_targets_release(struct kobject *kobj) 1461 + { 1462 + kfree(container_of(kobj, struct damon_sysfs_targets, kobj)); 1463 + } 1464 + 1465 + static struct kobj_attribute damon_sysfs_targets_nr_attr = 1466 + __ATTR_RW_MODE(nr_targets, 0600); 1467 + 1468 + static struct attribute *damon_sysfs_targets_attrs[] = { 1469 + &damon_sysfs_targets_nr_attr.attr, 1470 + NULL, 1471 + }; 1472 + ATTRIBUTE_GROUPS(damon_sysfs_targets); 1473 + 1474 + static struct kobj_type damon_sysfs_targets_ktype = { 1475 + .release = damon_sysfs_targets_release, 1476 + .sysfs_ops = &kobj_sysfs_ops, 1477 + .default_groups = damon_sysfs_targets_groups, 1478 + }; 1479 + 1480 + /* 1481 + * intervals directory 1482 + */ 1483 + 1484 + struct damon_sysfs_intervals { 1485 + struct kobject kobj; 1486 + unsigned long sample_us; 1487 + unsigned long aggr_us; 1488 + unsigned long update_us; 1489 + }; 1490 + 1491 + static struct damon_sysfs_intervals *damon_sysfs_intervals_alloc( 1492 + unsigned long sample_us, unsigned long aggr_us, 1493 + unsigned long update_us) 1494 + { 1495 + struct damon_sysfs_intervals *intervals = kmalloc(sizeof(*intervals), 1496 + GFP_KERNEL); 1497 + 1498 + if (!intervals) 1499 + return NULL; 1500 + 1501 + intervals->kobj = (struct kobject){}; 1502 + intervals->sample_us = sample_us; 1503 + intervals->aggr_us = aggr_us; 1504 + intervals->update_us = update_us; 1505 + return intervals; 1506 + } 1507 + 1508 + static ssize_t sample_us_show(struct kobject *kobj, 1509 + struct kobj_attribute *attr, char *buf) 1510 + { 1511 + struct damon_sysfs_intervals *intervals = container_of(kobj, 1512 + struct damon_sysfs_intervals, kobj); 1513 + 1514 + return sysfs_emit(buf, "%lu\n", intervals->sample_us); 1515 + } 1516 + 1517 + static ssize_t sample_us_store(struct kobject *kobj, 1518 + struct kobj_attribute *attr, const char *buf, size_t count) 1519 + { 1520 + struct damon_sysfs_intervals *intervals = container_of(kobj, 1521 + struct damon_sysfs_intervals, kobj); 1522 + unsigned long us; 1523 + int err = kstrtoul(buf, 0, &us); 1524 + 1525 + if (err) 1526 + return -EINVAL; 1527 + 1528 + intervals->sample_us = us; 1529 + return count; 1530 + } 1531 + 1532 + static ssize_t aggr_us_show(struct kobject *kobj, struct kobj_attribute *attr, 1533 + char *buf) 1534 + { 1535 + struct damon_sysfs_intervals *intervals = container_of(kobj, 1536 + struct damon_sysfs_intervals, kobj); 1537 + 1538 + return sysfs_emit(buf, "%lu\n", intervals->aggr_us); 1539 + } 1540 + 1541 + static ssize_t aggr_us_store(struct kobject *kobj, struct kobj_attribute *attr, 1542 + const char *buf, size_t count) 1543 + { 1544 + struct damon_sysfs_intervals *intervals = container_of(kobj, 1545 + struct damon_sysfs_intervals, kobj); 1546 + unsigned long us; 1547 + int err = kstrtoul(buf, 0, &us); 1548 + 1549 + if (err) 1550 + return -EINVAL; 1551 + 1552 + intervals->aggr_us = us; 1553 + return count; 1554 + } 1555 + 1556 + static ssize_t update_us_show(struct kobject *kobj, 1557 + struct kobj_attribute *attr, char *buf) 1558 + { 1559 + struct damon_sysfs_intervals *intervals = container_of(kobj, 1560 + struct damon_sysfs_intervals, kobj); 1561 + 1562 + return sysfs_emit(buf, "%lu\n", intervals->update_us); 1563 + } 1564 + 1565 + static ssize_t update_us_store(struct kobject *kobj, 1566 + struct kobj_attribute *attr, const char *buf, size_t count) 1567 + { 1568 + struct damon_sysfs_intervals *intervals = container_of(kobj, 1569 + struct damon_sysfs_intervals, kobj); 1570 + unsigned long us; 1571 + int err = kstrtoul(buf, 0, &us); 1572 + 1573 + if (err) 1574 + return -EINVAL; 1575 + 1576 + intervals->update_us = us; 1577 + return count; 1578 + } 1579 + 1580 + static void damon_sysfs_intervals_release(struct kobject *kobj) 1581 + { 1582 + kfree(container_of(kobj, struct damon_sysfs_intervals, kobj)); 1583 + } 1584 + 1585 + static struct kobj_attribute damon_sysfs_intervals_sample_us_attr = 1586 + __ATTR_RW_MODE(sample_us, 0600); 1587 + 1588 + static struct kobj_attribute damon_sysfs_intervals_aggr_us_attr = 1589 + __ATTR_RW_MODE(aggr_us, 0600); 1590 + 1591 + static struct kobj_attribute damon_sysfs_intervals_update_us_attr = 1592 + __ATTR_RW_MODE(update_us, 0600); 1593 + 1594 + static struct attribute *damon_sysfs_intervals_attrs[] = { 1595 + &damon_sysfs_intervals_sample_us_attr.attr, 1596 + &damon_sysfs_intervals_aggr_us_attr.attr, 1597 + &damon_sysfs_intervals_update_us_attr.attr, 1598 + NULL, 1599 + }; 1600 + ATTRIBUTE_GROUPS(damon_sysfs_intervals); 1601 + 1602 + static struct kobj_type damon_sysfs_intervals_ktype = { 1603 + .release = damon_sysfs_intervals_release, 1604 + .sysfs_ops = &kobj_sysfs_ops, 1605 + .default_groups = damon_sysfs_intervals_groups, 1606 + }; 1607 + 1608 + /* 1609 + * monitoring_attrs directory 1610 + */ 1611 + 1612 + struct damon_sysfs_attrs { 1613 + struct kobject kobj; 1614 + struct damon_sysfs_intervals *intervals; 1615 + struct damon_sysfs_ul_range *nr_regions_range; 1616 + }; 1617 + 1618 + static struct damon_sysfs_attrs *damon_sysfs_attrs_alloc(void) 1619 + { 1620 + struct damon_sysfs_attrs *attrs = kmalloc(sizeof(*attrs), GFP_KERNEL); 1621 + 1622 + if (!attrs) 1623 + return NULL; 1624 + attrs->kobj = (struct kobject){}; 1625 + return attrs; 1626 + } 1627 + 1628 + static int damon_sysfs_attrs_add_dirs(struct damon_sysfs_attrs *attrs) 1629 + { 1630 + struct damon_sysfs_intervals *intervals; 1631 + struct damon_sysfs_ul_range *nr_regions_range; 1632 + int err; 1633 + 1634 + intervals = damon_sysfs_intervals_alloc(5000, 100000, 60000000); 1635 + if (!intervals) 1636 + return -ENOMEM; 1637 + 1638 + err = kobject_init_and_add(&intervals->kobj, 1639 + &damon_sysfs_intervals_ktype, &attrs->kobj, 1640 + "intervals"); 1641 + if (err) 1642 + goto put_intervals_out; 1643 + attrs->intervals = intervals; 1644 + 1645 + nr_regions_range = damon_sysfs_ul_range_alloc(10, 1000); 1646 + if (!nr_regions_range) { 1647 + err = -ENOMEM; 1648 + goto put_intervals_out; 1649 + } 1650 + 1651 + err = kobject_init_and_add(&nr_regions_range->kobj, 1652 + &damon_sysfs_ul_range_ktype, &attrs->kobj, 1653 + "nr_regions"); 1654 + if (err) 1655 + goto put_nr_regions_intervals_out; 1656 + attrs->nr_regions_range = nr_regions_range; 1657 + return 0; 1658 + 1659 + put_nr_regions_intervals_out: 1660 + kobject_put(&nr_regions_range->kobj); 1661 + attrs->nr_regions_range = NULL; 1662 + put_intervals_out: 1663 + kobject_put(&intervals->kobj); 1664 + attrs->intervals = NULL; 1665 + return err; 1666 + } 1667 + 1668 + static void damon_sysfs_attrs_rm_dirs(struct damon_sysfs_attrs *attrs) 1669 + { 1670 + kobject_put(&attrs->nr_regions_range->kobj); 1671 + kobject_put(&attrs->intervals->kobj); 1672 + } 1673 + 1674 + static void damon_sysfs_attrs_release(struct kobject *kobj) 1675 + { 1676 + kfree(container_of(kobj, struct damon_sysfs_attrs, kobj)); 1677 + } 1678 + 1679 + static struct attribute *damon_sysfs_attrs_attrs[] = { 1680 + NULL, 1681 + }; 1682 + ATTRIBUTE_GROUPS(damon_sysfs_attrs); 1683 + 1684 + static struct kobj_type damon_sysfs_attrs_ktype = { 1685 + .release = damon_sysfs_attrs_release, 1686 + .sysfs_ops = &kobj_sysfs_ops, 1687 + .default_groups = damon_sysfs_attrs_groups, 1688 + }; 1689 + 1690 + /* 1691 + * context directory 1692 + */ 1693 + 1694 + /* This should match with enum damon_ops_id */ 1695 + static const char * const damon_sysfs_ops_strs[] = { 1696 + "vaddr", 1697 + "paddr", 1698 + }; 1699 + 1700 + struct damon_sysfs_context { 1701 + struct kobject kobj; 1702 + enum damon_ops_id ops_id; 1703 + struct damon_sysfs_attrs *attrs; 1704 + struct damon_sysfs_targets *targets; 1705 + struct damon_sysfs_schemes *schemes; 1706 + }; 1707 + 1708 + static struct damon_sysfs_context *damon_sysfs_context_alloc( 1709 + enum damon_ops_id ops_id) 1710 + { 1711 + struct damon_sysfs_context *context = kmalloc(sizeof(*context), 1712 + GFP_KERNEL); 1713 + 1714 + if (!context) 1715 + return NULL; 1716 + context->kobj = (struct kobject){}; 1717 + context->ops_id = ops_id; 1718 + return context; 1719 + } 1720 + 1721 + static int damon_sysfs_context_set_attrs(struct damon_sysfs_context *context) 1722 + { 1723 + struct damon_sysfs_attrs *attrs = damon_sysfs_attrs_alloc(); 1724 + int err; 1725 + 1726 + if (!attrs) 1727 + return -ENOMEM; 1728 + err = kobject_init_and_add(&attrs->kobj, &damon_sysfs_attrs_ktype, 1729 + &context->kobj, "monitoring_attrs"); 1730 + if (err) 1731 + goto out; 1732 + err = damon_sysfs_attrs_add_dirs(attrs); 1733 + if (err) 1734 + goto out; 1735 + context->attrs = attrs; 1736 + return 0; 1737 + 1738 + out: 1739 + kobject_put(&attrs->kobj); 1740 + return err; 1741 + } 1742 + 1743 + static int damon_sysfs_context_set_targets(struct damon_sysfs_context *context) 1744 + { 1745 + struct damon_sysfs_targets *targets = damon_sysfs_targets_alloc(); 1746 + int err; 1747 + 1748 + if (!targets) 1749 + return -ENOMEM; 1750 + err = kobject_init_and_add(&targets->kobj, &damon_sysfs_targets_ktype, 1751 + &context->kobj, "targets"); 1752 + if (err) { 1753 + kobject_put(&targets->kobj); 1754 + return err; 1755 + } 1756 + context->targets = targets; 1757 + return 0; 1758 + } 1759 + 1760 + static int damon_sysfs_context_set_schemes(struct damon_sysfs_context *context) 1761 + { 1762 + struct damon_sysfs_schemes *schemes = damon_sysfs_schemes_alloc(); 1763 + int err; 1764 + 1765 + if (!schemes) 1766 + return -ENOMEM; 1767 + err = kobject_init_and_add(&schemes->kobj, &damon_sysfs_schemes_ktype, 1768 + &context->kobj, "schemes"); 1769 + if (err) { 1770 + kobject_put(&schemes->kobj); 1771 + return err; 1772 + } 1773 + context->schemes = schemes; 1774 + return 0; 1775 + } 1776 + 1777 + static int damon_sysfs_context_add_dirs(struct damon_sysfs_context *context) 1778 + { 1779 + int err; 1780 + 1781 + err = damon_sysfs_context_set_attrs(context); 1782 + if (err) 1783 + return err; 1784 + 1785 + err = damon_sysfs_context_set_targets(context); 1786 + if (err) 1787 + goto put_attrs_out; 1788 + 1789 + err = damon_sysfs_context_set_schemes(context); 1790 + if (err) 1791 + goto put_targets_attrs_out; 1792 + return 0; 1793 + 1794 + put_targets_attrs_out: 1795 + kobject_put(&context->targets->kobj); 1796 + context->targets = NULL; 1797 + put_attrs_out: 1798 + kobject_put(&context->attrs->kobj); 1799 + context->attrs = NULL; 1800 + return err; 1801 + } 1802 + 1803 + static void damon_sysfs_context_rm_dirs(struct damon_sysfs_context *context) 1804 + { 1805 + damon_sysfs_attrs_rm_dirs(context->attrs); 1806 + kobject_put(&context->attrs->kobj); 1807 + damon_sysfs_targets_rm_dirs(context->targets); 1808 + kobject_put(&context->targets->kobj); 1809 + damon_sysfs_schemes_rm_dirs(context->schemes); 1810 + kobject_put(&context->schemes->kobj); 1811 + } 1812 + 1813 + static ssize_t operations_show(struct kobject *kobj, 1814 + struct kobj_attribute *attr, char *buf) 1815 + { 1816 + struct damon_sysfs_context *context = container_of(kobj, 1817 + struct damon_sysfs_context, kobj); 1818 + 1819 + return sysfs_emit(buf, "%s\n", damon_sysfs_ops_strs[context->ops_id]); 1820 + } 1821 + 1822 + static ssize_t operations_store(struct kobject *kobj, 1823 + struct kobj_attribute *attr, const char *buf, size_t count) 1824 + { 1825 + struct damon_sysfs_context *context = container_of(kobj, 1826 + struct damon_sysfs_context, kobj); 1827 + enum damon_ops_id id; 1828 + 1829 + for (id = 0; id < NR_DAMON_OPS; id++) { 1830 + if (sysfs_streq(buf, damon_sysfs_ops_strs[id])) { 1831 + context->ops_id = id; 1832 + return count; 1833 + } 1834 + } 1835 + return -EINVAL; 1836 + } 1837 + 1838 + static void damon_sysfs_context_release(struct kobject *kobj) 1839 + { 1840 + kfree(container_of(kobj, struct damon_sysfs_context, kobj)); 1841 + } 1842 + 1843 + static struct kobj_attribute damon_sysfs_context_operations_attr = 1844 + __ATTR_RW_MODE(operations, 0600); 1845 + 1846 + static struct attribute *damon_sysfs_context_attrs[] = { 1847 + &damon_sysfs_context_operations_attr.attr, 1848 + NULL, 1849 + }; 1850 + ATTRIBUTE_GROUPS(damon_sysfs_context); 1851 + 1852 + static struct kobj_type damon_sysfs_context_ktype = { 1853 + .release = damon_sysfs_context_release, 1854 + .sysfs_ops = &kobj_sysfs_ops, 1855 + .default_groups = damon_sysfs_context_groups, 1856 + }; 1857 + 1858 + /* 1859 + * contexts directory 1860 + */ 1861 + 1862 + struct damon_sysfs_contexts { 1863 + struct kobject kobj; 1864 + struct damon_sysfs_context **contexts_arr; 1865 + int nr; 1866 + }; 1867 + 1868 + static struct damon_sysfs_contexts *damon_sysfs_contexts_alloc(void) 1869 + { 1870 + return kzalloc(sizeof(struct damon_sysfs_contexts), GFP_KERNEL); 1871 + } 1872 + 1873 + static void damon_sysfs_contexts_rm_dirs(struct damon_sysfs_contexts *contexts) 1874 + { 1875 + struct damon_sysfs_context **contexts_arr = contexts->contexts_arr; 1876 + int i; 1877 + 1878 + for (i = 0; i < contexts->nr; i++) { 1879 + damon_sysfs_context_rm_dirs(contexts_arr[i]); 1880 + kobject_put(&contexts_arr[i]->kobj); 1881 + } 1882 + contexts->nr = 0; 1883 + kfree(contexts_arr); 1884 + contexts->contexts_arr = NULL; 1885 + } 1886 + 1887 + static int damon_sysfs_contexts_add_dirs(struct damon_sysfs_contexts *contexts, 1888 + int nr_contexts) 1889 + { 1890 + struct damon_sysfs_context **contexts_arr, *context; 1891 + int err, i; 1892 + 1893 + damon_sysfs_contexts_rm_dirs(contexts); 1894 + if (!nr_contexts) 1895 + return 0; 1896 + 1897 + contexts_arr = kmalloc_array(nr_contexts, sizeof(*contexts_arr), 1898 + GFP_KERNEL | __GFP_NOWARN); 1899 + if (!contexts_arr) 1900 + return -ENOMEM; 1901 + contexts->contexts_arr = contexts_arr; 1902 + 1903 + for (i = 0; i < nr_contexts; i++) { 1904 + context = damon_sysfs_context_alloc(DAMON_OPS_VADDR); 1905 + if (!context) { 1906 + damon_sysfs_contexts_rm_dirs(contexts); 1907 + return -ENOMEM; 1908 + } 1909 + 1910 + err = kobject_init_and_add(&context->kobj, 1911 + &damon_sysfs_context_ktype, &contexts->kobj, 1912 + "%d", i); 1913 + if (err) 1914 + goto out; 1915 + 1916 + err = damon_sysfs_context_add_dirs(context); 1917 + if (err) 1918 + goto out; 1919 + 1920 + contexts_arr[i] = context; 1921 + contexts->nr++; 1922 + } 1923 + return 0; 1924 + 1925 + out: 1926 + damon_sysfs_contexts_rm_dirs(contexts); 1927 + kobject_put(&context->kobj); 1928 + return err; 1929 + } 1930 + 1931 + static ssize_t nr_contexts_show(struct kobject *kobj, 1932 + struct kobj_attribute *attr, char *buf) 1933 + { 1934 + struct damon_sysfs_contexts *contexts = container_of(kobj, 1935 + struct damon_sysfs_contexts, kobj); 1936 + 1937 + return sysfs_emit(buf, "%d\n", contexts->nr); 1938 + } 1939 + 1940 + static ssize_t nr_contexts_store(struct kobject *kobj, 1941 + struct kobj_attribute *attr, const char *buf, size_t count) 1942 + { 1943 + struct damon_sysfs_contexts *contexts = container_of(kobj, 1944 + struct damon_sysfs_contexts, kobj); 1945 + int nr, err; 1946 + 1947 + err = kstrtoint(buf, 0, &nr); 1948 + if (err) 1949 + return err; 1950 + /* TODO: support multiple contexts per kdamond */ 1951 + if (nr < 0 || 1 < nr) 1952 + return -EINVAL; 1953 + 1954 + if (!mutex_trylock(&damon_sysfs_lock)) 1955 + return -EBUSY; 1956 + err = damon_sysfs_contexts_add_dirs(contexts, nr); 1957 + mutex_unlock(&damon_sysfs_lock); 1958 + if (err) 1959 + return err; 1960 + 1961 + return count; 1962 + } 1963 + 1964 + static void damon_sysfs_contexts_release(struct kobject *kobj) 1965 + { 1966 + kfree(container_of(kobj, struct damon_sysfs_contexts, kobj)); 1967 + } 1968 + 1969 + static struct kobj_attribute damon_sysfs_contexts_nr_attr 1970 + = __ATTR_RW_MODE(nr_contexts, 0600); 1971 + 1972 + static struct attribute *damon_sysfs_contexts_attrs[] = { 1973 + &damon_sysfs_contexts_nr_attr.attr, 1974 + NULL, 1975 + }; 1976 + ATTRIBUTE_GROUPS(damon_sysfs_contexts); 1977 + 1978 + static struct kobj_type damon_sysfs_contexts_ktype = { 1979 + .release = damon_sysfs_contexts_release, 1980 + .sysfs_ops = &kobj_sysfs_ops, 1981 + .default_groups = damon_sysfs_contexts_groups, 1982 + }; 1983 + 1984 + /* 1985 + * kdamond directory 1986 + */ 1987 + 1988 + struct damon_sysfs_kdamond { 1989 + struct kobject kobj; 1990 + struct damon_sysfs_contexts *contexts; 1991 + struct damon_ctx *damon_ctx; 1992 + }; 1993 + 1994 + static struct damon_sysfs_kdamond *damon_sysfs_kdamond_alloc(void) 1995 + { 1996 + return kzalloc(sizeof(struct damon_sysfs_kdamond), GFP_KERNEL); 1997 + } 1998 + 1999 + static int damon_sysfs_kdamond_add_dirs(struct damon_sysfs_kdamond *kdamond) 2000 + { 2001 + struct damon_sysfs_contexts *contexts; 2002 + int err; 2003 + 2004 + contexts = damon_sysfs_contexts_alloc(); 2005 + if (!contexts) 2006 + return -ENOMEM; 2007 + 2008 + err = kobject_init_and_add(&contexts->kobj, 2009 + &damon_sysfs_contexts_ktype, &kdamond->kobj, 2010 + "contexts"); 2011 + if (err) { 2012 + kobject_put(&contexts->kobj); 2013 + return err; 2014 + } 2015 + kdamond->contexts = contexts; 2016 + 2017 + return err; 2018 + } 2019 + 2020 + static void damon_sysfs_kdamond_rm_dirs(struct damon_sysfs_kdamond *kdamond) 2021 + { 2022 + damon_sysfs_contexts_rm_dirs(kdamond->contexts); 2023 + kobject_put(&kdamond->contexts->kobj); 2024 + } 2025 + 2026 + static bool damon_sysfs_ctx_running(struct damon_ctx *ctx) 2027 + { 2028 + bool running; 2029 + 2030 + mutex_lock(&ctx->kdamond_lock); 2031 + running = ctx->kdamond != NULL; 2032 + mutex_unlock(&ctx->kdamond_lock); 2033 + return running; 2034 + } 2035 + 2036 + static ssize_t state_show(struct kobject *kobj, struct kobj_attribute *attr, 2037 + char *buf) 2038 + { 2039 + struct damon_sysfs_kdamond *kdamond = container_of(kobj, 2040 + struct damon_sysfs_kdamond, kobj); 2041 + struct damon_ctx *ctx = kdamond->damon_ctx; 2042 + bool running; 2043 + 2044 + if (!ctx) 2045 + running = false; 2046 + else 2047 + running = damon_sysfs_ctx_running(ctx); 2048 + 2049 + return sysfs_emit(buf, "%s\n", running ? "on" : "off"); 2050 + } 2051 + 2052 + static int damon_sysfs_set_attrs(struct damon_ctx *ctx, 2053 + struct damon_sysfs_attrs *sys_attrs) 2054 + { 2055 + struct damon_sysfs_intervals *sys_intervals = sys_attrs->intervals; 2056 + struct damon_sysfs_ul_range *sys_nr_regions = 2057 + sys_attrs->nr_regions_range; 2058 + 2059 + return damon_set_attrs(ctx, sys_intervals->sample_us, 2060 + sys_intervals->aggr_us, sys_intervals->update_us, 2061 + sys_nr_regions->min, sys_nr_regions->max); 2062 + } 2063 + 2064 + static void damon_sysfs_destroy_targets(struct damon_ctx *ctx) 2065 + { 2066 + struct damon_target *t, *next; 2067 + 2068 + damon_for_each_target_safe(t, next, ctx) { 2069 + if (ctx->ops.id == DAMON_OPS_VADDR) 2070 + put_pid(t->pid); 2071 + damon_destroy_target(t); 2072 + } 2073 + } 2074 + 2075 + static int damon_sysfs_set_regions(struct damon_target *t, 2076 + struct damon_sysfs_regions *sysfs_regions) 2077 + { 2078 + int i; 2079 + 2080 + for (i = 0; i < sysfs_regions->nr; i++) { 2081 + struct damon_sysfs_region *sys_region = 2082 + sysfs_regions->regions_arr[i]; 2083 + struct damon_region *prev, *r; 2084 + 2085 + if (sys_region->start > sys_region->end) 2086 + return -EINVAL; 2087 + r = damon_new_region(sys_region->start, sys_region->end); 2088 + if (!r) 2089 + return -ENOMEM; 2090 + damon_add_region(r, t); 2091 + if (damon_nr_regions(t) > 1) { 2092 + prev = damon_prev_region(r); 2093 + if (prev->ar.end > r->ar.start) { 2094 + damon_destroy_region(r, t); 2095 + return -EINVAL; 2096 + } 2097 + } 2098 + } 2099 + return 0; 2100 + } 2101 + 2102 + static int damon_sysfs_set_targets(struct damon_ctx *ctx, 2103 + struct damon_sysfs_targets *sysfs_targets) 2104 + { 2105 + int i, err; 2106 + 2107 + for (i = 0; i < sysfs_targets->nr; i++) { 2108 + struct damon_sysfs_target *sys_target = 2109 + sysfs_targets->targets_arr[i]; 2110 + struct damon_target *t = damon_new_target(); 2111 + 2112 + if (!t) { 2113 + damon_sysfs_destroy_targets(ctx); 2114 + return -ENOMEM; 2115 + } 2116 + if (ctx->ops.id == DAMON_OPS_VADDR) { 2117 + t->pid = find_get_pid(sys_target->pid); 2118 + if (!t->pid) { 2119 + damon_sysfs_destroy_targets(ctx); 2120 + return -EINVAL; 2121 + } 2122 + } 2123 + damon_add_target(ctx, t); 2124 + err = damon_sysfs_set_regions(t, sys_target->regions); 2125 + if (err) { 2126 + damon_sysfs_destroy_targets(ctx); 2127 + return err; 2128 + } 2129 + } 2130 + return 0; 2131 + } 2132 + 2133 + static struct damos *damon_sysfs_mk_scheme( 2134 + struct damon_sysfs_scheme *sysfs_scheme) 2135 + { 2136 + struct damon_sysfs_access_pattern *pattern = 2137 + sysfs_scheme->access_pattern; 2138 + struct damon_sysfs_quotas *sysfs_quotas = sysfs_scheme->quotas; 2139 + struct damon_sysfs_weights *sysfs_weights = sysfs_quotas->weights; 2140 + struct damon_sysfs_watermarks *sysfs_wmarks = sysfs_scheme->watermarks; 2141 + struct damos_quota quota = { 2142 + .ms = sysfs_quotas->ms, 2143 + .sz = sysfs_quotas->sz, 2144 + .reset_interval = sysfs_quotas->reset_interval_ms, 2145 + .weight_sz = sysfs_weights->sz, 2146 + .weight_nr_accesses = sysfs_weights->nr_accesses, 2147 + .weight_age = sysfs_weights->age, 2148 + }; 2149 + struct damos_watermarks wmarks = { 2150 + .metric = sysfs_wmarks->metric, 2151 + .interval = sysfs_wmarks->interval_us, 2152 + .high = sysfs_wmarks->high, 2153 + .mid = sysfs_wmarks->mid, 2154 + .low = sysfs_wmarks->low, 2155 + }; 2156 + 2157 + return damon_new_scheme(pattern->sz->min, pattern->sz->max, 2158 + pattern->nr_accesses->min, pattern->nr_accesses->max, 2159 + pattern->age->min, pattern->age->max, 2160 + sysfs_scheme->action, &quota, &wmarks); 2161 + } 2162 + 2163 + static int damon_sysfs_set_schemes(struct damon_ctx *ctx, 2164 + struct damon_sysfs_schemes *sysfs_schemes) 2165 + { 2166 + int i; 2167 + 2168 + for (i = 0; i < sysfs_schemes->nr; i++) { 2169 + struct damos *scheme, *next; 2170 + 2171 + scheme = damon_sysfs_mk_scheme(sysfs_schemes->schemes_arr[i]); 2172 + if (!scheme) { 2173 + damon_for_each_scheme_safe(scheme, next, ctx) 2174 + damon_destroy_scheme(scheme); 2175 + return -ENOMEM; 2176 + } 2177 + damon_add_scheme(ctx, scheme); 2178 + } 2179 + return 0; 2180 + } 2181 + 2182 + static void damon_sysfs_before_terminate(struct damon_ctx *ctx) 2183 + { 2184 + struct damon_target *t, *next; 2185 + 2186 + if (ctx->ops.id != DAMON_OPS_VADDR) 2187 + return; 2188 + 2189 + mutex_lock(&ctx->kdamond_lock); 2190 + damon_for_each_target_safe(t, next, ctx) { 2191 + put_pid(t->pid); 2192 + damon_destroy_target(t); 2193 + } 2194 + mutex_unlock(&ctx->kdamond_lock); 2195 + } 2196 + 2197 + static struct damon_ctx *damon_sysfs_build_ctx( 2198 + struct damon_sysfs_context *sys_ctx) 2199 + { 2200 + struct damon_ctx *ctx = damon_new_ctx(); 2201 + int err; 2202 + 2203 + if (!ctx) 2204 + return ERR_PTR(-ENOMEM); 2205 + 2206 + err = damon_select_ops(ctx, sys_ctx->ops_id); 2207 + if (err) 2208 + goto out; 2209 + err = damon_sysfs_set_attrs(ctx, sys_ctx->attrs); 2210 + if (err) 2211 + goto out; 2212 + err = damon_sysfs_set_targets(ctx, sys_ctx->targets); 2213 + if (err) 2214 + goto out; 2215 + err = damon_sysfs_set_schemes(ctx, sys_ctx->schemes); 2216 + if (err) 2217 + goto out; 2218 + 2219 + ctx->callback.before_terminate = damon_sysfs_before_terminate; 2220 + return ctx; 2221 + 2222 + out: 2223 + damon_destroy_ctx(ctx); 2224 + return ERR_PTR(err); 2225 + } 2226 + 2227 + static int damon_sysfs_turn_damon_on(struct damon_sysfs_kdamond *kdamond) 2228 + { 2229 + struct damon_ctx *ctx; 2230 + int err; 2231 + 2232 + if (kdamond->damon_ctx && 2233 + damon_sysfs_ctx_running(kdamond->damon_ctx)) 2234 + return -EBUSY; 2235 + /* TODO: support multiple contexts per kdamond */ 2236 + if (kdamond->contexts->nr != 1) 2237 + return -EINVAL; 2238 + 2239 + if (kdamond->damon_ctx) 2240 + damon_destroy_ctx(kdamond->damon_ctx); 2241 + kdamond->damon_ctx = NULL; 2242 + 2243 + ctx = damon_sysfs_build_ctx(kdamond->contexts->contexts_arr[0]); 2244 + if (IS_ERR(ctx)) 2245 + return PTR_ERR(ctx); 2246 + err = damon_start(&ctx, 1, false); 2247 + if (err) { 2248 + damon_destroy_ctx(ctx); 2249 + return err; 2250 + } 2251 + kdamond->damon_ctx = ctx; 2252 + return err; 2253 + } 2254 + 2255 + static int damon_sysfs_turn_damon_off(struct damon_sysfs_kdamond *kdamond) 2256 + { 2257 + if (!kdamond->damon_ctx) 2258 + return -EINVAL; 2259 + return damon_stop(&kdamond->damon_ctx, 1); 2260 + /* 2261 + * To allow users show final monitoring results of already turned-off 2262 + * DAMON, we free kdamond->damon_ctx in next 2263 + * damon_sysfs_turn_damon_on(), or kdamonds_nr_store() 2264 + */ 2265 + } 2266 + 2267 + static int damon_sysfs_update_schemes_stats(struct damon_sysfs_kdamond *kdamond) 2268 + { 2269 + struct damon_ctx *ctx = kdamond->damon_ctx; 2270 + struct damos *scheme; 2271 + int schemes_idx = 0; 2272 + 2273 + if (!ctx) 2274 + return -EINVAL; 2275 + mutex_lock(&ctx->kdamond_lock); 2276 + damon_for_each_scheme(scheme, ctx) { 2277 + struct damon_sysfs_schemes *sysfs_schemes; 2278 + struct damon_sysfs_stats *sysfs_stats; 2279 + 2280 + sysfs_schemes = kdamond->contexts->contexts_arr[0]->schemes; 2281 + sysfs_stats = sysfs_schemes->schemes_arr[schemes_idx++]->stats; 2282 + sysfs_stats->nr_tried = scheme->stat.nr_tried; 2283 + sysfs_stats->sz_tried = scheme->stat.sz_tried; 2284 + sysfs_stats->nr_applied = scheme->stat.nr_applied; 2285 + sysfs_stats->sz_applied = scheme->stat.sz_applied; 2286 + sysfs_stats->qt_exceeds = scheme->stat.qt_exceeds; 2287 + } 2288 + mutex_unlock(&ctx->kdamond_lock); 2289 + return 0; 2290 + } 2291 + 2292 + static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr, 2293 + const char *buf, size_t count) 2294 + { 2295 + struct damon_sysfs_kdamond *kdamond = container_of(kobj, 2296 + struct damon_sysfs_kdamond, kobj); 2297 + ssize_t ret; 2298 + 2299 + if (!mutex_trylock(&damon_sysfs_lock)) 2300 + return -EBUSY; 2301 + if (sysfs_streq(buf, "on")) 2302 + ret = damon_sysfs_turn_damon_on(kdamond); 2303 + else if (sysfs_streq(buf, "off")) 2304 + ret = damon_sysfs_turn_damon_off(kdamond); 2305 + else if (sysfs_streq(buf, "update_schemes_stats")) 2306 + ret = damon_sysfs_update_schemes_stats(kdamond); 2307 + else 2308 + ret = -EINVAL; 2309 + mutex_unlock(&damon_sysfs_lock); 2310 + if (!ret) 2311 + ret = count; 2312 + return ret; 2313 + } 2314 + 2315 + static ssize_t pid_show(struct kobject *kobj, 2316 + struct kobj_attribute *attr, char *buf) 2317 + { 2318 + struct damon_sysfs_kdamond *kdamond = container_of(kobj, 2319 + struct damon_sysfs_kdamond, kobj); 2320 + struct damon_ctx *ctx; 2321 + int pid; 2322 + 2323 + if (!mutex_trylock(&damon_sysfs_lock)) 2324 + return -EBUSY; 2325 + ctx = kdamond->damon_ctx; 2326 + if (!ctx) { 2327 + pid = -1; 2328 + goto out; 2329 + } 2330 + mutex_lock(&ctx->kdamond_lock); 2331 + if (!ctx->kdamond) 2332 + pid = -1; 2333 + else 2334 + pid = ctx->kdamond->pid; 2335 + mutex_unlock(&ctx->kdamond_lock); 2336 + out: 2337 + mutex_unlock(&damon_sysfs_lock); 2338 + return sysfs_emit(buf, "%d\n", pid); 2339 + } 2340 + 2341 + static void damon_sysfs_kdamond_release(struct kobject *kobj) 2342 + { 2343 + struct damon_sysfs_kdamond *kdamond = container_of(kobj, 2344 + struct damon_sysfs_kdamond, kobj); 2345 + 2346 + if (kdamond->damon_ctx) 2347 + damon_destroy_ctx(kdamond->damon_ctx); 2348 + kfree(kdamond); 2349 + } 2350 + 2351 + static struct kobj_attribute damon_sysfs_kdamond_state_attr = 2352 + __ATTR_RW_MODE(state, 0600); 2353 + 2354 + static struct kobj_attribute damon_sysfs_kdamond_pid_attr = 2355 + __ATTR_RO_MODE(pid, 0400); 2356 + 2357 + static struct attribute *damon_sysfs_kdamond_attrs[] = { 2358 + &damon_sysfs_kdamond_state_attr.attr, 2359 + &damon_sysfs_kdamond_pid_attr.attr, 2360 + NULL, 2361 + }; 2362 + ATTRIBUTE_GROUPS(damon_sysfs_kdamond); 2363 + 2364 + static struct kobj_type damon_sysfs_kdamond_ktype = { 2365 + .release = damon_sysfs_kdamond_release, 2366 + .sysfs_ops = &kobj_sysfs_ops, 2367 + .default_groups = damon_sysfs_kdamond_groups, 2368 + }; 2369 + 2370 + /* 2371 + * kdamonds directory 2372 + */ 2373 + 2374 + struct damon_sysfs_kdamonds { 2375 + struct kobject kobj; 2376 + struct damon_sysfs_kdamond **kdamonds_arr; 2377 + int nr; 2378 + }; 2379 + 2380 + static struct damon_sysfs_kdamonds *damon_sysfs_kdamonds_alloc(void) 2381 + { 2382 + return kzalloc(sizeof(struct damon_sysfs_kdamonds), GFP_KERNEL); 2383 + } 2384 + 2385 + static void damon_sysfs_kdamonds_rm_dirs(struct damon_sysfs_kdamonds *kdamonds) 2386 + { 2387 + struct damon_sysfs_kdamond **kdamonds_arr = kdamonds->kdamonds_arr; 2388 + int i; 2389 + 2390 + for (i = 0; i < kdamonds->nr; i++) { 2391 + damon_sysfs_kdamond_rm_dirs(kdamonds_arr[i]); 2392 + kobject_put(&kdamonds_arr[i]->kobj); 2393 + } 2394 + kdamonds->nr = 0; 2395 + kfree(kdamonds_arr); 2396 + kdamonds->kdamonds_arr = NULL; 2397 + } 2398 + 2399 + static int damon_sysfs_nr_running_ctxs(struct damon_sysfs_kdamond **kdamonds, 2400 + int nr_kdamonds) 2401 + { 2402 + int nr_running_ctxs = 0; 2403 + int i; 2404 + 2405 + for (i = 0; i < nr_kdamonds; i++) { 2406 + struct damon_ctx *ctx = kdamonds[i]->damon_ctx; 2407 + 2408 + if (!ctx) 2409 + continue; 2410 + mutex_lock(&ctx->kdamond_lock); 2411 + if (ctx->kdamond) 2412 + nr_running_ctxs++; 2413 + mutex_unlock(&ctx->kdamond_lock); 2414 + } 2415 + return nr_running_ctxs; 2416 + } 2417 + 2418 + static int damon_sysfs_kdamonds_add_dirs(struct damon_sysfs_kdamonds *kdamonds, 2419 + int nr_kdamonds) 2420 + { 2421 + struct damon_sysfs_kdamond **kdamonds_arr, *kdamond; 2422 + int err, i; 2423 + 2424 + if (damon_sysfs_nr_running_ctxs(kdamonds->kdamonds_arr, kdamonds->nr)) 2425 + return -EBUSY; 2426 + 2427 + damon_sysfs_kdamonds_rm_dirs(kdamonds); 2428 + if (!nr_kdamonds) 2429 + return 0; 2430 + 2431 + kdamonds_arr = kmalloc_array(nr_kdamonds, sizeof(*kdamonds_arr), 2432 + GFP_KERNEL | __GFP_NOWARN); 2433 + if (!kdamonds_arr) 2434 + return -ENOMEM; 2435 + kdamonds->kdamonds_arr = kdamonds_arr; 2436 + 2437 + for (i = 0; i < nr_kdamonds; i++) { 2438 + kdamond = damon_sysfs_kdamond_alloc(); 2439 + if (!kdamond) { 2440 + damon_sysfs_kdamonds_rm_dirs(kdamonds); 2441 + return -ENOMEM; 2442 + } 2443 + 2444 + err = kobject_init_and_add(&kdamond->kobj, 2445 + &damon_sysfs_kdamond_ktype, &kdamonds->kobj, 2446 + "%d", i); 2447 + if (err) 2448 + goto out; 2449 + 2450 + err = damon_sysfs_kdamond_add_dirs(kdamond); 2451 + if (err) 2452 + goto out; 2453 + 2454 + kdamonds_arr[i] = kdamond; 2455 + kdamonds->nr++; 2456 + } 2457 + return 0; 2458 + 2459 + out: 2460 + damon_sysfs_kdamonds_rm_dirs(kdamonds); 2461 + kobject_put(&kdamond->kobj); 2462 + return err; 2463 + } 2464 + 2465 + static ssize_t nr_kdamonds_show(struct kobject *kobj, 2466 + struct kobj_attribute *attr, char *buf) 2467 + { 2468 + struct damon_sysfs_kdamonds *kdamonds = container_of(kobj, 2469 + struct damon_sysfs_kdamonds, kobj); 2470 + 2471 + return sysfs_emit(buf, "%d\n", kdamonds->nr); 2472 + } 2473 + 2474 + static ssize_t nr_kdamonds_store(struct kobject *kobj, 2475 + struct kobj_attribute *attr, const char *buf, size_t count) 2476 + { 2477 + struct damon_sysfs_kdamonds *kdamonds = container_of(kobj, 2478 + struct damon_sysfs_kdamonds, kobj); 2479 + int nr, err; 2480 + 2481 + err = kstrtoint(buf, 0, &nr); 2482 + if (err) 2483 + return err; 2484 + if (nr < 0) 2485 + return -EINVAL; 2486 + 2487 + if (!mutex_trylock(&damon_sysfs_lock)) 2488 + return -EBUSY; 2489 + err = damon_sysfs_kdamonds_add_dirs(kdamonds, nr); 2490 + mutex_unlock(&damon_sysfs_lock); 2491 + if (err) 2492 + return err; 2493 + 2494 + return count; 2495 + } 2496 + 2497 + static void damon_sysfs_kdamonds_release(struct kobject *kobj) 2498 + { 2499 + kfree(container_of(kobj, struct damon_sysfs_kdamonds, kobj)); 2500 + } 2501 + 2502 + static struct kobj_attribute damon_sysfs_kdamonds_nr_attr = 2503 + __ATTR_RW_MODE(nr_kdamonds, 0600); 2504 + 2505 + static struct attribute *damon_sysfs_kdamonds_attrs[] = { 2506 + &damon_sysfs_kdamonds_nr_attr.attr, 2507 + NULL, 2508 + }; 2509 + ATTRIBUTE_GROUPS(damon_sysfs_kdamonds); 2510 + 2511 + static struct kobj_type damon_sysfs_kdamonds_ktype = { 2512 + .release = damon_sysfs_kdamonds_release, 2513 + .sysfs_ops = &kobj_sysfs_ops, 2514 + .default_groups = damon_sysfs_kdamonds_groups, 2515 + }; 2516 + 2517 + /* 2518 + * damon user interface directory 2519 + */ 2520 + 2521 + struct damon_sysfs_ui_dir { 2522 + struct kobject kobj; 2523 + struct damon_sysfs_kdamonds *kdamonds; 2524 + }; 2525 + 2526 + static struct damon_sysfs_ui_dir *damon_sysfs_ui_dir_alloc(void) 2527 + { 2528 + return kzalloc(sizeof(struct damon_sysfs_ui_dir), GFP_KERNEL); 2529 + } 2530 + 2531 + static int damon_sysfs_ui_dir_add_dirs(struct damon_sysfs_ui_dir *ui_dir) 2532 + { 2533 + struct damon_sysfs_kdamonds *kdamonds; 2534 + int err; 2535 + 2536 + kdamonds = damon_sysfs_kdamonds_alloc(); 2537 + if (!kdamonds) 2538 + return -ENOMEM; 2539 + 2540 + err = kobject_init_and_add(&kdamonds->kobj, 2541 + &damon_sysfs_kdamonds_ktype, &ui_dir->kobj, 2542 + "kdamonds"); 2543 + if (err) { 2544 + kobject_put(&kdamonds->kobj); 2545 + return err; 2546 + } 2547 + ui_dir->kdamonds = kdamonds; 2548 + return err; 2549 + } 2550 + 2551 + static void damon_sysfs_ui_dir_release(struct kobject *kobj) 2552 + { 2553 + kfree(container_of(kobj, struct damon_sysfs_ui_dir, kobj)); 2554 + } 2555 + 2556 + static struct attribute *damon_sysfs_ui_dir_attrs[] = { 2557 + NULL, 2558 + }; 2559 + ATTRIBUTE_GROUPS(damon_sysfs_ui_dir); 2560 + 2561 + static struct kobj_type damon_sysfs_ui_dir_ktype = { 2562 + .release = damon_sysfs_ui_dir_release, 2563 + .sysfs_ops = &kobj_sysfs_ops, 2564 + .default_groups = damon_sysfs_ui_dir_groups, 2565 + }; 2566 + 2567 + static int __init damon_sysfs_init(void) 2568 + { 2569 + struct kobject *damon_sysfs_root; 2570 + struct damon_sysfs_ui_dir *admin; 2571 + int err; 2572 + 2573 + damon_sysfs_root = kobject_create_and_add("damon", mm_kobj); 2574 + if (!damon_sysfs_root) 2575 + return -ENOMEM; 2576 + 2577 + admin = damon_sysfs_ui_dir_alloc(); 2578 + if (!admin) { 2579 + kobject_put(damon_sysfs_root); 2580 + return -ENOMEM; 2581 + } 2582 + err = kobject_init_and_add(&admin->kobj, &damon_sysfs_ui_dir_ktype, 2583 + damon_sysfs_root, "admin"); 2584 + if (err) 2585 + goto out; 2586 + err = damon_sysfs_ui_dir_add_dirs(admin); 2587 + if (err) 2588 + goto out; 2589 + return 0; 2590 + 2591 + out: 2592 + kobject_put(&admin->kobj); 2593 + kobject_put(damon_sysfs_root); 2594 + return err; 2595 + } 2596 + subsys_initcall(damon_sysfs_init);

+4 -4

mm/damon/vaddr-test.h

··· 139 139 struct damon_region *r; 140 140 int i; 141 141 142 - t = damon_new_target(42); 142 + t = damon_new_target(); 143 143 for (i = 0; i < nr_regions / 2; i++) { 144 144 r = damon_new_region(regions[i * 2], regions[i * 2 + 1]); 145 145 damon_add_region(r, t); ··· 251 251 static void damon_test_split_evenly_fail(struct kunit *test, 252 252 unsigned long start, unsigned long end, unsigned int nr_pieces) 253 253 { 254 - struct damon_target *t = damon_new_target(42); 254 + struct damon_target *t = damon_new_target(); 255 255 struct damon_region *r = damon_new_region(start, end); 256 256 257 257 damon_add_region(r, t); ··· 270 270 static void damon_test_split_evenly_succ(struct kunit *test, 271 271 unsigned long start, unsigned long end, unsigned int nr_pieces) 272 272 { 273 - struct damon_target *t = damon_new_target(42); 273 + struct damon_target *t = damon_new_target(); 274 274 struct damon_region *r = damon_new_region(start, end); 275 275 unsigned long expected_width = (end - start) / nr_pieces; 276 276 unsigned long i = 0; ··· 314 314 }; 315 315 316 316 static struct kunit_suite damon_test_suite = { 317 - .name = "damon-primitives", 317 + .name = "damon-operations", 318 318 .test_cases = damon_test_cases, 319 319 }; 320 320 kunit_test_suite(damon_test_suite);

+22 -21

mm/damon/vaddr.c

··· 15 15 #include <linux/pagewalk.h> 16 16 #include <linux/sched/mm.h> 17 17 18 - #include "prmtv-common.h" 18 + #include "ops-common.h" 19 19 20 20 #ifdef CONFIG_DAMON_VADDR_KUNIT_TEST 21 21 #undef DAMON_MIN_REGION ··· 23 23 #endif 24 24 25 25 /* 26 - * 't->id' should be the pointer to the relevant 'struct pid' having reference 26 + * 't->pid' should be the pointer to the relevant 'struct pid' having reference 27 27 * count. Caller must put the returned task, unless it is NULL. 28 28 */ 29 29 static inline struct task_struct *damon_get_task_struct(struct damon_target *t) 30 30 { 31 - return get_pid_task((struct pid *)t->id, PIDTYPE_PID); 31 + return get_pid_task(t->pid, PIDTYPE_PID); 32 32 } 33 33 34 34 /* ··· 402 402 pte_t entry = huge_ptep_get(pte); 403 403 struct page *page = pte_page(entry); 404 404 405 - if (!page) 406 - return; 407 - 408 405 get_page(page); 409 406 410 407 if (pte_young(entry)) { ··· 561 564 goto out; 562 565 563 566 page = pte_page(entry); 564 - if (!page) 565 - goto out; 566 - 567 567 get_page(page); 568 568 569 569 if (pte_young(entry) || !page_is_idle(page) || ··· 653 659 * Functions for the target validity check and cleanup 654 660 */ 655 661 656 - bool damon_va_target_valid(void *target) 662 + static bool damon_va_target_valid(void *target) 657 663 { 658 664 struct damon_target *t = target; 659 665 struct task_struct *task; ··· 739 745 return DAMOS_MAX_SCORE; 740 746 } 741 747 742 - void damon_va_set_primitives(struct damon_ctx *ctx) 748 + static int __init damon_va_initcall(void) 743 749 { 744 - ctx->primitive.init = damon_va_init; 745 - ctx->primitive.update = damon_va_update; 746 - ctx->primitive.prepare_access_checks = damon_va_prepare_access_checks; 747 - ctx->primitive.check_accesses = damon_va_check_accesses; 748 - ctx->primitive.reset_aggregated = NULL; 749 - ctx->primitive.target_valid = damon_va_target_valid; 750 - ctx->primitive.cleanup = NULL; 751 - ctx->primitive.apply_scheme = damon_va_apply_scheme; 752 - ctx->primitive.get_scheme_score = damon_va_scheme_score; 753 - } 750 + struct damon_operations ops = { 751 + .id = DAMON_OPS_VADDR, 752 + .init = damon_va_init, 753 + .update = damon_va_update, 754 + .prepare_access_checks = damon_va_prepare_access_checks, 755 + .check_accesses = damon_va_check_accesses, 756 + .reset_aggregated = NULL, 757 + .target_valid = damon_va_target_valid, 758 + .cleanup = NULL, 759 + .apply_scheme = damon_va_apply_scheme, 760 + .get_scheme_score = damon_va_scheme_score, 761 + }; 762 + 763 + return damon_register_ops(&ops); 764 + }; 765 + 766 + subsys_initcall(damon_va_initcall); 754 767 755 768 #include "vaddr-test.h"

+1

mm/early_ioremap.c

··· 17 17 #include <linux/vmalloc.h> 18 18 #include <asm/fixmap.h> 19 19 #include <asm/early_ioremap.h> 20 + #include "internal.h" 20 21 21 22 #ifdef CONFIG_MMU 22 23 static int early_ioremap_debug __initdata;

+2 -3

mm/fadvise.c

··· 109 109 case POSIX_FADV_NOREUSE: 110 110 break; 111 111 case POSIX_FADV_DONTNEED: 112 - if (!inode_write_congested(mapping->host)) 113 - __filemap_fdatawrite_range(mapping, offset, endbyte, 114 - WB_SYNC_NONE); 112 + __filemap_fdatawrite_range(mapping, offset, endbyte, 113 + WB_SYNC_NONE); 115 114 116 115 /* 117 116 * First and last FULL page! Partial pages are deliberately

+12 -5

mm/filemap.c

··· 1054 1054 init_waitqueue_head(&folio_wait_table[i]); 1055 1055 1056 1056 page_writeback_init(); 1057 + 1058 + /* 1059 + * tmpfs uses the ZERO_PAGE for reading holes: it is up-to-date, 1060 + * and splice's page_cache_pipe_buf_confirm() needs to see that. 1061 + */ 1062 + SetPageUptodate(ZERO_PAGE(0)); 1057 1063 } 1058 1064 1059 1065 /* ··· 2235 2229 * @nr_pages: The maximum number of pages 2236 2230 * @pages: Where the resulting pages are placed 2237 2231 * 2238 - * find_get_pages_contig() works exactly like find_get_pages(), except 2239 - * that the returned number of pages are guaranteed to be contiguous. 2232 + * find_get_pages_contig() works exactly like find_get_pages_range(), 2233 + * except that the returned number of pages are guaranteed to be 2234 + * contiguous. 2240 2235 * 2241 2236 * Return: the number of pages which were found. 2242 2237 */ ··· 2297 2290 * @nr_pages: the maximum number of pages 2298 2291 * @pages: where the resulting pages are placed 2299 2292 * 2300 - * Like find_get_pages(), except we only return head pages which are tagged 2301 - * with @tag. @index is updated to the index immediately after the last 2302 - * page we return, ready for the next iteration. 2293 + * Like find_get_pages_range(), except we only return head pages which are 2294 + * tagged with @tag. @index is updated to the index immediately after the 2295 + * last page we return, ready for the next iteration. 2303 2296 * 2304 2297 * Return: the number of pages which were found. 2305 2298 */

+8 -93

mm/gup.c

··· 464 464 static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, 465 465 pte_t *pte, unsigned int flags) 466 466 { 467 - /* No page to get reference */ 468 - if (flags & FOLL_GET) 469 - return -EFAULT; 470 - 471 467 if (flags & FOLL_TOUCH) { 472 468 pte_t entry = *pte; 473 469 ··· 1201 1205 } else if (PTR_ERR(page) == -EEXIST) { 1202 1206 /* 1203 1207 * Proper page table entry exists, but no corresponding 1204 - * struct page. 1208 + * struct page. If the caller expects **pages to be 1209 + * filled in, bail out now, because that can't be done 1210 + * for this page. 1205 1211 */ 1212 + if (pages) { 1213 + ret = PTR_ERR(page); 1214 + goto out; 1215 + } 1216 + 1206 1217 goto next_page; 1207 1218 } else if (IS_ERR(page)) { 1208 1219 ret = PTR_ERR(page); ··· 2125 2122 pages, vmas, gup_flags | FOLL_TOUCH); 2126 2123 } 2127 2124 EXPORT_SYMBOL(get_user_pages); 2128 - 2129 - /** 2130 - * get_user_pages_locked() - variant of get_user_pages() 2131 - * 2132 - * @start: starting user address 2133 - * @nr_pages: number of pages from start to pin 2134 - * @gup_flags: flags modifying lookup behaviour 2135 - * @pages: array that receives pointers to the pages pinned. 2136 - * Should be at least nr_pages long. Or NULL, if caller 2137 - * only intends to ensure the pages are faulted in. 2138 - * @locked: pointer to lock flag indicating whether lock is held and 2139 - * subsequently whether VM_FAULT_RETRY functionality can be 2140 - * utilised. Lock must initially be held. 2141 - * 2142 - * It is suitable to replace the form: 2143 - * 2144 - * mmap_read_lock(mm); 2145 - * do_something() 2146 - * get_user_pages(mm, ..., pages, NULL); 2147 - * mmap_read_unlock(mm); 2148 - * 2149 - * to: 2150 - * 2151 - * int locked = 1; 2152 - * mmap_read_lock(mm); 2153 - * do_something() 2154 - * get_user_pages_locked(mm, ..., pages, &locked); 2155 - * if (locked) 2156 - * mmap_read_unlock(mm); 2157 - * 2158 - * We can leverage the VM_FAULT_RETRY functionality in the page fault 2159 - * paths better by using either get_user_pages_locked() or 2160 - * get_user_pages_unlocked(). 2161 - * 2162 - */ 2163 - long get_user_pages_locked(unsigned long start, unsigned long nr_pages, 2164 - unsigned int gup_flags, struct page **pages, 2165 - int *locked) 2166 - { 2167 - /* 2168 - * FIXME: Current FOLL_LONGTERM behavior is incompatible with 2169 - * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on 2170 - * vmas. As there are no users of this flag in this call we simply 2171 - * disallow this option for now. 2172 - */ 2173 - if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM)) 2174 - return -EINVAL; 2175 - /* 2176 - * FOLL_PIN must only be set internally by the pin_user_pages*() APIs, 2177 - * never directly by the caller, so enforce that: 2178 - */ 2179 - if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) 2180 - return -EINVAL; 2181 - 2182 - return __get_user_pages_locked(current->mm, start, nr_pages, 2183 - pages, NULL, locked, 2184 - gup_flags | FOLL_TOUCH); 2185 - } 2186 - EXPORT_SYMBOL(get_user_pages_locked); 2187 2125 2188 2126 /* 2189 2127 * get_user_pages_unlocked() is suitable to replace the form: ··· 3068 3124 return get_user_pages_unlocked(start, nr_pages, pages, gup_flags); 3069 3125 } 3070 3126 EXPORT_SYMBOL(pin_user_pages_unlocked); 3071 - 3072 - /* 3073 - * pin_user_pages_locked() is the FOLL_PIN variant of get_user_pages_locked(). 3074 - * Behavior is the same, except that this one sets FOLL_PIN and rejects 3075 - * FOLL_GET. 3076 - */ 3077 - long pin_user_pages_locked(unsigned long start, unsigned long nr_pages, 3078 - unsigned int gup_flags, struct page **pages, 3079 - int *locked) 3080 - { 3081 - /* 3082 - * FIXME: Current FOLL_LONGTERM behavior is incompatible with 3083 - * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on 3084 - * vmas. As there are no users of this flag in this call we simply 3085 - * disallow this option for now. 3086 - */ 3087 - if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM)) 3088 - return -EINVAL; 3089 - 3090 - /* FOLL_GET and FOLL_PIN are mutually exclusive. */ 3091 - if (WARN_ON_ONCE(gup_flags & FOLL_GET)) 3092 - return -EINVAL; 3093 - 3094 - gup_flags |= FOLL_PIN; 3095 - return __get_user_pages_locked(current->mm, start, nr_pages, 3096 - pages, NULL, locked, 3097 - gup_flags | FOLL_TOUCH); 3098 - } 3099 - EXPORT_SYMBOL(pin_user_pages_locked);

+4 -5

mm/highmem.c

··· 736 736 list_for_each_entry(pam, &pas->lh, list) { 737 737 if (pam->page == page) { 738 738 ret = pam->virtual; 739 - goto done; 739 + break; 740 740 } 741 741 } 742 742 } 743 - done: 743 + 744 744 spin_unlock_irqrestore(&pas->lock, flags); 745 745 return ret; 746 746 } ··· 773 773 list_for_each_entry(pam, &pas->lh, list) { 774 774 if (pam->page == page) { 775 775 list_del(&pam->list); 776 - spin_unlock_irqrestore(&pas->lock, flags); 777 - goto done; 776 + break; 778 777 } 779 778 } 780 779 spin_unlock_irqrestore(&pas->lock, flags); 781 780 } 782 - done: 781 + 783 782 return; 784 783 } 785 784

+1 -2

mm/hmm.c

··· 417 417 struct hmm_range *range = hmm_vma_walk->range; 418 418 unsigned long addr = start; 419 419 pud_t pud; 420 - int ret = 0; 421 420 spinlock_t *ptl = pud_trans_huge_lock(pudp, walk->vma); 422 421 423 422 if (!ptl) ··· 465 466 466 467 out_unlock: 467 468 spin_unlock(ptl); 468 - return ret; 469 + return 0; 469 470 } 470 471 #else 471 472 #define hmm_vma_walk_pud NULL

+26 -15

mm/huge_memory.c

··· 34 34 #include <linux/oom.h> 35 35 #include <linux/numa.h> 36 36 #include <linux/page_owner.h> 37 + #include <linux/sched/sysctl.h> 37 38 38 39 #include <asm/tlb.h> 39 40 #include <asm/pgalloc.h> ··· 1767 1766 } 1768 1767 #endif 1769 1768 1770 - /* 1771 - * Avoid trapping faults against the zero page. The read-only 1772 - * data is likely to be read-cached on the local CPU and 1773 - * local/remote hits to the zero page are not interesting. 1774 - */ 1775 - if (prot_numa && is_huge_zero_pmd(*pmd)) 1776 - goto unlock; 1769 + if (prot_numa) { 1770 + struct page *page; 1771 + /* 1772 + * Avoid trapping faults against the zero page. The read-only 1773 + * data is likely to be read-cached on the local CPU and 1774 + * local/remote hits to the zero page are not interesting. 1775 + */ 1776 + if (is_huge_zero_pmd(*pmd)) 1777 + goto unlock; 1777 1778 1778 - if (prot_numa && pmd_protnone(*pmd)) 1779 - goto unlock; 1779 + if (pmd_protnone(*pmd)) 1780 + goto unlock; 1780 1781 1782 + page = pmd_page(*pmd); 1783 + /* 1784 + * Skip scanning top tier node if normal numa 1785 + * balancing is disabled 1786 + */ 1787 + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && 1788 + node_is_toptier(page_to_nid(page))) 1789 + goto unlock; 1790 + } 1781 1791 /* 1782 1792 * In case prot_numa, we are under mmap_read_lock(mm). It's critical 1783 1793 * to not clear pmd intermittently to avoid race with MADV_DONTNEED ··· 2067 2055 young = pmd_young(old_pmd); 2068 2056 soft_dirty = pmd_soft_dirty(old_pmd); 2069 2057 uffd_wp = pmd_uffd_wp(old_pmd); 2058 + VM_BUG_ON_PAGE(!page_count(page), page); 2059 + page_ref_add(page, HPAGE_PMD_NR - 1); 2070 2060 } 2071 - VM_BUG_ON_PAGE(!page_count(page), page); 2072 - page_ref_add(page, HPAGE_PMD_NR - 1); 2073 2061 2074 2062 /* 2075 2063 * Withdraw the table only after we mark the pmd entry invalid. ··· 2965 2953 */ 2966 2954 for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) { 2967 2955 struct vm_area_struct *vma = find_vma(mm, addr); 2968 - unsigned int follflags; 2969 2956 struct page *page; 2970 2957 2971 2958 if (!vma || addr < vma->vm_start) ··· 2977 2966 } 2978 2967 2979 2968 /* FOLL_DUMP to ignore special (like zero) pages */ 2980 - follflags = FOLL_GET | FOLL_DUMP; 2981 - page = follow_page(vma, addr, follflags); 2969 + page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP); 2982 2970 2983 2971 if (IS_ERR(page)) 2984 2972 continue; ··· 3207 3197 if (pmd_swp_uffd_wp(*pvmw->pmd)) 3208 3198 pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde)); 3209 3199 3210 - flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE); 3211 3200 if (PageAnon(new)) 3212 3201 page_add_anon_rmap(new, vma, mmun_start, true); 3213 3202 else ··· 3214 3205 set_pmd_at(mm, mmun_start, pvmw->pmd, pmde); 3215 3206 if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new)) 3216 3207 mlock_vma_page(new); 3208 + 3209 + /* No need to invalidate - it was non-present before */ 3217 3210 update_mmu_cache_pmd(vma, address, pvmw->pmd); 3218 3211 } 3219 3212 #endif

+13 -10

mm/hugetlb.c

··· 31 31 #include <linux/llist.h> 32 32 #include <linux/cma.h> 33 33 #include <linux/migrate.h> 34 + #include <linux/nospec.h> 34 35 35 36 #include <asm/page.h> 36 37 #include <asm/pgalloc.h> ··· 1855 1854 1856 1855 return page_head[1].compound_dtor == HUGETLB_PAGE_DTOR; 1857 1856 } 1857 + EXPORT_SYMBOL_GPL(PageHeadHuge); 1858 1858 1859 1859 /* 1860 1860 * Find and lock address space (mapping) in write mode. ··· 3500 3498 static struct kobj_attribute _name##_attr = __ATTR_WO(_name) 3501 3499 3502 3500 #define HSTATE_ATTR(_name) \ 3503 - static struct kobj_attribute _name##_attr = \ 3504 - __ATTR(_name, 0644, _name##_show, _name##_store) 3501 + static struct kobj_attribute _name##_attr = __ATTR_RW(_name) 3505 3502 3506 3503 static struct kobject *hugepages_kobj; 3507 3504 static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; ··· 4162 4161 } 4163 4162 if (tmp >= nr_online_nodes) 4164 4163 goto invalid; 4165 - node = tmp; 4164 + node = array_index_nospec(tmp, nr_online_nodes); 4166 4165 p += count + 1; 4167 4166 /* Parse hugepages */ 4168 4167 if (sscanf(p, "%lu%n", &tmp, &count) != 1) ··· 4638 4637 vma->vm_page_prot)); 4639 4638 } 4640 4639 entry = pte_mkyoung(entry); 4641 - entry = pte_mkhuge(entry); 4642 4640 entry = arch_make_huge_pte(entry, shift, vma->vm_flags); 4643 4641 4644 4642 return entry; ··· 5341 5341 pgoff_t idx, 5342 5342 unsigned int flags, 5343 5343 unsigned long haddr, 5344 + unsigned long addr, 5344 5345 unsigned long reason) 5345 5346 { 5346 5347 vm_fault_t ret; ··· 5349 5348 struct vm_fault vmf = { 5350 5349 .vma = vma, 5351 5350 .address = haddr, 5351 + .real_address = addr, 5352 5352 .flags = flags, 5353 5353 5354 5354 /* ··· 5418 5416 /* Check for page in userfault range */ 5419 5417 if (userfaultfd_missing(vma)) { 5420 5418 ret = hugetlb_handle_userfault(vma, mapping, idx, 5421 - flags, haddr, 5419 + flags, haddr, address, 5422 5420 VM_UFFD_MISSING); 5423 5421 goto out; 5424 5422 } ··· 5482 5480 unlock_page(page); 5483 5481 put_page(page); 5484 5482 ret = hugetlb_handle_userfault(vma, mapping, idx, 5485 - flags, haddr, 5483 + flags, haddr, address, 5486 5484 VM_UFFD_MINOR); 5487 5485 goto out; 5488 5486 } ··· 5819 5817 *pagep = NULL; 5820 5818 goto out; 5821 5819 } 5822 - folio_copy(page_folio(page), page_folio(*pagep)); 5820 + copy_user_huge_page(page, *pagep, dst_addr, dst_vma, 5821 + pages_per_huge_page(h)); 5823 5822 put_page(*pagep); 5824 5823 *pagep = NULL; 5825 5824 } ··· 6174 6171 unsigned int shift = huge_page_shift(hstate_vma(vma)); 6175 6172 6176 6173 old_pte = huge_ptep_modify_prot_start(vma, address, ptep); 6177 - pte = pte_mkhuge(huge_pte_modify(old_pte, newprot)); 6174 + pte = huge_pte_modify(old_pte, newprot); 6178 6175 pte = arch_make_huge_pte(pte, shift, vma->vm_flags); 6179 6176 huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); 6180 6177 pages++; ··· 6892 6889 break; 6893 6890 6894 6891 if (s[count] == ':') { 6895 - nid = tmp; 6896 - if (nid < 0 || nid >= MAX_NUMNODES) 6892 + if (tmp >= MAX_NUMNODES) 6897 6893 break; 6894 + nid = array_index_nospec(tmp, MAX_NUMNODES); 6898 6895 6899 6896 s += count + 1; 6900 6897 tmp = memparse(s, &s);

+37 -31

mm/hugetlb_vmemmap.c

··· 124 124 * page of page structs (page 0) associated with the HugeTLB page contains the 4 125 125 * page structs necessary to describe the HugeTLB. The only use of the remaining 126 126 * pages of page structs (page 1 to page 7) is to point to page->compound_head. 127 - * Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs 127 + * Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of page structs 128 128 * will be used for each HugeTLB page. This will allow us to free the remaining 129 - * 6 pages to the buddy allocator. 129 + * 7 pages to the buddy allocator. 130 130 * 131 131 * Here is how things look after remapping. 132 132 * ··· 134 134 * +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ 135 135 * | | | 0 | -------------> | 0 | 136 136 * | | +-----------+ +-----------+ 137 - * | | | 1 | -------------> | 1 | 138 - * | | +-----------+ +-----------+ 139 - * | | | 2 | ----------------^ ^ ^ ^ ^ ^ 140 - * | | +-----------+ | | | | | 141 - * | | | 3 | ------------------+ | | | | 142 - * | | +-----------+ | | | | 143 - * | | | 4 | --------------------+ | | | 144 - * | PMD | +-----------+ | | | 145 - * | level | | 5 | ----------------------+ | | 146 - * | mapping | +-----------+ | | 147 - * | | | 6 | ------------------------+ | 148 - * | | +-----------+ | 149 - * | | | 7 | --------------------------+ 137 + * | | | 1 | ---------------^ ^ ^ ^ ^ ^ ^ 138 + * | | +-----------+ | | | | | | 139 + * | | | 2 | -----------------+ | | | | | 140 + * | | +-----------+ | | | | | 141 + * | | | 3 | -------------------+ | | | | 142 + * | | +-----------+ | | | | 143 + * | | | 4 | ---------------------+ | | | 144 + * | PMD | +-----------+ | | | 145 + * | level | | 5 | -----------------------+ | | 146 + * | mapping | +-----------+ | | 147 + * | | | 6 | -------------------------+ | 148 + * | | +-----------+ | 149 + * | | | 7 | ---------------------------+ 150 150 * | | +-----------+ 151 151 * | | 152 152 * | | 153 153 * | | 154 154 * +-----------+ 155 155 * 156 - * When a HugeTLB is freed to the buddy system, we should allocate 6 pages for 156 + * When a HugeTLB is freed to the buddy system, we should allocate 7 pages for 157 157 * vmemmap pages and restore the previous mapping relationship. 158 158 * 159 159 * For the HugeTLB page of the pud level mapping. It is similar to the former. 160 - * We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages. 160 + * We also can use this approach to free (PAGE_SIZE - 1) vmemmap pages. 161 161 * 162 162 * Apart from the HugeTLB page of the pmd/pud level mapping, some architectures 163 163 * (e.g. aarch64) provides a contiguous bit in the translation table entries ··· 166 166 * 167 167 * The contiguous bit is used to increase the mapping size at the pmd and pte 168 168 * (last) level. So this type of HugeTLB page can be optimized only when its 169 - * size of the struct page structs is greater than 2 pages. 169 + * size of the struct page structs is greater than 1 page. 170 + * 171 + * Notice: The head vmemmap page is not freed to the buddy allocator and all 172 + * tail vmemmap pages are mapped to the head vmemmap page frame. So we can see 173 + * more than one struct page struct with PG_head (e.g. 8 per 2 MB HugeTLB page) 174 + * associated with each HugeTLB page. The compound_head() can handle this 175 + * correctly (more details refer to the comment above compound_head()). 170 176 */ 171 177 #define pr_fmt(fmt) "HugeTLB: " fmt 172 178 ··· 181 175 /* 182 176 * There are a lot of struct page structures associated with each HugeTLB page. 183 177 * For tail pages, the value of compound_head is the same. So we can reuse first 184 - * page of tail page structures. We map the virtual addresses of the remaining 185 - * pages of tail page structures to the first tail page struct, and then free 186 - * these page frames. Therefore, we need to reserve two pages as vmemmap areas. 178 + * page of head page structures. We map the virtual addresses of all the pages 179 + * of tail page structures to the head page struct, and then free these page 180 + * frames. Therefore, we need to reserve one pages as vmemmap areas. 187 181 */ 188 - #define RESERVE_VMEMMAP_NR 2U 182 + #define RESERVE_VMEMMAP_NR 1U 189 183 #define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT) 190 184 191 - bool hugetlb_free_vmemmap_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON); 185 + DEFINE_STATIC_KEY_MAYBE(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON, 186 + hugetlb_free_vmemmap_enabled_key); 187 + EXPORT_SYMBOL(hugetlb_free_vmemmap_enabled_key); 192 188 193 189 static int __init early_hugetlb_free_vmemmap_param(char *buf) 194 190 { 195 191 /* We cannot optimize if a "struct page" crosses page boundaries. */ 196 - if ((!is_power_of_2(sizeof(struct page)))) { 192 + if (!is_power_of_2(sizeof(struct page))) { 197 193 pr_warn("cannot free vmemmap pages because \"struct page\" crosses page boundaries\n"); 198 194 return 0; 199 195 } ··· 204 196 return -EINVAL; 205 197 206 198 if (!strcmp(buf, "on")) 207 - hugetlb_free_vmemmap_enabled = true; 199 + static_branch_enable(&hugetlb_free_vmemmap_enabled_key); 208 200 else if (!strcmp(buf, "off")) 209 - hugetlb_free_vmemmap_enabled = false; 201 + static_branch_disable(&hugetlb_free_vmemmap_enabled_key); 210 202 else 211 203 return -EINVAL; 212 204 ··· 244 236 */ 245 237 ret = vmemmap_remap_alloc(vmemmap_addr, vmemmap_end, vmemmap_reuse, 246 238 GFP_KERNEL | __GFP_NORETRY | __GFP_THISNODE); 247 - 248 239 if (!ret) 249 240 ClearHPageVmemmapOptimized(head); 250 241 ··· 284 277 BUILD_BUG_ON(__NR_USED_SUBPAGE >= 285 278 RESERVE_VMEMMAP_SIZE / sizeof(struct page)); 286 279 287 - if (!hugetlb_free_vmemmap_enabled) 280 + if (!hugetlb_free_vmemmap_enabled()) 288 281 return; 289 282 290 283 vmemmap_pages = (nr_pages * sizeof(struct page)) >> PAGE_SHIFT; 291 284 /* 292 - * The head page and the first tail page are not to be freed to buddy 293 - * allocator, the other pages will map to the first tail page, so they 294 - * can be freed. 285 + * The head page is not to be freed to buddy allocator, the other tail 286 + * pages will map to the head page, so they can be freed. 295 287 * 296 288 * Could RESERVE_VMEMMAP_NR be greater than @vmemmap_pages? It is true 297 289 * on some architectures (e.g. aarch64). See Documentation/arm64/

+4 -3

mm/hwpoison-inject.c

··· 32 32 33 33 shake_page(hpage); 34 34 /* 35 - * This implies unable to support non-LRU pages. 35 + * This implies unable to support non-LRU pages except free page. 36 36 */ 37 - if (!PageLRU(hpage) && !PageHuge(p)) 37 + if (!PageLRU(hpage) && !PageHuge(p) && !is_free_buddy_page(p)) 38 38 return 0; 39 39 40 40 /* ··· 48 48 49 49 inject: 50 50 pr_info("Injecting memory failure at pfn %#lx\n", pfn); 51 - return memory_failure(pfn, 0); 51 + err = memory_failure(pfn, 0); 52 + return (err == -EOPNOTSUPP) ? 0 : err; 52 53 } 53 54 54 55 static int hwpoison_unpoison(void *data, u64 val)

+8 -11

mm/internal.h

··· 155 155 #define MAX_RECLAIM_RETRIES 16 156 156 157 157 /* 158 + * in mm/early_ioremap.c 159 + */ 160 + pgprot_t __init early_memremap_pgprot_adjust(resource_size_t phys_addr, 161 + unsigned long size, pgprot_t prot); 162 + 163 + /* 158 164 * in mm/vmscan.c: 159 165 */ 160 166 extern int isolate_lru_page(struct page *page); ··· 578 572 } 579 573 #endif /* CONFIG_DEBUG_MEMORY_INIT */ 580 574 581 - /* mminit_validate_memmodel_limits is independent of CONFIG_DEBUG_MEMORY_INIT */ 582 - #if defined(CONFIG_SPARSEMEM) 583 - extern void mminit_validate_memmodel_limits(unsigned long *start_pfn, 584 - unsigned long *end_pfn); 585 - #else 586 - static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn, 587 - unsigned long *end_pfn) 588 - { 589 - } 590 - #endif /* CONFIG_SPARSEMEM */ 591 - 592 575 #define NODE_RECLAIM_NOSCAN -2 593 576 #define NODE_RECLAIM_FULL -1 594 577 #define NODE_RECLAIM_SOME 0 ··· 712 717 713 718 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, 714 719 unsigned long addr, int page_nid, int *flags); 720 + 721 + DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats); 715 722 716 723 #endif /* __MM_INTERNAL_H */

+1 -1

mm/kfence/Makefile

··· 1 1 # SPDX-License-Identifier: GPL-2.0 2 2 3 - obj-$(CONFIG_KFENCE) := core.o report.o 3 + obj-y := core.o report.o 4 4 5 5 CFLAGS_kfence_test.o := -g -fno-omit-frame-pointer -fno-optimize-sibling-calls 6 6 obj-$(CONFIG_KFENCE_KUNIT_TEST) += kfence_test.o

+118 -23

mm/kfence/core.c

··· 38 38 #define KFENCE_WARN_ON(cond) \ 39 39 ({ \ 40 40 const bool __cond = WARN_ON(cond); \ 41 - if (unlikely(__cond)) \ 41 + if (unlikely(__cond)) { \ 42 42 WRITE_ONCE(kfence_enabled, false); \ 43 + disabled_by_warn = true; \ 44 + } \ 43 45 __cond; \ 44 46 }) 45 47 46 48 /* === Data ================================================================= */ 47 49 48 50 static bool kfence_enabled __read_mostly; 51 + static bool disabled_by_warn __read_mostly; 49 52 50 53 unsigned long kfence_sample_interval __read_mostly = CONFIG_KFENCE_SAMPLE_INTERVAL; 51 54 EXPORT_SYMBOL_GPL(kfence_sample_interval); /* Export for test modules. */ ··· 58 55 #endif 59 56 #define MODULE_PARAM_PREFIX "kfence." 60 57 58 + static int kfence_enable_late(void); 61 59 static int param_set_sample_interval(const char *val, const struct kernel_param *kp) 62 60 { 63 61 unsigned long num; ··· 69 65 70 66 if (!num) /* Using 0 to indicate KFENCE is disabled. */ 71 67 WRITE_ONCE(kfence_enabled, false); 72 - else if (!READ_ONCE(kfence_enabled) && system_state != SYSTEM_BOOTING) 73 - return -EINVAL; /* Cannot (re-)enable KFENCE on-the-fly. */ 74 68 75 69 *((unsigned long *)kp->arg) = num; 70 + 71 + if (num && !READ_ONCE(kfence_enabled) && system_state != SYSTEM_BOOTING) 72 + return disabled_by_warn ? -EINVAL : kfence_enable_late(); 76 73 return 0; 77 74 } 78 75 ··· 95 90 static unsigned long kfence_skip_covered_thresh __read_mostly = 75; 96 91 module_param_named(skip_covered_thresh, kfence_skip_covered_thresh, ulong, 0644); 97 92 93 + /* If true, use a deferrable timer. */ 94 + static bool kfence_deferrable __read_mostly = IS_ENABLED(CONFIG_KFENCE_DEFERRABLE); 95 + module_param_named(deferrable, kfence_deferrable, bool, 0444); 96 + 98 97 /* The pool of pages used for guard pages and objects. */ 99 - char *__kfence_pool __ro_after_init; 98 + char *__kfence_pool __read_mostly; 100 99 EXPORT_SYMBOL(__kfence_pool); /* Export for test modules. */ 101 100 102 101 /* ··· 541 532 kfence_guarded_free((void *)meta->addr, meta, false); 542 533 } 543 534 544 - static bool __init kfence_init_pool(void) 535 + /* 536 + * Initialization of the KFENCE pool after its allocation. 537 + * Returns 0 on success; otherwise returns the address up to 538 + * which partial initialization succeeded. 539 + */ 540 + static unsigned long kfence_init_pool(void) 545 541 { 546 542 unsigned long addr = (unsigned long)__kfence_pool; 547 543 struct page *pages; 548 544 int i; 549 545 550 - if (!__kfence_pool) 551 - return false; 552 - 553 546 if (!arch_kfence_init_pool()) 554 - goto err; 547 + return addr; 555 548 556 549 pages = virt_to_page(addr); 557 550 ··· 571 560 572 561 /* Verify we do not have a compound head page. */ 573 562 if (WARN_ON(compound_head(&pages[i]) != &pages[i])) 574 - goto err; 563 + return addr; 575 564 576 565 __SetPageSlab(&pages[i]); 577 566 } ··· 584 573 */ 585 574 for (i = 0; i < 2; i++) { 586 575 if (unlikely(!kfence_protect(addr))) 587 - goto err; 576 + return addr; 588 577 589 578 addr += PAGE_SIZE; 590 579 } ··· 601 590 602 591 /* Protect the right redzone. */ 603 592 if (unlikely(!kfence_protect(addr + PAGE_SIZE))) 604 - goto err; 593 + return addr; 605 594 606 595 addr += 2 * PAGE_SIZE; 607 596 } ··· 614 603 */ 615 604 kmemleak_free(__kfence_pool); 616 605 617 - return true; 606 + return 0; 607 + } 618 608 619 - err: 609 + static bool __init kfence_init_pool_early(void) 610 + { 611 + unsigned long addr; 612 + 613 + if (!__kfence_pool) 614 + return false; 615 + 616 + addr = kfence_init_pool(); 617 + 618 + if (!addr) 619 + return true; 620 + 620 621 /* 621 622 * Only release unprotected pages, and do not try to go back and change 622 623 * page attributes due to risk of failing to do so as well. If changing ··· 637 614 * most failure cases. 638 615 */ 639 616 memblock_free_late(__pa(addr), KFENCE_POOL_SIZE - (addr - (unsigned long)__kfence_pool)); 617 + __kfence_pool = NULL; 618 + return false; 619 + } 620 + 621 + static bool kfence_init_pool_late(void) 622 + { 623 + unsigned long addr, free_size; 624 + 625 + addr = kfence_init_pool(); 626 + 627 + if (!addr) 628 + return true; 629 + 630 + /* Same as above. */ 631 + free_size = KFENCE_POOL_SIZE - (addr - (unsigned long)__kfence_pool); 632 + #ifdef CONFIG_CONTIG_ALLOC 633 + free_contig_range(page_to_pfn(virt_to_page(addr)), free_size / PAGE_SIZE); 634 + #else 635 + free_pages_exact((void *)addr, free_size); 636 + #endif 640 637 __kfence_pool = NULL; 641 638 return false; 642 639 } ··· 744 701 745 702 /* === Allocation Gate Timer ================================================ */ 746 703 704 + static struct delayed_work kfence_timer; 705 + 747 706 #ifdef CONFIG_KFENCE_STATIC_KEYS 748 707 /* Wait queue to wake up allocation-gate timer task. */ 749 708 static DECLARE_WAIT_QUEUE_HEAD(allocation_wait); ··· 768 723 * avoids IPIs, at the cost of not immediately capturing allocations if the 769 724 * instructions remain cached. 770 725 */ 771 - static struct delayed_work kfence_timer; 772 726 static void toggle_allocation_gate(struct work_struct *work) 773 727 { 774 728 if (!READ_ONCE(kfence_enabled)) ··· 795 751 queue_delayed_work(system_unbound_wq, &kfence_timer, 796 752 msecs_to_jiffies(kfence_sample_interval)); 797 753 } 798 - static DECLARE_DELAYED_WORK(kfence_timer, toggle_allocation_gate); 799 754 800 755 /* === Public interface ===================================================== */ 801 756 ··· 809 766 pr_err("failed to allocate pool\n"); 810 767 } 811 768 769 + static void kfence_init_enable(void) 770 + { 771 + if (!IS_ENABLED(CONFIG_KFENCE_STATIC_KEYS)) 772 + static_branch_enable(&kfence_allocation_key); 773 + 774 + if (kfence_deferrable) 775 + INIT_DEFERRABLE_WORK(&kfence_timer, toggle_allocation_gate); 776 + else 777 + INIT_DELAYED_WORK(&kfence_timer, toggle_allocation_gate); 778 + 779 + WRITE_ONCE(kfence_enabled, true); 780 + queue_delayed_work(system_unbound_wq, &kfence_timer, 0); 781 + 782 + pr_info("initialized - using %lu bytes for %d objects at 0x%p-0x%p\n", KFENCE_POOL_SIZE, 783 + CONFIG_KFENCE_NUM_OBJECTS, (void *)__kfence_pool, 784 + (void *)(__kfence_pool + KFENCE_POOL_SIZE)); 785 + } 786 + 812 787 void __init kfence_init(void) 813 788 { 789 + stack_hash_seed = (u32)random_get_entropy(); 790 + 814 791 /* Setting kfence_sample_interval to 0 on boot disables KFENCE. */ 815 792 if (!kfence_sample_interval) 816 793 return; 817 794 818 - stack_hash_seed = (u32)random_get_entropy(); 819 - if (!kfence_init_pool()) { 795 + if (!kfence_init_pool_early()) { 820 796 pr_err("%s failed\n", __func__); 821 797 return; 822 798 } 823 799 824 - if (!IS_ENABLED(CONFIG_KFENCE_STATIC_KEYS)) 825 - static_branch_enable(&kfence_allocation_key); 800 + kfence_init_enable(); 801 + } 802 + 803 + static int kfence_init_late(void) 804 + { 805 + const unsigned long nr_pages = KFENCE_POOL_SIZE / PAGE_SIZE; 806 + #ifdef CONFIG_CONTIG_ALLOC 807 + struct page *pages; 808 + 809 + pages = alloc_contig_pages(nr_pages, GFP_KERNEL, first_online_node, NULL); 810 + if (!pages) 811 + return -ENOMEM; 812 + __kfence_pool = page_to_virt(pages); 813 + #else 814 + if (nr_pages > MAX_ORDER_NR_PAGES) { 815 + pr_warn("KFENCE_NUM_OBJECTS too large for buddy allocator\n"); 816 + return -EINVAL; 817 + } 818 + __kfence_pool = alloc_pages_exact(KFENCE_POOL_SIZE, GFP_KERNEL); 819 + if (!__kfence_pool) 820 + return -ENOMEM; 821 + #endif 822 + 823 + if (!kfence_init_pool_late()) { 824 + pr_err("%s failed\n", __func__); 825 + return -EBUSY; 826 + } 827 + 828 + kfence_init_enable(); 829 + return 0; 830 + } 831 + 832 + static int kfence_enable_late(void) 833 + { 834 + if (!__kfence_pool) 835 + return kfence_init_late(); 836 + 826 837 WRITE_ONCE(kfence_enabled, true); 827 838 queue_delayed_work(system_unbound_wq, &kfence_timer, 0); 828 - pr_info("initialized - using %lu bytes for %d objects at 0x%p-0x%p\n", KFENCE_POOL_SIZE, 829 - CONFIG_KFENCE_NUM_OBJECTS, (void *)__kfence_pool, 830 - (void *)(__kfence_pool + KFENCE_POOL_SIZE)); 839 + return 0; 831 840 } 832 841 833 842 void kfence_shutdown_cache(struct kmem_cache *s)

+2 -1

mm/kfence/kfence_test.c

··· 623 623 break; 624 624 test_free(buf2); 625 625 626 - if (i == CONFIG_KFENCE_NUM_OBJECTS) { 626 + if (kthread_should_stop() || (i == CONFIG_KFENCE_NUM_OBJECTS)) { 627 627 kunit_warn(test, "giving up ... cannot get same object back\n"); 628 628 return; 629 629 } 630 + cond_resched(); 630 631 } 631 632 632 633 for (i = 0; i < size; i++)

+4 -2

mm/ksm.c

··· 2595 2595 SetPageDirty(new_page); 2596 2596 __SetPageUptodate(new_page); 2597 2597 __SetPageLocked(new_page); 2598 + #ifdef CONFIG_SWAP 2599 + count_vm_event(KSM_SWPIN_COPY); 2600 + #endif 2598 2601 } 2599 2602 2600 2603 return new_page; ··· 2829 2826 #define KSM_ATTR_RO(_name) \ 2830 2827 static struct kobj_attribute _name##_attr = __ATTR_RO(_name) 2831 2828 #define KSM_ATTR(_name) \ 2832 - static struct kobj_attribute _name##_attr = \ 2833 - __ATTR(_name, 0644, _name##_show, _name##_store) 2829 + static struct kobj_attribute _name##_attr = __ATTR_RW(_name) 2834 2830 2835 2831 static ssize_t sleep_millisecs_show(struct kobject *kobj, 2836 2832 struct kobj_attribute *attr, char *buf)

+205 -223

mm/list_lru.c

··· 13 13 #include <linux/mutex.h> 14 14 #include <linux/memcontrol.h> 15 15 #include "slab.h" 16 + #include "internal.h" 16 17 17 18 #ifdef CONFIG_MEMCG_KMEM 18 19 static LIST_HEAD(memcg_list_lrus); ··· 50 49 } 51 50 52 51 static inline struct list_lru_one * 53 - list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx) 52 + list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx) 54 53 { 55 - struct list_lru_memcg *memcg_lrus; 56 - /* 57 - * Either lock or RCU protects the array of per cgroup lists 58 - * from relocation (see memcg_update_list_lru_node). 59 - */ 60 - memcg_lrus = rcu_dereference_check(nlru->memcg_lrus, 61 - lockdep_is_held(&nlru->lock)); 62 - if (memcg_lrus && idx >= 0) 63 - return memcg_lrus->lru[idx]; 64 - return &nlru->lru; 54 + if (list_lru_memcg_aware(lru) && idx >= 0) { 55 + struct list_lru_memcg *mlru = xa_load(&lru->xa, idx); 56 + 57 + return mlru ? &mlru->node[nid] : NULL; 58 + } 59 + return &lru->node[nid].lru; 65 60 } 66 61 67 62 static inline struct list_lru_one * 68 - list_lru_from_kmem(struct list_lru_node *nlru, void *ptr, 63 + list_lru_from_kmem(struct list_lru *lru, int nid, void *ptr, 69 64 struct mem_cgroup **memcg_ptr) 70 65 { 66 + struct list_lru_node *nlru = &lru->node[nid]; 71 67 struct list_lru_one *l = &nlru->lru; 72 68 struct mem_cgroup *memcg = NULL; 73 69 74 - if (!nlru->memcg_lrus) 70 + if (!list_lru_memcg_aware(lru)) 75 71 goto out; 76 72 77 73 memcg = mem_cgroup_from_obj(ptr); 78 74 if (!memcg) 79 75 goto out; 80 76 81 - l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg)); 77 + l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg)); 82 78 out: 83 79 if (memcg_ptr) 84 80 *memcg_ptr = memcg; ··· 101 103 } 102 104 103 105 static inline struct list_lru_one * 104 - list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx) 106 + list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx) 105 107 { 106 - return &nlru->lru; 108 + return &lru->node[nid].lru; 107 109 } 108 110 109 111 static inline struct list_lru_one * 110 - list_lru_from_kmem(struct list_lru_node *nlru, void *ptr, 112 + list_lru_from_kmem(struct list_lru *lru, int nid, void *ptr, 111 113 struct mem_cgroup **memcg_ptr) 112 114 { 113 115 if (memcg_ptr) 114 116 *memcg_ptr = NULL; 115 - return &nlru->lru; 117 + return &lru->node[nid].lru; 116 118 } 117 119 #endif /* CONFIG_MEMCG_KMEM */ 118 120 ··· 125 127 126 128 spin_lock(&nlru->lock); 127 129 if (list_empty(item)) { 128 - l = list_lru_from_kmem(nlru, item, &memcg); 130 + l = list_lru_from_kmem(lru, nid, item, &memcg); 129 131 list_add_tail(item, &l->list); 130 132 /* Set shrinker bit if the first element was added */ 131 133 if (!l->nr_items++) ··· 148 150 149 151 spin_lock(&nlru->lock); 150 152 if (!list_empty(item)) { 151 - l = list_lru_from_kmem(nlru, item, NULL); 153 + l = list_lru_from_kmem(lru, nid, item, NULL); 152 154 list_del_init(item); 153 155 l->nr_items--; 154 156 nlru->nr_items--; ··· 178 180 unsigned long list_lru_count_one(struct list_lru *lru, 179 181 int nid, struct mem_cgroup *memcg) 180 182 { 181 - struct list_lru_node *nlru = &lru->node[nid]; 182 183 struct list_lru_one *l; 183 184 long count; 184 185 185 186 rcu_read_lock(); 186 - l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg)); 187 - count = READ_ONCE(l->nr_items); 187 + l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg)); 188 + count = l ? READ_ONCE(l->nr_items) : 0; 188 189 rcu_read_unlock(); 189 190 190 191 if (unlikely(count < 0)) ··· 203 206 EXPORT_SYMBOL_GPL(list_lru_count_node); 204 207 205 208 static unsigned long 206 - __list_lru_walk_one(struct list_lru_node *nlru, int memcg_idx, 209 + __list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx, 207 210 list_lru_walk_cb isolate, void *cb_arg, 208 211 unsigned long *nr_to_walk) 209 212 { 210 - 213 + struct list_lru_node *nlru = &lru->node[nid]; 211 214 struct list_lru_one *l; 212 215 struct list_head *item, *n; 213 216 unsigned long isolated = 0; 214 217 215 - l = list_lru_from_memcg_idx(nlru, memcg_idx); 216 218 restart: 219 + l = list_lru_from_memcg_idx(lru, nid, memcg_idx); 220 + if (!l) 221 + goto out; 222 + 217 223 list_for_each_safe(item, n, &l->list) { 218 224 enum lru_status ret; 219 225 ··· 260 260 BUG(); 261 261 } 262 262 } 263 + out: 263 264 return isolated; 264 265 } 265 266 ··· 273 272 unsigned long ret; 274 273 275 274 spin_lock(&nlru->lock); 276 - ret = __list_lru_walk_one(nlru, memcg_cache_id(memcg), isolate, cb_arg, 277 - nr_to_walk); 275 + ret = __list_lru_walk_one(lru, nid, memcg_kmem_id(memcg), isolate, 276 + cb_arg, nr_to_walk); 278 277 spin_unlock(&nlru->lock); 279 278 return ret; 280 279 } ··· 289 288 unsigned long ret; 290 289 291 290 spin_lock_irq(&nlru->lock); 292 - ret = __list_lru_walk_one(nlru, memcg_cache_id(memcg), isolate, cb_arg, 293 - nr_to_walk); 291 + ret = __list_lru_walk_one(lru, nid, memcg_kmem_id(memcg), isolate, 292 + cb_arg, nr_to_walk); 294 293 spin_unlock_irq(&nlru->lock); 295 294 return ret; 296 295 } ··· 300 299 unsigned long *nr_to_walk) 301 300 { 302 301 long isolated = 0; 303 - int memcg_idx; 304 302 305 303 isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg, 306 304 nr_to_walk); 305 + 306 + #ifdef CONFIG_MEMCG_KMEM 307 307 if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) { 308 - for_each_memcg_cache_index(memcg_idx) { 308 + struct list_lru_memcg *mlru; 309 + unsigned long index; 310 + 311 + xa_for_each(&lru->xa, index, mlru) { 309 312 struct list_lru_node *nlru = &lru->node[nid]; 310 313 311 314 spin_lock(&nlru->lock); 312 - isolated += __list_lru_walk_one(nlru, memcg_idx, 315 + isolated += __list_lru_walk_one(lru, nid, index, 313 316 isolate, cb_arg, 314 317 nr_to_walk); 315 318 spin_unlock(&nlru->lock); ··· 322 317 break; 323 318 } 324 319 } 320 + #endif 321 + 325 322 return isolated; 326 323 } 327 324 EXPORT_SYMBOL_GPL(list_lru_walk_node); ··· 335 328 } 336 329 337 330 #ifdef CONFIG_MEMCG_KMEM 338 - static void __memcg_destroy_list_lru_node(struct list_lru_memcg *memcg_lrus, 339 - int begin, int end) 331 + static struct list_lru_memcg *memcg_init_list_lru_one(gfp_t gfp) 340 332 { 341 - int i; 333 + int nid; 334 + struct list_lru_memcg *mlru; 342 335 343 - for (i = begin; i < end; i++) 344 - kfree(memcg_lrus->lru[i]); 336 + mlru = kmalloc(struct_size(mlru, node, nr_node_ids), gfp); 337 + if (!mlru) 338 + return NULL; 339 + 340 + for_each_node(nid) 341 + init_one_lru(&mlru->node[nid]); 342 + 343 + return mlru; 345 344 } 346 345 347 - static int __memcg_init_list_lru_node(struct list_lru_memcg *memcg_lrus, 348 - int begin, int end) 346 + static void memcg_list_lru_free(struct list_lru *lru, int src_idx) 349 347 { 350 - int i; 348 + struct list_lru_memcg *mlru = xa_erase_irq(&lru->xa, src_idx); 351 349 352 - for (i = begin; i < end; i++) { 353 - struct list_lru_one *l; 354 - 355 - l = kmalloc(sizeof(struct list_lru_one), GFP_KERNEL); 356 - if (!l) 357 - goto fail; 358 - 359 - init_one_lru(l); 360 - memcg_lrus->lru[i] = l; 361 - } 362 - return 0; 363 - fail: 364 - __memcg_destroy_list_lru_node(memcg_lrus, begin, i); 365 - return -ENOMEM; 366 - } 367 - 368 - static int memcg_init_list_lru_node(struct list_lru_node *nlru) 369 - { 370 - struct list_lru_memcg *memcg_lrus; 371 - int size = memcg_nr_cache_ids; 372 - 373 - memcg_lrus = kvmalloc(struct_size(memcg_lrus, lru, size), GFP_KERNEL); 374 - if (!memcg_lrus) 375 - return -ENOMEM; 376 - 377 - if (__memcg_init_list_lru_node(memcg_lrus, 0, size)) { 378 - kvfree(memcg_lrus); 379 - return -ENOMEM; 380 - } 381 - RCU_INIT_POINTER(nlru->memcg_lrus, memcg_lrus); 382 - 383 - return 0; 384 - } 385 - 386 - static void memcg_destroy_list_lru_node(struct list_lru_node *nlru) 387 - { 388 - struct list_lru_memcg *memcg_lrus; 389 350 /* 390 - * This is called when shrinker has already been unregistered, 391 - * and nobody can use it. So, there is no need to use kvfree_rcu(). 351 + * The __list_lru_walk_one() can walk the list of this node. 352 + * We need kvfree_rcu() here. And the walking of the list 353 + * is under lru->node[nid]->lock, which can serve as a RCU 354 + * read-side critical section. 392 355 */ 393 - memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true); 394 - __memcg_destroy_list_lru_node(memcg_lrus, 0, memcg_nr_cache_ids); 395 - kvfree(memcg_lrus); 356 + if (mlru) 357 + kvfree_rcu(mlru, rcu); 396 358 } 397 359 398 - static int memcg_update_list_lru_node(struct list_lru_node *nlru, 399 - int old_size, int new_size) 360 + static inline void memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) 400 361 { 401 - struct list_lru_memcg *old, *new; 402 - 403 - BUG_ON(old_size > new_size); 404 - 405 - old = rcu_dereference_protected(nlru->memcg_lrus, 406 - lockdep_is_held(&list_lrus_mutex)); 407 - new = kvmalloc(struct_size(new, lru, new_size), GFP_KERNEL); 408 - if (!new) 409 - return -ENOMEM; 410 - 411 - if (__memcg_init_list_lru_node(new, old_size, new_size)) { 412 - kvfree(new); 413 - return -ENOMEM; 414 - } 415 - 416 - memcpy(&new->lru, &old->lru, flex_array_size(new, lru, old_size)); 417 - rcu_assign_pointer(nlru->memcg_lrus, new); 418 - kvfree_rcu(old, rcu); 419 - return 0; 420 - } 421 - 422 - static void memcg_cancel_update_list_lru_node(struct list_lru_node *nlru, 423 - int old_size, int new_size) 424 - { 425 - struct list_lru_memcg *memcg_lrus; 426 - 427 - memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, 428 - lockdep_is_held(&list_lrus_mutex)); 429 - /* do not bother shrinking the array back to the old size, because we 430 - * cannot handle allocation failures here */ 431 - __memcg_destroy_list_lru_node(memcg_lrus, old_size, new_size); 432 - } 433 - 434 - static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) 435 - { 436 - int i; 437 - 362 + if (memcg_aware) 363 + xa_init_flags(&lru->xa, XA_FLAGS_LOCK_IRQ); 438 364 lru->memcg_aware = memcg_aware; 439 - 440 - if (!memcg_aware) 441 - return 0; 442 - 443 - for_each_node(i) { 444 - if (memcg_init_list_lru_node(&lru->node[i])) 445 - goto fail; 446 - } 447 - return 0; 448 - fail: 449 - for (i = i - 1; i >= 0; i--) { 450 - if (!lru->node[i].memcg_lrus) 451 - continue; 452 - memcg_destroy_list_lru_node(&lru->node[i]); 453 - } 454 - return -ENOMEM; 455 365 } 456 366 457 367 static void memcg_destroy_list_lru(struct list_lru *lru) 458 368 { 459 - int i; 369 + XA_STATE(xas, &lru->xa, 0); 370 + struct list_lru_memcg *mlru; 460 371 461 372 if (!list_lru_memcg_aware(lru)) 462 373 return; 463 374 464 - for_each_node(i) 465 - memcg_destroy_list_lru_node(&lru->node[i]); 466 - } 467 - 468 - static int memcg_update_list_lru(struct list_lru *lru, 469 - int old_size, int new_size) 470 - { 471 - int i; 472 - 473 - for_each_node(i) { 474 - if (memcg_update_list_lru_node(&lru->node[i], 475 - old_size, new_size)) 476 - goto fail; 375 + xas_lock_irq(&xas); 376 + xas_for_each(&xas, mlru, ULONG_MAX) { 377 + kfree(mlru); 378 + xas_store(&xas, NULL); 477 379 } 478 - return 0; 479 - fail: 480 - for (i = i - 1; i >= 0; i--) { 481 - if (!lru->node[i].memcg_lrus) 482 - continue; 483 - 484 - memcg_cancel_update_list_lru_node(&lru->node[i], 485 - old_size, new_size); 486 - } 487 - return -ENOMEM; 380 + xas_unlock_irq(&xas); 488 381 } 489 382 490 - static void memcg_cancel_update_list_lru(struct list_lru *lru, 491 - int old_size, int new_size) 492 - { 493 - int i; 494 - 495 - for_each_node(i) 496 - memcg_cancel_update_list_lru_node(&lru->node[i], 497 - old_size, new_size); 498 - } 499 - 500 - int memcg_update_all_list_lrus(int new_size) 501 - { 502 - int ret = 0; 503 - struct list_lru *lru; 504 - int old_size = memcg_nr_cache_ids; 505 - 506 - mutex_lock(&list_lrus_mutex); 507 - list_for_each_entry(lru, &memcg_list_lrus, list) { 508 - ret = memcg_update_list_lru(lru, old_size, new_size); 509 - if (ret) 510 - goto fail; 511 - } 512 - out: 513 - mutex_unlock(&list_lrus_mutex); 514 - return ret; 515 - fail: 516 - list_for_each_entry_continue_reverse(lru, &memcg_list_lrus, list) 517 - memcg_cancel_update_list_lru(lru, old_size, new_size); 518 - goto out; 519 - } 520 - 521 - static void memcg_drain_list_lru_node(struct list_lru *lru, int nid, 522 - int src_idx, struct mem_cgroup *dst_memcg) 383 + static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid, 384 + int src_idx, struct mem_cgroup *dst_memcg) 523 385 { 524 386 struct list_lru_node *nlru = &lru->node[nid]; 525 387 int dst_idx = dst_memcg->kmemcg_id; 526 388 struct list_lru_one *src, *dst; 389 + 390 + /* 391 + * If there is no lru entry in this nlru, we can skip it immediately. 392 + */ 393 + if (!READ_ONCE(nlru->nr_items)) 394 + return; 527 395 528 396 /* 529 397 * Since list_lru_{add,del} may be called under an IRQ-safe lock, ··· 406 524 */ 407 525 spin_lock_irq(&nlru->lock); 408 526 409 - src = list_lru_from_memcg_idx(nlru, src_idx); 410 - dst = list_lru_from_memcg_idx(nlru, dst_idx); 527 + src = list_lru_from_memcg_idx(lru, nid, src_idx); 528 + if (!src) 529 + goto out; 530 + dst = list_lru_from_memcg_idx(lru, nid, dst_idx); 411 531 412 532 list_splice_init(&src->list, &dst->list); 413 533 ··· 418 534 set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru)); 419 535 src->nr_items = 0; 420 536 } 421 - 537 + out: 422 538 spin_unlock_irq(&nlru->lock); 423 539 } 424 540 425 - static void memcg_drain_list_lru(struct list_lru *lru, 426 - int src_idx, struct mem_cgroup *dst_memcg) 541 + static void memcg_reparent_list_lru(struct list_lru *lru, 542 + int src_idx, struct mem_cgroup *dst_memcg) 427 543 { 428 544 int i; 429 545 430 546 for_each_node(i) 431 - memcg_drain_list_lru_node(lru, i, src_idx, dst_memcg); 547 + memcg_reparent_list_lru_node(lru, i, src_idx, dst_memcg); 548 + 549 + memcg_list_lru_free(lru, src_idx); 432 550 } 433 551 434 - void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg) 552 + void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *parent) 435 553 { 554 + struct cgroup_subsys_state *css; 436 555 struct list_lru *lru; 556 + int src_idx = memcg->kmemcg_id; 557 + 558 + /* 559 + * Change kmemcg_id of this cgroup and all its descendants to the 560 + * parent's id, and then move all entries from this cgroup's list_lrus 561 + * to ones of the parent. 562 + * 563 + * After we have finished, all list_lrus corresponding to this cgroup 564 + * are guaranteed to remain empty. So we can safely free this cgroup's 565 + * list lrus in memcg_list_lru_free(). 566 + * 567 + * Changing ->kmemcg_id to the parent can prevent memcg_list_lru_alloc() 568 + * from allocating list lrus for this cgroup after memcg_list_lru_free() 569 + * call. 570 + */ 571 + rcu_read_lock(); 572 + css_for_each_descendant_pre(css, &memcg->css) { 573 + struct mem_cgroup *child; 574 + 575 + child = mem_cgroup_from_css(css); 576 + WRITE_ONCE(child->kmemcg_id, parent->kmemcg_id); 577 + } 578 + rcu_read_unlock(); 437 579 438 580 mutex_lock(&list_lrus_mutex); 439 581 list_for_each_entry(lru, &memcg_list_lrus, list) 440 - memcg_drain_list_lru(lru, src_idx, dst_memcg); 582 + memcg_reparent_list_lru(lru, src_idx, parent); 441 583 mutex_unlock(&list_lrus_mutex); 442 584 } 443 - #else 444 - static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) 585 + 586 + static inline bool memcg_list_lru_allocated(struct mem_cgroup *memcg, 587 + struct list_lru *lru) 445 588 { 446 - return 0; 589 + int idx = memcg->kmemcg_id; 590 + 591 + return idx < 0 || xa_load(&lru->xa, idx); 592 + } 593 + 594 + int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru, 595 + gfp_t gfp) 596 + { 597 + int i; 598 + unsigned long flags; 599 + struct list_lru_memcg_table { 600 + struct list_lru_memcg *mlru; 601 + struct mem_cgroup *memcg; 602 + } *table; 603 + XA_STATE(xas, &lru->xa, 0); 604 + 605 + if (!list_lru_memcg_aware(lru) || memcg_list_lru_allocated(memcg, lru)) 606 + return 0; 607 + 608 + gfp &= GFP_RECLAIM_MASK; 609 + table = kmalloc_array(memcg->css.cgroup->level, sizeof(*table), gfp); 610 + if (!table) 611 + return -ENOMEM; 612 + 613 + /* 614 + * Because the list_lru can be reparented to the parent cgroup's 615 + * list_lru, we should make sure that this cgroup and all its 616 + * ancestors have allocated list_lru_memcg. 617 + */ 618 + for (i = 0; memcg; memcg = parent_mem_cgroup(memcg), i++) { 619 + if (memcg_list_lru_allocated(memcg, lru)) 620 + break; 621 + 622 + table[i].memcg = memcg; 623 + table[i].mlru = memcg_init_list_lru_one(gfp); 624 + if (!table[i].mlru) { 625 + while (i--) 626 + kfree(table[i].mlru); 627 + kfree(table); 628 + return -ENOMEM; 629 + } 630 + } 631 + 632 + xas_lock_irqsave(&xas, flags); 633 + while (i--) { 634 + int index = READ_ONCE(table[i].memcg->kmemcg_id); 635 + struct list_lru_memcg *mlru = table[i].mlru; 636 + 637 + xas_set(&xas, index); 638 + retry: 639 + if (unlikely(index < 0 || xas_error(&xas) || xas_load(&xas))) { 640 + kfree(mlru); 641 + } else { 642 + xas_store(&xas, mlru); 643 + if (xas_error(&xas) == -ENOMEM) { 644 + xas_unlock_irqrestore(&xas, flags); 645 + if (xas_nomem(&xas, gfp)) 646 + xas_set_err(&xas, 0); 647 + xas_lock_irqsave(&xas, flags); 648 + /* 649 + * The xas lock has been released, this memcg 650 + * can be reparented before us. So reload 651 + * memcg id. More details see the comments 652 + * in memcg_reparent_list_lrus(). 653 + */ 654 + index = READ_ONCE(table[i].memcg->kmemcg_id); 655 + if (index < 0) 656 + xas_set_err(&xas, 0); 657 + else if (!xas_error(&xas) && index != xas.xa_index) 658 + xas_set(&xas, index); 659 + goto retry; 660 + } 661 + } 662 + } 663 + /* xas_nomem() is used to free memory instead of memory allocation. */ 664 + if (xas.xa_alloc) 665 + xas_nomem(&xas, gfp); 666 + xas_unlock_irqrestore(&xas, flags); 667 + kfree(table); 668 + 669 + return xas_error(&xas); 670 + } 671 + #else 672 + static inline void memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) 673 + { 447 674 } 448 675 449 676 static void memcg_destroy_list_lru(struct list_lru *lru) ··· 566 571 struct lock_class_key *key, struct shrinker *shrinker) 567 572 { 568 573 int i; 569 - int err = -ENOMEM; 570 574 571 575 #ifdef CONFIG_MEMCG_KMEM 572 576 if (shrinker) ··· 573 579 else 574 580 lru->shrinker_id = -1; 575 581 #endif 576 - memcg_get_cache_ids(); 577 582 578 583 lru->node = kcalloc(nr_node_ids, sizeof(*lru->node), GFP_KERNEL); 579 584 if (!lru->node) 580 - goto out; 585 + return -ENOMEM; 581 586 582 587 for_each_node(i) { 583 588 spin_lock_init(&lru->node[i].lock); ··· 585 592 init_one_lru(&lru->node[i].lru); 586 593 } 587 594 588 - err = memcg_init_list_lru(lru, memcg_aware); 589 - if (err) { 590 - kfree(lru->node); 591 - /* Do this so a list_lru_destroy() doesn't crash: */ 592 - lru->node = NULL; 593 - goto out; 594 - } 595 - 595 + memcg_init_list_lru(lru, memcg_aware); 596 596 list_lru_register(lru); 597 - out: 598 - memcg_put_cache_ids(); 599 - return err; 597 + 598 + return 0; 600 599 } 601 600 EXPORT_SYMBOL_GPL(__list_lru_init); 602 601 ··· 597 612 /* Already destroyed or not yet initialized? */ 598 613 if (!lru->node) 599 614 return; 600 - 601 - memcg_get_cache_ids(); 602 615 603 616 list_lru_unregister(lru); 604 617 ··· 607 624 #ifdef CONFIG_MEMCG_KMEM 608 625 lru->shrinker_id = -1; 609 626 #endif 610 - memcg_put_cache_ids(); 611 627 } 612 628 EXPORT_SYMBOL_GPL(list_lru_destroy);

+6

mm/maccess.c

··· 335 335 336 336 return ret; 337 337 } 338 + 339 + void __copy_overflow(int size, unsigned long count) 340 + { 341 + WARN(1, "Buffer overflow detected (%d < %lu)!\n", size, count); 342 + } 343 + EXPORT_SYMBOL(__copy_overflow);

+13 -5

mm/madvise.c

··· 849 849 * our VMA might have been split. 850 850 */ 851 851 if (!vma || start >= vma->vm_end) { 852 - vma = find_vma(mm, start); 853 - if (!vma || start < vma->vm_start) 852 + vma = vma_lookup(mm, start); 853 + if (!vma) 854 854 return -ENOMEM; 855 855 } 856 856 ··· 1067 1067 pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n", 1068 1068 pfn, start); 1069 1069 ret = memory_failure(pfn, MF_COUNT_INCREASED); 1070 + if (ret == -EOPNOTSUPP) 1071 + ret = 0; 1070 1072 } 1071 1073 1072 1074 if (ret) ··· 1428 1426 1429 1427 while (iov_iter_count(&iter)) { 1430 1428 iovec = iov_iter_iovec(&iter); 1429 + /* 1430 + * do_madvise returns ENOMEM if unmapped holes are present 1431 + * in the passed VMA. process_madvise() is expected to skip 1432 + * unmapped holes passed to it in the 'struct iovec' list 1433 + * and not fail because of them. Thus treat -ENOMEM return 1434 + * from do_madvise as valid and continue processing. 1435 + */ 1431 1436 ret = do_madvise(mm, (unsigned long)iovec.iov_base, 1432 1437 iovec.iov_len, behavior); 1433 - if (ret < 0) 1438 + if (ret < 0 && ret != -ENOMEM) 1434 1439 break; 1435 1440 iov_iter_advance(&iter, iovec.iov_len); 1436 1441 } 1437 1442 1438 - if (ret == 0) 1439 - ret = total_len - iov_iter_count(&iter); 1443 + ret = (total_len - iov_iter_count(&iter)) ? : ret; 1440 1444 1441 1445 release_mm: 1442 1446 mmput(mm);

+214 -281

mm/memcontrol.c

··· 348 348 } 349 349 350 350 /* 351 - * This will be used as a shrinker list's index. 352 - * The main reason for not using cgroup id for this: 353 - * this works better in sparse environments, where we have a lot of memcgs, 354 - * but only a few kmem-limited. Or also, if we have, for instance, 200 355 - * memcgs, and none but the 200th is kmem-limited, we'd have to have a 356 - * 200 entry array for that. 357 - * 358 - * The current size of the caches array is stored in memcg_nr_cache_ids. It 359 - * will double each time we have to increase it. 360 - */ 361 - static DEFINE_IDA(memcg_cache_ida); 362 - int memcg_nr_cache_ids; 363 - 364 - /* Protects memcg_nr_cache_ids */ 365 - static DECLARE_RWSEM(memcg_cache_ids_sem); 366 - 367 - void memcg_get_cache_ids(void) 368 - { 369 - down_read(&memcg_cache_ids_sem); 370 - } 371 - 372 - void memcg_put_cache_ids(void) 373 - { 374 - up_read(&memcg_cache_ids_sem); 375 - } 376 - 377 - /* 378 - * MIN_SIZE is different than 1, because we would like to avoid going through 379 - * the alloc/free process all the time. In a small machine, 4 kmem-limited 380 - * cgroups is a reasonable guess. In the future, it could be a parameter or 381 - * tunable, but that is strictly not necessary. 382 - * 383 - * MAX_SIZE should be as large as the number of cgrp_ids. Ideally, we could get 384 - * this constant directly from cgroup, but it is understandable that this is 385 - * better kept as an internal representation in cgroup.c. In any case, the 386 - * cgrp_id space is not getting any smaller, and we don't have to necessarily 387 - * increase ours as well if it increases. 388 - */ 389 - #define MEMCG_CACHES_MIN_SIZE 4 390 - #define MEMCG_CACHES_MAX_SIZE MEM_CGROUP_ID_MAX 391 - 392 - /* 393 351 * A lot of the calls to the cache allocation functions are expected to be 394 352 * inlined by the compiler. Since the calls to memcg_slab_pre_alloc_hook() are 395 353 * conditional to this static branch, we'll have to allow modules that does ··· 587 629 static DEFINE_PER_CPU(unsigned int, stats_updates); 588 630 static atomic_t stats_flush_threshold = ATOMIC_INIT(0); 589 631 632 + /* 633 + * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can 634 + * not rely on this as part of an acquired spinlock_t lock. These functions are 635 + * never used in hardirq context on PREEMPT_RT and therefore disabling preemtion 636 + * is sufficient. 637 + */ 638 + static void memcg_stats_lock(void) 639 + { 640 + #ifdef CONFIG_PREEMPT_RT 641 + preempt_disable(); 642 + #else 643 + VM_BUG_ON(!irqs_disabled()); 644 + #endif 645 + } 646 + 647 + static void __memcg_stats_lock(void) 648 + { 649 + #ifdef CONFIG_PREEMPT_RT 650 + preempt_disable(); 651 + #endif 652 + } 653 + 654 + static void memcg_stats_unlock(void) 655 + { 656 + #ifdef CONFIG_PREEMPT_RT 657 + preempt_enable(); 658 + #endif 659 + } 660 + 590 661 static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) 591 662 { 592 663 unsigned int x; ··· 692 705 pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); 693 706 memcg = pn->memcg; 694 707 708 + /* 709 + * The caller from rmap relay on disabled preemption becase they never 710 + * update their counter from in-interrupt context. For these two 711 + * counters we check that the update is never performed from an 712 + * interrupt context while other caller need to have disabled interrupt. 713 + */ 714 + __memcg_stats_lock(); 715 + if (IS_ENABLED(CONFIG_DEBUG_VM) && !IS_ENABLED(CONFIG_PREEMPT_RT)) { 716 + switch (idx) { 717 + case NR_ANON_MAPPED: 718 + case NR_FILE_MAPPED: 719 + case NR_ANON_THPS: 720 + case NR_SHMEM_PMDMAPPED: 721 + case NR_FILE_PMDMAPPED: 722 + WARN_ON_ONCE(!in_task()); 723 + break; 724 + default: 725 + WARN_ON_ONCE(!irqs_disabled()); 726 + } 727 + } 728 + 695 729 /* Update memcg */ 696 730 __this_cpu_add(memcg->vmstats_percpu->state[idx], val); 697 731 ··· 720 712 __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val); 721 713 722 714 memcg_rstat_updated(memcg, val); 715 + memcg_stats_unlock(); 723 716 } 724 717 725 718 /** ··· 803 794 if (mem_cgroup_disabled()) 804 795 return; 805 796 797 + memcg_stats_lock(); 806 798 __this_cpu_add(memcg->vmstats_percpu->events[idx], count); 807 799 memcg_rstat_updated(memcg, count); 800 + memcg_stats_unlock(); 808 801 } 809 802 810 803 static unsigned long memcg_events(struct mem_cgroup *memcg, int event) ··· 869 858 */ 870 859 static void memcg_check_events(struct mem_cgroup *memcg, int nid) 871 860 { 861 + if (IS_ENABLED(CONFIG_PREEMPT_RT)) 862 + return; 863 + 872 864 /* threshold event is triggered in finer grain than soft limit */ 873 865 if (unlikely(mem_cgroup_event_ratelimit(memcg, 874 866 MEM_CGROUP_TARGET_THRESH))) { ··· 1385 1371 static const struct memory_stat memory_stats[] = { 1386 1372 { "anon", NR_ANON_MAPPED }, 1387 1373 { "file", NR_FILE_PAGES }, 1374 + { "kernel", MEMCG_KMEM }, 1388 1375 { "kernel_stack", NR_KERNEL_STACK_KB }, 1389 1376 { "pagetables", NR_PAGETABLE }, 1390 1377 { "percpu", MEMCG_PERCPU_B }, ··· 1810 1795 __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); 1811 1796 } 1812 1797 1813 - enum oom_status { 1814 - OOM_SUCCESS, 1815 - OOM_FAILED, 1816 - OOM_ASYNC, 1817 - OOM_SKIPPED 1818 - }; 1819 - 1820 - static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) 1798 + /* 1799 + * Returns true if successfully killed one or more processes. Though in some 1800 + * corner cases it can return true even without killing any process. 1801 + */ 1802 + static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) 1821 1803 { 1822 - enum oom_status ret; 1823 - bool locked; 1804 + bool locked, ret; 1824 1805 1825 1806 if (order > PAGE_ALLOC_COSTLY_ORDER) 1826 - return OOM_SKIPPED; 1807 + return false; 1827 1808 1828 1809 memcg_memory_event(memcg, MEMCG_OOM); 1829 1810 ··· 1842 1831 * victim and then we have to bail out from the charge path. 1843 1832 */ 1844 1833 if (memcg->oom_kill_disable) { 1845 - if (!current->in_user_fault) 1846 - return OOM_SKIPPED; 1847 - css_get(&memcg->css); 1848 - current->memcg_in_oom = memcg; 1849 - current->memcg_oom_gfp_mask = mask; 1850 - current->memcg_oom_order = order; 1851 - 1852 - return OOM_ASYNC; 1834 + if (current->in_user_fault) { 1835 + css_get(&memcg->css); 1836 + current->memcg_in_oom = memcg; 1837 + current->memcg_oom_gfp_mask = mask; 1838 + current->memcg_oom_order = order; 1839 + } 1840 + return false; 1853 1841 } 1854 1842 1855 1843 mem_cgroup_mark_under_oom(memcg); ··· 1859 1849 mem_cgroup_oom_notify(memcg); 1860 1850 1861 1851 mem_cgroup_unmark_under_oom(memcg); 1862 - if (mem_cgroup_out_of_memory(memcg, mask, order)) 1863 - ret = OOM_SUCCESS; 1864 - else 1865 - ret = OOM_FAILED; 1852 + ret = mem_cgroup_out_of_memory(memcg, mask, order); 1866 1853 1867 1854 if (locked) 1868 1855 mem_cgroup_oom_unlock(memcg); ··· 2092 2085 folio_memcg_unlock(page_folio(page)); 2093 2086 } 2094 2087 2095 - struct obj_stock { 2088 + struct memcg_stock_pcp { 2089 + local_lock_t stock_lock; 2090 + struct mem_cgroup *cached; /* this never be root cgroup */ 2091 + unsigned int nr_pages; 2092 + 2096 2093 #ifdef CONFIG_MEMCG_KMEM 2097 2094 struct obj_cgroup *cached_objcg; 2098 2095 struct pglist_data *cached_pgdat; 2099 2096 unsigned int nr_bytes; 2100 2097 int nr_slab_reclaimable_b; 2101 2098 int nr_slab_unreclaimable_b; 2102 - #else 2103 - int dummy[0]; 2104 2099 #endif 2105 - }; 2106 - 2107 - struct memcg_stock_pcp { 2108 - struct mem_cgroup *cached; /* this never be root cgroup */ 2109 - unsigned int nr_pages; 2110 - struct obj_stock task_obj; 2111 - struct obj_stock irq_obj; 2112 2100 2113 2101 struct work_struct work; 2114 2102 unsigned long flags; 2115 2103 #define FLUSHING_CACHED_CHARGE 0 2116 2104 }; 2117 - static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock); 2105 + static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = { 2106 + .stock_lock = INIT_LOCAL_LOCK(stock_lock), 2107 + }; 2118 2108 static DEFINE_MUTEX(percpu_charge_mutex); 2119 2109 2120 2110 #ifdef CONFIG_MEMCG_KMEM 2121 - static void drain_obj_stock(struct obj_stock *stock); 2111 + static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock); 2122 2112 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, 2123 2113 struct mem_cgroup *root_memcg); 2114 + static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages); 2124 2115 2125 2116 #else 2126 - static inline void drain_obj_stock(struct obj_stock *stock) 2117 + static inline struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock) 2127 2118 { 2119 + return NULL; 2128 2120 } 2129 2121 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, 2130 2122 struct mem_cgroup *root_memcg) 2131 2123 { 2132 2124 return false; 2125 + } 2126 + static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages) 2127 + { 2133 2128 } 2134 2129 #endif 2135 2130 ··· 2155 2146 if (nr_pages > MEMCG_CHARGE_BATCH) 2156 2147 return ret; 2157 2148 2158 - local_irq_save(flags); 2149 + local_lock_irqsave(&memcg_stock.stock_lock, flags); 2159 2150 2160 2151 stock = this_cpu_ptr(&memcg_stock); 2161 2152 if (memcg == stock->cached && stock->nr_pages >= nr_pages) { ··· 2163 2154 ret = true; 2164 2155 } 2165 2156 2166 - local_irq_restore(flags); 2157 + local_unlock_irqrestore(&memcg_stock.stock_lock, flags); 2167 2158 2168 2159 return ret; 2169 2160 } ··· 2192 2183 static void drain_local_stock(struct work_struct *dummy) 2193 2184 { 2194 2185 struct memcg_stock_pcp *stock; 2186 + struct obj_cgroup *old = NULL; 2195 2187 unsigned long flags; 2196 2188 2197 2189 /* ··· 2200 2190 * drain_stock races is that we always operate on local CPU stock 2201 2191 * here with IRQ disabled 2202 2192 */ 2203 - local_irq_save(flags); 2193 + local_lock_irqsave(&memcg_stock.stock_lock, flags); 2204 2194 2205 2195 stock = this_cpu_ptr(&memcg_stock); 2206 - drain_obj_stock(&stock->irq_obj); 2207 - if (in_task()) 2208 - drain_obj_stock(&stock->task_obj); 2196 + old = drain_obj_stock(stock); 2209 2197 drain_stock(stock); 2210 2198 clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); 2211 2199 2212 - local_irq_restore(flags); 2200 + local_unlock_irqrestore(&memcg_stock.stock_lock, flags); 2201 + if (old) 2202 + obj_cgroup_put(old); 2213 2203 } 2214 2204 2215 2205 /* 2216 2206 * Cache charges(val) to local per_cpu area. 2217 2207 * This will be consumed by consume_stock() function, later. 2218 2208 */ 2219 - static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) 2209 + static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) 2220 2210 { 2221 2211 struct memcg_stock_pcp *stock; 2222 - unsigned long flags; 2223 - 2224 - local_irq_save(flags); 2225 2212 2226 2213 stock = this_cpu_ptr(&memcg_stock); 2227 2214 if (stock->cached != memcg) { /* reset if necessary */ ··· 2230 2223 2231 2224 if (stock->nr_pages > MEMCG_CHARGE_BATCH) 2232 2225 drain_stock(stock); 2226 + } 2233 2227 2234 - local_irq_restore(flags); 2228 + static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) 2229 + { 2230 + unsigned long flags; 2231 + 2232 + local_lock_irqsave(&memcg_stock.stock_lock, flags); 2233 + __refill_stock(memcg, nr_pages); 2234 + local_unlock_irqrestore(&memcg_stock.stock_lock, flags); 2235 2235 } 2236 2236 2237 2237 /* ··· 2258 2244 * as well as workers from this path always operate on the local 2259 2245 * per-cpu data. CPU up doesn't touch memcg_stock at all. 2260 2246 */ 2261 - curcpu = get_cpu(); 2247 + migrate_disable(); 2248 + curcpu = smp_processor_id(); 2262 2249 for_each_online_cpu(cpu) { 2263 2250 struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu); 2264 2251 struct mem_cgroup *memcg; ··· 2282 2267 schedule_work_on(cpu, &stock->work); 2283 2268 } 2284 2269 } 2285 - put_cpu(); 2270 + migrate_enable(); 2286 2271 mutex_unlock(&percpu_charge_mutex); 2287 2272 } 2288 2273 ··· 2556 2541 int nr_retries = MAX_RECLAIM_RETRIES; 2557 2542 struct mem_cgroup *mem_over_limit; 2558 2543 struct page_counter *counter; 2559 - enum oom_status oom_status; 2560 2544 unsigned long nr_reclaimed; 2561 2545 bool passed_oom = false; 2562 2546 bool may_swap = true; ··· 2582 2568 batch = nr_pages; 2583 2569 goto retry; 2584 2570 } 2585 - 2586 - /* 2587 - * Memcg doesn't have a dedicated reserve for atomic 2588 - * allocations. But like the global atomic pool, we need to 2589 - * put the burden of reclaim on regular allocation requests 2590 - * and let these go through as privileged allocations. 2591 - */ 2592 - if (gfp_mask & __GFP_ATOMIC) 2593 - goto force; 2594 2571 2595 2572 /* 2596 2573 * Prevent unbounded recursion when reclaim operations need to ··· 2649 2644 * a forward progress or bypass the charge if the oom killer 2650 2645 * couldn't make any progress. 2651 2646 */ 2652 - oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, 2653 - get_order(nr_pages * PAGE_SIZE)); 2654 - if (oom_status == OOM_SUCCESS) { 2647 + if (mem_cgroup_oom(mem_over_limit, gfp_mask, 2648 + get_order(nr_pages * PAGE_SIZE))) { 2655 2649 passed_oom = true; 2656 2650 nr_retries = MAX_RECLAIM_RETRIES; 2657 2651 goto retry; 2658 2652 } 2659 2653 nomem: 2660 - if (!(gfp_mask & __GFP_NOFAIL)) 2654 + /* 2655 + * Memcg doesn't have a dedicated reserve for atomic 2656 + * allocations. But like the global atomic pool, we need to 2657 + * put the burden of reclaim on regular allocation requests 2658 + * and let these go through as privileged allocations. 2659 + */ 2660 + if (!(gfp_mask & (__GFP_NOFAIL | __GFP_HIGH))) 2661 2661 return -ENOMEM; 2662 2662 force: 2663 2663 /* ··· 2698 2688 READ_ONCE(memcg->swap.high); 2699 2689 2700 2690 /* Don't bother a random interrupted task */ 2701 - if (in_interrupt()) { 2691 + if (!in_task()) { 2702 2692 if (mem_high) { 2703 2693 schedule_work(&memcg->high_work); 2704 2694 break; ··· 2722 2712 } 2723 2713 } while ((memcg = parent_mem_cgroup(memcg))); 2724 2714 2715 + if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH && 2716 + !(current->flags & PF_MEMALLOC) && 2717 + gfpflags_allow_blocking(gfp_mask)) { 2718 + mem_cgroup_handle_over_high(); 2719 + } 2725 2720 return 0; 2726 2721 } 2727 2722 ··· 2763 2748 folio->memcg_data = (unsigned long)memcg; 2764 2749 } 2765 2750 2766 - static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg) 2767 - { 2768 - struct mem_cgroup *memcg; 2769 - 2770 - rcu_read_lock(); 2771 - retry: 2772 - memcg = obj_cgroup_memcg(objcg); 2773 - if (unlikely(!css_tryget(&memcg->css))) 2774 - goto retry; 2775 - rcu_read_unlock(); 2776 - 2777 - return memcg; 2778 - } 2779 - 2780 2751 #ifdef CONFIG_MEMCG_KMEM 2781 2752 /* 2782 2753 * The allocated objcg pointers array is not accounted directly. ··· 2770 2769 * reclaimable. So those GFP bits should be masked off. 2771 2770 */ 2772 2771 #define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT) 2773 - 2774 - /* 2775 - * Most kmem_cache_alloc() calls are from user context. The irq disable/enable 2776 - * sequence used in this case to access content from object stock is slow. 2777 - * To optimize for user context access, there are now two object stocks for 2778 - * task context and interrupt context access respectively. 2779 - * 2780 - * The task context object stock can be accessed by disabling preemption only 2781 - * which is cheap in non-preempt kernel. The interrupt context object stock 2782 - * can only be accessed after disabling interrupt. User context code can 2783 - * access interrupt object stock, but not vice versa. 2784 - */ 2785 - static inline struct obj_stock *get_obj_stock(unsigned long *pflags) 2786 - { 2787 - struct memcg_stock_pcp *stock; 2788 - 2789 - if (likely(in_task())) { 2790 - *pflags = 0UL; 2791 - preempt_disable(); 2792 - stock = this_cpu_ptr(&memcg_stock); 2793 - return &stock->task_obj; 2794 - } 2795 - 2796 - local_irq_save(*pflags); 2797 - stock = this_cpu_ptr(&memcg_stock); 2798 - return &stock->irq_obj; 2799 - } 2800 - 2801 - static inline void put_obj_stock(unsigned long flags) 2802 - { 2803 - if (likely(in_task())) 2804 - preempt_enable(); 2805 - else 2806 - local_irq_restore(flags); 2807 - } 2808 2772 2809 2773 /* 2810 2774 * mod_objcg_mlstate() may be called with irq enabled, so ··· 2902 2936 return objcg; 2903 2937 } 2904 2938 2905 - static int memcg_alloc_cache_id(void) 2939 + static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages) 2906 2940 { 2907 - int id, size; 2908 - int err; 2909 - 2910 - id = ida_simple_get(&memcg_cache_ida, 2911 - 0, MEMCG_CACHES_MAX_SIZE, GFP_KERNEL); 2912 - if (id < 0) 2913 - return id; 2914 - 2915 - if (id < memcg_nr_cache_ids) 2916 - return id; 2917 - 2918 - /* 2919 - * There's no space for the new id in memcg_caches arrays, 2920 - * so we have to grow them. 2921 - */ 2922 - down_write(&memcg_cache_ids_sem); 2923 - 2924 - size = 2 * (id + 1); 2925 - if (size < MEMCG_CACHES_MIN_SIZE) 2926 - size = MEMCG_CACHES_MIN_SIZE; 2927 - else if (size > MEMCG_CACHES_MAX_SIZE) 2928 - size = MEMCG_CACHES_MAX_SIZE; 2929 - 2930 - err = memcg_update_all_list_lrus(size); 2931 - if (!err) 2932 - memcg_nr_cache_ids = size; 2933 - 2934 - up_write(&memcg_cache_ids_sem); 2935 - 2936 - if (err) { 2937 - ida_simple_remove(&memcg_cache_ida, id); 2938 - return err; 2941 + mod_memcg_state(memcg, MEMCG_KMEM, nr_pages); 2942 + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { 2943 + if (nr_pages > 0) 2944 + page_counter_charge(&memcg->kmem, nr_pages); 2945 + else 2946 + page_counter_uncharge(&memcg->kmem, -nr_pages); 2939 2947 } 2940 - return id; 2941 2948 } 2942 2949 2943 - static void memcg_free_cache_id(int id) 2944 - { 2945 - ida_simple_remove(&memcg_cache_ida, id); 2946 - } 2947 2950 2948 2951 /* 2949 2952 * obj_cgroup_uncharge_pages: uncharge a number of kernel pages from a objcg ··· 2926 2991 2927 2992 memcg = get_mem_cgroup_from_objcg(objcg); 2928 2993 2929 - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) 2930 - page_counter_uncharge(&memcg->kmem, nr_pages); 2994 + memcg_account_kmem(memcg, -nr_pages); 2931 2995 refill_stock(memcg, nr_pages); 2932 2996 2933 2997 css_put(&memcg->css); ··· 2952 3018 if (ret) 2953 3019 goto out; 2954 3020 2955 - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) 2956 - page_counter_charge(&memcg->kmem, nr_pages); 3021 + memcg_account_kmem(memcg, nr_pages); 2957 3022 out: 2958 3023 css_put(&memcg->css); 2959 3024 ··· 3008 3075 void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, 3009 3076 enum node_stat_item idx, int nr) 3010 3077 { 3078 + struct memcg_stock_pcp *stock; 3079 + struct obj_cgroup *old = NULL; 3011 3080 unsigned long flags; 3012 - struct obj_stock *stock = get_obj_stock(&flags); 3013 3081 int *bytes; 3082 + 3083 + local_lock_irqsave(&memcg_stock.stock_lock, flags); 3084 + stock = this_cpu_ptr(&memcg_stock); 3014 3085 3015 3086 /* 3016 3087 * Save vmstat data in stock and skip vmstat array update unless ··· 3022 3085 * changes. 3023 3086 */ 3024 3087 if (stock->cached_objcg != objcg) { 3025 - drain_obj_stock(stock); 3088 + old = drain_obj_stock(stock); 3026 3089 obj_cgroup_get(objcg); 3027 3090 stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes) 3028 3091 ? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0; ··· 3066 3129 if (nr) 3067 3130 mod_objcg_mlstate(objcg, pgdat, idx, nr); 3068 3131 3069 - put_obj_stock(flags); 3132 + local_unlock_irqrestore(&memcg_stock.stock_lock, flags); 3133 + if (old) 3134 + obj_cgroup_put(old); 3070 3135 } 3071 3136 3072 3137 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) 3073 3138 { 3139 + struct memcg_stock_pcp *stock; 3074 3140 unsigned long flags; 3075 - struct obj_stock *stock = get_obj_stock(&flags); 3076 3141 bool ret = false; 3077 3142 3143 + local_lock_irqsave(&memcg_stock.stock_lock, flags); 3144 + 3145 + stock = this_cpu_ptr(&memcg_stock); 3078 3146 if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) { 3079 3147 stock->nr_bytes -= nr_bytes; 3080 3148 ret = true; 3081 3149 } 3082 3150 3083 - put_obj_stock(flags); 3151 + local_unlock_irqrestore(&memcg_stock.stock_lock, flags); 3084 3152 3085 3153 return ret; 3086 3154 } 3087 3155 3088 - static void drain_obj_stock(struct obj_stock *stock) 3156 + static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock) 3089 3157 { 3090 3158 struct obj_cgroup *old = stock->cached_objcg; 3091 3159 3092 3160 if (!old) 3093 - return; 3161 + return NULL; 3094 3162 3095 3163 if (stock->nr_bytes) { 3096 3164 unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT; 3097 3165 unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1); 3098 3166 3099 - if (nr_pages) 3100 - obj_cgroup_uncharge_pages(old, nr_pages); 3167 + if (nr_pages) { 3168 + struct mem_cgroup *memcg; 3169 + 3170 + memcg = get_mem_cgroup_from_objcg(old); 3171 + 3172 + memcg_account_kmem(memcg, -nr_pages); 3173 + __refill_stock(memcg, nr_pages); 3174 + 3175 + css_put(&memcg->css); 3176 + } 3101 3177 3102 3178 /* 3103 3179 * The leftover is flushed to the centralized per-memcg value. ··· 3145 3195 stock->cached_pgdat = NULL; 3146 3196 } 3147 3197 3148 - obj_cgroup_put(old); 3149 3198 stock->cached_objcg = NULL; 3199 + /* 3200 + * The `old' objects needs to be released by the caller via 3201 + * obj_cgroup_put() outside of memcg_stock_pcp::stock_lock. 3202 + */ 3203 + return old; 3150 3204 } 3151 3205 3152 3206 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, ··· 3158 3204 { 3159 3205 struct mem_cgroup *memcg; 3160 3206 3161 - if (in_task() && stock->task_obj.cached_objcg) { 3162 - memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg); 3163 - if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) 3164 - return true; 3165 - } 3166 - if (stock->irq_obj.cached_objcg) { 3167 - memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg); 3207 + if (stock->cached_objcg) { 3208 + memcg = obj_cgroup_memcg(stock->cached_objcg); 3168 3209 if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) 3169 3210 return true; 3170 3211 } ··· 3170 3221 static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, 3171 3222 bool allow_uncharge) 3172 3223 { 3224 + struct memcg_stock_pcp *stock; 3225 + struct obj_cgroup *old = NULL; 3173 3226 unsigned long flags; 3174 - struct obj_stock *stock = get_obj_stock(&flags); 3175 3227 unsigned int nr_pages = 0; 3176 3228 3229 + local_lock_irqsave(&memcg_stock.stock_lock, flags); 3230 + 3231 + stock = this_cpu_ptr(&memcg_stock); 3177 3232 if (stock->cached_objcg != objcg) { /* reset if necessary */ 3178 - drain_obj_stock(stock); 3233 + old = drain_obj_stock(stock); 3179 3234 obj_cgroup_get(objcg); 3180 3235 stock->cached_objcg = objcg; 3181 3236 stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes) ··· 3193 3240 stock->nr_bytes &= (PAGE_SIZE - 1); 3194 3241 } 3195 3242 3196 - put_obj_stock(flags); 3243 + local_unlock_irqrestore(&memcg_stock.stock_lock, flags); 3244 + if (old) 3245 + obj_cgroup_put(old); 3197 3246 3198 3247 if (nr_pages) 3199 3248 obj_cgroup_uncharge_pages(objcg, nr_pages); ··· 3580 3625 static int memcg_online_kmem(struct mem_cgroup *memcg) 3581 3626 { 3582 3627 struct obj_cgroup *objcg; 3583 - int memcg_id; 3584 3628 3585 3629 if (cgroup_memory_nokmem) 3586 3630 return 0; 3587 3631 3588 - BUG_ON(memcg->kmemcg_id >= 0); 3589 - 3590 - memcg_id = memcg_alloc_cache_id(); 3591 - if (memcg_id < 0) 3592 - return memcg_id; 3632 + if (unlikely(mem_cgroup_is_root(memcg))) 3633 + return 0; 3593 3634 3594 3635 objcg = obj_cgroup_alloc(); 3595 - if (!objcg) { 3596 - memcg_free_cache_id(memcg_id); 3636 + if (!objcg) 3597 3637 return -ENOMEM; 3598 - } 3638 + 3599 3639 objcg->memcg = memcg; 3600 3640 rcu_assign_pointer(memcg->objcg, objcg); 3601 3641 3602 3642 static_branch_enable(&memcg_kmem_enabled_key); 3603 3643 3604 - memcg->kmemcg_id = memcg_id; 3644 + memcg->kmemcg_id = memcg->id.id; 3605 3645 3606 3646 return 0; 3607 3647 } ··· 3604 3654 static void memcg_offline_kmem(struct mem_cgroup *memcg) 3605 3655 { 3606 3656 struct mem_cgroup *parent; 3607 - int kmemcg_id; 3608 3657 3609 - if (memcg->kmemcg_id == -1) 3658 + if (cgroup_memory_nokmem) 3659 + return; 3660 + 3661 + if (unlikely(mem_cgroup_is_root(memcg))) 3610 3662 return; 3611 3663 3612 3664 parent = parent_mem_cgroup(memcg); ··· 3617 3665 3618 3666 memcg_reparent_objcgs(memcg, parent); 3619 3667 3620 - kmemcg_id = memcg->kmemcg_id; 3621 - BUG_ON(kmemcg_id < 0); 3622 - 3623 3668 /* 3624 3669 * After we have finished memcg_reparent_objcgs(), all list_lrus 3625 3670 * corresponding to this cgroup are guaranteed to remain empty. 3626 3671 * The ordering is imposed by list_lru_node->lock taken by 3627 - * memcg_drain_all_list_lrus(). 3672 + * memcg_reparent_list_lrus(). 3628 3673 */ 3629 - memcg_drain_all_list_lrus(kmemcg_id, parent); 3630 - 3631 - memcg_free_cache_id(kmemcg_id); 3632 - memcg->kmemcg_id = -1; 3674 + memcg_reparent_list_lrus(memcg, parent); 3633 3675 } 3634 3676 #else 3635 3677 static int memcg_online_kmem(struct mem_cgroup *memcg) ··· 3709 3763 } 3710 3764 break; 3711 3765 case RES_SOFT_LIMIT: 3712 - memcg->soft_limit = nr_pages; 3713 - ret = 0; 3766 + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { 3767 + ret = -EOPNOTSUPP; 3768 + } else { 3769 + memcg->soft_limit = nr_pages; 3770 + ret = 0; 3771 + } 3714 3772 break; 3715 3773 } 3716 3774 return ret ?: nbytes; ··· 4690 4740 char *endp; 4691 4741 int ret; 4692 4742 4743 + if (IS_ENABLED(CONFIG_PREEMPT_RT)) 4744 + return -EOPNOTSUPP; 4745 + 4693 4746 buf = strstrip(buf); 4694 4747 4695 4748 efd = simple_strtoul(buf, &endp, 10); ··· 5020 5067 static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) 5021 5068 { 5022 5069 struct mem_cgroup_per_node *pn; 5023 - int tmp = node; 5024 - /* 5025 - * This routine is called against possible nodes. 5026 - * But it's BUG to call kmalloc() against offline node. 5027 - * 5028 - * TODO: this routine can waste much memory for nodes which will 5029 - * never be onlined. It's better to use memory hotplug callback 5030 - * function. 5031 - */ 5032 - if (!node_state(node, N_NORMAL_MEMORY)) 5033 - tmp = -1; 5034 - pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, tmp); 5070 + 5071 + pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, node); 5035 5072 if (!pn) 5036 5073 return 1; 5037 5074 ··· 5033 5090 } 5034 5091 5035 5092 lruvec_init(&pn->lruvec); 5036 - pn->usage_in_excess = 0; 5037 - pn->on_tree = false; 5038 5093 pn->memcg = memcg; 5039 5094 5040 5095 memcg->nodeinfo[node] = pn; ··· 5078 5137 return ERR_PTR(error); 5079 5138 5080 5139 memcg->id.id = idr_alloc(&mem_cgroup_idr, NULL, 5081 - 1, MEM_CGROUP_ID_MAX, 5082 - GFP_KERNEL); 5140 + 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL); 5083 5141 if (memcg->id.id < 0) { 5084 5142 error = memcg->id.id; 5085 5143 goto fail; ··· 5132 5192 { 5133 5193 struct mem_cgroup *parent = mem_cgroup_from_css(parent_css); 5134 5194 struct mem_cgroup *memcg, *old_memcg; 5135 - long error = -ENOMEM; 5136 5195 5137 5196 old_memcg = set_active_memcg(parent); 5138 5197 memcg = mem_cgroup_alloc(); ··· 5160 5221 return &memcg->css; 5161 5222 } 5162 5223 5163 - /* The following stuff does not apply to the root */ 5164 - error = memcg_online_kmem(memcg); 5165 - if (error) 5166 - goto fail; 5167 - 5168 5224 if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket) 5169 5225 static_branch_inc(&memcg_sockets_enabled_key); 5170 5226 5171 5227 return &memcg->css; 5172 - fail: 5173 - mem_cgroup_id_remove(memcg); 5174 - mem_cgroup_free(memcg); 5175 - return ERR_PTR(error); 5176 5228 } 5177 5229 5178 5230 static int mem_cgroup_css_online(struct cgroup_subsys_state *css) 5179 5231 { 5180 5232 struct mem_cgroup *memcg = mem_cgroup_from_css(css); 5181 5233 5234 + if (memcg_online_kmem(memcg)) 5235 + goto remove_id; 5236 + 5182 5237 /* 5183 5238 * A memcg must be visible for expand_shrinker_info() 5184 5239 * by the time the maps are allocated. So, we allocate maps 5185 5240 * here, when for_each_mem_cgroup() can't skip it. 5186 5241 */ 5187 - if (alloc_shrinker_info(memcg)) { 5188 - mem_cgroup_id_remove(memcg); 5189 - return -ENOMEM; 5190 - } 5242 + if (alloc_shrinker_info(memcg)) 5243 + goto offline_kmem; 5191 5244 5192 5245 /* Online state pins memcg ID, memcg ID pins CSS */ 5193 5246 refcount_set(&memcg->id.ref, 1); ··· 5189 5258 queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 5190 5259 2UL*HZ); 5191 5260 return 0; 5261 + offline_kmem: 5262 + memcg_offline_kmem(memcg); 5263 + remove_id: 5264 + mem_cgroup_id_remove(memcg); 5265 + return -ENOMEM; 5192 5266 } 5193 5267 5194 5268 static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) ··· 5251 5315 cancel_work_sync(&memcg->high_work); 5252 5316 mem_cgroup_remove_from_trees(memcg); 5253 5317 free_shrinker_info(memcg); 5254 - 5255 - /* Need to offline kmem if online_css() fails */ 5256 - memcg_offline_kmem(memcg); 5257 5318 mem_cgroup_free(memcg); 5258 5319 } 5259 5320 ··· 6734 6801 page_counter_uncharge(&ug->memcg->memory, ug->nr_memory); 6735 6802 if (do_memsw_account()) 6736 6803 page_counter_uncharge(&ug->memcg->memsw, ug->nr_memory); 6737 - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem) 6738 - page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem); 6804 + if (ug->nr_kmem) 6805 + memcg_account_kmem(ug->memcg, -ug->nr_kmem); 6739 6806 memcg_oom_recover(ug->memcg); 6740 6807 } 6741 6808 ··· 6754 6821 long nr_pages; 6755 6822 struct mem_cgroup *memcg; 6756 6823 struct obj_cgroup *objcg; 6757 - bool use_objcg = folio_memcg_kmem(folio); 6758 6824 6759 6825 VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); 6760 6826 ··· 6762 6830 * folio memcg or objcg at this point, we have fully 6763 6831 * exclusive access to the folio. 6764 6832 */ 6765 - if (use_objcg) { 6833 + if (folio_memcg_kmem(folio)) { 6766 6834 objcg = __folio_objcg(folio); 6767 6835 /* 6768 6836 * This get matches the put at the end of the function and ··· 6790 6858 6791 6859 nr_pages = folio_nr_pages(folio); 6792 6860 6793 - if (use_objcg) { 6861 + if (folio_memcg_kmem(folio)) { 6794 6862 ug->nr_memory += nr_pages; 6795 6863 ug->nr_kmem += nr_pages; 6796 6864 ··· 6900 6968 return; 6901 6969 6902 6970 /* Do not associate the sock with unrelated interrupted task's memcg. */ 6903 - if (in_interrupt()) 6971 + if (!in_task()) 6904 6972 return; 6905 6973 6906 6974 rcu_read_lock(); ··· 6985 7053 if (!strcmp(token, "nokmem")) 6986 7054 cgroup_memory_nokmem = true; 6987 7055 } 6988 - return 0; 7056 + return 1; 6989 7057 } 6990 7058 __setup("cgroup.memory=", cgroup_memory); 6991 7059 ··· 7111 7179 * important here to have the interrupts disabled because it is the 7112 7180 * only synchronisation we have for updating the per-CPU variables. 7113 7181 */ 7114 - VM_BUG_ON(!irqs_disabled()); 7182 + memcg_stats_lock(); 7115 7183 mem_cgroup_charge_statistics(memcg, -nr_entries); 7184 + memcg_stats_unlock(); 7116 7185 memcg_check_events(memcg, page_to_nid(page)); 7117 7186 7118 7187 css_put(&memcg->css);

+84 -64

mm/memory-failure.c

··· 130 130 hwpoison_filter_dev_minor == ~0U) 131 131 return 0; 132 132 133 - /* 134 - * page_mapping() does not accept slab pages. 135 - */ 136 - if (PageSlab(p)) 137 - return -EINVAL; 138 - 139 133 mapping = page_mapping(p); 140 134 if (mapping == NULL || mapping->host == NULL) 141 135 return -EINVAL; ··· 252 258 pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", 253 259 pfn, t->comm, t->pid); 254 260 255 - if (flags & MF_ACTION_REQUIRED) { 256 - if (t == current) 257 - ret = force_sig_mceerr(BUS_MCEERR_AR, 258 - (void __user *)tk->addr, addr_lsb); 259 - else 260 - /* Signal other processes sharing the page if they have PF_MCE_EARLY set. */ 261 - ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr, 262 - addr_lsb, t); 263 - } else { 261 + if ((flags & MF_ACTION_REQUIRED) && (t == current)) 262 + ret = force_sig_mceerr(BUS_MCEERR_AR, 263 + (void __user *)tk->addr, addr_lsb); 264 + else 264 265 /* 266 + * Signal other processes sharing the page if they have 267 + * PF_MCE_EARLY set. 265 268 * Don't use force here, it's convenient if the signal 266 269 * can be temporarily blocked. 267 270 * This could cause a loop when the user sets SIGBUS ··· 266 275 */ 267 276 ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr, 268 277 addr_lsb, t); /* synchronous? */ 269 - } 270 278 if (ret < 0) 271 279 pr_info("Memory failure: Error sending signal to %s:%d: %d\n", 272 280 t->comm, t->pid, ret); ··· 305 315 pmd_t *pmd; 306 316 pte_t *pte; 307 317 318 + VM_BUG_ON_VMA(address == -EFAULT, vma); 308 319 pgd = pgd_offset(vma->vm_mm, address); 309 320 if (!pgd_present(*pgd)) 310 321 return 0; ··· 698 707 (void *)&priv); 699 708 if (ret == 1 && priv.tk.addr) 700 709 kill_proc(&priv.tk, pfn, flags); 710 + else 711 + ret = 0; 701 712 mmap_read_unlock(p->mm); 702 - return ret ? -EFAULT : -EHWPOISON; 713 + return ret > 0 ? -EHWPOISON : -EFAULT; 703 714 } 704 715 705 716 static const char *action_name[] = { ··· 732 739 [MF_MSG_BUDDY] = "free buddy page", 733 740 [MF_MSG_DAX] = "dax page", 734 741 [MF_MSG_UNSPLIT_THP] = "unsplit thp", 742 + [MF_MSG_DIFFERENT_PAGE_SIZE] = "different page size", 735 743 [MF_MSG_UNKNOWN] = "unknown page", 736 744 }; 737 745 ··· 1176 1182 * does not return true for hugetlb or device memory pages, so it's assumed 1177 1183 * to be called only in the context where we never have such pages. 1178 1184 */ 1179 - static inline bool HWPoisonHandlable(struct page *page) 1185 + static inline bool HWPoisonHandlable(struct page *page, unsigned long flags) 1180 1186 { 1181 - return PageLRU(page) || __PageMovable(page) || is_free_buddy_page(page); 1187 + bool movable = false; 1188 + 1189 + /* Soft offline could mirgate non-LRU movable pages */ 1190 + if ((flags & MF_SOFT_OFFLINE) && __PageMovable(page)) 1191 + movable = true; 1192 + 1193 + return movable || PageLRU(page) || is_free_buddy_page(page); 1182 1194 } 1183 1195 1184 - static int __get_hwpoison_page(struct page *page) 1196 + static int __get_hwpoison_page(struct page *page, unsigned long flags) 1185 1197 { 1186 1198 struct page *head = compound_head(page); 1187 1199 int ret = 0; ··· 1202 1202 * for any unsupported type of page in order to reduce the risk of 1203 1203 * unexpected races caused by taking a page refcount. 1204 1204 */ 1205 - if (!HWPoisonHandlable(head)) 1205 + if (!HWPoisonHandlable(head, flags)) 1206 1206 return -EBUSY; 1207 1207 1208 1208 if (get_page_unless_zero(head)) { ··· 1227 1227 1228 1228 try_again: 1229 1229 if (!count_increased) { 1230 - ret = __get_hwpoison_page(p); 1230 + ret = __get_hwpoison_page(p, flags); 1231 1231 if (!ret) { 1232 1232 if (page_count(p)) { 1233 1233 /* We raced with an allocation, retry. */ ··· 1255 1255 } 1256 1256 } 1257 1257 1258 - if (PageHuge(p) || HWPoisonHandlable(p)) { 1258 + if (PageHuge(p) || HWPoisonHandlable(p, flags)) { 1259 1259 ret = 1; 1260 1260 } else { 1261 1261 /* ··· 1411 1411 if (kill) 1412 1412 collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED); 1413 1413 1414 - if (!PageHuge(hpage)) { 1415 - try_to_unmap(hpage, ttu); 1414 + if (PageHuge(hpage) && !PageAnon(hpage)) { 1415 + /* 1416 + * For hugetlb pages in shared mappings, try_to_unmap 1417 + * could potentially call huge_pmd_unshare. Because of 1418 + * this, take semaphore in write mode here and set 1419 + * TTU_RMAP_LOCKED to indicate we have taken the lock 1420 + * at this higher level. 1421 + */ 1422 + mapping = hugetlb_page_mapping_lock_write(hpage); 1423 + if (mapping) { 1424 + try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED); 1425 + i_mmap_unlock_write(mapping); 1426 + } else 1427 + pr_info("Memory failure: %#lx: could not lock mapping for mapped huge page\n", pfn); 1416 1428 } else { 1417 - if (!PageAnon(hpage)) { 1418 - /* 1419 - * For hugetlb pages in shared mappings, try_to_unmap 1420 - * could potentially call huge_pmd_unshare. Because of 1421 - * this, take semaphore in write mode here and set 1422 - * TTU_RMAP_LOCKED to indicate we have taken the lock 1423 - * at this higher level. 1424 - */ 1425 - mapping = hugetlb_page_mapping_lock_write(hpage); 1426 - if (mapping) { 1427 - try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED); 1428 - i_mmap_unlock_write(mapping); 1429 - } else 1430 - pr_info("Memory failure: %#lx: could not lock mapping for mapped huge page\n", pfn); 1431 - } else { 1432 - try_to_unmap(hpage, ttu); 1433 - } 1429 + try_to_unmap(hpage, ttu); 1434 1430 } 1435 1431 1436 1432 unmap_success = !page_mapped(hpage); ··· 1522 1526 if (TestClearPageHWPoison(head)) 1523 1527 num_poisoned_pages_dec(); 1524 1528 unlock_page(head); 1525 - return 0; 1529 + return -EOPNOTSUPP; 1526 1530 } 1527 1531 unlock_page(head); 1528 1532 res = MF_FAILED; ··· 1539 1543 } 1540 1544 1541 1545 lock_page(head); 1546 + 1547 + /* 1548 + * The page could have changed compound pages due to race window. 1549 + * If this happens just bail out. 1550 + */ 1551 + if (!PageHuge(p) || compound_head(p) != head) { 1552 + action_result(pfn, MF_MSG_DIFFERENT_PAGE_SIZE, MF_IGNORED); 1553 + res = -EBUSY; 1554 + goto out; 1555 + } 1556 + 1542 1557 page_flags = head->flags; 1558 + 1559 + if (hwpoison_filter(p)) { 1560 + if (TestClearPageHWPoison(head)) 1561 + num_poisoned_pages_dec(); 1562 + put_page(p); 1563 + res = -EOPNOTSUPP; 1564 + goto out; 1565 + } 1543 1566 1544 1567 /* 1545 1568 * TODO: hwpoison for pud-sized hugetlb doesn't work right now, so ··· 1628 1613 goto out; 1629 1614 1630 1615 if (hwpoison_filter(page)) { 1631 - rc = 0; 1616 + rc = -EOPNOTSUPP; 1632 1617 goto unlock; 1633 1618 } 1634 1619 ··· 1653 1638 * SIGBUS (i.e. MF_MUST_KILL) 1654 1639 */ 1655 1640 flags |= MF_ACTION_REQUIRED | MF_MUST_KILL; 1656 - collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED); 1641 + collect_procs(page, &tokill, true); 1657 1642 1658 1643 list_for_each_entry(tk, &tokill, nd) 1659 1644 if (tk->size_shift) ··· 1668 1653 start = (page->index << PAGE_SHIFT) & ~(size - 1); 1669 1654 unmap_mapping_range(page->mapping, start, size, 0); 1670 1655 } 1671 - kill_procs(&tokill, flags & MF_MUST_KILL, false, pfn, flags); 1656 + kill_procs(&tokill, true, false, pfn, flags); 1672 1657 rc = 0; 1673 1658 unlock: 1674 1659 dax_unlock_page(page, cookie); ··· 1697 1682 * 1698 1683 * Must run in process context (e.g. a work queue) with interrupts 1699 1684 * enabled and no spinlocks hold. 1685 + * 1686 + * Return: 0 for successfully handled the memory error, 1687 + * -EOPNOTSUPP for memory_filter() filtered the error event, 1688 + * < 0(except -EOPNOTSUPP) on failure. 1700 1689 */ 1701 1690 int memory_failure(unsigned long pfn, int flags) 1702 1691 { 1703 1692 struct page *p; 1704 1693 struct page *hpage; 1705 - struct page *orig_head; 1706 1694 struct dev_pagemap *pgmap; 1707 1695 int res = 0; 1708 1696 unsigned long page_flags; ··· 1751 1733 goto unlock_mutex; 1752 1734 } 1753 1735 1754 - orig_head = hpage = compound_head(p); 1736 + hpage = compound_head(p); 1755 1737 num_poisoned_pages_inc(); 1756 1738 1757 1739 /* ··· 1832 1814 lock_page(p); 1833 1815 1834 1816 /* 1835 - * The page could have changed compound pages during the locking. 1836 - * If this happens just bail out. 1817 + * We're only intended to deal with the non-Compound page here. 1818 + * However, the page could have changed compound pages due to 1819 + * race window. If this happens, we could try again to hopefully 1820 + * handle the page next round. 1837 1821 */ 1838 - if (PageCompound(p) && compound_head(p) != orig_head) { 1822 + if (PageCompound(p)) { 1823 + if (retry) { 1824 + if (TestClearPageHWPoison(p)) 1825 + num_poisoned_pages_dec(); 1826 + unlock_page(p); 1827 + put_page(p); 1828 + flags &= ~MF_COUNT_INCREASED; 1829 + retry = false; 1830 + goto try_again; 1831 + } 1839 1832 action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED); 1840 1833 res = -EBUSY; 1841 1834 goto unlock_page; ··· 1866 1837 num_poisoned_pages_dec(); 1867 1838 unlock_page(p); 1868 1839 put_page(p); 1840 + res = -EOPNOTSUPP; 1869 1841 goto unlock_mutex; 1870 1842 } 1871 1843 ··· 1875 1845 * page_lock. We need wait writeback completion for this page or it 1876 1846 * may trigger vfs BUG while evict inode. 1877 1847 */ 1878 - if (!PageTransTail(p) && !PageLRU(p) && !PageWriteback(p)) 1848 + if (!PageLRU(p) && !PageWriteback(p)) 1879 1849 goto identify_page_state; 1880 1850 1881 1851 /* ··· 2180 2150 .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, 2181 2151 }; 2182 2152 2183 - /* 2184 - * Check PageHWPoison again inside page lock because PageHWPoison 2185 - * is set by memory_failure() outside page lock. Note that 2186 - * memory_failure() also double-checks PageHWPoison inside page lock, 2187 - * so there's no race between soft_offline_page() and memory_failure(). 2188 - */ 2189 2153 lock_page(page); 2190 2154 if (!PageHuge(page)) 2191 2155 wait_on_page_writeback(page); ··· 2190 2166 return 0; 2191 2167 } 2192 2168 2193 - if (!PageHuge(page)) 2169 + if (!PageHuge(page) && PageLRU(page) && !PageSwapCache(page)) 2194 2170 /* 2195 2171 * Try to invalidate first. This should work for 2196 2172 * non dirty unmapped page cache pages. ··· 2198 2174 ret = invalidate_inode_page(page); 2199 2175 unlock_page(page); 2200 2176 2201 - /* 2202 - * RED-PEN would be better to keep it isolated here, but we 2203 - * would need to fix isolation locking first. 2204 - */ 2205 2177 if (ret) { 2206 2178 pr_info("soft_offline: %#lx: invalidated\n", pfn); 2207 2179 page_handle_poison(page, false, true); ··· 2308 2288 2309 2289 retry: 2310 2290 get_online_mems(); 2311 - ret = get_hwpoison_page(page, flags); 2291 + ret = get_hwpoison_page(page, flags | MF_SOFT_OFFLINE); 2312 2292 put_online_mems(); 2313 2293 2314 2294 if (ret > 0) {

+58 -44

mm/memory.c

··· 1309 1309 * Parameter block passed down to zap_pte_range in exceptional cases. 1310 1310 */ 1311 1311 struct zap_details { 1312 - struct address_space *zap_mapping; /* Check page->mapping if set */ 1313 1312 struct folio *single_folio; /* Locked folio to be unmapped */ 1313 + bool even_cows; /* Zap COWed private pages too? */ 1314 1314 }; 1315 1315 1316 - /* 1317 - * We set details->zap_mapping when we want to unmap shared but keep private 1318 - * pages. Return true if skip zapping this page, false otherwise. 1319 - */ 1320 - static inline bool 1321 - zap_skip_check_mapping(struct zap_details *details, struct page *page) 1316 + /* Whether we should zap all COWed (private) pages too */ 1317 + static inline bool should_zap_cows(struct zap_details *details) 1322 1318 { 1323 - if (!details || !page) 1324 - return false; 1319 + /* By default, zap all pages */ 1320 + if (!details) 1321 + return true; 1325 1322 1326 - return details->zap_mapping && 1327 - (details->zap_mapping != page_rmapping(page)); 1323 + /* Or, we zap COWed pages only if the caller wants to */ 1324 + return details->even_cows; 1325 + } 1326 + 1327 + /* Decides whether we should zap this page with the page pointer specified */ 1328 + static inline bool should_zap_page(struct zap_details *details, struct page *page) 1329 + { 1330 + /* If we can make a decision without *page.. */ 1331 + if (should_zap_cows(details)) 1332 + return true; 1333 + 1334 + /* E.g. the caller passes NULL for the case of a zero page */ 1335 + if (!page) 1336 + return true; 1337 + 1338 + /* Otherwise we should only zap non-anon pages */ 1339 + return !PageAnon(page); 1328 1340 } 1329 1341 1330 1342 static unsigned long zap_pte_range(struct mmu_gather *tlb, ··· 1361 1349 arch_enter_lazy_mmu_mode(); 1362 1350 do { 1363 1351 pte_t ptent = *pte; 1352 + struct page *page; 1353 + 1364 1354 if (pte_none(ptent)) 1365 1355 continue; 1366 1356 ··· 1370 1356 break; 1371 1357 1372 1358 if (pte_present(ptent)) { 1373 - struct page *page; 1374 - 1375 1359 page = vm_normal_page(vma, addr, ptent); 1376 - if (unlikely(zap_skip_check_mapping(details, page))) 1360 + if (unlikely(!should_zap_page(details, page))) 1377 1361 continue; 1378 1362 ptent = ptep_get_and_clear_full(mm, addr, pte, 1379 1363 tlb->fullmm); ··· 1403 1391 entry = pte_to_swp_entry(ptent); 1404 1392 if (is_device_private_entry(entry) || 1405 1393 is_device_exclusive_entry(entry)) { 1406 - struct page *page = pfn_swap_entry_to_page(entry); 1407 - 1408 - if (unlikely(zap_skip_check_mapping(details, page))) 1394 + page = pfn_swap_entry_to_page(entry); 1395 + if (unlikely(!should_zap_page(details, page))) 1409 1396 continue; 1410 - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); 1411 1397 rss[mm_counter(page)]--; 1412 - 1413 1398 if (is_device_private_entry(entry)) 1414 1399 page_remove_rmap(page, false); 1415 - 1416 1400 put_page(page); 1417 - continue; 1418 - } 1419 - 1420 - /* If details->check_mapping, we leave swap entries. */ 1421 - if (unlikely(details)) 1422 - continue; 1423 - 1424 - if (!non_swap_entry(entry)) 1401 + } else if (!non_swap_entry(entry)) { 1402 + /* Genuine swap entry, hence a private anon page */ 1403 + if (!should_zap_cows(details)) 1404 + continue; 1425 1405 rss[MM_SWAPENTS]--; 1426 - else if (is_migration_entry(entry)) { 1427 - struct page *page; 1428 - 1406 + if (unlikely(!free_swap_and_cache(entry))) 1407 + print_bad_pte(vma, addr, ptent, NULL); 1408 + } else if (is_migration_entry(entry)) { 1429 1409 page = pfn_swap_entry_to_page(entry); 1410 + if (!should_zap_page(details, page)) 1411 + continue; 1430 1412 rss[mm_counter(page)]--; 1413 + } else if (is_hwpoison_entry(entry)) { 1414 + if (!should_zap_cows(details)) 1415 + continue; 1416 + } else { 1417 + /* We should have covered all the swap entry types */ 1418 + WARN_ON_ONCE(1); 1431 1419 } 1432 - if (unlikely(!free_swap_and_cache(entry))) 1433 - print_bad_pte(vma, addr, ptent, NULL); 1434 1420 pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); 1435 1421 } while (pte++, addr += PAGE_SIZE, addr != end); 1436 1422 ··· 1715 1705 void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, 1716 1706 unsigned long size) 1717 1707 { 1718 - if (address < vma->vm_start || address + size > vma->vm_end || 1708 + if (!range_in_vma(vma, address, address + size) || 1719 1709 !(vma->vm_flags & VM_PFNMAP)) 1720 1710 return; 1721 1711 ··· 3350 3340 vma_interval_tree_foreach(vma, root, first_index, last_index) { 3351 3341 vba = vma->vm_pgoff; 3352 3342 vea = vba + vma_pages(vma) - 1; 3353 - zba = first_index; 3354 - if (zba < vba) 3355 - zba = vba; 3356 - zea = last_index; 3357 - if (zea > vea) 3358 - zea = vea; 3343 + zba = max(first_index, vba); 3344 + zea = min(last_index, vea); 3359 3345 3360 3346 unmap_mapping_range_vma(vma, 3361 3347 ((zba - vba) << PAGE_SHIFT) + vma->vm_start, ··· 3383 3377 first_index = folio->index; 3384 3378 last_index = folio->index + folio_nr_pages(folio) - 1; 3385 3379 3386 - details.zap_mapping = mapping; 3380 + details.even_cows = false; 3387 3381 details.single_folio = folio; 3388 3382 3389 3383 i_mmap_lock_write(mapping); ··· 3412 3406 pgoff_t first_index = start; 3413 3407 pgoff_t last_index = start + nr - 1; 3414 3408 3415 - details.zap_mapping = even_cows ? NULL : mapping; 3409 + details.even_cows = even_cows; 3416 3410 if (last_index < first_index) 3417 3411 last_index = ULONG_MAX; 3418 3412 ··· 3877 3871 return ret; 3878 3872 3879 3873 if (unlikely(PageHWPoison(vmf->page))) { 3880 - if (ret & VM_FAULT_LOCKED) 3874 + vm_fault_t poisonret = VM_FAULT_HWPOISON; 3875 + if (ret & VM_FAULT_LOCKED) { 3876 + /* Retry if a clean page was removed from the cache. */ 3877 + if (invalidate_inode_page(vmf->page)) 3878 + poisonret = 0; 3881 3879 unlock_page(vmf->page); 3880 + } 3882 3881 put_page(vmf->page); 3883 3882 vmf->page = NULL; 3884 - return VM_FAULT_HWPOISON; 3883 + return poisonret; 3885 3884 } 3886 3885 3887 3886 if (unlikely(!(ret & VM_FAULT_LOCKED))) ··· 4633 4622 struct vm_fault vmf = { 4634 4623 .vma = vma, 4635 4624 .address = address & PAGE_MASK, 4625 + .real_address = address, 4636 4626 .flags = flags, 4637 4627 .pgoff = linear_page_index(vma, address), 4638 4628 .gfp_mask = __get_fault_gfp_mask(vma), ··· 5455 5443 ret_val -= (PAGE_SIZE - rc); 5456 5444 if (rc) 5457 5445 break; 5446 + 5447 + flush_dcache_page(subpage); 5458 5448 5459 5449 cond_resched(); 5460 5450 }

+30 -102

mm/memory_hotplug.c

··· 295 295 } 296 296 EXPORT_SYMBOL_GPL(pfn_to_online_page); 297 297 298 - /* 299 - * Reasonably generic function for adding memory. It is 300 - * expected that archs that support memory hotplug will 301 - * call this function after deciding the zone to which to 302 - * add the new pages. 303 - */ 304 298 int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages, 305 299 struct mhp_params *params) 306 300 { ··· 823 829 struct pglist_data *pgdat = NODE_DATA(nid); 824 830 int zid; 825 831 826 - for (zid = 0; zid <= ZONE_NORMAL; zid++) { 832 + for (zid = 0; zid < ZONE_NORMAL; zid++) { 827 833 struct zone *zone = &pgdat->node_zones[zid]; 828 834 829 835 if (zone_intersects(zone, start_pfn, nr_pages)) ··· 1156 1162 } 1157 1163 1158 1164 /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */ 1159 - static pg_data_t __ref *hotadd_new_pgdat(int nid) 1165 + static pg_data_t __ref *hotadd_init_pgdat(int nid) 1160 1166 { 1161 1167 struct pglist_data *pgdat; 1162 1168 1169 + /* 1170 + * NODE_DATA is preallocated (free_area_init) but its internal 1171 + * state is not allocated completely. Add missing pieces. 1172 + * Completely offline nodes stay around and they just need 1173 + * reintialization. 1174 + */ 1163 1175 pgdat = NODE_DATA(nid); 1164 - if (!pgdat) { 1165 - pgdat = arch_alloc_nodedata(nid); 1166 - if (!pgdat) 1167 - return NULL; 1168 - 1169 - pgdat->per_cpu_nodestats = 1170 - alloc_percpu(struct per_cpu_nodestat); 1171 - arch_refresh_nodedata(nid, pgdat); 1172 - } else { 1173 - int cpu; 1174 - /* 1175 - * Reset the nr_zones, order and highest_zoneidx before reuse. 1176 - * Note that kswapd will init kswapd_highest_zoneidx properly 1177 - * when it starts in the near future. 1178 - */ 1179 - pgdat->nr_zones = 0; 1180 - pgdat->kswapd_order = 0; 1181 - pgdat->kswapd_highest_zoneidx = 0; 1182 - for_each_online_cpu(cpu) { 1183 - struct per_cpu_nodestat *p; 1184 - 1185 - p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu); 1186 - memset(p, 0, sizeof(*p)); 1187 - } 1188 - } 1189 - 1190 - /* we can use NODE_DATA(nid) from here */ 1191 - pgdat->node_id = nid; 1192 - pgdat->node_start_pfn = 0; 1193 1176 1194 1177 /* init node's zones as empty zones, we don't have any present pages.*/ 1195 - free_area_init_core_hotplug(nid); 1178 + free_area_init_core_hotplug(pgdat); 1196 1179 1197 1180 /* 1198 1181 * The node we allocated has no zone fallback lists. For avoiding ··· 1181 1210 * When memory is hot-added, all the memory is in offline state. So 1182 1211 * clear all zones' present_pages because they will be updated in 1183 1212 * online_pages() and offline_pages(). 1213 + * TODO: should be in free_area_init_core_hotplug? 1184 1214 */ 1185 1215 reset_node_managed_pages(pgdat); 1186 1216 reset_node_present_pages(pgdat); 1187 1217 1188 1218 return pgdat; 1189 1219 } 1190 - 1191 - static void rollback_node_hotadd(int nid) 1192 - { 1193 - pg_data_t *pgdat = NODE_DATA(nid); 1194 - 1195 - arch_refresh_nodedata(nid, NULL); 1196 - free_percpu(pgdat->per_cpu_nodestats); 1197 - arch_free_nodedata(pgdat); 1198 - } 1199 - 1200 1220 1201 1221 /* 1202 1222 * __try_online_node - online a node if offlined ··· 1208 1246 if (node_online(nid)) 1209 1247 return 0; 1210 1248 1211 - pgdat = hotadd_new_pgdat(nid); 1249 + pgdat = hotadd_init_pgdat(nid); 1212 1250 if (!pgdat) { 1213 1251 pr_err("Cannot online node %d due to NULL pgdat\n", nid); 1214 1252 ret = -ENOMEM; ··· 1289 1327 * populate a single PMD. 1290 1328 */ 1291 1329 return memmap_on_memory && 1292 - !hugetlb_free_vmemmap_enabled && 1330 + !hugetlb_free_vmemmap_enabled() && 1293 1331 IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) && 1294 1332 size == memory_block_size_bytes() && 1295 1333 IS_ALIGNED(vmemmap_size, PMD_SIZE) && ··· 1383 1421 BUG_ON(ret); 1384 1422 } 1385 1423 1386 - /* link memory sections under this node.*/ 1387 - link_mem_sections(nid, PFN_DOWN(start), PFN_UP(start + size - 1), 1388 - MEMINIT_HOTPLUG); 1424 + register_memory_blocks_under_node(nid, PFN_DOWN(start), 1425 + PFN_UP(start + size - 1), 1426 + MEMINIT_HOTPLUG); 1389 1427 1390 1428 /* create new memmap entry */ 1391 1429 if (!strcmp(res->name, "System RAM")) ··· 1407 1445 1408 1446 return ret; 1409 1447 error: 1410 - /* rollback pgdat allocation and others */ 1411 - if (new_node) 1412 - rollback_node_hotadd(nid); 1413 1448 if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) 1414 1449 memblock_remove(start, size); 1415 1450 error_mem_hotplug_end: ··· 1548 1589 } 1549 1590 1550 1591 #ifdef CONFIG_MEMORY_HOTREMOVE 1551 - /* 1552 - * Confirm all pages in a range [start, end) belong to the same zone (skipping 1553 - * memory holes). When true, return the zone. 1554 - */ 1555 - struct zone *test_pages_in_a_zone(unsigned long start_pfn, 1556 - unsigned long end_pfn) 1557 - { 1558 - unsigned long pfn, sec_end_pfn; 1559 - struct zone *zone = NULL; 1560 - struct page *page; 1561 - 1562 - for (pfn = start_pfn, sec_end_pfn = SECTION_ALIGN_UP(start_pfn + 1); 1563 - pfn < end_pfn; 1564 - pfn = sec_end_pfn, sec_end_pfn += PAGES_PER_SECTION) { 1565 - /* Make sure the memory section is present first */ 1566 - if (!present_section_nr(pfn_to_section_nr(pfn))) 1567 - continue; 1568 - for (; pfn < sec_end_pfn && pfn < end_pfn; 1569 - pfn += MAX_ORDER_NR_PAGES) { 1570 - /* Check if we got outside of the zone */ 1571 - if (zone && !zone_spans_pfn(zone, pfn)) 1572 - return NULL; 1573 - page = pfn_to_page(pfn); 1574 - if (zone && page_zone(page) != zone) 1575 - return NULL; 1576 - zone = page_zone(page); 1577 - } 1578 - } 1579 - 1580 - return zone; 1581 - } 1582 - 1583 1592 /* 1584 1593 * Scan pfn range [start,end) to find movable/migratable pages (LRU pages, 1585 1594 * non-lru movable pages and hugepages). Will skip over most unmovable ··· 1771 1844 } 1772 1845 1773 1846 int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages, 1774 - struct memory_group *group) 1847 + struct zone *zone, struct memory_group *group) 1775 1848 { 1776 1849 const unsigned long end_pfn = start_pfn + nr_pages; 1777 1850 unsigned long pfn, system_ram_pages = 0; 1851 + const int node = zone_to_nid(zone); 1778 1852 unsigned long flags; 1779 - struct zone *zone; 1780 1853 struct memory_notify arg; 1781 - int ret, node; 1782 1854 char *reason; 1855 + int ret; 1783 1856 1784 1857 /* 1785 1858 * {on,off}lining is constrained to full memory sections (or more ··· 1811 1884 goto failed_removal; 1812 1885 } 1813 1886 1814 - /* This makes hotplug much easier...and readable. 1815 - we assume this for now. .*/ 1816 - zone = test_pages_in_a_zone(start_pfn, end_pfn); 1817 - if (!zone) { 1887 + /* 1888 + * We only support offlining of memory blocks managed by a single zone, 1889 + * checked by calling code. This is just a sanity check that we might 1890 + * want to remove in the future. 1891 + */ 1892 + if (WARN_ON_ONCE(page_zone(pfn_to_page(start_pfn)) != zone || 1893 + page_zone(pfn_to_page(end_pfn - 1)) != zone)) { 1818 1894 ret = -EINVAL; 1819 1895 reason = "multizone range"; 1820 1896 goto failed_removal; 1821 1897 } 1822 - node = zone_to_nid(zone); 1823 1898 1824 1899 /* 1825 1900 * Disable pcplists so that page isolation cannot race with freeing ··· 1933 2004 return 0; 1934 2005 1935 2006 failed_removal_isolated: 2007 + /* pushback to free area */ 1936 2008 undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); 1937 2009 memory_notify(MEM_CANCEL_OFFLINE, &arg); 1938 2010 failed_removal_pcplists_disabled: ··· 1944 2014 (unsigned long long) start_pfn << PAGE_SHIFT, 1945 2015 ((unsigned long long) end_pfn << PAGE_SHIFT) - 1, 1946 2016 reason); 1947 - /* pushback to free area */ 1948 2017 mem_hotplug_done(); 1949 2018 return ret; 1950 2019 } ··· 1975 2046 return mem->nr_vmemmap_pages; 1976 2047 } 1977 2048 1978 - static int check_cpu_on_node(pg_data_t *pgdat) 2049 + static int check_cpu_on_node(int nid) 1979 2050 { 1980 2051 int cpu; 1981 2052 1982 2053 for_each_present_cpu(cpu) { 1983 - if (cpu_to_node(cpu) == pgdat->node_id) 2054 + if (cpu_to_node(cpu) == nid) 1984 2055 /* 1985 2056 * the cpu on this node isn't removed, and we can't 1986 2057 * offline this node. ··· 2014 2085 */ 2015 2086 void try_offline_node(int nid) 2016 2087 { 2017 - pg_data_t *pgdat = NODE_DATA(nid); 2018 2088 int rc; 2019 2089 2020 2090 /* ··· 2021 2093 * offline it. A node spans memory after move_pfn_range_to_zone(), 2022 2094 * e.g., after the memory block was onlined. 2023 2095 */ 2024 - if (pgdat->node_spanned_pages) 2096 + if (node_spanned_pages(nid)) 2025 2097 return; 2026 2098 2027 2099 /* ··· 2033 2105 if (rc) 2034 2106 return; 2035 2107 2036 - if (check_cpu_on_node(pgdat)) 2108 + if (check_cpu_on_node(nid)) 2037 2109 return; 2038 2110 2039 2111 /*

+10 -19

mm/mempolicy.c

··· 786 786 static int mbind_range(struct mm_struct *mm, unsigned long start, 787 787 unsigned long end, struct mempolicy *new_pol) 788 788 { 789 - struct vm_area_struct *next; 790 789 struct vm_area_struct *prev; 791 790 struct vm_area_struct *vma; 792 791 int err = 0; ··· 800 801 if (start > vma->vm_start) 801 802 prev = vma; 802 803 803 - for (; vma && vma->vm_start < end; prev = vma, vma = next) { 804 - next = vma->vm_next; 804 + for (; vma && vma->vm_start < end; prev = vma, vma = vma->vm_next) { 805 805 vmstart = max(start, vma->vm_start); 806 806 vmend = min(end, vma->vm_end); 807 807 ··· 815 817 anon_vma_name(vma)); 816 818 if (prev) { 817 819 vma = prev; 818 - next = vma->vm_next; 819 - if (mpol_equal(vma_policy(vma), new_pol)) 820 - continue; 821 - /* vma_merge() joined vma && vma->next, case 8 */ 822 820 goto replace; 823 821 } 824 822 if (vma->vm_start != vmstart) { ··· 901 907 static int lookup_node(struct mm_struct *mm, unsigned long addr) 902 908 { 903 909 struct page *p = NULL; 904 - int err; 910 + int ret; 905 911 906 - int locked = 1; 907 - err = get_user_pages_locked(addr & PAGE_MASK, 1, 0, &p, &locked); 908 - if (err > 0) { 909 - err = page_to_nid(p); 912 + ret = get_user_pages_fast(addr & PAGE_MASK, 1, 0, &p); 913 + if (ret > 0) { 914 + ret = page_to_nid(p); 910 915 put_page(p); 911 916 } 912 - if (locked) 913 - mmap_read_unlock(mm); 914 - return err; 917 + return ret; 915 918 } 916 919 917 920 /* Retrieve NUMA policy */ ··· 959 968 if (flags & MPOL_F_NODE) { 960 969 if (flags & MPOL_F_ADDR) { 961 970 /* 962 - * Take a refcount on the mpol, lookup_node() 963 - * will drop the mmap_lock, so after calling 964 - * lookup_node() only "pol" remains valid, "vma" 965 - * is stale. 971 + * Take a refcount on the mpol, because we are about to 972 + * drop the mmap_lock, after which only "pol" remains 973 + * valid, "vma" is stale. 966 974 */ 967 975 pol_refcount = pol; 968 976 vma = NULL; 969 977 mpol_get(pol); 978 + mmap_read_unlock(mm); 970 979 err = lookup_node(mm, addr); 971 980 if (err < 0) 972 981 goto out;

+2 -1

mm/memremap.c

··· 282 282 return 0; 283 283 284 284 err_add_memory: 285 - kasan_remove_zero_shadow(__va(range->start), range_len(range)); 285 + if (!is_private) 286 + kasan_remove_zero_shadow(__va(range->start), range_len(range)); 286 287 err_kasan: 287 288 untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range)); 288 289 err_pfn_remap:

+55 -63

mm/migrate.c

··· 51 51 #include <linux/oom.h> 52 52 #include <linux/memory.h> 53 53 #include <linux/random.h> 54 + #include <linux/sched/sysctl.h> 54 55 55 56 #include <asm/tlbflush.h> 56 57 ··· 108 107 109 108 /* Driver shouldn't use PG_isolated bit of page->flags */ 110 109 WARN_ON_ONCE(PageIsolated(page)); 111 - __SetPageIsolated(page); 110 + SetPageIsolated(page); 112 111 unlock_page(page); 113 112 114 113 return 0; ··· 127 126 128 127 mapping = page_mapping(page); 129 128 mapping->a_ops->putback_page(page); 130 - __ClearPageIsolated(page); 129 + ClearPageIsolated(page); 131 130 } 132 131 133 132 /* ··· 160 159 if (PageMovable(page)) 161 160 putback_movable_page(page); 162 161 else 163 - __ClearPageIsolated(page); 162 + ClearPageIsolated(page); 164 163 unlock_page(page); 165 164 put_page(page); 166 165 } else { ··· 884 883 VM_BUG_ON_PAGE(!PageIsolated(page), page); 885 884 if (!PageMovable(page)) { 886 885 rc = MIGRATEPAGE_SUCCESS; 887 - __ClearPageIsolated(page); 886 + ClearPageIsolated(page); 888 887 goto out; 889 888 } 890 889 ··· 906 905 * We clear PG_movable under page_lock so any compactor 907 906 * cannot try to migrate this page. 908 907 */ 909 - __ClearPageIsolated(page); 908 + ClearPageIsolated(page); 910 909 } 911 910 912 911 /* ··· 918 917 page->mapping = NULL; 919 918 920 919 if (likely(!is_zone_device_page(newpage))) 921 - flush_dcache_page(newpage); 922 - 920 + flush_dcache_folio(page_folio(newpage)); 923 921 } 924 922 out: 925 923 return rc; ··· 1092 1092 if (unlikely(__PageMovable(page))) { 1093 1093 lock_page(page); 1094 1094 if (!PageMovable(page)) 1095 - __ClearPageIsolated(page); 1095 + ClearPageIsolated(page); 1096 1096 unlock_page(page); 1097 1097 } 1098 1098 goto out; ··· 1351 1351 bool is_thp = false; 1352 1352 struct page *page; 1353 1353 struct page *page2; 1354 - int swapwrite = current->flags & PF_SWAPWRITE; 1355 1354 int rc, nr_subpages; 1356 1355 LIST_HEAD(ret_pages); 1357 1356 LIST_HEAD(thp_split_pages); ··· 1358 1359 bool no_subpage_counting = false; 1359 1360 1360 1361 trace_mm_migrate_pages_start(mode, reason); 1361 - 1362 - if (!swapwrite) 1363 - current->flags |= PF_SWAPWRITE; 1364 1362 1365 1363 thp_subpage_migration: 1366 1364 for (pass = 0; pass < 10 && (retry || thp_retry); pass++) { ··· 1513 1517 trace_mm_migrate_pages(nr_succeeded, nr_failed_pages, nr_thp_succeeded, 1514 1518 nr_thp_failed, nr_thp_split, mode, reason); 1515 1519 1516 - if (!swapwrite) 1517 - current->flags &= ~PF_SWAPWRITE; 1518 - 1519 1520 if (ret_succeeded) 1520 1521 *ret_succeeded = nr_succeeded; 1521 1522 ··· 1605 1612 { 1606 1613 struct vm_area_struct *vma; 1607 1614 struct page *page; 1608 - unsigned int follflags; 1609 1615 int err; 1610 1616 1611 1617 mmap_read_lock(mm); ··· 1614 1622 goto out; 1615 1623 1616 1624 /* FOLL_DUMP to ignore special (like zero) pages */ 1617 - follflags = FOLL_GET | FOLL_DUMP; 1618 - page = follow_page(vma, addr, follflags); 1625 + page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP); 1619 1626 1620 1627 err = PTR_ERR(page); 1621 1628 if (IS_ERR(page)) ··· 1751 1760 /* The page is successfully queued for migration */ 1752 1761 continue; 1753 1762 } 1763 + 1764 + /* 1765 + * The move_pages() man page does not have an -EEXIST choice, so 1766 + * use -EFAULT instead. 1767 + */ 1768 + if (err == -EEXIST) 1769 + err = -EFAULT; 1754 1770 1755 1771 /* 1756 1772 * If the page is already on the target node (!err), store the ··· 2032 2034 { 2033 2035 int page_lru; 2034 2036 int nr_pages = thp_nr_pages(page); 2037 + int order = compound_order(page); 2035 2038 2036 - VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page); 2039 + VM_BUG_ON_PAGE(order && !PageTransHuge(page), page); 2037 2040 2038 2041 /* Do not migrate THP mapped by multiple processes */ 2039 2042 if (PageTransHuge(page) && total_mapcount(page) > 1) 2040 2043 return 0; 2041 2044 2042 2045 /* Avoid migrating to a node that is nearly full */ 2043 - if (!migrate_balanced_pgdat(pgdat, nr_pages)) 2046 + if (!migrate_balanced_pgdat(pgdat, nr_pages)) { 2047 + int z; 2048 + 2049 + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)) 2050 + return 0; 2051 + for (z = pgdat->nr_zones - 1; z >= 0; z--) { 2052 + if (populated_zone(pgdat->node_zones + z)) 2053 + break; 2054 + } 2055 + wakeup_kswapd(pgdat->node_zones + z, 0, order, ZONE_MOVABLE); 2044 2056 return 0; 2057 + } 2045 2058 2046 2059 if (isolate_lru_page(page)) 2047 2060 return 0; ··· 2081 2072 pg_data_t *pgdat = NODE_DATA(node); 2082 2073 int isolated; 2083 2074 int nr_remaining; 2075 + unsigned int nr_succeeded; 2084 2076 LIST_HEAD(migratepages); 2085 2077 new_page_t *new; 2086 2078 bool compound; ··· 2120 2110 2121 2111 list_add(&page->lru, &migratepages); 2122 2112 nr_remaining = migrate_pages(&migratepages, *new, NULL, node, 2123 - MIGRATE_ASYNC, MR_NUMA_MISPLACED, NULL); 2113 + MIGRATE_ASYNC, MR_NUMA_MISPLACED, 2114 + &nr_succeeded); 2124 2115 if (nr_remaining) { 2125 2116 if (!list_empty(&migratepages)) { 2126 2117 list_del(&page->lru); ··· 2130 2119 putback_lru_page(page); 2131 2120 } 2132 2121 isolated = 0; 2133 - } else 2134 - count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_pages); 2122 + } 2123 + if (nr_succeeded) { 2124 + count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); 2125 + if (!node_is_toptier(page_to_nid(page)) && node_is_toptier(node)) 2126 + mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, 2127 + nr_succeeded); 2128 + } 2135 2129 BUG_ON(!list_empty(&migratepages)); 2136 2130 return isolated; 2137 2131 ··· 3098 3082 if (best_distance != -1) { 3099 3083 val = node_distance(node, migration_target); 3100 3084 if (val > best_distance) 3101 - return NUMA_NO_NODE; 3085 + goto out_clear; 3102 3086 } 3103 3087 3104 3088 index = nd->nr; 3105 3089 if (WARN_ONCE(index >= DEMOTION_TARGET_NODES, 3106 3090 "Exceeds maximum demotion target nodes\n")) 3107 - return NUMA_NO_NODE; 3091 + goto out_clear; 3108 3092 3109 3093 nd->nodes[index] = migration_target; 3110 3094 nd->nr++; 3111 3095 3112 3096 return migration_target; 3097 + out_clear: 3098 + node_clear(migration_target, *used); 3099 + return NUMA_NO_NODE; 3113 3100 } 3114 3101 3115 3102 /* ··· 3209 3190 /* 3210 3191 * For callers that do not hold get_online_mems() already. 3211 3192 */ 3212 - static void set_migration_target_nodes(void) 3193 + void set_migration_target_nodes(void) 3213 3194 { 3214 3195 get_online_mems(); 3215 3196 __set_migration_target_nodes(); ··· 3273 3254 return notifier_from_errno(0); 3274 3255 } 3275 3256 3276 - /* 3277 - * React to hotplug events that might affect the migration targets 3278 - * like events that online or offline NUMA nodes. 3279 - * 3280 - * The ordering is also currently dependent on which nodes have 3281 - * CPUs. That means we need CPU on/offline notification too. 3282 - */ 3283 - static int migration_online_cpu(unsigned int cpu) 3257 + void __init migrate_on_reclaim_init(void) 3284 3258 { 3285 - set_migration_target_nodes(); 3286 - return 0; 3287 - } 3288 - 3289 - static int migration_offline_cpu(unsigned int cpu) 3290 - { 3291 - set_migration_target_nodes(); 3292 - return 0; 3293 - } 3294 - 3295 - static int __init migrate_on_reclaim_init(void) 3296 - { 3297 - int ret; 3298 - 3299 3259 node_demotion = kmalloc_array(nr_node_ids, 3300 3260 sizeof(struct demotion_nodes), 3301 3261 GFP_KERNEL); 3302 3262 WARN_ON(!node_demotion); 3303 3263 3304 - ret = cpuhp_setup_state_nocalls(CPUHP_MM_DEMOTION_DEAD, "mm/demotion:offline", 3305 - NULL, migration_offline_cpu); 3306 - /* 3307 - * In the unlikely case that this fails, the automatic 3308 - * migration targets may become suboptimal for nodes 3309 - * where N_CPU changes. With such a small impact in a 3310 - * rare case, do not bother trying to do anything special. 3311 - */ 3312 - WARN_ON(ret < 0); 3313 - ret = cpuhp_setup_state(CPUHP_AP_MM_DEMOTION_ONLINE, "mm/demotion:online", 3314 - migration_online_cpu, NULL); 3315 - WARN_ON(ret < 0); 3316 - 3317 3264 hotplug_memory_notifier(migrate_on_reclaim_callback, 100); 3318 - return 0; 3265 + /* 3266 + * At this point, all numa nodes with memory/CPus have their state 3267 + * properly set, so we can build the demotion order now. 3268 + * Let us hold the cpu_hotplug lock just, as we could possibily have 3269 + * CPU hotplug events during boot. 3270 + */ 3271 + cpus_read_lock(); 3272 + set_migration_target_nodes(); 3273 + cpus_read_unlock(); 3319 3274 } 3320 - late_initcall(migrate_on_reclaim_init); 3321 3275 #endif /* CONFIG_HOTPLUG_CPU */ 3322 3276 3323 3277 bool numa_demotion_enabled = false;

+1

mm/mlock.c

··· 839 839 } 840 840 if (!get_ucounts(ucounts)) { 841 841 dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked); 842 + allowed = 0; 842 843 goto out; 843 844 } 844 845 allowed = 1;

+2 -3

mm/mmap.c

··· 1616 1616 /* 1617 1617 * VM_NORESERVE is used because the reservations will be 1618 1618 * taken when vm_ops->mmap() is called 1619 - * A dummy user value is used because we are not locking 1620 - * memory so no accounting is necessary 1621 1619 */ 1622 1620 file = hugetlb_file_setup(HUGETLB_ANON_FILE, len, 1623 1621 VM_NORESERVE, ··· 2555 2557 if (!*endptr) 2556 2558 stack_guard_gap = val << PAGE_SHIFT; 2557 2559 2558 - return 0; 2560 + return 1; 2559 2561 } 2560 2562 __setup("stack_guard_gap=", cmdline_parse_stack_guard_gap); 2561 2563 ··· 3446 3448 vma->vm_end = addr + len; 3447 3449 3448 3450 vma->vm_flags = vm_flags | mm->def_flags | VM_DONTEXPAND | VM_SOFTDIRTY; 3451 + vma->vm_flags &= VM_LOCKED_CLEAR_MASK; 3449 3452 vma->vm_page_prot = vm_get_page_prot(vma->vm_flags); 3450 3453 3451 3454 vma->vm_ops = ops;

+4 -3

mm/mmzone.c

··· 89 89 unsigned long old_flags, flags; 90 90 int last_cpupid; 91 91 92 + old_flags = READ_ONCE(page->flags); 92 93 do { 93 - old_flags = flags = page->flags; 94 - last_cpupid = page_cpupid_last(page); 94 + flags = old_flags; 95 + last_cpupid = (flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK; 95 96 96 97 flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT); 97 98 flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT; 98 - } while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags)); 99 + } while (unlikely(!try_cmpxchg(&page->flags, &old_flags, flags))); 99 100 100 101 return last_cpupid; 101 102 }

+12 -1

mm/mprotect.c

··· 29 29 #include <linux/uaccess.h> 30 30 #include <linux/mm_inline.h> 31 31 #include <linux/pgtable.h> 32 + #include <linux/sched/sysctl.h> 32 33 #include <asm/cacheflush.h> 33 34 #include <asm/mmu_context.h> 34 35 #include <asm/tlbflush.h> ··· 84 83 */ 85 84 if (prot_numa) { 86 85 struct page *page; 86 + int nid; 87 87 88 88 /* Avoid TLB flush if possible */ 89 89 if (pte_protnone(oldpte)) ··· 111 109 * Don't mess with PTEs if page is already on the node 112 110 * a single-threaded process is running on. 113 111 */ 114 - if (target_node == page_to_nid(page)) 112 + nid = page_to_nid(page); 113 + if (target_node == nid) 114 + continue; 115 + 116 + /* 117 + * Skip scanning top tier node if normal numa 118 + * balancing is disabled 119 + */ 120 + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && 121 + node_is_toptier(nid)) 115 122 continue; 116 123 } 117 124

+2 -2

mm/mremap.c

··· 942 942 943 943 if (mmap_write_lock_killable(current->mm)) 944 944 return -EINTR; 945 - vma = find_vma(mm, addr); 946 - if (!vma || vma->vm_start > addr) { 945 + vma = vma_lookup(mm, addr); 946 + if (!vma) { 947 947 ret = EFAULT; 948 948 goto out; 949 949 }

-3

mm/oom_kill.c

··· 93 93 bool ret = false; 94 94 const nodemask_t *mask = oc->nodemask; 95 95 96 - if (is_memcg_oom(oc)) 97 - return true; 98 - 99 96 rcu_read_lock(); 100 97 for_each_thread(start, tsk) { 101 98 if (mask) {

-12

mm/page-writeback.c

··· 324 324 } 325 325 326 326 /* 327 - * Unreclaimable memory (kernel memory or anonymous memory 328 - * without swap) can bring down the dirtyable pages below 329 - * the zone's dirty balance reserve and the above calculation 330 - * will underflow. However we still want to add in nodes 331 - * which are below threshold (negative values) to get a more 332 - * accurate calculation but make sure that the total never 333 - * underflows. 334 - */ 335 - if ((long)x < 0) 336 - x = 0; 337 - 338 - /* 339 327 * Make sure that the number of highmem pages is never larger 340 328 * than the number of the total dirtyable memory. This can only 341 329 * occur in very strange VM situations but we want to make sure

+241 -216

mm/page_alloc.c

··· 128 128 struct pagesets { 129 129 local_lock_t lock; 130 130 }; 131 - static DEFINE_PER_CPU(struct pagesets, pagesets) = { 131 + static DEFINE_PER_CPU(struct pagesets, pagesets) __maybe_unused = { 132 132 .lock = INIT_LOCAL_LOCK(lock), 133 133 }; 134 134 ··· 1072 1072 int migratetype, fpi_t fpi_flags) 1073 1073 { 1074 1074 struct capture_control *capc = task_capc(zone); 1075 + unsigned int max_order = pageblock_order; 1075 1076 unsigned long buddy_pfn; 1076 1077 unsigned long combined_pfn; 1077 - unsigned int max_order; 1078 1078 struct page *buddy; 1079 1079 bool to_tail; 1080 - 1081 - max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order); 1082 1080 1083 1081 VM_BUG_ON(!zone_is_initialized(zone)); 1084 1082 VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); ··· 1115 1117 } 1116 1118 if (order < MAX_ORDER - 1) { 1117 1119 /* If we are here, it means order is >= pageblock_order. 1118 - * We want to prevent merge between freepages on isolate 1119 - * pageblock and normal pageblock. Without this, pageblock 1120 - * isolation could cause incorrect freepage or CMA accounting. 1120 + * We want to prevent merge between freepages on pageblock 1121 + * without fallbacks and normal pageblock. Without this, 1122 + * pageblock isolation could cause incorrect freepage or CMA 1123 + * accounting or HIGHATOMIC accounting. 1121 1124 * 1122 1125 * We don't want to hit this code for the more frequent 1123 1126 * low-order merging. 1124 1127 */ 1125 - if (unlikely(has_isolate_pageblock(zone))) { 1126 - int buddy_mt; 1128 + int buddy_mt; 1127 1129 1128 - buddy_pfn = __find_buddy_pfn(pfn, order); 1129 - buddy = page + (buddy_pfn - pfn); 1130 - buddy_mt = get_pageblock_migratetype(buddy); 1130 + buddy_pfn = __find_buddy_pfn(pfn, order); 1131 + buddy = page + (buddy_pfn - pfn); 1132 + buddy_mt = get_pageblock_migratetype(buddy); 1131 1133 1132 - if (migratetype != buddy_mt 1133 - && (is_migrate_isolate(migratetype) || 1134 - is_migrate_isolate(buddy_mt))) 1135 - goto done_merging; 1136 - } 1134 + if (migratetype != buddy_mt 1135 + && (!migratetype_is_mergeable(migratetype) || 1136 + !migratetype_is_mergeable(buddy_mt))) 1137 + goto done_merging; 1137 1138 max_order = order + 1; 1138 1139 goto continue_merging; 1139 1140 } ··· 1429 1432 } 1430 1433 #endif /* CONFIG_DEBUG_VM */ 1431 1434 1432 - static inline void prefetch_buddy(struct page *page) 1433 - { 1434 - unsigned long pfn = page_to_pfn(page); 1435 - unsigned long buddy_pfn = __find_buddy_pfn(pfn, 0); 1436 - struct page *buddy = page + (buddy_pfn - pfn); 1437 - 1438 - prefetch(buddy); 1439 - } 1440 - 1441 1435 /* 1442 1436 * Frees a number of pages from the PCP lists 1443 1437 * Assumes all pages on list are in same zone. 1444 1438 * count is the number of pages to free. 1445 1439 */ 1446 1440 static void free_pcppages_bulk(struct zone *zone, int count, 1447 - struct per_cpu_pages *pcp) 1441 + struct per_cpu_pages *pcp, 1442 + int pindex) 1448 1443 { 1449 - int pindex = 0; 1450 - int batch_free = 0; 1451 - int nr_freed = 0; 1444 + int min_pindex = 0; 1445 + int max_pindex = NR_PCP_LISTS - 1; 1452 1446 unsigned int order; 1453 - int prefetch_nr = READ_ONCE(pcp->batch); 1454 1447 bool isolated_pageblocks; 1455 - struct page *page, *tmp; 1456 - LIST_HEAD(head); 1448 + struct page *page; 1457 1449 1458 1450 /* 1459 1451 * Ensure proper count is passed which otherwise would stuck in the 1460 1452 * below while (list_empty(list)) loop. 1461 1453 */ 1462 1454 count = min(pcp->count, count); 1463 - while (count > 0) { 1464 - struct list_head *list; 1465 1455 1466 - /* 1467 - * Remove pages from lists in a round-robin fashion. A 1468 - * batch_free count is maintained that is incremented when an 1469 - * empty list is encountered. This is so more pages are freed 1470 - * off fuller lists instead of spinning excessively around empty 1471 - * lists 1472 - */ 1473 - do { 1474 - batch_free++; 1475 - if (++pindex == NR_PCP_LISTS) 1476 - pindex = 0; 1477 - list = &pcp->lists[pindex]; 1478 - } while (list_empty(list)); 1479 - 1480 - /* This is the only non-empty list. Free them all. */ 1481 - if (batch_free == NR_PCP_LISTS) 1482 - batch_free = count; 1483 - 1484 - order = pindex_to_order(pindex); 1485 - BUILD_BUG_ON(MAX_ORDER >= (1<<NR_PCP_ORDER_WIDTH)); 1486 - do { 1487 - page = list_last_entry(list, struct page, lru); 1488 - /* must delete to avoid corrupting pcp list */ 1489 - list_del(&page->lru); 1490 - nr_freed += 1 << order; 1491 - count -= 1 << order; 1492 - 1493 - if (bulkfree_pcp_prepare(page)) 1494 - continue; 1495 - 1496 - /* Encode order with the migratetype */ 1497 - page->index <<= NR_PCP_ORDER_WIDTH; 1498 - page->index |= order; 1499 - 1500 - list_add_tail(&page->lru, &head); 1501 - 1502 - /* 1503 - * We are going to put the page back to the global 1504 - * pool, prefetch its buddy to speed up later access 1505 - * under zone->lock. It is believed the overhead of 1506 - * an additional test and calculating buddy_pfn here 1507 - * can be offset by reduced memory latency later. To 1508 - * avoid excessive prefetching due to large count, only 1509 - * prefetch buddy for the first pcp->batch nr of pages. 1510 - */ 1511 - if (prefetch_nr) { 1512 - prefetch_buddy(page); 1513 - prefetch_nr--; 1514 - } 1515 - } while (count > 0 && --batch_free && !list_empty(list)); 1516 - } 1517 - pcp->count -= nr_freed; 1456 + /* Ensure requested pindex is drained first. */ 1457 + pindex = pindex - 1; 1518 1458 1519 1459 /* 1520 1460 * local_lock_irq held so equivalent to spin_lock_irqsave for ··· 1460 1526 spin_lock(&zone->lock); 1461 1527 isolated_pageblocks = has_isolate_pageblock(zone); 1462 1528 1463 - /* 1464 - * Use safe version since after __free_one_page(), 1465 - * page->lru.next will not point to original list. 1466 - */ 1467 - list_for_each_entry_safe(page, tmp, &head, lru) { 1468 - int mt = get_pcppage_migratetype(page); 1529 + while (count > 0) { 1530 + struct list_head *list; 1531 + int nr_pages; 1469 1532 1470 - /* mt has been encoded with the order (see above) */ 1471 - order = mt & NR_PCP_ORDER_MASK; 1472 - mt >>= NR_PCP_ORDER_WIDTH; 1533 + /* Remove pages from lists in a round-robin fashion. */ 1534 + do { 1535 + if (++pindex > max_pindex) 1536 + pindex = min_pindex; 1537 + list = &pcp->lists[pindex]; 1538 + if (!list_empty(list)) 1539 + break; 1473 1540 1474 - /* MIGRATE_ISOLATE page should not go to pcplists */ 1475 - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); 1476 - /* Pageblock could have been isolated meanwhile */ 1477 - if (unlikely(isolated_pageblocks)) 1478 - mt = get_pageblock_migratetype(page); 1541 + if (pindex == max_pindex) 1542 + max_pindex--; 1543 + if (pindex == min_pindex) 1544 + min_pindex++; 1545 + } while (1); 1479 1546 1480 - __free_one_page(page, page_to_pfn(page), zone, order, mt, FPI_NONE); 1481 - trace_mm_page_pcpu_drain(page, order, mt); 1547 + order = pindex_to_order(pindex); 1548 + nr_pages = 1 << order; 1549 + BUILD_BUG_ON(MAX_ORDER >= (1<<NR_PCP_ORDER_WIDTH)); 1550 + do { 1551 + int mt; 1552 + 1553 + page = list_last_entry(list, struct page, lru); 1554 + mt = get_pcppage_migratetype(page); 1555 + 1556 + /* must delete to avoid corrupting pcp list */ 1557 + list_del(&page->lru); 1558 + count -= nr_pages; 1559 + pcp->count -= nr_pages; 1560 + 1561 + if (bulkfree_pcp_prepare(page)) 1562 + continue; 1563 + 1564 + /* MIGRATE_ISOLATE page should not go to pcplists */ 1565 + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); 1566 + /* Pageblock could have been isolated meanwhile */ 1567 + if (unlikely(isolated_pageblocks)) 1568 + mt = get_pageblock_migratetype(page); 1569 + 1570 + __free_one_page(page, page_to_pfn(page), zone, order, mt, FPI_NONE); 1571 + trace_mm_page_pcpu_drain(page, order, mt); 1572 + } while (count > 0 && !list_empty(list)); 1482 1573 } 1574 + 1483 1575 spin_unlock(&zone->lock); 1484 1576 } 1485 1577 ··· 2220 2260 } while (++p, --i); 2221 2261 2222 2262 set_pageblock_migratetype(page, MIGRATE_CMA); 2223 - 2224 - if (pageblock_order >= MAX_ORDER) { 2225 - i = pageblock_nr_pages; 2226 - p = page; 2227 - do { 2228 - set_page_refcounted(p); 2229 - __free_pages(p, MAX_ORDER - 1); 2230 - p += MAX_ORDER_NR_PAGES; 2231 - } while (i -= MAX_ORDER_NR_PAGES); 2232 - } else { 2233 - set_page_refcounted(page); 2234 - __free_pages(page, pageblock_order); 2235 - } 2263 + set_page_refcounted(page); 2264 + __free_pages(page, pageblock_order); 2236 2265 2237 2266 adjust_managed_page_count(page, pageblock_nr_pages); 2238 2267 page_zone(page)->cma_pages += pageblock_nr_pages; ··· 2291 2342 return 1; 2292 2343 } 2293 2344 2294 - #ifdef CONFIG_DEBUG_VM 2295 - /* 2296 - * With DEBUG_VM enabled, order-0 pages are checked for expected state when 2297 - * being allocated from pcp lists. With debug_pagealloc also enabled, they are 2298 - * also checked when pcp lists are refilled from the free lists. 2299 - */ 2300 - static inline bool check_pcp_refill(struct page *page) 2301 - { 2302 - if (debug_pagealloc_enabled_static()) 2303 - return check_new_page(page); 2304 - else 2305 - return false; 2306 - } 2307 - 2308 - static inline bool check_new_pcp(struct page *page) 2309 - { 2310 - return check_new_page(page); 2311 - } 2312 - #else 2313 - /* 2314 - * With DEBUG_VM disabled, free order-0 pages are checked for expected state 2315 - * when pcp lists are being refilled from the free lists. With debug_pagealloc 2316 - * enabled, they are also checked when being allocated from the pcp lists. 2317 - */ 2318 - static inline bool check_pcp_refill(struct page *page) 2319 - { 2320 - return check_new_page(page); 2321 - } 2322 - static inline bool check_new_pcp(struct page *page) 2323 - { 2324 - if (debug_pagealloc_enabled_static()) 2325 - return check_new_page(page); 2326 - else 2327 - return false; 2328 - } 2329 - #endif /* CONFIG_DEBUG_VM */ 2330 - 2331 2345 static bool check_new_pages(struct page *page, unsigned int order) 2332 2346 { 2333 2347 int i; ··· 2303 2391 2304 2392 return false; 2305 2393 } 2394 + 2395 + #ifdef CONFIG_DEBUG_VM 2396 + /* 2397 + * With DEBUG_VM enabled, order-0 pages are checked for expected state when 2398 + * being allocated from pcp lists. With debug_pagealloc also enabled, they are 2399 + * also checked when pcp lists are refilled from the free lists. 2400 + */ 2401 + static inline bool check_pcp_refill(struct page *page, unsigned int order) 2402 + { 2403 + if (debug_pagealloc_enabled_static()) 2404 + return check_new_pages(page, order); 2405 + else 2406 + return false; 2407 + } 2408 + 2409 + static inline bool check_new_pcp(struct page *page, unsigned int order) 2410 + { 2411 + return check_new_pages(page, order); 2412 + } 2413 + #else 2414 + /* 2415 + * With DEBUG_VM disabled, free order-0 pages are checked for expected state 2416 + * when pcp lists are being refilled from the free lists. With debug_pagealloc 2417 + * enabled, they are also checked when being allocated from the pcp lists. 2418 + */ 2419 + static inline bool check_pcp_refill(struct page *page, unsigned int order) 2420 + { 2421 + return check_new_pages(page, order); 2422 + } 2423 + static inline bool check_new_pcp(struct page *page, unsigned int order) 2424 + { 2425 + if (debug_pagealloc_enabled_static()) 2426 + return check_new_pages(page, order); 2427 + else 2428 + return false; 2429 + } 2430 + #endif /* CONFIG_DEBUG_VM */ 2306 2431 2307 2432 inline void post_alloc_hook(struct page *page, unsigned int order, 2308 2433 gfp_t gfp_flags) ··· 2428 2479 /* 2429 2480 * This array describes the order lists are fallen back to when 2430 2481 * the free lists for the desirable migrate type are depleted 2482 + * 2483 + * The other migratetypes do not have fallbacks. 2431 2484 */ 2432 2485 static int fallbacks[MIGRATE_TYPES][3] = { 2433 2486 [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_TYPES }, 2434 2487 [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES }, 2435 2488 [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_TYPES }, 2436 - #ifdef CONFIG_CMA 2437 - [MIGRATE_CMA] = { MIGRATE_TYPES }, /* Never used */ 2438 - #endif 2439 - #ifdef CONFIG_MEMORY_ISOLATION 2440 - [MIGRATE_ISOLATE] = { MIGRATE_TYPES }, /* Never used */ 2441 - #endif 2442 2489 }; 2443 2490 2444 2491 #ifdef CONFIG_CMA ··· 2740 2795 2741 2796 /* Yoink! */ 2742 2797 mt = get_pageblock_migratetype(page); 2743 - if (!is_migrate_highatomic(mt) && !is_migrate_isolate(mt) 2744 - && !is_migrate_cma(mt)) { 2798 + /* Only reserve normal pageblocks (i.e., they can merge with others) */ 2799 + if (migratetype_is_mergeable(mt)) { 2745 2800 zone->nr_reserved_highatomic += pageblock_nr_pages; 2746 2801 set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); 2747 2802 move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL); ··· 2982 3037 if (unlikely(page == NULL)) 2983 3038 break; 2984 3039 2985 - if (unlikely(check_pcp_refill(page))) 3040 + if (unlikely(check_pcp_refill(page, order))) 2986 3041 continue; 2987 3042 2988 3043 /* ··· 3031 3086 batch = READ_ONCE(pcp->batch); 3032 3087 to_drain = min(pcp->count, batch); 3033 3088 if (to_drain > 0) 3034 - free_pcppages_bulk(zone, to_drain, pcp); 3089 + free_pcppages_bulk(zone, to_drain, pcp, 0); 3035 3090 local_unlock_irqrestore(&pagesets.lock, flags); 3036 3091 } 3037 3092 #endif ··· 3052 3107 3053 3108 pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); 3054 3109 if (pcp->count) 3055 - free_pcppages_bulk(zone, pcp->count, pcp); 3110 + free_pcppages_bulk(zone, pcp->count, pcp, 0); 3056 3111 3057 3112 local_unlock_irqrestore(&pagesets.lock, flags); 3058 3113 } ··· 3275 3330 return true; 3276 3331 } 3277 3332 3278 - static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch) 3333 + static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch, 3334 + bool free_high) 3279 3335 { 3280 3336 int min_nr_free, max_nr_free; 3337 + 3338 + /* Free everything if batch freeing high-order pages. */ 3339 + if (unlikely(free_high)) 3340 + return pcp->count; 3281 3341 3282 3342 /* Check for PCP disabled or boot pageset */ 3283 3343 if (unlikely(high < batch)) ··· 3304 3354 return batch; 3305 3355 } 3306 3356 3307 - static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone) 3357 + static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, 3358 + bool free_high) 3308 3359 { 3309 3360 int high = READ_ONCE(pcp->high); 3310 3361 3311 - if (unlikely(!high)) 3362 + if (unlikely(!high || free_high)) 3312 3363 return 0; 3313 3364 3314 3365 if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) ··· 3322 3371 return min(READ_ONCE(pcp->batch) << 2, high); 3323 3372 } 3324 3373 3325 - static void free_unref_page_commit(struct page *page, unsigned long pfn, 3326 - int migratetype, unsigned int order) 3374 + static void free_unref_page_commit(struct page *page, int migratetype, 3375 + unsigned int order) 3327 3376 { 3328 3377 struct zone *zone = page_zone(page); 3329 3378 struct per_cpu_pages *pcp; 3330 3379 int high; 3331 3380 int pindex; 3381 + bool free_high; 3332 3382 3333 3383 __count_vm_event(PGFREE); 3334 3384 pcp = this_cpu_ptr(zone->per_cpu_pageset); 3335 3385 pindex = order_to_pindex(migratetype, order); 3336 3386 list_add(&page->lru, &pcp->lists[pindex]); 3337 3387 pcp->count += 1 << order; 3338 - high = nr_pcp_high(pcp, zone); 3388 + 3389 + /* 3390 + * As high-order pages other than THP's stored on PCP can contribute 3391 + * to fragmentation, limit the number stored when PCP is heavily 3392 + * freeing without allocation. The remainder after bulk freeing 3393 + * stops will be drained from vmstat refresh context. 3394 + */ 3395 + free_high = (pcp->free_factor && order && order <= PAGE_ALLOC_COSTLY_ORDER); 3396 + 3397 + high = nr_pcp_high(pcp, zone, free_high); 3339 3398 if (pcp->count >= high) { 3340 3399 int batch = READ_ONCE(pcp->batch); 3341 3400 3342 - free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), pcp); 3401 + free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch, free_high), pcp, pindex); 3343 3402 } 3344 3403 } 3345 3404 ··· 3382 3421 } 3383 3422 3384 3423 local_lock_irqsave(&pagesets.lock, flags); 3385 - free_unref_page_commit(page, pfn, migratetype, order); 3424 + free_unref_page_commit(page, migratetype, order); 3386 3425 local_unlock_irqrestore(&pagesets.lock, flags); 3387 3426 } 3388 3427 ··· 3392 3431 void free_unref_page_list(struct list_head *list) 3393 3432 { 3394 3433 struct page *page, *next; 3395 - unsigned long flags, pfn; 3434 + unsigned long flags; 3396 3435 int batch_count = 0; 3397 3436 int migratetype; 3398 3437 3399 3438 /* Prepare pages for freeing */ 3400 3439 list_for_each_entry_safe(page, next, list, lru) { 3401 - pfn = page_to_pfn(page); 3440 + unsigned long pfn = page_to_pfn(page); 3402 3441 if (!free_unref_page_prepare(page, pfn, 0)) { 3403 3442 list_del(&page->lru); 3404 3443 continue; ··· 3414 3453 free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE); 3415 3454 continue; 3416 3455 } 3417 - 3418 - set_page_private(page, pfn); 3419 3456 } 3420 3457 3421 3458 local_lock_irqsave(&pagesets.lock, flags); 3422 3459 list_for_each_entry_safe(page, next, list, lru) { 3423 - pfn = page_private(page); 3424 - set_page_private(page, 0); 3425 - 3426 3460 /* 3427 3461 * Non-isolated types over MIGRATE_PCPTYPES get added 3428 3462 * to the MIGRATE_MOVABLE pcp list. ··· 3427 3471 migratetype = MIGRATE_MOVABLE; 3428 3472 3429 3473 trace_mm_page_free_batched(page); 3430 - free_unref_page_commit(page, pfn, migratetype, 0); 3474 + free_unref_page_commit(page, migratetype, 0); 3431 3475 3432 3476 /* 3433 3477 * Guard against excessive IRQ disabled times when we get ··· 3501 3545 struct page *endpage = page + (1 << order) - 1; 3502 3546 for (; page < endpage; page += pageblock_nr_pages) { 3503 3547 int mt = get_pageblock_migratetype(page); 3504 - if (!is_migrate_isolate(mt) && !is_migrate_cma(mt) 3505 - && !is_migrate_highatomic(mt)) 3548 + /* 3549 + * Only change normal pageblocks (i.e., they can merge 3550 + * with others) 3551 + */ 3552 + if (migratetype_is_mergeable(mt)) 3506 3553 set_pageblock_migratetype(page, 3507 3554 MIGRATE_MOVABLE); 3508 3555 } ··· 3600 3641 page = list_first_entry(list, struct page, lru); 3601 3642 list_del(&page->lru); 3602 3643 pcp->count -= 1 << order; 3603 - } while (check_new_pcp(page)); 3644 + } while (check_new_pcp(page, order)); 3604 3645 3605 3646 return page; 3606 3647 } ··· 3665 3706 * allocate greater than order-1 page units with __GFP_NOFAIL. 3666 3707 */ 3667 3708 WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); 3668 - spin_lock_irqsave(&zone->lock, flags); 3669 3709 3670 3710 do { 3671 3711 page = NULL; 3712 + spin_lock_irqsave(&zone->lock, flags); 3672 3713 /* 3673 3714 * order-0 request can reach here when the pcplist is skipped 3674 3715 * due to non-CMA allocation context. HIGHATOMIC area is ··· 3680 3721 if (page) 3681 3722 trace_mm_page_alloc_zone_locked(page, order, migratetype); 3682 3723 } 3683 - if (!page) 3724 + if (!page) { 3684 3725 page = __rmqueue(zone, order, migratetype, alloc_flags); 3685 - } while (page && check_new_pages(page, order)); 3686 - if (!page) 3687 - goto failed; 3688 - 3689 - __mod_zone_freepage_state(zone, -(1 << order), 3690 - get_pcppage_migratetype(page)); 3691 - spin_unlock_irqrestore(&zone->lock, flags); 3726 + if (!page) 3727 + goto failed; 3728 + } 3729 + __mod_zone_freepage_state(zone, -(1 << order), 3730 + get_pcppage_migratetype(page)); 3731 + spin_unlock_irqrestore(&zone->lock, flags); 3732 + } while (check_new_pages(page, order)); 3692 3733 3693 3734 __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); 3694 3735 zone_statistics(preferred_zone, zone, 1); ··· 4554 4595 const struct alloc_context *ac) 4555 4596 { 4556 4597 unsigned int noreclaim_flag; 4557 - unsigned long pflags, progress; 4598 + unsigned long progress; 4558 4599 4559 4600 cond_resched(); 4560 4601 4561 4602 /* We now go into synchronous reclaim */ 4562 4603 cpuset_memory_pressure_bump(); 4563 - psi_memstall_enter(&pflags); 4564 4604 fs_reclaim_acquire(gfp_mask); 4565 4605 noreclaim_flag = memalloc_noreclaim_save(); 4566 4606 ··· 4568 4610 4569 4611 memalloc_noreclaim_restore(noreclaim_flag); 4570 4612 fs_reclaim_release(gfp_mask); 4571 - psi_memstall_leave(&pflags); 4572 4613 4573 4614 cond_resched(); 4574 4615 ··· 4581 4624 unsigned long *did_some_progress) 4582 4625 { 4583 4626 struct page *page = NULL; 4627 + unsigned long pflags; 4584 4628 bool drained = false; 4585 4629 4630 + psi_memstall_enter(&pflags); 4586 4631 *did_some_progress = __perform_reclaim(gfp_mask, order, ac); 4587 4632 if (unlikely(!(*did_some_progress))) 4588 - return NULL; 4633 + goto out; 4589 4634 4590 4635 retry: 4591 4636 page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac); ··· 4603 4644 drained = true; 4604 4645 goto retry; 4605 4646 } 4647 + out: 4648 + psi_memstall_leave(&pflags); 4606 4649 4607 4650 return page; 4608 4651 } ··· 6341 6380 #define BOOT_PAGESET_BATCH 1 6342 6381 static DEFINE_PER_CPU(struct per_cpu_pages, boot_pageset); 6343 6382 static DEFINE_PER_CPU(struct per_cpu_zonestat, boot_zonestats); 6344 - static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats); 6383 + DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats); 6345 6384 6346 6385 static void __build_all_zonelists(void *data) 6347 6386 { ··· 6363 6402 if (self && !node_online(self->node_id)) { 6364 6403 build_zonelists(self); 6365 6404 } else { 6366 - for_each_online_node(nid) { 6405 + /* 6406 + * All possible nodes have pgdat preallocated 6407 + * in free_area_init 6408 + */ 6409 + for_each_node(nid) { 6367 6410 pg_data_t *pgdat = NODE_DATA(nid); 6368 6411 6369 6412 build_zonelists(pgdat); ··· 7354 7389 /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */ 7355 7390 void __init set_pageblock_order(void) 7356 7391 { 7357 - unsigned int order; 7392 + unsigned int order = MAX_ORDER - 1; 7358 7393 7359 7394 /* Check that pageblock_nr_pages has not already been setup */ 7360 7395 if (pageblock_order) 7361 7396 return; 7362 7397 7363 - if (HPAGE_SHIFT > PAGE_SHIFT) 7398 + /* Don't let pageblocks exceed the maximum allocation granularity. */ 7399 + if (HPAGE_SHIFT > PAGE_SHIFT && HUGETLB_PAGE_ORDER < order) 7364 7400 order = HUGETLB_PAGE_ORDER; 7365 - else 7366 - order = MAX_ORDER - 1; 7367 7401 7368 7402 /* 7369 7403 * Assume the largest contiguous order of interest is a huge page. ··· 7466 7502 * NOTE: this function is only called during memory hotplug 7467 7503 */ 7468 7504 #ifdef CONFIG_MEMORY_HOTPLUG 7469 - void __ref free_area_init_core_hotplug(int nid) 7505 + void __ref free_area_init_core_hotplug(struct pglist_data *pgdat) 7470 7506 { 7507 + int nid = pgdat->node_id; 7471 7508 enum zone_type z; 7472 - pg_data_t *pgdat = NODE_DATA(nid); 7509 + int cpu; 7473 7510 7474 7511 pgdat_init_internals(pgdat); 7512 + 7513 + if (pgdat->per_cpu_nodestats == &boot_nodestats) 7514 + pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat); 7515 + 7516 + /* 7517 + * Reset the nr_zones, order and highest_zoneidx before reuse. 7518 + * Note that kswapd will init kswapd_highest_zoneidx properly 7519 + * when it starts in the near future. 7520 + */ 7521 + pgdat->nr_zones = 0; 7522 + pgdat->kswapd_order = 0; 7523 + pgdat->kswapd_highest_zoneidx = 0; 7524 + pgdat->node_start_pfn = 0; 7525 + for_each_online_cpu(cpu) { 7526 + struct per_cpu_nodestat *p; 7527 + 7528 + p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu); 7529 + memset(p, 0, sizeof(*p)); 7530 + } 7531 + 7475 7532 for (z = 0; z < MAX_NR_ZONES; z++) 7476 7533 zone_init_internals(&pgdat->node_zones[z], z, nid, 0); 7477 7534 } ··· 7642 7657 pgdat->node_start_pfn = start_pfn; 7643 7658 pgdat->per_cpu_nodestats = NULL; 7644 7659 7645 - pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid, 7646 - (u64)start_pfn << PAGE_SHIFT, 7647 - end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0); 7660 + if (start_pfn != end_pfn) { 7661 + pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid, 7662 + (u64)start_pfn << PAGE_SHIFT, 7663 + end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0); 7664 + } else { 7665 + pr_info("Initmem setup node %d as memoryless\n", nid); 7666 + } 7667 + 7648 7668 calculate_node_totalpages(pgdat, start_pfn, end_pfn); 7649 7669 7650 7670 alloc_node_mem_map(pgdat); ··· 7658 7668 free_area_init_core(pgdat); 7659 7669 } 7660 7670 7661 - void __init free_area_init_memoryless_node(int nid) 7671 + static void __init free_area_init_memoryless_node(int nid) 7662 7672 { 7663 7673 free_area_init_node(nid); 7664 7674 } ··· 7962 7972 7963 7973 out2: 7964 7974 /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */ 7965 - for (nid = 0; nid < MAX_NUMNODES; nid++) 7975 + for (nid = 0; nid < MAX_NUMNODES; nid++) { 7976 + unsigned long start_pfn, end_pfn; 7977 + 7966 7978 zone_movable_pfn[nid] = 7967 7979 roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES); 7980 + 7981 + get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); 7982 + if (zone_movable_pfn[nid] >= end_pfn) 7983 + zone_movable_pfn[nid] = 0; 7984 + } 7968 7985 7969 7986 out: 7970 7987 /* restore the node_state */ ··· 8093 8096 /* Initialise every node */ 8094 8097 mminit_verify_pageflags_layout(); 8095 8098 setup_nr_node_ids(); 8096 - for_each_online_node(nid) { 8097 - pg_data_t *pgdat = NODE_DATA(nid); 8099 + for_each_node(nid) { 8100 + pg_data_t *pgdat; 8101 + 8102 + if (!node_online(nid)) { 8103 + pr_info("Initializing node %d as memoryless\n", nid); 8104 + 8105 + /* Allocator not initialized yet */ 8106 + pgdat = arch_alloc_nodedata(nid); 8107 + if (!pgdat) { 8108 + pr_err("Cannot allocate %zuB for node %d.\n", 8109 + sizeof(*pgdat), nid); 8110 + continue; 8111 + } 8112 + arch_refresh_nodedata(nid, pgdat); 8113 + free_area_init_memoryless_node(nid); 8114 + 8115 + /* 8116 + * We do not want to confuse userspace by sysfs 8117 + * files/directories for node without any memory 8118 + * attached to it, so this node is not marked as 8119 + * N_MEMORY and not marked online so that no sysfs 8120 + * hierarchy will be created via register_one_node for 8121 + * it. The pgdat will get fully initialized by 8122 + * hotadd_init_pgdat() when memory is hotplugged into 8123 + * this node. 8124 + */ 8125 + continue; 8126 + } 8127 + 8128 + pgdat = NODE_DATA(nid); 8098 8129 free_area_init_node(nid); 8099 8130 8100 8131 /* Any memory on that node */ ··· 8499 8474 8500 8475 zone->watermark_boost = 0; 8501 8476 zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp; 8502 - zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2; 8477 + zone->_watermark[WMARK_HIGH] = low_wmark_pages(zone) + tmp; 8478 + zone->_watermark[WMARK_PROMO] = high_wmark_pages(zone) + tmp; 8503 8479 8504 8480 spin_unlock_irqrestore(&zone->lock, flags); 8505 8481 } ··· 9012 8986 #ifdef CONFIG_CONTIG_ALLOC 9013 8987 static unsigned long pfn_max_align_down(unsigned long pfn) 9014 8988 { 9015 - return pfn & ~(max_t(unsigned long, MAX_ORDER_NR_PAGES, 9016 - pageblock_nr_pages) - 1); 8989 + return ALIGN_DOWN(pfn, MAX_ORDER_NR_PAGES); 9017 8990 } 9018 8991 9019 8992 static unsigned long pfn_max_align_up(unsigned long pfn) 9020 8993 { 9021 - return ALIGN(pfn, max_t(unsigned long, MAX_ORDER_NR_PAGES, 9022 - pageblock_nr_pages)); 8994 + return ALIGN(pfn, MAX_ORDER_NR_PAGES); 9023 8995 } 9024 8996 9025 8997 #if defined(CONFIG_DYNAMIC_DEBUG) || \ ··· 9476 9452 9477 9453 return order < MAX_ORDER; 9478 9454 } 9455 + EXPORT_SYMBOL(is_free_buddy_page); 9479 9456 9480 9457 #ifdef CONFIG_MEMORY_FAILURE 9481 9458 /*

+5 -2

mm/page_io.c

··· 359 359 struct bio *bio; 360 360 int ret = 0; 361 361 struct swap_info_struct *sis = page_swap_info(page); 362 + bool workingset = PageWorkingset(page); 362 363 unsigned long pflags; 363 364 364 365 VM_BUG_ON_PAGE(!PageSwapCache(page) && !synchronous, page); ··· 371 370 * or the submitting cgroup IO-throttled, submission can be a 372 371 * significant part of overall IO time. 373 372 */ 374 - psi_memstall_enter(&pflags); 373 + if (workingset) 374 + psi_memstall_enter(&pflags); 375 375 delayacct_swapin_start(); 376 376 377 377 if (frontswap_load(page) == 0) { ··· 433 431 bio_put(bio); 434 432 435 433 out: 436 - psi_memstall_leave(&pflags); 434 + if (workingset) 435 + psi_memstall_leave(&pflags); 437 436 delayacct_swapin_end(); 438 437 return ret; 439 438 }

+1 -9

mm/page_table_check.c

··· 23 23 24 24 static int __init early_page_table_check_param(char *buf) 25 25 { 26 - if (!buf) 27 - return -EINVAL; 28 - 29 - if (strcmp(buf, "on") == 0) 30 - __page_table_check_enabled = true; 31 - else if (strcmp(buf, "off") == 0) 32 - __page_table_check_enabled = false; 33 - 34 - return 0; 26 + return strtobool(buf, &__page_table_check_enabled); 35 27 } 36 28 37 29 early_param("page_table_check", early_page_table_check_param);

+12 -4

mm/ptdump.c

··· 40 40 if (st->effective_prot) 41 41 st->effective_prot(st, 0, pgd_val(val)); 42 42 43 - if (pgd_leaf(val)) 43 + if (pgd_leaf(val)) { 44 44 st->note_page(st, addr, 0, pgd_val(val)); 45 + walk->action = ACTION_CONTINUE; 46 + } 45 47 46 48 return 0; 47 49 } ··· 63 61 if (st->effective_prot) 64 62 st->effective_prot(st, 1, p4d_val(val)); 65 63 66 - if (p4d_leaf(val)) 64 + if (p4d_leaf(val)) { 67 65 st->note_page(st, addr, 1, p4d_val(val)); 66 + walk->action = ACTION_CONTINUE; 67 + } 68 68 69 69 return 0; 70 70 } ··· 86 82 if (st->effective_prot) 87 83 st->effective_prot(st, 2, pud_val(val)); 88 84 89 - if (pud_leaf(val)) 85 + if (pud_leaf(val)) { 90 86 st->note_page(st, addr, 2, pud_val(val)); 87 + walk->action = ACTION_CONTINUE; 88 + } 91 89 92 90 return 0; 93 91 } ··· 107 101 108 102 if (st->effective_prot) 109 103 st->effective_prot(st, 3, pmd_val(val)); 110 - if (pmd_leaf(val)) 104 + if (pmd_leaf(val)) { 111 105 st->note_page(st, addr, 3, pmd_val(val)); 106 + walk->action = ACTION_CONTINUE; 107 + } 112 108 113 109 return 0; 114 110 }

+115 -7

mm/readahead.c

··· 8 8 * Initial version. 9 9 */ 10 10 11 + /** 12 + * DOC: Readahead Overview 13 + * 14 + * Readahead is used to read content into the page cache before it is 15 + * explicitly requested by the application. Readahead only ever 16 + * attempts to read pages that are not yet in the page cache. If a 17 + * page is present but not up-to-date, readahead will not try to read 18 + * it. In that case a simple ->readpage() will be requested. 19 + * 20 + * Readahead is triggered when an application read request (whether a 21 + * systemcall or a page fault) finds that the requested page is not in 22 + * the page cache, or that it is in the page cache and has the 23 + * %PG_readahead flag set. This flag indicates that the page was loaded 24 + * as part of a previous read-ahead request and now that it has been 25 + * accessed, it is time for the next read-ahead. 26 + * 27 + * Each readahead request is partly synchronous read, and partly async 28 + * read-ahead. This is reflected in the struct file_ra_state which 29 + * contains ->size being to total number of pages, and ->async_size 30 + * which is the number of pages in the async section. The first page in 31 + * this async section will have %PG_readahead set as a trigger for a 32 + * subsequent read ahead. Once a series of sequential reads has been 33 + * established, there should be no need for a synchronous component and 34 + * all read ahead request will be fully asynchronous. 35 + * 36 + * When either of the triggers causes a readahead, three numbers need to 37 + * be determined: the start of the region, the size of the region, and 38 + * the size of the async tail. 39 + * 40 + * The start of the region is simply the first page address at or after 41 + * the accessed address, which is not currently populated in the page 42 + * cache. This is found with a simple search in the page cache. 43 + * 44 + * The size of the async tail is determined by subtracting the size that 45 + * was explicitly requested from the determined request size, unless 46 + * this would be less than zero - then zero is used. NOTE THIS 47 + * CALCULATION IS WRONG WHEN THE START OF THE REGION IS NOT THE ACCESSED 48 + * PAGE. 49 + * 50 + * The size of the region is normally determined from the size of the 51 + * previous readahead which loaded the preceding pages. This may be 52 + * discovered from the struct file_ra_state for simple sequential reads, 53 + * or from examining the state of the page cache when multiple 54 + * sequential reads are interleaved. Specifically: where the readahead 55 + * was triggered by the %PG_readahead flag, the size of the previous 56 + * readahead is assumed to be the number of pages from the triggering 57 + * page to the start of the new readahead. In these cases, the size of 58 + * the previous readahead is scaled, often doubled, for the new 59 + * readahead, though see get_next_ra_size() for details. 60 + * 61 + * If the size of the previous read cannot be determined, the number of 62 + * preceding pages in the page cache is used to estimate the size of 63 + * a previous read. This estimate could easily be misled by random 64 + * reads being coincidentally adjacent, so it is ignored unless it is 65 + * larger than the current request, and it is not scaled up, unless it 66 + * is at the start of file. 67 + * 68 + * In general read ahead is accelerated at the start of the file, as 69 + * reads from there are often sequential. There are other minor 70 + * adjustments to the read ahead size in various special cases and these 71 + * are best discovered by reading the code. 72 + * 73 + * The above calculation determines the readahead, to which any requested 74 + * read size may be added. 75 + * 76 + * Readahead requests are sent to the filesystem using the ->readahead() 77 + * address space operation, for which mpage_readahead() is a canonical 78 + * implementation. ->readahead() should normally initiate reads on all 79 + * pages, but may fail to read any or all pages without causing an IO 80 + * error. The page cache reading code will issue a ->readpage() request 81 + * for any page which ->readahead() does not provided, and only an error 82 + * from this will be final. 83 + * 84 + * ->readahead() will generally call readahead_page() repeatedly to get 85 + * each page from those prepared for read ahead. It may fail to read a 86 + * page by: 87 + * 88 + * * not calling readahead_page() sufficiently many times, effectively 89 + * ignoring some pages, as might be appropriate if the path to 90 + * storage is congested. 91 + * 92 + * * failing to actually submit a read request for a given page, 93 + * possibly due to insufficient resources, or 94 + * 95 + * * getting an error during subsequent processing of a request. 96 + * 97 + * In the last two cases, the page should be unlocked to indicate that 98 + * the read attempt has failed. In the first case the page will be 99 + * unlocked by the caller. 100 + * 101 + * Those pages not in the final ``async_size`` of the request should be 102 + * considered to be important and ->readahead() should not fail them due 103 + * to congestion or temporary resource unavailability, but should wait 104 + * for necessary resources (e.g. memory or indexing information) to 105 + * become available. Pages in the final ``async_size`` may be 106 + * considered less urgent and failure to read them is more acceptable. 107 + * In this case it is best to use delete_from_page_cache() to remove the 108 + * pages from the page cache as is automatically done for pages that 109 + * were not fetched with readahead_page(). This will allow a 110 + * subsequent synchronous read ahead request to try them again. If they 111 + * are left in the page cache, then they will be read individually using 112 + * ->readpage(). 113 + * 114 + */ 115 + 11 116 #include <linux/kernel.h> 12 117 #include <linux/dax.h> 13 118 #include <linux/gfp.h> ··· 232 127 233 128 if (aops->readahead) { 234 129 aops->readahead(rac); 235 - /* Clean up the remaining pages */ 130 + /* 131 + * Clean up the remaining pages. The sizes in ->ra 132 + * maybe be used to size next read-ahead, so make sure 133 + * they accurately reflect what happened. 134 + */ 236 135 while ((page = readahead_page(rac))) { 136 + rac->ra->size -= 1; 137 + if (rac->ra->async_size > 0) { 138 + rac->ra->async_size -= 1; 139 + delete_from_page_cache(page); 140 + } 237 141 unlock_page(page); 238 142 put_page(page); 239 143 } ··· 708 594 return; 709 595 710 596 folio_clear_readahead(folio); 711 - 712 - /* 713 - * Defer asynchronous read-ahead on IO congestion. 714 - */ 715 - if (inode_read_congested(ractl->mapping->host)) 716 - return; 717 597 718 598 if (blk_cgroup_congested()) 719 599 return;

+13 -2

mm/rmap.c

··· 1252 1252 } 1253 1253 if (!atomic_inc_and_test(compound_mapcount_ptr(page))) 1254 1254 goto out; 1255 + 1256 + /* 1257 + * It is racy to ClearPageDoubleMap in page_remove_file_rmap(); 1258 + * but page lock is held by all page_add_file_rmap() compound 1259 + * callers, and SetPageDoubleMap below warns if !PageLocked: 1260 + * so here is a place that DoubleMap can be safely cleared. 1261 + */ 1262 + VM_WARN_ON_ONCE(!PageLocked(page)); 1263 + if (nr == nr_pages && PageDoubleMap(page)) 1264 + ClearPageDoubleMap(page); 1265 + 1255 1266 if (PageSwapBacked(page)) 1256 1267 __mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED, 1257 1268 nr_pages); ··· 1564 1553 /* Update high watermark before we lower rss */ 1565 1554 update_hiwater_rss(mm); 1566 1555 1567 - if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) { 1556 + if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) { 1568 1557 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); 1569 1558 if (PageHuge(page)) { 1570 1559 hugetlb_count_sub(compound_nr(page), mm); ··· 1884 1873 * memory are supported. 1885 1874 */ 1886 1875 subpage = page; 1887 - } else if (PageHWPoison(page)) { 1876 + } else if (PageHWPoison(subpage)) { 1888 1877 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); 1889 1878 if (PageHuge(page)) { 1890 1879 hugetlb_count_sub(compound_nr(page), mm);

+25 -21

mm/shmem.c

··· 476 476 { 477 477 loff_t i_size; 478 478 479 + if (!S_ISREG(inode->i_mode)) 480 + return false; 479 481 if (shmem_huge == SHMEM_HUGE_DENY) 480 482 return false; 481 483 if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) || ··· 1063 1061 if (shmem_is_huge(NULL, inode, 0)) 1064 1062 stat->blksize = HPAGE_PMD_SIZE; 1065 1063 1064 + if (request_mask & STATX_BTIME) { 1065 + stat->result_mask |= STATX_BTIME; 1066 + stat->btime.tv_sec = info->i_crtime.tv_sec; 1067 + stat->btime.tv_nsec = info->i_crtime.tv_nsec; 1068 + } 1069 + 1066 1070 return 0; 1067 1071 } 1068 1072 ··· 1129 1121 if (shmem_mapping(inode->i_mapping)) { 1130 1122 shmem_unacct_size(info->flags, inode->i_size); 1131 1123 inode->i_size = 0; 1124 + mapping_set_exiting(inode->i_mapping); 1132 1125 shmem_truncate_range(inode, 0, (loff_t)-1); 1133 1126 if (!list_empty(&info->shrinklist)) { 1134 1127 spin_lock(&sbinfo->shrinklist_lock); ··· 1863 1854 return 0; 1864 1855 } 1865 1856 1866 - /* Never use a huge page for shmem_symlink() */ 1867 - if (S_ISLNK(inode->i_mode)) 1868 - goto alloc_nohuge; 1869 1857 if (!shmem_is_huge(vma, inode, index)) 1870 1858 goto alloc_nohuge; 1871 1859 ··· 2271 2265 atomic_set(&info->stop_eviction, 0); 2272 2266 info->seals = F_SEAL_SEAL; 2273 2267 info->flags = flags & VM_NORESERVE; 2268 + info->i_crtime = inode->i_mtime; 2274 2269 INIT_LIST_HEAD(&info->shrinklist); 2275 2270 INIT_LIST_HEAD(&info->swaplist); 2276 2271 simple_xattrs_init(&info->xattrs); ··· 2364 2357 /* don't free the page */ 2365 2358 goto out_unacct_blocks; 2366 2359 } 2360 + 2361 + flush_dcache_page(page); 2367 2362 } else { /* ZEROPAGE */ 2368 - clear_highpage(page); 2363 + clear_user_highpage(page, dst_addr); 2369 2364 } 2370 2365 } else { 2371 2366 page = *pagep; ··· 2501 2492 struct address_space *mapping = inode->i_mapping; 2502 2493 pgoff_t index; 2503 2494 unsigned long offset; 2504 - enum sgp_type sgp = SGP_READ; 2505 2495 int error = 0; 2506 2496 ssize_t retval = 0; 2507 2497 loff_t *ppos = &iocb->ki_pos; 2508 - 2509 - /* 2510 - * Might this read be for a stacking filesystem? Then when reading 2511 - * holes of a sparse file, we actually need to allocate those pages, 2512 - * and even mark them dirty, so it cannot exceed the max_blocks limit. 2513 - */ 2514 - if (!iter_is_iovec(to)) 2515 - sgp = SGP_CACHE; 2516 2498 2517 2499 index = *ppos >> PAGE_SHIFT; 2518 2500 offset = *ppos & ~PAGE_MASK; ··· 2513 2513 pgoff_t end_index; 2514 2514 unsigned long nr, ret; 2515 2515 loff_t i_size = i_size_read(inode); 2516 + bool got_page; 2516 2517 2517 2518 end_index = i_size >> PAGE_SHIFT; 2518 2519 if (index > end_index) ··· 2524 2523 break; 2525 2524 } 2526 2525 2527 - error = shmem_getpage(inode, index, &page, sgp); 2526 + error = shmem_getpage(inode, index, &page, SGP_READ); 2528 2527 if (error) { 2529 2528 if (error == -EINVAL) 2530 2529 error = 0; 2531 2530 break; 2532 2531 } 2533 2532 if (page) { 2534 - if (sgp == SGP_CACHE) 2535 - set_page_dirty(page); 2536 2533 unlock_page(page); 2537 2534 2538 2535 if (PageHWPoison(page)) { ··· 2570 2571 */ 2571 2572 if (!offset) 2572 2573 mark_page_accessed(page); 2574 + got_page = true; 2573 2575 } else { 2574 2576 page = ZERO_PAGE(0); 2575 - get_page(page); 2577 + got_page = false; 2576 2578 } 2577 2579 2578 2580 /* ··· 2586 2586 index += offset >> PAGE_SHIFT; 2587 2587 offset &= ~PAGE_MASK; 2588 2588 2589 - put_page(page); 2589 + if (got_page) 2590 + put_page(page); 2590 2591 if (!iov_iter_count(to)) 2591 2592 break; 2592 2593 if (ret < nr) { ··· 3197 3196 #endif /* CONFIG_TMPFS_XATTR */ 3198 3197 3199 3198 static const struct inode_operations shmem_short_symlink_operations = { 3199 + .getattr = shmem_getattr, 3200 3200 .get_link = simple_get_link, 3201 3201 #ifdef CONFIG_TMPFS_XATTR 3202 3202 .listxattr = shmem_listxattr, ··· 3205 3203 }; 3206 3204 3207 3205 static const struct inode_operations shmem_symlink_inode_operations = { 3206 + .getattr = shmem_getattr, 3208 3207 .get_link = shmem_get_link, 3209 3208 #ifdef CONFIG_TMPFS_XATTR 3210 3209 .listxattr = shmem_listxattr, ··· 3710 3707 static struct inode *shmem_alloc_inode(struct super_block *sb) 3711 3708 { 3712 3709 struct shmem_inode_info *info; 3713 - info = kmem_cache_alloc(shmem_inode_cachep, GFP_KERNEL); 3710 + info = alloc_inode_sb(sb, shmem_inode_cachep, GFP_KERNEL); 3714 3711 if (!info) 3715 3712 return NULL; 3716 3713 return &info->vfs_inode; ··· 3793 3790 3794 3791 static const struct inode_operations shmem_dir_inode_operations = { 3795 3792 #ifdef CONFIG_TMPFS 3793 + .getattr = shmem_getattr, 3796 3794 .create = shmem_create, 3797 3795 .lookup = simple_lookup, 3798 3796 .link = shmem_link, ··· 3815 3811 }; 3816 3812 3817 3813 static const struct inode_operations shmem_special_inode_operations = { 3814 + .getattr = shmem_getattr, 3818 3815 #ifdef CONFIG_TMPFS_XATTR 3819 3816 .listxattr = shmem_listxattr, 3820 3817 #endif ··· 3967 3962 return count; 3968 3963 } 3969 3964 3970 - struct kobj_attribute shmem_enabled_attr = 3971 - __ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store); 3965 + struct kobj_attribute shmem_enabled_attr = __ATTR_RW(shmem_enabled); 3972 3966 #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */ 3973 3967 3974 3968 #else /* !CONFIG_SHMEM */

+27 -12

mm/slab.c

··· 3211 3211 bool init = false; 3212 3212 3213 3213 flags &= gfp_allowed_mask; 3214 - cachep = slab_pre_alloc_hook(cachep, &objcg, 1, flags); 3214 + cachep = slab_pre_alloc_hook(cachep, NULL, &objcg, 1, flags); 3215 3215 if (unlikely(!cachep)) 3216 3216 return NULL; 3217 3217 ··· 3287 3287 #endif /* CONFIG_NUMA */ 3288 3288 3289 3289 static __always_inline void * 3290 - slab_alloc(struct kmem_cache *cachep, gfp_t flags, size_t orig_size, unsigned long caller) 3290 + slab_alloc(struct kmem_cache *cachep, struct list_lru *lru, gfp_t flags, 3291 + size_t orig_size, unsigned long caller) 3291 3292 { 3292 3293 unsigned long save_flags; 3293 3294 void *objp; ··· 3296 3295 bool init = false; 3297 3296 3298 3297 flags &= gfp_allowed_mask; 3299 - cachep = slab_pre_alloc_hook(cachep, &objcg, 1, flags); 3298 + cachep = slab_pre_alloc_hook(cachep, lru, &objcg, 1, flags); 3300 3299 if (unlikely(!cachep)) 3301 3300 return NULL; 3302 3301 ··· 3485 3484 __free_one(ac, objp); 3486 3485 } 3487 3486 3487 + static __always_inline 3488 + void *__kmem_cache_alloc_lru(struct kmem_cache *cachep, struct list_lru *lru, 3489 + gfp_t flags) 3490 + { 3491 + void *ret = slab_alloc(cachep, lru, flags, cachep->object_size, _RET_IP_); 3492 + 3493 + trace_kmem_cache_alloc(_RET_IP_, ret, 3494 + cachep->object_size, cachep->size, flags); 3495 + 3496 + return ret; 3497 + } 3498 + 3488 3499 /** 3489 3500 * kmem_cache_alloc - Allocate an object 3490 3501 * @cachep: The cache to allocate from. ··· 3509 3496 */ 3510 3497 void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) 3511 3498 { 3512 - void *ret = slab_alloc(cachep, flags, cachep->object_size, _RET_IP_); 3513 - 3514 - trace_kmem_cache_alloc(_RET_IP_, ret, 3515 - cachep->object_size, cachep->size, flags); 3516 - 3517 - return ret; 3499 + return __kmem_cache_alloc_lru(cachep, NULL, flags); 3518 3500 } 3519 3501 EXPORT_SYMBOL(kmem_cache_alloc); 3502 + 3503 + void *kmem_cache_alloc_lru(struct kmem_cache *cachep, struct list_lru *lru, 3504 + gfp_t flags) 3505 + { 3506 + return __kmem_cache_alloc_lru(cachep, lru, flags); 3507 + } 3508 + EXPORT_SYMBOL(kmem_cache_alloc_lru); 3520 3509 3521 3510 static __always_inline void 3522 3511 cache_alloc_debugcheck_after_bulk(struct kmem_cache *s, gfp_t flags, ··· 3536 3521 size_t i; 3537 3522 struct obj_cgroup *objcg = NULL; 3538 3523 3539 - s = slab_pre_alloc_hook(s, &objcg, size, flags); 3524 + s = slab_pre_alloc_hook(s, NULL, &objcg, size, flags); 3540 3525 if (!s) 3541 3526 return 0; 3542 3527 ··· 3577 3562 { 3578 3563 void *ret; 3579 3564 3580 - ret = slab_alloc(cachep, flags, size, _RET_IP_); 3565 + ret = slab_alloc(cachep, NULL, flags, size, _RET_IP_); 3581 3566 3582 3567 ret = kasan_kmalloc(cachep, ret, size, flags); 3583 3568 trace_kmalloc(_RET_IP_, ret, ··· 3704 3689 cachep = kmalloc_slab(size, flags); 3705 3690 if (unlikely(ZERO_OR_NULL_PTR(cachep))) 3706 3691 return cachep; 3707 - ret = slab_alloc(cachep, flags, size, caller); 3692 + ret = slab_alloc(cachep, NULL, flags, size, caller); 3708 3693 3709 3694 ret = kasan_kmalloc(cachep, ret, size, flags); 3710 3695 trace_kmalloc(caller, ret,

+21 -4

mm/slab.h

··· 231 231 #include <linux/kmemleak.h> 232 232 #include <linux/random.h> 233 233 #include <linux/sched/mm.h> 234 + #include <linux/list_lru.h> 234 235 235 236 /* 236 237 * State of the slab allocator. ··· 473 472 * Returns false if the allocation should fail. 474 473 */ 475 474 static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s, 475 + struct list_lru *lru, 476 476 struct obj_cgroup **objcgp, 477 477 size_t objects, gfp_t flags) 478 478 { ··· 489 487 if (!objcg) 490 488 return true; 491 489 492 - if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) { 493 - obj_cgroup_put(objcg); 494 - return false; 490 + if (lru) { 491 + int ret; 492 + struct mem_cgroup *memcg; 493 + 494 + memcg = get_mem_cgroup_from_objcg(objcg); 495 + ret = memcg_list_lru_alloc(memcg, lru, flags); 496 + css_put(&memcg->css); 497 + 498 + if (ret) 499 + goto out; 495 500 } 501 + 502 + if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) 503 + goto out; 496 504 497 505 *objcgp = objcg; 498 506 return true; 507 + out: 508 + obj_cgroup_put(objcg); 509 + return false; 499 510 } 500 511 501 512 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s, ··· 613 598 } 614 599 615 600 static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s, 601 + struct list_lru *lru, 616 602 struct obj_cgroup **objcgp, 617 603 size_t objects, gfp_t flags) 618 604 { ··· 713 697 } 714 698 715 699 static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, 700 + struct list_lru *lru, 716 701 struct obj_cgroup **objcgp, 717 702 size_t size, gfp_t flags) 718 703 { ··· 724 707 if (should_failslab(s, flags)) 725 708 return NULL; 726 709 727 - if (!memcg_slab_pre_alloc_hook(s, objcgp, size, flags)) 710 + if (!memcg_slab_pre_alloc_hook(s, lru, objcgp, size, flags)) 728 711 return NULL; 729 712 730 713 return s;

+6

mm/slob.c

··· 635 635 } 636 636 EXPORT_SYMBOL(kmem_cache_alloc); 637 637 638 + 639 + void *kmem_cache_alloc_lru(struct kmem_cache *cachep, struct list_lru *lru, gfp_t flags) 640 + { 641 + return slob_alloc_node(cachep, flags, NUMA_NO_NODE); 642 + } 643 + EXPORT_SYMBOL(kmem_cache_alloc_lru); 638 644 #ifdef CONFIG_NUMA 639 645 void *__kmalloc_node(size_t size, gfp_t gfp, int node) 640 646 {

+28 -14

mm/slub.c

··· 3131 3131 * 3132 3132 * Otherwise we can simply pick the next object from the lockless free list. 3133 3133 */ 3134 - static __always_inline void *slab_alloc_node(struct kmem_cache *s, 3134 + static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_lru *lru, 3135 3135 gfp_t gfpflags, int node, unsigned long addr, size_t orig_size) 3136 3136 { 3137 3137 void *object; ··· 3141 3141 struct obj_cgroup *objcg = NULL; 3142 3142 bool init = false; 3143 3143 3144 - s = slab_pre_alloc_hook(s, &objcg, 1, gfpflags); 3144 + s = slab_pre_alloc_hook(s, lru, &objcg, 1, gfpflags); 3145 3145 if (!s) 3146 3146 return NULL; 3147 3147 ··· 3232 3232 return object; 3233 3233 } 3234 3234 3235 - static __always_inline void *slab_alloc(struct kmem_cache *s, 3235 + static __always_inline void *slab_alloc(struct kmem_cache *s, struct list_lru *lru, 3236 3236 gfp_t gfpflags, unsigned long addr, size_t orig_size) 3237 3237 { 3238 - return slab_alloc_node(s, gfpflags, NUMA_NO_NODE, addr, orig_size); 3238 + return slab_alloc_node(s, lru, gfpflags, NUMA_NO_NODE, addr, orig_size); 3239 3239 } 3240 3240 3241 - void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags) 3241 + static __always_inline 3242 + void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru, 3243 + gfp_t gfpflags) 3242 3244 { 3243 - void *ret = slab_alloc(s, gfpflags, _RET_IP_, s->object_size); 3245 + void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size); 3244 3246 3245 3247 trace_kmem_cache_alloc(_RET_IP_, ret, s->object_size, 3246 3248 s->size, gfpflags); 3247 3249 3248 3250 return ret; 3249 3251 } 3252 + 3253 + void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags) 3254 + { 3255 + return __kmem_cache_alloc_lru(s, NULL, gfpflags); 3256 + } 3250 3257 EXPORT_SYMBOL(kmem_cache_alloc); 3258 + 3259 + void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru, 3260 + gfp_t gfpflags) 3261 + { 3262 + return __kmem_cache_alloc_lru(s, lru, gfpflags); 3263 + } 3264 + EXPORT_SYMBOL(kmem_cache_alloc_lru); 3251 3265 3252 3266 #ifdef CONFIG_TRACING 3253 3267 void *kmem_cache_alloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size) 3254 3268 { 3255 - void *ret = slab_alloc(s, gfpflags, _RET_IP_, size); 3269 + void *ret = slab_alloc(s, NULL, gfpflags, _RET_IP_, size); 3256 3270 trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags); 3257 3271 ret = kasan_kmalloc(s, ret, size, gfpflags); 3258 3272 return ret; ··· 3277 3263 #ifdef CONFIG_NUMA 3278 3264 void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node) 3279 3265 { 3280 - void *ret = slab_alloc_node(s, gfpflags, node, _RET_IP_, s->object_size); 3266 + void *ret = slab_alloc_node(s, NULL, gfpflags, node, _RET_IP_, s->object_size); 3281 3267 3282 3268 trace_kmem_cache_alloc_node(_RET_IP_, ret, 3283 3269 s->object_size, s->size, gfpflags, node); ··· 3291 3277 gfp_t gfpflags, 3292 3278 int node, size_t size) 3293 3279 { 3294 - void *ret = slab_alloc_node(s, gfpflags, node, _RET_IP_, size); 3280 + void *ret = slab_alloc_node(s, NULL, gfpflags, node, _RET_IP_, size); 3295 3281 3296 3282 trace_kmalloc_node(_RET_IP_, ret, 3297 3283 size, s->size, gfpflags, node); ··· 3681 3667 struct obj_cgroup *objcg = NULL; 3682 3668 3683 3669 /* memcg and kmem_cache debug support */ 3684 - s = slab_pre_alloc_hook(s, &objcg, size, flags); 3670 + s = slab_pre_alloc_hook(s, NULL, &objcg, size, flags); 3685 3671 if (unlikely(!s)) 3686 3672 return false; 3687 3673 /* ··· 4431 4417 if (unlikely(ZERO_OR_NULL_PTR(s))) 4432 4418 return s; 4433 4419 4434 - ret = slab_alloc(s, flags, _RET_IP_, size); 4420 + ret = slab_alloc(s, NULL, flags, _RET_IP_, size); 4435 4421 4436 4422 trace_kmalloc(_RET_IP_, ret, size, s->size, flags); 4437 4423 ··· 4479 4465 if (unlikely(ZERO_OR_NULL_PTR(s))) 4480 4466 return s; 4481 4467 4482 - ret = slab_alloc_node(s, flags, node, _RET_IP_, size); 4468 + ret = slab_alloc_node(s, NULL, flags, node, _RET_IP_, size); 4483 4469 4484 4470 trace_kmalloc_node(_RET_IP_, ret, size, s->size, flags, node); 4485 4471 ··· 4937 4923 if (unlikely(ZERO_OR_NULL_PTR(s))) 4938 4924 return s; 4939 4925 4940 - ret = slab_alloc(s, gfpflags, caller, size); 4926 + ret = slab_alloc(s, NULL, gfpflags, caller, size); 4941 4927 4942 4928 /* Honor the call site pointer we received. */ 4943 4929 trace_kmalloc(caller, ret, size, s->size, gfpflags); ··· 4968 4954 if (unlikely(ZERO_OR_NULL_PTR(s))) 4969 4955 return s; 4970 4956 4971 - ret = slab_alloc_node(s, gfpflags, node, caller, size); 4957 + ret = slab_alloc_node(s, NULL, gfpflags, node, caller, size); 4972 4958 4973 4959 /* Honor the call site pointer we received. */ 4974 4960 trace_kmalloc_node(caller, ret, size, s->size, gfpflags, node);

+54 -16

mm/sparse-vmemmap.c

··· 34 34 #include <asm/pgalloc.h> 35 35 #include <asm/tlbflush.h> 36 36 37 + #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP 37 38 /** 38 39 * struct vmemmap_remap_walk - walk vmemmap page table 39 40 * ··· 54 53 struct list_head *vmemmap_pages; 55 54 }; 56 55 57 - static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, 58 - struct vmemmap_remap_walk *walk) 56 + static int __split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start) 59 57 { 60 58 pmd_t __pmd; 61 59 int i; ··· 76 76 set_pte_at(&init_mm, addr, pte, entry); 77 77 } 78 78 79 - /* Make pte visible before pmd. See comment in pmd_install(). */ 80 - smp_wmb(); 81 - pmd_populate_kernel(&init_mm, pmd, pgtable); 82 - 83 - flush_tlb_kernel_range(start, start + PMD_SIZE); 79 + spin_lock(&init_mm.page_table_lock); 80 + if (likely(pmd_leaf(*pmd))) { 81 + /* Make pte visible before pmd. See comment in pmd_install(). */ 82 + smp_wmb(); 83 + pmd_populate_kernel(&init_mm, pmd, pgtable); 84 + flush_tlb_kernel_range(start, start + PMD_SIZE); 85 + } else { 86 + pte_free_kernel(&init_mm, pgtable); 87 + } 88 + spin_unlock(&init_mm.page_table_lock); 84 89 85 90 return 0; 91 + } 92 + 93 + static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start) 94 + { 95 + int leaf; 96 + 97 + spin_lock(&init_mm.page_table_lock); 98 + leaf = pmd_leaf(*pmd); 99 + spin_unlock(&init_mm.page_table_lock); 100 + 101 + if (!leaf) 102 + return 0; 103 + 104 + return __split_vmemmap_huge_pmd(pmd, start); 86 105 } 87 106 88 107 static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, ··· 140 121 141 122 pmd = pmd_offset(pud, addr); 142 123 do { 143 - if (pmd_leaf(*pmd)) { 144 - int ret; 124 + int ret; 145 125 146 - ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK, walk); 147 - if (ret) 148 - return ret; 149 - } 126 + ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK); 127 + if (ret) 128 + return ret; 129 + 150 130 next = pmd_addr_end(addr, end); 151 131 vmemmap_pte_range(pmd, addr, next, walk); 152 132 } while (pmd++, addr = next, addr != end); ··· 263 245 set_pte_at(&init_mm, addr, pte, entry); 264 246 } 265 247 248 + /* 249 + * How many struct page structs need to be reset. When we reuse the head 250 + * struct page, the special metadata (e.g. page->flags or page->mapping) 251 + * cannot copy to the tail struct page structs. The invalid value will be 252 + * checked in the free_tail_pages_check(). In order to avoid the message 253 + * of "corrupted mapping in tail page". We need to reset at least 3 (one 254 + * head struct page struct and two tail struct page structs) struct page 255 + * structs. 256 + */ 257 + #define NR_RESET_STRUCT_PAGE 3 258 + 259 + static inline void reset_struct_pages(struct page *start) 260 + { 261 + int i; 262 + struct page *from = start + NR_RESET_STRUCT_PAGE; 263 + 264 + for (i = 0; i < NR_RESET_STRUCT_PAGE; i++) 265 + memcpy(start + i, from, sizeof(*from)); 266 + } 267 + 266 268 static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, 267 269 struct vmemmap_remap_walk *walk) 268 270 { ··· 296 258 list_del(&page->lru); 297 259 to = page_to_virt(page); 298 260 copy_page(to, (void *)walk->reuse_addr); 261 + reset_struct_pages(to); 299 262 300 263 set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); 301 264 } ··· 339 300 */ 340 301 BUG_ON(start - reuse != PAGE_SIZE); 341 302 342 - mmap_write_lock(&init_mm); 303 + mmap_read_lock(&init_mm); 343 304 ret = vmemmap_remap_range(reuse, end, &walk); 344 - mmap_write_downgrade(&init_mm); 345 - 346 305 if (ret && walk.nr_walked) { 347 306 end = reuse + walk.nr_walked * PAGE_SIZE; 348 307 /* ··· 420 383 421 384 return 0; 422 385 } 386 + #endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */ 423 387 424 388 /* 425 389 * Allocate a block of memory to be used to back the virtual memory map

+1 -1

mm/sparse.c

··· 126 126 } 127 127 128 128 /* Validate the physical addressing limitations of the model */ 129 - void __meminit mminit_validate_memmodel_limits(unsigned long *start_pfn, 129 + static void __meminit mminit_validate_memmodel_limits(unsigned long *start_pfn, 130 130 unsigned long *end_pfn) 131 131 { 132 132 unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);

+15 -10

mm/swap.c

··· 425 425 /* 426 426 * Unevictable pages are on the "LRU_UNEVICTABLE" list. But, 427 427 * this list is never rotated or maintained, so marking an 428 - * evictable page accessed has no effect. 428 + * unevictable page accessed has no effect. 429 429 */ 430 430 } else if (!folio_test_active(folio)) { 431 431 /* ··· 831 831 for_each_online_cpu(cpu) { 832 832 struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); 833 833 834 - if (force_all_cpus || 835 - pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) || 834 + if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) || 836 835 data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) || 837 836 pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) || 838 837 pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) || ··· 875 876 void lru_cache_disable(void) 876 877 { 877 878 atomic_inc(&lru_disable_count); 878 - #ifdef CONFIG_SMP 879 879 /* 880 - * lru_add_drain_all in the force mode will schedule draining on 881 - * all online CPUs so any calls of lru_cache_disabled wrapped by 882 - * local_lock or preemption disabled would be ordered by that. 883 - * The atomic operation doesn't need to have stronger ordering 884 - * requirements because that is enforced by the scheduling 885 - * guarantees. 880 + * Readers of lru_disable_count are protected by either disabling 881 + * preemption or rcu_read_lock: 882 + * 883 + * preempt_disable, local_irq_disable [bh_lru_lock()] 884 + * rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT] 885 + * preempt_disable [local_lock !CONFIG_PREEMPT_RT] 886 + * 887 + * Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on 888 + * preempt_disable() regions of code. So any CPU which sees 889 + * lru_disable_count = 0 will have exited the critical 890 + * section when synchronize_rcu() returns. 886 891 */ 892 + synchronize_rcu(); 893 + #ifdef CONFIG_SMP 887 894 __lru_add_drain_all(true); 888 895 #else 889 896 lru_add_and_bh_lrus_drain();

+1

mm/swapfile.c

··· 1951 1951 struct vm_fault vmf = { 1952 1952 .vma = vma, 1953 1953 .address = addr, 1954 + .real_address = addr, 1954 1955 .pmd = pmd, 1955 1956 }; 1956 1957

+4 -12

mm/usercopy.c

··· 81 81 * kmem_cache_create_usercopy() function to create the cache (and 82 82 * carefully audit the whitelist range). 83 83 */ 84 - void usercopy_warn(const char *name, const char *detail, bool to_user, 85 - unsigned long offset, unsigned long len) 86 - { 87 - WARN_ONCE(1, "Bad or missing usercopy whitelist? Kernel memory %s attempt detected %s %s%s%s%s (offset %lu, size %lu)!\n", 88 - to_user ? "exposure" : "overwrite", 89 - to_user ? "from" : "to", 90 - name ? : "unknown?!", 91 - detail ? " '" : "", detail ? : "", detail ? "'" : "", 92 - offset, len); 93 - } 94 - 95 84 void __noreturn usercopy_abort(const char *name, const char *detail, 96 85 bool to_user, unsigned long offset, 97 86 unsigned long len) ··· 303 314 304 315 static int __init parse_hardened_usercopy(char *str) 305 316 { 306 - return strtobool(str, &enable_checks); 317 + if (strtobool(str, &enable_checks)) 318 + pr_warn("Invalid option string for hardened_usercopy: '%s'\n", 319 + str); 320 + return 1; 307 321 } 308 322 309 323 __setup("hardened_usercopy=", parse_hardened_usercopy);

+3

mm/userfaultfd.c

··· 150 150 /* don't free the page */ 151 151 goto out; 152 152 } 153 + 154 + flush_dcache_page(page); 153 155 } else { 154 156 page = *pagep; 155 157 *pagep = NULL; ··· 627 625 err = -EFAULT; 628 626 goto out; 629 627 } 628 + flush_dcache_page(page); 630 629 goto retry; 631 630 } else 632 631 BUG_ON(page);

+61 -41

mm/vmalloc.c

··· 118 118 if (size != PAGE_SIZE) { 119 119 pte_t entry = pfn_pte(pfn, prot); 120 120 121 - entry = pte_mkhuge(entry); 122 121 entry = arch_make_huge_pte(entry, ilog2(size), 0); 123 122 set_huge_pte_at(&init_mm, addr, pte, entry); 124 123 pfn += PFN_DOWN(size); ··· 775 776 return va ? va->subtree_max_size : 0; 776 777 } 777 778 778 - /* 779 - * Gets called when remove the node and rotate. 780 - */ 781 - static __always_inline unsigned long 782 - compute_subtree_max_size(struct vmap_area *va) 783 - { 784 - return max3(va_size(va), 785 - get_subtree_max_size(va->rb_node.rb_left), 786 - get_subtree_max_size(va->rb_node.rb_right)); 787 - } 788 - 789 779 RB_DECLARE_CALLBACKS_MAX(static, free_vmap_area_rb_augment_cb, 790 780 struct vmap_area, rb_node, unsigned long, subtree_max_size, va_size) 791 781 792 782 static void purge_vmap_area_lazy(void); 793 783 static BLOCKING_NOTIFIER_HEAD(vmap_notify_list); 794 - static unsigned long lazy_max_pages(void); 784 + static void drain_vmap_area_work(struct work_struct *work); 785 + static DECLARE_WORK(drain_vmap_work, drain_vmap_area_work); 795 786 796 787 static atomic_long_t nr_vmalloc_pages; 797 788 ··· 962 973 } 963 974 964 975 #if DEBUG_AUGMENT_PROPAGATE_CHECK 976 + /* 977 + * Gets called when remove the node and rotate. 978 + */ 979 + static __always_inline unsigned long 980 + compute_subtree_max_size(struct vmap_area *va) 981 + { 982 + return max3(va_size(va), 983 + get_subtree_max_size(va->rb_node.rb_left), 984 + get_subtree_max_size(va->rb_node.rb_right)); 985 + } 986 + 965 987 static void 966 988 augment_tree_propagate_check(void) 967 989 { ··· 1189 1189 /* 1190 1190 * Find the first free block(lowest start address) in the tree, 1191 1191 * that will accomplish the request corresponding to passing 1192 - * parameters. 1192 + * parameters. Please note, with an alignment bigger than PAGE_SIZE, 1193 + * a search length is adjusted to account for worst case alignment 1194 + * overhead. 1193 1195 */ 1194 1196 static __always_inline struct vmap_area * 1195 - find_vmap_lowest_match(unsigned long size, 1196 - unsigned long align, unsigned long vstart) 1197 + find_vmap_lowest_match(unsigned long size, unsigned long align, 1198 + unsigned long vstart, bool adjust_search_size) 1197 1199 { 1198 1200 struct vmap_area *va; 1199 1201 struct rb_node *node; 1202 + unsigned long length; 1200 1203 1201 1204 /* Start from the root. */ 1202 1205 node = free_vmap_area_root.rb_node; 1203 1206 1207 + /* Adjust the search size for alignment overhead. */ 1208 + length = adjust_search_size ? size + align - 1 : size; 1209 + 1204 1210 while (node) { 1205 1211 va = rb_entry(node, struct vmap_area, rb_node); 1206 1212 1207 - if (get_subtree_max_size(node->rb_left) >= size && 1213 + if (get_subtree_max_size(node->rb_left) >= length && 1208 1214 vstart < va->va_start) { 1209 1215 node = node->rb_left; 1210 1216 } else { ··· 1220 1214 /* 1221 1215 * Does not make sense to go deeper towards the right 1222 1216 * sub-tree if it does not have a free block that is 1223 - * equal or bigger to the requested search size. 1217 + * equal or bigger to the requested search length. 1224 1218 */ 1225 - if (get_subtree_max_size(node->rb_right) >= size) { 1219 + if (get_subtree_max_size(node->rb_right) >= length) { 1226 1220 node = node->rb_right; 1227 1221 continue; 1228 1222 } ··· 1238 1232 if (is_within_this_va(va, size, align, vstart)) 1239 1233 return va; 1240 1234 1241 - if (get_subtree_max_size(node->rb_right) >= size && 1235 + if (get_subtree_max_size(node->rb_right) >= length && 1242 1236 vstart <= va->va_start) { 1243 1237 /* 1244 1238 * Shift the vstart forward. Please note, we update it with ··· 1286 1280 get_random_bytes(&rnd, sizeof(rnd)); 1287 1281 vstart = VMALLOC_START + rnd; 1288 1282 1289 - va_1 = find_vmap_lowest_match(size, align, vstart); 1283 + va_1 = find_vmap_lowest_match(size, align, vstart, false); 1290 1284 va_2 = find_vmap_lowest_linear_match(size, align, vstart); 1291 1285 1292 1286 if (va_1 != va_2) ··· 1437 1431 __alloc_vmap_area(unsigned long size, unsigned long align, 1438 1432 unsigned long vstart, unsigned long vend) 1439 1433 { 1434 + bool adjust_search_size = true; 1440 1435 unsigned long nva_start_addr; 1441 1436 struct vmap_area *va; 1442 1437 enum fit_type type; 1443 1438 int ret; 1444 1439 1445 - va = find_vmap_lowest_match(size, align, vstart); 1440 + /* 1441 + * Do not adjust when: 1442 + * a) align <= PAGE_SIZE, because it does not make any sense. 1443 + * All blocks(their start addresses) are at least PAGE_SIZE 1444 + * aligned anyway; 1445 + * b) a short range where a requested size corresponds to exactly 1446 + * specified [vstart:vend] interval and an alignment > PAGE_SIZE. 1447 + * With adjusted search length an allocation would not succeed. 1448 + */ 1449 + if (align <= PAGE_SIZE || (align > PAGE_SIZE && (vend - vstart) == size)) 1450 + adjust_search_size = false; 1451 + 1452 + va = find_vmap_lowest_match(size, align, vstart, adjust_search_size); 1446 1453 if (unlikely(!va)) 1447 1454 return vend; 1448 1455 ··· 1739 1720 } 1740 1721 1741 1722 /* 1742 - * Kick off a purge of the outstanding lazy areas. Don't bother if somebody 1743 - * is already purging. 1744 - */ 1745 - static void try_purge_vmap_area_lazy(void) 1746 - { 1747 - if (mutex_trylock(&vmap_purge_lock)) { 1748 - __purge_vmap_area_lazy(ULONG_MAX, 0); 1749 - mutex_unlock(&vmap_purge_lock); 1750 - } 1751 - } 1752 - 1753 - /* 1754 1723 * Kick off a purge of the outstanding lazy areas. 1755 1724 */ 1756 1725 static void purge_vmap_area_lazy(void) ··· 1747 1740 purge_fragmented_blocks_allcpus(); 1748 1741 __purge_vmap_area_lazy(ULONG_MAX, 0); 1749 1742 mutex_unlock(&vmap_purge_lock); 1743 + } 1744 + 1745 + static void drain_vmap_area_work(struct work_struct *work) 1746 + { 1747 + unsigned long nr_lazy; 1748 + 1749 + do { 1750 + mutex_lock(&vmap_purge_lock); 1751 + __purge_vmap_area_lazy(ULONG_MAX, 0); 1752 + mutex_unlock(&vmap_purge_lock); 1753 + 1754 + /* Recheck if further work is required. */ 1755 + nr_lazy = atomic_long_read(&vmap_lazy_nr); 1756 + } while (nr_lazy > lazy_max_pages()); 1750 1757 } 1751 1758 1752 1759 /* ··· 1789 1768 1790 1769 /* After this point, we may free va at any time */ 1791 1770 if (unlikely(nr_lazy > lazy_max_pages())) 1792 - try_purge_vmap_area_lazy(); 1771 + schedule_work(&drain_vmap_work); 1793 1772 } 1794 1773 1795 1774 /* ··· 2946 2925 int node) 2947 2926 { 2948 2927 const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; 2949 - const gfp_t orig_gfp_mask = gfp_mask; 2950 2928 bool nofail = gfp_mask & __GFP_NOFAIL; 2951 2929 unsigned long addr = (unsigned long)area->addr; 2952 2930 unsigned long size = get_vm_area_size(area); ··· 2969 2949 } 2970 2950 2971 2951 if (!area->pages) { 2972 - warn_alloc(orig_gfp_mask, NULL, 2952 + warn_alloc(gfp_mask, NULL, 2973 2953 "vmalloc error: size %lu, failed to allocated page array size %lu", 2974 2954 nr_small_pages * PAGE_SIZE, array_size); 2975 2955 free_vm_area(area); ··· 2979 2959 set_vm_area_page_order(area, page_shift - PAGE_SHIFT); 2980 2960 page_order = vm_area_page_order(area); 2981 2961 2982 - area->nr_pages = vm_area_alloc_pages(gfp_mask, node, 2983 - page_order, nr_small_pages, area->pages); 2962 + area->nr_pages = vm_area_alloc_pages(gfp_mask | __GFP_NOWARN, 2963 + node, page_order, nr_small_pages, area->pages); 2984 2964 2985 2965 atomic_long_add(area->nr_pages, &nr_vmalloc_pages); 2986 2966 if (gfp_mask & __GFP_ACCOUNT) { ··· 2996 2976 * allocation request, free them via __vfree() if any. 2997 2977 */ 2998 2978 if (area->nr_pages != nr_small_pages) { 2999 - warn_alloc(orig_gfp_mask, NULL, 2979 + warn_alloc(gfp_mask, NULL, 3000 2980 "vmalloc error: size %lu, page order %u, failed to allocate pages", 3001 2981 area->nr_pages * PAGE_SIZE, page_order); 3002 2982 goto fail; ··· 3024 3004 memalloc_noio_restore(flags); 3025 3005 3026 3006 if (ret < 0) { 3027 - warn_alloc(orig_gfp_mask, NULL, 3007 + warn_alloc(gfp_mask, NULL, 3028 3008 "vmalloc error: size %lu, failed to map pages", 3029 3009 area->nr_pages * PAGE_SIZE); 3030 3010 goto fail;

+28 -110

mm/vmscan.c

··· 56 56 57 57 #include <linux/swapops.h> 58 58 #include <linux/balloon_compaction.h> 59 + #include <linux/sched/sysctl.h> 59 60 60 61 #include "internal.h" 61 62 ··· 990 989 return page_count(page) - page_has_private(page) == 1 + page_cache_pins; 991 990 } 992 991 993 - static int may_write_to_inode(struct inode *inode) 994 - { 995 - if (current->flags & PF_SWAPWRITE) 996 - return 1; 997 - if (!inode_write_congested(inode)) 998 - return 1; 999 - if (inode_to_bdi(inode) == current->backing_dev_info) 1000 - return 1; 1001 - return 0; 1002 - } 1003 - 1004 992 /* 1005 993 * We detected a synchronous write error writing a page out. Probably 1006 994 * -ENOSPC. We need to propagate that into the address_space for a subsequent ··· 1191 1201 } 1192 1202 if (mapping->a_ops->writepage == NULL) 1193 1203 return PAGE_ACTIVATE; 1194 - if (!may_write_to_inode(mapping->host)) 1195 - return PAGE_KEEP; 1196 1204 1197 1205 if (clear_page_dirty_for_io(page)) { 1198 1206 int res; ··· 1386 1398 /* 1387 1399 * All mapped pages start out with page table 1388 1400 * references from the instantiating fault, so we need 1389 - * to look twice if a mapped file page is used more 1401 + * to look twice if a mapped file/anon page is used more 1390 1402 * than once. 1391 1403 * 1392 1404 * Mark it and spare it for another trip around the ··· 1566 1578 * end of the LRU a second time. 1567 1579 */ 1568 1580 mapping = page_mapping(page); 1569 - if (((dirty || writeback) && mapping && 1570 - inode_write_congested(mapping->host)) || 1571 - (writeback && PageReclaim(page))) 1581 + if (writeback && PageReclaim(page)) 1572 1582 stat->nr_congested++; 1573 1583 1574 1584 /* ··· 2000 2014 } 2001 2015 2002 2016 /* 2003 - * Attempt to remove the specified page from its LRU. Only take this page 2004 - * if it is of the appropriate PageActive status. Pages which are being 2005 - * freed elsewhere are also ignored. 2006 - * 2007 - * page: page to consider 2008 - * mode: one of the LRU isolation modes defined above 2009 - * 2010 - * returns true on success, false on failure. 2011 - */ 2012 - bool __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode) 2013 - { 2014 - /* Only take pages on the LRU. */ 2015 - if (!PageLRU(page)) 2016 - return false; 2017 - 2018 - /* Compaction should not handle unevictable pages but CMA can do so */ 2019 - if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE)) 2020 - return false; 2021 - 2022 - /* 2023 - * To minimise LRU disruption, the caller can indicate that it only 2024 - * wants to isolate pages it will be able to operate on without 2025 - * blocking - clean pages for the most part. 2026 - * 2027 - * ISOLATE_ASYNC_MIGRATE is used to indicate that it only wants to pages 2028 - * that it is possible to migrate without blocking 2029 - */ 2030 - if (mode & ISOLATE_ASYNC_MIGRATE) { 2031 - /* All the caller can do on PageWriteback is block */ 2032 - if (PageWriteback(page)) 2033 - return false; 2034 - 2035 - if (PageDirty(page)) { 2036 - struct address_space *mapping; 2037 - bool migrate_dirty; 2038 - 2039 - /* 2040 - * Only pages without mappings or that have a 2041 - * ->migratepage callback are possible to migrate 2042 - * without blocking. However, we can be racing with 2043 - * truncation so it's necessary to lock the page 2044 - * to stabilise the mapping as truncation holds 2045 - * the page lock until after the page is removed 2046 - * from the page cache. 2047 - */ 2048 - if (!trylock_page(page)) 2049 - return false; 2050 - 2051 - mapping = page_mapping(page); 2052 - migrate_dirty = !mapping || mapping->a_ops->migratepage; 2053 - unlock_page(page); 2054 - if (!migrate_dirty) 2055 - return false; 2056 - } 2057 - } 2058 - 2059 - if ((mode & ISOLATE_UNMAPPED) && page_mapped(page)) 2060 - return false; 2061 - 2062 - return true; 2063 - } 2064 - 2065 - /* 2066 2017 * Update LRU sizes after isolating pages. The LRU size updates must 2067 2018 * be complete before mem_cgroup_update_lru_size due to a sanity check. 2068 2019 */ ··· 2050 2127 unsigned long skipped = 0; 2051 2128 unsigned long scan, total_scan, nr_pages; 2052 2129 LIST_HEAD(pages_skipped); 2053 - isolate_mode_t mode = (sc->may_unmap ? 0 : ISOLATE_UNMAPPED); 2054 2130 2055 2131 total_scan = 0; 2056 2132 scan = 0; 2057 2133 while (scan < nr_to_scan && !list_empty(src)) { 2134 + struct list_head *move_to = src; 2058 2135 struct page *page; 2059 2136 2060 2137 page = lru_to_page(src); ··· 2064 2141 total_scan += nr_pages; 2065 2142 2066 2143 if (page_zonenum(page) > sc->reclaim_idx) { 2067 - list_move(&page->lru, &pages_skipped); 2068 2144 nr_skipped[page_zonenum(page)] += nr_pages; 2069 - continue; 2145 + move_to = &pages_skipped; 2146 + goto move; 2070 2147 } 2071 2148 2072 2149 /* ··· 2074 2151 * return with no isolated pages if the LRU mostly contains 2075 2152 * ineligible pages. This causes the VM to not reclaim any 2076 2153 * pages, triggering a premature OOM. 2077 - * 2078 - * Account all tail pages of THP. This would not cause 2079 - * premature OOM since __isolate_lru_page() returns -EBUSY 2080 - * only when the page is being freed somewhere else. 2154 + * Account all tail pages of THP. 2081 2155 */ 2082 2156 scan += nr_pages; 2083 - if (!__isolate_lru_page_prepare(page, mode)) { 2084 - /* It is being freed elsewhere */ 2085 - list_move(&page->lru, src); 2086 - continue; 2087 - } 2157 + 2158 + if (!PageLRU(page)) 2159 + goto move; 2160 + if (!sc->may_unmap && page_mapped(page)) 2161 + goto move; 2162 + 2088 2163 /* 2089 2164 * Be careful not to clear PageLRU until after we're 2090 2165 * sure the page is not being freed elsewhere -- the 2091 2166 * page release code relies on it. 2092 2167 */ 2093 - if (unlikely(!get_page_unless_zero(page))) { 2094 - list_move(&page->lru, src); 2095 - continue; 2096 - } 2168 + if (unlikely(!get_page_unless_zero(page))) 2169 + goto move; 2097 2170 2098 2171 if (!TestClearPageLRU(page)) { 2099 2172 /* Another thread is already isolating this page */ 2100 2173 put_page(page); 2101 - list_move(&page->lru, src); 2102 - continue; 2174 + goto move; 2103 2175 } 2104 2176 2105 2177 nr_taken += nr_pages; 2106 2178 nr_zone_taken[page_zonenum(page)] += nr_pages; 2107 - list_move(&page->lru, dst); 2179 + move_to = dst; 2180 + move: 2181 + list_move(&page->lru, move_to); 2108 2182 } 2109 2183 2110 2184 /* ··· 2125 2205 } 2126 2206 *nr_scanned = total_scan; 2127 2207 trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, 2128 - total_scan, skipped, nr_taken, mode, lru); 2208 + total_scan, skipped, nr_taken, 2209 + sc->may_unmap ? 0 : ISOLATE_UNMAPPED, lru); 2129 2210 update_lru_sizes(lruvec, lru, nr_zone_taken); 2130 2211 return nr_taken; 2131 2212 } ··· 2300 2379 */ 2301 2380 static int current_may_throttle(void) 2302 2381 { 2303 - return !(current->flags & PF_LOCAL_THROTTLE) || 2304 - current->backing_dev_info == NULL || 2305 - bdi_write_congested(current->backing_dev_info); 2382 + return !(current->flags & PF_LOCAL_THROTTLE); 2306 2383 } 2307 2384 2308 2385 /* ··· 3896 3977 if (!managed_zone(zone)) 3897 3978 continue; 3898 3979 3899 - mark = high_wmark_pages(zone); 3980 + if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) 3981 + mark = wmark_pages(zone, WMARK_PROMO); 3982 + else 3983 + mark = high_wmark_pages(zone); 3900 3984 if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx)) 3901 3985 return true; 3902 3986 } ··· 4396 4474 * us from recursively trying to free more memory as we're 4397 4475 * trying to free the first piece of memory in the first place). 4398 4476 */ 4399 - tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; 4477 + tsk->flags |= PF_MEMALLOC | PF_KSWAPD; 4400 4478 set_freezable(); 4401 4479 4402 4480 WRITE_ONCE(pgdat->kswapd_order, 0); ··· 4447 4525 goto kswapd_try_sleep; 4448 4526 } 4449 4527 4450 - tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD); 4528 + tsk->flags &= ~(PF_MEMALLOC | PF_KSWAPD); 4451 4529 4452 4530 return 0; 4453 4531 } ··· 4688 4766 fs_reclaim_acquire(sc.gfp_mask); 4689 4767 /* 4690 4768 * We need to be able to allocate from the reserves for RECLAIM_UNMAP 4691 - * and we also need to be able to write out pages for RECLAIM_WRITE 4692 - * and RECLAIM_UNMAP. 4693 4769 */ 4694 4770 noreclaim_flag = memalloc_noreclaim_save(); 4695 - p->flags |= PF_SWAPWRITE; 4696 4771 set_task_reclaim_state(p, &sc.reclaim_state); 4697 4772 4698 4773 if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) { ··· 4703 4784 } 4704 4785 4705 4786 set_task_reclaim_state(p, NULL); 4706 - current->flags &= ~PF_SWAPWRITE; 4707 4787 memalloc_noreclaim_restore(noreclaim_flag); 4708 4788 fs_reclaim_release(sc.gfp_mask); 4709 4789 psi_memstall_leave(&pflags);

+18 -1

mm/vmstat.c

··· 28 28 #include <linux/mm_inline.h> 29 29 #include <linux/page_ext.h> 30 30 #include <linux/page_owner.h> 31 + #include <linux/migrate.h> 31 32 32 33 #include "internal.h" 33 34 ··· 1243 1242 #ifdef CONFIG_SWAP 1244 1243 "nr_swapcached", 1245 1244 #endif 1245 + #ifdef CONFIG_NUMA_BALANCING 1246 + "pgpromote_success", 1247 + #endif 1246 1248 1247 1249 /* enum writeback_stat_item counters */ 1248 1250 "nr_dirty_threshold", ··· 1389 1385 #ifdef CONFIG_SWAP 1390 1386 "swap_ra", 1391 1387 "swap_ra_hit", 1388 + #ifdef CONFIG_KSM 1389 + "ksm_swpin_copy", 1390 + #endif 1392 1391 #endif 1393 1392 #ifdef CONFIG_X86 1394 1393 "direct_map_level2_splits", ··· 2050 2043 static int vmstat_cpu_online(unsigned int cpu) 2051 2044 { 2052 2045 refresh_zone_stat_thresholds(); 2053 - node_set_state(cpu_to_node(cpu), N_CPU); 2046 + 2047 + if (!node_state(cpu_to_node(cpu), N_CPU)) { 2048 + node_set_state(cpu_to_node(cpu), N_CPU); 2049 + set_migration_target_nodes(); 2050 + } 2051 + 2054 2052 return 0; 2055 2053 } 2056 2054 ··· 2078 2066 return 0; 2079 2067 2080 2068 node_clear_state(node, N_CPU); 2069 + set_migration_target_nodes(); 2070 + 2081 2071 return 0; 2082 2072 } 2083 2073 ··· 2110 2096 cpus_read_unlock(); 2111 2097 2112 2098 start_shepherd_timer(); 2099 + #endif 2100 + #if defined(CONFIG_MIGRATION) && defined(CONFIG_HOTPLUG_CPU) 2101 + migrate_on_reclaim_init(); 2113 2102 #endif 2114 2103 #ifdef CONFIG_PROC_FS 2115 2104 proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);

+5 -2

mm/workingset.c

··· 429 429 * point where they would still be useful. 430 430 */ 431 431 432 - static struct list_lru shadow_nodes; 432 + struct list_lru shadow_nodes; 433 433 434 434 void workingset_update_node(struct xa_node *node) 435 435 { 436 + struct address_space *mapping; 437 + 436 438 /* 437 439 * Track non-empty nodes that contain only shadow entries; 438 440 * unlink those that contain pages or are being freed. ··· 443 441 * already where they should be. The list_empty() test is safe 444 442 * as node->private_list is protected by the i_pages lock. 445 443 */ 446 - VM_WARN_ON_ONCE(!irqs_disabled()); /* For __inc_lruvec_page_state */ 444 + mapping = container_of(node->array, struct address_space, i_pages); 445 + lockdep_assert_held(&mapping->i_pages.xa_lock); 447 446 448 447 if (node->count && node->count == node->nr_values) { 449 448 if (list_empty(&node->private_list)) {

+14 -1

mm/zswap.c

··· 120 120 module_param_named(accept_threshold_percent, zswap_accept_thr_percent, 121 121 uint, 0644); 122 122 123 - /* Enable/disable handling same-value filled pages (enabled by default) */ 123 + /* 124 + * Enable/disable handling same-value filled pages (enabled by default). 125 + * If disabled every page is considered non-same-value filled. 126 + */ 124 127 static bool zswap_same_filled_pages_enabled = true; 125 128 module_param_named(same_filled_pages_enabled, zswap_same_filled_pages_enabled, 129 + bool, 0644); 130 + 131 + /* Enable/disable handling non-same-value filled pages (enabled by default) */ 132 + static bool zswap_non_same_filled_pages_enabled = true; 133 + module_param_named(non_same_filled_pages_enabled, zswap_non_same_filled_pages_enabled, 126 134 bool, 0644); 127 135 128 136 /********************************* ··· 1153 1145 goto insert_entry; 1154 1146 } 1155 1147 kunmap_atomic(src); 1148 + } 1149 + 1150 + if (!zswap_non_same_filled_pages_enabled) { 1151 + ret = -EINVAL; 1152 + goto freepage; 1156 1153 } 1157 1154 1158 1155 /* if entry is successfully added, it keeps the reference */

+1 -1

net/socket.c

··· 301 301 { 302 302 struct socket_alloc *ei; 303 303 304 - ei = kmem_cache_alloc(sock_inode_cachep, GFP_KERNEL); 304 + ei = alloc_inode_sb(sb, sock_inode_cachep, GFP_KERNEL); 305 305 if (!ei) 306 306 return NULL; 307 307 init_waitqueue_head(&ei->socket.wq.wait);

+1 -1

net/sunrpc/rpc_pipe.c

··· 197 197 rpc_alloc_inode(struct super_block *sb) 198 198 { 199 199 struct rpc_inode *rpci; 200 - rpci = kmem_cache_alloc(rpc_inode_cachep, GFP_KERNEL); 200 + rpci = alloc_inode_sb(sb, rpc_inode_cachep, GFP_KERNEL); 201 201 if (!rpci) 202 202 return NULL; 203 203 return &rpci->vfs_inode;

+16

scripts/spelling.txt

··· 180 180 asycronous||asynchronous 181 181 asychronous||asynchronous 182 182 asynchnous||asynchronous 183 + asynchronus||asynchronous 183 184 asynchromous||asynchronous 184 185 asymetric||asymmetric 185 186 asymmeric||asymmetric ··· 232 231 bandwith||bandwidth 233 232 banlance||balance 234 233 batery||battery 234 + battey||battery 235 235 beacuse||because 236 236 becasue||because 237 237 becomming||becoming ··· 335 333 comsume||consume 336 334 comsumer||consumer 337 335 comsuming||consuming 336 + comaptible||compatible 338 337 compability||compatibility 339 338 compaibility||compatibility 340 339 comparsion||comparison ··· 356 353 comppatible||compatible 357 354 compres||compress 358 355 compresion||compression 356 + compresser||compressor 359 357 comression||compression 358 + comsumed||consumed 360 359 comunicate||communicate 361 360 comunication||communication 362 361 conbination||combination ··· 535 530 distiction||distinction 536 531 divisable||divisible 537 532 divsiors||divisors 533 + dsiabled||disabled 538 534 docuentation||documentation 539 535 documantation||documentation 540 536 documentaion||documentation ··· 683 677 frequncy||frequency 684 678 frequancy||frequency 685 679 frome||from 680 + fronend||frontend 686 681 fucntion||function 687 682 fuction||function 688 683 fuctions||functions ··· 768 761 implmenting||implementing 769 762 incative||inactive 770 763 incomming||incoming 764 + incompaitiblity||incompatibility 771 765 incompatabilities||incompatibilities 772 766 incompatable||incompatible 773 767 incompatble||incompatible ··· 950 942 micropone||microphone 951 943 microprocesspr||microprocessor 952 944 migrateable||migratable 945 + millenium||millennium 953 946 milliseonds||milliseconds 954 947 minium||minimum 955 948 minimam||minimum ··· 1016 1007 nubmer||number 1017 1008 numebr||number 1018 1009 numner||number 1010 + nunber||number 1019 1011 obtaion||obtain 1020 1012 obusing||abusing 1021 1013 occassionally||occasionally ··· 1146 1136 pressre||pressure 1147 1137 presuambly||presumably 1148 1138 previosuly||previously 1139 + previsously||previously 1149 1140 primative||primitive 1150 1141 princliple||principle 1151 1142 priorty||priority ··· 1308 1297 rquest||request 1309 1298 runing||running 1310 1299 runned||ran 1300 + runnnig||running 1311 1301 runnning||running 1312 1302 runtine||runtime 1313 1303 sacrifying||sacrificing ··· 1365 1353 simlar||similar 1366 1354 simliar||similar 1367 1355 simpified||simplified 1356 + simultanous||simultaneous 1368 1357 singaled||signaled 1369 1358 singal||signal 1370 1359 singed||signed ··· 1474 1461 sytem||system 1475 1462 sythesis||synthesis 1476 1463 taht||that 1464 + tained||tainted 1477 1465 tansmit||transmit 1478 1466 targetted||targeted 1479 1467 targetting||targeting ··· 1503 1489 tmis||this 1504 1490 toogle||toggle 1505 1491 torerable||tolerable 1492 + torlence||tolerance 1506 1493 traget||target 1507 1494 traking||tracking 1508 1495 tramsmitted||transmitted ··· 1518 1503 transfered||transferred 1519 1504 transfering||transferring 1520 1505 transision||transition 1506 + transistioned||transitioned 1521 1507 transmittd||transmitted 1522 1508 transormed||transformed 1523 1509 trasfer||transfer

+12 -3

tools/testing/selftests/cgroup/cgroup_util.c

··· 583 583 return 0; 584 584 } 585 585 586 - int cg_prepare_for_wait(const char *cgroup) 586 + static int __prepare_for_wait(const char *cgroup, const char *filename) 587 587 { 588 588 int fd, ret = -1; 589 589 ··· 591 591 if (fd == -1) 592 592 return fd; 593 593 594 - ret = inotify_add_watch(fd, cg_control(cgroup, "cgroup.events"), 595 - IN_MODIFY); 594 + ret = inotify_add_watch(fd, cg_control(cgroup, filename), IN_MODIFY); 596 595 if (ret == -1) { 597 596 close(fd); 598 597 fd = -1; 599 598 } 600 599 601 600 return fd; 601 + } 602 + 603 + int cg_prepare_for_wait(const char *cgroup) 604 + { 605 + return __prepare_for_wait(cgroup, "cgroup.events"); 606 + } 607 + 608 + int memcg_prepare_for_wait(const char *cgroup) 609 + { 610 + return __prepare_for_wait(cgroup, "memory.events"); 602 611 } 603 612 604 613 int cg_wait_for(int fd)

+1

tools/testing/selftests/cgroup/cgroup_util.h

··· 55 55 extern int clone_into_cgroup_run_wait(const char *cgroup); 56 56 extern int dirfd_open_opath(const char *dir); 57 57 extern int cg_prepare_for_wait(const char *cgroup); 58 + extern int memcg_prepare_for_wait(const char *cgroup); 58 59 extern int cg_wait_for(int fd);

+78

tools/testing/selftests/cgroup/test_memcontrol.c

··· 16 16 #include <netinet/in.h> 17 17 #include <netdb.h> 18 18 #include <errno.h> 19 + #include <sys/mman.h> 19 20 20 21 #include "../kselftest.h" 21 22 #include "cgroup_util.h" ··· 629 628 return ret; 630 629 } 631 630 631 + static int alloc_anon_mlock(const char *cgroup, void *arg) 632 + { 633 + size_t size = (size_t)arg; 634 + void *buf; 635 + 636 + buf = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, 637 + 0, 0); 638 + if (buf == MAP_FAILED) 639 + return -1; 640 + 641 + mlock(buf, size); 642 + munmap(buf, size); 643 + return 0; 644 + } 645 + 646 + /* 647 + * This test checks that memory.high is able to throttle big single shot 648 + * allocation i.e. large allocation within one kernel entry. 649 + */ 650 + static int test_memcg_high_sync(const char *root) 651 + { 652 + int ret = KSFT_FAIL, pid, fd = -1; 653 + char *memcg; 654 + long pre_high, pre_max; 655 + long post_high, post_max; 656 + 657 + memcg = cg_name(root, "memcg_test"); 658 + if (!memcg) 659 + goto cleanup; 660 + 661 + if (cg_create(memcg)) 662 + goto cleanup; 663 + 664 + pre_high = cg_read_key_long(memcg, "memory.events", "high "); 665 + pre_max = cg_read_key_long(memcg, "memory.events", "max "); 666 + if (pre_high < 0 || pre_max < 0) 667 + goto cleanup; 668 + 669 + if (cg_write(memcg, "memory.swap.max", "0")) 670 + goto cleanup; 671 + 672 + if (cg_write(memcg, "memory.high", "30M")) 673 + goto cleanup; 674 + 675 + if (cg_write(memcg, "memory.max", "140M")) 676 + goto cleanup; 677 + 678 + fd = memcg_prepare_for_wait(memcg); 679 + if (fd < 0) 680 + goto cleanup; 681 + 682 + pid = cg_run_nowait(memcg, alloc_anon_mlock, (void *)MB(200)); 683 + if (pid < 0) 684 + goto cleanup; 685 + 686 + cg_wait_for(fd); 687 + 688 + post_high = cg_read_key_long(memcg, "memory.events", "high "); 689 + post_max = cg_read_key_long(memcg, "memory.events", "max "); 690 + if (post_high < 0 || post_max < 0) 691 + goto cleanup; 692 + 693 + if (pre_high == post_high || pre_max != post_max) 694 + goto cleanup; 695 + 696 + ret = KSFT_PASS; 697 + 698 + cleanup: 699 + if (fd >= 0) 700 + close(fd); 701 + cg_destroy(memcg); 702 + free(memcg); 703 + 704 + return ret; 705 + } 706 + 632 707 /* 633 708 * This test checks that memory.max limits the amount of 634 709 * memory which can be consumed by either anonymous memory ··· 1257 1180 T(test_memcg_min), 1258 1181 T(test_memcg_low), 1259 1182 T(test_memcg_high), 1183 + T(test_memcg_high_sync), 1260 1184 T(test_memcg_max), 1261 1185 T(test_memcg_oom_events), 1262 1186 T(test_memcg_swap_max),

+1

tools/testing/selftests/damon/Makefile

··· 6 6 TEST_FILES = _chk_dependency.sh _debugfs_common.sh 7 7 TEST_PROGS = debugfs_attrs.sh debugfs_schemes.sh debugfs_target_ids.sh 8 8 TEST_PROGS += debugfs_empty_targets.sh debugfs_huge_count_read_write.sh 9 + TEST_PROGS += sysfs.sh 9 10 10 11 include ../lib.mk

+306

tools/testing/selftests/damon/sysfs.sh

··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + 4 + # Kselftest frmework requirement - SKIP code is 4. 5 + ksft_skip=4 6 + 7 + ensure_write_succ() 8 + { 9 + file=$1 10 + content=$2 11 + reason=$3 12 + 13 + if ! echo "$content" > "$file" 14 + then 15 + echo "writing $content to $file failed" 16 + echo "expected success because $reason" 17 + exit 1 18 + fi 19 + } 20 + 21 + ensure_write_fail() 22 + { 23 + file=$1 24 + content=$2 25 + reason=$3 26 + 27 + if echo "$content" > "$file" 28 + then 29 + echo "writing $content to $file succeed ($fail_reason)" 30 + echo "expected failure because $reason" 31 + exit 1 32 + fi 33 + } 34 + 35 + ensure_dir() 36 + { 37 + dir=$1 38 + to_ensure=$2 39 + if [ "$to_ensure" = "exist" ] && [ ! -d "$dir" ] 40 + then 41 + echo "$dir dir is expected but not found" 42 + exit 1 43 + elif [ "$to_ensure" = "not_exist" ] && [ -d "$dir" ] 44 + then 45 + echo "$dir dir is not expected but found" 46 + exit 1 47 + fi 48 + } 49 + 50 + ensure_file() 51 + { 52 + file=$1 53 + to_ensure=$2 54 + permission=$3 55 + if [ "$to_ensure" = "exist" ] 56 + then 57 + if [ ! -f "$file" ] 58 + then 59 + echo "$file is expected but not found" 60 + exit 1 61 + fi 62 + perm=$(stat -c "%a" "$file") 63 + if [ ! "$perm" = "$permission" ] 64 + then 65 + echo "$file permission: expected $permission but $perm" 66 + exit 1 67 + fi 68 + elif [ "$to_ensure" = "not_exist" ] && [ -f "$dir" ] 69 + then 70 + echo "$file is not expected but found" 71 + exit 1 72 + fi 73 + } 74 + 75 + test_range() 76 + { 77 + range_dir=$1 78 + ensure_dir "$range_dir" "exist" 79 + ensure_file "$range_dir/min" "exist" 600 80 + ensure_file "$range_dir/max" "exist" 600 81 + } 82 + 83 + test_stats() 84 + { 85 + stats_dir=$1 86 + ensure_dir "$stats_dir" "exist" 87 + for f in nr_tried sz_tried nr_applied sz_applied qt_exceeds 88 + do 89 + ensure_file "$stats_dir/$f" "exist" "400" 90 + done 91 + } 92 + 93 + test_watermarks() 94 + { 95 + watermarks_dir=$1 96 + ensure_dir "$watermarks_dir" "exist" 97 + ensure_file "$watermarks_dir/metric" "exist" "600" 98 + ensure_file "$watermarks_dir/interval_us" "exist" "600" 99 + ensure_file "$watermarks_dir/high" "exist" "600" 100 + ensure_file "$watermarks_dir/mid" "exist" "600" 101 + ensure_file "$watermarks_dir/low" "exist" "600" 102 + } 103 + 104 + test_weights() 105 + { 106 + weights_dir=$1 107 + ensure_dir "$weights_dir" "exist" 108 + ensure_file "$weights_dir/sz_permil" "exist" "600" 109 + ensure_file "$weights_dir/nr_accesses_permil" "exist" "600" 110 + ensure_file "$weights_dir/age_permil" "exist" "600" 111 + } 112 + 113 + test_quotas() 114 + { 115 + quotas_dir=$1 116 + ensure_dir "$quotas_dir" "exist" 117 + ensure_file "$quotas_dir/ms" "exist" 600 118 + ensure_file "$quotas_dir/bytes" "exist" 600 119 + ensure_file "$quotas_dir/reset_interval_ms" "exist" 600 120 + test_weights "$quotas_dir/weights" 121 + } 122 + 123 + test_access_pattern() 124 + { 125 + access_pattern_dir=$1 126 + ensure_dir "$access_pattern_dir" "exist" 127 + test_range "$access_pattern_dir/age" 128 + test_range "$access_pattern_dir/nr_accesses" 129 + test_range "$access_pattern_dir/sz" 130 + } 131 + 132 + test_scheme() 133 + { 134 + scheme_dir=$1 135 + ensure_dir "$scheme_dir" "exist" 136 + ensure_file "$scheme_dir/action" "exist" "600" 137 + test_access_pattern "$scheme_dir/access_pattern" 138 + test_quotas "$scheme_dir/quotas" 139 + test_watermarks "$scheme_dir/watermarks" 140 + test_stats "$scheme_dir/stats" 141 + } 142 + 143 + test_schemes() 144 + { 145 + schemes_dir=$1 146 + ensure_dir "$schemes_dir" "exist" 147 + ensure_file "$schemes_dir/nr_schemes" "exist" 600 148 + 149 + ensure_write_succ "$schemes_dir/nr_schemes" "1" "valid input" 150 + test_scheme "$schemes_dir/0" 151 + 152 + ensure_write_succ "$schemes_dir/nr_schemes" "2" "valid input" 153 + test_scheme "$schemes_dir/0" 154 + test_scheme "$schemes_dir/1" 155 + 156 + ensure_write_succ "$schemes_dir/nr_schemes" "0" "valid input" 157 + ensure_dir "$schemes_dir/0" "not_exist" 158 + ensure_dir "$schemes_dir/1" "not_exist" 159 + } 160 + 161 + test_region() 162 + { 163 + region_dir=$1 164 + ensure_dir "$region_dir" "exist" 165 + ensure_file "$region_dir/start" "exist" 600 166 + ensure_file "$region_dir/end" "exist" 600 167 + } 168 + 169 + test_regions() 170 + { 171 + regions_dir=$1 172 + ensure_dir "$regions_dir" "exist" 173 + ensure_file "$regions_dir/nr_regions" "exist" 600 174 + 175 + ensure_write_succ "$regions_dir/nr_regions" "1" "valid input" 176 + test_region "$regions_dir/0" 177 + 178 + ensure_write_succ "$regions_dir/nr_regions" "2" "valid input" 179 + test_region "$regions_dir/0" 180 + test_region "$regions_dir/1" 181 + 182 + ensure_write_succ "$regions_dir/nr_regions" "0" "valid input" 183 + ensure_dir "$regions_dir/0" "not_exist" 184 + ensure_dir "$regions_dir/1" "not_exist" 185 + } 186 + 187 + test_target() 188 + { 189 + target_dir=$1 190 + ensure_dir "$target_dir" "exist" 191 + ensure_file "$target_dir/pid_target" "exist" "600" 192 + test_regions "$target_dir/regions" 193 + } 194 + 195 + test_targets() 196 + { 197 + targets_dir=$1 198 + ensure_dir "$targets_dir" "exist" 199 + ensure_file "$targets_dir/nr_targets" "exist" 600 200 + 201 + ensure_write_succ "$targets_dir/nr_targets" "1" "valid input" 202 + test_target "$targets_dir/0" 203 + 204 + ensure_write_succ "$targets_dir/nr_targets" "2" "valid input" 205 + test_target "$targets_dir/0" 206 + test_target "$targets_dir/1" 207 + 208 + ensure_write_succ "$targets_dir/nr_targets" "0" "valid input" 209 + ensure_dir "$targets_dir/0" "not_exist" 210 + ensure_dir "$targets_dir/1" "not_exist" 211 + } 212 + 213 + test_intervals() 214 + { 215 + intervals_dir=$1 216 + ensure_dir "$intervals_dir" "exist" 217 + ensure_file "$intervals_dir/aggr_us" "exist" "600" 218 + ensure_file "$intervals_dir/sample_us" "exist" "600" 219 + ensure_file "$intervals_dir/update_us" "exist" "600" 220 + } 221 + 222 + test_monitoring_attrs() 223 + { 224 + monitoring_attrs_dir=$1 225 + ensure_dir "$monitoring_attrs_dir" "exist" 226 + test_intervals "$monitoring_attrs_dir/intervals" 227 + test_range "$monitoring_attrs_dir/nr_regions" 228 + } 229 + 230 + test_context() 231 + { 232 + context_dir=$1 233 + ensure_dir "$context_dir" "exist" 234 + ensure_file "$context_dir/operations" "exist" 600 235 + test_monitoring_attrs "$context_dir/monitoring_attrs" 236 + test_targets "$context_dir/targets" 237 + test_schemes "$context_dir/schemes" 238 + } 239 + 240 + test_contexts() 241 + { 242 + contexts_dir=$1 243 + ensure_dir "$contexts_dir" "exist" 244 + ensure_file "$contexts_dir/nr_contexts" "exist" 600 245 + 246 + ensure_write_succ "$contexts_dir/nr_contexts" "1" "valid input" 247 + test_context "$contexts_dir/0" 248 + 249 + ensure_write_fail "$contexts_dir/nr_contexts" "2" "only 0/1 are supported" 250 + test_context "$contexts_dir/0" 251 + 252 + ensure_write_succ "$contexts_dir/nr_contexts" "0" "valid input" 253 + ensure_dir "$contexts_dir/0" "not_exist" 254 + } 255 + 256 + test_kdamond() 257 + { 258 + kdamond_dir=$1 259 + ensure_dir "$kdamond_dir" "exist" 260 + ensure_file "$kdamond_dir/state" "exist" "600" 261 + ensure_file "$kdamond_dir/pid" "exist" 400 262 + test_contexts "$kdamond_dir/contexts" 263 + } 264 + 265 + test_kdamonds() 266 + { 267 + kdamonds_dir=$1 268 + ensure_dir "$kdamonds_dir" "exist" 269 + 270 + ensure_file "$kdamonds_dir/nr_kdamonds" "exist" "600" 271 + 272 + ensure_write_succ "$kdamonds_dir/nr_kdamonds" "1" "valid input" 273 + test_kdamond "$kdamonds_dir/0" 274 + 275 + ensure_write_succ "$kdamonds_dir/nr_kdamonds" "2" "valid input" 276 + test_kdamond "$kdamonds_dir/0" 277 + test_kdamond "$kdamonds_dir/1" 278 + 279 + ensure_write_succ "$kdamonds_dir/nr_kdamonds" "0" "valid input" 280 + ensure_dir "$kdamonds_dir/0" "not_exist" 281 + ensure_dir "$kdamonds_dir/1" "not_exist" 282 + } 283 + 284 + test_damon_sysfs() 285 + { 286 + damon_sysfs=$1 287 + if [ ! -d "$damon_sysfs" ] 288 + then 289 + echo "$damon_sysfs not found" 290 + exit $ksft_skip 291 + fi 292 + 293 + test_kdamonds "$damon_sysfs/kdamonds" 294 + } 295 + 296 + check_dependencies() 297 + { 298 + if [ $EUID -ne 0 ] 299 + then 300 + echo "Run as root" 301 + exit $ksft_skip 302 + fi 303 + } 304 + 305 + check_dependencies 306 + test_damon_sysfs "/sys/kernel/mm/damon/admin"

+1

tools/testing/selftests/vm/.gitignore

··· 2 2 hugepage-mmap 3 3 hugepage-mremap 4 4 hugepage-shm 5 + hugepage-vmemmap 5 6 khugepaged 6 7 map_hugetlb 7 8 map_populate

+4 -3

tools/testing/selftests/vm/Makefile

··· 33 33 TEST_GEN_FILES += hugepage-mmap 34 34 TEST_GEN_FILES += hugepage-mremap 35 35 TEST_GEN_FILES += hugepage-shm 36 + TEST_GEN_FILES += hugepage-vmemmap 36 37 TEST_GEN_FILES += khugepaged 37 38 TEST_GEN_FILES += madv_populate 38 39 TEST_GEN_FILES += map_fixed_noreplace ··· 52 51 TEST_GEN_FILES += ksm_tests 53 52 54 53 ifeq ($(MACHINE),x86_64) 55 - CAN_BUILD_I386 := $(shell ./../x86/check_cc.sh $(CC) ../x86/trivial_32bit_program.c -m32) 56 - CAN_BUILD_X86_64 := $(shell ./../x86/check_cc.sh $(CC) ../x86/trivial_64bit_program.c) 57 - CAN_BUILD_WITH_NOPIE := $(shell ./../x86/check_cc.sh $(CC) ../x86/trivial_program.c -no-pie) 54 + CAN_BUILD_I386 := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_32bit_program.c -m32) 55 + CAN_BUILD_X86_64 := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_64bit_program.c) 56 + CAN_BUILD_WITH_NOPIE := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_program.c -no-pie) 58 57 59 58 TARGETS := protection_keys 60 59 BINARIES_32 := $(TARGETS:%=%_32)

+144

tools/testing/selftests/vm/hugepage-vmemmap.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * A test case of using hugepage memory in a user application using the 4 + * mmap system call with MAP_HUGETLB flag. Before running this program 5 + * make sure the administrator has allocated enough default sized huge 6 + * pages to cover the 2 MB allocation. 7 + */ 8 + #include <stdlib.h> 9 + #include <stdio.h> 10 + #include <unistd.h> 11 + #include <sys/mman.h> 12 + #include <fcntl.h> 13 + 14 + #define MAP_LENGTH (2UL * 1024 * 1024) 15 + 16 + #ifndef MAP_HUGETLB 17 + #define MAP_HUGETLB 0x40000 /* arch specific */ 18 + #endif 19 + 20 + #define PAGE_SIZE 4096 21 + 22 + #define PAGE_COMPOUND_HEAD (1UL << 15) 23 + #define PAGE_COMPOUND_TAIL (1UL << 16) 24 + #define PAGE_HUGE (1UL << 17) 25 + 26 + #define HEAD_PAGE_FLAGS (PAGE_COMPOUND_HEAD | PAGE_HUGE) 27 + #define TAIL_PAGE_FLAGS (PAGE_COMPOUND_TAIL | PAGE_HUGE) 28 + 29 + #define PM_PFRAME_BITS 55 30 + #define PM_PFRAME_MASK ~((1UL << PM_PFRAME_BITS) - 1) 31 + 32 + /* 33 + * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages. 34 + * That means the addresses starting with 0x800000... will need to be 35 + * specified. Specifying a fixed address is not required on ppc64, i386 36 + * or x86_64. 37 + */ 38 + #ifdef __ia64__ 39 + #define MAP_ADDR (void *)(0x8000000000000000UL) 40 + #define MAP_FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_FIXED) 41 + #else 42 + #define MAP_ADDR NULL 43 + #define MAP_FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB) 44 + #endif 45 + 46 + static void write_bytes(char *addr, size_t length) 47 + { 48 + unsigned long i; 49 + 50 + for (i = 0; i < length; i++) 51 + *(addr + i) = (char)i; 52 + } 53 + 54 + static unsigned long virt_to_pfn(void *addr) 55 + { 56 + int fd; 57 + unsigned long pagemap; 58 + 59 + fd = open("/proc/self/pagemap", O_RDONLY); 60 + if (fd < 0) 61 + return -1UL; 62 + 63 + lseek(fd, (unsigned long)addr / PAGE_SIZE * sizeof(pagemap), SEEK_SET); 64 + read(fd, &pagemap, sizeof(pagemap)); 65 + close(fd); 66 + 67 + return pagemap & ~PM_PFRAME_MASK; 68 + } 69 + 70 + static int check_page_flags(unsigned long pfn) 71 + { 72 + int fd, i; 73 + unsigned long pageflags; 74 + 75 + fd = open("/proc/kpageflags", O_RDONLY); 76 + if (fd < 0) 77 + return -1; 78 + 79 + lseek(fd, pfn * sizeof(pageflags), SEEK_SET); 80 + 81 + read(fd, &pageflags, sizeof(pageflags)); 82 + if ((pageflags & HEAD_PAGE_FLAGS) != HEAD_PAGE_FLAGS) { 83 + close(fd); 84 + printf("Head page flags (%lx) is invalid\n", pageflags); 85 + return -1; 86 + } 87 + 88 + /* 89 + * pages other than the first page must be tail and shouldn't be head; 90 + * this also verifies kernel has correctly set the fake page_head to tail 91 + * while hugetlb_free_vmemmap is enabled. 92 + */ 93 + for (i = 1; i < MAP_LENGTH / PAGE_SIZE; i++) { 94 + read(fd, &pageflags, sizeof(pageflags)); 95 + if ((pageflags & TAIL_PAGE_FLAGS) != TAIL_PAGE_FLAGS || 96 + (pageflags & HEAD_PAGE_FLAGS) == HEAD_PAGE_FLAGS) { 97 + close(fd); 98 + printf("Tail page flags (%lx) is invalid\n", pageflags); 99 + return -1; 100 + } 101 + } 102 + 103 + close(fd); 104 + 105 + return 0; 106 + } 107 + 108 + int main(int argc, char **argv) 109 + { 110 + void *addr; 111 + unsigned long pfn; 112 + 113 + addr = mmap(MAP_ADDR, MAP_LENGTH, PROT_READ | PROT_WRITE, MAP_FLAGS, -1, 0); 114 + if (addr == MAP_FAILED) { 115 + perror("mmap"); 116 + exit(1); 117 + } 118 + 119 + /* Trigger allocation of HugeTLB page. */ 120 + write_bytes(addr, MAP_LENGTH); 121 + 122 + pfn = virt_to_pfn(addr); 123 + if (pfn == -1UL) { 124 + munmap(addr, MAP_LENGTH); 125 + perror("virt_to_pfn"); 126 + exit(1); 127 + } 128 + 129 + printf("Returned address is %p whose pfn is %lx\n", addr, pfn); 130 + 131 + if (check_page_flags(pfn) < 0) { 132 + munmap(addr, MAP_LENGTH); 133 + perror("check_page_flags"); 134 + exit(1); 135 + } 136 + 137 + /* munmap() length of MAP_HUGETLB memory must be hugepage aligned */ 138 + if (munmap(addr, MAP_LENGTH)) { 139 + perror("munmap"); 140 + exit(1); 141 + } 142 + 143 + return 0; 144 + }

+11

tools/testing/selftests/vm/run_vmtests.sh

··· 120 120 fi 121 121 rm -f $mnt/huge_mremap 122 122 123 + echo "------------------------" 124 + echo "running hugepage-vmemmap" 125 + echo "------------------------" 126 + ./hugepage-vmemmap 127 + if [ $? -ne 0 ]; then 128 + echo "[FAIL]" 129 + exitcode=1 130 + else 131 + echo "[PASS]" 132 + fi 133 + 123 134 echo "NOTE: The above hugetlb tests provide minimal coverage. Use" 124 135 echo " https://github.com/libhugetlbfs/libhugetlbfs.git for" 125 136 echo " hugetlb regression testing."

+1 -1

tools/testing/selftests/vm/userfaultfd.c

··· 540 540 static void *locking_thread(void *arg) 541 541 { 542 542 unsigned long cpu = (unsigned long) arg; 543 - unsigned long page_nr = *(&(page_nr)); /* uninitialized warning */ 543 + unsigned long page_nr; 544 544 unsigned long long count; 545 545 546 546 if (!(bounces & BOUNCE_RANDOM)) {

+3 -3

tools/testing/selftests/x86/Makefile

··· 6 6 .PHONY: all all_32 all_64 warn_32bit_failure clean 7 7 8 8 UNAME_M := $(shell uname -m) 9 - CAN_BUILD_I386 := $(shell ./check_cc.sh $(CC) trivial_32bit_program.c -m32) 10 - CAN_BUILD_X86_64 := $(shell ./check_cc.sh $(CC) trivial_64bit_program.c) 11 - CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh $(CC) trivial_program.c -no-pie) 9 + CAN_BUILD_I386 := $(shell ./check_cc.sh "$(CC)" trivial_32bit_program.c -m32) 10 + CAN_BUILD_X86_64 := $(shell ./check_cc.sh "$(CC)" trivial_64bit_program.c) 11 + CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh "$(CC)" trivial_program.c -no-pie) 12 12 13 13 TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \ 14 14 check_initial_reg_state sigreturn iopl ioperm \

Configure Feed

Configure Feed