Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'trace-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

- Deprecate auto-mounting tracefs to /sys/kernel/debug/tracing

When tracefs was first introduced back in 2014, the directory
/sys/kernel/tracing was added and is the designated location to mount
tracefs. To keep backward compatibility, tracefs was auto-mounted in
/sys/kernel/debug/tracing as well.

All distros now mount tracefs on /sys/kernel/tracing. Having it seen
in two different locations has lead to various issues and
inconsistencies.

The VFS folks have to also maintain debugfs_create_automount() for
this single user.

It's been over 10 years. Tooling and scripts should start replacing
the debugfs location with the tracefs one. The reason tracefs was
created in the first place was to allow access to the tracing
facilities without the need to configure debugfs into the kernel.
Using tracefs should now be more robust.

A new config is created: CONFIG_TRACEFS_AUTOMOUNT_DEPRECATED which is
default y, so that the kernel is still built with the automount. This
config allows those that want to remove the automount from debugfs to
do so.

When tracefs is accessed from /sys/kernel/debug/tracing, the
following printk is triggerd:

pr_warn("NOTICE: Automounting of tracing to debugfs is deprecated and will be removed in 2030\n");

This gives users another 5 years to fix their scripts.

- Use queue_rcu_work() instead of call_rcu() for freeing event filters

The number of filters to be free can be many depending on the number
of events within an event system. Freeing them from softirq context
can potentially cause undesired latency. Use the RCU workqueue to
free them instead.

- Remove pointless memory barriers in latency code

Memory barriers were added to some of the latency code a long time
ago with the idea of "making them visible", but that's not what
memory barriers are for. They are to synchronize access between
different variables. There was no synchronization here making them
pointless.

- Remove "__attribute__()" from the type field of event format

When LLVM is used to compile the kernel with CONFIG_DEBUG_INFO_BTF=y
and PAHOLE_HAS_BTF_TAG=y, some of the format fields get expanded with
the following:

field:const char * filename; offset:24; size:8; signed:0;

Turns into:

field:const char __attribute__((btf_type_tag("user"))) * filename; offset:24; size:8; signed:0;

This confuses parsers. Add code to strip these tags from the strings.

- Add eprobe config option CONFIG_EPROBE_EVENTS

Eprobes were added back in 5.15 but were only enabled when another
probe was enabled (kprobe, fprobe, uprobe, etc). The eprobes had no
config option of their own. Add one as they should be a separate
entity.

It's default y to keep with the old kernels but still has
dependencies on TRACING and HAVE_REGS_AND_STACK_ACCESS_API.

- Add eprobe documentation

When eprobes were added back in 5.15 no documentation was added to
describe them. This needs to be rectified.

- Replace open coded cpumask_next_wrap() in move_to_next_cpu()

- Have preemptirq_delay_run() use off-stack CPU mask

- Remove obsolete comment about pelt_cfs event

DECLARE_TRACE() appends "_tp" to trace events now, but the comment
above pelt_cfs still mentioned appending it manually.

- Remove EVENT_FILE_FL_SOFT_MODE flag

The SOFT_MODE flag was required when the soft enabling and disabling
of trace events was first introduced. But there was a bug with this
approach as it only worked for a single instance. When multiple users
required soft disabling and disabling the code was changed to have a
ref count. The SOFT_MODE flag is now set iff the ref count is non
zero. This is redundant and just reading the ref count is good
enough.

- Fix typo in comment

* tag 'trace-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
Documentation: tracing: Add documentation about eprobes
tracing: Have eprobes have their own config option
tracing: Remove "__attribute__()" from the type field of event format
tracing: Deprecate auto-mounting tracefs in debugfs
tracing: Fix comment in trace_module_remove_events()
tracing: Remove EVENT_FILE_FL_SOFT_MODE flag
tracing: Remove pointless memory barriers
tracing/sched: Remove obsolete comment on suffixes
kernel: trace: preemptirq_delay_test: use offstack cpu mask
tracing: Use queue_rcu_work() to free filters
tracing: Replace opencoded cpumask_next_wrap() in move_to_next_cpu()

+499 -86
+20
Documentation/ABI/obsolete/automount-tracefs-debugfs
··· 1 + What: /sys/kernel/debug/tracing 2 + Date: May 2008 3 + KernelVersion: 2.6.27 4 + Contact: linux-trace-kernel@vger.kernel.org 5 + Description: 6 + 7 + The ftrace was first added to the kernel, its interface was placed 8 + into the debugfs file system under the "tracing" directory. Access 9 + to the files were in /sys/kernel/debug/tracing. As systems wanted 10 + access to the tracing interface without having to enable debugfs, a 11 + new interface was created called "tracefs". This was a stand alone 12 + file system and was usually mounted in /sys/kernel/tracing. 13 + 14 + To allow older tooling to continue to operate, when mounting 15 + debugfs, the tracefs file system would automatically get mounted in 16 + the "tracing" directory of debugfs. The tracefs interface was added 17 + in January 2015 in the v4.1 kernel. 18 + 19 + All tooling should now be using tracefs directly and the "tracing" 20 + directory in debugfs should be removed by January 2030.
+269
Documentation/trace/eprobetrace.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ================================== 4 + Eprobe - Event-based Probe Tracing 5 + ================================== 6 + 7 + :Author: Steven Rostedt <rostedt@goodmis.org> 8 + 9 + - Written for v6.17 10 + 11 + Overview 12 + ======== 13 + 14 + Eprobes are dynamic events that are placed on existing events to either 15 + dereference a field that is a pointer, or simply to limit what fields are 16 + recorded in the trace event. 17 + 18 + Eprobes depend on kprobe events so to enable this feature, build your kernel 19 + with CONFIG_EPROBE_EVENTS=y. 20 + 21 + Eprobes are created via the /sys/kernel/tracing/dynamic_events file. 22 + 23 + Synopsis of eprobe_events 24 + ------------------------- 25 + :: 26 + 27 + e[:[EGRP/][EEVENT]] GRP.EVENT [FETCHARGS] : Set a probe 28 + -:[EGRP/][EEVENT] : Clear a probe 29 + 30 + EGRP : Group name of the new event. If omitted, use "eprobes" for it. 31 + EEVENT : Event name. If omitted, the event name is generated and will 32 + be the same event name as the event it attached to. 33 + GRP : Group name of the event to attach to. 34 + EVENT : Event name of the event to attach to. 35 + 36 + FETCHARGS : Arguments. Each probe can have up to 128 args. 37 + $FIELD : Fetch the value of the event field called FIELD. 38 + @ADDR : Fetch memory at ADDR (ADDR should be in kernel) 39 + @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol) 40 + $comm : Fetch current task comm. 41 + +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4) 42 + \IMM : Store an immediate value to the argument. 43 + NAME=FETCHARG : Set NAME as the argument name of FETCHARG. 44 + FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types 45 + (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types 46 + (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char", 47 + "string", "ustring", "symbol", "symstr" and "bitfield" are 48 + supported. 49 + 50 + Types 51 + ----- 52 + The FETCHARGS above is very similar to the kprobe events as described in 53 + Documentation/trace/kprobetrace.rst. 54 + 55 + The difference between eprobes and kprobes FETCHARGS is that eprobes has a 56 + $FIELD command that returns the content of the event field of the event 57 + that is attached. Eprobes do not have access to registers, stacks and function 58 + arguments that kprobes has. 59 + 60 + If a field argument is a pointer, it may be dereferenced just like a memory 61 + address using the FETCHARGS syntax. 62 + 63 + 64 + Attaching to dynamic events 65 + --------------------------- 66 + 67 + Eprobes may attach to dynamic events as well as to normal events. It may 68 + attach to a kprobe event, a synthetic event or a fprobe event. This is useful 69 + if the type of a field needs to be changed. See Example 2 below. 70 + 71 + Usage examples 72 + ============== 73 + 74 + Example 1 75 + --------- 76 + 77 + The basic usage of eprobes is to limit the data that is being recorded into 78 + the tracing buffer. For example, a common event to trace is the sched_switch 79 + trace event. That has a format of:: 80 + 81 + field:unsigned short common_type; offset:0; size:2; signed:0; 82 + field:unsigned char common_flags; offset:2; size:1; signed:0; 83 + field:unsigned char common_preempt_count; offset:3; size:1; signed:0; 84 + field:int common_pid; offset:4; size:4; signed:1; 85 + 86 + field:char prev_comm[16]; offset:8; size:16; signed:0; 87 + field:pid_t prev_pid; offset:24; size:4; signed:1; 88 + field:int prev_prio; offset:28; size:4; signed:1; 89 + field:long prev_state; offset:32; size:8; signed:1; 90 + field:char next_comm[16]; offset:40; size:16; signed:0; 91 + field:pid_t next_pid; offset:56; size:4; signed:1; 92 + field:int next_prio; offset:60; size:4; signed:1; 93 + 94 + The first four fields are common to all events and can not be limited. But the 95 + rest of the event has 60 bytes of information. It records the names of the 96 + previous and next tasks being scheduled out and in, as well as their pids and 97 + priorities. It also records the state of the previous task. If only the pids 98 + of the tasks are of interest, why waste the ring buffer with all the other 99 + fields? 100 + 101 + An eprobe can limit what gets recorded. Note, it does not help in performance, 102 + as all the fields are recorded in a temporary buffer to process the eprobe. 103 + :: 104 + 105 + # echo 'e:sched/switch sched.sched_switch prev=$prev_pid:u32 next=$next_pid:u32' >> /sys/kernel/tracing/dynamic_events 106 + # echo 1 > /sys/kernel/tracing/events/sched/switch/enable 107 + # cat /sys/kernel/tracing/trace 108 + 109 + # tracer: nop 110 + # 111 + # entries-in-buffer/entries-written: 2721/2721 #P:8 112 + # 113 + # _-----=> irqs-off/BH-disabled 114 + # / _----=> need-resched 115 + # | / _---=> hardirq/softirq 116 + # || / _--=> preempt-depth 117 + # ||| / _-=> migrate-disable 118 + # |||| / delay 119 + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION 120 + # | | | ||||| | | 121 + sshd-session-1082 [004] d..4. 5041.239906: switch: (sched.sched_switch) prev=1082 next=0 122 + bash-1085 [001] d..4. 5041.240198: switch: (sched.sched_switch) prev=1085 next=141 123 + kworker/u34:5-141 [001] d..4. 5041.240259: switch: (sched.sched_switch) prev=141 next=1085 124 + <idle>-0 [004] d..4. 5041.240354: switch: (sched.sched_switch) prev=0 next=1082 125 + bash-1085 [001] d..4. 5041.240385: switch: (sched.sched_switch) prev=1085 next=141 126 + kworker/u34:5-141 [001] d..4. 5041.240410: switch: (sched.sched_switch) prev=141 next=1085 127 + bash-1085 [001] d..4. 5041.240478: switch: (sched.sched_switch) prev=1085 next=0 128 + sshd-session-1082 [004] d..4. 5041.240526: switch: (sched.sched_switch) prev=1082 next=0 129 + <idle>-0 [001] d..4. 5041.247524: switch: (sched.sched_switch) prev=0 next=90 130 + <idle>-0 [002] d..4. 5041.247545: switch: (sched.sched_switch) prev=0 next=16 131 + kworker/1:1-90 [001] d..4. 5041.247580: switch: (sched.sched_switch) prev=90 next=0 132 + rcu_sched-16 [002] d..4. 5041.247591: switch: (sched.sched_switch) prev=16 next=0 133 + <idle>-0 [002] d..4. 5041.257536: switch: (sched.sched_switch) prev=0 next=16 134 + rcu_sched-16 [002] d..4. 5041.257573: switch: (sched.sched_switch) prev=16 next=0 135 + 136 + Note, without adding the "u32" after the prev_pid and next_pid, the values 137 + would default showing in hexadecimal. 138 + 139 + Example 2 140 + --------- 141 + 142 + If a specific system call is to be recorded but the syscalls events are not 143 + enabled, the raw_syscalls can still be used (syscalls are system call 144 + events are not normal events, but are created from the raw_syscalls events 145 + within the kernel). In order to trace the openat system call, one can create 146 + an event probe on top of the raw_syscalls event: 147 + :: 148 + 149 + # cd /sys/kernel/tracing 150 + # cat events/raw_syscalls/sys_enter/format 151 + name: sys_enter 152 + ID: 395 153 + format: 154 + field:unsigned short common_type; offset:0; size:2; signed:0; 155 + field:unsigned char common_flags; offset:2; size:1; signed:0; 156 + field:unsigned char common_preempt_count; offset:3; size:1; signed:0; 157 + field:int common_pid; offset:4; size:4; signed:1; 158 + 159 + field:long id; offset:8; size:8; signed:1; 160 + field:unsigned long args[6]; offset:16; size:48; signed:0; 161 + 162 + print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5] 163 + 164 + From the source code, the sys_openat() has: 165 + :: 166 + 167 + int sys_openat(int dirfd, const char *path, int flags, mode_t mode) 168 + { 169 + return my_syscall4(__NR_openat, dirfd, path, flags, mode); 170 + } 171 + 172 + The path is the second parameter, and that is what is wanted. 173 + :: 174 + 175 + # echo 'e:openat raw_syscalls.sys_enter nr=$id filename=+8($args):ustring' >> dynamic_events 176 + 177 + This is being run on x86_64 where the word size is 8 bytes and the openat 178 + system call __NR_openat is set at 257. 179 + :: 180 + 181 + # echo 'nr == 257' > events/eprobes/openat/filter 182 + 183 + Now enable the event and look at the trace. 184 + :: 185 + 186 + # echo 1 > events/eprobes/openat/enable 187 + # cat trace 188 + 189 + # tracer: nop 190 + # 191 + # entries-in-buffer/entries-written: 4/4 #P:8 192 + # 193 + # _-----=> irqs-off/BH-disabled 194 + # / _----=> need-resched 195 + # | / _---=> hardirq/softirq 196 + # || / _--=> preempt-depth 197 + # ||| / _-=> migrate-disable 198 + # |||| / delay 199 + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION 200 + # | | | ||||| | | 201 + cat-1298 [003] ...2. 2060.875970: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) 202 + cat-1298 [003] ...2. 2060.876197: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) 203 + cat-1298 [003] ...2. 2060.879126: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) 204 + cat-1298 [003] ...2. 2060.879639: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) 205 + 206 + The filename shows "(fault)". This is likely because the filename has not been 207 + pulled into memory yet and currently trace events cannot fault in memory that 208 + is not present. When an eprobe tries to read memory that has not been faulted 209 + in yet, it will show the "(fault)" text. 210 + 211 + To get around this, as the kernel will likely pull in this filename and make 212 + it present, attaching it to a synthetic event that can pass the address of the 213 + filename from the entry of the event to the end of the event, this can be used 214 + to show the filename when the system call returns. 215 + 216 + Remove the old eprobe:: 217 + 218 + # echo 1 > events/eprobes/openat/enable 219 + # echo '-:openat' >> dynamic_events 220 + 221 + This time make an eprobe where the address of the filename is saved:: 222 + 223 + # echo 'e:openat_start raw_syscalls.sys_enter nr=$id filename=+8($args):x64' >> dynamic_events 224 + 225 + Create a synthetic event that passes the address of the filename to the 226 + end of the event:: 227 + 228 + # echo 's:filename u64 file' >> dynamic_events 229 + # echo 'hist:keys=common_pid:f=filename if nr == 257' > events/eprobes/openat_start/trigger 230 + # echo 'hist:keys=common_pid:file=$f:onmatch(eprobes.openat_start).trace(filename,$file) if id == 257' > events/raw_syscalls/sys_exit/trigger 231 + 232 + Now that the address of the filename has been passed to the end of the 233 + system call, create another eprobe to attach to the exit event to show the 234 + string:: 235 + 236 + # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events 237 + # echo 1 > events/eprobes/openat/enable 238 + # cat trace 239 + 240 + # tracer: nop 241 + # 242 + # entries-in-buffer/entries-written: 4/4 #P:8 243 + # 244 + # _-----=> irqs-off/BH-disabled 245 + # / _----=> need-resched 246 + # | / _---=> hardirq/softirq 247 + # || / _--=> preempt-depth 248 + # ||| / _-=> migrate-disable 249 + # |||| / delay 250 + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION 251 + # | | | ||||| | | 252 + cat-1331 [001] ...5. 2944.787977: openat: (synthetic.filename) filename="/etc/ld.so.cache" 253 + cat-1331 [001] ...5. 2944.788480: openat: (synthetic.filename) filename="/lib/x86_64-linux-gnu/libc.so.6" 254 + cat-1331 [001] ...5. 2944.793426: openat: (synthetic.filename) filename="/usr/lib/locale/locale-archive" 255 + cat-1331 [001] ...5. 2944.831362: openat: (synthetic.filename) filename="trace" 256 + 257 + Example 3 258 + --------- 259 + 260 + If syscall trace events are available, the above would not need the first 261 + eprobe, but it would still need the last one:: 262 + 263 + # echo 's:filename u64 file' >> dynamic_events 264 + # echo 'hist:keys=common_pid:f=filename' > events/syscalls/sys_enter_openat/trigger 265 + # echo 'hist:keys=common_pid:file=$f:onmatch(syscalls.sys_enter_openat).trace(filename,$file)' > events/syscalls/sys_exit_openat/trigger 266 + # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events 267 + # echo 1 > events/eprobes/openat/enable 268 + 269 + And this would produce the same result as Example 2.
+1
Documentation/trace/index.rst
··· 36 36 kprobes 37 37 kprobetrace 38 38 fprobetrace 39 + eprobetrace 39 40 fprobe 40 41 ring-buffer-design 41 42
-3
include/linux/trace_events.h
··· 480 480 EVENT_FILE_FL_RECORDED_TGID_BIT, 481 481 EVENT_FILE_FL_FILTERED_BIT, 482 482 EVENT_FILE_FL_NO_SET_FILTER_BIT, 483 - EVENT_FILE_FL_SOFT_MODE_BIT, 484 483 EVENT_FILE_FL_SOFT_DISABLED_BIT, 485 484 EVENT_FILE_FL_TRIGGER_MODE_BIT, 486 485 EVENT_FILE_FL_TRIGGER_COND_BIT, ··· 617 618 * RECORDED_TGID - The tgids should be recorded at sched_switch 618 619 * FILTERED - The event has a filter attached 619 620 * NO_SET_FILTER - Set when filter has error and is to be ignored 620 - * SOFT_MODE - The event is enabled/disabled by SOFT_DISABLED 621 621 * SOFT_DISABLED - When set, do not trace the event (even though its 622 622 * tracepoint may be enabled) 623 623 * TRIGGER_MODE - When set, invoke the triggers associated with the event ··· 631 633 EVENT_FILE_FL_RECORDED_TGID = (1 << EVENT_FILE_FL_RECORDED_TGID_BIT), 632 634 EVENT_FILE_FL_FILTERED = (1 << EVENT_FILE_FL_FILTERED_BIT), 633 635 EVENT_FILE_FL_NO_SET_FILTER = (1 << EVENT_FILE_FL_NO_SET_FILTER_BIT), 634 - EVENT_FILE_FL_SOFT_MODE = (1 << EVENT_FILE_FL_SOFT_MODE_BIT), 635 636 EVENT_FILE_FL_SOFT_DISABLED = (1 << EVENT_FILE_FL_SOFT_DISABLED_BIT), 636 637 EVENT_FILE_FL_TRIGGER_MODE = (1 << EVENT_FILE_FL_TRIGGER_MODE_BIT), 637 638 EVENT_FILE_FL_TRIGGER_COND = (1 << EVENT_FILE_FL_TRIGGER_COND_BIT),
-2
include/trace/events/sched.h
··· 829 829 /* 830 830 * Following tracepoints are not exported in tracefs and provide hooking 831 831 * mechanisms only for testing and debugging purposes. 832 - * 833 - * Postfixed with _tp to make them easily identifiable in the code. 834 832 */ 835 833 DECLARE_TRACE(pelt_cfs, 836 834 TP_PROTO(struct cfs_rq *cfs_rq),
+27
kernel/trace/Kconfig
··· 200 200 201 201 if FTRACE 202 202 203 + config TRACEFS_AUTOMOUNT_DEPRECATED 204 + bool "Automount tracefs on debugfs [DEPRECATED]" 205 + depends on TRACING 206 + default y 207 + help 208 + The tracing interface was moved from /sys/kernel/debug/tracing 209 + to /sys/kernel/tracing in 2015, but the tracing file system 210 + was still automounted in /sys/kernel/debug for backward 211 + compatibility with tooling. 212 + 213 + The new interface has been around for more than 10 years and 214 + the old debug mount will soon be removed. 215 + 203 216 config BOOTTIME_TRACING 204 217 bool "Boot-time Tracing support" 205 218 depends on TRACING ··· 792 779 can probe, and record various registers. 793 780 This option is required if you plan to use perf-probe subcommand 794 781 of perf tools on user space applications. 782 + 783 + config EPROBE_EVENTS 784 + bool "Enable event-based dynamic events" 785 + depends on TRACING 786 + depends on HAVE_REGS_AND_STACK_ACCESS_API 787 + select PROBE_EVENTS 788 + select DYNAMIC_EVENTS 789 + default y 790 + help 791 + Eprobes are dynamic events that can be placed on other existing 792 + events. It can be used to limit what fields are recorded in 793 + an event or even dereference a field of an event. It can 794 + convert the type of an event field. For example, turn an 795 + address into a string. 795 796 796 797 config BPF_EVENTS 797 798 depends on BPF_SYSCALL
+1 -1
kernel/trace/Makefile
··· 82 82 endif 83 83 obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o 84 84 obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o 85 - obj-$(CONFIG_PROBE_EVENTS) += trace_eprobe.o 85 + obj-$(CONFIG_EPROBE_EVENTS) += trace_eprobe.o 86 86 obj-$(CONFIG_TRACE_EVENT_INJECT) += trace_events_inject.o 87 87 obj-$(CONFIG_SYNTH_EVENTS) += trace_events_synth.o 88 88 obj-$(CONFIG_HIST_TRIGGERS) += trace_events_hist.o
+9 -4
kernel/trace/preemptirq_delay_test.c
··· 117 117 { 118 118 int i; 119 119 int s = MIN(burst_size, NR_TEST_FUNCS); 120 - struct cpumask cpu_mask; 120 + cpumask_var_t cpu_mask; 121 + 122 + if (!alloc_cpumask_var(&cpu_mask, GFP_KERNEL)) 123 + return -ENOMEM; 121 124 122 125 if (cpu_affinity > -1) { 123 - cpumask_clear(&cpu_mask); 124 - cpumask_set_cpu(cpu_affinity, &cpu_mask); 125 - if (set_cpus_allowed_ptr(current, &cpu_mask)) 126 + cpumask_clear(cpu_mask); 127 + cpumask_set_cpu(cpu_affinity, cpu_mask); 128 + if (set_cpus_allowed_ptr(current, cpu_mask)) 126 129 pr_err("cpu_affinity:%d, failed\n", cpu_affinity); 127 130 } 128 131 ··· 141 138 } 142 139 143 140 __set_current_state(TASK_RUNNING); 141 + 142 + free_cpumask_var(cpu_mask); 144 143 145 144 return 0; 146 145 }
-6
kernel/trace/rv/rv.c
··· 674 674 */ 675 675 bool rv_monitoring_on(void) 676 676 { 677 - /* Ensures that concurrent monitors read consistent monitoring_on */ 678 - smp_rmb(); 679 677 return READ_ONCE(monitoring_on); 680 678 } 681 679 ··· 693 695 static void turn_monitoring_off(void) 694 696 { 695 697 WRITE_ONCE(monitoring_on, false); 696 - /* Ensures that concurrent monitors read consistent monitoring_on */ 697 - smp_wmb(); 698 698 } 699 699 700 700 static void reset_all_monitors(void) ··· 708 712 static void turn_monitoring_on(void) 709 713 { 710 714 WRITE_ONCE(monitoring_on, true); 711 - /* Ensures that concurrent monitors read consistent monitoring_on */ 712 - smp_wmb(); 713 715 } 714 716 715 717 static void turn_monitoring_on_with_reset(void)
+28 -21
kernel/trace/trace.c
··· 936 936 * return the mirror variable of the state of the ring buffer. 937 937 * It's a little racy, but we don't really care. 938 938 */ 939 - smp_rmb(); 940 939 return !global_trace.buffer_disabled; 941 940 } 942 941 ··· 1106 1107 * important to be fast than accurate. 1107 1108 */ 1108 1109 tr->buffer_disabled = 0; 1109 - /* Make the flag seen by readers */ 1110 - smp_wmb(); 1111 1110 } 1112 1111 1113 1112 /** ··· 1637 1640 * important to be fast than accurate. 1638 1641 */ 1639 1642 tr->buffer_disabled = 1; 1640 - /* Make the flag seen by readers */ 1641 - smp_wmb(); 1642 1643 } 1643 1644 1644 1645 /** ··· 2705 2710 2706 2711 static void enable_trace_buffered_event(void *data) 2707 2712 { 2708 - /* Probably not needed, but do it anyway */ 2709 - smp_rmb(); 2710 2713 this_cpu_dec(trace_buffered_event_cnt); 2711 2714 } 2712 2715 ··· 5924 5931 struct trace_eval_map **start, int len) { } 5925 5932 #endif /* !CONFIG_TRACE_EVAL_MAP_FILE */ 5926 5933 5927 - static void trace_insert_eval_map(struct module *mod, 5928 - struct trace_eval_map **start, int len) 5934 + static void 5935 + trace_event_update_with_eval_map(struct module *mod, 5936 + struct trace_eval_map **start, 5937 + int len) 5929 5938 { 5930 5939 struct trace_eval_map **map; 5931 5940 5932 - if (len <= 0) 5933 - return; 5941 + /* Always run sanitizer only if btf_type_tag attr exists. */ 5942 + if (len <= 0) { 5943 + if (!(IS_ENABLED(CONFIG_DEBUG_INFO_BTF) && 5944 + IS_ENABLED(CONFIG_PAHOLE_HAS_BTF_TAG) && 5945 + __has_attribute(btf_type_tag))) 5946 + return; 5947 + } 5934 5948 5935 5949 map = start; 5936 5950 5937 - trace_event_eval_update(map, len); 5951 + trace_event_update_all(map, len); 5952 + 5953 + if (len <= 0) 5954 + return; 5938 5955 5939 5956 trace_insert_eval_map_file(mod, start, len); 5940 5957 } ··· 6300 6297 static void add_tracer_options(struct trace_array *tr, struct tracer *t) 6301 6298 { 6302 6299 /* Only enable if the directory has been created already. */ 6303 - if (!tr->dir) 6300 + if (!tr->dir && !(tr->flags & TRACE_ARRAY_FL_GLOBAL)) 6304 6301 return; 6305 6302 6306 6303 /* Only create trace option files after update_tracer_options finish */ ··· 8981 8978 8982 8979 static struct dentry *tracing_get_dentry(struct trace_array *tr) 8983 8980 { 8984 - if (WARN_ON(!tr->dir)) 8985 - return ERR_PTR(-ENODEV); 8986 - 8987 8981 /* Top directory uses NULL as the parent */ 8988 8982 if (tr->flags & TRACE_ARRAY_FL_GLOBAL) 8989 8983 return NULL; 8984 + 8985 + if (WARN_ON(!tr->dir)) 8986 + return ERR_PTR(-ENODEV); 8990 8987 8991 8988 /* All sub buffers have a descriptor */ 8992 8989 return tr->dir; ··· 10253 10250 ftrace_init_tracefs(tr, d_tracer); 10254 10251 } 10255 10252 10253 + #ifdef CONFIG_TRACEFS_AUTOMOUNT_DEPRECATED 10256 10254 static struct vfsmount *trace_automount(struct dentry *mntpt, void *ingore) 10257 10255 { 10258 10256 struct vfsmount *mnt; ··· 10275 10271 if (IS_ERR(fc)) 10276 10272 return ERR_CAST(fc); 10277 10273 10274 + pr_warn("NOTICE: Automounting of tracing to debugfs is deprecated and will be removed in 2030\n"); 10275 + 10278 10276 ret = vfs_parse_fs_string(fc, "source", 10279 10277 "tracefs", strlen("tracefs")); 10280 10278 if (!ret) ··· 10287 10281 put_fs_context(fc); 10288 10282 return mnt; 10289 10283 } 10284 + #endif 10290 10285 10291 10286 /** 10292 10287 * tracing_init_dentry - initialize top level trace array ··· 10312 10305 if (WARN_ON(!tracefs_initialized())) 10313 10306 return -ENODEV; 10314 10307 10308 + #ifdef CONFIG_TRACEFS_AUTOMOUNT_DEPRECATED 10315 10309 /* 10316 10310 * As there may still be users that expect the tracing 10317 10311 * files to exist in debugfs/tracing, we must automount ··· 10321 10313 */ 10322 10314 tr->dir = debugfs_create_automount("tracing", NULL, 10323 10315 trace_automount, NULL); 10316 + #endif 10324 10317 10325 10318 return 0; 10326 10319 } ··· 10338 10329 int len; 10339 10330 10340 10331 len = __stop_ftrace_eval_maps - __start_ftrace_eval_maps; 10341 - trace_insert_eval_map(NULL, __start_ftrace_eval_maps, len); 10332 + trace_event_update_with_eval_map(NULL, __start_ftrace_eval_maps, len); 10342 10333 } 10343 10334 10344 10335 static int __init trace_eval_init(void) ··· 10391 10382 10392 10383 static void trace_module_add_evals(struct module *mod) 10393 10384 { 10394 - if (!mod->num_trace_evals) 10395 - return; 10396 - 10397 10385 /* 10398 10386 * Modules with bad taint do not have events created, do 10399 10387 * not bother with enums either. ··· 10398 10392 if (trace_module_has_bad_taint(mod)) 10399 10393 return; 10400 10394 10401 - trace_insert_eval_map(mod, mod->trace_evals, mod->num_trace_evals); 10395 + /* Even if no trace_evals, this need to sanitize field types. */ 10396 + trace_event_update_with_eval_map(mod, mod->trace_evals, mod->num_trace_evals); 10402 10397 } 10403 10398 10404 10399 #ifdef CONFIG_TRACE_EVAL_MAP_FILE
+2 -2
kernel/trace/trace.h
··· 2125 2125 2126 2126 #ifdef CONFIG_EVENT_TRACING 2127 2127 void trace_event_init(void); 2128 - void trace_event_eval_update(struct trace_eval_map **map, int len); 2128 + void trace_event_update_all(struct trace_eval_map **map, int len); 2129 2129 /* Used from boot time tracer */ 2130 2130 extern int ftrace_set_clr_event(struct trace_array *tr, char *buf, int set); 2131 2131 extern int trigger_process_regex(struct trace_event_file *file, char *buff); 2132 2132 #else 2133 2133 static inline void __init trace_event_init(void) { } 2134 - static inline void trace_event_eval_update(struct trace_eval_map **map, int len) { } 2134 + static inline void trace_event_update_all(struct trace_eval_map **map, int len) { } 2135 2135 #endif 2136 2136 2137 2137 #ifdef CONFIG_TRACER_SNAPSHOT
+121 -35
kernel/trace/trace_events.c
··· 768 768 { 769 769 struct trace_event_call *call = file->event_call; 770 770 struct trace_array *tr = file->tr; 771 + bool soft_mode = atomic_read(&file->sm_ref) != 0; 771 772 int ret = 0; 772 773 int disable; 773 774 ··· 783 782 * is set we do not want the event to be enabled before we 784 783 * clear the bit. 785 784 * 786 - * When soft_disable is not set but the SOFT_MODE flag is, 785 + * When soft_disable is not set but the soft_mode is, 787 786 * we do nothing. Do not disable the tracepoint, otherwise 788 787 * "soft enable"s (clearing the SOFT_DISABLED bit) wont work. 789 788 */ ··· 791 790 if (atomic_dec_return(&file->sm_ref) > 0) 792 791 break; 793 792 disable = file->flags & EVENT_FILE_FL_SOFT_DISABLED; 794 - clear_bit(EVENT_FILE_FL_SOFT_MODE_BIT, &file->flags); 793 + soft_mode = false; 795 794 /* Disable use of trace_buffered_event */ 796 795 trace_buffered_event_disable(); 797 796 } else 798 - disable = !(file->flags & EVENT_FILE_FL_SOFT_MODE); 797 + disable = !soft_mode; 799 798 800 799 if (disable && (file->flags & EVENT_FILE_FL_ENABLED)) { 801 800 clear_bit(EVENT_FILE_FL_ENABLED_BIT, &file->flags); ··· 813 812 814 813 WARN_ON_ONCE(ret); 815 814 } 816 - /* If in SOFT_MODE, just set the SOFT_DISABLE_BIT, else clear it */ 817 - if (file->flags & EVENT_FILE_FL_SOFT_MODE) 815 + /* If in soft mode, just set the SOFT_DISABLE_BIT, else clear it */ 816 + if (soft_mode) 818 817 set_bit(EVENT_FILE_FL_SOFT_DISABLED_BIT, &file->flags); 819 818 else 820 819 clear_bit(EVENT_FILE_FL_SOFT_DISABLED_BIT, &file->flags); ··· 824 823 * When soft_disable is set and enable is set, we want to 825 824 * register the tracepoint for the event, but leave the event 826 825 * as is. That means, if the event was already enabled, we do 827 - * nothing (but set SOFT_MODE). If the event is disabled, we 826 + * nothing (but set soft_mode). If the event is disabled, we 828 827 * set SOFT_DISABLED before enabling the event tracepoint, so 829 828 * it still seems to be disabled. 830 829 */ ··· 833 832 else { 834 833 if (atomic_inc_return(&file->sm_ref) > 1) 835 834 break; 836 - set_bit(EVENT_FILE_FL_SOFT_MODE_BIT, &file->flags); 835 + soft_mode = true; 837 836 /* Enable use of trace_buffered_event */ 838 837 trace_buffered_event_enable(); 839 838 } ··· 841 840 if (!(file->flags & EVENT_FILE_FL_ENABLED)) { 842 841 bool cmd = false, tgid = false; 843 842 844 - /* Keep the event disabled, when going to SOFT_MODE. */ 843 + /* Keep the event disabled, when going to soft mode. */ 845 844 if (soft_disable) 846 845 set_bit(EVENT_FILE_FL_SOFT_DISABLED_BIT, &file->flags); 847 846 ··· 1793 1792 !(flags & EVENT_FILE_FL_SOFT_DISABLED)) 1794 1793 strcpy(buf, "1"); 1795 1794 1796 - if (flags & EVENT_FILE_FL_SOFT_DISABLED || 1797 - flags & EVENT_FILE_FL_SOFT_MODE) 1795 + if (atomic_read(&file->sm_ref) != 0) 1798 1796 strcat(buf, "*"); 1799 1797 1800 1798 strcat(buf, "\n"); ··· 3267 3267 list_add(&modstr->next, &module_strings); 3268 3268 } 3269 3269 3270 + #define ATTRIBUTE_STR "__attribute__(" 3271 + #define ATTRIBUTE_STR_LEN (sizeof(ATTRIBUTE_STR) - 1) 3272 + 3273 + /* Remove all __attribute__() from @type. Return allocated string or @type. */ 3274 + static char *sanitize_field_type(const char *type) 3275 + { 3276 + char *attr, *tmp, *next, *ret = (char *)type; 3277 + int depth; 3278 + 3279 + next = (char *)type; 3280 + while ((attr = strstr(next, ATTRIBUTE_STR))) { 3281 + /* Retry if "__attribute__(" is a part of another word. */ 3282 + if (attr != next && !isspace(attr[-1])) { 3283 + next = attr + ATTRIBUTE_STR_LEN; 3284 + continue; 3285 + } 3286 + 3287 + if (ret == type) { 3288 + ret = kstrdup(type, GFP_KERNEL); 3289 + if (WARN_ON_ONCE(!ret)) 3290 + return NULL; 3291 + attr = ret + (attr - type); 3292 + } 3293 + 3294 + /* the ATTRIBUTE_STR already has the first '(' */ 3295 + depth = 1; 3296 + next = attr + ATTRIBUTE_STR_LEN; 3297 + do { 3298 + tmp = strpbrk(next, "()"); 3299 + /* There is unbalanced parentheses */ 3300 + if (WARN_ON_ONCE(!tmp)) { 3301 + kfree(ret); 3302 + return (char *)type; 3303 + } 3304 + 3305 + if (*tmp == '(') 3306 + depth++; 3307 + else 3308 + depth--; 3309 + next = tmp + 1; 3310 + } while (depth > 0); 3311 + next = skip_spaces(next); 3312 + strcpy(attr, next); 3313 + next = attr; 3314 + } 3315 + return ret; 3316 + } 3317 + 3318 + static char *find_replacable_eval(const char *type, const char *eval_string, 3319 + int len) 3320 + { 3321 + char *ptr; 3322 + 3323 + if (!eval_string) 3324 + return NULL; 3325 + 3326 + ptr = strchr(type, '['); 3327 + if (!ptr) 3328 + return NULL; 3329 + ptr++; 3330 + 3331 + if (!isalpha(*ptr) && *ptr != '_') 3332 + return NULL; 3333 + 3334 + if (strncmp(eval_string, ptr, len) != 0) 3335 + return NULL; 3336 + 3337 + return ptr; 3338 + } 3339 + 3270 3340 static void update_event_fields(struct trace_event_call *call, 3271 3341 struct trace_eval_map *map) 3272 3342 { 3273 3343 struct ftrace_event_field *field; 3344 + const char *eval_string = NULL; 3274 3345 struct list_head *head; 3346 + int len = 0; 3275 3347 char *ptr; 3276 3348 char *str; 3277 - int len = strlen(map->eval_string); 3278 3349 3279 3350 /* Dynamic events should never have field maps */ 3280 - if (WARN_ON_ONCE(call->flags & TRACE_EVENT_FL_DYNAMIC)) 3351 + if (call->flags & TRACE_EVENT_FL_DYNAMIC) 3281 3352 return; 3353 + 3354 + if (map) { 3355 + eval_string = map->eval_string; 3356 + len = strlen(map->eval_string); 3357 + } 3282 3358 3283 3359 head = trace_get_fields(call); 3284 3360 list_for_each_entry(field, head, link) { 3285 - ptr = strchr(field->type, '['); 3286 - if (!ptr) 3287 - continue; 3288 - ptr++; 3289 - 3290 - if (!isalpha(*ptr) && *ptr != '_') 3291 - continue; 3292 - 3293 - if (strncmp(map->eval_string, ptr, len) != 0) 3294 - continue; 3295 - 3296 - str = kstrdup(field->type, GFP_KERNEL); 3297 - if (WARN_ON_ONCE(!str)) 3361 + str = sanitize_field_type(field->type); 3362 + if (!str) 3298 3363 return; 3299 - ptr = str + (ptr - field->type); 3300 - ptr = eval_replace(ptr, map, len); 3301 - /* enum/sizeof string smaller than value */ 3302 - if (WARN_ON_ONCE(!ptr)) { 3303 - kfree(str); 3304 - continue; 3364 + 3365 + ptr = find_replacable_eval(str, eval_string, len); 3366 + if (ptr) { 3367 + if (str == field->type) { 3368 + str = kstrdup(field->type, GFP_KERNEL); 3369 + if (WARN_ON_ONCE(!str)) 3370 + return; 3371 + ptr = str + (ptr - field->type); 3372 + } 3373 + 3374 + ptr = eval_replace(ptr, map, len); 3375 + /* enum/sizeof string smaller than value */ 3376 + if (WARN_ON_ONCE(!ptr)) { 3377 + kfree(str); 3378 + continue; 3379 + } 3305 3380 } 3306 3381 3382 + if (str == field->type) 3383 + continue; 3307 3384 /* 3308 3385 * If the event is part of a module, then we need to free the string 3309 3386 * when the module is removed. Otherwise, it will stay allocated ··· 3390 3313 add_str_to_module(call->module, str); 3391 3314 3392 3315 field->type = str; 3316 + if (field->filter_type == FILTER_OTHER) 3317 + field->filter_type = filter_assign_type(field->type); 3393 3318 } 3394 3319 } 3395 3320 3396 - void trace_event_eval_update(struct trace_eval_map **map, int len) 3321 + /* Update all events for replacing eval and sanitizing */ 3322 + void trace_event_update_all(struct trace_eval_map **map, int len) 3397 3323 { 3398 3324 struct trace_event_call *call, *p; 3399 3325 const char *last_system = NULL; 3400 3326 bool first = false; 3327 + bool updated; 3401 3328 int last_i; 3402 3329 int i; 3403 3330 ··· 3414 3333 last_system = call->class->system; 3415 3334 } 3416 3335 3336 + updated = false; 3417 3337 /* 3418 3338 * Since calls are grouped by systems, the likelihood that the 3419 3339 * next call in the iteration belongs to the same system as the ··· 3434 3352 } 3435 3353 update_event_printk(call, map[i]); 3436 3354 update_event_fields(call, map[i]); 3355 + updated = true; 3437 3356 } 3438 3357 } 3358 + /* If not updated yet, update field for sanitizing. */ 3359 + if (!updated) 3360 + update_event_fields(call, NULL); 3439 3361 cond_resched(); 3440 3362 } 3441 3363 up_write(&trace_event_sem); ··· 3673 3587 continue; 3674 3588 /* 3675 3589 * We can't rely on ftrace_event_enable_disable(enable => 0) 3676 - * we are going to do, EVENT_FILE_FL_SOFT_MODE can suppress 3590 + * we are going to do, soft mode can suppress 3677 3591 * TRACE_REG_UNREGISTER. 3678 3592 */ 3679 3593 if (file->flags & EVENT_FILE_FL_ENABLED) ··· 3784 3698 if (call->module == mod) 3785 3699 __trace_remove_event_call(call); 3786 3700 } 3787 - /* Check for any strings allocade for this module */ 3701 + /* Check for any strings allocated for this module */ 3788 3702 list_for_each_entry_safe(modstr, m, &module_strings, next) { 3789 3703 if (modstr->module != mod) 3790 3704 continue; ··· 4088 4002 4089 4003 edata->ref--; 4090 4004 if (!edata->ref) { 4091 - /* Remove the SOFT_MODE flag */ 4005 + /* Remove soft mode */ 4092 4006 __ftrace_event_enable_disable(edata->file, 0, 1); 4093 4007 trace_event_put_ref(edata->file->event_call); 4094 4008 kfree(edata);
+20 -8
kernel/trace/trace_events_filter.c
··· 1344 1344 1345 1345 struct filter_head { 1346 1346 struct list_head list; 1347 - struct rcu_head rcu; 1347 + union { 1348 + struct rcu_head rcu; 1349 + struct rcu_work rwork; 1350 + }; 1348 1351 }; 1349 1352 1350 - 1351 - static void free_filter_list(struct rcu_head *rhp) 1353 + static void free_filter_list(struct filter_head *filter_list) 1352 1354 { 1353 - struct filter_head *filter_list = container_of(rhp, struct filter_head, rcu); 1354 1355 struct filter_list *filter_item, *tmp; 1355 1356 1356 1357 list_for_each_entry_safe(filter_item, tmp, &filter_list->list, list) { ··· 1362 1361 kfree(filter_list); 1363 1362 } 1364 1363 1364 + static void free_filter_list_work(struct work_struct *work) 1365 + { 1366 + struct filter_head *filter_list; 1367 + 1368 + filter_list = container_of(to_rcu_work(work), struct filter_head, rwork); 1369 + free_filter_list(filter_list); 1370 + } 1371 + 1365 1372 static void free_filter_list_tasks(struct rcu_head *rhp) 1366 1373 { 1367 - call_rcu(rhp, free_filter_list); 1374 + struct filter_head *filter_list = container_of(rhp, struct filter_head, rcu); 1375 + 1376 + INIT_RCU_WORK(&filter_list->rwork, free_filter_list_work); 1377 + queue_rcu_work(system_wq, &filter_list->rwork); 1368 1378 } 1369 1379 1370 1380 /* ··· 1472 1460 tracepoint_synchronize_unregister(); 1473 1461 1474 1462 if (head) 1475 - free_filter_list(&head->rcu); 1463 + free_filter_list(head); 1476 1464 1477 1465 list_for_each_entry(file, &tr->events, list) { 1478 1466 if (file->system != dir || !file->filter) ··· 2317 2305 return 0; 2318 2306 fail: 2319 2307 /* No call succeeded */ 2320 - free_filter_list(&filter_list->rcu); 2308 + free_filter_list(filter_list); 2321 2309 parse_error(pe, FILT_ERR_BAD_SUBSYS_FILTER, 0); 2322 2310 return -EINVAL; 2323 2311 fail_mem: ··· 2327 2315 if (!fail) 2328 2316 delay_free_filter(filter_list); 2329 2317 else 2330 - free_filter_list(&filter_list->rcu); 2318 + free_filter_list(filter_list); 2331 2319 2332 2320 return -ENOMEM; 2333 2321 }
+1 -4
kernel/trace/trace_hwlat.c
··· 325 325 326 326 cpus_read_lock(); 327 327 cpumask_and(current_mask, cpu_online_mask, tr->tracing_cpumask); 328 - next_cpu = cpumask_next(raw_smp_processor_id(), current_mask); 328 + next_cpu = cpumask_next_wrap(raw_smp_processor_id(), current_mask); 329 329 cpus_read_unlock(); 330 - 331 - if (next_cpu >= nr_cpu_ids) 332 - next_cpu = cpumask_first(current_mask); 333 330 334 331 if (next_cpu >= nr_cpu_ids) /* Shouldn't happen! */ 335 332 goto change_mode;