syzbot |
sign-in | mailing list | source | docs | 🏰 |
| ID | Workflow | Result | Correct | Bug | Created | Started | Finished | Revision | Error |
|---|---|---|---|---|---|---|---|---|---|
| 549ed145-2388-45c0-a6ff-7db485dc4fec | assessment-kcsan | Benign: ❌ Confident: ✅ | ✅ | KCSAN: data-race in perf_event_set_state / perf_mmap_rb | 2026/02/06 06:52 | 2026/02/06 06:52 | 2026/02/06 07:03 | 9b618abc0cd923b36c7a5ebc58a21174112d8f70 |
| Time | User | Decision |
|---|---|---|
| 2026/02/06 07:35 | dvyukov@google.com | ✅ |
The data race occurs in the `perf` subsystem between two concurrent calls to `perf_event_update_time()`.
### Analysis of the Race
1. **Thread 1 (Task 12011):** This task is executing the `mmap()` system call on a perf event file descriptor. It calls `perf_mmap()`, which acquires the `event->mmap_mutex` and then calls `perf_mmap_rb()`. Inside `perf_mmap_rb()`, it calls `perf_event_update_time(event)` to update the event's internal timing counters before initializing the user-space mapping page.
2. **Thread 2 (Task 12005):** This task is executing `perf_poll()`. Interestingly, `perf_poll()` also tries to acquire the same `event->mmap_mutex`. If Task 12011 already holds the mutex, Task 12005 will block and be scheduled out. During the context switch, the kernel calls `event_sched_out()` for the events associated with the task. `event_sched_out()` calls `perf_event_set_state()`, which in turn calls `perf_event_update_time(event)`.
### The Impact
The function `perf_event_update_time()` performs the following operations:
```c
static void perf_event_update_time(struct perf_event *event)
{
u64 now = perf_event_time(event);
__perf_update_times(event, now, &event->total_time_enabled, &event->total_time_running);
event->tstamp = now;
}
```
And `__perf_update_times()` does:
```c
*enabled = event->total_time_enabled;
if (state >= PERF_EVENT_STATE_INACTIVE)
*enabled += delta;
```
Because `perf_mmap_rb()` does not hold the `perf_event_context` lock (`ctx->lock`), which is the intended protection for these timing fields, it races with the `event_sched_out()` path (which does hold `ctx->lock`).
The race on `total_time_enabled` and `total_time_running` involves non-atomic read-modify-write operations. If both threads read the same old value of `total_time_enabled` before either writes back the updated value, one of the updates (representing a chunk of time the event was enabled) will be lost. Additionally, the race on `event->tstamp` can lead to inconsistent state where the timestamp and the total time counters are out of sync, causing further errors in subsequent time calculations.
### Conclusion
This data race is **harmful**. While it does not lead to an immediate kernel crash or memory corruption, it causes permanent inaccuracy in the performance counters. For a subsystem dedicated to high-precision performance monitoring, lost timing updates constitute a significant functional bug. The race is also highly reproducible in scenarios where one task maps an event while another task using the same event is forced to schedule out (e.g., by blocking on the same `mmap_mutex`).================================================================== BUG: KCSAN: data-race in perf_event_set_state / perf_mmap_rb write to 0xffff88812279f1a0 of 8 bytes by task 12011 on cpu 1: perf_event_update_time kernel/events/core.c:737 [inline] perf_mmap_rb+0x71c/0x910 kernel/events/core.c:7037 perf_mmap+0x1ce/0x2f0 kernel/events/core.c:7164 vfs_mmap include/linux/fs.h:2053 [inline] mmap_file mm/internal.h:167 [inline] __mmap_new_file_vma mm/vma.c:2421 [inline] __mmap_new_vma mm/vma.c:2484 [inline] __mmap_region mm/vma.c:2708 [inline] mmap_region+0x1045/0x1410 mm/vma.c:2786 do_mmap+0x9b3/0xbe0 mm/mmap.c:558 vm_mmap_pgoff+0x17a/0x2e0 mm/util.c:581 ksys_mmap_pgoff+0x268/0x310 mm/mmap.c:604 x64_sys_call+0x16bb/0x3000 arch/x86/include/generated/asm/syscalls_64.h:10 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xca/0x2b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffff88812279f1a0 of 8 bytes by task 12005 on cpu 0: __perf_update_times kernel/events/core.c:720 [inline] perf_event_update_time kernel/events/core.c:735 [inline] perf_event_set_state+0x153/0x440 kernel/events/core.c:754 event_sched_out+0x2d4/0x4d0 kernel/events/core.c:2391 group_sched_out kernel/events/core.c:2415 [inline] __pmu_ctx_sched_out+0x3e7/0x530 kernel/events/core.c:3458 ctx_sched_out+0x273/0x2d0 kernel/events/core.c:3539 task_ctx_sched_out+0x4d/0x70 kernel/events/core.c:2859 perf_event_context_sched_out kernel/events/core.c:3746 [inline] __perf_event_task_sched_out+0x286/0x370 kernel/events/core.c:3846 perf_event_task_sched_out include/linux/perf_event.h:1654 [inline] prepare_task_switch kernel/sched/core.c:5045 [inline] context_switch kernel/sched/core.c:5201 [inline] __schedule+0xbf0/0xcd0 kernel/sched/core.c:6863 __schedule_loop kernel/sched/core.c:6945 [inline] schedule+0x5f/0xd0 kernel/sched/core.c:6960 schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:7017 __mutex_lock_common kernel/locking/mutex.c:692 [inline] __mutex_lock+0x4ff/0xe20 kernel/locking/mutex.c:776 __mutex_lock_slowpath+0xa/0x10 kernel/locking/mutex.c:1065 mutex_lock+0x89/0x90 kernel/locking/mutex.c:290 perf_poll+0x180/0x1f0 kernel/events/core.c:6150 vfs_poll include/linux/poll.h:82 [inline] select_poll_one fs/select.c:480 [inline] do_select+0x8f1/0xf40 fs/select.c:536 core_sys_select+0x3dc/0x6e0 fs/select.c:677 do_pselect fs/select.c:759 [inline] __do_sys_pselect6 fs/select.c:798 [inline] __se_sys_pselect6+0x213/0x280 fs/select.c:789 __x64_sys_pselect6+0x78/0x90 fs/select.c:789 x64_sys_call+0x2e98/0x3000 arch/x86/include/generated/asm/syscalls_64.h:271 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xca/0x2b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x000000000038c145 -> 0x00000000003929d3 Reported by Kernel Concurrency Sanitizer on: CPU: 0 UID: 0 PID: 12005 Comm: syz.4.2772 Tainted: G W syzkaller #0 PREEMPT(voluntary) Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 ==================================================================
| Seq | Timestamp | Type | Name | Duration |
|---|---|---|---|---|
| 0/0 | 2026/02/06 06:52 | flow | assessment-kcsan |
10mResults:map[Benign:false Confident:true Explanation:The data race occurs in the `perf` subsystem between two concurrent calls to `perf_event_update_time()`.
### Analysis of the Race
1. **Thread 1 (Task 12011):** This task is executing the `mmap()` system call on a perf event file descriptor. It calls `perf_mmap()`, which acquires the `event->mmap_mutex` and then calls `perf_mmap_rb()`. Inside `perf_mmap_rb()`, it calls `perf_event_update_time(event)` to update the event's internal timing counters before initializing the user-space mapping page.
2. **Thread 2 (Task 12005):** This task is executing `perf_poll()`. Interestingly, `perf_poll()` also tries to acquire the same `event->mmap_mutex`. If Task 12011 already holds the mutex, Task 12005 will block and be scheduled out. During the context switch, the kernel calls `event_sched_out()` for the events associated with the task. `event_sched_out()` calls `perf_event_set_state()`, which in turn calls `perf_event_update_time(event)`.
### The Impact
The function `perf_event_update_time()` performs the following operations:
```c
static void perf_event_update_time(struct perf_event *event)
{
u64 now = perf_event_time(event);
__perf_update_times(event, now, &event->total_time_enabled, &event->total_time_running);
event->tstamp = now;
}
```
And `__perf_update_times()` does:
```c
*enabled = event->total_time_enabled;
if (state >= PERF_EVENT_STATE_INACTIVE)
*enabled += delta;
```
Because `perf_mmap_rb()` does not hold the `perf_event_context` lock (`ctx->lock`), which is the intended protection for these timing fields, it races with the `event_sched_out()` path (which does hold `ctx->lock`).
The race on `total_time_enabled` and `total_time_running` involves non-atomic read-modify-write operations. If both threads read the same old value of `total_time_enabled` before either writes back the updated value, one of the updates (representing a chunk of time the event was enabled) will be lost. Additionally, the race on `event->tstamp` can lead to inconsistent state where the timestamp and the total time counters are out of sync, causing further errors in subsequent time calculations.
### Conclusion
This data race is **harmful**. While it does not lead to an immediate kernel crash or memory corruption, it causes permanent inaccuracy in the performance counters. For a subsystem dedicated to high-precision performance monitoring, lost timing updates constitute a significant functional bug. The race is also highly reproducible in scenarios where one task maps an event while another task using the same event is forced to schedule out (e.g., by blocking on the same `mmap_mutex`).] |
| 1/1 | 2026/02/06 06:52 | action | kernel-checkouter |
0mResults:map[KernelSrc:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/src/45e88afe9f77e1065fe073226d73bb1a7b8fcd49] |
| 2/1 | 2026/02/06 06:53 | action | kernel-builder |
3mResults:map[KernelObj:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/build/cd032fae0a4a355202951678faca55c2ed45746d] |
| 3/1 | 2026/02/06 06:57 | action | codesearch-prepare |
2mResults:map[Index:codesearch-index] |
| 4/1 | 2026/02/06 07:00 | agent | expert |
3mModel:gemini-3-flash-preview Results: map[Benign:false Confident:true] Instruction: You are an experienced Linux kernel developer tasked with determining if the given kernel data race is benign or not. The data race report is from KCSAN tool. It contains 2 stack traces of the memory accesses that constitute a data race. A "benign" data races are on a simple int/bool variable or similar field, and the accesses are not supposed to be protected by any mutual exclusion primitives. Common examples of such "benign" data races are accesses to various flags fields, statistics counters, and similar. A "benign" data race does not lead to memory corruption/crash with a conservative compiler that compiles memory accesses to primitive types effectively as atomic. A non-benign (or "harmful" data race) can lead to corruption/crash even with a conservative compiler that compiles memory accesses to primitive types effectively as atomic. A common example of a "harmful" data races is race on a complex container (list/hashmap/etc), where accesses are supposed to be protected by a mutual exclusion primitive. Also consider races that happen at the same time with the given one. If there is no synchronization in between, other memory accesses in the involved threads race with each other if they access the same memory. For example, if both threads execute: some_struct->have_elements = true; list_add(new_node, &some_struct->list_head); the race on some_struct->have_elements may appear benign, however it also implies there is a race on some_struct->list_head which is not benign, since the list is not thread-safe. Take into account that on 32-bit systems 64-bit memory accesses may be split into two accesses, and thus even with a conservative compiler may not be fully atomic. However, such races may still be benign depending on how writes are done, and how read data is used. In the final reply explain why you think the given data race is benign or is harmful. Use the provided tools to confirm any assumptions, variables/fields being accessed, etc. In particular, don't make assumptions about the kernel source code, use codesearch tools to read the actual source code. Prefer calling several tools at the same time to save round-trips. Use set-results tool to provide results of the analysis. It must be called exactly once before the final reply. Ignore results of this tool. Prompt: The data race report is: ================================================================== BUG: KCSAN: data-race in perf_event_set_state / perf_mmap_rb write to 0xffff88812279f1a0 of 8 bytes by task 12011 on cpu 1: perf_event_update_time kernel/events/core.c:737 [inline] perf_mmap_rb+0x71c/0x910 kernel/events/core.c:7037 perf_mmap+0x1ce/0x2f0 kernel/events/core.c:7164 vfs_mmap include/linux/fs.h:2053 [inline] mmap_file mm/internal.h:167 [inline] __mmap_new_file_vma mm/vma.c:2421 [inline] __mmap_new_vma mm/vma.c:2484 [inline] __mmap_region mm/vma.c:2708 [inline] mmap_region+0x1045/0x1410 mm/vma.c:2786 do_mmap+0x9b3/0xbe0 mm/mmap.c:558 vm_mmap_pgoff+0x17a/0x2e0 mm/util.c:581 ksys_mmap_pgoff+0x268/0x310 mm/mmap.c:604 x64_sys_call+0x16bb/0x3000 arch/x86/include/generated/asm/syscalls_64.h:10 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xca/0x2b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffff88812279f1a0 of 8 bytes by task 12005 on cpu 0: __perf_update_times kernel/events/core.c:720 [inline] perf_event_update_time kernel/events/core.c:735 [inline] perf_event_set_state+0x153/0x440 kernel/events/core.c:754 event_sched_out+0x2d4/0x4d0 kernel/events/core.c:2391 group_sched_out kernel/events/core.c:2415 [inline] __pmu_ctx_sched_out+0x3e7/0x530 kernel/events/core.c:3458 ctx_sched_out+0x273/0x2d0 kernel/events/core.c:3539 task_ctx_sched_out+0x4d/0x70 kernel/events/core.c:2859 perf_event_context_sched_out kernel/events/core.c:3746 [inline] __perf_event_task_sched_out+0x286/0x370 kernel/events/core.c:3846 perf_event_task_sched_out include/linux/perf_event.h:1654 [inline] prepare_task_switch kernel/sched/core.c:5045 [inline] context_switch kernel/sched/core.c:5201 [inline] __schedule+0xbf0/0xcd0 kernel/sched/core.c:6863 __schedule_loop kernel/sched/core.c:6945 [inline] schedule+0x5f/0xd0 kernel/sched/core.c:6960 schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:7017 __mutex_lock_common kernel/locking/mutex.c:692 [inline] __mutex_lock+0x4ff/0xe20 kernel/locking/mutex.c:776 __mutex_lock_slowpath+0xa/0x10 kernel/locking/mutex.c:1065 mutex_lock+0x89/0x90 kernel/locking/mutex.c:290 perf_poll+0x180/0x1f0 kernel/events/core.c:6150 vfs_poll include/linux/poll.h:82 [inline] select_poll_one fs/select.c:480 [inline] do_select+0x8f1/0xf40 fs/select.c:536 core_sys_select+0x3dc/0x6e0 fs/select.c:677 do_pselect fs/select.c:759 [inline] __do_sys_pselect6 fs/select.c:798 [inline] __se_sys_pselect6+0x213/0x280 fs/select.c:789 __x64_sys_pselect6+0x78/0x90 fs/select.c:789 x64_sys_call+0x2e98/0x3000 arch/x86/include/generated/asm/syscalls_64.h:271 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xca/0x2b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x000000000038c145 -> 0x00000000003929d3 Reported by Kernel Concurrency Sanitizer on: CPU: 0 UID: 0 PID: 12005 Comm: syz.4.2772 Tainted: G W syzkaller #0 PREEMPT(voluntary) Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 ================================================================== Reply: The data race occurs in the `perf` subsystem between two concurrent calls to `perf_event_update_time()`.
### Analysis of the Race
1. **Thread 1 (Task 12011):** This task is executing the `mmap()` system call on a perf event file descriptor. It calls `perf_mmap()`, which acquires the `event->mmap_mutex` and then calls `perf_mmap_rb()`. Inside `perf_mmap_rb()`, it calls `perf_event_update_time(event)` to update the event's internal timing counters before initializing the user-space mapping page.
2. **Thread 2 (Task 12005):** This task is executing `perf_poll()`. Interestingly, `perf_poll()` also tries to acquire the same `event->mmap_mutex`. If Task 12011 already holds the mutex, Task 12005 will block and be scheduled out. During the context switch, the kernel calls `event_sched_out()` for the events associated with the task. `event_sched_out()` calls `perf_event_set_state()`, which in turn calls `perf_event_update_time(event)`.
### The Impact
The function `perf_event_update_time()` performs the following operations:
```c
static void perf_event_update_time(struct perf_event *event)
{
u64 now = perf_event_time(event);
__perf_update_times(event, now, &event->total_time_enabled, &event->total_time_running);
event->tstamp = now;
}
```
And `__perf_update_times()` does:
```c
*enabled = event->total_time_enabled;
if (state >= PERF_EVENT_STATE_INACTIVE)
*enabled += delta;
```
Because `perf_mmap_rb()` does not hold the `perf_event_context` lock (`ctx->lock`), which is the intended protection for these timing fields, it races with the `event_sched_out()` path (which does hold `ctx->lock`).
The race on `total_time_enabled` and `total_time_running` involves non-atomic read-modify-write operations. If both threads read the same old value of `total_time_enabled` before either writes back the updated value, one of the updates (representing a chunk of time the event was enabled) will be lost. Additionally, the race on `event->tstamp` can lead to inconsistent state where the timestamp and the total time counters are out of sync, causing further errors in subsequent time calculations.
### Conclusion
This data race is **harmful**. While it does not lead to an immediate kernel crash or memory corruption, it causes permanent inaccuracy in the performance counters. For a subsystem dedicated to high-precision performance monitoring, lost timing updates constitute a significant functional bug. The race is also highly reproducible in scenarios where one task maps an event while another task using the same event is forced to schedule out (e.g., by blocking on the same `mmap_mutex`). |
| 5/2 | 2026/02/06 07:00 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 4625 output: 43 thoughts: 216 Thoughts: **Analyzing a Data Race in the Perf Event Subsystem** Okay, here's what I'm seeing. This report is flagging a potential data race. Specifically, it seems like we've got a concurrency issue between `perf_event_update_time`, which gets called from `perf_mmap_rb`, and `perf_event_set_state`, which is triggered by `event_sched_out`. The race is happening on this 8-byte memory location: `0xffff88812279f1a0`. I need to figure out what's going on there. I'm seeing the value flip from `0x000000000038c145` to `0x00000000003929d3`. My first step is to dig into the code for those two functions, `perf_event_update_time` and `perf_event_set_state`. I need to figure out what's being accessed at that memory location. I'm looking at the call stacks now, and that's pointing to `kernel/events/core.c`. Interestingly, `perf_event_update_time` is involved in both of these races. So, that's definitely the key to follow. Time to dive into `perf_event_update_time` in `kernel/events/core.c`. |
| 6/2 | 2026/02/06 07:00 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:kernel/events/core.c IncludeLines:true Name:perf_event_update_time] Results: map[SourceCode: 731: static void perf_event_update_time(struct perf_event *event)
732: {
733: u64 now = perf_event_time(event);
734:
735: __perf_update_times(event, now, &event->total_time_enabled,
736: &event->total_time_running);
737: event->tstamp = now;
738: }
SourceFile:kernel/events/core.c] |
| 7/2 | 2026/02/06 07:00 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 5346 output: 42 thoughts: 269 Thoughts: **Analyzing the perf_event Timestamp Discrepancy** Okay, so I'm looking at this report and something's not quite right. The report details a write of 8 bytes by task 12011 on CPU 1, specifically at `kernel/events/core.c:737`, which is `perf_event_update_time`. The write is setting `event->tstamp = now;`, which is the expected behavior. But now I see that the report indicates a read of 8 bytes by task 12005 on CPU 0. And this read appears to be happening at `kernel/events/core.c:720`, which is inside `__perf_update_times`. That suggests that data is being read from a location related to the timestamps that's being set in the other function. The puzzle is, if the timestamp write is happening at line 737, and the read is within `__perf_update_times` at line 720, that would make sense since `__perf_update_times` is a function called by `perf_event_update_time`. So there doesn't seem to be a problem based on the location. But, I really need to drill down and look at `__perf_update_times`. I need to figure out how and why that timestamp value is being accessed so I can figure out what is going on. |
| 8/2 | 2026/02/06 07:00 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:kernel/events/core.c IncludeLines:true Name:__perf_update_times] Results: map[SourceCode: 716: static __always_inline void
717: __perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *running)
718: {
719: enum perf_event_state state = __perf_effective_state(event);
720: u64 delta = now - event->tstamp;
721:
722: *enabled = event->total_time_enabled;
723: if (state >= PERF_EVENT_STATE_INACTIVE)
724: *enabled += delta;
725:
726: *running = event->total_time_running;
727: if (state >= PERF_EVENT_STATE_ACTIVE)
728: *running += delta;
729: }
SourceFile:kernel/events/core.c] |
| 9/2 | 2026/02/06 07:00 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 6217 output: 42 thoughts: 96 Thoughts: Okay, here's my breakdown of what's happening, based on what I'm seeing: **Racing the Clock: A Perf Event Timestamp Conundrum** Alright, the race is on. I'm looking at a timing issue, specifically around the `event->tstamp` variable. It looks like we've got a read/write conflict situation going on. Task 12011 is actively writing a timestamp to `event->tstamp` on line 737, and simultaneously, task 12005 is attempting to read from the *very same* `event->tstamp` location on line 720. Clearly, we have a potential data race here! Crucially, I see that both tasks are calling `perf_event_update_time`. I need to dig deeper into that function. The next step is to examine all the call paths, specifically where `perf_mmap_rb` calls `perf_event_update_time`, as it's the most common entry point for events into the ring buffer. That should give me a better understanding of the timing and synchronization aspects and let me identify the root cause of the race. |
| 10/2 | 2026/02/06 07:00 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:kernel/events/core.c IncludeLines:true Name:perf_mmap_rb] Results: map[SourceCode:6978: static int perf_mmap_rb(struct vm_area_struct *vma, struct perf_event *event,
6979: unsigned long nr_pages)
6980: {
6981: long extra = 0, user_extra = nr_pages;
6982: struct perf_buffer *rb;
6983: int rb_flags = 0;
6984:
6985: nr_pages -= 1;
6986:
6987: /*
6988: * If we have rb pages ensure they're a power-of-two number, so we
6989: * can do bitmasks instead of modulo.
6990: */
6991: if (nr_pages != 0 && !is_power_of_2(nr_pages))
6992: return -EINVAL;
6993:
6994: WARN_ON_ONCE(event->ctx->parent_ctx);
6995:
6996: if (event->rb) {
6997: if (data_page_nr(event->rb) != nr_pages)
6998: return -EINVAL;
6999:
7000: if (refcount_inc_not_zero(&event->rb->mmap_count)) {
7001: /*
7002: * Success -- managed to mmap() the same buffer
7003: * multiple times.
7004: */
7005: perf_mmap_account(vma, user_extra, extra);
7006: refcount_inc(&event->mmap_count);
7007: return 0;
7008: }
7009:
7010: /*
7011: * Raced against perf_mmap_close()'s
7012: * refcount_dec_and_mutex_lock() remove the
7013: * event and continue as if !event->rb
7014: */
7015: ring_buffer_attach(event, NULL);
7016: }
7017:
7018: if (!perf_mmap_calc_limits(vma, &user_extra, &extra))
7019: return -EPERM;
7020:
7021: if (vma->vm_flags & VM_WRITE)
7022: rb_flags |= RING_BUFFER_WRITABLE;
7023:
7024: rb = rb_alloc(nr_pages,
7025: event->attr.watermark ? event->attr.wakeup_watermark : 0,
7026: event->cpu, rb_flags);
7027:
7028: if (!rb)
7029: return -ENOMEM;
7030:
7031: refcount_set(&rb->mmap_count, 1);
7032: rb->mmap_user = get_current_user();
7033: rb->mmap_locked = extra;
7034:
7035: ring_buffer_attach(event, rb);
7036:
7037: perf_event_update_time(event);
7038: perf_event_init_userpage(event);
7039: perf_event_update_userpage(event);
7040:
7041: perf_mmap_account(vma, user_extra, extra);
7042: refcount_set(&event->mmap_count, 1);
7043:
7044: return 0;
7045: }
SourceFile:kernel/events/core.c] |
| 11/2 | 2026/02/06 07:00 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 7619 output: 43 thoughts: 64 Thoughts: Here's my summary, as if I were thinking it: **Digging into the `perf_mmap_rb` Function** Okay, so I'm tracing through this `perf_mmap_rb` function, and I'm zeroing in on line 7037. That's where it's calling `perf_event_update_time(event)`. Aha! Now I remember; this is happening when a new ring buffer gets attached to the event during the `mmap` operation. That makes perfect sense; we need to update our time-related parameters to keep everything synchronized at that pivotal point. Now, that brings up another interesting function to examine - `perf_event_set_state` - which is likely where the final state is set. Let's see how that fits into this overall picture. |
| 12/2 | 2026/02/06 07:00 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:kernel/events/core.c IncludeLines:true Name:perf_event_set_state] Results: map[SourceCode: 748: static void
749: perf_event_set_state(struct perf_event *event, enum perf_event_state state)
750: {
751: if (event->state == state)
752: return;
753:
754: perf_event_update_time(event);
755: /*
756: * If a group leader gets enabled/disabled all its siblings
757: * are affected too.
758: */
759: if ((event->state < 0) ^ (state < 0))
760: perf_event_update_sibling_time(event);
761:
762: WRITE_ONCE(event->state, state);
763: }
SourceFile:kernel/events/core.c] |
| 13/2 | 2026/02/06 07:00 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 8152 output: 40 thoughts: 247 Thoughts: **My Analysis of a Potential Race Condition in the Perf Event System** Okay, let's break this down. I see that `perf_event_set_state` is calling `perf_event_update_time` and that's immediately triggering a read of `event->tstamp` through `__perf_update_times`. This sequence happens within the context of `event_sched_out`. That's important. I know `event_sched_out` is called during a context switch – classic. Now, the write operation is happening within `perf_mmap_rb`. This is where my alarm bells start to ring. If I recall correctly, `perf_mmap_rb` is the function invoked when a user-space process maps a perf event file descriptor using `mmap`. So, basically someone calling `mmap` on the perf event file will trigger `perf_mmap_rb`. And `perf_event_set_state` is getting called when the event state changes, like when it’s scheduled out. That smells like a race. A read happens during a context switch and a write during a mmap. The critical question is, does the system have proper locking? Is something protecting the timestamp from a read-write conflict? Let's check. `perf_mmap_rb` is called from `perf_mmap`, which does take `event->mmap_mutex`. Okay, so `perf_mmap` takes `event->mmap_mutex`. That's at least a start. I need to see what protection is surrounding the read and write functions. I see that `event_sched_out` is called with the context lock held (and usually with interrupts disabled if it’s on the CPU), meaning there is at least some protection for the read. So `perf_mmap_rb` takes `event->mmap_mutex`, and `event_sched_out` takes the context lock. Those should probably be enough, but I still need to verify what those locks protect. I should really check `perf_mmap` and `perf_event_set_state` more closely, specifically how the mutexes are being acquired and released and how the read occurs. |
| 14/2 | 2026/02/06 07:00 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:kernel/events/core.c IncludeLines:true Name:perf_mmap] Results: map[SourceCode:7123: static int perf_mmap(struct file *file, struct vm_area_struct *vma)
7124: {
7125: struct perf_event *event = file->private_data;
7126: unsigned long vma_size, nr_pages;
7127: mapped_f mapped;
7128: int ret;
7129:
7130: /*
7131: * Don't allow mmap() of inherited per-task counters. This would
7132: * create a performance issue due to all children writing to the
7133: * same rb.
7134: */
7135: if (event->cpu == -1 && event->attr.inherit)
7136: return -EINVAL;
7137:
7138: if (!(vma->vm_flags & VM_SHARED))
7139: return -EINVAL;
7140:
7141: ret = security_perf_event_read(event);
7142: if (ret)
7143: return ret;
7144:
7145: vma_size = vma->vm_end - vma->vm_start;
7146: nr_pages = vma_size / PAGE_SIZE;
7147:
7148: if (nr_pages > INT_MAX)
7149: return -ENOMEM;
7150:
7151: if (vma_size != PAGE_SIZE * nr_pages)
7152: return -EINVAL;
7153:
7154: scoped_guard (mutex, &event->mmap_mutex) {
7155: /*
7156: * This relies on __pmu_detach_event() taking mmap_mutex after marking
7157: * the event REVOKED. Either we observe the state, or __pmu_detach_event()
7158: * will detach the rb created here.
7159: */
7160: if (event->state <= PERF_EVENT_STATE_REVOKED)
7161: return -ENODEV;
7162:
7163: if (vma->vm_pgoff == 0)
7164: ret = perf_mmap_rb(vma, event, nr_pages);
7165: else
7166: ret = perf_mmap_aux(vma, event, nr_pages);
7167: if (ret)
7168: return ret;
7169: }
7170:
7171: /*
7172: * Since pinned accounting is per vm we cannot allow fork() to copy our
7173: * vma.
7174: */
7175: vm_flags_set(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP);
7176: vma->vm_ops = &perf_mmap_vmops;
7177:
7178: mapped = get_mapped(event, event_mapped);
7179: if (mapped)
7180: mapped(event, vma->vm_mm);
7181:
7182: /*
7183: * Try to map it into the page table. On fail, invoke
7184: * perf_mmap_close() to undo the above, as the callsite expects
7185: * full cleanup in this case and therefore does not invoke
7186: * vmops::close().
7187: */
7188: ret = map_range(event->rb, vma);
7189: if (ret)
7190: perf_mmap_close(vma);
7191:
7192: return ret;
7193: }
SourceFile:kernel/events/core.c] |
| 15/2 | 2026/02/06 07:00 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 9972 output: 34 thoughts: 862 Thoughts: **Analyzing Potential Race Conditions in Performance Events**
Okay, let's break this down. My initial observation is that `perf_mmap` utilizes `event->mmap_mutex`, which is a good starting point for understanding concurrency. Now, the flow seems to be that `event_sched_out` is called down the stack from `__perf_event_task_sched_out`, which itself is called from the context, right? The stack is `event_sched_out` -> `ctx_sched_out` -> `__pmu_ctx_sched_out` -> `group_sched_out`. `event_sched_out` then sets the event to inactive. The key here is that the context lock is held during `event_sched_out`, but *not* during `perf_mmap_rb`.
Now, the main concern here is around `perf_event_update_time`. I need to revisit that code snippet:
```c
static void perf_event_update_time(struct perf_event *event)
{
u64 now = perf_event_time(event);
__perf_update_times(event, now, &event->total_time_enabled,
&event->total_time_running);
event->tstamp = now;
}
```
And then the `__perf_update_times` function:
```c
static __always_inline void
__perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *running)
{
enum perf_event_state state = __perf_effective_state(event);
u64 delta = now - event->tstamp;
*enabled = event->total_time_enabled;
if (state >= PERF_EVENT_STATE_INACTIVE)
*enabled += delta;
*running = event->total_time_running;
if (state >= PERF_EVENT_STATE_ACTIVE)
*running += delta;
}
```
My concern arises from the potential for concurrent calls to `perf_event_update_time`. If two threads call this function simultaneously, they could both read `event->tstamp`, compute their respective `delta` values, update `total_time_enabled` and `total_time_running` via pointers, and then write their calculated value of `now` to `event->tstamp`.
Specifically, `__perf_update_times` is performing a read-modify-write on `event->total_time_enabled` and `event->total_time_running`. This is essentially `event->total_time_enabled = event->total_time_enabled + delta;` and same with `total_time_running`. If no lock protects these operations, that's a classic race condition on `total_time_enabled`, `total_time_running`, *and* `tstamp`. It looks like the current report only mentions `tstamp`, but the others are also vulnerable. Let me check the offsets of those within the `struct perf_event` to confirm...
|
| 16/2 | 2026/02/06 07:00 | tool | codesearch-struct-layout |
0mArgs:map[ContextFile:kernel/events/core.c Name:perf_event] Results: map[Fields:[map[Name:event_entry OffsetBits:0 SizeBits:128] map[Name:sibling_list OffsetBits:128 SizeBits:128] map[Name:active_list OffsetBits:256 SizeBits:128] map[Name:group_node OffsetBits:384 SizeBits:192] map[Name:group_index OffsetBits:576 SizeBits:64] map[Name:migrate_entry OffsetBits:640 SizeBits:128] map[Name:hlist_entry OffsetBits:768 SizeBits:128] map[Name:active_entry OffsetBits:896 SizeBits:128] map[Name:nr_siblings OffsetBits:1024 SizeBits:32] map[Name:event_caps OffsetBits:1056 SizeBits:32] map[Name:group_caps OffsetBits:1088 SizeBits:32] map[Name:group_generation OffsetBits:1120 SizeBits:32] map[Name:group_leader OffsetBits:1152 SizeBits:64] map[Name:pmu OffsetBits:1216 SizeBits:64] map[Name:pmu_private OffsetBits:1280 SizeBits:64] map[Name:state OffsetBits:1344 SizeBits:32] map[Name:attach_state OffsetBits:1376 SizeBits:32] map[Name:count OffsetBits:1408 SizeBits:64] map[Name:child_count OffsetBits:1472 SizeBits:64] map[Name:total_time_enabled OffsetBits:1536 SizeBits:64] map[Name:total_time_running OffsetBits:1600 SizeBits:64] map[Name:tstamp OffsetBits:1664 SizeBits:64] map[Name:attr OffsetBits:1728 SizeBits:1152] map[Name:header_size OffsetBits:2880 SizeBits:16] map[Name:id_header_size OffsetBits:2896 SizeBits:16] map[Name:read_size OffsetBits:2912 SizeBits:16] map[Name:hw OffsetBits:2944 SizeBits:1664] map[Name:ctx OffsetBits:4608 SizeBits:64] map[Name:pmu_ctx OffsetBits:4672 SizeBits:64] map[Name:refcount OffsetBits:4736 SizeBits:64] map[Name:child_total_time_enabled OffsetBits:4800 SizeBits:64] map[Name:child_total_time_running OffsetBits:4864 SizeBits:64] map[Name:child_mutex OffsetBits:4928 SizeBits:256] map[Name:child_list OffsetBits:5184 SizeBits:128] map[Name:parent OffsetBits:5312 SizeBits:64] map[Name:oncpu OffsetBits:5376 SizeBits:32] map[Name:cpu OffsetBits:5408 SizeBits:32] map[Name:owner_entry OffsetBits:5440 SizeBits:128] map[Name:owner OffsetBits:5568 SizeBits:64] map[Name:mmap_mutex OffsetBits:5632 SizeBits:256] map[Name:mmap_count OffsetBits:5888 SizeBits:32] map[Name:rb OffsetBits:5952 SizeBits:64] map[Name:rb_entry OffsetBits:6016 SizeBits:128] map[Name:rcu_batches OffsetBits:6144 SizeBits:64] map[Name:rcu_pending OffsetBits:6208 SizeBits:32] map[Name:waitq OffsetBits:6272 SizeBits:192] map[Name:fasync OffsetBits:6464 SizeBits:64] map[Name:pending_wakeup OffsetBits:6528 SizeBits:32] map[Name:pending_kill OffsetBits:6560 SizeBits:32] map[Name:pending_disable OffsetBits:6592 SizeBits:32] map[Name:pending_addr OffsetBits:6656 SizeBits:64] map[Name:pending_irq OffsetBits:6720 SizeBits:256] map[Name:pending_disable_irq OffsetBits:6976 SizeBits:256] map[Name:pending_task OffsetBits:7232 SizeBits:128] map[Name:pending_work OffsetBits:7360 SizeBits:32] map[Name:event_limit OffsetBits:7392 SizeBits:32] map[Name:addr_filters OffsetBits:7424 SizeBits:192] map[Name:addr_filter_ranges OffsetBits:7616 SizeBits:64] map[Name:addr_filters_gen OffsetBits:7680 SizeBits:64] map[Name:aux_event OffsetBits:7744 SizeBits:64] map[Name:destroy OffsetBits:7808 SizeBits:64] map[Name:callback_head OffsetBits:7872 SizeBits:128] map[Name:ns OffsetBits:8000 SizeBits:64] map[Name:id OffsetBits:8064 SizeBits:64] map[Name:lost_samples OffsetBits:8128 SizeBits:64] map[Name:clock OffsetBits:8192 SizeBits:64] map[Name:overflow_handler OffsetBits:8256 SizeBits:64] map[Name:overflow_handler_context OffsetBits:8320 SizeBits:64] map[Name:prog OffsetBits:8384 SizeBits:64] map[Name:bpf_cookie OffsetBits:8448 SizeBits:64] map[Name:tp_event OffsetBits:8512 SizeBits:64] map[Name:filter OffsetBits:8576 SizeBits:64] map[Name:cgrp OffsetBits:8640 SizeBits:64] map[Name:security OffsetBits:8704 SizeBits:64] map[Name:sb_list OffsetBits:8768 SizeBits:128] map[Name:pmu_list OffsetBits:8896 SizeBits:128] map[Name:orig_type OffsetBits:9024 SizeBits:32]]] |
| 17/2 | 2026/02/06 07:00 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 13264 output: 21 thoughts: 1994 Thoughts: **Analysis of Potential Race Condition in Perf Event Timing** Okay, so I've been looking at this potential issue in the perf event code, and it seems like there's a definite race condition related to updating timing counters. The core problem revolves around the interaction between `perf_mmap_rb` and `event_sched_out`, specifically concerning the `tstamp`, `total_time_enabled`, and `total_time_running` fields. The initial observation is that `tstamp` (at offset 208 bytes), `total_time_enabled` (192 bytes), and `total_time_running` (200 bytes) are all adjacent in memory. Given the address `0xffff88812279f1a0` for `tstamp`, the adjacent addresses for the others are as shown. The concern stems from `perf_mmap_rb` potentially updating these values without holding the context lock, which is normally used to protect access to these event-related times. My initial thought was to see if `perf_mmap_rb` should have been holding this lock. It does hold `event->mmap_mutex` in `perf_mmap`, but not the context lock itself, and `perf_mmap_rb` calls `perf_event_update_time(event)`, so this bears further scrutiny. I had to look at `perf_event_update_time`. Typically, this function is called when event state transitions happen, like enabling, disabling, or scheduling. During these transitions, the context lock *is* held. However, in `perf_mmap_rb`, it appears to be called when a new ring buffer is being attached. The code comments indicate this is tied to initializing the userpage with the latest time data, but the lack of lock is a problem. The potential race becomes clearer when considering the interleaving of `perf_mmap_rb` and `event_sched_out`. If `perf_mmap_rb` updates the event times concurrently with an event being scheduled out, `total_time_enabled` and `total_time_running` could be updated incorrectly. The race specifically involves this sequence in `__perf_update_times`: reading `event->total_time_enabled`, conditionally incrementing a local variable (which points to the event's actual counter) by a delta, and writing the updated value back to the event structure. Even though `perf_mmap_rb` is only called once per ring buffer allocation, it *can* race with the scheduling path. If the kernel takes a snapshot of the event and updates its counters while the other process is updating the same values, the updates can be lost, causing incorrect timing data in the user-visible page. The next step was to confirm if `perf_event_update_time` *should* be protected. The answer appears to be yes; most callers of `perf_event_update_time` hold the context lock. I checked the callers of `perf_event_set_state` (which calls `perf_event_update_time` as a result) and noticed that `event_sched_out` is one of them. `perf_event_enable` and `perf_event_disable` also call it. Now, let's trace this possible race in action, if `perf_mmap_rb` and `event_sched_out` are running at the same time: `perf_event_time(event)` is called, so it gets the now-current time. If it is active, `delta = now - event->tstamp` and the local variable `event->total_time_enabled += delta;` is added. However, with concurrent execution, one thread's update is lost because of the read-modify-write pattern: * Thread A (sched_out): reads and calculates `delta_A` based on its own snapshot time, updates `event->total_time_enabled`. * Thread B (mmap): reads and calculates `delta_B` based on its snapshot time. Because it is later, the timing data should be later as well. Because of this interleaving, Thread B's update, in this example, can overwrite the value written by A, using an outdated `event->tstamp`. `perf_event_update_time` is the problem as it updates the counters: it is called by `perf_mmap_rb` to initialize the userpage. So, in summary, a real race exists here leading to inaccurate timing within perf events. This isn't a crash, but incorrect timing information is a bug, even if not a critical one. Although, `perf_mmap` and `perf_mmap_rb` are not so common as scheduling, |
| 18/2 | 2026/02/06 07:01 | tool | grepper |
0mArgs:map[Expression:perf_event_update_time] Results: map[Output:kernel/events/core.c=717=__perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *running)
--
kernel/events/core.c-730-
kernel/events/core.c:731:static void perf_event_update_time(struct perf_event *event)
kernel/events/core.c-732-{
--
kernel/events/core.c=740=static void perf_event_update_sibling_time(struct perf_event *leader)
--
kernel/events/core.c-744- for_each_sibling_event(sibling, leader)
kernel/events/core.c:745: perf_event_update_time(sibling);
kernel/events/core.c-746-}
--
kernel/events/core.c=749=perf_event_set_state(struct perf_event *event, enum perf_event_state state)
--
kernel/events/core.c-753-
kernel/events/core.c:754: perf_event_update_time(event);
kernel/events/core.c-755- /*
--
kernel/events/core.c=3580=static void __perf_event_sync_stat(struct perf_event *event,
--
kernel/events/core.c-3596-
kernel/events/core.c:3597: perf_event_update_time(event);
kernel/events/core.c-3598-
--
kernel/events/core.c=3973=static inline bool event_update_userpage(struct perf_event *event)
--
kernel/events/core.c-3977-
kernel/events/core.c:3978: perf_event_update_time(event);
kernel/events/core.c-3979- perf_event_update_userpage(event);
--
kernel/events/core.c=4668=static void __perf_event_read(void *info)
--
kernel/events/core.c-4688-
kernel/events/core.c:4689: perf_event_update_time(event);
kernel/events/core.c-4690- if (data->group)
--
kernel/events/core.c=4817=static int perf_event_read(struct perf_event *event, bool group)
--
kernel/events/core.c-4881-
kernel/events/core.c:4882: perf_event_update_time(event);
kernel/events/core.c-4883- if (group)
--
kernel/events/core.c=6978=static int perf_mmap_rb(struct vm_area_struct *vma, struct perf_event *event,
--
kernel/events/core.c-7036-
kernel/events/core.c:7037: perf_event_update_time(event);
kernel/events/core.c-7038- perf_event_init_userpage(event);
] |
| 19/2 | 2026/02/06 07:01 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 17092 output: 39 thoughts: 97 Thoughts: Here's my thought process, as it stands: **Diving into `perf_event_read` and its Context** Okay, I need to understand this read path. First, I'm going to look at `perf_event_read`. I see it updates the time with `perf_event_update_time`. Now, I recall that `perf_event_read` is called from `perf_read`. That's where I need to focus next. I wonder if the `perf_read` function is holding any kind of context lock? If it is, that's crucial for understanding the potential implications and contention around this read operation. I need to investigate `perf_read` now to confirm. |
| 20/2 | 2026/02/06 07:01 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:kernel/events/core.c IncludeLines:true Name:perf_read] Results: map[SourceCode:6107: static ssize_t
6108: perf_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
6109: {
6110: struct perf_event *event = file->private_data;
6111: struct perf_event_context *ctx;
6112: int ret;
6113:
6114: ret = security_perf_event_read(event);
6115: if (ret)
6116: return ret;
6117:
6118: ctx = perf_event_ctx_lock(event);
6119: ret = __perf_read(event, buf, count);
6120: perf_event_ctx_unlock(event, ctx);
6121:
6122: return ret;
6123: }
SourceFile:kernel/events/core.c] |
| 21/2 | 2026/02/06 07:01 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 17647 output: 44 thoughts: 1878 Thoughts: **Analyzing a Potential Race Condition in Perf Event Time Tracking** Okay, so I'm digging into how `perf_read` and `perf_mmap` interact with performance event context locks and time updates. I know `perf_read` takes the context lock, which is expected. The question is, does `perf_mmap_rb` need it too? It's called from `perf_mmap`, which holds `event->mmap_mutex`, but not the context lock itself. That's the key point. Should `perf_mmap` *also* be taking the context lock? Looking at the code, in `perf_mmap`, it scopes around `event->mmap_mutex` and then calls `perf_mmap_rb`. If `perf_mmap_rb` subsequently calls `perf_event_update_time`, that should *absolutely* be protected. I'm starting to think this is where the vulnerability lies. The thing is, `perf_mmap_rb` might be executed on a different CPU than where the event is running. This means its timestamp and time-related counters are being updated asynchronously by the PMU interrupt or scheduler hooks. In reality, `perf_event_update_time` usually gets called on the CPU where the event is running, like when the event is being scheduled out, for example. However, `perf_mmap` can be called from *any* CPU. If `perf_mmap` calls `perf_event_update_time` on an event actively running elsewhere, that's a classic race condition. It's a read-modify-write on 64-bit values – even on x86_64, those aren't atomic, and that's the vulnerability. The potential damage? Incorrect values in `total_time_enabled` and `total_time_running`, ultimately leading to flawed statistics. It's not a crash, so some might consider it "benign," but I don't buy that. This is more than a simple counter. It's a state update involving multiple interdependent fields: `total_time_enabled`, `total_time_running`, and `tstamp`. They need to be consistent. For example: If `tstamp` is updated but `total_time_enabled` is not, the next call to `perf_event_update_time` will calculate the wrong delta, leading to a ripple effect of incorrect statistics. If one thread calculates a delta from an old timestamp, while another thread may have advanced the timestamp and total enabled value, the calculations will be off. Let's say we have an initial condition, `tstamp = 100`, `total_time_enabled = 1000`. Thread A schedules out at time 110. A calculates `delta = 10` and adds to `total_time_enabled`. It also writes the new timestamp. Then, Thread B, at time 115, reads the *old* `tstamp = 100`, calculates delta from *that* timestamp, reads the *old* `total_time_enabled`, and overwrites. It doesn't correctly capture A's work. This is a real problem. The counters will be inaccurate due to the racing of threads, so while it may not be a critical error for KCSAN, it's a bug in the code. I'm going to determine if the context lock should be taken within `perf_mmap_rb` or `perf_mmap`. I should probably also check the other call sites for `perf_event_update_time`, like in `perf_event_update_userpage`. Now, there are a number of calls that call that function. So it looks like `event_update_userpage` inlined is calling another version of the function (non-inline). It's going to be something I have to track down. |
| 22/2 | 2026/02/06 07:01 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:kernel/events/core.c IncludeLines:true Name:perf_event_update_userpage] Results: map[SourceCode:6534: /*
6535: * Callers need to ensure there can be no nesting of this function, otherwise
6536: * the seqlock logic goes bad. We can not serialize this because the arch
6537: * code calls this from NMI context.
6538: */
6539: void perf_event_update_userpage(struct perf_event *event)
6540: {
6541: struct perf_event_mmap_page *userpg;
6542: struct perf_buffer *rb;
6543: u64 enabled, running, now;
6544:
6545: rcu_read_lock();
6546: rb = rcu_dereference(event->rb);
6547: if (!rb)
6548: goto unlock;
6549:
6550: /*
6551: * compute total_time_enabled, total_time_running
6552: * based on snapshot values taken when the event
6553: * was last scheduled in.
6554: *
6555: * we cannot simply called update_context_time()
6556: * because of locking issue as we can be called in
6557: * NMI context
6558: */
6559: calc_timer_values(event, &now, &enabled, &running);
6560:
6561: userpg = rb->user_page;
6562: /*
6563: * Disable preemption to guarantee consistent time stamps are stored to
6564: * the user page.
6565: */
6566: preempt_disable();
6567: ++userpg->lock;
6568: barrier();
6569: userpg->index = perf_event_index(event);
6570: userpg->offset = perf_event_count(event, false);
6571: if (userpg->index)
6572: userpg->offset -= local64_read(&event->hw.prev_count);
6573:
6574: userpg->time_enabled = enabled +
6575: atomic64_read(&event->child_total_time_enabled);
6576:
6577: userpg->time_running = running +
6578: atomic64_read(&event->child_total_time_running);
6579:
6580: arch_perf_update_userpage(event, userpg, now);
6581:
6582: barrier();
6583: ++userpg->lock;
6584: preempt_enable();
6585: unlock:
6586: rcu_read_unlock();
6587: }
SourceFile:kernel/events/core.c] |
| 23/2 | 2026/02/06 07:01 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 21196 output: 42 thoughts: 75 Thoughts: Okay, here's my thought process as I digest this information: **Tracing the Timer Values in `perf_event`** Hold on a second... I was under the impression that `perf_event_update_userpage` was the function that utilized `perf_event_update_time`, but my understanding is wrong. It looks like `perf_event_update_userpage` actually calls `calc_timer_values`. That's a key distinction I need to file away. But wait, I saw `perf_event_update_time` in the `grep` output when looking for `event_update_userpage`. Something isn't adding up. I need to get to the bottom of this. My grep must have been too broad! Okay, let me check the inline implementation of `event_update_userpage` again. I need to pinpoint where the timer values are actually being calculated and how `perf_event_update_time` fits into the picture. Perhaps they're related in some other way. |
| 24/2 | 2026/02/06 07:01 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:kernel/events/core.c IncludeLines:true Name:event_update_userpage] Results: map[SourceCode:3966: /*
3967: * Because the userpage is strictly per-event (there is no concept of context,
3968: * so there cannot be a context indirection), every userpage must be updated
3969: * when context time starts :-(
3970: *
3971: * IOW, we must not miss EVENT_TIME edges.
3972: */
3973: static inline bool event_update_userpage(struct perf_event *event)
3974: {
3975: if (likely(!refcount_read(&event->mmap_count)))
3976: return false;
3977:
3978: perf_event_update_time(event);
3979: perf_event_update_userpage(event);
3980:
3981: return true;
3982: }
SourceFile:kernel/events/core.c] |
| 25/2 | 2026/02/06 07:01 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 21800 output: 41 thoughts: 152 Thoughts: **Tracing the `perf_event_update_time` Call Chain** Okay, so I'm looking at how `perf_event_update_time` gets called. Right, the code says `event_update_userpage` does call `perf_event_update_time`, that's clear. But then I see it's linked to `ctx_sched_in` and `ctx_sched_out` indirectly, going through a cascade of functions: `__pmu_ctx_sched_out`, `group_sched_out`, `event_sched_out`, `perf_event_set_state`, and finally `perf_event_update_time`. I'm tracking this because I want to understand the timing implications. Wait a second...I'm seeing `event_sched_out` here calls `perf_event_set_state`, which *also* goes to `perf_event_update_time`. And *it* also calls `event_update_userpage`? Okay, that's not what I was expecting. I need to make sure I understand the call graph correctly. This looks like a potentially significant double-call situation. Right, time to dive deeper into `event_sched_out` and sort this out. |
| 26/2 | 2026/02/06 07:01 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:kernel/events/core.c IncludeLines:true Name:event_sched_out] Results: map[SourceCode:2358: static void
2359: event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
2360: {
2361: struct perf_event_pmu_context *epc = event->pmu_ctx;
2362: struct perf_cpu_pmu_context *cpc = this_cpc(epc->pmu);
2363: enum perf_event_state state = PERF_EVENT_STATE_INACTIVE;
2364:
2365: // XXX cpc serialization, probably per-cpu IRQ disabled
2366:
2367: WARN_ON_ONCE(event->ctx != ctx);
2368: lockdep_assert_held(&ctx->lock);
2369:
2370: if (event->state != PERF_EVENT_STATE_ACTIVE)
2371: return;
2372:
2373: /*
2374: * Asymmetry; we only schedule events _IN_ through ctx_sched_in(), but
2375: * we can schedule events _OUT_ individually through things like
2376: * __perf_remove_from_context().
2377: */
2378: list_del_init(&event->active_list);
2379:
2380: perf_pmu_disable(event->pmu);
2381:
2382: event->pmu->del(event, 0);
2383: event->oncpu = -1;
2384:
2385: if (event->pending_disable) {
2386: event->pending_disable = 0;
2387: perf_cgroup_event_disable(event, ctx);
2388: state = PERF_EVENT_STATE_OFF;
2389: }
2390:
2391: perf_event_set_state(event, state);
2392:
2393: if (!is_software_event(event))
2394: cpc->active_oncpu--;
2395: if (is_event_in_freq_mode(event)) {
2396: ctx->nr_freq--;
2397: epc->nr_freq--;
2398: }
2399: if (event->attr.exclusive || !cpc->active_oncpu)
2400: cpc->exclusive = 0;
2401:
2402: perf_pmu_enable(event->pmu);
2403: }
SourceFile:kernel/events/core.c] |
| 27/2 | 2026/02/06 07:01 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 22983 output: 41 thoughts: 825 Thoughts: **Analyzing Time Updates in the Perf Event System** Okay, so I'm digging into how `perf_event_update_time` is handled, specifically in relation to potential race conditions. It seems that `event_sched_out` calls `perf_event_set_state`, which in turn calls `perf_event_update_time`. Got it. Now, I'm revisiting `perf_mmap_rb`. I see that it calls `perf_event_update_time` *and* `perf_event_update_userpage`. But, I realize, the non-inline version of `perf_event_update_userpage` calls `calc_timer_values`, while the inline version ( `event_update_userpage`) does call `perf_event_update_time`. Looking at the code for `perf_mmap_rb`, it explicitly calls the *non-inline* `perf_event_update_userpage`. So, `perf_mmap_rb` *does* explicitly call `perf_event_update_time` directly. This is where the problem lies. If `perf_mmap_rb` is racing with `event_sched_out`, it's because it doesn't hold the context lock. The question then becomes: can we safely take the context lock within `perf_mmap_rb`? `perf_mmap_rb` is called from `perf_mmap`, which is called from the `mmap` syscall. So, yeah, it *can* take the context lock. The function `perf_event_ctx_lock(event)` takes the context lock and handles other things. I need to see if there are other places where `perf_event_update_time` is called without the context lock. Let's see what the grep turned up: * `perf_mmap_rb` (line 7037): No context lock, which is the problem I'm trying to solve. * `perf_event_read` (line 4882): The context lock *is* held by `perf_read`. That's good. * `__perf_event_read` (line 4689): Called via `smp_call_function_single`. It's running on the CPU where the event is running. This synchronization is also likely fine. * `__perf_event_sync_stat` (line 3597): Called from `perf_event_terminate_to_free` and `perf_event_exit_task_context`. Those usually involve some synchronization. So, it's confirmed. `perf_mmap_rb` is missing the context lock and is, therefore, potentially unsafe when calling `perf_event_update_time`. Now, I see a crucial comment in the non-inline `perf_event_update_userpage`: ```c /* * compute total_time_enabled, total_time_running * based on snapshot values taken when the event * was last scheduled in. * * we cannot simply called update_context_time() * because of locking issue as we can be called in * NMI context */ calc_timer_values(event, &now, &enabled, &running); ``` This is key! `calc_timer_values` appears to be a safe way to get the time values without actually updating them, avoiding the locking issues. We can call calc_timer_values, and then that might get us what we need. I should check the implementation of `calc_timer_values` to verify that assumption. |
| 28/2 | 2026/02/06 07:01 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:kernel/events/core.c IncludeLines:true Name:calc_timer_values] Results: map[SourceCode:4723: static void calc_timer_values(struct perf_event *event,
4724: u64 *now,
4725: u64 *enabled,
4726: u64 *running)
4727: {
4728: u64 ctx_time;
4729:
4730: *now = perf_clock();
4731: ctx_time = perf_event_time_now(event, *now);
4732: __perf_update_times(event, ctx_time, enabled, running);
4733: }
SourceFile:kernel/events/core.c] |
| 29/2 | 2026/02/06 07:01 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 24830 output: 44 thoughts: 585 Thoughts: **Analyzing the Time Update in Perf Event Ring Buffer Setup** Okay, so I'm digging into how the timing information is handled when a performance event's ring buffer is mapped, specifically in `perf_mmap_rb`. I see that `calc_timer_values` is called, which in turn calls `__perf_update_times`. But, and this is the crux, it doesn't directly update the `event->total_time_enabled` field, instead relying on pointers to local variables that are then potentially discarded. That seems odd, but then I remember that `perf_event_update_time` *does* call `__perf_update_times` and *does* update `event->total_time_enabled` and `event->tstamp`. Now, in `perf_mmap_rb`, we have this sequence: `perf_event_update_time(event)`, followed by `perf_event_init_userpage(event)` and then `perf_event_update_userpage(event)`. My initial thought is that if the primary goal was to update the userpage, just calling `perf_event_update_userpage(event)` directly would have been sufficient since it calls `calc_timer_values` and so `__perf_update_times`. Why the extra call to `perf_event_update_time`? Perhaps there's a need to "checkpoint" the time at ring buffer attachment, but I'm not convinced yet. When `perf_mmap_rb` is called, we are effectively setting up the ring buffer, and the userpage will show the accumulated times. If we *didn't* call `perf_event_update_time`, the `total_time_enabled` in the `event` structure might be stale, reflecting the last time the event was scheduled in or out. But here is the thing, `perf_event_update_userpage` uses `calc_timer_values`, which then uses the current timestamp and compares it to the saved `tstamp`. Ah, that's right. If `perf_event_update_time` *isn't* called, `event->tstamp` remains unchanged. `calc_timer_values` will then use that potentially older `tstamp` to calculate the delta, and so the userpage should still reflect the correct current accumulated time. So, why is `perf_event_update_time(event)` being called in `perf_mmap_rb`? It feels redundant, or at least, the reason is not immediately obvious. It was added at some point, so I'm going to have to dig around a bit to figure out the history of this. No, I don't have git history, but I can look at the surrounding code to try and divine the reason. The key question: if I *remove* that call to `perf_event_update_time(event)` from `perf_mmap_rb`, will things break? `perf_event_init_userpage` and `perf_event_update_userpage` follow it in the code path. Let's see if there is something in `perf_event_init_userpage` that necessitates it. |
| 30/2 | 2026/02/06 07:02 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:kernel/events/core.c IncludeLines:true Name:perf_event_init_userpage] Results: map[SourceCode:6507: static void perf_event_init_userpage(struct perf_event *event)
6508: {
6509: struct perf_event_mmap_page *userpg;
6510: struct perf_buffer *rb;
6511:
6512: rcu_read_lock();
6513: rb = rcu_dereference(event->rb);
6514: if (!rb)
6515: goto unlock;
6516:
6517: userpg = rb->user_page;
6518:
6519: /* Allow new userspace to detect that bit 0 is deprecated */
6520: userpg->cap_bit0_is_deprecated = 1;
6521: userpg->size = offsetof(struct perf_event_mmap_page, __reserved);
6522: userpg->data_offset = PAGE_SIZE;
6523: userpg->data_size = perf_data_size(rb);
6524:
6525: unlock:
6526: rcu_read_unlock();
6527: }
SourceFile:kernel/events/core.c] |
| 31/2 | 2026/02/06 07:02 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 26486 output: 39 thoughts: 4379 Thoughts: **Data Race Analysis: `perf_event_update_time` in `perf_mmap_rb`** Alright, let's break this down. It seems we have a data race issue involving `perf_event_update_time` within `perf_mmap_rb`. The crux of the problem is that `perf_mmap_rb` calls this function, which updates the event's internal timing state, specifically `total_time_enabled` and `tstamp`. The concern stems from the possibility of a race condition with `event_sched_out`, another function that can also call `perf_event_update_time` to update the same fields. The initial thought is that the primary purpose of `perf_event_update_time` is to update internal state, not really timekeeping as the function name might suggest. The report shows a change in `tstamp` which is indicative that this function call is happening, and is the key to understanding the race. The real danger here is the read-modify-write operation on `total_time_enabled`. If two CPUs are updating this value concurrently, one update will be lost. This is a classic data race that can lead to incorrect statistics. Even though an initial thought is that this is not a memory corruption, it is a logic error that breaks the integrity of the collected statistics and thus is a bug. Looking at the code, `perf_event_update_time` itself doesn't have internal synchronization. It relies on the caller to provide that. It turns out that `perf_mmap_rb` is the odd one out. It appears to be an unprotected call to `perf_event_update_time`. The next question is whether `perf_mmap_rb` is called when events are inactive, but it's not the case because `perf_mmap` can be called anytime after the event's creation, and the event can indeed be active. The report says the timing issue can happen, but this is an issue. Now, considering the impact: slight inaccuracies in `total_time_enabled` translate directly into inaccuracies in `perf` output. While this might be negligible for some users, it is critical for high-precision profiling. The report also highlights a change in `tstamp` suggesting data race. It's evident that `perf_event_update_time` is *meant* to be called under a lock. The use of `WRITE_ONCE` for the state, while not using similar methods for the time-related fields, underscores this expectation. The call in `perf_mmap_rb` is redundant, and because it is unprotected, it's a bug. The question then becomes: is it a "harmful" bug? From a practical perspective, it corrupts the timing data. This impacts the accuracy of performance counters, which is the core function of the `perf` subsystem. `data_race()` is not used here, suggesting this race isn't deemed harmless. Further investigation reveals that while `perf_mmap` *does* hold `event->mmap_mutex`, protecting against multiple mmaps, it *doesn't* protect against `event_sched_out`. This is where the race happens. It is possible that the event is being scheduled out, and the call to `perf_poll` is happening on the `mmap_mutex` as well, making this race even clearer. Therefore, the `perf_event_update_time` call within `perf_mmap_rb` represents a data race, a bug that can lead to corrupted performance data, thus negatively impacting the accuracy and reliability of the `perf` tool. |
| 32/2 | 2026/02/06 07:02 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:kernel/events/core.c IncludeLines:true Name:perf_poll] Results: map[SourceCode:6125: static __poll_t perf_poll(struct file *file, poll_table *wait)
6126: {
6127: struct perf_event *event = file->private_data;
6128: struct perf_buffer *rb;
6129: __poll_t events = EPOLLHUP;
6130:
6131: if (event->state <= PERF_EVENT_STATE_REVOKED)
6132: return EPOLLERR;
6133:
6134: poll_wait(file, &event->waitq, wait);
6135:
6136: if (event->state <= PERF_EVENT_STATE_REVOKED)
6137: return EPOLLERR;
6138:
6139: if (is_event_hup(event))
6140: return events;
6141:
6142: if (unlikely(READ_ONCE(event->state) == PERF_EVENT_STATE_ERROR &&
6143: event->attr.pinned))
6144: return EPOLLERR;
6145:
6146: /*
6147: * Pin the event->rb by taking event->mmap_mutex; otherwise
6148: * perf_event_set_output() can swizzle our rb and make us miss wakeups.
6149: */
6150: mutex_lock(&event->mmap_mutex);
6151: rb = event->rb;
6152: if (rb)
6153: events = atomic_xchg(&rb->poll, 0);
6154: mutex_unlock(&event->mmap_mutex);
6155: return events;
6156: }
SourceFile:kernel/events/core.c] |
| 33/2 | 2026/02/06 07:02 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 32185 output: 41 thoughts: 2282 Thoughts: **Race Condition Analysis in Perf Event mmap** Alright, let's break this down. Task 12005 is in `perf_poll`, and it's calling `mutex_lock` on `event->mmap_mutex`. If Task 12011 is already holding that mutex in `perf_mmap`, Task 12005 blocks and calls `schedule()`. Now, `schedule()` calls `prepare_task_switch` which then calls `perf_event_task_sched_out`, and finally, we call `event_sched_out`. The key point is: if the event Task 12011 is mmapping is also in Task 12005's context, `event_sched_out` is called for it. Wait a second, if Task 12005 is being scheduled *out*, that means its events were active, and `event_sched_out` deactivates them. Now `event_sched_out` is calling `perf_event_set_state` which then calls `perf_event_update_time`. This is where it gets interesting because this could be racing with `perf_event_update_time` in `perf_mmap_rb` which is being called by Task 12011. Okay, this is a clear race. Is it harmful? Well, it will lead to inaccurate timing counters, which is generally a bug in the kernel. Is it "benign" though? I'm leaning toward "harmful," because this isn't just a simple counter increment, it's a read-modify-write on multiple fields that should be consistent. So, the worst-case scenario is that `total_time_enabled` and `total_time_running` are slightly wrong, which won't cause a crash, nor does it cause any memory corruption or deadlock. But `perf_event_set_state` also updates `event->state`, but that is protected by `WRITE_ONCE`. So the race is purely on timing fields. Let's see, Is `perf_event_update_time` called anywhere else without the context lock? Nope, `perf_mmap_rb` seems to be the only place. But wait, I recall seeing some patches regarding `perf_mmap` and context locks... Let's look again at `perf_mmap`. Okay, so there's `scoped_guard (mutex, &event->mmap_mutex)`. If there was a context lock here too, it would be safe. But, the context lock is a spinlock and `mmap_mutex` is a mutex, which could cause tricky lock ordering issues. And, `perf_event_ctx_lock` returns the context and holds the lock. But `perf_mmap_rb` calls `rb_alloc` which might sleep, so the context lock can't be held there. If `perf_mmap_rb` can't hold the context lock, then `perf_event_update_time` should *not* be called from it, unless it's safe. Could we make `perf_event_update_time` safe somehow? Maybe by using atomics for `total_time_enabled` and `tstamp`? But then you'd have to update them together, and that's not how it's implemented. Actually, `perf_event_update_userpage` uses a seqlock-like mechanism, but that's for the userpage, not the internal `perf_event` fields. So, if `perf_mmap_rb` only wants to update the userpage, it should just call `perf_event_update_userpage(event)`. Wait, it DOES call it: there's `perf_event_update_time`, then `perf_event_init_userpage`, then `perf_event_update_userpage`. What if we remove the `perf_event_update_time` call? `perf_event_update_userpage` calls `calc_timer_values`, and that calls `__perf_update_times` with *local* variables. The "real" current values are calculated using `event->tstamp` and `event->total_time_enabled`, and then those values are written to the userpage. This is safe, because it doesn't update the internal fields. So, why was `perf_event_update_time(event)` added there? Maybe to update internal fields, but if the event is running, the internal fields are always stale until the next `sched_out` or `read()`. That's why `calc_timer_values` is used to compute the accurate current time. So calling `perf |
| 34/2 | 2026/02/06 07:02 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:kernel/events/core.c IncludeLines:true Name:perf_event_alloc] Results: map[SourceCode:12924: /*
12925: * Allocate and initialize an event structure
12926: */
12927: static struct perf_event *
12928: perf_event_alloc(struct perf_event_attr *attr, int cpu,
12929: struct task_struct *task,
12930: struct perf_event *group_leader,
12931: struct perf_event *parent_event,
12932: perf_overflow_handler_t overflow_handler,
12933: void *context, int cgroup_fd)
12934: {
12935: struct pmu *pmu;
12936: struct hw_perf_event *hwc;
12937: long err = -EINVAL;
12938: int node;
12939:
12940: if ((unsigned)cpu >= nr_cpu_ids) {
12941: if (!task || cpu != -1)
12942: return ERR_PTR(-EINVAL);
12943: }
12944: if (attr->sigtrap && !task) {
12945: /* Requires a task: avoid signalling random tasks. */
12946: return ERR_PTR(-EINVAL);
12947: }
12948:
12949: node = (cpu >= 0) ? cpu_to_node(cpu) : -1;
12950: struct perf_event *event __free(__free_event) =
12951: kmem_cache_alloc_node(perf_event_cache, GFP_KERNEL | __GFP_ZERO, node);
12952: if (!event)
12953: return ERR_PTR(-ENOMEM);
12954:
12955: /*
12956: * Single events are their own group leaders, with an
12957: * empty sibling list:
12958: */
12959: if (!group_leader)
12960: group_leader = event;
12961:
12962: mutex_init(&event->child_mutex);
12963: INIT_LIST_HEAD(&event->child_list);
12964:
12965: INIT_LIST_HEAD(&event->event_entry);
12966: INIT_LIST_HEAD(&event->sibling_list);
12967: INIT_LIST_HEAD(&event->active_list);
12968: init_event_group(event);
12969: INIT_LIST_HEAD(&event->rb_entry);
12970: INIT_LIST_HEAD(&event->active_entry);
12971: INIT_LIST_HEAD(&event->addr_filters.list);
12972: INIT_HLIST_NODE(&event->hlist_entry);
12973: INIT_LIST_HEAD(&event->pmu_list);
12974:
12975:
12976: init_waitqueue_head(&event->waitq);
12977: init_irq_work(&event->pending_irq, perf_pending_irq);
12978: event->pending_disable_irq = IRQ_WORK_INIT_HARD(perf_pending_disable);
12979: init_task_work(&event->pending_task, perf_pending_task);
12980:
12981: mutex_init(&event->mmap_mutex);
12982: raw_spin_lock_init(&event->addr_filters.lock);
12983:
12984: atomic_long_set(&event->refcount, 1);
12985: event->cpu = cpu;
12986: event->attr = *attr;
12987: event->group_leader = group_leader;
12988: event->pmu = NULL;
12989: event->oncpu = -1;
12990:
12991: event->parent = parent_event;
12992:
12993: event->ns = get_pid_ns(task_active_pid_ns(current));
12994: event->id = atomic64_inc_return(&perf_event_id);
12995:
12996: event->state = PERF_EVENT_STATE_INACTIVE;
12997:
12998: if (parent_event)
12999: event->event_caps = parent_event->event_caps;
13000:
13001: if (task) {
13002: event->attach_state = PERF_ATTACH_TASK;
13003: /*
13004: * XXX pmu::event_init needs to know what task to account to
13005: * and we cannot use the ctx information because we need the
13006: * pmu before we get a ctx.
13007: */
13008: event->hw.target = get_task_struct(task);
13009: }
13010:
13011: event->clock = &local_clock;
13012: if (parent_event)
13013: event->clock = parent_event->clock;
13014:
13015: if (!overflow_handler && parent_event) {
13016: overflow_handler = parent_event->overflow_handler;
13017: context = parent_event->overflow_handler_context;
13018: #if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_EVENT_TRACING)
13019: if (parent_event->prog) {
13020: struct bpf_prog *prog = parent_event->prog;
13021:
13022: bpf_prog_inc(prog);
13023: event->prog = prog;
13024: }
13025: #endif
13026: }
13027:
13028: if (overflow_handler) {
13029: event->overflow_handler = overflow_handler;
13030: event->overflow_handler_context = context;
13031: } else if (is_write_backward(event)){
13032: event->overflow_handler = perf_event_output_backward;
13033: event->overflow_handler_context = NULL;
13034: } else {
13035: event->overflow_handler = perf_event_output_forward;
13036: event->overflow_handler_context = NULL;
13037: }
13038:
13039: perf_event__state_init(event);
13040:
13041: pmu = NULL;
13042:
13043: hwc = &event->hw;
13044: hwc->sample_period = attr->sample_period;
13045: if (is_event_in_freq_mode(event))
13046: hwc->sample_period = 1;
13047: hwc->last_period = hwc->sample_period;
13048:
13049: local64_set(&hwc->period_left, hwc->sample_period);
13050:
13051: /*
13052: * We do not support PERF_SAMPLE_READ on inherited events unless
13053: * PERF_SAMPLE_TID is also selected, which allows inherited events to
13054: * collect per-thread samples.
13055: * See perf_output_read().
13056: */
13057: if (has_inherit_and_sample_read(attr) && !(attr->sample_type & PERF_SAMPLE_TID))
13058: return ERR_PTR(-EINVAL);
13059:
13060: if (!has_branch_stack(event))
13061: event->attr.branch_sample_type = 0;
13062:
13063: pmu = perf_init_event(event);
13064: if (IS_ERR(pmu))
13065: return (void*)pmu;
13066:
13067: /*
13068: * The PERF_ATTACH_TASK_DATA is set in the event_init()->hw_config().
13069: * The attach should be right after the perf_init_event().
13070: * Otherwise, the __free_event() would mistakenly detach the non-exist
13071: * perf_ctx_data because of the other errors between them.
13072: */
13073: if (event->attach_state & PERF_ATTACH_TASK_DATA) {
13074: err = attach_perf_ctx_data(event);
13075: if (err)
13076: return ERR_PTR(err);
13077: }
13078:
13079: /*
13080: * Disallow uncore-task events. Similarly, disallow uncore-cgroup
13081: * events (they don't make sense as the cgroup will be different
13082: * on other CPUs in the uncore mask).
13083: */
13084: if (pmu->task_ctx_nr == perf_invalid_context && (task || cgroup_fd != -1))
13085: return ERR_PTR(-EINVAL);
13086:
13087: if (event->attr.aux_output &&
13088: (!(pmu->capabilities & PERF_PMU_CAP_AUX_OUTPUT) ||
13089: event->attr.aux_pause || event->attr.aux_resume))
13090: return ERR_PTR(-EOPNOTSUPP);
13091:
13092: if (event->attr.aux_pause && event->attr.aux_resume)
13093: return ERR_PTR(-EINVAL);
13094:
13095: if (event->attr.aux_start_paused) {
13096: if (!(pmu->capabilities & PERF_PMU_CAP_AUX_PAUSE))
13097: return ERR_PTR(-EOPNOTSUPP);
13098: event->hw.aux_paused = 1;
13099: }
13100:
13101: if (cgroup_fd != -1) {
13102: err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
13103: if (err)
13104: return ERR_PTR(err);
13105: }
13106:
13107: err = exclusive_event_init(event);
13108: if (err)
13109: return ERR_PTR(err);
13110:
13111: if (has_addr_filter(event)) {
13112: event->addr_filter_ranges = kcalloc(pmu->nr_addr_filters,
13113: sizeof(struct perf_addr_filter_range),
13114: GFP_KERNEL);
13115: if (!event->addr_filter_ranges)
13116: return ERR_PTR(-ENOMEM);
13117:
13118: /*
13119: * Clone the parent's vma offsets: they are valid until exec()
13120: * even if the mm is not shared with the parent.
13121: */
13122: if (event->parent) {
13123: struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
13124:
13125: raw_spin_lock_irq(&ifh->lock);
13126: memcpy(event->addr_filter_ranges,
13127: event->parent->addr_filter_ranges,
13128: pmu->nr_addr_filters * sizeof(struct perf_addr_filter_range));
13129: raw_spin_unlock_irq(&ifh->lock);
13130: }
13131:
13132: /* force hw sync on the address filters */
13133: event->addr_filters_gen = 1;
13134: }
13135:
13136: if (!event->parent) {
13137: if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
13138: err = get_callchain_buffers(attr->sample_max_stack);
13139: if (err)
13140: return ERR_PTR(err);
13141: event->attach_state |= PERF_ATTACH_CALLCHAIN;
13142: }
13143: }
13144:
13145: err = security_perf_event_alloc(event);
13146: if (err)
13147: return ERR_PTR(err);
13148:
13149: /* symmetric to unaccount_event() in _free_event() */
13150: account_event(event);
13151:
13152: /*
13153: * Event creation should be under SRCU, see perf_pmu_unregister().
13154: */
13155: lockdep_assert_held(&pmus_srcu);
13156: scoped_guard (spinlock, &pmu->events_lock)
13157: list_add(&event->pmu_list, &pmu->events);
13158:
13159: return_ptr(event);
13160: }
SourceFile:kernel/events/core.c] |
| 35/2 | 2026/02/06 07:02 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 39427 output: 19 thoughts: 2457 Thoughts: **Analysis of a Potential Data Race in Perf Event Handling**
Okay, here's what I've got. I'm looking at a section of the perf event code and immediately see a potential problem. I noticed `tstamp` isn't initialized explicitly, but then I remember that `kmem_cache_alloc_node` with `__GFP_ZERO` should take care of that, setting it to zero initially. My next thought is, when does `tstamp` actually get a valid timestamp? I figure it's probably when an event is enabled or scheduled.
Now, the code calls `perf_mmap_rb`, and that suggests the event might have been running for a bit. Bingo, I've got a potential race. I start to weigh if it's benign or harmful. I need to re-familiarize myself with what 'benign' means in this context. Ah, "statistics counters and flags". `total_time_enabled` is a statistics counter and `tstamp` is used for updating that counter. If it was just an `atomic64_add`, maybe it could be benign. But it's not. I'm seeing a 64-bit value updated via Read-Modify-Write (RMW), which raises a red flag.
Wait, I need to check upstream. Okay, I don't have search, but I see some recent code. I see a `scoped_guard (mutex, &event->mmap_mutex)` in `perf_mmap`. This is fairly recent code. If I look at `perf_event_update_time` again, it's NOT using `data_race()` or anything. This is probably a real bug that's either just been discovered or is still there. This is a classic race between `mmap` and `sched_out` on the same event. It's an interesting sequence of events.
If Task A is mmap'ing an event that Task B is using, and Task B gets scheduled out because it's waiting for the same `mmap_mutex` that Task A is holding... that is very specific and unusual.
I see Task 12005 (Task B) is calling `perf_poll` which acquires `event->mmap_mutex`. Task 12011 (Task A) calls `perf_mmap` which also acquires `event->mmap_mutex`.
If Task A holds the mutex, Task B blocks, gets scheduled out, and its events are also scheduled out. This means `event_sched_out` is called for the event being mmap'ed by Task A, calling `perf_event_update_time`. Task A also calls `perf_event_update_time` from `perf_mmap_rb`.
It's a race, plain and simple.
This looks like a *harmful* race. The `mmap_mutex` is supposed to protect the ring buffer, but it doesn't protect the event timing fields. Those are protected by the context lock. However, `perf_mmap_rb` doesn't take the context lock. Logic errors in synchronization usually translate into a harmful race. It can lead to inconsistent state and lost updates of the `total_time_enabled` counter.
I'm confident it's harmful, but I'll make sure there's no way this race is benign. No crash, no corruption, and hopefully no security implications. But incorrect results? Yes, absolutely. In the kernel, some statistics races are okay if they are statistics only, but this is different. This is part of the core timing mechanism of perf. And other KCSAN results, if they can be fixed, are usually harmful. This can potentially be fixed by locking.
Taking the context lock in `perf_mmap_rb` is hard because it's a spinlock, but `perf_mmap` can do it. `perf_mmap` could take the context lock, call `perf_event_update_time`, and then call `perf_mmap_rb`. Wait, no, `perf_mmap` already handles that.
Okay, let me check the code again... `perf_mmap` includes:
```c
7154: scoped_guard (mutex, &event->mmap_mutex) {
...
7164: ret = perf_mmap_rb(vma, event, nr_pages);
```
I look at `perf_event_set_state` - it gets called from many places. It's the `perf_mmap_rb` that's the culprit.
Is there locking in `perf_event_update_time`?
```c
731: static void perf
|
| 36/2 | 2026/02/06 07:03 | tool | set-results |
0mArgs:map[Benign:false Confident:true] Results: map[Benign:false Confident:true] |
| 37/2 | 2026/02/06 07:03 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 42945 output: 694 thoughts: 233 Thoughts: Here's my understanding of the situation, distilled for you: **My Analysis of the Timing Race in the Perf Subsystem** Okay, so the user wants a concise overview of this perf issue. Right, let me break it down: Essentially, we've got a race condition affecting timing accuracy. It centers around the interaction between `event->tstamp`, `event->total_time_enabled`, and `event->total_time_running`, specifically when these are updated. The race occurs because these timing counters get tweaked using non-atomic read-modify-write operations, making the sequence vulnerable. The race specifically happens in the window of execution time between `perf_mmap_rb` (triggered during `mmap`) and `event_sched_out` (which is called during a context switch). The key is that `perf_mmap_rb` holds `event->mmap_mutex`, while `event_sched_out` holds `ctx->lock`. These are *different* locks, and that's the vulnerability. Because they're different locks, we can get an update lost. The contention results in lost updates to those timing counters, leading to incorrect timing data for the perf event. This is a functional bug in the perf subsystem – the timing is simply inaccurate. The race is especially likely because `perf_poll` can block on `mmap_mutex`. This blocking behavior opens the door for the task to be scheduled out. Then, *boom*, `event_sched_out` gets called on the same event, and the race condition is triggered. |
| Total Calls | Total Tokens | Avg Tokens | Total Duration (Seconds) | Avg Duration (Seconds) |
|---|
| Total Calls | Total Duration (Seconds) | Avg Duration (Seconds) |
|---|