syzbot

ID	Workflow	Result	Correct	Bug	Created	Started	Finished	Revision	Error
058e6d8f-794d-47ca-a691-77776660c17e	assessment-kcsan	Benign: ✅ Confident: ✅	❓	KCSAN: data-race in hrtimer_interrupt / print_tickdevice (16)	2026/02/25 03:30	2026/02/25 03:30	2026/02/25 03:32	305c0ec5cd886e2d13738e28e1b2df9b0ec20fc9

The data race occurs between `hrtimer_interrupt()` and `print_tickdevice()`.

The write access is in `hrtimer_interrupt()` (at `kernel/time/hrtimer.c:1882` in the provided source), where `dev->next_event` is set to `KTIME_MAX` at the beginning of the high-resolution timer interrupt handler. Later in the same function, it is updated again via `tick_program_event()` to the next scheduled expiry time.

The read access is in `print_tickdevice()` (at `kernel/time/timer_list.c:206`), which is called when a user reads the `/proc/timer_list` file. This function iterates over the system's tick devices and prints their status, including the `next_event` field.

Analysis:
1. **Field Type**: The field being accessed is `dev->next_event`, which is of type `ktime_t`. In modern kernels, `ktime_t` is a 64-bit signed integer (`s64`).
2. **Atomicity**: On 64-bit systems (like the x86_64 system in the report), 8-byte aligned memory accesses are atomic. The `struct clock_event_device` is cacheline-aligned, and the `next_event` field is at an 8-byte aligned offset within the struct. Thus, the read and write are atomic.
3. **Context**: `hrtimer_interrupt()` runs in interrupt context on one CPU, while `print_tickdevice()` runs in task context (reading `/proc/timer_list`) potentially on another CPU.
4. **Impact**: The `next_event` field is used by the timer subsystem to track the next programmed event. `print_tickdevice()` uses it solely for displaying information to the user for debugging purposes. A data race here means that the user might see a slightly stale value or `KTIME_MAX` if the read happens exactly during an interrupt. This does not affect the actual operation of the timers or the stability of the kernel.
5. **Synchronization**: The `timer_list` interface is known to perform lockless reads of various timer-related fields to avoid the overhead and potential deadlocks of taking global or per-CPU locks while generating a potentially large amount of output.

Conclusion:
The data race is benign as it only affects a debug/statistics output (`/proc/timer_list`) and involves an atomic 64-bit variable. It does not lead to memory corruption, crashes, or incorrect logic in the timer subsystem.

==================================================================
BUG: KCSAN: data-race in hrtimer_interrupt / print_tickdevice

write to 0xffff888237d17b58 of 8 bytes by interrupt on cpu 1:
 hrtimer_interrupt+0x84/0x810 kernel/time/hrtimer.c:1882
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1045 [inline]
 __sysvec_apic_timer_interrupt+0x5f/0x1d0 arch/x86/kernel/apic/apic.c:1062
 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1056 [inline]
 sysvec_apic_timer_interrupt+0x6f/0x80 arch/x86/kernel/apic/apic.c:1056
 asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:697
 lru_gen_inc_refs mm/swap.c:391 [inline]
 folio_mark_accessed+0x6f/0x3a0 mm/swap.c:460
 follow_page_pte+0x572/0x6e0 mm/gup.c:886
 follow_pmd_mask mm/gup.c:915 [inline]
 follow_pud_mask mm/gup.c:967 [inline]
 follow_p4d_mask mm/gup.c:984 [inline]
 follow_page_mask mm/gup.c:1023 [inline]
 __get_user_pages+0x756/0x1ea0 mm/gup.c:1426
 __get_user_pages_locked mm/gup.c:1692 [inline]
 faultin_page_range+0x107/0x4d0 mm/gup.c:1912
 madvise_populate mm/madvise.c:976 [inline]
 madvise_do_behavior+0x19f/0xa10 mm/madvise.c:1935
 do_madvise+0x10e/0x190 mm/madvise.c:2030
 __do_sys_madvise mm/madvise.c:2039 [inline]
 __se_sys_madvise mm/madvise.c:2037 [inline]
 __x64_sys_madvise+0x63/0x80 mm/madvise.c:2037
 x64_sys_call+0x1c6b/0x3000 arch/x86/include/generated/asm/syscalls_64.h:29
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff888237d17b58 of 8 bytes by task 5743 on cpu 0:
 print_tickdevice+0x142/0x340 kernel/time/timer_list.c:206
 timer_list_show+0x6c/0x170 kernel/time/timer_list.c:298
 seq_read_iter+0x636/0x940 fs/seq_file.c:272
 proc_reg_read_iter+0x110/0x180 fs/proc/inode.c:299
 copy_splice_read+0x471/0x6c0 fs/splice.c:363
 do_splice_read fs/splice.c:982 [inline]
 splice_direct_to_actor+0x26e/0x670 fs/splice.c:1086
 do_splice_direct_actor fs/splice.c:1204 [inline]
 do_splice_direct+0x119/0x1a0 fs/splice.c:1230
 do_sendfile+0x382/0x650 fs/read_write.c:1370
 __do_sys_sendfile64 fs/read_write.c:1431 [inline]
 __se_sys_sendfile64 fs/read_write.c:1417 [inline]
 __x64_sys_sendfile64+0x105/0x150 fs/read_write.c:1417
 x64_sys_call+0x2db1/0x3000 arch/x86/include/generated/asm/syscalls_64.h:41
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0x000000129f80e76b -> 0x000000129f81f8db

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 UID: 0 PID: 5743 Comm: syz.2.615 Tainted: G        W           syzkaller #0 PREEMPT(voluntary) 
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026
==================================================================

Trajectory:
Seq	Timestamp	Type	Name	Duration
0/0	2026/02/25 03:30	flow	assessment-kcsan	2m Results: map[Benign:true Confident:true Explanation:The data race occurs between `hrtimer_interrupt()` and `print_tickdevice()`. The write access is in `hrtimer_interrupt()` (at `kernel/time/hrtimer.c:1882` in the provided source), where `dev->next_event` is set to `KTIME_MAX` at the beginning of the high-resolution timer interrupt handler. Later in the same function, it is updated again via `tick_program_event()` to the next scheduled expiry time. The read access is in `print_tickdevice()` (at `kernel/time/timer_list.c:206`), which is called when a user reads the `/proc/timer_list` file. This function iterates over the system's tick devices and prints their status, including the `next_event` field. Analysis: 1. Field Type: The field being accessed is `dev->next_event`, which is of type `ktime_t`. In modern kernels, `ktime_t` is a 64-bit signed integer (`s64`). 2. Atomicity: On 64-bit systems (like the x86_64 system in the report), 8-byte aligned memory accesses are atomic. The `struct clock_event_device` is cacheline-aligned, and the `next_event` field is at an 8-byte aligned offset within the struct. Thus, the read and write are atomic. 3. Context: `hrtimer_interrupt()` runs in interrupt context on one CPU, while `print_tickdevice()` runs in task context (reading `/proc/timer_list`) potentially on another CPU. 4. Impact: The `next_event` field is used by the timer subsystem to track the next programmed event. `print_tickdevice()` uses it solely for displaying information to the user for debugging purposes. A data race here means that the user might see a slightly stale value or `KTIME_MAX` if the read happens exactly during an interrupt. This does not affect the actual operation of the timers or the stability of the kernel. 5. Synchronization: The `timer_list` interface is known to perform lockless reads of various timer-related fields to avoid the overhead and potential deadlocks of taking global or per-CPU locks while generating a potentially large amount of output. Conclusion: The data race is benign as it only affects a debug/statistics output (`/proc/timer_list`) and involves an atomic 64-bit variable. It does not lead to memory corruption, crashes, or incorrect logic in the timer subsystem.]
1/1	2026/02/25 03:30	action	kernel-checkouter	0m Results: map[KernelSrc:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/src/a24e0351858b280a0e17500759ec155b5ad3e184]
2/1	2026/02/25 03:30	action	kernel-builder	0m Results: map[KernelObj:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/build/5a81561bbe40439496542cdb84200c114a114a5e]
3/1	2026/02/25 03:30	action	codesearch-prepare	0m Results: map[Index:codesearch-index]
4/1	2026/02/25 03:30	agent	expert	1m Model: gemini-3-flash-preview Results: map[Benign:true Confident:true] Instruction: You are an experienced Linux kernel developer tasked with determining if the given kernel data race is benign or not. The data race report is from KCSAN tool. It contains 2 stack traces of the memory accesses that constitute a data race. A "benign" data races are on a simple int/bool variable or similar field, and the accesses are not supposed to be protected by any mutual exclusion primitives. Common examples of such "benign" data races are accesses to various flags fields, statistics counters, and similar. A "benign" data race does not lead to memory corruption/crash with a conservative compiler that compiles memory accesses to primitive types effectively as atomic. A non-benign (or "harmful" data race) can lead to corruption/crash even with a conservative compiler that compiles memory accesses to primitive types effectively as atomic. A common example of a "harmful" data races is race on a complex container (list/hashmap/etc), where accesses are supposed to be protected by a mutual exclusion primitive. Also consider races that happen at the same time with the given one. If there is no synchronization in between, other memory accesses in the involved threads race with each other if they access the same memory. For example, if both threads execute: some_struct->have_elements = true; list_add(new_node, &some_struct->list_head); the race on some_struct->have_elements may appear benign, however it also implies there is a race on some_struct->list_head which is not benign, since the list is not thread-safe. Take into account that on 32-bit systems 64-bit memory accesses may be split into two accesses, and thus even with a conservative compiler may not be fully atomic. However, such races may still be benign depending on how writes are done, and how read data is used. In the final reply explain why you think the given data race is benign or is harmful. Use the provided tools to confirm any assumptions, variables/fields being accessed, etc. In particular, don't make assumptions about the kernel source code, use codesearch tools to read the actual source code. Prefer calling several tools at the same time to save round-trips. Use set-results tool to provide results of the analysis. It must be called exactly once before the final reply. Ignore results of this tool. Prompt: The data race report is: ================================================================== BUG: KCSAN: data-race in hrtimer_interrupt / print_tickdevice write to 0xffff888237d17b58 of 8 bytes by interrupt on cpu 1: hrtimer_interrupt+0x84/0x810 kernel/time/hrtimer.c:1882 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1045 [inline] __sysvec_apic_timer_interrupt+0x5f/0x1d0 arch/x86/kernel/apic/apic.c:1062 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1056 [inline] sysvec_apic_timer_interrupt+0x6f/0x80 arch/x86/kernel/apic/apic.c:1056 asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:697 lru_gen_inc_refs mm/swap.c:391 [inline] folio_mark_accessed+0x6f/0x3a0 mm/swap.c:460 follow_page_pte+0x572/0x6e0 mm/gup.c:886 follow_pmd_mask mm/gup.c:915 [inline] follow_pud_mask mm/gup.c:967 [inline] follow_p4d_mask mm/gup.c:984 [inline] follow_page_mask mm/gup.c:1023 [inline] __get_user_pages+0x756/0x1ea0 mm/gup.c:1426 __get_user_pages_locked mm/gup.c:1692 [inline] faultin_page_range+0x107/0x4d0 mm/gup.c:1912 madvise_populate mm/madvise.c:976 [inline] madvise_do_behavior+0x19f/0xa10 mm/madvise.c:1935 do_madvise+0x10e/0x190 mm/madvise.c:2030 __do_sys_madvise mm/madvise.c:2039 [inline] __se_sys_madvise mm/madvise.c:2037 [inline] __x64_sys_madvise+0x63/0x80 mm/madvise.c:2037 x64_sys_call+0x1c6b/0x3000 arch/x86/include/generated/asm/syscalls_64.h:29 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffff888237d17b58 of 8 bytes by task 5743 on cpu 0: print_tickdevice+0x142/0x340 kernel/time/timer_list.c:206 timer_list_show+0x6c/0x170 kernel/time/timer_list.c:298 seq_read_iter+0x636/0x940 fs/seq_file.c:272 proc_reg_read_iter+0x110/0x180 fs/proc/inode.c:299 copy_splice_read+0x471/0x6c0 fs/splice.c:363 do_splice_read fs/splice.c:982 [inline] splice_direct_to_actor+0x26e/0x670 fs/splice.c:1086 do_splice_direct_actor fs/splice.c:1204 [inline] do_splice_direct+0x119/0x1a0 fs/splice.c:1230 do_sendfile+0x382/0x650 fs/read_write.c:1370 __do_sys_sendfile64 fs/read_write.c:1431 [inline] __se_sys_sendfile64 fs/read_write.c:1417 [inline] __x64_sys_sendfile64+0x105/0x150 fs/read_write.c:1417 x64_sys_call+0x2db1/0x3000 arch/x86/include/generated/asm/syscalls_64.h:41 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x000000129f80e76b -> 0x000000129f81f8db Reported by Kernel Concurrency Sanitizer on: CPU: 0 UID: 0 PID: 5743 Comm: syz.2.615 Tainted: G W syzkaller #0 PREEMPT(voluntary) Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026 ================================================================== Reply: The data race occurs between `hrtimer_interrupt()` and `print_tickdevice()`. The write access is in `hrtimer_interrupt()` (at `kernel/time/hrtimer.c:1882` in the provided source), where `dev->next_event` is set to `KTIME_MAX` at the beginning of the high-resolution timer interrupt handler. Later in the same function, it is updated again via `tick_program_event()` to the next scheduled expiry time. The read access is in `print_tickdevice()` (at `kernel/time/timer_list.c:206`), which is called when a user reads the `/proc/timer_list` file. This function iterates over the system's tick devices and prints their status, including the `next_event` field. Analysis: 1. Field Type: The field being accessed is `dev->next_event`, which is of type `ktime_t`. In modern kernels, `ktime_t` is a 64-bit signed integer (`s64`). 2. Atomicity: On 64-bit systems (like the x86_64 system in the report), 8-byte aligned memory accesses are atomic. The `struct clock_event_device` is cacheline-aligned, and the `next_event` field is at an 8-byte aligned offset within the struct. Thus, the read and write are atomic. 3. Context: `hrtimer_interrupt()` runs in interrupt context on one CPU, while `print_tickdevice()` runs in task context (reading `/proc/timer_list`) potentially on another CPU. 4. Impact: The `next_event` field is used by the timer subsystem to track the next programmed event. `print_tickdevice()` uses it solely for displaying information to the user for debugging purposes. A data race here means that the user might see a slightly stale value or `KTIME_MAX` if the read happens exactly during an interrupt. This does not affect the actual operation of the timers or the stability of the kernel. 5. Synchronization: The `timer_list` interface is known to perform lockless reads of various timer-related fields to avoid the overhead and potential deadlocks of taking global or per-CPU locks while generating a potentially large amount of output. Conclusion: The data race is benign as it only affects a debug/statistics output (`/proc/timer_list`) and involves an atomic 64-bit variable. It does not lead to memory corruption, crashes, or incorrect logic in the timer subsystem.
5/2	2026/02/25 03:30	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 4527 output: 22 thoughts: 180 Thoughts: Here's my analysis of the reported data race: Data Race Deep Dive: `hrtimer_interrupt` vs. `print_tickdevice` Okay, so the report indicates a nasty data race. Specifically, it's between `hrtimer_interrupt`, which is writing to a memory address (`0xffff888237d17b58`), and `print_tickdevice`, which is reading from the same address. The key thing is where these functions are running: `hrtimer_interrupt` is in interrupt context (the APIC timer interrupt, no less!) and `print_tickdevice` is task context, triggered when someone reads `/proc/timer_list` through the `timer_list_show` function. That's a classic recipe for trouble. My initial thought is to pin down the exact field they're both touching. I need to understand what's being written to and read from that address. The report helpfully gives me the function locations: `kernel/time/hrtimer.c:1882` for `hrtimer_interrupt` and `kernel/time/timer_list.c:206` for `print_tickdevice`. So, I'm going to pull up those files right now and get right to reading those lines. First, I’ll get a good overview of the file contexts. Then, I need to see exactly what memory access is happening on line 1882 in `hrtimer.c` and line 206 in `timer_list.c`. That should reveal the shared data that's causing the problem. I'm going to look for any potential locking or atomic operations that might be missing or insufficient. Time to get my hands dirty.
6/2	2026/02/25 03:30	tool	codesearch-dir-index	0m Args: map[Dir:kernel/time/] Results: map[Files:[alarmtimer.c clockevents.c clocksource-wdtest.c clocksource.c hrtimer.c itimer.c jiffies.c namespace.c ntp.c ntp_internal.h posix-clock.c posix-cpu-timers.c posix-stubs.c posix-timers.c posix-timers.h sched_clock.c sleep_timeout.c test_udelay.c tick-broadcast-hrtimer.c tick-broadcast.c tick-common.c tick-internal.h tick-legacy.c tick-oneshot.c tick-sched.c tick-sched.h time.c time_test.c timeconv.c timecounter.c timekeeping.c timekeeping.h timekeeping_debug.c timekeeping_internal.h timer.c timer_list.c timer_migration.c timer_migration.h vsyscall.c] Subdirs:<nil>]
7/2	2026/02/25 03:30	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 5438 output: 23 thoughts: 13 Thoughts: Okay, I'm ready to dive in. Let's see what we're dealing with. Line-by-Line Analysis Incoming First, I'll identify the reported lines of code. Then, I'll thoroughly read and analyze the code at each of those lines, giving the necessary consideration to the surrounding context. I need to understand exactly what each line is doing and, crucially, how it interacts with the rest of the program. I'll be looking for potential issues, inefficiencies, or anything that might explain the reported problem, if any was reported. I'll be mindful of the broader architecture and the overall goals of this section of the codebase. I'll need to consider this in light of what I know about the function or module in question.
8/2	2026/02/25 03:31	tool	read-file	0m Args: map[File:kernel/time/hrtimer.c] Results: map[Contents:// SPDX-License-Identifier: GPL-2.0 /* * Copyright(C) 2005-2006, Linutronix GmbH, Thomas Gleixner <tglx@kernel.org> * Copyright(C) 2005-2007, Red Hat, Inc., Ingo Molnar * Copyright(C) 2006-2007 Timesys Corp., Thomas Gleixner * * High-resolution kernel timers * * In contrast to the low-resolution timeout API, aka timer wheel, * hrtimers provide finer resolution and accuracy depending on system * configuration and capabilities. * * Started by: Thomas Gleixner and Ingo Molnar * * Credits: * Based on the original timer wheel code * * Help, testing, suggestions, bugfixes, improvements were * provided by: * * George Anzinger, Andrew Morton, Steven Rostedt, Roman Zippel * et. al. / #include <linux/cpu.h> #include <linux/export.h> #include <linux/percpu.h> #include <linux/hrtimer.h> #include <linux/notifier.h> #include <linux/syscalls.h> #include <linux/interrupt.h> #include <linux/tick.h> #include <linux/err.h> #include <linux/debugobjects.h> #include <linux/sched/signal.h> #include <linux/sched/sysctl.h> #include <linux/sched/rt.h> #include <linux/sched/deadline.h> #include <linux/sched/nohz.h> #include <linux/sched/debug.h> #include <linux/sched/isolation.h> #include <linux/timer.h> #include <linux/freezer.h> #include <linux/compat.h> #include <linux/uaccess.h> #include <trace/events/timer.h> #include "tick-internal.h" / * Masks for selecting the soft and hard context timers from * cpu_base->active / #define MASK_SHIFT (HRTIMER_BASE_MONOTONIC_SOFT) #define HRTIMER_ACTIVE_HARD ((1U << MASK_SHIFT) - 1) #define HRTIMER_ACTIVE_SOFT (HRTIMER_ACTIVE_HARD << MASK_SHIFT) #define HRTIMER_ACTIVE_ALL (HRTIMER_ACTIVE_SOFT \| HRTIMER_ACTIVE_HARD) static void retrigger_next_event(void arg); static ktime_t __hrtimer_cb_get_time(clockid_t clock_id); /* * The timer bases: * * There are more clockids than hrtimer bases. Thus, we index * into the timer bases by the hrtimer_base_type enum. When trying * to reach a base using a clockid, hrtimer_clockid_to_base() * is used to convert from clockid to the proper hrtimer_base_type. / DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) = { .lock = __RAW_SPIN_LOCK_UNLOCKED(hrtimer_bases.lock), .clock_base = { { .index = HRTIMER_BASE_MONOTONIC, .clockid = CLOCK_MONOTONIC, }, { .index = HRTIMER_BASE_REALTIME, .clockid = CLOCK_REALTIME, }, { .index = HRTIMER_BASE_BOOTTIME, .clockid = CLOCK_BOOTTIME, }, { .index = HRTIMER_BASE_TAI, .clockid = CLOCK_TAI, }, { .index = HRTIMER_BASE_MONOTONIC_SOFT, .clockid = CLOCK_MONOTONIC, }, { .index = HRTIMER_BASE_REALTIME_SOFT, .clockid = CLOCK_REALTIME, }, { .index = HRTIMER_BASE_BOOTTIME_SOFT, .clockid = CLOCK_BOOTTIME, }, { .index = HRTIMER_BASE_TAI_SOFT, .clockid = CLOCK_TAI, }, }, .csd = CSD_INIT(retrigger_next_event, NULL) }; static inline bool hrtimer_base_is_online(struct hrtimer_cpu_base base) { if (!IS_ENABLED(CONFIG_HOTPLUG_CPU)) return true; else return likely(base->online); } /* * Functions and macros which are different for UP/SMP systems are kept in a * single place / #ifdef CONFIG_SMP / * We require the migration_base for lock_hrtimer_base()/switch_hrtimer_base() * such that hrtimer_callback_running() can unconditionally dereference * timer->base->cpu_base / static struct hrtimer_cpu_base migration_cpu_base = { .clock_base = { { .cpu_base = &migration_cpu_base, .seq = SEQCNT_RAW_SPINLOCK_ZERO(migration_cpu_base.seq, &migration_cpu_base.lock), }, }, }; #define migration_base migration_cpu_base.clock_base[0] / * We are using hashed locking: holding per_cpu(hrtimer_bases)[n].lock * means that all timers which are tied to this base via timer->base are * locked, and the base itself is locked too. * * So __run_timers/migrate_timers can safely modify all timers which could * be found on the lists/queues. * * When the timer's base is locked, and the timer removed from list, it is * possible to set timer->base = &migration_base and drop the lock: the timer * remains locked. / static struct hrtimer_clock_base lock_hrtimer_base(const struct hrtimer timer, unsigned long flags) __acquires(&timer->base->lock) { struct hrtimer_clock_base base; for (;;) { base = READ_ONCE(timer->base); if (likely(base != &migration_base)) { raw_spin_lock_irqsave(&base->cpu_base->lock, flags); if (likely(base == timer->base)) return base; /* The timer has migrated to another CPU: / raw_spin_unlock_irqrestore(&base->cpu_base->lock, flags); } cpu_relax(); } } /* * Check if the elected target is suitable considering its next * event and the hotplug state of the current CPU. * * If the elected target is remote and its next event is after the timer * to queue, then a remote reprogram is necessary. However there is no * guarantee the IPI handling the operation would arrive in time to meet * the high resolution deadline. In this case the local CPU becomes a * preferred target, unless it is offline. * * High and low resolution modes are handled the same way for simplicity. * * Called with cpu_base->lock of target cpu held. / static bool hrtimer_suitable_target(struct hrtimer timer, struct hrtimer_clock_base new_base, struct hrtimer_cpu_base new_cpu_base, struct hrtimer_cpu_base this_cpu_base) { ktime_t expires; / * The local CPU clockevent can be reprogrammed. Also get_target_base() * guarantees it is online. / if (new_cpu_base == this_cpu_base) return true; / * The offline local CPU can't be the default target if the * next remote target event is after this timer. Keep the * elected new base. An IPI will be issued to reprogram * it as a last resort. / if (!hrtimer_base_is_online(this_cpu_base)) return true; expires = ktime_sub(hrtimer_get_expires(timer), new_base->offset); return expires >= new_base->cpu_base->expires_next; } static inline struct hrtimer_cpu_base get_target_base(struct hrtimer_cpu_base base, int pinned) { if (!hrtimer_base_is_online(base)) { int cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TIMER)); return &per_cpu(hrtimer_bases, cpu); } #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) if (static_branch_likely(&timers_migration_enabled) && !pinned) return &per_cpu(hrtimer_bases, get_nohz_timer_target()); #endif return base; } / * We switch the timer base to a power-optimized selected CPU target, * if: * - NO_HZ_COMMON is enabled * - timer migration is enabled * - the timer callback is not running * - the timer is not the first expiring timer on the new target * * If one of the above requirements is not fulfilled we move the timer * to the current CPU or leave it on the previously assigned CPU if * the timer callback is currently running. / static inline struct hrtimer_clock_base switch_hrtimer_base(struct hrtimer timer, struct hrtimer_clock_base base, int pinned) { struct hrtimer_cpu_base new_cpu_base, this_cpu_base; struct hrtimer_clock_base new_base; int basenum = base->index; this_cpu_base = this_cpu_ptr(&hrtimer_bases); new_cpu_base = get_target_base(this_cpu_base, pinned); again: new_base = &new_cpu_base->clock_base[basenum]; if (base != new_base) { / * We are trying to move timer to new_base. * However we can't change timer's base while it is running, * so we keep it on the same CPU. No hassle vs. reprogramming * the event source in the high resolution case. The softirq * code will take care of this when the timer function has * completed. There is no conflict as we hold the lock until * the timer is enqueued. / if (unlikely(hrtimer_callback_running(timer))) return base; / See the comment in lock_hrtimer_base() / WRITE_ONCE(timer->base, &migration_base); raw_spin_unlock(&base->cpu_base->lock); raw_spin_lock(&new_base->cpu_base->lock); if (!hrtimer_suitable_target(timer, new_base, new_cpu_base, this_cpu_base)) { raw_spin_unlock(&new_base->cpu_base->lock); raw_spin_lock(&base->cpu_base->lock); new_cpu_base = this_cpu_base; WRITE_ONCE(timer->base, base); goto again; } WRITE_ONCE(timer->base, new_base); } else { if (!hrtimer_suitable_target(timer, new_base, new_cpu_base, this_cpu_base)) { new_cpu_base = this_cpu_base; goto again; } } return new_base; } #else / CONFIG_SMP / static inline struct hrtimer_clock_base lock_hrtimer_base(const struct hrtimer timer, unsigned long flags) __acquires(&timer->base->cpu_base->lock) { struct hrtimer_clock_base base = timer->base; raw_spin_lock_irqsave(&base->cpu_base->lock, flags); return base; } # define switch_hrtimer_base(t, b, p) (b) #endif /* !CONFIG_SMP / / * Functions for the union type storage format of ktime_t which are * too large for inlining: / #if BITS_PER_LONG < 64 / * Divide a ktime value by a nanosecond value / s64 __ktime_divns(const ktime_t kt, s64 div) { int sft = 0; s64 dclc; u64 tmp; dclc = ktime_to_ns(kt); tmp = dclc < 0 ? -dclc : dclc; / Make sure the divisor is less than 2^32: / while (div >> 32) { sft++; div >>= 1; } tmp >>= sft; do_div(tmp, (u32) div); return dclc < 0 ? -tmp : tmp; } EXPORT_SYMBOL_GPL(__ktime_divns); #endif / BITS_PER_LONG >= 64 / / * Add two ktime values and do a safety check for overflow: / ktime_t ktime_add_safe(const ktime_t lhs, const ktime_t rhs) { ktime_t res = ktime_add_unsafe(lhs, rhs); / * We use KTIME_SEC_MAX here, the maximum timeout which we can * return to user space in a timespec: / if (res < 0 \|\| res < lhs \|\| res < rhs) res = ktime_set(KTIME_SEC_MAX, 0); return res; } EXPORT_SYMBOL_GPL(ktime_add_safe); #ifdef CONFIG_DEBUG_OBJECTS_TIMERS static const struct debug_obj_descr hrtimer_debug_descr; static void hrtimer_debug_hint(void addr) { return ACCESS_PRIVATE((struct hrtimer )addr, function); } /* * fixup_init is called when: * - an active object is initialized / static bool hrtimer_fixup_init(void addr, enum debug_obj_state state) { struct hrtimer timer = addr; switch (state) { case ODEBUG_STATE_ACTIVE: hrtimer_cancel(timer); debug_object_init(timer, &hrtimer_debug_descr); return true; default: return false; } } / * fixup_activate is called when: * - an active object is activated * - an unknown non-static object is activated / static bool hrtimer_fixup_activate(void addr, enum debug_obj_state state) { switch (state) { case ODEBUG_STATE_ACTIVE: WARN_ON(1); fallthrough; default: return false; } } /* * fixup_free is called when: * - an active object is freed / static bool hrtimer_fixup_free(void addr, enum debug_obj_state state) { struct hrtimer timer = addr; switch (state) { case ODEBUG_STATE_ACTIVE: hrtimer_cancel(timer); debug_object_free(timer, &hrtimer_debug_descr); return true; default: return false; } } static const struct debug_obj_descr hrtimer_debug_descr = { .name = "hrtimer", .debug_hint = hrtimer_debug_hint, .fixup_init = hrtimer_fixup_init, .fixup_activate = hrtimer_fixup_activate, .fixup_free = hrtimer_fixup_free, }; static inline void debug_hrtimer_init(struct hrtimer timer) { debug_object_init(timer, &hrtimer_debug_descr); } static inline void debug_hrtimer_init_on_stack(struct hrtimer timer) { debug_object_init_on_stack(timer, &hrtimer_debug_descr); } static inline void debug_hrtimer_activate(struct hrtimer timer, enum hrtimer_mode mode) { debug_object_activate(timer, &hrtimer_debug_descr); } static inline void debug_hrtimer_deactivate(struct hrtimer timer) { debug_object_deactivate(timer, &hrtimer_debug_descr); } void destroy_hrtimer_on_stack(struct hrtimer timer) { debug_object_free(timer, &hrtimer_debug_descr); } EXPORT_SYMBOL_GPL(destroy_hrtimer_on_stack); #else static inline void debug_hrtimer_init(struct hrtimer timer) { } static inline void debug_hrtimer_init_on_stack(struct hrtimer timer) { } static inline void debug_hrtimer_activate(struct hrtimer timer, enum hrtimer_mode mode) { } static inline void debug_hrtimer_deactivate(struct hrtimer timer) { } #endif static inline void debug_setup(struct hrtimer timer, clockid_t clockid, enum hrtimer_mode mode) { debug_hrtimer_init(timer); trace_hrtimer_setup(timer, clockid, mode); } static inline void debug_setup_on_stack(struct hrtimer timer, clockid_t clockid, enum hrtimer_mode mode) { debug_hrtimer_init_on_stack(timer); trace_hrtimer_setup(timer, clockid, mode); } static inline void debug_activate(struct hrtimer timer, enum hrtimer_mode mode) { debug_hrtimer_activate(timer, mode); trace_hrtimer_start(timer, mode); } static inline void debug_deactivate(struct hrtimer timer) { debug_hrtimer_deactivate(timer); trace_hrtimer_cancel(timer); } static struct hrtimer_clock_base * __next_base(struct hrtimer_cpu_base cpu_base, unsigned int active) { unsigned int idx; if (!active) return NULL; idx = __ffs(active); active &= ~(1U << idx); return &cpu_base->clock_base[idx]; } #define for_each_active_base(base, cpu_base, active) \ while ((base = __next_base((cpu_base), &(active)))) static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base cpu_base, const struct hrtimer exclude, unsigned int active, ktime_t expires_next) { struct hrtimer_clock_base base; ktime_t expires; for_each_active_base(base, cpu_base, active) { struct timerqueue_node next; struct hrtimer timer; next = timerqueue_getnext(&base->active); timer = container_of(next, struct hrtimer, node); if (timer == exclude) { /* Get to the next timer in the queue. / next = timerqueue_iterate_next(next); if (!next) continue; timer = container_of(next, struct hrtimer, node); } expires = ktime_sub(hrtimer_get_expires(timer), base->offset); if (expires < expires_next) { expires_next = expires; / Skip cpu_base update if a timer is being excluded. / if (exclude) continue; if (timer->is_soft) cpu_base->softirq_next_timer = timer; else cpu_base->next_timer = timer; } } / * clock_was_set() might have changed base->offset of any of * the clock bases so the result might be negative. Fix it up * to prevent a false positive in clockevents_program_event(). / if (expires_next < 0) expires_next = 0; return expires_next; } / * Recomputes cpu_base::next_timer and returns the earliest expires_next but does not set cpu_base::expires_next, that is done by hrtimer[_force]_reprogram and hrtimer_interrupt only. When updating * cpu_base::expires_next right away, reprogramming logic would no longer work. * * When a softirq is pending, we can ignore the HRTIMER_ACTIVE_SOFT bases, * those timers will get run whenever the softirq gets handled, at the end of * hrtimer_run_softirq(), hrtimer_update_softirq_timer() will re-add these bases. * * Therefore softirq values are those from the HRTIMER_ACTIVE_SOFT clock bases. * The !softirq values are the minima across HRTIMER_ACTIVE_ALL, unless an actual * softirq is pending, in which case they're the minima of HRTIMER_ACTIVE_HARD. * * @active_mask must be one of: * - HRTIMER_ACTIVE_ALL, * - HRTIMER_ACTIVE_SOFT, or * - HRTIMER_ACTIVE_HARD. / static ktime_t __hrtimer_get_next_event(struct hrtimer_cpu_base cpu_base, unsigned int active_mask) { unsigned int active; struct hrtimer next_timer = NULL; ktime_t expires_next = KTIME_MAX; if (!cpu_base->softirq_activated && (active_mask & HRTIMER_ACTIVE_SOFT)) { active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT; cpu_base->softirq_next_timer = NULL; expires_next = __hrtimer_next_event_base(cpu_base, NULL, active, KTIME_MAX); next_timer = cpu_base->softirq_next_timer; } if (active_mask & HRTIMER_ACTIVE_HARD) { active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD; cpu_base->next_timer = next_timer; expires_next = __hrtimer_next_event_base(cpu_base, NULL, active, expires_next); } return expires_next; } static ktime_t hrtimer_update_next_event(struct hrtimer_cpu_base cpu_base) { ktime_t expires_next, soft = KTIME_MAX; /* * If the soft interrupt has already been activated, ignore the * soft bases. They will be handled in the already raised soft * interrupt. / if (!cpu_base->softirq_activated) { soft = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT); / * Update the soft expiry time. clock_settime() might have * affected it. / cpu_base->softirq_expires_next = soft; } expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_HARD); / * If a softirq timer is expiring first, update cpu_base->next_timer * and program the hardware with the soft expiry time. / if (expires_next > soft) { cpu_base->next_timer = cpu_base->softirq_next_timer; expires_next = soft; } return expires_next; } static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base base) { ktime_t offs_real = &base->clock_base[HRTIMER_BASE_REALTIME].offset; ktime_t offs_boot = &base->clock_base[HRTIMER_BASE_BOOTTIME].offset; ktime_t offs_tai = &base->clock_base[HRTIMER_BASE_TAI].offset; ktime_t now = ktime_get_update_offsets_now(&base->clock_was_set_seq, offs_real, offs_boot, offs_tai); base->clock_base[HRTIMER_BASE_REALTIME_SOFT].offset = offs_real; base->clock_base[HRTIMER_BASE_BOOTTIME_SOFT].offset = offs_boot; base->clock_base[HRTIMER_BASE_TAI_SOFT].offset = offs_tai; return now; } /* * Is the high resolution mode active ? / static inline int hrtimer_hres_active(struct hrtimer_cpu_base cpu_base) { return IS_ENABLED(CONFIG_HIGH_RES_TIMERS) ? cpu_base->hres_active : 0; } static void __hrtimer_reprogram(struct hrtimer_cpu_base cpu_base, struct hrtimer next_timer, ktime_t expires_next) { cpu_base->expires_next = expires_next; /* * If hres is not active, hardware does not have to be * reprogrammed yet. * * If a hang was detected in the last timer interrupt then we * leave the hang delay active in the hardware. We want the * system to make progress. That also prevents the following * scenario: * T1 expires 50ms from now * T2 expires 5s from now * * T1 is removed, so this code is called and would reprogram * the hardware to 5s from now. Any hrtimer_start after that * will not reprogram the hardware due to hang_detected being * set. So we'd effectively block all timers until the T2 event * fires. / if (!hrtimer_hres_active(cpu_base) \|\| cpu_base->hang_detected) return; tick_program_event(expires_next, 1); } / * Reprogram the event source with checking both queues for the * next event * Called with interrupts disabled and base->lock held / static void hrtimer_force_reprogram(struct hrtimer_cpu_base cpu_base, int skip_equal) { ktime_t expires_next; expires_next = hrtimer_update_next_event(cpu_base); if (skip_equal && expires_next == cpu_base->expires_next) return; __hrtimer_reprogram(cpu_base, cpu_base->next_timer, expires_next); } /* High resolution timer related functions / #ifdef CONFIG_HIGH_RES_TIMERS / * High resolution timer enabled ? / static bool hrtimer_hres_enabled __read_mostly = true; unsigned int hrtimer_resolution __read_mostly = LOW_RES_NSEC; EXPORT_SYMBOL_GPL(hrtimer_resolution); / * Enable / Disable high resolution mode / static int __init setup_hrtimer_hres(char str) { return (kstrtobool(str, &hrtimer_hres_enabled) == 0); } __setup("highres=", setup_hrtimer_hres); /* * hrtimer_high_res_enabled - query, if the highres mode is enabled / static inline int hrtimer_is_hres_enabled(void) { return hrtimer_hres_enabled; } / * Switch to high resolution mode / static void hrtimer_switch_to_hres(void) { struct hrtimer_cpu_base base = this_cpu_ptr(&hrtimer_bases); if (tick_init_highres()) { pr_warn("Could not switch to high resolution mode on CPU %u\n", base->cpu); return; } base->hres_active = 1; hrtimer_resolution = HIGH_RES_NSEC; tick_setup_sched_timer(true); /* "Retrigger" the interrupt to get things going / retrigger_next_event(NULL); } #else static inline int hrtimer_is_hres_enabled(void) { return 0; } static inline void hrtimer_switch_to_hres(void) { } #endif / CONFIG_HIGH_RES_TIMERS / / * Retrigger next event is called after clock was set with interrupts * disabled through an SMP function call or directly from low level * resume code. * * This is only invoked when: * - CONFIG_HIGH_RES_TIMERS is enabled. * - CONFIG_NOHZ_COMMON is enabled * * For the other cases this function is empty and because the call sites * are optimized out it vanishes as well, i.e. no need for lots of * #ifdeffery. / static void retrigger_next_event(void arg) { struct hrtimer_cpu_base base = this_cpu_ptr(&hrtimer_bases); / * When high resolution mode or nohz is active, then the offsets of * CLOCK_REALTIME/TAI/BOOTTIME have to be updated. Otherwise the * next tick will take care of that. * * If high resolution mode is active then the next expiring timer * must be reevaluated and the clock event device reprogrammed if * necessary. * * In the NOHZ case the update of the offset and the reevaluation * of the next expiring timer is enough. The return from the SMP * function call will take care of the reprogramming in case the * CPU was in a NOHZ idle sleep. * * In periodic low resolution mode, the next softirq expiration * must also be updated. / raw_spin_lock(&base->lock); hrtimer_update_base(base); if (hrtimer_hres_active(base)) hrtimer_force_reprogram(base, 0); else hrtimer_update_next_event(base); raw_spin_unlock(&base->lock); } / * When a timer is enqueued and expires earlier than the already enqueued * timers, we have to check, whether it expires earlier than the timer for * which the clock event device was armed. * * Called with interrupts disabled and base->cpu_base.lock held / static void hrtimer_reprogram(struct hrtimer timer, bool reprogram) { struct hrtimer_cpu_base cpu_base = this_cpu_ptr(&hrtimer_bases); struct hrtimer_clock_base base = timer->base; ktime_t expires = ktime_sub(hrtimer_get_expires(timer), base->offset); WARN_ON_ONCE(hrtimer_get_expires_tv64(timer) < 0); /* * CLOCK_REALTIME timer might be requested with an absolute * expiry time which is less than base->offset. Set it to 0. / if (expires < 0) expires = 0; if (timer->is_soft) { / * soft hrtimer could be started on a remote CPU. In this * case softirq_expires_next needs to be updated on the * remote CPU. The soft hrtimer will not expire before the * first hard hrtimer on the remote CPU - * hrtimer_check_target() prevents this case. / struct hrtimer_cpu_base timer_cpu_base = base->cpu_base; if (timer_cpu_base->softirq_activated) return; if (!ktime_before(expires, timer_cpu_base->softirq_expires_next)) return; timer_cpu_base->softirq_next_timer = timer; timer_cpu_base->softirq_expires_next = expires; if (!ktime_before(expires, timer_cpu_base->expires_next) \|\| !reprogram) return; } /* * If the timer is not on the current cpu, we cannot reprogram * the other cpus clock event device. / if (base->cpu_base != cpu_base) return; if (expires >= cpu_base->expires_next) return; / * If the hrtimer interrupt is running, then it will reevaluate the * clock bases and reprogram the clock event device. / if (cpu_base->in_hrtirq) return; cpu_base->next_timer = timer; __hrtimer_reprogram(cpu_base, timer, expires); } static bool update_needs_ipi(struct hrtimer_cpu_base cpu_base, unsigned int active) { struct hrtimer_clock_base base; unsigned int seq; ktime_t expires; / * Update the base offsets unconditionally so the following * checks whether the SMP function call is required works. * * The update is safe even when the remote CPU is in the hrtimer * interrupt or the hrtimer soft interrupt and expiring affected * bases. Either it will see the update before handling a base or * it will see it when it finishes the processing and reevaluates * the next expiring timer. / seq = cpu_base->clock_was_set_seq; hrtimer_update_base(cpu_base); / * If the sequence did not change over the update then the * remote CPU already handled it. / if (seq == cpu_base->clock_was_set_seq) return false; / * If the remote CPU is currently handling an hrtimer interrupt, it * will reevaluate the first expiring timer of all clock bases * before reprogramming. Nothing to do here. / if (cpu_base->in_hrtirq) return false; / * Walk the affected clock bases and check whether the first expiring * timer in a clock base is moving ahead of the first expiring timer of * @cpu_base. If so, the IPI must be invoked because per CPU clock * event devices cannot be remotely reprogrammed. / active &= cpu_base->active_bases; for_each_active_base(base, cpu_base, active) { struct timerqueue_node next; next = timerqueue_getnext(&base->active); expires = ktime_sub(next->expires, base->offset); if (expires < cpu_base->expires_next) return true; /* Extra check for softirq clock bases / if (base->index < HRTIMER_BASE_MONOTONIC_SOFT) continue; if (cpu_base->softirq_activated) continue; if (expires < cpu_base->softirq_expires_next) return true; } return false; } / * Clock was set. This might affect CLOCK_REALTIME, CLOCK_TAI and * CLOCK_BOOTTIME (for late sleep time injection). * * This requires to update the offsets for these clocks * vs. CLOCK_MONOTONIC. When high resolution timers are enabled, then this * also requires to eventually reprogram the per CPU clock event devices * when the change moves an affected timer ahead of the first expiring * timer on that CPU. Obviously remote per CPU clock event devices cannot * be reprogrammed. The other reason why an IPI has to be sent is when the * system is in !HIGH_RES and NOHZ mode. The NOHZ mode updates the offsets * in the tick, which obviously might be stopped, so this has to bring out * the remote CPU which might sleep in idle to get this sorted. / void clock_was_set(unsigned int bases) { struct hrtimer_cpu_base cpu_base = raw_cpu_ptr(&hrtimer_bases); cpumask_var_t mask; int cpu; if (!hrtimer_hres_active(cpu_base) && !tick_nohz_active) goto out_timerfd; if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) { on_each_cpu(retrigger_next_event, NULL, 1); goto out_timerfd; } /* Avoid interrupting CPUs if possible / cpus_read_lock(); for_each_online_cpu(cpu) { unsigned long flags; cpu_base = &per_cpu(hrtimer_bases, cpu); raw_spin_lock_irqsave(&cpu_base->lock, flags); if (update_needs_ipi(cpu_base, bases)) cpumask_set_cpu(cpu, mask); raw_spin_unlock_irqrestore(&cpu_base->lock, flags); } preempt_disable(); smp_call_function_many(mask, retrigger_next_event, NULL, 1); preempt_enable(); cpus_read_unlock(); free_cpumask_var(mask); out_timerfd: timerfd_clock_was_set(); } static void clock_was_set_work(struct work_struct work) { clock_was_set(CLOCK_SET_WALL); } static DECLARE_WORK(hrtimer_work, clock_was_set_work); /* * Called from timekeeping code to reprogram the hrtimer interrupt device * on all cpus and to notify timerfd. / void clock_was_set_delayed(void) { schedule_work(&hrtimer_work); } / * Called during resume either directly from via timekeeping_resume() * or in the case of s2idle from tick_unfreeze() to ensure that the * hrtimers are up to date. / void hrtimers_resume_local(void) { lockdep_assert_irqs_disabled(); / Retrigger on the local CPU / retrigger_next_event(NULL); } / * Counterpart to lock_hrtimer_base above: / static inline void unlock_hrtimer_base(const struct hrtimer timer, unsigned long flags) __releases(&timer->base->cpu_base->lock) { raw_spin_unlock_irqrestore(&timer->base->cpu_base->lock, flags); } /** * hrtimer_forward() - forward the timer expiry * @timer: hrtimer to forward * @now: forward past this time * @interval: the interval to forward * * Forward the timer expiry so it will expire in the future. * * .. note:: * This only updates the timer expiry value and does not requeue the timer. * * There is also a variant of the function hrtimer_forward_now(). * * Context: Can be safely called from the callback function of @timer. If called * from other contexts @timer must neither be enqueued nor running the * callback and the caller needs to take care of serialization. * * Return: The number of overruns are returned. / u64 hrtimer_forward(struct hrtimer timer, ktime_t now, ktime_t interval) { u64 orun = 1; ktime_t delta; delta = ktime_sub(now, hrtimer_get_expires(timer)); if (delta < 0) return 0; if (WARN_ON(timer->state & HRTIMER_STATE_ENQUEUED)) return 0; if (interval < hrtimer_resolution) interval = hrtimer_resolution; if (unlikely(delta >= interval)) { s64 incr = ktime_to_ns(interval); orun = ktime_divns(delta, incr); hrtimer_add_expires_ns(timer, incr * orun); if (hrtimer_get_expires_tv64(timer) > now) return orun; /* * This (and the ktime_add() below) is the * correction for exact: / orun++; } hrtimer_add_expires(timer, interval); return orun; } EXPORT_SYMBOL_GPL(hrtimer_forward); / * enqueue_hrtimer - internal function to (re)start a timer * * The timer is inserted in expiry order. Insertion into the * red black tree is O(log(n)). Must hold the base lock. * * Returns true when the new timer is the leftmost timer in the tree. / static bool enqueue_hrtimer(struct hrtimer timer, struct hrtimer_clock_base base, enum hrtimer_mode mode) { debug_activate(timer, mode); WARN_ON_ONCE(!base->cpu_base->online); base->cpu_base->active_bases \|= 1 << base->index; / Pairs with the lockless read in hrtimer_is_queued() / WRITE_ONCE(timer->state, HRTIMER_STATE_ENQUEUED); return timerqueue_add(&base->active, &timer->node); } / * __remove_hrtimer - internal function to remove a timer * * Caller must hold the base lock. * * High resolution timer mode reprograms the clock event device when the * timer is the one which expires next. The caller can disable this by setting * reprogram to zero. This is useful, when the context does a reprogramming * anyway (e.g. timer interrupt) / static void __remove_hrtimer(struct hrtimer timer, struct hrtimer_clock_base base, u8 newstate, int reprogram) { struct hrtimer_cpu_base cpu_base = base->cpu_base; u8 state = timer->state; /* Pairs with the lockless read in hrtimer_is_queued() / WRITE_ONCE(timer->state, newstate); if (!(state & HRTIMER_STATE_ENQUEUED)) return; if (!timerqueue_del(&base->active, &timer->node)) cpu_base->active_bases &= ~(1 << base->index); / * Note: If reprogram is false we do not update * cpu_base->next_timer. This happens when we remove the first * timer on a remote cpu. No harm as we never dereference * cpu_base->next_timer. So the worst thing what can happen is * an superfluous call to hrtimer_force_reprogram() on the * remote cpu later on if the same timer gets enqueued again. / if (reprogram && timer == cpu_base->next_timer) hrtimer_force_reprogram(cpu_base, 1); } / * remove hrtimer, called with base lock held / static inline int remove_hrtimer(struct hrtimer timer, struct hrtimer_clock_base base, bool restart, bool keep_local) { u8 state = timer->state; if (state & HRTIMER_STATE_ENQUEUED) { bool reprogram; / * Remove the timer and force reprogramming when high * resolution mode is active and the timer is on the current * CPU. If we remove a timer on another CPU, reprogramming is * skipped. The interrupt event on this CPU is fired and * reprogramming happens in the interrupt handler. This is a * rare case and less expensive than a smp call. / debug_deactivate(timer); reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases); / * If the timer is not restarted then reprogramming is * required if the timer is local. If it is local and about * to be restarted, avoid programming it twice (on removal * and a moment later when it's requeued). / if (!restart) state = HRTIMER_STATE_INACTIVE; else reprogram &= !keep_local; __remove_hrtimer(timer, base, state, reprogram); return 1; } return 0; } static inline ktime_t hrtimer_update_lowres(struct hrtimer timer, ktime_t tim, const enum hrtimer_mode mode) { #ifdef CONFIG_TIME_LOW_RES /* * CONFIG_TIME_LOW_RES indicates that the system has no way to return * granular time values. For relative timers we add hrtimer_resolution * (i.e. one jiffy) to prevent short timeouts. / timer->is_rel = mode & HRTIMER_MODE_REL; if (timer->is_rel) tim = ktime_add_safe(tim, hrtimer_resolution); #endif return tim; } static void hrtimer_update_softirq_timer(struct hrtimer_cpu_base cpu_base, bool reprogram) { ktime_t expires; /* * Find the next SOFT expiration. / expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT); / * reprogramming needs to be triggered, even if the next soft * hrtimer expires at the same time than the next hard * hrtimer. cpu_base->softirq_expires_next needs to be updated! / if (expires == KTIME_MAX) return; / * cpu_base->next_timer is recomputed by __hrtimer_get_next_event() cpu_base->expires_next is only set by hrtimer_reprogram() / hrtimer_reprogram(cpu_base->softirq_next_timer, reprogram); } static int __hrtimer_start_range_ns(struct hrtimer timer, ktime_t tim, u64 delta_ns, const enum hrtimer_mode mode, struct hrtimer_clock_base base) { struct hrtimer_cpu_base this_cpu_base = this_cpu_ptr(&hrtimer_bases); struct hrtimer_clock_base new_base; bool force_local, first; /* * If the timer is on the local cpu base and is the first expiring * timer then this might end up reprogramming the hardware twice * (on removal and on enqueue). To avoid that by prevent the * reprogram on removal, keep the timer local to the current CPU * and enforce reprogramming after it is queued no matter whether * it is the new first expiring timer again or not. / force_local = base->cpu_base == this_cpu_base; force_local &= base->cpu_base->next_timer == timer; / * Don't force local queuing if this enqueue happens on a unplugged * CPU after hrtimer_cpu_dying() has been invoked. / force_local &= this_cpu_base->online; / * Remove an active timer from the queue. In case it is not queued * on the current CPU, make sure that remove_hrtimer() updates the * remote data correctly. * * If it's on the current CPU and the first expiring timer, then * skip reprogramming, keep the timer local and enforce * reprogramming later if it was the first expiring timer. This * avoids programming the underlying clock event twice (once at * removal and once after enqueue). / remove_hrtimer(timer, base, true, force_local); if (mode & HRTIMER_MODE_REL) tim = ktime_add_safe(tim, __hrtimer_cb_get_time(base->clockid)); tim = hrtimer_update_lowres(timer, tim, mode); hrtimer_set_expires_range_ns(timer, tim, delta_ns); / Switch the timer base, if necessary: / if (!force_local) { new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED); } else { new_base = base; } first = enqueue_hrtimer(timer, new_base, mode); if (!force_local) { / * If the current CPU base is online, then the timer is * never queued on a remote CPU if it would be the first * expiring timer there. / if (hrtimer_base_is_online(this_cpu_base)) return first; / * Timer was enqueued remote because the current base is * already offline. If the timer is the first to expire, * kick the remote CPU to reprogram the clock event. / if (first) { struct hrtimer_cpu_base new_cpu_base = new_base->cpu_base; smp_call_function_single_async(new_cpu_base->cpu, &new_cpu_base->csd); } return 0; } /* * Timer was forced to stay on the current CPU to avoid * reprogramming on removal and enqueue. Force reprogram the * hardware by evaluating the new first expiring timer. / hrtimer_force_reprogram(new_base->cpu_base, 1); return 0; } /* * hrtimer_start_range_ns - (re)start an hrtimer * @timer: the timer to be added * @tim: expiry time * @delta_ns: "slack" range for the timer * @mode: timer mode: absolute (HRTIMER_MODE_ABS) or * relative (HRTIMER_MODE_REL), and pinned (HRTIMER_MODE_PINNED); * softirq based mode is considered for debug purpose only! / void hrtimer_start_range_ns(struct hrtimer timer, ktime_t tim, u64 delta_ns, const enum hrtimer_mode mode) { struct hrtimer_clock_base base; unsigned long flags; / * Check whether the HRTIMER_MODE_SOFT bit and hrtimer.is_soft * match on CONFIG_PREEMPT_RT = n. With PREEMPT_RT check the hard * expiry mode because unmarked timers are moved to softirq expiry. / if (!IS_ENABLED(CONFIG_PREEMPT_RT)) WARN_ON_ONCE(!(mode & HRTIMER_MODE_SOFT) ^ !timer->is_soft); else WARN_ON_ONCE(!(mode & HRTIMER_MODE_HARD) ^ !timer->is_hard); base = lock_hrtimer_base(timer, &flags); if (__hrtimer_start_range_ns(timer, tim, delta_ns, mode, base)) hrtimer_reprogram(timer, true); unlock_hrtimer_base(timer, &flags); } EXPORT_SYMBOL_GPL(hrtimer_start_range_ns); /* * hrtimer_try_to_cancel - try to deactivate a timer * @timer: hrtimer to stop * * Returns: * * * 0 when the timer was not active * * 1 when the timer was active * * -1 when the timer is currently executing the callback function and * cannot be stopped / int hrtimer_try_to_cancel(struct hrtimer timer) { struct hrtimer_clock_base base; unsigned long flags; int ret = -1; / * Check lockless first. If the timer is not active (neither * enqueued nor running the callback, nothing to do here. The * base lock does not serialize against a concurrent enqueue, * so we can avoid taking it. / if (!hrtimer_active(timer)) return 0; base = lock_hrtimer_base(timer, &flags); if (!hrtimer_callback_running(timer)) ret = remove_hrtimer(timer, base, false, false); unlock_hrtimer_base(timer, &flags); return ret; } EXPORT_SYMBOL_GPL(hrtimer_try_to_cancel); #ifdef CONFIG_PREEMPT_RT static void hrtimer_cpu_base_init_expiry_lock(struct hrtimer_cpu_base base) { spin_lock_init(&base->softirq_expiry_lock); } static void hrtimer_cpu_base_lock_expiry(struct hrtimer_cpu_base base) __acquires(&base->softirq_expiry_lock) { spin_lock(&base->softirq_expiry_lock); } static void hrtimer_cpu_base_unlock_expiry(struct hrtimer_cpu_base base) __releases(&base->softirq_expiry_lock) { spin_unlock(&base->softirq_expiry_lock); } /* * The counterpart to hrtimer_cancel_wait_running(). * * If there is a waiter for cpu_base->expiry_lock, then it was waiting for * the timer callback to finish. Drop expiry_lock and reacquire it. That * allows the waiter to acquire the lock and make progress. / static void hrtimer_sync_wait_running(struct hrtimer_cpu_base cpu_base, unsigned long flags) { if (atomic_read(&cpu_base->timer_waiters)) { raw_spin_unlock_irqrestore(&cpu_base->lock, flags); spin_unlock(&cpu_base->softirq_expiry_lock); spin_lock(&cpu_base->softirq_expiry_lock); raw_spin_lock_irq(&cpu_base->lock); } } #ifdef CONFIG_SMP static __always_inline bool is_migration_base(struct hrtimer_clock_base base) { return base == &migration_base; } #else static __always_inline bool is_migration_base(struct hrtimer_clock_base base) { return false; } #endif /* * This function is called on PREEMPT_RT kernels when the fast path * deletion of a timer failed because the timer callback function was * running. * * This prevents priority inversion: if the soft irq thread is preempted * in the middle of a timer callback, then calling hrtimer_cancel() can * lead to two issues: * * - If the caller is on a remote CPU then it has to spin wait for the timer * handler to complete. This can result in unbound priority inversion. * * - If the caller originates from the task which preempted the timer * handler on the same CPU, then spin waiting for the timer handler to * complete is never going to end. / void hrtimer_cancel_wait_running(const struct hrtimer timer) { /* Lockless read. Prevent the compiler from reloading it below / struct hrtimer_clock_base base = READ_ONCE(timer->base); /* * Just relax if the timer expires in hard interrupt context or if * it is currently on the migration base. / if (!timer->is_soft \|\| is_migration_base(base)) { cpu_relax(); return; } / * Mark the base as contended and grab the expiry lock, which is * held by the softirq across the timer callback. Drop the lock * immediately so the softirq can expire the next timer. In theory * the timer could already be running again, but that's more than * unlikely and just causes another wait loop. / atomic_inc(&base->cpu_base->timer_waiters); spin_lock_bh(&base->cpu_base->softirq_expiry_lock); atomic_dec(&base->cpu_base->timer_waiters); spin_unlock_bh(&base->cpu_base->softirq_expiry_lock); } #else static inline void hrtimer_cpu_base_init_expiry_lock(struct hrtimer_cpu_base base) { } static inline void hrtimer_cpu_base_lock_expiry(struct hrtimer_cpu_base base) { } static inline void hrtimer_cpu_base_unlock_expiry(struct hrtimer_cpu_base base) { } static inline void hrtimer_sync_wait_running(struct hrtimer_cpu_base base, unsigned long flags) { } #endif /* * hrtimer_cancel - cancel a timer and wait for the handler to finish. * @timer: the timer to be cancelled * * Returns: * 0 when the timer was not active * 1 when the timer was active / int hrtimer_cancel(struct hrtimer timer) { int ret; do { ret = hrtimer_try_to_cancel(timer); if (ret < 0) hrtimer_cancel_wait_running(timer); } while (ret < 0); return ret; } EXPORT_SYMBOL_GPL(hrtimer_cancel); /** * __hrtimer_get_remaining - get remaining time for the timer * @timer: the timer to read * @adjust: adjust relative timers when CONFIG_TIME_LOW_RES=y / ktime_t __hrtimer_get_remaining(const struct hrtimer timer, bool adjust) { unsigned long flags; ktime_t rem; lock_hrtimer_base(timer, &flags); if (IS_ENABLED(CONFIG_TIME_LOW_RES) && adjust) rem = hrtimer_expires_remaining_adjusted(timer); else rem = hrtimer_expires_remaining(timer); unlock_hrtimer_base(timer, &flags); return rem; } EXPORT_SYMBOL_GPL(__hrtimer_get_remaining); #ifdef CONFIG_NO_HZ_COMMON /** * hrtimer_get_next_event - get the time until next expiry event * * Returns the next expiry time or KTIME_MAX if no timer is pending. / u64 hrtimer_get_next_event(void) { struct hrtimer_cpu_base cpu_base = this_cpu_ptr(&hrtimer_bases); u64 expires = KTIME_MAX; unsigned long flags; raw_spin_lock_irqsave(&cpu_base->lock, flags); if (!hrtimer_hres_active(cpu_base)) expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL); raw_spin_unlock_irqrestore(&cpu_base->lock, flags); return expires; } /** * hrtimer_next_event_without - time until next expiry event w/o one timer * @exclude: timer to exclude * * Returns the next expiry time over all timers except for the @exclude one or * KTIME_MAX if none of them is pending. / u64 hrtimer_next_event_without(const struct hrtimer exclude) { struct hrtimer_cpu_base cpu_base = this_cpu_ptr(&hrtimer_bases); u64 expires = KTIME_MAX; unsigned long flags; raw_spin_lock_irqsave(&cpu_base->lock, flags); if (hrtimer_hres_active(cpu_base)) { unsigned int active; if (!cpu_base->softirq_activated) { active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT; expires = __hrtimer_next_event_base(cpu_base, exclude, active, KTIME_MAX); } active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD; expires = __hrtimer_next_event_base(cpu_base, exclude, active, expires); } raw_spin_unlock_irqrestore(&cpu_base->lock, flags); return expires; } #endif static inline int hrtimer_clockid_to_base(clockid_t clock_id) { switch (clock_id) { case CLOCK_MONOTONIC: return HRTIMER_BASE_MONOTONIC; case CLOCK_REALTIME: return HRTIMER_BASE_REALTIME; case CLOCK_BOOTTIME: return HRTIMER_BASE_BOOTTIME; case CLOCK_TAI: return HRTIMER_BASE_TAI; default: WARN(1, "Invalid clockid %d. Using MONOTONIC\n", clock_id); return HRTIMER_BASE_MONOTONIC; } } static ktime_t __hrtimer_cb_get_time(clockid_t clock_id) { switch (clock_id) { case CLOCK_MONOTONIC: return ktime_get(); case CLOCK_REALTIME: return ktime_get_real(); case CLOCK_BOOTTIME: return ktime_get_boottime(); case CLOCK_TAI: return ktime_get_clocktai(); default: WARN(1, "Invalid clockid %d. Using MONOTONIC\n", clock_id); return ktime_get(); } } ktime_t hrtimer_cb_get_time(const struct hrtimer timer) { return __hrtimer_cb_get_time(timer->base->clockid); } EXPORT_SYMBOL_GPL(hrtimer_cb_get_time); static void __hrtimer_setup(struct hrtimer timer, enum hrtimer_restart (function)(struct hrtimer ), clockid_t clock_id, enum hrtimer_mode mode) { bool softtimer = !!(mode & HRTIMER_MODE_SOFT); struct hrtimer_cpu_base cpu_base; int base; /* * On PREEMPT_RT enabled kernels hrtimers which are not explicitly * marked for hard interrupt expiry mode are moved into soft * interrupt context for latency reasons and because the callbacks * can invoke functions which might sleep on RT, e.g. spin_lock(). / if (IS_ENABLED(CONFIG_PREEMPT_RT) && !(mode & HRTIMER_MODE_HARD)) softtimer = true; memset(timer, 0, sizeof(struct hrtimer)); cpu_base = raw_cpu_ptr(&hrtimer_bases); / * POSIX magic: Relative CLOCK_REALTIME timers are not affected by * clock modifications, so they needs to become CLOCK_MONOTONIC to * ensure POSIX compliance. / if (clock_id == CLOCK_REALTIME && mode & HRTIMER_MODE_REL) clock_id = CLOCK_MONOTONIC; base = softtimer ? HRTIMER_MAX_CLOCK_BASES / 2 : 0; base += hrtimer_clockid_to_base(clock_id); timer->is_soft = softtimer; timer->is_hard = !!(mode & HRTIMER_MODE_HARD); timer->base = &cpu_base->clock_base[base]; timerqueue_init(&timer->node); if (WARN_ON_ONCE(!function)) ACCESS_PRIVATE(timer, function) = hrtimer_dummy_timeout; else ACCESS_PRIVATE(timer, function) = function; } /* * hrtimer_setup - initialize a timer to the given clock * @timer: the timer to be initialized * @function: the callback function * @clock_id: the clock to be used * @mode: The modes which are relevant for initialization: * HRTIMER_MODE_ABS, HRTIMER_MODE_REL, HRTIMER_MODE_ABS_SOFT, * HRTIMER_MODE_REL_SOFT * * The PINNED variants of the above can be handed in, * but the PINNED bit is ignored as pinning happens * when the hrtimer is started / void hrtimer_setup(struct hrtimer timer, enum hrtimer_restart (function)(struct hrtimer ), clockid_t clock_id, enum hrtimer_mode mode) { debug_setup(timer, clock_id, mode); __hrtimer_setup(timer, function, clock_id, mode); } EXPORT_SYMBOL_GPL(hrtimer_setup); /** * hrtimer_setup_on_stack - initialize a timer on stack memory * @timer: The timer to be initialized * @function: the callback function * @clock_id: The clock to be used * @mode: The timer mode * * Similar to hrtimer_setup(), except that this one must be used if struct hrtimer is in stack * memory. / void hrtimer_setup_on_stack(struct hrtimer timer, enum hrtimer_restart (function)(struct hrtimer ), clockid_t clock_id, enum hrtimer_mode mode) { debug_setup_on_stack(timer, clock_id, mode); __hrtimer_setup(timer, function, clock_id, mode); } EXPORT_SYMBOL_GPL(hrtimer_setup_on_stack); /* * A timer is active, when it is enqueued into the rbtree or the * callback function is running or it's in the state of being migrated * to another cpu. * * It is important for this function to not return a false negative. / bool hrtimer_active(const struct hrtimer timer) { struct hrtimer_clock_base base; unsigned int seq; do { base = READ_ONCE(timer->base); seq = raw_read_seqcount_begin(&base->seq); if (timer->state != HRTIMER_STATE_INACTIVE \|\| base->running == timer) return true; } while (read_seqcount_retry(&base->seq, seq) \|\| base != READ_ONCE(timer->base)); return false; } EXPORT_SYMBOL_GPL(hrtimer_active); / * The write_seqcount_barrier()s in __run_hrtimer() split the thing into 3 * distinct sections: * * - queued: the timer is queued * - callback: the timer is being ran * - post: the timer is inactive or (re)queued * * On the read side we ensure we observe timer->state and cpu_base->running * from the same section, if anything changed while we looked at it, we retry. * This includes timer->base changing because sequence numbers alone are * insufficient for that. * * The sequence numbers are required because otherwise we could still observe * a false negative if the read side got smeared over multiple consecutive * __run_hrtimer() invocations. / static void __run_hrtimer(struct hrtimer_cpu_base cpu_base, struct hrtimer_clock_base base, struct hrtimer timer, ktime_t now, unsigned long flags) __must_hold(&cpu_base->lock) { enum hrtimer_restart (fn)(struct hrtimer ); bool expires_in_hardirq; int restart; lockdep_assert_held(&cpu_base->lock); debug_deactivate(timer); base->running = timer; / * Separate the ->running assignment from the ->state assignment. * * As with a regular write barrier, this ensures the read side in * hrtimer_active() cannot observe base->running == NULL && * timer->state == INACTIVE. / raw_write_seqcount_barrier(&base->seq); __remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, 0); fn = ACCESS_PRIVATE(timer, function); / * Clear the 'is relative' flag for the TIME_LOW_RES case. If the * timer is restarted with a period then it becomes an absolute * timer. If its not restarted it does not matter. / if (IS_ENABLED(CONFIG_TIME_LOW_RES)) timer->is_rel = false; / * The timer is marked as running in the CPU base, so it is * protected against migration to a different CPU even if the lock * is dropped. / raw_spin_unlock_irqrestore(&cpu_base->lock, flags); trace_hrtimer_expire_entry(timer, now); expires_in_hardirq = lockdep_hrtimer_enter(timer); restart = fn(timer); lockdep_hrtimer_exit(expires_in_hardirq); trace_hrtimer_expire_exit(timer); raw_spin_lock_irq(&cpu_base->lock); / * Note: We clear the running state after enqueue_hrtimer and * we do not reprogram the event hardware. Happens either in * hrtimer_start_range_ns() or in hrtimer_interrupt() * * Note: Because we dropped the cpu_base->lock above, * hrtimer_start_range_ns() can have popped in and enqueued the timer * for us already. / if (restart != HRTIMER_NORESTART && !(timer->state & HRTIMER_STATE_ENQUEUED)) enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS); / * Separate the ->running assignment from the ->state assignment. * * As with a regular write barrier, this ensures the read side in * hrtimer_active() cannot observe base->running.timer == NULL && * timer->state == INACTIVE. / raw_write_seqcount_barrier(&base->seq); WARN_ON_ONCE(base->running != timer); base->running = NULL; } static void __hrtimer_run_queues(struct hrtimer_cpu_base cpu_base, ktime_t now, unsigned long flags, unsigned int active_mask) { struct hrtimer_clock_base base; unsigned int active = cpu_base->active_bases & active_mask; for_each_active_base(base, cpu_base, active) { struct timerqueue_node node; ktime_t basenow; basenow = ktime_add(now, base->offset); while ((node = timerqueue_getnext(&base->active))) { struct hrtimer timer; timer = container_of(node, struct hrtimer, node); / * The immediate goal for using the softexpires is * minimizing wakeups, not running timers at the * earliest interrupt after their soft expiration. * This allows us to avoid using a Priority Search * Tree, which can answer a stabbing query for * overlapping intervals and instead use the simple * BST we already have. * We don't add extra wakeups by delaying timers that * are right-of a not yet expired timer, because that * timer will have to trigger a wakeup anyway. / if (basenow < hrtimer_get_softexpires_tv64(timer)) break; __run_hrtimer(cpu_base, base, timer, &basenow, flags); if (active_mask == HRTIMER_ACTIVE_SOFT) hrtimer_sync_wait_running(cpu_base, flags); } } } static __latent_entropy void hrtimer_run_softirq(void) { struct hrtimer_cpu_base cpu_base = this_cpu_ptr(&hrtimer_bases); unsigned long flags; ktime_t now; hrtimer_cpu_base_lock_expiry(cpu_base); raw_spin_lock_irqsave(&cpu_base->lock, flags); now = hrtimer_update_base(cpu_base); __hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_SOFT); cpu_base->softirq_activated = 0; hrtimer_update_softirq_timer(cpu_base, true); raw_spin_unlock_irqrestore(&cpu_base->lock, flags); hrtimer_cpu_base_unlock_expiry(cpu_base); } #ifdef CONFIG_HIGH_RES_TIMERS /* * High resolution timer interrupt * Called with interrupts disabled / void hrtimer_interrupt(struct clock_event_device dev) { struct hrtimer_cpu_base cpu_base = this_cpu_ptr(&hrtimer_bases); ktime_t expires_next, now, entry_time, delta; unsigned long flags; int retries = 0; BUG_ON(!cpu_base->hres_active); cpu_base->nr_events++; dev->next_event = KTIME_MAX; raw_spin_lock_irqsave(&cpu_base->lock, flags); entry_time = now = hrtimer_update_base(cpu_base); retry: cpu_base->in_hrtirq = 1; / * We set expires_next to KTIME_MAX here with cpu_base->lock * held to prevent that a timer is enqueued in our queue via * the migration code. This does not affect enqueueing of * timers which run their callback and need to be requeued on * this CPU. / cpu_base->expires_next = KTIME_MAX; if (!ktime_before(now, cpu_base->softirq_expires_next)) { cpu_base->softirq_expires_next = KTIME_MAX; cpu_base->softirq_activated = 1; raise_timer_softirq(HRTIMER_SOFTIRQ); } __hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD); / Reevaluate the clock bases for the [soft] next expiry / expires_next = hrtimer_update_next_event(cpu_base); / * Store the new expiry value so the migration code can verify * against it. / cpu_base->expires_next = expires_next; cpu_base->in_hrtirq = 0; raw_spin_unlock_irqrestore(&cpu_base->lock, flags); / Reprogramming necessary ? / if (!tick_program_event(expires_next, 0)) { cpu_base->hang_detected = 0; return; } / * The next timer was already expired due to: * - tracing * - long lasting callbacks * - being scheduled away when running in a VM * * We need to prevent that we loop forever in the hrtimer * interrupt routine. We give it 3 attempts to avoid * overreacting on some spurious event. * * Acquire base lock for updating the offsets and retrieving * the current time. / raw_spin_lock_irqsave(&cpu_base->lock, flags); now = hrtimer_update_base(cpu_base); cpu_base->nr_retries++; if (++retries < 3) goto retry; / * Give the system a chance to do something else than looping * here. We stored the entry time, so we know exactly how long * we spent here. We schedule the next event this amount of * time away. / cpu_base->nr_hangs++; cpu_base->hang_detected = 1; raw_spin_unlock_irqrestore(&cpu_base->lock, flags); delta = ktime_sub(now, entry_time); if ((unsigned int)delta > cpu_base->max_hang_time) cpu_base->max_hang_time = (unsigned int) delta; / * Limit it to a sensible value as we enforce a longer * delay. Give the CPU at least 100ms to catch up. / if (delta > 100 NSEC_PER_MSEC) expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC); else expires_next = ktime_add(now, delta); tick_program_event(expires_next, 1); pr_warn_once("hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta)); } #endif /* !CONFIG_HIGH_RES_TIMERS / / * Called from run_local_timers in hardirq context every jiffy / void hrtimer_run_queues(void) { struct hrtimer_cpu_base cpu_base = this_cpu_ptr(&hrtimer_bases); unsigned long flags; ktime_t now; if (hrtimer_hres_active(cpu_base)) return; /* * This _is_ ugly: We have to check periodically, whether we * can switch to highres and / or nohz mode. The clocksource * switch happens with xtime_lock held. Notification from * there only sets the check bit in the tick_oneshot code, * otherwise we might deadlock vs. xtime_lock. / if (tick_check_oneshot_change(!hrtimer_is_hres_enabled())) { hrtimer_switch_to_hres(); return; } raw_spin_lock_irqsave(&cpu_base->lock, flags); now = hrtimer_update_base(cpu_base); if (!ktime_before(now, cpu_base->softirq_expires_next)) { cpu_base->softirq_expires_next = KTIME_MAX; cpu_base->softirq_activated = 1; raise_timer_softirq(HRTIMER_SOFTIRQ); } __hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD); raw_spin_unlock_irqrestore(&cpu_base->lock, flags); } / * Sleep related functions: / static enum hrtimer_restart hrtimer_wakeup(struct hrtimer timer) { struct hrtimer_sleeper t = container_of(timer, struct hrtimer_sleeper, timer); struct task_struct task = t->task; t->task = NULL; if (task) wake_up_process(task); return HRTIMER_NORESTART; } /** * hrtimer_sleeper_start_expires - Start a hrtimer sleeper timer * @sl: sleeper to be started * @mode: timer mode abs/rel * * Wrapper around hrtimer_start_expires() for hrtimer_sleeper based timers * to allow PREEMPT_RT to tweak the delivery mode (soft/hardirq context) / void hrtimer_sleeper_start_expires(struct hrtimer_sleeper sl, enum hrtimer_mode mode) { /* * Make the enqueue delivery mode check work on RT. If the sleeper * was initialized for hard interrupt delivery, force the mode bit. * This is a special case for hrtimer_sleepers because * __hrtimer_setup_sleeper() determines the delivery mode on RT so the * fiddling with this decision is avoided at the call sites. / if (IS_ENABLED(CONFIG_PREEMPT_RT) && sl->timer.is_hard) mode \|= HRTIMER_MODE_HARD; hrtimer_start_expires(&sl->timer, mode); } EXPORT_SYMBOL_GPL(hrtimer_sleeper_start_expires); static void __hrtimer_setup_sleeper(struct hrtimer_sleeper sl, clockid_t clock_id, enum hrtimer_mode mode) { /* * On PREEMPT_RT enabled kernels hrtimers which are not explicitly * marked for hard interrupt expiry mode are moved into soft * interrupt context either for latency reasons or because the * hrtimer callback takes regular spinlocks or invokes other * functions which are not suitable for hard interrupt context on * PREEMPT_RT. * * The hrtimer_sleeper callback is RT compatible in hard interrupt * context, but there is a latency concern: Untrusted userspace can * spawn many threads which arm timers for the same expiry time on * the same CPU. That causes a latency spike due to the wakeup of * a gazillion threads. * * OTOH, privileged real-time user space applications rely on the * low latency of hard interrupt wakeups. If the current task is in * a real-time scheduling class, mark the mode for hard interrupt * expiry. / if (IS_ENABLED(CONFIG_PREEMPT_RT)) { if (rt_or_dl_task_policy(current) && !(mode & HRTIMER_MODE_SOFT)) mode \|= HRTIMER_MODE_HARD; } __hrtimer_setup(&sl->timer, hrtimer_wakeup, clock_id, mode); sl->task = current; } /* * hrtimer_setup_sleeper_on_stack - initialize a sleeper in stack memory * @sl: sleeper to be initialized * @clock_id: the clock to be used * @mode: timer mode abs/rel / void hrtimer_setup_sleeper_on_stack(struct hrtimer_sleeper sl, clockid_t clock_id, enum hrtimer_mode mode) { debug_setup_on_stack(&sl->timer, clock_id, mode); __hrtimer_setup_sleeper(sl, clock_id, mode); } EXPORT_SYMBOL_GPL(hrtimer_setup_sleeper_on_stack); int nanosleep_copyout(struct restart_block restart, struct timespec64 ts) { switch(restart->nanosleep.type) { #ifdef CONFIG_COMPAT_32BIT_TIME case TT_COMPAT: if (put_old_timespec32(ts, restart->nanosleep.compat_rmtp)) return -EFAULT; break; #endif case TT_NATIVE: if (put_timespec64(ts, restart->nanosleep.rmtp)) return -EFAULT; break; default: BUG(); } return -ERESTART_RESTARTBLOCK; } static int __sched do_nanosleep(struct hrtimer_sleeper t, enum hrtimer_mode mode) { struct restart_block restart; do { set_current_state(TASK_INTERRUPTIBLE\|TASK_FREEZABLE); hrtimer_sleeper_start_expires(t, mode); if (likely(t->task)) schedule(); hrtimer_cancel(&t->timer); mode = HRTIMER_MODE_ABS; } while (t->task && !signal_pending(current)); __set_current_state(TASK_RUNNING); if (!t->task) return 0; restart = &current->restart_block; if (restart->nanosleep.type != TT_NONE) { ktime_t rem = hrtimer_expires_remaining(&t->timer); struct timespec64 rmt; if (rem <= 0) return 0; rmt = ktime_to_timespec64(rem); return nanosleep_copyout(restart, &rmt); } return -ERESTART_RESTARTBLOCK; } static long __sched hrtimer_nanosleep_restart(struct restart_block restart) { struct hrtimer_sleeper t; int ret; hrtimer_setup_sleeper_on_stack(&t, restart->nanosleep.clockid, HRTIMER_MODE_ABS); hrtimer_set_expires(&t.timer, restart->nanosleep.expires); ret = do_nanosleep(&t, HRTIMER_MODE_ABS); destroy_hrtimer_on_stack(&t.timer); return ret; } long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode, const clockid_t clockid) { struct restart_block restart; struct hrtimer_sleeper t; int ret = 0; hrtimer_setup_sleeper_on_stack(&t, clockid, mode); hrtimer_set_expires_range_ns(&t.timer, rqtp, current->timer_slack_ns); ret = do_nanosleep(&t, mode); if (ret != -ERESTART_RESTARTBLOCK) goto out; /* Absolute timers do not update the rmtp value and restart: / if (mode == HRTIMER_MODE_ABS) { ret = -ERESTARTNOHAND; goto out; } restart = &current->restart_block; restart->nanosleep.clockid = t.timer.base->clockid; restart->nanosleep.expires = hrtimer_get_expires(&t.timer); set_restart_fn(restart, hrtimer_nanosleep_restart); out: destroy_hrtimer_on_stack(&t.timer); return ret; } #ifdef CONFIG_64BIT SYSCALL_DEFINE2(nanosleep, struct __kernel_timespec __user , rqtp, struct __kernel_timespec __user , rmtp) { struct timespec64 tu; if (get_timespec64(&tu, rqtp)) return -EFAULT; if (!timespec64_valid(&tu)) return -EINVAL; current->restart_block.fn = do_no_restart_syscall; current->restart_block.nanosleep.type = rmtp ? TT_NATIVE : TT_NONE; current->restart_block.nanosleep.rmtp = rmtp; return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, CLOCK_MONOTONIC); } #endif #ifdef CONFIG_COMPAT_32BIT_TIME SYSCALL_DEFINE2(nanosleep_time32, struct old_timespec32 __user , rqtp, struct old_timespec32 __user , rmtp) { struct timespec64 tu; if (get_old_timespec32(&tu, rqtp)) return -EFAULT; if (!timespec64_valid(&tu)) return -EINVAL; current->restart_block.fn = do_no_restart_syscall; current->restart_block.nanosleep.type = rmtp ? TT_COMPAT : TT_NONE; current->restart_block.nanosleep.compat_rmtp = rmtp; return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, CLOCK_MONOTONIC); } #endif / * Functions related to boot-time initialization: / int hrtimers_prepare_cpu(unsigned int cpu) { struct hrtimer_cpu_base cpu_base = &per_cpu(hrtimer_bases, cpu); int i; for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { struct hrtimer_clock_base clock_b = &cpu_base->clock_base[i]; clock_b->cpu_base = cpu_base; seqcount_raw_spinlock_init(&clock_b->seq, &cpu_base->lock); timerqueue_init_head(&clock_b->active); } cpu_base->cpu = cpu; hrtimer_cpu_base_init_expiry_lock(cpu_base); return 0; } int hrtimers_cpu_starting(unsigned int cpu) { struct hrtimer_cpu_base cpu_base = this_cpu_ptr(&hrtimer_bases); /* Clear out any left over state from a CPU down operation / cpu_base->active_bases = 0; cpu_base->hres_active = 0; cpu_base->hang_detected = 0; cpu_base->next_timer = NULL; cpu_base->softirq_next_timer = NULL; cpu_base->expires_next = KTIME_MAX; cpu_base->softirq_expires_next = KTIME_MAX; cpu_base->online = 1; return 0; } #ifdef CONFIG_HOTPLUG_CPU static void migrate_hrtimer_list(struct hrtimer_clock_base old_base, struct hrtimer_clock_base new_base) { struct hrtimer timer; struct timerqueue_node node; while ((node = timerqueue_getnext(&old_base->active))) { timer = container_of(node, struct hrtimer, node); BUG_ON(hrtimer_callback_running(timer)); debug_deactivate(timer); / * Mark it as ENQUEUED not INACTIVE otherwise the * timer could be seen as !active and just vanish away * under us on another CPU / __remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, 0); timer->base = new_base; / * Enqueue the timers on the new cpu. This does not * reprogram the event device in case the timer * expires before the earliest on this CPU, but we run * hrtimer_interrupt after we migrated everything to * sort out already expired timers and reprogram the * event device. / enqueue_hrtimer(timer, new_base, HRTIMER_MODE_ABS); } } int hrtimers_cpu_dying(unsigned int dying_cpu) { int i, ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER)); struct hrtimer_cpu_base old_base, new_base; old_base = this_cpu_ptr(&hrtimer_bases); new_base = &per_cpu(hrtimer_bases, ncpu); / * The caller is globally serialized and nobody else * takes two locks at once, deadlock is not possible. / raw_spin_lock(&old_base->lock); raw_spin_lock_nested(&new_base->lock, SINGLE_DEPTH_NESTING); for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { migrate_hrtimer_list(&old_base->clock_base[i], &new_base->clock_base[i]); } / Tell the other CPU to retrigger the next event / smp_call_function_single(ncpu, retrigger_next_event, NULL, 0); raw_spin_unlock(&new_base->lock); old_base->online = 0; raw_spin_unlock(&old_base->lock); return 0; } #endif / CONFIG_HOTPLUG_CPU */ void __init hrtimers_init(void) { hrtimers_prepare_cpu(smp_processor_id()); hrtimers_cpu_starting(smp_processor_id()); open_softirq(HRTIMER_SOFTIRQ, hrtimer_run_softirq); } ]
9/2	2026/02/25 03:31	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 28251 output: 24 thoughts: 334 Thoughts: Analyzing `hrtimer_interrupt` and Identifying a Discrepancy Okay, first things first, I need to make sure I'm on the same page. The report mentions line 1882 of `hrtimer_interrupt` and claims `cpu_base->expires_next = KTIME_MAX;`. But hold on, I just reviewed the source code. Line 1882 in my copy of the file actually assigns `expires_next` to `cpu_base->expires_next`. I need to verify this... let's see. Ah, yes, looking at the code, there's a good reason for this. Let me take another look at the code: ```c 1856: void hrtimer_interrupt(struct clock_event_device dev) 1857: { ... 1871: cpu_base->expires_next = KTIME_MAX; ... 1881: expires_next = hrtimer_update_next_event(cpu_base); 1882: / 1883: * Store the new expiry value so the migration code can verify 1884: * against it. 1885: */ 1886: cpu_base->expires_next = expires_next; ``` Ah, right. The line numbers are slightly off between the report and the file, but the code itself is clear. The first assignment sets it to `KTIME_MAX` and the second assignment updates `expires_next` as reported. Line 1882 is definitely where `cpu_base->expires_next` is set, but the value is updated after calling `hrtimer_update_next_event`. Now that I've resolved this potential line numbering confusion, let's turn our attention to `print_tickdevice` in `kernel/time/timer_list.c:206`.
10/2	2026/02/25 03:31	tool	read-file	0m Args: map[File:kernel/time/timer_list.c] Results: map[Contents:// SPDX-License-Identifier: GPL-2.0 /* * List pending timers * * Copyright(C) 2006, Red Hat, Inc., Ingo Molnar / #include <linux/proc_fs.h> #include <linux/module.h> #include <linux/spinlock.h> #include <linux/sched.h> #include <linux/seq_file.h> #include <linux/kallsyms.h> #include <linux/nmi.h> #include <linux/uaccess.h> #include "tick-internal.h" struct timer_list_iter { int cpu; bool second_pass; u64 now; }; / * This allows printing both to /proc/timer_list and * to the console (on SysRq-Q): / __printf(2, 3) static void SEQ_printf(struct seq_file m, const char fmt, ...) { va_list args; va_start(args, fmt); if (m) seq_vprintf(m, fmt, args); else vprintk(fmt, args); va_end(args); } static void print_timer(struct seq_file m, struct hrtimer taddr, struct hrtimer timer, int idx, u64 now) { SEQ_printf(m, " #%d: <%p>, %ps", idx, taddr, ACCESS_PRIVATE(timer, function)); SEQ_printf(m, ", S:%02x", timer->state); SEQ_printf(m, "\n"); SEQ_printf(m, " # expires at %Lu-%Lu nsecs [in %Ld to %Ld nsecs]\n", (unsigned long long)ktime_to_ns(hrtimer_get_softexpires(timer)), (unsigned long long)ktime_to_ns(hrtimer_get_expires(timer)), (long long)(ktime_to_ns(hrtimer_get_softexpires(timer)) - now), (long long)(ktime_to_ns(hrtimer_get_expires(timer)) - now)); } static void print_active_timers(struct seq_file m, struct hrtimer_clock_base base, u64 now) { struct hrtimer timer, tmp; unsigned long next = 0, i; struct timerqueue_node curr; unsigned long flags; next_one: i = 0; touch_nmi_watchdog(); raw_spin_lock_irqsave(&base->cpu_base->lock, flags); curr = timerqueue_getnext(&base->active); /* * Crude but we have to do this O(NN) thing, because we have to unlock the base when printing: / while (curr && i < next) { curr = timerqueue_iterate_next(curr); i++; } if (curr) { timer = container_of(curr, struct hrtimer, node); tmp = timer; raw_spin_unlock_irqrestore(&base->cpu_base->lock, flags); print_timer(m, timer, &tmp, i, now); next++; goto next_one; } raw_spin_unlock_irqrestore(&base->cpu_base->lock, flags); } static void print_base(struct seq_file m, struct hrtimer_clock_base base, u64 now) { SEQ_printf(m, " .base: %p\n", base); SEQ_printf(m, " .index: %d\n", base->index); SEQ_printf(m, " .resolution: %u nsecs\n", hrtimer_resolution); #ifdef CONFIG_HIGH_RES_TIMERS SEQ_printf(m, " .offset: %Lu nsecs\n", (unsigned long long) ktime_to_ns(base->offset)); #endif SEQ_printf(m, "active timers:\n"); print_active_timers(m, base, now + ktime_to_ns(base->offset)); } static void print_cpu(struct seq_file m, int cpu, u64 now) { struct hrtimer_cpu_base cpu_base = &per_cpu(hrtimer_bases, cpu); int i; SEQ_printf(m, "cpu: %d\n", cpu); for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { SEQ_printf(m, " clock %d:\n", i); print_base(m, cpu_base->clock_base + i, now); } #define P(x) \ SEQ_printf(m, " .%-15s: %Lu\n", #x, \ (unsigned long long)(cpu_base->x)) #define P_ns(x) \ SEQ_printf(m, " .%-15s: %Lu nsecs\n", #x, \ (unsigned long long)(ktime_to_ns(cpu_base->x))) #ifdef CONFIG_HIGH_RES_TIMERS P_ns(expires_next); P(hres_active); P(nr_events); P(nr_retries); P(nr_hangs); P(max_hang_time); #endif #undef P #undef P_ns #ifdef CONFIG_TICK_ONESHOT # define P(x) \ SEQ_printf(m, " .%-15s: %Lu\n", #x, \ (unsigned long long)(ts->x)) # define P_ns(x) \ SEQ_printf(m, " .%-15s: %Lu nsecs\n", #x, \ (unsigned long long)(ktime_to_ns(ts->x))) # define P_flag(x, f) \ SEQ_printf(m, " .%-15s: %d\n", #x, !!(ts->flags & (f))) { struct tick_sched ts = tick_get_tick_sched(cpu); P_flag(nohz, TS_FLAG_NOHZ); P_flag(highres, TS_FLAG_HIGHRES); P_ns(last_tick); P_flag(tick_stopped, TS_FLAG_STOPPED); P(idle_jiffies); P(idle_calls); P(idle_sleeps); P_ns(idle_entrytime); P_ns(idle_waketime); P_ns(idle_exittime); P_ns(idle_sleeptime); P_ns(iowait_sleeptime); P(last_jiffies); P(next_timer); P_ns(idle_expires); SEQ_printf(m, "jiffies: %Lu\n", (unsigned long long)jiffies); } #endif #undef P #undef P_ns SEQ_printf(m, "\n"); } #ifdef CONFIG_GENERIC_CLOCKEVENTS static void print_tickdevice(struct seq_file m, struct tick_device td, int cpu) { struct clock_event_device dev = td->evtdev; touch_nmi_watchdog(); SEQ_printf(m, "Tick Device: mode: %d\n", td->mode); if (cpu < 0) SEQ_printf(m, "Broadcast device\n"); else SEQ_printf(m, "Per CPU device: %d\n", cpu); SEQ_printf(m, "Clock Event Device: "); if (!dev) { SEQ_printf(m, "<NULL>\n"); return; } SEQ_printf(m, "%s\n", dev->name); SEQ_printf(m, " max_delta_ns: %llu\n", (unsigned long long) dev->max_delta_ns); SEQ_printf(m, " min_delta_ns: %llu\n", (unsigned long long) dev->min_delta_ns); SEQ_printf(m, " mult: %u\n", dev->mult); SEQ_printf(m, " shift: %u\n", dev->shift); SEQ_printf(m, " mode: %d\n", clockevent_get_state(dev)); SEQ_printf(m, " next_event: %Ld nsecs\n", (unsigned long long) ktime_to_ns(dev->next_event)); SEQ_printf(m, " set_next_event: %ps\n", dev->set_next_event); if (dev->set_state_shutdown) SEQ_printf(m, " shutdown: %ps\n", dev->set_state_shutdown); if (dev->set_state_periodic) SEQ_printf(m, " periodic: %ps\n", dev->set_state_periodic); if (dev->set_state_oneshot) SEQ_printf(m, " oneshot: %ps\n", dev->set_state_oneshot); if (dev->set_state_oneshot_stopped) SEQ_printf(m, " oneshot stopped: %ps\n", dev->set_state_oneshot_stopped); if (dev->tick_resume) SEQ_printf(m, " resume: %ps\n", dev->tick_resume); SEQ_printf(m, " event_handler: %ps\n", dev->event_handler); SEQ_printf(m, "\n"); SEQ_printf(m, " retries: %lu\n", dev->retries); #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST if (cpu >= 0) { const struct clock_event_device wd = tick_get_wakeup_device(cpu); SEQ_printf(m, "Wakeup Device: %s\n", wd ? wd->name : "<NULL>"); } #endif SEQ_printf(m, "\n"); } static void timer_list_show_tickdevices_header(struct seq_file m) { #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST print_tickdevice(m, tick_get_broadcast_device(), -1); SEQ_printf(m, "tick_broadcast_mask: %pb\n", cpumask_pr_args(tick_get_broadcast_mask())); #ifdef CONFIG_TICK_ONESHOT SEQ_printf(m, "tick_broadcast_oneshot_mask: %pb\n", cpumask_pr_args(tick_get_broadcast_oneshot_mask())); #endif SEQ_printf(m, "\n"); #endif } #endif static inline void timer_list_header(struct seq_file m, u64 now) { SEQ_printf(m, "Timer List Version: v0.10\n"); SEQ_printf(m, "HRTIMER_MAX_CLOCK_BASES: %d\n", HRTIMER_MAX_CLOCK_BASES); SEQ_printf(m, "now at %Ld nsecs\n", (unsigned long long)now); SEQ_printf(m, "\n"); } void sysrq_timer_list_show(void) { u64 now = ktime_to_ns(ktime_get()); int cpu; timer_list_header(NULL, now); for_each_online_cpu(cpu) print_cpu(NULL, cpu, now); #ifdef CONFIG_GENERIC_CLOCKEVENTS timer_list_show_tickdevices_header(NULL); for_each_online_cpu(cpu) print_tickdevice(NULL, tick_get_device(cpu), cpu); #endif return; } #ifdef CONFIG_PROC_FS static int timer_list_show(struct seq_file m, void v) { struct timer_list_iter iter = v; if (iter->cpu == -1 && !iter->second_pass) timer_list_header(m, iter->now); else if (!iter->second_pass) print_cpu(m, iter->cpu, iter->now); #ifdef CONFIG_GENERIC_CLOCKEVENTS else if (iter->cpu == -1 && iter->second_pass) timer_list_show_tickdevices_header(m); else print_tickdevice(m, tick_get_device(iter->cpu), iter->cpu); #endif return 0; } static void move_iter(struct timer_list_iter iter, loff_t offset) { for (; offset; offset--) { iter->cpu = cpumask_next(iter->cpu, cpu_online_mask); if (iter->cpu >= nr_cpu_ids) { #ifdef CONFIG_GENERIC_CLOCKEVENTS if (!iter->second_pass) { iter->cpu = -1; iter->second_pass = true; } else return NULL; #else return NULL; #endif } } return iter; } static void timer_list_start(struct seq_file file, loff_t offset) { struct timer_list_iter iter = file->private; if (!offset) iter->now = ktime_to_ns(ktime_get()); iter->cpu = -1; iter->second_pass = false; return move_iter(iter, offset); } static void timer_list_next(struct seq_file file, void v, loff_t offset) { struct timer_list_iter iter = file->private; ++offset; return move_iter(iter, 1); } static void timer_list_stop(struct seq_file seq, void v) { } static const struct seq_operations timer_list_sops = { .start = timer_list_start, .next = timer_list_next, .stop = timer_list_stop, .show = timer_list_show, }; static int __init init_timer_list_procfs(void) { struct proc_dir_entry *pe; pe = proc_create_seq_private("timer_list", 0400, NULL, &timer_list_sops, sizeof(struct timer_list_iter), NULL); if (!pe) return -ENOMEM; return 0; } __initcall(init_timer_list_procfs); #endif ]
11/2	2026/02/25 03:31	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 32732 output: 42 thoughts: 1126 Thoughts: Race Condition Identified: `dev->next_event` in `clock_event_device` Okay, so the report points to a potential issue. I see a read to memory location `0xffff888237d17b58` by `print_tickdevice` at `kernel/time/timer_list.c:206` and a write to the same location by the `hrtimer_interrupt` at `kernel/time/hrtimer.c:1882`. My initial thought was that it was the `cpu_base->expires_next` field. I had to double-check my assumptions regarding which memory location was actually being accessed. Looking at the code for `hrtimer_interrupt`, I first identified what I thought was the target write location within the function's structure. However, on closer inspection, and after re-examining the line numbers in conjunction with the report, it became clear. Line 206 of `print_tickdevice` reads `dev->next_event`, whereas Line 1865 (and not 1882) in the version I'm looking at in `hrtimer_interrupt` is indeed setting `dev->next_event`. Therefore, since the reported addresses match, it means `print_tickdevice` is reading `dev->next_event`, and `hrtimer_interrupt` is writing to `dev->next_event`. This is a potential race condition on the `dev->next_event` field of `struct clock_event_device`. The code in `hrtimer_interrupt` also calls `tick_program_event`, so I'm now going to investigate what that function does to understand the interaction and potential implications of this race condition.
12/2	2026/02/25 03:31	tool	codesearch-definition-source	0m Args: map[ContextFile:kernel/time/hrtimer.c IncludeLines:true Name:tick_program_event] Results: map[SourceCode: 20: /** 21: * tick_program_event - program the CPU local timer device for the next event 22: * @expires: the time at which the next timer event should occur 23: * @force: flag to force reprograming even if the event time hasn't changed 24: * 25: * Return: 0 on success, negative error code on failure 26: / 27: int tick_program_event(ktime_t expires, int force) 28: { 29: struct clock_event_device dev = __this_cpu_read(tick_cpu_device.evtdev); 30: 31: if (unlikely(expires == KTIME_MAX)) { 32: /* 33: * We don't need the clock event device any more, stop it. 34: / 35: clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT_STOPPED); 36: dev->next_event = KTIME_MAX; 37: return 0; 38: } 39: 40: if (unlikely(clockevent_state_oneshot_stopped(dev))) { 41: / 42: * We need the clock event again, configure it in ONESHOT mode 43: * before using it. 44: */ 45: clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT); 46: } 47: 48: return clockevents_program_event(dev, expires, force); 49: } SourceFile:kernel/time/tick-oneshot.c]
13/2	2026/02/25 03:31	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 34747 output: 45 thoughts: 9 Thoughts: Okay, here's my attempt at summarizing the thought process, speaking from the first-person perspective, assuming I'm an expert in the field: Clock Event Programming and its Implications Alright, let's break down this `clockevents_program_event` function. It seems like we're diving into the heart of how the kernel manages its timers and interrupts. When I see this function, my mind immediately jumps to the underlying hardware – the Programmable Interval Timers (PITs), Advanced Configuration and Power Interface (ACPI) timers, High Precision Event Timers (HPETs) – whatever specific hardware the system's employing. I need to figure out which specific hardware is in play, the architecture will determine the best event programming method. My immediate thought is, what's the broader context? Is this being called during normal operation, or is it a part of system boot? Does this code interact with other interrupt mechanisms, like the Advanced Programmable Interrupt Controller (APIC) or Local APIC (LAPIC)? The scheduler's time management, context switching, and the accuracy of time-related functions rely heavily on what goes on here. A misstep in how we program the event, especially if we're dealing with multiple CPUs or heterogeneous systems, will lead to all kinds of timing jitter and performance issues. I have to think about synchronization primitives, interrupt masking, and memory barriers to ensure correct behaviour. We've got to ensure the event is programmed correctly for the specific timer device and that we account for potential race conditions. I need to check the data sheets for the particular hardware and consider edge-case scenarios carefully to ensure we program the event accurately and efficiently.
14/2	2026/02/25 03:31	tool	codesearch-definition-source	0m Args: map[ContextFile:kernel/time/tick-oneshot.c IncludeLines:true Name:clockevents_program_event] Results: map[SourceCode: 295: /** 296: * clockevents_program_event - Reprogram the clock event device. 297: * @dev: device to program 298: * @expires: absolute expiry time (monotonic clock) 299: * @force: program minimum delay if expires can not be set 300: * 301: * Returns 0 on success, -ETIME when the event is in the past. 302: / 303: int clockevents_program_event(struct clock_event_device dev, ktime_t expires, 304: bool force) 305: { 306: unsigned long long clc; 307: int64_t delta; 308: int rc; 309: 310: if (WARN_ON_ONCE(expires < 0)) 311: return -ETIME; 312: 313: dev->next_event = expires; 314: 315: if (clockevent_state_shutdown(dev)) 316: return 0; 317: 318: /* We must be in ONESHOT state here / 319: WARN_ONCE(!clockevent_state_oneshot(dev), "Current state: %d\n", 320: clockevent_get_state(dev)); 321: 322: / Shortcut for clockevent devices that can deal with ktime. / 323: if (dev->features & CLOCK_EVT_FEAT_KTIME) 324: return dev->set_next_ktime(expires, dev); 325: 326: delta = ktime_to_ns(ktime_sub(expires, ktime_get())); 327: if (delta <= 0) 328: return force ? clockevents_program_min_delta(dev) : -ETIME; 329: 330: delta = min(delta, (int64_t) dev->max_delta_ns); 331: delta = max(delta, (int64_t) dev->min_delta_ns); 332: 333: clc = ((unsigned long long) delta dev->mult) >> dev->shift; 334: rc = dev->set_next_event((unsigned long) clc, dev); 335: 336: return (rc && force) ? clockevents_program_min_delta(dev) : rc; 337: } SourceFile:kernel/time/clockevents.c]
15/2	2026/02/25 03:31	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 35875 output: 42 thoughts: 807 Thoughts: Analyzing a Potential Data Race in the Kernel's Timer System Okay, so I'm looking at this KCSAN report, and it points to a potential data race involving `dev->next_event` in the kernel's timer code. It looks like `clockevents_program_event` writes to `dev->next_event`, specifically at line 313. Then `hrtimer_interrupt` also writes to `dev->next_event`, but the important thing to note is that `tick_program_event`, which calls `clockevents_program_event` and therefore writes to `dev->next_event`, calls it with interrupts disabled. The data race is between `hrtimer_interrupt` writing, and `print_tickdevice` reading `dev->next_event`. `print_tickdevice` gets called when `/proc/timer_list` is read, iterating over the CPUs and their tick devices. And looking at the code, `print_tickdevice` reads `dev->next_event` directly, without any locks! That's a classic red flag. My immediate thought is, "Wait, is this actually a problem?". `dev->next_event` is a `ktime_t`, which is a 64-bit value on 64-bit systems. Those 8-byte accesses are atomic. So, the report indicates the value changed from `0x000000129f80e76b` to `0x000000129f81f8db`. Since it's atomic, this suggests the read observed the value changing in-flight, which KCSAN picked up. This suggests that even if there is a data race, it's likely a benign data race, and could just lead to `/proc/timer_list` showing a slightly stale value of `next_event`. It is intended for debugging after all, and I wouldn't expect strict synchronization. However, KCSAN flagged it, which means we should still investigate. The race could still impact the timer system if `dev->next_event` is also used for something sensitive. It's a field within `struct clock_event_device`, so let me check what else uses `dev->next_event`.
16/2	2026/02/25 03:31	tool	codesearch-definition-source	0m Args: map[ContextFile:kernel/time/clockevents.c IncludeLines:true Name:clock_event_device] Results: map[SourceCode: 70: /** 71: * struct clock_event_device - clock event device descriptor 72: * @event_handler: Assigned by the framework to be called by the low 73: * level handler of the event source 74: * @set_next_event: set next event function using a clocksource delta 75: * @set_next_ktime: set next event function using a direct ktime value 76: * @next_event: local storage for the next event in oneshot mode 77: * @max_delta_ns: maximum delta value in ns 78: * @min_delta_ns: minimum delta value in ns 79: * @mult: nanosecond to cycles multiplier 80: * @shift: nanoseconds to cycles divisor (power of two) 81: * @state_use_accessors:current state of the device, assigned by the core code 82: * @features: features 83: * @retries: number of forced programming retries 84: * @set_state_periodic: switch state to periodic 85: * @set_state_oneshot: switch state to oneshot 86: * @set_state_oneshot_stopped: switch state to oneshot_stopped 87: * @set_state_shutdown: switch state to shutdown 88: * @tick_resume: resume clkevt device 89: * @broadcast: function to broadcast events 90: * @min_delta_ticks: minimum delta value in ticks stored for reconfiguration 91: * @max_delta_ticks: maximum delta value in ticks stored for reconfiguration 92: * @name: ptr to clock event name 93: * @rating: variable to rate clock event devices 94: * @irq: IRQ number (only for non CPU local devices) 95: * @bound_on: Bound on CPU 96: * @cpumask: cpumask to indicate for which CPUs this device works 97: * @list: list head for the management code 98: * @owner: module reference 99: / 100: struct clock_event_device { 101: void (event_handler)(struct clock_event_device ); 102: int (set_next_event)(unsigned long evt, struct clock_event_device ); 103: int (set_next_ktime)(ktime_t expires, struct clock_event_device ); 104: ktime_t next_event; 105: u64 max_delta_ns; 106: u64 min_delta_ns; 107: u32 mult; 108: u32 shift; 109: enum clock_event_state state_use_accessors; 110: unsigned int features; 111: unsigned long retries; 112: 113: int (set_state_periodic)(struct clock_event_device ); 114: int (set_state_oneshot)(struct clock_event_device ); 115: int (set_state_oneshot_stopped)(struct clock_event_device ); 116: int (set_state_shutdown)(struct clock_event_device ); 117: int (tick_resume)(struct clock_event_device ); 118: 119: void (broadcast)(const struct cpumask mask); 120: void (suspend)(struct clock_event_device ); 121: void (resume)(struct clock_event_device ); 122: unsigned long min_delta_ticks; 123: unsigned long max_delta_ticks; 124: 125: const char name; 126: int rating; 127: int irq; 128: int bound_on; 129: const struct cpumask cpumask; 130: struct list_head list; 131: struct module owner; 132: } ____cacheline_aligned; SourceFile:include/linux/clockchips.h]
17/2	2026/02/25 03:31	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 38360 output: 38 thoughts: 90 Thoughts: Here's my thought process, summarized as if I were considering this code: Analyzing the `next_event` Variable Okay, so `next_event` is defined as a `ktime_t`, which, as I recall, is a 64-bit signed integer (`s64`). That's straightforward enough. The comment indicates it's used as local storage for the next event specifically in the one-shot mode of whatever this code is managing. So, it's essentially a timestamp, an absolute point in time. The comment goes on to highlight that it stores the expiry time, which makes sense. I see it's used in `clockevents_program_event`, confirming it's directly related to setting up a future timer event. But, I need to know the full picture. I need to see the breadth of its use. Is this a local variable used only for calculation within this method, or is this variable referenced and mutated elsewhere? Is this variable subject to race conditions if used in an interrupt context? To answer this question, I'll search for every instance of `next_event` within the `kernel/time/` directory. That should give me a good handle on where it's being read from, written to, and potentially, how it's being used across the timekeeping subsystem. This search will be key to understanding the scope and potential implications of this variable.
18/2	2026/02/25 03:31	tool	codesearch-find-references	0m Args: map[Name:clock_event_device::next_event SourceTreePrefix:kernel/time/] Results: map[References:[map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:___tick_broadcast_oneshot_control SourceFile:kernel/time/tick-broadcast.c SourceLine:848] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:___tick_broadcast_oneshot_control SourceFile:kernel/time/tick-broadcast.c SourceLine:848] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:___tick_broadcast_oneshot_control SourceFile:kernel/time/tick-broadcast.c SourceLine:849] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:___tick_broadcast_oneshot_control SourceFile:kernel/time/tick-broadcast.c SourceLine:883] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:___tick_broadcast_oneshot_control SourceFile:kernel/time/tick-broadcast.c SourceLine:918] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:___tick_broadcast_oneshot_control SourceFile:kernel/time/tick-broadcast.c SourceLine:926] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:__clockevents_update_freq SourceFile:kernel/time/clockevents.c SourceLine:528] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:broadcast_needs_cpu SourceFile:kernel/time/tick-broadcast.c SourceLine:774] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:broadcast_shutdown_local SourceFile:kernel/time/tick-broadcast.c SourceLine:790] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:broadcast_shutdown_local SourceFile:kernel/time/tick-broadcast.c SourceLine:790] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:clockevents_increase_min_delta SourceFile:kernel/time/clockevents.c SourceLine:208] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:clockevents_program_event SourceFile:kernel/time/clockevents.c SourceLine:313] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:clockevents_program_min_delta SourceFile:kernel/time/clockevents.c SourceLine:241] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:clockevents_shutdown SourceFile:kernel/time/clockevents.c SourceLine:174] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:hotplug_cpu__broadcast_tick_pull SourceFile:kernel/time/tick-broadcast.c SourceLine:1174] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:hotplug_cpu__broadcast_tick_pull SourceFile:kernel/time/tick-broadcast.c SourceLine:1178] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:hrtimer_interrupt SourceFile:kernel/time/hrtimer.c SourceLine:1882] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:print_tickdevice SourceFile:kernel/time/timer_list.c SourceLine:206] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:tick_broadcast_init_next_event SourceFile:kernel/time/tick-broadcast.c SourceLine:1001] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:tick_broadcast_setup_oneshot SourceFile:kernel/time/tick-broadcast.c SourceLine:1066] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:tick_handle_oneshot_broadcast SourceFile:kernel/time/tick-broadcast.c SourceLine:698] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_handle_oneshot_broadcast SourceFile:kernel/time/tick-broadcast.c SourceLine:713] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_handle_oneshot_broadcast SourceFile:kernel/time/tick-broadcast.c SourceLine:721] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_handle_oneshot_broadcast SourceFile:kernel/time/tick-broadcast.c SourceLine:722] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_handle_periodic SourceFile:kernel/time/tick-common.c SourceLine:111] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_handle_periodic_broadcast SourceFile:kernel/time/tick-broadcast.c SourceLine:416] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_nohz_get_sleep_length SourceFile:kernel/time/tick-sched.c SourceLine:1375] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:tick_nohz_lowres_handler SourceFile:kernel/time/tick-sched.c SourceLine:1506] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_nohz_stop_tick SourceFile:kernel/time/tick-sched.c SourceLine:1049] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_oneshot_wakeup_control SourceFile:kernel/time/tick-broadcast.c SourceLine:952] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:tick_program_event SourceFile:kernel/time/tick-oneshot.c SourceLine:36] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_setup_device SourceFile:kernel/time/tick-common.c SourceLine:232]] TruncatedOutput:false]
19/2	2026/02/25 03:31	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 40518 output: 41 thoughts: 266 Thoughts: Analyzing the `next_event` Race Condition Okay, so the references indicate `next_event` is a widely used field. However, the specific race condition identified focuses on interactions between `hrtimer_interrupt` (writing) and `print_tickdevice` (reading). My initial thought was to quickly check where this write happens within `hrtimer_interrupt`, and I went to the line indicated in the initial reference, which I thought was 1882: ```c 1882: dev->next_event = KTIME_MAX; ``` Hmm, let me check that file again. ```c 1856: void hrtimer_interrupt(struct clock_event_device *dev) ... 1865: dev->next_event = KTIME_MAX; ``` Wait a second. The file is different to the reference result. Line numbers don't match exactly. I bet the line numbers changed between the reference index and my `read-file` output, probably due to some file updates or different versions. No problem, this happens. I need to be careful about that. Instead of hunting for line 1882 in the file contents, let's go straight to the definitive source. Let me use `codesearch-definition-source` for `hrtimer_interrupt`, with line numbers. That should give me a more reliable starting point. This should clarify exactly where the write to `dev->next_event` occurs within `hrtimer_interrupt` and prevent these silly confusion errors.
20/2	2026/02/25 03:31	tool	codesearch-definition-source	0m Args: map[ContextFile:kernel/time/hrtimer.c IncludeLines:true Name:hrtimer_interrupt] Results: map[SourceCode:1869: /* 1870: * High resolution timer interrupt 1871: * Called with interrupts disabled 1872: / 1873: void hrtimer_interrupt(struct clock_event_device dev) 1874: { 1875: struct hrtimer_cpu_base cpu_base = this_cpu_ptr(&hrtimer_bases); 1876: ktime_t expires_next, now, entry_time, delta; 1877: unsigned long flags; 1878: int retries = 0; 1879: 1880: BUG_ON(!cpu_base->hres_active); 1881: cpu_base->nr_events++; 1882: dev->next_event = KTIME_MAX; 1883: 1884: raw_spin_lock_irqsave(&cpu_base->lock, flags); 1885: entry_time = now = hrtimer_update_base(cpu_base); 1886: retry: 1887: cpu_base->in_hrtirq = 1; 1888: / 1889: * We set expires_next to KTIME_MAX here with cpu_base->lock 1890: * held to prevent that a timer is enqueued in our queue via 1891: * the migration code. This does not affect enqueueing of 1892: * timers which run their callback and need to be requeued on 1893: * this CPU. 1894: / 1895: cpu_base->expires_next = KTIME_MAX; 1896: 1897: if (!ktime_before(now, cpu_base->softirq_expires_next)) { 1898: cpu_base->softirq_expires_next = KTIME_MAX; 1899: cpu_base->softirq_activated = 1; 1900: raise_timer_softirq(HRTIMER_SOFTIRQ); 1901: } 1902: 1903: __hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD); 1904: 1905: / Reevaluate the clock bases for the [soft] next expiry / 1906: expires_next = hrtimer_update_next_event(cpu_base); 1907: / 1908: * Store the new expiry value so the migration code can verify 1909: * against it. 1910: / 1911: cpu_base->expires_next = expires_next; 1912: cpu_base->in_hrtirq = 0; 1913: raw_spin_unlock_irqrestore(&cpu_base->lock, flags); 1914: 1915: / Reprogramming necessary ? / 1916: if (!tick_program_event(expires_next, 0)) { 1917: cpu_base->hang_detected = 0; 1918: return; 1919: } 1920: 1921: / 1922: * The next timer was already expired due to: 1923: * - tracing 1924: * - long lasting callbacks 1925: * - being scheduled away when running in a VM 1926: * 1927: * We need to prevent that we loop forever in the hrtimer 1928: * interrupt routine. We give it 3 attempts to avoid 1929: * overreacting on some spurious event. 1930: * 1931: * Acquire base lock for updating the offsets and retrieving 1932: * the current time. 1933: / 1934: raw_spin_lock_irqsave(&cpu_base->lock, flags); 1935: now = hrtimer_update_base(cpu_base); 1936: cpu_base->nr_retries++; 1937: if (++retries < 3) 1938: goto retry; 1939: / 1940: * Give the system a chance to do something else than looping 1941: * here. We stored the entry time, so we know exactly how long 1942: * we spent here. We schedule the next event this amount of 1943: * time away. 1944: / 1945: cpu_base->nr_hangs++; 1946: cpu_base->hang_detected = 1; 1947: raw_spin_unlock_irqrestore(&cpu_base->lock, flags); 1948: 1949: delta = ktime_sub(now, entry_time); 1950: if ((unsigned int)delta > cpu_base->max_hang_time) 1951: cpu_base->max_hang_time = (unsigned int) delta; 1952: / 1953: * Limit it to a sensible value as we enforce a longer 1954: * delay. Give the CPU at least 100ms to catch up. 1955: / 1956: if (delta > 100 NSEC_PER_MSEC) 1957: expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC); 1958: else 1959: expires_next = ktime_add(now, delta); 1960: tick_program_event(expires_next, 1); 1961: pr_warn_once("hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta)); 1962: } SourceFile:kernel/time/hrtimer.c]
21/2	2026/02/25 03:31	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 42703 output: 37 thoughts: 288 Thoughts: Analyzing the Timer Race Condition Okay, so the report points directly to line 1882 in `hrtimer.c`: `dev->next_event = KTIME_MAX;`. That's a perfect match. The report indicates a race condition centered around `dev->next_event`. Specifically, `hrtimer_interrupt` is writing `KTIME_MAX` to this variable, while `print_tickdevice` is reading it for display, likely for debugging or status output. Initially, I thought this was a completely benign race. `print_tickdevice`'s purpose is to show the timer system's state. If it catches a value while it's being updated, particularly with `KTIME_MAX`, it's not a functional issue. But hang on… is there a torn read risk? We need to verify data consistency. `ktime_t`, which is what `dev->next_event` is, is an `s64`, meaning it's 8 bytes in size. The report confirms "read to ... of 8 bytes" and "write to ... of 8 bytes." This is crucial. It means the compiler is employing 8-byte instructions, which is important for atomicity. On x86_64 architectures, 8-byte aligned loads and stores are atomic. Now we need to confirm the alignment. `dev->next_event` is located within the `struct clock_event_device`. I need to double-check that `next_event` is properly 8-byte aligned within that structure. If it is, the race is harmless because the reads and writes are atomic. If it's not aligned, then we have a problem.
22/2	2026/02/25 03:31	tool	codesearch-struct-layout	0m Args: map[ContextFile:kernel/time/clockevents.c Name:clock_event_device] Results: map[Fields:[map[Name:event_handler OffsetBits:0 SizeBits:64] map[Name:set_next_event OffsetBits:64 SizeBits:64] map[Name:set_next_ktime OffsetBits:128 SizeBits:64] map[Name:next_event OffsetBits:192 SizeBits:64] map[Name:max_delta_ns OffsetBits:256 SizeBits:64] map[Name:min_delta_ns OffsetBits:320 SizeBits:64] map[Name:mult OffsetBits:384 SizeBits:32] map[Name:shift OffsetBits:416 SizeBits:32] map[Name:state_use_accessors OffsetBits:448 SizeBits:32] map[Name:features OffsetBits:480 SizeBits:32] map[Name:retries OffsetBits:512 SizeBits:64] map[Name:set_state_periodic OffsetBits:576 SizeBits:64] map[Name:set_state_oneshot OffsetBits:640 SizeBits:64] map[Name:set_state_oneshot_stopped OffsetBits:704 SizeBits:64] map[Name:set_state_shutdown OffsetBits:768 SizeBits:64] map[Name:tick_resume OffsetBits:832 SizeBits:64] map[Name:broadcast OffsetBits:896 SizeBits:64] map[Name:suspend OffsetBits:960 SizeBits:64] map[Name:resume OffsetBits:1024 SizeBits:64] map[Name:min_delta_ticks OffsetBits:1088 SizeBits:64] map[Name:max_delta_ticks OffsetBits:1152 SizeBits:64] map[Name:name OffsetBits:1216 SizeBits:64] map[Name:rating OffsetBits:1280 SizeBits:32] map[Name:irq OffsetBits:1312 SizeBits:32] map[Name:bound_on OffsetBits:1344 SizeBits:32] map[Name:cpumask OffsetBits:1408 SizeBits:64] map[Name:list OffsetBits:1472 SizeBits:128] map[Name:owner OffsetBits:1600 SizeBits:64]]]
23/2	2026/02/25 03:31	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 44013 output: 41 thoughts: 1917 Thoughts: Benign Data Race in Timer Debug Output Okay, so I've been looking at a reported data race, specifically related to the `next_event` field within the `tick_device` structure. My initial thought was that, given the struct's cacheline alignment, and the offset of `next_event` being a multiple of 8, the access should be atomic on a 64-bit system. Even if it weren't atomic on 32-bit, it's just a debug output so it's probably fine either way. The report highlights that `hrtimer_interrupt` is writing to this field while `folio_mark_accessed` (triggered by an interrupt) is accessing something else entirely on a different CPU, causing the race. Tracing the stack traces confirms that the write indeed happens in `hrtimer_interrupt` and the read in `print_tickdevice`, and the interrupt happening in between other code. This isn't unexpected, an interrupt occurring mid-execution is normal. Digging further, I realized that `hrtimer_interrupt` writes to `dev->next_event`, where `dev` is the local CPU's clock event device. The crucial point here is that `print_tickdevice` iterates through all CPUs to read their respective `tick_device` structures. This means CPU 0 is reading CPU 1's `next_event` field while CPU 1 is potentially updating it. That confirms that the race occurs between the interrupt handler on CPU 1 and a task on CPU 0. Looking at the code, `print_tickdevice` (on CPU 0) is reading the `next_event` of CPU 1's `tick_device`, and `hrtimer_interrupt` (on CPU 1) is writing to its own `next_event`. The fact that `print_tickdevice` iterates all CPUs, confirms this is expected and a side effect of the design. Specifically, the report indicates the write in `hrtimer_interrupt` happens at line 1882 in the current version of the kernel. I can confirm that this line indeed corresponds to `dev->next_event = KTIME_MAX;`. Now, the important question is: is this harmful? Given that `print_tickdevice` is used to dump timer information via `/proc/timer_list`, it's primarily a debugging tool. The race condition involves the local CPU writing `KTIME_MAX` (a specific time value) during interrupt handling and the other CPU reading it. Since `print_tickdevice` is not used in any critical path, and `hrtimer_interrupt` is the only code setting the time on that CPU, the reading CPU should eventually get an accurate value. In `hrtimer_interrupt`, the `next_event` field is set to `KTIME_MAX` and later updated. This value indicates the device is either being reprogrammed or the current event has been handled. Another CPU reading the `KTIME_MAX` value could temporarily think there's no pending event, but this is fine, it's only for the debug interface. Finally, the report also mentioned `cpu_base->expires_next`. I verified the write operations happen in `hrtimer_interrupt`, lines 1895 and 1911 in `hrtimer.c` but those accesses are also benign because they are just used for debug information. So, ultimately this race condition, specifically affecting the debug output, is completely benign.
24/2	2026/02/25 03:32	tool	codesearch-definition-source	0m Args: map[ContextFile:include/linux/clockchips.h IncludeLines:true Name:ktime_t] Results: map[SourceCode: 126: /* Nanosecond scalar representation for kernel time values */ 127: typedef s64 ktime_t; SourceFile:include/linux/types.h]
25/2	2026/02/25 03:32	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 46785 output: 19 thoughts: 1592 Thoughts: Benign Race Analysis of `dev->next_event` in `hrtimer_interrupt` Okay, so I'm looking at this code and confirming my initial assessment. It's a `s64` - got it. The race condition seems benign, and I'm confident about that. I need to justify this conclusion, though, given the KCSAN report. The specific issue is that `dev->next_event` is written in `hrtimer_interrupt` outside of a lock. Specifically: ```c 1882: dev->next_event = KTIME_MAX; ``` This write isn't protected by `cpu_base->lock`. I thought that seemed odd at first. However, the interrupt context of this write is local to the CPU that's running it. So, while it's written before the lock is acquired, the typical use case is that no other task on that same CPU can interfere. I need to be sure. The concern is cross-CPU access, and the context here is `print_tickdevice`, specifically looking at `/proc/timer_list`. My first thought was, maybe they should be using `READ_ONCE` or `WRITE_ONCE`. But `ktime_t` is a 64-bit value, so it's already atomic. No compiler magic should be able to make this non-atomic, which is good. I need to confirm there aren't other races in `timer_list.c`. I browsed the code quickly, and saw that many fields within `timer_list.c` are read without locks. It seemed like the locking was used, for instance, `print_active_timers` does take a lock on `base->cpu_base->lock`, but `print_base` and `print_cpu` read stuff without that lock. Then, the specific issue at hand, `print_tickdevice` doesn't take any lock on `dev`, which is what I'm concerned about. `struct tick_device` is just a pointer to `clock_event_device` along with an enum. There's no lock here. And the `clock_event_device` is usually protected by the per-CPU `hrtimer_cpu_base->lock`, or that the accesses happen with interrupts disabled. Since `print_tickdevice` is just for `/proc/timer_list`, I think it is acceptable that it races. It's not controlling anything critical, it's just presenting information. Back to line 1882 in `hrtimer_interrupt`, the writes to `dev->next_event` (direct at 1882 and indirect through `tick_program_event` at 1916) are outside the protection of the lock. I see that! `tick_program_event` gets called after `raw_spin_unlock_irqrestore(&cpu_base->lock, flags);`. So, it's confirmed. This is very likely an "unprotected" field for a specific reason – it's mostly per-CPU, and it's being read in a context that does not have the lock. I should check if there are other reports for this. Something like "hrtimer_interrupt data-race print_tickdevice" would be ideal. I don't have a search tool for that. But I'll trust my judgement for now. It's benign. Interrupts are disabled locally, so no other task on this CPU can interfere. But another CPU can read it. The field is being accessed like so: ```c SEQ_printf(m, " next_event: %Ld nsecs\n", (unsigned long long) ktime_to_ns(dev->next_event)); ``` `ktime_to_ns` just does a direct read. No funny business. I should go through the "harmful" criteria. Is there: - Memory corruption/crash? No. - Complex container? No, it's a simple `s64`. - Synchronization between other memory accesses? No. The contents of `dev->next_event` represents the programmed expiry time. If `print_tickdevice` sees `KTIME_MAX` instead of the next event, it just prints `KTIME_MAX`. This doesn't affect the hardware which is what sets the timer off with `dev->set_next_event`. The hardware is programmed after `dev->next_event` is updated. That helps. In `hrtimer_interrupt`: It's set to `KTIME_MAX` while the interrupt handler is running and then set to a new `expires
26/2	2026/02/25 03:32	tool	set-results	0m Args: map[Benign:true Confident:true] Results: map[Benign:true Confident:true]
27/2	2026/02/25 03:32	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 49438 output: 560 thoughts: 250 Thoughts: Benign Data Race in the Timer System Okay, so I've looked at this data race, and I'm pretty confident it's harmless. The race is between the `hrtimer_interrupt` function, which is updating `dev->next_event`, and `print_tickdevice`, which is reading it. Now, `dev->next_event` is just a 64-bit integer, a `ktime_t` in the `clock_event_device` struct. `hrtimer_interrupt` is, of course, the interrupt handler that sets this field to point to the next timer event. `print_tickdevice` gets called when someone reads `/proc/timer_list`, that diagnostic interface for the timer system. Since this is a 64-bit integer on a 64-bit system, the access is atomic, which in itself all but eliminates the problem. But even if it wasn't atomic, the only consequence would be a slightly incorrect value being displayed in `/proc/timer_list`. Crucially, and this is important, `print_tickdevice` doesn't actually use the value for any critical decision-making or timer logic. It's purely for display purposes, so even a bad read wouldn't affect the timer itself. Furthermore, `hrtimer_interrupt` writes with interrupts disabled on the local CPU, which certainly helps. And the read in `print_tickdevice` is essentially a cross-CPU read for diagnostics. This is a classic example of a benign race condition, one in statistical or debug output, where a small inconsistency isn't catastrophic. It's a non-issue as far as I can tell.

map[Benign:true Confident:true Explanation:The data race occurs between `hrtimer_interrupt()` and `print_tickdevice()`.

map[KernelSrc:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/src/a24e0351858b280a0e17500759ec155b5ad3e184]

map[KernelObj:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/build/5a81561bbe40439496542cdb84200c114a114a5e]

map[Index:codesearch-index]

gemini-3-flash-preview

map[Benign:true Confident:true]

You are an experienced Linux kernel developer tasked with determining if the given kernel
data race is benign or not. The data race report is from KCSAN tool.
It contains 2 stack traces of the memory accesses that constitute a data race.

A "benign" data races are on a simple int/bool variable or similar field,
and the accesses are not supposed to be protected by any mutual exclusion primitives.
Common examples of such "benign" data races are accesses to various flags fields,
statistics counters, and similar. A "benign" data race does not lead to memory corruption/crash
with a conservative compiler that compiles memory accesses to primitive types
effectively as atomic.

A non-benign (or "harmful" data race) can lead to corruption/crash even with
a conservative compiler that compiles memory accesses to primitive types
effectively as atomic. A common example of a "harmful" data races is race on
a complex container (list/hashmap/etc), where accesses are supposed to be protected
by a mutual exclusion primitive.

Also consider races that happen at the same time with the given one.
If there is no synchronization in between, other memory accesses in the involved threads
race with each other if they access the same memory. For example, if both threads execute:

some_struct->have_elements = true;
list_add(new_node, &some_struct->list_head);

the race on some_struct->have_elements may appear benign, however it also implies there
is a race on some_struct->list_head which is not benign, since the list is not thread-safe.

Take into account that on 32-bit systems 64-bit memory accesses may be split into two accesses,
and thus even with a conservative compiler may not be fully atomic. However, such races may
still be benign depending on how writes are done, and how read data is used.

In the final reply explain why you think the given data race is benign or is harmful.

Use the provided tools to confirm any assumptions, variables/fields being accessed, etc.
In particular, don't make assumptions about the kernel source code,
use codesearch tools to read the actual source code.

Prefer calling several tools at the same time to save round-trips.

Use set-results tool to provide results of the analysis.
It must be called exactly once before the final reply.
Ignore results of this tool.

The data race report is:

==================================================================
BUG: KCSAN: data-race in hrtimer_interrupt / print_tickdevice

write to 0xffff888237d17b58 of 8 bytes by interrupt on cpu 1:
 hrtimer_interrupt+0x84/0x810 kernel/time/hrtimer.c:1882
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1045 [inline]
 __sysvec_apic_timer_interrupt+0x5f/0x1d0 arch/x86/kernel/apic/apic.c:1062
 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1056 [inline]
 sysvec_apic_timer_interrupt+0x6f/0x80 arch/x86/kernel/apic/apic.c:1056
 asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:697
 lru_gen_inc_refs mm/swap.c:391 [inline]
 folio_mark_accessed+0x6f/0x3a0 mm/swap.c:460
 follow_page_pte+0x572/0x6e0 mm/gup.c:886
 follow_pmd_mask mm/gup.c:915 [inline]
 follow_pud_mask mm/gup.c:967 [inline]
 follow_p4d_mask mm/gup.c:984 [inline]
 follow_page_mask mm/gup.c:1023 [inline]
 __get_user_pages+0x756/0x1ea0 mm/gup.c:1426
 __get_user_pages_locked mm/gup.c:1692 [inline]
 faultin_page_range+0x107/0x4d0 mm/gup.c:1912
 madvise_populate mm/madvise.c:976 [inline]
 madvise_do_behavior+0x19f/0xa10 mm/madvise.c:1935
 do_madvise+0x10e/0x190 mm/madvise.c:2030
 __do_sys_madvise mm/madvise.c:2039 [inline]
 __se_sys_madvise mm/madvise.c:2037 [inline]
 __x64_sys_madvise+0x63/0x80 mm/madvise.c:2037
 x64_sys_call+0x1c6b/0x3000 arch/x86/include/generated/asm/syscalls_64.h:29
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff888237d17b58 of 8 bytes by task 5743 on cpu 0:
 print_tickdevice+0x142/0x340 kernel/time/timer_list.c:206
 timer_list_show+0x6c/0x170 kernel/time/timer_list.c:298
 seq_read_iter+0x636/0x940 fs/seq_file.c:272
 proc_reg_read_iter+0x110/0x180 fs/proc/inode.c:299
 copy_splice_read+0x471/0x6c0 fs/splice.c:363
 do_splice_read fs/splice.c:982 [inline]
 splice_direct_to_actor+0x26e/0x670 fs/splice.c:1086
 do_splice_direct_actor fs/splice.c:1204 [inline]
 do_splice_direct+0x119/0x1a0 fs/splice.c:1230
 do_sendfile+0x382/0x650 fs/read_write.c:1370
 __do_sys_sendfile64 fs/read_write.c:1431 [inline]
 __se_sys_sendfile64 fs/read_write.c:1417 [inline]
 __x64_sys_sendfile64+0x105/0x150 fs/read_write.c:1417
 x64_sys_call+0x2db1/0x3000 arch/x86/include/generated/asm/syscalls_64.h:41
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0x000000129f80e76b -> 0x000000129f81f8db

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 UID: 0 PID: 5743 Comm: syz.2.615 Tainted: G        W           syzkaller #0 PREEMPT(voluntary) 
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026
==================================================================

The data race occurs between `hrtimer_interrupt()` and `print_tickdevice()`.

gemini-3-flash-preview

							input: 4527
							output: 22
							thoughts: 180

Here's my analysis of the reported data race:

**Data Race Deep Dive: `hrtimer_interrupt` vs. `print_tickdevice`**

Okay, so the report indicates a nasty data race. Specifically, it's between `hrtimer_interrupt`, which is writing to a memory address (`0xffff888237d17b58`), and `print_tickdevice`, which is reading from the same address. The key thing is *where* these functions are running: `hrtimer_interrupt` is in interrupt context (the APIC timer interrupt, no less!) and `print_tickdevice` is task context, triggered when someone reads `/proc/timer_list` through the `timer_list_show` function. That's a classic recipe for trouble.

My initial thought is to pin down the *exact* field they're both touching. I need to understand what's being written to and read from that address. The report helpfully gives me the function locations: `kernel/time/hrtimer.c:1882` for `hrtimer_interrupt` and `kernel/time/timer_list.c:206` for `print_tickdevice`.

So, I'm going to pull up those files right now and get right to reading those lines. First, I’ll get a good overview of the file contexts. Then, I need to see *exactly* what memory access is happening on line 1882 in `hrtimer.c` and line 206 in `timer_list.c`. That should reveal the shared data that's causing the problem. I'm going to look for any potential locking or atomic operations that might be missing or insufficient. Time to get my hands dirty.

map[Dir:kernel/time/]

map[Files:[alarmtimer.c clockevents.c clocksource-wdtest.c clocksource.c hrtimer.c itimer.c jiffies.c namespace.c ntp.c ntp_internal.h posix-clock.c posix-cpu-timers.c posix-stubs.c posix-timers.c posix-timers.h sched_clock.c sleep_timeout.c test_udelay.c tick-broadcast-hrtimer.c tick-broadcast.c tick-common.c tick-internal.h tick-legacy.c tick-oneshot.c tick-sched.c tick-sched.h time.c time_test.c timeconv.c timecounter.c timekeeping.c timekeeping.h timekeeping_debug.c timekeeping_internal.h timer.c timer_list.c timer_migration.c timer_migration.h vsyscall.c] Subdirs:<nil>]

gemini-3-flash-preview

							input: 5438
							output: 23
							thoughts: 13

Okay, I'm ready to dive in. Let's see what we're dealing with.

**Line-by-Line Analysis Incoming**

First, I'll identify the reported lines of code. Then, I'll thoroughly read and analyze the code at each of those lines, giving the necessary consideration to the surrounding context. I need to understand exactly what each line is doing and, crucially, how it interacts with the rest of the program. I'll be looking for potential issues, inefficiencies, or anything that might explain the reported problem, if any was reported.

I'll be mindful of the broader architecture and the overall goals of this section of the codebase. I'll need to consider this in light of what I know about the function or module in question.

map[File:kernel/time/hrtimer.c]

map[Contents:// SPDX-License-Identifier: GPL-2.0
/*
 *  Copyright(C) 2005-2006, Linutronix GmbH, Thomas Gleixner <tglx@kernel.org>
 *  Copyright(C) 2005-2007, Red Hat, Inc., Ingo Molnar
 *  Copyright(C) 2006-2007  Timesys Corp., Thomas Gleixner
 *
 *  High-resolution kernel timers
 *
 *  In contrast to the low-resolution timeout API, aka timer wheel,
 *  hrtimers provide finer resolution and accuracy depending on system
 *  configuration and capabilities.
 *
 *  Started by: Thomas Gleixner and Ingo Molnar
 *
 *  Credits:
 *	Based on the original timer wheel code
 *
 *	Help, testing, suggestions, bugfixes, improvements were
 *	provided by:
 *
 *	George Anzinger, Andrew Morton, Steven Rostedt, Roman Zippel
 *	et. al.
 */

#include <linux/cpu.h>
#include <linux/export.h>
#include <linux/percpu.h>
#include <linux/hrtimer.h>
#include <linux/notifier.h>
#include <linux/syscalls.h>
#include <linux/interrupt.h>
#include <linux/tick.h>
#include <linux/err.h>
#include <linux/debugobjects.h>
#include <linux/sched/signal.h>
#include <linux/sched/sysctl.h>
#include <linux/sched/rt.h>
#include <linux/sched/deadline.h>
#include <linux/sched/nohz.h>
#include <linux/sched/debug.h>
#include <linux/sched/isolation.h>
#include <linux/timer.h>
#include <linux/freezer.h>
#include <linux/compat.h>

#include <linux/uaccess.h>

#include <trace/events/timer.h>

#include "tick-internal.h"

/*
 * Masks for selecting the soft and hard context timers from
 * cpu_base->active
 */
#define MASK_SHIFT		(HRTIMER_BASE_MONOTONIC_SOFT)
#define HRTIMER_ACTIVE_HARD	((1U << MASK_SHIFT) - 1)
#define HRTIMER_ACTIVE_SOFT	(HRTIMER_ACTIVE_HARD << MASK_SHIFT)
#define HRTIMER_ACTIVE_ALL	(HRTIMER_ACTIVE_SOFT | HRTIMER_ACTIVE_HARD)

static void retrigger_next_event(void *arg);
static ktime_t __hrtimer_cb_get_time(clockid_t clock_id);

/*
 * The timer bases:
 *
 * There are more clockids than hrtimer bases. Thus, we index
 * into the timer bases by the hrtimer_base_type enum. When trying
 * to reach a base using a clockid, hrtimer_clockid_to_base()
 * is used to convert from clockid to the proper hrtimer_base_type.
 */
DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
{
	.lock = __RAW_SPIN_LOCK_UNLOCKED(hrtimer_bases.lock),
	.clock_base =
	{
		{
			.index = HRTIMER_BASE_MONOTONIC,
			.clockid = CLOCK_MONOTONIC,
		},
		{
			.index = HRTIMER_BASE_REALTIME,
			.clockid = CLOCK_REALTIME,
		},
		{
			.index = HRTIMER_BASE_BOOTTIME,
			.clockid = CLOCK_BOOTTIME,
		},
		{
			.index = HRTIMER_BASE_TAI,
			.clockid = CLOCK_TAI,
		},
		{
			.index = HRTIMER_BASE_MONOTONIC_SOFT,
			.clockid = CLOCK_MONOTONIC,
		},
		{
			.index = HRTIMER_BASE_REALTIME_SOFT,
			.clockid = CLOCK_REALTIME,
		},
		{
			.index = HRTIMER_BASE_BOOTTIME_SOFT,
			.clockid = CLOCK_BOOTTIME,
		},
		{
			.index = HRTIMER_BASE_TAI_SOFT,
			.clockid = CLOCK_TAI,
		},
	},
	.csd = CSD_INIT(retrigger_next_event, NULL)
};

static inline bool hrtimer_base_is_online(struct hrtimer_cpu_base *base)
{
	if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
		return true;
	else
		return likely(base->online);
}

/*
 * Functions and macros which are different for UP/SMP systems are kept in a
 * single place
 */
#ifdef CONFIG_SMP

/*
 * We require the migration_base for lock_hrtimer_base()/switch_hrtimer_base()
 * such that hrtimer_callback_running() can unconditionally dereference
 * timer->base->cpu_base
 */
static struct hrtimer_cpu_base migration_cpu_base = {
	.clock_base = { {
		.cpu_base = &migration_cpu_base,
		.seq      = SEQCNT_RAW_SPINLOCK_ZERO(migration_cpu_base.seq,
						     &migration_cpu_base.lock),
	}, },
};

#define migration_base	migration_cpu_base.clock_base[0]

/*
 * We are using hashed locking: holding per_cpu(hrtimer_bases)[n].lock
 * means that all timers which are tied to this base via timer->base are
 * locked, and the base itself is locked too.
 *
 * So __run_timers/migrate_timers can safely modify all timers which could
 * be found on the lists/queues.
 *
 * When the timer's base is locked, and the timer removed from list, it is
 * possible to set timer->base = &migration_base and drop the lock: the timer
 * remains locked.
 */
static
struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
					     unsigned long *flags)
	__acquires(&timer->base->lock)
{
	struct hrtimer_clock_base *base;

	for (;;) {
		base = READ_ONCE(timer->base);
		if (likely(base != &migration_base)) {
			raw_spin_lock_irqsave(&base->cpu_base->lock, *flags);
			if (likely(base == timer->base))
				return base;
			/* The timer has migrated to another CPU: */
			raw_spin_unlock_irqrestore(&base->cpu_base->lock, *flags);
		}
		cpu_relax();
	}
}

/*
 * Check if the elected target is suitable considering its next
 * event and the hotplug state of the current CPU.
 *
 * If the elected target is remote and its next event is after the timer
 * to queue, then a remote reprogram is necessary. However there is no
 * guarantee the IPI handling the operation would arrive in time to meet
 * the high resolution deadline. In this case the local CPU becomes a
 * preferred target, unless it is offline.
 *
 * High and low resolution modes are handled the same way for simplicity.
 *
 * Called with cpu_base->lock of target cpu held.
 */
static bool hrtimer_suitable_target(struct hrtimer *timer, struct hrtimer_clock_base *new_base,
				    struct hrtimer_cpu_base *new_cpu_base,
				    struct hrtimer_cpu_base *this_cpu_base)
{
	ktime_t expires;

	/*
	 * The local CPU clockevent can be reprogrammed. Also get_target_base()
	 * guarantees it is online.
	 */
	if (new_cpu_base == this_cpu_base)
		return true;

	/*
	 * The offline local CPU can't be the default target if the
	 * next remote target event is after this timer. Keep the
	 * elected new base. An IPI will be issued to reprogram
	 * it as a last resort.
	 */
	if (!hrtimer_base_is_online(this_cpu_base))
		return true;

	expires = ktime_sub(hrtimer_get_expires(timer), new_base->offset);

	return expires >= new_base->cpu_base->expires_next;
}

static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, int pinned)
{
	if (!hrtimer_base_is_online(base)) {
		int cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TIMER));

		return &per_cpu(hrtimer_bases, cpu);
	}

#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
	if (static_branch_likely(&timers_migration_enabled) && !pinned)
		return &per_cpu(hrtimer_bases, get_nohz_timer_target());
#endif
	return base;
}

/*
 * We switch the timer base to a power-optimized selected CPU target,
 * if:
 *	- NO_HZ_COMMON is enabled
 *	- timer migration is enabled
 *	- the timer callback is not running
 *	- the timer is not the first expiring timer on the new target
 *
 * If one of the above requirements is not fulfilled we move the timer
 * to the current CPU or leave it on the previously assigned CPU if
 * the timer callback is currently running.
 */
static inline struct hrtimer_clock_base *
switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
		    int pinned)
{
	struct hrtimer_cpu_base *new_cpu_base, *this_cpu_base;
	struct hrtimer_clock_base *new_base;
	int basenum = base->index;

	this_cpu_base = this_cpu_ptr(&hrtimer_bases);
	new_cpu_base = get_target_base(this_cpu_base, pinned);
again:
	new_base = &new_cpu_base->clock_base[basenum];

	if (base != new_base) {
		/*
		 * We are trying to move timer to new_base.
		 * However we can't change timer's base while it is running,
		 * so we keep it on the same CPU. No hassle vs. reprogramming
		 * the event source in the high resolution case. The softirq
		 * code will take care of this when the timer function has
		 * completed. There is no conflict as we hold the lock until
		 * the timer is enqueued.
		 */
		if (unlikely(hrtimer_callback_running(timer)))
			return base;

		/* See the comment in lock_hrtimer_base() */
		WRITE_ONCE(timer->base, &migration_base);
		raw_spin_unlock(&base->cpu_base->lock);
		raw_spin_lock(&new_base->cpu_base->lock);

		if (!hrtimer_suitable_target(timer, new_base, new_cpu_base,
					     this_cpu_base)) {
			raw_spin_unlock(&new_base->cpu_base->lock);
			raw_spin_lock(&base->cpu_base->lock);
			new_cpu_base = this_cpu_base;
			WRITE_ONCE(timer->base, base);
			goto again;
		}
		WRITE_ONCE(timer->base, new_base);
	} else {
		if (!hrtimer_suitable_target(timer, new_base,  new_cpu_base, this_cpu_base)) {
			new_cpu_base = this_cpu_base;
			goto again;
		}
	}
	return new_base;
}

#else /* CONFIG_SMP */

static inline struct hrtimer_clock_base *
lock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
	__acquires(&timer->base->cpu_base->lock)
{
	struct hrtimer_clock_base *base = timer->base;

	raw_spin_lock_irqsave(&base->cpu_base->lock, *flags);

	return base;
}

# define switch_hrtimer_base(t, b, p)	(b)

#endif	/* !CONFIG_SMP */

/*
 * Functions for the union type storage format of ktime_t which are
 * too large for inlining:
 */
#if BITS_PER_LONG < 64
/*
 * Divide a ktime value by a nanosecond value
 */
s64 __ktime_divns(const ktime_t kt, s64 div)
{
	int sft = 0;
	s64 dclc;
	u64 tmp;

	dclc = ktime_to_ns(kt);
	tmp = dclc < 0 ? -dclc : dclc;

	/* Make sure the divisor is less than 2^32: */
	while (div >> 32) {
		sft++;
		div >>= 1;
	}
	tmp >>= sft;
	do_div(tmp, (u32) div);
	return dclc < 0 ? -tmp : tmp;
}
EXPORT_SYMBOL_GPL(__ktime_divns);
#endif /* BITS_PER_LONG >= 64 */

/*
 * Add two ktime values and do a safety check for overflow:
 */
ktime_t ktime_add_safe(const ktime_t lhs, const ktime_t rhs)
{
	ktime_t res = ktime_add_unsafe(lhs, rhs);

	/*
	 * We use KTIME_SEC_MAX here, the maximum timeout which we can
	 * return to user space in a timespec:
	 */
	if (res < 0 || res < lhs || res < rhs)
		res = ktime_set(KTIME_SEC_MAX, 0);

	return res;
}

EXPORT_SYMBOL_GPL(ktime_add_safe);

#ifdef CONFIG_DEBUG_OBJECTS_TIMERS

static const struct debug_obj_descr hrtimer_debug_descr;

static void *hrtimer_debug_hint(void *addr)
{
	return ACCESS_PRIVATE((struct hrtimer *)addr, function);
}

/*
 * fixup_init is called when:
 * - an active object is initialized
 */
static bool hrtimer_fixup_init(void *addr, enum debug_obj_state state)
{
	struct hrtimer *timer = addr;

	switch (state) {
	case ODEBUG_STATE_ACTIVE:
		hrtimer_cancel(timer);
		debug_object_init(timer, &hrtimer_debug_descr);
		return true;
	default:
		return false;
	}
}

/*
 * fixup_activate is called when:
 * - an active object is activated
 * - an unknown non-static object is activated
 */
static bool hrtimer_fixup_activate(void *addr, enum debug_obj_state state)
{
	switch (state) {
	case ODEBUG_STATE_ACTIVE:
		WARN_ON(1);
		fallthrough;
	default:
		return false;
	}
}

/*
 * fixup_free is called when:
 * - an active object is freed
 */
static bool hrtimer_fixup_free(void *addr, enum debug_obj_state state)
{
	struct hrtimer *timer = addr;

	switch (state) {
	case ODEBUG_STATE_ACTIVE:
		hrtimer_cancel(timer);
		debug_object_free(timer, &hrtimer_debug_descr);
		return true;
	default:
		return false;
	}
}

static const struct debug_obj_descr hrtimer_debug_descr = {
	.name		= "hrtimer",
	.debug_hint	= hrtimer_debug_hint,
	.fixup_init	= hrtimer_fixup_init,
	.fixup_activate	= hrtimer_fixup_activate,
	.fixup_free	= hrtimer_fixup_free,
};

static inline void debug_hrtimer_init(struct hrtimer *timer)
{
	debug_object_init(timer, &hrtimer_debug_descr);
}

static inline void debug_hrtimer_init_on_stack(struct hrtimer *timer)
{
	debug_object_init_on_stack(timer, &hrtimer_debug_descr);
}

static inline void debug_hrtimer_activate(struct hrtimer *timer,
					  enum hrtimer_mode mode)
{
	debug_object_activate(timer, &hrtimer_debug_descr);
}

static inline void debug_hrtimer_deactivate(struct hrtimer *timer)
{
	debug_object_deactivate(timer, &hrtimer_debug_descr);
}

void destroy_hrtimer_on_stack(struct hrtimer *timer)
{
	debug_object_free(timer, &hrtimer_debug_descr);
}
EXPORT_SYMBOL_GPL(destroy_hrtimer_on_stack);

#else

static inline void debug_hrtimer_init(struct hrtimer *timer) { }
static inline void debug_hrtimer_init_on_stack(struct hrtimer *timer) { }
static inline void debug_hrtimer_activate(struct hrtimer *timer,
					  enum hrtimer_mode mode) { }
static inline void debug_hrtimer_deactivate(struct hrtimer *timer) { }
#endif

static inline void debug_setup(struct hrtimer *timer, clockid_t clockid, enum hrtimer_mode mode)
{
	debug_hrtimer_init(timer);
	trace_hrtimer_setup(timer, clockid, mode);
}

static inline void debug_setup_on_stack(struct hrtimer *timer, clockid_t clockid,
					enum hrtimer_mode mode)
{
	debug_hrtimer_init_on_stack(timer);
	trace_hrtimer_setup(timer, clockid, mode);
}

static inline void debug_activate(struct hrtimer *timer,
				  enum hrtimer_mode mode)
{
	debug_hrtimer_activate(timer, mode);
	trace_hrtimer_start(timer, mode);
}

static inline void debug_deactivate(struct hrtimer *timer)
{
	debug_hrtimer_deactivate(timer);
	trace_hrtimer_cancel(timer);
}

static struct hrtimer_clock_base *
__next_base(struct hrtimer_cpu_base *cpu_base, unsigned int *active)
{
	unsigned int idx;

	if (!*active)
		return NULL;

	idx = __ffs(*active);
	*active &= ~(1U << idx);

	return &cpu_base->clock_base[idx];
}

#define for_each_active_base(base, cpu_base, active)	\
	while ((base = __next_base((cpu_base), &(active))))

static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
					 const struct hrtimer *exclude,
					 unsigned int active,
					 ktime_t expires_next)
{
	struct hrtimer_clock_base *base;
	ktime_t expires;

	for_each_active_base(base, cpu_base, active) {
		struct timerqueue_node *next;
		struct hrtimer *timer;

		next = timerqueue_getnext(&base->active);
		timer = container_of(next, struct hrtimer, node);
		if (timer == exclude) {
			/* Get to the next timer in the queue. */
			next = timerqueue_iterate_next(next);
			if (!next)
				continue;

			timer = container_of(next, struct hrtimer, node);
		}
		expires = ktime_sub(hrtimer_get_expires(timer), base->offset);
		if (expires < expires_next) {
			expires_next = expires;

			/* Skip cpu_base update if a timer is being excluded. */
			if (exclude)
				continue;

			if (timer->is_soft)
				cpu_base->softirq_next_timer = timer;
			else
				cpu_base->next_timer = timer;
		}
	}
	/*
	 * clock_was_set() might have changed base->offset of any of
	 * the clock bases so the result might be negative. Fix it up
	 * to prevent a false positive in clockevents_program_event().
	 */
	if (expires_next < 0)
		expires_next = 0;
	return expires_next;
}

/*
 * Recomputes cpu_base::*next_timer and returns the earliest expires_next
 * but does not set cpu_base::*expires_next, that is done by
 * hrtimer[_force]_reprogram and hrtimer_interrupt only. When updating
 * cpu_base::*expires_next right away, reprogramming logic would no longer
 * work.
 *
 * When a softirq is pending, we can ignore the HRTIMER_ACTIVE_SOFT bases,
 * those timers will get run whenever the softirq gets handled, at the end of
 * hrtimer_run_softirq(), hrtimer_update_softirq_timer() will re-add these bases.
 *
 * Therefore softirq values are those from the HRTIMER_ACTIVE_SOFT clock bases.
 * The !softirq values are the minima across HRTIMER_ACTIVE_ALL, unless an actual
 * softirq is pending, in which case they're the minima of HRTIMER_ACTIVE_HARD.
 *
 * @active_mask must be one of:
 *  - HRTIMER_ACTIVE_ALL,
 *  - HRTIMER_ACTIVE_SOFT, or
 *  - HRTIMER_ACTIVE_HARD.
 */
static ktime_t
__hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_mask)
{
	unsigned int active;
	struct hrtimer *next_timer = NULL;
	ktime_t expires_next = KTIME_MAX;

	if (!cpu_base->softirq_activated && (active_mask & HRTIMER_ACTIVE_SOFT)) {
		active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
		cpu_base->softirq_next_timer = NULL;
		expires_next = __hrtimer_next_event_base(cpu_base, NULL,
							 active, KTIME_MAX);

		next_timer = cpu_base->softirq_next_timer;
	}

	if (active_mask & HRTIMER_ACTIVE_HARD) {
		active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
		cpu_base->next_timer = next_timer;
		expires_next = __hrtimer_next_event_base(cpu_base, NULL, active,
							 expires_next);
	}

	return expires_next;
}

static ktime_t hrtimer_update_next_event(struct hrtimer_cpu_base *cpu_base)
{
	ktime_t expires_next, soft = KTIME_MAX;

	/*
	 * If the soft interrupt has already been activated, ignore the
	 * soft bases. They will be handled in the already raised soft
	 * interrupt.
	 */
	if (!cpu_base->softirq_activated) {
		soft = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT);
		/*
		 * Update the soft expiry time. clock_settime() might have
		 * affected it.
		 */
		cpu_base->softirq_expires_next = soft;
	}

	expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_HARD);
	/*
	 * If a softirq timer is expiring first, update cpu_base->next_timer
	 * and program the hardware with the soft expiry time.
	 */
	if (expires_next > soft) {
		cpu_base->next_timer = cpu_base->softirq_next_timer;
		expires_next = soft;
	}

	return expires_next;
}

static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base)
{
	ktime_t *offs_real = &base->clock_base[HRTIMER_BASE_REALTIME].offset;
	ktime_t *offs_boot = &base->clock_base[HRTIMER_BASE_BOOTTIME].offset;
	ktime_t *offs_tai = &base->clock_base[HRTIMER_BASE_TAI].offset;

	ktime_t now = ktime_get_update_offsets_now(&base->clock_was_set_seq,
					    offs_real, offs_boot, offs_tai);

	base->clock_base[HRTIMER_BASE_REALTIME_SOFT].offset = *offs_real;
	base->clock_base[HRTIMER_BASE_BOOTTIME_SOFT].offset = *offs_boot;
	base->clock_base[HRTIMER_BASE_TAI_SOFT].offset = *offs_tai;

	return now;
}

/*
 * Is the high resolution mode active ?
 */
static inline int hrtimer_hres_active(struct hrtimer_cpu_base *cpu_base)
{
	return IS_ENABLED(CONFIG_HIGH_RES_TIMERS) ?
		cpu_base->hres_active : 0;
}

static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base,
				struct hrtimer *next_timer,
				ktime_t expires_next)
{
	cpu_base->expires_next = expires_next;

	/*
	 * If hres is not active, hardware does not have to be
	 * reprogrammed yet.
	 *
	 * If a hang was detected in the last timer interrupt then we
	 * leave the hang delay active in the hardware. We want the
	 * system to make progress. That also prevents the following
	 * scenario:
	 * T1 expires 50ms from now
	 * T2 expires 5s from now
	 *
	 * T1 is removed, so this code is called and would reprogram
	 * the hardware to 5s from now. Any hrtimer_start after that
	 * will not reprogram the hardware due to hang_detected being
	 * set. So we'd effectively block all timers until the T2 event
	 * fires.
	 */
	if (!hrtimer_hres_active(cpu_base) || cpu_base->hang_detected)
		return;

	tick_program_event(expires_next, 1);
}

/*
 * Reprogram the event source with checking both queues for the
 * next event
 * Called with interrupts disabled and base->lock held
 */
static void
hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
{
	ktime_t expires_next;

	expires_next = hrtimer_update_next_event(cpu_base);

	if (skip_equal && expires_next == cpu_base->expires_next)
		return;

	__hrtimer_reprogram(cpu_base, cpu_base->next_timer, expires_next);
}

/* High resolution timer related functions */
#ifdef CONFIG_HIGH_RES_TIMERS

/*
 * High resolution timer enabled ?
 */
static bool hrtimer_hres_enabled __read_mostly  = true;
unsigned int hrtimer_resolution __read_mostly = LOW_RES_NSEC;
EXPORT_SYMBOL_GPL(hrtimer_resolution);

/*
 * Enable / Disable high resolution mode
 */
static int __init setup_hrtimer_hres(char *str)
{
	return (kstrtobool(str, &hrtimer_hres_enabled) == 0);
}

__setup("highres=", setup_hrtimer_hres);

/*
 * hrtimer_high_res_enabled - query, if the highres mode is enabled
 */
static inline int hrtimer_is_hres_enabled(void)
{
	return hrtimer_hres_enabled;
}

/*
 * Switch to high resolution mode
 */
static void hrtimer_switch_to_hres(void)
{
	struct hrtimer_cpu_base *base = this_cpu_ptr(&hrtimer_bases);

	if (tick_init_highres()) {
		pr_warn("Could not switch to high resolution mode on CPU %u\n",
			base->cpu);
		return;
	}
	base->hres_active = 1;
	hrtimer_resolution = HIGH_RES_NSEC;

	tick_setup_sched_timer(true);
	/* "Retrigger" the interrupt to get things going */
	retrigger_next_event(NULL);
}

#else

static inline int hrtimer_is_hres_enabled(void) { return 0; }
static inline void hrtimer_switch_to_hres(void) { }

#endif /* CONFIG_HIGH_RES_TIMERS */
/*
 * Retrigger next event is called after clock was set with interrupts
 * disabled through an SMP function call or directly from low level
 * resume code.
 *
 * This is only invoked when:
 *	- CONFIG_HIGH_RES_TIMERS is enabled.
 *	- CONFIG_NOHZ_COMMON is enabled
 *
 * For the other cases this function is empty and because the call sites
 * are optimized out it vanishes as well, i.e. no need for lots of
 * #ifdeffery.
 */
static void retrigger_next_event(void *arg)
{
	struct hrtimer_cpu_base *base = this_cpu_ptr(&hrtimer_bases);

	/*
	 * When high resolution mode or nohz is active, then the offsets of
	 * CLOCK_REALTIME/TAI/BOOTTIME have to be updated. Otherwise the
	 * next tick will take care of that.
	 *
	 * If high resolution mode is active then the next expiring timer
	 * must be reevaluated and the clock event device reprogrammed if
	 * necessary.
	 *
	 * In the NOHZ case the update of the offset and the reevaluation
	 * of the next expiring timer is enough. The return from the SMP
	 * function call will take care of the reprogramming in case the
	 * CPU was in a NOHZ idle sleep.
	 *
	 * In periodic low resolution mode, the next softirq expiration
	 * must also be updated.
	 */
	raw_spin_lock(&base->lock);
	hrtimer_update_base(base);
	if (hrtimer_hres_active(base))
		hrtimer_force_reprogram(base, 0);
	else
		hrtimer_update_next_event(base);
	raw_spin_unlock(&base->lock);
}

/*
 * When a timer is enqueued and expires earlier than the already enqueued
 * timers, we have to check, whether it expires earlier than the timer for
 * which the clock event device was armed.
 *
 * Called with interrupts disabled and base->cpu_base.lock held
 */
static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
{
	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
	struct hrtimer_clock_base *base = timer->base;
	ktime_t expires = ktime_sub(hrtimer_get_expires(timer), base->offset);

	WARN_ON_ONCE(hrtimer_get_expires_tv64(timer) < 0);

	/*
	 * CLOCK_REALTIME timer might be requested with an absolute
	 * expiry time which is less than base->offset. Set it to 0.
	 */
	if (expires < 0)
		expires = 0;

	if (timer->is_soft) {
		/*
		 * soft hrtimer could be started on a remote CPU. In this
		 * case softirq_expires_next needs to be updated on the
		 * remote CPU. The soft hrtimer will not expire before the
		 * first hard hrtimer on the remote CPU -
		 * hrtimer_check_target() prevents this case.
		 */
		struct hrtimer_cpu_base *timer_cpu_base = base->cpu_base;

		if (timer_cpu_base->softirq_activated)
			return;

		if (!ktime_before(expires, timer_cpu_base->softirq_expires_next))
			return;

		timer_cpu_base->softirq_next_timer = timer;
		timer_cpu_base->softirq_expires_next = expires;

		if (!ktime_before(expires, timer_cpu_base->expires_next) ||
		    !reprogram)
			return;
	}

	/*
	 * If the timer is not on the current cpu, we cannot reprogram
	 * the other cpus clock event device.
	 */
	if (base->cpu_base != cpu_base)
		return;

	if (expires >= cpu_base->expires_next)
		return;

	/*
	 * If the hrtimer interrupt is running, then it will reevaluate the
	 * clock bases and reprogram the clock event device.
	 */
	if (cpu_base->in_hrtirq)
		return;

	cpu_base->next_timer = timer;

	__hrtimer_reprogram(cpu_base, timer, expires);
}

static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,
			     unsigned int active)
{
	struct hrtimer_clock_base *base;
	unsigned int seq;
	ktime_t expires;

	/*
	 * Update the base offsets unconditionally so the following
	 * checks whether the SMP function call is required works.
	 *
	 * The update is safe even when the remote CPU is in the hrtimer
	 * interrupt or the hrtimer soft interrupt and expiring affected
	 * bases. Either it will see the update before handling a base or
	 * it will see it when it finishes the processing and reevaluates
	 * the next expiring timer.
	 */
	seq = cpu_base->clock_was_set_seq;
	hrtimer_update_base(cpu_base);

	/*
	 * If the sequence did not change over the update then the
	 * remote CPU already handled it.
	 */
	if (seq == cpu_base->clock_was_set_seq)
		return false;

	/*
	 * If the remote CPU is currently handling an hrtimer interrupt, it
	 * will reevaluate the first expiring timer of all clock bases
	 * before reprogramming. Nothing to do here.
	 */
	if (cpu_base->in_hrtirq)
		return false;

	/*
	 * Walk the affected clock bases and check whether the first expiring
	 * timer in a clock base is moving ahead of the first expiring timer of
	 * @cpu_base. If so, the IPI must be invoked because per CPU clock
	 * event devices cannot be remotely reprogrammed.
	 */
	active &= cpu_base->active_bases;

	for_each_active_base(base, cpu_base, active) {
		struct timerqueue_node *next;

		next = timerqueue_getnext(&base->active);
		expires = ktime_sub(next->expires, base->offset);
		if (expires < cpu_base->expires_next)
			return true;

		/* Extra check for softirq clock bases */
		if (base->index < HRTIMER_BASE_MONOTONIC_SOFT)
			continue;
		if (cpu_base->softirq_activated)
			continue;
		if (expires < cpu_base->softirq_expires_next)
			return true;
	}
	return false;
}

/*
 * Clock was set. This might affect CLOCK_REALTIME, CLOCK_TAI and
 * CLOCK_BOOTTIME (for late sleep time injection).
 *
 * This requires to update the offsets for these clocks
 * vs. CLOCK_MONOTONIC. When high resolution timers are enabled, then this
 * also requires to eventually reprogram the per CPU clock event devices
 * when the change moves an affected timer ahead of the first expiring
 * timer on that CPU. Obviously remote per CPU clock event devices cannot
 * be reprogrammed. The other reason why an IPI has to be sent is when the
 * system is in !HIGH_RES and NOHZ mode. The NOHZ mode updates the offsets
 * in the tick, which obviously might be stopped, so this has to bring out
 * the remote CPU which might sleep in idle to get this sorted.
 */
void clock_was_set(unsigned int bases)
{
	struct hrtimer_cpu_base *cpu_base = raw_cpu_ptr(&hrtimer_bases);
	cpumask_var_t mask;
	int cpu;

	if (!hrtimer_hres_active(cpu_base) && !tick_nohz_active)
		goto out_timerfd;

	if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
		on_each_cpu(retrigger_next_event, NULL, 1);
		goto out_timerfd;
	}

	/* Avoid interrupting CPUs if possible */
	cpus_read_lock();
	for_each_online_cpu(cpu) {
		unsigned long flags;

		cpu_base = &per_cpu(hrtimer_bases, cpu);
		raw_spin_lock_irqsave(&cpu_base->lock, flags);

		if (update_needs_ipi(cpu_base, bases))
			cpumask_set_cpu(cpu, mask);

		raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
	}

	preempt_disable();
	smp_call_function_many(mask, retrigger_next_event, NULL, 1);
	preempt_enable();
	cpus_read_unlock();
	free_cpumask_var(mask);

out_timerfd:
	timerfd_clock_was_set();
}

static void clock_was_set_work(struct work_struct *work)
{
	clock_was_set(CLOCK_SET_WALL);
}

static DECLARE_WORK(hrtimer_work, clock_was_set_work);

/*
 * Called from timekeeping code to reprogram the hrtimer interrupt device
 * on all cpus and to notify timerfd.
 */
void clock_was_set_delayed(void)
{
	schedule_work(&hrtimer_work);
}

/*
 * Called during resume either directly from via timekeeping_resume()
 * or in the case of s2idle from tick_unfreeze() to ensure that the
 * hrtimers are up to date.
 */
void hrtimers_resume_local(void)
{
	lockdep_assert_irqs_disabled();
	/* Retrigger on the local CPU */
	retrigger_next_event(NULL);
}

/*
 * Counterpart to lock_hrtimer_base above:
 */
static inline
void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
	__releases(&timer->base->cpu_base->lock)
{
	raw_spin_unlock_irqrestore(&timer->base->cpu_base->lock, *flags);
}

/**
 * hrtimer_forward() - forward the timer expiry
 * @timer:	hrtimer to forward
 * @now:	forward past this time
 * @interval:	the interval to forward
 *
 * Forward the timer expiry so it will expire in the future.
 *
 * .. note::
 *  This only updates the timer expiry value and does not requeue the timer.
 *
 * There is also a variant of the function hrtimer_forward_now().
 *
 * Context: Can be safely called from the callback function of @timer. If called
 *          from other contexts @timer must neither be enqueued nor running the
 *          callback and the caller needs to take care of serialization.
 *
 * Return: The number of overruns are returned.
 */
u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
{
	u64 orun = 1;
	ktime_t delta;

	delta = ktime_sub(now, hrtimer_get_expires(timer));

	if (delta < 0)
		return 0;

	if (WARN_ON(timer->state & HRTIMER_STATE_ENQUEUED))
		return 0;

	if (interval < hrtimer_resolution)
		interval = hrtimer_resolution;

	if (unlikely(delta >= interval)) {
		s64 incr = ktime_to_ns(interval);

		orun = ktime_divns(delta, incr);
		hrtimer_add_expires_ns(timer, incr * orun);
		if (hrtimer_get_expires_tv64(timer) > now)
			return orun;
		/*
		 * This (and the ktime_add() below) is the
		 * correction for exact:
		 */
		orun++;
	}
	hrtimer_add_expires(timer, interval);

	return orun;
}
EXPORT_SYMBOL_GPL(hrtimer_forward);

/*
 * enqueue_hrtimer - internal function to (re)start a timer
 *
 * The timer is inserted in expiry order. Insertion into the
 * red black tree is O(log(n)). Must hold the base lock.
 *
 * Returns true when the new timer is the leftmost timer in the tree.
 */
static bool enqueue_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
			    enum hrtimer_mode mode)
{
	debug_activate(timer, mode);
	WARN_ON_ONCE(!base->cpu_base->online);

	base->cpu_base->active_bases |= 1 << base->index;

	/* Pairs with the lockless read in hrtimer_is_queued() */
	WRITE_ONCE(timer->state, HRTIMER_STATE_ENQUEUED);

	return timerqueue_add(&base->active, &timer->node);
}

/*
 * __remove_hrtimer - internal function to remove a timer
 *
 * Caller must hold the base lock.
 *
 * High resolution timer mode reprograms the clock event device when the
 * timer is the one which expires next. The caller can disable this by setting
 * reprogram to zero. This is useful, when the context does a reprogramming
 * anyway (e.g. timer interrupt)
 */
static void __remove_hrtimer(struct hrtimer *timer,
			     struct hrtimer_clock_base *base,
			     u8 newstate, int reprogram)
{
	struct hrtimer_cpu_base *cpu_base = base->cpu_base;
	u8 state = timer->state;

	/* Pairs with the lockless read in hrtimer_is_queued() */
	WRITE_ONCE(timer->state, newstate);
	if (!(state & HRTIMER_STATE_ENQUEUED))
		return;

	if (!timerqueue_del(&base->active, &timer->node))
		cpu_base->active_bases &= ~(1 << base->index);

	/*
	 * Note: If reprogram is false we do not update
	 * cpu_base->next_timer. This happens when we remove the first
	 * timer on a remote cpu. No harm as we never dereference
	 * cpu_base->next_timer. So the worst thing what can happen is
	 * an superfluous call to hrtimer_force_reprogram() on the
	 * remote cpu later on if the same timer gets enqueued again.
	 */
	if (reprogram && timer == cpu_base->next_timer)
		hrtimer_force_reprogram(cpu_base, 1);
}

/*
 * remove hrtimer, called with base lock held
 */
static inline int
remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
	       bool restart, bool keep_local)
{
	u8 state = timer->state;

	if (state & HRTIMER_STATE_ENQUEUED) {
		bool reprogram;

		/*
		 * Remove the timer and force reprogramming when high
		 * resolution mode is active and the timer is on the current
		 * CPU. If we remove a timer on another CPU, reprogramming is
		 * skipped. The interrupt event on this CPU is fired and
		 * reprogramming happens in the interrupt handler. This is a
		 * rare case and less expensive than a smp call.
		 */
		debug_deactivate(timer);
		reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases);

		/*
		 * If the timer is not restarted then reprogramming is
		 * required if the timer is local. If it is local and about
		 * to be restarted, avoid programming it twice (on removal
		 * and a moment later when it's requeued).
		 */
		if (!restart)
			state = HRTIMER_STATE_INACTIVE;
		else
			reprogram &= !keep_local;

		__remove_hrtimer(timer, base, state, reprogram);
		return 1;
	}
	return 0;
}

static inline ktime_t hrtimer_update_lowres(struct hrtimer *timer, ktime_t tim,
					    const enum hrtimer_mode mode)
{
#ifdef CONFIG_TIME_LOW_RES
	/*
	 * CONFIG_TIME_LOW_RES indicates that the system has no way to return
	 * granular time values. For relative timers we add hrtimer_resolution
	 * (i.e. one jiffy) to prevent short timeouts.
	 */
	timer->is_rel = mode & HRTIMER_MODE_REL;
	if (timer->is_rel)
		tim = ktime_add_safe(tim, hrtimer_resolution);
#endif
	return tim;
}

static void
hrtimer_update_softirq_timer(struct hrtimer_cpu_base *cpu_base, bool reprogram)
{
	ktime_t expires;

	/*
	 * Find the next SOFT expiration.
	 */
	expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT);

	/*
	 * reprogramming needs to be triggered, even if the next soft
	 * hrtimer expires at the same time than the next hard
	 * hrtimer. cpu_base->softirq_expires_next needs to be updated!
	 */
	if (expires == KTIME_MAX)
		return;

	/*
	 * cpu_base->*next_timer is recomputed by __hrtimer_get_next_event()
	 * cpu_base->*expires_next is only set by hrtimer_reprogram()
	 */
	hrtimer_reprogram(cpu_base->softirq_next_timer, reprogram);
}

static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
				    u64 delta_ns, const enum hrtimer_mode mode,
				    struct hrtimer_clock_base *base)
{
	struct hrtimer_cpu_base *this_cpu_base = this_cpu_ptr(&hrtimer_bases);
	struct hrtimer_clock_base *new_base;
	bool force_local, first;

	/*
	 * If the timer is on the local cpu base and is the first expiring
	 * timer then this might end up reprogramming the hardware twice
	 * (on removal and on enqueue). To avoid that by prevent the
	 * reprogram on removal, keep the timer local to the current CPU
	 * and enforce reprogramming after it is queued no matter whether
	 * it is the new first expiring timer again or not.
	 */
	force_local = base->cpu_base == this_cpu_base;
	force_local &= base->cpu_base->next_timer == timer;

	/*
	 * Don't force local queuing if this enqueue happens on a unplugged
	 * CPU after hrtimer_cpu_dying() has been invoked.
	 */
	force_local &= this_cpu_base->online;

	/*
	 * Remove an active timer from the queue. In case it is not queued
	 * on the current CPU, make sure that remove_hrtimer() updates the
	 * remote data correctly.
	 *
	 * If it's on the current CPU and the first expiring timer, then
	 * skip reprogramming, keep the timer local and enforce
	 * reprogramming later if it was the first expiring timer.  This
	 * avoids programming the underlying clock event twice (once at
	 * removal and once after enqueue).
	 */
	remove_hrtimer(timer, base, true, force_local);

	if (mode & HRTIMER_MODE_REL)
		tim = ktime_add_safe(tim, __hrtimer_cb_get_time(base->clockid));

	tim = hrtimer_update_lowres(timer, tim, mode);

	hrtimer_set_expires_range_ns(timer, tim, delta_ns);

	/* Switch the timer base, if necessary: */
	if (!force_local) {
		new_base = switch_hrtimer_base(timer, base,
					       mode & HRTIMER_MODE_PINNED);
	} else {
		new_base = base;
	}

	first = enqueue_hrtimer(timer, new_base, mode);
	if (!force_local) {
		/*
		 * If the current CPU base is online, then the timer is
		 * never queued on a remote CPU if it would be the first
		 * expiring timer there.
		 */
		if (hrtimer_base_is_online(this_cpu_base))
			return first;

		/*
		 * Timer was enqueued remote because the current base is
		 * already offline. If the timer is the first to expire,
		 * kick the remote CPU to reprogram the clock event.
		 */
		if (first) {
			struct hrtimer_cpu_base *new_cpu_base = new_base->cpu_base;

			smp_call_function_single_async(new_cpu_base->cpu, &new_cpu_base->csd);
		}
		return 0;
	}

	/*
	 * Timer was forced to stay on the current CPU to avoid
	 * reprogramming on removal and enqueue. Force reprogram the
	 * hardware by evaluating the new first expiring timer.
	 */
	hrtimer_force_reprogram(new_base->cpu_base, 1);
	return 0;
}

/**
 * hrtimer_start_range_ns - (re)start an hrtimer
 * @timer:	the timer to be added
 * @tim:	expiry time
 * @delta_ns:	"slack" range for the timer
 * @mode:	timer mode: absolute (HRTIMER_MODE_ABS) or
 *		relative (HRTIMER_MODE_REL), and pinned (HRTIMER_MODE_PINNED);
 *		softirq based mode is considered for debug purpose only!
 */
void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
			    u64 delta_ns, const enum hrtimer_mode mode)
{
	struct hrtimer_clock_base *base;
	unsigned long flags;

	/*
	 * Check whether the HRTIMER_MODE_SOFT bit and hrtimer.is_soft
	 * match on CONFIG_PREEMPT_RT = n. With PREEMPT_RT check the hard
	 * expiry mode because unmarked timers are moved to softirq expiry.
	 */
	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
		WARN_ON_ONCE(!(mode & HRTIMER_MODE_SOFT) ^ !timer->is_soft);
	else
		WARN_ON_ONCE(!(mode & HRTIMER_MODE_HARD) ^ !timer->is_hard);

	base = lock_hrtimer_base(timer, &flags);

	if (__hrtimer_start_range_ns(timer, tim, delta_ns, mode, base))
		hrtimer_reprogram(timer, true);

	unlock_hrtimer_base(timer, &flags);
}
EXPORT_SYMBOL_GPL(hrtimer_start_range_ns);

/**
 * hrtimer_try_to_cancel - try to deactivate a timer
 * @timer:	hrtimer to stop
 *
 * Returns:
 *
 *  *  0 when the timer was not active
 *  *  1 when the timer was active
 *  * -1 when the timer is currently executing the callback function and
 *    cannot be stopped
 */
int hrtimer_try_to_cancel(struct hrtimer *timer)
{
	struct hrtimer_clock_base *base;
	unsigned long flags;
	int ret = -1;

	/*
	 * Check lockless first. If the timer is not active (neither
	 * enqueued nor running the callback, nothing to do here.  The
	 * base lock does not serialize against a concurrent enqueue,
	 * so we can avoid taking it.
	 */
	if (!hrtimer_active(timer))
		return 0;

	base = lock_hrtimer_base(timer, &flags);

	if (!hrtimer_callback_running(timer))
		ret = remove_hrtimer(timer, base, false, false);

	unlock_hrtimer_base(timer, &flags);

	return ret;

}
EXPORT_SYMBOL_GPL(hrtimer_try_to_cancel);

#ifdef CONFIG_PREEMPT_RT
static void hrtimer_cpu_base_init_expiry_lock(struct hrtimer_cpu_base *base)
{
	spin_lock_init(&base->softirq_expiry_lock);
}

static void hrtimer_cpu_base_lock_expiry(struct hrtimer_cpu_base *base)
	__acquires(&base->softirq_expiry_lock)
{
	spin_lock(&base->softirq_expiry_lock);
}

static void hrtimer_cpu_base_unlock_expiry(struct hrtimer_cpu_base *base)
	__releases(&base->softirq_expiry_lock)
{
	spin_unlock(&base->softirq_expiry_lock);
}

/*
 * The counterpart to hrtimer_cancel_wait_running().
 *
 * If there is a waiter for cpu_base->expiry_lock, then it was waiting for
 * the timer callback to finish. Drop expiry_lock and reacquire it. That
 * allows the waiter to acquire the lock and make progress.
 */
static void hrtimer_sync_wait_running(struct hrtimer_cpu_base *cpu_base,
				      unsigned long flags)
{
	if (atomic_read(&cpu_base->timer_waiters)) {
		raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
		spin_unlock(&cpu_base->softirq_expiry_lock);
		spin_lock(&cpu_base->softirq_expiry_lock);
		raw_spin_lock_irq(&cpu_base->lock);
	}
}

#ifdef CONFIG_SMP
static __always_inline bool is_migration_base(struct hrtimer_clock_base *base)
{
	return base == &migration_base;
}
#else
static __always_inline bool is_migration_base(struct hrtimer_clock_base *base)
{
	return false;
}
#endif

/*
 * This function is called on PREEMPT_RT kernels when the fast path
 * deletion of a timer failed because the timer callback function was
 * running.
 *
 * This prevents priority inversion: if the soft irq thread is preempted
 * in the middle of a timer callback, then calling hrtimer_cancel() can
 * lead to two issues:
 *
 *  - If the caller is on a remote CPU then it has to spin wait for the timer
 *    handler to complete. This can result in unbound priority inversion.
 *
 *  - If the caller originates from the task which preempted the timer
 *    handler on the same CPU, then spin waiting for the timer handler to
 *    complete is never going to end.
 */
void hrtimer_cancel_wait_running(const struct hrtimer *timer)
{
	/* Lockless read. Prevent the compiler from reloading it below */
	struct hrtimer_clock_base *base = READ_ONCE(timer->base);

	/*
	 * Just relax if the timer expires in hard interrupt context or if
	 * it is currently on the migration base.
	 */
	if (!timer->is_soft || is_migration_base(base)) {
		cpu_relax();
		return;
	}

	/*
	 * Mark the base as contended and grab the expiry lock, which is
	 * held by the softirq across the timer callback. Drop the lock
	 * immediately so the softirq can expire the next timer. In theory
	 * the timer could already be running again, but that's more than
	 * unlikely and just causes another wait loop.
	 */
	atomic_inc(&base->cpu_base->timer_waiters);
	spin_lock_bh(&base->cpu_base->softirq_expiry_lock);
	atomic_dec(&base->cpu_base->timer_waiters);
	spin_unlock_bh(&base->cpu_base->softirq_expiry_lock);
}
#else
static inline void
hrtimer_cpu_base_init_expiry_lock(struct hrtimer_cpu_base *base) { }
static inline void
hrtimer_cpu_base_lock_expiry(struct hrtimer_cpu_base *base) { }
static inline void
hrtimer_cpu_base_unlock_expiry(struct hrtimer_cpu_base *base) { }
static inline void hrtimer_sync_wait_running(struct hrtimer_cpu_base *base,
					     unsigned long flags) { }
#endif

/**
 * hrtimer_cancel - cancel a timer and wait for the handler to finish.
 * @timer:	the timer to be cancelled
 *
 * Returns:
 *  0 when the timer was not active
 *  1 when the timer was active
 */
int hrtimer_cancel(struct hrtimer *timer)
{
	int ret;

	do {
		ret = hrtimer_try_to_cancel(timer);

		if (ret < 0)
			hrtimer_cancel_wait_running(timer);
	} while (ret < 0);
	return ret;
}
EXPORT_SYMBOL_GPL(hrtimer_cancel);

/**
 * __hrtimer_get_remaining - get remaining time for the timer
 * @timer:	the timer to read
 * @adjust:	adjust relative timers when CONFIG_TIME_LOW_RES=y
 */
ktime_t __hrtimer_get_remaining(const struct hrtimer *timer, bool adjust)
{
	unsigned long flags;
	ktime_t rem;

	lock_hrtimer_base(timer, &flags);
	if (IS_ENABLED(CONFIG_TIME_LOW_RES) && adjust)
		rem = hrtimer_expires_remaining_adjusted(timer);
	else
		rem = hrtimer_expires_remaining(timer);
	unlock_hrtimer_base(timer, &flags);

	return rem;
}
EXPORT_SYMBOL_GPL(__hrtimer_get_remaining);

#ifdef CONFIG_NO_HZ_COMMON
/**
 * hrtimer_get_next_event - get the time until next expiry event
 *
 * Returns the next expiry time or KTIME_MAX if no timer is pending.
 */
u64 hrtimer_get_next_event(void)
{
	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
	u64 expires = KTIME_MAX;
	unsigned long flags;

	raw_spin_lock_irqsave(&cpu_base->lock, flags);

	if (!hrtimer_hres_active(cpu_base))
		expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL);

	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);

	return expires;
}

/**
 * hrtimer_next_event_without - time until next expiry event w/o one timer
 * @exclude:	timer to exclude
 *
 * Returns the next expiry time over all timers except for the @exclude one or
 * KTIME_MAX if none of them is pending.
 */
u64 hrtimer_next_event_without(const struct hrtimer *exclude)
{
	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
	u64 expires = KTIME_MAX;
	unsigned long flags;

	raw_spin_lock_irqsave(&cpu_base->lock, flags);

	if (hrtimer_hres_active(cpu_base)) {
		unsigned int active;

		if (!cpu_base->softirq_activated) {
			active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
			expires = __hrtimer_next_event_base(cpu_base, exclude,
							    active, KTIME_MAX);
		}
		active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
		expires = __hrtimer_next_event_base(cpu_base, exclude, active,
						    expires);
	}

	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);

	return expires;
}
#endif

static inline int hrtimer_clockid_to_base(clockid_t clock_id)
{
	switch (clock_id) {
	case CLOCK_MONOTONIC:
		return HRTIMER_BASE_MONOTONIC;
	case CLOCK_REALTIME:
		return HRTIMER_BASE_REALTIME;
	case CLOCK_BOOTTIME:
		return HRTIMER_BASE_BOOTTIME;
	case CLOCK_TAI:
		return HRTIMER_BASE_TAI;
	default:
		WARN(1, "Invalid clockid %d. Using MONOTONIC\n", clock_id);
		return HRTIMER_BASE_MONOTONIC;
	}
}

static ktime_t __hrtimer_cb_get_time(clockid_t clock_id)
{
	switch (clock_id) {
	case CLOCK_MONOTONIC:
		return ktime_get();
	case CLOCK_REALTIME:
		return ktime_get_real();
	case CLOCK_BOOTTIME:
		return ktime_get_boottime();
	case CLOCK_TAI:
		return ktime_get_clocktai();
	default:
		WARN(1, "Invalid clockid %d. Using MONOTONIC\n", clock_id);
		return ktime_get();
	}
}

ktime_t hrtimer_cb_get_time(const struct hrtimer *timer)
{
	return __hrtimer_cb_get_time(timer->base->clockid);
}
EXPORT_SYMBOL_GPL(hrtimer_cb_get_time);

static void __hrtimer_setup(struct hrtimer *timer,
			    enum hrtimer_restart (*function)(struct hrtimer *),
			    clockid_t clock_id, enum hrtimer_mode mode)
{
	bool softtimer = !!(mode & HRTIMER_MODE_SOFT);
	struct hrtimer_cpu_base *cpu_base;
	int base;

	/*
	 * On PREEMPT_RT enabled kernels hrtimers which are not explicitly
	 * marked for hard interrupt expiry mode are moved into soft
	 * interrupt context for latency reasons and because the callbacks
	 * can invoke functions which might sleep on RT, e.g. spin_lock().
	 */
	if (IS_ENABLED(CONFIG_PREEMPT_RT) && !(mode & HRTIMER_MODE_HARD))
		softtimer = true;

	memset(timer, 0, sizeof(struct hrtimer));

	cpu_base = raw_cpu_ptr(&hrtimer_bases);

	/*
	 * POSIX magic: Relative CLOCK_REALTIME timers are not affected by
	 * clock modifications, so they needs to become CLOCK_MONOTONIC to
	 * ensure POSIX compliance.
	 */
	if (clock_id == CLOCK_REALTIME && mode & HRTIMER_MODE_REL)
		clock_id = CLOCK_MONOTONIC;

	base = softtimer ? HRTIMER_MAX_CLOCK_BASES / 2 : 0;
	base += hrtimer_clockid_to_base(clock_id);
	timer->is_soft = softtimer;
	timer->is_hard = !!(mode & HRTIMER_MODE_HARD);
	timer->base = &cpu_base->clock_base[base];
	timerqueue_init(&timer->node);

	if (WARN_ON_ONCE(!function))
		ACCESS_PRIVATE(timer, function) = hrtimer_dummy_timeout;
	else
		ACCESS_PRIVATE(timer, function) = function;
}

/**
 * hrtimer_setup - initialize a timer to the given clock
 * @timer:	the timer to be initialized
 * @function:	the callback function
 * @clock_id:	the clock to be used
 * @mode:       The modes which are relevant for initialization:
 *              HRTIMER_MODE_ABS, HRTIMER_MODE_REL, HRTIMER_MODE_ABS_SOFT,
 *              HRTIMER_MODE_REL_SOFT
 *
 *              The PINNED variants of the above can be handed in,
 *              but the PINNED bit is ignored as pinning happens
 *              when the hrtimer is started
 */
void hrtimer_setup(struct hrtimer *timer, enum hrtimer_restart (*function)(struct hrtimer *),
		   clockid_t clock_id, enum hrtimer_mode mode)
{
	debug_setup(timer, clock_id, mode);
	__hrtimer_setup(timer, function, clock_id, mode);
}
EXPORT_SYMBOL_GPL(hrtimer_setup);

/**
 * hrtimer_setup_on_stack - initialize a timer on stack memory
 * @timer:	The timer to be initialized
 * @function:	the callback function
 * @clock_id:	The clock to be used
 * @mode:       The timer mode
 *
 * Similar to hrtimer_setup(), except that this one must be used if struct hrtimer is in stack
 * memory.
 */
void hrtimer_setup_on_stack(struct hrtimer *timer,
			    enum hrtimer_restart (*function)(struct hrtimer *),
			    clockid_t clock_id, enum hrtimer_mode mode)
{
	debug_setup_on_stack(timer, clock_id, mode);
	__hrtimer_setup(timer, function, clock_id, mode);
}
EXPORT_SYMBOL_GPL(hrtimer_setup_on_stack);

/*
 * A timer is active, when it is enqueued into the rbtree or the
 * callback function is running or it's in the state of being migrated
 * to another cpu.
 *
 * It is important for this function to not return a false negative.
 */
bool hrtimer_active(const struct hrtimer *timer)
{
	struct hrtimer_clock_base *base;
	unsigned int seq;

	do {
		base = READ_ONCE(timer->base);
		seq = raw_read_seqcount_begin(&base->seq);

		if (timer->state != HRTIMER_STATE_INACTIVE ||
		    base->running == timer)
			return true;

	} while (read_seqcount_retry(&base->seq, seq) ||
		 base != READ_ONCE(timer->base));

	return false;
}
EXPORT_SYMBOL_GPL(hrtimer_active);

/*
 * The write_seqcount_barrier()s in __run_hrtimer() split the thing into 3
 * distinct sections:
 *
 *  - queued:	the timer is queued
 *  - callback:	the timer is being ran
 *  - post:	the timer is inactive or (re)queued
 *
 * On the read side we ensure we observe timer->state and cpu_base->running
 * from the same section, if anything changed while we looked at it, we retry.
 * This includes timer->base changing because sequence numbers alone are
 * insufficient for that.
 *
 * The sequence numbers are required because otherwise we could still observe
 * a false negative if the read side got smeared over multiple consecutive
 * __run_hrtimer() invocations.
 */

static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
			  struct hrtimer_clock_base *base,
			  struct hrtimer *timer, ktime_t *now,
			  unsigned long flags) __must_hold(&cpu_base->lock)
{
	enum hrtimer_restart (*fn)(struct hrtimer *);
	bool expires_in_hardirq;
	int restart;

	lockdep_assert_held(&cpu_base->lock);

	debug_deactivate(timer);
	base->running = timer;

	/*
	 * Separate the ->running assignment from the ->state assignment.
	 *
	 * As with a regular write barrier, this ensures the read side in
	 * hrtimer_active() cannot observe base->running == NULL &&
	 * timer->state == INACTIVE.
	 */
	raw_write_seqcount_barrier(&base->seq);

	__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, 0);
	fn = ACCESS_PRIVATE(timer, function);

	/*
	 * Clear the 'is relative' flag for the TIME_LOW_RES case. If the
	 * timer is restarted with a period then it becomes an absolute
	 * timer. If its not restarted it does not matter.
	 */
	if (IS_ENABLED(CONFIG_TIME_LOW_RES))
		timer->is_rel = false;

	/*
	 * The timer is marked as running in the CPU base, so it is
	 * protected against migration to a different CPU even if the lock
	 * is dropped.
	 */
	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
	trace_hrtimer_expire_entry(timer, now);
	expires_in_hardirq = lockdep_hrtimer_enter(timer);

	restart = fn(timer);

	lockdep_hrtimer_exit(expires_in_hardirq);
	trace_hrtimer_expire_exit(timer);
	raw_spin_lock_irq(&cpu_base->lock);

	/*
	 * Note: We clear the running state after enqueue_hrtimer and
	 * we do not reprogram the event hardware. Happens either in
	 * hrtimer_start_range_ns() or in hrtimer_interrupt()
	 *
	 * Note: Because we dropped the cpu_base->lock above,
	 * hrtimer_start_range_ns() can have popped in and enqueued the timer
	 * for us already.
	 */
	if (restart != HRTIMER_NORESTART &&
	    !(timer->state & HRTIMER_STATE_ENQUEUED))
		enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS);

	/*
	 * Separate the ->running assignment from the ->state assignment.
	 *
	 * As with a regular write barrier, this ensures the read side in
	 * hrtimer_active() cannot observe base->running.timer == NULL &&
	 * timer->state == INACTIVE.
	 */
	raw_write_seqcount_barrier(&base->seq);

	WARN_ON_ONCE(base->running != timer);
	base->running = NULL;
}

static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now,
				 unsigned long flags, unsigned int active_mask)
{
	struct hrtimer_clock_base *base;
	unsigned int active = cpu_base->active_bases & active_mask;

	for_each_active_base(base, cpu_base, active) {
		struct timerqueue_node *node;
		ktime_t basenow;

		basenow = ktime_add(now, base->offset);

		while ((node = timerqueue_getnext(&base->active))) {
			struct hrtimer *timer;

			timer = container_of(node, struct hrtimer, node);

			/*
			 * The immediate goal for using the softexpires is
			 * minimizing wakeups, not running timers at the
			 * earliest interrupt after their soft expiration.
			 * This allows us to avoid using a Priority Search
			 * Tree, which can answer a stabbing query for
			 * overlapping intervals and instead use the simple
			 * BST we already have.
			 * We don't add extra wakeups by delaying timers that
			 * are right-of a not yet expired timer, because that
			 * timer will have to trigger a wakeup anyway.
			 */
			if (basenow < hrtimer_get_softexpires_tv64(timer))
				break;

			__run_hrtimer(cpu_base, base, timer, &basenow, flags);
			if (active_mask == HRTIMER_ACTIVE_SOFT)
				hrtimer_sync_wait_running(cpu_base, flags);
		}
	}
}

static __latent_entropy void hrtimer_run_softirq(void)
{
	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
	unsigned long flags;
	ktime_t now;

	hrtimer_cpu_base_lock_expiry(cpu_base);
	raw_spin_lock_irqsave(&cpu_base->lock, flags);

	now = hrtimer_update_base(cpu_base);
	__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_SOFT);

	cpu_base->softirq_activated = 0;
	hrtimer_update_softirq_timer(cpu_base, true);

	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
	hrtimer_cpu_base_unlock_expiry(cpu_base);
}

#ifdef CONFIG_HIGH_RES_TIMERS

/*
 * High resolution timer interrupt
 * Called with interrupts disabled
 */
void hrtimer_interrupt(struct clock_event_device *dev)
{
	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
	ktime_t expires_next, now, entry_time, delta;
	unsigned long flags;
	int retries = 0;

	BUG_ON(!cpu_base->hres_active);
	cpu_base->nr_events++;
	dev->next_event = KTIME_MAX;

	raw_spin_lock_irqsave(&cpu_base->lock, flags);
	entry_time = now = hrtimer_update_base(cpu_base);
retry:
	cpu_base->in_hrtirq = 1;
	/*
	 * We set expires_next to KTIME_MAX here with cpu_base->lock
	 * held to prevent that a timer is enqueued in our queue via
	 * the migration code. This does not affect enqueueing of
	 * timers which run their callback and need to be requeued on
	 * this CPU.
	 */
	cpu_base->expires_next = KTIME_MAX;

	if (!ktime_before(now, cpu_base->softirq_expires_next)) {
		cpu_base->softirq_expires_next = KTIME_MAX;
		cpu_base->softirq_activated = 1;
		raise_timer_softirq(HRTIMER_SOFTIRQ);
	}

	__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);

	/* Reevaluate the clock bases for the [soft] next expiry */
	expires_next = hrtimer_update_next_event(cpu_base);
	/*
	 * Store the new expiry value so the migration code can verify
	 * against it.
	 */
	cpu_base->expires_next = expires_next;
	cpu_base->in_hrtirq = 0;
	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);

	/* Reprogramming necessary ? */
	if (!tick_program_event(expires_next, 0)) {
		cpu_base->hang_detected = 0;
		return;
	}

	/*
	 * The next timer was already expired due to:
	 * - tracing
	 * - long lasting callbacks
	 * - being scheduled away when running in a VM
	 *
	 * We need to prevent that we loop forever in the hrtimer
	 * interrupt routine. We give it 3 attempts to avoid
	 * overreacting on some spurious event.
	 *
	 * Acquire base lock for updating the offsets and retrieving
	 * the current time.
	 */
	raw_spin_lock_irqsave(&cpu_base->lock, flags);
	now = hrtimer_update_base(cpu_base);
	cpu_base->nr_retries++;
	if (++retries < 3)
		goto retry;
	/*
	 * Give the system a chance to do something else than looping
	 * here. We stored the entry time, so we know exactly how long
	 * we spent here. We schedule the next event this amount of
	 * time away.
	 */
	cpu_base->nr_hangs++;
	cpu_base->hang_detected = 1;
	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);

	delta = ktime_sub(now, entry_time);
	if ((unsigned int)delta > cpu_base->max_hang_time)
		cpu_base->max_hang_time = (unsigned int) delta;
	/*
	 * Limit it to a sensible value as we enforce a longer
	 * delay. Give the CPU at least 100ms to catch up.
	 */
	if (delta > 100 * NSEC_PER_MSEC)
		expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
	else
		expires_next = ktime_add(now, delta);
	tick_program_event(expires_next, 1);
	pr_warn_once("hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta));
}
#endif /* !CONFIG_HIGH_RES_TIMERS */

/*
 * Called from run_local_timers in hardirq context every jiffy
 */
void hrtimer_run_queues(void)
{
	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
	unsigned long flags;
	ktime_t now;

	if (hrtimer_hres_active(cpu_base))
		return;

	/*
	 * This _is_ ugly: We have to check periodically, whether we
	 * can switch to highres and / or nohz mode. The clocksource
	 * switch happens with xtime_lock held. Notification from
	 * there only sets the check bit in the tick_oneshot code,
	 * otherwise we might deadlock vs. xtime_lock.
	 */
	if (tick_check_oneshot_change(!hrtimer_is_hres_enabled())) {
		hrtimer_switch_to_hres();
		return;
	}

	raw_spin_lock_irqsave(&cpu_base->lock, flags);
	now = hrtimer_update_base(cpu_base);

	if (!ktime_before(now, cpu_base->softirq_expires_next)) {
		cpu_base->softirq_expires_next = KTIME_MAX;
		cpu_base->softirq_activated = 1;
		raise_timer_softirq(HRTIMER_SOFTIRQ);
	}

	__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);
	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
}

/*
 * Sleep related functions:
 */
static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer)
{
	struct hrtimer_sleeper *t =
		container_of(timer, struct hrtimer_sleeper, timer);
	struct task_struct *task = t->task;

	t->task = NULL;
	if (task)
		wake_up_process(task);

	return HRTIMER_NORESTART;
}

/**
 * hrtimer_sleeper_start_expires - Start a hrtimer sleeper timer
 * @sl:		sleeper to be started
 * @mode:	timer mode abs/rel
 *
 * Wrapper around hrtimer_start_expires() for hrtimer_sleeper based timers
 * to allow PREEMPT_RT to tweak the delivery mode (soft/hardirq context)
 */
void hrtimer_sleeper_start_expires(struct hrtimer_sleeper *sl,
				   enum hrtimer_mode mode)
{
	/*
	 * Make the enqueue delivery mode check work on RT. If the sleeper
	 * was initialized for hard interrupt delivery, force the mode bit.
	 * This is a special case for hrtimer_sleepers because
	 * __hrtimer_setup_sleeper() determines the delivery mode on RT so the
	 * fiddling with this decision is avoided at the call sites.
	 */
	if (IS_ENABLED(CONFIG_PREEMPT_RT) && sl->timer.is_hard)
		mode |= HRTIMER_MODE_HARD;

	hrtimer_start_expires(&sl->timer, mode);
}
EXPORT_SYMBOL_GPL(hrtimer_sleeper_start_expires);

static void __hrtimer_setup_sleeper(struct hrtimer_sleeper *sl,
				    clockid_t clock_id, enum hrtimer_mode mode)
{
	/*
	 * On PREEMPT_RT enabled kernels hrtimers which are not explicitly
	 * marked for hard interrupt expiry mode are moved into soft
	 * interrupt context either for latency reasons or because the
	 * hrtimer callback takes regular spinlocks or invokes other
	 * functions which are not suitable for hard interrupt context on
	 * PREEMPT_RT.
	 *
	 * The hrtimer_sleeper callback is RT compatible in hard interrupt
	 * context, but there is a latency concern: Untrusted userspace can
	 * spawn many threads which arm timers for the same expiry time on
	 * the same CPU. That causes a latency spike due to the wakeup of
	 * a gazillion threads.
	 *
	 * OTOH, privileged real-time user space applications rely on the
	 * low latency of hard interrupt wakeups. If the current task is in
	 * a real-time scheduling class, mark the mode for hard interrupt
	 * expiry.
	 */
	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
		if (rt_or_dl_task_policy(current) && !(mode & HRTIMER_MODE_SOFT))
			mode |= HRTIMER_MODE_HARD;
	}

	__hrtimer_setup(&sl->timer, hrtimer_wakeup, clock_id, mode);
	sl->task = current;
}

/**
 * hrtimer_setup_sleeper_on_stack - initialize a sleeper in stack memory
 * @sl:		sleeper to be initialized
 * @clock_id:	the clock to be used
 * @mode:	timer mode abs/rel
 */
void hrtimer_setup_sleeper_on_stack(struct hrtimer_sleeper *sl,
				    clockid_t clock_id, enum hrtimer_mode mode)
{
	debug_setup_on_stack(&sl->timer, clock_id, mode);
	__hrtimer_setup_sleeper(sl, clock_id, mode);
}
EXPORT_SYMBOL_GPL(hrtimer_setup_sleeper_on_stack);

int nanosleep_copyout(struct restart_block *restart, struct timespec64 *ts)
{
	switch(restart->nanosleep.type) {
#ifdef CONFIG_COMPAT_32BIT_TIME
	case TT_COMPAT:
		if (put_old_timespec32(ts, restart->nanosleep.compat_rmtp))
			return -EFAULT;
		break;
#endif
	case TT_NATIVE:
		if (put_timespec64(ts, restart->nanosleep.rmtp))
			return -EFAULT;
		break;
	default:
		BUG();
	}
	return -ERESTART_RESTARTBLOCK;
}

static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
{
	struct restart_block *restart;

	do {
		set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
		hrtimer_sleeper_start_expires(t, mode);

		if (likely(t->task))
			schedule();

		hrtimer_cancel(&t->timer);
		mode = HRTIMER_MODE_ABS;

	} while (t->task && !signal_pending(current));

	__set_current_state(TASK_RUNNING);

	if (!t->task)
		return 0;

	restart = &current->restart_block;
	if (restart->nanosleep.type != TT_NONE) {
		ktime_t rem = hrtimer_expires_remaining(&t->timer);
		struct timespec64 rmt;

		if (rem <= 0)
			return 0;
		rmt = ktime_to_timespec64(rem);

		return nanosleep_copyout(restart, &rmt);
	}
	return -ERESTART_RESTARTBLOCK;
}

static long __sched hrtimer_nanosleep_restart(struct restart_block *restart)
{
	struct hrtimer_sleeper t;
	int ret;

	hrtimer_setup_sleeper_on_stack(&t, restart->nanosleep.clockid, HRTIMER_MODE_ABS);
	hrtimer_set_expires(&t.timer, restart->nanosleep.expires);
	ret = do_nanosleep(&t, HRTIMER_MODE_ABS);
	destroy_hrtimer_on_stack(&t.timer);
	return ret;
}

long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode,
		       const clockid_t clockid)
{
	struct restart_block *restart;
	struct hrtimer_sleeper t;
	int ret = 0;

	hrtimer_setup_sleeper_on_stack(&t, clockid, mode);
	hrtimer_set_expires_range_ns(&t.timer, rqtp, current->timer_slack_ns);
	ret = do_nanosleep(&t, mode);
	if (ret != -ERESTART_RESTARTBLOCK)
		goto out;

	/* Absolute timers do not update the rmtp value and restart: */
	if (mode == HRTIMER_MODE_ABS) {
		ret = -ERESTARTNOHAND;
		goto out;
	}

	restart = &current->restart_block;
	restart->nanosleep.clockid = t.timer.base->clockid;
	restart->nanosleep.expires = hrtimer_get_expires(&t.timer);
	set_restart_fn(restart, hrtimer_nanosleep_restart);
out:
	destroy_hrtimer_on_stack(&t.timer);
	return ret;
}

#ifdef CONFIG_64BIT

SYSCALL_DEFINE2(nanosleep, struct __kernel_timespec __user *, rqtp,
		struct __kernel_timespec __user *, rmtp)
{
	struct timespec64 tu;

	if (get_timespec64(&tu, rqtp))
		return -EFAULT;

	if (!timespec64_valid(&tu))
		return -EINVAL;

	current->restart_block.fn = do_no_restart_syscall;
	current->restart_block.nanosleep.type = rmtp ? TT_NATIVE : TT_NONE;
	current->restart_block.nanosleep.rmtp = rmtp;
	return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL,
				 CLOCK_MONOTONIC);
}

#endif

#ifdef CONFIG_COMPAT_32BIT_TIME

SYSCALL_DEFINE2(nanosleep_time32, struct old_timespec32 __user *, rqtp,
		       struct old_timespec32 __user *, rmtp)
{
	struct timespec64 tu;

	if (get_old_timespec32(&tu, rqtp))
		return -EFAULT;

	if (!timespec64_valid(&tu))
		return -EINVAL;

	current->restart_block.fn = do_no_restart_syscall;
	current->restart_block.nanosleep.type = rmtp ? TT_COMPAT : TT_NONE;
	current->restart_block.nanosleep.compat_rmtp = rmtp;
	return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL,
				 CLOCK_MONOTONIC);
}
#endif

/*
 * Functions related to boot-time initialization:
 */
int hrtimers_prepare_cpu(unsigned int cpu)
{
	struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
	int i;

	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
		struct hrtimer_clock_base *clock_b = &cpu_base->clock_base[i];

		clock_b->cpu_base = cpu_base;
		seqcount_raw_spinlock_init(&clock_b->seq, &cpu_base->lock);
		timerqueue_init_head(&clock_b->active);
	}

	cpu_base->cpu = cpu;
	hrtimer_cpu_base_init_expiry_lock(cpu_base);
	return 0;
}

int hrtimers_cpu_starting(unsigned int cpu)
{
	struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);

	/* Clear out any left over state from a CPU down operation */
	cpu_base->active_bases = 0;
	cpu_base->hres_active = 0;
	cpu_base->hang_detected = 0;
	cpu_base->next_timer = NULL;
	cpu_base->softirq_next_timer = NULL;
	cpu_base->expires_next = KTIME_MAX;
	cpu_base->softirq_expires_next = KTIME_MAX;
	cpu_base->online = 1;
	return 0;
}

#ifdef CONFIG_HOTPLUG_CPU

static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
				struct hrtimer_clock_base *new_base)
{
	struct hrtimer *timer;
	struct timerqueue_node *node;

	while ((node = timerqueue_getnext(&old_base->active))) {
		timer = container_of(node, struct hrtimer, node);
		BUG_ON(hrtimer_callback_running(timer));
		debug_deactivate(timer);

		/*
		 * Mark it as ENQUEUED not INACTIVE otherwise the
		 * timer could be seen as !active and just vanish away
		 * under us on another CPU
		 */
		__remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, 0);
		timer->base = new_base;
		/*
		 * Enqueue the timers on the new cpu. This does not
		 * reprogram the event device in case the timer
		 * expires before the earliest on this CPU, but we run
		 * hrtimer_interrupt after we migrated everything to
		 * sort out already expired timers and reprogram the
		 * event device.
		 */
		enqueue_hrtimer(timer, new_base, HRTIMER_MODE_ABS);
	}
}

int hrtimers_cpu_dying(unsigned int dying_cpu)
{
	int i, ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
	struct hrtimer_cpu_base *old_base, *new_base;

	old_base = this_cpu_ptr(&hrtimer_bases);
	new_base = &per_cpu(hrtimer_bases, ncpu);

	/*
	 * The caller is globally serialized and nobody else
	 * takes two locks at once, deadlock is not possible.
	 */
	raw_spin_lock(&old_base->lock);
	raw_spin_lock_nested(&new_base->lock, SINGLE_DEPTH_NESTING);

	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
		migrate_hrtimer_list(&old_base->clock_base[i],
				     &new_base->clock_base[i]);
	}

	/* Tell the other CPU to retrigger the next event */
	smp_call_function_single(ncpu, retrigger_next_event, NULL, 0);

	raw_spin_unlock(&new_base->lock);
	old_base->online = 0;
	raw_spin_unlock(&old_base->lock);

	return 0;
}

#endif /* CONFIG_HOTPLUG_CPU */

void __init hrtimers_init(void)
{
	hrtimers_prepare_cpu(smp_processor_id());
	hrtimers_cpu_starting(smp_processor_id());
	open_softirq(HRTIMER_SOFTIRQ, hrtimer_run_softirq);
}
]

gemini-3-flash-preview

							input: 28251
							output: 24
							thoughts: 334

**Analyzing `hrtimer_interrupt` and Identifying a Discrepancy**

Okay, first things first, I need to make sure I'm on the same page. The report mentions line 1882 of `hrtimer_interrupt` and claims `cpu_base->expires_next = KTIME_MAX;`. But hold on, I just reviewed the source code. Line 1882 in *my* copy of the file actually assigns `expires_next` to `cpu_base->expires_next`. I need to verify this... let's see. Ah, yes, looking at the code, there's a good reason for this.

Let me take another look at the code:

```c
1856: void hrtimer_interrupt(struct clock_event_device *dev)
1857: {
...
1871: 	cpu_base->expires_next = KTIME_MAX;
...
1881: 	expires_next = hrtimer_update_next_event(cpu_base);
1882: 	/*
1883: 	 * Store the new expiry value so the migration code can verify
1884: 	 * against it.
1885: 	 */
1886: 	cpu_base->expires_next = expires_next;
```

Ah, right. The line numbers are slightly off between the report and the file, but the code itself is clear. The first assignment sets it to `KTIME_MAX` and the second assignment updates `expires_next` as reported. Line 1882 is definitely where `cpu_base->expires_next` is set, but the value is updated after calling `hrtimer_update_next_event`.

Now that I've resolved this potential line numbering confusion, let's turn our attention to `print_tickdevice` in `kernel/time/timer_list.c:206`.

map[File:kernel/time/timer_list.c]

map[Contents:// SPDX-License-Identifier: GPL-2.0
/*
 * List pending timers
 *
 * Copyright(C) 2006, Red Hat, Inc., Ingo Molnar
 */

#include <linux/proc_fs.h>
#include <linux/module.h>
#include <linux/spinlock.h>
#include <linux/sched.h>
#include <linux/seq_file.h>
#include <linux/kallsyms.h>
#include <linux/nmi.h>

#include <linux/uaccess.h>

#include "tick-internal.h"

struct timer_list_iter {
	int cpu;
	bool second_pass;
	u64 now;
};

/*
 * This allows printing both to /proc/timer_list and
 * to the console (on SysRq-Q):
 */
__printf(2, 3)
static void SEQ_printf(struct seq_file *m, const char *fmt, ...)
{
	va_list args;

	va_start(args, fmt);

	if (m)
		seq_vprintf(m, fmt, args);
	else
		vprintk(fmt, args);

	va_end(args);
}

static void
print_timer(struct seq_file *m, struct hrtimer *taddr, struct hrtimer *timer,
	    int idx, u64 now)
{
	SEQ_printf(m, " #%d: <%p>, %ps", idx, taddr, ACCESS_PRIVATE(timer, function));
	SEQ_printf(m, ", S:%02x", timer->state);
	SEQ_printf(m, "\n");
	SEQ_printf(m, " # expires at %Lu-%Lu nsecs [in %Ld to %Ld nsecs]\n",
		(unsigned long long)ktime_to_ns(hrtimer_get_softexpires(timer)),
		(unsigned long long)ktime_to_ns(hrtimer_get_expires(timer)),
		(long long)(ktime_to_ns(hrtimer_get_softexpires(timer)) - now),
		(long long)(ktime_to_ns(hrtimer_get_expires(timer)) - now));
}

static void
print_active_timers(struct seq_file *m, struct hrtimer_clock_base *base,
		    u64 now)
{
	struct hrtimer *timer, tmp;
	unsigned long next = 0, i;
	struct timerqueue_node *curr;
	unsigned long flags;

next_one:
	i = 0;

	touch_nmi_watchdog();

	raw_spin_lock_irqsave(&base->cpu_base->lock, flags);

	curr = timerqueue_getnext(&base->active);
	/*
	 * Crude but we have to do this O(N*N) thing, because
	 * we have to unlock the base when printing:
	 */
	while (curr && i < next) {
		curr = timerqueue_iterate_next(curr);
		i++;
	}

	if (curr) {

		timer = container_of(curr, struct hrtimer, node);
		tmp = *timer;
		raw_spin_unlock_irqrestore(&base->cpu_base->lock, flags);

		print_timer(m, timer, &tmp, i, now);
		next++;
		goto next_one;
	}
	raw_spin_unlock_irqrestore(&base->cpu_base->lock, flags);
}

static void
print_base(struct seq_file *m, struct hrtimer_clock_base *base, u64 now)
{
	SEQ_printf(m, "  .base:       %p\n", base);
	SEQ_printf(m, "  .index:      %d\n", base->index);

	SEQ_printf(m, "  .resolution: %u nsecs\n", hrtimer_resolution);
#ifdef CONFIG_HIGH_RES_TIMERS
	SEQ_printf(m, "  .offset:     %Lu nsecs\n",
		   (unsigned long long) ktime_to_ns(base->offset));
#endif
	SEQ_printf(m,   "active timers:\n");
	print_active_timers(m, base, now + ktime_to_ns(base->offset));
}

static void print_cpu(struct seq_file *m, int cpu, u64 now)
{
	struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
	int i;

	SEQ_printf(m, "cpu: %d\n", cpu);
	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
		SEQ_printf(m, " clock %d:\n", i);
		print_base(m, cpu_base->clock_base + i, now);
	}
#define P(x) \
	SEQ_printf(m, "  .%-15s: %Lu\n", #x, \
		   (unsigned long long)(cpu_base->x))
#define P_ns(x) \
	SEQ_printf(m, "  .%-15s: %Lu nsecs\n", #x, \
		   (unsigned long long)(ktime_to_ns(cpu_base->x)))

#ifdef CONFIG_HIGH_RES_TIMERS
	P_ns(expires_next);
	P(hres_active);
	P(nr_events);
	P(nr_retries);
	P(nr_hangs);
	P(max_hang_time);
#endif
#undef P
#undef P_ns

#ifdef CONFIG_TICK_ONESHOT
# define P(x) \
	SEQ_printf(m, "  .%-15s: %Lu\n", #x, \
		   (unsigned long long)(ts->x))
# define P_ns(x) \
	SEQ_printf(m, "  .%-15s: %Lu nsecs\n", #x, \
		   (unsigned long long)(ktime_to_ns(ts->x)))
# define P_flag(x, f)			    \
	SEQ_printf(m, "  .%-15s: %d\n", #x, !!(ts->flags & (f)))

	{
		struct tick_sched *ts = tick_get_tick_sched(cpu);
		P_flag(nohz, TS_FLAG_NOHZ);
		P_flag(highres, TS_FLAG_HIGHRES);
		P_ns(last_tick);
		P_flag(tick_stopped, TS_FLAG_STOPPED);
		P(idle_jiffies);
		P(idle_calls);
		P(idle_sleeps);
		P_ns(idle_entrytime);
		P_ns(idle_waketime);
		P_ns(idle_exittime);
		P_ns(idle_sleeptime);
		P_ns(iowait_sleeptime);
		P(last_jiffies);
		P(next_timer);
		P_ns(idle_expires);
		SEQ_printf(m, "jiffies: %Lu\n",
			   (unsigned long long)jiffies);
	}
#endif

#undef P
#undef P_ns
	SEQ_printf(m, "\n");
}

#ifdef CONFIG_GENERIC_CLOCKEVENTS
static void
print_tickdevice(struct seq_file *m, struct tick_device *td, int cpu)
{
	struct clock_event_device *dev = td->evtdev;

	touch_nmi_watchdog();

	SEQ_printf(m, "Tick Device: mode:     %d\n", td->mode);
	if (cpu < 0)
		SEQ_printf(m, "Broadcast device\n");
	else
		SEQ_printf(m, "Per CPU device: %d\n", cpu);

	SEQ_printf(m, "Clock Event Device: ");
	if (!dev) {
		SEQ_printf(m, "<NULL>\n");
		return;
	}
	SEQ_printf(m, "%s\n", dev->name);
	SEQ_printf(m, " max_delta_ns:   %llu\n",
		   (unsigned long long) dev->max_delta_ns);
	SEQ_printf(m, " min_delta_ns:   %llu\n",
		   (unsigned long long) dev->min_delta_ns);
	SEQ_printf(m, " mult:           %u\n", dev->mult);
	SEQ_printf(m, " shift:          %u\n", dev->shift);
	SEQ_printf(m, " mode:           %d\n", clockevent_get_state(dev));
	SEQ_printf(m, " next_event:     %Ld nsecs\n",
		   (unsigned long long) ktime_to_ns(dev->next_event));

	SEQ_printf(m, " set_next_event: %ps\n", dev->set_next_event);

	if (dev->set_state_shutdown)
		SEQ_printf(m, " shutdown:       %ps\n",
			dev->set_state_shutdown);

	if (dev->set_state_periodic)
		SEQ_printf(m, " periodic:       %ps\n",
			dev->set_state_periodic);

	if (dev->set_state_oneshot)
		SEQ_printf(m, " oneshot:        %ps\n",
			dev->set_state_oneshot);

	if (dev->set_state_oneshot_stopped)
		SEQ_printf(m, " oneshot stopped: %ps\n",
			dev->set_state_oneshot_stopped);

	if (dev->tick_resume)
		SEQ_printf(m, " resume:         %ps\n",
			dev->tick_resume);

	SEQ_printf(m, " event_handler:  %ps\n", dev->event_handler);
	SEQ_printf(m, "\n");
	SEQ_printf(m, " retries:        %lu\n", dev->retries);

#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
	if (cpu >= 0) {
		const struct clock_event_device *wd = tick_get_wakeup_device(cpu);

		SEQ_printf(m, "Wakeup Device: %s\n", wd ? wd->name : "<NULL>");
	}
#endif
	SEQ_printf(m, "\n");
}

static void timer_list_show_tickdevices_header(struct seq_file *m)
{
#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
	print_tickdevice(m, tick_get_broadcast_device(), -1);
	SEQ_printf(m, "tick_broadcast_mask: %*pb\n",
		   cpumask_pr_args(tick_get_broadcast_mask()));
#ifdef CONFIG_TICK_ONESHOT
	SEQ_printf(m, "tick_broadcast_oneshot_mask: %*pb\n",
		   cpumask_pr_args(tick_get_broadcast_oneshot_mask()));
#endif
	SEQ_printf(m, "\n");
#endif
}
#endif

static inline void timer_list_header(struct seq_file *m, u64 now)
{
	SEQ_printf(m, "Timer List Version: v0.10\n");
	SEQ_printf(m, "HRTIMER_MAX_CLOCK_BASES: %d\n", HRTIMER_MAX_CLOCK_BASES);
	SEQ_printf(m, "now at %Ld nsecs\n", (unsigned long long)now);
	SEQ_printf(m, "\n");
}

void sysrq_timer_list_show(void)
{
	u64 now = ktime_to_ns(ktime_get());
	int cpu;

	timer_list_header(NULL, now);

	for_each_online_cpu(cpu)
		print_cpu(NULL, cpu, now);

#ifdef CONFIG_GENERIC_CLOCKEVENTS
	timer_list_show_tickdevices_header(NULL);
	for_each_online_cpu(cpu)
		print_tickdevice(NULL, tick_get_device(cpu), cpu);
#endif
	return;
}

#ifdef CONFIG_PROC_FS
static int timer_list_show(struct seq_file *m, void *v)
{
	struct timer_list_iter *iter = v;

	if (iter->cpu == -1 && !iter->second_pass)
		timer_list_header(m, iter->now);
	else if (!iter->second_pass)
		print_cpu(m, iter->cpu, iter->now);
#ifdef CONFIG_GENERIC_CLOCKEVENTS
	else if (iter->cpu == -1 && iter->second_pass)
		timer_list_show_tickdevices_header(m);
	else
		print_tickdevice(m, tick_get_device(iter->cpu), iter->cpu);
#endif
	return 0;
}

static void *move_iter(struct timer_list_iter *iter, loff_t offset)
{
	for (; offset; offset--) {
		iter->cpu = cpumask_next(iter->cpu, cpu_online_mask);
		if (iter->cpu >= nr_cpu_ids) {
#ifdef CONFIG_GENERIC_CLOCKEVENTS
			if (!iter->second_pass) {
				iter->cpu = -1;
				iter->second_pass = true;
			} else
				return NULL;
#else
			return NULL;
#endif
		}
	}
	return iter;
}

static void *timer_list_start(struct seq_file *file, loff_t *offset)
{
	struct timer_list_iter *iter = file->private;

	if (!*offset)
		iter->now = ktime_to_ns(ktime_get());
	iter->cpu = -1;
	iter->second_pass = false;
	return move_iter(iter, *offset);
}

static void *timer_list_next(struct seq_file *file, void *v, loff_t *offset)
{
	struct timer_list_iter *iter = file->private;
	++*offset;
	return move_iter(iter, 1);
}

static void timer_list_stop(struct seq_file *seq, void *v)
{
}

static const struct seq_operations timer_list_sops = {
	.start = timer_list_start,
	.next = timer_list_next,
	.stop = timer_list_stop,
	.show = timer_list_show,
};

static int __init init_timer_list_procfs(void)
{
	struct proc_dir_entry *pe;

	pe = proc_create_seq_private("timer_list", 0400, NULL, &timer_list_sops,
			sizeof(struct timer_list_iter), NULL);
	if (!pe)
		return -ENOMEM;
	return 0;
}
__initcall(init_timer_list_procfs);
#endif
]

gemini-3-flash-preview

							input: 32732
							output: 42
							thoughts: 1126

**Race Condition Identified: `dev->next_event` in `clock_event_device`**

Okay, so the report points to a potential issue. I see a read to memory location `0xffff888237d17b58` by `print_tickdevice` at `kernel/time/timer_list.c:206` and a write to the same location by the `hrtimer_interrupt` at `kernel/time/hrtimer.c:1882`.  My initial thought was that it was the `cpu_base->expires_next` field.

I had to double-check my assumptions regarding which memory location was actually being accessed. Looking at the code for `hrtimer_interrupt`, I first identified what I *thought* was the target write location within the function's structure. However, on closer inspection, and after re-examining the line numbers in conjunction with the report, it became clear. Line 206 of `print_tickdevice` reads `dev->next_event`, whereas Line 1865 (and not 1882) in the *version* I'm looking at in `hrtimer_interrupt` is indeed setting `dev->next_event`.

Therefore, since the reported addresses match, it means `print_tickdevice` is reading `dev->next_event`, and `hrtimer_interrupt` is writing to `dev->next_event`.  This is a potential race condition on the `dev->next_event` field of `struct clock_event_device`. The code in `hrtimer_interrupt` also calls `tick_program_event`, so I'm now going to investigate what that function does to understand the interaction and potential implications of this race condition.

map[ContextFile:kernel/time/hrtimer.c IncludeLines:true Name:tick_program_event]

map[SourceCode:  20:	/**
  21:	 * tick_program_event - program the CPU local timer device for the next event
  22:	 * @expires: the time at which the next timer event should occur
  23:	 * @force: flag to force reprograming even if the event time hasn't changed
  24:	 *
  25:	 * Return: 0 on success, negative error code on failure
  26:	 */
  27:	int tick_program_event(ktime_t expires, int force)
  28:	{
  29:		struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
  30:	
  31:		if (unlikely(expires == KTIME_MAX)) {
  32:			/*
  33:			 * We don't need the clock event device any more, stop it.
  34:			 */
  35:			clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT_STOPPED);
  36:			dev->next_event = KTIME_MAX;
  37:			return 0;
  38:		}
  39:	
  40:		if (unlikely(clockevent_state_oneshot_stopped(dev))) {
  41:			/*
  42:			 * We need the clock event again, configure it in ONESHOT mode
  43:			 * before using it.
  44:			 */
  45:			clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT);
  46:		}
  47:	
  48:		return clockevents_program_event(dev, expires, force);
  49:	}
 SourceFile:kernel/time/tick-oneshot.c]

gemini-3-flash-preview

							input: 34747
							output: 45
							thoughts: 9

Okay, here's my attempt at summarizing the thought process, speaking from the first-person perspective, assuming I'm an expert in the field:

**Clock Event Programming and its Implications**

Alright, let's break down this `clockevents_program_event` function. It seems like we're diving into the heart of how the kernel manages its timers and interrupts. When I see this function, my mind immediately jumps to the underlying hardware – the Programmable Interval Timers (PITs), Advanced Configuration and Power Interface (ACPI) timers, High Precision Event Timers (HPETs) – whatever specific hardware the system's employing. I need to figure out which specific hardware is in play, the architecture will determine the best event programming method.

My immediate thought is, what's the broader context? Is this being called during normal operation, or is it a part of system boot? Does this code interact with other interrupt mechanisms, like the Advanced Programmable Interrupt Controller (APIC) or Local APIC (LAPIC)? The scheduler's time management, context switching, and the accuracy of time-related functions rely heavily on what goes on here. A misstep in how we program the event, especially if we're dealing with multiple CPUs or heterogeneous systems, will lead to all kinds of timing jitter and performance issues. I have to think about synchronization primitives, interrupt masking, and memory barriers to ensure correct behaviour. We've got to ensure the event is programmed correctly for the specific timer device and that we account for potential race conditions. I need to check the data sheets for the particular hardware and consider edge-case scenarios carefully to ensure we program the event accurately and efficiently.

map[ContextFile:kernel/time/tick-oneshot.c IncludeLines:true Name:clockevents_program_event]

map[SourceCode: 295:	/**
 296:	 * clockevents_program_event - Reprogram the clock event device.
 297:	 * @dev:	device to program
 298:	 * @expires:	absolute expiry time (monotonic clock)
 299:	 * @force:	program minimum delay if expires can not be set
 300:	 *
 301:	 * Returns 0 on success, -ETIME when the event is in the past.
 302:	 */
 303:	int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
 304:				      bool force)
 305:	{
 306:		unsigned long long clc;
 307:		int64_t delta;
 308:		int rc;
 309:	
 310:		if (WARN_ON_ONCE(expires < 0))
 311:			return -ETIME;
 312:	
 313:		dev->next_event = expires;
 314:	
 315:		if (clockevent_state_shutdown(dev))
 316:			return 0;
 317:	
 318:		/* We must be in ONESHOT state here */
 319:		WARN_ONCE(!clockevent_state_oneshot(dev), "Current state: %d\n",
 320:			  clockevent_get_state(dev));
 321:	
 322:		/* Shortcut for clockevent devices that can deal with ktime. */
 323:		if (dev->features & CLOCK_EVT_FEAT_KTIME)
 324:			return dev->set_next_ktime(expires, dev);
 325:	
 326:		delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
 327:		if (delta <= 0)
 328:			return force ? clockevents_program_min_delta(dev) : -ETIME;
 329:	
 330:		delta = min(delta, (int64_t) dev->max_delta_ns);
 331:		delta = max(delta, (int64_t) dev->min_delta_ns);
 332:	
 333:		clc = ((unsigned long long) delta * dev->mult) >> dev->shift;
 334:		rc = dev->set_next_event((unsigned long) clc, dev);
 335:	
 336:		return (rc && force) ? clockevents_program_min_delta(dev) : rc;
 337:	}
 SourceFile:kernel/time/clockevents.c]

gemini-3-flash-preview

							input: 35875
							output: 42
							thoughts: 807

**Analyzing a Potential Data Race in the Kernel's Timer System**

Okay, so I'm looking at this KCSAN report, and it points to a potential data race involving `dev->next_event` in the kernel's timer code.  It looks like `clockevents_program_event` writes to `dev->next_event`, specifically at line 313. Then `hrtimer_interrupt` also writes to `dev->next_event`, but the important thing to note is that `tick_program_event`, which calls `clockevents_program_event` and therefore writes to `dev->next_event`, calls it with interrupts disabled.

The data race is between `hrtimer_interrupt` writing, and `print_tickdevice` reading `dev->next_event`. `print_tickdevice` gets called when `/proc/timer_list` is read, iterating over the CPUs and their tick devices. And looking at the code, `print_tickdevice` reads `dev->next_event` directly, without any locks! That's a classic red flag.

My immediate thought is, "Wait, is this actually a problem?". `dev->next_event` is a `ktime_t`, which is a 64-bit value on 64-bit systems. Those 8-byte accesses are atomic. So, the report indicates the value changed from `0x000000129f80e76b` to `0x000000129f81f8db`. Since it's atomic, this suggests the read observed the value changing in-flight, which KCSAN picked up.  This suggests that even if there is a data race, it's likely a benign data race, and could just lead to `/proc/timer_list` showing a slightly stale value of `next_event`. It is intended for debugging after all, and I wouldn't expect strict synchronization.

However, KCSAN flagged it, which means we should still investigate. The race could still impact the timer system if `dev->next_event` is also used for something sensitive. It's a field within `struct clock_event_device`, so let me check what else uses `dev->next_event`.

map[ContextFile:kernel/time/clockevents.c IncludeLines:true Name:clock_event_device]

map[SourceCode:  70:	/**
  71:	 * struct clock_event_device - clock event device descriptor
  72:	 * @event_handler:	Assigned by the framework to be called by the low
  73:	 *			level handler of the event source
  74:	 * @set_next_event:	set next event function using a clocksource delta
  75:	 * @set_next_ktime:	set next event function using a direct ktime value
  76:	 * @next_event:		local storage for the next event in oneshot mode
  77:	 * @max_delta_ns:	maximum delta value in ns
  78:	 * @min_delta_ns:	minimum delta value in ns
  79:	 * @mult:		nanosecond to cycles multiplier
  80:	 * @shift:		nanoseconds to cycles divisor (power of two)
  81:	 * @state_use_accessors:current state of the device, assigned by the core code
  82:	 * @features:		features
  83:	 * @retries:		number of forced programming retries
  84:	 * @set_state_periodic:	switch state to periodic
  85:	 * @set_state_oneshot:	switch state to oneshot
  86:	 * @set_state_oneshot_stopped: switch state to oneshot_stopped
  87:	 * @set_state_shutdown:	switch state to shutdown
  88:	 * @tick_resume:	resume clkevt device
  89:	 * @broadcast:		function to broadcast events
  90:	 * @min_delta_ticks:	minimum delta value in ticks stored for reconfiguration
  91:	 * @max_delta_ticks:	maximum delta value in ticks stored for reconfiguration
  92:	 * @name:		ptr to clock event name
  93:	 * @rating:		variable to rate clock event devices
  94:	 * @irq:		IRQ number (only for non CPU local devices)
  95:	 * @bound_on:		Bound on CPU
  96:	 * @cpumask:		cpumask to indicate for which CPUs this device works
  97:	 * @list:		list head for the management code
  98:	 * @owner:		module reference
  99:	 */
 100:	struct clock_event_device {
 101:		void			(*event_handler)(struct clock_event_device *);
 102:		int			(*set_next_event)(unsigned long evt, struct clock_event_device *);
 103:		int			(*set_next_ktime)(ktime_t expires, struct clock_event_device *);
 104:		ktime_t			next_event;
 105:		u64			max_delta_ns;
 106:		u64			min_delta_ns;
 107:		u32			mult;
 108:		u32			shift;
 109:		enum clock_event_state	state_use_accessors;
 110:		unsigned int		features;
 111:		unsigned long		retries;
 112:	
 113:		int			(*set_state_periodic)(struct clock_event_device *);
 114:		int			(*set_state_oneshot)(struct clock_event_device *);
 115:		int			(*set_state_oneshot_stopped)(struct clock_event_device *);
 116:		int			(*set_state_shutdown)(struct clock_event_device *);
 117:		int			(*tick_resume)(struct clock_event_device *);
 118:	
 119:		void			(*broadcast)(const struct cpumask *mask);
 120:		void			(*suspend)(struct clock_event_device *);
 121:		void			(*resume)(struct clock_event_device *);
 122:		unsigned long		min_delta_ticks;
 123:		unsigned long		max_delta_ticks;
 124:	
 125:		const char		*name;
 126:		int			rating;
 127:		int			irq;
 128:		int			bound_on;
 129:		const struct cpumask	*cpumask;
 130:		struct list_head	list;
 131:		struct module		*owner;
 132:	} ____cacheline_aligned;
 SourceFile:include/linux/clockchips.h]

gemini-3-flash-preview

							input: 38360
							output: 38
							thoughts: 90

Here's my thought process, summarized as if I were considering this code:

**Analyzing the `next_event` Variable**

Okay, so `next_event` is defined as a `ktime_t`, which, as I recall, is a 64-bit signed integer (`s64`). That's straightforward enough. The comment indicates it's used as local storage for the *next event* specifically in the one-shot mode of whatever this code is managing. So, it's essentially a timestamp, an absolute point in time.

The comment goes on to highlight that it stores the expiry time, which makes sense. I see it's used in `clockevents_program_event`, confirming it's directly related to setting up a future timer event. But, I need to know the full picture.

I need to see the breadth of its use. Is this a local variable used only for calculation within this method, or is this variable referenced and mutated elsewhere? Is this variable subject to race conditions if used in an interrupt context? To answer this question, I'll search for every instance of `next_event` within the `kernel/time/` directory. That should give me a good handle on where it's being read from, written to, and potentially, how it's being used across the timekeeping subsystem. This search will be key to understanding the scope and potential implications of this variable.

map[Name:clock_event_device::next_event SourceTreePrefix:kernel/time/]

map[References:[map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:___tick_broadcast_oneshot_control SourceFile:kernel/time/tick-broadcast.c SourceLine:848] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:___tick_broadcast_oneshot_control SourceFile:kernel/time/tick-broadcast.c SourceLine:848] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:___tick_broadcast_oneshot_control SourceFile:kernel/time/tick-broadcast.c SourceLine:849] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:___tick_broadcast_oneshot_control SourceFile:kernel/time/tick-broadcast.c SourceLine:883] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:___tick_broadcast_oneshot_control SourceFile:kernel/time/tick-broadcast.c SourceLine:918] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:___tick_broadcast_oneshot_control SourceFile:kernel/time/tick-broadcast.c SourceLine:926] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:__clockevents_update_freq SourceFile:kernel/time/clockevents.c SourceLine:528] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:broadcast_needs_cpu SourceFile:kernel/time/tick-broadcast.c SourceLine:774] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:broadcast_shutdown_local SourceFile:kernel/time/tick-broadcast.c SourceLine:790] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:broadcast_shutdown_local SourceFile:kernel/time/tick-broadcast.c SourceLine:790] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:clockevents_increase_min_delta SourceFile:kernel/time/clockevents.c SourceLine:208] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:clockevents_program_event SourceFile:kernel/time/clockevents.c SourceLine:313] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:clockevents_program_min_delta SourceFile:kernel/time/clockevents.c SourceLine:241] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:clockevents_shutdown SourceFile:kernel/time/clockevents.c SourceLine:174] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:hotplug_cpu__broadcast_tick_pull SourceFile:kernel/time/tick-broadcast.c SourceLine:1174] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:hotplug_cpu__broadcast_tick_pull SourceFile:kernel/time/tick-broadcast.c SourceLine:1178] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:hrtimer_interrupt SourceFile:kernel/time/hrtimer.c SourceLine:1882] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:print_tickdevice SourceFile:kernel/time/timer_list.c SourceLine:206] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:tick_broadcast_init_next_event SourceFile:kernel/time/tick-broadcast.c SourceLine:1001] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:tick_broadcast_setup_oneshot SourceFile:kernel/time/tick-broadcast.c SourceLine:1066] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:tick_handle_oneshot_broadcast SourceFile:kernel/time/tick-broadcast.c SourceLine:698] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_handle_oneshot_broadcast SourceFile:kernel/time/tick-broadcast.c SourceLine:713] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_handle_oneshot_broadcast SourceFile:kernel/time/tick-broadcast.c SourceLine:721] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_handle_oneshot_broadcast SourceFile:kernel/time/tick-broadcast.c SourceLine:722] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_handle_periodic SourceFile:kernel/time/tick-common.c SourceLine:111] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_handle_periodic_broadcast SourceFile:kernel/time/tick-broadcast.c SourceLine:416] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_nohz_get_sleep_length SourceFile:kernel/time/tick-sched.c SourceLine:1375] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:tick_nohz_lowres_handler SourceFile:kernel/time/tick-sched.c SourceLine:1506] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_nohz_stop_tick SourceFile:kernel/time/tick-sched.c SourceLine:1049] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_oneshot_wakeup_control SourceFile:kernel/time/tick-broadcast.c SourceLine:952] map[ReferenceKind:writes ReferencingEntityKind:function ReferencingEntityName:tick_program_event SourceFile:kernel/time/tick-oneshot.c SourceLine:36] map[ReferenceKind:reads ReferencingEntityKind:function ReferencingEntityName:tick_setup_device SourceFile:kernel/time/tick-common.c SourceLine:232]] TruncatedOutput:false]

gemini-3-flash-preview

							input: 40518
							output: 41
							thoughts: 266

**Analyzing the `next_event` Race Condition**

Okay, so the references indicate `next_event` is a widely used field. However, the specific race condition identified focuses on interactions between `hrtimer_interrupt` (writing) and `print_tickdevice` (reading). My initial thought was to quickly check where this write happens within `hrtimer_interrupt`, and I went to the line indicated in the initial reference, which I thought was 1882:

```c
1882: 	dev->next_event = KTIME_MAX;
```

Hmm, let me check that file again.

```c
1856: void hrtimer_interrupt(struct clock_event_device *dev)
...
1865: 	dev->next_event = KTIME_MAX;
```

Wait a second. The file is different to the reference result. Line numbers don't match exactly. I bet the line numbers changed between the reference index and my `read-file` output, probably due to some file updates or different versions. No problem, this happens. I need to be careful about that.

Instead of hunting for line 1882 in the file contents, let's go straight to the definitive source. Let me use `codesearch-definition-source` for `hrtimer_interrupt`, with line numbers. That should give me a more reliable starting point. This should clarify exactly where the write to `dev->next_event` occurs within `hrtimer_interrupt` and prevent these silly confusion errors.

map[ContextFile:kernel/time/hrtimer.c IncludeLines:true Name:hrtimer_interrupt]

map[SourceCode:1869:	/*
1870:	 * High resolution timer interrupt
1871:	 * Called with interrupts disabled
1872:	 */
1873:	void hrtimer_interrupt(struct clock_event_device *dev)
1874:	{
1875:		struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
1876:		ktime_t expires_next, now, entry_time, delta;
1877:		unsigned long flags;
1878:		int retries = 0;
1879:	
1880:		BUG_ON(!cpu_base->hres_active);
1881:		cpu_base->nr_events++;
1882:		dev->next_event = KTIME_MAX;
1883:	
1884:		raw_spin_lock_irqsave(&cpu_base->lock, flags);
1885:		entry_time = now = hrtimer_update_base(cpu_base);
1886:	retry:
1887:		cpu_base->in_hrtirq = 1;
1888:		/*
1889:		 * We set expires_next to KTIME_MAX here with cpu_base->lock
1890:		 * held to prevent that a timer is enqueued in our queue via
1891:		 * the migration code. This does not affect enqueueing of
1892:		 * timers which run their callback and need to be requeued on
1893:		 * this CPU.
1894:		 */
1895:		cpu_base->expires_next = KTIME_MAX;
1896:	
1897:		if (!ktime_before(now, cpu_base->softirq_expires_next)) {
1898:			cpu_base->softirq_expires_next = KTIME_MAX;
1899:			cpu_base->softirq_activated = 1;
1900:			raise_timer_softirq(HRTIMER_SOFTIRQ);
1901:		}
1902:	
1903:		__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);
1904:	
1905:		/* Reevaluate the clock bases for the [soft] next expiry */
1906:		expires_next = hrtimer_update_next_event(cpu_base);
1907:		/*
1908:		 * Store the new expiry value so the migration code can verify
1909:		 * against it.
1910:		 */
1911:		cpu_base->expires_next = expires_next;
1912:		cpu_base->in_hrtirq = 0;
1913:		raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
1914:	
1915:		/* Reprogramming necessary ? */
1916:		if (!tick_program_event(expires_next, 0)) {
1917:			cpu_base->hang_detected = 0;
1918:			return;
1919:		}
1920:	
1921:		/*
1922:		 * The next timer was already expired due to:
1923:		 * - tracing
1924:		 * - long lasting callbacks
1925:		 * - being scheduled away when running in a VM
1926:		 *
1927:		 * We need to prevent that we loop forever in the hrtimer
1928:		 * interrupt routine. We give it 3 attempts to avoid
1929:		 * overreacting on some spurious event.
1930:		 *
1931:		 * Acquire base lock for updating the offsets and retrieving
1932:		 * the current time.
1933:		 */
1934:		raw_spin_lock_irqsave(&cpu_base->lock, flags);
1935:		now = hrtimer_update_base(cpu_base);
1936:		cpu_base->nr_retries++;
1937:		if (++retries < 3)
1938:			goto retry;
1939:		/*
1940:		 * Give the system a chance to do something else than looping
1941:		 * here. We stored the entry time, so we know exactly how long
1942:		 * we spent here. We schedule the next event this amount of
1943:		 * time away.
1944:		 */
1945:		cpu_base->nr_hangs++;
1946:		cpu_base->hang_detected = 1;
1947:		raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
1948:	
1949:		delta = ktime_sub(now, entry_time);
1950:		if ((unsigned int)delta > cpu_base->max_hang_time)
1951:			cpu_base->max_hang_time = (unsigned int) delta;
1952:		/*
1953:		 * Limit it to a sensible value as we enforce a longer
1954:		 * delay. Give the CPU at least 100ms to catch up.
1955:		 */
1956:		if (delta > 100 * NSEC_PER_MSEC)
1957:			expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
1958:		else
1959:			expires_next = ktime_add(now, delta);
1960:		tick_program_event(expires_next, 1);
1961:		pr_warn_once("hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta));
1962:	}
 SourceFile:kernel/time/hrtimer.c]

gemini-3-flash-preview

							input: 42703
							output: 37
							thoughts: 288

**Analyzing the Timer Race Condition**

Okay, so the report points directly to line 1882 in `hrtimer.c`: `dev->next_event = KTIME_MAX;`. That's a perfect match. The report indicates a race condition centered around `dev->next_event`. Specifically, `hrtimer_interrupt` is writing `KTIME_MAX` to this variable, while `print_tickdevice` is reading it for display, likely for debugging or status output.

Initially, I thought this was a completely benign race. `print_tickdevice`'s purpose is to *show* the timer system's state. If it catches a value while it's being updated, particularly with `KTIME_MAX`, it's not a functional issue.

But hang on… is there a *torn read* risk? We need to verify data consistency. `ktime_t`, which is what `dev->next_event` is, is an `s64`, meaning it's 8 bytes in size. The report confirms "read to ... of 8 bytes" and "write to ... of 8 bytes." This is crucial. It means the compiler is employing 8-byte instructions, which is important for atomicity. On x86_64 architectures, 8-byte aligned loads and stores are *atomic*.

Now we need to confirm the alignment. `dev->next_event` is located within the `struct clock_event_device`. I need to double-check that `next_event` is properly 8-byte aligned within that structure. If it is, the race is harmless because the reads and writes are atomic. If it's not aligned, then we have a problem.

map[ContextFile:kernel/time/clockevents.c Name:clock_event_device]

map[Fields:[map[Name:event_handler OffsetBits:0 SizeBits:64] map[Name:set_next_event OffsetBits:64 SizeBits:64] map[Name:set_next_ktime OffsetBits:128 SizeBits:64] map[Name:next_event OffsetBits:192 SizeBits:64] map[Name:max_delta_ns OffsetBits:256 SizeBits:64] map[Name:min_delta_ns OffsetBits:320 SizeBits:64] map[Name:mult OffsetBits:384 SizeBits:32] map[Name:shift OffsetBits:416 SizeBits:32] map[Name:state_use_accessors OffsetBits:448 SizeBits:32] map[Name:features OffsetBits:480 SizeBits:32] map[Name:retries OffsetBits:512 SizeBits:64] map[Name:set_state_periodic OffsetBits:576 SizeBits:64] map[Name:set_state_oneshot OffsetBits:640 SizeBits:64] map[Name:set_state_oneshot_stopped OffsetBits:704 SizeBits:64] map[Name:set_state_shutdown OffsetBits:768 SizeBits:64] map[Name:tick_resume OffsetBits:832 SizeBits:64] map[Name:broadcast OffsetBits:896 SizeBits:64] map[Name:suspend OffsetBits:960 SizeBits:64] map[Name:resume OffsetBits:1024 SizeBits:64] map[Name:min_delta_ticks OffsetBits:1088 SizeBits:64] map[Name:max_delta_ticks OffsetBits:1152 SizeBits:64] map[Name:name OffsetBits:1216 SizeBits:64] map[Name:rating OffsetBits:1280 SizeBits:32] map[Name:irq OffsetBits:1312 SizeBits:32] map[Name:bound_on OffsetBits:1344 SizeBits:32] map[Name:cpumask OffsetBits:1408 SizeBits:64] map[Name:list OffsetBits:1472 SizeBits:128] map[Name:owner OffsetBits:1600 SizeBits:64]]]

gemini-3-flash-preview

							input: 44013
							output: 41
							thoughts: 1917

**Benign Data Race in Timer Debug Output**

Okay, so I've been looking at a reported data race, specifically related to the `next_event` field within the `tick_device` structure. My initial thought was that, given the struct's cacheline alignment, and the offset of `next_event` being a multiple of 8, the access should be atomic on a 64-bit system. Even if it weren't atomic on 32-bit, it's just a debug output so it's probably fine either way.

The report highlights that `hrtimer_interrupt` is writing to this field while `folio_mark_accessed` (triggered by an interrupt) is accessing something else entirely on a different CPU, causing the race. Tracing the stack traces confirms that the write indeed happens in `hrtimer_interrupt` and the read in `print_tickdevice`, and the interrupt happening in between other code. This isn't unexpected, an interrupt occurring mid-execution is normal.

Digging further, I realized that `hrtimer_interrupt` writes to `dev->next_event`, where `dev` is the local CPU's clock event device. The crucial point here is that `print_tickdevice` iterates through *all* CPUs to read their respective `tick_device` structures. This means CPU 0 is reading CPU 1's `next_event` field while CPU 1 is potentially updating it. That confirms that the race occurs between the interrupt handler on CPU 1 and a task on CPU 0.

Looking at the code, `print_tickdevice` (on CPU 0) is reading the `next_event` of CPU 1's `tick_device`, and `hrtimer_interrupt` (on CPU 1) is writing to *its own* `next_event`. The fact that `print_tickdevice` iterates all CPUs, confirms this is expected and a side effect of the design.

Specifically, the report indicates the write in `hrtimer_interrupt` happens at line 1882 in the current version of the kernel. I can confirm that this line indeed corresponds to `dev->next_event = KTIME_MAX;`.

Now, the important question is: is this *harmful*? Given that `print_tickdevice` is used to dump timer information via `/proc/timer_list`, it's primarily a debugging tool. The race condition involves the local CPU writing `KTIME_MAX` (a specific time value) during interrupt handling and the other CPU reading it. Since `print_tickdevice` is not used in any critical path, and `hrtimer_interrupt` is the only code setting the time on that CPU, the reading CPU should eventually get an accurate value.

In `hrtimer_interrupt`, the `next_event` field is set to `KTIME_MAX` and later updated. This value indicates the device is either being reprogrammed or the current event has been handled. Another CPU reading the `KTIME_MAX` value could temporarily think there's no pending event, but this is fine, it's only for the debug interface.

Finally, the report also mentioned `cpu_base->expires_next`. I verified the write operations happen in `hrtimer_interrupt`, lines 1895 and 1911 in `hrtimer.c` but those accesses are also benign because they are just used for debug information. So, ultimately this race condition, specifically affecting the debug output, is completely benign.

map[ContextFile:include/linux/clockchips.h IncludeLines:true Name:ktime_t]

map[SourceCode: 126:	/* Nanosecond scalar representation for kernel time values */
 127:	typedef s64	ktime_t;
 SourceFile:include/linux/types.h]

gemini-3-flash-preview

							input: 46785
							output: 19
							thoughts: 1592

**Benign Race Analysis of `dev->next_event` in `hrtimer_interrupt`**

Okay, so I'm looking at this code and confirming my initial assessment. It's a `s64` - got it. The race condition seems benign, and I'm confident about that. I need to justify this conclusion, though, given the KCSAN report.

The specific issue is that `dev->next_event` is written in `hrtimer_interrupt` outside of a lock. Specifically:

```c
1882:		dev->next_event = KTIME_MAX;
```

This write isn't protected by `cpu_base->lock`. I thought that seemed odd at first. However, the interrupt context of this write *is* local to the CPU that's running it. So, while it's written before the lock is acquired, the typical use case is that no other task on that *same* CPU can interfere. I need to be sure. The concern is cross-CPU access, and the context here is `print_tickdevice`, specifically looking at `/proc/timer_list`.

My first thought was, maybe they should be using `READ_ONCE` or `WRITE_ONCE`. But `ktime_t` is a 64-bit value, so it's already atomic. No compiler magic should be able to make this non-atomic, which is good.

I need to confirm there aren't *other* races in `timer_list.c`. I browsed the code quickly, and saw that many fields within `timer_list.c` are read without locks. It seemed like the locking was used, for instance, `print_active_timers` does take a lock on `base->cpu_base->lock`, but `print_base` and `print_cpu` read stuff without that lock. Then, the specific issue at hand, `print_tickdevice` doesn't take any lock on `dev`, which is what I'm concerned about.

`struct tick_device` is just a pointer to `clock_event_device` along with an enum. There's no lock here. And the `clock_event_device` is usually protected by the per-CPU `hrtimer_cpu_base->lock`, or that the accesses happen with interrupts disabled.

Since `print_tickdevice` is just for `/proc/timer_list`, I think it is acceptable that it races. It's not controlling anything critical, it's just presenting information.

Back to line 1882 in `hrtimer_interrupt`, the writes to `dev->next_event` (direct at 1882 and indirect through `tick_program_event` at 1916) are outside the protection of the lock. I see that! `tick_program_event` gets called *after* `raw_spin_unlock_irqrestore(&cpu_base->lock, flags);`. So, it's confirmed.

This is very likely an "unprotected" field for a specific reason – it's mostly per-CPU, and it's being read in a context that does *not* have the lock. I should check if there are other reports for this. Something like "hrtimer_interrupt data-race print_tickdevice" would be ideal. I don't have a search tool for that. But I'll trust my judgement for now. It's benign.

Interrupts are disabled locally, so no other task on *this* CPU can interfere. But another CPU *can* read it.

The field is being accessed like so:

```c
	SEQ_printf(m, " next_event:     %Ld nsecs\n",
		   (unsigned long long) ktime_to_ns(dev->next_event));
```

`ktime_to_ns` just does a direct read. No funny business.

I should go through the "harmful" criteria. Is there:
- Memory corruption/crash? No.
- Complex container? No, it's a simple `s64`.
- Synchronization between other memory accesses? No.

The contents of `dev->next_event` represents the programmed expiry time. If `print_tickdevice` sees `KTIME_MAX` instead of the next event, it just prints `KTIME_MAX`. This doesn't affect the hardware which is what sets the timer off with `dev->set_next_event`.

The hardware is programmed *after* `dev->next_event` is updated. That helps.
In `hrtimer_interrupt`: It's set to `KTIME_MAX` while the interrupt handler is running and then set to a new `expires

map[Benign:true Confident:true]

map[Benign:true Confident:true]

gemini-3-flash-preview

							input: 49438
							output: 560
							thoughts: 250

**Benign Data Race in the Timer System**

Okay, so I've looked at this data race, and I'm pretty confident it's harmless. The race is between the `hrtimer_interrupt` function, which is updating `dev->next_event`, and `print_tickdevice`, which is reading it. Now, `dev->next_event` is just a 64-bit integer, a `ktime_t` in the `clock_event_device` struct. `hrtimer_interrupt` is, of course, the interrupt handler that sets this field to point to the next timer event. `print_tickdevice` gets called when someone reads `/proc/timer_list`, that diagnostic interface for the timer system.

Since this is a 64-bit integer on a 64-bit system, the access is atomic, which in itself all but eliminates the problem. But even if it wasn't atomic, the only consequence would be a slightly incorrect value being displayed in `/proc/timer_list`. Crucially, and this is important, `print_tickdevice` doesn't actually *use* the value for any critical decision-making or timer logic. It's purely for display purposes, so even a bad read wouldn't affect the timer itself.

Furthermore, `hrtimer_interrupt` writes with interrupts disabled on the local CPU, which certainly helps. And the read in `print_tickdevice` is essentially a cross-CPU read for diagnostics. This is a classic example of a benign race condition, one in statistical or debug output, where a small inconsistency isn't catastrophic. It's a non-issue as far as I can tell.

LLM Calls Summary:
Total Calls	Total Tokens	Avg Tokens	Total Duration (Seconds)	Avg Duration (Seconds)

Tool Calls Summary:
Total Calls	Total Duration (Seconds)	Avg Duration (Seconds)