syzbot

ID	Workflow	Result	Correct	Bug	Created	Started	Finished	Revision	Error
a5ba41ad-d030-416f-8c90-39f924d83fc6	assessment-kcsan	Benign: ✅ Confident: ✅	❓	KCSAN: data-race in fill_mg_cmtime / inode_set_ctime_current (2)	2026/01/26 23:20	2026/01/26 23:20	2026/01/26 23:22	efb3e894dd7e30475c92970989c58e5aad3d5e7b

The data race occurs on the `i_ctime_sec` field of `struct inode`, which is a `time64_t` (8-byte integer) representing the seconds part of the inode's change time.

### Analysis

The race involves two functions:
1.  **`inode_set_ctime_current` (Writer):** This function is part of the multigrain timestamp implementation. It updates the inode's `ctime`. It first uses `try_cmpxchg` to atomically update `i_ctime_nsec` (nanoseconds part and some flag bits) and then updates `i_ctime_sec` with a simple assignment:
    ```c
    if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now.tv_nsec)) {
        inode->i_ctime_sec = now.tv_sec; // Write to i_ctime_sec
        ...
    }
    ```
2.  **`fill_mg_cmtime` (Reader):** This function is called during `getattr` to fill in the status information. It reads the seconds part and then the nanoseconds part:
    ```c
    stat->ctime.tv_sec = inode->i_ctime_sec; // Read from i_ctime_sec
    stat->ctime.tv_nsec = (u32)atomic_read(pcn); // Read from i_ctime_nsec
    ```

Because the writer updates `nsec` then `sec`, and the reader reads `sec` then `nsec`, there is a small window where the reader can see the old `sec` value but the new `nsec` value. If the update occurred across a second boundary (e.g., from 100.999s to 101.001s), the reader might observe an inconsistent timestamp (e.g., 100.001s), which is technically in the past relative to the previous state.

### Conclusion

Despite the potential for a slightly inconsistent timestamp being reported to userspace for a very brief period, this data race is **benign** for the following reasons:

1.  **Primitive Data Type:** The race is on a `time64_t` field, which is a primitive integer type. On 64-bit architectures, aligned 8-byte loads and stores are atomic, so no "torn" values (half-old, half-new) will be read.
2.  **No Corruption or Stability Issues:** The race does not involve pointers, linked lists, or complex data structures. It cannot lead to memory corruption, kernel crashes, or invalid memory accesses.
3.  **Traditional VFS Behavior:** Timestamps in the Linux VFS have traditionally been accessed without heavy locking in the `getattr` (stat) path for performance reasons. While the multigrain timestamp feature introduces more complexity in the nanoseconds field to track queries, the seconds field remains a simple informational value.
4.  **Limited Impact:** The inconsistency is transient and only affects the precision of a reported timestamp in a race condition. It does not violate any core kernel invariants.

Following the KCSAN definition of benign races (accesses to simple fields not requiring mutual exclusion and not leading to corruption), this report is a benign data race.

==================================================================
BUG: KCSAN: data-race in fill_mg_cmtime / inode_set_ctime_current

write to 0xffff8881039e7250 of 8 bytes by task 3305 on cpu 0:
 inode_set_ctime_current+0x592/0x810 fs/inode.c:2843
 shmem_mknod+0x127/0x170 mm/shmem.c:3889
 shmem_create+0x34/0x50 mm/shmem.c:3943
 lookup_open fs/namei.c:4449 [inline]
 open_last_lookups fs/namei.c:4549 [inline]
 path_openat+0x1121/0x23c0 fs/namei.c:4793
 do_filp_open+0x15d/0x280 fs/namei.c:4823
 do_sys_openat2+0xa6/0x150 fs/open.c:1430
 do_sys_open fs/open.c:1436 [inline]
 __do_sys_openat fs/open.c:1452 [inline]
 __se_sys_openat fs/open.c:1447 [inline]
 __x64_sys_openat+0xf2/0x120 fs/open.c:1447
 x64_sys_call+0x2b07/0x3000 arch/x86/include/generated/asm/syscalls_64.h:258
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff8881039e7250 of 8 bytes by task 3306 on cpu 1:
 fill_mg_cmtime+0x86/0x260 fs/stat.c:56
 generic_fillattr+0x24a/0x340 fs/stat.c:99
 shmem_getattr+0x181/0x200 mm/shmem.c:1274
 vfs_getattr_nosec+0x146/0x1e0 fs/stat.c:213
 vfs_getattr+0x48/0x60 fs/stat.c:262
 vfs_statx_path+0x28/0x140 fs/stat.c:299
 vfs_statx+0xc4/0x170 fs/stat.c:356
 vfs_fstatat+0x115/0x170 fs/stat.c:375
 __do_sys_newfstatat fs/stat.c:542 [inline]
 __se_sys_newfstatat+0x55/0x3e0 fs/stat.c:536
 __x64_sys_newfstatat+0x55/0x70 fs/stat.c:536
 x64_sys_call+0x111f/0x3000 arch/x86/include/generated/asm/syscalls_64.h:263
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0x00000000773d97c1 -> 0x00000000773d97c2

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 3306 Comm: udevd Tainted: G        W           syzkaller #0 PREEMPT(voluntary) 
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025
==================================================================

Trajectory:
Seq	Timestamp	Type	Name	Duration
0/0	2026/01/26 23:20	flow	assessment-kcsan	2m Results: map[Benign:true Confident:true Explanation:The data race occurs on the `i_ctime_sec` field of `struct inode`, which is a `time64_t` (8-byte integer) representing the seconds part of the inode's change time. ### Analysis The race involves two functions: 1. `inode_set_ctime_current` (Writer): This function is part of the multigrain timestamp implementation. It updates the inode's `ctime`. It first uses `try_cmpxchg` to atomically update `i_ctime_nsec` (nanoseconds part and some flag bits) and then updates `i_ctime_sec` with a simple assignment: ```c if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now.tv_nsec)) { inode->i_ctime_sec = now.tv_sec; // Write to i_ctime_sec ... } ``` 2. `fill_mg_cmtime` (Reader): This function is called during `getattr` to fill in the status information. It reads the seconds part and then the nanoseconds part: ```c stat->ctime.tv_sec = inode->i_ctime_sec; // Read from i_ctime_sec stat->ctime.tv_nsec = (u32)atomic_read(pcn); // Read from i_ctime_nsec ``` Because the writer updates `nsec` then `sec`, and the reader reads `sec` then `nsec`, there is a small window where the reader can see the old `sec` value but the new `nsec` value. If the update occurred across a second boundary (e.g., from 100.999s to 101.001s), the reader might observe an inconsistent timestamp (e.g., 100.001s), which is technically in the past relative to the previous state. ### Conclusion Despite the potential for a slightly inconsistent timestamp being reported to userspace for a very brief period, this data race is benign for the following reasons: 1. Primitive Data Type: The race is on a `time64_t` field, which is a primitive integer type. On 64-bit architectures, aligned 8-byte loads and stores are atomic, so no "torn" values (half-old, half-new) will be read. 2. No Corruption or Stability Issues: The race does not involve pointers, linked lists, or complex data structures. It cannot lead to memory corruption, kernel crashes, or invalid memory accesses. 3. Traditional VFS Behavior: Timestamps in the Linux VFS have traditionally been accessed without heavy locking in the `getattr` (stat) path for performance reasons. While the multigrain timestamp feature introduces more complexity in the nanoseconds field to track queries, the seconds field remains a simple informational value. 4. Limited Impact: The inconsistency is transient and only affects the precision of a reported timestamp in a race condition. It does not violate any core kernel invariants. Following the KCSAN definition of benign races (accesses to simple fields not requiring mutual exclusion and not leading to corruption), this report is a benign data race.]
1/1	2026/01/26 23:20	action	kernel-checkouter	0m Results: map[KernelSrc:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/src/8c64ca34582568f44cbf475e4f60a6f0b9d98e80]
2/1	2026/01/26 23:20	action	kernel-builder	0m Results: map[KernelObj:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/build/3ce537d20de7716e72da0174003c11b07ad22515]
3/1	2026/01/26 23:20	action	codesearch-prepare	0m Results: map[Index:codesearch-index]
4/1	2026/01/26 23:20	agent	expert	2m Model: gemini-3-flash-preview Results: map[Benign:true Confident:true] Instruction: You are an experienced Linux kernel developer tasked with determining if the given kernel data race is benign or not. The data race report is from KCSAN tool. It contains 2 stack traces of the memory accesses that constitute a data race. A "benign" data races are on a simple int/bool variable or similar field, and the accesses are not supposed to be protected by any mutual exclusion primitives. Common examples of such "benign" data races are accesses to various flags fields, statistics counters, and similar. A "benign" data race does not lead to memory corruption/crash with a conservative compiler that compiles memory accesses to primitive types effectively as atomic. A non-benign (or "harmful" data race) can lead to corruption/crash even with a conservative compiler that compiles memory accesses to primitive types effectively as atomic. A common example of a "harmful" data races is race on a complex container (list/hashmap/etc), where accesses are supposed to be protected by a mutual exclusion primitive. In the final reply explain why you think the given data race is benign or is harmful. Use the provided tools to confirm any assumptions, variables/fields being accessed, etc. In particular, don't make assumptions about the kernel source code, use codesearch tools to read the actual source code. Prefer calling several tools at the same time to save round-trips. Use set-results tool to provide results of the analysis. It must be called exactly once before the final reply. Ignore results of this tool. Prompt: The data race report is: ================================================================== BUG: KCSAN: data-race in fill_mg_cmtime / inode_set_ctime_current write to 0xffff8881039e7250 of 8 bytes by task 3305 on cpu 0: inode_set_ctime_current+0x592/0x810 fs/inode.c:2843 shmem_mknod+0x127/0x170 mm/shmem.c:3889 shmem_create+0x34/0x50 mm/shmem.c:3943 lookup_open fs/namei.c:4449 [inline] open_last_lookups fs/namei.c:4549 [inline] path_openat+0x1121/0x23c0 fs/namei.c:4793 do_filp_open+0x15d/0x280 fs/namei.c:4823 do_sys_openat2+0xa6/0x150 fs/open.c:1430 do_sys_open fs/open.c:1436 [inline] __do_sys_openat fs/open.c:1452 [inline] __se_sys_openat fs/open.c:1447 [inline] __x64_sys_openat+0xf2/0x120 fs/open.c:1447 x64_sys_call+0x2b07/0x3000 arch/x86/include/generated/asm/syscalls_64.h:258 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffff8881039e7250 of 8 bytes by task 3306 on cpu 1: fill_mg_cmtime+0x86/0x260 fs/stat.c:56 generic_fillattr+0x24a/0x340 fs/stat.c:99 shmem_getattr+0x181/0x200 mm/shmem.c:1274 vfs_getattr_nosec+0x146/0x1e0 fs/stat.c:213 vfs_getattr+0x48/0x60 fs/stat.c:262 vfs_statx_path+0x28/0x140 fs/stat.c:299 vfs_statx+0xc4/0x170 fs/stat.c:356 vfs_fstatat+0x115/0x170 fs/stat.c:375 __do_sys_newfstatat fs/stat.c:542 [inline] __se_sys_newfstatat+0x55/0x3e0 fs/stat.c:536 __x64_sys_newfstatat+0x55/0x70 fs/stat.c:536 x64_sys_call+0x111f/0x3000 arch/x86/include/generated/asm/syscalls_64.h:263 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x00000000773d97c1 -> 0x00000000773d97c2 Reported by Kernel Concurrency Sanitizer on: CPU: 1 UID: 0 PID: 3306 Comm: udevd Tainted: G W syzkaller #0 PREEMPT(voluntary) Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 ================================================================== Reply: The data race occurs on the `i_ctime_sec` field of `struct inode`, which is a `time64_t` (8-byte integer) representing the seconds part of the inode's change time. ### Analysis The race involves two functions: 1. `inode_set_ctime_current` (Writer): This function is part of the multigrain timestamp implementation. It updates the inode's `ctime`. It first uses `try_cmpxchg` to atomically update `i_ctime_nsec` (nanoseconds part and some flag bits) and then updates `i_ctime_sec` with a simple assignment: ```c if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now.tv_nsec)) { inode->i_ctime_sec = now.tv_sec; // Write to i_ctime_sec ... } ``` 2. `fill_mg_cmtime` (Reader): This function is called during `getattr` to fill in the status information. It reads the seconds part and then the nanoseconds part: ```c stat->ctime.tv_sec = inode->i_ctime_sec; // Read from i_ctime_sec stat->ctime.tv_nsec = (u32)atomic_read(pcn); // Read from i_ctime_nsec ``` Because the writer updates `nsec` then `sec`, and the reader reads `sec` then `nsec`, there is a small window where the reader can see the old `sec` value but the new `nsec` value. If the update occurred across a second boundary (e.g., from 100.999s to 101.001s), the reader might observe an inconsistent timestamp (e.g., 100.001s), which is technically in the past relative to the previous state. ### Conclusion Despite the potential for a slightly inconsistent timestamp being reported to userspace for a very brief period, this data race is benign for the following reasons: 1. Primitive Data Type: The race is on a `time64_t` field, which is a primitive integer type. On 64-bit architectures, aligned 8-byte loads and stores are atomic, so no "torn" values (half-old, half-new) will be read. 2. No Corruption or Stability Issues: The race does not involve pointers, linked lists, or complex data structures. It cannot lead to memory corruption, kernel crashes, or invalid memory accesses. 3. Traditional VFS Behavior: Timestamps in the Linux VFS have traditionally been accessed without heavy locking in the `getattr` (stat) path for performance reasons. While the multigrain timestamp feature introduces more complexity in the nanoseconds field to track queries, the seconds field remains a simple informational value. 4. Limited Impact: The inconsistency is transient and only affects the precision of a reported timestamp in a race condition. It does not violate any core kernel invariants. Following the KCSAN definition of benign races (accesses to simple fields not requiring mutual exclusion and not leading to corruption), this report is a benign data race.
5/2	2026/01/26 23:20	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 3326 output: 81 thoughts: 127 Thoughts: Okay, here's my take on those thoughts, framed as an expert's internal monologue: Investigating a Potential Data Race Alright, this report has caught my attention. It suggests a potential data race, and it's flagged on the interaction between `inode_set_ctime_current` and `fill_mg_cmtime`. Specifically, it appears there's a read/write conflict on a memory location that contains either a `timespec64` structure or something similar – eight bytes, and the reported values changing are strongly suggestive of a timestamp. My immediate next step is to drill into the code. I need to understand what these two functions are actually doing. I'll need to locate the definitions of `inode_set_ctime_current` and `fill_mg_cmtime` within the relevant kernel code base. From there, I'll review their implementations very closely to see how they're handling this timestamp or related data. The key is understanding the access patterns: what locks are in place, how the reads and writes are performed, and what other data might be involved.
6/2	2026/01/26 23:20	tool	codesearch-definition-source	0m Args: map[ContextFile:fs/inode.c IncludeLines:true Name:inode_set_ctime_current] Results: map[SourceCode:2784: /** 2785: * inode_set_ctime_current - set the ctime to current_time 2786: * @inode: inode 2787: * 2788: * Set the inode's ctime to the current value for the inode. Returns the 2789: * current value that was assigned. If this is not a multigrain inode, then we 2790: * set it to the later of the coarse time and floor value. 2791: * 2792: * If it is multigrain, then we first see if the coarse-grained timestamp is 2793: * distinct from what is already there. If so, then use that. Otherwise, get a 2794: * fine-grained timestamp. 2795: * 2796: * After that, try to swap the new value into i_ctime_nsec. Accept the 2797: * resulting ctime, regardless of the outcome of the swap. If it has 2798: * already been replaced, then that timestamp is later than the earlier 2799: * unacceptable one, and is thus acceptable. 2800: / 2801: struct timespec64 inode_set_ctime_current(struct inode inode) 2802: { 2803: struct timespec64 now; 2804: u32 cns, cur; 2805: 2806: ktime_get_coarse_real_ts64_mg(&now); 2807: now = timestamp_truncate(now, inode); 2808: 2809: /* Just return that if this is not a multigrain fs / 2810: if (!is_mgtime(inode)) { 2811: inode_set_ctime_to_ts(inode, now); 2812: goto out; 2813: } 2814: 2815: / 2816: * A fine-grained time is only needed if someone has queried 2817: * for timestamps, and the current coarse grained time isn't 2818: * later than what's already there. 2819: / 2820: cns = smp_load_acquire(&inode->i_ctime_nsec); 2821: if (cns & I_CTIME_QUERIED) { 2822: struct timespec64 ctime = { .tv_sec = inode->i_ctime_sec, 2823: .tv_nsec = cns & ~I_CTIME_QUERIED }; 2824: 2825: if (timespec64_compare(&now, &ctime) <= 0) { 2826: ktime_get_real_ts64_mg(&now); 2827: now = timestamp_truncate(now, inode); 2828: mgtime_counter_inc(mg_fine_stamps); 2829: } 2830: } 2831: mgtime_counter_inc(mg_ctime_updates); 2832: 2833: / No need to cmpxchg if it's exactly the same / 2834: if (cns == now.tv_nsec && inode->i_ctime_sec == now.tv_sec) { 2835: trace_ctime_xchg_skip(inode, &now); 2836: goto out; 2837: } 2838: cur = cns; 2839: retry: 2840: / Try to swap the nsec value into place. / 2841: if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now.tv_nsec)) { 2842: / If swap occurred, then we're (mostly) done / 2843: inode->i_ctime_sec = now.tv_sec; 2844: trace_ctime_ns_xchg(inode, cns, now.tv_nsec, cur); 2845: mgtime_counter_inc(mg_ctime_swaps); 2846: } else { 2847: / 2848: * Was the change due to someone marking the old ctime QUERIED? 2849: * If so then retry the swap. This can only happen once since 2850: * the only way to clear I_CTIME_QUERIED is to stamp the inode 2851: * with a new ctime. 2852: / 2853: if (!(cns & I_CTIME_QUERIED) && (cns \| I_CTIME_QUERIED) == cur) { 2854: cns = cur; 2855: goto retry; 2856: } 2857: / Otherwise, keep the existing ctime */ 2858: now.tv_sec = inode->i_ctime_sec; 2859: now.tv_nsec = cur & ~I_CTIME_QUERIED; 2860: } 2861: out: 2862: return now; 2863: } SourceFile:fs/inode.c]
7/2	2026/01/26 23:20	tool	codesearch-definition-source	0m Args: map[ContextFile:fs/stat.c IncludeLines:true Name:fill_mg_cmtime] Results: map[SourceCode: 31: /** 32: * fill_mg_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED 33: * @stat: where to store the resulting values 34: * @request_mask: STATX_* values requested 35: * @inode: inode from which to grab the c/mtime 36: * 37: * Given @inode, grab the ctime and mtime out if it and store the result 38: * in @stat. When fetching the value, flag it as QUERIED (if not already) 39: * so the next write will record a distinct timestamp. 40: * 41: * NB: The QUERIED flag is tracked in the ctime, but we set it there even 42: * if only the mtime was requested, as that ensures that the next mtime 43: * change will be distinct. 44: / 45: void fill_mg_cmtime(struct kstat stat, u32 request_mask, struct inode inode) 46: { 47: atomic_t pcn = (atomic_t )&inode->i_ctime_nsec; 48: 49: / If neither time was requested, then don't report them */ 50: if (!(request_mask & (STATX_CTIME\|STATX_MTIME))) { 51: stat->result_mask &= ~(STATX_CTIME\|STATX_MTIME); 52: return; 53: } 54: 55: stat->mtime = inode_get_mtime(inode); 56: stat->ctime.tv_sec = inode->i_ctime_sec; 57: stat->ctime.tv_nsec = (u32)atomic_read(pcn); 58: if (!(stat->ctime.tv_nsec & I_CTIME_QUERIED)) 59: stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn)); 60: stat->ctime.tv_nsec &= ~I_CTIME_QUERIED; 61: trace_fill_mg_cmtime(inode, &stat->ctime, &stat->mtime); 62: } SourceFile:fs/stat.c]
8/2	2026/01/26 23:20	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 5760 output: 44 thoughts: 792 Thoughts: Analyzing a Potential Time Stamp Inconsistency in the Kernel Okay, here's what's running through my head. The report is pointing to a potential data race related to file ctime updates. The core of it seems to be in the interaction between `inode_set_ctime_current` and `fill_mg_cmtime`, specifically concerning the `i_ctime_sec` and `i_ctime_nsec` fields of an inode. The report clearly states a write of 8 bytes to `0xffff8881039e7250` by task 3305 on cpu 0, and importantly, it references `inode_set_ctime_current+0x592/0x810 fs/inode.c:2843`. That address and write size is the `i_ctime_sec`, which is time64_t. That write matches what the kernel would do. We have a matching read at `fill_mg_cmtime+0x86/0x260 fs/stat.c:56`, which accesses the same `i_ctime_sec`. So, we're looking at a race condition: `inode_set_ctime_current` updating this timestamp, and `fill_mg_cmtime` reading it. Digging into the code... `inode_set_ctime_current` does a CAS on the nanoseconds first: `try_cmpxchg(&inode->i_ctime_nsec, ...)` and then updates the seconds: `inode->i_ctime_sec = now.tv_sec;`. That's an important order of operations. In `fill_mg_cmtime`, the code reads the seconds before the nanoseconds: `stat->ctime.tv_sec = inode->i_ctime_sec;` then `stat->ctime.tv_nsec = (u32)atomic_read(pcn);`. So, it's reading the timestamp in a non-atomic manner. The crucial issue is the read-write order: If `fill_mg_cmtime` reads `i_ctime_sec` after the `i_ctime_nsec` is updated but before `i_ctime_sec` is updated, it can grab an inconsistent time stamp value. I can see how this can happen: 1. `inode_set_ctime_current` updates `i_ctime_nsec`. 2. `fill_mg_cmtime` reads `i_ctime_sec` (potentially the old value). 3. `inode_set_ctime_current` updates `i_ctime_sec`. 4. `fill_mg_cmtime` reads `i_ctime_nsec` (the new value). This can cause issues. The timestamp can go backward. Ah, but let's not jump to conclusions yet. The nanosecond field in multi-grain timestamps has special bits, I recall that `i_ctime_nsec` has `I_CTIME_QUERIED` which is likely bit 31. So, we need to check how these bits affect the code path and if those might interfere with our analysis. I'll need to double check the use of `I_CTIME_QUERIED` and how it might influence the race condition. Let's see...
9/2	2026/01/26 23:20	tool	codesearch-definition-source	0m Error: requested entity does not exist Args: map[ContextFile:include/linux/fs.h IncludeLines:true Name:I_CTIME_QUERIED] Results: map[SourceCode: SourceFile:]
10/2	2026/01/26 23:20	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 7345 output: 26 thoughts: 23 Thoughts: Okay, here's the summary of the thought process, as if I, an expert in my field, were having these thoughts: Pinpointing the Location Hmm, I'm trying to track down the definition of this thing... and it's not immediately obvious where it's located. My first hunch is that it's probably tucked away in a header file, likely something related to the file system, perhaps `fs.h` or something similarly named. I should run a search. It's time to dig into the header files to find its declaration.
11/2	2026/01/26 23:20	tool	codesearch-find-references	0m Error: requested entity does not exist Args: map[Name:I_CTIME_QUERIED] Results: map[References:<nil> TruncatedOutput:false]
12/2	2026/01/26 23:20	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 7535 output: 20 thoughts: 60 Thoughts: Investigating an Unfamiliar Definition Hmm, that's odd. I don't recognize that particular term. Perhaps it's being defined somewhere with a prefix I'm overlooking, or maybe I just fat-fingered the input and mis-typed it somehow? Time to go hunting. Since the error indicates it's used within `fs/inode.c`, that's a good starting point. I'll open up that file and immediately zero in on the lines referencing the term, to quickly identify where the compiler is complaining and hopefully find the definition lurking nearby. A quick scan should clear this up.
13/2	2026/01/26 23:20	tool	read-file	0m Args: map[File:fs/inode.c] Results: map[Contents:// SPDX-License-Identifier: GPL-2.0-only /* * (C) 1997 Linus Torvalds * (C) 1999 Andrea Arcangeli <andrea@suse.de> (dynamic inode allocation) / #include <linux/export.h> #include <linux/fs.h> #include <linux/filelock.h> #include <linux/mm.h> #include <linux/backing-dev.h> #include <linux/hash.h> #include <linux/swap.h> #include <linux/security.h> #include <linux/cdev.h> #include <linux/memblock.h> #include <linux/fsnotify.h> #include <linux/mount.h> #include <linux/posix_acl.h> #include <linux/buffer_head.h> / for inode_has_buffers / #include <linux/ratelimit.h> #include <linux/list_lru.h> #include <linux/iversion.h> #include <linux/rw_hint.h> #include <linux/seq_file.h> #include <linux/debugfs.h> #include <trace/events/writeback.h> #define CREATE_TRACE_POINTS #include <trace/events/timestamp.h> #include "internal.h" / * Inode locking rules: * * inode->i_lock protects: * inode->i_state, inode->i_hash, __iget(), inode->i_io_list * Inode LRU list locks protect: * inode->i_sb->s_inode_lru, inode->i_lru * inode->i_sb->s_inode_list_lock protects: * inode->i_sb->s_inodes, inode->i_sb_list * bdi->wb.list_lock protects: * bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list * inode_hash_lock protects: * inode_hashtable, inode->i_hash * * Lock ordering: * * inode->i_sb->s_inode_list_lock * inode->i_lock * Inode LRU list locks * * bdi->wb.list_lock * inode->i_lock * * inode_hash_lock * inode->i_sb->s_inode_list_lock * inode->i_lock * * iunique_lock * inode_hash_lock / static unsigned int i_hash_mask __ro_after_init; static unsigned int i_hash_shift __ro_after_init; static struct hlist_head inode_hashtable __ro_after_init; static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock); /* * Empty aops. Can be used for the cases where the user does not * define any of the address_space operations. / const struct address_space_operations empty_aops = { }; EXPORT_SYMBOL(empty_aops); static DEFINE_PER_CPU(unsigned long, nr_inodes); static DEFINE_PER_CPU(unsigned long, nr_unused); static struct kmem_cache inode_cachep __ro_after_init; static long get_nr_inodes(void) { int i; long sum = 0; for_each_possible_cpu(i) sum += per_cpu(nr_inodes, i); return sum < 0 ? 0 : sum; } static inline long get_nr_inodes_unused(void) { int i; long sum = 0; for_each_possible_cpu(i) sum += per_cpu(nr_unused, i); return sum < 0 ? 0 : sum; } long get_nr_dirty_inodes(void) { /* not actually dirty inodes, but a wild approximation / long nr_dirty = get_nr_inodes() - get_nr_inodes_unused(); return nr_dirty > 0 ? nr_dirty : 0; } #ifdef CONFIG_DEBUG_FS static DEFINE_PER_CPU(long, mg_ctime_updates); static DEFINE_PER_CPU(long, mg_fine_stamps); static DEFINE_PER_CPU(long, mg_ctime_swaps); static unsigned long get_mg_ctime_updates(void) { unsigned long sum = 0; int i; for_each_possible_cpu(i) sum += data_race(per_cpu(mg_ctime_updates, i)); return sum; } static unsigned long get_mg_fine_stamps(void) { unsigned long sum = 0; int i; for_each_possible_cpu(i) sum += data_race(per_cpu(mg_fine_stamps, i)); return sum; } static unsigned long get_mg_ctime_swaps(void) { unsigned long sum = 0; int i; for_each_possible_cpu(i) sum += data_race(per_cpu(mg_ctime_swaps, i)); return sum; } #define mgtime_counter_inc(__var) this_cpu_inc(__var) static int mgts_show(struct seq_file s, void p) { unsigned long ctime_updates = get_mg_ctime_updates(); unsigned long ctime_swaps = get_mg_ctime_swaps(); unsigned long fine_stamps = get_mg_fine_stamps(); unsigned long floor_swaps = timekeeping_get_mg_floor_swaps(); seq_printf(s, "%lu %lu %lu %lu\n", ctime_updates, ctime_swaps, fine_stamps, floor_swaps); return 0; } DEFINE_SHOW_ATTRIBUTE(mgts); static int __init mg_debugfs_init(void) { debugfs_create_file("multigrain_timestamps", S_IFREG \| S_IRUGO, NULL, NULL, &mgts_fops); return 0; } late_initcall(mg_debugfs_init); #else / ! CONFIG_DEBUG_FS / #define mgtime_counter_inc(__var) do { } while (0) #endif / CONFIG_DEBUG_FS / / * Handle nr_inode sysctl / #ifdef CONFIG_SYSCTL / * Statistics gathering.. / static struct inodes_stat_t inodes_stat; static int proc_nr_inodes(const struct ctl_table table, int write, void buffer, size_t lenp, loff_t ppos) { inodes_stat.nr_inodes = get_nr_inodes(); inodes_stat.nr_unused = get_nr_inodes_unused(); return proc_doulongvec_minmax(table, write, buffer, lenp, ppos); } static const struct ctl_table inodes_sysctls[] = { { .procname = "inode-nr", .data = &inodes_stat, .maxlen = 2sizeof(long), .mode = 0444, .proc_handler = proc_nr_inodes, }, { .procname = "inode-state", .data = &inodes_stat, .maxlen = 7sizeof(long), .mode = 0444, .proc_handler = proc_nr_inodes, }, }; static int __init init_fs_inode_sysctls(void) { register_sysctl_init("fs", inodes_sysctls); return 0; } early_initcall(init_fs_inode_sysctls); #endif static int no_open(struct inode inode, struct file file) { return -ENXIO; } /* * inode_init_always_gfp - perform inode structure initialisation * @sb: superblock inode belongs to * @inode: inode to initialise * @gfp: allocation flags * * These are initializations that need to be done on every inode * allocation as the fields are not initialised by slab allocation. * If there are additional allocations required @gfp is used. / int inode_init_always_gfp(struct super_block sb, struct inode inode, gfp_t gfp) { static const struct inode_operations empty_iops; static const struct file_operations no_open_fops = {.open = no_open}; struct address_space const mapping = &inode->i_data; inode->i_sb = sb; inode->i_blkbits = sb->s_blocksize_bits; inode->i_flags = 0; inode_state_assign_raw(inode, 0); atomic64_set(&inode->i_sequence, 0); atomic_set(&inode->i_count, 1); inode->i_op = &empty_iops; inode->i_fop = &no_open_fops; inode->i_ino = 0; inode->__i_nlink = 1; inode->i_opflags = 0; if (sb->s_xattr) inode->i_opflags \|= IOP_XATTR; if (sb->s_type->fs_flags & FS_MGTIME) inode->i_opflags \|= IOP_MGTIME; i_uid_write(inode, 0); i_gid_write(inode, 0); atomic_set(&inode->i_writecount, 0); inode->i_size = 0; inode->i_write_hint = WRITE_LIFE_NOT_SET; inode->i_blocks = 0; inode->i_bytes = 0; inode->i_generation = 0; inode->i_pipe = NULL; inode->i_cdev = NULL; inode->i_link = NULL; inode->i_dir_seq = 0; inode->i_rdev = 0; inode->dirtied_when = 0; #ifdef CONFIG_CGROUP_WRITEBACK inode->i_wb_frn_winner = 0; inode->i_wb_frn_avg_time = 0; inode->i_wb_frn_history = 0; #endif spin_lock_init(&inode->i_lock); lockdep_set_class(&inode->i_lock, &sb->s_type->i_lock_key); init_rwsem(&inode->i_rwsem); lockdep_set_class(&inode->i_rwsem, &sb->s_type->i_mutex_key); atomic_set(&inode->i_dio_count, 0); mapping->a_ops = &empty_aops; mapping->host = inode; mapping->flags = 0; mapping->wb_err = 0; atomic_set(&mapping->i_mmap_writable, 0); #ifdef CONFIG_READ_ONLY_THP_FOR_FS atomic_set(&mapping->nr_thps, 0); #endif mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE); mapping->i_private_data = NULL; mapping->writeback_index = 0; init_rwsem(&mapping->invalidate_lock); lockdep_set_class_and_name(&mapping->invalidate_lock, &sb->s_type->invalidate_lock_key, "mapping.invalidate_lock"); if (sb->s_iflags & SB_I_STABLE_WRITES) mapping_set_stable_writes(mapping); inode->i_private = NULL; inode->i_mapping = mapping; INIT_HLIST_HEAD(&inode->i_dentry); /* buggered by rcu freeing / #ifdef CONFIG_FS_POSIX_ACL inode->i_acl = inode->i_default_acl = ACL_NOT_CACHED; #endif #ifdef CONFIG_FSNOTIFY inode->i_fsnotify_mask = 0; #endif inode->i_flctx = NULL; if (unlikely(security_inode_alloc(inode, gfp))) return -ENOMEM; this_cpu_inc(nr_inodes); return 0; } EXPORT_SYMBOL(inode_init_always_gfp); void free_inode_nonrcu(struct inode inode) { kmem_cache_free(inode_cachep, inode); } EXPORT_SYMBOL(free_inode_nonrcu); static void i_callback(struct rcu_head head) { struct inode inode = container_of(head, struct inode, i_rcu); if (inode->free_inode) inode->free_inode(inode); else free_inode_nonrcu(inode); } /** * alloc_inode - obtain an inode * @sb: superblock * * Allocates a new inode for given superblock. * Inode wont be chained in superblock s_inodes list * This means : * - fs can't be unmount * - quotas, fsnotify, writeback can't work / struct inode alloc_inode(struct super_block sb) { const struct super_operations ops = sb->s_op; struct inode inode; if (ops->alloc_inode) inode = ops->alloc_inode(sb); else inode = alloc_inode_sb(sb, inode_cachep, GFP_KERNEL); if (!inode) return NULL; if (unlikely(inode_init_always(sb, inode))) { if (ops->destroy_inode) { ops->destroy_inode(inode); if (!ops->free_inode) return NULL; } inode->free_inode = ops->free_inode; i_callback(&inode->i_rcu); return NULL; } return inode; } void __destroy_inode(struct inode inode) { BUG_ON(inode_has_buffers(inode)); inode_detach_wb(inode); security_inode_free(inode); fsnotify_inode_delete(inode); locks_free_lock_context(inode); if (!inode->i_nlink) { WARN_ON(atomic_long_read(&inode->i_sb->s_remove_count) == 0); atomic_long_dec(&inode->i_sb->s_remove_count); } #ifdef CONFIG_FS_POSIX_ACL if (inode->i_acl && !is_uncached_acl(inode->i_acl)) posix_acl_release(inode->i_acl); if (inode->i_default_acl && !is_uncached_acl(inode->i_default_acl)) posix_acl_release(inode->i_default_acl); #endif this_cpu_dec(nr_inodes); } EXPORT_SYMBOL(__destroy_inode); static void destroy_inode(struct inode inode) { const struct super_operations ops = inode->i_sb->s_op; BUG_ON(!list_empty(&inode->i_lru)); __destroy_inode(inode); if (ops->destroy_inode) { ops->destroy_inode(inode); if (!ops->free_inode) return; } inode->free_inode = ops->free_inode; call_rcu(&inode->i_rcu, i_callback); } /** * drop_nlink - directly drop an inode's link count * @inode: inode * * This is a low-level filesystem helper to replace any * direct filesystem manipulation of i_nlink. In cases * where we are attempting to track writes to the * filesystem, a decrement to zero means an imminent * write when the file is truncated and actually unlinked * on the filesystem. / void drop_nlink(struct inode inode) { WARN_ON(inode->i_nlink == 0); inode->__i_nlink--; if (!inode->i_nlink) atomic_long_inc(&inode->i_sb->s_remove_count); } EXPORT_SYMBOL(drop_nlink); /** * clear_nlink - directly zero an inode's link count * @inode: inode * * This is a low-level filesystem helper to replace any * direct filesystem manipulation of i_nlink. See * drop_nlink() for why we care about i_nlink hitting zero. / void clear_nlink(struct inode inode) { if (inode->i_nlink) { inode->__i_nlink = 0; atomic_long_inc(&inode->i_sb->s_remove_count); } } EXPORT_SYMBOL(clear_nlink); /** * set_nlink - directly set an inode's link count * @inode: inode * @nlink: new nlink (should be non-zero) * * This is a low-level filesystem helper to replace any * direct filesystem manipulation of i_nlink. / void set_nlink(struct inode inode, unsigned int nlink) { if (!nlink) { clear_nlink(inode); } else { /* Yes, some filesystems do change nlink from zero to one / if (inode->i_nlink == 0) atomic_long_dec(&inode->i_sb->s_remove_count); inode->__i_nlink = nlink; } } EXPORT_SYMBOL(set_nlink); /* * inc_nlink - directly increment an inode's link count * @inode: inode * * This is a low-level filesystem helper to replace any * direct filesystem manipulation of i_nlink. Currently, * it is only here for parity with dec_nlink(). / void inc_nlink(struct inode inode) { if (unlikely(inode->i_nlink == 0)) { WARN_ON(!(inode_state_read_once(inode) & I_LINKABLE)); atomic_long_dec(&inode->i_sb->s_remove_count); } inode->__i_nlink++; } EXPORT_SYMBOL(inc_nlink); static void __address_space_init_once(struct address_space mapping) { xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ \| XA_FLAGS_ACCOUNT); init_rwsem(&mapping->i_mmap_rwsem); INIT_LIST_HEAD(&mapping->i_private_list); spin_lock_init(&mapping->i_private_lock); mapping->i_mmap = RB_ROOT_CACHED; } void address_space_init_once(struct address_space mapping) { memset(mapping, 0, sizeof(mapping)); __address_space_init_once(mapping); } EXPORT_SYMBOL(address_space_init_once); / * These are initializations that only need to be done * once, because the fields are idempotent across use * of the inode, so let the slab aware of that. / void inode_init_once(struct inode inode) { memset(inode, 0, sizeof(inode)); INIT_HLIST_NODE(&inode->i_hash); INIT_LIST_HEAD(&inode->i_devices); INIT_LIST_HEAD(&inode->i_io_list); INIT_LIST_HEAD(&inode->i_wb_list); INIT_LIST_HEAD(&inode->i_lru); INIT_LIST_HEAD(&inode->i_sb_list); __address_space_init_once(&inode->i_data); i_size_ordered_init(inode); } EXPORT_SYMBOL(inode_init_once); static void init_once(void foo) { struct inode inode = (struct inode ) foo; inode_init_once(inode); } /* * get additional reference to inode; caller must already hold one. / void ihold(struct inode inode) { WARN_ON(atomic_inc_return(&inode->i_count) < 2); } EXPORT_SYMBOL(ihold); struct wait_queue_head inode_bit_waitqueue(struct wait_bit_queue_entry wqe, struct inode inode, u32 bit) { void bit_address; bit_address = inode_state_wait_address(inode, bit); init_wait_var_entry(wqe, bit_address, 0); return __var_waitqueue(bit_address); } EXPORT_SYMBOL(inode_bit_waitqueue); void wait_on_new_inode(struct inode inode) { struct wait_bit_queue_entry wqe; struct wait_queue_head wq_head; spin_lock(&inode->i_lock); if (!(inode_state_read(inode) & I_NEW)) { spin_unlock(&inode->i_lock); return; } wq_head = inode_bit_waitqueue(&wqe, inode, __I_NEW); for (;;) { prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE); if (!(inode_state_read(inode) & I_NEW)) break; spin_unlock(&inode->i_lock); schedule(); spin_lock(&inode->i_lock); } finish_wait(wq_head, &wqe.wq_entry); WARN_ON(inode_state_read(inode) & I_NEW); spin_unlock(&inode->i_lock); } EXPORT_SYMBOL(wait_on_new_inode); static void __inode_lru_list_add(struct inode inode, bool rotate) { lockdep_assert_held(&inode->i_lock); if (inode_state_read(inode) & (I_DIRTY_ALL \| I_SYNC \| I_FREEING \| I_WILL_FREE)) return; if (icount_read(inode)) return; if (!(inode->i_sb->s_flags & SB_ACTIVE)) return; if (!mapping_shrinkable(&inode->i_data)) return; if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) this_cpu_inc(nr_unused); else if (rotate) inode_state_set(inode, I_REFERENCED); } / * Add inode to LRU if needed (inode is unused and clean). / void inode_lru_list_add(struct inode inode) { __inode_lru_list_add(inode, false); } static void inode_lru_list_del(struct inode inode) { if (list_empty(&inode->i_lru)) return; if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) this_cpu_dec(nr_unused); } static void inode_pin_lru_isolating(struct inode inode) { lockdep_assert_held(&inode->i_lock); WARN_ON(inode_state_read(inode) & (I_LRU_ISOLATING \| I_FREEING \| I_WILL_FREE)); inode_state_set(inode, I_LRU_ISOLATING); } static void inode_unpin_lru_isolating(struct inode inode) { spin_lock(&inode->i_lock); WARN_ON(!(inode_state_read(inode) & I_LRU_ISOLATING)); inode_state_clear(inode, I_LRU_ISOLATING); / Called with inode->i_lock which ensures memory ordering. / inode_wake_up_bit(inode, __I_LRU_ISOLATING); spin_unlock(&inode->i_lock); } static void inode_wait_for_lru_isolating(struct inode inode) { struct wait_bit_queue_entry wqe; struct wait_queue_head wq_head; lockdep_assert_held(&inode->i_lock); if (!(inode_state_read(inode) & I_LRU_ISOLATING)) return; wq_head = inode_bit_waitqueue(&wqe, inode, __I_LRU_ISOLATING); for (;;) { prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE); / * Checking I_LRU_ISOLATING with inode->i_lock guarantees * memory ordering. / if (!(inode_state_read(inode) & I_LRU_ISOLATING)) break; spin_unlock(&inode->i_lock); schedule(); spin_lock(&inode->i_lock); } finish_wait(wq_head, &wqe.wq_entry); WARN_ON(inode_state_read(inode) & I_LRU_ISOLATING); } /* * inode_sb_list_add - add inode to the superblock list of inodes * @inode: inode to add / void inode_sb_list_add(struct inode inode) { struct super_block sb = inode->i_sb; spin_lock(&sb->s_inode_list_lock); list_add(&inode->i_sb_list, &sb->s_inodes); spin_unlock(&sb->s_inode_list_lock); } EXPORT_SYMBOL_GPL(inode_sb_list_add); static inline void inode_sb_list_del(struct inode inode) { struct super_block sb = inode->i_sb; if (!list_empty(&inode->i_sb_list)) { spin_lock(&sb->s_inode_list_lock); list_del_init(&inode->i_sb_list); spin_unlock(&sb->s_inode_list_lock); } } static unsigned long hash(struct super_block sb, unsigned long hashval) { unsigned long tmp; tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) / L1_CACHE_BYTES; tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> i_hash_shift); return tmp & i_hash_mask; } /** * __insert_inode_hash - hash an inode * @inode: unhashed inode * @hashval: unsigned long value used to locate this object in the * inode_hashtable. * * Add an inode to the inode hash for this superblock. / void __insert_inode_hash(struct inode inode, unsigned long hashval) { struct hlist_head b = inode_hashtable + hash(inode->i_sb, hashval); spin_lock(&inode_hash_lock); spin_lock(&inode->i_lock); hlist_add_head_rcu(&inode->i_hash, b); spin_unlock(&inode->i_lock); spin_unlock(&inode_hash_lock); } EXPORT_SYMBOL(__insert_inode_hash); /* * __remove_inode_hash - remove an inode from the hash * @inode: inode to unhash * * Remove an inode from the superblock. / void __remove_inode_hash(struct inode inode) { spin_lock(&inode_hash_lock); spin_lock(&inode->i_lock); hlist_del_init_rcu(&inode->i_hash); spin_unlock(&inode->i_lock); spin_unlock(&inode_hash_lock); } EXPORT_SYMBOL(__remove_inode_hash); void dump_mapping(const struct address_space mapping) { struct inode host; const struct address_space_operations a_ops; struct hlist_node dentry_first; struct dentry dentry_ptr; struct dentry dentry; char fname[64] = {}; unsigned long ino; / * If mapping is an invalid pointer, we don't want to crash * accessing it, so probe everything depending on it carefully. / if (get_kernel_nofault(host, &mapping->host) \|\| get_kernel_nofault(a_ops, &mapping->a_ops)) { pr_warn("invalid mapping:%px\n", mapping); return; } if (!host) { pr_warn("aops:%ps\n", a_ops); return; } if (get_kernel_nofault(dentry_first, &host->i_dentry.first) \|\| get_kernel_nofault(ino, &host->i_ino)) { pr_warn("aops:%ps invalid inode:%px\n", a_ops, host); return; } if (!dentry_first) { pr_warn("aops:%ps ino:%lx\n", a_ops, ino); return; } dentry_ptr = container_of(dentry_first, struct dentry, d_u.d_alias); if (get_kernel_nofault(dentry, dentry_ptr) \|\| !dentry.d_parent \|\| !dentry.d_name.name) { pr_warn("aops:%ps ino:%lx invalid dentry:%px\n", a_ops, ino, dentry_ptr); return; } if (strncpy_from_kernel_nofault(fname, dentry.d_name.name, 63) < 0) strscpy(fname, "<invalid>"); / * Even if strncpy_from_kernel_nofault() succeeded, * the fname could be unreliable / pr_warn("aops:%ps ino:%lx dentry name(?):\"%s\"\n", a_ops, ino, fname); } void clear_inode(struct inode inode) { /* * We have to cycle the i_pages lock here because reclaim can be in the * process of removing the last page (in __filemap_remove_folio()) * and we must not free the mapping under it. / xa_lock_irq(&inode->i_data.i_pages); BUG_ON(inode->i_data.nrpages); / * Almost always, mapping_empty(&inode->i_data) here; but there are * two known and long-standing ways in which nodes may get left behind * (when deep radix-tree node allocation failed partway; or when THP * collapse_file() failed). Until those two known cases are cleaned up, * or a cleanup function is called here, do not BUG_ON(!mapping_empty), * nor even WARN_ON(!mapping_empty). / xa_unlock_irq(&inode->i_data.i_pages); BUG_ON(!list_empty(&inode->i_data.i_private_list)); BUG_ON(!(inode_state_read_once(inode) & I_FREEING)); BUG_ON(inode_state_read_once(inode) & I_CLEAR); BUG_ON(!list_empty(&inode->i_wb_list)); / don't need i_lock here, no concurrent mods to i_state / inode_state_assign_raw(inode, I_FREEING \| I_CLEAR); } EXPORT_SYMBOL(clear_inode); / * Free the inode passed in, removing it from the lists it is still connected * to. We remove any pages still attached to the inode and wait for any IO that * is still in progress before finally destroying the inode. * * An inode must already be marked I_FREEING so that we avoid the inode being * moved back onto lists if we race with other code that manipulates the lists * (e.g. writeback_single_inode). The caller is responsible for setting this. * * An inode must already be removed from the LRU list before being evicted from * the cache. This should occur atomically with setting the I_FREEING state * flag, so no inodes here should ever be on the LRU when being evicted. / static void evict(struct inode inode) { const struct super_operations op = inode->i_sb->s_op; BUG_ON(!(inode_state_read_once(inode) & I_FREEING)); BUG_ON(!list_empty(&inode->i_lru)); inode_io_list_del(inode); inode_sb_list_del(inode); spin_lock(&inode->i_lock); inode_wait_for_lru_isolating(inode); / * Wait for flusher thread to be done with the inode so that filesystem * does not start destroying it while writeback is still running. Since * the inode has I_FREEING set, flusher thread won't start new work on * the inode. We just have to wait for running writeback to finish. / inode_wait_for_writeback(inode); spin_unlock(&inode->i_lock); if (op->evict_inode) { op->evict_inode(inode); } else { truncate_inode_pages_final(&inode->i_data); clear_inode(inode); } if (S_ISCHR(inode->i_mode) && inode->i_cdev) cd_forget(inode); remove_inode_hash(inode); / * Wake up waiters in __wait_on_freeing_inode(). * * It is an invariant that any thread we need to wake up is already * accounted for before remove_inode_hash() acquires ->i_lock -- both * sides take the lock and sleep is aborted if the inode is found * unhashed. Thus either the sleeper wins and goes off CPU, or removal * wins and the sleeper aborts after testing with the lock. * * This also means we don't need any fences for the call below. / inode_wake_up_bit(inode, __I_NEW); BUG_ON(inode_state_read_once(inode) != (I_FREEING \| I_CLEAR)); destroy_inode(inode); } / * dispose_list - dispose of the contents of a local list * @head: the head of the list to free * * Dispose-list gets a local list with local inodes in it, so it doesn't * need to worry about list corruption and SMP locks. / static void dispose_list(struct list_head head) { while (!list_empty(head)) { struct inode inode; inode = list_first_entry(head, struct inode, i_lru); list_del_init(&inode->i_lru); evict(inode); cond_resched(); } } /* * evict_inodes - evict all evictable inodes for a superblock * @sb: superblock to operate on * * Make sure that no inodes with zero refcount are retained. This is * called by superblock shutdown after having SB_ACTIVE flag removed, * so any inode reaching zero refcount during or after that call will * be immediately evicted. / void evict_inodes(struct super_block sb) { struct inode inode; LIST_HEAD(dispose); again: spin_lock(&sb->s_inode_list_lock); list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { if (icount_read(inode)) continue; spin_lock(&inode->i_lock); if (icount_read(inode)) { spin_unlock(&inode->i_lock); continue; } if (inode_state_read(inode) & (I_NEW \| I_FREEING \| I_WILL_FREE)) { spin_unlock(&inode->i_lock); continue; } inode_state_set(inode, I_FREEING); inode_lru_list_del(inode); spin_unlock(&inode->i_lock); list_add(&inode->i_lru, &dispose); / * We can have a ton of inodes to evict at unmount time given * enough memory, check to see if we need to go to sleep for a * bit so we don't livelock. / if (need_resched()) { spin_unlock(&sb->s_inode_list_lock); cond_resched(); dispose_list(&dispose); goto again; } } spin_unlock(&sb->s_inode_list_lock); dispose_list(&dispose); } EXPORT_SYMBOL_GPL(evict_inodes); / * Isolate the inode from the LRU in preparation for freeing it. * * If the inode has the I_REFERENCED flag set, then it means that it has been * used recently - the flag is set in iput_final(). When we encounter such an * inode, clear the flag and move it to the back of the LRU so it gets another * pass through the LRU before it gets reclaimed. This is necessary because of * the fact we are doing lazy LRU updates to minimise lock contention so the * LRU does not have strict ordering. Hence we don't want to reclaim inodes * with this flag set because they are the inodes that are out of order. / static enum lru_status inode_lru_isolate(struct list_head item, struct list_lru_one lru, void arg) { struct list_head freeable = arg; struct inode inode = container_of(item, struct inode, i_lru); /* * We are inverting the lru lock/inode->i_lock here, so use a * trylock. If we fail to get the lock, just skip it. / if (!spin_trylock(&inode->i_lock)) return LRU_SKIP; / * Inodes can get referenced, redirtied, or repopulated while * they're already on the LRU, and this can make them * unreclaimable for a while. Remove them lazily here; iput, * sync, or the last page cache deletion will requeue them. / if (icount_read(inode) \|\| (inode_state_read(inode) & ~I_REFERENCED) \|\| !mapping_shrinkable(&inode->i_data)) { list_lru_isolate(lru, &inode->i_lru); spin_unlock(&inode->i_lock); this_cpu_dec(nr_unused); return LRU_REMOVED; } / Recently referenced inodes get one more pass / if (inode_state_read(inode) & I_REFERENCED) { inode_state_clear(inode, I_REFERENCED); spin_unlock(&inode->i_lock); return LRU_ROTATE; } / * On highmem systems, mapping_shrinkable() permits dropping * page cache in order to free up struct inodes: lowmem might * be under pressure before the cache inside the highmem zone. / if (inode_has_buffers(inode) \|\| !mapping_empty(&inode->i_data)) { inode_pin_lru_isolating(inode); spin_unlock(&inode->i_lock); spin_unlock(&lru->lock); if (remove_inode_buffers(inode)) { unsigned long reap; reap = invalidate_mapping_pages(&inode->i_data, 0, -1); if (current_is_kswapd()) __count_vm_events(KSWAPD_INODESTEAL, reap); else __count_vm_events(PGINODESTEAL, reap); mm_account_reclaimed_pages(reap); } inode_unpin_lru_isolating(inode); return LRU_RETRY; } WARN_ON(inode_state_read(inode) & I_NEW); inode_state_set(inode, I_FREEING); list_lru_isolate_move(lru, &inode->i_lru, freeable); spin_unlock(&inode->i_lock); this_cpu_dec(nr_unused); return LRU_REMOVED; } / * Walk the superblock inode LRU for freeable inodes and attempt to free them. * This is called from the superblock shrinker function with a number of inodes * to trim from the LRU. Inodes to be freed are moved to a temporary list and * then are freed outside inode_lock by dispose_list(). / long prune_icache_sb(struct super_block sb, struct shrink_control sc) { LIST_HEAD(freeable); long freed; freed = list_lru_shrink_walk(&sb->s_inode_lru, sc, inode_lru_isolate, &freeable); dispose_list(&freeable); return freed; } static void __wait_on_freeing_inode(struct inode inode, bool is_inode_hash_locked); /* * Called with the inode lock held. / static struct inode find_inode(struct super_block sb, struct hlist_head head, int (test)(struct inode , void ), void data, bool is_inode_hash_locked, bool isnew) { struct inode inode = NULL; if (is_inode_hash_locked) lockdep_assert_held(&inode_hash_lock); else lockdep_assert_not_held(&inode_hash_lock); rcu_read_lock(); repeat: hlist_for_each_entry_rcu(inode, head, i_hash) { if (inode->i_sb != sb) continue; if (!test(inode, data)) continue; spin_lock(&inode->i_lock); if (inode_state_read(inode) & (I_FREEING \| I_WILL_FREE)) { __wait_on_freeing_inode(inode, is_inode_hash_locked); goto repeat; } if (unlikely(inode_state_read(inode) & I_CREATING)) { spin_unlock(&inode->i_lock); rcu_read_unlock(); return ERR_PTR(-ESTALE); } __iget(inode); isnew = !!(inode_state_read(inode) & I_NEW); spin_unlock(&inode->i_lock); rcu_read_unlock(); return inode; } rcu_read_unlock(); return NULL; } / * find_inode_fast is the fast path version of find_inode, see the comment at * iget_locked for details. / static struct inode find_inode_fast(struct super_block sb, struct hlist_head head, unsigned long ino, bool is_inode_hash_locked, bool isnew) { struct inode inode = NULL; if (is_inode_hash_locked) lockdep_assert_held(&inode_hash_lock); else lockdep_assert_not_held(&inode_hash_lock); rcu_read_lock(); repeat: hlist_for_each_entry_rcu(inode, head, i_hash) { if (inode->i_ino != ino) continue; if (inode->i_sb != sb) continue; spin_lock(&inode->i_lock); if (inode_state_read(inode) & (I_FREEING \| I_WILL_FREE)) { __wait_on_freeing_inode(inode, is_inode_hash_locked); goto repeat; } if (unlikely(inode_state_read(inode) & I_CREATING)) { spin_unlock(&inode->i_lock); rcu_read_unlock(); return ERR_PTR(-ESTALE); } __iget(inode); isnew = !!(inode_state_read(inode) & I_NEW); spin_unlock(&inode->i_lock); rcu_read_unlock(); return inode; } rcu_read_unlock(); return NULL; } / * Each cpu owns a range of LAST_INO_BATCH numbers. * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations, * to renew the exhausted range. * * This does not significantly increase overflow rate because every CPU can * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is * NR_CPUS(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the 2^32 range, and is a worst-case. Even a 50% wastage would only increase * overflow rate by 2x, which does not seem too significant. * * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW * error if st_ino won't fit in target struct field. Use 32bit counter * here to attempt to avoid that. / #define LAST_INO_BATCH 1024 static DEFINE_PER_CPU(unsigned int, last_ino); unsigned int get_next_ino(void) { unsigned int p = &get_cpu_var(last_ino); unsigned int res = p; #ifdef CONFIG_SMP if (unlikely((res & (LAST_INO_BATCH-1)) == 0)) { static atomic_t shared_last_ino; int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino); res = next - LAST_INO_BATCH; } #endif res++; / get_next_ino should not provide a 0 inode number / if (unlikely(!res)) res++; p = res; put_cpu_var(last_ino); return res; } EXPORT_SYMBOL(get_next_ino); /** * new_inode - obtain an inode * @sb: superblock * * Allocates a new inode for given superblock. The default gfp_mask * for allocations related to inode->i_mapping is GFP_HIGHUSER_MOVABLE. * If HIGHMEM pages are unsuitable or it is known that pages allocated * for the page cache are not reclaimable or migratable, * mapping_set_gfp_mask() must be called with suitable flags on the * newly created inode's mapping * / struct inode new_inode(struct super_block sb) { struct inode inode; inode = alloc_inode(sb); if (inode) inode_sb_list_add(inode); return inode; } EXPORT_SYMBOL(new_inode); #ifdef CONFIG_DEBUG_LOCK_ALLOC void lockdep_annotate_inode_mutex_key(struct inode inode) { if (S_ISDIR(inode->i_mode)) { struct file_system_type type = inode->i_sb->s_type; /* Set new key only if filesystem hasn't already changed it / if (lockdep_match_class(&inode->i_rwsem, &type->i_mutex_key)) { / * ensure nobody is actually holding i_rwsem / init_rwsem(&inode->i_rwsem); lockdep_set_class(&inode->i_rwsem, &type->i_mutex_dir_key); } } } EXPORT_SYMBOL(lockdep_annotate_inode_mutex_key); #endif /* * unlock_new_inode - clear the I_NEW state and wake up any waiters * @inode: new inode to unlock * * Called when the inode is fully initialised to clear the new state of the * inode and wake up anyone waiting for the inode to finish initialisation. / void unlock_new_inode(struct inode inode) { lockdep_annotate_inode_mutex_key(inode); spin_lock(&inode->i_lock); WARN_ON(!(inode_state_read(inode) & I_NEW)); inode_state_clear(inode, I_NEW \| I_CREATING); inode_wake_up_bit(inode, __I_NEW); spin_unlock(&inode->i_lock); } EXPORT_SYMBOL(unlock_new_inode); void discard_new_inode(struct inode inode) { lockdep_annotate_inode_mutex_key(inode); spin_lock(&inode->i_lock); WARN_ON(!(inode_state_read(inode) & I_NEW)); inode_state_clear(inode, I_NEW); inode_wake_up_bit(inode, __I_NEW); spin_unlock(&inode->i_lock); iput(inode); } EXPORT_SYMBOL(discard_new_inode); /* * lock_two_nondirectories - take two i_mutexes on non-directory objects * * Lock any non-NULL argument. Passed objects must not be directories. * Zero, one or two objects may be locked by this function. * * @inode1: first inode to lock * @inode2: second inode to lock / void lock_two_nondirectories(struct inode inode1, struct inode inode2) { if (inode1) WARN_ON_ONCE(S_ISDIR(inode1->i_mode)); if (inode2) WARN_ON_ONCE(S_ISDIR(inode2->i_mode)); if (inode1 > inode2) swap(inode1, inode2); if (inode1) inode_lock(inode1); if (inode2 && inode2 != inode1) inode_lock_nested(inode2, I_MUTEX_NONDIR2); } EXPORT_SYMBOL(lock_two_nondirectories); /* * unlock_two_nondirectories - release locks from lock_two_nondirectories() * @inode1: first inode to unlock * @inode2: second inode to unlock / void unlock_two_nondirectories(struct inode inode1, struct inode inode2) { if (inode1) { WARN_ON_ONCE(S_ISDIR(inode1->i_mode)); inode_unlock(inode1); } if (inode2 && inode2 != inode1) { WARN_ON_ONCE(S_ISDIR(inode2->i_mode)); inode_unlock(inode2); } } EXPORT_SYMBOL(unlock_two_nondirectories); /* * inode_insert5 - obtain an inode from a mounted file system * @inode: pre-allocated inode to use for insert to cache * @hashval: hash value (usually inode number) to get * @test: callback used for comparisons between inodes * @set: callback used to initialize a new struct inode * @data: opaque data pointer to pass to @test and @set * @isnew: pointer to a bool which will indicate whether I_NEW is set * * Search for the inode specified by @hashval and @data in the inode cache, * and if present return it with an increased reference count. This is a * variant of iget5_locked() that doesn't allocate an inode. * * If the inode is not present in the cache, insert the pre-allocated inode and * return it locked, hashed, and with the I_NEW flag set. The file system gets * to fill it in before unlocking it via unlock_new_inode(). * * Note that both @test and @set are called with the inode_hash_lock held, so * they can't sleep. / struct inode inode_insert5(struct inode inode, unsigned long hashval, int (test)(struct inode , void ), int (set)(struct inode , void ), void data) { struct hlist_head head = inode_hashtable + hash(inode->i_sb, hashval); struct inode old; bool isnew; might_sleep(); again: spin_lock(&inode_hash_lock); old = find_inode(inode->i_sb, head, test, data, true, &isnew); if (unlikely(old)) { /* * Uhhuh, somebody else created the same inode under us. * Use the old inode instead of the preallocated one. / spin_unlock(&inode_hash_lock); if (IS_ERR(old)) return NULL; if (unlikely(isnew)) wait_on_new_inode(old); if (unlikely(inode_unhashed(old))) { iput(old); goto again; } return old; } if (set && unlikely(set(inode, data))) { spin_unlock(&inode_hash_lock); return NULL; } / * Return the locked inode with I_NEW set, the * caller is responsible for filling in the contents / spin_lock(&inode->i_lock); inode_state_set(inode, I_NEW); hlist_add_head_rcu(&inode->i_hash, head); spin_unlock(&inode->i_lock); spin_unlock(&inode_hash_lock); / * Add inode to the sb list if it's not already. It has I_NEW at this * point, so it should be safe to test i_sb_list locklessly. / if (list_empty(&inode->i_sb_list)) inode_sb_list_add(inode); return inode; } EXPORT_SYMBOL(inode_insert5); /* * iget5_locked - obtain an inode from a mounted file system * @sb: super block of file system * @hashval: hash value (usually inode number) to get * @test: callback used for comparisons between inodes * @set: callback used to initialize a new struct inode * @data: opaque data pointer to pass to @test and @set * * Search for the inode specified by @hashval and @data in the inode cache, * and if present return it with an increased reference count. This is a * generalized version of iget_locked() for file systems where the inode * number is not sufficient for unique identification of an inode. * * If the inode is not present in the cache, allocate and insert a new inode * and return it locked, hashed, and with the I_NEW flag set. The file system * gets to fill it in before unlocking it via unlock_new_inode(). * * Note that both @test and @set are called with the inode_hash_lock held, so * they can't sleep. / struct inode iget5_locked(struct super_block sb, unsigned long hashval, int (test)(struct inode , void ), int (set)(struct inode , void ), void data) { struct inode inode = ilookup5(sb, hashval, test, data); if (!inode) { struct inode new = alloc_inode(sb); if (new) { inode = inode_insert5(new, hashval, test, set, data); if (unlikely(inode != new)) destroy_inode(new); } } return inode; } EXPORT_SYMBOL(iget5_locked); /** * iget5_locked_rcu - obtain an inode from a mounted file system * @sb: super block of file system * @hashval: hash value (usually inode number) to get * @test: callback used for comparisons between inodes * @set: callback used to initialize a new struct inode * @data: opaque data pointer to pass to @test and @set * * This is equivalent to iget5_locked, except the @test callback must * tolerate the inode not being stable, including being mid-teardown. / struct inode iget5_locked_rcu(struct super_block sb, unsigned long hashval, int (test)(struct inode , void ), int (set)(struct inode , void ), void data) { struct hlist_head head = inode_hashtable + hash(sb, hashval); struct inode inode, new; bool isnew; might_sleep(); again: inode = find_inode(sb, head, test, data, false, &isnew); if (inode) { if (IS_ERR(inode)) return NULL; if (unlikely(isnew)) wait_on_new_inode(inode); if (unlikely(inode_unhashed(inode))) { iput(inode); goto again; } return inode; } new = alloc_inode(sb); if (new) { inode = inode_insert5(new, hashval, test, set, data); if (unlikely(inode != new)) destroy_inode(new); } return inode; } EXPORT_SYMBOL_GPL(iget5_locked_rcu); /* * iget_locked - obtain an inode from a mounted file system * @sb: super block of file system * @ino: inode number to get * * Search for the inode specified by @ino in the inode cache and if present * return it with an increased reference count. This is for file systems * where the inode number is sufficient for unique identification of an inode. * * If the inode is not in cache, allocate a new inode and return it locked, * hashed, and with the I_NEW flag set. The file system gets to fill it in * before unlocking it via unlock_new_inode(). / struct inode iget_locked(struct super_block sb, unsigned long ino) { struct hlist_head head = inode_hashtable + hash(sb, ino); struct inode inode; bool isnew; might_sleep(); again: inode = find_inode_fast(sb, head, ino, false, &isnew); if (inode) { if (IS_ERR(inode)) return NULL; if (unlikely(isnew)) wait_on_new_inode(inode); if (unlikely(inode_unhashed(inode))) { iput(inode); goto again; } return inode; } inode = alloc_inode(sb); if (inode) { struct inode old; spin_lock(&inode_hash_lock); /* We released the lock, so.. / old = find_inode_fast(sb, head, ino, true, &isnew); if (!old) { inode->i_ino = ino; spin_lock(&inode->i_lock); inode_state_assign(inode, I_NEW); hlist_add_head_rcu(&inode->i_hash, head); spin_unlock(&inode->i_lock); spin_unlock(&inode_hash_lock); inode_sb_list_add(inode); / Return the locked inode with I_NEW set, the * caller is responsible for filling in the contents / return inode; } / * Uhhuh, somebody else created the same inode under * us. Use the old inode instead of the one we just * allocated. / spin_unlock(&inode_hash_lock); destroy_inode(inode); if (IS_ERR(old)) return NULL; inode = old; if (unlikely(isnew)) wait_on_new_inode(inode); if (unlikely(inode_unhashed(inode))) { iput(inode); goto again; } } return inode; } EXPORT_SYMBOL(iget_locked); / * search the inode cache for a matching inode number. * If we find one, then the inode number we are trying to * allocate is not unique and so we should not use it. * * Returns 1 if the inode number is unique, 0 if it is not. / static int test_inode_iunique(struct super_block sb, unsigned long ino) { struct hlist_head b = inode_hashtable + hash(sb, ino); struct inode inode; hlist_for_each_entry_rcu(inode, b, i_hash) { if (inode->i_ino == ino && inode->i_sb == sb) return 0; } return 1; } /** * iunique - get a unique inode number * @sb: superblock * @max_reserved: highest reserved inode number * * Obtain an inode number that is unique on the system for a given * superblock. This is used by file systems that have no natural * permanent inode numbering system. An inode number is returned that * is higher than the reserved limit but unique. * * BUGS: * With a large number of inodes live on the file system this function * currently becomes quite slow. / ino_t iunique(struct super_block sb, ino_t max_reserved) { /* * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW * error if st_ino won't fit in target struct field. Use 32bit counter * here to attempt to avoid that. / static DEFINE_SPINLOCK(iunique_lock); static unsigned int counter; ino_t res; rcu_read_lock(); spin_lock(&iunique_lock); do { if (counter <= max_reserved) counter = max_reserved + 1; res = counter++; } while (!test_inode_iunique(sb, res)); spin_unlock(&iunique_lock); rcu_read_unlock(); return res; } EXPORT_SYMBOL(iunique); struct inode igrab(struct inode inode) { spin_lock(&inode->i_lock); if (!(inode_state_read(inode) & (I_FREEING \| I_WILL_FREE))) { __iget(inode); spin_unlock(&inode->i_lock); } else { spin_unlock(&inode->i_lock); / * Handle the case where s_op->clear_inode is not been * called yet, and somebody is calling igrab * while the inode is getting freed. / inode = NULL; } return inode; } EXPORT_SYMBOL(igrab); /* * ilookup5_nowait - search for an inode in the inode cache * @sb: super block of file system to search * @hashval: hash value (usually inode number) to search for * @test: callback used for comparisons between inodes * @data: opaque data pointer to pass to @test * @isnew: return argument telling whether I_NEW was set when * the inode was found in hash (the caller needs to * wait for I_NEW to clear) * * Search for the inode specified by @hashval and @data in the inode cache. * If the inode is in the cache, the inode is returned with an incremented * reference count. * * Note: I_NEW is not waited upon so you have to be very careful what you do * with the returned inode. You probably should be using ilookup5() instead. * * Note2: @test is called with the inode_hash_lock held, so can't sleep. / struct inode ilookup5_nowait(struct super_block sb, unsigned long hashval, int (test)(struct inode , void ), void data, bool isnew) { struct hlist_head head = inode_hashtable + hash(sb, hashval); struct inode inode; spin_lock(&inode_hash_lock); inode = find_inode(sb, head, test, data, true, isnew); spin_unlock(&inode_hash_lock); return IS_ERR(inode) ? NULL : inode; } EXPORT_SYMBOL(ilookup5_nowait); /** * ilookup5 - search for an inode in the inode cache * @sb: super block of file system to search * @hashval: hash value (usually inode number) to search for * @test: callback used for comparisons between inodes * @data: opaque data pointer to pass to @test * * Search for the inode specified by @hashval and @data in the inode cache, * and if the inode is in the cache, return the inode with an incremented * reference count. Waits on I_NEW before returning the inode. * returned with an incremented reference count. * * This is a generalized version of ilookup() for file systems where the * inode number is not sufficient for unique identification of an inode. * * Note: @test is called with the inode_hash_lock held, so can't sleep. / struct inode ilookup5(struct super_block sb, unsigned long hashval, int (test)(struct inode , void ), void data) { struct inode inode; bool isnew; might_sleep(); again: inode = ilookup5_nowait(sb, hashval, test, data, &isnew); if (inode) { if (unlikely(isnew)) wait_on_new_inode(inode); if (unlikely(inode_unhashed(inode))) { iput(inode); goto again; } } return inode; } EXPORT_SYMBOL(ilookup5); /** * ilookup - search for an inode in the inode cache * @sb: super block of file system to search * @ino: inode number to search for * * Search for the inode @ino in the inode cache, and if the inode is in the * cache, the inode is returned with an incremented reference count. / struct inode ilookup(struct super_block sb, unsigned long ino) { struct hlist_head head = inode_hashtable + hash(sb, ino); struct inode inode; bool isnew; might_sleep(); again: inode = find_inode_fast(sb, head, ino, false, &isnew); if (inode) { if (IS_ERR(inode)) return NULL; if (unlikely(isnew)) wait_on_new_inode(inode); if (unlikely(inode_unhashed(inode))) { iput(inode); goto again; } } return inode; } EXPORT_SYMBOL(ilookup); /* * find_inode_nowait - find an inode in the inode cache * @sb: super block of file system to search * @hashval: hash value (usually inode number) to search for * @match: callback used for comparisons between inodes * @data: opaque data pointer to pass to @match * * Search for the inode specified by @hashval and @data in the inode * cache, where the helper function @match will return 0 if the inode * does not match, 1 if the inode does match, and -1 if the search * should be stopped. The @match function must be responsible for * taking the i_lock spin_lock and checking i_state for an inode being * freed or being initialized, and incrementing the reference count * before returning 1. It also must not sleep, since it is called with * the inode_hash_lock spinlock held. * * This is a even more generalized version of ilookup5() when the * function must never block --- find_inode() can block in * __wait_on_freeing_inode() --- or when the caller can not increment * the reference count because the resulting iput() might cause an * inode eviction. The tradeoff is that the @match funtion must be * very carefully implemented. / struct inode find_inode_nowait(struct super_block sb, unsigned long hashval, int (match)(struct inode , unsigned long, void ), void data) { struct hlist_head head = inode_hashtable + hash(sb, hashval); struct inode inode, ret_inode = NULL; int mval; spin_lock(&inode_hash_lock); hlist_for_each_entry(inode, head, i_hash) { if (inode->i_sb != sb) continue; mval = match(inode, hashval, data); if (mval == 0) continue; if (mval == 1) ret_inode = inode; goto out; } out: spin_unlock(&inode_hash_lock); return ret_inode; } EXPORT_SYMBOL(find_inode_nowait); /** * find_inode_rcu - find an inode in the inode cache * @sb: Super block of file system to search * @hashval: Key to hash * @test: Function to test match on an inode * @data: Data for test function * * Search for the inode specified by @hashval and @data in the inode cache, * where the helper function @test will return 0 if the inode does not match * and 1 if it does. The @test function must be responsible for taking the * i_lock spin_lock and checking i_state for an inode being freed or being * initialized. * * If successful, this will return the inode for which the @test function * returned 1 and NULL otherwise. * * The @test function is not permitted to take a ref on any inode presented. * It is also not permitted to sleep. * * The caller must hold the RCU read lock. / struct inode find_inode_rcu(struct super_block sb, unsigned long hashval, int (test)(struct inode , void ), void data) { struct hlist_head head = inode_hashtable + hash(sb, hashval); struct inode inode; RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "suspicious find_inode_rcu() usage"); hlist_for_each_entry_rcu(inode, head, i_hash) { if (inode->i_sb == sb && !(inode_state_read_once(inode) & (I_FREEING \| I_WILL_FREE)) && test(inode, data)) return inode; } return NULL; } EXPORT_SYMBOL(find_inode_rcu); /* * find_inode_by_ino_rcu - Find an inode in the inode cache * @sb: Super block of file system to search * @ino: The inode number to match * * Search for the inode specified by @hashval and @data in the inode cache, * where the helper function @test will return 0 if the inode does not match * and 1 if it does. The @test function must be responsible for taking the * i_lock spin_lock and checking i_state for an inode being freed or being * initialized. * * If successful, this will return the inode for which the @test function * returned 1 and NULL otherwise. * * The @test function is not permitted to take a ref on any inode presented. * It is also not permitted to sleep. * * The caller must hold the RCU read lock. / struct inode find_inode_by_ino_rcu(struct super_block sb, unsigned long ino) { struct hlist_head head = inode_hashtable + hash(sb, ino); struct inode inode; RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "suspicious find_inode_by_ino_rcu() usage"); hlist_for_each_entry_rcu(inode, head, i_hash) { if (inode->i_ino == ino && inode->i_sb == sb && !(inode_state_read_once(inode) & (I_FREEING \| I_WILL_FREE))) return inode; } return NULL; } EXPORT_SYMBOL(find_inode_by_ino_rcu); int insert_inode_locked(struct inode inode) { struct super_block sb = inode->i_sb; ino_t ino = inode->i_ino; struct hlist_head head = inode_hashtable + hash(sb, ino); bool isnew; might_sleep(); while (1) { struct inode old = NULL; spin_lock(&inode_hash_lock); hlist_for_each_entry(old, head, i_hash) { if (old->i_ino != ino) continue; if (old->i_sb != sb) continue; spin_lock(&old->i_lock); if (inode_state_read(old) & (I_FREEING \| I_WILL_FREE)) { spin_unlock(&old->i_lock); continue; } break; } if (likely(!old)) { spin_lock(&inode->i_lock); inode_state_set(inode, I_NEW \| I_CREATING); hlist_add_head_rcu(&inode->i_hash, head); spin_unlock(&inode->i_lock); spin_unlock(&inode_hash_lock); return 0; } if (unlikely(inode_state_read(old) & I_CREATING)) { spin_unlock(&old->i_lock); spin_unlock(&inode_hash_lock); return -EBUSY; } __iget(old); isnew = !!(inode_state_read(old) & I_NEW); spin_unlock(&old->i_lock); spin_unlock(&inode_hash_lock); if (isnew) wait_on_new_inode(old); if (unlikely(!inode_unhashed(old))) { iput(old); return -EBUSY; } iput(old); } } EXPORT_SYMBOL(insert_inode_locked); int insert_inode_locked4(struct inode inode, unsigned long hashval, int (test)(struct inode , void ), void data) { struct inode old; might_sleep(); inode_state_set_raw(inode, I_CREATING); old = inode_insert5(inode, hashval, test, NULL, data); if (old != inode) { iput(old); return -EBUSY; } return 0; } EXPORT_SYMBOL(insert_inode_locked4); int inode_just_drop(struct inode inode) { return 1; } EXPORT_SYMBOL(inode_just_drop); /* * Called when we're dropping the last reference * to an inode. * * Call the FS "drop_inode()" function, defaulting to * the legacy UNIX filesystem behaviour. If it tells * us to evict inode, do so. Otherwise, retain inode * in cache if fs is alive, sync and evict if fs is * shutting down. / static void iput_final(struct inode inode) { struct super_block sb = inode->i_sb; const struct super_operations op = inode->i_sb->s_op; int drop; WARN_ON(inode_state_read(inode) & I_NEW); VFS_BUG_ON_INODE(atomic_read(&inode->i_count) != 0, inode); if (op->drop_inode) drop = op->drop_inode(inode); else drop = inode_generic_drop(inode); if (!drop && !(inode_state_read(inode) & I_DONTCACHE) && (sb->s_flags & SB_ACTIVE)) { __inode_lru_list_add(inode, true); spin_unlock(&inode->i_lock); return; } /* * Re-check ->i_count in case the ->drop_inode() hooks played games. * Note we only execute this if the verdict was to drop the inode. / VFS_BUG_ON_INODE(atomic_read(&inode->i_count) != 0, inode); if (drop) { inode_state_set(inode, I_FREEING); } else { inode_state_set(inode, I_WILL_FREE); spin_unlock(&inode->i_lock); write_inode_now(inode, 1); spin_lock(&inode->i_lock); WARN_ON(inode_state_read(inode) & I_NEW); inode_state_replace(inode, I_WILL_FREE, I_FREEING); } inode_lru_list_del(inode); spin_unlock(&inode->i_lock); evict(inode); } /* * iput - put an inode * @inode: inode to put * * Puts an inode, dropping its usage count. If the inode use count hits * zero, the inode is then freed and may also be destroyed. * * Consequently, iput() can sleep. / void iput(struct inode inode) { might_sleep(); if (unlikely(!inode)) return; retry: lockdep_assert_not_held(&inode->i_lock); VFS_BUG_ON_INODE(inode_state_read_once(inode) & (I_FREEING \| I_CLEAR), inode); /* * Note this assert is technically racy as if the count is bogusly * equal to one, then two CPUs racing to further drop it can both * conclude it's fine. / VFS_BUG_ON_INODE(atomic_read(&inode->i_count) < 1, inode); if (atomic_add_unless(&inode->i_count, -1, 1)) return; if ((inode_state_read_once(inode) & I_DIRTY_TIME) && inode->i_nlink) { trace_writeback_lazytime_iput(inode); mark_inode_dirty_sync(inode); goto retry; } spin_lock(&inode->i_lock); if (unlikely((inode_state_read(inode) & I_DIRTY_TIME) && inode->i_nlink)) { spin_unlock(&inode->i_lock); goto retry; } if (!atomic_dec_and_test(&inode->i_count)) { spin_unlock(&inode->i_lock); return; } / * iput_final() drops ->i_lock, we can't assert on it as the inode may * be deallocated by the time the call returns. / iput_final(inode); } EXPORT_SYMBOL(iput); /* * iput_not_last - put an inode assuming this is not the last reference * @inode: inode to put / void iput_not_last(struct inode inode) { VFS_BUG_ON_INODE(inode_state_read_once(inode) & (I_FREEING \| I_CLEAR), inode); VFS_BUG_ON_INODE(atomic_read(&inode->i_count) < 2, inode); WARN_ON(atomic_sub_return(1, &inode->i_count) == 0); } EXPORT_SYMBOL(iput_not_last); #ifdef CONFIG_BLOCK /** * bmap - find a block number in a file * @inode: inode owning the block number being requested * @block: pointer containing the block to find * * Replaces the value in ``block`` with the block number on the device holding corresponding to the requested block number in the file. * That is, asked for block 4 of inode 1 the function will replace the * 4 in ``block``, with disk block relative to the disk start that holds that block of the file. * * Returns -EINVAL in case of error, 0 otherwise. If mapping falls into a * hole, returns 0 and ``block`` is also set to 0. / int bmap(struct inode inode, sector_t block) { if (!inode->i_mapping->a_ops->bmap) return -EINVAL; block = inode->i_mapping->a_ops->bmap(inode->i_mapping, block); return 0; } EXPORT_SYMBOL(bmap); #endif /* * With relative atime, only update atime if the previous atime is * earlier than or equal to either the ctime or mtime, * or if at least a day has passed since the last atime update. / static bool relatime_need_update(struct vfsmount mnt, struct inode inode, struct timespec64 now) { struct timespec64 atime, mtime, ctime; if (!(mnt->mnt_flags & MNT_RELATIME)) return true; / * Is mtime younger than or equal to atime? If yes, update atime: / atime = inode_get_atime(inode); mtime = inode_get_mtime(inode); if (timespec64_compare(&mtime, &atime) >= 0) return true; / * Is ctime younger than or equal to atime? If yes, update atime: / ctime = inode_get_ctime(inode); if (timespec64_compare(&ctime, &atime) >= 0) return true; / * Is the previous atime value older than a day? If yes, * update atime: / if ((long)(now.tv_sec - atime.tv_sec) >= 246060) return true; / * Good, we can skip the atime update: / return false; } /* * inode_update_timestamps - update the timestamps on the inode * @inode: inode to be updated * @flags: S_* flags that needed to be updated * * The update_time function is called when an inode's timestamps need to be * updated for a read or write operation. This function handles updating the * actual timestamps. It's up to the caller to ensure that the inode is marked * dirty appropriately. * * In the case where any of S_MTIME, S_CTIME, or S_VERSION need to be updated, * attempt to update all three of them. S_ATIME updates can be handled * independently of the rest. * * Returns a set of S_* flags indicating which values changed. / int inode_update_timestamps(struct inode inode, int flags) { int updated = 0; struct timespec64 now; if (flags & (S_MTIME\|S_CTIME\|S_VERSION)) { struct timespec64 ctime = inode_get_ctime(inode); struct timespec64 mtime = inode_get_mtime(inode); now = inode_set_ctime_current(inode); if (!timespec64_equal(&now, &ctime)) updated \|= S_CTIME; if (!timespec64_equal(&now, &mtime)) { inode_set_mtime_to_ts(inode, now); updated \|= S_MTIME; } if (IS_I_VERSION(inode) && inode_maybe_inc_iversion(inode, updated)) updated \|= S_VERSION; } else { now = current_time(inode); } if (flags & S_ATIME) { struct timespec64 atime = inode_get_atime(inode); if (!timespec64_equal(&now, &atime)) { inode_set_atime_to_ts(inode, now); updated \|= S_ATIME; } } return updated; } EXPORT_SYMBOL(inode_update_timestamps); /** * generic_update_time - update the timestamps on the inode * @inode: inode to be updated * @flags: S_* flags that needed to be updated * * The update_time function is called when an inode's timestamps need to be * updated for a read or write operation. In the case where any of S_MTIME, S_CTIME, * or S_VERSION need to be updated we attempt to update all three of them. S_ATIME * updates can be handled done independently of the rest. * * Returns a S_* mask indicating which fields were updated. / int generic_update_time(struct inode inode, int flags) { int updated = inode_update_timestamps(inode, flags); int dirty_flags = 0; if (updated & (S_ATIME\|S_MTIME\|S_CTIME)) dirty_flags = inode->i_sb->s_flags & SB_LAZYTIME ? I_DIRTY_TIME : I_DIRTY_SYNC; if (updated & S_VERSION) dirty_flags \|= I_DIRTY_SYNC; __mark_inode_dirty(inode, dirty_flags); return updated; } EXPORT_SYMBOL(generic_update_time); /* * This does the actual work of updating an inodes time or version. Must have * had called mnt_want_write() before calling this. / int inode_update_time(struct inode inode, int flags) { if (inode->i_op->update_time) return inode->i_op->update_time(inode, flags); generic_update_time(inode, flags); return 0; } EXPORT_SYMBOL(inode_update_time); /** * atime_needs_update - update the access time * @path: the &struct path to update * @inode: inode to update * * Update the accessed time on an inode and mark it for writeback. * This function automatically handles read only file systems and media, * as well as the "noatime" flag and inode specific "noatime" markers. / bool atime_needs_update(const struct path path, struct inode inode) { struct vfsmount mnt = path->mnt; struct timespec64 now, atime; if (inode->i_flags & S_NOATIME) return false; /* Atime updates will likely cause i_uid and i_gid to be written * back improprely if their true value is unknown to the vfs. / if (HAS_UNMAPPED_ID(mnt_idmap(mnt), inode)) return false; if (IS_NOATIME(inode)) return false; if ((inode->i_sb->s_flags & SB_NODIRATIME) && S_ISDIR(inode->i_mode)) return false; if (mnt->mnt_flags & MNT_NOATIME) return false; if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode)) return false; now = current_time(inode); if (!relatime_need_update(mnt, inode, now)) return false; atime = inode_get_atime(inode); if (timespec64_equal(&atime, &now)) return false; return true; } void touch_atime(const struct path path) { struct vfsmount mnt = path->mnt; struct inode inode = d_inode(path->dentry); if (!atime_needs_update(path, inode)) return; if (!sb_start_write_trylock(inode->i_sb)) return; if (mnt_get_write_access(mnt) != 0) goto skip_update; /* * File systems can error out when updating inodes if they need to * allocate new space to modify an inode (such is the case for * Btrfs), but since we touch atime while walking down the path we * really don't care if we failed to update the atime of the file, * so just ignore the return value. * We may also fail on filesystems that have the ability to make parts * of the fs read only, e.g. subvolumes in Btrfs. / inode_update_time(inode, S_ATIME); mnt_put_write_access(mnt); skip_update: sb_end_write(inode->i_sb); } EXPORT_SYMBOL(touch_atime); / * Return mask of changes for notify_change() that need to be done as a * response to write or truncate. Return 0 if nothing has to be changed. * Negative value on error (change should be denied). / int dentry_needs_remove_privs(struct mnt_idmap idmap, struct dentry dentry) { struct inode inode = d_inode(dentry); int mask = 0; int ret; if (IS_NOSEC(inode)) return 0; mask = setattr_should_drop_suidgid(idmap, inode); ret = security_inode_need_killpriv(dentry); if (ret < 0) return ret; if (ret) mask \|= ATTR_KILL_PRIV; return mask; } static int __remove_privs(struct mnt_idmap idmap, struct dentry dentry, int kill) { struct iattr newattrs; newattrs.ia_valid = ATTR_FORCE \| kill; /* * Note we call this on write, so notify_change will not * encounter any conflicting delegations: / return notify_change(idmap, dentry, &newattrs, NULL); } static int file_remove_privs_flags(struct file file, unsigned int flags) { struct dentry dentry = file_dentry(file); struct inode inode = file_inode(file); int error = 0; int kill; if (IS_NOSEC(inode) \|\| !S_ISREG(inode->i_mode)) return 0; kill = dentry_needs_remove_privs(file_mnt_idmap(file), dentry); if (kill < 0) return kill; if (kill) { if (flags & IOCB_NOWAIT) return -EAGAIN; error = __remove_privs(file_mnt_idmap(file), dentry, kill); } if (!error) inode_has_no_xattr(inode); return error; } /** * file_remove_privs - remove special file privileges (suid, capabilities) * @file: file to remove privileges from * * When file is modified by a write or truncation ensure that special * file privileges are removed. * * Return: 0 on success, negative errno on failure. / int file_remove_privs(struct file file) { return file_remove_privs_flags(file, 0); } EXPORT_SYMBOL(file_remove_privs); /** * current_time - Return FS time (possibly fine-grained) * @inode: inode. * * Return the current time truncated to the time granularity supported by * the fs, as suitable for a ctime/mtime change. If the ctime is flagged * as having been QUERIED, get a fine-grained timestamp, but don't update * the floor. * * For a multigrain inode, this is effectively an estimate of the timestamp * that a file would receive. An actual update must go through * inode_set_ctime_current(). / struct timespec64 current_time(struct inode inode) { struct timespec64 now; u32 cns; ktime_get_coarse_real_ts64_mg(&now); if (!is_mgtime(inode)) goto out; /* If nothing has queried it, then coarse time is fine / cns = smp_load_acquire(&inode->i_ctime_nsec); if (cns & I_CTIME_QUERIED) { / * If there is no apparent change, then get a fine-grained * timestamp. / if (now.tv_nsec == (cns & ~I_CTIME_QUERIED)) ktime_get_real_ts64(&now); } out: return timestamp_truncate(now, inode); } EXPORT_SYMBOL(current_time); static int file_update_time_flags(struct file file, unsigned int flags) { struct inode inode = file_inode(file); struct timespec64 now, ts; int sync_mode = 0; int ret = 0; / First try to exhaust all avenues to not sync / if (IS_NOCMTIME(inode)) return 0; if (unlikely(file->f_mode & FMODE_NOCMTIME)) return 0; now = current_time(inode); ts = inode_get_mtime(inode); if (!timespec64_equal(&ts, &now)) sync_mode \|= S_MTIME; ts = inode_get_ctime(inode); if (!timespec64_equal(&ts, &now)) sync_mode \|= S_CTIME; if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode)) sync_mode \|= S_VERSION; if (!sync_mode) return 0; if (flags & IOCB_NOWAIT) return -EAGAIN; if (mnt_get_write_access_file(file)) return 0; ret = inode_update_time(inode, sync_mode); mnt_put_write_access_file(file); return ret; } /* * file_update_time - update mtime and ctime time * @file: file accessed * * Update the mtime and ctime members of an inode and mark the inode for * writeback. Note that this function is meant exclusively for usage in * the file write path of filesystems, and filesystems may choose to * explicitly ignore updates via this function with the _NOCMTIME inode * flag, e.g. for network filesystem where these imestamps are handled * by the server. This can return an error for file systems who need to * allocate space in order to update an inode. * * Return: 0 on success, negative errno on failure. / int file_update_time(struct file file) { return file_update_time_flags(file, 0); } EXPORT_SYMBOL(file_update_time); /** * file_modified_flags - handle mandated vfs changes when modifying a file * @file: file that was modified * @flags: kiocb flags * * When file has been modified ensure that special * file privileges are removed and time settings are updated. * * If IOCB_NOWAIT is set, special file privileges will not be removed and * time settings will not be updated. It will return -EAGAIN. * * Context: Caller must hold the file's inode lock. * * Return: 0 on success, negative errno on failure. / static int file_modified_flags(struct file file, int flags) { int ret; /* * Clear the security bits if the process is not being run by root. * This keeps people from modifying setuid and setgid binaries. / ret = file_remove_privs_flags(file, flags); if (ret) return ret; return file_update_time_flags(file, flags); } /* * file_modified - handle mandated vfs changes when modifying a file * @file: file that was modified * * When file has been modified ensure that special * file privileges are removed and time settings are updated. * * Context: Caller must hold the file's inode lock. * * Return: 0 on success, negative errno on failure. / int file_modified(struct file file) { return file_modified_flags(file, 0); } EXPORT_SYMBOL(file_modified); /** * kiocb_modified - handle mandated vfs changes when modifying a file * @iocb: iocb that was modified * * When file has been modified ensure that special * file privileges are removed and time settings are updated. * * Context: Caller must hold the file's inode lock. * * Return: 0 on success, negative errno on failure. / int kiocb_modified(struct kiocb iocb) { return file_modified_flags(iocb->ki_filp, iocb->ki_flags); } EXPORT_SYMBOL_GPL(kiocb_modified); int inode_needs_sync(struct inode inode) { if (IS_SYNC(inode)) return 1; if (S_ISDIR(inode->i_mode) && IS_DIRSYNC(inode)) return 1; return 0; } EXPORT_SYMBOL(inode_needs_sync); / * If we try to find an inode in the inode hash while it is being * deleted, we have to wait until the filesystem completes its * deletion before reporting that it isn't found. This function waits * until the deletion _might_ have completed. Callers are responsible * to recheck inode state. * * It doesn't matter if I_NEW is not set initially, a call to * wake_up_bit(&inode->i_state, __I_NEW) after removing from the hash list * will DTRT. / static void __wait_on_freeing_inode(struct inode inode, bool is_inode_hash_locked) { struct wait_bit_queue_entry wqe; struct wait_queue_head wq_head; / * Handle racing against evict(), see that routine for more details. / if (unlikely(inode_unhashed(inode))) { WARN_ON(is_inode_hash_locked); spin_unlock(&inode->i_lock); return; } wq_head = inode_bit_waitqueue(&wqe, inode, __I_NEW); prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE); spin_unlock(&inode->i_lock); rcu_read_unlock(); if (is_inode_hash_locked) spin_unlock(&inode_hash_lock); schedule(); finish_wait(wq_head, &wqe.wq_entry); if (is_inode_hash_locked) spin_lock(&inode_hash_lock); rcu_read_lock(); } static __initdata unsigned long ihash_entries; static int __init set_ihash_entries(char str) { if (!str) return 0; ihash_entries = simple_strtoul(str, &str, 0); return 1; } __setup("ihash_entries=", set_ihash_entries); /* * Initialize the waitqueues and inode hash table. / void __init inode_init_early(void) { / If hashes are distributed across NUMA nodes, defer * hash allocation until vmalloc space is available. / if (hashdist) return; inode_hashtable = alloc_large_system_hash("Inode-cache", sizeof(struct hlist_head), ihash_entries, 14, HASH_EARLY \| HASH_ZERO, &i_hash_shift, &i_hash_mask, 0, 0); } void __init inode_init(void) { / inode slab cache / inode_cachep = kmem_cache_create("inode_cache", sizeof(struct inode), 0, (SLAB_RECLAIM_ACCOUNT\|SLAB_PANIC\| SLAB_ACCOUNT), init_once); / Hash may have been set up in inode_init_early / if (!hashdist) return; inode_hashtable = alloc_large_system_hash("Inode-cache", sizeof(struct hlist_head), ihash_entries, 14, HASH_ZERO, &i_hash_shift, &i_hash_mask, 0, 0); } void init_special_inode(struct inode inode, umode_t mode, dev_t rdev) { inode->i_mode = mode; switch (inode->i_mode & S_IFMT) { case S_IFCHR: inode->i_fop = &def_chr_fops; inode->i_rdev = rdev; break; case S_IFBLK: if (IS_ENABLED(CONFIG_BLOCK)) inode->i_fop = &def_blk_fops; inode->i_rdev = rdev; break; case S_IFIFO: inode->i_fop = &pipefifo_fops; break; case S_IFSOCK: /* leave it no_open_fops / break; default: printk(KERN_DEBUG "init_special_inode: bogus i_mode (%o) for" " inode %s:%lu\n", mode, inode->i_sb->s_id, inode->i_ino); break; } } EXPORT_SYMBOL(init_special_inode); /* * inode_init_owner - Init uid,gid,mode for new inode according to posix standards * @idmap: idmap of the mount the inode was created from * @inode: New inode * @dir: Directory inode * @mode: mode of the new inode * * If the inode has been created through an idmapped mount the idmap of * the vfsmount must be passed through @idmap. This function will then take * care to map the inode according to @idmap before checking permissions * and initializing i_uid and i_gid. On non-idmapped mounts or if permission * checking is to be performed on the raw inode simply pass @nop_mnt_idmap. / void inode_init_owner(struct mnt_idmap idmap, struct inode inode, const struct inode dir, umode_t mode) { inode_fsuid_set(inode, idmap); if (dir && dir->i_mode & S_ISGID) { inode->i_gid = dir->i_gid; /* Directories are special, and always inherit S_ISGID / if (S_ISDIR(mode)) mode \|= S_ISGID; } else inode_fsgid_set(inode, idmap); inode->i_mode = mode; } EXPORT_SYMBOL(inode_init_owner); /* * inode_owner_or_capable - check current task permissions to inode * @idmap: idmap of the mount the inode was found from * @inode: inode being checked * * Return true if current either has CAP_FOWNER in a namespace with the * inode owner uid mapped, or owns the file. * * If the inode has been found through an idmapped mount the idmap of * the vfsmount must be passed through @idmap. This function will then take * care to map the inode according to @idmap before checking permissions. * On non-idmapped mounts or if permission checking is to be performed on the * raw inode simply pass @nop_mnt_idmap. / bool inode_owner_or_capable(struct mnt_idmap idmap, const struct inode inode) { vfsuid_t vfsuid; struct user_namespace ns; vfsuid = i_uid_into_vfsuid(idmap, inode); if (vfsuid_eq_kuid(vfsuid, current_fsuid())) return true; ns = current_user_ns(); if (vfsuid_has_mapping(ns, vfsuid) && ns_capable(ns, CAP_FOWNER)) return true; return false; } EXPORT_SYMBOL(inode_owner_or_capable); /* * Direct i/o helper functions / bool inode_dio_finished(const struct inode inode) { return atomic_read(&inode->i_dio_count) == 0; } EXPORT_SYMBOL(inode_dio_finished); /** * inode_dio_wait - wait for outstanding DIO requests to finish * @inode: inode to wait for * * Waits for all pending direct I/O requests to finish so that we can * proceed with a truncate or equivalent operation. * * Must be called under a lock that serializes taking new references * to i_dio_count, usually by inode->i_rwsem. / void inode_dio_wait(struct inode inode) { wait_var_event(&inode->i_dio_count, inode_dio_finished(inode)); } EXPORT_SYMBOL(inode_dio_wait); void inode_dio_wait_interruptible(struct inode inode) { wait_var_event_interruptible(&inode->i_dio_count, inode_dio_finished(inode)); } EXPORT_SYMBOL(inode_dio_wait_interruptible); / * inode_set_flags - atomically set some inode flags * * Note: the caller should be holding i_rwsem exclusively, or else be sure that * they have exclusive access to the inode structure (i.e., while the * inode is being instantiated). The reason for the cmpxchg() loop * --- which wouldn't be necessary if all code paths which modify * i_flags actually followed this rule, is that there is at least one * code path which doesn't today so we use cmpxchg() out of an abundance * of caution. * * In the long run, i_rwsem is overkill, and we should probably look * at using the i_lock spinlock to protect i_flags, and then make sure * it is so documented in include/linux/fs.h and that all code follows * the locking convention!! / void inode_set_flags(struct inode inode, unsigned int flags, unsigned int mask) { WARN_ON_ONCE(flags & ~mask); set_mask_bits(&inode->i_flags, mask, flags); } EXPORT_SYMBOL(inode_set_flags); void inode_nohighmem(struct inode inode) { mapping_set_gfp_mask(inode->i_mapping, GFP_USER); } EXPORT_SYMBOL(inode_nohighmem); struct timespec64 inode_set_ctime_to_ts(struct inode inode, struct timespec64 ts) { trace_inode_set_ctime_to_ts(inode, &ts); set_normalized_timespec64(&ts, ts.tv_sec, ts.tv_nsec); inode->i_ctime_sec = ts.tv_sec; inode->i_ctime_nsec = ts.tv_nsec; return ts; } EXPORT_SYMBOL(inode_set_ctime_to_ts); /** * timestamp_truncate - Truncate timespec to a granularity * @t: Timespec * @inode: inode being updated * * Truncate a timespec to the granularity supported by the fs * containing the inode. Always rounds down. gran must * not be 0 nor greater than a second (NSEC_PER_SEC, or 10^9 ns). / struct timespec64 timestamp_truncate(struct timespec64 t, struct inode inode) { struct super_block sb = inode->i_sb; unsigned int gran = sb->s_time_gran; t.tv_sec = clamp(t.tv_sec, sb->s_time_min, sb->s_time_max); if (unlikely(t.tv_sec == sb->s_time_max \|\| t.tv_sec == sb->s_time_min)) t.tv_nsec = 0; / Avoid division in the common cases 1 ns and 1 s. / if (gran == 1) ; / nothing / else if (gran == NSEC_PER_SEC) t.tv_nsec = 0; else if (gran > 1 && gran < NSEC_PER_SEC) t.tv_nsec -= t.tv_nsec % gran; else WARN(1, "invalid file time granularity: %u", gran); return t; } EXPORT_SYMBOL(timestamp_truncate); /* * inode_set_ctime_current - set the ctime to current_time * @inode: inode * * Set the inode's ctime to the current value for the inode. Returns the * current value that was assigned. If this is not a multigrain inode, then we * set it to the later of the coarse time and floor value. * * If it is multigrain, then we first see if the coarse-grained timestamp is * distinct from what is already there. If so, then use that. Otherwise, get a * fine-grained timestamp. * * After that, try to swap the new value into i_ctime_nsec. Accept the * resulting ctime, regardless of the outcome of the swap. If it has * already been replaced, then that timestamp is later than the earlier * unacceptable one, and is thus acceptable. / struct timespec64 inode_set_ctime_current(struct inode inode) { struct timespec64 now; u32 cns, cur; ktime_get_coarse_real_ts64_mg(&now); now = timestamp_truncate(now, inode); /* Just return that if this is not a multigrain fs / if (!is_mgtime(inode)) { inode_set_ctime_to_ts(inode, now); goto out; } / * A fine-grained time is only needed if someone has queried * for timestamps, and the current coarse grained time isn't * later than what's already there. / cns = smp_load_acquire(&inode->i_ctime_nsec); if (cns & I_CTIME_QUERIED) { struct timespec64 ctime = { .tv_sec = inode->i_ctime_sec, .tv_nsec = cns & ~I_CTIME_QUERIED }; if (timespec64_compare(&now, &ctime) <= 0) { ktime_get_real_ts64_mg(&now); now = timestamp_truncate(now, inode); mgtime_counter_inc(mg_fine_stamps); } } mgtime_counter_inc(mg_ctime_updates); / No need to cmpxchg if it's exactly the same / if (cns == now.tv_nsec && inode->i_ctime_sec == now.tv_sec) { trace_ctime_xchg_skip(inode, &now); goto out; } cur = cns; retry: / Try to swap the nsec value into place. / if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now.tv_nsec)) { / If swap occurred, then we're (mostly) done / inode->i_ctime_sec = now.tv_sec; trace_ctime_ns_xchg(inode, cns, now.tv_nsec, cur); mgtime_counter_inc(mg_ctime_swaps); } else { / * Was the change due to someone marking the old ctime QUERIED? * If so then retry the swap. This can only happen once since * the only way to clear I_CTIME_QUERIED is to stamp the inode * with a new ctime. / if (!(cns & I_CTIME_QUERIED) && (cns \| I_CTIME_QUERIED) == cur) { cns = cur; goto retry; } / Otherwise, keep the existing ctime / now.tv_sec = inode->i_ctime_sec; now.tv_nsec = cur & ~I_CTIME_QUERIED; } out: return now; } EXPORT_SYMBOL(inode_set_ctime_current); /* * inode_set_ctime_deleg - try to update the ctime on a delegated inode * @inode: inode to update * @update: timespec64 to set the ctime * * Attempt to atomically update the ctime on behalf of a delegation holder. * * The nfs server can call back the holder of a delegation to get updated * inode attributes, including the mtime. When updating the mtime, update * the ctime to a value at least equal to that. * * This can race with concurrent updates to the inode, in which * case the update is skipped. * * Note that this works even when multigrain timestamps are not enabled, * so it is used in either case. / struct timespec64 inode_set_ctime_deleg(struct inode inode, struct timespec64 update) { struct timespec64 now, cur_ts; u32 cur, old; /* pairs with try_cmpxchg below / cur = smp_load_acquire(&inode->i_ctime_nsec); cur_ts.tv_nsec = cur & ~I_CTIME_QUERIED; cur_ts.tv_sec = inode->i_ctime_sec; / If the update is older than the existing value, skip it. / if (timespec64_compare(&update, &cur_ts) <= 0) return cur_ts; ktime_get_coarse_real_ts64_mg(&now); / Clamp the update to "now" if it's in the future / if (timespec64_compare(&update, &now) > 0) update = now; update = timestamp_truncate(update, inode); / No need to update if the values are already the same / if (timespec64_equal(&update, &cur_ts)) return cur_ts; / * Try to swap the nsec value into place. If it fails, that means * it raced with an update due to a write or similar activity. That * stamp takes precedence, so just skip the update. / retry: old = cur; if (try_cmpxchg(&inode->i_ctime_nsec, &cur, update.tv_nsec)) { inode->i_ctime_sec = update.tv_sec; mgtime_counter_inc(mg_ctime_swaps); return update; } / * Was the change due to another task marking the old ctime QUERIED? * * If so, then retry the swap. This can only happen once since * the only way to clear I_CTIME_QUERIED is to stamp the inode * with a new ctime. / if (!(old & I_CTIME_QUERIED) && (cur == (old \| I_CTIME_QUERIED))) goto retry; / Otherwise, it was a new timestamp. / cur_ts.tv_sec = inode->i_ctime_sec; cur_ts.tv_nsec = cur & ~I_CTIME_QUERIED; return cur_ts; } EXPORT_SYMBOL(inode_set_ctime_deleg); /* * in_group_or_capable - check whether caller is CAP_FSETID privileged * @idmap: idmap of the mount @inode was found from * @inode: inode to check * @vfsgid: the new/current vfsgid of @inode * * Check whether @vfsgid is in the caller's group list or if the caller is * privileged with CAP_FSETID over @inode. This can be used to determine * whether the setgid bit can be kept or must be dropped. * * Return: true if the caller is sufficiently privileged, false if not. / bool in_group_or_capable(struct mnt_idmap idmap, const struct inode inode, vfsgid_t vfsgid) { if (vfsgid_in_group_p(vfsgid)) return true; if (capable_wrt_inode_uidgid(idmap, inode, CAP_FSETID)) return true; return false; } EXPORT_SYMBOL(in_group_or_capable); /* * mode_strip_sgid - handle the sgid bit for non-directories * @idmap: idmap of the mount the inode was created from * @dir: parent directory inode * @mode: mode of the file to be created in @dir * * If the @mode of the new file has both the S_ISGID and S_IXGRP bit * raised and @dir has the S_ISGID bit raised ensure that the caller is * either in the group of the parent directory or they have CAP_FSETID * in their user namespace and are privileged over the parent directory. * In all other cases, strip the S_ISGID bit from @mode. * * Return: the new mode to use for the file / umode_t mode_strip_sgid(struct mnt_idmap idmap, const struct inode dir, umode_t mode) { if ((mode & (S_ISGID \| S_IXGRP)) != (S_ISGID \| S_IXGRP)) return mode; if (S_ISDIR(mode) \|\| !dir \|\| !(dir->i_mode & S_ISGID)) return mode; if (in_group_or_capable(idmap, dir, i_gid_into_vfsgid(idmap, dir))) return mode; return mode & ~S_ISGID; } EXPORT_SYMBOL(mode_strip_sgid); #ifdef CONFIG_DEBUG_VFS / * Dump an inode. * * TODO: add a proper inode dumping routine, this is a stub to get debug off the * ground. * * TODO: handle getting to fs type with get_kernel_nofault()? * See dump_mapping() above. / void dump_inode(struct inode inode, const char reason) { struct super_block sb = inode->i_sb; pr_warn("%s encountered for inode %px\n" "fs %s mode %ho opflags 0x%hx flags 0x%x state 0x%x count %d\n", reason, inode, sb->s_type->name, inode->i_mode, inode->i_opflags, inode->i_flags, inode_state_read_once(inode), atomic_read(&inode->i_count)); } EXPORT_SYMBOL(dump_inode); #endif ]
14/2	2026/01/26 23:20	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 37224 output: 21 thoughts: 88 Thoughts: Okay, here's my summary, taking the role of the expert: Investigating the Origin of `I_CTIME_QUERIED` Right, so I'm digging into `inode_set_ctime_current` and I see this `I_CTIME_QUERIED` constant being used. It's a bit of a head-scratcher. My first instinct, given my experience, is to check `fs/inode.c` directly for its definition. Nope, not there. Okay, so it has to be defined in an included header file. I tried the usual suspects, and my initial search for its definition in `include/linux/fs.h` using `codesearch-definition-source` came up empty. That's unusual, but not unheard of. Sometimes these things are buried a bit deeper. Time to broaden the search. I need to run a full, exhaustive search across the entire code base to be certain I haven't missed something and to understand its intended purpose. I'll need to use a more general search, probably across all files to track it down. Let's get to it.
15/2	2026/01/26 23:20	tool	codesearch-dir-index	0m Args: map[Dir:include/linux] Results: map[Files:[8250_pci.h acct.h acpi.h acpi_amd_wbrf.h acpi_dma.h acpi_iort.h acpi_mdio.h acpi_pmtmr.h acpi_rimt.h acpi_viot.h adb.h adfs_fs.h adi-axi-common.h adreno-smmu-priv.h adxl.h aer.h agp_backend.h agpgart.h ahci-remap.h ahci_platform.h aio.h alarmtimer.h alcor_pci.h align.h alloc_tag.h altera_jtaguart.h altera_uart.h amd-iommu.h amd-pmf-io.h annotate.h anon_inodes.h aperture.h apm-emulation.h apm_bios.h apple-gmux.h arch_topology.h args.h arm-cci.h arm-smccc.h arm_ffa.h arm_mpam.h arm_sdei.h armada-37xx-rwtm-mailbox.h array_size.h ascii85.h asn1.h asn1_ber_bytecode.h asn1_decoder.h asn1_encoder.h assoc_array.h assoc_array_priv.h async.h async_tx.h ata.h ata_platform.h atalk.h atm.h atm_tcp.h atmdev.h atmel-isc-media.h atmel-ssc.h atmel_pdc.h atomic.h attribute_container.h audit.h audit_arch.h auto_dev-ioctl.h auto_fs.h auxiliary_bus.h auxvec.h average.h backing-dev-defs.h backing-dev.h backing-file.h backlight.h badblocks.h balloon_compaction.h base64.h bcd.h bch.h bcm47xx_nvram.h bcm47xx_sprom.h bcm47xx_wdt.h bcm963xx_nvram.h bcm963xx_tag.h binfmts.h bio-integrity.h bio.h bit_spinlock.h bitfield.h bitmap-str.h bitmap.h bitops.h bitrev.h bits.h blk-cgroup.h blk-crypto-profile.h blk-crypto.h blk-integrity.h blk-mq-dma.h blk-mq.h blk-pm.h blk_types.h blkdev.h blkpg.h blktrace_api.h blockgroup_lock.h bma150.h bootconfig.h bootmem_info.h bottom_half.h bpf-cgroup-defs.h bpf-cgroup.h bpf-netns.h bpf.h bpf_crypto.h bpf_lirc.h bpf_local_storage.h bpf_lsm.h bpf_mem_alloc.h bpf_mprog.h bpf_trace.h bpf_types.h bpf_verifier.h bpfptr.h brcmphy.h bsearch.h bsg-lib.h bsg.h btf.h btf_ids.h btree-128.h btree-type.h btree.h btrfs.h buffer_head.h bug.h build-salt.h build_bug.h buildid.h bvec.h c2port.h cache.h cache_coherency.h cacheflush.h cacheinfo.h call_once.h capability.h cb710.h cc_platform.h cciss_ioctl.h ccp.h cdev.h cdrom.h cfag12864b.h cfi.h cfi_types.h cgroup-defs.h cgroup.h cgroup_api.h cgroup_dmem.h cgroup_namespace.h cgroup_rdma.h cgroup_refcnt.h cgroup_subsys.h circ_buf.h cleanup.h clk-provider.h clk.h clkdev.h clockchips.h clocksource.h clocksource_ids.h closure.h cma.h cmpxchg-emu.h cn_proc.h cnt32_to_63.h coda.h codetag.h compaction.h compat.h compiler-clang.h compiler-gcc.h compiler-version.h compiler.h compiler_attributes.h compiler_types.h completion.h component.h configfs.h connector.h console.h console_struct.h consolemap.h const.h container.h container_of.h context_tracking.h context_tracking_irq.h context_tracking_state.h cookie.h cordic.h coredump.h coresight-pmu.h coresight-stm.h coresight.h count_zeros.h counter.h cper.h cpu.h cpu_cooling.h cpu_pm.h cpu_rmap.h cpu_smt.h cpufeature.h cpufreq.h cpuhotplug.h cpuhplock.h cpuidle.h cpuidle_haltpoll.h cpumask.h cpumask_api.h cpumask_types.h cpuset.h crash_core.h crash_dump.h crash_reserve.h crc-ccitt.h crc-itu-t.h crc-t10dif.h crc16.h crc32.h crc32c.h crc32poly.h crc4.h crc64.h crc7.h crc8.h cred.h crypto.h cs5535.h ctype.h cuda.h damon.h dasd_mod.h davinci_emac.h dax.h dca.h dcache.h dccp.h debug_locks.h debugfs.h debugobjects.h delay.h delayacct.h delayed_call.h dev_printk.h devcoredump.h devfreq-event.h devfreq-governor.h devfreq.h devfreq_cooling.h device-mapper.h device.h device_cgroup.h devm-helpers.h devpts_fs.h dfl.h dibs.h digsig.h dim.h dio.h dirent.h dlm.h dlm_plock.h dm-bufio.h dm-dirty-log.h dm-io.h dm-kcopyd.h dm-region-hash.h dm-verity-loadpin.h dm9000.h dma-buf-mapping.h dma-buf.h dma-direct.h dma-direction.h dma-fence-array.h dma-fence-chain.h dma-fence-unwrap.h dma-fence.h dma-heap.h dma-map-ops.h dma-mapping.h dma-resv.h dmaengine.h dmapool.h dmar.h dmi.h dnotify.h dns_resolver.h dpll.h dqblk_qtree.h dqblk_v1.h dqblk_v2.h drbd.h drbd_config.h drbd_genl.h drbd_genl_api.h drbd_limits.h ds2782_battery.h dtlk.h dtpm.h dw_apb_timer.h dynamic_debug.h dynamic_queue_limits.h earlycpio.h ecryptfs.h edac.h edd.h eeprom_93cx6.h efi-bgrt.h efi.h efi_embedded_fw.h efs_vh.h ehl_pse_io_aux.h eisa.h elf-fdpic.h elf-randomize.h elf.h elfcore-compat.h elfcore.h elfnote-lto.h elfnote.h enclosure.h energy_model.h entry-common.h entry-virt.h err.h errname.h errno.h error-injection.h errqueue.h errseq.h etherdevice.h ethtool.h ethtool_netlink.h eventfd.h eventpoll.h evm.h execmem.h export-internal.h export.h exportfs.h ext2_fs.h extable.h extcon-provider.h extcon.h f2fs_fs.h f75375s.h falloc.h fanotify.h fault-inject-usercopy.h fault-inject.h fb.h fbcon.h fcdevice.h fcntl.h fd.h fddidevice.h fdtable.h fec.h fiemap.h file.h file_ref.h fileattr.h filelock.h filter.h find.h fips.h firewire.h firmware-map.h firmware.h fixp-arith.h flat.h flex_proportions.h folio_queue.h font.h fortify-string.h fprobe.h fpu.h freezer.h fs.h fs_api.h fs_context.h fs_dirent.h fs_parser.h fs_pin.h fs_stack.h fs_struct.h fscache-cache.h fscache.h fscrypt.h fsi-occ.h fsi-sbefifo.h fsi.h fsl-diu-fb.h fsl_devices.h fsl_hypervisor.h fsl_ifc.h fsldma.h fsnotify.h fsnotify_backend.h fsverity.h ftrace.h ftrace_irq.h ftrace_regs.h futex.h fw_table.h fwctl.h fwnode.h fwnode_mdio.h gameport.h gcd.h genalloc.h generic-radix-tree.h genl_magic_func.h genl_magic_struct.h getcpu.h gfp.h gfp_api.h gfp_types.h glob.h gnss.h goldfish.h gpio-pxa.h gpio.h gpio_keys.h greybus.h group_cpus.h hardirq.h hash.h hashtable.h hashtable_api.h hdlc.h hdlcdrv.h hdmi.h hex.h hfs_common.h hid-debug.h hid-over-i2c.h hid-over-spi.h hid-roccat.h hid-sensor-hub.h hid-sensor-ids.h hid.h hid_bpf.h hidden.h hiddev.h hidraw.h highmem-internal.h highmem.h highuid.h hil.h hil_mlc.h hippidevice.h hisi_acc_qm.h hmm-dma.h hmm.h host1x.h host1x_context_bus.h hp_sdc.h hpet.h hrtimer.h hrtimer_api.h hrtimer_defs.h hrtimer_types.h hte.h huge_mm.h hugetlb.h hugetlb_cgroup.h hugetlb_inline.h hung_task.h hw_bitfield.h hw_breakpoint.h hw_random.h hwmon-sysfs.h hwmon-vid.h hwmon.h hwspinlock.h hyperv.h hypervisor.h i2c-algo-bit.h i2c-algo-pca.h i2c-algo-pcf.h i2c-atr.h i2c-dev.h i2c-mux.h i2c-of-prober.h i2c-smbus.h i2c.h i8042.h i8253.h i8254.h icmp.h icmpv6.h idle_inject.h idr.h ieee80211-eht.h ieee80211-he.h ieee80211-ht.h ieee80211-mesh.h ieee80211-nan.h ieee80211-p2p.h ieee80211-s1g.h ieee80211-vht.h ieee80211.h ieee802154.h if_arp.h if_bridge.h if_eql.h if_ether.h if_fddi.h if_hsr.h if_link.h if_macvlan.h if_phonet.h if_pppol2tp.h if_pppox.h if_rmnet.h if_tap.h if_team.h if_tun.h if_tunnel.h if_vlan.h igmp.h ihex.h ima.h imx-media.h in.h in6.h indirect_call_wrapper.h inet.h inet_diag.h inetdevice.h init.h init_ohci1394_dma.h init_syscalls.h init_task.h initrd.h inotify.h input.h instruction_pointer.h instrumentation.h instrumented.h int_log.h integrity.h intel-ish-client-if.h intel_dg_nvm_aux.h intel_pmt_features.h intel_rapl.h intel_tcc.h intel_th.h intel_tpmi.h intel_vsec.h interconnect-clk.h interconnect-provider.h interconnect.h interrupt.h interval_tree.h interval_tree_generic.h io-64-nonatomic-hi-lo.h io-64-nonatomic-lo-hi.h io-mapping.h io-pgtable.h io.h io_uring.h io_uring_types.h ioam6.h ioam6_genl.h ioam6_iptunnel.h iocontext.h iomap.h iommu-dma.h iommu-helper.h iommu.h iommufd.h iopoll.h ioport.h ioprio.h ioremap.h iosys-map.h iov_iter.h iova.h iova_bitmap.h ip.h ipack.h ipc.h ipc_namespace.h ipmi.h ipmi_smi.h ipv6.h ipv6_route.h irq-entry-common.h irq.h irq_poll.h irq_sim.h irq_work.h irq_work_types.h irqbypass.h irqchip.h irqdesc.h irqdomain.h irqdomain_defs.h irqflags.h irqflags_types.h irqhandler.h irqnr.h irqreturn.h isa-dma.h isa.h isapnp.h iscsi_boot_sysfs.h iscsi_ibft.h ism.h iversion.h jbd2.h jhash.h jiffies.h journal-head.h joystick.h jump_label.h jump_label_ratelimit.h jz4740-adc.h jz4780-nemc.h kallsyms.h kasan-checks.h kasan-enabled.h kasan-tags.h kasan.h kbd_diacr.h kbd_kern.h kbuild.h kconfig.h kcore.h kcov.h kcsan-checks.h kcsan.h kdb.h kdebug.h kdev_t.h kern_levels.h kernel-page-flags.h kernel.h kernel_read_file.h kernel_stat.h kernelcapi.h kernfs.h kexec.h kexec_handover.h key-type.h key.h keyboard.h keyctl.h kfence.h kfifo.h kgdb.h khugepaged.h klist.h kmemleak.h kmod.h kmsan-checks.h kmsan.h kmsan_string.h kmsan_types.h kmsg_dump.h kobj_map.h kobject.h kobject_api.h kobject_ns.h kprobes.h kref.h kref_api.h ks0108.h ks8842.h ks8851_mll.h ksm.h kstack_erase.h kstrtox.h kthread.h ktime.h ktime_api.h kvm_dirty_ring.h kvm_host.h kvm_irqfd.h kvm_para.h kvm_types.h l2tp.h lantiq.h lapb.h latencytop.h lcd.h lcm.h leafops.h led-class-flash.h led-class-multicolor.h led-lm3530.h leds-bd2802.h leds-expresswire.h leds-lp3944.h leds-lp3952.h leds-pca9532.h leds-regulator.h leds-ti-lmu-common.h leds.h libata.h libfdt.h libfdt_env.h libgcc.h libnvdimm.h libps2.h license.h limits.h linear_range.h linkage.h linkmode.h linux_logo.h lis3lv02d.h list.h list_bl.h list_lru.h list_nulls.h list_sort.h litex.h livepatch.h livepatch_external.h livepatch_helpers.h livepatch_sched.h liveupdate.h llc.h llist.h llist_api.h local_lock.h local_lock_internal.h lockdep.h lockdep_api.h lockdep_types.h lockref.h log2.h logic_iomem.h logic_pio.h lp.h lru_cache.h lsm_audit.h lsm_count.h lsm_hook_defs.h lsm_hooks.h lwq.h lz4.h lzo.h mISDNdsp.h mISDNhw.h mISDNif.h mailbox_client.h mailbox_controller.h maple.h maple_tree.h marvell_phy.h math.h math64.h mbcache.h mbus.h mc146818rtc.h mc33xs2410.h mc6821.h mcb.h mdev.h mdio-bitbang.h mdio-gpio.h mdio-mux.h mdio.h mei_aux.h mei_cl_bus.h mem_encrypt.h memblock.h memcontrol.h memfd.h memory-failure.h memory-tiers.h memory.h memory_hotplug.h mempolicy.h mempool.h memregion.h memremap.h memstick.h mhi.h mhi_ep.h micrel_phy.h microchipphy.h migrate.h migrate_mode.h mii.h mii_timestamper.h min_heap.h minmax.h misc_cgroup.h miscdevice.h mm.h mm_api.h mm_inline.h mm_types.h mm_types_task.h mman.h mmap_lock.h mmdebug.h mmiotrace.h mmu_context.h mmu_notifier.h mmzone.h mnt_idmapping.h mnt_namespace.h mod_devicetable.h module.h module_signature.h module_symbol.h moduleloader.h moduleparam.h most.h mount.h moxtet.h mpage.h mpi.h mpls.h mpls_iptunnel.h mroute.h mroute6.h mroute_base.h msdos_fs.h msdos_partition.h msg.h msi.h msi_api.h mtio.h mutex.h mutex_api.h mutex_types.h mv643xx_eth.h mv643xx_i2c.h mvebu-pmsu.h mxm-wmi.h namei.h nd.h ndctl.h net.h net_tstamp.h netdev_features.h netdevice.h netdevice_xmit.h netfilter.h netfilter_bridge.h netfilter_defs.h netfilter_ipv4.h netfilter_ipv6.h netfilter_netdev.h netfs.h netlink.h netpoll.h nfs.h nfs3.h nfs4.h nfs_common.h nfs_fs.h nfs_fs_i.h nfs_fs_sb.h nfs_iostat.h nfs_page.h nfs_ssc.h nfs_xdr.h nfsacl.h nfslocalio.h nitro_enclaves.h nl802154.h nls.h nmi.h node.h nodemask.h nodemask_types.h nospec.h notifier.h ns_common.h nsc_gpio.h nsfs.h nsproxy.h nstree.h ntb.h ntb_transport.h nubus.h numa.h numa_memblks.h nvme-auth.h nvme-fc-driver.h nvme-fc.h nvme-keyring.h nvme-rdma.h nvme-tcp.h nvme.h nvmem-consumer.h nvmem-provider.h nvram.h oa_tc6.h objagg.h objpool.h objtool.h objtool_types.h of.h of_address.h of_clk.h of_device.h of_dma.h of_fdt.h of_gpio.h of_graph.h of_iommu.h of_irq.h of_mdio.h of_net.h of_pci.h of_pdt.h of_platform.h of_reserved_mem.h oid_registry.h olpc-ec.h omap-dma.h omap-gpmc.h omap-iommu.h omap-mailbox.h omapfb.h once.h once_lite.h oom.h openvswitch.h osq_lock.h overflow.h packing.h padata.h page-flags-layout.h page-flags.h page-isolation.h page_counter.h page_ext.h page_frag_cache.h page_idle.h page_owner.h page_ref.h page_reporting.h page_table_check.h pageblock-flags.h pagemap.h pagevec.h pagewalk.h panic.h panic_notifier.h papr_scm.h parman.h parport.h parport_pc.h parser.h part_stat.h pata_arasan_cf_data.h patchkey.h path.h pch_dma.h pci-acpi.h pci-ats.h pci-bwctrl.h pci-doe.h pci-ecam.h pci-ep-cfs.h pci-ep-msi.h pci-epc.h pci-epf.h pci-ide.h pci-p2pdma.h pci-pwrctrl.h pci-tph.h pci-tsm.h pci.h pci_hotplug.h pci_ids.h pcie-dwc.h pcs-lynx.h pcs-rzn1-miic.h pe.h peci-cpu.h peci.h percpu-defs.h percpu-refcount.h percpu-rwsem.h percpu.h percpu_counter.h perf_event.h perf_event_api.h perf_regs.h personality.h pfn.h pgalloc.h pgalloc_tag.h pgtable.h pgtable_api.h phonet.h phy.h phy_fixed.h phy_led_triggers.h phy_link_topology.h phylib_stubs.h phylink.h pid.h pid_namespace.h pid_types.h pidfs.h pim.h pipe_fs_i.h pkeys.h pl320-ipc.h platform_device.h platform_profile.h pldmfw.h plist.h plist_types.h pm-trace.h pm.h pm_clock.h pm_domain.h pm_opp.h pm_qos.h pm_runtime.h pm_wakeirq.h pm_wakeup.h pmbus.h pmu.h pnp.h poison.h poll.h polynomial.h posix-clock.h posix-timers.h posix-timers_types.h posix_acl.h posix_acl_xattr.h power_supply.h powercap.h ppp-comp.h ppp_channel.h ppp_defs.h pps_gen_kernel.h pps_kernel.h pr.h prandom.h preempt.h prefetch.h prime_numbers.h printk.h prmt.h proc_fs.h proc_ns.h processor.h profile.h projid.h property.h pruss_driver.h psci.h pseudo_fs.h psi.h psi_types.h psp-platform-access.h psp-sev.h psp-tee.h psp.h pstore.h pstore_blk.h pstore_ram.h pstore_zone.h ptdump.h pti.h ptp_classify.h ptp_clock_kernel.h ptp_kvm.h ptp_mock.h ptp_pch.h ptr_ring.h ptrace.h ptrace_api.h purgatory.h pvclock_gtod.h pwm.h pwm_backlight.h pxa168_eth.h pxa2xx_ssp.h qnx6_fs.h quota.h quotaops.h radix-tree.h raid_class.h ramfs.h random.h randomize_kstack.h range.h ras.h ratelimit.h ratelimit_types.h rational.h rbtree.h rbtree_augmented.h rbtree_latch.h rbtree_types.h rcu_node_tree.h rcu_notifier.h rcu_segcblist.h rcu_sync.h rculist.h rculist_bl.h rculist_nulls.h rcupdate.h rcupdate_trace.h rcupdate_wait.h rcuref.h rcutiny.h rcutree.h rcuwait.h rcuwait_api.h reboot-mode.h reboot.h reciprocal_div.h ref_tracker.h refcount.h refcount_api.h refcount_types.h regmap.h regset.h relay.h remoteproc.h resctrl.h resctrl_types.h reset-controller.h reset.h resource.h resource_ext.h restart_block.h resume_user_mode.h rethook.h rfkill.h rhashtable-types.h rhashtable.h ring_buffer.h rio.h rio_drv.h rio_ids.h rio_regs.h rmap.h rmi.h rndis.h rodata_test.h rolling_buffer.h root_dev.h rpmb.h rpmsg.h rseq.h rseq_entry.h rseq_types.h rslib.h rtc.h rtmutex.h rtnetlink.h rtsx_common.h rtsx_pci.h rtsx_usb.h rv.h rw_hint.h rwbase_rt.h rwlock.h rwlock_api_smp.h rwlock_rt.h rwlock_types.h rwsem.h sbitmap.h scatterlist.h scc.h sched.h sched_clock.h scmi_imx_protocol.h scmi_protocol.h scpi_protocol.h screen_info.h scs.h sctp.h scx200.h scx200_gpio.h seccomp.h seccomp_types.h secretmem.h securebits.h security.h sed-opal-key.h sed-opal.h seg6.h seg6_genl.h seg6_hmac.h seg6_iptunnel.h seg6_local.h selection.h sem.h sem_types.h semaphore.h seq_buf.h seq_file.h seq_file_net.h seqlock.h seqlock_api.h seqlock_types.h serdev.h serial.h serial_8250.h serial_bcm63xx.h serial_core.h serial_s3c.h serial_sci.h serio.h set_memory.h sfp.h sh_clk.h sh_dma.h sh_eth.h sh_intc.h sh_timer.h shdma-base.h shm.h shmem_fs.h shrinker.h signal.h signal_types.h signalfd.h siox.h siphash.h sizes.h skb_array.h skbuff.h skbuff_ref.h skmsg.h slab.h slimbus.h sm501-regs.h sm501.h smc91x.h smp.h smp_types.h smpboot.h smsc911x.h smscphy.h sock_diag.h socket.h sockptr.h softirq.h sonet.h sonypi.h sort.h sound.h soundcard.h spinlock.h spinlock_api.h spinlock_api_smp.h spinlock_api_up.h spinlock_rt.h spinlock_types.h spinlock_types_raw.h spinlock_types_up.h spinlock_up.h splice.h spmi.h sprintf.h sram.h srcu.h srcutiny.h srcutree.h ssbi.h stackdepot.h stackprotector.h stacktrace.h start_kernel.h stat.h statfs.h static_call.h static_call_types.h static_key.h stdarg.h stddef.h stm.h stmmac.h stmp3xxx_rtc_wdt.h stmp_device.h stop_machine.h string.h string_choices.h string_helpers.h stringhash.h stringify.h sungem_phy.h sunserialcore.h sunxi-rsb.h surface_acpi_notify.h suspend.h svga.h sw842.h swab.h swait.h swait_api.h swap.h swap_cgroup.h swapfile.h swapops.h swiotlb.h switchtec.h sxgbe_platform.h sync_core.h sync_file.h synclink.h sys.h sys_info.h sys_soc.h syscall_user_dispatch.h syscall_user_dispatch_types.h syscalls.h syscalls_api.h syscore_ops.h sysctl.h sysfb.h sysfs.h syslog.h sysrq.h t10-pi.h task_io_accounting.h task_io_accounting_ops.h task_work.h taskstats_kern.h tboot.h tc.h tcp.h tee_core.h tee_drv.h tegra-icc.h termios_internal.h text-patching.h textsearch.h textsearch_fsm.h thermal.h thread_info.h threads.h thunderbolt.h ti-emif-sram.h ti_wilink_st.h tick.h tifm.h timb_dma.h timb_gpio.h time.h time32.h time64.h time_namespace.h timecounter.h timekeeper_internal.h timekeeping.h timer.h timer_types.h timerfd.h timeriomem-rng.h timerqueue.h timerqueue_types.h timex.h tnum.h topology.h torture.h toshiba.h tpm.h tpm_command.h tpm_eventlog.h tpm_svsm.h trace.h trace_clock.h trace_events.h trace_recursion.h trace_seq.h tracefs.h tracepoint-defs.h tracepoint.h transport_class.h ts-nbus.h tsacct_kern.h tsm-mr.h tsm.h tty.h tty_buffer.h tty_driver.h tty_flip.h tty_ldisc.h tty_port.h turris-omnia-mcu-interface.h turris-signing-key.h typecheck.h types.h u64_stats_sync.h u64_stats_sync_api.h uacce.h uaccess.h ubsan.h ucopysize.h ucs2_string.h udp.h uidgid.h uidgid_types.h uio.h uio_driver.h umh.h unaligned.h unicode.h union_find.h units.h unroll.h unwind_deferred.h unwind_deferred_types.h unwind_user.h unwind_user_types.h uprobes.h usb.h usb_usual.h usbdevice_fs.h user-return-notifier.h user.h user_events.h user_namespace.h userfaultfd_k.h util_macros.h uts.h uts_namespace.h utsname.h uuid.h vbox_utils.h vdpa.h vdso_datastore.h verification.h vermagic.h vexpress.h vfio.h vfio_pci_core.h vfs.h vfsdebug.h vga_switcheroo.h vgaarb.h vhost_iotlb.h via-core.h via.h via_i2c.h videodev2.h virtio.h virtio_anchor.h virtio_byteorder.h virtio_caif.h virtio_config.h virtio_dma_buf.h virtio_features.h virtio_net.h virtio_pci_admin.h virtio_pci_legacy.h virtio_pci_modern.h virtio_ring.h virtio_vsock.h vm_event_item.h vmalloc.h vmcore_info.h vmpressure.h vmstat.h vmw_vmci_api.h vmw_vmci_defs.h vringh.h vt.h vt_buffer.h vt_kern.h vtime.h w1.h wait.h wait_api.h wait_bit.h watch_queue.h watchdog.h win_minmax.h wireless.h wkup_m3_ipc.h wm97xx.h wmi.h wordpart.h workqueue.h workqueue_api.h workqueue_types.h writeback.h ww_mutex.h wwan.h xarray.h xattr.h xxhash.h xz.h yam.h zconf.h zlib.h zorro.h zsmalloc.h zstd.h zstd_errors.h zstd_lib.h zswap.h zutil.h] Subdirs:[amba atomic avf bcma bnxt bus byteorder can cdx ceph clk comedi crush decompress device dma dma-buf dsa extcon firmware fpga framer fs fsl generic_pt gpio greybus habanalabs hsi i3c iio input io_uring irqchip isdn kho lockd lsm mailbox mdio memory mfd misc mlx4 mlx5 mmc mtd mux net netfilter netfilter_arp netfilter_bridge netfilter_ipv4 netfilter_ipv6 ns pcs pds perf phy pinctrl platform_data power pse-pd pwrseq qat qed raid raspberrypi regulator remoteproc reset rpmsg rtc sched soc soundwire spi ssb sunrpc surface_aggregator ulpi unaligned usb]]
16/2	2026/01/26 23:20	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 50646 output: 26 thoughts: 34 Thoughts: Okay, here's my thought process on that: Investigating a Potential Header File Location Hmm, so I need to find where that particular macro, `I_CTIME_QUERIED`, is defined. I've got a hunch it might be buried within the kernel's file system headers. Given the context, `include/linux/fs.h` seems like a likely candidate. Instead of a broad search, I should start with a more targeted approach. I'll do a focused search specifically within `include/linux/fs.h` to see if that macro shows up. That will be more efficient than casting a wider net initially. It is a good starting point, and I will adjust the search as needed if this approach doesn't turn up what I am after.
17/2	2026/01/26 23:20	tool	codesearch-find-references	0m Error: requested entity does not exist Args: map[Name:I_CTIME_QUERIED] Results: map[References:<nil> TruncatedOutput:false]
18/2	2026/01/26 23:20	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 50894 output: 41 thoughts: 1452 Thoughts: Analyzing a Potential Race Condition in Multigrain Timestamps Okay, I need to understand this timestamping behavior. Something feels off. It seems related to `fs/inode.c`, specifically `fs/stat.c`. I was hoping to find a quick definition, but I can only use `codesearch-find-references`. I should try searching with `I_CTIME_` prefix. Ah, it seems the race condition is centered around `inode->i_ctime_sec`. I see task 3305 setting `inode->i_ctime_sec = now.tv_sec` in `fs/inode.c` (line 2843), and task 3306 reading `stat->ctime.tv_sec = inode->i_ctime_sec` in `fs/stat.c` (line 56). Let's go through the code block. Let me examine `fill_mg_cmtime` again: I see that `i_ctime_sec` is read before `i_ctime_nsec`. ```c 45: void fill_mg_cmtime(struct kstat stat, u32 request_mask, struct inode inode) 46: { 47: atomic_t pcn = (atomic_t )&inode->i_ctime_nsec; ... 55: stat->mtime = inode_get_mtime(inode); 56: stat->ctime.tv_sec = inode->i_ctime_sec; 57: stat->ctime.tv_nsec = (u32)atomic_read(pcn); 58: if (!(stat->ctime.tv_nsec & I_CTIME_QUERIED)) 59: stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn)); 60: stat->ctime.tv_nsec &= ~I_CTIME_QUERIED; 61: trace_fill_mg_cmtime(inode, &stat->ctime, &stat->mtime); 62: } ``` Now, let's look at `inode_set_ctime_current`. ```c 2841: if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now.tv_nsec)) { 2842: /* If swap occurred, then we're (mostly) done / 2843: inode->i_ctime_sec = now.tv_sec; 2844: trace_ctime_ns_xchg(inode, cns, now.tv_nsec, cur); 2845: mgtime_counter_inc(mg_ctime_swaps); 2846: } else { ``` I see it! In `inode_set_ctime_current`, they're updating `i_ctime_nsec` before* `i_ctime_sec`. The `try_cmpxchg` is visible to others, and could be read before `i_ctime_sec` gets updated! If `fill_mg_cmtime` reads the values in the wrong order, I have a problem. Here's my scenario: 1. `inode_set_ctime_current` is running. 2. `try_cmpxchg(&inode->i_ctime_nsec, ..., 000)`: `i_ctime_nsec` is 000 3. `inode->i_ctime_sec = S+1`: `i_ctime_sec` is S+1 Meanwhile, `fill_mg_cmtime`: 1. `stat->ctime.tv_sec = inode->i_ctime_sec` (reads S -- the old value) 2. `stat->ctime.tv_nsec = atomic_read(&inode->i_ctime_nsec)` (reads 000 -- the new value) The result is `S.000`. But this is a timestamp older than `S.999`. That's non-monotonic! I've confirmed a race condition that results in a time going backward, which is clearly a bug. This multigrain timestamp feature is causing headaches, it needs careful attention! Now, what about barriers? `try_cmpxchg` is atomic, and is a full barrier. However, there's no ordering after the assignment to `i_ctime_sec`. And `fill_mg_cmtime`
19/2	2026/01/26 23:20	tool	codesearch-definition-source	0m Args: map[ContextFile:include/linux/fs.h IncludeLines:true Name:inode_get_ctime] Results: map[SourceCode:1691: static inline struct timespec64 inode_get_ctime(const struct inode *inode) 1692: { 1693: struct timespec64 ts = { .tv_sec = inode_get_ctime_sec(inode), 1694: .tv_nsec = inode_get_ctime_nsec(inode) }; 1695: 1696: return ts; 1697: } SourceFile:include/linux/fs.h]
20/2	2026/01/26 23:20	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 53555 output: 87 thoughts: 20 Thoughts: Okay, here's my interpretation of those thoughts, tailored for an expert audience, presented in the first person: Analyzing Time Retrieval in File System Metadata Alright, so I'm thinking about this file system implementation and specifically how it handles timestamp information. I'm focusing on the low-level functions here. My brain is circling around the details of time retrieval. Specifically, I'm thinking about the implications of having separate functions for getting the "creation time" -- `inode_get_ctime_sec` and `inode_get_ctime_nsec`. This tells me that they've chosen to store the time information, seconds and nanoseconds, separately. I need to keep this in mind. It's a key design choice that will ripple into how I consider atomicity, caching strategies, and potential data races when accessing file metadata. The split between `sec` and `nsec` strongly suggests we need to be very careful about synchronization, especially in a concurrent environment. This detail will undoubtedly influence how I analyze the consistency model of the file system and how to account for this approach in terms of performance. It's also important to understand the underlying data structures, and the implications of this storage approach. Does this align with any specific hardware instructions, or is there a specific design reason for storing seconds and nanoseconds separately?
21/2	2026/01/26 23:21	tool	codesearch-definition-source	0m Args: map[ContextFile:include/linux/fs.h IncludeLines:true Name:inode_get_ctime_sec] Results: map[SourceCode:1681: static inline time64_t inode_get_ctime_sec(const struct inode *inode) 1682: { 1683: return inode->i_ctime_sec; 1684: } SourceFile:include/linux/fs.h]
22/2	2026/01/26 23:21	tool	codesearch-definition-source	0m Args: map[ContextFile:include/linux/fs.h, IncludeLines:true Name:inode_get_ctime_nsec] Results: map[SourceCode:1686: static inline long inode_get_ctime_nsec(const struct inode *inode) 1687: { 1688: return inode->i_ctime_nsec & ~I_CTIME_QUERIED; 1689: } SourceFile:include/linux/fs.h]
23/2	2026/01/26 23:21	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 54120 output: 26 thoughts: 34 Thoughts: Okay, here's my take on those thoughts, tailored for an expert audience: Kernel Timestamp Investigation Alright, that's interesting. I've spotted `I_CTIME_QUERIED`. My gut tells me that's a key timestamp flag. Let me dig a bit deeper. Yup, there it is, right in `include/linux/fs.h`. Okay, no surprises there, that's where I'd expect it to be defined. Now the hunt is on to trace where this macro is actually used. I'm going to need to track down all the read and write instances now to have a clear understanding of its function.
24/2	2026/01/26 23:21	tool	codesearch-find-references	0m Error: requested entity does not exist Args: map[Name:I_CTIME_QUERIED] Results: map[References:<nil> TruncatedOutput:false]
25/2	2026/01/26 23:21	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 54342 output: 22 thoughts: 59 Thoughts: Okay, here's my take on those thoughts, framed as my own: Investigating a Discrepancy Right, this is strange. I'm looking at this code snippet, and I'm sure I see the function right there. But, `codesearch-find-references` is telling me it's nowhere to be found. That's a classic head-scratcher. It's making me wonder if I'm dealing with an `enum` or perhaps a macro definition. That would certainly explain the disconnect between the apparent declaration in the code and the search tool's findings. Time to dig into the source. I'll check `include/linux/fs.h` around line 1680. That seems like a good place to start, given the context. Let's see what's lurking in there.
26/2	2026/01/26 23:21	tool	read-file	0m Args: map[File:include/linux/fs.h] Results: map[Contents:/* SPDX-License-Identifier: GPL-2.0 / #ifndef _LINUX_FS_H #define _LINUX_FS_H #include <linux/fs/super.h> #include <linux/vfsdebug.h> #include <linux/linkage.h> #include <linux/wait_bit.h> #include <linux/kdev_t.h> #include <linux/dcache.h> #include <linux/path.h> #include <linux/stat.h> #include <linux/cache.h> #include <linux/list.h> #include <linux/llist.h> #include <linux/radix-tree.h> #include <linux/xarray.h> #include <linux/rbtree.h> #include <linux/init.h> #include <linux/pid.h> #include <linux/bug.h> #include <linux/mutex.h> #include <linux/rwsem.h> #include <linux/mm_types.h> #include <linux/capability.h> #include <linux/semaphore.h> #include <linux/fcntl.h> #include <linux/rculist_bl.h> #include <linux/atomic.h> #include <linux/shrinker.h> #include <linux/migrate_mode.h> #include <linux/uidgid.h> #include <linux/lockdep.h> #include <linux/percpu-rwsem.h> #include <linux/workqueue.h> #include <linux/delayed_call.h> #include <linux/uuid.h> #include <linux/errseq.h> #include <linux/ioprio.h> #include <linux/build_bug.h> #include <linux/stddef.h> #include <linux/mount.h> #include <linux/cred.h> #include <linux/mnt_idmapping.h> #include <linux/slab.h> #include <linux/maple_tree.h> #include <linux/rw_hint.h> #include <linux/file_ref.h> #include <linux/unicode.h> #include <asm/byteorder.h> #include <uapi/linux/fs.h> struct bdi_writeback; struct bio; struct io_comp_batch; struct fiemap_extent_info; struct hd_geometry; struct iovec; struct kiocb; struct kobject; struct pipe_inode_info; struct poll_table_struct; struct kstatfs; struct vm_area_struct; struct vfsmount; struct cred; struct swap_info_struct; struct seq_file; struct iov_iter; struct fsnotify_mark_connector; struct fs_context; struct fs_parameter_spec; struct file_kattr; struct iomap_ops; struct delegated_inode; extern void __init inode_init(void); extern void __init inode_init_early(void); extern void __init files_init(void); extern void __init files_maxfiles_init(void); extern unsigned long get_max_files(void); extern unsigned int sysctl_nr_open; typedef __kernel_rwf_t rwf_t; struct buffer_head; typedef int (get_block_t)(struct inode inode, sector_t iblock, struct buffer_head bh_result, int create); typedef int (dio_iodone_t)(struct kiocb iocb, loff_t offset, ssize_t bytes, void private); #define MAY_EXEC 0x00000001 #define MAY_WRITE 0x00000002 #define MAY_READ 0x00000004 #define MAY_APPEND 0x00000008 #define MAY_ACCESS 0x00000010 #define MAY_OPEN 0x00000020 #define MAY_CHDIR 0x00000040 / called from RCU mode, don't block / #define MAY_NOT_BLOCK 0x00000080 / * flags in file.f_mode. Note that FMODE_READ and FMODE_WRITE must correspond * to O_WRONLY and O_RDWR via the strange trick in do_dentry_open() / / file is open for reading / #define FMODE_READ ((__force fmode_t)(1 << 0)) / file is open for writing / #define FMODE_WRITE ((__force fmode_t)(1 << 1)) / file is seekable / #define FMODE_LSEEK ((__force fmode_t)(1 << 2)) / file can be accessed using pread / #define FMODE_PREAD ((__force fmode_t)(1 << 3)) / file can be accessed using pwrite / #define FMODE_PWRITE ((__force fmode_t)(1 << 4)) / File is opened for execution with sys_execve / sys_uselib / #define FMODE_EXEC ((__force fmode_t)(1 << 5)) / File writes are restricted (block device specific) / #define FMODE_WRITE_RESTRICTED ((__force fmode_t)(1 << 6)) / File supports atomic writes / #define FMODE_CAN_ATOMIC_WRITE ((__force fmode_t)(1 << 7)) / FMODE_* bit 8 / / 32bit hashes as llseek() offset (for directories) / #define FMODE_32BITHASH ((__force fmode_t)(1 << 9)) / 64bit hashes as llseek() offset (for directories) / #define FMODE_64BITHASH ((__force fmode_t)(1 << 10)) / * Don't update ctime and mtime. * * Currently a special hack for the XFS open_by_handle ioctl, but we'll * hopefully graduate it to a proper O_CMTIME flag supported by open(2) soon. / #define FMODE_NOCMTIME ((__force fmode_t)(1 << 11)) / Expect random access pattern / #define FMODE_RANDOM ((__force fmode_t)(1 << 12)) / Supports IOCB_HAS_METADATA / #define FMODE_HAS_METADATA ((__force fmode_t)(1 << 13)) / File is opened with O_PATH; almost nothing can be done with it / #define FMODE_PATH ((__force fmode_t)(1 << 14)) / File needs atomic accesses to f_pos / #define FMODE_ATOMIC_POS ((__force fmode_t)(1 << 15)) / Write access to underlying fs / #define FMODE_WRITER ((__force fmode_t)(1 << 16)) / Has read method(s) / #define FMODE_CAN_READ ((__force fmode_t)(1 << 17)) / Has write method(s) / #define FMODE_CAN_WRITE ((__force fmode_t)(1 << 18)) #define FMODE_OPENED ((__force fmode_t)(1 << 19)) #define FMODE_CREATED ((__force fmode_t)(1 << 20)) / File is stream-like / #define FMODE_STREAM ((__force fmode_t)(1 << 21)) / File supports DIRECT IO / #define FMODE_CAN_ODIRECT ((__force fmode_t)(1 << 22)) #define FMODE_NOREUSE ((__force fmode_t)(1 << 23)) / File is embedded in backing_file object / #define FMODE_BACKING ((__force fmode_t)(1 << 24)) / * Together with FMODE_NONOTIFY_PERM defines which fsnotify events shouldn't be * generated (see below) / #define FMODE_NONOTIFY ((__force fmode_t)(1 << 25)) / * Together with FMODE_NONOTIFY defines which fsnotify events shouldn't be * generated (see below) / #define FMODE_NONOTIFY_PERM ((__force fmode_t)(1 << 26)) / File is capable of returning -EAGAIN if I/O will block / #define FMODE_NOWAIT ((__force fmode_t)(1 << 27)) / File represents mount that needs unmounting / #define FMODE_NEED_UNMOUNT ((__force fmode_t)(1 << 28)) / File does not contribute to nr_files count / #define FMODE_NOACCOUNT ((__force fmode_t)(1 << 29)) / * The two FMODE_NONOTIFY* define which fsnotify events should not be generated * for an open file. These are the possible values of * (f->f_mode & FMODE_FSNOTIFY_MASK) and their meaning: * * FMODE_NONOTIFY - suppress all (incl. non-permission) events. * FMODE_NONOTIFY_PERM - suppress permission (incl. pre-content) events. * FMODE_NONOTIFY \| FMODE_NONOTIFY_PERM - suppress only FAN_ACCESS_PERM. / #define FMODE_FSNOTIFY_MASK \ (FMODE_NONOTIFY \| FMODE_NONOTIFY_PERM) #define FMODE_FSNOTIFY_NONE(mode) \ ((mode & FMODE_FSNOTIFY_MASK) == FMODE_NONOTIFY) #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS #define FMODE_FSNOTIFY_HSM(mode) \ ((mode & FMODE_FSNOTIFY_MASK) == 0 \|\| \ (mode & FMODE_FSNOTIFY_MASK) == (FMODE_NONOTIFY \| FMODE_NONOTIFY_PERM)) #define FMODE_FSNOTIFY_ACCESS_PERM(mode) \ ((mode & FMODE_FSNOTIFY_MASK) == 0) #else #define FMODE_FSNOTIFY_ACCESS_PERM(mode) 0 #define FMODE_FSNOTIFY_HSM(mode) 0 #endif / * Attribute flags. These should be or-ed together to figure out what * has been changed! / #define ATTR_MODE (1 << 0) #define ATTR_UID (1 << 1) #define ATTR_GID (1 << 2) #define ATTR_SIZE (1 << 3) #define ATTR_ATIME (1 << 4) #define ATTR_MTIME (1 << 5) #define ATTR_CTIME (1 << 6) #define ATTR_ATIME_SET (1 << 7) #define ATTR_MTIME_SET (1 << 8) #define ATTR_FORCE (1 << 9) / Not a change, but a change it / #define ATTR_CTIME_SET (1 << 10) #define ATTR_KILL_SUID (1 << 11) #define ATTR_KILL_SGID (1 << 12) #define ATTR_FILE (1 << 13) #define ATTR_KILL_PRIV (1 << 14) #define ATTR_OPEN (1 << 15) / Truncating from open(O_TRUNC) / #define ATTR_TIMES_SET (1 << 16) #define ATTR_TOUCH (1 << 17) #define ATTR_DELEG (1 << 18) / Delegated attrs. Don't break write delegations / / * Whiteout is represented by a char device. The following constants define the * mode and device number to use. / #define WHITEOUT_MODE 0 #define WHITEOUT_DEV 0 / * This is the Inode Attributes structure, used for notify_change(). It * uses the above definitions as flags, to know which values have changed. * Also, in this manner, a Filesystem can look at only the values it cares * about. Basically, these are the attributes that the VFS layer can * request to change from the FS layer. * * Derek Atkins <warlord@MIT.EDU> 94-10-20 / struct iattr { unsigned int ia_valid; umode_t ia_mode; / * The two anonymous unions wrap structures with the same member. * * Filesystems raising FS_ALLOW_IDMAP need to use ia_vfs{g,u}id which * are a dedicated type requiring the filesystem to use the dedicated * helpers. Other filesystem can continue to use ia_{g,u}id until they * have been ported. * * They always contain the same value. In other words FS_ALLOW_IDMAP * pass down the same value on idmapped mounts as they would on regular * mounts. / union { kuid_t ia_uid; vfsuid_t ia_vfsuid; }; union { kgid_t ia_gid; vfsgid_t ia_vfsgid; }; loff_t ia_size; struct timespec64 ia_atime; struct timespec64 ia_mtime; struct timespec64 ia_ctime; / * Not an attribute, but an auxiliary info for filesystems wanting to * implement an ftruncate() like method. NOTE: filesystem should * check for (ia_valid & ATTR_FILE), and not for (ia_file != NULL). / struct file ia_file; }; /* * Maximum number of layers of fs stack. Needs to be limited to * prevent kernel stack overflow / #define FILESYSTEM_MAX_STACK_DEPTH 2 /* * enum positive_aop_returns - aop return codes with specific semantics * * @AOP_WRITEPAGE_ACTIVATE: Informs the caller that page writeback has * completed, that the page is still locked, and * should be considered active. The VM uses this hint * to return the page to the active list -- it won't * be a candidate for writeback again in the near * future. Other callers must be careful to unlock * the page if they get this return. Returned by * writepage(); * * @AOP_TRUNCATED_PAGE: The AOP method that was handed a locked page has * unlocked it and the page might have been truncated. * The caller should back up to acquiring a new page and * trying again. The aop will be taking reasonable * precautions not to livelock. If the caller held a page * reference, it should drop it before retrying. Returned * by read_folio(). * * address_space_operation functions return these large constants to indicate * special semantics to the caller. These are much larger than the bytes in a * page to allow for functions that return the number of bytes operated on in a * given page. / enum positive_aop_returns { AOP_WRITEPAGE_ACTIVATE = 0x80000, AOP_TRUNCATED_PAGE = 0x80001, }; / * oh the beauties of C type declarations. / struct page; struct address_space; struct writeback_control; struct readahead_control; / Match RWF_* bits to IOCB bits / #define IOCB_HIPRI (__force int) RWF_HIPRI #define IOCB_DSYNC (__force int) RWF_DSYNC #define IOCB_SYNC (__force int) RWF_SYNC #define IOCB_NOWAIT (__force int) RWF_NOWAIT #define IOCB_APPEND (__force int) RWF_APPEND #define IOCB_ATOMIC (__force int) RWF_ATOMIC #define IOCB_DONTCACHE (__force int) RWF_DONTCACHE #define IOCB_NOSIGNAL (__force int) RWF_NOSIGNAL / non-RWF related bits - start at 16 / #define IOCB_EVENTFD (1 << 16) #define IOCB_DIRECT (1 << 17) #define IOCB_WRITE (1 << 18) / iocb->ki_waitq is valid / #define IOCB_WAITQ (1 << 19) #define IOCB_NOIO (1 << 20) / can use bio alloc cache / #define IOCB_ALLOC_CACHE (1 << 21) / kiocb is a read or write operation submitted by fs/aio.c. / #define IOCB_AIO_RW (1 << 22) #define IOCB_HAS_METADATA (1 << 23) / for use in trace events / #define TRACE_IOCB_STRINGS \ { IOCB_HIPRI, "HIPRI" }, \ { IOCB_DSYNC, "DSYNC" }, \ { IOCB_SYNC, "SYNC" }, \ { IOCB_NOWAIT, "NOWAIT" }, \ { IOCB_APPEND, "APPEND" }, \ { IOCB_ATOMIC, "ATOMIC" }, \ { IOCB_DONTCACHE, "DONTCACHE" }, \ { IOCB_EVENTFD, "EVENTFD"}, \ { IOCB_DIRECT, "DIRECT" }, \ { IOCB_WRITE, "WRITE" }, \ { IOCB_WAITQ, "WAITQ" }, \ { IOCB_NOIO, "NOIO" }, \ { IOCB_ALLOC_CACHE, "ALLOC_CACHE" }, \ { IOCB_AIO_RW, "AIO_RW" }, \ { IOCB_HAS_METADATA, "AIO_HAS_METADATA" } struct kiocb { struct file ki_filp; loff_t ki_pos; void (ki_complete)(struct kiocb iocb, long ret); void private; int ki_flags; u16 ki_ioprio; / See linux/ioprio.h / u8 ki_write_stream; / * Only used for async buffered reads, where it denotes the page * waitqueue associated with completing the read. * Valid IFF IOCB_WAITQ is set. / struct wait_page_queue ki_waitq; }; static inline bool is_sync_kiocb(struct kiocb kiocb) { return kiocb->ki_complete == NULL; } struct address_space_operations { int (read_folio)(struct file , struct folio ); /* Write back some dirty pages from this mapping. / int (writepages)(struct address_space , struct writeback_control ); /* Mark a folio dirty. Return true if this dirtied it / bool (dirty_folio)(struct address_space , struct folio ); void (readahead)(struct readahead_control ); int (write_begin)(const struct kiocb , struct address_space mapping, loff_t pos, unsigned len, struct folio foliop, void fsdata); int (write_end)(const struct kiocb , struct address_space mapping, loff_t pos, unsigned len, unsigned copied, struct folio folio, void fsdata); /* Unfortunately this kludge is needed for FIBMAP. Don't use it / sector_t (bmap)(struct address_space , sector_t); void (invalidate_folio) (struct folio , size_t offset, size_t len); bool (release_folio)(struct folio , gfp_t); void (free_folio)(struct folio folio); ssize_t (direct_IO)(struct kiocb , struct iov_iter iter); /* * migrate the contents of a folio to the specified target. If * migrate_mode is MIGRATE_ASYNC, it must not block. / int (migrate_folio)(struct address_space , struct folio dst, struct folio src, enum migrate_mode); int (launder_folio)(struct folio ); bool (is_partially_uptodate) (struct folio , size_t from, size_t count); void (is_dirty_writeback) (struct folio , bool dirty, bool wb); int (error_remove_folio)(struct address_space , struct folio ); /* swapfile support / int (swap_activate)(struct swap_info_struct sis, struct file file, sector_t span); void (swap_deactivate)(struct file file); int (swap_rw)(struct kiocb iocb, struct iov_iter iter); }; extern const struct address_space_operations empty_aops; /** * struct address_space - Contents of a cacheable, mappable object. * @host: Owner, either the inode or the block_device. * @i_pages: Cached pages. * @invalidate_lock: Guards coherency between page cache contents and * file offset->disk block mappings in the filesystem during invalidates. * It is also used to block modification of page cache contents through * memory mappings. * @gfp_mask: Memory allocation flags to use for allocating pages. * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings. * @nr_thps: Number of THPs in the pagecache (non-shmem only). * @i_mmap: Tree of private and shared mappings. * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable. * @nrpages: Number of page entries, protected by the i_pages lock. * @writeback_index: Writeback starts here. * @a_ops: Methods. * @flags: Error bits and flags (AS_). @wb_err: The most recent error which has occurred. * @i_private_lock: For use by the owner of the address_space. * @i_private_list: For use by the owner of the address_space. * @i_private_data: For use by the owner of the address_space. / struct address_space { struct inode host; struct xarray i_pages; struct rw_semaphore invalidate_lock; gfp_t gfp_mask; atomic_t i_mmap_writable; #ifdef CONFIG_READ_ONLY_THP_FOR_FS /* number of thp, only for non-shmem files / atomic_t nr_thps; #endif struct rb_root_cached i_mmap; unsigned long nrpages; pgoff_t writeback_index; const struct address_space_operations a_ops; unsigned long flags; errseq_t wb_err; spinlock_t i_private_lock; struct list_head i_private_list; struct rw_semaphore i_mmap_rwsem; void * i_private_data; } __attribute__((aligned(sizeof(long)))) __randomize_layout; /* * On most architectures that alignment is already the case; but * must be enforced here for CRIS, to let the least significant bit * of struct folio's "mapping" pointer be used for FOLIO_MAPPING_ANON. / / XArray tags, for tagging dirty and writeback pages in the pagecache. / #define PAGECACHE_TAG_DIRTY XA_MARK_0 #define PAGECACHE_TAG_WRITEBACK XA_MARK_1 #define PAGECACHE_TAG_TOWRITE XA_MARK_2 / * Returns true if any of the pages in the mapping are marked with the tag. / static inline bool mapping_tagged(const struct address_space mapping, xa_mark_t tag) { return xa_marked(&mapping->i_pages, tag); } static inline void i_mmap_lock_write(struct address_space mapping) { down_write(&mapping->i_mmap_rwsem); } static inline int i_mmap_trylock_write(struct address_space mapping) { return down_write_trylock(&mapping->i_mmap_rwsem); } static inline void i_mmap_unlock_write(struct address_space mapping) { up_write(&mapping->i_mmap_rwsem); } static inline int i_mmap_trylock_read(struct address_space mapping) { return down_read_trylock(&mapping->i_mmap_rwsem); } static inline void i_mmap_lock_read(struct address_space mapping) { down_read(&mapping->i_mmap_rwsem); } static inline void i_mmap_unlock_read(struct address_space mapping) { up_read(&mapping->i_mmap_rwsem); } static inline void i_mmap_assert_locked(struct address_space mapping) { lockdep_assert_held(&mapping->i_mmap_rwsem); } static inline void i_mmap_assert_write_locked(struct address_space mapping) { lockdep_assert_held_write(&mapping->i_mmap_rwsem); } /* * Might pages of this file be mapped into userspace? / static inline int mapping_mapped(const struct address_space mapping) { return !RB_EMPTY_ROOT(&mapping->i_mmap.rb_root); } /* * Might pages of this file have been modified in userspace? * Note that i_mmap_writable counts all VM_SHARED, VM_MAYWRITE vmas: do_mmap * marks vma as VM_SHARED if it is shared, and the file was opened for * writing i.e. vma may be mprotected writable even if now readonly. * * If i_mmap_writable is negative, no new writable mappings are allowed. You * can only deny writable mappings, if none exists right now. / static inline int mapping_writably_mapped(const struct address_space mapping) { return atomic_read(&mapping->i_mmap_writable) > 0; } static inline int mapping_map_writable(struct address_space mapping) { return atomic_inc_unless_negative(&mapping->i_mmap_writable) ? 0 : -EPERM; } static inline void mapping_unmap_writable(struct address_space mapping) { atomic_dec(&mapping->i_mmap_writable); } static inline int mapping_deny_writable(struct address_space mapping) { return atomic_dec_unless_positive(&mapping->i_mmap_writable) ? 0 : -EBUSY; } static inline void mapping_allow_writable(struct address_space mapping) { atomic_inc(&mapping->i_mmap_writable); } /* * Use sequence counter to get consistent i_size on 32-bit processors. / #if BITS_PER_LONG==32 && defined(CONFIG_SMP) #include <linux/seqlock.h> #define __NEED_I_SIZE_ORDERED #define i_size_ordered_init(inode) seqcount_init(&inode->i_size_seqcount) #else #define i_size_ordered_init(inode) do { } while (0) #endif struct posix_acl; #define ACL_NOT_CACHED ((void )(-1)) /* * ACL_DONT_CACHE is for stacked filesystems, that rely on underlying fs to * cache the ACL. This also means that ->get_inode_acl() can be called in RCU * mode with the LOOKUP_RCU flag. / #define ACL_DONT_CACHE ((void )(-3)) static inline struct posix_acl * uncached_acl_sentinel(struct task_struct task) { return (void )task + 1; } static inline bool is_uncached_acl(struct posix_acl acl) { return (long)acl & 1; } #define IOP_FASTPERM 0x0001 #define IOP_LOOKUP 0x0002 #define IOP_NOFOLLOW 0x0004 #define IOP_XATTR 0x0008 #define IOP_DEFAULT_READLINK 0x0010 #define IOP_MGTIME 0x0020 #define IOP_CACHED_LINK 0x0040 #define IOP_FASTPERM_MAY_EXEC 0x0080 / * Inode state bits. Protected by inode->i_lock * * Four bits determine the dirty state of the inode: I_DIRTY_SYNC, * I_DIRTY_DATASYNC, I_DIRTY_PAGES, and I_DIRTY_TIME. * * Four bits define the lifetime of an inode. Initially, inodes are I_NEW, * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at * various stages of removing an inode. * * Two bits are used for locking and completion notification, I_NEW and I_SYNC. * * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on * fdatasync() (unless I_DIRTY_DATASYNC is also set). * Timestamp updates are the usual cause. * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of * these changes separately from I_DIRTY_SYNC so that we * don't have to write inode on fdatasync() when only * e.g. the timestamps have changed. * I_DIRTY_PAGES Inode has dirty pages. Inode itself may be clean. * I_DIRTY_TIME The inode itself has dirty timestamps, and the * lazytime mount option is enabled. We keep track of this * separately from I_DIRTY_SYNC in order to implement * lazytime. This gets cleared if I_DIRTY_INODE * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. But * I_DIRTY_TIME can still be set if I_DIRTY_SYNC is already * in place because writeback might already be in progress * and we don't want to lose the time update * I_NEW Serves as both a mutex and completion notification. * New inodes set I_NEW. If two processes both create * the same inode, one of them will release its inode and * wait for I_NEW to be released before returning. * Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can * also cause waiting on I_NEW, without I_NEW actually * being set. find_inode() uses this to prevent returning * nearly-dead inodes. * I_WILL_FREE Must be set when calling write_inode_now() if i_count * is zero. I_FREEING must be set when I_WILL_FREE is * cleared. * I_FREEING Set when inode is about to be freed but still has dirty * pages or buffers attached or the inode itself is still * dirty. * I_CLEAR Added by clear_inode(). In this state the inode is * clean and can be destroyed. Inode keeps I_FREEING. * * Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are * prohibited for many purposes. iget() must wait for * the inode to be completely released, then create it * anew. Other functions will just ignore such inodes, * if appropriate. I_NEW is used for waiting. * * I_SYNC Writeback of inode is running. The bit is set during * data writeback, and cleared with a wakeup on the bit * address once it is done. The bit is also used to pin * the inode in memory for flusher thread. * * I_REFERENCED Marks the inode as recently references on the LRU list. * * I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to * synchronize competing switching instances and to tell * wb stat updates to grab the i_pages lock. See * inode_switch_wbs_work_fn() for details. * * I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper * and work dirs among overlayfs mounts. * * I_CREATING New object's inode in the middle of setting up. * * I_DONTCACHE Evict inode as soon as it is not used anymore. * * I_SYNC_QUEUED Inode is queued in b_io or b_more_io writeback lists. * Used to detect that mark_inode_dirty() should not move * inode between dirty lists. * * I_PINNING_FSCACHE_WB Inode is pinning an fscache object for writeback. * * I_LRU_ISOLATING Inode is pinned being isolated from LRU without holding * i_count. * * Q: What is the difference between I_WILL_FREE and I_FREEING? * * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait * upon. There's one free address left. / enum inode_state_bits { __I_NEW = 0U, __I_SYNC = 1U, __I_LRU_ISOLATING = 2U / reserved wait address bit 3 / }; enum inode_state_flags_enum { I_NEW = (1U << __I_NEW), I_SYNC = (1U << __I_SYNC), I_LRU_ISOLATING = (1U << __I_LRU_ISOLATING), / reserved flag bit 3 / I_DIRTY_SYNC = (1U << 4), I_DIRTY_DATASYNC = (1U << 5), I_DIRTY_PAGES = (1U << 6), I_WILL_FREE = (1U << 7), I_FREEING = (1U << 8), I_CLEAR = (1U << 9), I_REFERENCED = (1U << 10), I_LINKABLE = (1U << 11), I_DIRTY_TIME = (1U << 12), I_WB_SWITCH = (1U << 13), I_OVL_INUSE = (1U << 14), I_CREATING = (1U << 15), I_DONTCACHE = (1U << 16), I_SYNC_QUEUED = (1U << 17), I_PINNING_NETFS_WB = (1U << 18) }; #define I_DIRTY_INODE (I_DIRTY_SYNC \| I_DIRTY_DATASYNC) #define I_DIRTY (I_DIRTY_INODE \| I_DIRTY_PAGES) #define I_DIRTY_ALL (I_DIRTY \| I_DIRTY_TIME) / * Use inode_state_read() & friends to access. / struct inode_state_flags { enum inode_state_flags_enum __state; }; / * Keep mostly read-only and often accessed (especially for * the RCU path lookup and 'stat' data) fields at the beginning * of the 'struct inode' / struct inode { umode_t i_mode; unsigned short i_opflags; unsigned int i_flags; #ifdef CONFIG_FS_POSIX_ACL struct posix_acl i_acl; struct posix_acl i_default_acl; #endif kuid_t i_uid; kgid_t i_gid; const struct inode_operations i_op; struct super_block i_sb; struct address_space i_mapping; #ifdef CONFIG_SECURITY void i_security; #endif / Stat data, not accessed from path walking / unsigned long i_ino; / * Filesystems may only read i_nlink directly. They shall use the * following functions for modification: * * (set\|clear\|inc\|drop)_nlink * inode_(inc\|dec)_link_count / union { const unsigned int i_nlink; unsigned int __i_nlink; }; dev_t i_rdev; loff_t i_size; time64_t i_atime_sec; time64_t i_mtime_sec; time64_t i_ctime_sec; u32 i_atime_nsec; u32 i_mtime_nsec; u32 i_ctime_nsec; u32 i_generation; spinlock_t i_lock; / i_blocks, i_bytes, maybe i_size / unsigned short i_bytes; u8 i_blkbits; enum rw_hint i_write_hint; blkcnt_t i_blocks; #ifdef __NEED_I_SIZE_ORDERED seqcount_t i_size_seqcount; #endif / Misc / struct inode_state_flags i_state; / 32-bit hole / struct rw_semaphore i_rwsem; unsigned long dirtied_when; / jiffies of first dirtying / unsigned long dirtied_time_when; struct hlist_node i_hash; struct list_head i_io_list; / backing dev IO list / #ifdef CONFIG_CGROUP_WRITEBACK struct bdi_writeback i_wb; /* the associated cgroup wb / / foreign inode detection, see wbc_detach_inode() / int i_wb_frn_winner; u16 i_wb_frn_avg_time; u16 i_wb_frn_history; #endif struct list_head i_lru; / inode LRU list / struct list_head i_sb_list; struct list_head i_wb_list; / backing dev writeback list / union { struct hlist_head i_dentry; struct rcu_head i_rcu; }; atomic64_t i_version; atomic64_t i_sequence; / see futex / atomic_t i_count; atomic_t i_dio_count; atomic_t i_writecount; #if defined(CONFIG_IMA) \|\| defined(CONFIG_FILE_LOCKING) atomic_t i_readcount; / struct files open RO / #endif union { const struct file_operations i_fop; /* former ->i_op->default_file_ops / void (free_inode)(struct inode ); }; struct file_lock_context i_flctx; struct address_space i_data; union { struct list_head i_devices; int i_linklen; }; union { struct pipe_inode_info i_pipe; struct cdev i_cdev; char i_link; unsigned i_dir_seq; }; #ifdef CONFIG_FSNOTIFY __u32 i_fsnotify_mask; / all events this inode cares about / / 32-bit hole reserved for expanding i_fsnotify_mask / struct fsnotify_mark_connector __rcu i_fsnotify_marks; #endif void i_private; / fs or device private pointer / } __randomize_layout; / * i_state handling * * We hide all of it behind helpers so that we can validate consumers. / static inline enum inode_state_flags_enum inode_state_read_once(struct inode inode) { return READ_ONCE(inode->i_state.__state); } static inline enum inode_state_flags_enum inode_state_read(struct inode inode) { lockdep_assert_held(&inode->i_lock); return inode->i_state.__state; } static inline void inode_state_set_raw(struct inode inode, enum inode_state_flags_enum flags) { WRITE_ONCE(inode->i_state.__state, inode->i_state.__state \| flags); } static inline void inode_state_set(struct inode inode, enum inode_state_flags_enum flags) { lockdep_assert_held(&inode->i_lock); inode_state_set_raw(inode, flags); } static inline void inode_state_clear_raw(struct inode inode, enum inode_state_flags_enum flags) { WRITE_ONCE(inode->i_state.__state, inode->i_state.__state & ~flags); } static inline void inode_state_clear(struct inode inode, enum inode_state_flags_enum flags) { lockdep_assert_held(&inode->i_lock); inode_state_clear_raw(inode, flags); } static inline void inode_state_assign_raw(struct inode inode, enum inode_state_flags_enum flags) { WRITE_ONCE(inode->i_state.__state, flags); } static inline void inode_state_assign(struct inode inode, enum inode_state_flags_enum flags) { lockdep_assert_held(&inode->i_lock); inode_state_assign_raw(inode, flags); } static inline void inode_state_replace_raw(struct inode inode, enum inode_state_flags_enum clearflags, enum inode_state_flags_enum setflags) { enum inode_state_flags_enum flags; flags = inode->i_state.__state; flags &= ~clearflags; flags \|= setflags; inode_state_assign_raw(inode, flags); } static inline void inode_state_replace(struct inode inode, enum inode_state_flags_enum clearflags, enum inode_state_flags_enum setflags) { lockdep_assert_held(&inode->i_lock); inode_state_replace_raw(inode, clearflags, setflags); } static inline void inode_set_cached_link(struct inode inode, char link, int linklen) { VFS_WARN_ON_INODE(strlen(link) != linklen, inode); VFS_WARN_ON_INODE(inode->i_opflags & IOP_CACHED_LINK, inode); inode->i_link = link; inode->i_linklen = linklen; inode->i_opflags \|= IOP_CACHED_LINK; } / * Get bit address from inode->i_state to use with wait_var_event() * infrastructre. / #define inode_state_wait_address(inode, bit) ((char )&(inode)->i_state + (bit)) struct wait_queue_head inode_bit_waitqueue(struct wait_bit_queue_entry wqe, struct inode inode, u32 bit); static inline void inode_wake_up_bit(struct inode inode, u32 bit) { /* Caller is responsible for correct memory barriers. / wake_up_var(inode_state_wait_address(inode, bit)); } struct timespec64 timestamp_truncate(struct timespec64 t, struct inode inode); static inline unsigned int i_blocksize(const struct inode node) { return (1 << node->i_blkbits); } static inline int inode_unhashed(struct inode inode) { return hlist_unhashed(&inode->i_hash); } /* * __mark_inode_dirty expects inodes to be hashed. Since we don't * want special inodes in the fileset inode space, we make them * appear hashed, but do not put on any lists. hlist_del() * will work fine and require no locking. / static inline void inode_fake_hash(struct inode inode) { hlist_add_fake(&inode->i_hash); } void wait_on_new_inode(struct inode inode); / * inode->i_rwsem nesting subclasses for the lock validator: * * 0: the object of the current VFS operation * 1: parent * 2: child/target * 3: xattr * 4: second non-directory * 5: second parent (when locking independent directories in rename) * * I_MUTEX_NONDIR2 is for certain operations (such as rename) which lock two * non-directories at once. * * The locking order between these classes is * parent[2] -> child -> grandchild -> normal -> xattr -> second non-directory / enum inode_i_mutex_lock_class { I_MUTEX_NORMAL, I_MUTEX_PARENT, I_MUTEX_CHILD, I_MUTEX_XATTR, I_MUTEX_NONDIR2, I_MUTEX_PARENT2, }; static inline void inode_lock(struct inode inode) { down_write(&inode->i_rwsem); } static inline __must_check int inode_lock_killable(struct inode inode) { return down_write_killable(&inode->i_rwsem); } static inline void inode_unlock(struct inode inode) { up_write(&inode->i_rwsem); } static inline void inode_lock_shared(struct inode inode) { down_read(&inode->i_rwsem); } static inline __must_check int inode_lock_shared_killable(struct inode inode) { return down_read_killable(&inode->i_rwsem); } static inline void inode_unlock_shared(struct inode inode) { up_read(&inode->i_rwsem); } static inline int inode_trylock(struct inode inode) { return down_write_trylock(&inode->i_rwsem); } static inline int inode_trylock_shared(struct inode inode) { return down_read_trylock(&inode->i_rwsem); } static inline int inode_is_locked(struct inode inode) { return rwsem_is_locked(&inode->i_rwsem); } static inline void inode_lock_nested(struct inode inode, unsigned subclass) { down_write_nested(&inode->i_rwsem, subclass); } static inline void inode_lock_shared_nested(struct inode inode, unsigned subclass) { down_read_nested(&inode->i_rwsem, subclass); } static inline void filemap_invalidate_lock(struct address_space mapping) { down_write(&mapping->invalidate_lock); } static inline void filemap_invalidate_unlock(struct address_space mapping) { up_write(&mapping->invalidate_lock); } static inline void filemap_invalidate_lock_shared(struct address_space mapping) { down_read(&mapping->invalidate_lock); } static inline int filemap_invalidate_trylock_shared( struct address_space mapping) { return down_read_trylock(&mapping->invalidate_lock); } static inline void filemap_invalidate_unlock_shared( struct address_space mapping) { up_read(&mapping->invalidate_lock); } void lock_two_nondirectories(struct inode , struct inode); void unlock_two_nondirectories(struct inode , struct inode); void filemap_invalidate_lock_two(struct address_space mapping1, struct address_space mapping2); void filemap_invalidate_unlock_two(struct address_space mapping1, struct address_space mapping2); / * NOTE: in a 32bit arch with a preemptable kernel and * an UP compile the i_size_read/write must be atomic * with respect to the local cpu (unlike with preempt disabled), * but they don't need to be atomic with respect to other cpus like in * true SMP (so they need either to either locally disable irq around * the read or for example on x86 they can be still implemented as a * cmpxchg8b without the need of the lock prefix). For SMP compiles * and 64bit archs it makes no difference if preempt is enabled or not. / static inline loff_t i_size_read(const struct inode inode) { #if BITS_PER_LONG==32 && defined(CONFIG_SMP) loff_t i_size; unsigned int seq; do { seq = read_seqcount_begin(&inode->i_size_seqcount); i_size = inode->i_size; } while (read_seqcount_retry(&inode->i_size_seqcount, seq)); return i_size; #elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPTION) loff_t i_size; preempt_disable(); i_size = inode->i_size; preempt_enable(); return i_size; #else /* Pairs with smp_store_release() in i_size_write() / return smp_load_acquire(&inode->i_size); #endif } / * NOTE: unlike i_size_read(), i_size_write() does need locking around it * (normally i_rwsem), otherwise on 32bit/SMP an update of i_size_seqcount * can be lost, resulting in subsequent i_size_read() calls spinning forever. / static inline void i_size_write(struct inode inode, loff_t i_size) { #if BITS_PER_LONG==32 && defined(CONFIG_SMP) preempt_disable(); write_seqcount_begin(&inode->i_size_seqcount); inode->i_size = i_size; write_seqcount_end(&inode->i_size_seqcount); preempt_enable(); #elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPTION) preempt_disable(); inode->i_size = i_size; preempt_enable(); #else /* * Pairs with smp_load_acquire() in i_size_read() to ensure * changes related to inode size (such as page contents) are * visible before we see the changed inode size. / smp_store_release(&inode->i_size, i_size); #endif } static inline unsigned iminor(const struct inode inode) { return MINOR(inode->i_rdev); } static inline unsigned imajor(const struct inode inode) { return MAJOR(inode->i_rdev); } struct fown_struct { struct file file; /* backpointer for security modules / rwlock_t lock; / protects pid, uid, euid fields / struct pid pid; /* pid or -pgrp where SIGIO should be sent / enum pid_type pid_type; / Kind of process group SIGIO should be sent to / kuid_t uid, euid; / uid/euid of process setting the owner / int signum; / posix.1b rt signal to be delivered on IO / }; /* * struct file_ra_state - Track a file's readahead state. * @start: Where the most recent readahead started. * @size: Number of pages read in the most recent readahead. * @async_size: Numer of pages that were/are not needed immediately * and so were/are genuinely "ahead". Start next readahead when * the first of these pages is accessed. * @ra_pages: Maximum size of a readahead request, copied from the bdi. * @order: Preferred folio order used for most recent readahead. * @mmap_miss: How many mmap accesses missed in the page cache. * @prev_pos: The last byte in the most recent read request. * * When this structure is passed to ->readahead(), the "most recent" * readahead means the current readahead. / struct file_ra_state { pgoff_t start; unsigned int size; unsigned int async_size; unsigned int ra_pages; unsigned short order; unsigned short mmap_miss; loff_t prev_pos; }; / * Check if @index falls in the readahead windows. / static inline int ra_has_index(struct file_ra_state ra, pgoff_t index) { return (index >= ra->start && index < ra->start + ra->size); } /** * struct file - Represents a file * @f_lock: Protects f_ep, f_flags. Must not be taken from IRQ context. * @f_mode: FMODE_* flags often used in hotpaths * @f_op: file operations * @f_mapping: Contents of a cacheable, mappable object. * @private_data: filesystem or driver specific data * @f_inode: cached inode * @f_flags: file flags * @f_iocb_flags: iocb flags * @f_cred: stashed credentials of creator/opener * @f_owner: file owner * @f_path: path of the file * @__f_path: writable alias for @f_path; ONLY for core VFS and only before * the file gets open * @f_pos_lock: lock protecting file position * @f_pipe: specific to pipes * @f_pos: file position * @f_security: LSM security context of this file * @f_wb_err: writeback error * @f_sb_err: per sb writeback errors * @f_ep: link of all epoll hooks for this file * @f_task_work: task work entry point * @f_llist: work queue entrypoint * @f_ra: file's readahead state * @f_freeptr: Pointer used by SLAB_TYPESAFE_BY_RCU file cache (don't touch.) * @f_ref: reference count / struct file { spinlock_t f_lock; fmode_t f_mode; const struct file_operations f_op; struct address_space f_mapping; void private_data; struct inode f_inode; unsigned int f_flags; unsigned int f_iocb_flags; const struct cred f_cred; struct fown_struct f_owner; / --- cacheline 1 boundary (64 bytes) --- / union { const struct path f_path; struct path __f_path; }; union { / regular files (with FMODE_ATOMIC_POS) and directories / struct mutex f_pos_lock; / pipes / u64 f_pipe; }; loff_t f_pos; #ifdef CONFIG_SECURITY void f_security; #endif /* --- cacheline 2 boundary (128 bytes) --- / errseq_t f_wb_err; errseq_t f_sb_err; #ifdef CONFIG_EPOLL struct hlist_head f_ep; #endif union { struct callback_head f_task_work; struct llist_node f_llist; struct file_ra_state f_ra; freeptr_t f_freeptr; }; file_ref_t f_ref; /* --- cacheline 3 boundary (192 bytes) --- / } __randomize_layout __attribute__((aligned(4))); / lest something weird decides that 2 is OK / struct file_handle { __u32 handle_bytes; int handle_type; / file identifier / unsigned char f_handle[] __counted_by(handle_bytes); }; static inline struct file get_file(struct file f) { file_ref_inc(&f->f_ref); return f; } struct file get_file_rcu(struct file __rcu *f); struct file get_file_active(struct file *f); #define file_count(f) file_ref_read(&(f)->f_ref) #define MAX_NON_LFS ((1UL<<31) - 1) / Page cache limit. The filesystems should put that into their s_maxbytes limits, otherwise bad things can happen in VM. / #if BITS_PER_LONG==32 #define MAX_LFS_FILESIZE ((loff_t)ULONG_MAX << PAGE_SHIFT) #elif BITS_PER_LONG==64 #define MAX_LFS_FILESIZE ((loff_t)LLONG_MAX) #endif / legacy typedef, should eventually be removed / typedef void fl_owner_t; struct file_lock; struct file_lease; /* The following constant reflects the upper bound of the file/locking space / #ifndef OFFSET_MAX #define OFFSET_MAX type_max(loff_t) #define OFFT_OFFSET_MAX type_max(off_t) #endif int file_f_owner_allocate(struct file file); static inline struct fown_struct file_f_owner(const struct file file) { return READ_ONCE(file->f_owner); } extern void send_sigio(struct fown_struct fown, int fd, int band); static inline struct inode file_inode(const struct file f) { return f->f_inode; } / * file_dentry() is a relic from the days that overlayfs was using files with a * "fake" path, meaning, f_path on overlayfs and f_inode on underlying fs. * In those days, file_dentry() was needed to get the underlying fs dentry that * matches f_inode. * Files with "fake" path should not exist nowadays, so use an assertion to make * sure that file_dentry() was not papering over filesystem bugs. / static inline struct dentry file_dentry(const struct file file) { struct dentry dentry = file->f_path.dentry; WARN_ON_ONCE(d_inode(dentry) != file_inode(file)); return dentry; } struct fasync_struct { rwlock_t fa_lock; int magic; int fa_fd; struct fasync_struct fa_next; / singly linked list / struct file fa_file; struct rcu_head fa_rcu; }; #define FASYNC_MAGIC 0x4601 /* SMP safe fasync helpers: / extern int fasync_helper(int, struct file , int, struct fasync_struct *); extern struct fasync_struct fasync_insert_entry(int, struct file , struct fasync_struct , struct fasync_struct ); extern int fasync_remove_entry(struct file , struct fasync_struct ); extern struct fasync_struct fasync_alloc(void); extern void fasync_free(struct fasync_struct ); / can be called from interrupts / extern void kill_fasync(struct fasync_struct , int, int); extern void __f_setown(struct file filp, struct pid , enum pid_type, int force); extern int f_setown(struct file filp, int who, int force); extern void f_delown(struct file filp); extern pid_t f_getown(struct file filp); extern int send_sigurg(struct file file); / * Umount options / #define MNT_FORCE 0x00000001 / Attempt to forcibily umount / #define MNT_DETACH 0x00000002 / Just detach from the tree / #define MNT_EXPIRE 0x00000004 / Mark for expiry / #define UMOUNT_NOFOLLOW 0x00000008 / Don't follow symlink on umount / #define UMOUNT_UNUSED 0x80000000 / Flag guaranteed to be unused / static inline struct user_namespace i_user_ns(const struct inode inode) { return inode->i_sb->s_user_ns; } / Helper functions so that in most cases filesystems will * not need to deal directly with kuid_t and kgid_t and can * instead deal with the raw numeric values that are stored * in the filesystem. / static inline uid_t i_uid_read(const struct inode inode) { return from_kuid(i_user_ns(inode), inode->i_uid); } static inline gid_t i_gid_read(const struct inode inode) { return from_kgid(i_user_ns(inode), inode->i_gid); } static inline void i_uid_write(struct inode inode, uid_t uid) { inode->i_uid = make_kuid(i_user_ns(inode), uid); } static inline void i_gid_write(struct inode inode, gid_t gid) { inode->i_gid = make_kgid(i_user_ns(inode), gid); } /* * i_uid_into_vfsuid - map an inode's i_uid down according to an idmapping * @idmap: idmap of the mount the inode was found from * @inode: inode to map * * Return: whe inode's i_uid mapped down according to @idmap. * If the inode's i_uid has no mapping INVALID_VFSUID is returned. / static inline vfsuid_t i_uid_into_vfsuid(struct mnt_idmap idmap, const struct inode inode) { return make_vfsuid(idmap, i_user_ns(inode), inode->i_uid); } /* * i_uid_needs_update - check whether inode's i_uid needs to be updated * @idmap: idmap of the mount the inode was found from * @attr: the new attributes of @inode * @inode: the inode to update * * Check whether the $inode's i_uid field needs to be updated taking idmapped * mounts into account if the filesystem supports it. * * Return: true if @inode's i_uid field needs to be updated, false if not. / static inline bool i_uid_needs_update(struct mnt_idmap idmap, const struct iattr attr, const struct inode inode) { return ((attr->ia_valid & ATTR_UID) && !vfsuid_eq(attr->ia_vfsuid, i_uid_into_vfsuid(idmap, inode))); } /** * i_uid_update - update @inode's i_uid field * @idmap: idmap of the mount the inode was found from * @attr: the new attributes of @inode * @inode: the inode to update * * Safely update @inode's i_uid field translating the vfsuid of any idmapped * mount into the filesystem kuid. / static inline void i_uid_update(struct mnt_idmap idmap, const struct iattr attr, struct inode inode) { if (attr->ia_valid & ATTR_UID) inode->i_uid = from_vfsuid(idmap, i_user_ns(inode), attr->ia_vfsuid); } /** * i_gid_into_vfsgid - map an inode's i_gid down according to an idmapping * @idmap: idmap of the mount the inode was found from * @inode: inode to map * * Return: the inode's i_gid mapped down according to @idmap. * If the inode's i_gid has no mapping INVALID_VFSGID is returned. / static inline vfsgid_t i_gid_into_vfsgid(struct mnt_idmap idmap, const struct inode inode) { return make_vfsgid(idmap, i_user_ns(inode), inode->i_gid); } /* * i_gid_needs_update - check whether inode's i_gid needs to be updated * @idmap: idmap of the mount the inode was found from * @attr: the new attributes of @inode * @inode: the inode to update * * Check whether the $inode's i_gid field needs to be updated taking idmapped * mounts into account if the filesystem supports it. * * Return: true if @inode's i_gid field needs to be updated, false if not. / static inline bool i_gid_needs_update(struct mnt_idmap idmap, const struct iattr attr, const struct inode inode) { return ((attr->ia_valid & ATTR_GID) && !vfsgid_eq(attr->ia_vfsgid, i_gid_into_vfsgid(idmap, inode))); } /** * i_gid_update - update @inode's i_gid field * @idmap: idmap of the mount the inode was found from * @attr: the new attributes of @inode * @inode: the inode to update * * Safely update @inode's i_gid field translating the vfsgid of any idmapped * mount into the filesystem kgid. / static inline void i_gid_update(struct mnt_idmap idmap, const struct iattr attr, struct inode inode) { if (attr->ia_valid & ATTR_GID) inode->i_gid = from_vfsgid(idmap, i_user_ns(inode), attr->ia_vfsgid); } /** * inode_fsuid_set - initialize inode's i_uid field with callers fsuid * @inode: inode to initialize * @idmap: idmap of the mount the inode was found from * * Initialize the i_uid field of @inode. If the inode was found/created via * an idmapped mount map the caller's fsuid according to @idmap. / static inline void inode_fsuid_set(struct inode inode, struct mnt_idmap idmap) { inode->i_uid = mapped_fsuid(idmap, i_user_ns(inode)); } /* * inode_fsgid_set - initialize inode's i_gid field with callers fsgid * @inode: inode to initialize * @idmap: idmap of the mount the inode was found from * * Initialize the i_gid field of @inode. If the inode was found/created via * an idmapped mount map the caller's fsgid according to @idmap. / static inline void inode_fsgid_set(struct inode inode, struct mnt_idmap idmap) { inode->i_gid = mapped_fsgid(idmap, i_user_ns(inode)); } /* * fsuidgid_has_mapping() - check whether caller's fsuid/fsgid is mapped * @sb: the superblock we want a mapping in * @idmap: idmap of the relevant mount * * Check whether the caller's fsuid and fsgid have a valid mapping in the * s_user_ns of the superblock @sb. If the caller is on an idmapped mount map * the caller's fsuid and fsgid according to the @idmap first. * * Return: true if fsuid and fsgid is mapped, false if not. / static inline bool fsuidgid_has_mapping(struct super_block sb, struct mnt_idmap idmap) { struct user_namespace fs_userns = sb->s_user_ns; kuid_t kuid; kgid_t kgid; kuid = mapped_fsuid(idmap, fs_userns); if (!uid_valid(kuid)) return false; kgid = mapped_fsgid(idmap, fs_userns); if (!gid_valid(kgid)) return false; return kuid_has_mapping(fs_userns, kuid) && kgid_has_mapping(fs_userns, kgid); } struct timespec64 current_time(struct inode inode); struct timespec64 inode_set_ctime_current(struct inode inode); struct timespec64 inode_set_ctime_deleg(struct inode inode, struct timespec64 update); static inline time64_t inode_get_atime_sec(const struct inode inode) { return inode->i_atime_sec; } static inline long inode_get_atime_nsec(const struct inode inode) { return inode->i_atime_nsec; } static inline struct timespec64 inode_get_atime(const struct inode inode) { struct timespec64 ts = { .tv_sec = inode_get_atime_sec(inode), .tv_nsec = inode_get_atime_nsec(inode) }; return ts; } static inline struct timespec64 inode_set_atime_to_ts(struct inode inode, struct timespec64 ts) { inode->i_atime_sec = ts.tv_sec; inode->i_atime_nsec = ts.tv_nsec; return ts; } static inline struct timespec64 inode_set_atime(struct inode inode, time64_t sec, long nsec) { struct timespec64 ts = { .tv_sec = sec, .tv_nsec = nsec }; return inode_set_atime_to_ts(inode, ts); } static inline time64_t inode_get_mtime_sec(const struct inode inode) { return inode->i_mtime_sec; } static inline long inode_get_mtime_nsec(const struct inode inode) { return inode->i_mtime_nsec; } static inline struct timespec64 inode_get_mtime(const struct inode inode) { struct timespec64 ts = { .tv_sec = inode_get_mtime_sec(inode), .tv_nsec = inode_get_mtime_nsec(inode) }; return ts; } static inline struct timespec64 inode_set_mtime_to_ts(struct inode inode, struct timespec64 ts) { inode->i_mtime_sec = ts.tv_sec; inode->i_mtime_nsec = ts.tv_nsec; return ts; } static inline struct timespec64 inode_set_mtime(struct inode inode, time64_t sec, long nsec) { struct timespec64 ts = { .tv_sec = sec, .tv_nsec = nsec }; return inode_set_mtime_to_ts(inode, ts); } / * Multigrain timestamps * * Conditionally use fine-grained ctime and mtime timestamps when there * are users actively observing them via getattr. The primary use-case * for this is NFS clients that use the ctime to distinguish between * different states of the file, and that are often fooled by multiple * operations that occur in the same coarse-grained timer tick. / #define I_CTIME_QUERIED ((u32)BIT(31)) static inline time64_t inode_get_ctime_sec(const struct inode inode) { return inode->i_ctime_sec; } static inline long inode_get_ctime_nsec(const struct inode inode) { return inode->i_ctime_nsec & ~I_CTIME_QUERIED; } static inline struct timespec64 inode_get_ctime(const struct inode inode) { struct timespec64 ts = { .tv_sec = inode_get_ctime_sec(inode), .tv_nsec = inode_get_ctime_nsec(inode) }; return ts; } struct timespec64 inode_set_ctime_to_ts(struct inode inode, struct timespec64 ts); /* * inode_set_ctime - set the ctime in the inode * @inode: inode in which to set the ctime * @sec: tv_sec value to set * @nsec: tv_nsec value to set * * Set the ctime in @inode to { @sec, @nsec } / static inline struct timespec64 inode_set_ctime(struct inode inode, time64_t sec, long nsec) { struct timespec64 ts = { .tv_sec = sec, .tv_nsec = nsec }; return inode_set_ctime_to_ts(inode, ts); } struct timespec64 simple_inode_init_ts(struct inode inode); / * Snapshotting support. / /* * file_write_started - check if SB_FREEZE_WRITE is held * @file: the file we write to * * May be false positive with !CONFIG_LOCKDEP/LOCK_STATE_UNKNOWN. * May be false positive with !S_ISREG, because file_start_write() has * no effect on !S_ISREG. / static inline bool file_write_started(const struct file file) { if (!S_ISREG(file_inode(file)->i_mode)) return true; return sb_write_started(file_inode(file)->i_sb); } /** * file_write_not_started - check if SB_FREEZE_WRITE is not held * @file: the file we write to * * May be false positive with !CONFIG_LOCKDEP/LOCK_STATE_UNKNOWN. * May be false positive with !S_ISREG, because file_start_write() has * no effect on !S_ISREG. / static inline bool file_write_not_started(const struct file file) { if (!S_ISREG(file_inode(file)->i_mode)) return true; return sb_write_not_started(file_inode(file)->i_sb); } bool inode_owner_or_capable(struct mnt_idmap idmap, const struct inode inode); /* * VFS helper functions.. / int vfs_create(struct mnt_idmap , struct dentry , umode_t, struct delegated_inode ); struct dentry vfs_mkdir(struct mnt_idmap , struct inode , struct dentry , umode_t, struct delegated_inode ); int vfs_mknod(struct mnt_idmap , struct inode , struct dentry , umode_t, dev_t, struct delegated_inode ); int vfs_symlink(struct mnt_idmap , struct inode , struct dentry , const char , struct delegated_inode ); int vfs_link(struct dentry , struct mnt_idmap , struct inode , struct dentry , struct delegated_inode ); int vfs_rmdir(struct mnt_idmap , struct inode , struct dentry , struct delegated_inode ); int vfs_unlink(struct mnt_idmap , struct inode , struct dentry , struct delegated_inode ); /* * struct renamedata - contains all information required for renaming * @mnt_idmap: idmap of the mount in which the rename is happening. * @old_parent: parent of source * @old_dentry: source * @new_parent: parent of destination * @new_dentry: destination * @delegated_inode: returns an inode needing a delegation break * @flags: rename flags / struct renamedata { struct mnt_idmap mnt_idmap; struct dentry old_parent; struct dentry old_dentry; struct dentry new_parent; struct dentry new_dentry; struct delegated_inode delegated_inode; unsigned int flags; } __randomize_layout; int vfs_rename(struct renamedata ); static inline int vfs_whiteout(struct mnt_idmap idmap, struct inode dir, struct dentry dentry) { return vfs_mknod(idmap, dir, dentry, S_IFCHR \| WHITEOUT_MODE, WHITEOUT_DEV, NULL); } struct file kernel_tmpfile_open(struct mnt_idmap idmap, const struct path parentpath, umode_t mode, int open_flag, const struct cred cred); struct file kernel_file_open(const struct path path, int flags, const struct cred cred); int vfs_mkobj(struct dentry , umode_t, int (f)(struct dentry , umode_t, void ), void ); int vfs_fchown(struct file file, uid_t user, gid_t group); int vfs_fchmod(struct file file, umode_t mode); int vfs_utimes(const struct path path, struct timespec64 times); #ifdef CONFIG_COMPAT extern long compat_ptr_ioctl(struct file file, unsigned int cmd, unsigned long arg); #else #define compat_ptr_ioctl NULL #endif /* * VFS file helper functions. / void inode_init_owner(struct mnt_idmap idmap, struct inode inode, const struct inode dir, umode_t mode); extern bool may_open_dev(const struct path path); umode_t mode_strip_sgid(struct mnt_idmap idmap, const struct inode dir, umode_t mode); bool in_group_or_capable(struct mnt_idmap idmap, const struct inode inode, vfsgid_t vfsgid); / * This is the "filldir" function type, used by readdir() to let * the kernel specify what kind of dirent layout it wants to have. * This allows the kernel to read directories into kernel space or * to have different dirent layouts depending on the binary type. * Return 'true' to keep going and 'false' if there are no more entries. / struct dir_context; typedef bool (filldir_t)(struct dir_context , const char , int, loff_t, u64, unsigned); struct dir_context { filldir_t actor; loff_t pos; /* * Filesystems MUST NOT MODIFY count, but may use as a hint: * 0 unknown * > 0 space in buffer (assume at least one entry) * INT_MAX unlimited / int count; }; / If OR-ed with d_type, pending signals are not checked / #define FILLDIR_FLAG_NOINTR 0x1000 / * These flags let !MMU mmap() govern direct device mapping vs immediate * copying more easily for MAP_PRIVATE, especially for ROM filesystems. * * NOMMU_MAP_COPY: Copy can be mapped (MAP_PRIVATE) * NOMMU_MAP_DIRECT: Can be mapped directly (MAP_SHARED) * NOMMU_MAP_READ: Can be mapped for reading * NOMMU_MAP_WRITE: Can be mapped for writing * NOMMU_MAP_EXEC: Can be mapped for execution / #define NOMMU_MAP_COPY 0x00000001 #define NOMMU_MAP_DIRECT 0x00000008 #define NOMMU_MAP_READ VM_MAYREAD #define NOMMU_MAP_WRITE VM_MAYWRITE #define NOMMU_MAP_EXEC VM_MAYEXEC #define NOMMU_VMFLAGS \ (NOMMU_MAP_READ \| NOMMU_MAP_WRITE \| NOMMU_MAP_EXEC) / * These flags control the behavior of the remap_file_range function pointer. * If it is called with len == 0 that means "remap to end of source file". * See Documentation/filesystems/vfs.rst for more details about this call. * * REMAP_FILE_DEDUP: only remap if contents identical (i.e. deduplicate) * REMAP_FILE_CAN_SHORTEN: caller can handle a shortened request / #define REMAP_FILE_DEDUP (1 << 0) #define REMAP_FILE_CAN_SHORTEN (1 << 1) / * These flags signal that the caller is ok with altering various aspects of * the behavior of the remap operation. The changes must be made by the * implementation; the vfs remap helper functions can take advantage of them. * Flags in this category exist to preserve the quirky behavior of the hoisted * btrfs clone/dedupe ioctls. / #define REMAP_FILE_ADVISORY (REMAP_FILE_CAN_SHORTEN) / * These flags control the behavior of vfs_copy_file_range(). * They are not available to the user via syscall. * * COPY_FILE_SPLICE: call splice direct instead of fs clone/copy ops / #define COPY_FILE_SPLICE (1 << 0) struct iov_iter; struct io_uring_cmd; struct offset_ctx; typedef unsigned int __bitwise fop_flags_t; struct file_operations { struct module owner; fop_flags_t fop_flags; loff_t (llseek) (struct file , loff_t, int); ssize_t (read) (struct file , char __user , size_t, loff_t ); ssize_t (write) (struct file , const char __user , size_t, loff_t ); ssize_t (read_iter) (struct kiocb , struct iov_iter ); ssize_t (write_iter) (struct kiocb , struct iov_iter ); int (iopoll)(struct kiocb kiocb, struct io_comp_batch , unsigned int flags); int (iterate_shared) (struct file , struct dir_context ); __poll_t (poll) (struct file , struct poll_table_struct ); long (unlocked_ioctl) (struct file , unsigned int, unsigned long); long (compat_ioctl) (struct file , unsigned int, unsigned long); int (mmap) (struct file , struct vm_area_struct ); int (open) (struct inode , struct file ); int (flush) (struct file , fl_owner_t id); int (release) (struct inode , struct file ); int (fsync) (struct file , loff_t, loff_t, int datasync); int (fasync) (int, struct file , int); int (lock) (struct file , int, struct file_lock ); unsigned long (get_unmapped_area)(struct file , unsigned long, unsigned long, unsigned long, unsigned long); int (check_flags)(int); int (flock) (struct file , int, struct file_lock ); ssize_t (splice_write)(struct pipe_inode_info , struct file , loff_t , size_t, unsigned int); ssize_t (splice_read)(struct file , loff_t , struct pipe_inode_info , size_t, unsigned int); void (splice_eof)(struct file file); int (setlease)(struct file , int, struct file_lease , void ); long (fallocate)(struct file file, int mode, loff_t offset, loff_t len); void (show_fdinfo)(struct seq_file m, struct file f); #ifndef CONFIG_MMU unsigned (mmap_capabilities)(struct file ); #endif ssize_t (copy_file_range)(struct file , loff_t, struct file , loff_t, size_t, unsigned int); loff_t (remap_file_range)(struct file file_in, loff_t pos_in, struct file file_out, loff_t pos_out, loff_t len, unsigned int remap_flags); int (fadvise)(struct file , loff_t, loff_t, int); int (uring_cmd)(struct io_uring_cmd ioucmd, unsigned int issue_flags); int (uring_cmd_iopoll)(struct io_uring_cmd , struct io_comp_batch , unsigned int poll_flags); int (mmap_prepare)(struct vm_area_desc ); } __randomize_layout; / Supports async buffered reads / #define FOP_BUFFER_RASYNC ((__force fop_flags_t)(1 << 0)) / Supports async buffered writes / #define FOP_BUFFER_WASYNC ((__force fop_flags_t)(1 << 1)) / Supports synchronous page faults for mappings / #define FOP_MMAP_SYNC ((__force fop_flags_t)(1 << 2)) / Supports non-exclusive O_DIRECT writes from multiple threads / #define FOP_DIO_PARALLEL_WRITE ((__force fop_flags_t)(1 << 3)) / Contains huge pages / #define FOP_HUGE_PAGES ((__force fop_flags_t)(1 << 4)) / Treat loff_t as unsigned (e.g., /dev/mem) / #define FOP_UNSIGNED_OFFSET ((__force fop_flags_t)(1 << 5)) / Supports asynchronous lock callbacks / #define FOP_ASYNC_LOCK ((__force fop_flags_t)(1 << 6)) / File system supports uncached read/write buffered IO / #define FOP_DONTCACHE ((__force fop_flags_t)(1 << 7)) / Wrap a directory iterator that needs exclusive inode access / int wrap_directory_iterator(struct file , struct dir_context , int () (struct file , struct dir_context )); #define WRAP_DIR_ITER(x) \ static int shared_##x(struct file file , struct dir_context ctx) \ { return wrap_directory_iterator(file, ctx, x); } struct inode_operations { struct dentry * (lookup) (struct inode ,struct dentry , unsigned int); const char (get_link) (struct dentry , struct inode , struct delayed_call ); int (permission) (struct mnt_idmap , struct inode , int); struct posix_acl (get_inode_acl)(struct inode , int, bool); int (readlink) (struct dentry , char __user ,int); int (create) (struct mnt_idmap , struct inode ,struct dentry , umode_t, bool); int (link) (struct dentry ,struct inode ,struct dentry ); int (unlink) (struct inode ,struct dentry ); int (symlink) (struct mnt_idmap , struct inode ,struct dentry , const char ); struct dentry (mkdir) (struct mnt_idmap , struct inode , struct dentry , umode_t); int (rmdir) (struct inode ,struct dentry ); int (mknod) (struct mnt_idmap , struct inode ,struct dentry , umode_t,dev_t); int (rename) (struct mnt_idmap , struct inode , struct dentry , struct inode , struct dentry , unsigned int); int (setattr) (struct mnt_idmap , struct dentry , struct iattr ); int (getattr) (struct mnt_idmap , const struct path , struct kstat , u32, unsigned int); ssize_t (listxattr) (struct dentry , char , size_t); int (fiemap)(struct inode , struct fiemap_extent_info , u64 start, u64 len); int (update_time)(struct inode , int); int (atomic_open)(struct inode , struct dentry , struct file , unsigned open_flag, umode_t create_mode); int (tmpfile) (struct mnt_idmap , struct inode , struct file , umode_t); struct posix_acl (get_acl)(struct mnt_idmap , struct dentry , int); int (set_acl)(struct mnt_idmap , struct dentry , struct posix_acl , int); int (fileattr_set)(struct mnt_idmap idmap, struct dentry dentry, struct file_kattr fa); int (fileattr_get)(struct dentry dentry, struct file_kattr fa); struct offset_ctx (get_offset_ctx)(struct inode inode); } ____cacheline_aligned; / Did the driver provide valid mmap hook configuration? / static inline bool can_mmap_file(struct file file) { bool has_mmap = file->f_op->mmap; bool has_mmap_prepare = file->f_op->mmap_prepare; /* Hooks are mutually exclusive. / if (WARN_ON_ONCE(has_mmap && has_mmap_prepare)) return false; if (!has_mmap && !has_mmap_prepare) return false; return true; } int __compat_vma_mmap(const struct file_operations f_op, struct file file, struct vm_area_struct vma); int compat_vma_mmap(struct file file, struct vm_area_struct vma); static inline int vfs_mmap(struct file file, struct vm_area_struct vma) { if (file->f_op->mmap_prepare) return compat_vma_mmap(file, vma); return file->f_op->mmap(file, vma); } static inline int vfs_mmap_prepare(struct file file, struct vm_area_desc desc) { return file->f_op->mmap_prepare(desc); } extern ssize_t vfs_read(struct file , char __user , size_t, loff_t ); extern ssize_t vfs_write(struct file , const char __user , size_t, loff_t ); extern ssize_t vfs_copy_file_range(struct file , loff_t , struct file , loff_t, size_t, unsigned int); int remap_verify_area(struct file file, loff_t pos, loff_t len, bool write); int __generic_remap_file_range_prep(struct file file_in, loff_t pos_in, struct file file_out, loff_t pos_out, loff_t len, unsigned int remap_flags, const struct iomap_ops dax_read_ops); int generic_remap_file_range_prep(struct file file_in, loff_t pos_in, struct file file_out, loff_t pos_out, loff_t count, unsigned int remap_flags); extern loff_t vfs_clone_file_range(struct file file_in, loff_t pos_in, struct file file_out, loff_t pos_out, loff_t len, unsigned int remap_flags); extern int vfs_dedupe_file_range(struct file file, struct file_dedupe_range same); extern loff_t vfs_dedupe_file_range_one(struct file src_file, loff_t src_pos, struct file dst_file, loff_t dst_pos, loff_t len, unsigned int remap_flags); /* * Inode flags - they have no relation to superblock flags now / #define S_SYNC (1 << 0) / Writes are synced at once / #define S_NOATIME (1 << 1) / Do not update access times / #define S_APPEND (1 << 2) / Append-only file / #define S_IMMUTABLE (1 << 3) / Immutable file / #define S_DEAD (1 << 4) / removed, but still open directory / #define S_NOQUOTA (1 << 5) / Inode is not counted to quota / #define S_DIRSYNC (1 << 6) / Directory modifications are synchronous / #define S_NOCMTIME (1 << 7) / Do not update file c/mtime / #define S_SWAPFILE (1 << 8) / Do not truncate: swapon got its bmaps / #define S_PRIVATE (1 << 9) / Inode is fs-internal / #define S_IMA (1 << 10) / Inode has an associated IMA struct / #define S_AUTOMOUNT (1 << 11) / Automount/referral quasi-directory / #define S_NOSEC (1 << 12) / no suid or xattr security attributes / #ifdef CONFIG_FS_DAX #define S_DAX (1 << 13) / Direct Access, avoiding the page cache / #else #define S_DAX 0 / Make all the DAX code disappear / #endif #define S_ENCRYPTED (1 << 14) / Encrypted file (using fs/crypto/) / #define S_CASEFOLD (1 << 15) / Casefolded file / #define S_VERITY (1 << 16) / Verity file (using fs/verity/) / #define S_KERNEL_FILE (1 << 17) / File is in use by the kernel (eg. fs/cachefiles) / #define S_ANON_INODE (1 << 19) / Inode is an anonymous inode / / * Note that nosuid etc flags are inode-specific: setting some file-system * flags just means all the inodes inherit those flags by default. It might be * possible to override it selectively if you really wanted to with some * ioctl() that is not currently implemented. * * Exception: SB_RDONLY is always applied to the entire file system. * * Unfortunately, it is possible to change a filesystems flags with it mounted * with files in use. This means that all of the inodes will not have their * i_flags updated. Hence, i_flags no longer inherit the superblock mount * flags, so these have to be checked separately. -- rmk@arm.uk.linux.org / #define __IS_FLG(inode, flg) ((inode)->i_sb->s_flags & (flg)) #define IS_RDONLY(inode) sb_rdonly((inode)->i_sb) #define IS_SYNC(inode) (__IS_FLG(inode, SB_SYNCHRONOUS) \|\| \ ((inode)->i_flags & S_SYNC)) #define IS_DIRSYNC(inode) (__IS_FLG(inode, SB_SYNCHRONOUS\|SB_DIRSYNC) \|\| \ ((inode)->i_flags & (S_SYNC\|S_DIRSYNC))) #define IS_MANDLOCK(inode) __IS_FLG(inode, SB_MANDLOCK) #define IS_NOATIME(inode) __IS_FLG(inode, SB_RDONLY\|SB_NOATIME) #define IS_I_VERSION(inode) __IS_FLG(inode, SB_I_VERSION) #define IS_NOQUOTA(inode) ((inode)->i_flags & S_NOQUOTA) #define IS_APPEND(inode) ((inode)->i_flags & S_APPEND) #define IS_IMMUTABLE(inode) ((inode)->i_flags & S_IMMUTABLE) #ifdef CONFIG_FS_POSIX_ACL #define IS_POSIXACL(inode) __IS_FLG(inode, SB_POSIXACL) #else #define IS_POSIXACL(inode) 0 #endif #define IS_DEADDIR(inode) ((inode)->i_flags & S_DEAD) #define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME) #ifdef CONFIG_SWAP #define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE) #else #define IS_SWAPFILE(inode) ((void)(inode), 0U) #endif #define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE) #define IS_IMA(inode) ((inode)->i_flags & S_IMA) #define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT) #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC) #define IS_DAX(inode) ((inode)->i_flags & S_DAX) #define IS_ENCRYPTED(inode) ((inode)->i_flags & S_ENCRYPTED) #define IS_CASEFOLDED(inode) ((inode)->i_flags & S_CASEFOLD) #define IS_VERITY(inode) ((inode)->i_flags & S_VERITY) #define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \ (inode)->i_rdev == WHITEOUT_DEV) #define IS_ANON_FILE(inode) ((inode)->i_flags & S_ANON_INODE) static inline bool HAS_UNMAPPED_ID(struct mnt_idmap idmap, struct inode inode) { return !vfsuid_valid(i_uid_into_vfsuid(idmap, inode)) \|\| !vfsgid_valid(i_gid_into_vfsgid(idmap, inode)); } static inline void init_sync_kiocb(struct kiocb kiocb, struct file filp) { kiocb = (struct kiocb) { .ki_filp = filp, .ki_flags = filp->f_iocb_flags, .ki_ioprio = get_current_ioprio(), }; } static inline void kiocb_clone(struct kiocb kiocb, struct kiocb kiocb_src, struct file filp) { kiocb = (struct kiocb) { .ki_filp = filp, .ki_flags = kiocb_src->ki_flags, .ki_ioprio = kiocb_src->ki_ioprio, .ki_pos = kiocb_src->ki_pos, }; } extern void __mark_inode_dirty(struct inode , int); static inline void mark_inode_dirty(struct inode inode) { __mark_inode_dirty(inode, I_DIRTY); } static inline void mark_inode_dirty_sync(struct inode inode) { __mark_inode_dirty(inode, I_DIRTY_SYNC); } static inline int icount_read(const struct inode inode) { return atomic_read(&inode->i_count); } /* * Returns true if the given inode itself only has dirty timestamps (its pages * may still be dirty) and isn't currently being allocated or freed. * Filesystems should call this if when writing an inode when lazytime is * enabled, they want to opportunistically write the timestamps of other inodes * located very nearby on-disk, e.g. in the same inode block. This returns true * if the given inode is in need of such an opportunistic update. Requires * i_lock, or at least later re-checking under i_lock. / static inline bool inode_is_dirtytime_only(struct inode inode) { return (inode_state_read_once(inode) & (I_DIRTY_TIME \| I_NEW \| I_FREEING \| I_WILL_FREE)) == I_DIRTY_TIME; } extern void inc_nlink(struct inode inode); extern void drop_nlink(struct inode inode); extern void clear_nlink(struct inode inode); extern void set_nlink(struct inode inode, unsigned int nlink); static inline void inode_inc_link_count(struct inode inode) { inc_nlink(inode); mark_inode_dirty(inode); } static inline void inode_dec_link_count(struct inode inode) { drop_nlink(inode); mark_inode_dirty(inode); } enum file_time_flags { S_ATIME = 1, S_MTIME = 2, S_CTIME = 4, S_VERSION = 8, }; extern bool atime_needs_update(const struct path , struct inode ); extern void touch_atime(const struct path ); int inode_update_time(struct inode inode, int flags); static inline void file_accessed(struct file file) { if (!(file->f_flags & O_NOATIME)) touch_atime(&file->f_path); } extern int file_modified(struct file file); int kiocb_modified(struct kiocb iocb); int sync_inode_metadata(struct inode inode, int wait); struct file_system_type { const char name; int fs_flags; #define FS_REQUIRES_DEV 1 #define FS_BINARY_MOUNTDATA 2 #define FS_HAS_SUBTYPE 4 #define FS_USERNS_MOUNT 8 / Can be mounted by userns root / #define FS_DISALLOW_NOTIFY_PERM 16 / Disable fanotify permission events / #define FS_ALLOW_IDMAP 32 / FS has been updated to handle vfs idmappings. / #define FS_MGTIME 64 / FS uses multigrain timestamps / #define FS_LBS 128 / FS supports LBS / #define FS_POWER_FREEZE 256 / Always freeze on suspend/hibernate / #define FS_RENAME_DOES_D_MOVE 32768 / FS will handle d_move() during rename() internally. / int (init_fs_context)(struct fs_context ); const struct fs_parameter_spec parameters; struct dentry (mount) (struct file_system_type , int, const char , void ); void (kill_sb) (struct super_block ); struct module owner; struct file_system_type * next; struct hlist_head fs_supers; struct lock_class_key s_lock_key; struct lock_class_key s_umount_key; struct lock_class_key s_vfs_rename_key; struct lock_class_key s_writers_key[SB_FREEZE_LEVELS]; struct lock_class_key i_lock_key; struct lock_class_key i_mutex_key; struct lock_class_key invalidate_lock_key; struct lock_class_key i_mutex_dir_key; }; #define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME) /** * is_mgtime: is this inode using multigrain timestamps * @inode: inode to test for multigrain timestamps * * Return true if the inode uses multigrain timestamps, false otherwise. / static inline bool is_mgtime(const struct inode inode) { return inode->i_opflags & IOP_MGTIME; } extern struct dentry mount_subtree(struct vfsmount mnt, const char path); void retire_super(struct super_block sb); void generic_shutdown_super(struct super_block sb); void kill_block_super(struct super_block sb); void kill_anon_super(struct super_block sb); void deactivate_super(struct super_block sb); void deactivate_locked_super(struct super_block sb); int set_anon_super(struct super_block s, void data); int set_anon_super_fc(struct super_block s, struct fs_context fc); int get_anon_bdev(dev_t ); void free_anon_bdev(dev_t); struct super_block sget_fc(struct fs_context fc, int (test)(struct super_block , struct fs_context ), int (set)(struct super_block , struct fs_context )); struct super_block sget(struct file_system_type type, int (test)(struct super_block ,void ), int (set)(struct super_block ,void ), int flags, void data); struct super_block sget_dev(struct fs_context fc, dev_t dev); / Alas, no aliases. Too much hassle with bringing module.h everywhere / #define fops_get(fops) ({ \ const struct file_operations _fops = (fops); \ (((_fops) && try_module_get((_fops)->owner) ? (_fops) : NULL)); \ }) #define fops_put(fops) ({ \ const struct file_operations _fops = (fops); \ if (_fops) \ module_put((_fops)->owner); \ }) / * This one is to be used ONLY from ->open() instances. * fops must be non-NULL, pinned down and module dependencies * should be sufficient to pin the caller down as well. / #define replace_fops(f, fops) \ do { \ struct file __file = (f); \ fops_put(__file->f_op); \ BUG_ON(!(__file->f_op = (fops))); \ } while(0) extern int register_filesystem(struct file_system_type ); extern int unregister_filesystem(struct file_system_type ); extern int vfs_statfs(const struct path , struct kstatfs ); extern int user_statfs(const char __user , struct kstatfs ); extern int fd_statfs(int, struct kstatfs ); extern __printf(2, 3) int super_setup_bdi_name(struct super_block sb, char fmt, ...); extern int super_setup_bdi(struct super_block sb); static inline void super_set_uuid(struct super_block sb, const u8 uuid, unsigned len) { if (WARN_ON(len > sizeof(sb->s_uuid))) len = sizeof(sb->s_uuid); sb->s_uuid_len = len; memcpy(&sb->s_uuid, uuid, len); } /* set sb sysfs name based on sb->s_bdev / static inline void super_set_sysfs_name_bdev(struct super_block sb) { snprintf(sb->s_sysfs_name, sizeof(sb->s_sysfs_name), "%pg", sb->s_bdev); } /* set sb sysfs name based on sb->s_uuid / static inline void super_set_sysfs_name_uuid(struct super_block sb) { WARN_ON(sb->s_uuid_len != sizeof(sb->s_uuid)); snprintf(sb->s_sysfs_name, sizeof(sb->s_sysfs_name), "%pU", sb->s_uuid.b); } /* set sb sysfs name based on sb->s_id / static inline void super_set_sysfs_name_id(struct super_block sb) { strscpy(sb->s_sysfs_name, sb->s_id, sizeof(sb->s_sysfs_name)); } /* try to use something standard before you use this / __printf(2, 3) static inline void super_set_sysfs_name_generic(struct super_block sb, const char fmt, ...) { va_list args; va_start(args, fmt); vsnprintf(sb->s_sysfs_name, sizeof(sb->s_sysfs_name), fmt, args); va_end(args); } extern void ihold(struct inode inode); extern void iput(struct inode ); void iput_not_last(struct inode ); int inode_update_timestamps(struct inode inode, int flags); int generic_update_time(struct inode , int); /* /sys/fs / extern struct kobject fs_kobj; #define MAX_RW_COUNT (INT_MAX & PAGE_MASK) /* fs/open.c / struct audit_names; struct filename { const char name; /* pointer to actual string / const __user char uptr; /* original userland pointer / atomic_t refcnt; struct audit_names aname; const char iname[]; }; static_assert(offsetof(struct filename, iname) % sizeof(long) == 0); static inline struct mnt_idmap file_mnt_idmap(const struct file file) { return mnt_idmap(file->f_path.mnt); } /** * is_idmapped_mnt - check whether a mount is mapped * @mnt: the mount to check * * If @mnt has an non @nop_mnt_idmap attached to it then @mnt is mapped. * * Return: true if mount is mapped, false if not. / static inline bool is_idmapped_mnt(const struct vfsmount mnt) { return mnt_idmap(mnt) != &nop_mnt_idmap; } int vfs_truncate(const struct path , loff_t); int do_truncate(struct mnt_idmap , struct dentry , loff_t start, unsigned int time_attrs, struct file filp); extern int vfs_fallocate(struct file file, int mode, loff_t offset, loff_t len); int do_sys_open(int dfd, const char __user filename, int flags, umode_t mode); extern struct file file_open_name(struct filename , int, umode_t); extern struct file filp_open(const char , int, umode_t); extern struct file file_open_root(const struct path , const char , int, umode_t); static inline struct file file_open_root_mnt(struct vfsmount mnt, const char name, int flags, umode_t mode) { return file_open_root(&(struct path){.mnt = mnt, .dentry = mnt->mnt_root}, name, flags, mode); } struct file dentry_open(const struct path path, int flags, const struct cred creds); struct file dentry_open_nonotify(const struct path path, int flags, const struct cred cred); struct file dentry_create(const struct path path, int flags, umode_t mode, const struct cred cred); const struct path backing_file_user_path(const struct file f); / * When mmapping a file on a stackable filesystem (e.g., overlayfs), the file * stored in ->vm_file is a backing file whose f_inode is on the underlying * filesystem. When the mapped file path and inode number are displayed to * user (e.g. via /proc/<pid>/maps), these helpers should be used to get the * path and inode number to display to the user, which is the path of the fd * that user has requested to map and the inode number that would be returned * by fstat() on that same fd. / / Get the path to display in /proc/<pid>/maps / static inline const struct path file_user_path(const struct file f) { if (unlikely(f->f_mode & FMODE_BACKING)) return backing_file_user_path(f); return &f->f_path; } / Get the inode whose inode number to display in /proc/<pid>/maps / static inline const struct inode file_user_inode(const struct file f) { if (unlikely(f->f_mode & FMODE_BACKING)) return d_inode(backing_file_user_path(f)->dentry); return file_inode(f); } static inline struct file file_clone_open(struct file file) { return dentry_open(&file->f_path, file->f_flags, file->f_cred); } extern int filp_close(struct file , fl_owner_t id); extern struct filename getname_flags(const char __user , int); extern struct filename getname_uflags(const char __user , int); static inline struct filename getname(const char __user name) { return getname_flags(name, 0); } extern struct filename getname_kernel(const char ); extern struct filename __getname_maybe_null(const char __user ); static inline struct filename getname_maybe_null(const char __user name, int flags) { if (!(flags & AT_EMPTY_PATH)) return getname(name); if (!name) return NULL; return __getname_maybe_null(name); } extern void putname(struct filename name); DEFINE_FREE(putname, struct filename , if (!IS_ERR_OR_NULL(_T)) putname(_T)) static inline struct filename refname(struct filename name) { atomic_inc(&name->refcnt); return name; } extern int finish_open(struct file file, struct dentry dentry, int (open)(struct inode , struct file )); extern int finish_no_open(struct file file, struct dentry dentry); / Helper for the simple case when original dentry is used / static inline int finish_open_simple(struct file file, int error) { if (error) return error; return finish_open(file, file->f_path.dentry, NULL); } /* fs/dcache.c / extern void __init vfs_caches_init_early(void); extern void __init vfs_caches_init(void); extern struct kmem_cache names_cachep; #define __getname() kmem_cache_alloc(names_cachep, GFP_KERNEL) #define __putname(name) kmem_cache_free(names_cachep, (void )(name)) void emergency_thaw_all(void); extern int sync_filesystem(struct super_block ); extern const struct file_operations def_blk_fops; extern const struct file_operations def_chr_fops; /* fs/char_dev.c / #define CHRDEV_MAJOR_MAX 512 / Marks the bottom of the first segment of free char majors / #define CHRDEV_MAJOR_DYN_END 234 / Marks the top and bottom of the second segment of free char majors / #define CHRDEV_MAJOR_DYN_EXT_START 511 #define CHRDEV_MAJOR_DYN_EXT_END 384 extern int alloc_chrdev_region(dev_t , unsigned, unsigned, const char ); extern int register_chrdev_region(dev_t, unsigned, const char ); extern int __register_chrdev(unsigned int major, unsigned int baseminor, unsigned int count, const char name, const struct file_operations fops); extern void __unregister_chrdev(unsigned int major, unsigned int baseminor, unsigned int count, const char name); extern void unregister_chrdev_region(dev_t, unsigned); extern void chrdev_show(struct seq_file ,off_t); static inline int register_chrdev(unsigned int major, const char name, const struct file_operations fops) { return __register_chrdev(major, 0, 256, name, fops); } static inline void unregister_chrdev(unsigned int major, const char name) { __unregister_chrdev(major, 0, 256, name); } extern void init_special_inode(struct inode , umode_t, dev_t); /* Invalid inode operations -- fs/bad_inode.c / extern void make_bad_inode(struct inode ); extern bool is_bad_inode(struct inode ); extern int __must_check file_fdatawait_range(struct file file, loff_t lstart, loff_t lend); extern int __must_check file_check_and_advance_wb_err(struct file file); extern int __must_check file_write_and_wait_range(struct file file, loff_t start, loff_t end); int filemap_flush_range(struct address_space mapping, loff_t start, loff_t end); static inline int file_write_and_wait(struct file file) { return file_write_and_wait_range(file, 0, LLONG_MAX); } extern int vfs_fsync_range(struct file file, loff_t start, loff_t end, int datasync); extern int vfs_fsync(struct file file, int datasync); extern int sync_file_range(struct file file, loff_t offset, loff_t nbytes, unsigned int flags); static inline bool iocb_is_dsync(const struct kiocb iocb) { return (iocb->ki_flags & IOCB_DSYNC) \|\| IS_SYNC(iocb->ki_filp->f_mapping->host); } /* * Sync the bytes written if this was a synchronous write. Expect ki_pos * to already be updated for the write, and will return either the amount * of bytes passed in, or an error if syncing the file failed. / static inline ssize_t generic_write_sync(struct kiocb iocb, ssize_t count) { if (iocb_is_dsync(iocb)) { int ret = vfs_fsync_range(iocb->ki_filp, iocb->ki_pos - count, iocb->ki_pos - 1, (iocb->ki_flags & IOCB_SYNC) ? 0 : 1); if (ret) return ret; } else if (iocb->ki_flags & IOCB_DONTCACHE) { struct address_space mapping = iocb->ki_filp->f_mapping; filemap_flush_range(mapping, iocb->ki_pos - count, iocb->ki_pos - 1); } return count; } extern void emergency_sync(void); extern void emergency_remount(void); #ifdef CONFIG_BLOCK extern int bmap(struct inode inode, sector_t block); #else static inline int bmap(struct inode inode, sector_t block) { return -EINVAL; } #endif int notify_change(struct mnt_idmap , struct dentry , struct iattr , struct delegated_inode ); int inode_permission(struct mnt_idmap , struct inode , int); int generic_permission(struct mnt_idmap , struct inode , int); static inline int file_permission(struct file file, int mask) { return inode_permission(file_mnt_idmap(file), file_inode(file), mask); } static inline int path_permission(const struct path path, int mask) { return inode_permission(mnt_idmap(path->mnt), d_inode(path->dentry), mask); } int __check_sticky(struct mnt_idmap idmap, struct inode dir, struct inode inode); static inline bool execute_ok(struct inode inode) { return (inode->i_mode & S_IXUGO) \|\| S_ISDIR(inode->i_mode); } static inline bool inode_wrong_type(const struct inode inode, umode_t mode) { return (inode->i_mode ^ mode) & S_IFMT; } /** * file_start_write - get write access to a superblock for regular file io * @file: the file we want to write to * * This is a variant of sb_start_write() which is a noop on non-regular file. * Should be matched with a call to file_end_write(). / static inline void file_start_write(struct file file) { if (!S_ISREG(file_inode(file)->i_mode)) return; sb_start_write(file_inode(file)->i_sb); } static inline bool file_start_write_trylock(struct file file) { if (!S_ISREG(file_inode(file)->i_mode)) return true; return sb_start_write_trylock(file_inode(file)->i_sb); } /* * file_end_write - drop write access to a superblock of a regular file * @file: the file we wrote to * * Should be matched with a call to file_start_write(). / static inline void file_end_write(struct file file) { if (!S_ISREG(file_inode(file)->i_mode)) return; sb_end_write(file_inode(file)->i_sb); } /** * kiocb_start_write - get write access to a superblock for async file io * @iocb: the io context we want to submit the write with * * This is a variant of sb_start_write() for async io submission. * Should be matched with a call to kiocb_end_write(). / static inline void kiocb_start_write(struct kiocb iocb) { struct inode inode = file_inode(iocb->ki_filp); sb_start_write(inode->i_sb); / * Fool lockdep by telling it the lock got released so that it * doesn't complain about the held lock when we return to userspace. / __sb_writers_release(inode->i_sb, SB_FREEZE_WRITE); } /* * kiocb_end_write - drop write access to a superblock after async file io * @iocb: the io context we sumbitted the write with * * Should be matched with a call to kiocb_start_write(). / static inline void kiocb_end_write(struct kiocb iocb) { struct inode inode = file_inode(iocb->ki_filp); / * Tell lockdep we inherited freeze protection from submission thread. / __sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE); sb_end_write(inode->i_sb); } / * This is used for regular files where some users -- especially the * currently executed binary in a process, previously handled via * VM_DENYWRITE -- cannot handle concurrent write (and maybe mmap * read-write shared) accesses. * * get_write_access() gets write permission for a file. * put_write_access() releases this write permission. * deny_write_access() denies write access to a file. * allow_write_access() re-enables write access to a file. * * The i_writecount field of an inode can have the following values: * 0: no write access, no denied write access * < 0: (-i_writecount) users that denied write access to the file. * > 0: (i_writecount) users that have write access to the file. * * Normally we operate on that counter with atomic_{inc,dec} and it's safe * except for the cases where we don't hold i_writecount yet. Then we need to * use {get,deny}_write_access() - these functions check the sign and refuse * to do the change if sign is wrong. / static inline int get_write_access(struct inode inode) { return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY; } static inline int deny_write_access(struct file file) { struct inode inode = file_inode(file); return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY; } static inline void put_write_access(struct inode * inode) { atomic_dec(&inode->i_writecount); } static inline void allow_write_access(struct file file) { if (file) atomic_inc(&file_inode(file)->i_writecount); } / * Do not prevent write to executable file when watched by pre-content events. * * Note that FMODE_FSNOTIFY_HSM mode is set depending on pre-content watches at * the time of file open and remains constant for entire lifetime of the file, * so if pre-content watches are added post execution or removed before the end * of the execution, it will not cause i_writecount reference leak. / static inline int exe_file_deny_write_access(struct file exe_file) { if (unlikely(FMODE_FSNOTIFY_HSM(exe_file->f_mode))) return 0; return deny_write_access(exe_file); } static inline void exe_file_allow_write_access(struct file exe_file) { if (unlikely(!exe_file \|\| FMODE_FSNOTIFY_HSM(exe_file->f_mode))) return; allow_write_access(exe_file); } static inline void file_set_fsnotify_mode(struct file file, fmode_t mode) { file->f_mode &= ~FMODE_FSNOTIFY_MASK; file->f_mode \|= mode; } static inline bool inode_is_open_for_write(const struct inode inode) { return atomic_read(&inode->i_writecount) > 0; } #if defined(CONFIG_IMA) \|\| defined(CONFIG_FILE_LOCKING) static inline void i_readcount_dec(struct inode inode) { BUG_ON(atomic_dec_return(&inode->i_readcount) < 0); } static inline void i_readcount_inc(struct inode inode) { atomic_inc(&inode->i_readcount); } #else static inline void i_readcount_dec(struct inode inode) { return; } static inline void i_readcount_inc(struct inode inode) { return; } #endif extern int do_pipe_flags(int , int); extern ssize_t kernel_read(struct file , void , size_t, loff_t ); ssize_t __kernel_read(struct file file, void buf, size_t count, loff_t pos); extern ssize_t kernel_write(struct file , const void , size_t, loff_t ); extern ssize_t __kernel_write(struct file , const void , size_t, loff_t ); extern struct file * open_exec(const char ); / fs/dcache.c -- generic fs support functions / extern bool is_subdir(struct dentry , struct dentry ); extern bool path_is_under(const struct path , const struct path ); u64 vfsmount_to_propagation_flags(struct vfsmount mnt); extern char file_path(struct file , char , int); /* * is_dot_dotdot - returns true only if @name is "." or ".." * @name: file name to check * @len: length of file name, in bytes / static inline bool is_dot_dotdot(const char name, size_t len) { return len && unlikely(name[0] == '.') && (len == 1 \|\| (len == 2 && name[1] == '.')); } /** * name_contains_dotdot - check if a file name contains ".." path components * @name: File path string to check * Search for ".." surrounded by either '/' or start/end of string. / static inline bool name_contains_dotdot(const char name) { size_t name_len; name_len = strlen(name); return strcmp(name, "..") == 0 \|\| strncmp(name, "../", 3) == 0 \|\| strstr(name, "/../") != NULL \|\| (name_len >= 3 && strcmp(name + name_len - 3, "/..") == 0); } #include <linux/err.h> /* needed for stackable file system support / extern loff_t default_llseek(struct file file, loff_t offset, int whence); extern loff_t vfs_llseek(struct file file, loff_t offset, int whence); extern int inode_init_always_gfp(struct super_block , struct inode , gfp_t); static inline int inode_init_always(struct super_block sb, struct inode inode) { return inode_init_always_gfp(sb, inode, GFP_NOFS); } extern void inode_init_once(struct inode ); extern void address_space_init_once(struct address_space mapping); extern struct inode igrab(struct inode ); extern ino_t iunique(struct super_block , ino_t); extern int inode_needs_sync(struct inode inode); extern int inode_just_drop(struct inode inode); static inline int inode_generic_drop(struct inode inode) { return !inode->i_nlink \|\| inode_unhashed(inode); } extern void d_mark_dontcache(struct inode inode); extern struct inode ilookup5_nowait(struct super_block sb, unsigned long hashval, int (test)(struct inode , void ), void data, bool isnew); extern struct inode ilookup5(struct super_block sb, unsigned long hashval, int (test)(struct inode , void ), void data); extern struct inode ilookup(struct super_block sb, unsigned long ino); extern struct inode inode_insert5(struct inode inode, unsigned long hashval, int (test)(struct inode , void ), int (set)(struct inode , void ), void data); struct inode iget5_locked(struct super_block , unsigned long, int (test)(struct inode , void ), int (set)(struct inode , void ), void ); struct inode iget5_locked_rcu(struct super_block , unsigned long, int (test)(struct inode , void ), int (set)(struct inode , void ), void ); extern struct inode * iget_locked(struct super_block , unsigned long); extern struct inode find_inode_nowait(struct super_block , unsigned long, int (match)(struct inode , unsigned long, void ), void data); extern struct inode find_inode_rcu(struct super_block , unsigned long, int ()(struct inode , void ), void ); extern struct inode find_inode_by_ino_rcu(struct super_block , unsigned long); extern int insert_inode_locked4(struct inode , unsigned long, int (test)(struct inode , void ), void ); extern int insert_inode_locked(struct inode ); #ifdef CONFIG_DEBUG_LOCK_ALLOC extern void lockdep_annotate_inode_mutex_key(struct inode inode); #else static inline void lockdep_annotate_inode_mutex_key(struct inode inode) { }; #endif extern void unlock_new_inode(struct inode ); extern void discard_new_inode(struct inode ); extern unsigned int get_next_ino(void); extern void evict_inodes(struct super_block sb); void dump_mapping(const struct address_space ); / * Userspace may rely on the inode number being non-zero. For example, glibc * simply ignores files with zero i_ino in unlink() and other places. * * As an additional complication, if userspace was compiled with * _FILE_OFFSET_BITS=32 on a 64-bit kernel we'll only end up reading out the * lower 32 bits, so we need to check that those aren't zero explicitly. With * _FILE_OFFSET_BITS=64, this may cause some harmless false-negatives, but * better safe than sorry. / static inline bool is_zero_ino(ino_t ino) { return (u32)ino == 0; } static inline void __iget(struct inode inode) { lockdep_assert_held(&inode->i_lock); atomic_inc(&inode->i_count); } extern void iget_failed(struct inode ); extern void clear_inode(struct inode ); extern void __destroy_inode(struct inode ); struct inode alloc_inode(struct super_block sb); static inline struct inode new_inode_pseudo(struct super_block sb) { return alloc_inode(sb); } extern struct inode new_inode(struct super_block sb); extern void free_inode_nonrcu(struct inode inode); extern int setattr_should_drop_suidgid(struct mnt_idmap , struct inode ); extern int file_remove_privs(struct file ); int setattr_should_drop_sgid(struct mnt_idmap idmap, const struct inode inode); / * This must be used for allocating filesystems specific inodes to set * up the inode reclaim context correctly. / #define alloc_inode_sb(_sb, _cache, _gfp) kmem_cache_alloc_lru(_cache, &_sb->s_inode_lru, _gfp) extern void __insert_inode_hash(struct inode , unsigned long hashval); static inline void insert_inode_hash(struct inode inode) { __insert_inode_hash(inode, inode->i_ino); } extern void __remove_inode_hash(struct inode ); static inline void remove_inode_hash(struct inode inode) { if (!inode_unhashed(inode) && !hlist_fake(&inode->i_hash)) __remove_inode_hash(inode); } extern void inode_sb_list_add(struct inode inode); extern void inode_lru_list_add(struct inode inode); int generic_file_mmap(struct file , struct vm_area_struct ); int generic_file_mmap_prepare(struct vm_area_desc desc); int generic_file_readonly_mmap(struct file , struct vm_area_struct ); int generic_file_readonly_mmap_prepare(struct vm_area_desc desc); extern ssize_t generic_write_checks(struct kiocb , struct iov_iter ); int generic_write_checks_count(struct kiocb iocb, loff_t count); extern int generic_write_check_limits(struct file file, loff_t pos, loff_t count); extern int generic_file_rw_checks(struct file file_in, struct file file_out); ssize_t filemap_read(struct kiocb iocb, struct iov_iter to, ssize_t already_read); extern ssize_t generic_file_read_iter(struct kiocb , struct iov_iter ); extern ssize_t __generic_file_write_iter(struct kiocb , struct iov_iter ); extern ssize_t generic_file_write_iter(struct kiocb , struct iov_iter ); extern ssize_t generic_file_direct_write(struct kiocb , struct iov_iter ); ssize_t generic_perform_write(struct kiocb , struct iov_iter ); ssize_t direct_write_fallback(struct kiocb iocb, struct iov_iter iter, ssize_t direct_written, ssize_t buffered_written); ssize_t vfs_iter_read(struct file file, struct iov_iter iter, loff_t ppos, rwf_t flags); ssize_t vfs_iter_write(struct file file, struct iov_iter iter, loff_t ppos, rwf_t flags); ssize_t vfs_iocb_iter_read(struct file file, struct kiocb iocb, struct iov_iter iter); ssize_t vfs_iocb_iter_write(struct file file, struct kiocb iocb, struct iov_iter iter); / fs/splice.c / ssize_t filemap_splice_read(struct file in, loff_t ppos, struct pipe_inode_info pipe, size_t len, unsigned int flags); ssize_t copy_splice_read(struct file in, loff_t ppos, struct pipe_inode_info pipe, size_t len, unsigned int flags); extern ssize_t iter_file_splice_write(struct pipe_inode_info , struct file , loff_t , size_t, unsigned int); extern void file_ra_state_init(struct file_ra_state ra, struct address_space mapping); extern loff_t noop_llseek(struct file file, loff_t offset, int whence); extern loff_t vfs_setpos(struct file file, loff_t offset, loff_t maxsize); extern loff_t generic_file_llseek(struct file file, loff_t offset, int whence); extern loff_t generic_file_llseek_size(struct file file, loff_t offset, int whence, loff_t maxsize, loff_t eof); loff_t generic_llseek_cookie(struct file file, loff_t offset, int whence, u64 cookie); extern loff_t fixed_size_llseek(struct file file, loff_t offset, int whence, loff_t size); extern loff_t no_seek_end_llseek_size(struct file , loff_t, int, loff_t); extern loff_t no_seek_end_llseek(struct file , loff_t, int); int rw_verify_area(int, struct file , const loff_t , size_t); extern int generic_file_open(struct inode inode, struct file * filp); extern int nonseekable_open(struct inode * inode, struct file * filp); extern int stream_open(struct inode * inode, struct file * filp); #ifdef CONFIG_BLOCK typedef void (dio_submit_t)(struct bio bio, struct inode inode, loff_t file_offset); enum { /* need locking between buffered and direct access / DIO_LOCKING = 0x01, / filesystem does not support filling holes / DIO_SKIP_HOLES = 0x02, }; ssize_t __blockdev_direct_IO(struct kiocb iocb, struct inode inode, struct block_device bdev, struct iov_iter iter, get_block_t get_block, dio_iodone_t end_io, int flags); static inline ssize_t blockdev_direct_IO(struct kiocb iocb, struct inode inode, struct iov_iter iter, get_block_t get_block) { return __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, iter, get_block, NULL, DIO_LOCKING \| DIO_SKIP_HOLES); } #endif bool inode_dio_finished(const struct inode inode); void inode_dio_wait(struct inode inode); void inode_dio_wait_interruptible(struct inode inode); /* * inode_dio_begin - signal start of a direct I/O requests * @inode: inode the direct I/O happens on * * This is called once we've finished processing a direct I/O request, * and is used to wake up callers waiting for direct I/O to be quiesced. / static inline void inode_dio_begin(struct inode inode) { atomic_inc(&inode->i_dio_count); } /** * inode_dio_end - signal finish of a direct I/O requests * @inode: inode the direct I/O happens on * * This is called once we've finished processing a direct I/O request, * and is used to wake up callers waiting for direct I/O to be quiesced. / static inline void inode_dio_end(struct inode inode) { if (atomic_dec_and_test(&inode->i_dio_count)) wake_up_var(&inode->i_dio_count); } extern void inode_set_flags(struct inode inode, unsigned int flags, unsigned int mask); extern const struct file_operations generic_ro_fops; #define special_file(m) (S_ISCHR(m)\|\|S_ISBLK(m)\|\|S_ISFIFO(m)\|\|S_ISSOCK(m)) extern int readlink_copy(char __user , int, const char , int); extern int page_readlink(struct dentry , char __user , int); extern const char page_get_link_raw(struct dentry , struct inode , struct delayed_call ); extern const char page_get_link(struct dentry , struct inode , struct delayed_call ); extern void page_put_link(void ); extern int page_symlink(struct inode inode, const char symname, int len); extern const struct inode_operations page_symlink_inode_operations; extern void kfree_link(void ); void fill_mg_cmtime(struct kstat stat, u32 request_mask, struct inode inode); void generic_fillattr(struct mnt_idmap , u32, struct inode , struct kstat ); void generic_fill_statx_attr(struct inode inode, struct kstat stat); void generic_fill_statx_atomic_writes(struct kstat stat, unsigned int unit_min, unsigned int unit_max, unsigned int unit_max_opt); extern int vfs_getattr_nosec(const struct path , struct kstat , u32, unsigned int); extern int vfs_getattr(const struct path , struct kstat , u32, unsigned int); void __inode_add_bytes(struct inode inode, loff_t bytes); void inode_add_bytes(struct inode inode, loff_t bytes); void __inode_sub_bytes(struct inode inode, loff_t bytes); void inode_sub_bytes(struct inode inode, loff_t bytes); static inline loff_t __inode_get_bytes(struct inode inode) { return (((loff_t)inode->i_blocks) << 9) + inode->i_bytes; } loff_t inode_get_bytes(struct inode inode); void inode_set_bytes(struct inode inode, loff_t bytes); const char simple_get_link(struct dentry , struct inode , struct delayed_call ); extern const struct inode_operations simple_symlink_inode_operations; extern int iterate_dir(struct file , struct dir_context ); int vfs_fstatat(int dfd, const char __user filename, struct kstat stat, int flags); int vfs_fstat(int fd, struct kstat stat); static inline int vfs_stat(const char __user filename, struct kstat stat) { return vfs_fstatat(AT_FDCWD, filename, stat, 0); } static inline int vfs_lstat(const char __user name, struct kstat stat) { return vfs_fstatat(AT_FDCWD, name, stat, AT_SYMLINK_NOFOLLOW); } extern const char vfs_get_link(struct dentry , struct delayed_call ); extern int vfs_readlink(struct dentry , char __user , int); extern struct file_system_type get_filesystem(struct file_system_type fs); extern void put_filesystem(struct file_system_type fs); extern struct file_system_type get_fs_type(const char name); extern void drop_super(struct super_block sb); extern void drop_super_exclusive(struct super_block sb); extern void iterate_supers(void (f)(struct super_block , void ), void arg); extern void iterate_supers_type(struct file_system_type , void ()(struct super_block , void ), void ); void filesystems_freeze(bool freeze_all); void filesystems_thaw(void); void end_dirop(struct dentry de); extern int dcache_dir_open(struct inode , struct file ); extern int dcache_dir_close(struct inode , struct file ); extern loff_t dcache_dir_lseek(struct file , loff_t, int); extern int dcache_readdir(struct file , struct dir_context ); extern int simple_setattr(struct mnt_idmap , struct dentry , struct iattr ); extern int simple_getattr(struct mnt_idmap , const struct path , struct kstat , u32, unsigned int); extern int simple_statfs(struct dentry , struct kstatfs ); extern int simple_open(struct inode inode, struct file file); extern int simple_link(struct dentry , struct inode , struct dentry ); extern int simple_unlink(struct inode , struct dentry ); extern int simple_rmdir(struct inode , struct dentry ); extern void __simple_unlink(struct inode , struct dentry ); extern void __simple_rmdir(struct inode , struct dentry ); void simple_rename_timestamp(struct inode old_dir, struct dentry old_dentry, struct inode new_dir, struct dentry new_dentry); extern int simple_rename_exchange(struct inode old_dir, struct dentry old_dentry, struct inode new_dir, struct dentry new_dentry); extern int simple_rename(struct mnt_idmap , struct inode , struct dentry , struct inode , struct dentry , unsigned int); extern void simple_recursive_removal(struct dentry , void (callback)(struct dentry )); extern void simple_remove_by_name(struct dentry , const char , void (callback)(struct dentry )); extern void locked_recursive_removal(struct dentry , void (callback)(struct dentry )); extern int noop_fsync(struct file , loff_t, loff_t, int); extern ssize_t noop_direct_IO(struct kiocb iocb, struct iov_iter iter); extern int simple_empty(struct dentry ); extern int simple_write_begin(const struct kiocb iocb, struct address_space mapping, loff_t pos, unsigned len, struct folio foliop, void fsdata); extern const struct address_space_operations ram_aops; extern int always_delete_dentry(const struct dentry ); extern struct inode alloc_anon_inode(struct super_block ); struct inode anon_inode_make_secure_inode(struct super_block sb, const char name, const struct inode context_inode); extern int simple_nosetlease(struct file , int, struct file_lease , void ); extern struct dentry simple_lookup(struct inode , struct dentry , unsigned int flags); extern ssize_t generic_read_dir(struct file , char __user , size_t, loff_t ); extern const struct file_operations simple_dir_operations; extern const struct inode_operations simple_dir_inode_operations; extern void make_empty_dir_inode(struct inode inode); extern bool is_empty_dir_inode(struct inode inode); struct tree_descr { const char name; const struct file_operations ops; int mode; }; struct dentry d_alloc_name(struct dentry , const char ); extern int simple_fill_super(struct super_block , unsigned long, const struct tree_descr ); extern int simple_pin_fs(struct file_system_type , struct vfsmount *mount, int count); extern void simple_release_fs(struct vfsmount *mount, int count); struct dentry simple_start_creating(struct dentry , const char ); void simple_done_creating(struct dentry ); extern ssize_t simple_read_from_buffer(void __user to, size_t count, loff_t ppos, const void from, size_t available); extern ssize_t simple_write_to_buffer(void to, size_t available, loff_t ppos, const void __user from, size_t count); struct offset_ctx { struct maple_tree mt; unsigned long next_offset; }; void simple_offset_init(struct offset_ctx octx); int simple_offset_add(struct offset_ctx octx, struct dentry dentry); void simple_offset_remove(struct offset_ctx octx, struct dentry dentry); void simple_offset_rename(struct inode old_dir, struct dentry old_dentry, struct inode new_dir, struct dentry new_dentry); int simple_offset_rename_exchange(struct inode old_dir, struct dentry old_dentry, struct inode new_dir, struct dentry new_dentry); void simple_offset_destroy(struct offset_ctx octx); extern const struct file_operations simple_offset_dir_operations; extern int __generic_file_fsync(struct file , loff_t, loff_t, int); extern int generic_file_fsync(struct file , loff_t, loff_t, int); extern int generic_check_addressable(unsigned, u64); extern void generic_set_sb_d_ops(struct super_block sb); extern int generic_ci_match(const struct inode parent, const struct qstr name, const struct qstr folded_name, const u8 de_name, u32 de_name_len); #if IS_ENABLED(CONFIG_UNICODE) int generic_ci_d_hash(const struct dentry dentry, struct qstr str); int generic_ci_d_compare(const struct dentry dentry, unsigned int len, const char str, const struct qstr name); /** * generic_ci_validate_strict_name - Check if a given name is suitable * for a directory * * This functions checks if the proposed filename is valid for the * parent directory. That means that only valid UTF-8 filenames will be * accepted for casefold directories from filesystems created with the * strict encoding flag. That also means that any name will be * accepted for directories that doesn't have casefold enabled, or * aren't being strict with the encoding. * * @dir: inode of the directory where the new file will be created * @name: name of the new file * * Return: * * True: if the filename is suitable for this directory. It can be * true if a given name is not suitable for a strict encoding * directory, but the directory being used isn't strict * * False if the filename isn't suitable for this directory. This only * happens when a directory is casefolded and the filesystem is strict * about its encoding. / static inline bool generic_ci_validate_strict_name(struct inode dir, const struct qstr name) { if (!IS_CASEFOLDED(dir) \|\| !sb_has_strict_encoding(dir->i_sb)) return true; / * A casefold dir must have a encoding set, unless the filesystem * is corrupted / if (WARN_ON_ONCE(!dir->i_sb->s_encoding)) return true; return !utf8_validate(dir->i_sb->s_encoding, name); } #else static inline bool generic_ci_validate_strict_name(struct inode dir, const struct qstr name) { return true; } #endif int may_setattr(struct mnt_idmap idmap, struct inode inode, unsigned int ia_valid); int setattr_prepare(struct mnt_idmap , struct dentry , struct iattr ); extern int inode_newsize_ok(const struct inode , loff_t offset); void setattr_copy(struct mnt_idmap , struct inode inode, const struct iattr attr); extern int file_update_time(struct file file); static inline bool file_is_dax(const struct file file) { return file && IS_DAX(file->f_mapping->host); } static inline bool vma_is_dax(const struct vm_area_struct vma) { return file_is_dax(vma->vm_file); } static inline bool vma_is_fsdax(struct vm_area_struct vma) { struct inode inode; if (!IS_ENABLED(CONFIG_FS_DAX) \|\| !vma->vm_file) return false; if (!vma_is_dax(vma)) return false; inode = file_inode(vma->vm_file); if (S_ISCHR(inode->i_mode)) return false; / device-dax / return true; } static inline int iocb_flags(struct file file) { int res = 0; if (file->f_flags & O_APPEND) res \|= IOCB_APPEND; if (file->f_flags & O_DIRECT) res \|= IOCB_DIRECT; if (file->f_flags & O_DSYNC) res \|= IOCB_DSYNC; if (file->f_flags & __O_SYNC) res \|= IOCB_SYNC; return res; } static inline int kiocb_set_rw_flags(struct kiocb ki, rwf_t flags, int rw_type) { int kiocb_flags = 0; / make sure there's no overlap between RWF and private IOCB flags / BUILD_BUG_ON((__force int) RWF_SUPPORTED & IOCB_EVENTFD); if (!flags) return 0; if (unlikely(flags & ~RWF_SUPPORTED)) return -EOPNOTSUPP; if (unlikely((flags & RWF_APPEND) && (flags & RWF_NOAPPEND))) return -EINVAL; if (flags & RWF_NOWAIT) { if (!(ki->ki_filp->f_mode & FMODE_NOWAIT)) return -EOPNOTSUPP; } if (flags & RWF_ATOMIC) { if (rw_type != WRITE) return -EOPNOTSUPP; if (!(ki->ki_filp->f_mode & FMODE_CAN_ATOMIC_WRITE)) return -EOPNOTSUPP; } if (flags & RWF_DONTCACHE) { / file system must support it / if (!(ki->ki_filp->f_op->fop_flags & FOP_DONTCACHE)) return -EOPNOTSUPP; / DAX mappings not supported / if (IS_DAX(ki->ki_filp->f_mapping->host)) return -EOPNOTSUPP; } kiocb_flags \|= (__force int) (flags & RWF_SUPPORTED); if (flags & RWF_SYNC) kiocb_flags \|= IOCB_DSYNC; if ((flags & RWF_NOAPPEND) && (ki->ki_flags & IOCB_APPEND)) { if (IS_APPEND(file_inode(ki->ki_filp))) return -EPERM; ki->ki_flags &= ~IOCB_APPEND; } ki->ki_flags \|= kiocb_flags; return 0; } / Transaction based IO helpers / / * An argresp is stored in an allocated page and holds the * size of the argument or response, along with its content / struct simple_transaction_argresp { ssize_t size; char data[]; }; #define SIMPLE_TRANSACTION_LIMIT (PAGE_SIZE - sizeof(struct simple_transaction_argresp)) char simple_transaction_get(struct file file, const char __user buf, size_t size); ssize_t simple_transaction_read(struct file file, char __user buf, size_t size, loff_t pos); int simple_transaction_release(struct inode inode, struct file file); void simple_transaction_set(struct file file, size_t n); /* * simple attribute files * * These attributes behave similar to those in sysfs: * * Writing to an attribute immediately sets a value, an open file can be * written to multiple times. * * Reading from an attribute creates a buffer from the value that might get * read with multiple read calls. When the attribute has been read * completely, no further read calls are possible until the file is opened * again. * * All attributes contain a text representation of a numeric value * that are accessed with the get() and set() functions. / #define DEFINE_SIMPLE_ATTRIBUTE_XSIGNED(__fops, __get, __set, __fmt, __is_signed) \ static int __fops ## _open(struct inode inode, struct file file) \ { \ __simple_attr_check_format(__fmt, 0ull); \ return simple_attr_open(inode, file, __get, __set, __fmt); \ } \ static const struct file_operations __fops = { \ .owner = THIS_MODULE, \ .open = __fops ## _open, \ .release = simple_attr_release, \ .read = simple_attr_read, \ .write = (__is_signed) ? simple_attr_write_signed : simple_attr_write, \ .llseek = generic_file_llseek, \ } #define DEFINE_SIMPLE_ATTRIBUTE(__fops, __get, __set, __fmt) \ DEFINE_SIMPLE_ATTRIBUTE_XSIGNED(__fops, __get, __set, __fmt, false) #define DEFINE_SIMPLE_ATTRIBUTE_SIGNED(__fops, __get, __set, __fmt) \ DEFINE_SIMPLE_ATTRIBUTE_XSIGNED(__fops, __get, __set, __fmt, true) static inline __printf(1, 2) void __simple_attr_check_format(const char fmt, ...) { /* don't do anything, just let the compiler check the arguments; / } int simple_attr_open(struct inode inode, struct file file, int (get)(void , u64 ), int (set)(void , u64), const char fmt); int simple_attr_release(struct inode inode, struct file file); ssize_t simple_attr_read(struct file file, char __user buf, size_t len, loff_t ppos); ssize_t simple_attr_write(struct file file, const char __user buf, size_t len, loff_t ppos); ssize_t simple_attr_write_signed(struct file file, const char __user buf, size_t len, loff_t ppos); struct ctl_table; int __init list_bdev_fs_names(char buf, size_t size); #define __FMODE_EXEC ((__force int) FMODE_EXEC) #define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE]) #define OPEN_FMODE(flag) ((__force fmode_t)((flag + 1) & O_ACCMODE)) static inline bool is_sxid(umode_t mode) { return mode & (S_ISUID \| S_ISGID); } static inline int check_sticky(struct mnt_idmap idmap, struct inode dir, struct inode inode) { if (!(dir->i_mode & S_ISVTX)) return 0; return __check_sticky(idmap, dir, inode); } static inline void inode_has_no_xattr(struct inode inode) { if (!is_sxid(inode->i_mode) && (inode->i_sb->s_flags & SB_NOSEC)) inode->i_flags \|= S_NOSEC; } static inline bool is_root_inode(struct inode inode) { return inode == inode->i_sb->s_root->d_inode; } static inline bool dir_emit(struct dir_context ctx, const char name, int namelen, u64 ino, unsigned type) { return ctx->actor(ctx, name, namelen, ctx->pos, ino, type); } static inline bool dir_emit_dot(struct file file, struct dir_context ctx) { return ctx->actor(ctx, ".", 1, ctx->pos, file->f_path.dentry->d_inode->i_ino, DT_DIR); } static inline bool dir_emit_dotdot(struct file file, struct dir_context ctx) { return ctx->actor(ctx, "..", 2, ctx->pos, d_parent_ino(file->f_path.dentry), DT_DIR); } static inline bool dir_emit_dots(struct file file, struct dir_context ctx) { if (ctx->pos == 0) { if (!dir_emit_dot(file, ctx)) return false; ctx->pos = 1; } if (ctx->pos == 1) { if (!dir_emit_dotdot(file, ctx)) return false; ctx->pos = 2; } return true; } static inline bool dir_relax(struct inode inode) { inode_unlock(inode); inode_lock(inode); return !IS_DEADDIR(inode); } static inline bool dir_relax_shared(struct inode inode) { inode_unlock_shared(inode); inode_lock_shared(inode); return !IS_DEADDIR(inode); } extern bool path_noexec(const struct path path); extern void inode_nohighmem(struct inode inode); /* mm/fadvise.c / extern int vfs_fadvise(struct file file, loff_t offset, loff_t len, int advice); extern int generic_fadvise(struct file file, loff_t offset, loff_t len, int advice); static inline bool vfs_empty_path(int dfd, const char __user path) { char c; if (dfd < 0) return false; /* We now allow NULL to be used for empty path. / if (!path) return true; if (unlikely(get_user(c, path))) return false; return !c; } int generic_atomic_write_valid(struct kiocb iocb, struct iov_iter iter); static inline bool extensible_ioctl_valid(unsigned int cmd_a, unsigned int cmd_b, size_t min_size) { if (_IOC_DIR(cmd_a) != _IOC_DIR(cmd_b)) return false; if (_IOC_TYPE(cmd_a) != _IOC_TYPE(cmd_b)) return false; if (_IOC_NR(cmd_a) != _IOC_NR(cmd_b)) return false; if (_IOC_SIZE(cmd_a) < min_size) return false; return true; } #endif / _LINUX_FS_H */ ]
27/2	2026/01/26 23:21	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 95310 output: 41 thoughts: 1121 Thoughts: Multigrain Timestamp Race Condition Analysis Okay, here's what I've found. I was digging into the multigrain timestamp implementation and located the definition of `I_CTIME_QUERIED` at line 1679 of `include/linux/fs.h`. This bit is used to track when timestamps are actively being observed, which is a key part of the multigrain timestamp system, apparently. The `i_ctime_nsec` field in the inode structure seems to be the one holding the `I_CTIME_QUERIED` bit. I've identified a potential race condition between `inode_set_ctime_current` (the writer) and `fill_mg_cmtime` (the reader). Let's break down the code. In `inode_set_ctime_current`, there's this `try_cmpxchg` operation: ```c try_cmpxchg(&inode->i_ctime_nsec, &cur, now.tv_nsec) ``` Notice that `now.tv_nsec` doesn't have `I_CTIME_QUERIED` set. This is because we are setting a new timestamp. If the `try_cmpxchg` succeeds, the code then updates `i_ctime_sec` on the next line. Now, look at `fill_mg_cmtime`: ```c stat->ctime.tv_sec = inode->i_ctime_sec; stat->ctime.tv_nsec = (u32)atomic_read(pcn); if (!(stat->ctime.tv_nsec & I_CTIME_QUERIED)) stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn)); stat->ctime.tv_nsec &= ~I_CTIME_QUERIED; ``` Here's the problem. If a writer (say, task 3305) completes the `try_cmpxchg` (setting `i_ctime_nsec` to a new value without the bit) but hasn't yet updated `i_ctime_sec`, and then a reader (task 3306) comes along, this is what happens: The reader reads the old `i_ctime_sec`, then reads the new (unqueried) `i_ctime_nsec`. This means, that if `i_ctime_sec` was supposed to increment, the reported time is older than it should be. The time jumps backward. The `try_cmpxchg` seems atomic, but the writes to `sec` and `nsec` are not. These two values are logically linked, in other words. We have no atomic update of `sec` and `nsec`, because they are separate fields in the `inode` structure. This is a classic "unprotected" update of two dependent values. Specifically, the read/write order is Write: `nsec` then `sec` and Read: `sec` then `nsec`. The multigrain timestamp design seems to assume that races are acceptable, but this one specifically causes time to go backwards! This could be a problem, since timestamps moving backward is bad for anything that depends on monotonic timestamps (build systems, NFS cache consistency, etc.). I doubt it will crash the kernel, but it's a definite bug in the multigrain timestamp design. I need to see if there's any protection anywhere. I've noted that shmem is using this system in calls like `shmem_mknod` and `shmem_getattr`. I see that `shmem_mknod` calls `shmem_get_inode`, which presumably sets timestamps. Yes, `inode_set_ctime_current` is called from within `shmem_mknod`. I need to review that code specifically to be sure.
28/2	2026/01/26 23:21	tool	codesearch-definition-source	0m Args: map[ContextFile:mm/shmem.c IncludeLines:true Name:shmem_mknod] Results: map[SourceCode:3859: /* 3860: * File creation. Allocate an inode, and we're done.. 3861: / 3862: static int 3863: shmem_mknod(struct mnt_idmap idmap, struct inode dir, 3864: struct dentry dentry, umode_t mode, dev_t dev) 3865: { 3866: struct inode *inode; 3867: int error; 3868: 3869: if (!generic_ci_validate_strict_name(dir, &dentry->d_name)) 3870: return -EINVAL; 3871: 3872: inode = shmem_get_inode(idmap, dir->i_sb, dir, mode, dev, VM_NORESERVE); 3873: if (IS_ERR(inode)) 3874: return PTR_ERR(inode); 3875: 3876: error = simple_acl_create(dir, inode); 3877: if (error) 3878: goto out_iput; 3879: error = security_inode_init_security(inode, dir, &dentry->d_name, 3880: shmem_initxattrs, NULL); 3881: if (error && error != -EOPNOTSUPP) 3882: goto out_iput; 3883: 3884: error = simple_offset_add(shmem_get_offset_ctx(dir), dentry); 3885: if (error) 3886: goto out_iput; 3887: 3888: dir->i_size += BOGO_DIRENT_SIZE; 3889: inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir)); 3890: inode_inc_iversion(dir); 3891: 3892: d_make_persistent(dentry, inode); 3893: return error; 3894: 3895: out_iput: 3896: iput(inode); 3897: return error; 3898: } SourceFile:mm/shmem.c]
29/2	2026/01/26 23:21	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 97956 output: 39 thoughts: 58 Thoughts: Okay, here's my interpretation of that code snippet, delivered as if it's my own thought process: Reflecting on Filesystem Metadata Updates Alright, let's see. Line 3889... ah, yes, `inode_set_ctime_current(dir)`. Right, that's being called. My initial thought was: is this updating the ctime of the current directory? Wait a minute... that's not quite right. After a closer look, the key point is it's actually modifying the parent directory's ctime and mtime. This makes perfect sense; because we just created a new directory entry within it. The call, `shmem_getattr`, is the reader, and needs to be kept up to date.
30/2	2026/01/26 23:21	tool	codesearch-definition-source	0m Args: map[ContextFile:mm/shmem.c IncludeLines:true Name:shmem_getattr] Results: map[SourceCode:1255: static int shmem_getattr(struct mnt_idmap idmap, 1256: const struct path path, struct kstat stat, 1257: u32 request_mask, unsigned int query_flags) 1258: { 1259: struct inode inode = path->dentry->d_inode; 1260: struct shmem_inode_info *info = SHMEM_I(inode); 1261: 1262: if (info->alloced - info->swapped != inode->i_mapping->nrpages) 1263: shmem_recalc_inode(inode, 0, 0); 1264: 1265: if (info->fsflags & FS_APPEND_FL) 1266: stat->attributes \|= STATX_ATTR_APPEND; 1267: if (info->fsflags & FS_IMMUTABLE_FL) 1268: stat->attributes \|= STATX_ATTR_IMMUTABLE; 1269: if (info->fsflags & FS_NODUMP_FL) 1270: stat->attributes \|= STATX_ATTR_NODUMP; 1271: stat->attributes_mask \|= (STATX_ATTR_APPEND \| 1272: STATX_ATTR_IMMUTABLE \| 1273: STATX_ATTR_NODUMP); 1274: generic_fillattr(idmap, request_mask, inode, stat); 1275: 1276: if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0)) 1277: stat->blksize = HPAGE_PMD_SIZE; 1278: 1279: if (request_mask & STATX_BTIME) { 1280: stat->result_mask \|= STATX_BTIME; 1281: stat->btime.tv_sec = info->i_crtime.tv_sec; 1282: stat->btime.tv_nsec = info->i_crtime.tv_nsec; 1283: } 1284: 1285: return 0; 1286: } SourceFile:mm/shmem.c]
31/2	2026/01/26 23:21	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 98821 output: 25 thoughts: 6475 Thoughts: Multigrain Timestamp Race Analysis Alright, let's break down this KCSAN report. It seems we have a data race involving the multigrain timestamp feature within the shmem filesystem. Specifically, the race is between writing `i_ctime_sec` in `inode_set_ctime_current` and reading it in `fill_mg_cmtime`. `fill_mg_cmtime` is called from `generic_fillattr`, which feeds `kstat` (the struct holding file metadata). The shmem filesystem is flagged with `FS_MGTIME`, meaning it does use multigrain timestamps, so this race is active there. The root cause is the order of operations. `inode_set_ctime_current` first updates `i_ctime_nsec` using `try_cmpxchg`, and then updates `i_ctime_sec`. The problem is that `try_cmpxchg` on `i_ctime_nsec` acts as a release operation for `i_ctime_sec` ONLY IF we update `sec` BEFORE `nsec`. But here we are updating `nsec` BEFORE `sec`. This means a reader (like `fill_mg_cmtime`) could see the new `nsec` but still the old `sec`. `fill_mg_cmtime` then reads `i_ctime_sec` and `i_ctime_nsec`. But `inode_set_ctime_current` is also reading those at the same time. The order of reads/writes is wrong to guarantee consistency. If we want to guarantee the correct timestamp, we should first update and read the "sec" value, then read/write the "nsec" value. This order can lead to an inconsistent timestamp. If the time is, say, `10.999` and becomes `11.001`, a reader might see `i_ctime_sec` as `10` and `i_ctime_nsec` as `001`, resulting in a timestamp of `10.001`, or an older time, going backward, which is bad for monotonicity. KCSAN is indeed reporting this. The question is, is this "harmful"? It doesn't crash the system. However, it does introduce a logic error: timestamps can go backward, albeit for a very brief window. With multigrain timestamps, the intention is to improve accuracy and ensure monotonicity. Returning a time that is older than a previous timestamp on the same directory is a bug. For example, if a `stat` on a directory returns timestamp X, and a subsequent `stat` returns a timestamp Y < X, this is a real problem. For `make`, which often uses timestamps for checking if files have changed, this can break builds or other logic depending on timestamp order. The code in `inode_set_ctime_current` also uses `smp_load_acquire` on `i_ctime_nsec` and then reads `i_ctime_sec`, but that doesn't guarantee the reader sees an update to `sec` if it reads an updated `nsec` first. So, the issue is there is no synchronization in the read and the write of `i_ctime_sec` and `i_ctime_nsec`. Now, if `try_cmpxchg` fails, the code then reads the existing `i_ctime_sec` and `i_ctime_nsec`. This race also exists for any filesystem that updates both at the same time. Checking `vfs_getattr` -> `shmem_getattr` -> `generic_fillattr` -> `fill_mg_cmtime` shows that it doesn't take the inode lock. This is where this problem happens. It's a concurrent `stat` and directory modification race on the same inode (directory). The writer is `shmem_mknod`, which updates the directory's ctime. The reader is the `stat` call on the same directory. The directory's `i_rwsem` is held by `shmem_mknod` but not the reader in `vfs_getattr`. This is a very common issue across many filesystems. A timestamp going backward is bad. In this case, `make` might have issues. While KCSAN is reporting this data race, the question remains: is this a known "benign" race? It's often the case that timestamp races are tolerated because of the window. But with multigrain timestamps, which try for higher accuracy, this may not be true. It seems the developers knew there wasn't a single atomic update of the whole `timespec64`. So it's known, but is it a problem? The race is on an 8-byte aligned `time64_t`. On
32/2	2026/01/26 23:22	tool	codesearch-find-references	0m Args: map[Name:inode::i_ctime_sec] Results: map[References:<nil> TruncatedOutput:false]
33/2	2026/01/26 23:22	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 106367 output: 23 thoughts: 13 Thoughts: Okay, here's my summary, tailored as if I'm the one having these thoughts, keeping in mind I'm an expert in this field: Refining My Approach Alright, I'm taking a look at this code and trying to optimize it, or at least streamline how I'm accessing this timestamp data. I was going through the structure to find the ctime seconds variable, and I had my initial approach. Now, I'm thinking, "Let's ditch the explicit struct name for that timestamp value". Instead of referencing it with the full struct name as a prefix, let me see if I can get away with just `i_ctime_sec`. I want to see if that works. It'll be more concise if I can pull that off. It's a quick thing to try to simplify the line.
34/2	2026/01/26 23:22	tool	codesearch-find-references	0m Error: requested entity does not exist Args: map[Name:i_ctime_sec] Results: map[References:<nil> TruncatedOutput:false]
35/2	2026/01/26 23:22	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 106602 output: 19 thoughts: 1167 Thoughts: Analyzing a Potentially Benign Data Race in the VFS Layer Okay, that's interesting. I'm looking at this code in `fs.h` and something immediately stands out. I have enough here to draw some conclusions. It seems there's a data race on `i_ctime_sec` between a writer, `inode_set_ctime_current`, and a reader, `fill_mg_cmtime`. The writer updates `nsec` before `sec`, while the reader reads `sec` before `nsec`. This can result in an inconsistent timestamp – an older `sec` paired with a newer `nsec`. Essentially, if the update happens at a second boundary, the timestamp could appear to jump backward. However, it's a race condition on a simple `time64_t` field. It's not leading to memory corruption or crashes. It seems like this is a "benign" data race, intended or at least tolerated for performance reasons within the VFS `getattr` path, where full locking might be too expensive. This new multigrain timestamp feature is aiming to be smarter with the `nsec` field, but it doesn't appear to fully synchronize the `sec` field. Now, if this is a new feature meant to be better, perhaps the developers intended race-free behavior and missed this. However, even if they did miss it, under the given KCSAN definition, it's still considered "benign": access to simple variables/flags, statistics counters, etc. Timestamps are similar to statistics in this regard, informational fields of the inode. Is there any reason this race would be "harmful"? "Harmful" data races involve corruption/crashes, often with complex containers. This is not a race on a complex container; it's a race on a simple `time64_t` (primitive type). I see that `fill_mg_cmtime` uses `atomic_read` and `atomic_fetch_or` on `i_ctime_nsec`, which ensures the `QUERIED` bit is handled correctly. The race on `i_ctime_sec` is the only one not using atomic operations. Looking at `inode_set_ctime_current` again, they're being very careful with `i_ctime_nsec` using `try_cmpxchg`. It seems they attempt atomicity for the `nsec` portion, but know they can't achieve it for the whole `timespec64` without a lock. I'll check the files, or run a search. I'm confident this is "benign" because it's a race on a primitive type that doesn't cause crashes or memory corruption, even if it leads to slightly inconsistent timestamps. But let me verify. If this were harmful, it would probably involve pointers or list manipulations. This is just a timestamp. Let's look at `inode_set_ctime_current` again. Ah, here: ```c 2841: if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now.tv_nsec)) { 2842: /* If swap occurred, then we're (mostly) done / 2843: inode->i_ctime_sec = now.tv_sec; ``` Wait! If `try_cmpxchg` succeeds, `nsec` is updated. If another* reader comes in and sees the new `nsec`, it would logically expect to see the new `sec` as well. However, `sec` is updated after `nsec`. What if the other reader is also `inode_set_ctime_current`? It too reads `sec`. Wait a minute, the prompt did say that "benign" races are simple variables. `time64_t` is simple: it's just a long long. I'm concluding this is indeed benign. Let me double-check the `fill_mg_cmtime` code: ```c 56: stat->ctime.tv_sec = inode->i_ctime_sec; 57: stat->ctime.tv_nsec = (u32)atomic_read(pcn); 58: if (!(stat->ctime.tv_nsec & I_CTIME_QUERIED)) 59: stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn)); ``` Yep, it reads `sec` then `nsec`. The writer writes `nsec` then `sec`. It's exactly the setup for inconsistent state! However, most VFS
36/2	2026/01/26 23:22	tool	set-results	0m Args: map[Benign:true Confident:true] Results: map[Benign:true Confident:true]
37/2	2026/01/26 23:22	llm	expert	0m Model: gemini-3-flash-preview Tokens: input: 108830 output: 706 thoughts: 381 Thoughts: My Assessment of the `inode` `i_ctime_sec` Data Race Okay, so I've got this data race on my hands related to the `i_ctime_sec` field within the `struct inode` structure. This field is a `time64_t`, which means we're dealing with an 8-byte integer representing the seconds component of the inode's change timestamp. The race is occurring between `inode_set_ctime_current`, which is the writer, and `fill_mg_cmtime`, which is reading the value. Here's how it breaks down from what I can see: `inode_set_ctime_current` first updates `i_ctime_nsec` using `try_cmpxchg` – that makes sense, trying to atomically update the nanosecond part. Then, boom, it does a simple assignment to update `i_ctime_sec`. No synchronization, just a direct write. On the reader's side, `fill_mg_cmtime` reads `i_ctime_sec` first. Then, it goes on to read `i_ctime_nsec` using `atomic_read`. The problem is glaringly obvious – there's no synchronization whatsoever. No locks, no carefully placed memory barriers to enforce an order between the writes of `i_ctime_sec` and `i_ctime_nsec` and the reads of those fields. So, the potential for inconsistency is there. I could end up with `fill_mg_cmtime` reading an old value of `i_ctime_sec` paired with a newer `i_ctime_nsec`. This, in turn, may lead to a slightly incorrect timestamp – maybe appearing to go back a second or so if the update happens around a second boundary. But, and this is the crucial part, this is likely to be benign. Data races on timestamps are unfortunately common in the VFS layer, especially with primitive types like this. It's not the ideal situation, but here's why I'm not panicking: 1. We're dealing with primitive types. We are talking about time values. It's not like we are racing on a data structure containing pointers or other complex information. 2. The race isn't leading to memory corruption, crashes, or any sort of invalid pointer dereferences, as far as I can see. This type of error is something I'd be looking for, and it's not present. 3. Timestamps in concurrent systems are inherently a little loose, without the application of very heavy-handed locking. This is always a trade-off. 4. The multigrain timestamp feature's design goal is still mostly achieved, which is preventing frequent updates from being merged into the same coarse tick. The atomic operations on `i_ctime_nsec` are working as intended. So, while it's a data race and not ideal, I don't see this as a critical bug, especially given the low level of the data involved and that other mechanisms are in place that mostly avoid any time inconsistencies. We're not risking kernel stability or, critically, corruption. Therefore, it's something I can classify as a benign race. I'll make a note of it, but it's not high on my priority list.

map[Benign:true Confident:true Explanation:The data race occurs on the `i_ctime_sec` field of `struct inode`, which is a `time64_t` (8-byte integer) representing the seconds part of the inode's change time.

### Analysis

The race involves two functions:
1.  **`inode_set_ctime_current` (Writer):** This function is part of the multigrain timestamp implementation. It updates the inode's `ctime`. It first uses `try_cmpxchg` to atomically update `i_ctime_nsec` (nanoseconds part and some flag bits) and then updates `i_ctime_sec` with a simple assignment:
    ```c
    if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now.tv_nsec)) {
        inode->i_ctime_sec = now.tv_sec; // Write to i_ctime_sec
        ...
    }
    ```
2.  **`fill_mg_cmtime` (Reader):** This function is called during `getattr` to fill in the status information. It reads the seconds part and then the nanoseconds part:
    ```c
    stat->ctime.tv_sec = inode->i_ctime_sec; // Read from i_ctime_sec
    stat->ctime.tv_nsec = (u32)atomic_read(pcn); // Read from i_ctime_nsec
    ```

Because the writer updates `nsec` then `sec`, and the reader reads `sec` then `nsec`, there is a small window where the reader can see the old `sec` value but the new `nsec` value. If the update occurred across a second boundary (e.g., from 100.999s to 101.001s), the reader might observe an inconsistent timestamp (e.g., 100.001s), which is technically in the past relative to the previous state.

### Conclusion

Despite the potential for a slightly inconsistent timestamp being reported to userspace for a very brief period, this data race is **benign** for the following reasons:

1.  **Primitive Data Type:** The race is on a `time64_t` field, which is a primitive integer type. On 64-bit architectures, aligned 8-byte loads and stores are atomic, so no "torn" values (half-old, half-new) will be read.
2.  **No Corruption or Stability Issues:** The race does not involve pointers, linked lists, or complex data structures. It cannot lead to memory corruption, kernel crashes, or invalid memory accesses.
3.  **Traditional VFS Behavior:** Timestamps in the Linux VFS have traditionally been accessed without heavy locking in the `getattr` (stat) path for performance reasons. While the multigrain timestamp feature introduces more complexity in the nanoseconds field to track queries, the seconds field remains a simple informational value.
4.  **Limited Impact:** The inconsistency is transient and only affects the precision of a reported timestamp in a race condition. It does not violate any core kernel invariants.

Following the KCSAN definition of benign races (accesses to simple fields not requiring mutual exclusion and not leading to corruption), this report is a benign data race.]

map[KernelSrc:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/src/8c64ca34582568f44cbf475e4f60a6f0b9d98e80]

map[KernelObj:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/build/3ce537d20de7716e72da0174003c11b07ad22515]

map[Index:codesearch-index]

gemini-3-flash-preview

map[Benign:true Confident:true]

You are an experienced Linux kernel developer tasked with determining if the given kernel
data race is benign or not. The data race report is from KCSAN tool.
It contains 2 stack traces of the memory accesses that constitute a data race.

A "benign" data races are on a simple int/bool variable or similar field,
and the accesses are not supposed to be protected by any mutual exclusion primitives.
Common examples of such "benign" data races are accesses to various flags fields,
statistics counters, and similar. A "benign" data race does not lead to memory corruption/crash
with a conservative compiler that compiles memory accesses to primitive types
effectively as atomic.

A non-benign (or "harmful" data race) can lead to corruption/crash even with
a conservative compiler that compiles memory accesses to primitive types
effectively as atomic. A common example of a "harmful" data races is race on
a complex container (list/hashmap/etc), where accesses are supposed to be protected
by a mutual exclusion primitive.

In the final reply explain why you think the given data race is benign or is harmful.

Use the provided tools to confirm any assumptions, variables/fields being accessed, etc.
In particular, don't make assumptions about the kernel source code,
use codesearch tools to read the actual source code.

Prefer calling several tools at the same time to save round-trips.

Use set-results tool to provide results of the analysis.
It must be called exactly once before the final reply.
Ignore results of this tool.

The data race report is:

==================================================================
BUG: KCSAN: data-race in fill_mg_cmtime / inode_set_ctime_current

write to 0xffff8881039e7250 of 8 bytes by task 3305 on cpu 0:
 inode_set_ctime_current+0x592/0x810 fs/inode.c:2843
 shmem_mknod+0x127/0x170 mm/shmem.c:3889
 shmem_create+0x34/0x50 mm/shmem.c:3943
 lookup_open fs/namei.c:4449 [inline]
 open_last_lookups fs/namei.c:4549 [inline]
 path_openat+0x1121/0x23c0 fs/namei.c:4793
 do_filp_open+0x15d/0x280 fs/namei.c:4823
 do_sys_openat2+0xa6/0x150 fs/open.c:1430
 do_sys_open fs/open.c:1436 [inline]
 __do_sys_openat fs/open.c:1452 [inline]
 __se_sys_openat fs/open.c:1447 [inline]
 __x64_sys_openat+0xf2/0x120 fs/open.c:1447
 x64_sys_call+0x2b07/0x3000 arch/x86/include/generated/asm/syscalls_64.h:258
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff8881039e7250 of 8 bytes by task 3306 on cpu 1:
 fill_mg_cmtime+0x86/0x260 fs/stat.c:56
 generic_fillattr+0x24a/0x340 fs/stat.c:99
 shmem_getattr+0x181/0x200 mm/shmem.c:1274
 vfs_getattr_nosec+0x146/0x1e0 fs/stat.c:213
 vfs_getattr+0x48/0x60 fs/stat.c:262
 vfs_statx_path+0x28/0x140 fs/stat.c:299
 vfs_statx+0xc4/0x170 fs/stat.c:356
 vfs_fstatat+0x115/0x170 fs/stat.c:375
 __do_sys_newfstatat fs/stat.c:542 [inline]
 __se_sys_newfstatat+0x55/0x3e0 fs/stat.c:536
 __x64_sys_newfstatat+0x55/0x70 fs/stat.c:536
 x64_sys_call+0x111f/0x3000 arch/x86/include/generated/asm/syscalls_64.h:263
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0x00000000773d97c1 -> 0x00000000773d97c2

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 3306 Comm: udevd Tainted: G        W           syzkaller #0 PREEMPT(voluntary) 
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025
==================================================================

The data race occurs on the `i_ctime_sec` field of `struct inode`, which is a `time64_t` (8-byte integer) representing the seconds part of the inode's change time.

### Analysis

The race involves two functions:
1.  **`inode_set_ctime_current` (Writer):** This function is part of the multigrain timestamp implementation. It updates the inode's `ctime`. It first uses `try_cmpxchg` to atomically update `i_ctime_nsec` (nanoseconds part and some flag bits) and then updates `i_ctime_sec` with a simple assignment:
    ```c
    if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now.tv_nsec)) {
        inode->i_ctime_sec = now.tv_sec; // Write to i_ctime_sec
        ...
    }
    ```
2.  **`fill_mg_cmtime` (Reader):** This function is called during `getattr` to fill in the status information. It reads the seconds part and then the nanoseconds part:
    ```c
    stat->ctime.tv_sec = inode->i_ctime_sec; // Read from i_ctime_sec
    stat->ctime.tv_nsec = (u32)atomic_read(pcn); // Read from i_ctime_nsec
    ```

Because the writer updates `nsec` then `sec`, and the reader reads `sec` then `nsec`, there is a small window where the reader can see the old `sec` value but the new `nsec` value. If the update occurred across a second boundary (e.g., from 100.999s to 101.001s), the reader might observe an inconsistent timestamp (e.g., 100.001s), which is technically in the past relative to the previous state.

### Conclusion

Despite the potential for a slightly inconsistent timestamp being reported to userspace for a very brief period, this data race is **benign** for the following reasons:

1.  **Primitive Data Type:** The race is on a `time64_t` field, which is a primitive integer type. On 64-bit architectures, aligned 8-byte loads and stores are atomic, so no "torn" values (half-old, half-new) will be read.
2.  **No Corruption or Stability Issues:** The race does not involve pointers, linked lists, or complex data structures. It cannot lead to memory corruption, kernel crashes, or invalid memory accesses.
3.  **Traditional VFS Behavior:** Timestamps in the Linux VFS have traditionally been accessed without heavy locking in the `getattr` (stat) path for performance reasons. While the multigrain timestamp feature introduces more complexity in the nanoseconds field to track queries, the seconds field remains a simple informational value.
4.  **Limited Impact:** The inconsistency is transient and only affects the precision of a reported timestamp in a race condition. It does not violate any core kernel invariants.

Following the KCSAN definition of benign races (accesses to simple fields not requiring mutual exclusion and not leading to corruption), this report is a benign data race.

gemini-3-flash-preview

							input: 3326
							output: 81
							thoughts: 127

Okay, here's my take on those thoughts, framed as an expert's internal monologue:

**Investigating a Potential Data Race**

Alright, this report has caught my attention. It suggests a potential data race, and it's flagged on the interaction between `inode_set_ctime_current` and `fill_mg_cmtime`. Specifically, it appears there's a read/write conflict on a memory location that contains either a `timespec64` structure or something similar – eight bytes, and the reported values changing are strongly suggestive of a timestamp.

My immediate next step is to drill into the code. I need to understand what these two functions are actually doing. I'll need to locate the definitions of `inode_set_ctime_current` and `fill_mg_cmtime` within the relevant kernel code base. From there, I'll review their implementations very closely to see how they're handling this timestamp or related data. The key is understanding the access patterns: what locks are in place, how the reads and writes are performed, and what other data might be involved.

map[ContextFile:fs/inode.c IncludeLines:true Name:inode_set_ctime_current]

map[SourceCode:2784:	/**
2785:	 * inode_set_ctime_current - set the ctime to current_time
2786:	 * @inode: inode
2787:	 *
2788:	 * Set the inode's ctime to the current value for the inode. Returns the
2789:	 * current value that was assigned. If this is not a multigrain inode, then we
2790:	 * set it to the later of the coarse time and floor value.
2791:	 *
2792:	 * If it is multigrain, then we first see if the coarse-grained timestamp is
2793:	 * distinct from what is already there. If so, then use that. Otherwise, get a
2794:	 * fine-grained timestamp.
2795:	 *
2796:	 * After that, try to swap the new value into i_ctime_nsec. Accept the
2797:	 * resulting ctime, regardless of the outcome of the swap. If it has
2798:	 * already been replaced, then that timestamp is later than the earlier
2799:	 * unacceptable one, and is thus acceptable.
2800:	 */
2801:	struct timespec64 inode_set_ctime_current(struct inode *inode)
2802:	{
2803:		struct timespec64 now;
2804:		u32 cns, cur;
2805:	
2806:		ktime_get_coarse_real_ts64_mg(&now);
2807:		now = timestamp_truncate(now, inode);
2808:	
2809:		/* Just return that if this is not a multigrain fs */
2810:		if (!is_mgtime(inode)) {
2811:			inode_set_ctime_to_ts(inode, now);
2812:			goto out;
2813:		}
2814:	
2815:		/*
2816:		 * A fine-grained time is only needed if someone has queried
2817:		 * for timestamps, and the current coarse grained time isn't
2818:		 * later than what's already there.
2819:		 */
2820:		cns = smp_load_acquire(&inode->i_ctime_nsec);
2821:		if (cns & I_CTIME_QUERIED) {
2822:			struct timespec64 ctime = { .tv_sec = inode->i_ctime_sec,
2823:						    .tv_nsec = cns & ~I_CTIME_QUERIED };
2824:	
2825:			if (timespec64_compare(&now, &ctime) <= 0) {
2826:				ktime_get_real_ts64_mg(&now);
2827:				now = timestamp_truncate(now, inode);
2828:				mgtime_counter_inc(mg_fine_stamps);
2829:			}
2830:		}
2831:		mgtime_counter_inc(mg_ctime_updates);
2832:	
2833:		/* No need to cmpxchg if it's exactly the same */
2834:		if (cns == now.tv_nsec && inode->i_ctime_sec == now.tv_sec) {
2835:			trace_ctime_xchg_skip(inode, &now);
2836:			goto out;
2837:		}
2838:		cur = cns;
2839:	retry:
2840:		/* Try to swap the nsec value into place. */
2841:		if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now.tv_nsec)) {
2842:			/* If swap occurred, then we're (mostly) done */
2843:			inode->i_ctime_sec = now.tv_sec;
2844:			trace_ctime_ns_xchg(inode, cns, now.tv_nsec, cur);
2845:			mgtime_counter_inc(mg_ctime_swaps);
2846:		} else {
2847:			/*
2848:			 * Was the change due to someone marking the old ctime QUERIED?
2849:			 * If so then retry the swap. This can only happen once since
2850:			 * the only way to clear I_CTIME_QUERIED is to stamp the inode
2851:			 * with a new ctime.
2852:			 */
2853:			if (!(cns & I_CTIME_QUERIED) && (cns | I_CTIME_QUERIED) == cur) {
2854:				cns = cur;
2855:				goto retry;
2856:			}
2857:			/* Otherwise, keep the existing ctime */
2858:			now.tv_sec = inode->i_ctime_sec;
2859:			now.tv_nsec = cur & ~I_CTIME_QUERIED;
2860:		}
2861:	out:
2862:		return now;
2863:	}
 SourceFile:fs/inode.c]

map[ContextFile:fs/stat.c IncludeLines:true Name:fill_mg_cmtime]

map[SourceCode:  31:	/**
  32:	 * fill_mg_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED
  33:	 * @stat: where to store the resulting values
  34:	 * @request_mask: STATX_* values requested
  35:	 * @inode: inode from which to grab the c/mtime
  36:	 *
  37:	 * Given @inode, grab the ctime and mtime out if it and store the result
  38:	 * in @stat. When fetching the value, flag it as QUERIED (if not already)
  39:	 * so the next write will record a distinct timestamp.
  40:	 *
  41:	 * NB: The QUERIED flag is tracked in the ctime, but we set it there even
  42:	 * if only the mtime was requested, as that ensures that the next mtime
  43:	 * change will be distinct.
  44:	 */
  45:	void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode)
  46:	{
  47:		atomic_t *pcn = (atomic_t *)&inode->i_ctime_nsec;
  48:	
  49:		/* If neither time was requested, then don't report them */
  50:		if (!(request_mask & (STATX_CTIME|STATX_MTIME))) {
  51:			stat->result_mask &= ~(STATX_CTIME|STATX_MTIME);
  52:			return;
  53:		}
  54:	
  55:		stat->mtime = inode_get_mtime(inode);
  56:		stat->ctime.tv_sec = inode->i_ctime_sec;
  57:		stat->ctime.tv_nsec = (u32)atomic_read(pcn);
  58:		if (!(stat->ctime.tv_nsec & I_CTIME_QUERIED))
  59:			stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn));
  60:		stat->ctime.tv_nsec &= ~I_CTIME_QUERIED;
  61:		trace_fill_mg_cmtime(inode, &stat->ctime, &stat->mtime);
  62:	}
 SourceFile:fs/stat.c]

gemini-3-flash-preview

							input: 5760
							output: 44
							thoughts: 792

**Analyzing a Potential Time Stamp Inconsistency in the Kernel**

Okay, here's what's running through my head. The report is pointing to a potential data race related to file ctime updates. The core of it seems to be in the interaction between `inode_set_ctime_current` and `fill_mg_cmtime`, specifically concerning the `i_ctime_sec` and `i_ctime_nsec` fields of an inode.

The report clearly states a write of 8 bytes to `0xffff8881039e7250` by task 3305 on cpu 0, and importantly, it references `inode_set_ctime_current+0x592/0x810 fs/inode.c:2843`. That address and write size is the `i_ctime_sec`, which is time64_t. That write matches what the kernel would do. We have a matching read at `fill_mg_cmtime+0x86/0x260 fs/stat.c:56`, which accesses the same `i_ctime_sec`.

So, we're looking at a race condition: `inode_set_ctime_current` updating this timestamp, and `fill_mg_cmtime` reading it. Digging into the code... `inode_set_ctime_current` does a CAS on the nanoseconds first: `try_cmpxchg(&inode->i_ctime_nsec, ...)` and *then* updates the seconds: `inode->i_ctime_sec = now.tv_sec;`. That's an important order of operations.

In `fill_mg_cmtime`, the code reads the seconds *before* the nanoseconds: `stat->ctime.tv_sec = inode->i_ctime_sec;` then `stat->ctime.tv_nsec = (u32)atomic_read(pcn);`. So, it's reading the timestamp in a non-atomic manner.

The crucial issue is the read-write order: If `fill_mg_cmtime` reads `i_ctime_sec` *after* the `i_ctime_nsec` is updated but *before* `i_ctime_sec` is updated, it can grab an inconsistent time stamp value. I can see how this can happen:

1. `inode_set_ctime_current` updates `i_ctime_nsec`.
2. `fill_mg_cmtime` reads `i_ctime_sec` (potentially the *old* value).
3. `inode_set_ctime_current` updates `i_ctime_sec`.
4. `fill_mg_cmtime` reads `i_ctime_nsec` (the *new* value).

This can cause issues. The timestamp can go backward.

Ah, but let's not jump to conclusions yet. The nanosecond field in multi-grain timestamps has special bits, I recall that `i_ctime_nsec` has `I_CTIME_QUERIED` which is likely bit 31. So, we need to check how these bits affect the code path and if those might interfere with our analysis. I'll need to double check the use of `I_CTIME_QUERIED` and how it might influence the race condition. Let's see...

requested entity does not exist

map[ContextFile:include/linux/fs.h IncludeLines:true Name:I_CTIME_QUERIED]

map[SourceCode: SourceFile:]

gemini-3-flash-preview

							input: 7345
							output: 26
							thoughts: 23

Okay, here's the summary of the thought process, as if I, an expert in my field, were having these thoughts:

**Pinpointing the Location**

Hmm, I'm trying to track down the definition of this thing... and it's not immediately obvious where it's located. My first hunch is that it's probably tucked away in a header file, likely something related to the file system, perhaps `fs.h` or something similarly named. I should run a search. It's time to dig into the header files to find its declaration.

requested entity does not exist

map[Name:I_CTIME_QUERIED]

map[References:<nil> TruncatedOutput:false]

gemini-3-flash-preview

							input: 7535
							output: 20
							thoughts: 60

**Investigating an Unfamiliar Definition**

Hmm, that's odd. I don't recognize that particular term. Perhaps it's being defined somewhere with a prefix I'm overlooking, or maybe I just fat-fingered the input and mis-typed it somehow? Time to go hunting. Since the error indicates it's used within `fs/inode.c`, that's a good starting point. I'll open up that file and immediately zero in on the lines referencing the term, to quickly identify where the compiler is complaining and hopefully find the definition lurking nearby. A quick scan should clear this up.

map[File:fs/inode.c]

map[Contents:// SPDX-License-Identifier: GPL-2.0-only
/*
 * (C) 1997 Linus Torvalds
 * (C) 1999 Andrea Arcangeli <andrea@suse.de> (dynamic inode allocation)
 */
#include <linux/export.h>
#include <linux/fs.h>
#include <linux/filelock.h>
#include <linux/mm.h>
#include <linux/backing-dev.h>
#include <linux/hash.h>
#include <linux/swap.h>
#include <linux/security.h>
#include <linux/cdev.h>
#include <linux/memblock.h>
#include <linux/fsnotify.h>
#include <linux/mount.h>
#include <linux/posix_acl.h>
#include <linux/buffer_head.h> /* for inode_has_buffers */
#include <linux/ratelimit.h>
#include <linux/list_lru.h>
#include <linux/iversion.h>
#include <linux/rw_hint.h>
#include <linux/seq_file.h>
#include <linux/debugfs.h>
#include <trace/events/writeback.h>
#define CREATE_TRACE_POINTS
#include <trace/events/timestamp.h>

#include "internal.h"

/*
 * Inode locking rules:
 *
 * inode->i_lock protects:
 *   inode->i_state, inode->i_hash, __iget(), inode->i_io_list
 * Inode LRU list locks protect:
 *   inode->i_sb->s_inode_lru, inode->i_lru
 * inode->i_sb->s_inode_list_lock protects:
 *   inode->i_sb->s_inodes, inode->i_sb_list
 * bdi->wb.list_lock protects:
 *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list
 * inode_hash_lock protects:
 *   inode_hashtable, inode->i_hash
 *
 * Lock ordering:
 *
 * inode->i_sb->s_inode_list_lock
 *   inode->i_lock
 *     Inode LRU list locks
 *
 * bdi->wb.list_lock
 *   inode->i_lock
 *
 * inode_hash_lock
 *   inode->i_sb->s_inode_list_lock
 *   inode->i_lock
 *
 * iunique_lock
 *   inode_hash_lock
 */

static unsigned int i_hash_mask __ro_after_init;
static unsigned int i_hash_shift __ro_after_init;
static struct hlist_head *inode_hashtable __ro_after_init;
static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);

/*
 * Empty aops. Can be used for the cases where the user does not
 * define any of the address_space operations.
 */
const struct address_space_operations empty_aops = {
};
EXPORT_SYMBOL(empty_aops);

static DEFINE_PER_CPU(unsigned long, nr_inodes);
static DEFINE_PER_CPU(unsigned long, nr_unused);

static struct kmem_cache *inode_cachep __ro_after_init;

static long get_nr_inodes(void)
{
	int i;
	long sum = 0;
	for_each_possible_cpu(i)
		sum += per_cpu(nr_inodes, i);
	return sum < 0 ? 0 : sum;
}

static inline long get_nr_inodes_unused(void)
{
	int i;
	long sum = 0;
	for_each_possible_cpu(i)
		sum += per_cpu(nr_unused, i);
	return sum < 0 ? 0 : sum;
}

long get_nr_dirty_inodes(void)
{
	/* not actually dirty inodes, but a wild approximation */
	long nr_dirty = get_nr_inodes() - get_nr_inodes_unused();
	return nr_dirty > 0 ? nr_dirty : 0;
}

#ifdef CONFIG_DEBUG_FS
static DEFINE_PER_CPU(long, mg_ctime_updates);
static DEFINE_PER_CPU(long, mg_fine_stamps);
static DEFINE_PER_CPU(long, mg_ctime_swaps);

static unsigned long get_mg_ctime_updates(void)
{
	unsigned long sum = 0;
	int i;

	for_each_possible_cpu(i)
		sum += data_race(per_cpu(mg_ctime_updates, i));
	return sum;
}

static unsigned long get_mg_fine_stamps(void)
{
	unsigned long sum = 0;
	int i;

	for_each_possible_cpu(i)
		sum += data_race(per_cpu(mg_fine_stamps, i));
	return sum;
}

static unsigned long get_mg_ctime_swaps(void)
{
	unsigned long sum = 0;
	int i;

	for_each_possible_cpu(i)
		sum += data_race(per_cpu(mg_ctime_swaps, i));
	return sum;
}

#define mgtime_counter_inc(__var)	this_cpu_inc(__var)

static int mgts_show(struct seq_file *s, void *p)
{
	unsigned long ctime_updates = get_mg_ctime_updates();
	unsigned long ctime_swaps = get_mg_ctime_swaps();
	unsigned long fine_stamps = get_mg_fine_stamps();
	unsigned long floor_swaps = timekeeping_get_mg_floor_swaps();

	seq_printf(s, "%lu %lu %lu %lu\n",
		   ctime_updates, ctime_swaps, fine_stamps, floor_swaps);
	return 0;
}

DEFINE_SHOW_ATTRIBUTE(mgts);

static int __init mg_debugfs_init(void)
{
	debugfs_create_file("multigrain_timestamps", S_IFREG | S_IRUGO, NULL, NULL, &mgts_fops);
	return 0;
}
late_initcall(mg_debugfs_init);

#else /* ! CONFIG_DEBUG_FS */

#define mgtime_counter_inc(__var)	do { } while (0)

#endif /* CONFIG_DEBUG_FS */

/*
 * Handle nr_inode sysctl
 */
#ifdef CONFIG_SYSCTL
/*
 * Statistics gathering..
 */
static struct inodes_stat_t inodes_stat;

static int proc_nr_inodes(const struct ctl_table *table, int write, void *buffer,
			  size_t *lenp, loff_t *ppos)
{
	inodes_stat.nr_inodes = get_nr_inodes();
	inodes_stat.nr_unused = get_nr_inodes_unused();
	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
}

static const struct ctl_table inodes_sysctls[] = {
	{
		.procname	= "inode-nr",
		.data		= &inodes_stat,
		.maxlen		= 2*sizeof(long),
		.mode		= 0444,
		.proc_handler	= proc_nr_inodes,
	},
	{
		.procname	= "inode-state",
		.data		= &inodes_stat,
		.maxlen		= 7*sizeof(long),
		.mode		= 0444,
		.proc_handler	= proc_nr_inodes,
	},
};

static int __init init_fs_inode_sysctls(void)
{
	register_sysctl_init("fs", inodes_sysctls);
	return 0;
}
early_initcall(init_fs_inode_sysctls);
#endif

static int no_open(struct inode *inode, struct file *file)
{
	return -ENXIO;
}

/**
 * inode_init_always_gfp - perform inode structure initialisation
 * @sb: superblock inode belongs to
 * @inode: inode to initialise
 * @gfp: allocation flags
 *
 * These are initializations that need to be done on every inode
 * allocation as the fields are not initialised by slab allocation.
 * If there are additional allocations required @gfp is used.
 */
int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp)
{
	static const struct inode_operations empty_iops;
	static const struct file_operations no_open_fops = {.open = no_open};
	struct address_space *const mapping = &inode->i_data;

	inode->i_sb = sb;
	inode->i_blkbits = sb->s_blocksize_bits;
	inode->i_flags = 0;
	inode_state_assign_raw(inode, 0);
	atomic64_set(&inode->i_sequence, 0);
	atomic_set(&inode->i_count, 1);
	inode->i_op = &empty_iops;
	inode->i_fop = &no_open_fops;
	inode->i_ino = 0;
	inode->__i_nlink = 1;
	inode->i_opflags = 0;
	if (sb->s_xattr)
		inode->i_opflags |= IOP_XATTR;
	if (sb->s_type->fs_flags & FS_MGTIME)
		inode->i_opflags |= IOP_MGTIME;
	i_uid_write(inode, 0);
	i_gid_write(inode, 0);
	atomic_set(&inode->i_writecount, 0);
	inode->i_size = 0;
	inode->i_write_hint = WRITE_LIFE_NOT_SET;
	inode->i_blocks = 0;
	inode->i_bytes = 0;
	inode->i_generation = 0;
	inode->i_pipe = NULL;
	inode->i_cdev = NULL;
	inode->i_link = NULL;
	inode->i_dir_seq = 0;
	inode->i_rdev = 0;
	inode->dirtied_when = 0;

#ifdef CONFIG_CGROUP_WRITEBACK
	inode->i_wb_frn_winner = 0;
	inode->i_wb_frn_avg_time = 0;
	inode->i_wb_frn_history = 0;
#endif

	spin_lock_init(&inode->i_lock);
	lockdep_set_class(&inode->i_lock, &sb->s_type->i_lock_key);

	init_rwsem(&inode->i_rwsem);
	lockdep_set_class(&inode->i_rwsem, &sb->s_type->i_mutex_key);

	atomic_set(&inode->i_dio_count, 0);

	mapping->a_ops = &empty_aops;
	mapping->host = inode;
	mapping->flags = 0;
	mapping->wb_err = 0;
	atomic_set(&mapping->i_mmap_writable, 0);
#ifdef CONFIG_READ_ONLY_THP_FOR_FS
	atomic_set(&mapping->nr_thps, 0);
#endif
	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
	mapping->i_private_data = NULL;
	mapping->writeback_index = 0;
	init_rwsem(&mapping->invalidate_lock);
	lockdep_set_class_and_name(&mapping->invalidate_lock,
				   &sb->s_type->invalidate_lock_key,
				   "mapping.invalidate_lock");
	if (sb->s_iflags & SB_I_STABLE_WRITES)
		mapping_set_stable_writes(mapping);
	inode->i_private = NULL;
	inode->i_mapping = mapping;
	INIT_HLIST_HEAD(&inode->i_dentry);	/* buggered by rcu freeing */
#ifdef CONFIG_FS_POSIX_ACL
	inode->i_acl = inode->i_default_acl = ACL_NOT_CACHED;
#endif

#ifdef CONFIG_FSNOTIFY
	inode->i_fsnotify_mask = 0;
#endif
	inode->i_flctx = NULL;

	if (unlikely(security_inode_alloc(inode, gfp)))
		return -ENOMEM;

	this_cpu_inc(nr_inodes);

	return 0;
}
EXPORT_SYMBOL(inode_init_always_gfp);

void free_inode_nonrcu(struct inode *inode)
{
	kmem_cache_free(inode_cachep, inode);
}
EXPORT_SYMBOL(free_inode_nonrcu);

static void i_callback(struct rcu_head *head)
{
	struct inode *inode = container_of(head, struct inode, i_rcu);
	if (inode->free_inode)
		inode->free_inode(inode);
	else
		free_inode_nonrcu(inode);
}

/**
 *	alloc_inode 	- obtain an inode
 *	@sb: superblock
 *
 *	Allocates a new inode for given superblock.
 *	Inode wont be chained in superblock s_inodes list
 *	This means :
 *	- fs can't be unmount
 *	- quotas, fsnotify, writeback can't work
 */
struct inode *alloc_inode(struct super_block *sb)
{
	const struct super_operations *ops = sb->s_op;
	struct inode *inode;

	if (ops->alloc_inode)
		inode = ops->alloc_inode(sb);
	else
		inode = alloc_inode_sb(sb, inode_cachep, GFP_KERNEL);

	if (!inode)
		return NULL;

	if (unlikely(inode_init_always(sb, inode))) {
		if (ops->destroy_inode) {
			ops->destroy_inode(inode);
			if (!ops->free_inode)
				return NULL;
		}
		inode->free_inode = ops->free_inode;
		i_callback(&inode->i_rcu);
		return NULL;
	}

	return inode;
}

void __destroy_inode(struct inode *inode)
{
	BUG_ON(inode_has_buffers(inode));
	inode_detach_wb(inode);
	security_inode_free(inode);
	fsnotify_inode_delete(inode);
	locks_free_lock_context(inode);
	if (!inode->i_nlink) {
		WARN_ON(atomic_long_read(&inode->i_sb->s_remove_count) == 0);
		atomic_long_dec(&inode->i_sb->s_remove_count);
	}

#ifdef CONFIG_FS_POSIX_ACL
	if (inode->i_acl && !is_uncached_acl(inode->i_acl))
		posix_acl_release(inode->i_acl);
	if (inode->i_default_acl && !is_uncached_acl(inode->i_default_acl))
		posix_acl_release(inode->i_default_acl);
#endif
	this_cpu_dec(nr_inodes);
}
EXPORT_SYMBOL(__destroy_inode);

static void destroy_inode(struct inode *inode)
{
	const struct super_operations *ops = inode->i_sb->s_op;

	BUG_ON(!list_empty(&inode->i_lru));
	__destroy_inode(inode);
	if (ops->destroy_inode) {
		ops->destroy_inode(inode);
		if (!ops->free_inode)
			return;
	}
	inode->free_inode = ops->free_inode;
	call_rcu(&inode->i_rcu, i_callback);
}

/**
 * drop_nlink - directly drop an inode's link count
 * @inode: inode
 *
 * This is a low-level filesystem helper to replace any
 * direct filesystem manipulation of i_nlink.  In cases
 * where we are attempting to track writes to the
 * filesystem, a decrement to zero means an imminent
 * write when the file is truncated and actually unlinked
 * on the filesystem.
 */
void drop_nlink(struct inode *inode)
{
	WARN_ON(inode->i_nlink == 0);
	inode->__i_nlink--;
	if (!inode->i_nlink)
		atomic_long_inc(&inode->i_sb->s_remove_count);
}
EXPORT_SYMBOL(drop_nlink);

/**
 * clear_nlink - directly zero an inode's link count
 * @inode: inode
 *
 * This is a low-level filesystem helper to replace any
 * direct filesystem manipulation of i_nlink.  See
 * drop_nlink() for why we care about i_nlink hitting zero.
 */
void clear_nlink(struct inode *inode)
{
	if (inode->i_nlink) {
		inode->__i_nlink = 0;
		atomic_long_inc(&inode->i_sb->s_remove_count);
	}
}
EXPORT_SYMBOL(clear_nlink);

/**
 * set_nlink - directly set an inode's link count
 * @inode: inode
 * @nlink: new nlink (should be non-zero)
 *
 * This is a low-level filesystem helper to replace any
 * direct filesystem manipulation of i_nlink.
 */
void set_nlink(struct inode *inode, unsigned int nlink)
{
	if (!nlink) {
		clear_nlink(inode);
	} else {
		/* Yes, some filesystems do change nlink from zero to one */
		if (inode->i_nlink == 0)
			atomic_long_dec(&inode->i_sb->s_remove_count);

		inode->__i_nlink = nlink;
	}
}
EXPORT_SYMBOL(set_nlink);

/**
 * inc_nlink - directly increment an inode's link count
 * @inode: inode
 *
 * This is a low-level filesystem helper to replace any
 * direct filesystem manipulation of i_nlink.  Currently,
 * it is only here for parity with dec_nlink().
 */
void inc_nlink(struct inode *inode)
{
	if (unlikely(inode->i_nlink == 0)) {
		WARN_ON(!(inode_state_read_once(inode) & I_LINKABLE));
		atomic_long_dec(&inode->i_sb->s_remove_count);
	}

	inode->__i_nlink++;
}
EXPORT_SYMBOL(inc_nlink);

static void __address_space_init_once(struct address_space *mapping)
{
	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
	init_rwsem(&mapping->i_mmap_rwsem);
	INIT_LIST_HEAD(&mapping->i_private_list);
	spin_lock_init(&mapping->i_private_lock);
	mapping->i_mmap = RB_ROOT_CACHED;
}

void address_space_init_once(struct address_space *mapping)
{
	memset(mapping, 0, sizeof(*mapping));
	__address_space_init_once(mapping);
}
EXPORT_SYMBOL(address_space_init_once);

/*
 * These are initializations that only need to be done
 * once, because the fields are idempotent across use
 * of the inode, so let the slab aware of that.
 */
void inode_init_once(struct inode *inode)
{
	memset(inode, 0, sizeof(*inode));
	INIT_HLIST_NODE(&inode->i_hash);
	INIT_LIST_HEAD(&inode->i_devices);
	INIT_LIST_HEAD(&inode->i_io_list);
	INIT_LIST_HEAD(&inode->i_wb_list);
	INIT_LIST_HEAD(&inode->i_lru);
	INIT_LIST_HEAD(&inode->i_sb_list);
	__address_space_init_once(&inode->i_data);
	i_size_ordered_init(inode);
}
EXPORT_SYMBOL(inode_init_once);

static void init_once(void *foo)
{
	struct inode *inode = (struct inode *) foo;

	inode_init_once(inode);
}

/*
 * get additional reference to inode; caller must already hold one.
 */
void ihold(struct inode *inode)
{
	WARN_ON(atomic_inc_return(&inode->i_count) < 2);
}
EXPORT_SYMBOL(ihold);

struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
					    struct inode *inode, u32 bit)
{
	void *bit_address;

	bit_address = inode_state_wait_address(inode, bit);
	init_wait_var_entry(wqe, bit_address, 0);
	return __var_waitqueue(bit_address);
}
EXPORT_SYMBOL(inode_bit_waitqueue);

void wait_on_new_inode(struct inode *inode)
{
	struct wait_bit_queue_entry wqe;
	struct wait_queue_head *wq_head;

	spin_lock(&inode->i_lock);
	if (!(inode_state_read(inode) & I_NEW)) {
		spin_unlock(&inode->i_lock);
		return;
	}

	wq_head = inode_bit_waitqueue(&wqe, inode, __I_NEW);
	for (;;) {
		prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
		if (!(inode_state_read(inode) & I_NEW))
			break;
		spin_unlock(&inode->i_lock);
		schedule();
		spin_lock(&inode->i_lock);
	}
	finish_wait(wq_head, &wqe.wq_entry);
	WARN_ON(inode_state_read(inode) & I_NEW);
	spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL(wait_on_new_inode);

static void __inode_lru_list_add(struct inode *inode, bool rotate)
{
	lockdep_assert_held(&inode->i_lock);

	if (inode_state_read(inode) & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE))
		return;
	if (icount_read(inode))
		return;
	if (!(inode->i_sb->s_flags & SB_ACTIVE))
		return;
	if (!mapping_shrinkable(&inode->i_data))
		return;

	if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru))
		this_cpu_inc(nr_unused);
	else if (rotate)
		inode_state_set(inode, I_REFERENCED);
}

/*
 * Add inode to LRU if needed (inode is unused and clean).
 */
void inode_lru_list_add(struct inode *inode)
{
	__inode_lru_list_add(inode, false);
}

static void inode_lru_list_del(struct inode *inode)
{
	if (list_empty(&inode->i_lru))
		return;

	if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru))
		this_cpu_dec(nr_unused);
}

static void inode_pin_lru_isolating(struct inode *inode)
{
	lockdep_assert_held(&inode->i_lock);
	WARN_ON(inode_state_read(inode) & (I_LRU_ISOLATING | I_FREEING | I_WILL_FREE));
	inode_state_set(inode, I_LRU_ISOLATING);
}

static void inode_unpin_lru_isolating(struct inode *inode)
{
	spin_lock(&inode->i_lock);
	WARN_ON(!(inode_state_read(inode) & I_LRU_ISOLATING));
	inode_state_clear(inode, I_LRU_ISOLATING);
	/* Called with inode->i_lock which ensures memory ordering. */
	inode_wake_up_bit(inode, __I_LRU_ISOLATING);
	spin_unlock(&inode->i_lock);
}

static void inode_wait_for_lru_isolating(struct inode *inode)
{
	struct wait_bit_queue_entry wqe;
	struct wait_queue_head *wq_head;

	lockdep_assert_held(&inode->i_lock);
	if (!(inode_state_read(inode) & I_LRU_ISOLATING))
		return;

	wq_head = inode_bit_waitqueue(&wqe, inode, __I_LRU_ISOLATING);
	for (;;) {
		prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
		/*
		 * Checking I_LRU_ISOLATING with inode->i_lock guarantees
		 * memory ordering.
		 */
		if (!(inode_state_read(inode) & I_LRU_ISOLATING))
			break;
		spin_unlock(&inode->i_lock);
		schedule();
		spin_lock(&inode->i_lock);
	}
	finish_wait(wq_head, &wqe.wq_entry);
	WARN_ON(inode_state_read(inode) & I_LRU_ISOLATING);
}

/**
 * inode_sb_list_add - add inode to the superblock list of inodes
 * @inode: inode to add
 */
void inode_sb_list_add(struct inode *inode)
{
	struct super_block *sb = inode->i_sb;

	spin_lock(&sb->s_inode_list_lock);
	list_add(&inode->i_sb_list, &sb->s_inodes);
	spin_unlock(&sb->s_inode_list_lock);
}
EXPORT_SYMBOL_GPL(inode_sb_list_add);

static inline void inode_sb_list_del(struct inode *inode)
{
	struct super_block *sb = inode->i_sb;

	if (!list_empty(&inode->i_sb_list)) {
		spin_lock(&sb->s_inode_list_lock);
		list_del_init(&inode->i_sb_list);
		spin_unlock(&sb->s_inode_list_lock);
	}
}

static unsigned long hash(struct super_block *sb, unsigned long hashval)
{
	unsigned long tmp;

	tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
			L1_CACHE_BYTES;
	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> i_hash_shift);
	return tmp & i_hash_mask;
}

/**
 *	__insert_inode_hash - hash an inode
 *	@inode: unhashed inode
 *	@hashval: unsigned long value used to locate this object in the
 *		inode_hashtable.
 *
 *	Add an inode to the inode hash for this superblock.
 */
void __insert_inode_hash(struct inode *inode, unsigned long hashval)
{
	struct hlist_head *b = inode_hashtable + hash(inode->i_sb, hashval);

	spin_lock(&inode_hash_lock);
	spin_lock(&inode->i_lock);
	hlist_add_head_rcu(&inode->i_hash, b);
	spin_unlock(&inode->i_lock);
	spin_unlock(&inode_hash_lock);
}
EXPORT_SYMBOL(__insert_inode_hash);

/**
 *	__remove_inode_hash - remove an inode from the hash
 *	@inode: inode to unhash
 *
 *	Remove an inode from the superblock.
 */
void __remove_inode_hash(struct inode *inode)
{
	spin_lock(&inode_hash_lock);
	spin_lock(&inode->i_lock);
	hlist_del_init_rcu(&inode->i_hash);
	spin_unlock(&inode->i_lock);
	spin_unlock(&inode_hash_lock);
}
EXPORT_SYMBOL(__remove_inode_hash);

void dump_mapping(const struct address_space *mapping)
{
	struct inode *host;
	const struct address_space_operations *a_ops;
	struct hlist_node *dentry_first;
	struct dentry *dentry_ptr;
	struct dentry dentry;
	char fname[64] = {};
	unsigned long ino;

	/*
	 * If mapping is an invalid pointer, we don't want to crash
	 * accessing it, so probe everything depending on it carefully.
	 */
	if (get_kernel_nofault(host, &mapping->host) ||
	    get_kernel_nofault(a_ops, &mapping->a_ops)) {
		pr_warn("invalid mapping:%px\n", mapping);
		return;
	}

	if (!host) {
		pr_warn("aops:%ps\n", a_ops);
		return;
	}

	if (get_kernel_nofault(dentry_first, &host->i_dentry.first) ||
	    get_kernel_nofault(ino, &host->i_ino)) {
		pr_warn("aops:%ps invalid inode:%px\n", a_ops, host);
		return;
	}

	if (!dentry_first) {
		pr_warn("aops:%ps ino:%lx\n", a_ops, ino);
		return;
	}

	dentry_ptr = container_of(dentry_first, struct dentry, d_u.d_alias);
	if (get_kernel_nofault(dentry, dentry_ptr) ||
	    !dentry.d_parent || !dentry.d_name.name) {
		pr_warn("aops:%ps ino:%lx invalid dentry:%px\n",
				a_ops, ino, dentry_ptr);
		return;
	}

	if (strncpy_from_kernel_nofault(fname, dentry.d_name.name, 63) < 0)
		strscpy(fname, "<invalid>");
	/*
	 * Even if strncpy_from_kernel_nofault() succeeded,
	 * the fname could be unreliable
	 */
	pr_warn("aops:%ps ino:%lx dentry name(?):\"%s\"\n",
		a_ops, ino, fname);
}

void clear_inode(struct inode *inode)
{
	/*
	 * We have to cycle the i_pages lock here because reclaim can be in the
	 * process of removing the last page (in __filemap_remove_folio())
	 * and we must not free the mapping under it.
	 */
	xa_lock_irq(&inode->i_data.i_pages);
	BUG_ON(inode->i_data.nrpages);
	/*
	 * Almost always, mapping_empty(&inode->i_data) here; but there are
	 * two known and long-standing ways in which nodes may get left behind
	 * (when deep radix-tree node allocation failed partway; or when THP
	 * collapse_file() failed). Until those two known cases are cleaned up,
	 * or a cleanup function is called here, do not BUG_ON(!mapping_empty),
	 * nor even WARN_ON(!mapping_empty).
	 */
	xa_unlock_irq(&inode->i_data.i_pages);
	BUG_ON(!list_empty(&inode->i_data.i_private_list));
	BUG_ON(!(inode_state_read_once(inode) & I_FREEING));
	BUG_ON(inode_state_read_once(inode) & I_CLEAR);
	BUG_ON(!list_empty(&inode->i_wb_list));
	/* don't need i_lock here, no concurrent mods to i_state */
	inode_state_assign_raw(inode, I_FREEING | I_CLEAR);
}
EXPORT_SYMBOL(clear_inode);

/*
 * Free the inode passed in, removing it from the lists it is still connected
 * to. We remove any pages still attached to the inode and wait for any IO that
 * is still in progress before finally destroying the inode.
 *
 * An inode must already be marked I_FREEING so that we avoid the inode being
 * moved back onto lists if we race with other code that manipulates the lists
 * (e.g. writeback_single_inode). The caller is responsible for setting this.
 *
 * An inode must already be removed from the LRU list before being evicted from
 * the cache. This should occur atomically with setting the I_FREEING state
 * flag, so no inodes here should ever be on the LRU when being evicted.
 */
static void evict(struct inode *inode)
{
	const struct super_operations *op = inode->i_sb->s_op;

	BUG_ON(!(inode_state_read_once(inode) & I_FREEING));
	BUG_ON(!list_empty(&inode->i_lru));

	inode_io_list_del(inode);
	inode_sb_list_del(inode);

	spin_lock(&inode->i_lock);
	inode_wait_for_lru_isolating(inode);

	/*
	 * Wait for flusher thread to be done with the inode so that filesystem
	 * does not start destroying it while writeback is still running. Since
	 * the inode has I_FREEING set, flusher thread won't start new work on
	 * the inode.  We just have to wait for running writeback to finish.
	 */
	inode_wait_for_writeback(inode);
	spin_unlock(&inode->i_lock);

	if (op->evict_inode) {
		op->evict_inode(inode);
	} else {
		truncate_inode_pages_final(&inode->i_data);
		clear_inode(inode);
	}
	if (S_ISCHR(inode->i_mode) && inode->i_cdev)
		cd_forget(inode);

	remove_inode_hash(inode);

	/*
	 * Wake up waiters in __wait_on_freeing_inode().
	 *
	 * It is an invariant that any thread we need to wake up is already
	 * accounted for before remove_inode_hash() acquires ->i_lock -- both
	 * sides take the lock and sleep is aborted if the inode is found
	 * unhashed. Thus either the sleeper wins and goes off CPU, or removal
	 * wins and the sleeper aborts after testing with the lock.
	 *
	 * This also means we don't need any fences for the call below.
	 */
	inode_wake_up_bit(inode, __I_NEW);
	BUG_ON(inode_state_read_once(inode) != (I_FREEING | I_CLEAR));

	destroy_inode(inode);
}

/*
 * dispose_list - dispose of the contents of a local list
 * @head: the head of the list to free
 *
 * Dispose-list gets a local list with local inodes in it, so it doesn't
 * need to worry about list corruption and SMP locks.
 */
static void dispose_list(struct list_head *head)
{
	while (!list_empty(head)) {
		struct inode *inode;

		inode = list_first_entry(head, struct inode, i_lru);
		list_del_init(&inode->i_lru);

		evict(inode);
		cond_resched();
	}
}

/**
 * evict_inodes	- evict all evictable inodes for a superblock
 * @sb:		superblock to operate on
 *
 * Make sure that no inodes with zero refcount are retained.  This is
 * called by superblock shutdown after having SB_ACTIVE flag removed,
 * so any inode reaching zero refcount during or after that call will
 * be immediately evicted.
 */
void evict_inodes(struct super_block *sb)
{
	struct inode *inode;
	LIST_HEAD(dispose);

again:
	spin_lock(&sb->s_inode_list_lock);
	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
		if (icount_read(inode))
			continue;

		spin_lock(&inode->i_lock);
		if (icount_read(inode)) {
			spin_unlock(&inode->i_lock);
			continue;
		}
		if (inode_state_read(inode) & (I_NEW | I_FREEING | I_WILL_FREE)) {
			spin_unlock(&inode->i_lock);
			continue;
		}

		inode_state_set(inode, I_FREEING);
		inode_lru_list_del(inode);
		spin_unlock(&inode->i_lock);
		list_add(&inode->i_lru, &dispose);

		/*
		 * We can have a ton of inodes to evict at unmount time given
		 * enough memory, check to see if we need to go to sleep for a
		 * bit so we don't livelock.
		 */
		if (need_resched()) {
			spin_unlock(&sb->s_inode_list_lock);
			cond_resched();
			dispose_list(&dispose);
			goto again;
		}
	}
	spin_unlock(&sb->s_inode_list_lock);

	dispose_list(&dispose);
}
EXPORT_SYMBOL_GPL(evict_inodes);

/*
 * Isolate the inode from the LRU in preparation for freeing it.
 *
 * If the inode has the I_REFERENCED flag set, then it means that it has been
 * used recently - the flag is set in iput_final(). When we encounter such an
 * inode, clear the flag and move it to the back of the LRU so it gets another
 * pass through the LRU before it gets reclaimed. This is necessary because of
 * the fact we are doing lazy LRU updates to minimise lock contention so the
 * LRU does not have strict ordering. Hence we don't want to reclaim inodes
 * with this flag set because they are the inodes that are out of order.
 */
static enum lru_status inode_lru_isolate(struct list_head *item,
		struct list_lru_one *lru, void *arg)
{
	struct list_head *freeable = arg;
	struct inode	*inode = container_of(item, struct inode, i_lru);

	/*
	 * We are inverting the lru lock/inode->i_lock here, so use a
	 * trylock. If we fail to get the lock, just skip it.
	 */
	if (!spin_trylock(&inode->i_lock))
		return LRU_SKIP;

	/*
	 * Inodes can get referenced, redirtied, or repopulated while
	 * they're already on the LRU, and this can make them
	 * unreclaimable for a while. Remove them lazily here; iput,
	 * sync, or the last page cache deletion will requeue them.
	 */
	if (icount_read(inode) ||
	    (inode_state_read(inode) & ~I_REFERENCED) ||
	    !mapping_shrinkable(&inode->i_data)) {
		list_lru_isolate(lru, &inode->i_lru);
		spin_unlock(&inode->i_lock);
		this_cpu_dec(nr_unused);
		return LRU_REMOVED;
	}

	/* Recently referenced inodes get one more pass */
	if (inode_state_read(inode) & I_REFERENCED) {
		inode_state_clear(inode, I_REFERENCED);
		spin_unlock(&inode->i_lock);
		return LRU_ROTATE;
	}

	/*
	 * On highmem systems, mapping_shrinkable() permits dropping
	 * page cache in order to free up struct inodes: lowmem might
	 * be under pressure before the cache inside the highmem zone.
	 */
	if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
		inode_pin_lru_isolating(inode);
		spin_unlock(&inode->i_lock);
		spin_unlock(&lru->lock);
		if (remove_inode_buffers(inode)) {
			unsigned long reap;
			reap = invalidate_mapping_pages(&inode->i_data, 0, -1);
			if (current_is_kswapd())
				__count_vm_events(KSWAPD_INODESTEAL, reap);
			else
				__count_vm_events(PGINODESTEAL, reap);
			mm_account_reclaimed_pages(reap);
		}
		inode_unpin_lru_isolating(inode);
		return LRU_RETRY;
	}

	WARN_ON(inode_state_read(inode) & I_NEW);
	inode_state_set(inode, I_FREEING);
	list_lru_isolate_move(lru, &inode->i_lru, freeable);
	spin_unlock(&inode->i_lock);

	this_cpu_dec(nr_unused);
	return LRU_REMOVED;
}

/*
 * Walk the superblock inode LRU for freeable inodes and attempt to free them.
 * This is called from the superblock shrinker function with a number of inodes
 * to trim from the LRU. Inodes to be freed are moved to a temporary list and
 * then are freed outside inode_lock by dispose_list().
 */
long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
{
	LIST_HEAD(freeable);
	long freed;

	freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
				     inode_lru_isolate, &freeable);
	dispose_list(&freeable);
	return freed;
}

static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_locked);
/*
 * Called with the inode lock held.
 */
static struct inode *find_inode(struct super_block *sb,
				struct hlist_head *head,
				int (*test)(struct inode *, void *),
				void *data, bool is_inode_hash_locked,
				bool *isnew)
{
	struct inode *inode = NULL;

	if (is_inode_hash_locked)
		lockdep_assert_held(&inode_hash_lock);
	else
		lockdep_assert_not_held(&inode_hash_lock);

	rcu_read_lock();
repeat:
	hlist_for_each_entry_rcu(inode, head, i_hash) {
		if (inode->i_sb != sb)
			continue;
		if (!test(inode, data))
			continue;
		spin_lock(&inode->i_lock);
		if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE)) {
			__wait_on_freeing_inode(inode, is_inode_hash_locked);
			goto repeat;
		}
		if (unlikely(inode_state_read(inode) & I_CREATING)) {
			spin_unlock(&inode->i_lock);
			rcu_read_unlock();
			return ERR_PTR(-ESTALE);
		}
		__iget(inode);
		*isnew = !!(inode_state_read(inode) & I_NEW);
		spin_unlock(&inode->i_lock);
		rcu_read_unlock();
		return inode;
	}
	rcu_read_unlock();
	return NULL;
}

/*
 * find_inode_fast is the fast path version of find_inode, see the comment at
 * iget_locked for details.
 */
static struct inode *find_inode_fast(struct super_block *sb,
				struct hlist_head *head, unsigned long ino,
				bool is_inode_hash_locked, bool *isnew)
{
	struct inode *inode = NULL;

	if (is_inode_hash_locked)
		lockdep_assert_held(&inode_hash_lock);
	else
		lockdep_assert_not_held(&inode_hash_lock);

	rcu_read_lock();
repeat:
	hlist_for_each_entry_rcu(inode, head, i_hash) {
		if (inode->i_ino != ino)
			continue;
		if (inode->i_sb != sb)
			continue;
		spin_lock(&inode->i_lock);
		if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE)) {
			__wait_on_freeing_inode(inode, is_inode_hash_locked);
			goto repeat;
		}
		if (unlikely(inode_state_read(inode) & I_CREATING)) {
			spin_unlock(&inode->i_lock);
			rcu_read_unlock();
			return ERR_PTR(-ESTALE);
		}
		__iget(inode);
		*isnew = !!(inode_state_read(inode) & I_NEW);
		spin_unlock(&inode->i_lock);
		rcu_read_unlock();
		return inode;
	}
	rcu_read_unlock();
	return NULL;
}

/*
 * Each cpu owns a range of LAST_INO_BATCH numbers.
 * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations,
 * to renew the exhausted range.
 *
 * This does not significantly increase overflow rate because every CPU can
 * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
 * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
 * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
 * overflow rate by 2x, which does not seem too significant.
 *
 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
 * error if st_ino won't fit in target struct field. Use 32bit counter
 * here to attempt to avoid that.
 */
#define LAST_INO_BATCH 1024
static DEFINE_PER_CPU(unsigned int, last_ino);

unsigned int get_next_ino(void)
{
	unsigned int *p = &get_cpu_var(last_ino);
	unsigned int res = *p;

#ifdef CONFIG_SMP
	if (unlikely((res & (LAST_INO_BATCH-1)) == 0)) {
		static atomic_t shared_last_ino;
		int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);

		res = next - LAST_INO_BATCH;
	}
#endif

	res++;
	/* get_next_ino should not provide a 0 inode number */
	if (unlikely(!res))
		res++;
	*p = res;
	put_cpu_var(last_ino);
	return res;
}
EXPORT_SYMBOL(get_next_ino);

/**
 *	new_inode 	- obtain an inode
 *	@sb: superblock
 *
 *	Allocates a new inode for given superblock. The default gfp_mask
 *	for allocations related to inode->i_mapping is GFP_HIGHUSER_MOVABLE.
 *	If HIGHMEM pages are unsuitable or it is known that pages allocated
 *	for the page cache are not reclaimable or migratable,
 *	mapping_set_gfp_mask() must be called with suitable flags on the
 *	newly created inode's mapping
 *
 */
struct inode *new_inode(struct super_block *sb)
{
	struct inode *inode;

	inode = alloc_inode(sb);
	if (inode)
		inode_sb_list_add(inode);
	return inode;
}
EXPORT_SYMBOL(new_inode);

#ifdef CONFIG_DEBUG_LOCK_ALLOC
void lockdep_annotate_inode_mutex_key(struct inode *inode)
{
	if (S_ISDIR(inode->i_mode)) {
		struct file_system_type *type = inode->i_sb->s_type;

		/* Set new key only if filesystem hasn't already changed it */
		if (lockdep_match_class(&inode->i_rwsem, &type->i_mutex_key)) {
			/*
			 * ensure nobody is actually holding i_rwsem
			 */
			init_rwsem(&inode->i_rwsem);
			lockdep_set_class(&inode->i_rwsem,
					  &type->i_mutex_dir_key);
		}
	}
}
EXPORT_SYMBOL(lockdep_annotate_inode_mutex_key);
#endif

/**
 * unlock_new_inode - clear the I_NEW state and wake up any waiters
 * @inode:	new inode to unlock
 *
 * Called when the inode is fully initialised to clear the new state of the
 * inode and wake up anyone waiting for the inode to finish initialisation.
 */
void unlock_new_inode(struct inode *inode)
{
	lockdep_annotate_inode_mutex_key(inode);
	spin_lock(&inode->i_lock);
	WARN_ON(!(inode_state_read(inode) & I_NEW));
	inode_state_clear(inode, I_NEW | I_CREATING);
	inode_wake_up_bit(inode, __I_NEW);
	spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL(unlock_new_inode);

void discard_new_inode(struct inode *inode)
{
	lockdep_annotate_inode_mutex_key(inode);
	spin_lock(&inode->i_lock);
	WARN_ON(!(inode_state_read(inode) & I_NEW));
	inode_state_clear(inode, I_NEW);
	inode_wake_up_bit(inode, __I_NEW);
	spin_unlock(&inode->i_lock);
	iput(inode);
}
EXPORT_SYMBOL(discard_new_inode);

/**
 * lock_two_nondirectories - take two i_mutexes on non-directory objects
 *
 * Lock any non-NULL argument. Passed objects must not be directories.
 * Zero, one or two objects may be locked by this function.
 *
 * @inode1: first inode to lock
 * @inode2: second inode to lock
 */
void lock_two_nondirectories(struct inode *inode1, struct inode *inode2)
{
	if (inode1)
		WARN_ON_ONCE(S_ISDIR(inode1->i_mode));
	if (inode2)
		WARN_ON_ONCE(S_ISDIR(inode2->i_mode));
	if (inode1 > inode2)
		swap(inode1, inode2);
	if (inode1)
		inode_lock(inode1);
	if (inode2 && inode2 != inode1)
		inode_lock_nested(inode2, I_MUTEX_NONDIR2);
}
EXPORT_SYMBOL(lock_two_nondirectories);

/**
 * unlock_two_nondirectories - release locks from lock_two_nondirectories()
 * @inode1: first inode to unlock
 * @inode2: second inode to unlock
 */
void unlock_two_nondirectories(struct inode *inode1, struct inode *inode2)
{
	if (inode1) {
		WARN_ON_ONCE(S_ISDIR(inode1->i_mode));
		inode_unlock(inode1);
	}
	if (inode2 && inode2 != inode1) {
		WARN_ON_ONCE(S_ISDIR(inode2->i_mode));
		inode_unlock(inode2);
	}
}
EXPORT_SYMBOL(unlock_two_nondirectories);

/**
 * inode_insert5 - obtain an inode from a mounted file system
 * @inode:	pre-allocated inode to use for insert to cache
 * @hashval:	hash value (usually inode number) to get
 * @test:	callback used for comparisons between inodes
 * @set:	callback used to initialize a new struct inode
 * @data:	opaque data pointer to pass to @test and @set
 * @isnew:	pointer to a bool which will indicate whether I_NEW is set
 *
 * Search for the inode specified by @hashval and @data in the inode cache,
 * and if present return it with an increased reference count. This is a
 * variant of iget5_locked() that doesn't allocate an inode.
 *
 * If the inode is not present in the cache, insert the pre-allocated inode and
 * return it locked, hashed, and with the I_NEW flag set. The file system gets
 * to fill it in before unlocking it via unlock_new_inode().
 *
 * Note that both @test and @set are called with the inode_hash_lock held, so
 * they can't sleep.
 */
struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
			    int (*test)(struct inode *, void *),
			    int (*set)(struct inode *, void *), void *data)
{
	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
	struct inode *old;
	bool isnew;

	might_sleep();

again:
	spin_lock(&inode_hash_lock);
	old = find_inode(inode->i_sb, head, test, data, true, &isnew);
	if (unlikely(old)) {
		/*
		 * Uhhuh, somebody else created the same inode under us.
		 * Use the old inode instead of the preallocated one.
		 */
		spin_unlock(&inode_hash_lock);
		if (IS_ERR(old))
			return NULL;
		if (unlikely(isnew))
			wait_on_new_inode(old);
		if (unlikely(inode_unhashed(old))) {
			iput(old);
			goto again;
		}
		return old;
	}

	if (set && unlikely(set(inode, data))) {
		spin_unlock(&inode_hash_lock);
		return NULL;
	}

	/*
	 * Return the locked inode with I_NEW set, the
	 * caller is responsible for filling in the contents
	 */
	spin_lock(&inode->i_lock);
	inode_state_set(inode, I_NEW);
	hlist_add_head_rcu(&inode->i_hash, head);
	spin_unlock(&inode->i_lock);

	spin_unlock(&inode_hash_lock);

	/*
	 * Add inode to the sb list if it's not already. It has I_NEW at this
	 * point, so it should be safe to test i_sb_list locklessly.
	 */
	if (list_empty(&inode->i_sb_list))
		inode_sb_list_add(inode);

	return inode;
}
EXPORT_SYMBOL(inode_insert5);

/**
 * iget5_locked - obtain an inode from a mounted file system
 * @sb:		super block of file system
 * @hashval:	hash value (usually inode number) to get
 * @test:	callback used for comparisons between inodes
 * @set:	callback used to initialize a new struct inode
 * @data:	opaque data pointer to pass to @test and @set
 *
 * Search for the inode specified by @hashval and @data in the inode cache,
 * and if present return it with an increased reference count. This is a
 * generalized version of iget_locked() for file systems where the inode
 * number is not sufficient for unique identification of an inode.
 *
 * If the inode is not present in the cache, allocate and insert a new inode
 * and return it locked, hashed, and with the I_NEW flag set. The file system
 * gets to fill it in before unlocking it via unlock_new_inode().
 *
 * Note that both @test and @set are called with the inode_hash_lock held, so
 * they can't sleep.
 */
struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
		int (*test)(struct inode *, void *),
		int (*set)(struct inode *, void *), void *data)
{
	struct inode *inode = ilookup5(sb, hashval, test, data);

	if (!inode) {
		struct inode *new = alloc_inode(sb);

		if (new) {
			inode = inode_insert5(new, hashval, test, set, data);
			if (unlikely(inode != new))
				destroy_inode(new);
		}
	}
	return inode;
}
EXPORT_SYMBOL(iget5_locked);

/**
 * iget5_locked_rcu - obtain an inode from a mounted file system
 * @sb:		super block of file system
 * @hashval:	hash value (usually inode number) to get
 * @test:	callback used for comparisons between inodes
 * @set:	callback used to initialize a new struct inode
 * @data:	opaque data pointer to pass to @test and @set
 *
 * This is equivalent to iget5_locked, except the @test callback must
 * tolerate the inode not being stable, including being mid-teardown.
 */
struct inode *iget5_locked_rcu(struct super_block *sb, unsigned long hashval,
		int (*test)(struct inode *, void *),
		int (*set)(struct inode *, void *), void *data)
{
	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
	struct inode *inode, *new;
	bool isnew;

	might_sleep();

again:
	inode = find_inode(sb, head, test, data, false, &isnew);
	if (inode) {
		if (IS_ERR(inode))
			return NULL;
		if (unlikely(isnew))
			wait_on_new_inode(inode);
		if (unlikely(inode_unhashed(inode))) {
			iput(inode);
			goto again;
		}
		return inode;
	}

	new = alloc_inode(sb);
	if (new) {
		inode = inode_insert5(new, hashval, test, set, data);
		if (unlikely(inode != new))
			destroy_inode(new);
	}
	return inode;
}
EXPORT_SYMBOL_GPL(iget5_locked_rcu);

/**
 * iget_locked - obtain an inode from a mounted file system
 * @sb:		super block of file system
 * @ino:	inode number to get
 *
 * Search for the inode specified by @ino in the inode cache and if present
 * return it with an increased reference count. This is for file systems
 * where the inode number is sufficient for unique identification of an inode.
 *
 * If the inode is not in cache, allocate a new inode and return it locked,
 * hashed, and with the I_NEW flag set.  The file system gets to fill it in
 * before unlocking it via unlock_new_inode().
 */
struct inode *iget_locked(struct super_block *sb, unsigned long ino)
{
	struct hlist_head *head = inode_hashtable + hash(sb, ino);
	struct inode *inode;
	bool isnew;

	might_sleep();

again:
	inode = find_inode_fast(sb, head, ino, false, &isnew);
	if (inode) {
		if (IS_ERR(inode))
			return NULL;
		if (unlikely(isnew))
			wait_on_new_inode(inode);
		if (unlikely(inode_unhashed(inode))) {
			iput(inode);
			goto again;
		}
		return inode;
	}

	inode = alloc_inode(sb);
	if (inode) {
		struct inode *old;

		spin_lock(&inode_hash_lock);
		/* We released the lock, so.. */
		old = find_inode_fast(sb, head, ino, true, &isnew);
		if (!old) {
			inode->i_ino = ino;
			spin_lock(&inode->i_lock);
			inode_state_assign(inode, I_NEW);
			hlist_add_head_rcu(&inode->i_hash, head);
			spin_unlock(&inode->i_lock);
			spin_unlock(&inode_hash_lock);
			inode_sb_list_add(inode);

			/* Return the locked inode with I_NEW set, the
			 * caller is responsible for filling in the contents
			 */
			return inode;
		}

		/*
		 * Uhhuh, somebody else created the same inode under
		 * us. Use the old inode instead of the one we just
		 * allocated.
		 */
		spin_unlock(&inode_hash_lock);
		destroy_inode(inode);
		if (IS_ERR(old))
			return NULL;
		inode = old;
		if (unlikely(isnew))
			wait_on_new_inode(inode);
		if (unlikely(inode_unhashed(inode))) {
			iput(inode);
			goto again;
		}
	}
	return inode;
}
EXPORT_SYMBOL(iget_locked);

/*
 * search the inode cache for a matching inode number.
 * If we find one, then the inode number we are trying to
 * allocate is not unique and so we should not use it.
 *
 * Returns 1 if the inode number is unique, 0 if it is not.
 */
static int test_inode_iunique(struct super_block *sb, unsigned long ino)
{
	struct hlist_head *b = inode_hashtable + hash(sb, ino);
	struct inode *inode;

	hlist_for_each_entry_rcu(inode, b, i_hash) {
		if (inode->i_ino == ino && inode->i_sb == sb)
			return 0;
	}
	return 1;
}

/**
 *	iunique - get a unique inode number
 *	@sb: superblock
 *	@max_reserved: highest reserved inode number
 *
 *	Obtain an inode number that is unique on the system for a given
 *	superblock. This is used by file systems that have no natural
 *	permanent inode numbering system. An inode number is returned that
 *	is higher than the reserved limit but unique.
 *
 *	BUGS:
 *	With a large number of inodes live on the file system this function
 *	currently becomes quite slow.
 */
ino_t iunique(struct super_block *sb, ino_t max_reserved)
{
	/*
	 * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
	 * error if st_ino won't fit in target struct field. Use 32bit counter
	 * here to attempt to avoid that.
	 */
	static DEFINE_SPINLOCK(iunique_lock);
	static unsigned int counter;
	ino_t res;

	rcu_read_lock();
	spin_lock(&iunique_lock);
	do {
		if (counter <= max_reserved)
			counter = max_reserved + 1;
		res = counter++;
	} while (!test_inode_iunique(sb, res));
	spin_unlock(&iunique_lock);
	rcu_read_unlock();

	return res;
}
EXPORT_SYMBOL(iunique);

struct inode *igrab(struct inode *inode)
{
	spin_lock(&inode->i_lock);
	if (!(inode_state_read(inode) & (I_FREEING | I_WILL_FREE))) {
		__iget(inode);
		spin_unlock(&inode->i_lock);
	} else {
		spin_unlock(&inode->i_lock);
		/*
		 * Handle the case where s_op->clear_inode is not been
		 * called yet, and somebody is calling igrab
		 * while the inode is getting freed.
		 */
		inode = NULL;
	}
	return inode;
}
EXPORT_SYMBOL(igrab);

/**
 * ilookup5_nowait - search for an inode in the inode cache
 * @sb:		super block of file system to search
 * @hashval:	hash value (usually inode number) to search for
 * @test:	callback used for comparisons between inodes
 * @data:	opaque data pointer to pass to @test
 * @isnew:	return argument telling whether I_NEW was set when
 *		the inode was found in hash (the caller needs to
 *		wait for I_NEW to clear)
 *
 * Search for the inode specified by @hashval and @data in the inode cache.
 * If the inode is in the cache, the inode is returned with an incremented
 * reference count.
 *
 * Note: I_NEW is not waited upon so you have to be very careful what you do
 * with the returned inode.  You probably should be using ilookup5() instead.
 *
 * Note2: @test is called with the inode_hash_lock held, so can't sleep.
 */
struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
		int (*test)(struct inode *, void *), void *data, bool *isnew)
{
	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
	struct inode *inode;

	spin_lock(&inode_hash_lock);
	inode = find_inode(sb, head, test, data, true, isnew);
	spin_unlock(&inode_hash_lock);

	return IS_ERR(inode) ? NULL : inode;
}
EXPORT_SYMBOL(ilookup5_nowait);

/**
 * ilookup5 - search for an inode in the inode cache
 * @sb:		super block of file system to search
 * @hashval:	hash value (usually inode number) to search for
 * @test:	callback used for comparisons between inodes
 * @data:	opaque data pointer to pass to @test
 *
 * Search for the inode specified by @hashval and @data in the inode cache,
 * and if the inode is in the cache, return the inode with an incremented
 * reference count.  Waits on I_NEW before returning the inode.
 * returned with an incremented reference count.
 *
 * This is a generalized version of ilookup() for file systems where the
 * inode number is not sufficient for unique identification of an inode.
 *
 * Note: @test is called with the inode_hash_lock held, so can't sleep.
 */
struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
		int (*test)(struct inode *, void *), void *data)
{
	struct inode *inode;
	bool isnew;

	might_sleep();

again:
	inode = ilookup5_nowait(sb, hashval, test, data, &isnew);
	if (inode) {
		if (unlikely(isnew))
			wait_on_new_inode(inode);
		if (unlikely(inode_unhashed(inode))) {
			iput(inode);
			goto again;
		}
	}
	return inode;
}
EXPORT_SYMBOL(ilookup5);

/**
 * ilookup - search for an inode in the inode cache
 * @sb:		super block of file system to search
 * @ino:	inode number to search for
 *
 * Search for the inode @ino in the inode cache, and if the inode is in the
 * cache, the inode is returned with an incremented reference count.
 */
struct inode *ilookup(struct super_block *sb, unsigned long ino)
{
	struct hlist_head *head = inode_hashtable + hash(sb, ino);
	struct inode *inode;
	bool isnew;

	might_sleep();

again:
	inode = find_inode_fast(sb, head, ino, false, &isnew);

	if (inode) {
		if (IS_ERR(inode))
			return NULL;
		if (unlikely(isnew))
			wait_on_new_inode(inode);
		if (unlikely(inode_unhashed(inode))) {
			iput(inode);
			goto again;
		}
	}
	return inode;
}
EXPORT_SYMBOL(ilookup);

/**
 * find_inode_nowait - find an inode in the inode cache
 * @sb:		super block of file system to search
 * @hashval:	hash value (usually inode number) to search for
 * @match:	callback used for comparisons between inodes
 * @data:	opaque data pointer to pass to @match
 *
 * Search for the inode specified by @hashval and @data in the inode
 * cache, where the helper function @match will return 0 if the inode
 * does not match, 1 if the inode does match, and -1 if the search
 * should be stopped.  The @match function must be responsible for
 * taking the i_lock spin_lock and checking i_state for an inode being
 * freed or being initialized, and incrementing the reference count
 * before returning 1.  It also must not sleep, since it is called with
 * the inode_hash_lock spinlock held.
 *
 * This is a even more generalized version of ilookup5() when the
 * function must never block --- find_inode() can block in
 * __wait_on_freeing_inode() --- or when the caller can not increment
 * the reference count because the resulting iput() might cause an
 * inode eviction.  The tradeoff is that the @match funtion must be
 * very carefully implemented.
 */
struct inode *find_inode_nowait(struct super_block *sb,
				unsigned long hashval,
				int (*match)(struct inode *, unsigned long,
					     void *),
				void *data)
{
	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
	struct inode *inode, *ret_inode = NULL;
	int mval;

	spin_lock(&inode_hash_lock);
	hlist_for_each_entry(inode, head, i_hash) {
		if (inode->i_sb != sb)
			continue;
		mval = match(inode, hashval, data);
		if (mval == 0)
			continue;
		if (mval == 1)
			ret_inode = inode;
		goto out;
	}
out:
	spin_unlock(&inode_hash_lock);
	return ret_inode;
}
EXPORT_SYMBOL(find_inode_nowait);

/**
 * find_inode_rcu - find an inode in the inode cache
 * @sb:		Super block of file system to search
 * @hashval:	Key to hash
 * @test:	Function to test match on an inode
 * @data:	Data for test function
 *
 * Search for the inode specified by @hashval and @data in the inode cache,
 * where the helper function @test will return 0 if the inode does not match
 * and 1 if it does.  The @test function must be responsible for taking the
 * i_lock spin_lock and checking i_state for an inode being freed or being
 * initialized.
 *
 * If successful, this will return the inode for which the @test function
 * returned 1 and NULL otherwise.
 *
 * The @test function is not permitted to take a ref on any inode presented.
 * It is also not permitted to sleep.
 *
 * The caller must hold the RCU read lock.
 */
struct inode *find_inode_rcu(struct super_block *sb, unsigned long hashval,
			     int (*test)(struct inode *, void *), void *data)
{
	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
	struct inode *inode;

	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
			 "suspicious find_inode_rcu() usage");

	hlist_for_each_entry_rcu(inode, head, i_hash) {
		if (inode->i_sb == sb &&
		    !(inode_state_read_once(inode) & (I_FREEING | I_WILL_FREE)) &&
		    test(inode, data))
			return inode;
	}
	return NULL;
}
EXPORT_SYMBOL(find_inode_rcu);

/**
 * find_inode_by_ino_rcu - Find an inode in the inode cache
 * @sb:		Super block of file system to search
 * @ino:	The inode number to match
 *
 * Search for the inode specified by @hashval and @data in the inode cache,
 * where the helper function @test will return 0 if the inode does not match
 * and 1 if it does.  The @test function must be responsible for taking the
 * i_lock spin_lock and checking i_state for an inode being freed or being
 * initialized.
 *
 * If successful, this will return the inode for which the @test function
 * returned 1 and NULL otherwise.
 *
 * The @test function is not permitted to take a ref on any inode presented.
 * It is also not permitted to sleep.
 *
 * The caller must hold the RCU read lock.
 */
struct inode *find_inode_by_ino_rcu(struct super_block *sb,
				    unsigned long ino)
{
	struct hlist_head *head = inode_hashtable + hash(sb, ino);
	struct inode *inode;

	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
			 "suspicious find_inode_by_ino_rcu() usage");

	hlist_for_each_entry_rcu(inode, head, i_hash) {
		if (inode->i_ino == ino &&
		    inode->i_sb == sb &&
		    !(inode_state_read_once(inode) & (I_FREEING | I_WILL_FREE)))
		    return inode;
	}
	return NULL;
}
EXPORT_SYMBOL(find_inode_by_ino_rcu);

int insert_inode_locked(struct inode *inode)
{
	struct super_block *sb = inode->i_sb;
	ino_t ino = inode->i_ino;
	struct hlist_head *head = inode_hashtable + hash(sb, ino);
	bool isnew;

	might_sleep();

	while (1) {
		struct inode *old = NULL;
		spin_lock(&inode_hash_lock);
		hlist_for_each_entry(old, head, i_hash) {
			if (old->i_ino != ino)
				continue;
			if (old->i_sb != sb)
				continue;
			spin_lock(&old->i_lock);
			if (inode_state_read(old) & (I_FREEING | I_WILL_FREE)) {
				spin_unlock(&old->i_lock);
				continue;
			}
			break;
		}
		if (likely(!old)) {
			spin_lock(&inode->i_lock);
			inode_state_set(inode, I_NEW | I_CREATING);
			hlist_add_head_rcu(&inode->i_hash, head);
			spin_unlock(&inode->i_lock);
			spin_unlock(&inode_hash_lock);
			return 0;
		}
		if (unlikely(inode_state_read(old) & I_CREATING)) {
			spin_unlock(&old->i_lock);
			spin_unlock(&inode_hash_lock);
			return -EBUSY;
		}
		__iget(old);
		isnew = !!(inode_state_read(old) & I_NEW);
		spin_unlock(&old->i_lock);
		spin_unlock(&inode_hash_lock);
		if (isnew)
			wait_on_new_inode(old);
		if (unlikely(!inode_unhashed(old))) {
			iput(old);
			return -EBUSY;
		}
		iput(old);
	}
}
EXPORT_SYMBOL(insert_inode_locked);

int insert_inode_locked4(struct inode *inode, unsigned long hashval,
		int (*test)(struct inode *, void *), void *data)
{
	struct inode *old;

	might_sleep();

	inode_state_set_raw(inode, I_CREATING);
	old = inode_insert5(inode, hashval, test, NULL, data);

	if (old != inode) {
		iput(old);
		return -EBUSY;
	}
	return 0;
}
EXPORT_SYMBOL(insert_inode_locked4);


int inode_just_drop(struct inode *inode)
{
	return 1;
}
EXPORT_SYMBOL(inode_just_drop);

/*
 * Called when we're dropping the last reference
 * to an inode.
 *
 * Call the FS "drop_inode()" function, defaulting to
 * the legacy UNIX filesystem behaviour.  If it tells
 * us to evict inode, do so.  Otherwise, retain inode
 * in cache if fs is alive, sync and evict if fs is
 * shutting down.
 */
static void iput_final(struct inode *inode)
{
	struct super_block *sb = inode->i_sb;
	const struct super_operations *op = inode->i_sb->s_op;
	int drop;

	WARN_ON(inode_state_read(inode) & I_NEW);
	VFS_BUG_ON_INODE(atomic_read(&inode->i_count) != 0, inode);

	if (op->drop_inode)
		drop = op->drop_inode(inode);
	else
		drop = inode_generic_drop(inode);

	if (!drop &&
	    !(inode_state_read(inode) & I_DONTCACHE) &&
	    (sb->s_flags & SB_ACTIVE)) {
		__inode_lru_list_add(inode, true);
		spin_unlock(&inode->i_lock);
		return;
	}

	/*
	 * Re-check ->i_count in case the ->drop_inode() hooks played games.
	 * Note we only execute this if the verdict was to drop the inode.
	 */
	VFS_BUG_ON_INODE(atomic_read(&inode->i_count) != 0, inode);

	if (drop) {
		inode_state_set(inode, I_FREEING);
	} else {
		inode_state_set(inode, I_WILL_FREE);
		spin_unlock(&inode->i_lock);

		write_inode_now(inode, 1);

		spin_lock(&inode->i_lock);
		WARN_ON(inode_state_read(inode) & I_NEW);
		inode_state_replace(inode, I_WILL_FREE, I_FREEING);
	}

	inode_lru_list_del(inode);
	spin_unlock(&inode->i_lock);

	evict(inode);
}

/**
 *	iput	- put an inode
 *	@inode: inode to put
 *
 *	Puts an inode, dropping its usage count. If the inode use count hits
 *	zero, the inode is then freed and may also be destroyed.
 *
 *	Consequently, iput() can sleep.
 */
void iput(struct inode *inode)
{
	might_sleep();
	if (unlikely(!inode))
		return;

retry:
	lockdep_assert_not_held(&inode->i_lock);
	VFS_BUG_ON_INODE(inode_state_read_once(inode) & (I_FREEING | I_CLEAR), inode);
	/*
	 * Note this assert is technically racy as if the count is bogusly
	 * equal to one, then two CPUs racing to further drop it can both
	 * conclude it's fine.
	 */
	VFS_BUG_ON_INODE(atomic_read(&inode->i_count) < 1, inode);

	if (atomic_add_unless(&inode->i_count, -1, 1))
		return;

	if ((inode_state_read_once(inode) & I_DIRTY_TIME) && inode->i_nlink) {
		trace_writeback_lazytime_iput(inode);
		mark_inode_dirty_sync(inode);
		goto retry;
	}

	spin_lock(&inode->i_lock);
	if (unlikely((inode_state_read(inode) & I_DIRTY_TIME) && inode->i_nlink)) {
		spin_unlock(&inode->i_lock);
		goto retry;
	}

	if (!atomic_dec_and_test(&inode->i_count)) {
		spin_unlock(&inode->i_lock);
		return;
	}

	/*
	 * iput_final() drops ->i_lock, we can't assert on it as the inode may
	 * be deallocated by the time the call returns.
	 */
	iput_final(inode);
}
EXPORT_SYMBOL(iput);

/**
 *	iput_not_last	- put an inode assuming this is not the last reference
 *	@inode: inode to put
 */
void iput_not_last(struct inode *inode)
{
	VFS_BUG_ON_INODE(inode_state_read_once(inode) & (I_FREEING | I_CLEAR), inode);
	VFS_BUG_ON_INODE(atomic_read(&inode->i_count) < 2, inode);

	WARN_ON(atomic_sub_return(1, &inode->i_count) == 0);
}
EXPORT_SYMBOL(iput_not_last);

#ifdef CONFIG_BLOCK
/**
 *	bmap	- find a block number in a file
 *	@inode:  inode owning the block number being requested
 *	@block: pointer containing the block to find
 *
 *	Replaces the value in ``*block`` with the block number on the device holding
 *	corresponding to the requested block number in the file.
 *	That is, asked for block 4 of inode 1 the function will replace the
 *	4 in ``*block``, with disk block relative to the disk start that holds that
 *	block of the file.
 *
 *	Returns -EINVAL in case of error, 0 otherwise. If mapping falls into a
 *	hole, returns 0 and ``*block`` is also set to 0.
 */
int bmap(struct inode *inode, sector_t *block)
{
	if (!inode->i_mapping->a_ops->bmap)
		return -EINVAL;

	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
	return 0;
}
EXPORT_SYMBOL(bmap);
#endif

/*
 * With relative atime, only update atime if the previous atime is
 * earlier than or equal to either the ctime or mtime,
 * or if at least a day has passed since the last atime update.
 */
static bool relatime_need_update(struct vfsmount *mnt, struct inode *inode,
			     struct timespec64 now)
{
	struct timespec64 atime, mtime, ctime;

	if (!(mnt->mnt_flags & MNT_RELATIME))
		return true;
	/*
	 * Is mtime younger than or equal to atime? If yes, update atime:
	 */
	atime = inode_get_atime(inode);
	mtime = inode_get_mtime(inode);
	if (timespec64_compare(&mtime, &atime) >= 0)
		return true;
	/*
	 * Is ctime younger than or equal to atime? If yes, update atime:
	 */
	ctime = inode_get_ctime(inode);
	if (timespec64_compare(&ctime, &atime) >= 0)
		return true;

	/*
	 * Is the previous atime value older than a day? If yes,
	 * update atime:
	 */
	if ((long)(now.tv_sec - atime.tv_sec) >= 24*60*60)
		return true;
	/*
	 * Good, we can skip the atime update:
	 */
	return false;
}

/**
 * inode_update_timestamps - update the timestamps on the inode
 * @inode: inode to be updated
 * @flags: S_* flags that needed to be updated
 *
 * The update_time function is called when an inode's timestamps need to be
 * updated for a read or write operation. This function handles updating the
 * actual timestamps. It's up to the caller to ensure that the inode is marked
 * dirty appropriately.
 *
 * In the case where any of S_MTIME, S_CTIME, or S_VERSION need to be updated,
 * attempt to update all three of them. S_ATIME updates can be handled
 * independently of the rest.
 *
 * Returns a set of S_* flags indicating which values changed.
 */
int inode_update_timestamps(struct inode *inode, int flags)
{
	int updated = 0;
	struct timespec64 now;

	if (flags & (S_MTIME|S_CTIME|S_VERSION)) {
		struct timespec64 ctime = inode_get_ctime(inode);
		struct timespec64 mtime = inode_get_mtime(inode);

		now = inode_set_ctime_current(inode);
		if (!timespec64_equal(&now, &ctime))
			updated |= S_CTIME;
		if (!timespec64_equal(&now, &mtime)) {
			inode_set_mtime_to_ts(inode, now);
			updated |= S_MTIME;
		}
		if (IS_I_VERSION(inode) && inode_maybe_inc_iversion(inode, updated))
			updated |= S_VERSION;
	} else {
		now = current_time(inode);
	}

	if (flags & S_ATIME) {
		struct timespec64 atime = inode_get_atime(inode);

		if (!timespec64_equal(&now, &atime)) {
			inode_set_atime_to_ts(inode, now);
			updated |= S_ATIME;
		}
	}
	return updated;
}
EXPORT_SYMBOL(inode_update_timestamps);

/**
 * generic_update_time - update the timestamps on the inode
 * @inode: inode to be updated
 * @flags: S_* flags that needed to be updated
 *
 * The update_time function is called when an inode's timestamps need to be
 * updated for a read or write operation. In the case where any of S_MTIME, S_CTIME,
 * or S_VERSION need to be updated we attempt to update all three of them. S_ATIME
 * updates can be handled done independently of the rest.
 *
 * Returns a S_* mask indicating which fields were updated.
 */
int generic_update_time(struct inode *inode, int flags)
{
	int updated = inode_update_timestamps(inode, flags);
	int dirty_flags = 0;

	if (updated & (S_ATIME|S_MTIME|S_CTIME))
		dirty_flags = inode->i_sb->s_flags & SB_LAZYTIME ? I_DIRTY_TIME : I_DIRTY_SYNC;
	if (updated & S_VERSION)
		dirty_flags |= I_DIRTY_SYNC;
	__mark_inode_dirty(inode, dirty_flags);
	return updated;
}
EXPORT_SYMBOL(generic_update_time);

/*
 * This does the actual work of updating an inodes time or version.  Must have
 * had called mnt_want_write() before calling this.
 */
int inode_update_time(struct inode *inode, int flags)
{
	if (inode->i_op->update_time)
		return inode->i_op->update_time(inode, flags);
	generic_update_time(inode, flags);
	return 0;
}
EXPORT_SYMBOL(inode_update_time);

/**
 *	atime_needs_update	-	update the access time
 *	@path: the &struct path to update
 *	@inode: inode to update
 *
 *	Update the accessed time on an inode and mark it for writeback.
 *	This function automatically handles read only file systems and media,
 *	as well as the "noatime" flag and inode specific "noatime" markers.
 */
bool atime_needs_update(const struct path *path, struct inode *inode)
{
	struct vfsmount *mnt = path->mnt;
	struct timespec64 now, atime;

	if (inode->i_flags & S_NOATIME)
		return false;

	/* Atime updates will likely cause i_uid and i_gid to be written
	 * back improprely if their true value is unknown to the vfs.
	 */
	if (HAS_UNMAPPED_ID(mnt_idmap(mnt), inode))
		return false;

	if (IS_NOATIME(inode))
		return false;
	if ((inode->i_sb->s_flags & SB_NODIRATIME) && S_ISDIR(inode->i_mode))
		return false;

	if (mnt->mnt_flags & MNT_NOATIME)
		return false;
	if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
		return false;

	now = current_time(inode);

	if (!relatime_need_update(mnt, inode, now))
		return false;

	atime = inode_get_atime(inode);
	if (timespec64_equal(&atime, &now))
		return false;

	return true;
}

void touch_atime(const struct path *path)
{
	struct vfsmount *mnt = path->mnt;
	struct inode *inode = d_inode(path->dentry);

	if (!atime_needs_update(path, inode))
		return;

	if (!sb_start_write_trylock(inode->i_sb))
		return;

	if (mnt_get_write_access(mnt) != 0)
		goto skip_update;
	/*
	 * File systems can error out when updating inodes if they need to
	 * allocate new space to modify an inode (such is the case for
	 * Btrfs), but since we touch atime while walking down the path we
	 * really don't care if we failed to update the atime of the file,
	 * so just ignore the return value.
	 * We may also fail on filesystems that have the ability to make parts
	 * of the fs read only, e.g. subvolumes in Btrfs.
	 */
	inode_update_time(inode, S_ATIME);
	mnt_put_write_access(mnt);
skip_update:
	sb_end_write(inode->i_sb);
}
EXPORT_SYMBOL(touch_atime);

/*
 * Return mask of changes for notify_change() that need to be done as a
 * response to write or truncate. Return 0 if nothing has to be changed.
 * Negative value on error (change should be denied).
 */
int dentry_needs_remove_privs(struct mnt_idmap *idmap,
			      struct dentry *dentry)
{
	struct inode *inode = d_inode(dentry);
	int mask = 0;
	int ret;

	if (IS_NOSEC(inode))
		return 0;

	mask = setattr_should_drop_suidgid(idmap, inode);
	ret = security_inode_need_killpriv(dentry);
	if (ret < 0)
		return ret;
	if (ret)
		mask |= ATTR_KILL_PRIV;
	return mask;
}

static int __remove_privs(struct mnt_idmap *idmap,
			  struct dentry *dentry, int kill)
{
	struct iattr newattrs;

	newattrs.ia_valid = ATTR_FORCE | kill;
	/*
	 * Note we call this on write, so notify_change will not
	 * encounter any conflicting delegations:
	 */
	return notify_change(idmap, dentry, &newattrs, NULL);
}

static int file_remove_privs_flags(struct file *file, unsigned int flags)
{
	struct dentry *dentry = file_dentry(file);
	struct inode *inode = file_inode(file);
	int error = 0;
	int kill;

	if (IS_NOSEC(inode) || !S_ISREG(inode->i_mode))
		return 0;

	kill = dentry_needs_remove_privs(file_mnt_idmap(file), dentry);
	if (kill < 0)
		return kill;

	if (kill) {
		if (flags & IOCB_NOWAIT)
			return -EAGAIN;

		error = __remove_privs(file_mnt_idmap(file), dentry, kill);
	}

	if (!error)
		inode_has_no_xattr(inode);
	return error;
}

/**
 * file_remove_privs - remove special file privileges (suid, capabilities)
 * @file: file to remove privileges from
 *
 * When file is modified by a write or truncation ensure that special
 * file privileges are removed.
 *
 * Return: 0 on success, negative errno on failure.
 */
int file_remove_privs(struct file *file)
{
	return file_remove_privs_flags(file, 0);
}
EXPORT_SYMBOL(file_remove_privs);

/**
 * current_time - Return FS time (possibly fine-grained)
 * @inode: inode.
 *
 * Return the current time truncated to the time granularity supported by
 * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
 * as having been QUERIED, get a fine-grained timestamp, but don't update
 * the floor.
 *
 * For a multigrain inode, this is effectively an estimate of the timestamp
 * that a file would receive. An actual update must go through
 * inode_set_ctime_current().
 */
struct timespec64 current_time(struct inode *inode)
{
	struct timespec64 now;
	u32 cns;

	ktime_get_coarse_real_ts64_mg(&now);

	if (!is_mgtime(inode))
		goto out;

	/* If nothing has queried it, then coarse time is fine */
	cns = smp_load_acquire(&inode->i_ctime_nsec);
	if (cns & I_CTIME_QUERIED) {
		/*
		 * If there is no apparent change, then get a fine-grained
		 * timestamp.
		 */
		if (now.tv_nsec == (cns & ~I_CTIME_QUERIED))
			ktime_get_real_ts64(&now);
	}
out:
	return timestamp_truncate(now, inode);
}
EXPORT_SYMBOL(current_time);

static int file_update_time_flags(struct file *file, unsigned int flags)
{
	struct inode *inode = file_inode(file);
	struct timespec64 now, ts;
	int sync_mode = 0;
	int ret = 0;

	/* First try to exhaust all avenues to not sync */
	if (IS_NOCMTIME(inode))
		return 0;
	if (unlikely(file->f_mode & FMODE_NOCMTIME))
		return 0;

	now = current_time(inode);

	ts = inode_get_mtime(inode);
	if (!timespec64_equal(&ts, &now))
		sync_mode |= S_MTIME;
	ts = inode_get_ctime(inode);
	if (!timespec64_equal(&ts, &now))
		sync_mode |= S_CTIME;
	if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
		sync_mode |= S_VERSION;

	if (!sync_mode)
		return 0;

	if (flags & IOCB_NOWAIT)
		return -EAGAIN;

	if (mnt_get_write_access_file(file))
		return 0;
	ret = inode_update_time(inode, sync_mode);
	mnt_put_write_access_file(file);
	return ret;
}

/**
 * file_update_time - update mtime and ctime time
 * @file: file accessed
 *
 * Update the mtime and ctime members of an inode and mark the inode for
 * writeback. Note that this function is meant exclusively for usage in
 * the file write path of filesystems, and filesystems may choose to
 * explicitly ignore updates via this function with the _NOCMTIME inode
 * flag, e.g. for network filesystem where these imestamps are handled
 * by the server. This can return an error for file systems who need to
 * allocate space in order to update an inode.
 *
 * Return: 0 on success, negative errno on failure.
 */
int file_update_time(struct file *file)
{
	return file_update_time_flags(file, 0);
}
EXPORT_SYMBOL(file_update_time);

/**
 * file_modified_flags - handle mandated vfs changes when modifying a file
 * @file: file that was modified
 * @flags: kiocb flags
 *
 * When file has been modified ensure that special
 * file privileges are removed and time settings are updated.
 *
 * If IOCB_NOWAIT is set, special file privileges will not be removed and
 * time settings will not be updated. It will return -EAGAIN.
 *
 * Context: Caller must hold the file's inode lock.
 *
 * Return: 0 on success, negative errno on failure.
 */
static int file_modified_flags(struct file *file, int flags)
{
	int ret;

	/*
	 * Clear the security bits if the process is not being run by root.
	 * This keeps people from modifying setuid and setgid binaries.
	 */
	ret = file_remove_privs_flags(file, flags);
	if (ret)
		return ret;
	return file_update_time_flags(file, flags);
}

/**
 * file_modified - handle mandated vfs changes when modifying a file
 * @file: file that was modified
 *
 * When file has been modified ensure that special
 * file privileges are removed and time settings are updated.
 *
 * Context: Caller must hold the file's inode lock.
 *
 * Return: 0 on success, negative errno on failure.
 */
int file_modified(struct file *file)
{
	return file_modified_flags(file, 0);
}
EXPORT_SYMBOL(file_modified);

/**
 * kiocb_modified - handle mandated vfs changes when modifying a file
 * @iocb: iocb that was modified
 *
 * When file has been modified ensure that special
 * file privileges are removed and time settings are updated.
 *
 * Context: Caller must hold the file's inode lock.
 *
 * Return: 0 on success, negative errno on failure.
 */
int kiocb_modified(struct kiocb *iocb)
{
	return file_modified_flags(iocb->ki_filp, iocb->ki_flags);
}
EXPORT_SYMBOL_GPL(kiocb_modified);

int inode_needs_sync(struct inode *inode)
{
	if (IS_SYNC(inode))
		return 1;
	if (S_ISDIR(inode->i_mode) && IS_DIRSYNC(inode))
		return 1;
	return 0;
}
EXPORT_SYMBOL(inode_needs_sync);

/*
 * If we try to find an inode in the inode hash while it is being
 * deleted, we have to wait until the filesystem completes its
 * deletion before reporting that it isn't found.  This function waits
 * until the deletion _might_ have completed.  Callers are responsible
 * to recheck inode state.
 *
 * It doesn't matter if I_NEW is not set initially, a call to
 * wake_up_bit(&inode->i_state, __I_NEW) after removing from the hash list
 * will DTRT.
 */
static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_locked)
{
	struct wait_bit_queue_entry wqe;
	struct wait_queue_head *wq_head;

	/*
	 * Handle racing against evict(), see that routine for more details.
	 */
	if (unlikely(inode_unhashed(inode))) {
		WARN_ON(is_inode_hash_locked);
		spin_unlock(&inode->i_lock);
		return;
	}

	wq_head = inode_bit_waitqueue(&wqe, inode, __I_NEW);
	prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
	spin_unlock(&inode->i_lock);
	rcu_read_unlock();
	if (is_inode_hash_locked)
		spin_unlock(&inode_hash_lock);
	schedule();
	finish_wait(wq_head, &wqe.wq_entry);
	if (is_inode_hash_locked)
		spin_lock(&inode_hash_lock);
	rcu_read_lock();
}

static __initdata unsigned long ihash_entries;
static int __init set_ihash_entries(char *str)
{
	if (!str)
		return 0;
	ihash_entries = simple_strtoul(str, &str, 0);
	return 1;
}
__setup("ihash_entries=", set_ihash_entries);

/*
 * Initialize the waitqueues and inode hash table.
 */
void __init inode_init_early(void)
{
	/* If hashes are distributed across NUMA nodes, defer
	 * hash allocation until vmalloc space is available.
	 */
	if (hashdist)
		return;

	inode_hashtable =
		alloc_large_system_hash("Inode-cache",
					sizeof(struct hlist_head),
					ihash_entries,
					14,
					HASH_EARLY | HASH_ZERO,
					&i_hash_shift,
					&i_hash_mask,
					0,
					0);
}

void __init inode_init(void)
{
	/* inode slab cache */
	inode_cachep = kmem_cache_create("inode_cache",
					 sizeof(struct inode),
					 0,
					 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
					 SLAB_ACCOUNT),
					 init_once);

	/* Hash may have been set up in inode_init_early */
	if (!hashdist)
		return;

	inode_hashtable =
		alloc_large_system_hash("Inode-cache",
					sizeof(struct hlist_head),
					ihash_entries,
					14,
					HASH_ZERO,
					&i_hash_shift,
					&i_hash_mask,
					0,
					0);
}

void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
{
	inode->i_mode = mode;
	switch (inode->i_mode & S_IFMT) {
	case S_IFCHR:
		inode->i_fop = &def_chr_fops;
		inode->i_rdev = rdev;
		break;
	case S_IFBLK:
		if (IS_ENABLED(CONFIG_BLOCK))
			inode->i_fop = &def_blk_fops;
		inode->i_rdev = rdev;
		break;
	case S_IFIFO:
		inode->i_fop = &pipefifo_fops;
		break;
	case S_IFSOCK:
		/* leave it no_open_fops */
		break;
	default:
		printk(KERN_DEBUG "init_special_inode: bogus i_mode (%o) for"
				  " inode %s:%lu\n", mode, inode->i_sb->s_id,
				  inode->i_ino);
		break;
	}
}
EXPORT_SYMBOL(init_special_inode);

/**
 * inode_init_owner - Init uid,gid,mode for new inode according to posix standards
 * @idmap: idmap of the mount the inode was created from
 * @inode: New inode
 * @dir: Directory inode
 * @mode: mode of the new inode
 *
 * If the inode has been created through an idmapped mount the idmap of
 * the vfsmount must be passed through @idmap. This function will then take
 * care to map the inode according to @idmap before checking permissions
 * and initializing i_uid and i_gid. On non-idmapped mounts or if permission
 * checking is to be performed on the raw inode simply pass @nop_mnt_idmap.
 */
void inode_init_owner(struct mnt_idmap *idmap, struct inode *inode,
		      const struct inode *dir, umode_t mode)
{
	inode_fsuid_set(inode, idmap);
	if (dir && dir->i_mode & S_ISGID) {
		inode->i_gid = dir->i_gid;

		/* Directories are special, and always inherit S_ISGID */
		if (S_ISDIR(mode))
			mode |= S_ISGID;
	} else
		inode_fsgid_set(inode, idmap);
	inode->i_mode = mode;
}
EXPORT_SYMBOL(inode_init_owner);

/**
 * inode_owner_or_capable - check current task permissions to inode
 * @idmap: idmap of the mount the inode was found from
 * @inode: inode being checked
 *
 * Return true if current either has CAP_FOWNER in a namespace with the
 * inode owner uid mapped, or owns the file.
 *
 * If the inode has been found through an idmapped mount the idmap of
 * the vfsmount must be passed through @idmap. This function will then take
 * care to map the inode according to @idmap before checking permissions.
 * On non-idmapped mounts or if permission checking is to be performed on the
 * raw inode simply pass @nop_mnt_idmap.
 */
bool inode_owner_or_capable(struct mnt_idmap *idmap,
			    const struct inode *inode)
{
	vfsuid_t vfsuid;
	struct user_namespace *ns;

	vfsuid = i_uid_into_vfsuid(idmap, inode);
	if (vfsuid_eq_kuid(vfsuid, current_fsuid()))
		return true;

	ns = current_user_ns();
	if (vfsuid_has_mapping(ns, vfsuid) && ns_capable(ns, CAP_FOWNER))
		return true;
	return false;
}
EXPORT_SYMBOL(inode_owner_or_capable);

/*
 * Direct i/o helper functions
 */
bool inode_dio_finished(const struct inode *inode)
{
	return atomic_read(&inode->i_dio_count) == 0;
}
EXPORT_SYMBOL(inode_dio_finished);

/**
 * inode_dio_wait - wait for outstanding DIO requests to finish
 * @inode: inode to wait for
 *
 * Waits for all pending direct I/O requests to finish so that we can
 * proceed with a truncate or equivalent operation.
 *
 * Must be called under a lock that serializes taking new references
 * to i_dio_count, usually by inode->i_rwsem.
 */
void inode_dio_wait(struct inode *inode)
{
	wait_var_event(&inode->i_dio_count, inode_dio_finished(inode));
}
EXPORT_SYMBOL(inode_dio_wait);

void inode_dio_wait_interruptible(struct inode *inode)
{
	wait_var_event_interruptible(&inode->i_dio_count,
				     inode_dio_finished(inode));
}
EXPORT_SYMBOL(inode_dio_wait_interruptible);

/*
 * inode_set_flags - atomically set some inode flags
 *
 * Note: the caller should be holding i_rwsem exclusively, or else be sure that
 * they have exclusive access to the inode structure (i.e., while the
 * inode is being instantiated).  The reason for the cmpxchg() loop
 * --- which wouldn't be necessary if all code paths which modify
 * i_flags actually followed this rule, is that there is at least one
 * code path which doesn't today so we use cmpxchg() out of an abundance
 * of caution.
 *
 * In the long run, i_rwsem is overkill, and we should probably look
 * at using the i_lock spinlock to protect i_flags, and then make sure
 * it is so documented in include/linux/fs.h and that all code follows
 * the locking convention!!
 */
void inode_set_flags(struct inode *inode, unsigned int flags,
		     unsigned int mask)
{
	WARN_ON_ONCE(flags & ~mask);
	set_mask_bits(&inode->i_flags, mask, flags);
}
EXPORT_SYMBOL(inode_set_flags);

void inode_nohighmem(struct inode *inode)
{
	mapping_set_gfp_mask(inode->i_mapping, GFP_USER);
}
EXPORT_SYMBOL(inode_nohighmem);

struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts)
{
	trace_inode_set_ctime_to_ts(inode, &ts);
	set_normalized_timespec64(&ts, ts.tv_sec, ts.tv_nsec);
	inode->i_ctime_sec = ts.tv_sec;
	inode->i_ctime_nsec = ts.tv_nsec;
	return ts;
}
EXPORT_SYMBOL(inode_set_ctime_to_ts);

/**
 * timestamp_truncate - Truncate timespec to a granularity
 * @t: Timespec
 * @inode: inode being updated
 *
 * Truncate a timespec to the granularity supported by the fs
 * containing the inode. Always rounds down. gran must
 * not be 0 nor greater than a second (NSEC_PER_SEC, or 10^9 ns).
 */
struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode)
{
	struct super_block *sb = inode->i_sb;
	unsigned int gran = sb->s_time_gran;

	t.tv_sec = clamp(t.tv_sec, sb->s_time_min, sb->s_time_max);
	if (unlikely(t.tv_sec == sb->s_time_max || t.tv_sec == sb->s_time_min))
		t.tv_nsec = 0;

	/* Avoid division in the common cases 1 ns and 1 s. */
	if (gran == 1)
		; /* nothing */
	else if (gran == NSEC_PER_SEC)
		t.tv_nsec = 0;
	else if (gran > 1 && gran < NSEC_PER_SEC)
		t.tv_nsec -= t.tv_nsec % gran;
	else
		WARN(1, "invalid file time granularity: %u", gran);
	return t;
}
EXPORT_SYMBOL(timestamp_truncate);

/**
 * inode_set_ctime_current - set the ctime to current_time
 * @inode: inode
 *
 * Set the inode's ctime to the current value for the inode. Returns the
 * current value that was assigned. If this is not a multigrain inode, then we
 * set it to the later of the coarse time and floor value.
 *
 * If it is multigrain, then we first see if the coarse-grained timestamp is
 * distinct from what is already there. If so, then use that. Otherwise, get a
 * fine-grained timestamp.
 *
 * After that, try to swap the new value into i_ctime_nsec. Accept the
 * resulting ctime, regardless of the outcome of the swap. If it has
 * already been replaced, then that timestamp is later than the earlier
 * unacceptable one, and is thus acceptable.
 */
struct timespec64 inode_set_ctime_current(struct inode *inode)
{
	struct timespec64 now;
	u32 cns, cur;

	ktime_get_coarse_real_ts64_mg(&now);
	now = timestamp_truncate(now, inode);

	/* Just return that if this is not a multigrain fs */
	if (!is_mgtime(inode)) {
		inode_set_ctime_to_ts(inode, now);
		goto out;
	}

	/*
	 * A fine-grained time is only needed if someone has queried
	 * for timestamps, and the current coarse grained time isn't
	 * later than what's already there.
	 */
	cns = smp_load_acquire(&inode->i_ctime_nsec);
	if (cns & I_CTIME_QUERIED) {
		struct timespec64 ctime = { .tv_sec = inode->i_ctime_sec,
					    .tv_nsec = cns & ~I_CTIME_QUERIED };

		if (timespec64_compare(&now, &ctime) <= 0) {
			ktime_get_real_ts64_mg(&now);
			now = timestamp_truncate(now, inode);
			mgtime_counter_inc(mg_fine_stamps);
		}
	}
	mgtime_counter_inc(mg_ctime_updates);

	/* No need to cmpxchg if it's exactly the same */
	if (cns == now.tv_nsec && inode->i_ctime_sec == now.tv_sec) {
		trace_ctime_xchg_skip(inode, &now);
		goto out;
	}
	cur = cns;
retry:
	/* Try to swap the nsec value into place. */
	if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now.tv_nsec)) {
		/* If swap occurred, then we're (mostly) done */
		inode->i_ctime_sec = now.tv_sec;
		trace_ctime_ns_xchg(inode, cns, now.tv_nsec, cur);
		mgtime_counter_inc(mg_ctime_swaps);
	} else {
		/*
		 * Was the change due to someone marking the old ctime QUERIED?
		 * If so then retry the swap. This can only happen once since
		 * the only way to clear I_CTIME_QUERIED is to stamp the inode
		 * with a new ctime.
		 */
		if (!(cns & I_CTIME_QUERIED) && (cns | I_CTIME_QUERIED) == cur) {
			cns = cur;
			goto retry;
		}
		/* Otherwise, keep the existing ctime */
		now.tv_sec = inode->i_ctime_sec;
		now.tv_nsec = cur & ~I_CTIME_QUERIED;
	}
out:
	return now;
}
EXPORT_SYMBOL(inode_set_ctime_current);

/**
 * inode_set_ctime_deleg - try to update the ctime on a delegated inode
 * @inode: inode to update
 * @update: timespec64 to set the ctime
 *
 * Attempt to atomically update the ctime on behalf of a delegation holder.
 *
 * The nfs server can call back the holder of a delegation to get updated
 * inode attributes, including the mtime. When updating the mtime, update
 * the ctime to a value at least equal to that.
 *
 * This can race with concurrent updates to the inode, in which
 * case the update is skipped.
 *
 * Note that this works even when multigrain timestamps are not enabled,
 * so it is used in either case.
 */
struct timespec64 inode_set_ctime_deleg(struct inode *inode, struct timespec64 update)
{
	struct timespec64 now, cur_ts;
	u32 cur, old;

	/* pairs with try_cmpxchg below */
	cur = smp_load_acquire(&inode->i_ctime_nsec);
	cur_ts.tv_nsec = cur & ~I_CTIME_QUERIED;
	cur_ts.tv_sec = inode->i_ctime_sec;

	/* If the update is older than the existing value, skip it. */
	if (timespec64_compare(&update, &cur_ts) <= 0)
		return cur_ts;

	ktime_get_coarse_real_ts64_mg(&now);

	/* Clamp the update to "now" if it's in the future */
	if (timespec64_compare(&update, &now) > 0)
		update = now;

	update = timestamp_truncate(update, inode);

	/* No need to update if the values are already the same */
	if (timespec64_equal(&update, &cur_ts))
		return cur_ts;

	/*
	 * Try to swap the nsec value into place. If it fails, that means
	 * it raced with an update due to a write or similar activity. That
	 * stamp takes precedence, so just skip the update.
	 */
retry:
	old = cur;
	if (try_cmpxchg(&inode->i_ctime_nsec, &cur, update.tv_nsec)) {
		inode->i_ctime_sec = update.tv_sec;
		mgtime_counter_inc(mg_ctime_swaps);
		return update;
	}

	/*
	 * Was the change due to another task marking the old ctime QUERIED?
	 *
	 * If so, then retry the swap. This can only happen once since
	 * the only way to clear I_CTIME_QUERIED is to stamp the inode
	 * with a new ctime.
	 */
	if (!(old & I_CTIME_QUERIED) && (cur == (old | I_CTIME_QUERIED)))
		goto retry;

	/* Otherwise, it was a new timestamp. */
	cur_ts.tv_sec = inode->i_ctime_sec;
	cur_ts.tv_nsec = cur & ~I_CTIME_QUERIED;
	return cur_ts;
}
EXPORT_SYMBOL(inode_set_ctime_deleg);

/**
 * in_group_or_capable - check whether caller is CAP_FSETID privileged
 * @idmap:	idmap of the mount @inode was found from
 * @inode:	inode to check
 * @vfsgid:	the new/current vfsgid of @inode
 *
 * Check whether @vfsgid is in the caller's group list or if the caller is
 * privileged with CAP_FSETID over @inode. This can be used to determine
 * whether the setgid bit can be kept or must be dropped.
 *
 * Return: true if the caller is sufficiently privileged, false if not.
 */
bool in_group_or_capable(struct mnt_idmap *idmap,
			 const struct inode *inode, vfsgid_t vfsgid)
{
	if (vfsgid_in_group_p(vfsgid))
		return true;
	if (capable_wrt_inode_uidgid(idmap, inode, CAP_FSETID))
		return true;
	return false;
}
EXPORT_SYMBOL(in_group_or_capable);

/**
 * mode_strip_sgid - handle the sgid bit for non-directories
 * @idmap: idmap of the mount the inode was created from
 * @dir: parent directory inode
 * @mode: mode of the file to be created in @dir
 *
 * If the @mode of the new file has both the S_ISGID and S_IXGRP bit
 * raised and @dir has the S_ISGID bit raised ensure that the caller is
 * either in the group of the parent directory or they have CAP_FSETID
 * in their user namespace and are privileged over the parent directory.
 * In all other cases, strip the S_ISGID bit from @mode.
 *
 * Return: the new mode to use for the file
 */
umode_t mode_strip_sgid(struct mnt_idmap *idmap,
			const struct inode *dir, umode_t mode)
{
	if ((mode & (S_ISGID | S_IXGRP)) != (S_ISGID | S_IXGRP))
		return mode;
	if (S_ISDIR(mode) || !dir || !(dir->i_mode & S_ISGID))
		return mode;
	if (in_group_or_capable(idmap, dir, i_gid_into_vfsgid(idmap, dir)))
		return mode;
	return mode & ~S_ISGID;
}
EXPORT_SYMBOL(mode_strip_sgid);

#ifdef CONFIG_DEBUG_VFS
/*
 * Dump an inode.
 *
 * TODO: add a proper inode dumping routine, this is a stub to get debug off the
 * ground.
 *
 * TODO: handle getting to fs type with get_kernel_nofault()?
 * See dump_mapping() above.
 */
void dump_inode(struct inode *inode, const char *reason)
{
	struct super_block *sb = inode->i_sb;

	pr_warn("%s encountered for inode %px\n"
		"fs %s mode %ho opflags 0x%hx flags 0x%x state 0x%x count %d\n",
		reason, inode, sb->s_type->name, inode->i_mode, inode->i_opflags,
		inode->i_flags, inode_state_read_once(inode), atomic_read(&inode->i_count));
}

EXPORT_SYMBOL(dump_inode);
#endif
]

gemini-3-flash-preview

							input: 37224
							output: 21
							thoughts: 88

Okay, here's my summary, taking the role of the expert:

**Investigating the Origin of `I_CTIME_QUERIED`**

Right, so I'm digging into `inode_set_ctime_current` and I see this `I_CTIME_QUERIED` constant being used. It's a bit of a head-scratcher. My first instinct, given my experience, is to check `fs/inode.c` directly for its definition. Nope, not there. Okay, so it has to be defined in an included header file.

I tried the usual suspects, and my initial search for its definition in `include/linux/fs.h` using `codesearch-definition-source` came up empty. That's unusual, but not unheard of. Sometimes these things are buried a bit deeper. Time to broaden the search. I need to run a full, exhaustive search across the entire code base to be certain I haven't missed something and to understand its intended purpose. I'll need to use a more general search, probably across all files to track it down. Let's get to it.

map[Dir:include/linux]

map[Files:[8250_pci.h acct.h acpi.h acpi_amd_wbrf.h acpi_dma.h acpi_iort.h acpi_mdio.h acpi_pmtmr.h acpi_rimt.h acpi_viot.h adb.h adfs_fs.h adi-axi-common.h adreno-smmu-priv.h adxl.h aer.h agp_backend.h agpgart.h ahci-remap.h ahci_platform.h aio.h alarmtimer.h alcor_pci.h align.h alloc_tag.h altera_jtaguart.h altera_uart.h amd-iommu.h amd-pmf-io.h annotate.h anon_inodes.h aperture.h apm-emulation.h apm_bios.h apple-gmux.h arch_topology.h args.h arm-cci.h arm-smccc.h arm_ffa.h arm_mpam.h arm_sdei.h armada-37xx-rwtm-mailbox.h array_size.h ascii85.h asn1.h asn1_ber_bytecode.h asn1_decoder.h asn1_encoder.h assoc_array.h assoc_array_priv.h async.h async_tx.h ata.h ata_platform.h atalk.h atm.h atm_tcp.h atmdev.h atmel-isc-media.h atmel-ssc.h atmel_pdc.h atomic.h attribute_container.h audit.h audit_arch.h auto_dev-ioctl.h auto_fs.h auxiliary_bus.h auxvec.h average.h backing-dev-defs.h backing-dev.h backing-file.h backlight.h badblocks.h balloon_compaction.h base64.h bcd.h bch.h bcm47xx_nvram.h bcm47xx_sprom.h bcm47xx_wdt.h bcm963xx_nvram.h bcm963xx_tag.h binfmts.h bio-integrity.h bio.h bit_spinlock.h bitfield.h bitmap-str.h bitmap.h bitops.h bitrev.h bits.h blk-cgroup.h blk-crypto-profile.h blk-crypto.h blk-integrity.h blk-mq-dma.h blk-mq.h blk-pm.h blk_types.h blkdev.h blkpg.h blktrace_api.h blockgroup_lock.h bma150.h bootconfig.h bootmem_info.h bottom_half.h bpf-cgroup-defs.h bpf-cgroup.h bpf-netns.h bpf.h bpf_crypto.h bpf_lirc.h bpf_local_storage.h bpf_lsm.h bpf_mem_alloc.h bpf_mprog.h bpf_trace.h bpf_types.h bpf_verifier.h bpfptr.h brcmphy.h bsearch.h bsg-lib.h bsg.h btf.h btf_ids.h btree-128.h btree-type.h btree.h btrfs.h buffer_head.h bug.h build-salt.h build_bug.h buildid.h bvec.h c2port.h cache.h cache_coherency.h cacheflush.h cacheinfo.h call_once.h capability.h cb710.h cc_platform.h cciss_ioctl.h ccp.h cdev.h cdrom.h cfag12864b.h cfi.h cfi_types.h cgroup-defs.h cgroup.h cgroup_api.h cgroup_dmem.h cgroup_namespace.h cgroup_rdma.h cgroup_refcnt.h cgroup_subsys.h circ_buf.h cleanup.h clk-provider.h clk.h clkdev.h clockchips.h clocksource.h clocksource_ids.h closure.h cma.h cmpxchg-emu.h cn_proc.h cnt32_to_63.h coda.h codetag.h compaction.h compat.h compiler-clang.h compiler-gcc.h compiler-version.h compiler.h compiler_attributes.h compiler_types.h completion.h component.h configfs.h connector.h console.h console_struct.h consolemap.h const.h container.h container_of.h context_tracking.h context_tracking_irq.h context_tracking_state.h cookie.h cordic.h coredump.h coresight-pmu.h coresight-stm.h coresight.h count_zeros.h counter.h cper.h cpu.h cpu_cooling.h cpu_pm.h cpu_rmap.h cpu_smt.h cpufeature.h cpufreq.h cpuhotplug.h cpuhplock.h cpuidle.h cpuidle_haltpoll.h cpumask.h cpumask_api.h cpumask_types.h cpuset.h crash_core.h crash_dump.h crash_reserve.h crc-ccitt.h crc-itu-t.h crc-t10dif.h crc16.h crc32.h crc32c.h crc32poly.h crc4.h crc64.h crc7.h crc8.h cred.h crypto.h cs5535.h ctype.h cuda.h damon.h dasd_mod.h davinci_emac.h dax.h dca.h dcache.h dccp.h debug_locks.h debugfs.h debugobjects.h delay.h delayacct.h delayed_call.h dev_printk.h devcoredump.h devfreq-event.h devfreq-governor.h devfreq.h devfreq_cooling.h device-mapper.h device.h device_cgroup.h devm-helpers.h devpts_fs.h dfl.h dibs.h digsig.h dim.h dio.h dirent.h dlm.h dlm_plock.h dm-bufio.h dm-dirty-log.h dm-io.h dm-kcopyd.h dm-region-hash.h dm-verity-loadpin.h dm9000.h dma-buf-mapping.h dma-buf.h dma-direct.h dma-direction.h dma-fence-array.h dma-fence-chain.h dma-fence-unwrap.h dma-fence.h dma-heap.h dma-map-ops.h dma-mapping.h dma-resv.h dmaengine.h dmapool.h dmar.h dmi.h dnotify.h dns_resolver.h dpll.h dqblk_qtree.h dqblk_v1.h dqblk_v2.h drbd.h drbd_config.h drbd_genl.h drbd_genl_api.h drbd_limits.h ds2782_battery.h dtlk.h dtpm.h dw_apb_timer.h dynamic_debug.h dynamic_queue_limits.h earlycpio.h ecryptfs.h edac.h edd.h eeprom_93cx6.h efi-bgrt.h efi.h efi_embedded_fw.h efs_vh.h ehl_pse_io_aux.h eisa.h elf-fdpic.h elf-randomize.h elf.h elfcore-compat.h elfcore.h elfnote-lto.h elfnote.h enclosure.h energy_model.h entry-common.h entry-virt.h err.h errname.h errno.h error-injection.h errqueue.h errseq.h etherdevice.h ethtool.h ethtool_netlink.h eventfd.h eventpoll.h evm.h execmem.h export-internal.h export.h exportfs.h ext2_fs.h extable.h extcon-provider.h extcon.h f2fs_fs.h f75375s.h falloc.h fanotify.h fault-inject-usercopy.h fault-inject.h fb.h fbcon.h fcdevice.h fcntl.h fd.h fddidevice.h fdtable.h fec.h fiemap.h file.h file_ref.h fileattr.h filelock.h filter.h find.h fips.h firewire.h firmware-map.h firmware.h fixp-arith.h flat.h flex_proportions.h folio_queue.h font.h fortify-string.h fprobe.h fpu.h freezer.h fs.h fs_api.h fs_context.h fs_dirent.h fs_parser.h fs_pin.h fs_stack.h fs_struct.h fscache-cache.h fscache.h fscrypt.h fsi-occ.h fsi-sbefifo.h fsi.h fsl-diu-fb.h fsl_devices.h fsl_hypervisor.h fsl_ifc.h fsldma.h fsnotify.h fsnotify_backend.h fsverity.h ftrace.h ftrace_irq.h ftrace_regs.h futex.h fw_table.h fwctl.h fwnode.h fwnode_mdio.h gameport.h gcd.h genalloc.h generic-radix-tree.h genl_magic_func.h genl_magic_struct.h getcpu.h gfp.h gfp_api.h gfp_types.h glob.h gnss.h goldfish.h gpio-pxa.h gpio.h gpio_keys.h greybus.h group_cpus.h hardirq.h hash.h hashtable.h hashtable_api.h hdlc.h hdlcdrv.h hdmi.h hex.h hfs_common.h hid-debug.h hid-over-i2c.h hid-over-spi.h hid-roccat.h hid-sensor-hub.h hid-sensor-ids.h hid.h hid_bpf.h hidden.h hiddev.h hidraw.h highmem-internal.h highmem.h highuid.h hil.h hil_mlc.h hippidevice.h hisi_acc_qm.h hmm-dma.h hmm.h host1x.h host1x_context_bus.h hp_sdc.h hpet.h hrtimer.h hrtimer_api.h hrtimer_defs.h hrtimer_types.h hte.h huge_mm.h hugetlb.h hugetlb_cgroup.h hugetlb_inline.h hung_task.h hw_bitfield.h hw_breakpoint.h hw_random.h hwmon-sysfs.h hwmon-vid.h hwmon.h hwspinlock.h hyperv.h hypervisor.h i2c-algo-bit.h i2c-algo-pca.h i2c-algo-pcf.h i2c-atr.h i2c-dev.h i2c-mux.h i2c-of-prober.h i2c-smbus.h i2c.h i8042.h i8253.h i8254.h icmp.h icmpv6.h idle_inject.h idr.h ieee80211-eht.h ieee80211-he.h ieee80211-ht.h ieee80211-mesh.h ieee80211-nan.h ieee80211-p2p.h ieee80211-s1g.h ieee80211-vht.h ieee80211.h ieee802154.h if_arp.h if_bridge.h if_eql.h if_ether.h if_fddi.h if_hsr.h if_link.h if_macvlan.h if_phonet.h if_pppol2tp.h if_pppox.h if_rmnet.h if_tap.h if_team.h if_tun.h if_tunnel.h if_vlan.h igmp.h ihex.h ima.h imx-media.h in.h in6.h indirect_call_wrapper.h inet.h inet_diag.h inetdevice.h init.h init_ohci1394_dma.h init_syscalls.h init_task.h initrd.h inotify.h input.h instruction_pointer.h instrumentation.h instrumented.h int_log.h integrity.h intel-ish-client-if.h intel_dg_nvm_aux.h intel_pmt_features.h intel_rapl.h intel_tcc.h intel_th.h intel_tpmi.h intel_vsec.h interconnect-clk.h interconnect-provider.h interconnect.h interrupt.h interval_tree.h interval_tree_generic.h io-64-nonatomic-hi-lo.h io-64-nonatomic-lo-hi.h io-mapping.h io-pgtable.h io.h io_uring.h io_uring_types.h ioam6.h ioam6_genl.h ioam6_iptunnel.h iocontext.h iomap.h iommu-dma.h iommu-helper.h iommu.h iommufd.h iopoll.h ioport.h ioprio.h ioremap.h iosys-map.h iov_iter.h iova.h iova_bitmap.h ip.h ipack.h ipc.h ipc_namespace.h ipmi.h ipmi_smi.h ipv6.h ipv6_route.h irq-entry-common.h irq.h irq_poll.h irq_sim.h irq_work.h irq_work_types.h irqbypass.h irqchip.h irqdesc.h irqdomain.h irqdomain_defs.h irqflags.h irqflags_types.h irqhandler.h irqnr.h irqreturn.h isa-dma.h isa.h isapnp.h iscsi_boot_sysfs.h iscsi_ibft.h ism.h iversion.h jbd2.h jhash.h jiffies.h journal-head.h joystick.h jump_label.h jump_label_ratelimit.h jz4740-adc.h jz4780-nemc.h kallsyms.h kasan-checks.h kasan-enabled.h kasan-tags.h kasan.h kbd_diacr.h kbd_kern.h kbuild.h kconfig.h kcore.h kcov.h kcsan-checks.h kcsan.h kdb.h kdebug.h kdev_t.h kern_levels.h kernel-page-flags.h kernel.h kernel_read_file.h kernel_stat.h kernelcapi.h kernfs.h kexec.h kexec_handover.h key-type.h key.h keyboard.h keyctl.h kfence.h kfifo.h kgdb.h khugepaged.h klist.h kmemleak.h kmod.h kmsan-checks.h kmsan.h kmsan_string.h kmsan_types.h kmsg_dump.h kobj_map.h kobject.h kobject_api.h kobject_ns.h kprobes.h kref.h kref_api.h ks0108.h ks8842.h ks8851_mll.h ksm.h kstack_erase.h kstrtox.h kthread.h ktime.h ktime_api.h kvm_dirty_ring.h kvm_host.h kvm_irqfd.h kvm_para.h kvm_types.h l2tp.h lantiq.h lapb.h latencytop.h lcd.h lcm.h leafops.h led-class-flash.h led-class-multicolor.h led-lm3530.h leds-bd2802.h leds-expresswire.h leds-lp3944.h leds-lp3952.h leds-pca9532.h leds-regulator.h leds-ti-lmu-common.h leds.h libata.h libfdt.h libfdt_env.h libgcc.h libnvdimm.h libps2.h license.h limits.h linear_range.h linkage.h linkmode.h linux_logo.h lis3lv02d.h list.h list_bl.h list_lru.h list_nulls.h list_sort.h litex.h livepatch.h livepatch_external.h livepatch_helpers.h livepatch_sched.h liveupdate.h llc.h llist.h llist_api.h local_lock.h local_lock_internal.h lockdep.h lockdep_api.h lockdep_types.h lockref.h log2.h logic_iomem.h logic_pio.h lp.h lru_cache.h lsm_audit.h lsm_count.h lsm_hook_defs.h lsm_hooks.h lwq.h lz4.h lzo.h mISDNdsp.h mISDNhw.h mISDNif.h mailbox_client.h mailbox_controller.h maple.h maple_tree.h marvell_phy.h math.h math64.h mbcache.h mbus.h mc146818rtc.h mc33xs2410.h mc6821.h mcb.h mdev.h mdio-bitbang.h mdio-gpio.h mdio-mux.h mdio.h mei_aux.h mei_cl_bus.h mem_encrypt.h memblock.h memcontrol.h memfd.h memory-failure.h memory-tiers.h memory.h memory_hotplug.h mempolicy.h mempool.h memregion.h memremap.h memstick.h mhi.h mhi_ep.h micrel_phy.h microchipphy.h migrate.h migrate_mode.h mii.h mii_timestamper.h min_heap.h minmax.h misc_cgroup.h miscdevice.h mm.h mm_api.h mm_inline.h mm_types.h mm_types_task.h mman.h mmap_lock.h mmdebug.h mmiotrace.h mmu_context.h mmu_notifier.h mmzone.h mnt_idmapping.h mnt_namespace.h mod_devicetable.h module.h module_signature.h module_symbol.h moduleloader.h moduleparam.h most.h mount.h moxtet.h mpage.h mpi.h mpls.h mpls_iptunnel.h mroute.h mroute6.h mroute_base.h msdos_fs.h msdos_partition.h msg.h msi.h msi_api.h mtio.h mutex.h mutex_api.h mutex_types.h mv643xx_eth.h mv643xx_i2c.h mvebu-pmsu.h mxm-wmi.h namei.h nd.h ndctl.h net.h net_tstamp.h netdev_features.h netdevice.h netdevice_xmit.h netfilter.h netfilter_bridge.h netfilter_defs.h netfilter_ipv4.h netfilter_ipv6.h netfilter_netdev.h netfs.h netlink.h netpoll.h nfs.h nfs3.h nfs4.h nfs_common.h nfs_fs.h nfs_fs_i.h nfs_fs_sb.h nfs_iostat.h nfs_page.h nfs_ssc.h nfs_xdr.h nfsacl.h nfslocalio.h nitro_enclaves.h nl802154.h nls.h nmi.h node.h nodemask.h nodemask_types.h nospec.h notifier.h ns_common.h nsc_gpio.h nsfs.h nsproxy.h nstree.h ntb.h ntb_transport.h nubus.h numa.h numa_memblks.h nvme-auth.h nvme-fc-driver.h nvme-fc.h nvme-keyring.h nvme-rdma.h nvme-tcp.h nvme.h nvmem-consumer.h nvmem-provider.h nvram.h oa_tc6.h objagg.h objpool.h objtool.h objtool_types.h of.h of_address.h of_clk.h of_device.h of_dma.h of_fdt.h of_gpio.h of_graph.h of_iommu.h of_irq.h of_mdio.h of_net.h of_pci.h of_pdt.h of_platform.h of_reserved_mem.h oid_registry.h olpc-ec.h omap-dma.h omap-gpmc.h omap-iommu.h omap-mailbox.h omapfb.h once.h once_lite.h oom.h openvswitch.h osq_lock.h overflow.h packing.h padata.h page-flags-layout.h page-flags.h page-isolation.h page_counter.h page_ext.h page_frag_cache.h page_idle.h page_owner.h page_ref.h page_reporting.h page_table_check.h pageblock-flags.h pagemap.h pagevec.h pagewalk.h panic.h panic_notifier.h papr_scm.h parman.h parport.h parport_pc.h parser.h part_stat.h pata_arasan_cf_data.h patchkey.h path.h pch_dma.h pci-acpi.h pci-ats.h pci-bwctrl.h pci-doe.h pci-ecam.h pci-ep-cfs.h pci-ep-msi.h pci-epc.h pci-epf.h pci-ide.h pci-p2pdma.h pci-pwrctrl.h pci-tph.h pci-tsm.h pci.h pci_hotplug.h pci_ids.h pcie-dwc.h pcs-lynx.h pcs-rzn1-miic.h pe.h peci-cpu.h peci.h percpu-defs.h percpu-refcount.h percpu-rwsem.h percpu.h percpu_counter.h perf_event.h perf_event_api.h perf_regs.h personality.h pfn.h pgalloc.h pgalloc_tag.h pgtable.h pgtable_api.h phonet.h phy.h phy_fixed.h phy_led_triggers.h phy_link_topology.h phylib_stubs.h phylink.h pid.h pid_namespace.h pid_types.h pidfs.h pim.h pipe_fs_i.h pkeys.h pl320-ipc.h platform_device.h platform_profile.h pldmfw.h plist.h plist_types.h pm-trace.h pm.h pm_clock.h pm_domain.h pm_opp.h pm_qos.h pm_runtime.h pm_wakeirq.h pm_wakeup.h pmbus.h pmu.h pnp.h poison.h poll.h polynomial.h posix-clock.h posix-timers.h posix-timers_types.h posix_acl.h posix_acl_xattr.h power_supply.h powercap.h ppp-comp.h ppp_channel.h ppp_defs.h pps_gen_kernel.h pps_kernel.h pr.h prandom.h preempt.h prefetch.h prime_numbers.h printk.h prmt.h proc_fs.h proc_ns.h processor.h profile.h projid.h property.h pruss_driver.h psci.h pseudo_fs.h psi.h psi_types.h psp-platform-access.h psp-sev.h psp-tee.h psp.h pstore.h pstore_blk.h pstore_ram.h pstore_zone.h ptdump.h pti.h ptp_classify.h ptp_clock_kernel.h ptp_kvm.h ptp_mock.h ptp_pch.h ptr_ring.h ptrace.h ptrace_api.h purgatory.h pvclock_gtod.h pwm.h pwm_backlight.h pxa168_eth.h pxa2xx_ssp.h qnx6_fs.h quota.h quotaops.h radix-tree.h raid_class.h ramfs.h random.h randomize_kstack.h range.h ras.h ratelimit.h ratelimit_types.h rational.h rbtree.h rbtree_augmented.h rbtree_latch.h rbtree_types.h rcu_node_tree.h rcu_notifier.h rcu_segcblist.h rcu_sync.h rculist.h rculist_bl.h rculist_nulls.h rcupdate.h rcupdate_trace.h rcupdate_wait.h rcuref.h rcutiny.h rcutree.h rcuwait.h rcuwait_api.h reboot-mode.h reboot.h reciprocal_div.h ref_tracker.h refcount.h refcount_api.h refcount_types.h regmap.h regset.h relay.h remoteproc.h resctrl.h resctrl_types.h reset-controller.h reset.h resource.h resource_ext.h restart_block.h resume_user_mode.h rethook.h rfkill.h rhashtable-types.h rhashtable.h ring_buffer.h rio.h rio_drv.h rio_ids.h rio_regs.h rmap.h rmi.h rndis.h rodata_test.h rolling_buffer.h root_dev.h rpmb.h rpmsg.h rseq.h rseq_entry.h rseq_types.h rslib.h rtc.h rtmutex.h rtnetlink.h rtsx_common.h rtsx_pci.h rtsx_usb.h rv.h rw_hint.h rwbase_rt.h rwlock.h rwlock_api_smp.h rwlock_rt.h rwlock_types.h rwsem.h sbitmap.h scatterlist.h scc.h sched.h sched_clock.h scmi_imx_protocol.h scmi_protocol.h scpi_protocol.h screen_info.h scs.h sctp.h scx200.h scx200_gpio.h seccomp.h seccomp_types.h secretmem.h securebits.h security.h sed-opal-key.h sed-opal.h seg6.h seg6_genl.h seg6_hmac.h seg6_iptunnel.h seg6_local.h selection.h sem.h sem_types.h semaphore.h seq_buf.h seq_file.h seq_file_net.h seqlock.h seqlock_api.h seqlock_types.h serdev.h serial.h serial_8250.h serial_bcm63xx.h serial_core.h serial_s3c.h serial_sci.h serio.h set_memory.h sfp.h sh_clk.h sh_dma.h sh_eth.h sh_intc.h sh_timer.h shdma-base.h shm.h shmem_fs.h shrinker.h signal.h signal_types.h signalfd.h siox.h siphash.h sizes.h skb_array.h skbuff.h skbuff_ref.h skmsg.h slab.h slimbus.h sm501-regs.h sm501.h smc91x.h smp.h smp_types.h smpboot.h smsc911x.h smscphy.h sock_diag.h socket.h sockptr.h softirq.h sonet.h sonypi.h sort.h sound.h soundcard.h spinlock.h spinlock_api.h spinlock_api_smp.h spinlock_api_up.h spinlock_rt.h spinlock_types.h spinlock_types_raw.h spinlock_types_up.h spinlock_up.h splice.h spmi.h sprintf.h sram.h srcu.h srcutiny.h srcutree.h ssbi.h stackdepot.h stackprotector.h stacktrace.h start_kernel.h stat.h statfs.h static_call.h static_call_types.h static_key.h stdarg.h stddef.h stm.h stmmac.h stmp3xxx_rtc_wdt.h stmp_device.h stop_machine.h string.h string_choices.h string_helpers.h stringhash.h stringify.h sungem_phy.h sunserialcore.h sunxi-rsb.h surface_acpi_notify.h suspend.h svga.h sw842.h swab.h swait.h swait_api.h swap.h swap_cgroup.h swapfile.h swapops.h swiotlb.h switchtec.h sxgbe_platform.h sync_core.h sync_file.h synclink.h sys.h sys_info.h sys_soc.h syscall_user_dispatch.h syscall_user_dispatch_types.h syscalls.h syscalls_api.h syscore_ops.h sysctl.h sysfb.h sysfs.h syslog.h sysrq.h t10-pi.h task_io_accounting.h task_io_accounting_ops.h task_work.h taskstats_kern.h tboot.h tc.h tcp.h tee_core.h tee_drv.h tegra-icc.h termios_internal.h text-patching.h textsearch.h textsearch_fsm.h thermal.h thread_info.h threads.h thunderbolt.h ti-emif-sram.h ti_wilink_st.h tick.h tifm.h timb_dma.h timb_gpio.h time.h time32.h time64.h time_namespace.h timecounter.h timekeeper_internal.h timekeeping.h timer.h timer_types.h timerfd.h timeriomem-rng.h timerqueue.h timerqueue_types.h timex.h tnum.h topology.h torture.h toshiba.h tpm.h tpm_command.h tpm_eventlog.h tpm_svsm.h trace.h trace_clock.h trace_events.h trace_recursion.h trace_seq.h tracefs.h tracepoint-defs.h tracepoint.h transport_class.h ts-nbus.h tsacct_kern.h tsm-mr.h tsm.h tty.h tty_buffer.h tty_driver.h tty_flip.h tty_ldisc.h tty_port.h turris-omnia-mcu-interface.h turris-signing-key.h typecheck.h types.h u64_stats_sync.h u64_stats_sync_api.h uacce.h uaccess.h ubsan.h ucopysize.h ucs2_string.h udp.h uidgid.h uidgid_types.h uio.h uio_driver.h umh.h unaligned.h unicode.h union_find.h units.h unroll.h unwind_deferred.h unwind_deferred_types.h unwind_user.h unwind_user_types.h uprobes.h usb.h usb_usual.h usbdevice_fs.h user-return-notifier.h user.h user_events.h user_namespace.h userfaultfd_k.h util_macros.h uts.h uts_namespace.h utsname.h uuid.h vbox_utils.h vdpa.h vdso_datastore.h verification.h vermagic.h vexpress.h vfio.h vfio_pci_core.h vfs.h vfsdebug.h vga_switcheroo.h vgaarb.h vhost_iotlb.h via-core.h via.h via_i2c.h videodev2.h virtio.h virtio_anchor.h virtio_byteorder.h virtio_caif.h virtio_config.h virtio_dma_buf.h virtio_features.h virtio_net.h virtio_pci_admin.h virtio_pci_legacy.h virtio_pci_modern.h virtio_ring.h virtio_vsock.h vm_event_item.h vmalloc.h vmcore_info.h vmpressure.h vmstat.h vmw_vmci_api.h vmw_vmci_defs.h vringh.h vt.h vt_buffer.h vt_kern.h vtime.h w1.h wait.h wait_api.h wait_bit.h watch_queue.h watchdog.h win_minmax.h wireless.h wkup_m3_ipc.h wm97xx.h wmi.h wordpart.h workqueue.h workqueue_api.h workqueue_types.h writeback.h ww_mutex.h wwan.h xarray.h xattr.h xxhash.h xz.h yam.h zconf.h zlib.h zorro.h zsmalloc.h zstd.h zstd_errors.h zstd_lib.h zswap.h zutil.h] Subdirs:[amba atomic avf bcma bnxt bus byteorder can cdx ceph clk comedi crush decompress device dma dma-buf dsa extcon firmware fpga framer fs fsl generic_pt gpio greybus habanalabs hsi i3c iio input io_uring irqchip isdn kho lockd lsm mailbox mdio memory mfd misc mlx4 mlx5 mmc mtd mux net netfilter netfilter_arp netfilter_bridge netfilter_ipv4 netfilter_ipv6 ns pcs pds perf phy pinctrl platform_data power pse-pd pwrseq qat qed raid raspberrypi regulator remoteproc reset rpmsg rtc sched soc soundwire spi ssb sunrpc surface_aggregator ulpi unaligned usb]]

gemini-3-flash-preview

							input: 50646
							output: 26
							thoughts: 34

Okay, here's my thought process on that:

**Investigating a Potential Header File Location**

Hmm, so I need to find where that particular macro, `I_CTIME_QUERIED`, is defined. I've got a hunch it might be buried within the kernel's file system headers. Given the context, `include/linux/fs.h` seems like a likely candidate. Instead of a broad search, I should start with a more targeted approach. I'll do a focused search specifically within `include/linux/fs.h` to see if that macro shows up. That will be more efficient than casting a wider net initially. It is a good starting point, and I will adjust the search as needed if this approach doesn't turn up what I am after.

requested entity does not exist

map[Name:I_CTIME_QUERIED]

map[References:<nil> TruncatedOutput:false]

gemini-3-flash-preview

							input: 50894
							output: 41
							thoughts: 1452

**Analyzing a Potential Race Condition in Multigrain Timestamps**

Okay, I need to understand this timestamping behavior. Something feels off. It seems related to `fs/inode.c`, specifically `fs/stat.c`. I was hoping to find a quick definition, but I can only use `codesearch-find-references`. I should try searching with `I_CTIME_` prefix.

Ah, it seems the race condition is centered around `inode->i_ctime_sec`. I see task 3305 setting `inode->i_ctime_sec = now.tv_sec` in `fs/inode.c` (line 2843), and task 3306 reading `stat->ctime.tv_sec = inode->i_ctime_sec` in `fs/stat.c` (line 56). Let's go through the code block.

Let me examine `fill_mg_cmtime` again: I see that `i_ctime_sec` is read before `i_ctime_nsec`.

```c
  45:	void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode)
  46:	{
  47:		atomic_t *pcn = (atomic_t *)&inode->i_ctime_nsec;
  ...
  55:	stat->mtime = inode_get_mtime(inode);
  56:	stat->ctime.tv_sec = inode->i_ctime_sec;
  57:	stat->ctime.tv_nsec = (u32)atomic_read(pcn);
  58:	if (!(stat->ctime.tv_nsec & I_CTIME_QUERIED))
  59:	stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn));
  60:	stat->ctime.tv_nsec &= ~I_CTIME_QUERIED;
  61:	trace_fill_mg_cmtime(inode, &stat->ctime, &stat->mtime);
  62:	}
```

Now, let's look at `inode_set_ctime_current`.

```c
2841:		if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now.tv_nsec)) {
2842:			/* If swap occurred, then we're (mostly) done */
2843:			inode->i_ctime_sec = now.tv_sec;
2844:			trace_ctime_ns_xchg(inode, cns, now.tv_nsec, cur);
2845:			mgtime_counter_inc(mg_ctime_swaps);
2846:		} else {
```

I see it! In `inode_set_ctime_current`, they're updating `i_ctime_nsec` *before* `i_ctime_sec`. The `try_cmpxchg` is visible to others, and could be read before `i_ctime_sec` gets updated! If `fill_mg_cmtime` reads the values in the wrong order, I have a problem.

Here's my scenario:
1.  `inode_set_ctime_current` is running.
2.  `try_cmpxchg(&inode->i_ctime_nsec, ..., 000)`: `i_ctime_nsec` is 000
3.  `inode->i_ctime_sec = S+1`: `i_ctime_sec` is S+1

Meanwhile, `fill_mg_cmtime`:
1.  `stat->ctime.tv_sec = inode->i_ctime_sec` (reads S -- the *old* value)
2.  `stat->ctime.tv_nsec = atomic_read(&inode->i_ctime_nsec)` (reads 000 -- the *new* value)

The result is `S.000`. But this is a timestamp *older* than `S.999`. That's non-monotonic! I've confirmed a race condition that results in a time going backward, which is clearly a bug. This multigrain timestamp feature is causing headaches, it needs careful attention!

Now, what about barriers? `try_cmpxchg` is atomic, and is a full barrier. However, there's no ordering after the assignment to `i_ctime_sec`. And `fill_mg_cmtime`

map[ContextFile:include/linux/fs.h IncludeLines:true Name:inode_get_ctime]

map[SourceCode:1691:	static inline struct timespec64 inode_get_ctime(const struct inode *inode)
1692:	{
1693:		struct timespec64 ts = { .tv_sec  = inode_get_ctime_sec(inode),
1694:					 .tv_nsec = inode_get_ctime_nsec(inode) };
1695:	
1696:		return ts;
1697:	}
 SourceFile:include/linux/fs.h]

gemini-3-flash-preview

							input: 53555
							output: 87
							thoughts: 20

Okay, here's my interpretation of those thoughts, tailored for an expert audience, presented in the first person:

**Analyzing Time Retrieval in File System Metadata**

Alright, so I'm thinking about this file system implementation and specifically how it handles timestamp information. I'm focusing on the low-level functions here. My brain is circling around the details of time retrieval. Specifically, I'm thinking about the implications of having separate functions for getting the "creation time" -- `inode_get_ctime_sec` and `inode_get_ctime_nsec`. This tells me that they've chosen to store the time information, seconds and nanoseconds, separately. I need to keep this in mind. It's a key design choice that will ripple into how I consider atomicity, caching strategies, and potential data races when accessing file metadata. The split between `sec` and `nsec` strongly suggests we need to be very careful about synchronization, especially in a concurrent environment. This detail will undoubtedly influence how I analyze the consistency model of the file system and how to account for this approach in terms of performance. It's also important to understand the underlying data structures, and the implications of this storage approach. Does this align with any specific hardware instructions, or is there a specific design reason for storing seconds and nanoseconds separately?

map[ContextFile:include/linux/fs.h IncludeLines:true Name:inode_get_ctime_sec]

map[SourceCode:1681:	static inline time64_t inode_get_ctime_sec(const struct inode *inode)
1682:	{
1683:		return inode->i_ctime_sec;
1684:	}
 SourceFile:include/linux/fs.h]

map[ContextFile:include/linux/fs.h, IncludeLines:true Name:inode_get_ctime_nsec]

map[SourceCode:1686:	static inline long inode_get_ctime_nsec(const struct inode *inode)
1687:	{
1688:		return inode->i_ctime_nsec & ~I_CTIME_QUERIED;
1689:	}
 SourceFile:include/linux/fs.h]

gemini-3-flash-preview

							input: 54120
							output: 26
							thoughts: 34

Okay, here's my take on those thoughts, tailored for an expert audience:

**Kernel Timestamp Investigation**

Alright, that's interesting. I've spotted `I_CTIME_QUERIED`. My gut tells me that's a key timestamp flag. Let me dig a bit deeper. Yup, there it is, right in `include/linux/fs.h`. Okay, no surprises there, that's where I'd expect it to be defined. Now the hunt is on to trace where this macro is actually *used*. I'm going to need to track down all the read and write instances now to have a clear understanding of its function.

requested entity does not exist

map[Name:I_CTIME_QUERIED]

map[References:<nil> TruncatedOutput:false]

gemini-3-flash-preview

							input: 54342
							output: 22
							thoughts: 59

Okay, here's my take on those thoughts, framed as my own:

**Investigating a Discrepancy**

Right, this is strange. I'm looking at this code snippet, and I'm sure I see the function right there. But, `codesearch-find-references` is telling me it's nowhere to be found. That's a classic head-scratcher. It's making me wonder if I'm dealing with an `enum` or perhaps a macro definition. That would certainly explain the disconnect between the apparent declaration in the code and the search tool's findings.

Time to dig into the source. I'll check `include/linux/fs.h` around line 1680. That seems like a good place to start, given the context. Let's see what's lurking in there.

map[File:include/linux/fs.h]

map[Contents:/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _LINUX_FS_H
#define _LINUX_FS_H

#include <linux/fs/super.h>
#include <linux/vfsdebug.h>
#include <linux/linkage.h>
#include <linux/wait_bit.h>
#include <linux/kdev_t.h>
#include <linux/dcache.h>
#include <linux/path.h>
#include <linux/stat.h>
#include <linux/cache.h>
#include <linux/list.h>
#include <linux/llist.h>
#include <linux/radix-tree.h>
#include <linux/xarray.h>
#include <linux/rbtree.h>
#include <linux/init.h>
#include <linux/pid.h>
#include <linux/bug.h>
#include <linux/mutex.h>
#include <linux/rwsem.h>
#include <linux/mm_types.h>
#include <linux/capability.h>
#include <linux/semaphore.h>
#include <linux/fcntl.h>
#include <linux/rculist_bl.h>
#include <linux/atomic.h>
#include <linux/shrinker.h>
#include <linux/migrate_mode.h>
#include <linux/uidgid.h>
#include <linux/lockdep.h>
#include <linux/percpu-rwsem.h>
#include <linux/workqueue.h>
#include <linux/delayed_call.h>
#include <linux/uuid.h>
#include <linux/errseq.h>
#include <linux/ioprio.h>
#include <linux/build_bug.h>
#include <linux/stddef.h>
#include <linux/mount.h>
#include <linux/cred.h>
#include <linux/mnt_idmapping.h>
#include <linux/slab.h>
#include <linux/maple_tree.h>
#include <linux/rw_hint.h>
#include <linux/file_ref.h>
#include <linux/unicode.h>

#include <asm/byteorder.h>
#include <uapi/linux/fs.h>

struct bdi_writeback;
struct bio;
struct io_comp_batch;
struct fiemap_extent_info;
struct hd_geometry;
struct iovec;
struct kiocb;
struct kobject;
struct pipe_inode_info;
struct poll_table_struct;
struct kstatfs;
struct vm_area_struct;
struct vfsmount;
struct cred;
struct swap_info_struct;
struct seq_file;
struct iov_iter;
struct fsnotify_mark_connector;
struct fs_context;
struct fs_parameter_spec;
struct file_kattr;
struct iomap_ops;
struct delegated_inode;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
extern void __init files_init(void);
extern void __init files_maxfiles_init(void);

extern unsigned long get_max_files(void);
extern unsigned int sysctl_nr_open;

typedef __kernel_rwf_t rwf_t;

struct buffer_head;
typedef int (get_block_t)(struct inode *inode, sector_t iblock,
			struct buffer_head *bh_result, int create);
typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
			ssize_t bytes, void *private);

#define MAY_EXEC		0x00000001
#define MAY_WRITE		0x00000002
#define MAY_READ		0x00000004
#define MAY_APPEND		0x00000008
#define MAY_ACCESS		0x00000010
#define MAY_OPEN		0x00000020
#define MAY_CHDIR		0x00000040
/* called from RCU mode, don't block */
#define MAY_NOT_BLOCK		0x00000080

/*
 * flags in file.f_mode.  Note that FMODE_READ and FMODE_WRITE must correspond
 * to O_WRONLY and O_RDWR via the strange trick in do_dentry_open()
 */

/* file is open for reading */
#define FMODE_READ		((__force fmode_t)(1 << 0))
/* file is open for writing */
#define FMODE_WRITE		((__force fmode_t)(1 << 1))
/* file is seekable */
#define FMODE_LSEEK		((__force fmode_t)(1 << 2))
/* file can be accessed using pread */
#define FMODE_PREAD		((__force fmode_t)(1 << 3))
/* file can be accessed using pwrite */
#define FMODE_PWRITE		((__force fmode_t)(1 << 4))
/* File is opened for execution with sys_execve / sys_uselib */
#define FMODE_EXEC		((__force fmode_t)(1 << 5))
/* File writes are restricted (block device specific) */
#define FMODE_WRITE_RESTRICTED	((__force fmode_t)(1 << 6))
/* File supports atomic writes */
#define FMODE_CAN_ATOMIC_WRITE	((__force fmode_t)(1 << 7))

/* FMODE_* bit 8 */

/* 32bit hashes as llseek() offset (for directories) */
#define FMODE_32BITHASH         ((__force fmode_t)(1 << 9))
/* 64bit hashes as llseek() offset (for directories) */
#define FMODE_64BITHASH         ((__force fmode_t)(1 << 10))

/*
 * Don't update ctime and mtime.
 *
 * Currently a special hack for the XFS open_by_handle ioctl, but we'll
 * hopefully graduate it to a proper O_CMTIME flag supported by open(2) soon.
 */
#define FMODE_NOCMTIME		((__force fmode_t)(1 << 11))

/* Expect random access pattern */
#define FMODE_RANDOM		((__force fmode_t)(1 << 12))

/* Supports IOCB_HAS_METADATA */
#define FMODE_HAS_METADATA	((__force fmode_t)(1 << 13))

/* File is opened with O_PATH; almost nothing can be done with it */
#define FMODE_PATH		((__force fmode_t)(1 << 14))

/* File needs atomic accesses to f_pos */
#define FMODE_ATOMIC_POS	((__force fmode_t)(1 << 15))
/* Write access to underlying fs */
#define FMODE_WRITER		((__force fmode_t)(1 << 16))
/* Has read method(s) */
#define FMODE_CAN_READ          ((__force fmode_t)(1 << 17))
/* Has write method(s) */
#define FMODE_CAN_WRITE         ((__force fmode_t)(1 << 18))

#define FMODE_OPENED		((__force fmode_t)(1 << 19))
#define FMODE_CREATED		((__force fmode_t)(1 << 20))

/* File is stream-like */
#define FMODE_STREAM		((__force fmode_t)(1 << 21))

/* File supports DIRECT IO */
#define	FMODE_CAN_ODIRECT	((__force fmode_t)(1 << 22))

#define	FMODE_NOREUSE		((__force fmode_t)(1 << 23))

/* File is embedded in backing_file object */
#define FMODE_BACKING		((__force fmode_t)(1 << 24))

/*
 * Together with FMODE_NONOTIFY_PERM defines which fsnotify events shouldn't be
 * generated (see below)
 */
#define FMODE_NONOTIFY		((__force fmode_t)(1 << 25))

/*
 * Together with FMODE_NONOTIFY defines which fsnotify events shouldn't be
 * generated (see below)
 */
#define FMODE_NONOTIFY_PERM	((__force fmode_t)(1 << 26))

/* File is capable of returning -EAGAIN if I/O will block */
#define FMODE_NOWAIT		((__force fmode_t)(1 << 27))

/* File represents mount that needs unmounting */
#define FMODE_NEED_UNMOUNT	((__force fmode_t)(1 << 28))

/* File does not contribute to nr_files count */
#define FMODE_NOACCOUNT		((__force fmode_t)(1 << 29))

/*
 * The two FMODE_NONOTIFY* define which fsnotify events should not be generated
 * for an open file. These are the possible values of
 * (f->f_mode & FMODE_FSNOTIFY_MASK) and their meaning:
 *
 * FMODE_NONOTIFY - suppress all (incl. non-permission) events.
 * FMODE_NONOTIFY_PERM - suppress permission (incl. pre-content) events.
 * FMODE_NONOTIFY | FMODE_NONOTIFY_PERM - suppress only FAN_ACCESS_PERM.
 */
#define FMODE_FSNOTIFY_MASK \
	(FMODE_NONOTIFY | FMODE_NONOTIFY_PERM)

#define FMODE_FSNOTIFY_NONE(mode) \
	((mode & FMODE_FSNOTIFY_MASK) == FMODE_NONOTIFY)
#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
#define FMODE_FSNOTIFY_HSM(mode) \
	((mode & FMODE_FSNOTIFY_MASK) == 0 || \
	 (mode & FMODE_FSNOTIFY_MASK) == (FMODE_NONOTIFY | FMODE_NONOTIFY_PERM))
#define FMODE_FSNOTIFY_ACCESS_PERM(mode) \
	((mode & FMODE_FSNOTIFY_MASK) == 0)
#else
#define FMODE_FSNOTIFY_ACCESS_PERM(mode) 0
#define FMODE_FSNOTIFY_HSM(mode)	0
#endif

/*
 * Attribute flags.  These should be or-ed together to figure out what
 * has been changed!
 */
#define ATTR_MODE	(1 << 0)
#define ATTR_UID	(1 << 1)
#define ATTR_GID	(1 << 2)
#define ATTR_SIZE	(1 << 3)
#define ATTR_ATIME	(1 << 4)
#define ATTR_MTIME	(1 << 5)
#define ATTR_CTIME	(1 << 6)
#define ATTR_ATIME_SET	(1 << 7)
#define ATTR_MTIME_SET	(1 << 8)
#define ATTR_FORCE	(1 << 9) /* Not a change, but a change it */
#define ATTR_CTIME_SET	(1 << 10)
#define ATTR_KILL_SUID	(1 << 11)
#define ATTR_KILL_SGID	(1 << 12)
#define ATTR_FILE	(1 << 13)
#define ATTR_KILL_PRIV	(1 << 14)
#define ATTR_OPEN	(1 << 15) /* Truncating from open(O_TRUNC) */
#define ATTR_TIMES_SET	(1 << 16)
#define ATTR_TOUCH	(1 << 17)
#define ATTR_DELEG	(1 << 18) /* Delegated attrs. Don't break write delegations */

/*
 * Whiteout is represented by a char device.  The following constants define the
 * mode and device number to use.
 */
#define WHITEOUT_MODE 0
#define WHITEOUT_DEV 0

/*
 * This is the Inode Attributes structure, used for notify_change().  It
 * uses the above definitions as flags, to know which values have changed.
 * Also, in this manner, a Filesystem can look at only the values it cares
 * about.  Basically, these are the attributes that the VFS layer can
 * request to change from the FS layer.
 *
 * Derek Atkins <warlord@MIT.EDU> 94-10-20
 */
struct iattr {
	unsigned int	ia_valid;
	umode_t		ia_mode;
	/*
	 * The two anonymous unions wrap structures with the same member.
	 *
	 * Filesystems raising FS_ALLOW_IDMAP need to use ia_vfs{g,u}id which
	 * are a dedicated type requiring the filesystem to use the dedicated
	 * helpers. Other filesystem can continue to use ia_{g,u}id until they
	 * have been ported.
	 *
	 * They always contain the same value. In other words FS_ALLOW_IDMAP
	 * pass down the same value on idmapped mounts as they would on regular
	 * mounts.
	 */
	union {
		kuid_t		ia_uid;
		vfsuid_t	ia_vfsuid;
	};
	union {
		kgid_t		ia_gid;
		vfsgid_t	ia_vfsgid;
	};
	loff_t		ia_size;
	struct timespec64 ia_atime;
	struct timespec64 ia_mtime;
	struct timespec64 ia_ctime;

	/*
	 * Not an attribute, but an auxiliary info for filesystems wanting to
	 * implement an ftruncate() like method.  NOTE: filesystem should
	 * check for (ia_valid & ATTR_FILE), and not for (ia_file != NULL).
	 */
	struct file	*ia_file;
};

/*
 * Maximum number of layers of fs stack.  Needs to be limited to
 * prevent kernel stack overflow
 */
#define FILESYSTEM_MAX_STACK_DEPTH 2

/** 
 * enum positive_aop_returns - aop return codes with specific semantics
 *
 * @AOP_WRITEPAGE_ACTIVATE: Informs the caller that page writeback has
 * 			    completed, that the page is still locked, and
 * 			    should be considered active.  The VM uses this hint
 * 			    to return the page to the active list -- it won't
 * 			    be a candidate for writeback again in the near
 * 			    future.  Other callers must be careful to unlock
 * 			    the page if they get this return.  Returned by
 * 			    writepage(); 
 *
 * @AOP_TRUNCATED_PAGE: The AOP method that was handed a locked page has
 *  			unlocked it and the page might have been truncated.
 *  			The caller should back up to acquiring a new page and
 *  			trying again.  The aop will be taking reasonable
 *  			precautions not to livelock.  If the caller held a page
 *  			reference, it should drop it before retrying.  Returned
 *  			by read_folio().
 *
 * address_space_operation functions return these large constants to indicate
 * special semantics to the caller.  These are much larger than the bytes in a
 * page to allow for functions that return the number of bytes operated on in a
 * given page.
 */

enum positive_aop_returns {
	AOP_WRITEPAGE_ACTIVATE	= 0x80000,
	AOP_TRUNCATED_PAGE	= 0x80001,
};

/*
 * oh the beauties of C type declarations.
 */
struct page;
struct address_space;
struct writeback_control;
struct readahead_control;

/* Match RWF_* bits to IOCB bits */
#define IOCB_HIPRI		(__force int) RWF_HIPRI
#define IOCB_DSYNC		(__force int) RWF_DSYNC
#define IOCB_SYNC		(__force int) RWF_SYNC
#define IOCB_NOWAIT		(__force int) RWF_NOWAIT
#define IOCB_APPEND		(__force int) RWF_APPEND
#define IOCB_ATOMIC		(__force int) RWF_ATOMIC
#define IOCB_DONTCACHE		(__force int) RWF_DONTCACHE
#define IOCB_NOSIGNAL		(__force int) RWF_NOSIGNAL

/* non-RWF related bits - start at 16 */
#define IOCB_EVENTFD		(1 << 16)
#define IOCB_DIRECT		(1 << 17)
#define IOCB_WRITE		(1 << 18)
/* iocb->ki_waitq is valid */
#define IOCB_WAITQ		(1 << 19)
#define IOCB_NOIO		(1 << 20)
/* can use bio alloc cache */
#define IOCB_ALLOC_CACHE	(1 << 21)
/* kiocb is a read or write operation submitted by fs/aio.c. */
#define IOCB_AIO_RW		(1 << 22)
#define IOCB_HAS_METADATA	(1 << 23)

/* for use in trace events */
#define TRACE_IOCB_STRINGS \
	{ IOCB_HIPRI,		"HIPRI" }, \
	{ IOCB_DSYNC,		"DSYNC" }, \
	{ IOCB_SYNC,		"SYNC" }, \
	{ IOCB_NOWAIT,		"NOWAIT" }, \
	{ IOCB_APPEND,		"APPEND" }, \
	{ IOCB_ATOMIC,		"ATOMIC" }, \
	{ IOCB_DONTCACHE,	"DONTCACHE" }, \
	{ IOCB_EVENTFD,		"EVENTFD"}, \
	{ IOCB_DIRECT,		"DIRECT" }, \
	{ IOCB_WRITE,		"WRITE" }, \
	{ IOCB_WAITQ,		"WAITQ" }, \
	{ IOCB_NOIO,		"NOIO" }, \
	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }, \
	{ IOCB_AIO_RW,		"AIO_RW" }, \
	{ IOCB_HAS_METADATA,	"AIO_HAS_METADATA" }

struct kiocb {
	struct file		*ki_filp;
	loff_t			ki_pos;
	void (*ki_complete)(struct kiocb *iocb, long ret);
	void			*private;
	int			ki_flags;
	u16			ki_ioprio; /* See linux/ioprio.h */
	u8			ki_write_stream;

	/*
	 * Only used for async buffered reads, where it denotes the page
	 * waitqueue associated with completing the read.
	 * Valid IFF IOCB_WAITQ is set.
	 */
	struct wait_page_queue	*ki_waitq;
};

static inline bool is_sync_kiocb(struct kiocb *kiocb)
{
	return kiocb->ki_complete == NULL;
}

struct address_space_operations {
	int (*read_folio)(struct file *, struct folio *);

	/* Write back some dirty pages from this mapping. */
	int (*writepages)(struct address_space *, struct writeback_control *);

	/* Mark a folio dirty.  Return true if this dirtied it */
	bool (*dirty_folio)(struct address_space *, struct folio *);

	void (*readahead)(struct readahead_control *);

	int (*write_begin)(const struct kiocb *, struct address_space *mapping,
				loff_t pos, unsigned len,
				struct folio **foliop, void **fsdata);
	int (*write_end)(const struct kiocb *, struct address_space *mapping,
				loff_t pos, unsigned len, unsigned copied,
				struct folio *folio, void *fsdata);

	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
	sector_t (*bmap)(struct address_space *, sector_t);
	void (*invalidate_folio) (struct folio *, size_t offset, size_t len);
	bool (*release_folio)(struct folio *, gfp_t);
	void (*free_folio)(struct folio *folio);
	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
	/*
	 * migrate the contents of a folio to the specified target. If
	 * migrate_mode is MIGRATE_ASYNC, it must not block.
	 */
	int (*migrate_folio)(struct address_space *, struct folio *dst,
			struct folio *src, enum migrate_mode);
	int (*launder_folio)(struct folio *);
	bool (*is_partially_uptodate) (struct folio *, size_t from,
			size_t count);
	void (*is_dirty_writeback) (struct folio *, bool *dirty, bool *wb);
	int (*error_remove_folio)(struct address_space *, struct folio *);

	/* swapfile support */
	int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
				sector_t *span);
	void (*swap_deactivate)(struct file *file);
	int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
};

extern const struct address_space_operations empty_aops;

/**
 * struct address_space - Contents of a cacheable, mappable object.
 * @host: Owner, either the inode or the block_device.
 * @i_pages: Cached pages.
 * @invalidate_lock: Guards coherency between page cache contents and
 *   file offset->disk block mappings in the filesystem during invalidates.
 *   It is also used to block modification of page cache contents through
 *   memory mappings.
 * @gfp_mask: Memory allocation flags to use for allocating pages.
 * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
 * @nr_thps: Number of THPs in the pagecache (non-shmem only).
 * @i_mmap: Tree of private and shared mappings.
 * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
 * @nrpages: Number of page entries, protected by the i_pages lock.
 * @writeback_index: Writeback starts here.
 * @a_ops: Methods.
 * @flags: Error bits and flags (AS_*).
 * @wb_err: The most recent error which has occurred.
 * @i_private_lock: For use by the owner of the address_space.
 * @i_private_list: For use by the owner of the address_space.
 * @i_private_data: For use by the owner of the address_space.
 */
struct address_space {
	struct inode		*host;
	struct xarray		i_pages;
	struct rw_semaphore	invalidate_lock;
	gfp_t			gfp_mask;
	atomic_t		i_mmap_writable;
#ifdef CONFIG_READ_ONLY_THP_FOR_FS
	/* number of thp, only for non-shmem files */
	atomic_t		nr_thps;
#endif
	struct rb_root_cached	i_mmap;
	unsigned long		nrpages;
	pgoff_t			writeback_index;
	const struct address_space_operations *a_ops;
	unsigned long		flags;
	errseq_t		wb_err;
	spinlock_t		i_private_lock;
	struct list_head	i_private_list;
	struct rw_semaphore	i_mmap_rwsem;
	void *			i_private_data;
} __attribute__((aligned(sizeof(long)))) __randomize_layout;
	/*
	 * On most architectures that alignment is already the case; but
	 * must be enforced here for CRIS, to let the least significant bit
	 * of struct folio's "mapping" pointer be used for FOLIO_MAPPING_ANON.
	 */

/* XArray tags, for tagging dirty and writeback pages in the pagecache. */
#define PAGECACHE_TAG_DIRTY	XA_MARK_0
#define PAGECACHE_TAG_WRITEBACK	XA_MARK_1
#define PAGECACHE_TAG_TOWRITE	XA_MARK_2

/*
 * Returns true if any of the pages in the mapping are marked with the tag.
 */
static inline bool mapping_tagged(const struct address_space *mapping, xa_mark_t tag)
{
	return xa_marked(&mapping->i_pages, tag);
}

static inline void i_mmap_lock_write(struct address_space *mapping)
{
	down_write(&mapping->i_mmap_rwsem);
}

static inline int i_mmap_trylock_write(struct address_space *mapping)
{
	return down_write_trylock(&mapping->i_mmap_rwsem);
}

static inline void i_mmap_unlock_write(struct address_space *mapping)
{
	up_write(&mapping->i_mmap_rwsem);
}

static inline int i_mmap_trylock_read(struct address_space *mapping)
{
	return down_read_trylock(&mapping->i_mmap_rwsem);
}

static inline void i_mmap_lock_read(struct address_space *mapping)
{
	down_read(&mapping->i_mmap_rwsem);
}

static inline void i_mmap_unlock_read(struct address_space *mapping)
{
	up_read(&mapping->i_mmap_rwsem);
}

static inline void i_mmap_assert_locked(struct address_space *mapping)
{
	lockdep_assert_held(&mapping->i_mmap_rwsem);
}

static inline void i_mmap_assert_write_locked(struct address_space *mapping)
{
	lockdep_assert_held_write(&mapping->i_mmap_rwsem);
}

/*
 * Might pages of this file be mapped into userspace?
 */
static inline int mapping_mapped(const struct address_space *mapping)
{
	return	!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root);
}

/*
 * Might pages of this file have been modified in userspace?
 * Note that i_mmap_writable counts all VM_SHARED, VM_MAYWRITE vmas: do_mmap
 * marks vma as VM_SHARED if it is shared, and the file was opened for
 * writing i.e. vma may be mprotected writable even if now readonly.
 *
 * If i_mmap_writable is negative, no new writable mappings are allowed. You
 * can only deny writable mappings, if none exists right now.
 */
static inline int mapping_writably_mapped(const struct address_space *mapping)
{
	return atomic_read(&mapping->i_mmap_writable) > 0;
}

static inline int mapping_map_writable(struct address_space *mapping)
{
	return atomic_inc_unless_negative(&mapping->i_mmap_writable) ?
		0 : -EPERM;
}

static inline void mapping_unmap_writable(struct address_space *mapping)
{
	atomic_dec(&mapping->i_mmap_writable);
}

static inline int mapping_deny_writable(struct address_space *mapping)
{
	return atomic_dec_unless_positive(&mapping->i_mmap_writable) ?
		0 : -EBUSY;
}

static inline void mapping_allow_writable(struct address_space *mapping)
{
	atomic_inc(&mapping->i_mmap_writable);
}

/*
 * Use sequence counter to get consistent i_size on 32-bit processors.
 */
#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
#include <linux/seqlock.h>
#define __NEED_I_SIZE_ORDERED
#define i_size_ordered_init(inode) seqcount_init(&inode->i_size_seqcount)
#else
#define i_size_ordered_init(inode) do { } while (0)
#endif

struct posix_acl;
#define ACL_NOT_CACHED ((void *)(-1))
/*
 * ACL_DONT_CACHE is for stacked filesystems, that rely on underlying fs to
 * cache the ACL.  This also means that ->get_inode_acl() can be called in RCU
 * mode with the LOOKUP_RCU flag.
 */
#define ACL_DONT_CACHE ((void *)(-3))

static inline struct posix_acl *
uncached_acl_sentinel(struct task_struct *task)
{
	return (void *)task + 1;
}

static inline bool
is_uncached_acl(struct posix_acl *acl)
{
	return (long)acl & 1;
}

#define IOP_FASTPERM		0x0001
#define IOP_LOOKUP		0x0002
#define IOP_NOFOLLOW		0x0004
#define IOP_XATTR		0x0008
#define IOP_DEFAULT_READLINK	0x0010
#define IOP_MGTIME		0x0020
#define IOP_CACHED_LINK		0x0040
#define IOP_FASTPERM_MAY_EXEC	0x0080

/*
 * Inode state bits.  Protected by inode->i_lock
 *
 * Four bits determine the dirty state of the inode: I_DIRTY_SYNC,
 * I_DIRTY_DATASYNC, I_DIRTY_PAGES, and I_DIRTY_TIME.
 *
 * Four bits define the lifetime of an inode.  Initially, inodes are I_NEW,
 * until that flag is cleared.  I_WILL_FREE, I_FREEING and I_CLEAR are set at
 * various stages of removing an inode.
 *
 * Two bits are used for locking and completion notification, I_NEW and I_SYNC.
 *
 * I_DIRTY_SYNC		Inode is dirty, but doesn't have to be written on
 *			fdatasync() (unless I_DIRTY_DATASYNC is also set).
 *			Timestamp updates are the usual cause.
 * I_DIRTY_DATASYNC	Data-related inode changes pending.  We keep track of
 *			these changes separately from I_DIRTY_SYNC so that we
 *			don't have to write inode on fdatasync() when only
 *			e.g. the timestamps have changed.
 * I_DIRTY_PAGES	Inode has dirty pages.  Inode itself may be clean.
 * I_DIRTY_TIME		The inode itself has dirty timestamps, and the
 *			lazytime mount option is enabled.  We keep track of this
 *			separately from I_DIRTY_SYNC in order to implement
 *			lazytime.  This gets cleared if I_DIRTY_INODE
 *			(I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. But
 *			I_DIRTY_TIME can still be set if I_DIRTY_SYNC is already
 *			in place because writeback might already be in progress
 *			and we don't want to lose the time update
 * I_NEW		Serves as both a mutex and completion notification.
 *			New inodes set I_NEW.  If two processes both create
 *			the same inode, one of them will release its inode and
 *			wait for I_NEW to be released before returning.
 *			Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can
 *			also cause waiting on I_NEW, without I_NEW actually
 *			being set.  find_inode() uses this to prevent returning
 *			nearly-dead inodes.
 * I_WILL_FREE		Must be set when calling write_inode_now() if i_count
 *			is zero.  I_FREEING must be set when I_WILL_FREE is
 *			cleared.
 * I_FREEING		Set when inode is about to be freed but still has dirty
 *			pages or buffers attached or the inode itself is still
 *			dirty.
 * I_CLEAR		Added by clear_inode().  In this state the inode is
 *			clean and can be destroyed.  Inode keeps I_FREEING.
 *
 *			Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are
 *			prohibited for many purposes.  iget() must wait for
 *			the inode to be completely released, then create it
 *			anew.  Other functions will just ignore such inodes,
 *			if appropriate.  I_NEW is used for waiting.
 *
 * I_SYNC		Writeback of inode is running. The bit is set during
 *			data writeback, and cleared with a wakeup on the bit
 *			address once it is done. The bit is also used to pin
 *			the inode in memory for flusher thread.
 *
 * I_REFERENCED		Marks the inode as recently references on the LRU list.
 *
 * I_WB_SWITCH		Cgroup bdi_writeback switching in progress.  Used to
 *			synchronize competing switching instances and to tell
 *			wb stat updates to grab the i_pages lock.  See
 *			inode_switch_wbs_work_fn() for details.
 *
 * I_OVL_INUSE		Used by overlayfs to get exclusive ownership on upper
 *			and work dirs among overlayfs mounts.
 *
 * I_CREATING		New object's inode in the middle of setting up.
 *
 * I_DONTCACHE		Evict inode as soon as it is not used anymore.
 *
 * I_SYNC_QUEUED	Inode is queued in b_io or b_more_io writeback lists.
 *			Used to detect that mark_inode_dirty() should not move
 *			inode between dirty lists.
 *
 * I_PINNING_FSCACHE_WB	Inode is pinning an fscache object for writeback.
 *
 * I_LRU_ISOLATING	Inode is pinned being isolated from LRU without holding
 *			i_count.
 *
 * Q: What is the difference between I_WILL_FREE and I_FREEING?
 *
 * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait
 * upon. There's one free address left.
 */

enum inode_state_bits {
	__I_NEW			= 0U,
	__I_SYNC		= 1U,
	__I_LRU_ISOLATING	= 2U
	/* reserved wait address bit 3 */
};

enum inode_state_flags_enum {
	I_NEW			= (1U << __I_NEW),
	I_SYNC			= (1U << __I_SYNC),
	I_LRU_ISOLATING         = (1U << __I_LRU_ISOLATING),
	/* reserved flag bit 3 */
	I_DIRTY_SYNC		= (1U << 4),
	I_DIRTY_DATASYNC	= (1U << 5),
	I_DIRTY_PAGES		= (1U << 6),
	I_WILL_FREE		= (1U << 7),
	I_FREEING		= (1U << 8),
	I_CLEAR			= (1U << 9),
	I_REFERENCED		= (1U << 10),
	I_LINKABLE		= (1U << 11),
	I_DIRTY_TIME		= (1U << 12),
	I_WB_SWITCH		= (1U << 13),
	I_OVL_INUSE		= (1U << 14),
	I_CREATING		= (1U << 15),
	I_DONTCACHE		= (1U << 16),
	I_SYNC_QUEUED		= (1U << 17),
	I_PINNING_NETFS_WB	= (1U << 18)
};

#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
#define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)

/*
 * Use inode_state_read() & friends to access.
 */
struct inode_state_flags {
	enum inode_state_flags_enum __state;
};

/*
 * Keep mostly read-only and often accessed (especially for
 * the RCU path lookup and 'stat' data) fields at the beginning
 * of the 'struct inode'
 */
struct inode {
	umode_t			i_mode;
	unsigned short		i_opflags;
	unsigned int		i_flags;
#ifdef CONFIG_FS_POSIX_ACL
	struct posix_acl	*i_acl;
	struct posix_acl	*i_default_acl;
#endif
	kuid_t			i_uid;
	kgid_t			i_gid;

	const struct inode_operations	*i_op;
	struct super_block	*i_sb;
	struct address_space	*i_mapping;

#ifdef CONFIG_SECURITY
	void			*i_security;
#endif

	/* Stat data, not accessed from path walking */
	unsigned long		i_ino;
	/*
	 * Filesystems may only read i_nlink directly.  They shall use the
	 * following functions for modification:
	 *
	 *    (set|clear|inc|drop)_nlink
	 *    inode_(inc|dec)_link_count
	 */
	union {
		const unsigned int i_nlink;
		unsigned int __i_nlink;
	};
	dev_t			i_rdev;
	loff_t			i_size;
	time64_t		i_atime_sec;
	time64_t		i_mtime_sec;
	time64_t		i_ctime_sec;
	u32			i_atime_nsec;
	u32			i_mtime_nsec;
	u32			i_ctime_nsec;
	u32			i_generation;
	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
	unsigned short          i_bytes;
	u8			i_blkbits;
	enum rw_hint		i_write_hint;
	blkcnt_t		i_blocks;

#ifdef __NEED_I_SIZE_ORDERED
	seqcount_t		i_size_seqcount;
#endif

	/* Misc */
	struct inode_state_flags i_state;
	/* 32-bit hole */
	struct rw_semaphore	i_rwsem;

	unsigned long		dirtied_when;	/* jiffies of first dirtying */
	unsigned long		dirtied_time_when;

	struct hlist_node	i_hash;
	struct list_head	i_io_list;	/* backing dev IO list */
#ifdef CONFIG_CGROUP_WRITEBACK
	struct bdi_writeback	*i_wb;		/* the associated cgroup wb */

	/* foreign inode detection, see wbc_detach_inode() */
	int			i_wb_frn_winner;
	u16			i_wb_frn_avg_time;
	u16			i_wb_frn_history;
#endif
	struct list_head	i_lru;		/* inode LRU list */
	struct list_head	i_sb_list;
	struct list_head	i_wb_list;	/* backing dev writeback list */
	union {
		struct hlist_head	i_dentry;
		struct rcu_head		i_rcu;
	};
	atomic64_t		i_version;
	atomic64_t		i_sequence; /* see futex */
	atomic_t		i_count;
	atomic_t		i_dio_count;
	atomic_t		i_writecount;
#if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
	atomic_t		i_readcount; /* struct files open RO */
#endif
	union {
		const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
		void (*free_inode)(struct inode *);
	};
	struct file_lock_context	*i_flctx;
	struct address_space	i_data;
	union {
		struct list_head	i_devices;
		int			i_linklen;
	};
	union {
		struct pipe_inode_info	*i_pipe;
		struct cdev		*i_cdev;
		char			*i_link;
		unsigned		i_dir_seq;
	};


#ifdef CONFIG_FSNOTIFY
	__u32			i_fsnotify_mask; /* all events this inode cares about */
	/* 32-bit hole reserved for expanding i_fsnotify_mask */
	struct fsnotify_mark_connector __rcu	*i_fsnotify_marks;
#endif

	void			*i_private; /* fs or device private pointer */
} __randomize_layout;

/*
 * i_state handling
 *
 * We hide all of it behind helpers so that we can validate consumers.
 */
static inline enum inode_state_flags_enum inode_state_read_once(struct inode *inode)
{
	return READ_ONCE(inode->i_state.__state);
}

static inline enum inode_state_flags_enum inode_state_read(struct inode *inode)
{
	lockdep_assert_held(&inode->i_lock);
	return inode->i_state.__state;
}

static inline void inode_state_set_raw(struct inode *inode,
				       enum inode_state_flags_enum flags)
{
	WRITE_ONCE(inode->i_state.__state, inode->i_state.__state | flags);
}

static inline void inode_state_set(struct inode *inode,
				   enum inode_state_flags_enum flags)
{
	lockdep_assert_held(&inode->i_lock);
	inode_state_set_raw(inode, flags);
}

static inline void inode_state_clear_raw(struct inode *inode,
					 enum inode_state_flags_enum flags)
{
	WRITE_ONCE(inode->i_state.__state, inode->i_state.__state & ~flags);
}

static inline void inode_state_clear(struct inode *inode,
				     enum inode_state_flags_enum flags)
{
	lockdep_assert_held(&inode->i_lock);
	inode_state_clear_raw(inode, flags);
}

static inline void inode_state_assign_raw(struct inode *inode,
					  enum inode_state_flags_enum flags)
{
	WRITE_ONCE(inode->i_state.__state, flags);
}

static inline void inode_state_assign(struct inode *inode,
				      enum inode_state_flags_enum flags)
{
	lockdep_assert_held(&inode->i_lock);
	inode_state_assign_raw(inode, flags);
}

static inline void inode_state_replace_raw(struct inode *inode,
					   enum inode_state_flags_enum clearflags,
					   enum inode_state_flags_enum setflags)
{
	enum inode_state_flags_enum flags;
	flags = inode->i_state.__state;
	flags &= ~clearflags;
	flags |= setflags;
	inode_state_assign_raw(inode, flags);
}

static inline void inode_state_replace(struct inode *inode,
				       enum inode_state_flags_enum clearflags,
				       enum inode_state_flags_enum setflags)
{
	lockdep_assert_held(&inode->i_lock);
	inode_state_replace_raw(inode, clearflags, setflags);
}

static inline void inode_set_cached_link(struct inode *inode, char *link, int linklen)
{
	VFS_WARN_ON_INODE(strlen(link) != linklen, inode);
	VFS_WARN_ON_INODE(inode->i_opflags & IOP_CACHED_LINK, inode);
	inode->i_link = link;
	inode->i_linklen = linklen;
	inode->i_opflags |= IOP_CACHED_LINK;
}

/*
 * Get bit address from inode->i_state to use with wait_var_event()
 * infrastructre.
 */
#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit))

struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
					    struct inode *inode, u32 bit);

static inline void inode_wake_up_bit(struct inode *inode, u32 bit)
{
	/* Caller is responsible for correct memory barriers. */
	wake_up_var(inode_state_wait_address(inode, bit));
}

struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);

static inline unsigned int i_blocksize(const struct inode *node)
{
	return (1 << node->i_blkbits);
}

static inline int inode_unhashed(struct inode *inode)
{
	return hlist_unhashed(&inode->i_hash);
}

/*
 * __mark_inode_dirty expects inodes to be hashed.  Since we don't
 * want special inodes in the fileset inode space, we make them
 * appear hashed, but do not put on any lists.  hlist_del()
 * will work fine and require no locking.
 */
static inline void inode_fake_hash(struct inode *inode)
{
	hlist_add_fake(&inode->i_hash);
}

void wait_on_new_inode(struct inode *inode);

/*
 * inode->i_rwsem nesting subclasses for the lock validator:
 *
 * 0: the object of the current VFS operation
 * 1: parent
 * 2: child/target
 * 3: xattr
 * 4: second non-directory
 * 5: second parent (when locking independent directories in rename)
 *
 * I_MUTEX_NONDIR2 is for certain operations (such as rename) which lock two
 * non-directories at once.
 *
 * The locking order between these classes is
 * parent[2] -> child -> grandchild -> normal -> xattr -> second non-directory
 */
enum inode_i_mutex_lock_class
{
	I_MUTEX_NORMAL,
	I_MUTEX_PARENT,
	I_MUTEX_CHILD,
	I_MUTEX_XATTR,
	I_MUTEX_NONDIR2,
	I_MUTEX_PARENT2,
};

static inline void inode_lock(struct inode *inode)
{
	down_write(&inode->i_rwsem);
}

static inline __must_check int inode_lock_killable(struct inode *inode)
{
	return down_write_killable(&inode->i_rwsem);
}

static inline void inode_unlock(struct inode *inode)
{
	up_write(&inode->i_rwsem);
}

static inline void inode_lock_shared(struct inode *inode)
{
	down_read(&inode->i_rwsem);
}

static inline __must_check int inode_lock_shared_killable(struct inode *inode)
{
	return down_read_killable(&inode->i_rwsem);
}

static inline void inode_unlock_shared(struct inode *inode)
{
	up_read(&inode->i_rwsem);
}

static inline int inode_trylock(struct inode *inode)
{
	return down_write_trylock(&inode->i_rwsem);
}

static inline int inode_trylock_shared(struct inode *inode)
{
	return down_read_trylock(&inode->i_rwsem);
}

static inline int inode_is_locked(struct inode *inode)
{
	return rwsem_is_locked(&inode->i_rwsem);
}

static inline void inode_lock_nested(struct inode *inode, unsigned subclass)
{
	down_write_nested(&inode->i_rwsem, subclass);
}

static inline void inode_lock_shared_nested(struct inode *inode, unsigned subclass)
{
	down_read_nested(&inode->i_rwsem, subclass);
}

static inline void filemap_invalidate_lock(struct address_space *mapping)
{
	down_write(&mapping->invalidate_lock);
}

static inline void filemap_invalidate_unlock(struct address_space *mapping)
{
	up_write(&mapping->invalidate_lock);
}

static inline void filemap_invalidate_lock_shared(struct address_space *mapping)
{
	down_read(&mapping->invalidate_lock);
}

static inline int filemap_invalidate_trylock_shared(
					struct address_space *mapping)
{
	return down_read_trylock(&mapping->invalidate_lock);
}

static inline void filemap_invalidate_unlock_shared(
					struct address_space *mapping)
{
	up_read(&mapping->invalidate_lock);
}

void lock_two_nondirectories(struct inode *, struct inode*);
void unlock_two_nondirectories(struct inode *, struct inode*);

void filemap_invalidate_lock_two(struct address_space *mapping1,
				 struct address_space *mapping2);
void filemap_invalidate_unlock_two(struct address_space *mapping1,
				   struct address_space *mapping2);


/*
 * NOTE: in a 32bit arch with a preemptable kernel and
 * an UP compile the i_size_read/write must be atomic
 * with respect to the local cpu (unlike with preempt disabled),
 * but they don't need to be atomic with respect to other cpus like in
 * true SMP (so they need either to either locally disable irq around
 * the read or for example on x86 they can be still implemented as a
 * cmpxchg8b without the need of the lock prefix). For SMP compiles
 * and 64bit archs it makes no difference if preempt is enabled or not.
 */
static inline loff_t i_size_read(const struct inode *inode)
{
#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
	loff_t i_size;
	unsigned int seq;

	do {
		seq = read_seqcount_begin(&inode->i_size_seqcount);
		i_size = inode->i_size;
	} while (read_seqcount_retry(&inode->i_size_seqcount, seq));
	return i_size;
#elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPTION)
	loff_t i_size;

	preempt_disable();
	i_size = inode->i_size;
	preempt_enable();
	return i_size;
#else
	/* Pairs with smp_store_release() in i_size_write() */
	return smp_load_acquire(&inode->i_size);
#endif
}

/*
 * NOTE: unlike i_size_read(), i_size_write() does need locking around it
 * (normally i_rwsem), otherwise on 32bit/SMP an update of i_size_seqcount
 * can be lost, resulting in subsequent i_size_read() calls spinning forever.
 */
static inline void i_size_write(struct inode *inode, loff_t i_size)
{
#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
	preempt_disable();
	write_seqcount_begin(&inode->i_size_seqcount);
	inode->i_size = i_size;
	write_seqcount_end(&inode->i_size_seqcount);
	preempt_enable();
#elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPTION)
	preempt_disable();
	inode->i_size = i_size;
	preempt_enable();
#else
	/*
	 * Pairs with smp_load_acquire() in i_size_read() to ensure
	 * changes related to inode size (such as page contents) are
	 * visible before we see the changed inode size.
	 */
	smp_store_release(&inode->i_size, i_size);
#endif
}

static inline unsigned iminor(const struct inode *inode)
{
	return MINOR(inode->i_rdev);
}

static inline unsigned imajor(const struct inode *inode)
{
	return MAJOR(inode->i_rdev);
}

struct fown_struct {
	struct file *file;	/* backpointer for security modules */
	rwlock_t lock;          /* protects pid, uid, euid fields */
	struct pid *pid;	/* pid or -pgrp where SIGIO should be sent */
	enum pid_type pid_type;	/* Kind of process group SIGIO should be sent to */
	kuid_t uid, euid;	/* uid/euid of process setting the owner */
	int signum;		/* posix.1b rt signal to be delivered on IO */
};

/**
 * struct file_ra_state - Track a file's readahead state.
 * @start: Where the most recent readahead started.
 * @size: Number of pages read in the most recent readahead.
 * @async_size: Numer of pages that were/are not needed immediately
 *      and so were/are genuinely "ahead".  Start next readahead when
 *      the first of these pages is accessed.
 * @ra_pages: Maximum size of a readahead request, copied from the bdi.
 * @order: Preferred folio order used for most recent readahead.
 * @mmap_miss: How many mmap accesses missed in the page cache.
 * @prev_pos: The last byte in the most recent read request.
 *
 * When this structure is passed to ->readahead(), the "most recent"
 * readahead means the current readahead.
 */
struct file_ra_state {
	pgoff_t start;
	unsigned int size;
	unsigned int async_size;
	unsigned int ra_pages;
	unsigned short order;
	unsigned short mmap_miss;
	loff_t prev_pos;
};

/*
 * Check if @index falls in the readahead windows.
 */
static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
{
	return (index >= ra->start &&
		index <  ra->start + ra->size);
}

/**
 * struct file - Represents a file
 * @f_lock: Protects f_ep, f_flags. Must not be taken from IRQ context.
 * @f_mode: FMODE_* flags often used in hotpaths
 * @f_op: file operations
 * @f_mapping: Contents of a cacheable, mappable object.
 * @private_data: filesystem or driver specific data
 * @f_inode: cached inode
 * @f_flags: file flags
 * @f_iocb_flags: iocb flags
 * @f_cred: stashed credentials of creator/opener
 * @f_owner: file owner
 * @f_path: path of the file
 * @__f_path: writable alias for @f_path; *ONLY* for core VFS and only before
 *   the file gets open
 * @f_pos_lock: lock protecting file position
 * @f_pipe: specific to pipes
 * @f_pos: file position
 * @f_security: LSM security context of this file
 * @f_wb_err: writeback error
 * @f_sb_err: per sb writeback errors
 * @f_ep: link of all epoll hooks for this file
 * @f_task_work: task work entry point
 * @f_llist: work queue entrypoint
 * @f_ra: file's readahead state
 * @f_freeptr: Pointer used by SLAB_TYPESAFE_BY_RCU file cache (don't touch.)
 * @f_ref: reference count
 */
struct file {
	spinlock_t			f_lock;
	fmode_t				f_mode;
	const struct file_operations	*f_op;
	struct address_space		*f_mapping;
	void				*private_data;
	struct inode			*f_inode;
	unsigned int			f_flags;
	unsigned int			f_iocb_flags;
	const struct cred		*f_cred;
	struct fown_struct		*f_owner;
	/* --- cacheline 1 boundary (64 bytes) --- */
	union {
		const struct path	f_path;
		struct path		__f_path;
	};
	union {
		/* regular files (with FMODE_ATOMIC_POS) and directories */
		struct mutex		f_pos_lock;
		/* pipes */
		u64			f_pipe;
	};
	loff_t				f_pos;
#ifdef CONFIG_SECURITY
	void				*f_security;
#endif
	/* --- cacheline 2 boundary (128 bytes) --- */
	errseq_t			f_wb_err;
	errseq_t			f_sb_err;
#ifdef CONFIG_EPOLL
	struct hlist_head		*f_ep;
#endif
	union {
		struct callback_head	f_task_work;
		struct llist_node	f_llist;
		struct file_ra_state	f_ra;
		freeptr_t		f_freeptr;
	};
	file_ref_t			f_ref;
	/* --- cacheline 3 boundary (192 bytes) --- */
} __randomize_layout
  __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */

struct file_handle {
	__u32 handle_bytes;
	int handle_type;
	/* file identifier */
	unsigned char f_handle[] __counted_by(handle_bytes);
};

static inline struct file *get_file(struct file *f)
{
	file_ref_inc(&f->f_ref);
	return f;
}

struct file *get_file_rcu(struct file __rcu **f);
struct file *get_file_active(struct file **f);

#define file_count(f)	file_ref_read(&(f)->f_ref)

#define	MAX_NON_LFS	((1UL<<31) - 1)

/* Page cache limit. The filesystems should put that into their s_maxbytes 
   limits, otherwise bad things can happen in VM. */ 
#if BITS_PER_LONG==32
#define MAX_LFS_FILESIZE	((loff_t)ULONG_MAX << PAGE_SHIFT)
#elif BITS_PER_LONG==64
#define MAX_LFS_FILESIZE 	((loff_t)LLONG_MAX)
#endif

/* legacy typedef, should eventually be removed */
typedef void *fl_owner_t;

struct file_lock;
struct file_lease;

/* The following constant reflects the upper bound of the file/locking space */
#ifndef OFFSET_MAX
#define OFFSET_MAX	type_max(loff_t)
#define OFFT_OFFSET_MAX	type_max(off_t)
#endif

int file_f_owner_allocate(struct file *file);
static inline struct fown_struct *file_f_owner(const struct file *file)
{
	return READ_ONCE(file->f_owner);
}

extern void send_sigio(struct fown_struct *fown, int fd, int band);

static inline struct inode *file_inode(const struct file *f)
{
	return f->f_inode;
}

/*
 * file_dentry() is a relic from the days that overlayfs was using files with a
 * "fake" path, meaning, f_path on overlayfs and f_inode on underlying fs.
 * In those days, file_dentry() was needed to get the underlying fs dentry that
 * matches f_inode.
 * Files with "fake" path should not exist nowadays, so use an assertion to make
 * sure that file_dentry() was not papering over filesystem bugs.
 */
static inline struct dentry *file_dentry(const struct file *file)
{
	struct dentry *dentry = file->f_path.dentry;

	WARN_ON_ONCE(d_inode(dentry) != file_inode(file));
	return dentry;
}

struct fasync_struct {
	rwlock_t		fa_lock;
	int			magic;
	int			fa_fd;
	struct fasync_struct	*fa_next; /* singly linked list */
	struct file		*fa_file;
	struct rcu_head		fa_rcu;
};

#define FASYNC_MAGIC 0x4601

/* SMP safe fasync helpers: */
extern int fasync_helper(int, struct file *, int, struct fasync_struct **);
extern struct fasync_struct *fasync_insert_entry(int, struct file *, struct fasync_struct **, struct fasync_struct *);
extern int fasync_remove_entry(struct file *, struct fasync_struct **);
extern struct fasync_struct *fasync_alloc(void);
extern void fasync_free(struct fasync_struct *);

/* can be called from interrupts */
extern void kill_fasync(struct fasync_struct **, int, int);

extern void __f_setown(struct file *filp, struct pid *, enum pid_type, int force);
extern int f_setown(struct file *filp, int who, int force);
extern void f_delown(struct file *filp);
extern pid_t f_getown(struct file *filp);
extern int send_sigurg(struct file *file);

/*
 *	Umount options
 */

#define MNT_FORCE	0x00000001	/* Attempt to forcibily umount */
#define MNT_DETACH	0x00000002	/* Just detach from the tree */
#define MNT_EXPIRE	0x00000004	/* Mark for expiry */
#define UMOUNT_NOFOLLOW	0x00000008	/* Don't follow symlink on umount */
#define UMOUNT_UNUSED	0x80000000	/* Flag guaranteed to be unused */

static inline struct user_namespace *i_user_ns(const struct inode *inode)
{
	return inode->i_sb->s_user_ns;
}

/* Helper functions so that in most cases filesystems will
 * not need to deal directly with kuid_t and kgid_t and can
 * instead deal with the raw numeric values that are stored
 * in the filesystem.
 */
static inline uid_t i_uid_read(const struct inode *inode)
{
	return from_kuid(i_user_ns(inode), inode->i_uid);
}

static inline gid_t i_gid_read(const struct inode *inode)
{
	return from_kgid(i_user_ns(inode), inode->i_gid);
}

static inline void i_uid_write(struct inode *inode, uid_t uid)
{
	inode->i_uid = make_kuid(i_user_ns(inode), uid);
}

static inline void i_gid_write(struct inode *inode, gid_t gid)
{
	inode->i_gid = make_kgid(i_user_ns(inode), gid);
}

/**
 * i_uid_into_vfsuid - map an inode's i_uid down according to an idmapping
 * @idmap: idmap of the mount the inode was found from
 * @inode: inode to map
 *
 * Return: whe inode's i_uid mapped down according to @idmap.
 * If the inode's i_uid has no mapping INVALID_VFSUID is returned.
 */
static inline vfsuid_t i_uid_into_vfsuid(struct mnt_idmap *idmap,
					 const struct inode *inode)
{
	return make_vfsuid(idmap, i_user_ns(inode), inode->i_uid);
}

/**
 * i_uid_needs_update - check whether inode's i_uid needs to be updated
 * @idmap: idmap of the mount the inode was found from
 * @attr: the new attributes of @inode
 * @inode: the inode to update
 *
 * Check whether the $inode's i_uid field needs to be updated taking idmapped
 * mounts into account if the filesystem supports it.
 *
 * Return: true if @inode's i_uid field needs to be updated, false if not.
 */
static inline bool i_uid_needs_update(struct mnt_idmap *idmap,
				      const struct iattr *attr,
				      const struct inode *inode)
{
	return ((attr->ia_valid & ATTR_UID) &&
		!vfsuid_eq(attr->ia_vfsuid,
			   i_uid_into_vfsuid(idmap, inode)));
}

/**
 * i_uid_update - update @inode's i_uid field
 * @idmap: idmap of the mount the inode was found from
 * @attr: the new attributes of @inode
 * @inode: the inode to update
 *
 * Safely update @inode's i_uid field translating the vfsuid of any idmapped
 * mount into the filesystem kuid.
 */
static inline void i_uid_update(struct mnt_idmap *idmap,
				const struct iattr *attr,
				struct inode *inode)
{
	if (attr->ia_valid & ATTR_UID)
		inode->i_uid = from_vfsuid(idmap, i_user_ns(inode),
					   attr->ia_vfsuid);
}

/**
 * i_gid_into_vfsgid - map an inode's i_gid down according to an idmapping
 * @idmap: idmap of the mount the inode was found from
 * @inode: inode to map
 *
 * Return: the inode's i_gid mapped down according to @idmap.
 * If the inode's i_gid has no mapping INVALID_VFSGID is returned.
 */
static inline vfsgid_t i_gid_into_vfsgid(struct mnt_idmap *idmap,
					 const struct inode *inode)
{
	return make_vfsgid(idmap, i_user_ns(inode), inode->i_gid);
}

/**
 * i_gid_needs_update - check whether inode's i_gid needs to be updated
 * @idmap: idmap of the mount the inode was found from
 * @attr: the new attributes of @inode
 * @inode: the inode to update
 *
 * Check whether the $inode's i_gid field needs to be updated taking idmapped
 * mounts into account if the filesystem supports it.
 *
 * Return: true if @inode's i_gid field needs to be updated, false if not.
 */
static inline bool i_gid_needs_update(struct mnt_idmap *idmap,
				      const struct iattr *attr,
				      const struct inode *inode)
{
	return ((attr->ia_valid & ATTR_GID) &&
		!vfsgid_eq(attr->ia_vfsgid,
			   i_gid_into_vfsgid(idmap, inode)));
}

/**
 * i_gid_update - update @inode's i_gid field
 * @idmap: idmap of the mount the inode was found from
 * @attr: the new attributes of @inode
 * @inode: the inode to update
 *
 * Safely update @inode's i_gid field translating the vfsgid of any idmapped
 * mount into the filesystem kgid.
 */
static inline void i_gid_update(struct mnt_idmap *idmap,
				const struct iattr *attr,
				struct inode *inode)
{
	if (attr->ia_valid & ATTR_GID)
		inode->i_gid = from_vfsgid(idmap, i_user_ns(inode),
					   attr->ia_vfsgid);
}

/**
 * inode_fsuid_set - initialize inode's i_uid field with callers fsuid
 * @inode: inode to initialize
 * @idmap: idmap of the mount the inode was found from
 *
 * Initialize the i_uid field of @inode. If the inode was found/created via
 * an idmapped mount map the caller's fsuid according to @idmap.
 */
static inline void inode_fsuid_set(struct inode *inode,
				   struct mnt_idmap *idmap)
{
	inode->i_uid = mapped_fsuid(idmap, i_user_ns(inode));
}

/**
 * inode_fsgid_set - initialize inode's i_gid field with callers fsgid
 * @inode: inode to initialize
 * @idmap: idmap of the mount the inode was found from
 *
 * Initialize the i_gid field of @inode. If the inode was found/created via
 * an idmapped mount map the caller's fsgid according to @idmap.
 */
static inline void inode_fsgid_set(struct inode *inode,
				   struct mnt_idmap *idmap)
{
	inode->i_gid = mapped_fsgid(idmap, i_user_ns(inode));
}

/**
 * fsuidgid_has_mapping() - check whether caller's fsuid/fsgid is mapped
 * @sb: the superblock we want a mapping in
 * @idmap: idmap of the relevant mount
 *
 * Check whether the caller's fsuid and fsgid have a valid mapping in the
 * s_user_ns of the superblock @sb. If the caller is on an idmapped mount map
 * the caller's fsuid and fsgid according to the @idmap first.
 *
 * Return: true if fsuid and fsgid is mapped, false if not.
 */
static inline bool fsuidgid_has_mapping(struct super_block *sb,
					struct mnt_idmap *idmap)
{
	struct user_namespace *fs_userns = sb->s_user_ns;
	kuid_t kuid;
	kgid_t kgid;

	kuid = mapped_fsuid(idmap, fs_userns);
	if (!uid_valid(kuid))
		return false;
	kgid = mapped_fsgid(idmap, fs_userns);
	if (!gid_valid(kgid))
		return false;
	return kuid_has_mapping(fs_userns, kuid) &&
	       kgid_has_mapping(fs_userns, kgid);
}

struct timespec64 current_time(struct inode *inode);
struct timespec64 inode_set_ctime_current(struct inode *inode);
struct timespec64 inode_set_ctime_deleg(struct inode *inode,
					struct timespec64 update);

static inline time64_t inode_get_atime_sec(const struct inode *inode)
{
	return inode->i_atime_sec;
}

static inline long inode_get_atime_nsec(const struct inode *inode)
{
	return inode->i_atime_nsec;
}

static inline struct timespec64 inode_get_atime(const struct inode *inode)
{
	struct timespec64 ts = { .tv_sec  = inode_get_atime_sec(inode),
				 .tv_nsec = inode_get_atime_nsec(inode) };

	return ts;
}

static inline struct timespec64 inode_set_atime_to_ts(struct inode *inode,
						      struct timespec64 ts)
{
	inode->i_atime_sec = ts.tv_sec;
	inode->i_atime_nsec = ts.tv_nsec;
	return ts;
}

static inline struct timespec64 inode_set_atime(struct inode *inode,
						time64_t sec, long nsec)
{
	struct timespec64 ts = { .tv_sec  = sec,
				 .tv_nsec = nsec };

	return inode_set_atime_to_ts(inode, ts);
}

static inline time64_t inode_get_mtime_sec(const struct inode *inode)
{
	return inode->i_mtime_sec;
}

static inline long inode_get_mtime_nsec(const struct inode *inode)
{
	return inode->i_mtime_nsec;
}

static inline struct timespec64 inode_get_mtime(const struct inode *inode)
{
	struct timespec64 ts = { .tv_sec  = inode_get_mtime_sec(inode),
				 .tv_nsec = inode_get_mtime_nsec(inode) };
	return ts;
}

static inline struct timespec64 inode_set_mtime_to_ts(struct inode *inode,
						      struct timespec64 ts)
{
	inode->i_mtime_sec = ts.tv_sec;
	inode->i_mtime_nsec = ts.tv_nsec;
	return ts;
}

static inline struct timespec64 inode_set_mtime(struct inode *inode,
						time64_t sec, long nsec)
{
	struct timespec64 ts = { .tv_sec  = sec,
				 .tv_nsec = nsec };
	return inode_set_mtime_to_ts(inode, ts);
}

/*
 * Multigrain timestamps
 *
 * Conditionally use fine-grained ctime and mtime timestamps when there
 * are users actively observing them via getattr. The primary use-case
 * for this is NFS clients that use the ctime to distinguish between
 * different states of the file, and that are often fooled by multiple
 * operations that occur in the same coarse-grained timer tick.
 */
#define I_CTIME_QUERIED		((u32)BIT(31))

static inline time64_t inode_get_ctime_sec(const struct inode *inode)
{
	return inode->i_ctime_sec;
}

static inline long inode_get_ctime_nsec(const struct inode *inode)
{
	return inode->i_ctime_nsec & ~I_CTIME_QUERIED;
}

static inline struct timespec64 inode_get_ctime(const struct inode *inode)
{
	struct timespec64 ts = { .tv_sec  = inode_get_ctime_sec(inode),
				 .tv_nsec = inode_get_ctime_nsec(inode) };

	return ts;
}

struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts);

/**
 * inode_set_ctime - set the ctime in the inode
 * @inode: inode in which to set the ctime
 * @sec: tv_sec value to set
 * @nsec: tv_nsec value to set
 *
 * Set the ctime in @inode to { @sec, @nsec }
 */
static inline struct timespec64 inode_set_ctime(struct inode *inode,
						time64_t sec, long nsec)
{
	struct timespec64 ts = { .tv_sec  = sec,
				 .tv_nsec = nsec };

	return inode_set_ctime_to_ts(inode, ts);
}

struct timespec64 simple_inode_init_ts(struct inode *inode);

/*
 * Snapshotting support.
 */

/**
 * file_write_started - check if SB_FREEZE_WRITE is held
 * @file: the file we write to
 *
 * May be false positive with !CONFIG_LOCKDEP/LOCK_STATE_UNKNOWN.
 * May be false positive with !S_ISREG, because file_start_write() has
 * no effect on !S_ISREG.
 */
static inline bool file_write_started(const struct file *file)
{
	if (!S_ISREG(file_inode(file)->i_mode))
		return true;
	return sb_write_started(file_inode(file)->i_sb);
}

/**
 * file_write_not_started - check if SB_FREEZE_WRITE is not held
 * @file: the file we write to
 *
 * May be false positive with !CONFIG_LOCKDEP/LOCK_STATE_UNKNOWN.
 * May be false positive with !S_ISREG, because file_start_write() has
 * no effect on !S_ISREG.
 */
static inline bool file_write_not_started(const struct file *file)
{
	if (!S_ISREG(file_inode(file)->i_mode))
		return true;
	return sb_write_not_started(file_inode(file)->i_sb);
}

bool inode_owner_or_capable(struct mnt_idmap *idmap,
			    const struct inode *inode);

/*
 * VFS helper functions..
 */
int vfs_create(struct mnt_idmap *, struct dentry *, umode_t,
	       struct delegated_inode *);
struct dentry *vfs_mkdir(struct mnt_idmap *, struct inode *,
			 struct dentry *, umode_t, struct delegated_inode *);
int vfs_mknod(struct mnt_idmap *, struct inode *, struct dentry *,
	      umode_t, dev_t, struct delegated_inode *);
int vfs_symlink(struct mnt_idmap *, struct inode *,
		struct dentry *, const char *, struct delegated_inode *);
int vfs_link(struct dentry *, struct mnt_idmap *, struct inode *,
	     struct dentry *, struct delegated_inode *);
int vfs_rmdir(struct mnt_idmap *, struct inode *, struct dentry *,
	      struct delegated_inode *);
int vfs_unlink(struct mnt_idmap *, struct inode *, struct dentry *,
	       struct delegated_inode *);

/**
 * struct renamedata - contains all information required for renaming
 * @mnt_idmap:     idmap of the mount in which the rename is happening.
 * @old_parent:        parent of source
 * @old_dentry:                source
 * @new_parent:        parent of destination
 * @new_dentry:                destination
 * @delegated_inode:   returns an inode needing a delegation break
 * @flags:             rename flags
 */
struct renamedata {
	struct mnt_idmap *mnt_idmap;
	struct dentry *old_parent;
	struct dentry *old_dentry;
	struct dentry *new_parent;
	struct dentry *new_dentry;
	struct delegated_inode *delegated_inode;
	unsigned int flags;
} __randomize_layout;

int vfs_rename(struct renamedata *);

static inline int vfs_whiteout(struct mnt_idmap *idmap,
			       struct inode *dir, struct dentry *dentry)
{
	return vfs_mknod(idmap, dir, dentry, S_IFCHR | WHITEOUT_MODE,
			 WHITEOUT_DEV, NULL);
}

struct file *kernel_tmpfile_open(struct mnt_idmap *idmap,
				 const struct path *parentpath,
				 umode_t mode, int open_flag,
				 const struct cred *cred);
struct file *kernel_file_open(const struct path *path, int flags,
			      const struct cred *cred);

int vfs_mkobj(struct dentry *, umode_t,
		int (*f)(struct dentry *, umode_t, void *),
		void *);

int vfs_fchown(struct file *file, uid_t user, gid_t group);
int vfs_fchmod(struct file *file, umode_t mode);
int vfs_utimes(const struct path *path, struct timespec64 *times);

#ifdef CONFIG_COMPAT
extern long compat_ptr_ioctl(struct file *file, unsigned int cmd,
					unsigned long arg);
#else
#define compat_ptr_ioctl NULL
#endif

/*
 * VFS file helper functions.
 */
void inode_init_owner(struct mnt_idmap *idmap, struct inode *inode,
		      const struct inode *dir, umode_t mode);
extern bool may_open_dev(const struct path *path);
umode_t mode_strip_sgid(struct mnt_idmap *idmap,
			const struct inode *dir, umode_t mode);
bool in_group_or_capable(struct mnt_idmap *idmap,
			 const struct inode *inode, vfsgid_t vfsgid);

/*
 * This is the "filldir" function type, used by readdir() to let
 * the kernel specify what kind of dirent layout it wants to have.
 * This allows the kernel to read directories into kernel space or
 * to have different dirent layouts depending on the binary type.
 * Return 'true' to keep going and 'false' if there are no more entries.
 */
struct dir_context;
typedef bool (*filldir_t)(struct dir_context *, const char *, int, loff_t, u64,
			 unsigned);

struct dir_context {
	filldir_t actor;
	loff_t pos;
	/*
	 * Filesystems MUST NOT MODIFY count, but may use as a hint:
	 * 0	    unknown
	 * > 0      space in buffer (assume at least one entry)
	 * INT_MAX  unlimited
	 */
	int count;
};

/* If OR-ed with d_type, pending signals are not checked */
#define FILLDIR_FLAG_NOINTR	0x1000

/*
 * These flags let !MMU mmap() govern direct device mapping vs immediate
 * copying more easily for MAP_PRIVATE, especially for ROM filesystems.
 *
 * NOMMU_MAP_COPY:	Copy can be mapped (MAP_PRIVATE)
 * NOMMU_MAP_DIRECT:	Can be mapped directly (MAP_SHARED)
 * NOMMU_MAP_READ:	Can be mapped for reading
 * NOMMU_MAP_WRITE:	Can be mapped for writing
 * NOMMU_MAP_EXEC:	Can be mapped for execution
 */
#define NOMMU_MAP_COPY		0x00000001
#define NOMMU_MAP_DIRECT	0x00000008
#define NOMMU_MAP_READ		VM_MAYREAD
#define NOMMU_MAP_WRITE		VM_MAYWRITE
#define NOMMU_MAP_EXEC		VM_MAYEXEC

#define NOMMU_VMFLAGS \
	(NOMMU_MAP_READ | NOMMU_MAP_WRITE | NOMMU_MAP_EXEC)

/*
 * These flags control the behavior of the remap_file_range function pointer.
 * If it is called with len == 0 that means "remap to end of source file".
 * See Documentation/filesystems/vfs.rst for more details about this call.
 *
 * REMAP_FILE_DEDUP: only remap if contents identical (i.e. deduplicate)
 * REMAP_FILE_CAN_SHORTEN: caller can handle a shortened request
 */
#define REMAP_FILE_DEDUP		(1 << 0)
#define REMAP_FILE_CAN_SHORTEN		(1 << 1)

/*
 * These flags signal that the caller is ok with altering various aspects of
 * the behavior of the remap operation.  The changes must be made by the
 * implementation; the vfs remap helper functions can take advantage of them.
 * Flags in this category exist to preserve the quirky behavior of the hoisted
 * btrfs clone/dedupe ioctls.
 */
#define REMAP_FILE_ADVISORY		(REMAP_FILE_CAN_SHORTEN)

/*
 * These flags control the behavior of vfs_copy_file_range().
 * They are not available to the user via syscall.
 *
 * COPY_FILE_SPLICE: call splice direct instead of fs clone/copy ops
 */
#define COPY_FILE_SPLICE		(1 << 0)

struct iov_iter;
struct io_uring_cmd;
struct offset_ctx;

typedef unsigned int __bitwise fop_flags_t;

struct file_operations {
	struct module *owner;
	fop_flags_t fop_flags;
	loff_t (*llseek) (struct file *, loff_t, int);
	ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
	int (*iopoll)(struct kiocb *kiocb, struct io_comp_batch *,
			unsigned int flags);
	int (*iterate_shared) (struct file *, struct dir_context *);
	__poll_t (*poll) (struct file *, struct poll_table_struct *);
	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
	int (*mmap) (struct file *, struct vm_area_struct *);
	int (*open) (struct inode *, struct file *);
	int (*flush) (struct file *, fl_owner_t id);
	int (*release) (struct inode *, struct file *);
	int (*fsync) (struct file *, loff_t, loff_t, int datasync);
	int (*fasync) (int, struct file *, int);
	int (*lock) (struct file *, int, struct file_lock *);
	unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
	int (*check_flags)(int);
	int (*flock) (struct file *, int, struct file_lock *);
	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
	void (*splice_eof)(struct file *file);
	int (*setlease)(struct file *, int, struct file_lease **, void **);
	long (*fallocate)(struct file *file, int mode, loff_t offset,
			  loff_t len);
	void (*show_fdinfo)(struct seq_file *m, struct file *f);
#ifndef CONFIG_MMU
	unsigned (*mmap_capabilities)(struct file *);
#endif
	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
			loff_t, size_t, unsigned int);
	loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
				   struct file *file_out, loff_t pos_out,
				   loff_t len, unsigned int remap_flags);
	int (*fadvise)(struct file *, loff_t, loff_t, int);
	int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
	int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
				unsigned int poll_flags);
	int (*mmap_prepare)(struct vm_area_desc *);
} __randomize_layout;

/* Supports async buffered reads */
#define FOP_BUFFER_RASYNC	((__force fop_flags_t)(1 << 0))
/* Supports async buffered writes */
#define FOP_BUFFER_WASYNC	((__force fop_flags_t)(1 << 1))
/* Supports synchronous page faults for mappings */
#define FOP_MMAP_SYNC		((__force fop_flags_t)(1 << 2))
/* Supports non-exclusive O_DIRECT writes from multiple threads */
#define FOP_DIO_PARALLEL_WRITE	((__force fop_flags_t)(1 << 3))
/* Contains huge pages */
#define FOP_HUGE_PAGES		((__force fop_flags_t)(1 << 4))
/* Treat loff_t as unsigned (e.g., /dev/mem) */
#define FOP_UNSIGNED_OFFSET	((__force fop_flags_t)(1 << 5))
/* Supports asynchronous lock callbacks */
#define FOP_ASYNC_LOCK		((__force fop_flags_t)(1 << 6))
/* File system supports uncached read/write buffered IO */
#define FOP_DONTCACHE		((__force fop_flags_t)(1 << 7))

/* Wrap a directory iterator that needs exclusive inode access */
int wrap_directory_iterator(struct file *, struct dir_context *,
			    int (*) (struct file *, struct dir_context *));
#define WRAP_DIR_ITER(x) \
	static int shared_##x(struct file *file , struct dir_context *ctx) \
	{ return wrap_directory_iterator(file, ctx, x); }

struct inode_operations {
	struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
	const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *);
	int (*permission) (struct mnt_idmap *, struct inode *, int);
	struct posix_acl * (*get_inode_acl)(struct inode *, int, bool);

	int (*readlink) (struct dentry *, char __user *,int);

	int (*create) (struct mnt_idmap *, struct inode *,struct dentry *,
		       umode_t, bool);
	int (*link) (struct dentry *,struct inode *,struct dentry *);
	int (*unlink) (struct inode *,struct dentry *);
	int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,
			const char *);
	struct dentry *(*mkdir) (struct mnt_idmap *, struct inode *,
				 struct dentry *, umode_t);
	int (*rmdir) (struct inode *,struct dentry *);
	int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,
		      umode_t,dev_t);
	int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
			struct inode *, struct dentry *, unsigned int);
	int (*setattr) (struct mnt_idmap *, struct dentry *, struct iattr *);
	int (*getattr) (struct mnt_idmap *, const struct path *,
			struct kstat *, u32, unsigned int);
	ssize_t (*listxattr) (struct dentry *, char *, size_t);
	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
		      u64 len);
	int (*update_time)(struct inode *, int);
	int (*atomic_open)(struct inode *, struct dentry *,
			   struct file *, unsigned open_flag,
			   umode_t create_mode);
	int (*tmpfile) (struct mnt_idmap *, struct inode *,
			struct file *, umode_t);
	struct posix_acl *(*get_acl)(struct mnt_idmap *, struct dentry *,
				     int);
	int (*set_acl)(struct mnt_idmap *, struct dentry *,
		       struct posix_acl *, int);
	int (*fileattr_set)(struct mnt_idmap *idmap,
			    struct dentry *dentry, struct file_kattr *fa);
	int (*fileattr_get)(struct dentry *dentry, struct file_kattr *fa);
	struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
} ____cacheline_aligned;

/* Did the driver provide valid mmap hook configuration? */
static inline bool can_mmap_file(struct file *file)
{
	bool has_mmap = file->f_op->mmap;
	bool has_mmap_prepare = file->f_op->mmap_prepare;

	/* Hooks are mutually exclusive. */
	if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
		return false;
	if (!has_mmap && !has_mmap_prepare)
		return false;

	return true;
}

int __compat_vma_mmap(const struct file_operations *f_op,
		struct file *file, struct vm_area_struct *vma);
int compat_vma_mmap(struct file *file, struct vm_area_struct *vma);

static inline int vfs_mmap(struct file *file, struct vm_area_struct *vma)
{
	if (file->f_op->mmap_prepare)
		return compat_vma_mmap(file, vma);

	return file->f_op->mmap(file, vma);
}

static inline int vfs_mmap_prepare(struct file *file, struct vm_area_desc *desc)
{
	return file->f_op->mmap_prepare(desc);
}

extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
				   loff_t, size_t, unsigned int);
int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write);
int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
				    struct file *file_out, loff_t pos_out,
				    loff_t *len, unsigned int remap_flags,
				    const struct iomap_ops *dax_read_ops);
int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
				  struct file *file_out, loff_t pos_out,
				  loff_t *count, unsigned int remap_flags);
extern loff_t vfs_clone_file_range(struct file *file_in, loff_t pos_in,
				   struct file *file_out, loff_t pos_out,
				   loff_t len, unsigned int remap_flags);
extern int vfs_dedupe_file_range(struct file *file,
				 struct file_dedupe_range *same);
extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
					struct file *dst_file, loff_t dst_pos,
					loff_t len, unsigned int remap_flags);

/*
 * Inode flags - they have no relation to superblock flags now
 */
#define S_SYNC		(1 << 0)  /* Writes are synced at once */
#define S_NOATIME	(1 << 1)  /* Do not update access times */
#define S_APPEND	(1 << 2)  /* Append-only file */
#define S_IMMUTABLE	(1 << 3)  /* Immutable file */
#define S_DEAD		(1 << 4)  /* removed, but still open directory */
#define S_NOQUOTA	(1 << 5)  /* Inode is not counted to quota */
#define S_DIRSYNC	(1 << 6)  /* Directory modifications are synchronous */
#define S_NOCMTIME	(1 << 7)  /* Do not update file c/mtime */
#define S_SWAPFILE	(1 << 8)  /* Do not truncate: swapon got its bmaps */
#define S_PRIVATE	(1 << 9)  /* Inode is fs-internal */
#define S_IMA		(1 << 10) /* Inode has an associated IMA struct */
#define S_AUTOMOUNT	(1 << 11) /* Automount/referral quasi-directory */
#define S_NOSEC		(1 << 12) /* no suid or xattr security attributes */
#ifdef CONFIG_FS_DAX
#define S_DAX		(1 << 13) /* Direct Access, avoiding the page cache */
#else
#define S_DAX		0	  /* Make all the DAX code disappear */
#endif
#define S_ENCRYPTED	(1 << 14) /* Encrypted file (using fs/crypto/) */
#define S_CASEFOLD	(1 << 15) /* Casefolded file */
#define S_VERITY	(1 << 16) /* Verity file (using fs/verity/) */
#define S_KERNEL_FILE	(1 << 17) /* File is in use by the kernel (eg. fs/cachefiles) */
#define S_ANON_INODE	(1 << 19) /* Inode is an anonymous inode */

/*
 * Note that nosuid etc flags are inode-specific: setting some file-system
 * flags just means all the inodes inherit those flags by default. It might be
 * possible to override it selectively if you really wanted to with some
 * ioctl() that is not currently implemented.
 *
 * Exception: SB_RDONLY is always applied to the entire file system.
 *
 * Unfortunately, it is possible to change a filesystems flags with it mounted
 * with files in use.  This means that all of the inodes will not have their
 * i_flags updated.  Hence, i_flags no longer inherit the superblock mount
 * flags, so these have to be checked separately. -- rmk@arm.uk.linux.org
 */
#define __IS_FLG(inode, flg)	((inode)->i_sb->s_flags & (flg))

#define IS_RDONLY(inode)	sb_rdonly((inode)->i_sb)
#define IS_SYNC(inode)		(__IS_FLG(inode, SB_SYNCHRONOUS) || \
					((inode)->i_flags & S_SYNC))
#define IS_DIRSYNC(inode)	(__IS_FLG(inode, SB_SYNCHRONOUS|SB_DIRSYNC) || \
					((inode)->i_flags & (S_SYNC|S_DIRSYNC)))
#define IS_MANDLOCK(inode)	__IS_FLG(inode, SB_MANDLOCK)
#define IS_NOATIME(inode)	__IS_FLG(inode, SB_RDONLY|SB_NOATIME)
#define IS_I_VERSION(inode)	__IS_FLG(inode, SB_I_VERSION)

#define IS_NOQUOTA(inode)	((inode)->i_flags & S_NOQUOTA)
#define IS_APPEND(inode)	((inode)->i_flags & S_APPEND)
#define IS_IMMUTABLE(inode)	((inode)->i_flags & S_IMMUTABLE)

#ifdef CONFIG_FS_POSIX_ACL
#define IS_POSIXACL(inode)	__IS_FLG(inode, SB_POSIXACL)
#else
#define IS_POSIXACL(inode)	0
#endif

#define IS_DEADDIR(inode)	((inode)->i_flags & S_DEAD)
#define IS_NOCMTIME(inode)	((inode)->i_flags & S_NOCMTIME)

#ifdef CONFIG_SWAP
#define IS_SWAPFILE(inode)	((inode)->i_flags & S_SWAPFILE)
#else
#define IS_SWAPFILE(inode)	((void)(inode), 0U)
#endif

#define IS_PRIVATE(inode)	((inode)->i_flags & S_PRIVATE)
#define IS_IMA(inode)		((inode)->i_flags & S_IMA)
#define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
#define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
#define IS_DAX(inode)		((inode)->i_flags & S_DAX)
#define IS_ENCRYPTED(inode)	((inode)->i_flags & S_ENCRYPTED)
#define IS_CASEFOLDED(inode)	((inode)->i_flags & S_CASEFOLD)
#define IS_VERITY(inode)	((inode)->i_flags & S_VERITY)

#define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
				 (inode)->i_rdev == WHITEOUT_DEV)
#define IS_ANON_FILE(inode)	((inode)->i_flags & S_ANON_INODE)

static inline bool HAS_UNMAPPED_ID(struct mnt_idmap *idmap,
				   struct inode *inode)
{
	return !vfsuid_valid(i_uid_into_vfsuid(idmap, inode)) ||
	       !vfsgid_valid(i_gid_into_vfsgid(idmap, inode));
}

static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
{
	*kiocb = (struct kiocb) {
		.ki_filp = filp,
		.ki_flags = filp->f_iocb_flags,
		.ki_ioprio = get_current_ioprio(),
	};
}

static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
			       struct file *filp)
{
	*kiocb = (struct kiocb) {
		.ki_filp = filp,
		.ki_flags = kiocb_src->ki_flags,
		.ki_ioprio = kiocb_src->ki_ioprio,
		.ki_pos = kiocb_src->ki_pos,
	};
}

extern void __mark_inode_dirty(struct inode *, int);
static inline void mark_inode_dirty(struct inode *inode)
{
	__mark_inode_dirty(inode, I_DIRTY);
}

static inline void mark_inode_dirty_sync(struct inode *inode)
{
	__mark_inode_dirty(inode, I_DIRTY_SYNC);
}

static inline int icount_read(const struct inode *inode)
{
	return atomic_read(&inode->i_count);
}

/*
 * Returns true if the given inode itself only has dirty timestamps (its pages
 * may still be dirty) and isn't currently being allocated or freed.
 * Filesystems should call this if when writing an inode when lazytime is
 * enabled, they want to opportunistically write the timestamps of other inodes
 * located very nearby on-disk, e.g. in the same inode block.  This returns true
 * if the given inode is in need of such an opportunistic update.  Requires
 * i_lock, or at least later re-checking under i_lock.
 */
static inline bool inode_is_dirtytime_only(struct inode *inode)
{
	return (inode_state_read_once(inode) &
	       (I_DIRTY_TIME | I_NEW | I_FREEING | I_WILL_FREE)) == I_DIRTY_TIME;
}

extern void inc_nlink(struct inode *inode);
extern void drop_nlink(struct inode *inode);
extern void clear_nlink(struct inode *inode);
extern void set_nlink(struct inode *inode, unsigned int nlink);

static inline void inode_inc_link_count(struct inode *inode)
{
	inc_nlink(inode);
	mark_inode_dirty(inode);
}

static inline void inode_dec_link_count(struct inode *inode)
{
	drop_nlink(inode);
	mark_inode_dirty(inode);
}

enum file_time_flags {
	S_ATIME = 1,
	S_MTIME = 2,
	S_CTIME = 4,
	S_VERSION = 8,
};

extern bool atime_needs_update(const struct path *, struct inode *);
extern void touch_atime(const struct path *);
int inode_update_time(struct inode *inode, int flags);

static inline void file_accessed(struct file *file)
{
	if (!(file->f_flags & O_NOATIME))
		touch_atime(&file->f_path);
}

extern int file_modified(struct file *file);
int kiocb_modified(struct kiocb *iocb);

int sync_inode_metadata(struct inode *inode, int wait);

struct file_system_type {
	const char *name;
	int fs_flags;
#define FS_REQUIRES_DEV		1 
#define FS_BINARY_MOUNTDATA	2
#define FS_HAS_SUBTYPE		4
#define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
#define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
#define FS_ALLOW_IDMAP         32      /* FS has been updated to handle vfs idmappings. */
#define FS_MGTIME		64	/* FS uses multigrain timestamps */
#define FS_LBS			128	/* FS supports LBS */
#define FS_POWER_FREEZE		256	/* Always freeze on suspend/hibernate */
#define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
	int (*init_fs_context)(struct fs_context *);
	const struct fs_parameter_spec *parameters;
	struct dentry *(*mount) (struct file_system_type *, int,
		       const char *, void *);
	void (*kill_sb) (struct super_block *);
	struct module *owner;
	struct file_system_type * next;
	struct hlist_head fs_supers;

	struct lock_class_key s_lock_key;
	struct lock_class_key s_umount_key;
	struct lock_class_key s_vfs_rename_key;
	struct lock_class_key s_writers_key[SB_FREEZE_LEVELS];

	struct lock_class_key i_lock_key;
	struct lock_class_key i_mutex_key;
	struct lock_class_key invalidate_lock_key;
	struct lock_class_key i_mutex_dir_key;
};

#define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME)

/**
 * is_mgtime: is this inode using multigrain timestamps
 * @inode: inode to test for multigrain timestamps
 *
 * Return true if the inode uses multigrain timestamps, false otherwise.
 */
static inline bool is_mgtime(const struct inode *inode)
{
	return inode->i_opflags & IOP_MGTIME;
}

extern struct dentry *mount_subtree(struct vfsmount *mnt, const char *path);
void retire_super(struct super_block *sb);
void generic_shutdown_super(struct super_block *sb);
void kill_block_super(struct super_block *sb);
void kill_anon_super(struct super_block *sb);
void deactivate_super(struct super_block *sb);
void deactivate_locked_super(struct super_block *sb);
int set_anon_super(struct super_block *s, void *data);
int set_anon_super_fc(struct super_block *s, struct fs_context *fc);
int get_anon_bdev(dev_t *);
void free_anon_bdev(dev_t);
struct super_block *sget_fc(struct fs_context *fc,
			    int (*test)(struct super_block *, struct fs_context *),
			    int (*set)(struct super_block *, struct fs_context *));
struct super_block *sget(struct file_system_type *type,
			int (*test)(struct super_block *,void *),
			int (*set)(struct super_block *,void *),
			int flags, void *data);
struct super_block *sget_dev(struct fs_context *fc, dev_t dev);

/* Alas, no aliases. Too much hassle with bringing module.h everywhere */
#define fops_get(fops) ({						\
	const struct file_operations *_fops = (fops);			\
	(((_fops) && try_module_get((_fops)->owner) ? (_fops) : NULL));	\
})

#define fops_put(fops) ({						\
	const struct file_operations *_fops = (fops);			\
	if (_fops)							\
		module_put((_fops)->owner);				\
})

/*
 * This one is to be used *ONLY* from ->open() instances.
 * fops must be non-NULL, pinned down *and* module dependencies
 * should be sufficient to pin the caller down as well.
 */
#define replace_fops(f, fops) \
	do {	\
		struct file *__file = (f); \
		fops_put(__file->f_op); \
		BUG_ON(!(__file->f_op = (fops))); \
	} while(0)

extern int register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);
extern int vfs_statfs(const struct path *, struct kstatfs *);
extern int user_statfs(const char __user *, struct kstatfs *);
extern int fd_statfs(int, struct kstatfs *);
extern __printf(2, 3)
int super_setup_bdi_name(struct super_block *sb, char *fmt, ...);
extern int super_setup_bdi(struct super_block *sb);

static inline void super_set_uuid(struct super_block *sb, const u8 *uuid, unsigned len)
{
	if (WARN_ON(len > sizeof(sb->s_uuid)))
		len = sizeof(sb->s_uuid);
	sb->s_uuid_len = len;
	memcpy(&sb->s_uuid, uuid, len);
}

/* set sb sysfs name based on sb->s_bdev */
static inline void super_set_sysfs_name_bdev(struct super_block *sb)
{
	snprintf(sb->s_sysfs_name, sizeof(sb->s_sysfs_name), "%pg", sb->s_bdev);
}

/* set sb sysfs name based on sb->s_uuid */
static inline void super_set_sysfs_name_uuid(struct super_block *sb)
{
	WARN_ON(sb->s_uuid_len != sizeof(sb->s_uuid));
	snprintf(sb->s_sysfs_name, sizeof(sb->s_sysfs_name), "%pU", sb->s_uuid.b);
}

/* set sb sysfs name based on sb->s_id */
static inline void super_set_sysfs_name_id(struct super_block *sb)
{
	strscpy(sb->s_sysfs_name, sb->s_id, sizeof(sb->s_sysfs_name));
}

/* try to use something standard before you use this */
__printf(2, 3)
static inline void super_set_sysfs_name_generic(struct super_block *sb, const char *fmt, ...)
{
	va_list args;

	va_start(args, fmt);
	vsnprintf(sb->s_sysfs_name, sizeof(sb->s_sysfs_name), fmt, args);
	va_end(args);
}

extern void ihold(struct inode * inode);
extern void iput(struct inode *);
void iput_not_last(struct inode *);
int inode_update_timestamps(struct inode *inode, int flags);
int generic_update_time(struct inode *, int);

/* /sys/fs */
extern struct kobject *fs_kobj;

#define MAX_RW_COUNT (INT_MAX & PAGE_MASK)

/* fs/open.c */
struct audit_names;
struct filename {
	const char		*name;	/* pointer to actual string */
	const __user char	*uptr;	/* original userland pointer */
	atomic_t		refcnt;
	struct audit_names	*aname;
	const char		iname[];
};
static_assert(offsetof(struct filename, iname) % sizeof(long) == 0);

static inline struct mnt_idmap *file_mnt_idmap(const struct file *file)
{
	return mnt_idmap(file->f_path.mnt);
}

/**
 * is_idmapped_mnt - check whether a mount is mapped
 * @mnt: the mount to check
 *
 * If @mnt has an non @nop_mnt_idmap attached to it then @mnt is mapped.
 *
 * Return: true if mount is mapped, false if not.
 */
static inline bool is_idmapped_mnt(const struct vfsmount *mnt)
{
	return mnt_idmap(mnt) != &nop_mnt_idmap;
}

int vfs_truncate(const struct path *, loff_t);
int do_truncate(struct mnt_idmap *, struct dentry *, loff_t start,
		unsigned int time_attrs, struct file *filp);
extern int vfs_fallocate(struct file *file, int mode, loff_t offset,
			loff_t len);
int do_sys_open(int dfd, const char __user *filename, int flags,
		umode_t mode);
extern struct file *file_open_name(struct filename *, int, umode_t);
extern struct file *filp_open(const char *, int, umode_t);
extern struct file *file_open_root(const struct path *,
				   const char *, int, umode_t);
static inline struct file *file_open_root_mnt(struct vfsmount *mnt,
				   const char *name, int flags, umode_t mode)
{
	return file_open_root(&(struct path){.mnt = mnt, .dentry = mnt->mnt_root},
			      name, flags, mode);
}
struct file *dentry_open(const struct path *path, int flags,
			 const struct cred *creds);
struct file *dentry_open_nonotify(const struct path *path, int flags,
				  const struct cred *cred);
struct file *dentry_create(const struct path *path, int flags, umode_t mode,
			   const struct cred *cred);
const struct path *backing_file_user_path(const struct file *f);

/*
 * When mmapping a file on a stackable filesystem (e.g., overlayfs), the file
 * stored in ->vm_file is a backing file whose f_inode is on the underlying
 * filesystem.  When the mapped file path and inode number are displayed to
 * user (e.g. via /proc/<pid>/maps), these helpers should be used to get the
 * path and inode number to display to the user, which is the path of the fd
 * that user has requested to map and the inode number that would be returned
 * by fstat() on that same fd.
 */
/* Get the path to display in /proc/<pid>/maps */
static inline const struct path *file_user_path(const struct file *f)
{
	if (unlikely(f->f_mode & FMODE_BACKING))
		return backing_file_user_path(f);
	return &f->f_path;
}
/* Get the inode whose inode number to display in /proc/<pid>/maps */
static inline const struct inode *file_user_inode(const struct file *f)
{
	if (unlikely(f->f_mode & FMODE_BACKING))
		return d_inode(backing_file_user_path(f)->dentry);
	return file_inode(f);
}

static inline struct file *file_clone_open(struct file *file)
{
	return dentry_open(&file->f_path, file->f_flags, file->f_cred);
}
extern int filp_close(struct file *, fl_owner_t id);

extern struct filename *getname_flags(const char __user *, int);
extern struct filename *getname_uflags(const char __user *, int);
static inline struct filename *getname(const char __user *name)
{
	return getname_flags(name, 0);
}
extern struct filename *getname_kernel(const char *);
extern struct filename *__getname_maybe_null(const char __user *);
static inline struct filename *getname_maybe_null(const char __user *name, int flags)
{
	if (!(flags & AT_EMPTY_PATH))
		return getname(name);

	if (!name)
		return NULL;
	return __getname_maybe_null(name);
}
extern void putname(struct filename *name);
DEFINE_FREE(putname, struct filename *, if (!IS_ERR_OR_NULL(_T)) putname(_T))

static inline struct filename *refname(struct filename *name)
{
	atomic_inc(&name->refcnt);
	return name;
}

extern int finish_open(struct file *file, struct dentry *dentry,
			int (*open)(struct inode *, struct file *));
extern int finish_no_open(struct file *file, struct dentry *dentry);

/* Helper for the simple case when original dentry is used */
static inline int finish_open_simple(struct file *file, int error)
{
	if (error)
		return error;

	return finish_open(file, file->f_path.dentry, NULL);
}

/* fs/dcache.c */
extern void __init vfs_caches_init_early(void);
extern void __init vfs_caches_init(void);

extern struct kmem_cache *names_cachep;

#define __getname()		kmem_cache_alloc(names_cachep, GFP_KERNEL)
#define __putname(name)		kmem_cache_free(names_cachep, (void *)(name))

void emergency_thaw_all(void);
extern int sync_filesystem(struct super_block *);
extern const struct file_operations def_blk_fops;
extern const struct file_operations def_chr_fops;

/* fs/char_dev.c */
#define CHRDEV_MAJOR_MAX 512
/* Marks the bottom of the first segment of free char majors */
#define CHRDEV_MAJOR_DYN_END 234
/* Marks the top and bottom of the second segment of free char majors */
#define CHRDEV_MAJOR_DYN_EXT_START 511
#define CHRDEV_MAJOR_DYN_EXT_END 384

extern int alloc_chrdev_region(dev_t *, unsigned, unsigned, const char *);
extern int register_chrdev_region(dev_t, unsigned, const char *);
extern int __register_chrdev(unsigned int major, unsigned int baseminor,
			     unsigned int count, const char *name,
			     const struct file_operations *fops);
extern void __unregister_chrdev(unsigned int major, unsigned int baseminor,
				unsigned int count, const char *name);
extern void unregister_chrdev_region(dev_t, unsigned);
extern void chrdev_show(struct seq_file *,off_t);

static inline int register_chrdev(unsigned int major, const char *name,
				  const struct file_operations *fops)
{
	return __register_chrdev(major, 0, 256, name, fops);
}

static inline void unregister_chrdev(unsigned int major, const char *name)
{
	__unregister_chrdev(major, 0, 256, name);
}

extern void init_special_inode(struct inode *, umode_t, dev_t);

/* Invalid inode operations -- fs/bad_inode.c */
extern void make_bad_inode(struct inode *);
extern bool is_bad_inode(struct inode *);

extern int __must_check file_fdatawait_range(struct file *file, loff_t lstart,
						loff_t lend);
extern int __must_check file_check_and_advance_wb_err(struct file *file);
extern int __must_check file_write_and_wait_range(struct file *file,
						loff_t start, loff_t end);
int filemap_flush_range(struct address_space *mapping, loff_t start,
		loff_t end);

static inline int file_write_and_wait(struct file *file)
{
	return file_write_and_wait_range(file, 0, LLONG_MAX);
}

extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
			   int datasync);
extern int vfs_fsync(struct file *file, int datasync);

extern int sync_file_range(struct file *file, loff_t offset, loff_t nbytes,
				unsigned int flags);

static inline bool iocb_is_dsync(const struct kiocb *iocb)
{
	return (iocb->ki_flags & IOCB_DSYNC) ||
		IS_SYNC(iocb->ki_filp->f_mapping->host);
}

/*
 * Sync the bytes written if this was a synchronous write.  Expect ki_pos
 * to already be updated for the write, and will return either the amount
 * of bytes passed in, or an error if syncing the file failed.
 */
static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
{
	if (iocb_is_dsync(iocb)) {
		int ret = vfs_fsync_range(iocb->ki_filp,
				iocb->ki_pos - count, iocb->ki_pos - 1,
				(iocb->ki_flags & IOCB_SYNC) ? 0 : 1);
		if (ret)
			return ret;
	} else if (iocb->ki_flags & IOCB_DONTCACHE) {
		struct address_space *mapping = iocb->ki_filp->f_mapping;

		filemap_flush_range(mapping, iocb->ki_pos - count,
				iocb->ki_pos - 1);
	}

	return count;
}

extern void emergency_sync(void);
extern void emergency_remount(void);

#ifdef CONFIG_BLOCK
extern int bmap(struct inode *inode, sector_t *block);
#else
static inline int bmap(struct inode *inode,  sector_t *block)
{
	return -EINVAL;
}
#endif

int notify_change(struct mnt_idmap *, struct dentry *,
		  struct iattr *, struct delegated_inode *);
int inode_permission(struct mnt_idmap *, struct inode *, int);
int generic_permission(struct mnt_idmap *, struct inode *, int);
static inline int file_permission(struct file *file, int mask)
{
	return inode_permission(file_mnt_idmap(file),
				file_inode(file), mask);
}
static inline int path_permission(const struct path *path, int mask)
{
	return inode_permission(mnt_idmap(path->mnt),
				d_inode(path->dentry), mask);
}
int __check_sticky(struct mnt_idmap *idmap, struct inode *dir,
		   struct inode *inode);

static inline bool execute_ok(struct inode *inode)
{
	return (inode->i_mode & S_IXUGO) || S_ISDIR(inode->i_mode);
}

static inline bool inode_wrong_type(const struct inode *inode, umode_t mode)
{
	return (inode->i_mode ^ mode) & S_IFMT;
}

/**
 * file_start_write - get write access to a superblock for regular file io
 * @file: the file we want to write to
 *
 * This is a variant of sb_start_write() which is a noop on non-regular file.
 * Should be matched with a call to file_end_write().
 */
static inline void file_start_write(struct file *file)
{
	if (!S_ISREG(file_inode(file)->i_mode))
		return;
	sb_start_write(file_inode(file)->i_sb);
}

static inline bool file_start_write_trylock(struct file *file)
{
	if (!S_ISREG(file_inode(file)->i_mode))
		return true;
	return sb_start_write_trylock(file_inode(file)->i_sb);
}

/**
 * file_end_write - drop write access to a superblock of a regular file
 * @file: the file we wrote to
 *
 * Should be matched with a call to file_start_write().
 */
static inline void file_end_write(struct file *file)
{
	if (!S_ISREG(file_inode(file)->i_mode))
		return;
	sb_end_write(file_inode(file)->i_sb);
}

/**
 * kiocb_start_write - get write access to a superblock for async file io
 * @iocb: the io context we want to submit the write with
 *
 * This is a variant of sb_start_write() for async io submission.
 * Should be matched with a call to kiocb_end_write().
 */
static inline void kiocb_start_write(struct kiocb *iocb)
{
	struct inode *inode = file_inode(iocb->ki_filp);

	sb_start_write(inode->i_sb);
	/*
	 * Fool lockdep by telling it the lock got released so that it
	 * doesn't complain about the held lock when we return to userspace.
	 */
	__sb_writers_release(inode->i_sb, SB_FREEZE_WRITE);
}

/**
 * kiocb_end_write - drop write access to a superblock after async file io
 * @iocb: the io context we sumbitted the write with
 *
 * Should be matched with a call to kiocb_start_write().
 */
static inline void kiocb_end_write(struct kiocb *iocb)
{
	struct inode *inode = file_inode(iocb->ki_filp);

	/*
	 * Tell lockdep we inherited freeze protection from submission thread.
	 */
	__sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE);
	sb_end_write(inode->i_sb);
}

/*
 * This is used for regular files where some users -- especially the
 * currently executed binary in a process, previously handled via
 * VM_DENYWRITE -- cannot handle concurrent write (and maybe mmap
 * read-write shared) accesses.
 *
 * get_write_access() gets write permission for a file.
 * put_write_access() releases this write permission.
 * deny_write_access() denies write access to a file.
 * allow_write_access() re-enables write access to a file.
 *
 * The i_writecount field of an inode can have the following values:
 * 0: no write access, no denied write access
 * < 0: (-i_writecount) users that denied write access to the file.
 * > 0: (i_writecount) users that have write access to the file.
 *
 * Normally we operate on that counter with atomic_{inc,dec} and it's safe
 * except for the cases where we don't hold i_writecount yet. Then we need to
 * use {get,deny}_write_access() - these functions check the sign and refuse
 * to do the change if sign is wrong.
 */
static inline int get_write_access(struct inode *inode)
{
	return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY;
}
static inline int deny_write_access(struct file *file)
{
	struct inode *inode = file_inode(file);
	return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
}
static inline void put_write_access(struct inode * inode)
{
	atomic_dec(&inode->i_writecount);
}
static inline void allow_write_access(struct file *file)
{
	if (file)
		atomic_inc(&file_inode(file)->i_writecount);
}

/*
 * Do not prevent write to executable file when watched by pre-content events.
 *
 * Note that FMODE_FSNOTIFY_HSM mode is set depending on pre-content watches at
 * the time of file open and remains constant for entire lifetime of the file,
 * so if pre-content watches are added post execution or removed before the end
 * of the execution, it will not cause i_writecount reference leak.
 */
static inline int exe_file_deny_write_access(struct file *exe_file)
{
	if (unlikely(FMODE_FSNOTIFY_HSM(exe_file->f_mode)))
		return 0;
	return deny_write_access(exe_file);
}
static inline void exe_file_allow_write_access(struct file *exe_file)
{
	if (unlikely(!exe_file || FMODE_FSNOTIFY_HSM(exe_file->f_mode)))
		return;
	allow_write_access(exe_file);
}

static inline void file_set_fsnotify_mode(struct file *file, fmode_t mode)
{
	file->f_mode &= ~FMODE_FSNOTIFY_MASK;
	file->f_mode |= mode;
}

static inline bool inode_is_open_for_write(const struct inode *inode)
{
	return atomic_read(&inode->i_writecount) > 0;
}

#if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
static inline void i_readcount_dec(struct inode *inode)
{
	BUG_ON(atomic_dec_return(&inode->i_readcount) < 0);
}
static inline void i_readcount_inc(struct inode *inode)
{
	atomic_inc(&inode->i_readcount);
}
#else
static inline void i_readcount_dec(struct inode *inode)
{
	return;
}
static inline void i_readcount_inc(struct inode *inode)
{
	return;
}
#endif
extern int do_pipe_flags(int *, int);

extern ssize_t kernel_read(struct file *, void *, size_t, loff_t *);
ssize_t __kernel_read(struct file *file, void *buf, size_t count, loff_t *pos);
extern ssize_t kernel_write(struct file *, const void *, size_t, loff_t *);
extern ssize_t __kernel_write(struct file *, const void *, size_t, loff_t *);
extern struct file * open_exec(const char *);
 
/* fs/dcache.c -- generic fs support functions */
extern bool is_subdir(struct dentry *, struct dentry *);
extern bool path_is_under(const struct path *, const struct path *);
u64 vfsmount_to_propagation_flags(struct vfsmount *mnt);

extern char *file_path(struct file *, char *, int);

/**
 * is_dot_dotdot - returns true only if @name is "." or ".."
 * @name: file name to check
 * @len: length of file name, in bytes
 */
static inline bool is_dot_dotdot(const char *name, size_t len)
{
	return len && unlikely(name[0] == '.') &&
		(len == 1 || (len == 2 && name[1] == '.'));
}

/**
 * name_contains_dotdot - check if a file name contains ".." path components
 * @name: File path string to check
 * Search for ".." surrounded by either '/' or start/end of string.
 */
static inline bool name_contains_dotdot(const char *name)
{
	size_t name_len;

	name_len = strlen(name);
	return strcmp(name, "..") == 0 ||
	       strncmp(name, "../", 3) == 0 ||
	       strstr(name, "/../") != NULL ||
	       (name_len >= 3 && strcmp(name + name_len - 3, "/..") == 0);
}

#include <linux/err.h>

/* needed for stackable file system support */
extern loff_t default_llseek(struct file *file, loff_t offset, int whence);

extern loff_t vfs_llseek(struct file *file, loff_t offset, int whence);

extern int inode_init_always_gfp(struct super_block *, struct inode *, gfp_t);
static inline int inode_init_always(struct super_block *sb, struct inode *inode)
{
	return inode_init_always_gfp(sb, inode, GFP_NOFS);
}

extern void inode_init_once(struct inode *);
extern void address_space_init_once(struct address_space *mapping);
extern struct inode * igrab(struct inode *);
extern ino_t iunique(struct super_block *, ino_t);
extern int inode_needs_sync(struct inode *inode);
extern int inode_just_drop(struct inode *inode);
static inline int inode_generic_drop(struct inode *inode)
{
	return !inode->i_nlink || inode_unhashed(inode);
}
extern void d_mark_dontcache(struct inode *inode);

extern struct inode *ilookup5_nowait(struct super_block *sb,
		unsigned long hashval, int (*test)(struct inode *, void *),
		void *data, bool *isnew);
extern struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
		int (*test)(struct inode *, void *), void *data);
extern struct inode *ilookup(struct super_block *sb, unsigned long ino);

extern struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
		int (*test)(struct inode *, void *),
		int (*set)(struct inode *, void *),
		void *data);
struct inode *iget5_locked(struct super_block *, unsigned long,
			   int (*test)(struct inode *, void *),
			   int (*set)(struct inode *, void *), void *);
struct inode *iget5_locked_rcu(struct super_block *, unsigned long,
			       int (*test)(struct inode *, void *),
			       int (*set)(struct inode *, void *), void *);
extern struct inode * iget_locked(struct super_block *, unsigned long);
extern struct inode *find_inode_nowait(struct super_block *,
				       unsigned long,
				       int (*match)(struct inode *,
						    unsigned long, void *),
				       void *data);
extern struct inode *find_inode_rcu(struct super_block *, unsigned long,
				    int (*)(struct inode *, void *), void *);
extern struct inode *find_inode_by_ino_rcu(struct super_block *, unsigned long);
extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *);
extern int insert_inode_locked(struct inode *);
#ifdef CONFIG_DEBUG_LOCK_ALLOC
extern void lockdep_annotate_inode_mutex_key(struct inode *inode);
#else
static inline void lockdep_annotate_inode_mutex_key(struct inode *inode) { };
#endif
extern void unlock_new_inode(struct inode *);
extern void discard_new_inode(struct inode *);
extern unsigned int get_next_ino(void);
extern void evict_inodes(struct super_block *sb);
void dump_mapping(const struct address_space *);

/*
 * Userspace may rely on the inode number being non-zero. For example, glibc
 * simply ignores files with zero i_ino in unlink() and other places.
 *
 * As an additional complication, if userspace was compiled with
 * _FILE_OFFSET_BITS=32 on a 64-bit kernel we'll only end up reading out the
 * lower 32 bits, so we need to check that those aren't zero explicitly. With
 * _FILE_OFFSET_BITS=64, this may cause some harmless false-negatives, but
 * better safe than sorry.
 */
static inline bool is_zero_ino(ino_t ino)
{
	return (u32)ino == 0;
}

static inline void __iget(struct inode *inode)
{
	lockdep_assert_held(&inode->i_lock);
	atomic_inc(&inode->i_count);
}

extern void iget_failed(struct inode *);
extern void clear_inode(struct inode *);
extern void __destroy_inode(struct inode *);
struct inode *alloc_inode(struct super_block *sb);
static inline struct inode *new_inode_pseudo(struct super_block *sb)
{
	return alloc_inode(sb);
}
extern struct inode *new_inode(struct super_block *sb);
extern void free_inode_nonrcu(struct inode *inode);
extern int setattr_should_drop_suidgid(struct mnt_idmap *, struct inode *);
extern int file_remove_privs(struct file *);
int setattr_should_drop_sgid(struct mnt_idmap *idmap,
			     const struct inode *inode);

/*
 * This must be used for allocating filesystems specific inodes to set
 * up the inode reclaim context correctly.
 */
#define alloc_inode_sb(_sb, _cache, _gfp) kmem_cache_alloc_lru(_cache, &_sb->s_inode_lru, _gfp)

extern void __insert_inode_hash(struct inode *, unsigned long hashval);
static inline void insert_inode_hash(struct inode *inode)
{
	__insert_inode_hash(inode, inode->i_ino);
}

extern void __remove_inode_hash(struct inode *);
static inline void remove_inode_hash(struct inode *inode)
{
	if (!inode_unhashed(inode) && !hlist_fake(&inode->i_hash))
		__remove_inode_hash(inode);
}

extern void inode_sb_list_add(struct inode *inode);
extern void inode_lru_list_add(struct inode *inode);

int generic_file_mmap(struct file *, struct vm_area_struct *);
int generic_file_mmap_prepare(struct vm_area_desc *desc);
int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
int generic_file_readonly_mmap_prepare(struct vm_area_desc *desc);
extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
int generic_write_checks_count(struct kiocb *iocb, loff_t *count);
extern int generic_write_check_limits(struct file *file, loff_t pos,
		loff_t *count);
extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *to,
		ssize_t already_read);
extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
extern ssize_t generic_file_direct_write(struct kiocb *, struct iov_iter *);
ssize_t generic_perform_write(struct kiocb *, struct iov_iter *);
ssize_t direct_write_fallback(struct kiocb *iocb, struct iov_iter *iter,
		ssize_t direct_written, ssize_t buffered_written);

ssize_t vfs_iter_read(struct file *file, struct iov_iter *iter, loff_t *ppos,
		rwf_t flags);
ssize_t vfs_iter_write(struct file *file, struct iov_iter *iter, loff_t *ppos,
		rwf_t flags);
ssize_t vfs_iocb_iter_read(struct file *file, struct kiocb *iocb,
			   struct iov_iter *iter);
ssize_t vfs_iocb_iter_write(struct file *file, struct kiocb *iocb,
			    struct iov_iter *iter);

/* fs/splice.c */
ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
			    struct pipe_inode_info *pipe,
			    size_t len, unsigned int flags);
ssize_t copy_splice_read(struct file *in, loff_t *ppos,
			 struct pipe_inode_info *pipe,
			 size_t len, unsigned int flags);
extern ssize_t iter_file_splice_write(struct pipe_inode_info *,
		struct file *, loff_t *, size_t, unsigned int);


extern void
file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping);
extern loff_t noop_llseek(struct file *file, loff_t offset, int whence);
extern loff_t vfs_setpos(struct file *file, loff_t offset, loff_t maxsize);
extern loff_t generic_file_llseek(struct file *file, loff_t offset, int whence);
extern loff_t generic_file_llseek_size(struct file *file, loff_t offset,
		int whence, loff_t maxsize, loff_t eof);
loff_t generic_llseek_cookie(struct file *file, loff_t offset, int whence,
			     u64 *cookie);
extern loff_t fixed_size_llseek(struct file *file, loff_t offset,
		int whence, loff_t size);
extern loff_t no_seek_end_llseek_size(struct file *, loff_t, int, loff_t);
extern loff_t no_seek_end_llseek(struct file *, loff_t, int);
int rw_verify_area(int, struct file *, const loff_t *, size_t);
extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);
extern int stream_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_BLOCK
typedef void (dio_submit_t)(struct bio *bio, struct inode *inode,
			    loff_t file_offset);

enum {
	/* need locking between buffered and direct access */
	DIO_LOCKING	= 0x01,

	/* filesystem does not support filling holes */
	DIO_SKIP_HOLES	= 0x02,
};

ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
			     struct block_device *bdev, struct iov_iter *iter,
			     get_block_t get_block,
			     dio_iodone_t end_io,
			     int flags);

static inline ssize_t blockdev_direct_IO(struct kiocb *iocb,
					 struct inode *inode,
					 struct iov_iter *iter,
					 get_block_t get_block)
{
	return __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, iter,
			get_block, NULL, DIO_LOCKING | DIO_SKIP_HOLES);
}
#endif

bool inode_dio_finished(const struct inode *inode);
void inode_dio_wait(struct inode *inode);
void inode_dio_wait_interruptible(struct inode *inode);

/**
 * inode_dio_begin - signal start of a direct I/O requests
 * @inode: inode the direct I/O happens on
 *
 * This is called once we've finished processing a direct I/O request,
 * and is used to wake up callers waiting for direct I/O to be quiesced.
 */
static inline void inode_dio_begin(struct inode *inode)
{
	atomic_inc(&inode->i_dio_count);
}

/**
 * inode_dio_end - signal finish of a direct I/O requests
 * @inode: inode the direct I/O happens on
 *
 * This is called once we've finished processing a direct I/O request,
 * and is used to wake up callers waiting for direct I/O to be quiesced.
 */
static inline void inode_dio_end(struct inode *inode)
{
	if (atomic_dec_and_test(&inode->i_dio_count))
		wake_up_var(&inode->i_dio_count);
}

extern void inode_set_flags(struct inode *inode, unsigned int flags,
			    unsigned int mask);

extern const struct file_operations generic_ro_fops;

#define special_file(m) (S_ISCHR(m)||S_ISBLK(m)||S_ISFIFO(m)||S_ISSOCK(m))

extern int readlink_copy(char __user *, int, const char *, int);
extern int page_readlink(struct dentry *, char __user *, int);
extern const char *page_get_link_raw(struct dentry *, struct inode *,
				     struct delayed_call *);
extern const char *page_get_link(struct dentry *, struct inode *,
				 struct delayed_call *);
extern void page_put_link(void *);
extern int page_symlink(struct inode *inode, const char *symname, int len);
extern const struct inode_operations page_symlink_inode_operations;
extern void kfree_link(void *);
void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode);
void generic_fillattr(struct mnt_idmap *, u32, struct inode *, struct kstat *);
void generic_fill_statx_attr(struct inode *inode, struct kstat *stat);
void generic_fill_statx_atomic_writes(struct kstat *stat,
				      unsigned int unit_min,
				      unsigned int unit_max,
				      unsigned int unit_max_opt);
extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
extern int vfs_getattr(const struct path *, struct kstat *, u32, unsigned int);
void __inode_add_bytes(struct inode *inode, loff_t bytes);
void inode_add_bytes(struct inode *inode, loff_t bytes);
void __inode_sub_bytes(struct inode *inode, loff_t bytes);
void inode_sub_bytes(struct inode *inode, loff_t bytes);
static inline loff_t __inode_get_bytes(struct inode *inode)
{
	return (((loff_t)inode->i_blocks) << 9) + inode->i_bytes;
}
loff_t inode_get_bytes(struct inode *inode);
void inode_set_bytes(struct inode *inode, loff_t bytes);
const char *simple_get_link(struct dentry *, struct inode *,
			    struct delayed_call *);
extern const struct inode_operations simple_symlink_inode_operations;

extern int iterate_dir(struct file *, struct dir_context *);

int vfs_fstatat(int dfd, const char __user *filename, struct kstat *stat,
		int flags);
int vfs_fstat(int fd, struct kstat *stat);

static inline int vfs_stat(const char __user *filename, struct kstat *stat)
{
	return vfs_fstatat(AT_FDCWD, filename, stat, 0);
}
static inline int vfs_lstat(const char __user *name, struct kstat *stat)
{
	return vfs_fstatat(AT_FDCWD, name, stat, AT_SYMLINK_NOFOLLOW);
}

extern const char *vfs_get_link(struct dentry *, struct delayed_call *);
extern int vfs_readlink(struct dentry *, char __user *, int);

extern struct file_system_type *get_filesystem(struct file_system_type *fs);
extern void put_filesystem(struct file_system_type *fs);
extern struct file_system_type *get_fs_type(const char *name);
extern void drop_super(struct super_block *sb);
extern void drop_super_exclusive(struct super_block *sb);
extern void iterate_supers(void (*f)(struct super_block *, void *), void *arg);
extern void iterate_supers_type(struct file_system_type *,
			        void (*)(struct super_block *, void *), void *);
void filesystems_freeze(bool freeze_all);
void filesystems_thaw(void);

void end_dirop(struct dentry *de);

extern int dcache_dir_open(struct inode *, struct file *);
extern int dcache_dir_close(struct inode *, struct file *);
extern loff_t dcache_dir_lseek(struct file *, loff_t, int);
extern int dcache_readdir(struct file *, struct dir_context *);
extern int simple_setattr(struct mnt_idmap *, struct dentry *,
			  struct iattr *);
extern int simple_getattr(struct mnt_idmap *, const struct path *,
			  struct kstat *, u32, unsigned int);
extern int simple_statfs(struct dentry *, struct kstatfs *);
extern int simple_open(struct inode *inode, struct file *file);
extern int simple_link(struct dentry *, struct inode *, struct dentry *);
extern int simple_unlink(struct inode *, struct dentry *);
extern int simple_rmdir(struct inode *, struct dentry *);
extern void __simple_unlink(struct inode *, struct dentry *);
extern void __simple_rmdir(struct inode *, struct dentry *);
void simple_rename_timestamp(struct inode *old_dir, struct dentry *old_dentry,
			     struct inode *new_dir, struct dentry *new_dentry);
extern int simple_rename_exchange(struct inode *old_dir, struct dentry *old_dentry,
				  struct inode *new_dir, struct dentry *new_dentry);
extern int simple_rename(struct mnt_idmap *, struct inode *,
			 struct dentry *, struct inode *, struct dentry *,
			 unsigned int);
extern void simple_recursive_removal(struct dentry *,
                              void (*callback)(struct dentry *));
extern void simple_remove_by_name(struct dentry *, const char *,
                              void (*callback)(struct dentry *));
extern void locked_recursive_removal(struct dentry *,
                              void (*callback)(struct dentry *));
extern int noop_fsync(struct file *, loff_t, loff_t, int);
extern ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
extern int simple_empty(struct dentry *);
extern int simple_write_begin(const struct kiocb *iocb,
			      struct address_space *mapping,
			      loff_t pos, unsigned len,
			      struct folio **foliop, void **fsdata);
extern const struct address_space_operations ram_aops;
extern int always_delete_dentry(const struct dentry *);
extern struct inode *alloc_anon_inode(struct super_block *);
struct inode *anon_inode_make_secure_inode(struct super_block *sb, const char *name,
					   const struct inode *context_inode);
extern int simple_nosetlease(struct file *, int, struct file_lease **, void **);

extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned int flags);
extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *);
extern const struct file_operations simple_dir_operations;
extern const struct inode_operations simple_dir_inode_operations;
extern void make_empty_dir_inode(struct inode *inode);
extern bool is_empty_dir_inode(struct inode *inode);
struct tree_descr { const char *name; const struct file_operations *ops; int mode; };
struct dentry *d_alloc_name(struct dentry *, const char *);
extern int simple_fill_super(struct super_block *, unsigned long,
			     const struct tree_descr *);
extern int simple_pin_fs(struct file_system_type *, struct vfsmount **mount, int *count);
extern void simple_release_fs(struct vfsmount **mount, int *count);
struct dentry *simple_start_creating(struct dentry *, const char *);
void simple_done_creating(struct dentry *);

extern ssize_t simple_read_from_buffer(void __user *to, size_t count,
			loff_t *ppos, const void *from, size_t available);
extern ssize_t simple_write_to_buffer(void *to, size_t available, loff_t *ppos,
		const void __user *from, size_t count);

struct offset_ctx {
	struct maple_tree	mt;
	unsigned long		next_offset;
};

void simple_offset_init(struct offset_ctx *octx);
int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry);
void simple_offset_remove(struct offset_ctx *octx, struct dentry *dentry);
void simple_offset_rename(struct inode *old_dir, struct dentry *old_dentry,
			 struct inode *new_dir, struct dentry *new_dentry);
int simple_offset_rename_exchange(struct inode *old_dir,
				  struct dentry *old_dentry,
				  struct inode *new_dir,
				  struct dentry *new_dentry);
void simple_offset_destroy(struct offset_ctx *octx);

extern const struct file_operations simple_offset_dir_operations;

extern int __generic_file_fsync(struct file *, loff_t, loff_t, int);
extern int generic_file_fsync(struct file *, loff_t, loff_t, int);

extern int generic_check_addressable(unsigned, u64);

extern void generic_set_sb_d_ops(struct super_block *sb);
extern int generic_ci_match(const struct inode *parent,
			    const struct qstr *name,
			    const struct qstr *folded_name,
			    const u8 *de_name, u32 de_name_len);

#if IS_ENABLED(CONFIG_UNICODE)
int generic_ci_d_hash(const struct dentry *dentry, struct qstr *str);
int generic_ci_d_compare(const struct dentry *dentry, unsigned int len,
			 const char *str, const struct qstr *name);

/**
 * generic_ci_validate_strict_name - Check if a given name is suitable
 * for a directory
 *
 * This functions checks if the proposed filename is valid for the
 * parent directory. That means that only valid UTF-8 filenames will be
 * accepted for casefold directories from filesystems created with the
 * strict encoding flag.  That also means that any name will be
 * accepted for directories that doesn't have casefold enabled, or
 * aren't being strict with the encoding.
 *
 * @dir: inode of the directory where the new file will be created
 * @name: name of the new file
 *
 * Return:
 * * True: if the filename is suitable for this directory. It can be
 *   true if a given name is not suitable for a strict encoding
 *   directory, but the directory being used isn't strict
 * * False if the filename isn't suitable for this directory. This only
 *   happens when a directory is casefolded and the filesystem is strict
 *   about its encoding.
 */
static inline bool generic_ci_validate_strict_name(struct inode *dir,
						   const struct qstr *name)
{
	if (!IS_CASEFOLDED(dir) || !sb_has_strict_encoding(dir->i_sb))
		return true;

	/*
	 * A casefold dir must have a encoding set, unless the filesystem
	 * is corrupted
	 */
	if (WARN_ON_ONCE(!dir->i_sb->s_encoding))
		return true;

	return !utf8_validate(dir->i_sb->s_encoding, name);
}
#else
static inline bool generic_ci_validate_strict_name(struct inode *dir,
						   const struct qstr *name)
{
	return true;
}
#endif

int may_setattr(struct mnt_idmap *idmap, struct inode *inode,
		unsigned int ia_valid);
int setattr_prepare(struct mnt_idmap *, struct dentry *, struct iattr *);
extern int inode_newsize_ok(const struct inode *, loff_t offset);
void setattr_copy(struct mnt_idmap *, struct inode *inode,
		  const struct iattr *attr);

extern int file_update_time(struct file *file);

static inline bool file_is_dax(const struct file *file)
{
	return file && IS_DAX(file->f_mapping->host);
}

static inline bool vma_is_dax(const struct vm_area_struct *vma)
{
	return file_is_dax(vma->vm_file);
}

static inline bool vma_is_fsdax(struct vm_area_struct *vma)
{
	struct inode *inode;

	if (!IS_ENABLED(CONFIG_FS_DAX) || !vma->vm_file)
		return false;
	if (!vma_is_dax(vma))
		return false;
	inode = file_inode(vma->vm_file);
	if (S_ISCHR(inode->i_mode))
		return false; /* device-dax */
	return true;
}

static inline int iocb_flags(struct file *file)
{
	int res = 0;
	if (file->f_flags & O_APPEND)
		res |= IOCB_APPEND;
	if (file->f_flags & O_DIRECT)
		res |= IOCB_DIRECT;
	if (file->f_flags & O_DSYNC)
		res |= IOCB_DSYNC;
	if (file->f_flags & __O_SYNC)
		res |= IOCB_SYNC;
	return res;
}

static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
				     int rw_type)
{
	int kiocb_flags = 0;

	/* make sure there's no overlap between RWF and private IOCB flags */
	BUILD_BUG_ON((__force int) RWF_SUPPORTED & IOCB_EVENTFD);

	if (!flags)
		return 0;
	if (unlikely(flags & ~RWF_SUPPORTED))
		return -EOPNOTSUPP;
	if (unlikely((flags & RWF_APPEND) && (flags & RWF_NOAPPEND)))
		return -EINVAL;

	if (flags & RWF_NOWAIT) {
		if (!(ki->ki_filp->f_mode & FMODE_NOWAIT))
			return -EOPNOTSUPP;
	}
	if (flags & RWF_ATOMIC) {
		if (rw_type != WRITE)
			return -EOPNOTSUPP;
		if (!(ki->ki_filp->f_mode & FMODE_CAN_ATOMIC_WRITE))
			return -EOPNOTSUPP;
	}
	if (flags & RWF_DONTCACHE) {
		/* file system must support it */
		if (!(ki->ki_filp->f_op->fop_flags & FOP_DONTCACHE))
			return -EOPNOTSUPP;
		/* DAX mappings not supported */
		if (IS_DAX(ki->ki_filp->f_mapping->host))
			return -EOPNOTSUPP;
	}
	kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
	if (flags & RWF_SYNC)
		kiocb_flags |= IOCB_DSYNC;

	if ((flags & RWF_NOAPPEND) && (ki->ki_flags & IOCB_APPEND)) {
		if (IS_APPEND(file_inode(ki->ki_filp)))
			return -EPERM;
		ki->ki_flags &= ~IOCB_APPEND;
	}

	ki->ki_flags |= kiocb_flags;
	return 0;
}

/* Transaction based IO helpers */

/*
 * An argresp is stored in an allocated page and holds the
 * size of the argument or response, along with its content
 */
struct simple_transaction_argresp {
	ssize_t size;
	char data[];
};

#define SIMPLE_TRANSACTION_LIMIT (PAGE_SIZE - sizeof(struct simple_transaction_argresp))

char *simple_transaction_get(struct file *file, const char __user *buf,
				size_t size);
ssize_t simple_transaction_read(struct file *file, char __user *buf,
				size_t size, loff_t *pos);
int simple_transaction_release(struct inode *inode, struct file *file);

void simple_transaction_set(struct file *file, size_t n);

/*
 * simple attribute files
 *
 * These attributes behave similar to those in sysfs:
 *
 * Writing to an attribute immediately sets a value, an open file can be
 * written to multiple times.
 *
 * Reading from an attribute creates a buffer from the value that might get
 * read with multiple read calls. When the attribute has been read
 * completely, no further read calls are possible until the file is opened
 * again.
 *
 * All attributes contain a text representation of a numeric value
 * that are accessed with the get() and set() functions.
 */
#define DEFINE_SIMPLE_ATTRIBUTE_XSIGNED(__fops, __get, __set, __fmt, __is_signed)	\
static int __fops ## _open(struct inode *inode, struct file *file)	\
{									\
	__simple_attr_check_format(__fmt, 0ull);			\
	return simple_attr_open(inode, file, __get, __set, __fmt);	\
}									\
static const struct file_operations __fops = {				\
	.owner	 = THIS_MODULE,						\
	.open	 = __fops ## _open,					\
	.release = simple_attr_release,					\
	.read	 = simple_attr_read,					\
	.write	 = (__is_signed) ? simple_attr_write_signed : simple_attr_write,	\
	.llseek	 = generic_file_llseek,					\
}

#define DEFINE_SIMPLE_ATTRIBUTE(__fops, __get, __set, __fmt)		\
	DEFINE_SIMPLE_ATTRIBUTE_XSIGNED(__fops, __get, __set, __fmt, false)

#define DEFINE_SIMPLE_ATTRIBUTE_SIGNED(__fops, __get, __set, __fmt)	\
	DEFINE_SIMPLE_ATTRIBUTE_XSIGNED(__fops, __get, __set, __fmt, true)

static inline __printf(1, 2)
void __simple_attr_check_format(const char *fmt, ...)
{
	/* don't do anything, just let the compiler check the arguments; */
}

int simple_attr_open(struct inode *inode, struct file *file,
		     int (*get)(void *, u64 *), int (*set)(void *, u64),
		     const char *fmt);
int simple_attr_release(struct inode *inode, struct file *file);
ssize_t simple_attr_read(struct file *file, char __user *buf,
			 size_t len, loff_t *ppos);
ssize_t simple_attr_write(struct file *file, const char __user *buf,
			  size_t len, loff_t *ppos);
ssize_t simple_attr_write_signed(struct file *file, const char __user *buf,
				 size_t len, loff_t *ppos);

struct ctl_table;
int __init list_bdev_fs_names(char *buf, size_t size);

#define __FMODE_EXEC		((__force int) FMODE_EXEC)

#define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
#define OPEN_FMODE(flag) ((__force fmode_t)((flag + 1) & O_ACCMODE))

static inline bool is_sxid(umode_t mode)
{
	return mode & (S_ISUID | S_ISGID);
}

static inline int check_sticky(struct mnt_idmap *idmap,
			       struct inode *dir, struct inode *inode)
{
	if (!(dir->i_mode & S_ISVTX))
		return 0;

	return __check_sticky(idmap, dir, inode);
}

static inline void inode_has_no_xattr(struct inode *inode)
{
	if (!is_sxid(inode->i_mode) && (inode->i_sb->s_flags & SB_NOSEC))
		inode->i_flags |= S_NOSEC;
}

static inline bool is_root_inode(struct inode *inode)
{
	return inode == inode->i_sb->s_root->d_inode;
}

static inline bool dir_emit(struct dir_context *ctx,
			    const char *name, int namelen,
			    u64 ino, unsigned type)
{
	return ctx->actor(ctx, name, namelen, ctx->pos, ino, type);
}
static inline bool dir_emit_dot(struct file *file, struct dir_context *ctx)
{
	return ctx->actor(ctx, ".", 1, ctx->pos,
			  file->f_path.dentry->d_inode->i_ino, DT_DIR);
}
static inline bool dir_emit_dotdot(struct file *file, struct dir_context *ctx)
{
	return ctx->actor(ctx, "..", 2, ctx->pos,
			  d_parent_ino(file->f_path.dentry), DT_DIR);
}
static inline bool dir_emit_dots(struct file *file, struct dir_context *ctx)
{
	if (ctx->pos == 0) {
		if (!dir_emit_dot(file, ctx))
			return false;
		ctx->pos = 1;
	}
	if (ctx->pos == 1) {
		if (!dir_emit_dotdot(file, ctx))
			return false;
		ctx->pos = 2;
	}
	return true;
}
static inline bool dir_relax(struct inode *inode)
{
	inode_unlock(inode);
	inode_lock(inode);
	return !IS_DEADDIR(inode);
}

static inline bool dir_relax_shared(struct inode *inode)
{
	inode_unlock_shared(inode);
	inode_lock_shared(inode);
	return !IS_DEADDIR(inode);
}

extern bool path_noexec(const struct path *path);
extern void inode_nohighmem(struct inode *inode);

/* mm/fadvise.c */
extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len,
		       int advice);
extern int generic_fadvise(struct file *file, loff_t offset, loff_t len,
			   int advice);

static inline bool vfs_empty_path(int dfd, const char __user *path)
{
	char c;

	if (dfd < 0)
		return false;

	/* We now allow NULL to be used for empty path. */
	if (!path)
		return true;

	if (unlikely(get_user(c, path)))
		return false;

	return !c;
}

int generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter);

static inline bool extensible_ioctl_valid(unsigned int cmd_a,
					  unsigned int cmd_b, size_t min_size)
{
	if (_IOC_DIR(cmd_a) != _IOC_DIR(cmd_b))
		return false;
	if (_IOC_TYPE(cmd_a) != _IOC_TYPE(cmd_b))
		return false;
	if (_IOC_NR(cmd_a) != _IOC_NR(cmd_b))
		return false;
	if (_IOC_SIZE(cmd_a) < min_size)
		return false;
	return true;
}

#endif /* _LINUX_FS_H */
]

gemini-3-flash-preview

							input: 95310
							output: 41
							thoughts: 1121

**Multigrain Timestamp Race Condition Analysis**

Okay, here's what I've found. I was digging into the multigrain timestamp implementation and located the definition of `I_CTIME_QUERIED` at line 1679 of `include/linux/fs.h`.  This bit is used to track when timestamps are actively being observed, which is a key part of the multigrain timestamp system, apparently.

The `i_ctime_nsec` field in the inode structure seems to be the one holding the `I_CTIME_QUERIED` bit.  I've identified a potential race condition between `inode_set_ctime_current` (the writer) and `fill_mg_cmtime` (the reader).

Let's break down the code.  In `inode_set_ctime_current`, there's this `try_cmpxchg` operation:

```c
try_cmpxchg(&inode->i_ctime_nsec, &cur, now.tv_nsec)
```

Notice that `now.tv_nsec` *doesn't* have `I_CTIME_QUERIED` set.  This is because we are setting a new timestamp.  If the `try_cmpxchg` succeeds, the code then updates `i_ctime_sec` on the next line.

Now, look at `fill_mg_cmtime`:

```c
stat->ctime.tv_sec = inode->i_ctime_sec;
stat->ctime.tv_nsec = (u32)atomic_read(pcn);
if (!(stat->ctime.tv_nsec & I_CTIME_QUERIED))
	stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn));
stat->ctime.tv_nsec &= ~I_CTIME_QUERIED;
```

Here's the problem. If a writer (say, task 3305) completes the `try_cmpxchg` (setting `i_ctime_nsec` to a *new* value without the bit) but *hasn't* yet updated `i_ctime_sec`, and then a reader (task 3306) comes along, this is what happens:  The reader reads the *old* `i_ctime_sec`, then reads the *new* (unqueried) `i_ctime_nsec`. This means, that if `i_ctime_sec` was *supposed* to increment, the reported time is *older* than it should be. The time jumps backward.

The `try_cmpxchg` seems atomic, but the writes to `sec` and `nsec` are not. These two values are logically linked, in other words. We have no atomic update of `sec` and `nsec`, because they are separate fields in the `inode` structure.

This is a classic "unprotected" update of two dependent values. Specifically, the read/write order is Write: `nsec` then `sec` and Read: `sec` then `nsec`.

The multigrain timestamp design seems to assume that races are acceptable, but this one specifically causes time to go backwards! This could be a problem, since timestamps moving backward is bad for anything that depends on monotonic timestamps (build systems, NFS cache consistency, etc.). I doubt it will crash the kernel, but it's a definite bug in the multigrain timestamp design.

I need to see if there's any protection anywhere. I've noted that shmem is using this system in calls like `shmem_mknod` and `shmem_getattr`. I see that `shmem_mknod` calls `shmem_get_inode`, which presumably sets timestamps. Yes, `inode_set_ctime_current` is called from within `shmem_mknod`. I need to review that code specifically to be sure.

map[ContextFile:mm/shmem.c IncludeLines:true Name:shmem_mknod]

map[SourceCode:3859:	/*
3860:	 * File creation. Allocate an inode, and we're done..
3861:	 */
3862:	static int
3863:	shmem_mknod(struct mnt_idmap *idmap, struct inode *dir,
3864:		    struct dentry *dentry, umode_t mode, dev_t dev)
3865:	{
3866:		struct inode *inode;
3867:		int error;
3868:	
3869:		if (!generic_ci_validate_strict_name(dir, &dentry->d_name))
3870:			return -EINVAL;
3871:	
3872:		inode = shmem_get_inode(idmap, dir->i_sb, dir, mode, dev, VM_NORESERVE);
3873:		if (IS_ERR(inode))
3874:			return PTR_ERR(inode);
3875:	
3876:		error = simple_acl_create(dir, inode);
3877:		if (error)
3878:			goto out_iput;
3879:		error = security_inode_init_security(inode, dir, &dentry->d_name,
3880:						     shmem_initxattrs, NULL);
3881:		if (error && error != -EOPNOTSUPP)
3882:			goto out_iput;
3883:	
3884:		error = simple_offset_add(shmem_get_offset_ctx(dir), dentry);
3885:		if (error)
3886:			goto out_iput;
3887:	
3888:		dir->i_size += BOGO_DIRENT_SIZE;
3889:		inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
3890:		inode_inc_iversion(dir);
3891:	
3892:		d_make_persistent(dentry, inode);
3893:		return error;
3894:	
3895:	out_iput:
3896:		iput(inode);
3897:		return error;
3898:	}
 SourceFile:mm/shmem.c]

gemini-3-flash-preview

							input: 97956
							output: 39
							thoughts: 58

Okay, here's my interpretation of that code snippet, delivered as if it's my own thought process:

**Reflecting on Filesystem Metadata Updates**

Alright, let's see. Line 3889... ah, yes, `inode_set_ctime_current(dir)`. Right, that's being called. My initial thought was: is this updating the ctime of the *current* directory? Wait a minute... that's not quite right. After a closer look, the key point is it's actually modifying the *parent* directory's ctime and mtime. This makes perfect sense; because we just created a *new* directory entry within it. The call, `shmem_getattr`, is the reader, and needs to be kept up to date.

map[ContextFile:mm/shmem.c IncludeLines:true Name:shmem_getattr]

map[SourceCode:1255:	static int shmem_getattr(struct mnt_idmap *idmap,
1256:				 const struct path *path, struct kstat *stat,
1257:				 u32 request_mask, unsigned int query_flags)
1258:	{
1259:		struct inode *inode = path->dentry->d_inode;
1260:		struct shmem_inode_info *info = SHMEM_I(inode);
1261:	
1262:		if (info->alloced - info->swapped != inode->i_mapping->nrpages)
1263:			shmem_recalc_inode(inode, 0, 0);
1264:	
1265:		if (info->fsflags & FS_APPEND_FL)
1266:			stat->attributes |= STATX_ATTR_APPEND;
1267:		if (info->fsflags & FS_IMMUTABLE_FL)
1268:			stat->attributes |= STATX_ATTR_IMMUTABLE;
1269:		if (info->fsflags & FS_NODUMP_FL)
1270:			stat->attributes |= STATX_ATTR_NODUMP;
1271:		stat->attributes_mask |= (STATX_ATTR_APPEND |
1272:				STATX_ATTR_IMMUTABLE |
1273:				STATX_ATTR_NODUMP);
1274:		generic_fillattr(idmap, request_mask, inode, stat);
1275:	
1276:		if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0))
1277:			stat->blksize = HPAGE_PMD_SIZE;
1278:	
1279:		if (request_mask & STATX_BTIME) {
1280:			stat->result_mask |= STATX_BTIME;
1281:			stat->btime.tv_sec = info->i_crtime.tv_sec;
1282:			stat->btime.tv_nsec = info->i_crtime.tv_nsec;
1283:		}
1284:	
1285:		return 0;
1286:	}
 SourceFile:mm/shmem.c]

gemini-3-flash-preview

							input: 98821
							output: 25
							thoughts: 6475

**Multigrain Timestamp Race Analysis**

Alright, let's break down this KCSAN report. It seems we have a data race involving the multigrain timestamp feature within the shmem filesystem. Specifically, the race is between writing `i_ctime_sec` in `inode_set_ctime_current` and reading it in `fill_mg_cmtime`. `fill_mg_cmtime` is called from `generic_fillattr`, which feeds `kstat` (the struct holding file metadata). The shmem filesystem is flagged with `FS_MGTIME`, meaning it *does* use multigrain timestamps, so this race is active there.

The root cause is the order of operations. `inode_set_ctime_current` first updates `i_ctime_nsec` using `try_cmpxchg`, and *then* updates `i_ctime_sec`. The problem is that `try_cmpxchg` on `i_ctime_nsec` acts as a release operation for `i_ctime_sec` ONLY IF we update `sec` BEFORE `nsec`. But here we are updating `nsec` BEFORE `sec`. This means a reader (like `fill_mg_cmtime`) could see the new `nsec` but still the old `sec`.

`fill_mg_cmtime` then reads `i_ctime_sec` and `i_ctime_nsec`. But `inode_set_ctime_current` is also reading those at the same time. The order of reads/writes is wrong to guarantee consistency. If we want to guarantee the correct timestamp, we should first update and read the "sec" value, then read/write the "nsec" value.

This order can lead to an inconsistent timestamp. If the time is, say, `10.999` and becomes `11.001`, a reader might see `i_ctime_sec` as `10` and `i_ctime_nsec` as `001`, resulting in a timestamp of `10.001`, or an older time, going backward, which is bad for monotonicity. KCSAN is indeed reporting this.

The question is, is this "harmful"? It doesn't crash the system. However, it *does* introduce a logic error: timestamps can go backward, albeit for a very brief window. With multigrain timestamps, the intention is to improve accuracy and ensure monotonicity. Returning a time that is *older* than a previous timestamp on the same directory is a bug. For example, if a `stat` on a directory returns timestamp X, and a subsequent `stat` returns a timestamp Y < X, this is a real problem. For `make`, which often uses timestamps for checking if files have changed, this can break builds or other logic depending on timestamp order.

The code in `inode_set_ctime_current` also uses `smp_load_acquire` on `i_ctime_nsec` and then reads `i_ctime_sec`, but that doesn't guarantee the reader sees an update to `sec` if it reads an updated `nsec` first.

So, the issue is there is no synchronization in the read and the write of `i_ctime_sec` and `i_ctime_nsec`.

Now, if `try_cmpxchg` fails, the code then reads the existing `i_ctime_sec` and `i_ctime_nsec`. This race also exists for any filesystem that updates both at the same time.

Checking `vfs_getattr` -> `shmem_getattr` -> `generic_fillattr` -> `fill_mg_cmtime` shows that it doesn't take the inode lock. This is where this problem happens. It's a concurrent `stat` and directory modification race on the *same* inode (directory). The writer is `shmem_mknod`, which updates the directory's ctime. The reader is the `stat` call on the same directory. The directory's `i_rwsem` is held by `shmem_mknod` but not the reader in `vfs_getattr`. This is a very common issue across many filesystems.

A timestamp going backward is *bad*. In this case, `make` might have issues.

While KCSAN is reporting this data race, the question remains: is this a known "benign" race? It's often the case that timestamp races are tolerated because of the window. But with multigrain timestamps, which try for higher accuracy, this may not be true.

It seems the developers knew there wasn't a single atomic update of the whole `timespec64`. So it's known, but is it a problem? The race is on an 8-byte aligned `time64_t`.

map[Name:inode::i_ctime_sec]

map[References:<nil> TruncatedOutput:false]

gemini-3-flash-preview

							input: 106367
							output: 23
							thoughts: 13

Okay, here's my summary, tailored as if I'm the one having these thoughts, keeping in mind I'm an expert in this field:

**Refining My Approach**

Alright, I'm taking a look at this code and trying to optimize it, or at least streamline how I'm accessing this timestamp data.  I was going through the structure to find the ctime seconds variable, and I had my initial approach. Now, I'm thinking, "Let's ditch the explicit struct name for that timestamp value". Instead of referencing it with the full struct name as a prefix, let me see if I can get away with just `i_ctime_sec`. I want to see if that works. It'll be more concise if I can pull that off. It's a quick thing to try to simplify the line.

requested entity does not exist

map[Name:i_ctime_sec]

map[References:<nil> TruncatedOutput:false]

gemini-3-flash-preview

							input: 106602
							output: 19
							thoughts: 1167

**Analyzing a Potentially Benign Data Race in the VFS Layer**

Okay, that's interesting. I'm looking at this code in `fs.h` and something immediately stands out. I have enough here to draw some conclusions. It seems there's a data race on `i_ctime_sec` between a writer, `inode_set_ctime_current`, and a reader, `fill_mg_cmtime`. The writer updates `nsec` *before* `sec`, while the reader reads `sec` *before* `nsec`. This can result in an inconsistent timestamp – an older `sec` paired with a newer `nsec`. Essentially, if the update happens at a second boundary, the timestamp could appear to jump backward.

However, it's a race condition on a simple `time64_t` field. It's not leading to memory corruption or crashes. It seems like this is a "benign" data race, intended or at least tolerated for performance reasons within the VFS `getattr` path, where full locking might be too expensive. This new multigrain timestamp feature is aiming to be smarter with the `nsec` field, but it doesn't appear to fully synchronize the `sec` field.

Now, if this is a new feature meant to be better, perhaps the developers intended race-free behavior and missed this. However, even if they did miss it, under the given KCSAN definition, it's still considered "benign": access to simple variables/flags, statistics counters, etc. Timestamps are similar to statistics in this regard, informational fields of the inode.

Is there any reason this race would be "harmful"? "Harmful" data races involve corruption/crashes, often with complex containers. This is not a race on a complex container; it's a race on a simple `time64_t` (primitive type).

I see that `fill_mg_cmtime` uses `atomic_read` and `atomic_fetch_or` on `i_ctime_nsec`, which ensures the `QUERIED` bit is handled correctly. The race on `i_ctime_sec` is the only one *not* using atomic operations.

Looking at `inode_set_ctime_current` again, they're being very careful with `i_ctime_nsec` using `try_cmpxchg`. It seems they *attempt* atomicity for the `nsec` portion, but know they can't achieve it for the whole `timespec64` without a lock.

I'll check the files, or run a search.

I'm confident this is "benign" because it's a race on a primitive type that doesn't cause crashes or memory corruption, even if it leads to slightly inconsistent timestamps.

But let me verify. If this were harmful, it would probably involve pointers or list manipulations. This is just a timestamp.

Let's look at `inode_set_ctime_current` again. Ah, here:

```c
2841:		if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now.tv_nsec)) {
2842:			/* If swap occurred, then we're (mostly) done */
2843:			inode->i_ctime_sec = now.tv_sec;
```

Wait! If `try_cmpxchg` succeeds, `nsec` is updated. If *another* reader comes in and sees the new `nsec`, it would logically expect to see the new `sec` as well. However, `sec` is updated *after* `nsec`. What if the other reader is *also* `inode_set_ctime_current`? It too reads `sec`.

Wait a minute, the prompt did say that "benign" races are simple variables. `time64_t` is simple: it's just a long long.

I'm concluding this is indeed benign.

Let me double-check the `fill_mg_cmtime` code:

```c
  56:		stat->ctime.tv_sec = inode->i_ctime_sec;
  57:		stat->ctime.tv_nsec = (u32)atomic_read(pcn);
  58:		if (!(stat->ctime.tv_nsec & I_CTIME_QUERIED))
  59:			stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn));
```

Yep, it reads `sec` then `nsec`. The writer writes `nsec` then `sec`. It's exactly the setup for inconsistent state!

However, most VFS

map[Benign:true Confident:true]

map[Benign:true Confident:true]

gemini-3-flash-preview

							input: 108830
							output: 706
							thoughts: 381

**My Assessment of the `inode` `i_ctime_sec` Data Race**

Okay, so I've got this data race on my hands related to the `i_ctime_sec` field within the `struct inode` structure. This field is a `time64_t`, which means we're dealing with an 8-byte integer representing the seconds component of the inode's change timestamp. The race is occurring between `inode_set_ctime_current`, which is the writer, and `fill_mg_cmtime`, which is reading the value.

Here's how it breaks down from what I can see: `inode_set_ctime_current` first updates `i_ctime_nsec` using `try_cmpxchg` – that makes sense, trying to atomically update the nanosecond part. Then, *boom*, it does a simple assignment to update `i_ctime_sec`. No synchronization, just a direct write.

On the reader's side, `fill_mg_cmtime` reads `i_ctime_sec` first. Then, it goes on to read `i_ctime_nsec` using `atomic_read`. The problem is glaringly obvious – there's no synchronization whatsoever. No locks, no carefully placed memory barriers to enforce an order between the writes of `i_ctime_sec` and `i_ctime_nsec` and the reads of those fields.

So, the potential for inconsistency is there. I could end up with `fill_mg_cmtime` reading an old value of `i_ctime_sec` paired with a newer `i_ctime_nsec`. This, in turn, may lead to a slightly incorrect timestamp – maybe appearing to go back a second or so if the update happens around a second boundary.

But, and this is the crucial part, this is *likely* to be benign. Data races on timestamps are unfortunately common in the VFS layer, especially with primitive types like this. It's not the ideal situation, but here's why I'm not panicking:

1. We're dealing with primitive types. We are talking about time values. It's not like we are racing on a data structure containing pointers or other complex information.
2. The race isn't leading to memory corruption, crashes, or any sort of invalid pointer dereferences, as far as I can see. This type of error is something I'd be looking for, and it's not present.
3. Timestamps in concurrent systems are inherently a little loose, without the application of very heavy-handed locking. This is always a trade-off.
4. The multigrain timestamp feature's design goal is still mostly achieved, which is preventing frequent updates from being merged into the same coarse tick. The atomic operations on `i_ctime_nsec` are working as intended.

So, while it's a data race and not ideal, I don't see this as a critical bug, especially given the low level of the data involved and that other mechanisms are in place that mostly avoid any time inconsistencies. We're not risking kernel stability or, critically, corruption. Therefore, it's something I can classify as a *benign* race. I'll make a note of it, but it's not high on my priority list.

LLM Calls Summary:
Total Calls	Total Tokens	Avg Tokens	Total Duration (Seconds)	Avg Duration (Seconds)

Tool Calls Summary:
Total Calls	Total Duration (Seconds)	Avg Duration (Seconds)