[WIP] coccinelle: rt: Add coccicheck on sleep in atomic context on PREEMPT_RT

Yunseong Kim posted 1 patch 1 month, 2 weeks ago
scripts/coccinelle/rt/sleep_in_atomic.cocci | 509 ++++++++++++++++++++
1 file changed, 509 insertions(+)
create mode 100644 scripts/coccinelle/rt/sleep_in_atomic.cocci
[WIP] coccinelle: rt: Add coccicheck on sleep in atomic context on PREEMPT_RT
Posted by Yunseong Kim 1 month, 2 weeks ago
I'm working on a new Coccinelle script to detect sleep-in-atomic bugs in
PREEMPT_RT kernels. This script identifies calls to sleeping functions
(e.g., mutex_lock, msleep, kmalloc with GFP_KERNEL, spin_lock which may
sleep in PREEMPT_RT) within atomic contexts (e.g., raw_spin_lock,
preempt_disable, bit_spin_lock).

It supports both direct calls and indirect call chains through
inter-procedural analysis using function call graphs. Memory allocations
are handled including GFP_ATOMIC/NOWAIT. This is a WIP patch for early
feedback. I've tested it with make coccicheck on various subsystems, but
there are still issues with position variables sometimes being tuples,
leading to "Invalid position info" warnings and incomplete data collection.

The script includes defensive checks, but indirect bugs are not always
detected. I'd appreciate any suggestions on improving the Python handling
of position variables or the SmPL rules for better matching in complex code
(e.g., macros, inlines). The script is added to scripts/coccinelle/rt/.

Detects sleep-in-atomic bugs in PREEMPT_RT kernels by identifying improper
calls to functions that may sleep, such as mutex locks, explicit sleep
functions (e.g., msleep), memory allocations and sleepable spinlocks,
within atomic contexts created by preempt_disable, raw_spin_lock,
irq_disable (e.g. bit_spin_lock).

1. Detection of direct calls to sleeping functions in atomic scopes.
2. Analysis of inter-procedural call chains to uncover indirect calls to
   sleeping functions via function call graphs.
3. Handling of memory allocation functions that may sleep.
   (including GFP_ATOMIC).

This cocci script should identify direct and indirect sleep-in-atomic
violations, improving PREEMPT_RT compatibility across kernel code.
For example:
Link: https://lore.kernel.org/linux-rt-devel/7a68c944-0199-468e-a0f2-ae2a9f21225b@kzalloc.com/t/#u

I understand that it’s not working perfectly yet, but I’d like to ask how
we can make it more clearly indicated.

$ make coccicheck COCCI=../../scripts/coccinelle/rt/sleep_in_atomic.cocci MODE=report M=fs/gfs2 2>&1 | tee gfs2-rt.log

The final desired outcome is as follows.

  [!] BUG (Indirect): Function 'lockref_get_not_dead' can sleep and is called from an atomic context.
    - Sleep details:
      - spin_lock() at fs/gfs2/quota.c:XXX (is a sleeping lock on PREEMPT_RT)
    - Atomic call sites:
      - gfs2_qd_search_bucket at fs/gfs2/quota.c:YYY from atomic context at fs/gfs2/quota.c:ZZZ
    - Call paths to sleepers:
      - lockref_get_not_dead -> gfs2_qd_search_bucket -> spin_lock_bucket -> gfs2_quota_init

The current output looks as follows:
Link: https://gist.github.com/kzall0c/8d081b36f5f6b23498441007c8a835cd

Signed-off-by: Yunseong Kim <ysk@kzalloc.com>
---
 scripts/coccinelle/rt/sleep_in_atomic.cocci | 509 ++++++++++++++++++++
 1 file changed, 509 insertions(+)
 create mode 100644 scripts/coccinelle/rt/sleep_in_atomic.cocci

diff --git a/scripts/coccinelle/rt/sleep_in_atomic.cocci b/scripts/coccinelle/rt/sleep_in_atomic.cocci
new file mode 100644
index 000000000000..f675eeff5c34
--- /dev/null
+++ b/scripts/coccinelle/rt/sleep_in_atomic.cocci
@@ -0,0 +1,509 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/// @description: Finds direct and inter-procedural calls to sleeping functions
+///   (including spin_lock in PREEMPT_RT) within atomic contexts, such as
+///   preemption or interrupt disabled regions. Detects two types of bugs:
+///   1. Direct: A sleeping function is called inside an atomic context.
+///   2. Indirect: A function called from an atomic context eventually calls
+///      a sleeping function through a chain of calls.
+///
+// Confidence: High
+// Copyright: (C) 2025 Yunseong Kim <ysk@kzalloc.com>
+// Options: --no-includes --include-headers
+
+virtual report
+
+// =========================================================================
+// 1. Main Rules to Find Violations
+// =========================================================================
+
+// --- PART 1: Direct (Intra-procedural) Violation Detection ---
+
+@find_direct_sleep_in_atomic@
+position p_atomic, p_call;
+identifier bad_func =~ "^(mutex_lock|mutex_lock_interruptible|mutex_lock_killable|down|down_interruptible|down_killable|down_trylock|rwsem_down_read|rwsem_down_write|ww_mutex_lock|msleep|ssleep|usleep_range|wait_for_completion|schedule|cond_resched|copy_from_user|copy_to_user|get_user|put_user|vmalloc|spin_lock|read_lock|write_lock)$";
+expression lock, flags;
+@@
+(
+  raw_spin_lock@p_atomic(...)
+| raw_spin_lock_irq@p_atomic(...)
+| raw_spin_lock_irqsave@p_atomic(...)
+| raw_spin_lock_bh@p_atomic(...)
+| raw_read_lock@p_atomic(...)
+| raw_read_lock_irq@p_atomic(...)
+| raw_read_lock_irqsave@p_atomic(...)
+| raw_read_lock_bh@p_atomic(...)
+| raw_write_lock@p_atomic(...)
+| raw_write_lock_irq@p_atomic(...)
+| raw_write_lock_irqsave@p_atomic(...)
+| raw_write_lock_bh@p_atomic(...)
+| preempt_disable@p_atomic()
+| local_irq_disable@p_atomic()
+| local_irq_save@p_atomic(...)
+| local_bh_disable@p_atomic()
+| bit_spin_lock@p_atomic(...)
+)
+<...
+  bad_func@p_call(...)
+...>
+(
+  raw_spin_unlock(lock)
+| raw_spin_unlock_irq(lock)
+| raw_spin_unlock_irqrestore(lock, flags)
+| raw_spin_unlock_bh(lock)
+| raw_read_unlock(lock)
+| raw_read_unlock_irq(lock)
+| raw_read_unlock_irqrestore(lock, flags)
+| raw_read_unlock_bh(lock)
+| raw_write_unlock(lock)
+| raw_write_unlock_irq(lock)
+| raw_write_unlock_irqrestore(lock, flags)
+| raw_write_unlock_bh(lock)
+| preempt_enable()
+| local_irq_enable()
+| local_irq_restore(flags)
+| local_bh_enable()
+| bit_spin_unlock(...)
+)
+
+@find_direct_sleep_alloc_in_atomic@
+position p_atomic, p_call;
+identifier alloc_func =~ "^(kmalloc|kzalloc|kcalloc|kvmalloc|kvzalloc|kvcalloc)$";
+expression gfp, lock, flags;
+@@
+(
+  raw_spin_lock@p_atomic(...)
+| raw_spin_lock_irq@p_atomic(...)
+| raw_spin_lock_irqsave@p_atomic(...)
+| raw_spin_lock_bh@p_atomic(...)
+| raw_read_lock@p_atomic(...)
+| raw_read_lock_irq@p_atomic(...)
+| raw_read_lock_irqsave@p_atomic(...)
+| raw_read_lock_bh@p_atomic(...)
+| raw_write_lock@p_atomic(...)
+| raw_write_lock_irq@p_atomic(...)
+| raw_write_lock_irqsave@p_atomic(...)
+| raw_write_lock_bh@p_atomic(...)
+| preempt_disable@p_atomic()
+| local_irq_disable@p_atomic()
+| local_irq_save@p_atomic(...)
+| local_bh_disable@p_atomic()
+| bit_spin_lock@p_atomic(...)
+)
+<...
+  alloc_func@p_call(..., gfp)
+...>
+(
+  raw_spin_unlock(lock)
+| raw_spin_unlock_irq(lock)
+| raw_spin_unlock_irqrestore(lock, flags)
+| raw_spin_unlock_bh(lock)
+| raw_read_unlock(lock)
+| raw_read_unlock_irq(lock)
+| raw_read_unlock_irqrestore(lock, flags)
+| raw_read_unlock_bh(lock)
+| raw_write_unlock(lock)
+| raw_write_unlock_irq(lock)
+| raw_write_unlock_irqrestore(lock, flags)
+| raw_write_unlock_bh(lock)
+| preempt_enable()
+| local_irq_enable()
+| local_irq_restore(flags)
+| local_bh_enable()
+| bit_spin_unlock(...)
+)
+
+// --- PART 2: Indirect (Inter-procedural) Violation Data Collection ---
+
+@collect_atomic_callees@
+position p_atomic, p_callee_call;
+identifier callee_func !~ "^\\b(raw_spin|raw_read|raw_write|preempt|local_irq|local_bh|printk|pr_|dev_)\\b";
+expression lock, flags;
+@@
+(
+  raw_spin_lock@p_atomic(...)
+| raw_spin_lock_irq@p_atomic(...)
+| raw_spin_lock_irqsave@p_atomic(...)
+| raw_spin_lock_bh@p_atomic(...)
+| raw_read_lock@p_atomic(...)
+| raw_read_lock_irq@p_atomic(...)
+| raw_read_lock_irqsave@p_atomic(...)
+| raw_read_lock_bh@p_atomic(...)
+| raw_write_lock@p_atomic(...)
+| raw_write_lock_irq@p_atomic(...)
+| raw_write_lock_irqsave@p_atomic(...)
+| raw_write_lock_bh@p_atomic(...)
+| preempt_disable@p_atomic()
+| local_irq_disable@p_atomic()
+| local_irq_save@p_atomic(...)
+| local_bh_disable@p_atomic()
+| bit_spin_lock@p_atomic(...)
+)
+<...
+  callee_func@p_callee_call(...)
+...>
+(
+  raw_spin_unlock(lock)
+| raw_spin_unlock_irq(lock)
+| raw_spin_unlock_irqrestore(lock, flags)
+| raw_spin_unlock_bh(lock)
+| raw_read_unlock(lock)
+| raw_read_unlock_irq(lock)
+| raw_read_unlock_irqrestore(lock, flags)
+| raw_read_unlock_bh(lock)
+| raw_write_unlock(lock)
+| raw_write_unlock_irq(lock)
+| raw_write_unlock_irqrestore(lock, flags)
+| raw_write_unlock_bh(lock)
+| preempt_enable()
+| local_irq_enable()
+| local_irq_restore(flags)
+| local_bh_enable()
+| bit_spin_unlock(...)
+)
+
+@collect_potential_sleepers@
+position p_def, p_bad_call;
+identifier func_def;
+identifier bad_func =~ "^(mutex_lock|mutex_lock_interruptible|mutex_lock_killable|down|down_interruptible|down_killable|down_trylock|rwsem_down_read|rwsem_down_write|ww_mutex_lock|msleep|ssleep|usleep_range|wait_for_completion|schedule|cond_resched|copy_from_user|copy_to_user|get_user|put_user|vmalloc|spin_lock|read_lock|write_lock)$";
+@@
+(
+func_def@p_def(...) {
+  <...
+    bad_func@p_bad_call(...)
+  ...>
+}
+|
+static inline func_def@p_def(...) {
+  <...
+    bad_func@p_bad_call(...)
+  ...>
+}
+)
+
+@collect_potential_alloc_sleepers@
+position p_def, p_bad_call;
+identifier func_def;
+identifier alloc_func =~ "^(kmalloc|kzalloc|kcalloc|kvmalloc|kvzalloc|kvcalloc)$";
+expression gfp;
+@@
+(
+func_def@p_def(...) {
+  <...
+    alloc_func@p_bad_call(..., gfp)
+  ...>
+}
+|
+static inline func_def@p_def(...) {
+  <...
+    alloc_func@p_bad_call(..., gfp)
+  ...>
+}
+)
+
+@collect_call_graph@
+position p_def, p_call;
+identifier caller_func, callee_func !~ "^\\b(raw_spin|raw_read|raw_write|preempt|local_irq|local_bh|printk|pr_|dev_)\\b";
+@@
+(
+caller_func@p_def(...) {
+  <...
+    callee_func@p_call(...)
+  ...>
+}
+|
+static inline caller_func@p_def(...) {
+  <...
+    callee_func@p_call(...)
+  ...>
+}
+)
+
+// =========================================================================
+// 2. Python Scripts for Data Collection and Rich Reporting
+// =========================================================================
+
+@initialize:python@
+@@
+REASONS = {
+    "mutex_lock": "is a sleeping lock",
+    "down": "is a sleeping semaphore operation",
+    "rwsem_down": "is a sleeping lock",
+    "ww_mutex_lock": "is a sleeping lock",
+    "msleep": "is an explicit sleep",
+    "ssleep": "is an explicit sleep",
+    "usleep_range": "is an explicit sleep",
+    "wait_for_completion": "waits for an event and sleeps",
+    "schedule": "explicitly invokes the scheduler",
+    "cond_resched": "may invoke the scheduler",
+    "copy_from_user": "may sleep on page fault",
+    "copy_to_user": "may sleep on page fault",
+    "get_user": "may sleep on page fault",
+    "put_user": "may sleep on page fault",
+    "vmalloc": "can sleep",
+    "spin_lock": "is a sleeping lock on PREEMPT_RT",
+    "read_lock": "is a sleeping lock on PREEMPT_RT",
+    "write_lock": "is a sleeping lock on PREEMPT_RT",
+    "kmalloc": "may sleep in PREEMPT_RT",
+    "kzalloc": "may sleep in PREEMPT_RT",
+    "kcalloc": "may sleep in PREEMPT_RT",
+    "kvmalloc": "may sleep in PREEMPT_RT",
+    "kvzalloc": "may sleep in PREEMPT_RT",
+    "kvcalloc": "may sleep in PREEMPT_RT",
+}
+
+def get_reason(func_name):
+    for key in REASONS:
+        if func_name.startswith(key):
+            return REASONS[key]
+    return "is prohibited in atomic context"
+
+// --- PART 1 Report: Direct Violations ---
+
+@script:python depends on find_direct_sleep_in_atomic@
+p_atomic << find_direct_sleep_in_atomic.p_atomic;
+p_call << find_direct_sleep_in_atomic.p_call;
+bad_func << find_direct_sleep_in_atomic.bad_func;
+@@
+bad_func_name = str(bad_func)
+reason_str = get_reason(bad_func_name)
+
+# Handle p_call and p_atomic as list or tuple or single Location
+if isinstance(p_call, (list, tuple)):
+    if p_call:
+        p_call = p_call[0]
+    else:
+        p_call = None
+if isinstance(p_atomic, (list, tuple)):
+    if p_atomic:
+        p_atomic = p_atomic[0]
+    else:
+        p_atomic = None
+
+if p_call and hasattr(p_call, 'file') and hasattr(p_call, 'line') and hasattr(p_call, 'current_element') and p_atomic and hasattr(p_atomic, 'line'):
+    coccilib.report.print_report(p_call,
+        f"BUG (Direct): Prohibited call to {bad_func_name}() ({reason_str}) "
+        f"inside atomic context started at line {p_atomic.line} "
+        f"in function {p_call.current_element}.")
+else:
+    print(f"Warning: Invalid position info for direct sleep {bad_func_name} at p_call={repr(p_call)}, p_atomic={repr(p_atomic)}")
+
+@script:python depends on find_direct_sleep_alloc_in_atomic@
+p_atomic << find_direct_sleep_alloc_in_atomic.p_atomic;
+p_call << find_direct_sleep_alloc_in_atomic.p_call;
+alloc_func << find_direct_sleep_alloc_in_atomic.alloc_func;
+gfp << find_direct_sleep_alloc_in_atomic.gfp;
+@@
+alloc_func_name = str(alloc_func)
+reason_str = get_reason(alloc_func_name)
+
+# Handle p_call and p_atomic as list or tuple or single Location
+if isinstance(p_call, (list, tuple)):
+    if p_call:
+        p_call = p_call[0]
+    else:
+        p_call = None
+if isinstance(p_atomic, (list, tuple)):
+    if p_atomic:
+        p_atomic = p_atomic[0]
+    else:
+        p_atomic = None
+
+if p_call and hasattr(p_call, 'file') and hasattr(p_call, 'line') and hasattr(p_call, 'current_element') and p_atomic and hasattr(p_atomic, 'line'):
+    coccilib.report.print_report(p_call,
+        f"BUG (Direct): Prohibited call to {alloc_func_name}() with {gfp} ({reason_str}) "
+        f"inside atomic context started at line {p_atomic.line} "
+        f"in function {p_call.current_element}.")
+else:
+    print(f"Warning: Invalid position info for direct alloc {alloc_func_name} at p_call={repr(p_call)}, p_atomic={repr(p_atomic)}")
+
+// --- PART 2 Collect: Data for Indirect Violations ---
+
+@script:python depends on collect_atomic_callees@
+p_atomic << collect_atomic_callees.p_atomic;
+p_callee_call << collect_atomic_callees.p_callee_call;
+callee_func << collect_atomic_callees.callee_func;
+@@
+if "ATOMIC_CALLEES" not in globals():
+    ATOMIC_CALLEES = {}
+
+# Handle p_callee_call and p_atomic as list or tuple or single Location
+if isinstance(p_callee_call, (list, tuple)):
+    if p_callee_call:
+        p_callee_call = p_callee_call[0]
+    else:
+        p_callee_call = None
+if isinstance(p_atomic, (list, tuple)):
+    if p_atomic:
+        p_atomic = p_atomic[0]
+    else:
+        p_atomic = None
+
+if p_callee_call and hasattr(p_callee_call, 'file') and hasattr(p_callee_call, 'line') and hasattr(p_callee_call, 'current_element') and p_atomic and hasattr(p_atomic, 'file') and hasattr(p_atomic, 'line'):
+    context_info = (f"{p_callee_call.current_element} at {p_callee_call.file}:{p_callee_call.line} "
+                    f"from atomic context at {p_atomic.file}:{p_atomic.line}")
+else:
+    print(f"Warning: Invalid position info for {callee_func} at p_callee_call={repr(p_callee_call)}, p_atomic={repr(p_atomic)}")
+    context_info = f"{callee_func} (unknown location)"
+
+key = str(callee_func)
+if key not in ATOMIC_CALLEES:
+    ATOMIC_CALLEES[key] = set()
+ATOMIC_CALLEES[key].add(context_info)
+
+@script:python depends on collect_potential_sleepers@
+p_def << collect_potential_sleepers.p_def;
+p_bad_call << collect_potential_sleepers.p_bad_call;
+func_def << collect_potential_sleepers.func_def;
+bad_func << collect_potential_sleepers.bad_func;
+@@
+if "POTENTIAL_SLEEPERS" not in globals():
+    POTENTIAL_SLEEPERS = {}
+
+bad_func_name = str(bad_func)
+reason_str = get_reason(bad_func_name)
+
+# Handle p_bad_call as list or tuple or single Location
+if isinstance(p_bad_call, (list, tuple)):
+    if p_bad_call:
+        p_bad_call = p_bad_call[0]
+    else:
+        p_bad_call = None
+
+if p_bad_call and hasattr(p_bad_call, 'file') and hasattr(p_bad_call, 'line'):
+    sleeper_info = (f"{bad_func_name}() at {p_bad_call.file}:{p_bad_call.line} ({reason_str})")
+else:
+    print(f"Warning: Invalid position info for sleeper {bad_func_name} at p_bad_call={repr(p_bad_call)}")
+    sleeper_info = f"{bad_func_name}() (unknown location) ({reason_str})"
+
+key = str(func_def)
+if key not in POTENTIAL_SLEEPERS:
+    POTENTIAL_SLEEPERS[key] = set()
+POTENTIAL_SLEEPERS[key].add(sleeper_info)
+
+@script:python depends on collect_potential_alloc_sleepers@
+p_def << collect_potential_alloc_sleepers.p_def;
+p_bad_call << collect_potential_alloc_sleepers.p_bad_call;
+func_def << collect_potential_alloc_sleepers.func_def;
+alloc_func << collect_potential_alloc_sleepers.alloc_func;
+gfp << collect_potential_alloc_sleepers.gfp;
+@@
+if "POTENTIAL_SLEEPERS" not in globals():
+    POTENTIAL_SLEEPERS = {}
+
+alloc_func_name = str(alloc_func)
+reason_str = get_reason(alloc_func_name)
+
+# Handle p_bad_call as list or tuple or single Location
+if isinstance(p_bad_call, (list, tuple)):
+    if p_bad_call:
+        p_bad_call = p_bad_call[0]
+    else:
+        p_bad_call = None
+
+if p_bad_call and hasattr(p_bad_call, 'file') and hasattr(p_bad_call, 'line'):
+    sleeper_info = (f"{alloc_func_name}() with {gfp} at {p_bad_call.file}:{p_bad_call.line} ({reason_str})")
+else:
+    print(f"Warning: Invalid position info for alloc {alloc_func_name} at p_bad_call={repr(p_bad_call)}")
+    sleeper_info = f"{alloc_func_name}() with {gfp} (unknown location) ({reason_str})"
+
+key = str(func_def)
+if key not in POTENTIAL_SLEEPERS:
+    POTENTIAL_SLEEPERS[key] = set()
+POTENTIAL_SLEEPERS[key].add(sleeper_info)
+
+@script:python depends on collect_call_graph@
+p_def << collect_call_graph.p_def;
+caller_func << collect_call_graph.caller_func;
+callee_func << collect_call_graph.callee_func;
+@@
+if "CALL_GRAPH" not in globals():
+    CALL_GRAPH = {}
+
+key = str(caller_func)
+callee_str = str(callee_func)
+if key not in CALL_GRAPH:
+    CALL_GRAPH[key] = set()
+CALL_GRAPH[key].add(callee_str)
+
+// --- PART 3 Report: Indirect Violations ---
+
+@script:python final@
+@@
+import collections
+
+def build_call_path(func, reverse_graph, visited=None):
+    if visited is None:
+        visited = set()
+    if func not in reverse_graph or func in visited:
+        return []
+    visited.add(func)
+    paths = []
+    for parent in reverse_graph.get(func, set()):
+        sub_paths = build_call_path(parent, reverse_graph, visited.copy())
+        if sub_paths:
+            for path in sub_paths:
+                paths.append([func] + path)
+        else:
+            paths.append([func])
+    return paths if paths else [[func]]
+
+if "ATOMIC_CALLEES" in globals() and "POTENTIAL_SLEEPERS" in globals() and "CALL_GRAPH" in globals():
+    print("\n--- Inter-procedural Sleep-in-Atomic Analysis ---")
+    found_bugs = False
+
+    # Build reverse graph
+    REVERSE_GRAPH = {}
+    for caller, callees in CALL_GRAPH.items():
+        for callee in callees:
+            if callee not in REVERSE_GRAPH:
+                REVERSE_GRAPH[callee] = set()
+            REVERSE_GRAPH[callee].add(caller)
+
+    # Find transitive sleepers
+    may_sleep = set()
+    transitive_reasons = {}
+    queue = collections.deque(POTENTIAL_SLEEPERS.keys())
+    visited = set()
+
+    while queue:
+        func = queue.popleft()
+        if func in visited:
+            continue
+        visited.add(func)
+        may_sleep.add(func)
+        if func in POTENTIAL_SLEEPERS:
+            transitive_reasons[func] = POTENTIAL_SLEEPERS[func]
+        else:
+            transitive_reasons[func] = set()
+        for parent in REVERSE_GRAPH.get(func, set()):
+            if parent not in visited:
+                queue.append(parent)
+                transitive_reasons[parent] = transitive_reasons.get(parent, set()) | transitive_reasons[func]
+
+    # Find risky functions
+    risky_funcs = set(ATOMIC_CALLEES.keys()) & may_sleep
+    for func in sorted(risky_funcs):
+        found_bugs = True
+        print(f"\n[!] BUG (Indirect): Function '{func}' can sleep and is called from an atomic context.")
+        print(f"  - Sleep details:")
+        for reason in sorted(transitive_reasons.get(func, set())):
+            print(f"    - {reason}")
+        print(f"  - Atomic call sites:")
+        for context in sorted(ATOMIC_CALLEES.get(func, set())):
+            print(f"    - {context}")
+        # Print call paths
+        print(f"  - Call paths to sleepers:")
+        for path in build_call_path(func, REVERSE_GRAPH):
+            print(f"    - {' -> '.join(path)}")
+
+    if not found_bugs:
+        print("No indirect sleep-in-atomic bugs found.")
+else:
+    print("Error: Required data (ATOMIC_CALLEES, POTENTIAL_SLEEPERS, or CALL_GRAPH) not collected.")
+    # Debug globals state
+    if "ATOMIC_CALLEES" in globals():
+        print(f"Debug: ATOMIC_CALLEES = {ATOMIC_CALLEES}")
+    if "POTENTIAL_SLEEPERS" in globals():
+        print(f"Debug: POTENTIAL_SLEEPERS = {POTENTIAL_SLEEPERS}")
+    if "CALL_GRAPH" in globals():
+        print(f"Debug: CALL_GRAPH = {CALL_GRAPH}")
\ No newline at end of file
-- 
2.50.0

Re: [WIP] coccinelle: rt: Add coccicheck on sleep in atomic context on PREEMPT_RT
Posted by Tomas Glozar 1 month, 1 week ago
Hi Yunseong,

so 16. 8. 2025 v 6:56 odesílatel Yunseong Kim <ysk@kzalloc.com> napsal:
>
> I'm working on a new Coccinelle script to detect sleep-in-atomic bugs in
> PREEMPT_RT kernels. This script identifies calls to sleeping functions
> (e.g., mutex_lock, msleep, kmalloc with GFP_KERNEL, spin_lock which may
> sleep in PREEMPT_RT) within atomic contexts (e.g., raw_spin_lock,
> preempt_disable, bit_spin_lock).
>
> It supports both direct calls and indirect call chains through
> inter-procedural analysis using function call graphs. Memory allocations
> are handled including GFP_ATOMIC/NOWAIT. This is a WIP patch for early
> feedback. I've tested it with make coccicheck on various subsystems, but
> there are still issues with position variables sometimes being tuples,
> leading to "Invalid position info" warnings and incomplete data collection.

I can share some of my own experience. I wrote a similar tool for the
same problem two years ago, called rtlockscope [1], which uses ctags
to get a list of all functions, CScope to get a function call graph,
and assigning a summary to each function based on its callees. The
results could use some improvement, since it reduces control flow to
an ordering of callees, and assumes that all symbols are global (e.g.
an ARM-only function is seen as called from x86-only code).

[1] Repo: https://gitlab.com/tglozar/rtlockscope, LPC talk slides:
https://lpc.events/event/18/contributions/1735/attachments/1428/3051/lpc2024talk.pdf;
currently I'm focusing on getting more reliable results using automata
abstractions.

>
> The script includes defensive checks, but indirect bugs are not always
> detected. I'd appreciate any suggestions on improving the Python handling
> of position variables or the SmPL rules for better matching in complex code
> (e.g., macros, inlines). The script is added to scripts/coccinelle/rt/.
>

My tool captures macros, but it reports a lot of false positives via
various KASAN and printing routines. For example:

Sleeping lock called at:
__cache_free at mm/slab.c:3617
___cache_free at mm/slab.c:3378
do_slab_free at mm/slub.c:3816
__slab_free at mm/slub.c:3796
put_cpu_partial at mm/slub.c:3679
local_lock_irqsave at mm/slub.c:2703
__local_lock_irqsave at include/linux/local_lock.h:31
__local_lock at include/linux/local_lock_internal.h:128
spin_lock at include/linux/local_lock_internal.h:119

preemption disabled at:
__cache_free at mm/slab.c:3617
kasan_slab_free at mm/slab.c:3370
__kasan_slab_free at include/linux/kasan.h:164
____kasan_slab_free at mm/kasan/common.c:244
kasan_quarantine_put at mm/kasan/common.c:238
raw_spin_lock at mm/kasan/quarantine.c:224
_raw_spin_lock at include/linux/spinlock.h:217
__raw_spin_lock at kernel/locking/spinlock.c:154
preempt_disable at include/linux/spinlock_api_smp.h:132

But that might be just because I'm also tracking indirect
preempt_disable though (see below). I'm not familiar with Coccinelle
unfortunately, I considered it for a while, but opted for a different
approach.

> Detects sleep-in-atomic bugs in PREEMPT_RT kernels by identifying improper
> calls to functions that may sleep, such as mutex locks, explicit sleep
> functions (e.g., msleep), memory allocations and sleepable spinlocks,
> within atomic contexts created by preempt_disable, raw_spin_lock,
> irq_disable (e.g. bit_spin_lock).
>
> 1. Detection of direct calls to sleeping functions in atomic scopes.
> 2. Analysis of inter-procedural call chains to uncover indirect calls to
>    sleeping functions via function call graphs.
> 3. Handling of memory allocation functions that may sleep.
>    (including GFP_ATOMIC).
>

If I understand your code properly, you only match on a specific case
of sleeping in atomic context, where the offending call is directly in
between "preempt disable" and "preempt enable".

That means that your script only takes indirection into account for
sleeping functions, not for disabling preemption/atomic context. There
are some occurrences where custom "lock" functions call
preempt_disable in the kernel, so this is needed in order not to miss
those. But it might be better to skip them to prevent flooding the
output with a lot of false positives, since one unmatched
preempt_disable will pollute the rest of the function (and every
function that calls it).

> This cocci script should identify direct and indirect sleep-in-atomic
> violations, improving PREEMPT_RT compatibility across kernel code.
> For example:
> Link: https://lore.kernel.org/linux-rt-devel/7a68c944-0199-468e-a0f2-ae2a9f21225b@kzalloc.com/t/#u
>

There are likely still tens of these bugs across different subsystems,
I remember fixing one in nvdimm and one in BPF.

There is also a 2018 paper, Effective Detection of
Sleep-in-Atomic-Context Bugs in the Linux Kernel [2], which covers
this problem without taking PREEMPT_RT into account. They identify
three challenges: accurately processing control flow, handling
function pointers, and handling different code paths. Notably, they
also use summaries, and handle sleeping in atomic context in interrupt
handlers. Overall, it looks like it uses the same general approach as
rtlockscope and your Coccinelle script, just more polished, so you
might want to have a look at it (if you have not seen it yet). Of
course, on PREEMPT_RT, there is an additional challenge in
distinguishing between RT and non-RT paths (like code that sleeps only
on RT and disables preemption only on non-RT).

[2] https://hal.science/hal-03032244



Tomas