From nobody Sat Oct  4 00:26:49 2025
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org
 [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id CCCB03126B3;
	Fri, 22 Aug 2025 14:19:48 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=10.30.226.201
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1755872388; cv=none;
 b=rnvaPTWD+/BuwC3CSM/EbEa7HT2vPI15bCatnQ/v013wzq1ZfNXzYJKTzvGJkGzlVrzGRFTnMKrG80axbcDyVaV1abp3FDuhdC1iQBsN8nte6hThjo+cQIgxnsTKsMAP2y4c3zyCCZl7pomE7juty8pNY0UFwpI3PMXjQcfZK38=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1755872388; c=relaxed/simple;
	bh=IkR2tDIMqOGFWIt+/VpJUQvrkHkeqcw1eRxVumPctes=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=t+cVsIxFV34BQE0+Rxrt2sRPXtEGzhbhIYxdQhlVP20an4xne0zC/zoAG1nx7Xvpr3Xjvm0qbVAG+kIkFSbKYyJWwUGLDqXWNm4fN0SbcVWeqYRVtYU1LoXg3zkpzZ97cFGATZF1lnh+j9mCEEjKI1pHaAe6Vng+1e4wwoiaV+s=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org
 header.b=oU65SJP/; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org
 header.b="oU65SJP/"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 10E97C116D0;
	Fri, 22 Aug 2025 14:19:48 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1755872388;
	bh=IkR2tDIMqOGFWIt+/VpJUQvrkHkeqcw1eRxVumPctes=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References:From;
	b=oU65SJP/+jTEHTdqTyCfM+UFLSFdFufzlQG6yrIQibvMjeyaPCDZMHGhNUprTFotx
	 hHNAGLEW3JhEzU/91ob0zp9BVM4ppQav7TMPy5ktxJZLEvz2xePoGeFyx7rhlehf3A
	 SZq4/NQm4sFnCj3zJH64K8BkFWdv0n9f2W6bwlI5InXEjLNyQAKsrpBNreIK8RAvPZ
	 SnxY8VlJDnE7fIAR29afATlqu9Nw1H30hSznzWEzYg/mhdgtE5QF7MAZrl3Ax4EAMV
	 CkUSYjfgJFkiOaVq+UZQCKOIIyUfSErtMC9ENCE5fzHd3TzH3Ht4YawQbDAkcrTlyq
	 rA5pK9xaIXJVQ==
Received: from mchehab by mail.kernel.org with local (Exim 4.98.2)
	(envelope-from <mchehab+huawei@kernel.org>)
	id 1upScM-0000000CCqz-0fiZ;
	Fri, 22 Aug 2025 16:19:46 +0200
From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
To: Jonathan Corbet <corbet@lwn.net>,
	Linux Doc Mailing List <linux-doc@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>,
	"Mauro Carvalho Chehab" <mchehab+huawei@kernel.org>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 02/24] docs: parse-headers.py: convert parse-headers.pl
Date: Fri, 22 Aug 2025 16:19:14 +0200
Message-ID: 
 <ae5cfa8dff37e280cc9493fc95a51cd0cc0ba127.1755872208.git.mchehab+huawei@kernel.org>
X-Mailer: git-send-email 2.50.1
In-Reply-To: <cover.1755872208.git.mchehab+huawei@kernel.org>
References: <cover.1755872208.git.mchehab+huawei@kernel.org>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Sender: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Content-Type: text/plain; charset="utf-8"

When the Kernel started to use Sphinx, we had to come up with
a solution to parse media headers. On that time, we didn't have
much experience with Sphinx extensions. So, we came up with our
own script-based solution that were basically implementing a
set of rules we used to have at the Makefile.

Convert it to Python, keeping it bug-compatible with the
original script.

While here, try to better document it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 Documentation/sphinx/parse-headers.py | 429 ++++++++++++++++++++++++++
 1 file changed, 429 insertions(+)
 create mode 100755 Documentation/sphinx/parse-headers.py

diff --git a/Documentation/sphinx/parse-headers.py b/Documentation/sphinx/p=
arse-headers.py
new file mode 100755
index 000000000000..b39284d21090
--- /dev/null
+++ b/Documentation/sphinx/parse-headers.py
@@ -0,0 +1,429 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2016 by Mauro Carvalho Chehab <mchehab@kernel.org>.
+# pylint: disable=3DC0103,R0902,R0912,R0914,R0915
+
+"""
+Convert a C header or source file (C_FILE), into a ReStructured Text
+included via ..parsed-literal block with cross-references for the
+documentation files that describe the API. It accepts an optional
+EXCEPTIONS_FILE with describes what elements will be either ignored or
+be pointed to a non-default reference.
+
+The output is written at the (OUT_FILE).
+
+It is capable of identifying defines, functions, structs, typedefs,
+enums and enum symbols and create cross-references for all of them.
+It is also capable of distinguish #define used for specifying a Linux
+ioctl.
+
+The EXCEPTIONS_FILE contains a set of rules like:
+
+    ignore ioctl VIDIOC_ENUM_FMT
+    replace ioctl VIDIOC_DQBUF vidioc_qbuf
+    replace define V4L2_EVENT_MD_FL_HAVE_FRAME_SEQ :c:type:`v4l2_event_mot=
ion_det`
+"""
+
+import argparse
+import os
+import re
+import sys
+
+
+class ParseHeader:
+    """
+    Creates an enriched version of a Kernel header file with cross-links
+    to each C data structure type.
+
+    It is meant to allow having a more comprehensive documentation, where
+    uAPI headers will create cross-reference links to the code.
+
+    It is capable of identifying defines, functions, structs, typedefs,
+    enums and enum symbols and create cross-references for all of them.
+    It is also capable of distinguish #define used for specifying a Linux
+    ioctl.
+
+    By default, it create rules for all symbols and defines, but it also
+    allows parsing an exception file. Such file contains a set of rules
+    using the syntax below:
+
+    1. Ignore rules:
+
+        ignore <type> <symbol>`
+
+    Removes the symbol from reference generation.
+
+    2. Replace rules:
+
+        replace <type> <old_symbol> <new_reference>
+
+    Replaces how old_symbol with a new reference. The new_reference can be:
+        - A simple symbol name;
+        - A full Sphinx reference.
+
+    On both cases, <type> can be:
+        - ioctl: for defines that end with _IO*, e.g. ioctl definitions
+        - define: for other defines
+        - symbol: for symbols defined within enums;
+        - typedef: for typedefs;
+        - enum: for the name of a non-anonymous enum;
+        - struct: for structs.
+
+    Examples:
+
+        ignore define __LINUX_MEDIA_H
+        ignore ioctl VIDIOC_ENUM_FMT
+        replace ioctl VIDIOC_DQBUF vidioc_qbuf
+        replace define V4L2_EVENT_MD_FL_HAVE_FRAME_SEQ :c:type:`v4l2_event=
_motion_det`
+    """
+
+    # Parser regexes with multiple ways to capture enums and structs
+    RE_ENUMS =3D [
+        re.compile(r"^\s*enum\s+([\w_]+)\s*\{"),
+        re.compile(r"^\s*enum\s+([\w_]+)\s*$"),
+        re.compile(r"^\s*typedef\s*enum\s+([\w_]+)\s*\{"),
+        re.compile(r"^\s*typedef\s*enum\s+([\w_]+)\s*$"),
+    ]
+    RE_STRUCTS =3D [
+        re.compile(r"^\s*struct\s+([_\w][\w\d_]+)\s*\{"),
+        re.compile(r"^\s*struct\s+([_\w][\w\d_]+)$"),
+        re.compile(r"^\s*typedef\s*struct\s+([_\w][\w\d_]+)\s*\{"),
+        re.compile(r"^\s*typedef\s*struct\s+([_\w][\w\d_]+)$"),
+    ]
+
+    # FIXME: the original code was written a long time before Sphinx C
+    # domain to have multiple namespaces. To avoid to much turn at the
+    # existing hyperlinks, the code kept using "c:type" instead of the
+    # right types. To change that, we need to change the types not only
+    # here, but also at the uAPI media documentation.
+    DEF_SYMBOL_TYPES =3D {
+        "ioctl": {
+            "prefix": "\\ ",
+            "suffix": "\\ ",
+            "ref_type": ":ref",
+        },
+        "define": {
+            "prefix": "\\ ",
+            "suffix": "\\ ",
+            "ref_type": ":ref",
+        },
+        # We're calling each definition inside an enum as "symbol"
+        "symbol": {
+            "prefix": "\\ ",
+            "suffix": "\\ ",
+            "ref_type": ":ref",
+        },
+        "typedef": {
+            "prefix": "\\ ",
+            "suffix": "\\ ",
+            "ref_type": ":c:type",
+        },
+        # This is the name of the enum itself
+        "enum": {
+            "prefix": "",
+            "suffix": "\\ ",
+            "ref_type": ":c:type",
+        },
+        "struct": {
+            "prefix": "",
+            "suffix": "\\ ",
+            "ref_type": ":c:type",
+        },
+    }
+
+    def __init__(self, debug: bool =3D False):
+        """Initialize internal vars"""
+        self.debug =3D debug
+        self.data =3D ""
+
+        self.symbols =3D {}
+
+        for symbol_type in self.DEF_SYMBOL_TYPES:
+            self.symbols[symbol_type] =3D {}
+
+    def store_type(self, symbol_type: str, symbol: str,
+                   ref_name: str =3D None, replace_underscores: bool =3D T=
rue):
+        """
+        Stores a new symbol at self.symbols under symbol_type.
+
+        By default, underscores are replaced by "-"
+        """
+        defs =3D self.DEF_SYMBOL_TYPES[symbol_type]
+
+        prefix =3D defs.get("prefix", "")
+        suffix =3D defs.get("suffix", "")
+        ref_type =3D defs.get("ref_type")
+
+        # Determine ref_link based on symbol type
+        if ref_type:
+            if symbol_type =3D=3D "enum":
+                ref_link =3D f"{ref_type}:`{symbol}`"
+            else:
+                if not ref_name:
+                    ref_name =3D symbol.lower()
+
+                if replace_underscores:
+                    ref_name =3D ref_name.replace("_", "-")
+
+                ref_link =3D f"{ref_type}:`{symbol} <{ref_name}>`"
+        else:
+            ref_link =3D symbol
+
+        self.symbols[symbol_type][symbol] =3D f"{prefix}{ref_link}{suffix}"
+
+    def store_line(self, line):
+        """Stores a line at self.data, properly indented"""
+        line =3D "    " + line.expandtabs()
+        self.data +=3D line.rstrip(" ")
+
+    def parse_file(self, file_in: str):
+        """Reads a C source file and get identifiers"""
+        self.data =3D ""
+        is_enum =3D False
+        is_comment =3D False
+        multiline =3D ""
+
+        with open(file_in, "r",
+                  encoding=3D"utf-8", errors=3D"backslashreplace") as f:
+            for line_no, line in enumerate(f):
+                self.store_line(line)
+                line =3D line.strip("\n")
+
+                # Handle continuation lines
+                if line.endswith(r"\\"):
+                    multiline +=3D line[-1]
+                    continue
+
+                if multiline:
+                    line =3D multiline + line
+                    multiline =3D ""
+
+                # Handle comments. They can be multilined
+                if not is_comment:
+                    if re.search(r"/\*.*", line):
+                        is_comment =3D True
+                    else:
+                        # Strip C99-style comments
+                        line =3D re.sub(r"(//.*)", "", line)
+
+                if is_comment:
+                    if re.search(r".*\*/", line):
+                        is_comment =3D False
+                    else:
+                        multiline =3D line
+                        continue
+
+                # At this point, line variable may be a multilined stateme=
nt,
+                # if lines end with \ or if they have multi-line comments
+                # With that, it can safely remove the entire comments,
+                # and there's no need to use re.DOTALL for the logic below
+
+                line =3D re.sub(r"(/\*.*\*/)", "", line)
+                if not line.strip():
+                    continue
+
+                # It can be useful for debug purposes to print the file af=
ter
+                # having comments stripped and multi-lines grouped.
+                if self.debug > 1:
+                    print(f"line {line_no + 1}: {line}")
+
+                # Now the fun begins: parse each type and store it.
+
+                # We opted for a two parsing logic here due to:
+                # 1. it makes easier to debug issues not-parsed symbols;
+                # 2. we want symbol replacement at the entire content, not
+                #    just when the symbol is detected.
+
+                if is_enum:
+                    match =3D re.match(r"^\s*([_\w][\w\d_]+)\s*[\,=3D]?", =
line)
+                    if match:
+                        self.store_type("symbol", match.group(1))
+                    if "}" in line:
+                        is_enum =3D False
+                    continue
+
+                match =3D re.match(r"^\s*#\s*define\s+([\w_]+)\s+_IO", lin=
e)
+                if match:
+                    self.store_type("ioctl", match.group(1),
+                                    replace_underscores=3DFalse)
+                    continue
+
+                match =3D re.match(r"^\s*#\s*define\s+([\w_]+)(\s+|$)", li=
ne)
+                if match:
+                    self.store_type("define", match.group(1))
+                    continue
+
+                match =3D re.match(r"^\s*typedef\s+([_\w][\w\d_]+)\s+(.*)\=
s+([_\w][\w\d_]+);",
+                                 line)
+                if match:
+                    name =3D match.group(2).strip()
+                    symbol =3D match.group(3)
+                    self.store_type("typedef", symbol, ref_name=3Dname,
+                                    replace_underscores=3DFalse)
+                    continue
+
+                for re_enum in self.RE_ENUMS:
+                    match =3D re_enum.match(line)
+                    if match:
+                        self.store_type("enum", match.group(1))
+                        is_enum =3D True
+                        break
+
+                for re_struct in self.RE_STRUCTS:
+                    match =3D re_struct.match(line)
+                    if match:
+                        self.store_type("struct", match.group(1),
+                                        replace_underscores=3DFalse)
+                        break
+
+    def process_exceptions(self, fname: str):
+        """
+        Process exceptions file with rules to ignore or replace references.
+        """
+        if not fname:
+            return
+
+        name =3D os.path.basename(fname)
+
+        with open(fname, "r", encoding=3D"utf-8", errors=3D"backslashrepla=
ce") as f:
+            for ln, line in enumerate(f):
+                ln +=3D 1
+                line =3D line.strip()
+                if not line or line.startswith("#"):
+                    continue
+
+                # Handle ignore rules
+                match =3D re.match(r"^ignore\s+(\w+)\s+(\S+)", line)
+                if match:
+                    c_type =3D match.group(1)
+                    symbol =3D match.group(2)
+
+                    if c_type not in self.DEF_SYMBOL_TYPES:
+                        sys.exit(f"{name}:{ln}: {c_type} is invalid")
+
+                    d =3D self.symbols[c_type]
+                    if symbol in d:
+                        del d[symbol]
+
+                    continue
+
+                # Handle replace rules
+                match =3D re.match(r"^replace\s+(\S+)\s+(\S+)\s+(\S+)", li=
ne)
+                if not match:
+                    sys.exit(f"{name}:{ln}: invalid line: {line}")
+
+                c_type, old, new =3D match.groups()
+
+                if c_type not in self.DEF_SYMBOL_TYPES:
+                    sys.exit(f"{name}:{ln}: {c_type} is invalid")
+
+                reftype =3D None
+
+                # Parse reference type when the type is specified
+
+                match =3D re.match(r"^\:c\:(data|func|macro|type)\:\`(.+)\=
`", new)
+                if match:
+                    reftype =3D f":c:{match.group(1)}"
+                    new =3D match.group(2)
+                else:
+                    match =3D re.search(r"(\:ref)\:\`(.+)\`", new)
+                    if match:
+                        reftype =3D match.group(1)
+                        new =3D match.group(2)
+
+                # If the replacement rule doesn't have a type, get default
+                if not reftype:
+                    reftype =3D self.DEF_SYMBOL_TYPES[c_type].get("ref_typ=
e")
+                    if not reftype:
+                        reftype =3D self.DEF_SYMBOL_TYPES[c_type].get("rea=
l_type")
+
+                new_ref =3D f"{reftype}:`{old} <{new}>`"
+
+                # Change self.symbols to use the replacement rule
+                if old in self.symbols[c_type]:
+                    self.symbols[c_type][old] =3D new_ref
+                else:
+                    print(f"{name}:{ln}: Warning: can't find {old} {c_type=
}")
+
+    def debug_print(self):
+        """
+        Print debug information containing the replacement rules per symbo=
l.
+        To make easier to check, group them per type.
+        """
+        if not self.debug:
+            return
+
+        for c_type, refs in self.symbols.items():
+            if not refs:  # Skip empty dictionaries
+                continue
+
+            print(f"{c_type}:")
+
+            for symbol, ref in sorted(refs.items()):
+                print(f"  {symbol} -> {ref}")
+
+            print()
+
+    def write_output(self, file_in: str, file_out: str):
+        """Write the formatted output to a file."""
+
+        # Avoid extra blank lines
+        text =3D re.sub(r"\s+$", "", self.data) + "\n"
+        text =3D re.sub(r"\n\s+\n", "\n\n", text)
+
+        # Escape Sphinx special characters
+        text =3D re.sub(r"([\_\`\*\<\>\&\\\\:\/\|\%\$\#\{\}\~\^])", r"\\\1=
", text)
+
+        # Source uAPI files may have special notes. Use bold font for them
+        text =3D re.sub(r"DEPRECATED", "**DEPRECATED**", text)
+
+        # Delimiters to catch the entire symbol after escaped
+        start_delim =3D r"([ \n\t\(=3D\*\@])"
+        end_delim =3D r"(\s|,|\\=3D|\\:|\;|\)|\}|\{)"
+
+        # Process all reference types
+        for ref_dict in self.symbols.values():
+            for symbol, replacement in ref_dict.items():
+                symbol =3D re.escape(re.sub(r"([\_\`\*\<\>\&\\\\:\/])", r"=
\\\1", symbol))
+                text =3D re.sub(fr'{start_delim}{symbol}{end_delim}',
+                              fr'\1{replacement}\2', text)
+
+        # Remove "\ " where not needed: before spaces and at the end of li=
nes
+        text =3D re.sub(r"\\ ([\n ])", r"\1", text)
+
+        title =3D os.path.basename(file_in)
+
+        with open(file_out, "w", encoding=3D"utf-8", errors=3D"backslashre=
place") as f:
+            f.write(".. -*- coding: utf-8; mode: rst -*-\n\n")
+            f.write(f"{title}\n")
+            f.write("=3D" * len(title))
+            f.write("\n\n.. parsed-literal::\n\n")
+            f.write(text)
+
+
+def main():
+    """Main function"""
+    parser =3D argparse.ArgumentParser(description=3D__doc__,
+                                     formatter_class=3Dargparse.RawDescrip=
tionHelpFormatter)
+
+    parser.add_argument("-d", "--debug", action=3D"count", default=3D0,
+                        help=3D"Increase debug level. Can be used multiple=
 times")
+    parser.add_argument("file_in", help=3D"Input C file")
+    parser.add_argument("file_out", help=3D"Output RST file")
+    parser.add_argument("file_exceptions", nargs=3D"?",
+                        help=3D"Exceptions file (optional)")
+
+    args =3D parser.parse_args()
+
+    parser =3D ParseHeader(debug=3Dargs.debug)
+    parser.parse_file(args.file_in)
+
+    if args.file_exceptions:
+        parser.process_exceptions(args.file_exceptions)
+
+    parser.debug_print()
+    parser.write_output(args.file_in, args.file_out)
+
+
+if __name__ =3D=3D "__main__":
+    main()
--=20
2.50.1