From nobody Sat Apr 20 13:49:49 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zohomail.com: domain of redhat.com designates 170.10.133.124 as permitted sender) client-ip=170.10.133.124; envelope-from=patchew-devel-bounces@redhat.com; helo=us-smtp-delivery-124.mimecast.com; Authentication-Results: mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=patchew-devel-bounces@redhat.com; dmarc=pass(p=none dis=none) header.from=redhat.com ARC-Seal: i=1; a=rsa-sha256; t=1645700459; cv=none; d=zohomail.com; s=zohoarc; b=XedHD97lSZt9rGjdkxr7DxuvtAowamZj7Koc0MDKWP+4gECPvc/5qt9t585fUGhSQLxeE7kPVNyRl+dQrNQzbjUQBl9H5pqKTBc+z3RvLaBr82t2NSjJG0BdIUb72JEi7TZvh32gfmQHvz5gQvHg8bthKOreoO5UJ5Fi/KwgpnM= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1645700459; h=Content-Type:Content-Transfer-Encoding:Cc:Date:From:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:MIME-Version:Message-ID:Sender:Subject:To; bh=Ey+kU4qd5VEHFTnl6BAQnYqmP76gzNIrGM3b84+gJzU=; b=eKxtFRyAmJ993pUGFV4+sMOaPk35IgF3jCQx+pIsKaA3mSR3NsXRhjl/6gDNlFFq9rxVN2xw7GULWE2D222F2WOKnBtQ8deei+ngOC3fRJIB5/CNjpnIlwV2QwvD4D+s1A6bKiydRrjZk0nANw+b3EjohfQ7BRWI14CxusP3yX8= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=patchew-devel-bounces@redhat.com; dmarc=pass header.from= (p=none dis=none) Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by mx.zohomail.com with SMTPS id 1645700459275711.6094706543082; Thu, 24 Feb 2022 03:00:59 -0800 (PST) Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-82-iDdImzniONeQzovQUt0Mjw-1; Thu, 24 Feb 2022 06:00:54 -0500 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id AA0BD1091DA0; Thu, 24 Feb 2022 11:00:53 +0000 (UTC) Received: from colo-mx.corp.redhat.com (colo-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.21]) by smtp.corp.redhat.com (Postfix) with ESMTPS id A2AC48319B; Thu, 24 Feb 2022 11:00:53 +0000 (UTC) Received: from lists01.pubmisc.prod.ext.phx2.redhat.com (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33]) by colo-mx.corp.redhat.com (Postfix) with ESMTP id 8FC0D4A701; Thu, 24 Feb 2022 11:00:53 +0000 (UTC) Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id 21OB0p5V021192 for ; Thu, 24 Feb 2022 06:00:51 -0500 Received: by smtp.corp.redhat.com (Postfix) id BC4E67B6E7; Thu, 24 Feb 2022 11:00:51 +0000 (UTC) Received: from avogadro.lan (unknown [10.39.193.11]) by smtp.corp.redhat.com (Postfix) with ESMTP id A0F747B6F6; Thu, 24 Feb 2022 11:00:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1645700458; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:list-id:list-help: list-unsubscribe:list-subscribe:list-post; bh=Ey+kU4qd5VEHFTnl6BAQnYqmP76gzNIrGM3b84+gJzU=; b=dSTC9RRAYFCqkcMyTZL7Bm9aaJGGAWUclFPe55eWKA2Nki0rx29zE1w+BYC8RPgLJT+fdn cajF1i3S1jyi30ZeANdTnxZWVoT9BMCBBfen0M+57HjmXp0quTfmystl1F2nOi0KTMQ+Je v2Asd9hGwJmMzGkLusue8z69Zg5/Jow= X-MC-Unique: iDdImzniONeQzovQUt0Mjw-1 From: Paolo Bonzini To: patchew-devel@redhat.com Date: Thu, 24 Feb 2022 12:00:49 +0100 Message-Id: <20220224110049.355325-1-pbonzini@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-loop: patchew-devel@redhat.com Cc: Fam Zheng Subject: [Patchew-devel] [PATCH] add public-inbox importer X-BeenThere: patchew-devel@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk List-Id: Patchew development and discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: patchew-devel-bounces@redhat.com Errors-To: patchew-devel-bounces@redhat.com X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=patchew-devel-bounces@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable X-ZohoMail-DKIM: pass (identity @redhat.com) X-ZM-MESSAGEID: 1645700461599100003 Content-Type: text/plain; charset="utf-8" This is a spiced-up version of Fam's code from https://github.com/famz/patchew/commit/925998bf6. The differences are: - there is a new importer script and playbook, so we have two - there is config file support so that the playbook can follow the same model as the existing ones - there is an age limit so that patches older than a few months are not imported - once it is up to date, the script only works on the most recent repos and does not attempt to clone all of them - all git invocations are done from Python instead of shell The configuration file supports all command-line options, while the playbook is a bit more limited. Co-developed-by: Fam Zheng --- scripts/deploy | 8 + scripts/dockerfiles/importer-lore.docker | 5 + scripts/patchew-importer-lore | 244 ++++++++++++++++++ scripts/playbooks/deploy-importers-lore.yml | 38 +++ .../templates/importer-lore-config.j2 | 4 + 5 files changed, 299 insertions(+) create mode 100644 scripts/dockerfiles/importer-lore.docker create mode 100755 scripts/patchew-importer-lore create mode 100644 scripts/playbooks/deploy-importers-lore.yml create mode 100644 scripts/playbooks/templates/importer-lore-config.j2 diff --git a/scripts/deploy b/scripts/deploy index e05984d..8bd23c4 100755 --- a/scripts/deploy +++ b/scripts/deploy @@ -21,6 +21,8 @@ def parse_args(): help=3D"Database host address") parser.add_argument("--tester", "-t", nargs=3D"*", dest=3D"testers", help=3D"Tester host address") + parser.add_argument("--public-inbox", "-p", nargs=3D"?", + help=3D"Importer host address") parser.add_argument("--importer", "-i", nargs=3D"?", help=3D"Importer host address") parser.add_argument("--applier", "-a", nargs=3D"?", @@ -41,6 +43,9 @@ def generate_inventory_file(args): [appliers] %s =20 +[importers_lore] +%s + [importers] %s =20 @@ -49,6 +54,7 @@ def generate_inventory_file(args): % (args.web_server or "", args.db_server or "", args.applier or "", + args.public_inbox or "", args.importer or "", "\n".join(args.testers or []))) f.flush() @@ -68,6 +74,8 @@ def main(): playbooks.append("deploy-testers.yml") if args.applier: playbooks.append("deploy-appliers.yml") + if args.public_inbox: + playbooks.append("deploy-importers-lore.yml") if args.importer: playbooks.append("deploy-importers.yml") if not playbooks: diff --git a/scripts/dockerfiles/importer-lore.docker b/scripts/dockerfiles= /importer-lore.docker new file mode 100644 index 0000000..1e7e14b --- /dev/null +++ b/scripts/dockerfiles/importer-lore.docker @@ -0,0 +1,5 @@ +FROM fedora:latest +RUN dnf install -y python findutils git wget +ENV LC_ALL en_US.UTF-8 +COPY . /opt/patchew/ +CMD /opt/patchew/scripts/patchew-importer-lore -d /data/patchew -c /data/p= atchew/config diff --git a/scripts/patchew-importer-lore b/scripts/patchew-importer-lore new file mode 100755 index 0000000..e034998 --- /dev/null +++ b/scripts/patchew-importer-lore @@ -0,0 +1,244 @@ +#!/usr/bin/env python3 +# +# Copyright 2021-2022 Bytedance Inc. +# +# Authors: +# Fam Zheng +# +# This work is licensed under the MIT License. Please see the LICENSE fil= e or +# http://opensource.org/licenses/MIT. + +import os +import sys +import time +import argparse +import logging +import tempfile +import subprocess +import dbm + +BASE_DIR =3D os.path.realpath(os.path.dirname(__file__) + "/..") +PATCHEW_CLI =3D os.path.join(BASE_DIR, "patchew-cli") + +CONFIG_ITEMS =3D { + "data_dir": { + "short": "d", + "help": "directory to put data in", + "metavar": "PATH", + }, + "patchew_server": { + "short": "S", + "help": "Patchew server to log into", + "metavar": "HOST", + }, + "patchew_username": { + "short": "U", + "help": "Username for patchew server", + "metavar": "USER", + }, + "patchew_password": { + "short": "P", + "help": "Password for patchew server", + "metavar": "PASSWORD", + }, + "git_root": { + "short": "g", + "help": "Root of public-inbox repository", + "metavar": "URL", + }, + "limit": { + "short": "l", + "default": "2.months.ago", + "help": "How old to import backlog (default 2 months)", + "metavar": "DATE", + }, + "max": { + "short": "m", + "default": "4", + "help": "How many public-inbox repositories to import (default 4)", + "metavar": "N", + }, + "batch": { + "short": "b", + "default": "500", + "help": "How many messages to import between git-pull", + "metavar": "N", + }, +} + +CONFIG =3D {} +HIGHEST_REPO =3D 0 + + +def config_from_file(args): + global CONFIG + import configparser + + parser =3D configparser.ConfigParser() + parser.read(args.config) + + # default section applies to all git repos + CONFIG.update(parser["DEFAULT"]) + if not args.git_root: + # no -g flag, there needs to be exactly one non-DEFAULT section + if len(parser.sections()) > 1: + raise Exception("please specify desired git root") + git_root =3D parser.sections()[0] + CONFIG["git_root"] =3D git_root + CONFIG.update(parser[git_root]) + else: + # -g flag, use the named section or just the defaults + if args.git_root in parser.sections(): + CONFIG.update(parser[args.git_root]) + + +def parse_args(): + parser =3D argparse.ArgumentParser() + parser.add_argument("--config", "-c", help=3D"Path to config file", me= tavar=3D"FILE") + for k, v in CONFIG_ITEMS.items(): + long =3D "--" + k.replace("_", "-") + short =3D "-" + v["short"] + parser.add_argument(short, long, help=3Dv["help"], metavar=3Dv["me= tavar"]) + if "default" in v: + CONFIG[k] =3D v["default"] + + args =3D parser.parse_args() + if args.config: + config_from_file(args) + # Arguments override config file + for k in CONFIG_ITEMS.keys(): + if getattr(args, k) is not None: + CONFIG[k] =3D getattr(args, k) + + +def git_clone(src, dest): + logging.info("cloning " + src + " into " + os.path.join(os.getcwd(), d= est)) + subprocess.check_call(["git", "clone", src, dest]) + + +def git_pull(wd): + logging.info("updating " + os.path.join(os.getcwd(), wd)) + subprocess.check_call(["git", "pull"], cwd=3Dwd) + + +def find_commits(git_root, first_repo, max_repos): + global HIGHEST_REPO + base =3D "public-inbox" + if not os.path.exists(base): + os.mkdir(base) + for i in range(first_repo, -1, -1): + if max_repos < 1: + break + + i_str =3D str(i) + wd =3D os.path.join(base, i_str) + if not os.path.exists(wd): + try: + git_clone(git_root + i_str, wd) + except subprocess.CalledProcessError: + continue + + HIGHEST_REPO =3D max(HIGHEST_REPO, i) + try: + git_pull(wd) + except subprocess.CalledProcessError: + break + + max_repos -=3D 1 + p =3D subprocess.Popen( + ["git", "log", "--oneline", "--since=3D" + CONFIG["limit"], "-= -format=3D%h"], + cwd=3Dwd, + stdout=3Dsubprocess.PIPE, + encoding=3D"utf-8", + ) + for line in p.stdout: + yield (wd, line.strip()) + + +def show_commit(d, c): + return subprocess.check_output(["git", "show", "%s:m" % c], cwd=3Dd) + + +def import_public_inbox(git_root, max_imports, first_repo, max_repos): + if not git_root.endswith("/"): + git_root +=3D "/" + + db =3D dbm.open("patchew-importer-lore.db", "c") + + for (d, commit) in find_commits(git_root, first_repo, max_repos): + if max_imports < 1: + break + if db.get(commit): + continue + max_imports -=3D 1 + with tempfile.NamedTemporaryFile() as tf: + try: + tf.write(show_commit(d, commit)) + tf.flush() + what =3D subprocess.check_output( + "git log -n 1 {commit} --oneline --format=3D'%aD - %aN= <%aE> - %s'".format( + commit=3Dcommit + ), + shell=3DTrue, + cwd=3Dd, + encoding=3D"utf-8", + ) + logging.info("importing %s" % what) + cmd =3D [PATCHEW_CLI, "-s", CONFIG["patchew_server"], "imp= ort", tf.name] + subprocess.check_output(cmd, stderr=3Dsubprocess.PIPE) + db[commit] =3D "imported" + except Exception as e: + logging.error( + "failed to import commit %s in archive %s: %s" % (comm= it, d, e) + ) + db[commit] =3D "failed" + else: + time.sleep(60) + + +def main(): + global CONFIG, HIGHEST_REPO + + parse_args() + if not CONFIG["patchew_server"]: + logging.error( + "you need to specify a patchew server within the config file o= r with -S" + ) + if not CONFIG["patchew_username"]: + logging.error( + "you need to specify a patchew username within the config file= or with -U" + ) + if not CONFIG["patchew_password"]: + logging.error( + "you need to specify a patchew username password the config fi= le or with -P" + ) + + logging.basicConfig(level=3Dlogging.DEBUG) + if CONFIG["data_dir"]: + if not os.path.exists(CONFIG["data_dir"]): + os.mkdir(CONFIG["data_dir"]) + os.chdir(CONFIG["data_dir"]) + cmd =3D [ + PATCHEW_CLI, + "-s", + CONFIG["patchew_server"], + "login", + CONFIG["patchew_username"], + CONFIG["patchew_password"], + ] + subprocess.check_call(cmd, stderr=3Dsubprocess.STDOUT) + + # no need to be stingy, high repos are checked only once per run + first_repo =3D 40 + max_repos =3D int(CONFIG["max"]) + max_imports =3D int(CONFIG["batch"]) + git_root =3D CONFIG["git_root"] + while True: + # restart and import the latest mails every once in a while to make + # sure new patches are imported timely, before the backlog + import_public_inbox(git_root, max_imports, first_repo, max_repos) + first_repo =3D HIGHEST_REPO + 1 + + +if __name__ =3D=3D "__main__": + sys.exit(main()) diff --git a/scripts/playbooks/deploy-importers-lore.yml b/scripts/playbook= s/deploy-importers-lore.yml new file mode 100644 index 0000000..de0a0f2 --- /dev/null +++ b/scripts/playbooks/deploy-importers-lore.yml @@ -0,0 +1,38 @@ +- hosts: importers_lore + vars_prompt: + - name: instance_name + prompt: "The instance name" + default: patchew-importer-lore + private: no + - name: "patchew_server" + prompt: "The address of patchew server" + default: "https://patchew.org" + private: no + - name: "importer_user" + prompt: "Username for the importer to login to the server" + private: no + default: "importer" + - name: "importer_pass" + prompt: "Password for the importer to login to the server" + private: yes + - name: "git_repo_base" + prompt: "URL in which to find public-inbox git repositories" + default: "https://lore.kernel.org/lkml/" + private: no + vars: + base_dir: "/data/{{ instance_name }}" + src_dir: "{{ base_dir }}/src" + data_dir: "{{ base_dir }}/data" + config_file: "{{ data_dir }}/config" + tasks: + - name: Create data dir + file: + path: "{{ data_dir }}" + state: directory + - name: Create config + template: + src: "templates/importer-lore-config.j2" + dest: "{{ config_file }}" + - import_tasks: tasks/docker-deploy.yml + vars: + instance_role: importer-lore diff --git a/scripts/playbooks/templates/importer-lore-config.j2 b/scripts/= playbooks/templates/importer-lore-config.j2 new file mode 100644 index 0000000..e3a1437 --- /dev/null +++ b/scripts/playbooks/templates/importer-lore-config.j2 @@ -0,0 +1,4 @@ +[{{ git_repo_base }}] +patchew_server=3D{{ patchew_server }} +patchew_username=3D{{ importer_user }} +patchew_password=3D{{ importer_pass }} --=20 2.34.1 _______________________________________________ Patchew-devel mailing list Patchew-devel@redhat.com https://listman.redhat.com/mailman/listinfo/patchew-devel