search: treat dot and slash as word separators

[Patchew-devel] [PATCH] search: treat dot and slash as word separators

Posted by Paolo Bonzini 2 years, 8 months ago

Postgres full text search is pretty strict about only searching for exact word matches.
For patches it is useful to add a word break on dots and slashes, but unfortunately
it is not really possible to teach it what delimiters to use.

One possibility could be to include the text search vector directly in the database
using a trigger, but that is a bit hard to do with django, so just create a new
field that is used for the full text search.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 .../0061_message_searched_subject.py          | 19 +++++++++++++++++++
 .../0062_populate_searched_subject.py         | 18 ++++++++++++++++++
 .../0063_postgres_fts_searched_subject.py     | 19 +++++++++++++++++++
 api/models.py                                 |  4 ++++
 api/search.py                                 |  2 +-
 5 files changed, 61 insertions(+), 1 deletion(-)
 create mode 100644 api/migrations/0061_message_searched_subject.py
 create mode 100644 api/migrations/0062_populate_searched_subject.py
 create mode 100644 api/migrations/0063_postgres_fts_searched_subject.py

diff --git a/api/migrations/0061_message_searched_subject.py b/api/migrations/0061_message_searched_subject.py
new file mode 100644
index 0000000..7edc67d
--- /dev/null
+++ b/api/migrations/0061_message_searched_subject.py
@@ -0,0 +1,19 @@
+# Generated by Django 3.1.14 on 2022-02-22 10:45
+
+from django.db import migrations, models
+
+
+class Migration(migrations.Migration):
+
+    dependencies = [
+        ('api', '0060_auto_20210106_1120'),
+    ]
+
+    operations = [
+        migrations.AddField(
+            model_name='message',
+            name='searched_subject',
+            field=models.CharField(default='', max_length=4096),
+            preserve_default=False,
+        ),
+    ]
diff --git a/api/migrations/0062_populate_searched_subject.py b/api/migrations/0062_populate_searched_subject.py
new file mode 100644
index 0000000..0f2812c
--- /dev/null
+++ b/api/migrations/0062_populate_searched_subject.py
@@ -0,0 +1,18 @@
+# -*- coding: utf-8 -*-
+from __future__ import unicode_literals
+
+from django.db import migrations
+from django.db.models import Q
+
+class Migration(migrations.Migration):
+
+    dependencies = [
+        ('api', '0061_message_searched_subject'),
+    ]
+
+    operations = [
+        migrations.RunSQL(
+            'update api_message set searched_subject = replace(replace(subject, ".", " "), "/", " ") where subject like "%.%" or subject like "%/%"'
+        )
+
+    ]
diff --git a/api/migrations/0063_postgres_fts_searched_subject.py b/api/migrations/0063_postgres_fts_searched_subject.py
new file mode 100644
index 0000000..557bd9c
--- /dev/null
+++ b/api/migrations/0063_postgres_fts_searched_subject.py
@@ -0,0 +1,19 @@
+# -*- coding: utf-8 -*-
+from __future__ import unicode_literals
+
+from django.db import migrations
+
+from api.migrations import PostgresOnlyMigration
+
+class Migration(PostgresOnlyMigration):
+
+    dependencies = [
+        ('api', '0062_populate_searched_subject'),
+    ]
+
+    operations = [
+        migrations.RunSQL("create index api_message_searched_subject_gin on api_message using gin(to_tsvector('english', searched_subject::text));",
+                          "drop index api_message_searched_subject_gin"),
+        migrations.RunSQL("drop index api_message_subject_gin",
+                          "create index api_message_subject_gin on api_message using gin(to_tsvector('english', subject::text));"),
+    ]
diff --git a/api/models.py b/api/models.py
index 18392b1..3edb4ca 100644
--- a/api/models.py
+++ b/api/models.py
@@ -472,6 +472,9 @@ class MessageManager(models.Manager):
         if "in_reply_to" not in validated_data:
             msg.in_reply_to = m.get_in_reply_to() or ""
         msg.stripped_subject = m.get_subject(strip_tags=True)
+        msg.searched_subject = msg.subject \
+            .replace(".", " ") \
+            .replace("/", " ")
         msg.version = m.get_version()
         msg.prefixes = m.get_prefixes()
         if m.is_series_head():
@@ -596,6 +599,7 @@ class Message(models.Model):
     last_comment_date = models.DateTimeField(db_index=True, null=True)
     subject = HeaderFieldModel()
     stripped_subject = HeaderFieldModel(db_index=True)
+    searched_subject = HeaderFieldModel()
     version = models.PositiveSmallIntegerField(default=0)
     sender = jsonfield.JSONCharField(max_length=4096, db_index=True)
     recipients = jsonfield.JSONField()
diff --git a/api/search.py b/api/search.py
index 08b40c3..c3d75e5 100644
--- a/api/search.py
+++ b/api/search.py
@@ -411,7 +411,7 @@ Search text keyword in the email message. Example:
         if self._last_keywords:
             if connection.vendor == "postgresql":
                 queryset = queryset.annotate(
-                    subjsearch=NonNullSearchVector("subject", config="english")
+                    subjsearch=NonNullSearchVector("searched_subject", config="english")
                 )
                 searchq = reduce(
                     lambda x, y: x & y,
-- 
2.34.1

_______________________________________________
Patchew-devel mailing list
Patchew-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/patchew-devel

Re: [Patchew-devel] [PATCH] search: treat dot and slash as word separators

Posted by Paolo Bonzini 2 years, 8 months ago

On 2/22/22 12:00, Paolo Bonzini wrote:
> Postgres full text search is pretty strict about only searching for exact word matches.
> For patches it is useful to add a word break on dots and slashes, but unfortunately
> it is not really possible to teach it what delimiters to use.
> 
> One possibility could be to include the text search vector directly in the database
> using a trigger, but that is a bit hard to do with django, so just create a new
> field that is used for the full text search.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Unfortunately this broke horribly when applied to next.patchew.org.  I'm 
not sure why, but the "UPDATE" takes an incredibly long time even if 
it's just for 10000 rows.

Paolo

> ---
>   .../0061_message_searched_subject.py          | 19 +++++++++++++++++++
>   .../0062_populate_searched_subject.py         | 18 ++++++++++++++++++
>   .../0063_postgres_fts_searched_subject.py     | 19 +++++++++++++++++++
>   api/models.py                                 |  4 ++++
>   api/search.py                                 |  2 +-
>   5 files changed, 61 insertions(+), 1 deletion(-)
>   create mode 100644 api/migrations/0061_message_searched_subject.py
>   create mode 100644 api/migrations/0062_populate_searched_subject.py
>   create mode 100644 api/migrations/0063_postgres_fts_searched_subject.py
> 
> diff --git a/api/migrations/0061_message_searched_subject.py b/api/migrations/0061_message_searched_subject.py
> new file mode 100644
> index 0000000..7edc67d
> --- /dev/null
> +++ b/api/migrations/0061_message_searched_subject.py
> @@ -0,0 +1,19 @@
> +# Generated by Django 3.1.14 on 2022-02-22 10:45
> +
> +from django.db import migrations, models
> +
> +
> +class Migration(migrations.Migration):
> +
> +    dependencies = [
> +        ('api', '0060_auto_20210106_1120'),
> +    ]
> +
> +    operations = [
> +        migrations.AddField(
> +            model_name='message',
> +            name='searched_subject',
> +            field=models.CharField(default='', max_length=4096),
> +            preserve_default=False,
> +        ),
> +    ]
> diff --git a/api/migrations/0062_populate_searched_subject.py b/api/migrations/0062_populate_searched_subject.py
> new file mode 100644
> index 0000000..0f2812c
> --- /dev/null
> +++ b/api/migrations/0062_populate_searched_subject.py
> @@ -0,0 +1,18 @@
> +# -*- coding: utf-8 -*-
> +from __future__ import unicode_literals
> +
> +from django.db import migrations
> +from django.db.models import Q
> +
> +class Migration(migrations.Migration):
> +
> +    dependencies = [
> +        ('api', '0061_message_searched_subject'),
> +    ]
> +
> +    operations = [
> +        migrations.RunSQL(
> +            'update api_message set searched_subject = replace(replace(subject, ".", " "), "/", " ") where subject like "%.%" or subject like "%/%"'
> +        )
> +
> +    ]
> diff --git a/api/migrations/0063_postgres_fts_searched_subject.py b/api/migrations/0063_postgres_fts_searched_subject.py
> new file mode 100644
> index 0000000..557bd9c
> --- /dev/null
> +++ b/api/migrations/0063_postgres_fts_searched_subject.py
> @@ -0,0 +1,19 @@
> +# -*- coding: utf-8 -*-
> +from __future__ import unicode_literals
> +
> +from django.db import migrations
> +
> +from api.migrations import PostgresOnlyMigration
> +
> +class Migration(PostgresOnlyMigration):
> +
> +    dependencies = [
> +        ('api', '0062_populate_searched_subject'),
> +    ]
> +
> +    operations = [
> +        migrations.RunSQL("create index api_message_searched_subject_gin on api_message using gin(to_tsvector('english', searched_subject::text));",
> +                          "drop index api_message_searched_subject_gin"),
> +        migrations.RunSQL("drop index api_message_subject_gin",
> +                          "create index api_message_subject_gin on api_message using gin(to_tsvector('english', subject::text));"),
> +    ]
> diff --git a/api/models.py b/api/models.py
> index 18392b1..3edb4ca 100644
> --- a/api/models.py
> +++ b/api/models.py
> @@ -472,6 +472,9 @@ class MessageManager(models.Manager):
>           if "in_reply_to" not in validated_data:
>               msg.in_reply_to = m.get_in_reply_to() or ""
>           msg.stripped_subject = m.get_subject(strip_tags=True)
> +        msg.searched_subject = msg.subject \
> +            .replace(".", " ") \
> +            .replace("/", " ")
>           msg.version = m.get_version()
>           msg.prefixes = m.get_prefixes()
>           if m.is_series_head():
> @@ -596,6 +599,7 @@ class Message(models.Model):
>       last_comment_date = models.DateTimeField(db_index=True, null=True)
>       subject = HeaderFieldModel()
>       stripped_subject = HeaderFieldModel(db_index=True)
> +    searched_subject = HeaderFieldModel()
>       version = models.PositiveSmallIntegerField(default=0)
>       sender = jsonfield.JSONCharField(max_length=4096, db_index=True)
>       recipients = jsonfield.JSONField()
> diff --git a/api/search.py b/api/search.py
> index 08b40c3..c3d75e5 100644
> --- a/api/search.py
> +++ b/api/search.py
> @@ -411,7 +411,7 @@ Search text keyword in the email message. Example:
>           if self._last_keywords:
>               if connection.vendor == "postgresql":
>                   queryset = queryset.annotate(
> -                    subjsearch=NonNullSearchVector("subject", config="english")
> +                    subjsearch=NonNullSearchVector("searched_subject", config="english")
>                   )
>                   searchq = reduce(
>                       lambda x, y: x & y,

_______________________________________________
Patchew-devel mailing list
Patchew-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/patchew-devel

Re: [Patchew-devel] [PATCH] search: treat dot and slash as word separators

Posted by Fam Zheng 2 years, 8 months ago

On 2022-02-22 14:57, Paolo Bonzini wrote:
> On 2/22/22 12:00, Paolo Bonzini wrote:
> > Postgres full text search is pretty strict about only searching for exact word matches.
> > For patches it is useful to add a word break on dots and slashes, but unfortunately
> > it is not really possible to teach it what delimiters to use.
> > 
> > One possibility could be to include the text search vector directly in the database
> > using a trigger, but that is a bit hard to do with django, so just create a new
> > field that is used for the full text search.
> > 
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> Unfortunately this broke horribly when applied to next.patchew.org.  I'm not
> sure why, but the "UPDATE" takes an incredibly long time even if it's just
> for 10000 rows.
> 
> Paolo

Oops.  Maybe the VM is overwhelmed by IO? It seems to be up now, have you
aborted the change?

The HTTP 500 downtime we experience for every code change is quite bad. I'm
inclined we switch to a load balanced server using k3s (single node) and run as
Kubernetes Deployment. It has some quite noticable memory tax (500MB to 1G) is
nicer in many ways.  It will take some work to chance but I can manage.

What do you think?

Fam

_______________________________________________
Patchew-devel mailing list
Patchew-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/patchew-devel

Re: [Patchew-devel] [PATCH] search: treat dot and slash as word separators

Posted by Paolo Bonzini 2 years, 8 months ago

On 2/22/22 15:06, Fam Zheng wrote:
>> Unfortunately this broke horribly when applied to next.patchew.org.  I'm not
>> sure why, but the "UPDATE" takes an incredibly long time even if it's just
>> for 10000 rows.
>
> Oops.  Maybe the VM is overwhelmed by IO? It seems to be up now, have you
> aborted the change?

Yes, I reverted it and undid the django migration (with ALTER TABLE 
api_messages DROP COLUMN + DELETE FROM django_migrations WHERE ...) by hand.

> The HTTP 500 downtime we experience for every code change is quite bad. I'm
> inclined we switch to a load balanced server using k3s (single node) and run as
> Kubernetes Deployment. It has some quite noticable memory tax (500MB to 1G) is
> nicer in many ways.  It will take some work to chance but I can manage.

Most of the downtime is for chown -R of the mail data, is that actually 
necessary?  But really I think we should just make the cron deployment 
to patchew.org for master only run once a day.

That said:

1) we should set up next.patchew.org's importer to bring at least all of 
QEMU, so that we have some recent useful data

2) we should also set up an alternate applier for next.patchew.org, 
again at least for QEMU

3) if you want, I have Azure credits that were given to QEMU, and we can 
use them for a Patchew instance as well since it uses QEMU.  I'm not 
sure how hard it is to set up our services on AKS, I can start with the 
next.patchew.org importer next Friday, as I have a day of learning...

Paolo

_______________________________________________
Patchew-devel mailing list
Patchew-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/patchew-devel

Re: [Patchew-devel] [PATCH] search: treat dot and slash as word separators

Posted by Fam Zheng 2 years, 8 months ago

On 2022-02-22 17:22, Paolo Bonzini wrote:
> On 2/22/22 15:06, Fam Zheng wrote:
> > > Unfortunately this broke horribly when applied to next.patchew.org.  I'm not
> > > sure why, but the "UPDATE" takes an incredibly long time even if it's just
> > > for 10000 rows.
> > 
> > Oops.  Maybe the VM is overwhelmed by IO? It seems to be up now, have you
> > aborted the change?
> 
> Yes, I reverted it and undid the django migration (with ALTER TABLE
> api_messages DROP COLUMN + DELETE FROM django_migrations WHERE ...) by hand.
> 
> > The HTTP 500 downtime we experience for every code change is quite bad. I'm
> > inclined we switch to a load balanced server using k3s (single node) and run as
> > Kubernetes Deployment. It has some quite noticable memory tax (500MB to 1G) is
> > nicer in many ways.  It will take some work to chance but I can manage.
> 
> Most of the downtime is for chown -R of the mail data, is that actually
> necessary?

Likely redundant now. But the problem still exists for long migrations. It's
not a big step to taking advantage of kubernetes from where we are. (I already
have a downstream deployment done that way.)

> But really I think we should just make the cron deployment to
> patchew.org for master only run once a day.
> 
> That said:
> 
> 1) we should set up next.patchew.org's importer to bring at least all of
> QEMU, so that we have some recent useful data
> 
> 2) we should also set up an alternate applier for next.patchew.org, again at
> least for QEMU

I'd like to do it, but the disk space on next.patchew.org is quite tight.

> 
> 3) if you want, I have Azure credits that were given to QEMU, and we can use
> them for a Patchew instance as well since it uses QEMU.  I'm not sure how
> hard it is to set up our services on AKS, I can start with the
> next.patchew.org importer next Friday, as I have a day of learning...

AKS sounds great. I think we can first move importers and appliers there.
Their Deployment YAML will be fairly straightforward; for the container
registry we can use GitLab or just push as docker public image.

Fam

_______________________________________________
Patchew-devel mailing list
Patchew-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/patchew-devel

Re: [Patchew-devel] [PATCH] search: treat dot and slash as word separators

Posted by Paolo Bonzini 2 years, 8 months ago

On 2/22/22 17:39, Fam Zheng wrote:
>> 3) if you want, I have Azure credits that were given to QEMU, and we can use
>> them for a Patchew instance as well since it uses QEMU.  I'm not sure how
>> hard it is to set up our services on AKS, I can start with the
>> next.patchew.org importer next Friday, as I have a day of learning...
>
> AKS sounds great. I think we can first move importers and appliers there.
> Their Deployment YAML will be fairly straightforward; for the container
> registry we can use GitLab or just push as docker public image.

Ok, so we'll need 2x3 docker public images for {next,master} x 
{importer,applier,new-importer}.  I'll use either quay.io or azure 
container registry for now, then we can decide.  Later we can add server 
and db as well.

Paolo

_______________________________________________
Patchew-devel mailing list
Patchew-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/patchew-devel

Re: [Patchew-devel] [PATCH] search: treat dot and slash as word separators

Posted by Fam Zheng 2 years, 8 months ago

On 2022-02-22 17:47, Paolo Bonzini wrote:
> On 2/22/22 17:39, Fam Zheng wrote:
> > > 3) if you want, I have Azure credits that were given to QEMU, and we can use
> > > them for a Patchew instance as well since it uses QEMU.  I'm not sure how
> > > hard it is to set up our services on AKS, I can start with the
> > > next.patchew.org importer next Friday, as I have a day of learning...
> > 
> > AKS sounds great. I think we can first move importers and appliers there.
> > Their Deployment YAML will be fairly straightforward; for the container
> > registry we can use GitLab or just push as docker public image.
> 
> Ok, so we'll need 2x3 docker public images for {next,master} x
> {importer,applier,new-importer}.  I'll use either quay.io or azure container
> registry for now, then we can decide.  Later we can add server and db as
> well.

No problem.

Making kubelet pull from private registries is a little painful [1].
Alternatively, we can also use a public container, do git-pull in the container
[2] to avoid using registry at all. This will greatly simplify the CI/CD script
in the patchew repo (just a kubectl apply, no more docker-build and
docker-pull).

[1]: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/

[2]: example YAML for importer (modified from my local version, untested)

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: importer-data
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: importer
  name: importer
spec:
  replicas: 1
  selector:
    matchLabels:
      app: importer
  template:
    metadata:
      labels:
        app: importer
    spec:
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: importer-data
      containers:
      - image: alpine:3.15
        imagePullPolicy: IfNotPresent
        name: importer
        volumeMounts:
          - mountPath: "/data/patchew"
            name: data
        command: /bin/sh
        env:
          - name: PATCHEW_BRANCH
            value: next
        args:
          - "-c"
          - |
          set -e
          apk add git py3-pip
          git clone https://github.com/patchew-project/patchew -b $PATCHEW_BRANCH /opt/patchew
          cd /opt/patchew
          ...

_______________________________________________
Patchew-devel mailing list
Patchew-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/patchew-devel