From nobody Mon Apr 6 22:48:54 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 97A4EC433F5 for ; Fri, 7 Oct 2022 16:56:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229665AbiJGQ4v (ORCPT ); Fri, 7 Oct 2022 12:56:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55474 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229556AbiJGQ4p (ORCPT ); Fri, 7 Oct 2022 12:56:45 -0400 Received: from mail-il1-x12e.google.com (mail-il1-x12e.google.com [IPv6:2607:f8b0:4864:20::12e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0AAC35FDFD for ; Fri, 7 Oct 2022 09:56:44 -0700 (PDT) Received: by mail-il1-x12e.google.com with SMTP id u10so2798308ilm.5 for ; Fri, 07 Oct 2022 09:56:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=4UKLX7j9DF0b0nVBHqCcizEjwTs/7OhMEZz7ThiWmPU=; b=4XlkKsLZxCrt+c9fmGB5bq2MTyFakYDlhyvSOt757T7suNWNTtXfhxzaLcHiqtH5yD /J6qS/O2YJPeWXDF/LOzsCLN3i9uKcAyuvDvLzX6z9+F4k7G5bikxOcHzaS+b2BAPc14 NMKcjMkjJ7eplGMRtkarPXexSfQUOo1BMenVGhCL9/IdP7OJqW4W5MW2uDzaN0iwWm3H Exnsa3Nkdn9g7WVMgmFtbAxQr1OKWGz9dZi4YgDKRSgXAn98OHwfw4Lm2kz3Rjj1HfpA CgTJbpLBY/zARLGF7FlJM2gjujUuUsLZnuBxAyw50Je2StP4d2HAn7rUXZ5MZvBnS8lv dNRQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4UKLX7j9DF0b0nVBHqCcizEjwTs/7OhMEZz7ThiWmPU=; b=42AtqEjTgy+ZO9ItI+0nzyeY5ZN7fCxo7r62ml5JCqFLRQ/0q/2ZPMaAGs5wqjx1kE YB5gzB/u+ynbj2G9c4wCdCpJjKI7DTTOv9huQKSH5cplbJ/n97SjeAlVvCOBug/7PB57 NCVEFL9vlF/bUa8BwRryaWjg7r20nwFvsYl4o6mucOP4weNIJh758WUYaBwKLBXzcexf wP8oXnCQCDz8+ft2Y4P5WNdjxwDd7dEKqzlqrUejoyI7u/IzVjF/yCyUEK5Q2yeCaZup /vIP5FdLP+7okBTUeT9P+eI+nb1kggUOMIDx/iP+rGZ0dfabn4T+VXHye7DwfseKuKen PNMw== X-Gm-Message-State: ACrzQf1fW2x4t7GlYUSX13hG6MhBd3va0oXldhDT9CZtG9F4OkkL8VKd VZHGmXKig/X4n/170dTtXRpD/W7Emjpjhw== X-Google-Smtp-Source: AMsMyM4OvaNbHc5o7fDN2kfr0Oa71t2vvJjYaC8Pb78cW2Rj1iHSsk0tf7UB7bsSc96VGdKdM1lWdw== X-Received: by 2002:a92:6912:0:b0:2ea:fa2e:462d with SMTP id e18-20020a926912000000b002eafa2e462dmr2830314ilc.155.1665161803214; Fri, 07 Oct 2022 09:56:43 -0700 (PDT) Received: from m1max.localdomain ([207.135.234.126]) by smtp.gmail.com with ESMTPSA id a6-20020a056e020e0600b002eb5eb4f8f9sm1055584ilk.77.2022.10.07.09.56.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 07 Oct 2022 09:56:42 -0700 (PDT) From: Jens Axboe To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe Subject: [PATCH 1/4] eventpoll: cleanup branches around sleeping for events Date: Fri, 7 Oct 2022 10:56:34 -0600 Message-Id: <20221007165637.22374-2-axboe@kernel.dk> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20221007165637.22374-1-axboe@kernel.dk> References: <20221007165637.22374-1-axboe@kernel.dk> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Rather than have two separate branches here, collapse them into a single one instead. No functional changes here, just a cleanup in preparation for changes in this area. Signed-off-by: Jens Axboe --- fs/eventpoll.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 8b56b94e2f56..8a75ae70e312 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1869,14 +1869,15 @@ static int ep_poll(struct eventpoll *ep, struct epo= ll_event __user *events, * important. */ eavail =3D ep_events_available(ep); - if (!eavail) + if (!eavail) { __add_wait_queue_exclusive(&ep->wq, &wait); - - write_unlock_irq(&ep->lock); - - if (!eavail) + write_unlock_irq(&ep->lock); timed_out =3D !schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS); + } else { + write_unlock_irq(&ep->lock); + } + __set_current_state(TASK_RUNNING); =20 /* --=20 2.35.1 From nobody Mon Apr 6 22:48:54 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2CD20C433FE for ; Fri, 7 Oct 2022 16:56:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229899AbiJGQ45 (ORCPT ); Fri, 7 Oct 2022 12:56:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55508 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229834AbiJGQ4q (ORCPT ); Fri, 7 Oct 2022 12:56:46 -0400 Received: from mail-io1-xd2e.google.com (mail-io1-xd2e.google.com [IPv6:2607:f8b0:4864:20::d2e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3040F46DB9 for ; Fri, 7 Oct 2022 09:56:45 -0700 (PDT) Received: by mail-io1-xd2e.google.com with SMTP id 4so4072355iou.9 for ; Fri, 07 Oct 2022 09:56:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=RG3Q+CFVM1jwdBZfpkSgMBovuxPpkHVLQimnfbDB7qo=; b=Ccn6u7yVrMjm4PfhysRT3eZJKgn/nieVJoC5uVIx4IaYNlhnUM3QK47Us6ax5DQgkJ 90L+c9/TwnUfzgBXT0g3JoyAkM+vHiRL0xr1TUfB86vmCZE4ALJPTL97tAsqgd5k717M c5AzVt5LrbaZJcVczzLXh/1DvVTkNpQGrYsjnlX7zlW+dNB+UAFHe/Cg6pRXnyfTP5Zk PTywbUxXsLI1KEtM0yN6pIeIqS7lGG06PG9Z77r5TrvvRR1hYm9PNQYH/8AtrpoeGrQe ReUinQgm1CytQaM0wKeLii6nqIgy57XSohHaR/JrdcpRIlu0ReF4jnw2sBXKsC/LAHSp DCJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=RG3Q+CFVM1jwdBZfpkSgMBovuxPpkHVLQimnfbDB7qo=; b=xTDkdJ1A182LoAkozh6XxXxnJQiX7ctjjq1IC/GRjlN8QpQYVXaLfU/RaHK+osrVsK H1rW4nuvMM+7Z+ooh91zFxo7tw3H7kg+eeZIXzKfkyyVUezcp0RzVdTCtCce7dTWgNgR TUhdUT+/FSrbn3gXMQXVwegiOxn9x14Lb6W8K9efpSnqeruKVSz0/ikdDVpmI1mPoDbk G82Zwsxyo2o6YvGpp8dorv28HtkYCz7LwSzjYRHY/XT579EwYi5OoofwHmhVwce9p7Vc JKbv3uJ/WTc76XzH0uUPhe3JBK3a/qVlUhz+6gWrpbucTiSRxG7Crqjn8UHvCXDFsmsk DVnQ== X-Gm-Message-State: ACrzQf1eqlYmAfwVxLsloEs44NQSBx9PQQi15pX+Vpout3xE3nmG8UFp JZ14g3QsRhquWpntYiP+Fp0h+iJHecZnLA== X-Google-Smtp-Source: AMsMyM4yKZrnfwt2NxMdj+RlCbLqnMTEgMZPruyOyr65JkMUdnD3dNvnizz9JoX9XFqH/FAnikkodw== X-Received: by 2002:a5d:924b:0:b0:6a4:c19d:c5b3 with SMTP id e11-20020a5d924b000000b006a4c19dc5b3mr2740837iol.147.1665161804288; Fri, 07 Oct 2022 09:56:44 -0700 (PDT) Received: from m1max.localdomain ([207.135.234.126]) by smtp.gmail.com with ESMTPSA id a6-20020a056e020e0600b002eb5eb4f8f9sm1055584ilk.77.2022.10.07.09.56.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 07 Oct 2022 09:56:43 -0700 (PDT) From: Jens Axboe To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe Subject: [PATCH 2/4] eventpoll: split out wait handling Date: Fri, 7 Oct 2022 10:56:35 -0600 Message-Id: <20221007165637.22374-3-axboe@kernel.dk> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20221007165637.22374-1-axboe@kernel.dk> References: <20221007165637.22374-1-axboe@kernel.dk> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" In preparation for making changes to how wakeups and sleeps are done, move the timeout scheduling into a helper and manage it rather than rely on schedule_hrtimeout_range(). Signed-off-by: Jens Axboe --- fs/eventpoll.c | 70 ++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 56 insertions(+), 14 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 8a75ae70e312..01b9dab2b68c 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1762,6 +1762,47 @@ static int ep_autoremove_wake_function(struct wait_q= ueue_entry *wq_entry, return ret; } =20 +struct epoll_wq { + wait_queue_entry_t wait; + struct hrtimer timer; + bool timed_out; +}; + +static enum hrtimer_restart ep_timer(struct hrtimer *timer) +{ + struct epoll_wq *ewq =3D container_of(timer, struct epoll_wq, timer); + struct task_struct *task =3D ewq->wait.private; + + ewq->timed_out =3D true; + wake_up_process(task); + return HRTIMER_NORESTART; +} + +static void ep_schedule(struct eventpoll *ep, struct epoll_wq *ewq, ktime_= t *to, + u64 slack) +{ + if (ewq->timed_out) + return; + if (to && *to =3D=3D 0) { + ewq->timed_out =3D true; + return; + } + if (!to) { + schedule(); + return; + } + + hrtimer_init_on_stack(&ewq->timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); + ewq->timer.function =3D ep_timer; + hrtimer_set_expires_range_ns(&ewq->timer, *to, slack); + hrtimer_start_expires(&ewq->timer, HRTIMER_MODE_ABS); + + schedule(); + + hrtimer_cancel(&ewq->timer); + destroy_hrtimer_on_stack(&ewq->timer); +} + /** * ep_poll - Retrieves ready events, and delivers them to the caller-suppl= ied * event buffer. @@ -1782,13 +1823,15 @@ static int ep_autoremove_wake_function(struct wait_= queue_entry *wq_entry, static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, int maxevents, struct timespec64 *timeout) { - int res, eavail, timed_out =3D 0; + int res, eavail; u64 slack =3D 0; - wait_queue_entry_t wait; ktime_t expires, *to =3D NULL; + struct epoll_wq ewq; =20 lockdep_assert_irqs_enabled(); =20 + ewq.timed_out =3D false; + if (timeout && (timeout->tv_sec | timeout->tv_nsec)) { slack =3D select_estimate_accuracy(timeout); to =3D &expires; @@ -1798,7 +1841,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll= _event __user *events, * Avoid the unnecessary trip to the wait queue loop, if the * caller specified a non blocking operation. */ - timed_out =3D 1; + ewq.timed_out =3D 1; } =20 /* @@ -1823,10 +1866,10 @@ static int ep_poll(struct eventpoll *ep, struct epo= ll_event __user *events, return res; } =20 - if (timed_out) + if (ewq.timed_out) return 0; =20 - eavail =3D ep_busy_loop(ep, timed_out); + eavail =3D ep_busy_loop(ep, ewq.timed_out); if (eavail) continue; =20 @@ -1850,8 +1893,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll= _event __user *events, * performance issue if a process is killed, causing all of its * threads to wake up without being removed normally. */ - init_wait(&wait); - wait.func =3D ep_autoremove_wake_function; + init_wait(&ewq.wait); + ewq.wait.func =3D ep_autoremove_wake_function; =20 write_lock_irq(&ep->lock); /* @@ -1870,10 +1913,9 @@ static int ep_poll(struct eventpoll *ep, struct epol= l_event __user *events, */ eavail =3D ep_events_available(ep); if (!eavail) { - __add_wait_queue_exclusive(&ep->wq, &wait); + __add_wait_queue_exclusive(&ep->wq, &ewq.wait); write_unlock_irq(&ep->lock); - timed_out =3D !schedule_hrtimeout_range(to, slack, - HRTIMER_MODE_ABS); + ep_schedule(ep, &ewq, to, slack); } else { write_unlock_irq(&ep->lock); } @@ -1887,7 +1929,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll= _event __user *events, */ eavail =3D 1; =20 - if (!list_empty_careful(&wait.entry)) { + if (!list_empty_careful(&ewq.wait.entry)) { write_lock_irq(&ep->lock); /* * If the thread timed out and is not on the wait queue, @@ -1896,9 +1938,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll= _event __user *events, * Thus, when wait.entry is empty, it needs to harvest * events. */ - if (timed_out) - eavail =3D list_empty(&wait.entry); - __remove_wait_queue(&ep->wq, &wait); + if (ewq.timed_out) + eavail =3D list_empty(&ewq.wait.entry); + __remove_wait_queue(&ep->wq, &ewq.wait); write_unlock_irq(&ep->lock); } } --=20 2.35.1 From nobody Mon Apr 6 22:48:54 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EE7C3C433FE for ; Fri, 7 Oct 2022 16:57:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230120AbiJGQ5B (ORCPT ); Fri, 7 Oct 2022 12:57:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55514 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229513AbiJGQ4q (ORCPT ); Fri, 7 Oct 2022 12:56:46 -0400 Received: from mail-io1-xd31.google.com (mail-io1-xd31.google.com [IPv6:2607:f8b0:4864:20::d31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0ACAB60496 for ; Fri, 7 Oct 2022 09:56:46 -0700 (PDT) Received: by mail-io1-xd31.google.com with SMTP id 4so4072377iou.9 for ; Fri, 07 Oct 2022 09:56:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=268Ewj1SkBC8anSnii6XOQa/SAiq+mm/PdGL6GPMikM=; b=um2f4lAP90/5sPo8x7U02EbQNucStn9G1tMe/aYsv1/wSysgTPyhwAHdy3uyIZbO3I R12S393agNl2gl6/84qpfQwxUX3x333wI6RscwNaHiEOZHuZfjA8/+52qLIVb219+sdy q6VKkIsHKVSXARmljPQZoxd4ftfCeBmrHOQYKFhKVTFbuHOAlX2kyDD/GgtCDRJT9v6q QlbLUIWcFfRHG8DxYC84sN0izcQBncZ06z3HsRa/OZMQkGR/UwolfKYT6TjLHlP1aj9U d8X7hX7JoO1ZY/CUnu7TEeMcO95rb4rJVOsH9iWdvOfTZK3aJjYuxqDaeodo6W5cBhPh SO8w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=268Ewj1SkBC8anSnii6XOQa/SAiq+mm/PdGL6GPMikM=; b=AzcwAzWQhkUPKJ9rYbQaQyq0cbVEJTQXN+dwBhzdl0X0N6drbZ5neJEjp9YCVyKcKz 7qcJPE3G+pEpbgy0CSSvvzpyJTNdEm35RBpSCPCNLDDAGDTtQdbEOh10iKLNZc6qbOVV +Jl9l/89DmnDmYwVUm0Beo84eMI7yFRoRh5ZLorZDSG6ASyEZhO+PIILpZBb93rh+zlQ roUaZ6Clz3GeiGYWrgodvxXtNy8Sk32HZLtPqY1/C042S2XvAcGXd/fr1zuLTzpWWYB3 09TkxTfwdibKTVGDo4hot+M59duCw0+C45xXWOgimqzwu9C0AwPOSpUF2CfNCyp1pvVB UFjw== X-Gm-Message-State: ACrzQf2pjWFdBqHeJr4Iyz8NSGd71wtP/WjmK1eu7bDGfADSPXq5wKI2 b2lOAdHLKcYqQnPax0P3RCz8LLNIwg8RNQ== X-Google-Smtp-Source: AMsMyM5cab4DrLGmE2R6/vLkQVAh+Bjf5n/flcaTschxphy9xhCDDpVwHtoi+2sscPY8PSz1btOb/g== X-Received: by 2002:a05:6602:2cd3:b0:6a2:167d:1d1c with SMTP id j19-20020a0566022cd300b006a2167d1d1cmr2679773iow.18.1665161805095; Fri, 07 Oct 2022 09:56:45 -0700 (PDT) Received: from m1max.localdomain ([207.135.234.126]) by smtp.gmail.com with ESMTPSA id a6-20020a056e020e0600b002eb5eb4f8f9sm1055584ilk.77.2022.10.07.09.56.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 07 Oct 2022 09:56:44 -0700 (PDT) From: Jens Axboe To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe Subject: [PATCH 3/4] eventpoll: move expires to epoll_wq Date: Fri, 7 Oct 2022 10:56:36 -0600 Message-Id: <20221007165637.22374-4-axboe@kernel.dk> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20221007165637.22374-1-axboe@kernel.dk> References: <20221007165637.22374-1-axboe@kernel.dk> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" This makes the expiration available to the wakeup handler. No functional changes expected in this patch, purely in preparation for being able to use the timeout on the wakeup side. Signed-off-by: Jens Axboe --- fs/eventpoll.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 01b9dab2b68c..79aa61a951df 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1765,6 +1765,7 @@ static int ep_autoremove_wake_function(struct wait_qu= eue_entry *wq_entry, struct epoll_wq { wait_queue_entry_t wait; struct hrtimer timer; + ktime_t timeout_ts; bool timed_out; }; =20 @@ -1825,7 +1826,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll= _event __user *events, { int res, eavail; u64 slack =3D 0; - ktime_t expires, *to =3D NULL; + ktime_t *to =3D NULL; struct epoll_wq ewq; =20 lockdep_assert_irqs_enabled(); @@ -1834,7 +1835,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll= _event __user *events, =20 if (timeout && (timeout->tv_sec | timeout->tv_nsec)) { slack =3D select_estimate_accuracy(timeout); - to =3D &expires; + to =3D &ewq.timeout_ts; *to =3D timespec64_to_ktime(*timeout); } else if (timeout) { /* --=20 2.35.1 From nobody Mon Apr 6 22:48:54 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A191CC433F5 for ; Fri, 7 Oct 2022 16:57:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230075AbiJGQ5E (ORCPT ); Fri, 7 Oct 2022 12:57:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55630 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229817AbiJGQ4t (ORCPT ); Fri, 7 Oct 2022 12:56:49 -0400 Received: from mail-il1-x12e.google.com (mail-il1-x12e.google.com [IPv6:2607:f8b0:4864:20::12e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E651946DB9 for ; Fri, 7 Oct 2022 09:56:46 -0700 (PDT) Received: by mail-il1-x12e.google.com with SMTP id u10so2798362ilm.5 for ; Fri, 07 Oct 2022 09:56:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=dbBXQdH6Biw9ZL6BC4iovaHCIiANA2YGcpyF37wDdyQ=; b=j2Zh2OIT0OBTcwGeFc70NIm4U/Jv7GRw+g0fEMIWqc4lpGkyFIj1mqhS66tb0RPbt1 4f9Of5Armjd6tTRT51mUf2jemRrDBuUb5rLeHNaNC5wAFLV9+6+dL8xRtWOgwSjsqwIZ AUsb9ML3qkDeMdrTsYLAPPTFof5Tl+Kf+z5ZbqEMQ/NDXgslx+VWq3c/u0ExvrOxygJE crUC9NFh6Em0RBufTSeEcMjopu8sE/b3OhmWtH91382vgW5T2yLCXdMNOj3kLyJmo+Be GemtPUuBbbaJoGpceIckTAcc521ad0TBqdglZcF4kb4PgSx/0XtP6+1ES8gMtT6JqTaS KJAw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=dbBXQdH6Biw9ZL6BC4iovaHCIiANA2YGcpyF37wDdyQ=; b=KecPT3HORS8rAAitIsTNAuXpQdHY2NUSY0fuw0+VEkeIogI+3hsRgLb2EriTcDAiPT QMVlAxX1PO0n3sYwFgwbmxy3Q38bDAG5B7Ql52z0RAKkLADMJtI6K4s09AoFb5o2M7wC fo4pbML1IV+GWAkPDmlYoVb5cQH9M7jmsjSpMYDkAYUackgmM1ggWcQ6trLK2qH4bhB+ klqL6TQwGA/QmI8b4q1iyNIZS1xrDERwakLobFy0ivAeIx52BZnitzloHTaGY29vqi5F bEDYR0yWK7M5NTHjbi8CHp9hxaHhvV6QkIESoR5YPwXBJ3ye1C7zt5NwCxHqVXLxYmwB YsMA== X-Gm-Message-State: ACrzQf3m8S5PMMaSt0b3L74lx+sPNTfYYhe9Db8sEtdxYEhD36DjIypD 2VA/F49Szo0S5zat6zxZVfJB3RaDbnXphQ== X-Google-Smtp-Source: AMsMyM4cd19LOkCQXBpjx9j+X3RTH9HqoJfd+r/KT/Uep1nNboL/FXYLPHw/WV/JS7akfIniviVdtA== X-Received: by 2002:a92:c568:0:b0:2f9:e77d:293c with SMTP id b8-20020a92c568000000b002f9e77d293cmr2680095ilj.319.1665161806162; Fri, 07 Oct 2022 09:56:46 -0700 (PDT) Received: from m1max.localdomain ([207.135.234.126]) by smtp.gmail.com with ESMTPSA id a6-20020a056e020e0600b002eb5eb4f8f9sm1055584ilk.77.2022.10.07.09.56.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 07 Oct 2022 09:56:45 -0700 (PDT) From: Jens Axboe To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe Subject: [PATCH 4/4] eventpoll: add support for min-wait Date: Fri, 7 Oct 2022 10:56:37 -0600 Message-Id: <20221007165637.22374-5-axboe@kernel.dk> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20221007165637.22374-1-axboe@kernel.dk> References: <20221007165637.22374-1-axboe@kernel.dk> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Rather than just have a timeout value for waiting on events, add EPOLL_CTL_MIN_WAIT to allow setting a minimum time that epoll_wait() should always wait for events to arrive. For medium workload efficiencies, some production workloads inject artificial timers or sleeps before calling epoll_wait() to get better batching and higher efficiencies. While this does help, it's not as efficient as it could be. By adding support for epoll_wait() for this directly, we can avoids extra context switches and scheduler and timer overhead. As an example, running an AB test on an identical workload at about ~370K reqs/second, without this change and with the sleep hack mentioned above (using 200 usec as the timeout), we're doing 310K-340K non-voluntary context switches per second. Idle CPU on the host is 27-34%. With the the sleep hack removed and epoll set to the same 200 usec value, we're handling the exact same load but at 292K-315k non-voluntary context switches and idle CPU of 33-41%, a substantial win. Basic test case: struct d { int p1, p2; }; static void *fn(void *data) { struct d *d =3D data; char b =3D 0x89; /* Generate 2 events 20 msec apart */ usleep(10000); write(d->p1, &b, sizeof(b)); usleep(10000); write(d->p2, &b, sizeof(b)); return NULL; } int main(int argc, char *argv[]) { struct epoll_event ev, events[2]; pthread_t thread; int p1[2], p2[2]; struct d d; int efd, ret; efd =3D epoll_create1(0); if (efd < 0) { perror("epoll_create"); return 1; } if (pipe(p1) < 0) { perror("pipe"); return 1; } if (pipe(p2) < 0) { perror("pipe"); return 1; } ev.events =3D EPOLLIN; ev.data.fd =3D p1[0]; if (epoll_ctl(efd, EPOLL_CTL_ADD, p1[0], &ev) < 0) { perror("epoll add"); return 1; } ev.events =3D EPOLLIN; ev.data.fd =3D p2[0]; if (epoll_ctl(efd, EPOLL_CTL_ADD, p2[0], &ev) < 0) { perror("epoll add"); return 1; } /* always wait 200 msec for events */ ev.data.u64 =3D 200000; if (epoll_ctl(efd, EPOLL_CTL_MIN_WAIT, -1, &ev) < 0) { perror("epoll add set timeout"); return 1; } d.p1 =3D p1[1]; d.p2 =3D p2[1]; pthread_create(&thread, NULL, fn, &d); /* expect to get 2 events here rather than just 1 */ ret =3D epoll_wait(efd, events, 2, -1); printf("epoll_wait=3D%d\n", ret); return 0; } Signed-off-by: Jens Axboe --- fs/eventpoll.c | 132 ++++++++++++++++++++++++++------- include/linux/eventpoll.h | 2 +- include/uapi/linux/eventpoll.h | 1 + 3 files changed, 109 insertions(+), 26 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 79aa61a951df..ccb8400e2252 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -39,6 +39,11 @@ #include #include =20 +/* + * If a default min_wait timeout is desired, set this to non-zero. In usec= s. + */ +#define EPOLL_DEF_MIN_WAIT 0 + /* * LOCKING: * There are three level of locking required by epoll : @@ -117,6 +122,9 @@ struct eppoll_entry { /* The "base" pointer is set to the container "struct epitem" */ struct epitem *base; =20 + /* min wait time if (min_wait_ts) & 1 !=3D 0 */ + ktime_t min_wait_ts; + /* * Wait queue item that will be linked to the target file wait * queue head. @@ -217,6 +225,9 @@ struct eventpoll { u64 gen; struct hlist_head refs; =20 + /* min wait for epoll_wait() */ + unsigned int min_wait_ts; + #ifdef CONFIG_NET_RX_BUSY_POLL /* used to track busy poll napi_id */ unsigned int napi_id; @@ -953,6 +964,7 @@ static int ep_alloc(struct eventpoll **pep) ep->rbr =3D RB_ROOT_CACHED; ep->ovflist =3D EP_UNACTIVE_PTR; ep->user =3D user; + ep->min_wait_ts =3D EPOLL_DEF_MIN_WAIT; =20 *pep =3D ep; =20 @@ -1747,6 +1759,32 @@ static struct timespec64 *ep_timeout_to_timespec(str= uct timespec64 *to, long ms) return to; } =20 +struct epoll_wq { + wait_queue_entry_t wait; + struct hrtimer timer; + ktime_t timeout_ts; + ktime_t min_wait_ts; + struct eventpoll *ep; + bool timed_out; + int maxevents; + int wakeups; +}; + +static bool ep_should_min_wait(struct epoll_wq *ewq) +{ + if (ewq->min_wait_ts & 1) { + /* just an approximation */ + if (++ewq->wakeups >=3D ewq->maxevents) + goto stop_wait; + if (ktime_before(ktime_get_ns(), ewq->min_wait_ts)) + return true; + } + +stop_wait: + ewq->min_wait_ts &=3D ~(u64) 1; + return false; +} + /* * autoremove_wake_function, but remove even on failure to wake up, becaus= e we * know that default_wake_function/ttwu will only fail if the thread is al= ready @@ -1756,27 +1794,37 @@ static struct timespec64 *ep_timeout_to_timespec(st= ruct timespec64 *to, long ms) static int ep_autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned int mode, int sync, void *key) { - int ret =3D default_wake_function(wq_entry, mode, sync, key); + struct epoll_wq *ewq =3D container_of(wq_entry, struct epoll_wq, wait); + int ret; + + /* + * If min wait time hasn't been satisfied yet, keep waiting + */ + if (ep_should_min_wait(ewq)) + return 0; =20 + ret =3D default_wake_function(wq_entry, mode, sync, key); list_del_init(&wq_entry->entry); return ret; } =20 -struct epoll_wq { - wait_queue_entry_t wait; - struct hrtimer timer; - ktime_t timeout_ts; - bool timed_out; -}; - static enum hrtimer_restart ep_timer(struct hrtimer *timer) { struct epoll_wq *ewq =3D container_of(timer, struct epoll_wq, timer); struct task_struct *task =3D ewq->wait.private; + const bool is_min_wait =3D ewq->min_wait_ts & 1; + + if (!is_min_wait || ep_events_available(ewq->ep)) { + if (!is_min_wait) + ewq->timed_out =3D true; + ewq->min_wait_ts &=3D ~(u64) 1; + wake_up_process(task); + return HRTIMER_NORESTART; + } =20 - ewq->timed_out =3D true; - wake_up_process(task); - return HRTIMER_NORESTART; + ewq->min_wait_ts &=3D ~(u64) 1; + hrtimer_set_expires_range_ns(&ewq->timer, ewq->timeout_ts, 0); + return HRTIMER_RESTART; } =20 static void ep_schedule(struct eventpoll *ep, struct epoll_wq *ewq, ktime_= t *to, @@ -1831,12 +1879,14 @@ static int ep_poll(struct eventpoll *ep, struct epo= ll_event __user *events, =20 lockdep_assert_irqs_enabled(); =20 + ewq.ep =3D ep; ewq.timed_out =3D false; + ewq.maxevents =3D maxevents; + ewq.wakeups =3D 0; =20 if (timeout && (timeout->tv_sec | timeout->tv_nsec)) { slack =3D select_estimate_accuracy(timeout); - to =3D &ewq.timeout_ts; - *to =3D timespec64_to_ktime(*timeout); + ewq.timeout_ts =3D timespec64_to_ktime(*timeout); } else if (timeout) { /* * Avoid the unnecessary trip to the wait queue loop, if the @@ -1845,6 +1895,21 @@ static int ep_poll(struct eventpoll *ep, struct epol= l_event __user *events, ewq.timed_out =3D 1; } =20 + /* + * If min_wait is set for this epoll instance, note the min_wait + * time. Ensure the lowest bit is set in ewq.min_wait_ts, that's + * the state bit for whether or not min_wait is enabled. + */ + if (ep->min_wait_ts) { + ewq.min_wait_ts =3D ktime_add_us(ktime_get_ns(), + ep->min_wait_ts); + ewq.min_wait_ts |=3D (u64) 1; + to =3D &ewq.min_wait_ts; + } else { + ewq.min_wait_ts =3D 0; + to =3D &ewq.timeout_ts; + } + /* * This call is racy: We may or may not see events that are being added * to the ready list under the lock (e.g., in IRQ callbacks). For cases @@ -1913,7 +1978,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll= _event __user *events, * important. */ eavail =3D ep_events_available(ep); - if (!eavail) { + if (!eavail || ewq.min_wait_ts & 1) { __add_wait_queue_exclusive(&ep->wq, &ewq.wait); write_unlock_irq(&ep->lock); ep_schedule(ep, &ewq, to, slack); @@ -2111,6 +2176,31 @@ int do_epoll_ctl(int epfd, int op, int fd, struct ep= oll_event *epds, if (!f.file) goto error_return; =20 + /* + * We have to check that the file structure underneath the file + * descriptor the user passed to us _is_ an eventpoll file. + */ + error =3D -EINVAL; + if (!is_file_epoll(f.file)) + goto error_fput; + + /* + * At this point it is safe to assume that the "private_data" contains + * our own data structure. + */ + ep =3D f.file->private_data; + + /* + * Handle EPOLL_CTL_MIN_WAIT upfront as we don't need to care about + * the fd being passed in. + */ + if (op =3D=3D EPOLL_CTL_MIN_WAIT) { + /* return old value */ + error =3D ep->min_wait_ts; + ep->min_wait_ts =3D epds->data; + goto error_fput; + } + /* Get the "struct file *" for the target file */ tf =3D fdget(fd); if (!tf.file) @@ -2126,12 +2216,10 @@ int do_epoll_ctl(int epfd, int op, int fd, struct e= poll_event *epds, ep_take_care_of_epollwakeup(epds); =20 /* - * We have to check that the file structure underneath the file descriptor - * the user passed to us _is_ an eventpoll file. And also we do not permit - * adding an epoll file descriptor inside itself. + * We do not permit adding an epoll file descriptor inside itself. */ error =3D -EINVAL; - if (f.file =3D=3D tf.file || !is_file_epoll(f.file)) + if (f.file =3D=3D tf.file) goto error_tgt_fput; =20 /* @@ -2147,12 +2235,6 @@ int do_epoll_ctl(int epfd, int op, int fd, struct ep= oll_event *epds, goto error_tgt_fput; } =20 - /* - * At this point it is safe to assume that the "private_data" contains - * our own data structure. - */ - ep =3D f.file->private_data; - /* * When we insert an epoll file descriptor inside another epoll file * descriptor, there is the chance of creating closed loops, which are @@ -2251,7 +2333,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, f= d, { struct epoll_event epds; =20 - if (ep_op_has_event(op) && + if ((ep_op_has_event(op) || op =3D=3D EPOLL_CTL_MIN_WAIT) && copy_from_user(&epds, event, sizeof(struct epoll_event))) return -EFAULT; =20 diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h index 3337745d81bd..cbef635cb7e4 100644 --- a/include/linux/eventpoll.h +++ b/include/linux/eventpoll.h @@ -59,7 +59,7 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_e= vent *epds, /* Tells if the epoll_ctl(2) operation needs an event copy from userspace = */ static inline int ep_op_has_event(int op) { - return op !=3D EPOLL_CTL_DEL; + return op !=3D EPOLL_CTL_DEL && op !=3D EPOLL_CTL_MIN_WAIT; } =20 #else diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h index 8a3432d0f0dc..81ecb1ca36e0 100644 --- a/include/uapi/linux/eventpoll.h +++ b/include/uapi/linux/eventpoll.h @@ -26,6 +26,7 @@ #define EPOLL_CTL_ADD 1 #define EPOLL_CTL_DEL 2 #define EPOLL_CTL_MOD 3 +#define EPOLL_CTL_MIN_WAIT 4 =20 /* Epoll event masks */ #define EPOLLIN (__force __poll_t)0x00000001 --=20 2.35.1