misc nvme related fixes

[PATCH 1/3] nvme-tcp: rate limit error message in send path

Posted by Daniel Wagner 1 year ago

If a lot of request are in the queue, this message is spamming the logs,
thus rate limit it.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/nvme/host/tcp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index dc5bbca58c6dcbce40cfa3de893592d768ebc939..1ed0bc10b2dffe534536b1073abc0302056aa51e 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1257,8 +1257,8 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
 	if (ret == -EAGAIN) {
 		ret = 0;
 	} else if (ret < 0) {
-		dev_err(queue->ctrl->ctrl.device,
-			"failed to send request %d\n", ret);
+		dev_err_ratelimited(queue->ctrl->ctrl.device,
+				    "failed to send request %d\n", ret);
 		nvme_tcp_fail_request(queue->request);
 		nvme_tcp_done_send_req(queue);
 	}

-- 
2.48.1

Re: [PATCH 1/3] nvme-tcp: rate limit error message in send path

Posted by Sagi Grimberg 1 year ago

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

Re: [PATCH 1/3] nvme-tcp: rate limit error message in send path

Posted by Christoph Hellwig 1 year ago

On Tue, Jan 28, 2025 at 05:34:46PM +0100, Daniel Wagner wrote:
> If a lot of request are in the queue, this message is spamming the logs,
> thus rate limit it.

Are in the queue when what happens?  Not that I'm against this,
but if we have a known condition where this error is printed a lot
we should probably skip it entirely for that?

Re: [PATCH 1/3] nvme-tcp: rate limit error message in send path

Posted by Daniel Wagner 1 year ago

On Wed, Jan 29, 2025 at 07:05:34AM +0100, Christoph Hellwig wrote:
> On Tue, Jan 28, 2025 at 05:34:46PM +0100, Daniel Wagner wrote:
> > If a lot of request are in the queue, this message is spamming the logs,
> > thus rate limit it.
> 
> Are in the queue when what happens?  Not that I'm against this,
> but if we have a known condition where this error is printed a lot
> we should probably skip it entirely for that?

The condition is that all the elements in the queue->send_list could fail as a
batch. I had a bug in my patches which re-queued all the failed command
immediately and semd them out again, thus spamming the log.

This behavior doesn't exist in upstream. I just thought it might make
sense to rate limit as precaution. I don't know if it is worth the code
churn.

Re: [PATCH 1/3] nvme-tcp: rate limit error message in send path

Posted by Christoph Hellwig 1 year ago

On Thu, Jan 30, 2025 at 04:25:35PM +0100, Daniel Wagner wrote:
> On Wed, Jan 29, 2025 at 07:05:34AM +0100, Christoph Hellwig wrote:
> > On Tue, Jan 28, 2025 at 05:34:46PM +0100, Daniel Wagner wrote:
> > > If a lot of request are in the queue, this message is spamming the logs,
> > > thus rate limit it.
> > 
> > Are in the queue when what happens?  Not that I'm against this,
> > but if we have a known condition where this error is printed a lot
> > we should probably skip it entirely for that?
> 
> The condition is that all the elements in the queue->send_list could fail as a
> batch. I had a bug in my patches which re-queued all the failed command
> immediately and semd them out again, thus spamming the log.
> 
> This behavior doesn't exist in upstream. I just thought it might make
> sense to rate limit as precaution. I don't know if it is worth the code
> churn.

I'm fine with the rate limiting.  I was just wondering if there is
a case where we'd easily hit it and could do even better.

Re: [PATCH 1/3] nvme-tcp: rate limit error message in send path

Posted by Sagi Grimberg 1 year ago



On 31/01/2025 9:29, Christoph Hellwig wrote:
> On Thu, Jan 30, 2025 at 04:25:35PM +0100, Daniel Wagner wrote:
>> On Wed, Jan 29, 2025 at 07:05:34AM +0100, Christoph Hellwig wrote:
>>> On Tue, Jan 28, 2025 at 05:34:46PM +0100, Daniel Wagner wrote:
>>>> If a lot of request are in the queue, this message is spamming the logs,
>>>> thus rate limit it.
>>> Are in the queue when what happens?  Not that I'm against this,
>>> but if we have a known condition where this error is printed a lot
>>> we should probably skip it entirely for that?
>> The condition is that all the elements in the queue->send_list could fail as a
>> batch. I had a bug in my patches which re-queued all the failed command
>> immediately and semd them out again, thus spamming the log.
>>
>> This behavior doesn't exist in upstream. I just thought it might make
>> sense to rate limit as precaution. I don't know if it is worth the code
>> churn.
> I'm fine with the rate limiting.  I was just wondering if there is
> a case where we'd easily hit it and could do even better.

I agree with the change. The reason why I think its useful to keep is 
because its
has been really useful indication when debugging UAF panics.