From nobody Sat Feb 7 18:20:49 2026 Received: from out30-113.freemail.mail.aliyun.com (out30-113.freemail.mail.aliyun.com [115.124.30.113]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 23B332DA763; Mon, 2 Feb 2026 09:48:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.113 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770025691; cv=none; b=Utzd7bm5ZPnwEqTZtxvrGmSR+n+IhYQNg45iE0v405GjRC3L8aqD1HCdtLfip2LG+k0ef5xRo8fdoGg/AJisvl93k8tihoxE9UAkM2gQ5p9adPc7/mbr/ySy+pmBY8pkoAWSILkHF5VhOzJRVKCxYekhnX+ONpjqgXxTvtjwtGc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770025691; c=relaxed/simple; bh=LbjHlQCc7ncHoZOH3c4BjPy5+hpYFDz6A/lSXLpvul8=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=H693pPICrpLVxfukw3DSEGKyeXcDg3TpUuIfCUNvX2GTTrMEFxe570Qugh+9/3HkA2SPeqNaRQ+Q0Sp6DCVuJ6WV3+3uTMkinXNZZ9Dj94iGZXkTY/gxYZc+A3xSgmrrBMvZvX2gSAFLIdNzU7l53D/l4zaGq8xDQCkaCVoWjBM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=qZcT4E4H; arc=none smtp.client-ip=115.124.30.113 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="qZcT4E4H" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1770025684; h=From:To:Subject:Date:Message-ID:MIME-Version; bh=bgq7Uv6p4D7tQiyocvJOx1Hi6miE+Okp38OUb116VLc=; b=qZcT4E4H414+Y3Hk5noh3wrZo3UQ+80uXdpFugHdLVJ3nrz50YSNELsFWntWBlmzSq+/Zos/fcFXorBDfU2O54ZVDVDNaaSvIef/6qFjgxK5AC8j9wM9w7pniSegOdFfNXW29JUQ7ITMPSdCaSN92AlukaX71LhAA8JcTW79L7Q= Received: from j66a10360.sqa.eu95.tbsite.net(mailfrom:alibuda@linux.alibaba.com fp:SMTPD_---0WyLiSzk_1770025680 cluster:ay36) by smtp.aliyun-inc.com; Mon, 02 Feb 2026 17:48:04 +0800 From: "D. Wythe" To: "David S. Miller" , Dust Li , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Sidraya Jayagond , Wenjia Zhang Cc: Mahanta Jambigi , Simon Horman , Tony Lu , Wen Gu , linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, linux-s390@vger.kernel.org, netdev@vger.kernel.org, oliver.yang@linux.alibaba.com, pasic@linux.ibm.com Subject: [PATCH RFC net-next] net/smc: transition to RDMA core CQ pooling Date: Mon, 2 Feb 2026 17:48:00 +0800 Message-ID: <20260202094800.30373-1-alibuda@linux.alibaba.com> X-Mailer: git-send-email 2.45.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The current SMC-R implementation relies on global per-device CQs and manual polling within tasklets, which introduces severe scalability bottlenecks due to global lock contention and tasklet scheduling overhead, resulting in poor performance as concurrency increases. Refactor the completion handling to utilize the ib_cqe API and standard RDMA core CQ pooling. This transition provides several key advantages: 1. Multi-CQ: Shift from a single shared per-device CQ to multiple link-specific CQs via the CQ pool. This allows completion processing to be parallelized across multiple CPU cores, effectively eliminating the global CQ bottleneck. 2. Leverage DIM: Utilizing the standard CQ pool with IB_POLL_SOFTIRQ enables Dynamic Interrupt Moderation from the RDMA core, optimizing interrupt frequency and reducing CPU load under high pressure. 3. O(1) Context Retrieval: Replaces the expensive wr_id based lookup logic (e.g., smc_wr_tx_find_pending_index) with direct context retrieval using container_of() on the embedded ib_cqe. 4. Code Simplification: This refactoring results in a reduction of ~150 lines of code. It removes redundant sequence tracking, complex lookup helpers, and manual CQ management, significantly improving maintainability. Performance Test: redis-benchmark with max 32 connections per QP Data format: Requests Per Second (RPS), Percentage in brackets represents the gain/loss compared to TCP. | Clients | TCP | SMC (original) | SMC (cq_pool) | |---------|----------|---------------------|---------------------| | c =3D 1 | 24449 | 31172 (+27%) | 34039 (+39%) | | c =3D 2 | 46420 | 53216 (+14%) | 64391 (+38%) | | c =3D 16 | 159673 | 83668 (-48%) <-- | 216947 (+36%) | | c =3D 32 | 164956 | 97631 (-41%) <-- | 249376 (+51%) | | c =3D 64 | 166322 | 118192 (-29%) <-- | 249488 (+50%) | | c =3D 128 | 167700 | 121497 (-27%) <-- | 249480 (+48%) | | c =3D 256 | 175021 | 146109 (-16%) <-- | 240384 (+37%) | | c =3D 512 | 168987 | 101479 (-40%) <-- | 226634 (+34%) | The results demonstrate that this optimization effectively resolves the scalability bottleneck, with RPS increasing by over 110% at c=3D64 compared to the original implementation. Signed-off-by: D. Wythe --- net/smc/smc_core.c | 8 +- net/smc/smc_core.h | 16 ++- net/smc/smc_ib.c | 114 ++++++------------- net/smc/smc_ib.h | 5 - net/smc/smc_tx.c | 1 - net/smc/smc_wr.c | 267 ++++++++++++++++----------------------------- net/smc/smc_wr.h | 38 ++----- 7 files changed, 150 insertions(+), 299 deletions(-) diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c index 8aca5dc54be7..9590c8aed3dd 100644 --- a/net/smc/smc_core.c +++ b/net/smc/smc_core.c @@ -815,17 +815,11 @@ int smcr_link_init(struct smc_link_group *lgr, struct= smc_link *lnk, lnk->lgr =3D lgr; smc_lgr_hold(lgr); /* lgr_put in smcr_link_clear() */ lnk->link_idx =3D link_idx; - lnk->wr_rx_id_compl =3D 0; smc_ibdev_cnt_inc(lnk); smcr_copy_dev_info_to_link(lnk); atomic_set(&lnk->conn_cnt, 0); smc_llc_link_set_uid(lnk); INIT_WORK(&lnk->link_down_wrk, smc_link_down_work); - if (!lnk->smcibdev->initialized) { - rc =3D (int)smc_ib_setup_per_ibdev(lnk->smcibdev); - if (rc) - goto out; - } get_random_bytes(rndvec, sizeof(rndvec)); lnk->psn_initial =3D rndvec[0] + (rndvec[1] << 8) + (rndvec[2] << 16); @@ -1373,7 +1367,7 @@ void smcr_link_clear(struct smc_link *lnk, bool log) smc_llc_link_clear(lnk, log); smcr_buf_unmap_lgr(lnk); smcr_rtoken_clear_link(lnk); - smc_ib_modify_qp_error(lnk); + smc_wr_drain_qp(lnk); smc_wr_free_link(lnk); smc_ib_destroy_queue_pair(lnk); smc_ib_dealloc_protection_domain(lnk); diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h index 5c18f08a4c8a..00468b7a279f 100644 --- a/net/smc/smc_core.h +++ b/net/smc/smc_core.h @@ -92,6 +92,12 @@ struct smc_rdma_wr { /* work requests per message struct ib_rdma_wr wr_tx_rdma[SMC_MAX_RDMA_WRITES]; }; =20 +struct smc_ib_recv_wr { + struct ib_cqe cqe; + struct ib_recv_wr wr; + int index; +}; + #define SMC_LGR_ID_SIZE 4 =20 struct smc_link { @@ -100,6 +106,8 @@ struct smc_link { struct ib_pd *roce_pd; /* IB protection domain, * unique for every RoCE QP */ + int nr_cqe; + struct ib_cq *ib_cq; struct ib_qp *roce_qp; /* IB queue pair */ struct ib_qp_attr qp_attr; /* IB queue pair attributes */ =20 @@ -107,6 +115,7 @@ struct smc_link { struct ib_send_wr *wr_tx_ibs; /* WR send meta data */ struct ib_sge *wr_tx_sges; /* WR send gather meta data */ struct smc_rdma_sges *wr_tx_rdma_sges;/*RDMA WRITE gather meta data*/ + struct ib_cqe tx_rdma_cqe; /* CQE RDMA WRITE */ struct smc_rdma_wr *wr_tx_rdmas; /* WR RDMA WRITE */ struct smc_wr_tx_pend *wr_tx_pends; /* WR send waiting for CQE */ struct completion *wr_tx_compl; /* WR send CQE completion */ @@ -116,7 +125,6 @@ struct smc_link { struct smc_wr_tx_pend *wr_tx_v2_pend; /* WR send v2 waiting for CQE */ dma_addr_t wr_tx_dma_addr; /* DMA address of wr_tx_bufs */ dma_addr_t wr_tx_v2_dma_addr; /* DMA address of v2 tx buf*/ - atomic_long_t wr_tx_id; /* seq # of last sent WR */ unsigned long *wr_tx_mask; /* bit mask of used indexes */ u32 wr_tx_cnt; /* number of WR send buffers */ wait_queue_head_t wr_tx_wait; /* wait for free WR send buf */ @@ -126,7 +134,7 @@ struct smc_link { struct completion tx_ref_comp; =20 u8 *wr_rx_bufs; /* WR recv payload buffers */ - struct ib_recv_wr *wr_rx_ibs; /* WR recv meta data */ + struct smc_ib_recv_wr *wr_rx_ibs; /* WR recv meta data */ struct ib_sge *wr_rx_sges; /* WR recv scatter meta data */ /* above three vectors have wr_rx_cnt elements and use the same index */ int wr_rx_sge_cnt; /* rx sge, V1 is 1, V2 is either 2 or 1 */ @@ -135,13 +143,11 @@ struct smc_link { */ dma_addr_t wr_rx_dma_addr; /* DMA address of wr_rx_bufs */ dma_addr_t wr_rx_v2_dma_addr; /* DMA address of v2 rx buf*/ - u64 wr_rx_id; /* seq # of last recv WR */ - u64 wr_rx_id_compl; /* seq # of last completed WR */ u32 wr_rx_cnt; /* number of WR recv buffers */ unsigned long wr_rx_tstamp; /* jiffies when last buf rx */ - wait_queue_head_t wr_rx_empty_wait; /* wait for RQ empty */ =20 struct ib_reg_wr wr_reg; /* WR register memory region */ + struct ib_cqe wr_reg_cqe; wait_queue_head_t wr_reg_wait; /* wait for wr_reg result */ struct { struct percpu_ref wr_reg_refs; diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c index 67211d44a1db..77047ad7d452 100644 --- a/net/smc/smc_ib.c +++ b/net/smc/smc_ib.c @@ -112,15 +112,6 @@ int smc_ib_modify_qp_rts(struct smc_link *lnk) IB_QP_MAX_QP_RD_ATOMIC); } =20 -int smc_ib_modify_qp_error(struct smc_link *lnk) -{ - struct ib_qp_attr qp_attr; - - memset(&qp_attr, 0, sizeof(qp_attr)); - qp_attr.qp_state =3D IB_QPS_ERR; - return ib_modify_qp(lnk->roce_qp, &qp_attr, IB_QP_STATE); -} - int smc_ib_ready_link(struct smc_link *lnk) { struct smc_link_group *lgr =3D smc_get_lgr(lnk); @@ -134,10 +125,7 @@ int smc_ib_ready_link(struct smc_link *lnk) if (rc) goto out; smc_wr_remember_qp_attr(lnk); - rc =3D ib_req_notify_cq(lnk->smcibdev->roce_cq_recv, - IB_CQ_SOLICITED_MASK); - if (rc) - goto out; + rc =3D smc_wr_rx_post_init(lnk); if (rc) goto out; @@ -658,38 +646,60 @@ void smc_ib_destroy_queue_pair(struct smc_link *lnk) if (lnk->roce_qp) ib_destroy_qp(lnk->roce_qp); lnk->roce_qp =3D NULL; + if (lnk->ib_cq) { + ib_cq_pool_put(lnk->ib_cq, lnk->nr_cqe); + lnk->ib_cq =3D NULL; + } } =20 /* create a queue pair within the protection domain for a link */ int smc_ib_create_queue_pair(struct smc_link *lnk) { + int max_send_wr, max_recv_wr, rc; + struct ib_cq *cq; + + /* include unsolicited rdma_writes as well, + * there are max. 2 RDMA_WRITE per 1 WR_SEND. + * +1 for ib_drain_qp() + */ + max_send_wr =3D 3 * lnk->lgr->max_send_wr + 1; + max_recv_wr =3D lnk->lgr->max_recv_wr + 1; + + cq =3D ib_cq_pool_get(lnk->smcibdev->ibdev, max_send_wr + max_recv_wr, -1, + IB_POLL_SOFTIRQ); + + if (IS_ERR(cq)) { + rc =3D PTR_ERR(cq); + return rc; + } + struct ib_qp_init_attr qp_attr =3D { .event_handler =3D smc_ib_qp_event_handler, .qp_context =3D lnk, - .send_cq =3D lnk->smcibdev->roce_cq_send, - .recv_cq =3D lnk->smcibdev->roce_cq_recv, + .send_cq =3D cq, + .recv_cq =3D cq, .srq =3D NULL, .cap =3D { .max_send_sge =3D SMC_IB_MAX_SEND_SGE, .max_recv_sge =3D lnk->wr_rx_sge_cnt, + .max_send_wr =3D max_send_wr, + .max_recv_wr =3D max_recv_wr, .max_inline_data =3D 0, }, .sq_sig_type =3D IB_SIGNAL_REQ_WR, .qp_type =3D IB_QPT_RC, }; - int rc; =20 - /* include unsolicited rdma_writes as well, - * there are max. 2 RDMA_WRITE per 1 WR_SEND - */ - qp_attr.cap.max_send_wr =3D 3 * lnk->lgr->max_send_wr; - qp_attr.cap.max_recv_wr =3D lnk->lgr->max_recv_wr; lnk->roce_qp =3D ib_create_qp(lnk->roce_pd, &qp_attr); rc =3D PTR_ERR_OR_ZERO(lnk->roce_qp); - if (IS_ERR(lnk->roce_qp)) + if (IS_ERR(lnk->roce_qp)) { lnk->roce_qp =3D NULL; - else + ib_cq_pool_put(cq, max_send_wr + max_recv_wr); + } else { smc_wr_remember_qp_attr(lnk); + lnk->nr_cqe =3D max_send_wr + max_recv_wr; + lnk->ib_cq =3D cq; + } return rc; } =20 @@ -855,62 +865,6 @@ void smc_ib_buf_unmap_sg(struct smc_link *lnk, buf_slot->sgt[lnk->link_idx].sgl->dma_address =3D 0; } =20 -long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev) -{ - struct ib_cq_init_attr cqattr =3D { - .cqe =3D SMC_MAX_CQE, .comp_vector =3D 0 }; - int cqe_size_order, smc_order; - long rc; - - mutex_lock(&smcibdev->mutex); - rc =3D 0; - if (smcibdev->initialized) - goto out; - /* the calculated number of cq entries fits to mlx5 cq allocation */ - cqe_size_order =3D cache_line_size() =3D=3D 128 ? 7 : 6; - smc_order =3D MAX_PAGE_ORDER - cqe_size_order; - if (SMC_MAX_CQE + 2 > (0x00000001 << smc_order) * PAGE_SIZE) - cqattr.cqe =3D (0x00000001 << smc_order) * PAGE_SIZE - 2; - smcibdev->roce_cq_send =3D ib_create_cq(smcibdev->ibdev, - smc_wr_tx_cq_handler, NULL, - smcibdev, &cqattr); - rc =3D PTR_ERR_OR_ZERO(smcibdev->roce_cq_send); - if (IS_ERR(smcibdev->roce_cq_send)) { - smcibdev->roce_cq_send =3D NULL; - goto out; - } - smcibdev->roce_cq_recv =3D ib_create_cq(smcibdev->ibdev, - smc_wr_rx_cq_handler, NULL, - smcibdev, &cqattr); - rc =3D PTR_ERR_OR_ZERO(smcibdev->roce_cq_recv); - if (IS_ERR(smcibdev->roce_cq_recv)) { - smcibdev->roce_cq_recv =3D NULL; - goto err; - } - smc_wr_add_dev(smcibdev); - smcibdev->initialized =3D 1; - goto out; - -err: - ib_destroy_cq(smcibdev->roce_cq_send); -out: - mutex_unlock(&smcibdev->mutex); - return rc; -} - -static void smc_ib_cleanup_per_ibdev(struct smc_ib_device *smcibdev) -{ - mutex_lock(&smcibdev->mutex); - if (!smcibdev->initialized) - goto out; - smcibdev->initialized =3D 0; - ib_destroy_cq(smcibdev->roce_cq_recv); - ib_destroy_cq(smcibdev->roce_cq_send); - smc_wr_remove_dev(smcibdev); -out: - mutex_unlock(&smcibdev->mutex); -} - static struct ib_client smc_ib_client; =20 static void smc_copy_netdev_ifindex(struct smc_ib_device *smcibdev, int po= rt) @@ -969,7 +923,6 @@ static int smc_ib_add_dev(struct ib_device *ibdev) INIT_WORK(&smcibdev->port_event_work, smc_ib_port_event_work); atomic_set(&smcibdev->lnk_cnt, 0); init_waitqueue_head(&smcibdev->lnks_deleted); - mutex_init(&smcibdev->mutex); mutex_lock(&smc_ib_devices.mutex); list_add_tail(&smcibdev->list, &smc_ib_devices.list); mutex_unlock(&smc_ib_devices.mutex); @@ -1018,7 +971,6 @@ static void smc_ib_remove_dev(struct ib_device *ibdev,= void *client_data) pr_warn_ratelimited("smc: removing ib device %s\n", smcibdev->ibdev->name); smc_smcr_terminate_all(smcibdev); - smc_ib_cleanup_per_ibdev(smcibdev); ib_unregister_event_handler(&smcibdev->event_handler); cancel_work_sync(&smcibdev->port_event_work); kfree(smcibdev); diff --git a/net/smc/smc_ib.h b/net/smc/smc_ib.h index ef8ac2b7546d..c5a0d773b73f 100644 --- a/net/smc/smc_ib.h +++ b/net/smc/smc_ib.h @@ -37,10 +37,6 @@ struct smc_ib_device { /* ib-device infos for smc */ struct ib_device *ibdev; struct ib_port_attr pattr[SMC_MAX_PORTS]; /* ib dev. port attrs */ struct ib_event_handler event_handler; /* global ib_event handler */ - struct ib_cq *roce_cq_send; /* send completion queue */ - struct ib_cq *roce_cq_recv; /* recv completion queue */ - struct tasklet_struct send_tasklet; /* called by send cq handler */ - struct tasklet_struct recv_tasklet; /* called by recv cq handler */ char mac[SMC_MAX_PORTS][ETH_ALEN]; /* mac address per port*/ u8 pnetid[SMC_MAX_PORTS][SMC_MAX_PNETID_LEN]; @@ -96,7 +92,6 @@ void smc_ib_destroy_queue_pair(struct smc_link *lnk); int smc_ib_create_queue_pair(struct smc_link *lnk); int smc_ib_ready_link(struct smc_link *lnk); int smc_ib_modify_qp_rts(struct smc_link *lnk); -int smc_ib_modify_qp_error(struct smc_link *lnk); long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev); int smc_ib_get_memory_region(struct ib_pd *pd, int access_flags, struct smc_buf_desc *buf_slot, u8 link_idx); diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c index 3144b4b1fe29..d301df9ed58b 100644 --- a/net/smc/smc_tx.c +++ b/net/smc/smc_tx.c @@ -321,7 +321,6 @@ static int smc_tx_rdma_write(struct smc_connection *con= n, int peer_rmbe_offset, struct smc_link *link =3D conn->lnk; int rc; =20 - rdma_wr->wr.wr_id =3D smc_wr_tx_get_next_wr_id(link); rdma_wr->wr.num_sge =3D num_sges; rdma_wr->remote_addr =3D lgr->rtokens[conn->rtoken_idx][link->link_idx].dma_addr + diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c index 5feafa98ab1a..3a361aa020ab 100644 --- a/net/smc/smc_wr.c +++ b/net/smc/smc_wr.c @@ -38,7 +38,7 @@ static DEFINE_HASHTABLE(smc_wr_rx_hash, SMC_WR_RX_HASH_BI= TS); static DEFINE_SPINLOCK(smc_wr_rx_hash_lock); =20 struct smc_wr_tx_pend { /* control data for a pending send request */ - u64 wr_id; /* work request id sent */ + struct ib_cqe cqe; smc_wr_tx_handler handler; enum ib_wc_status wc_status; /* CQE status */ struct smc_link *link; @@ -63,62 +63,51 @@ void smc_wr_tx_wait_no_pending_sends(struct smc_link *l= ink) wait_event(link->wr_tx_wait, !smc_wr_is_tx_pend(link)); } =20 -static inline int smc_wr_tx_find_pending_index(struct smc_link *link, u64 = wr_id) +static void smc_wr_tx_rdma_process_cqe(struct ib_cq *cq, struct ib_wc *wc) { - u32 i; + struct smc_link *link =3D wc->qp->qp_context; =20 - for (i =3D 0; i < link->wr_tx_cnt; i++) { - if (link->wr_tx_pends[i].wr_id =3D=3D wr_id) - return i; - } - return link->wr_tx_cnt; + /* terminate link */ + if (unlikely(wc->status)) + smcr_link_down_cond_sched(link); +} + +static void smc_wr_reg_process_cqe(struct ib_cq *cq, struct ib_wc *wc) +{ + struct smc_link *link =3D wc->qp->qp_context; + + if (wc->status) + link->wr_reg_state =3D FAILED; + else + link->wr_reg_state =3D CONFIRMED; + smc_wr_wakeup_reg_wait(link); } =20 -static inline void smc_wr_tx_process_cqe(struct ib_wc *wc) +static void smc_wr_tx_process_cqe(struct ib_cq *cq, struct ib_wc *wc) { - struct smc_wr_tx_pend pnd_snd; + struct smc_wr_tx_pend *tx_pend, pnd_snd; struct smc_link *link; u32 pnd_snd_idx; =20 link =3D wc->qp->qp_context; =20 - if (wc->opcode =3D=3D IB_WC_REG_MR) { - if (wc->status) - link->wr_reg_state =3D FAILED; - else - link->wr_reg_state =3D CONFIRMED; - smc_wr_wakeup_reg_wait(link); - return; - } + tx_pend =3D container_of(wc->wr_cqe, struct smc_wr_tx_pend, cqe); + pnd_snd_idx =3D tx_pend->idx; + + tx_pend->wc_status =3D wc->status; + memcpy(&pnd_snd, tx_pend, sizeof(pnd_snd)); =20 - pnd_snd_idx =3D smc_wr_tx_find_pending_index(link, wc->wr_id); if (pnd_snd_idx =3D=3D link->wr_tx_cnt) { - if (link->lgr->smc_version !=3D SMC_V2 || - link->wr_tx_v2_pend->wr_id !=3D wc->wr_id) - return; - link->wr_tx_v2_pend->wc_status =3D wc->status; - memcpy(&pnd_snd, link->wr_tx_v2_pend, sizeof(pnd_snd)); - /* clear the full struct smc_wr_tx_pend including .priv */ - memset(link->wr_tx_v2_pend, 0, - sizeof(*link->wr_tx_v2_pend)); - memset(link->lgr->wr_tx_buf_v2, 0, - sizeof(*link->lgr->wr_tx_buf_v2)); + memset(link->lgr->wr_tx_buf_v2, 0, sizeof(*link->lgr->wr_tx_buf_v2)); } else { - link->wr_tx_pends[pnd_snd_idx].wc_status =3D wc->status; - if (link->wr_tx_pends[pnd_snd_idx].compl_requested) + if (tx_pend->compl_requested) complete(&link->wr_tx_compl[pnd_snd_idx]); - memcpy(&pnd_snd, &link->wr_tx_pends[pnd_snd_idx], - sizeof(pnd_snd)); - /* clear the full struct smc_wr_tx_pend including .priv */ - memset(&link->wr_tx_pends[pnd_snd_idx], 0, - sizeof(link->wr_tx_pends[pnd_snd_idx])); - memset(&link->wr_tx_bufs[pnd_snd_idx], 0, - sizeof(link->wr_tx_bufs[pnd_snd_idx])); + memset(&link->wr_tx_bufs[tx_pend->idx], 0, sizeof(link->wr_tx_bufs[tx_pe= nd->idx])); if (!test_and_clear_bit(pnd_snd_idx, link->wr_tx_mask)) return; } =20 - if (wc->status) { + if (unlikely(wc->status)) { if (link->lgr->smc_version =3D=3D SMC_V2) { memset(link->wr_tx_v2_pend, 0, sizeof(*link->wr_tx_v2_pend)); @@ -128,44 +117,12 @@ static inline void smc_wr_tx_process_cqe(struct ib_wc= *wc) /* terminate link */ smcr_link_down_cond_sched(link); } + if (pnd_snd.handler) pnd_snd.handler(&pnd_snd.priv, link, wc->status); wake_up(&link->wr_tx_wait); } =20 -static void smc_wr_tx_tasklet_fn(struct tasklet_struct *t) -{ - struct smc_ib_device *dev =3D from_tasklet(dev, t, send_tasklet); - struct ib_wc wc[SMC_WR_MAX_POLL_CQE]; - int i =3D 0, rc; - int polled =3D 0; - -again: - polled++; - do { - memset(&wc, 0, sizeof(wc)); - rc =3D ib_poll_cq(dev->roce_cq_send, SMC_WR_MAX_POLL_CQE, wc); - if (polled =3D=3D 1) { - ib_req_notify_cq(dev->roce_cq_send, - IB_CQ_NEXT_COMP | - IB_CQ_REPORT_MISSED_EVENTS); - } - if (!rc) - break; - for (i =3D 0; i < rc; i++) - smc_wr_tx_process_cqe(&wc[i]); - } while (rc > 0); - if (polled =3D=3D 1) - goto again; -} - -void smc_wr_tx_cq_handler(struct ib_cq *ib_cq, void *cq_context) -{ - struct smc_ib_device *dev =3D (struct smc_ib_device *)cq_context; - - tasklet_schedule(&dev->send_tasklet); -} - /*---------------------------- request submission ------------------------= ---*/ =20 static inline int smc_wr_tx_get_free_slot_index(struct smc_link *link, u32= *idx) @@ -202,7 +159,6 @@ int smc_wr_tx_get_free_slot(struct smc_link *link, struct smc_wr_tx_pend *wr_pend; u32 idx =3D link->wr_tx_cnt; struct ib_send_wr *wr_ib; - u64 wr_id; int rc; =20 *wr_buf =3D NULL; @@ -226,14 +182,13 @@ int smc_wr_tx_get_free_slot(struct smc_link *link, if (idx =3D=3D link->wr_tx_cnt) return -EPIPE; } - wr_id =3D smc_wr_tx_get_next_wr_id(link); + wr_pend =3D &link->wr_tx_pends[idx]; - wr_pend->wr_id =3D wr_id; wr_pend->handler =3D handler; wr_pend->link =3D link; wr_pend->idx =3D idx; wr_ib =3D &link->wr_tx_ibs[idx]; - wr_ib->wr_id =3D wr_id; + wr_ib->wr_cqe =3D &wr_pend->cqe; *wr_buf =3D &link->wr_tx_bufs[idx]; if (wr_rdma_buf) *wr_rdma_buf =3D &link->wr_tx_rdmas[idx]; @@ -248,21 +203,18 @@ int smc_wr_tx_get_v2_slot(struct smc_link *link, { struct smc_wr_tx_pend *wr_pend; struct ib_send_wr *wr_ib; - u64 wr_id; =20 if (link->wr_tx_v2_pend->idx =3D=3D link->wr_tx_cnt) return -EBUSY; =20 *wr_buf =3D NULL; *wr_pend_priv =3D NULL; - wr_id =3D smc_wr_tx_get_next_wr_id(link); wr_pend =3D link->wr_tx_v2_pend; - wr_pend->wr_id =3D wr_id; wr_pend->handler =3D handler; wr_pend->link =3D link; wr_pend->idx =3D link->wr_tx_cnt; wr_ib =3D link->wr_tx_v2_ib; - wr_ib->wr_id =3D wr_id; + wr_ib->wr_cqe =3D &wr_pend->cqe; *wr_buf =3D link->lgr->wr_tx_buf_v2; *wr_pend_priv =3D &wr_pend->priv; return 0; @@ -306,8 +258,6 @@ int smc_wr_tx_send(struct smc_link *link, struct smc_wr= _tx_pend_priv *priv) struct smc_wr_tx_pend *pend; int rc; =20 - ib_req_notify_cq(link->smcibdev->roce_cq_send, - IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS); pend =3D container_of(priv, struct smc_wr_tx_pend, priv); rc =3D ib_post_send(link->roce_qp, &link->wr_tx_ibs[pend->idx], NULL); if (rc) { @@ -323,8 +273,6 @@ int smc_wr_tx_v2_send(struct smc_link *link, struct smc= _wr_tx_pend_priv *priv, int rc; =20 link->wr_tx_v2_ib->sg_list[0].length =3D len; - ib_req_notify_cq(link->smcibdev->roce_cq_send, - IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS); rc =3D ib_post_send(link->roce_qp, link->wr_tx_v2_ib, NULL); if (rc) { smc_wr_tx_put_slot(link, priv); @@ -367,10 +315,7 @@ int smc_wr_reg_send(struct smc_link *link, struct ib_m= r *mr) { int rc; =20 - ib_req_notify_cq(link->smcibdev->roce_cq_send, - IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS); link->wr_reg_state =3D POSTED; - link->wr_reg.wr.wr_id =3D (u64)(uintptr_t)mr; link->wr_reg.mr =3D mr; link->wr_reg.key =3D mr->rkey; rc =3D ib_post_send(link->roce_qp, &link->wr_reg.wr, NULL); @@ -431,94 +376,76 @@ static inline void smc_wr_rx_demultiplex(struct ib_wc= *wc) { struct smc_link *link =3D (struct smc_link *)wc->qp->qp_context; struct smc_wr_rx_handler *handler; + struct smc_ib_recv_wr *recv_wr; struct smc_wr_rx_hdr *wr_rx; - u64 temp_wr_id; - u32 index; =20 if (wc->byte_len < sizeof(*wr_rx)) return; /* short message */ - temp_wr_id =3D wc->wr_id; - index =3D do_div(temp_wr_id, link->wr_rx_cnt); - wr_rx =3D (struct smc_wr_rx_hdr *)(link->wr_rx_bufs + index * link->wr_rx= _buflen); + + recv_wr =3D container_of(wc->wr_cqe, struct smc_ib_recv_wr, cqe); + + wr_rx =3D (struct smc_wr_rx_hdr *)(link->wr_rx_bufs + recv_wr->index * li= nk->wr_rx_buflen); hash_for_each_possible(smc_wr_rx_hash, handler, list, wr_rx->type) { if (handler->type =3D=3D wr_rx->type) handler->handler(wc, wr_rx); } } =20 -static inline void smc_wr_rx_process_cqes(struct ib_wc wc[], int num) +static void smc_wr_rx_process_cqe(struct ib_cq *cq, struct ib_wc *wc) { - struct smc_link *link; - int i; - - for (i =3D 0; i < num; i++) { - link =3D wc[i].qp->qp_context; - link->wr_rx_id_compl =3D wc[i].wr_id; - if (wc[i].status =3D=3D IB_WC_SUCCESS) { - link->wr_rx_tstamp =3D jiffies; - smc_wr_rx_demultiplex(&wc[i]); - smc_wr_rx_post(link); /* refill WR RX */ - } else { - /* handle status errors */ - switch (wc[i].status) { - case IB_WC_RETRY_EXC_ERR: - case IB_WC_RNR_RETRY_EXC_ERR: - case IB_WC_WR_FLUSH_ERR: - smcr_link_down_cond_sched(link); - if (link->wr_rx_id_compl =3D=3D link->wr_rx_id) - wake_up(&link->wr_rx_empty_wait); - break; - default: - smc_wr_rx_post(link); /* refill WR RX */ - break; - } + struct smc_link *link =3D wc->qp->qp_context; + + if (wc->status =3D=3D IB_WC_SUCCESS) { + link->wr_rx_tstamp =3D jiffies; + smc_wr_rx_demultiplex(wc); + smc_wr_rx_post(link, wc->wr_cqe); /* refill WR RX */ + } else { + /* handle status errors */ + switch (wc->status) { + case IB_WC_RETRY_EXC_ERR: + case IB_WC_RNR_RETRY_EXC_ERR: + case IB_WC_WR_FLUSH_ERR: + smcr_link_down_cond_sched(link); + break; + default: + smc_wr_rx_post(link, wc->wr_cqe); /* refill WR RX */ + break; } } } =20 -static void smc_wr_rx_tasklet_fn(struct tasklet_struct *t) +int smc_wr_rx_post_init(struct smc_link *link) { - struct smc_ib_device *dev =3D from_tasklet(dev, t, recv_tasklet); - struct ib_wc wc[SMC_WR_MAX_POLL_CQE]; - int polled =3D 0; - int rc; + u32 i; + int rc =3D 0; =20 -again: - polled++; - do { - memset(&wc, 0, sizeof(wc)); - rc =3D ib_poll_cq(dev->roce_cq_recv, SMC_WR_MAX_POLL_CQE, wc); - if (polled =3D=3D 1) { - ib_req_notify_cq(dev->roce_cq_recv, - IB_CQ_SOLICITED_MASK - | IB_CQ_REPORT_MISSED_EVENTS); - } - if (!rc) - break; - smc_wr_rx_process_cqes(&wc[0], rc); - } while (rc > 0); - if (polled =3D=3D 1) - goto again; + for (i =3D 0; i < link->wr_rx_cnt; i++) + rc =3D smc_wr_rx_post(link, &link->wr_rx_ibs[i].cqe); + return rc; } =20 -void smc_wr_rx_cq_handler(struct ib_cq *ib_cq, void *cq_context) -{ - struct smc_ib_device *dev =3D (struct smc_ib_device *)cq_context; +/***************************** init, exit, misc **************************= ****/ + =20 - tasklet_schedule(&dev->recv_tasklet); +static inline void smc_wr_reg_init_cqe(struct ib_cqe *cqe) +{ + cqe->done =3D smc_wr_reg_process_cqe; } =20 -int smc_wr_rx_post_init(struct smc_link *link) +static inline void smc_wr_tx_init_cqe(struct ib_cqe *cqe) { - u32 i; - int rc =3D 0; + cqe->done =3D smc_wr_tx_process_cqe; +} =20 - for (i =3D 0; i < link->wr_rx_cnt; i++) - rc =3D smc_wr_rx_post(link); - return rc; +static inline void smc_wr_rx_init_cqe(struct ib_cqe *cqe) +{ + cqe->done =3D smc_wr_rx_process_cqe; } =20 -/***************************** init, exit, misc **************************= ****/ +static inline void smc_wr_tx_rdma_init_cqe(struct ib_cqe *cqe) +{ + cqe->done =3D smc_wr_tx_rdma_process_cqe; +} =20 void smc_wr_remember_qp_attr(struct smc_link *lnk) { @@ -548,9 +475,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk) &init_attr); =20 lnk->wr_tx_cnt =3D min_t(size_t, lnk->max_send_wr, - lnk->qp_attr.cap.max_send_wr); + lnk->qp_attr.cap.max_send_wr - 1); lnk->wr_rx_cnt =3D min_t(size_t, lnk->max_recv_wr, - lnk->qp_attr.cap.max_recv_wr); + lnk->qp_attr.cap.max_recv_wr - 1); } =20 static void smc_wr_init_sge(struct smc_link *lnk) @@ -585,6 +512,8 @@ static void smc_wr_init_sge(struct smc_link *lnk) lnk->wr_tx_rdma_sges[i].tx_rdma_sge[0].wr_tx_rdma_sge; lnk->wr_tx_rdmas[i].wr_tx_rdma[1].wr.sg_list =3D lnk->wr_tx_rdma_sges[i].tx_rdma_sge[1].wr_tx_rdma_sge; + lnk->wr_tx_rdmas[i].wr_tx_rdma[0].wr.wr_cqe =3D &lnk->tx_rdma_cqe; + lnk->wr_tx_rdmas[i].wr_tx_rdma[1].wr.wr_cqe =3D &lnk->tx_rdma_cqe; } =20 if (lnk->lgr->smc_version =3D=3D SMC_V2) { @@ -622,10 +551,13 @@ static void smc_wr_init_sge(struct smc_link *lnk) lnk->wr_rx_sges[x + 1].lkey =3D lnk->roce_pd->local_dma_lkey; } - lnk->wr_rx_ibs[i].next =3D NULL; - lnk->wr_rx_ibs[i].sg_list =3D &lnk->wr_rx_sges[x]; - lnk->wr_rx_ibs[i].num_sge =3D lnk->wr_rx_sge_cnt; + lnk->wr_rx_ibs[i].wr.next =3D NULL; + lnk->wr_rx_ibs[i].wr.sg_list =3D &lnk->wr_rx_sges[x]; + lnk->wr_rx_ibs[i].wr.num_sge =3D lnk->wr_rx_sge_cnt; } + + smc_wr_reg_init_cqe(&lnk->wr_reg_cqe); + lnk->wr_reg.wr.wr_cqe =3D &lnk->wr_reg_cqe; lnk->wr_reg.wr.next =3D NULL; lnk->wr_reg.wr.num_sge =3D 0; lnk->wr_reg.wr.send_flags =3D IB_SEND_SIGNALED; @@ -641,7 +573,6 @@ void smc_wr_free_link(struct smc_link *lnk) return; ibdev =3D lnk->smcibdev->ibdev; =20 - smc_wr_drain_cq(lnk); smc_wr_wakeup_reg_wait(lnk); smc_wr_wakeup_tx_wait(lnk); =20 @@ -758,11 +689,19 @@ int smc_wr_alloc_link_mem(struct smc_link *link) GFP_KERNEL); if (!link->wr_rx_ibs) goto no_mem_wr_tx_ibs; + /* init wr_rx_ibs cqe */ + for (int i =3D 0; i < link->max_recv_wr; i++) { + smc_wr_rx_init_cqe(&link->wr_rx_ibs[i].cqe); + link->wr_rx_ibs[i].wr.wr_cqe =3D &link->wr_rx_ibs[i].cqe; + link->wr_rx_ibs[i].index =3D i; + } link->wr_tx_rdmas =3D kcalloc(link->max_send_wr, sizeof(link->wr_tx_rdmas[0]), GFP_KERNEL); if (!link->wr_tx_rdmas) goto no_mem_wr_rx_ibs; + + smc_wr_tx_rdma_init_cqe(&link->tx_rdma_cqe); link->wr_tx_rdma_sges =3D kcalloc(link->max_send_wr, sizeof(link->wr_tx_rdma_sges[0]), GFP_KERNEL); @@ -785,6 +724,8 @@ int smc_wr_alloc_link_mem(struct smc_link *link) GFP_KERNEL); if (!link->wr_tx_pends) goto no_mem_wr_tx_mask; + for (int i =3D 0; i < link->max_send_wr; i++) + smc_wr_tx_init_cqe(&link->wr_tx_pends[i].cqe); link->wr_tx_compl =3D kcalloc(link->max_send_wr, sizeof(link->wr_tx_compl[0]), GFP_KERNEL); @@ -804,6 +745,7 @@ int smc_wr_alloc_link_mem(struct smc_link *link) GFP_KERNEL); if (!link->wr_tx_v2_pend) goto no_mem_v2_sge; + smc_wr_tx_init_cqe(&link->wr_tx_v2_pend->cqe); } return 0; =20 @@ -837,18 +779,6 @@ int smc_wr_alloc_link_mem(struct smc_link *link) return -ENOMEM; } =20 -void smc_wr_remove_dev(struct smc_ib_device *smcibdev) -{ - tasklet_kill(&smcibdev->recv_tasklet); - tasklet_kill(&smcibdev->send_tasklet); -} - -void smc_wr_add_dev(struct smc_ib_device *smcibdev) -{ - tasklet_setup(&smcibdev->recv_tasklet, smc_wr_rx_tasklet_fn); - tasklet_setup(&smcibdev->send_tasklet, smc_wr_tx_tasklet_fn); -} - static void smcr_wr_tx_refs_free(struct percpu_ref *ref) { struct smc_link *lnk =3D container_of(ref, struct smc_link, wr_tx_refs); @@ -868,8 +798,6 @@ int smc_wr_create_link(struct smc_link *lnk) struct ib_device *ibdev =3D lnk->smcibdev->ibdev; int rc =3D 0; =20 - smc_wr_tx_set_wr_id(&lnk->wr_tx_id, 0); - lnk->wr_rx_id =3D 0; lnk->wr_rx_dma_addr =3D ib_dma_map_single( ibdev, lnk->wr_rx_bufs, lnk->wr_rx_buflen * lnk->wr_rx_cnt, DMA_FROM_DEVICE); @@ -917,7 +845,6 @@ int smc_wr_create_link(struct smc_link *lnk) if (rc) goto cancel_ref; init_completion(&lnk->reg_ref_comp); - init_waitqueue_head(&lnk->wr_rx_empty_wait); return rc; =20 cancel_ref: diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h index aa4533af9122..4268dfcd84d3 100644 --- a/net/smc/smc_wr.h +++ b/net/smc/smc_wr.h @@ -44,19 +44,6 @@ struct smc_wr_rx_handler { u8 type; }; =20 -/* Only used by RDMA write WRs. - * All other WRs (CDC/LLC) use smc_wr_tx_send handling WR_ID implicitly - */ -static inline long smc_wr_tx_get_next_wr_id(struct smc_link *link) -{ - return atomic_long_inc_return(&link->wr_tx_id); -} - -static inline void smc_wr_tx_set_wr_id(atomic_long_t *wr_tx_id, long val) -{ - atomic_long_set(wr_tx_id, val); -} - static inline bool smc_wr_tx_link_hold(struct smc_link *link) { if (!smc_link_sendable(link)) @@ -70,9 +57,10 @@ static inline void smc_wr_tx_link_put(struct smc_link *l= ink) percpu_ref_put(&link->wr_tx_refs); } =20 -static inline void smc_wr_drain_cq(struct smc_link *lnk) +static inline void smc_wr_drain_qp(struct smc_link *lnk) { - wait_event(lnk->wr_rx_empty_wait, lnk->wr_rx_id_compl =3D=3D lnk->wr_rx_i= d); + if (lnk->qp_attr.cur_qp_state !=3D IB_QPS_RESET) + ib_drain_qp(lnk->roce_qp); } =20 static inline void smc_wr_wakeup_tx_wait(struct smc_link *lnk) @@ -86,18 +74,12 @@ static inline void smc_wr_wakeup_reg_wait(struct smc_li= nk *lnk) } =20 /* post a new receive work request to fill a completed old work request en= try */ -static inline int smc_wr_rx_post(struct smc_link *link) +static inline int smc_wr_rx_post(struct smc_link *link, struct ib_cqe *cqe) { - int rc; - u64 wr_id, temp_wr_id; - u32 index; - - wr_id =3D ++link->wr_rx_id; /* tasklet context, thus not atomic */ - temp_wr_id =3D wr_id; - index =3D do_div(temp_wr_id, link->wr_rx_cnt); - link->wr_rx_ibs[index].wr_id =3D wr_id; - rc =3D ib_post_recv(link->roce_qp, &link->wr_rx_ibs[index], NULL); - return rc; + struct smc_ib_recv_wr *recv_wr; + + recv_wr =3D container_of(cqe, struct smc_ib_recv_wr, cqe); + return ib_post_recv(link->roce_qp, &recv_wr->wr, NULL); } =20 int smc_wr_create_link(struct smc_link *lnk); @@ -107,8 +89,6 @@ void smc_wr_free_link(struct smc_link *lnk); void smc_wr_free_link_mem(struct smc_link *lnk); void smc_wr_free_lgr_mem(struct smc_link_group *lgr); void smc_wr_remember_qp_attr(struct smc_link *lnk); -void smc_wr_remove_dev(struct smc_ib_device *smcibdev); -void smc_wr_add_dev(struct smc_ib_device *smcibdev); =20 int smc_wr_tx_get_free_slot(struct smc_link *link, smc_wr_tx_handler handl= er, struct smc_wr_buf **wr_buf, @@ -126,12 +106,10 @@ int smc_wr_tx_v2_send(struct smc_link *link, struct smc_wr_tx_pend_priv *priv, int len); int smc_wr_tx_send_wait(struct smc_link *link, struct smc_wr_tx_pend_priv = *priv, unsigned long timeout); -void smc_wr_tx_cq_handler(struct ib_cq *ib_cq, void *cq_context); void smc_wr_tx_wait_no_pending_sends(struct smc_link *link); =20 int smc_wr_rx_register_handler(struct smc_wr_rx_handler *handler); int smc_wr_rx_post_init(struct smc_link *link); -void smc_wr_rx_cq_handler(struct ib_cq *ib_cq, void *cq_context); int smc_wr_reg_send(struct smc_link *link, struct ib_mr *mr); =20 #endif /* SMC_WR_H */ --=20 2.45.0