From nobody Sun May 24 21:37:37 2026 Received: from lgeamrelo13.lge.com (lgeamrelo13.lge.com [156.147.23.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A2B8C2FE056 for ; Thu, 21 May 2026 05:40:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=156.147.23.53 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779342013; cv=none; b=lVrYpyBTbBChK4BqVTfsPNs8oGABZi//y/ddNu8OMBaKNsBPgFrURNklKylCmkUEZa+YDIdX+63ly8o9o0SvSKwWz3p8O3aq0No1s/yl6R+90VUXedV4LoATwFmlTS84bEToH4GAqxN4pLOcXEfPUB8s8+6zhV73ySit3cZdjQk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779342013; c=relaxed/simple; bh=JSLkpKAdLvFcHZV/9f1vzh8b+F/CplWu9ekxAsBp3qA=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=Gbs9zqx7uEt8aabOWc8sPQzz1ZuplXPtF8NhWohcynh3e+cvBB3lK/AHpelsUnMAhC1jsuvCNDMbs4PoQzmQ2+fAEMmiVPLuM/HkIVQiBf8nuEzSG1hSCCKOMUSmy5c09+Ba/kRIFcqe7HwlZoQJ2lCZLJcpkUG8NZlv8ppBmOg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com; spf=fail smtp.mailfrom=gmail.com; arc=none smtp.client-ip=156.147.23.53 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=gmail.com Received: from unknown (HELO lgeamrelo01.lge.com) (156.147.1.125) by 156.147.23.53 with ESMTP; 21 May 2026 14:37:08 +0900 X-Original-SENDERIP: 156.147.1.125 X-Original-MAILFROM: hyc.lee@gmail.com Received: from unknown (HELO hyunchul-PC02.lge.net) (10.177.111.62) by 156.147.1.125 with ESMTP; 21 May 2026 14:37:08 +0900 X-Original-SENDERIP: 10.177.111.62 X-Original-MAILFROM: hyc.lee@gmail.com From: Hyunchul Lee To: Namjae Jeon Cc: Hyunchul Lee , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, cheol.lee@lge.com Subject: [PATCH] ntfs: skip extent mft records in writeback to prevent deadlock Date: Thu, 21 May 2026 14:37:03 +0900 Message-ID: <20260521053703.1850487-1-hyc.lee@gmail.com> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This patch fixes the ABBA deadlock between extent_lock and extent mrec_lock triggered by xfstests generic/113, that occurs since the commit 6994acf33bae ("ntfs: use base mft_no when looking up base inode for extent record"). Path A (inode writeback): VFS writeback -> ntfs_write_inode() -> __ntfs_write_inode() -> mutex_lock(&ni->extent_lock) -> mutex_lock(&tni->mrec_lock) Path B (MFT folio writeback): VFS writeback of $MFT dirty folios -> ntfs_mft_writepages() -> ntfs_write_mft_block() -> ntfs_may_write_mft_record() -> holds one extent mrec_lock from a previous iteration -> tries to acquire another base inode extent_lock By removing all extent_lock and extent mrec_lock acquisition from the MFT folio writeback path, the ABBA lock ordering is eliminated: Path A: __ntfs_write_inode(): extent_lock -> mrec_lock Path B (removed): ntfs_write_mft_block(): mrec_lock -> extent_lock Path B is always redundant for extent records because: 1. mark_mft_record_dirty(ext_ni) does NOT dirty the MFT folio. It only sets NInoDirty(ext_ni) and marks the base VFS inode dirty via __mark_inode_dirty(I_DIRTY_DATASYNC), which triggers Path A. Therefore, normal extent modifications never create a situation where the MFT folio is dirty and Path B is not scheduled. 2. The MFT folio only gets dirtied via ntfs_mft_mark_dirty() inside ntfs_mft_record_alloc(). But all identified callers in attrib.c (ntfs_attr_add, ntfs_attr_record_move_away, ntfs_attr_make_non_resident, ntfs_attr_record_resize) follow through with mark_mft_record_dirty(), which triggers Path A to write the complete record. 3. ntfs_evict_big_inode() calls ntfs_commit_inode() before freeing extent inodes, ensuring all dirty extents are flushed via Path A before the base inode leaves the icache. Signed-off-by: Hyunchul Lee --- fs/ntfs/mft.c | 129 ++------------------------------------------------ 1 file changed, 4 insertions(+), 125 deletions(-) diff --git a/fs/ntfs/mft.c b/fs/ntfs/mft.c index a7d10ee41b34..a5019e80951b 100644 --- a/fs/ntfs/mft.c +++ b/fs/ntfs/mft.c @@ -743,23 +743,6 @@ static int ntfs_test_inode_wb(struct inode *vi, u64 in= o, void *data) * * If the mft record is not a FILE record or it is a base mft record, we c= an * safely write it and return 'true'. - * - * We now know the mft record is an extent mft record. We check if the in= ode - * corresponding to its base mft record is in icache. If it is not, we can= not - * safely determine the state of the extent inode, so we return 'false'. - * - * We now have the base inode for the extent mft record. We check if it h= as an - * ntfs inode for the extent mft record attached. If not, it is safe to wr= ite - * the extent mft record and we return 'true'. - * - * If the extent inode is attached, we check if it is dirty. If so, we ret= urn - * 'false' (letting the standard write_inode path handle it). - * - * If it is not dirty, we attempt to lock the extent mft record. If the lo= ck - * was already taken, it is not safe to write and we return 'false'. - * - * If we manage to obtain the lock we have exclusive access to the extent = mft - * record. We set @locked_ni to the now locked ntfs inode and return 'true= '. */ static bool ntfs_may_write_mft_record(struct ntfs_volume *vol, const u64 m= ft_no, const struct mft_record *m, struct ntfs_inode **locked_ni, @@ -768,8 +751,7 @@ static bool ntfs_may_write_mft_record(struct ntfs_volum= e *vol, const u64 mft_no, struct super_block *sb =3D vol->sb; struct inode *mft_vi =3D vol->mft_ino; struct inode *vi; - struct ntfs_inode *ni, *eni, **extent_nis; - int i; + struct ntfs_inode *ni; struct ntfs_attr na =3D {0}; =20 ntfs_debug("Entering for inode 0x%llx.", mft_no); @@ -849,100 +831,10 @@ static bool ntfs_may_write_mft_record(struct ntfs_vo= lume *vol, const u64 mft_no, mft_no); return true; } - /* - * This is an extent mft record. Check if the inode corresponding to - * its base mft record is in icache and obtain a reference to it if it - * is. - */ - na.mft_no =3D MREF_LE(m->base_mft_record); - na.state =3D 0; - ntfs_debug("Mft record 0x%llx is an extent record. Looking for base inod= e 0x%llx in icache.", - mft_no, na.mft_no); - if (!na.mft_no) { - /* Balance the below iput(). */ - vi =3D igrab(mft_vi); - WARN_ON(vi !=3D mft_vi); - } else { - vi =3D find_inode_nowait(sb, na.mft_no, ntfs_test_inode_wb, &na); - if (na.state =3D=3D NI_BeingDeleted || na.state =3D=3D NI_BeingCreated) - return false; - } =20 - if (!vi) - return false; - ntfs_debug("Base inode 0x%llx is in icache.", na.mft_no); - /* - * The base inode is in icache. Check if it has the extent inode - * corresponding to this extent mft record attached. - */ - ni =3D NTFS_I(vi); - mutex_lock(&ni->extent_lock); - if (ni->nr_extents <=3D 0) { - /* - * The base inode has no attached extent inodes, write this - * extent mft record. - */ - mutex_unlock(&ni->extent_lock); - *ref_vi =3D vi; - ntfs_debug("Base inode 0x%llx has no attached extent inodes, write the e= xtent record.", - na.mft_no); - return true; - } - /* Iterate over the attached extent inodes. */ - extent_nis =3D ni->ext.extent_ntfs_inos; - for (eni =3D NULL, i =3D 0; i < ni->nr_extents; ++i) { - if (mft_no =3D=3D extent_nis[i]->mft_no) { - /* - * Found the extent inode corresponding to this extent - * mft record. - */ - eni =3D extent_nis[i]; - break; - } - } - /* - * If the extent inode was not attached to the base inode, write this - * extent mft record. - */ - if (!eni) { - mutex_unlock(&ni->extent_lock); - *ref_vi =3D vi; - ntfs_debug("Extent inode 0x%llx is not attached to its base inode 0x%llx= , write the extent record.", - mft_no, na.mft_no); - return true; - } - ntfs_debug("Extent inode 0x%llx is attached to its base inode 0x%llx.", - mft_no, na.mft_no); - /* Take a reference to the extent ntfs inode. */ - atomic_inc(&eni->count); - mutex_unlock(&ni->extent_lock); - - /* if extent inode is dirty, write_inode will write it */ - if (NInoDirty(eni)) { - atomic_dec(&eni->count); - *ref_vi =3D vi; - return false; - } - - /* - * Found the extent inode coresponding to this extent mft record. - * Try to take the mft record lock. - */ - if (unlikely(!mutex_trylock(&eni->mrec_lock))) { - atomic_dec(&eni->count); - *ref_vi =3D vi; - ntfs_debug("Extent mft record 0x%llx is already locked, do not write it.= ", - mft_no); - return false; - } - ntfs_debug("Managed to lock extent mft record 0x%llx, write it.", - mft_no); - /* - * The write has to occur while we hold the mft record lock so return - * the locked extent ntfs inode. - */ - *locked_ni =3D eni; - return true; + ntfs_debug("Mft record 0x%llx is an extent record, skip it.", + mft_no); + return false; } =20 static const char *es =3D " Leaving inconsistent metadata. Unmount and r= un chkdsk."; @@ -2791,19 +2683,6 @@ static int ntfs_write_mft_block(struct folio *folio,= struct writeback_control *w unsigned int mft_record_off =3D 0; s64 vcn_off =3D vcn; =20 - /* - * Skip $MFT extent mft records and let them being written - * by writeback to avioid deadlocks. the $MFT runlist - * lock must be taken before $MFT extent mrec_lock is taken. - */ - if (tni && tni->nr_extents < 0 && - tni->ext.base_ntfs_ino =3D=3D NTFS_I(vol->mft_ino)) { - mutex_unlock(&tni->mrec_lock); - atomic_dec(&tni->count); - iput(vol->mft_ino); - continue; - } - /* * The record should be written. If a locked ntfs * inode was returned, add it to the array of locked --=20 2.43.0