From nobody Fri May 1 11:11:11 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 89B60C433EF for ; Mon, 30 May 2022 13:00:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236280AbiE3NAa (ORCPT ); Mon, 30 May 2022 09:00:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38336 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232804AbiE3NA2 (ORCPT ); Mon, 30 May 2022 09:00:28 -0400 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1F8AD6BFDE for ; Mon, 30 May 2022 06:00:27 -0700 (PDT) Received: from kwepemi500016.china.huawei.com (unknown [172.30.72.56]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4LBb7q47NjzgYJF; Mon, 30 May 2022 20:58:47 +0800 (CST) Received: from kwepemm600016.china.huawei.com (7.193.23.20) by kwepemi500016.china.huawei.com (7.221.188.220) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Mon, 30 May 2022 21:00:24 +0800 Received: from kwepemm600016.china.huawei.com ([7.193.23.20]) by kwepemm600016.china.huawei.com ([7.193.23.20]) with mapi id 15.01.2375.024; Mon, 30 May 2022 21:00:24 +0800 From: "liubo (AW)" To: "akpm@linux-foundation.org" , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" CC: "ying.huang@intel.com" , "willy@infradead.org" , "vbabka@suse.cz" , "surenb@google.com" , "peterx@redhat.com" , "neilb@suse.de" , "naoya.horiguchi@nec.com" , "minchan@kernel.org" , linmiaohe , Louhongxiang , linfeilong Subject: =?utf-8?B?562U5aSNOiBbUEFUQ0hdIG1tL3N3YXBmaWxlOiByZWxlYXNlIHN3YXAgaW5m?= =?utf-8?Q?o_when_swap_device_is_unpluged?= Thread-Topic: [PATCH] mm/swapfile: release swap info when swap device is unpluged Thread-Index: AQHYcm/S8I4jercSlEysmKi3QX0FCa03ZWgg Date: Mon, 30 May 2022 13:00:24 +0000 Message-ID: <80d54ab2864e4011a9f5e5b198ccfe8e@huawei.com> References: <20220528084941.28391-1-liubo254@huawei.com> In-Reply-To: <20220528084941.28391-1-liubo254@huawei.com> Accept-Language: zh-CN, en-US Content-Language: zh-CN X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.174.177.130] Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org friendly ping. -----=E9=82=AE=E4=BB=B6=E5=8E=9F=E4=BB=B6----- =E5=8F=91=E4=BB=B6=E4=BA=BA: liubo (AW)=20 =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2022=E5=B9=B45=E6=9C=8828=E6=97=A5 16= :50 =E6=94=B6=E4=BB=B6=E4=BA=BA: akpm@linux-foundation.org; linux-mm@kvack.org;= linux-kernel@vger.kernel.org =E6=8A=84=E9=80=81: ying.huang@intel.com; willy@infradead.org; vbabka@suse.= cz; surenb@google.com; peterx@redhat.com; neilb@suse.de; naoya.horiguchi@ne= c.com; minchan@kernel.org; linmiaohe ; Louhongxiang <= louhongxiang@huawei.com>; linfeilong ; liubo (AW) =E4=B8=BB=E9=A2=98: [PATCH] mm/swapfile: release swap info when swap device= is unpluged When the swap partition is mounted through the swapon command, the kernel w= ill create the swap_info_struct data structure and initialize it, and save = it in the swap_info global array. When the swap partition is no longer in use, the disk is unloaded through t= he swapoff command. However, if the disk is pulled out after swapon, an error will occur when s= wapoff the disk, causing the swap_info_struct data structure to remain in t= he kernel and cannot be cleared. This patch identifies which disks are no longer available by adding a trave= rsal operation for swap_active_head available swap partitions in the swapon= and swapoff processes, so as to clear the above data structures and releas= e the corresponding resources. Example: [root@localhost ~]# swapon -s [root@localhost ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 1.1T 0 disk =E2=94=9C=E2=94=80sda1 8:1 0 600M 0 part /boot/efi =E2=94=9C=E2=94=80sda2 8:2 0 1G 0 part /boot =E2=94=94=E2=94=80sda3 8:3 0 1.1T 0 part =E2=94=9C=E2=94=80root 253:0 0 70G 0 lvm / =E2=94=9C=E2=94=80swap 253:1 0 4G 0 lvm =E2=94=94=E2=94=80home 253:2 0 1T 0 lvm /home nvme0n1 259:0 0 3.6T 0 disk =E2=94=94=E2=94=80nvme0n1p1 259:5 0 60G 0 part [root@localhost ~]# swapon /dev/nvme0n1p1 [root@localhost ~]# swapon -s Filename Type Size Used Priority /dev/nvme0n1p1 partition 62914556 0 -2 [root@localhost ~]# echo 1 > /sys/bus/pci/devices/0000:d8:00.0/remove [root@localhost ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 1.1T 0 disk =E2=94=9C=E2=94=80sda1 8:1 0 600M 0 part /boot/efi =E2=94=9C=E2=94=80sda2 8:2 0 1G 0 part /boot =E2=94=94=E2=94=80sda3 8:3 0 1.1T 0 part =E2=94=9C=E2=94=80root 253:0 0 70G 0 lvm / =E2=94=9C=E2=94=80swap 253:1 0 4G 0 lvm =E2=94=94=E2=94=80home 253:2 0 1T 0 lvm /home [root@localhost ~]# swapon -s Filename Type Size Used Priority /dev/nvme0n1p1 partition 62914556 0 -2 [root@localhost ~]# swapoff /dev/nvme0n1p1 swapoff: /dev/nvme0n1p1: swapoff failed: No such file or directory [root@lo= calhost ~]# swapoff -a [root@localhost ~]# swapon -s Filename Type Size Used Priority /dev/nvme0n1p1 partition 62914556 0 -2 In the swapoff command, the device is acquired in the following ways, but t= he device has been unplugged at this time, causing the "victim" acquisition to fail, thus returning an error directly. And the invalid swap_info_struct cannot be effectively released. pathname =3D getname(specialfile); if (IS_ERR(pathname)) return PTR_ERR(pathname); victim =3D file_open_name(pathname, O_RDWR|O_LARGEFILE, 0); err =3D PTR_ERR= (victim); if (IS_ERR(victim)) goto out; In order to solve the above problems, by adding traversal of swap_avail_hea= ds (available swap partitions) in the swapoff and swapon processes, find th= e swap_info_struct whose disk partition has been unplugged, and release res= ources. The reason why the judgment of unavailable swap information is also added t= o the swapon process is that the swapoff is executed by the user, and the t= iming is uncontrollable. The system supports swapon multiple disks, and the unavailable swap can be = deleted at the same time as swapon is mounted. In order to realize the interface reuse in the swapoff resource release pro= cess, some of the operations are abstracted into separate interfaces. del_useless_swap_info(): Remove specific swap_info_struct from swap_active_head and update total_swa= p_pages. release_swap_info_memory(): Clear the corresponding resources of swap_info_struct. swapoff_invalid_swapinfo(): Traverse the swap_avail_heads list to release the invalid swap area resourc= es. Signed-off-by: liubo --- mm/swapfile.c | 262 +++++++++++++++++++++++++++++++++++--------------- 1 file changed, 182 insertions(+), 80 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index a2e66d855b19..8d2e75891ff4= 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -68,7 +68,7 @@ static const char Bad_file[] =3D "Bad swap file entry "; = static const char Unused_file[] =3D "Unused swap file entry "; static con= st char Bad_offset[] =3D "Bad swap offset entry "; static const char Unuse= d_offset[] =3D "Unused swap offset entry "; - +static const char invalid_info[] =3D "deleted"; /* * all active swap_info_structs * protected with swap_lock, and ordered by priority. @@ -2384,18 +2384,184 @@ bool has_usable_swap(void) return ret; } =20 -SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) +static void release_swap_info_memory(struct swap_info_struct *p) { - struct swap_info_struct *p =3D NULL; unsigned char *swap_map; struct swap_cluster_info *cluster_info; unsigned long *frontswap_map; - struct file *swap_file, *victim; + struct file *swap_file; struct address_space *mapping; struct inode *inode; + unsigned int old_block_size; + + mutex_lock(&swapon_mutex); + spin_lock(&swap_lock); + spin_lock(&p->lock); + drain_mmlist(); + + /* wait for anyone still in scan_swap_map */ + p->highest_bit =3D 0; /* cuts scans short */ + while (p->flags >=3D SWP_SCANNING) { + spin_unlock(&p->lock); + spin_unlock(&swap_lock); + schedule_timeout_uninterruptible(1); + spin_lock(&swap_lock); + spin_lock(&p->lock); + } + + swap_file =3D p->swap_file; + mapping =3D p->swap_file->f_mapping; + old_block_size =3D p->old_block_size; + p->swap_file =3D NULL; + p->max =3D 0; + swap_map =3D p->swap_map; + p->swap_map =3D NULL; + cluster_info =3D p->cluster_info; + p->cluster_info =3D NULL; + frontswap_map =3D frontswap_map_get(p); + spin_unlock(&p->lock); + spin_unlock(&swap_lock); + arch_swap_invalidate_area(p->type); + frontswap_invalidate_area(p->type); + frontswap_map_set(p, NULL); + mutex_unlock(&swapon_mutex); + free_percpu(p->percpu_cluster); + p->percpu_cluster =3D NULL; + free_percpu(p->cluster_next_cpu); + p->cluster_next_cpu =3D NULL; + vfree(swap_map); + kvfree(cluster_info); + kvfree(frontswap_map); + /* Destroy swap account information */ + swap_cgroup_swapoff(p->type); + exit_swap_address_space(p->type); + + inode =3D mapping->host; + if (S_ISBLK(inode->i_mode)) { + struct block_device *bdev =3D I_BDEV(inode); + + set_blocksize(bdev, old_block_size); + blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL); + } + + inode_lock(inode); + inode->i_flags &=3D ~S_SWAPFILE; + inode_unlock(inode); + filp_close(swap_file, NULL); +} + +static void del_useless_swap_info(struct swap_info_struct *p) { + del_from_avail_list(p); + spin_lock(&p->lock); + if (p->prio < 0) { + struct swap_info_struct *si =3D p; + int nid; + + plist_for_each_entry_continue(si, &swap_active_head, list) { + si->prio++; + si->list.prio--; + for_each_node(nid) { + if (si->avail_lists[nid].prio !=3D 1) + si->avail_lists[nid].prio--; + } + } + least_priority++; + } + plist_del(&p->list, &swap_active_head); + atomic_long_sub(p->pages, &nr_swap_pages); + total_swap_pages -=3D p->pages; + p->flags &=3D ~SWP_WRITEOK; + spin_unlock(&p->lock); +} + +static int swapoff_invalid_swapinfo(void) { + struct swap_info_struct *p =3D NULL; + struct file *swap_file; + int err, found =3D 0; + + char *tmp =3D NULL; + char *swap_name =3D NULL; + + tmp =3D kvzalloc(PAGE_SIZE, GFP_KERNEL); + if (!tmp) + return -ENOMEM; +rescan: + memset(tmp, 0, PAGE_SIZE); + spin_lock(&swap_lock); + plist_for_each_entry(p, &swap_active_head, list) { + if (p->flags & SWP_WRITEOK) { + swap_file =3D p->swap_file; + swap_name =3D d_path(&swap_file->f_path, tmp, PAGE_SIZE); + + if (strstr(swap_name, invalid_info)) { + found =3D 1; + break; + } + } + } + + if (!found) { + err =3D 0; + spin_unlock(&swap_lock); + goto out; + } + + del_useless_swap_info(p); + spin_unlock(&swap_lock); + + disable_swap_slots_cache_lock(); + set_current_oom_origin(); + try_to_unuse(p->type); + clear_current_oom_origin(); + + reenable_swap_slots_cache_unlock(); + + /* + * wait for swap operations protected by get/put_swap_device() + * to complete + */ + synchronize_rcu(); + + flush_work(&p->discard_work); + + destroy_swap_extents(p); + if (p->flags & SWP_CONTINUED) + free_swap_count_continuations(p); + + if (!p->bdev || !blk_queue_nonrot(bdev_get_queue(p->bdev))) + atomic_dec(&nr_rotate_swap); + + release_swap_info_memory(p); + + /* + * Clear the SWP_USED flag after all resources are freed so that swapon + * can reuse this swap_info in alloc_swap_info() safely. It is ok to + * not hold p->lock after we cleared its SWP_WRITEOK. + */ + spin_lock(&swap_lock); + p->flags =3D 0; + spin_unlock(&swap_lock); + + err =3D 0; + atomic_inc(&proc_poll_event); + wake_up_interruptible(&proc_poll_wait); + + found =3D 0; + goto rescan; +out: + kfree(tmp); + return err; +} + +SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) { + struct swap_info_struct *p =3D NULL; + struct file *victim; + struct address_space *mapping; struct filename *pathname; int err, found =3D 0; - unsigned int old_block_size; =20 if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -2408,8 +2574,12 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specia= lfile) =20 victim =3D file_open_name(pathname, O_RDWR|O_LARGEFILE, 0); err =3D PTR_ERR(victim); - if (IS_ERR(victim)) + if (IS_ERR(victim)) { + /* check if the pathname is a device that has been unpluged */ + err =3D swapoff_invalid_swapinfo(); + err =3D err < 0 ? err : PTR_ERR(victim); goto out; + } =20 mapping =3D victim->f_mapping; spin_lock(&swap_lock); @@ -2433,27 +2603,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specia= lfile) spin_unlock(&swap_lock); goto out_dput; } - del_from_avail_list(p); - spin_lock(&p->lock); - if (p->prio < 0) { - struct swap_info_struct *si =3D p; - int nid; =20 - plist_for_each_entry_continue(si, &swap_active_head, list) { - si->prio++; - si->list.prio--; - for_each_node(nid) { - if (si->avail_lists[nid].prio !=3D 1) - si->avail_lists[nid].prio--; - } - } - least_priority++; - } - plist_del(&p->list, &swap_active_head); - atomic_long_sub(p->pages, &nr_swap_pages); - total_swap_pages -=3D p->pages; - p->flags &=3D ~SWP_WRITEOK; - spin_unlock(&p->lock); + del_useless_swap_info(p); spin_unlock(&swap_lock); =20 disable_swap_slots_cache_lock(); @@ -2491,60 +2642,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specia= lfile) if (!p->bdev || !bdev_nonrot(p->bdev)) atomic_dec(&nr_rotate_swap); =20 - mutex_lock(&swapon_mutex); - spin_lock(&swap_lock); - spin_lock(&p->lock); - drain_mmlist(); - - /* wait for anyone still in scan_swap_map_slots */ - p->highest_bit =3D 0; /* cuts scans short */ - while (p->flags >=3D SWP_SCANNING) { - spin_unlock(&p->lock); - spin_unlock(&swap_lock); - schedule_timeout_uninterruptible(1); - spin_lock(&swap_lock); - spin_lock(&p->lock); - } - - swap_file =3D p->swap_file; - old_block_size =3D p->old_block_size; - p->swap_file =3D NULL; - p->max =3D 0; - swap_map =3D p->swap_map; - p->swap_map =3D NULL; - cluster_info =3D p->cluster_info; - p->cluster_info =3D NULL; - frontswap_map =3D frontswap_map_get(p); - spin_unlock(&p->lock); - spin_unlock(&swap_lock); - arch_swap_invalidate_area(p->type); - frontswap_invalidate_area(p->type); - frontswap_map_set(p, NULL); - mutex_unlock(&swapon_mutex); - free_percpu(p->percpu_cluster); - p->percpu_cluster =3D NULL; - free_percpu(p->cluster_next_cpu); - p->cluster_next_cpu =3D NULL; - vfree(swap_map); - kvfree(cluster_info); - kvfree(frontswap_map); - /* Destroy swap account information */ - swap_cgroup_swapoff(p->type); - exit_swap_address_space(p->type); - - inode =3D mapping->host; - if (S_ISBLK(inode->i_mode)) { - struct block_device *bdev =3D I_BDEV(inode); - - set_blocksize(bdev, old_block_size); - blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL); - } - - inode_lock(inode); - inode->i_flags &=3D ~S_SWAPFILE; - inode_unlock(inode); - filp_close(swap_file, NULL); - + release_swap_info_memory(p); /* * Clear the SWP_USED flag after all resources are freed so that swapon * can reuse this swap_info in alloc_swap_info() safely. It is ok to @@ = -3008,6 +3106,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfil= e, int, swap_flags) if (!swap_avail_heads) return -ENOMEM; =20 + error =3D swapoff_invalid_swapinfo(); + if (error < 0) + return error; + p =3D alloc_swap_info(); if (IS_ERR(p)) return PTR_ERR(p); -- 2.27.0