Skip to content

Commit 3ca864d

Browse files
fdmananakdave
authored andcommitted
btrfs: fix a race between renames and directory logging
We have a race between a rename and directory inode logging that if it happens and we crash/power fail before the rename completes, the next time the filesystem is mounted, the log replay code will end up deleting the file that was being renamed. This is best explained following a step by step analysis of an interleaving of steps that lead into this situation. Consider the initial conditions: 1) We are at transaction N; 2) We have directories A and B created in a past transaction (< N); 3) We have inode X corresponding to a file that has 2 hardlinks, one in directory A and the other in directory B, so we'll name them as "A/foo_link1" and "B/foo_link2". Both hard links were persisted in a past transaction (< N); 4) We have inode Y corresponding to a file that as a single hard link and is located in directory A, we'll name it as "A/bar". This file was also persisted in a past transaction (< N). The steps leading to a file loss are the following and for all of them we are under transaction N: 1) Link "A/foo_link1" is removed, so inode's X last_unlink_trans field is updated to N, through btrfs_unlink() -> btrfs_record_unlink_dir(); 2) Task A starts a rename for inode Y, with the goal of renaming from "A/bar" to "A/baz", so we enter btrfs_rename(); 3) Task A inserts the new BTRFS_INODE_REF_KEY for inode Y by calling btrfs_insert_inode_ref(); 4) Because the rename happens in the same directory, we don't set the last_unlink_trans field of directoty A's inode to the current transaction id, that is, we don't cal btrfs_record_unlink_dir(); 5) Task A then removes the entries from directory A (BTRFS_DIR_ITEM_KEY and BTRFS_DIR_INDEX_KEY items) when calling __btrfs_unlink_inode() (actually the dir index item is added as a delayed item, but the effect is the same); 6) Now before task A adds the new entry "A/baz" to directory A by calling btrfs_add_link(), another task, task B is logging inode X; 7) Task B starts a fsync of inode X and after logging inode X, at btrfs_log_inode_parent() it calls btrfs_log_all_parents(), since inode X has a last_unlink_trans value of N, set at in step 1; 8) At btrfs_log_all_parents() we search for all parent directories of inode X using the commit root, so we find directories A and B and log them. Bu when logging direct A, we don't have a dir index item for inode Y anymore, neither the old name "A/bar" nor for the new name "A/baz" since the rename has deleted the old name but has not yet inserted the new name - task A hasn't called yet btrfs_add_link() to do that. Note that logging directory A doesn't fallback to a transaction commit because its last_unlink_trans has a lower value than the current transaction's id (see step 4); 9) Task B finishes logging directories A and B and gets back to btrfs_sync_file() where it calls btrfs_sync_log() to persist the log tree; 10) Task B successfully persisted the log tree, btrfs_sync_log() completed with success, and a power failure happened. We have a log tree without any directory entry for inode Y, so the log replay code deletes the entry for inode Y, name "A/bar", from the subvolume tree since it doesn't exist in the log tree and the log tree is authorative for its index (we logged a BTRFS_DIR_LOG_INDEX_KEY item that covers the index range for the dentry that corresponds to "A/bar"). Since there's no other hard link for inode Y and the log replay code deletes the name "A/bar", the file is lost. The issue wouldn't happen if task B synced the log only after task A called btrfs_log_new_name(), which would update the log with the new name for inode Y ("A/bar"). Fix this by pinning the log root during renames before removing the old directory entry, and unpinning after btrfs_log_new_name() is called. Fixes: 259c4b9 ("btrfs: stop doing unnecessary log updates during a rename") CC: [email protected] # 5.18+ Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent 65d5112 commit 3ca864d

File tree

1 file changed

+64
-17
lines changed

1 file changed

+64
-17
lines changed

fs/btrfs/inode.c

Lines changed: 64 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -8059,6 +8059,7 @@ static int btrfs_rename_exchange(struct inode *old_dir,
80598059
int ret;
80608060
int ret2;
80618061
bool need_abort = false;
8062+
bool logs_pinned = false;
80628063
struct fscrypt_name old_fname, new_fname;
80638064
struct fscrypt_str *old_name, *new_name;
80648065

@@ -8182,6 +8183,31 @@ static int btrfs_rename_exchange(struct inode *old_dir,
81828183
inode_inc_iversion(new_inode);
81838184
simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
81848185

8186+
if (old_ino != BTRFS_FIRST_FREE_OBJECTID &&
8187+
new_ino != BTRFS_FIRST_FREE_OBJECTID) {
8188+
/*
8189+
* If we are renaming in the same directory (and it's not for
8190+
* root entries) pin the log early to prevent any concurrent
8191+
* task from logging the directory after we removed the old
8192+
* entries and before we add the new entries, otherwise that
8193+
* task can sync a log without any entry for the inodes we are
8194+
* renaming and therefore replaying that log, if a power failure
8195+
* happens after syncing the log, would result in deleting the
8196+
* inodes.
8197+
*
8198+
* If the rename affects two different directories, we want to
8199+
* make sure the that there's no log commit that contains
8200+
* updates for only one of the directories but not for the
8201+
* other.
8202+
*
8203+
* If we are renaming an entry for a root, we don't care about
8204+
* log updates since we called btrfs_set_log_full_commit().
8205+
*/
8206+
btrfs_pin_log_trans(root);
8207+
btrfs_pin_log_trans(dest);
8208+
logs_pinned = true;
8209+
}
8210+
81858211
if (old_dentry->d_parent != new_dentry->d_parent) {
81868212
btrfs_record_unlink_dir(trans, BTRFS_I(old_dir),
81878213
BTRFS_I(old_inode), true);
@@ -8253,30 +8279,23 @@ static int btrfs_rename_exchange(struct inode *old_dir,
82538279
BTRFS_I(new_inode)->dir_index = new_idx;
82548280

82558281
/*
8256-
* Now pin the logs of the roots. We do it to ensure that no other task
8257-
* can sync the logs while we are in progress with the rename, because
8258-
* that could result in an inconsistency in case any of the inodes that
8259-
* are part of this rename operation were logged before.
8282+
* Do the log updates for all inodes.
8283+
*
8284+
* If either entry is for a root we don't need to update the logs since
8285+
* we've called btrfs_set_log_full_commit() before.
82608286
*/
8261-
if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
8262-
btrfs_pin_log_trans(root);
8263-
if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
8264-
btrfs_pin_log_trans(dest);
8265-
8266-
/* Do the log updates for all inodes. */
8267-
if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
8287+
if (logs_pinned) {
82688288
btrfs_log_new_name(trans, old_dentry, BTRFS_I(old_dir),
82698289
old_rename_ctx.index, new_dentry->d_parent);
8270-
if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
82718290
btrfs_log_new_name(trans, new_dentry, BTRFS_I(new_dir),
82728291
new_rename_ctx.index, old_dentry->d_parent);
8292+
}
82738293

8274-
/* Now unpin the logs. */
8275-
if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
8294+
out_fail:
8295+
if (logs_pinned) {
82768296
btrfs_end_log_trans(root);
8277-
if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
82788297
btrfs_end_log_trans(dest);
8279-
out_fail:
8298+
}
82808299
ret2 = btrfs_end_transaction(trans);
82818300
ret = ret ? ret : ret2;
82828301
out_notrans:
@@ -8326,6 +8345,7 @@ static int btrfs_rename(struct mnt_idmap *idmap,
83268345
int ret2;
83278346
u64 old_ino = btrfs_ino(BTRFS_I(old_inode));
83288347
struct fscrypt_name old_fname, new_fname;
8348+
bool logs_pinned = false;
83298349

83308350
if (btrfs_ino(BTRFS_I(new_dir)) == BTRFS_EMPTY_SUBVOL_DIR_OBJECTID)
83318351
return -EPERM;
@@ -8460,6 +8480,29 @@ static int btrfs_rename(struct mnt_idmap *idmap,
84608480
inode_inc_iversion(old_inode);
84618481
simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
84628482

8483+
if (old_ino != BTRFS_FIRST_FREE_OBJECTID) {
8484+
/*
8485+
* If we are renaming in the same directory (and it's not a
8486+
* root entry) pin the log to prevent any concurrent task from
8487+
* logging the directory after we removed the old entry and
8488+
* before we add the new entry, otherwise that task can sync
8489+
* a log without any entry for the inode we are renaming and
8490+
* therefore replaying that log, if a power failure happens
8491+
* after syncing the log, would result in deleting the inode.
8492+
*
8493+
* If the rename affects two different directories, we want to
8494+
* make sure the that there's no log commit that contains
8495+
* updates for only one of the directories but not for the
8496+
* other.
8497+
*
8498+
* If we are renaming an entry for a root, we don't care about
8499+
* log updates since we called btrfs_set_log_full_commit().
8500+
*/
8501+
btrfs_pin_log_trans(root);
8502+
btrfs_pin_log_trans(dest);
8503+
logs_pinned = true;
8504+
}
8505+
84638506
if (old_dentry->d_parent != new_dentry->d_parent)
84648507
btrfs_record_unlink_dir(trans, BTRFS_I(old_dir),
84658508
BTRFS_I(old_inode), true);
@@ -8524,7 +8567,7 @@ static int btrfs_rename(struct mnt_idmap *idmap,
85248567
if (old_inode->i_nlink == 1)
85258568
BTRFS_I(old_inode)->dir_index = index;
85268569

8527-
if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
8570+
if (logs_pinned)
85288571
btrfs_log_new_name(trans, old_dentry, BTRFS_I(old_dir),
85298572
rename_ctx.index, new_dentry->d_parent);
85308573

@@ -8540,6 +8583,10 @@ static int btrfs_rename(struct mnt_idmap *idmap,
85408583
}
85418584
}
85428585
out_fail:
8586+
if (logs_pinned) {
8587+
btrfs_end_log_trans(root);
8588+
btrfs_end_log_trans(dest);
8589+
}
85438590
ret2 = btrfs_end_transaction(trans);
85448591
ret = ret ? ret : ret2;
85458592
out_notrans:

0 commit comments

Comments
 (0)