fs/ext4/file.c | 3 +++ mm/filemap.c | 5 +++-- 2 files changed, 6 insertions(+), 2 deletions(-)
Hello everyone!
Recently, while running some pressure tests on MYSQL, noticed that
occasionally a "corrupted data in log event" error would be reported.
After analyzing the error, I found that extending DIO write and buffered
read were competing, resulting in some zero-filled page end being read.
Since ext4 buffered read doesn't hold an inode lock, and there is no
field in the page to indicate the valid data size, it seems to me that
it is impossible to solve this problem perfectly without changing these
two things.
In this series, the first patch reads the inode size twice, and takes the
smaller of the two values as the copyout limit to avoid copying data that
was not actually read (0-padding) into the user buffer and causing data
corruption. This greatly reduces the probability of problems under 4k
page. However, the problem is still easily triggered under 64k page.
The second patch waits for the existing dio write to complete and
invalidate the stale page cache before performing a new buffered read
in ext4, avoiding data corruption by copying the stale page cache to
the user buffer. This makes it much less likely that the problem will
be triggered in a 64k page.
Do we have a plan to add a lock to the ext4 buffered read or a field in
the page that indicates the size of the valid data in the page? Or does
anyone have a better idea?
Comments and questions are, as always, welcome.
Baokun Li (2):
mm: avoid data corruption when extending DIO write race with buffered
read
ext4: avoid data corruption when extending DIO write race with
buffered read
fs/ext4/file.c | 3 +++
mm/filemap.c | 5 +++--
2 files changed, 6 insertions(+), 2 deletions(-)
--
2.31.1
Hello! On Sat 02-12-23 17:14:30, Baokun Li wrote: > Recently, while running some pressure tests on MYSQL, noticed that > occasionally a "corrupted data in log event" error would be reported. > After analyzing the error, I found that extending DIO write and buffered > read were competing, resulting in some zero-filled page end being read. > Since ext4 buffered read doesn't hold an inode lock, and there is no > field in the page to indicate the valid data size, it seems to me that > it is impossible to solve this problem perfectly without changing these > two things. Yes, combining buffered reads with direct IO writes is a recipe for problems and pretty much in the "don't do it" territory. So honestly I'd consider this a MYSQL bug. Were you able to identify why does MYSQL use buffered read in this case? It is just something specific to the test you're doing? > In this series, the first patch reads the inode size twice, and takes the > smaller of the two values as the copyout limit to avoid copying data that > was not actually read (0-padding) into the user buffer and causing data > corruption. This greatly reduces the probability of problems under 4k > page. However, the problem is still easily triggered under 64k page. > > The second patch waits for the existing dio write to complete and > invalidate the stale page cache before performing a new buffered read > in ext4, avoiding data corruption by copying the stale page cache to > the user buffer. This makes it much less likely that the problem will > be triggered in a 64k page. > > Do we have a plan to add a lock to the ext4 buffered read or a field in > the page that indicates the size of the valid data in the page? Or does > anyone have a better idea? No, there are no plans to address this AFAIK. Because such locking will slow down all the well behaved applications to fix a corner case for application doing unsupported things. Sure we must not crash the kernel, corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app combines buffered and direct IO but returning zeros instead of valid data is in my opinion fully within the range of acceptable behavior for such case. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On 2023/12/4 20:11, Jan Kara wrote: > Hello! Thank you for your reply! > > On Sat 02-12-23 17:14:30, Baokun Li wrote: >> Recently, while running some pressure tests on MYSQL, noticed that >> occasionally a "corrupted data in log event" error would be reported. >> After analyzing the error, I found that extending DIO write and buffered >> read were competing, resulting in some zero-filled page end being read. >> Since ext4 buffered read doesn't hold an inode lock, and there is no >> field in the page to indicate the valid data size, it seems to me that >> it is impossible to solve this problem perfectly without changing these >> two things. > Yes, combining buffered reads with direct IO writes is a recipe for > problems and pretty much in the "don't do it" territory. So honestly I'd > consider this a MYSQL bug. Were you able to identify why does MYSQL use > buffered read in this case? It is just something specific to the test > you're doing? The problem is with a one-master-twoslave MYSQL database with three physical machines, and using sysbench pressure testing on each of the three machines, the problem occurs about once every two to three hours. The problem is with the relay log file, and when the problem occurs, the middle dozens of bytes of the file are read as all zeros, while the data on disk is not. This is a journal-like file where a write process gets the data from the master node and writes it locally, and another replay process reads the file and performs the replay operation accordingly (some SQL statements). The problem is that when replaying, it finds that the data read is corrupted, not valid SQL data, while the data on disk is normal. It's not confirmed that buffered reads vs direct IO writes is actually causing this issue, but this is the only scenario that we can reproduce with our local simplified scripts. Also, after merging in patch 1, the MYSQL pressure test scenario has now been tested for 5 days and has not been reproduced. I'll double-check the problem scenario, although buffered reads with buffered writes doesn't seem to have this problem. >> In this series, the first patch reads the inode size twice, and takes the >> smaller of the two values as the copyout limit to avoid copying data that >> was not actually read (0-padding) into the user buffer and causing data >> corruption. This greatly reduces the probability of problems under 4k >> page. However, the problem is still easily triggered under 64k page. >> >> The second patch waits for the existing dio write to complete and >> invalidate the stale page cache before performing a new buffered read >> in ext4, avoiding data corruption by copying the stale page cache to >> the user buffer. This makes it much less likely that the problem will >> be triggered in a 64k page. >> >> Do we have a plan to add a lock to the ext4 buffered read or a field in >> the page that indicates the size of the valid data in the page? Or does >> anyone have a better idea? > No, there are no plans to address this AFAIK. Because such locking will > slow down all the well behaved applications to fix a corner case for > application doing unsupported things. Sure we must not crash the kernel, > corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app > combines buffered and direct IO but returning zeros instead of valid data > is in my opinion fully within the range of acceptable behavior for such > case. > > Honza I also feel that a scenario like buffered reads + DIO writes is strange. But theoretically when read doesn't return an error, the data read shouldn't be wrong. And I tested that xfs guarantees data consistency in this scenario, which is why I thought it might be buggy. Thanks! -- With Best Regards, Baokun Li .
On Mon, Dec 04, 2023 at 09:50:18PM +0800, Baokun Li wrote: > On 2023/12/4 20:11, Jan Kara wrote: > > Hello! > Thank you for your reply! > > > > On Sat 02-12-23 17:14:30, Baokun Li wrote: > > > Recently, while running some pressure tests on MYSQL, noticed that > > > occasionally a "corrupted data in log event" error would be reported. > > > After analyzing the error, I found that extending DIO write and buffered > > > read were competing, resulting in some zero-filled page end being read. > > > Since ext4 buffered read doesn't hold an inode lock, and there is no > > > field in the page to indicate the valid data size, it seems to me that > > > it is impossible to solve this problem perfectly without changing these > > > two things. > > Yes, combining buffered reads with direct IO writes is a recipe for > > problems and pretty much in the "don't do it" territory. So honestly I'd > > consider this a MYSQL bug. Were you able to identify why does MYSQL use > > buffered read in this case? It is just something specific to the test > > you're doing? > The problem is with a one-master-twoslave MYSQL database with three > physical machines, and using sysbench pressure testing on each of the > three machines, the problem occurs about once every two to three hours. > > The problem is with the relay log file, and when the problem occurs, the > middle dozens of bytes of the file are read as all zeros, while the data on > disk is not. This is a journal-like file where a write process gets the data > from > the master node and writes it locally, and another replay process reads the > file and performs the replay operation accordingly (some SQL statements). > The problem is that when replaying, it finds that the data read is > corrupted, > not valid SQL data, while the data on disk is normal. > > It's not confirmed that buffered reads vs direct IO writes is actually > causing > this issue, but this is the only scenario that we can reproduce with our > local > simplified scripts. Also, after merging in patch 1, the MYSQL pressure test > scenario has now been tested for 5 days and has not been reproduced. Mixing overlapping buffered read with direct writes - especially partial block extending DIO writes - is a recipe for data corruption. It's not a matter of if, it's a matter of when. Fundamentally, when you have overlapping write IO involving DIO, the result of the overlapping IOs is undefined. One cannot control submission order, the order that the overlapping IO hit the media, or completion ordering that might clear flags like unwritten extents. The only guarantee that we give in this case is that we won't expose stale data from the disk to the user read. As such, it is the responsibility of the application to avoid overlapping IOs when doing direct IO. The fact that you['ve observed data corruption due to overlapping IO ranges from the one application indicates that this is, indeed, an application level problem and can only be solved by fixing the application.... > I'll double-check the problem scenario, although buffered reads with > buffered > writes doesn't seem to have this problem. > > > In this series, the first patch reads the inode size twice, and takes the > > > smaller of the two values as the copyout limit to avoid copying data that > > > was not actually read (0-padding) into the user buffer and causing data > > > corruption. This greatly reduces the probability of problems under 4k > > > page. However, the problem is still easily triggered under 64k page. > > > > > > The second patch waits for the existing dio write to complete and > > > invalidate the stale page cache before performing a new buffered read > > > in ext4, avoiding data corruption by copying the stale page cache to > > > the user buffer. This makes it much less likely that the problem will > > > be triggered in a 64k page. > > > > > > Do we have a plan to add a lock to the ext4 buffered read or a field in > > > the page that indicates the size of the valid data in the page? Or does > > > anyone have a better idea? > > No, there are no plans to address this AFAIK. Because such locking will > > slow down all the well behaved applications to fix a corner case for > > application doing unsupported things. Sure we must not crash the kernel, > > corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app > > combines buffered and direct IO but returning zeros instead of valid data > > is in my opinion fully within the range of acceptable behavior for such > > case. > > > > Honza > I also feel that a scenario like buffered reads + DIO writes is strange. But > theoretically when read doesn't return an error, the data read shouldn't be > wrong. The data that was read isn't wrong - it just wasn't what the application expected. > And I tested that xfs guarantees data consistency in this scenario, which is > why I > thought it might be buggy. XFS certainly does not guarantee data consistency between buffered reads and DIO writes, especially for overlapping IO ranges using DIO (see above). IOWs, the fact that the data inconsistency doesn't reproduce on XFS doesn't mean that XFS is providing some guarantee of consistency for this IO pattern. All it means is that ext4 and XFS behave differently in an situation where the operational result is indeterminate and all we guarantee is that we won't expose stale data read from disk.... Cheers, Dave. -- Dave Chinner david@fromorbit.com
On 2023/12/6 16:35, Dave Ch inner wrote:
> On Mon, Dec 04, 2023 at 09:50:18PM +0800, Baokun Li wrote:
>> On 2023/12/4 20:11, Jan Kara wrote:
>>> Hello!
>> Thank you for your reply!
>>> On Sat 02-12-23 17:14:30, Baokun Li wrote:
>>>> Recently, while running some pressure tests on MYSQL, noticed that
>>>> occasionally a "corrupted data in log event" error would be reported.
>>>> After analyzing the error, I found that extending DIO write and buffered
>>>> read were competing, resulting in some zero-filled page end being read.
>>>> Since ext4 buffered read doesn't hold an inode lock, and there is no
>>>> field in the page to indicate the valid data size, it seems to me that
>>>> it is impossible to solve this problem perfectly without changing these
>>>> two things.
>>> Yes, combining buffered reads with direct IO writes is a recipe for
>>> problems and pretty much in the "don't do it" territory. So honestly I'd
>>> consider this a MYSQL bug. Were you able to identify why does MYSQL use
>>> buffered read in this case? It is just something specific to the test
>>> you're doing?
>> The problem is with a one-master-twoslave MYSQL database with three
>> physical machines, and using sysbench pressure testing on each of the
>> three machines, the problem occurs about once every two to three hours.
>>
>> The problem is with the relay log file, and when the problem occurs, the
>> middle dozens of bytes of the file are read as all zeros, while the data on
>> disk is not. This is a journal-like file where a write process gets the data
>> from
>> the master node and writes it locally, and another replay process reads the
>> file and performs the replay operation accordingly (some SQL statements).
>> The problem is that when replaying, it finds that the data read is
>> corrupted,
>> not valid SQL data, while the data on disk is normal.
>>
>> It's not confirmed that buffered reads vs direct IO writes is actually
>> causing
>> this issue, but this is the only scenario that we can reproduce with our
>> local
>> simplified scripts. Also, after merging in patch 1, the MYSQL pressure test
>> scenario has now been tested for 5 days and has not been reproduced.
> Mixing overlapping buffered read with direct writes - especially partial block
> extending DIO writes - is a recipe for data corruption. It's not a
> matter of if, it's a matter of when.
>
> Fundamentally, when you have overlapping write IO involving DIO, the
> result of the overlapping IOs is undefined. One cannot control
> submission order, the order that the overlapping IO hit the
> media, or completion ordering that might clear flags like unwritten
> extents. The only guarantee that we give in this case is that we
> won't expose stale data from the disk to the user read.
>
> As such, it is the responsibility of the application to avoid
> overlapping IOs when doing direct IO. The fact that you['ve observed
> data corruption due to overlapping IO ranges from the one
> application indicates that this is, indeed, an application level
> problem and can only be solved by fixing the application....
OK, thank you very much for your detailed explanation!
>> I'll double-check the problem scenario, although buffered reads with
>> buffered
>> writes doesn't seem to have this problem.
>>>> In this series, the first patch reads the inode size twice, and takes the
>>>> smaller of the two values as the copyout limit to avoid copying data that
>>>> was not actually read (0-padding) into the user buffer and causing data
>>>> corruption. This greatly reduces the probability of problems under 4k
>>>> page. However, the problem is still easily triggered under 64k page.
>>>>
>>>> The second patch waits for the existing dio write to complete and
>>>> invalidate the stale page cache before performing a new buffered read
>>>> in ext4, avoiding data corruption by copying the stale page cache to
>>>> the user buffer. This makes it much less likely that the problem will
>>>> be triggered in a 64k page.
>>>>
>>>> Do we have a plan to add a lock to the ext4 buffered read or a field in
>>>> the page that indicates the size of the valid data in the page? Or does
>>>> anyone have a better idea?
>>> No, there are no plans to address this AFAIK. Because such locking will
>>> slow down all the well behaved applications to fix a corner case for
>>> application doing unsupported things. Sure we must not crash the kernel,
>>> corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app
>>> combines buffered and direct IO but returning zeros instead of valid data
>>> is in my opinion fully within the range of acceptable behavior for such
>>> case.
>>>
>>> Honza
>> I also feel that a scenario like buffered reads + DIO writes is strange. But
>> theoretically when read doesn't return an error, the data read shouldn't be
>> wrong.
> The data that was read isn't wrong - it just wasn't what the
> application expected.
>
>> And I tested that xfs guarantees data consistency in this scenario, which is
>> why I
>> thought it might be buggy.
> XFS certainly does not guarantee data consistency between buffered
> reads and DIO writes, especially for overlapping IO ranges using DIO
> (see above).
>
> IOWs, the fact that the data inconsistency doesn't reproduce on XFS
> doesn't mean that XFS is providing some guarantee of consistency for
> this IO pattern. All it means is that ext4 and XFS behave differently
> in an situation where the operational result is indeterminate and all
> we guarantee is that we won't expose stale data read from disk....
>
> Cheers,
>
> Dave.
Here's my test script, just put all three shell scripts in the same
directory,
change the disk and dir, then run `sh test-io.sh`, and the inconsistency is
reproduced under ext4 in just one round, and after switching the
filesystem to xfs, the problem is no longer reproduced. So it looks like
xfs guarantees consistency in this scenario. But you're right, a test case
can't trigger a problem doesn't mean the problem doesn't exist.
Thank you very much for the correction.
------ test-write.sh
disk="/dev/sda"
dir="/tmp/test"
write_t()
{
pos=0
touch $dir/writefile
while [ $pos -lt 16777216 ]
do
xfs_io -c "open -d $dir/writefile" -c "pwrite -q -b 1024 -i
$dir/testfile -s $pos $pos 1024" &> /dev/null
pos=`ls -l $dir/writefile | awk '{print $5}'`
done
}
write_t
------------------------------------------------------------
------ test-read.sh
disk="/dev/sda"
dir="/tmp/test"
read_t()
{
pos=0
touch $dir/readfile
while [ $pos -lt 16777216 ]
do
xfs_io -c "open -d $dir/readfile" -c "pwrite -q -b 65536 -i
$dir/writefile -s $pos $pos 65536" &> /dev/null
pos=`ls -l $dir/readfile | awk '{print $5}'`
done
}
read_t
------------------------------------------------------------
------ test-io.sh
disk="/dev/sda"
dir="/tmp/test"
mkdir -p $dir
umount $dir
mkfs.ext4 -F $disk
#mkfs.xfs -f -m crc=1,rmapbt=0,reflink=0 $disk
mount $disk $dir
xfs_io -f -c "pwrite -S 0xff 0 16777216" $dir/testfile
sync
round=0
while true
do
while true
do
echo "******* round $round ********"
touch $dir/writefile
taskset 0x00000001 sh test-read.sh &
pid_read=$!
taskset 0x0000000f sh test-write.sh &
pid_write=$!
wait $pid_read
wait $pid_write
sync
md_raw=`md5sum $dir/testfile | awk '{print $1}'`
md_write=`md5sum $dir/writefile | awk '{print $1}'`
md_read=`md5sum $dir/readfile | awk '{print $1}'`
if [ $md_write != $md_raw ]
then
echo "write error csum"
echo "raw csum: $md_raw"
echo "write csum: $md_write"
exit 1
elif [ $md_read != $md_write ]
then
echo "raed error csum"
echo "write csum: $md_write"
echo "read csum: $md_read"
exit 1
fi
rm -f $dir/writefile
rm -f $dir/readfile
sync
round=$(($round+1))
done
done
------------------------------------------------------------
Thanks!
--
With Best Regards,
Baokun Li
.
On Wed, Dec 06, 2023 at 07:35:35PM +1100, Dave Chinner wrote: > Mixing overlapping buffered read with direct writes - especially partial block > extending DIO writes - is a recipe for data corruption. It's not a > matter of if, it's a matter of when. > > Fundamentally, when you have overlapping write IO involving DIO, the > result of the overlapping IOs is undefined. One cannot control > submission order, the order that the overlapping IO hit the > media, or completion ordering that might clear flags like unwritten > extents. The only guarantee that we give in this case is that we > won't expose stale data from the disk to the user read. Btw, one thing we could do to kill these races forever is to track if there are any buffered openers for an inode and just fall back to buffered I/O for that case. With that and and inode_dio_wait for when opening for buffered I/O we'd avoid the races an various crazy workarounds entirely. nfs and ocfs2 do (or did, I haven't checked for a while) something like that.
On Wed, Dec 06, 2023 at 01:02:43AM -0800, Christoph Hellwig wrote: > On Wed, Dec 06, 2023 at 07:35:35PM +1100, Dave Chinner wrote: > > Mixing overlapping buffered read with direct writes - especially partial block > > extending DIO writes - is a recipe for data corruption. It's not a > > matter of if, it's a matter of when. > > > > Fundamentally, when you have overlapping write IO involving DIO, the > > result of the overlapping IOs is undefined. One cannot control > > submission order, the order that the overlapping IO hit the > > media, or completion ordering that might clear flags like unwritten > > extents. The only guarantee that we give in this case is that we > > won't expose stale data from the disk to the user read. > > Btw, one thing we could do to kill these races forever is to track if > there are any buffered openers for an inode and just fall back to > buffered I/O for that case. With that and and inode_dio_wait for > when opening for buffered I/O we'd avoid the races an various crazy > workarounds entirely. That's basically what Solaris did 20-25 years ago. The inode held a flag that indicated what IO was being done, and if the "buffered" flag was set (either through mmap() based access or buffered read/write syscalls) then direct IO would do also do buffered IO until the flag was cleared and the cache cleaned and invalidated. That had .... problems. Largely they were performance problems - unpredictable IO latency and CPU overhead for IO meant applications would randomly miss SLAs. The application would see IO suddenly lose all concurrency, go real slow and/or burn lots more CPU when the inode switched to buffered mode. I'm not sure that's a particularly viable model given the raw IO throughput even cheap modern SSDs largely exceeds the capability of buffered IO through the page cache. The differences in concurrency, latency and throughput between buffered and DIO modes will be even more stark itoday than they were 20 years ago.... -Dave. -- Dave Chinner david@fromorbit.com
On Wed, Dec 06, 2023 at 09:34:49PM +1100, Dave Chinner wrote: > Largely they were performance problems - unpredictable IO latency > and CPU overhead for IO meant applications would randomly miss SLAs. > The application would see IO suddenly lose all concurrency, go real > slow and/or burn lots more CPU when the inode switched to buffered > mode. > > I'm not sure that's a particularly viable model given the raw IO > throughput even cheap modern SSDs largely exceeds the capability of > buffered IO through the page cache. The differences in concurrency, > latency and throughput between buffered and DIO modes will be even > more stark itoday than they were 20 years ago.... The question is what's worse: random performance drops or random corruption. I suspect the former is less bad, especially if we have good tracepoints to pin it down.
On Mon, Dec 04, 2023 at 09:50:18PM +0800, Baokun Li wrote: > The problem is with a one-master-twoslave MYSQL database with three > physical machines, and using sysbench pressure testing on each of the > three machines, the problem occurs about once every two to three hours. > > The problem is with the relay log file, and when the problem occurs, > the middle dozens of bytes of the file are read as all zeros, while > the data on disk is not. This is a journal-like file where a write > process gets the data from the master node and writes it locally, > and another replay process reads the file and performs the replay > operation accordingly (some SQL statements). The problem is that > when replaying, it finds that the data read is corrupted, not valid > SQL data, while the data on disk is normal. You mentioned "scripts" --- are these locally developped scripts by any chance? The procedure suggested in a few places that I looked up don't involve needing to read the replay log. For example from[1]: On the master server: root@repl-master:~# mysql -uroot -p; mysql> CREATE USER ‘slave’@’12.34.56.789‘ IDENTIFIED BY ‘SLAVE_PASSWORD‘; mysql> GRANT REPLICATION SLAVE ON . TO ‘slave’@’12.34.56.222 ‘; mysql> FLUSH PRIVILEGES; mysql> FLUSH TABLES WITH READ LOCK; This will make the master server read-only, with all pending writes flushed out (so you don't need to worry about the replay log), and then you move the data from the master to slave: root@repl-master:~# mysqldump -u root -p –all-databases –master-data > data.sql root@repl-master:~# scp data.sql root@12.34.56.222 Then on the slave: root@repl-slave:~# mysql -uroot -p < data.sql root@repl-slave:~# mysql -uroot -p; mysql> STOP SLAVE; ... and then on the master: root@repl-master:~# mysql -uroot -p; mysql> UNLOCK TABLES; ... and back on the slave: root@repl-slave:~# mysql -uroot -p; mysql> START SLAVE; [1] https://hevodata.com/learn/mysql-master-slave-replication/ ... or you could buy the product advertised at [1] which is easier for the database administrators, but results in $$$ flowing to the Hevo company. :-) In any case, I'm pretty sure that the official documented way of setting up a failover replication setup doesn't involve buffered reads of the replay file. It is certainly the case that mysqldump uses buffered reads, but that's why you have to temporary make the database read-only using "FLUSH TABLES WITH READ LOCK" before taking a database snapshot, and then re-enable database updates the "UNLOCK TABLES" SQL commands. Cheers, - Ted
On 2023/12/5 12:17, Theodore Ts'o wrote: > On Mon, Dec 04, 2023 at 09:50:18PM +0800, Baokun Li wrote: >> The problem is with a one-master-twoslave MYSQL database with three >> physical machines, and using sysbench pressure testing on each of the >> three machines, the problem occurs about once every two to three hours. >> >> The problem is with the relay log file, and when the problem occurs, >> the middle dozens of bytes of the file are read as all zeros, while >> the data on disk is not. This is a journal-like file where a write >> process gets the data from the master node and writes it locally, >> and another replay process reads the file and performs the replay >> operation accordingly (some SQL statements). The problem is that >> when replaying, it finds that the data read is corrupted, not valid >> SQL data, while the data on disk is normal. > You mentioned "scripts" --- are these locally developped scripts by > any chance? This refers to the sql commands to be replayed in the relay log file. I don't know much about this file, but you can read the official documentation. https://dev.mysql.com/doc/refman/8.0/en/replica-logs-relaylog.html > The procedure suggested in a few places that I looked up > don't involve needing to read the replay log. For example from[1]: > > On the master server: > > root@repl-master:~# mysql -uroot -p; > mysql> CREATE USER ‘slave’@’12.34.56.789‘ IDENTIFIED BY ‘SLAVE_PASSWORD‘; > mysql> GRANT REPLICATION SLAVE ON . TO ‘slave’@’12.34.56.222 ‘; > mysql> FLUSH PRIVILEGES; > mysql> FLUSH TABLES WITH READ LOCK; > > This will make the master server read-only, with all pending writes > flushed out (so you don't need to worry about the replay log), and > then you move the data from the master to slave: > > root@repl-master:~# mysqldump -u root -p –all-databases –master-data > data.sql > root@repl-master:~# scp data.sql root@12.34.56.222 > > Then on the slave: > > root@repl-slave:~# mysql -uroot -p < data.sql > root@repl-slave:~# mysql -uroot -p; > mysql> STOP SLAVE; > > ... and then on the master: > > root@repl-master:~# mysql -uroot -p; > mysql> UNLOCK TABLES; > > ... and back on the slave: > > root@repl-slave:~# mysql -uroot -p; > mysql> START SLAVE; > > [1] https://hevodata.com/learn/mysql-master-slave-replication/ > > ... or you could buy the product advertised at [1] which is easier for > the database administrators, but results in $$$ flowing to the Hevo > company. :-) > > In any case, I'm pretty sure that the official documented way of > setting up a failover replication setup doesn't involve buffered reads > of the replay file. > > It is certainly the case that mysqldump uses buffered reads, but > that's why you have to temporary make the database read-only using > "FLUSH TABLES WITH READ LOCK" before taking a database snapshot, and > then re-enable database updates the "UNLOCK TABLES" SQL commands. > > Cheers, > > - Ted Thank you very much for your detailed explanation! But the downstream users do have buffered reads to read the relay log file, as I confirmed with bpftrace. Here's an introduction to turning on relay logging, but I'm not sure if you can access this link: https://blog.csdn.net/javaanddonet/article/details/112596148 Thanks! -- With Best Regards, Baokun Li .
On Tue, Dec 05, 2023 at 09:19:03PM +0800, Baokun Li wrote:
> Thank you very much for your detailed explanation!
> But the downstream users do have buffered reads to read the relay log
> file, as I confirmed with bpftrace. Here's an introduction to turning on
> relay logging, but I'm not sure if you can access this link:
> https://blog.csdn.net/javaanddonet/article/details/112596148
Well, if a mysql-supplied program is trying read the relay log using a
buffered read, when it's being written using Direct I/O, that's a bug
in mysql, and it should be reported as such to the mysql folks.
It does look like there is a newer way of doing replication which
doesn't rely on messign with log files. From:
https://dev.mysql.com/doc/refman/8.0/en/replication.html
MySQL 8.0 supports different methods of replication. The
traditional method is based on replicating events from the
source's binary log, and requires the log files and positions in
them to be synchronized between source and replica. The newer
method based on global transaction identifiers (GTIDs) is
transactional and therefore does not require working with log
files or positions within these files, which greatly simplifies
many common replication tasks. Replication using GTIDs guarantees
consistency between source and replica as long as all transactions
committed on the source have also been applied on the replica. For
more information about GTIDs and GTID-based replication in MySQL,
see Section 17.1.3, “Replication with Global Transaction
Identifiers”. For information on using binary log file position
based replication, see Section 17.1, “Configuring Replication”.
Perhaps you can try and see how mysql handles GTID-based replication
using bpftrace?
Cheers,
- Ted
On 2023/12/7 5:55, Theodore Ts'o wrote: > On Tue, Dec 05, 2023 at 09:19:03PM +0800, Baokun Li wrote: >> Thank you very much for your detailed explanation! >> But the downstream users do have buffered reads to read the relay log >> file, as I confirmed with bpftrace. Here's an introduction to turning on >> relay logging, but I'm not sure if you can access this link: >> https://blog.csdn.net/javaanddonet/article/details/112596148 > Well, if a mysql-supplied program is trying read the relay log using a > buffered read, when it's being written using Direct I/O, that's a bug > in mysql, and it should be reported as such to the mysql folks. I was mistaken here, it now looks like reads and writes to the relay log are buffered IO, and I'm still trying to locate the issue. > > It does look like there is a newer way of doing replication which > doesn't rely on messign with log files. From: > > https://dev.mysql.com/doc/refman/8.0/en/replication.html > > MySQL 8.0 supports different methods of replication. The > traditional method is based on replicating events from the > source's binary log, and requires the log files and positions in > them to be synchronized between source and replica. The newer > method based on global transaction identifiers (GTIDs) is > transactional and therefore does not require working with log > files or positions within these files, which greatly simplifies > many common replication tasks. Replication using GTIDs guarantees > consistency between source and replica as long as all transactions > committed on the source have also been applied on the replica. For > more information about GTIDs and GTID-based replication in MySQL, > see Section 17.1.3, “Replication with Global Transaction > Identifiers”. For information on using binary log file position > based replication, see Section 17.1, “Configuring Replication”. > > Perhaps you can try and see how mysql handles GTID-based replication > using bpftrace? > > Cheers, > > - Ted Thank you very much for your solution! I'll try it. Thanks! -- With Best Regards, Baokun Li .
On Mon 04-12-23 21:50:18, Baokun Li wrote: > On 2023/12/4 20:11, Jan Kara wrote: > > On Sat 02-12-23 17:14:30, Baokun Li wrote: > > > Recently, while running some pressure tests on MYSQL, noticed that > > > occasionally a "corrupted data in log event" error would be reported. > > > After analyzing the error, I found that extending DIO write and buffered > > > read were competing, resulting in some zero-filled page end being read. > > > Since ext4 buffered read doesn't hold an inode lock, and there is no > > > field in the page to indicate the valid data size, it seems to me that > > > it is impossible to solve this problem perfectly without changing these > > > two things. > > Yes, combining buffered reads with direct IO writes is a recipe for > > problems and pretty much in the "don't do it" territory. So honestly I'd > > consider this a MYSQL bug. Were you able to identify why does MYSQL use > > buffered read in this case? It is just something specific to the test > > you're doing? > The problem is with a one-master-twoslave MYSQL database with three > physical machines, and using sysbench pressure testing on each of the > three machines, the problem occurs about once every two to three hours. > > The problem is with the relay log file, and when the problem occurs, the > middle dozens of bytes of the file are read as all zeros, while the data on > disk is not. This is a journal-like file where a write process gets the data > from > the master node and writes it locally, and another replay process reads the > file and performs the replay operation accordingly (some SQL statements). > The problem is that when replaying, it finds that the data read is > corrupted, > not valid SQL data, while the data on disk is normal. > > It's not confirmed that buffered reads vs direct IO writes is actually > causing this issue, but this is the only scenario that we can reproduce > with our local simplified scripts. Also, after merging in patch 1, the > MYSQL pressure test scenario has now been tested for 5 days and has not > been reproduced. > > I'll double-check the problem scenario, although buffered reads with > buffered writes doesn't seem to have this problem. Yeah, from what you write it seems that the replay code is using buffered reads on the journal file. I guess you could confirm that with a bit of kernel tracing but the symptoms look pretty convincing. Did you try talking to MYSQL guys about why they are doing this? > > > In this series, the first patch reads the inode size twice, and takes the > > > smaller of the two values as the copyout limit to avoid copying data that > > > was not actually read (0-padding) into the user buffer and causing data > > > corruption. This greatly reduces the probability of problems under 4k > > > page. However, the problem is still easily triggered under 64k page. > > > > > > The second patch waits for the existing dio write to complete and > > > invalidate the stale page cache before performing a new buffered read > > > in ext4, avoiding data corruption by copying the stale page cache to > > > the user buffer. This makes it much less likely that the problem will > > > be triggered in a 64k page. > > > > > > Do we have a plan to add a lock to the ext4 buffered read or a field in > > > the page that indicates the size of the valid data in the page? Or does > > > anyone have a better idea? > > No, there are no plans to address this AFAIK. Because such locking will > > slow down all the well behaved applications to fix a corner case for > > application doing unsupported things. Sure we must not crash the kernel, > > corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app > > combines buffered and direct IO but returning zeros instead of valid data > > is in my opinion fully within the range of acceptable behavior for such > > case. > > > I also feel that a scenario like buffered reads + DIO writes is strange. > But theoretically when read doesn't return an error, the data read > shouldn't be wrong. And I tested that xfs guarantees data consistency in > this scenario, which is why I thought it might be buggy. Yes, XFS has inherited stronger consistency guarantees from IRIX times than Linux filesystems traditionally had. We generally don't even guarantee buffered read vs buffered write atomicity (i.e., buffered read can see a torn buffered write). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On 2023/12/4 22:41, Jan Kara wrote:
> On Mon 04-12-23 21:50:18, Baokun Li wrote:
>> On 2023/12/4 20:11, Jan Kara wrote:
>>> On Sat 02-12-23 17:14:30, Baokun Li wrote:
>>>> Recently, while running some pressure tests on MYSQL, noticed that
>>>> occasionally a "corrupted data in log event" error would be reported.
>>>> After analyzing the error, I found that extending DIO write and buffered
>>>> read were competing, resulting in some zero-filled page end being read.
>>>> Since ext4 buffered read doesn't hold an inode lock, and there is no
>>>> field in the page to indicate the valid data size, it seems to me that
>>>> it is impossible to solve this problem perfectly without changing these
>>>> two things.
>>> Yes, combining buffered reads with direct IO writes is a recipe for
>>> problems and pretty much in the "don't do it" territory. So honestly I'd
>>> consider this a MYSQL bug. Were you able to identify why does MYSQL use
>>> buffered read in this case? It is just something specific to the test
>>> you're doing?
>> The problem is with a one-master-twoslave MYSQL database with three
>> physical machines, and using sysbench pressure testing on each of the
>> three machines, the problem occurs about once every two to three hours.
>>
>> The problem is with the relay log file, and when the problem occurs, the
>> middle dozens of bytes of the file are read as all zeros, while the data on
>> disk is not. This is a journal-like file where a write process gets the data
>> from
>> the master node and writes it locally, and another replay process reads the
>> file and performs the replay operation accordingly (some SQL statements).
>> The problem is that when replaying, it finds that the data read is
>> corrupted,
>> not valid SQL data, while the data on disk is normal.
>>
>> It's not confirmed that buffered reads vs direct IO writes is actually
>> causing this issue, but this is the only scenario that we can reproduce
>> with our local simplified scripts. Also, after merging in patch 1, the
>> MYSQL pressure test scenario has now been tested for 5 days and has not
>> been reproduced.
>>
>> I'll double-check the problem scenario, although buffered reads with
>> buffered writes doesn't seem to have this problem.
> Yeah, from what you write it seems that the replay code is using buffered
> reads on the journal file. I guess you could confirm that with a bit of
> kernel tracing but the symptoms look pretty convincing. Did you try talking
> to MYSQL guys about why they are doing this?
The operations performed on the relay log file are buffered reads and
writes,
which I confirmed with the following bpftrace script:
```
#include <linux/fs.h>
#include <linux/path.h>
#include <linux/dcache.h>
kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
printf("read path: %s\n", str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name));
}
kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
printf("write path: %s\n", str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name));
}
```
I suspect there are DIO writes causing the problem, but I haven't caught
any DIO writes to such files via bpftrace.
>>>> In this series, the first patch reads the inode size twice, and takes the
>>>> smaller of the two values as the copyout limit to avoid copying data that
>>>> was not actually read (0-padding) into the user buffer and causing data
>>>> corruption. This greatly reduces the probability of problems under 4k
>>>> page. However, the problem is still easily triggered under 64k page.
>>>>
>>>> The second patch waits for the existing dio write to complete and
>>>> invalidate the stale page cache before performing a new buffered read
>>>> in ext4, avoiding data corruption by copying the stale page cache to
>>>> the user buffer. This makes it much less likely that the problem will
>>>> be triggered in a 64k page.
>>>>
>>>> Do we have a plan to add a lock to the ext4 buffered read or a field in
>>>> the page that indicates the size of the valid data in the page? Or does
>>>> anyone have a better idea?
>>> No, there are no plans to address this AFAIK. Because such locking will
>>> slow down all the well behaved applications to fix a corner case for
>>> application doing unsupported things. Sure we must not crash the kernel,
>>> corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app
>>> combines buffered and direct IO but returning zeros instead of valid data
>>> is in my opinion fully within the range of acceptable behavior for such
>>> case.
>>>
>> I also feel that a scenario like buffered reads + DIO writes is strange.
>> But theoretically when read doesn't return an error, the data read
>> shouldn't be wrong. And I tested that xfs guarantees data consistency in
>> this scenario, which is why I thought it might be buggy.
> Yes, XFS has inherited stronger consistency guarantees from IRIX times than
> Linux filesystems traditionally had. We generally don't even guarantee
> buffered read vs buffered write atomicity (i.e., buffered read can see a
> torn buffered write).
>
> Honza
I'm a bit confused here, buffered read vs buffered write uses the same
page and
appears to be protected by a memory barrier, how does the inconsistency
occur?
Thanks!
--
With Best Regards,
Baokun Li
.
On Tue 05-12-23 20:50:30, Baokun Li wrote:
> On 2023/12/4 22:41, Jan Kara wrote:
> > On Mon 04-12-23 21:50:18, Baokun Li wrote:
> > > On 2023/12/4 20:11, Jan Kara wrote:
> > > > On Sat 02-12-23 17:14:30, Baokun Li wrote:
> > > > > Recently, while running some pressure tests on MYSQL, noticed that
> > > > > occasionally a "corrupted data in log event" error would be reported.
> > > > > After analyzing the error, I found that extending DIO write and buffered
> > > > > read were competing, resulting in some zero-filled page end being read.
> > > > > Since ext4 buffered read doesn't hold an inode lock, and there is no
> > > > > field in the page to indicate the valid data size, it seems to me that
> > > > > it is impossible to solve this problem perfectly without changing these
> > > > > two things.
> > > > Yes, combining buffered reads with direct IO writes is a recipe for
> > > > problems and pretty much in the "don't do it" territory. So honestly I'd
> > > > consider this a MYSQL bug. Were you able to identify why does MYSQL use
> > > > buffered read in this case? It is just something specific to the test
> > > > you're doing?
> > > The problem is with a one-master-twoslave MYSQL database with three
> > > physical machines, and using sysbench pressure testing on each of the
> > > three machines, the problem occurs about once every two to three hours.
> > >
> > > The problem is with the relay log file, and when the problem occurs, the
> > > middle dozens of bytes of the file are read as all zeros, while the data on
> > > disk is not. This is a journal-like file where a write process gets the data
> > > from
> > > the master node and writes it locally, and another replay process reads the
> > > file and performs the replay operation accordingly (some SQL statements).
> > > The problem is that when replaying, it finds that the data read is
> > > corrupted,
> > > not valid SQL data, while the data on disk is normal.
> > >
> > > It's not confirmed that buffered reads vs direct IO writes is actually
> > > causing this issue, but this is the only scenario that we can reproduce
> > > with our local simplified scripts. Also, after merging in patch 1, the
> > > MYSQL pressure test scenario has now been tested for 5 days and has not
> > > been reproduced.
> > >
> > > I'll double-check the problem scenario, although buffered reads with
> > > buffered writes doesn't seem to have this problem.
> > Yeah, from what you write it seems that the replay code is using buffered
> > reads on the journal file. I guess you could confirm that with a bit of
> > kernel tracing but the symptoms look pretty convincing. Did you try talking
> > to MYSQL guys about why they are doing this?
>
> The operations performed on the relay log file are buffered reads and
> writes, which I confirmed with the following bpftrace script:
> ```
> #include <linux/fs.h>
> #include <linux/path.h>
> #include <linux/dcache.h>
>
> kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb
> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
> printf("read path: %s\n", str(((struct kiocb
> *)arg0)->ki_filp->f_path.dentry->d_name.name));
> }
>
> kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb
> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
> printf("write path: %s\n", str(((struct kiocb
> *)arg0)->ki_filp->f_path.dentry->d_name.name));
> }
> ```
> I suspect there are DIO writes causing the problem, but I haven't caught
> any DIO writes to such files via bpftrace.
Interesting. Not sure how your partially zeroed-out buffers could happen
with fully buffered IO.
> > > > > In this series, the first patch reads the inode size twice, and takes the
> > > > > smaller of the two values as the copyout limit to avoid copying data that
> > > > > was not actually read (0-padding) into the user buffer and causing data
> > > > > corruption. This greatly reduces the probability of problems under 4k
> > > > > page. However, the problem is still easily triggered under 64k page.
> > > > >
> > > > > The second patch waits for the existing dio write to complete and
> > > > > invalidate the stale page cache before performing a new buffered read
> > > > > in ext4, avoiding data corruption by copying the stale page cache to
> > > > > the user buffer. This makes it much less likely that the problem will
> > > > > be triggered in a 64k page.
> > > > >
> > > > > Do we have a plan to add a lock to the ext4 buffered read or a field in
> > > > > the page that indicates the size of the valid data in the page? Or does
> > > > > anyone have a better idea?
> > > > No, there are no plans to address this AFAIK. Because such locking will
> > > > slow down all the well behaved applications to fix a corner case for
> > > > application doing unsupported things. Sure we must not crash the kernel,
> > > > corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app
> > > > combines buffered and direct IO but returning zeros instead of valid data
> > > > is in my opinion fully within the range of acceptable behavior for such
> > > > case.
> > > >
> > > I also feel that a scenario like buffered reads + DIO writes is strange.
> > > But theoretically when read doesn't return an error, the data read
> > > shouldn't be wrong. And I tested that xfs guarantees data consistency in
> > > this scenario, which is why I thought it might be buggy.
> > Yes, XFS has inherited stronger consistency guarantees from IRIX times than
> > Linux filesystems traditionally had. We generally don't even guarantee
> > buffered read vs buffered write atomicity (i.e., buffered read can see a
> > torn buffered write).
>
> I'm a bit confused here, buffered read vs buffered write uses the same
> page and appears to be protected by a memory barrier, how does the
> inconsistency occur?
Within the same page buffered reads and writes should be consistent because
they are synchronized by the page lock. However once reads and writes
involve multiple pages, there is no serialization so you can get contents
of some pages before write and some pages after being written. However this
doesn't seem to be your particular case here. I just wanted to point out
that in general even buffered reads vs writes are not fully consistent.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Wed, Dec 06, 2023 at 08:37:57PM +0100, Jan Kara wrote: > Within the same page buffered reads and writes should be consistent because > they are synchronized by the page lock. However once reads and writes > involve multiple pages, there is no serialization so you can get contents > of some pages before write and some pages after being written. However this > doesn't seem to be your particular case here. I just wanted to point out > that in general even buffered reads vs writes are not fully consistent. Buffered reads don't take the page/folio lock. We only use the folio lock to avoid reading stale data from the page cache while we're fetching the data from storage. Once the uptodate flag is set on the folio, we never take the folio lock for reads.
On Tue 12-12-23 04:36:45, Matthew Wilcox wrote: > On Wed, Dec 06, 2023 at 08:37:57PM +0100, Jan Kara wrote: > > Within the same page buffered reads and writes should be consistent because > > they are synchronized by the page lock. However once reads and writes > > involve multiple pages, there is no serialization so you can get contents > > of some pages before write and some pages after being written. However this > > doesn't seem to be your particular case here. I just wanted to point out > > that in general even buffered reads vs writes are not fully consistent. > > Buffered reads don't take the page/folio lock. We only use the folio > lock to avoid reading stale data from the page cache while we're > fetching the data from storage. Once the uptodate flag is set on the > folio, we never take the folio lock for reads. Right, I've noticed once I've reread the code but thanks for correction anyway. :) Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On 2023/12/7 3:37, Jan Kara wrote:
> On Tue 05-12-23 20:50:30, Baokun Li wrote:
>> On 2023/12/4 22:41, Jan Kara wrote:
>>> On Mon 04-12-23 21:50:18, Baokun Li wrote:
>>>> On 2023/12/4 20:11, Jan Kara wrote:
>>>> The problem is with a one-master-twoslave MYSQL database with three
>>>> physical machines, and using sysbench pressure testing on each of the
>>>> three machines, the problem occurs about once every two to three hours.
>>>>
>>>> The problem is with the relay log file, and when the problem occurs, the
>>>> middle dozens of bytes of the file are read as all zeros, while the data on
>>>> disk is not. This is a journal-like file where a write process gets the data
>>>> from
>>>> the master node and writes it locally, and another replay process reads the
>>>> file and performs the replay operation accordingly (some SQL statements).
>>>> The problem is that when replaying, it finds that the data read is
>>>> corrupted,
>>>> not valid SQL data, while the data on disk is normal.
>>>>
>>>> It's not confirmed that buffered reads vs direct IO writes is actually
>>>> causing this issue, but this is the only scenario that we can reproduce
>>>> with our local simplified scripts. Also, after merging in patch 1, the
>>>> MYSQL pressure test scenario has now been tested for 5 days and has not
>>>> been reproduced.
>>>>
>>>> I'll double-check the problem scenario, although buffered reads with
>>>> buffered writes doesn't seem to have this problem.
>>> Yeah, from what you write it seems that the replay code is using buffered
>>> reads on the journal file. I guess you could confirm that with a bit of
>>> kernel tracing but the symptoms look pretty convincing. Did you try talking
>>> to MYSQL guys about why they are doing this?
>> The operations performed on the relay log file are buffered reads and
>> writes, which I confirmed with the following bpftrace script:
>> ```
>> #include <linux/fs.h>
>> #include <linux/path.h>
>> #include <linux/dcache.h>
>>
>> kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb
>> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
>> printf("read path: %s\n", str(((struct kiocb
>> *)arg0)->ki_filp->f_path.dentry->d_name.name));
>> }
>>
>> kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb
>> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
>> printf("write path: %s\n", str(((struct kiocb
>> *)arg0)->ki_filp->f_path.dentry->d_name.name));
>> }
>> ```
>> I suspect there are DIO writes causing the problem, but I haven't caught
>> any DIO writes to such files via bpftrace.
> Interesting. Not sure how your partially zeroed-out buffers could happen
> with fully buffered IO.
>
After looking at the code again and again, the following concurrency
seems to bypass the memory barrier:
ext4_buffered_write_iter
generic_perform_write
copy_page_from_iter_atomic
ext4_da_write_end
ext4_da_do_write_end
block_write_end
__block_commit_write
folio_mark_uptodate
smp_wmb()
set_bit(PG_uptodate, folio_flags(folio, 0))
i_size_write(inode, pos + copied)
// write isize 2048
unlock_page(page)
ext4_file_read_iter
generic_file_read_iter
filemap_read
filemap_get_pages
filemap_get_read_batch
folio_test_uptodate(folio)
ret = test_bit(PG_uptodate, folio_flags(folio, 0));
if (ret)
smp_rmb();
// The read barrier here ensures
// that data 0-2048 in the page is synchronized.
ext4_buffered_write_iter
generic_perform_write
copy_page_from_iter_atomic
ext4_da_write_end
ext4_da_do_write_end
block_write_end
__block_commit_write
folio_mark_uptodate
smp_wmb()
set_bit(PG_uptodate,
folio_flags(folio, 0))
i_size_write(inode, pos + copied)
// write isize 4096
unlock_page(page)
// read isize 4096
isize = i_size_read(inode)
// But there is no read barrier here,
// so the data in the 2048-4096 range
// may not be synchronized yet !!!
copy_page_to_iter()
// copyout 4096
In the concurrency above, we read the updated i_size, but there is
no read barrier to ensure that the data in the page is the same as
the i_size at this point. Therefore, we may copy the unsynchronized
page out. Is it normal for us to read zero-filled data in this case?
Thanks!
--
With Best Regards,
Baokun Li
.
On Thu 07-12-23 22:15:55, Baokun Li wrote:
> On 2023/12/7 3:37, Jan Kara wrote:
> > On Tue 05-12-23 20:50:30, Baokun Li wrote:
> > > On 2023/12/4 22:41, Jan Kara wrote:
> > > > On Mon 04-12-23 21:50:18, Baokun Li wrote:
> > > > > On 2023/12/4 20:11, Jan Kara wrote:
> > > > > The problem is with a one-master-twoslave MYSQL database with three
> > > > > physical machines, and using sysbench pressure testing on each of the
> > > > > three machines, the problem occurs about once every two to three hours.
> > > > >
> > > > > The problem is with the relay log file, and when the problem occurs, the
> > > > > middle dozens of bytes of the file are read as all zeros, while the data on
> > > > > disk is not. This is a journal-like file where a write process gets the data
> > > > > from
> > > > > the master node and writes it locally, and another replay process reads the
> > > > > file and performs the replay operation accordingly (some SQL statements).
> > > > > The problem is that when replaying, it finds that the data read is
> > > > > corrupted,
> > > > > not valid SQL data, while the data on disk is normal.
> > > > >
> > > > > It's not confirmed that buffered reads vs direct IO writes is actually
> > > > > causing this issue, but this is the only scenario that we can reproduce
> > > > > with our local simplified scripts. Also, after merging in patch 1, the
> > > > > MYSQL pressure test scenario has now been tested for 5 days and has not
> > > > > been reproduced.
> > > > >
> > > > > I'll double-check the problem scenario, although buffered reads with
> > > > > buffered writes doesn't seem to have this problem.
> > > > Yeah, from what you write it seems that the replay code is using buffered
> > > > reads on the journal file. I guess you could confirm that with a bit of
> > > > kernel tracing but the symptoms look pretty convincing. Did you try talking
> > > > to MYSQL guys about why they are doing this?
> > > The operations performed on the relay log file are buffered reads and
> > > writes, which I confirmed with the following bpftrace script:
> > > ```
> > > #include <linux/fs.h>
> > > #include <linux/path.h>
> > > #include <linux/dcache.h>
> > >
> > > kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb
> > > *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
> > > printf("read path: %s\n", str(((struct kiocb
> > > *)arg0)->ki_filp->f_path.dentry->d_name.name));
> > > }
> > >
> > > kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb
> > > *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
> > > printf("write path: %s\n", str(((struct kiocb
> > > *)arg0)->ki_filp->f_path.dentry->d_name.name));
> > > }
> > > ```
> > > I suspect there are DIO writes causing the problem, but I haven't caught
> > > any DIO writes to such files via bpftrace.
> > Interesting. Not sure how your partially zeroed-out buffers could happen
> > with fully buffered IO.
> >
> After looking at the code again and again, the following concurrency
> seems to bypass the memory barrier:
>
> ext4_buffered_write_iter
> generic_perform_write
> copy_page_from_iter_atomic
> ext4_da_write_end
> ext4_da_do_write_end
> block_write_end
> __block_commit_write
> folio_mark_uptodate
> smp_wmb()
> set_bit(PG_uptodate, folio_flags(folio, 0))
> i_size_write(inode, pos + copied)
> // write isize 2048
> unlock_page(page)
>
> ext4_file_read_iter
> generic_file_read_iter
> filemap_read
> filemap_get_pages
> filemap_get_read_batch
> folio_test_uptodate(folio)
> ret = test_bit(PG_uptodate, folio_flags(folio, 0));
> if (ret)
> smp_rmb();
> // The read barrier here ensures
> // that data 0-2048 in the page is synchronized.
> ext4_buffered_write_iter
> generic_perform_write
> copy_page_from_iter_atomic
> ext4_da_write_end
> ext4_da_do_write_end
> block_write_end
> __block_commit_write
> folio_mark_uptodate
> smp_wmb()
> set_bit(PG_uptodate, folio_flags(folio,
> 0))
> i_size_write(inode, pos + copied)
> // write isize 4096
> unlock_page(page)
> // read isize 4096
> isize = i_size_read(inode)
> // But there is no read barrier here,
> // so the data in the 2048-4096 range
> // may not be synchronized yet !!!
> copy_page_to_iter()
> // copyout 4096
>
> In the concurrency above, we read the updated i_size, but there is
> no read barrier to ensure that the data in the page is the same as
> the i_size at this point. Therefore, we may copy the unsynchronized
> page out. Is it normal for us to read zero-filled data in this case?
Indeed, I have checked and filemap_read() (but this dates back even to old
do_generic_file_read() code) indeed does copy data only after checking
uptodate flag and then sampling i_size so we may be copying state in the
middle of the racing write and indeed there is nothing which would prevent
prefetching page data before fetching inode size. I agree this is kind of
nasty so I think adding a read barrier between i_size_read() and
copy_page_to_iter() makes sense. Does it fix your issue with MYSQL?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On 2023/12/12 1:49, Jan Kara wrote:
> On Thu 07-12-23 22:15:55, Baokun Li wrote:
>> On 2023/12/7 3:37, Jan Kara wrote:
>>> On Tue 05-12-23 20:50:30, Baokun Li wrote:
>>>> On 2023/12/4 22:41, Jan Kara wrote:
>>>>> On Mon 04-12-23 21:50:18, Baokun Li wrote:
>>>>>> On 2023/12/4 20:11, Jan Kara wrote:
>>>>>> The problem is with a one-master-twoslave MYSQL database with three
>>>>>> physical machines, and using sysbench pressure testing on each of the
>>>>>> three machines, the problem occurs about once every two to three hours.
>>>>>>
>>>>>> The problem is with the relay log file, and when the problem occurs, the
>>>>>> middle dozens of bytes of the file are read as all zeros, while the data on
>>>>>> disk is not. This is a journal-like file where a write process gets the data
>>>>>> from
>>>>>> the master node and writes it locally, and another replay process reads the
>>>>>> file and performs the replay operation accordingly (some SQL statements).
>>>>>> The problem is that when replaying, it finds that the data read is
>>>>>> corrupted,
>>>>>> not valid SQL data, while the data on disk is normal.
>>>>>>
>>>>>> It's not confirmed that buffered reads vs direct IO writes is actually
>>>>>> causing this issue, but this is the only scenario that we can reproduce
>>>>>> with our local simplified scripts. Also, after merging in patch 1, the
>>>>>> MYSQL pressure test scenario has now been tested for 5 days and has not
>>>>>> been reproduced.
>>>>>>
>>>>>> I'll double-check the problem scenario, although buffered reads with
>>>>>> buffered writes doesn't seem to have this problem.
>>>>> Yeah, from what you write it seems that the replay code is using buffered
>>>>> reads on the journal file. I guess you could confirm that with a bit of
>>>>> kernel tracing but the symptoms look pretty convincing. Did you try talking
>>>>> to MYSQL guys about why they are doing this?
>>>> The operations performed on the relay log file are buffered reads and
>>>> writes, which I confirmed with the following bpftrace script:
>>>> ```
>>>> #include <linux/fs.h>
>>>> #include <linux/path.h>
>>>> #include <linux/dcache.h>
>>>>
>>>> kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb
>>>> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
>>>> printf("read path: %s\n", str(((struct kiocb
>>>> *)arg0)->ki_filp->f_path.dentry->d_name.name));
>>>> }
>>>>
>>>> kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb
>>>> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
>>>> printf("write path: %s\n", str(((struct kiocb
>>>> *)arg0)->ki_filp->f_path.dentry->d_name.name));
>>>> }
>>>> ```
>>>> I suspect there are DIO writes causing the problem, but I haven't caught
>>>> any DIO writes to such files via bpftrace.
>>> Interesting. Not sure how your partially zeroed-out buffers could happen
>>> with fully buffered IO.
>>>
>> After looking at the code again and again, the following concurrency
>> seems to bypass the memory barrier:
>>
>> ext4_buffered_write_iter
>> generic_perform_write
>> copy_page_from_iter_atomic
>> ext4_da_write_end
>> ext4_da_do_write_end
>> block_write_end
>> __block_commit_write
>> folio_mark_uptodate
>> smp_wmb()
>> set_bit(PG_uptodate, folio_flags(folio, 0))
>> i_size_write(inode, pos + copied)
>> // write isize 2048
>> unlock_page(page)
>>
>> ext4_file_read_iter
>> generic_file_read_iter
>> filemap_read
>> filemap_get_pages
>> filemap_get_read_batch
>> folio_test_uptodate(folio)
>> ret = test_bit(PG_uptodate, folio_flags(folio, 0));
>> if (ret)
>> smp_rmb();
>> // The read barrier here ensures
>> // that data 0-2048 in the page is synchronized.
>> ext4_buffered_write_iter
>> generic_perform_write
>> copy_page_from_iter_atomic
>> ext4_da_write_end
>> ext4_da_do_write_end
>> block_write_end
>> __block_commit_write
>> folio_mark_uptodate
>> smp_wmb()
>> set_bit(PG_uptodate, folio_flags(folio,
>> 0))
>> i_size_write(inode, pos + copied)
>> // write isize 4096
>> unlock_page(page)
>> // read isize 4096
>> isize = i_size_read(inode)
>> // But there is no read barrier here,
>> // so the data in the 2048-4096 range
>> // may not be synchronized yet !!!
>> copy_page_to_iter()
>> // copyout 4096
>>
>> In the concurrency above, we read the updated i_size, but there is
>> no read barrier to ensure that the data in the page is the same as
>> the i_size at this point. Therefore, we may copy the unsynchronized
>> page out. Is it normal for us to read zero-filled data in this case?
> Indeed, I have checked and filemap_read() (but this dates back even to old
> do_generic_file_read() code) indeed does copy data only after checking
> uptodate flag and then sampling i_size so we may be copying state in the
> middle of the racing write and indeed there is nothing which would prevent
> prefetching page data before fetching inode size. I agree this is kind of
> nasty so I think adding a read barrier between i_size_read() and
> copy_page_to_iter() makes sense. Does it fix your issue with MYSQL?
>
> Honza
Thank you very much for confirming this issue!
Adding a read barrier between i_size_read() and copy_page_to_iter()
does seem to have solved the problem. The environment, which was
previously reproducing once every 2-3h, has been running for 60+h
without reproducing after adding the read barrier.
This is a Load-Load reordering issue, which only occurs on some
weak mem-ordering architectures (e.g. ARM64, ALPHA), but not
on strong mem-ordering architectures (e.g. X86). This issue was
exactly discovered when migrating MYSQL services from X86 to
ARM64.
And theoretically the problem doesn't only happen on ext4,
filesystems that call filemap_read() but don't hold inode lock
(e.g. btrfs, f2fs, ubifs ...) will have this problem, while filesystems
with inode lock (e.g. xfs, nfs) won't have this problem.
Later I will send the patch that adds the memory barrier.
Thanks!
--
With Best Regards,
Baokun Li
.
On 2023/12/7 3:37, Jan Kara wrote:
> On Tue 05-12-23 20:50:30, Baokun Li wrote:
>> On 2023/12/4 22:41, Jan Kara wrote:
>>> On Mon 04-12-23 21:50:18, Baokun Li wrote:
>>>> On 2023/12/4 20:11, Jan Kara wrote:
>>>>> On Sat 02-12-23 17:14:30, Baokun Li wrote:
>>>>>> Recently, while running some pressure tests on MYSQL, noticed that
>>>>>> occasionally a "corrupted data in log event" error would be reported.
>>>>>> After analyzing the error, I found that extending DIO write and buffered
>>>>>> read were competing, resulting in some zero-filled page end being read.
>>>>>> Since ext4 buffered read doesn't hold an inode lock, and there is no
>>>>>> field in the page to indicate the valid data size, it seems to me that
>>>>>> it is impossible to solve this problem perfectly without changing these
>>>>>> two things.
>>>>> Yes, combining buffered reads with direct IO writes is a recipe for
>>>>> problems and pretty much in the "don't do it" territory. So honestly I'd
>>>>> consider this a MYSQL bug. Were you able to identify why does MYSQL use
>>>>> buffered read in this case? It is just something specific to the test
>>>>> you're doing?
>>>> The problem is with a one-master-twoslave MYSQL database with three
>>>> physical machines, and using sysbench pressure testing on each of the
>>>> three machines, the problem occurs about once every two to three hours.
>>>>
>>>> The problem is with the relay log file, and when the problem occurs, the
>>>> middle dozens of bytes of the file are read as all zeros, while the data on
>>>> disk is not. This is a journal-like file where a write process gets the data
>>>> from
>>>> the master node and writes it locally, and another replay process reads the
>>>> file and performs the replay operation accordingly (some SQL statements).
>>>> The problem is that when replaying, it finds that the data read is
>>>> corrupted,
>>>> not valid SQL data, while the data on disk is normal.
>>>>
>>>> It's not confirmed that buffered reads vs direct IO writes is actually
>>>> causing this issue, but this is the only scenario that we can reproduce
>>>> with our local simplified scripts. Also, after merging in patch 1, the
>>>> MYSQL pressure test scenario has now been tested for 5 days and has not
>>>> been reproduced.
>>>>
>>>> I'll double-check the problem scenario, although buffered reads with
>>>> buffered writes doesn't seem to have this problem.
>>> Yeah, from what you write it seems that the replay code is using buffered
>>> reads on the journal file. I guess you could confirm that with a bit of
>>> kernel tracing but the symptoms look pretty convincing. Did you try talking
>>> to MYSQL guys about why they are doing this?
>> The operations performed on the relay log file are buffered reads and
>> writes, which I confirmed with the following bpftrace script:
>> ```
>> #include <linux/fs.h>
>> #include <linux/path.h>
>> #include <linux/dcache.h>
>>
>> kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb
>> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
>> printf("read path: %s\n", str(((struct kiocb
>> *)arg0)->ki_filp->f_path.dentry->d_name.name));
>> }
>>
>> kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb
>> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
>> printf("write path: %s\n", str(((struct kiocb
>> *)arg0)->ki_filp->f_path.dentry->d_name.name));
>> }
>> ```
>> I suspect there are DIO writes causing the problem, but I haven't caught
>> any DIO writes to such files via bpftrace.
> Interesting. Not sure how your partially zeroed-out buffers could happen
> with fully buffered IO.
I'm still in the process of locating this issue and will sync here
if there is any progress.
>>>>>> In this series, the first patch reads the inode size twice, and takes the
>>>>>> smaller of the two values as the copyout limit to avoid copying data that
>>>>>> was not actually read (0-padding) into the user buffer and causing data
>>>>>> corruption. This greatly reduces the probability of problems under 4k
>>>>>> page. However, the problem is still easily triggered under 64k page.
>>>>>>
>>>>>> The second patch waits for the existing dio write to complete and
>>>>>> invalidate the stale page cache before performing a new buffered read
>>>>>> in ext4, avoiding data corruption by copying the stale page cache to
>>>>>> the user buffer. This makes it much less likely that the problem will
>>>>>> be triggered in a 64k page.
>>>>>>
>>>>>> Do we have a plan to add a lock to the ext4 buffered read or a field in
>>>>>> the page that indicates the size of the valid data in the page? Or does
>>>>>> anyone have a better idea?
>>>>> No, there are no plans to address this AFAIK. Because such locking will
>>>>> slow down all the well behaved applications to fix a corner case for
>>>>> application doing unsupported things. Sure we must not crash the kernel,
>>>>> corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app
>>>>> combines buffered and direct IO but returning zeros instead of valid data
>>>>> is in my opinion fully within the range of acceptable behavior for such
>>>>> case.
>>>>>
>>>> I also feel that a scenario like buffered reads + DIO writes is strange.
>>>> But theoretically when read doesn't return an error, the data read
>>>> shouldn't be wrong. And I tested that xfs guarantees data consistency in
>>>> this scenario, which is why I thought it might be buggy.
>>> Yes, XFS has inherited stronger consistency guarantees from IRIX times than
>>> Linux filesystems traditionally had. We generally don't even guarantee
>>> buffered read vs buffered write atomicity (i.e., buffered read can see a
>>> torn buffered write).
>> I'm a bit confused here, buffered read vs buffered write uses the same
>> page and appears to be protected by a memory barrier, how does the
>> inconsistency occur?
> Within the same page buffered reads and writes should be consistent because
> they are synchronized by the page lock. However once reads and writes
> involve multiple pages, there is no serialization so you can get contents
> of some pages before write and some pages after being written. However this
> doesn't seem to be your particular case here. I just wanted to point out
> that in general even buffered reads vs writes are not fully consistent.
>
> Honza
>
Yes, when writing multiple pages, it is possible to read a part of
the pages that has been updated and a part of the pages that has
not been updated. Thank you very much for answering my query!
Thanks!
--
With Best Regards,
Baokun Li
.
© 2016 - 2025 Red Hat, Inc.