2018-10-25

1. 概念

单个handle是原子操作. 多个handle打包为一个transaction.

disk is used to indicate the actual block device, whereas the term journal is used for the log area.

Log record: Describes a single update of a disk block of the journaling filesystem.(来自[Understanding the Linux Kernel, 3rd Edition])

commit record: a special block, called the commit record, is written to the journal. The commit record is used to indicate that all the blocks belonging to a single atomic operation are written to the journal. fong:这样, 如果crash, 就可以知道那些log是完成的. The commit record indicates that this is a completed operation and could be written to the disk.

checkpointing: 将 finished transactions 写入磁盘(非log区域), 用于回收相应log空间的过程.

2. 状态

T_RUNNING
the transaction can accept new handles.

T_LOCKED
the transaction does not accept any new handles but existing handles are not complete. fong:这里的handle completed, 意思是普通的写入完成.

T_FLUSH
all the handles in a transaction are complete. The transaction is writing itself to the journal. itself指的是 log, 将log写入log区域. 代码中journalsubmitdatabuffers()并unplug, + journalwriterevokerecords()

T_COMMIT
参考 linux-3.10.86_utf8/fs/jbd/commit.c we have now written out all of the data for a transaction. Now comes the tricky part: we need to write out metadata.

T_COMMIT_RECORD
开始写 write commit record and do cleanup.

T_FINISHED

[Understanding the Linux Kernel, 3rd Edition] All log records included in the transaction have been physically written onto the journal. When recovering from a system failure, e2fsck considers every complete transaction of the journal and writes the corresponding blocks into the filesystem. 因为checkpointing会将transaction从日志区移除, 所以, 在日志区就需要replay.

3. 主要过程

Phase 1: the transaction enters into the flush state (T_FLUSH).
Phase 2: the actual buffers of the transaction are flushed to the disk. Data buffers go first. There are no complications here, as data buffers are not saved in the log area. Instead, they are flushed directly to their actual positions on the disk. This phase ends when the I/O completion notifications for all such buffers are received.
date不会先写到log区域, 而是直接写往实际的磁盘.

(文件系统层面才有data=writeback, ordered, journal这个概念, 所以, 在jbd看来, data不会journal)

Phase 3: all the data buffers are written to a disk but their metadata still is in the volatile memory. 这时, metadata还在易失性存储中. Metadata flushing is not as straightforward as data buffer flushing, because metadata needs to be written to the log area and the actual positions on the disk need to be remembered. 写metadata, 我们需要记录映射关系. This phase starts with flushing these metadata buffers, for which a journal descriptor block is acquired. The journal descriptor block stores the mapping of each metadata buffer in the journal to its actual location on the disk in the form of tags. 以tag方式来记录映射关系.

Phase 4 and Phase 5: both phase 4 and phase 5 wait on I/O completion notifications.

Phase 6: all the data and metadata is on safe storage, data at its actual locations and metadata in the journal. Now transactions need to be marked as committed so that it can be known that all the updates are safe in the journal. After this, the transaction is moved to the committed state, T_COMMIT.

Phase 7:do checkpoint processing. Phase 8: the transaction is marked as being in the finished state, T_FINISHED.

4. revoke(撤销)

For example consider the following sequence of steps when the filesystem is mounted in metadata only journalling mode. 考虑在仅对metadata作日志下的如下情景:

a) A metadata block 'B' is journalled and contents are copied to journal.
b) Later 'B' gets freed
c) 'B' is now used to write contents of user data, this is not journalled.
原先块内容是metadata, 后来改其用途为存储data.

Now if we crash and replay, we need to avoid replaying the contents of block 'B' in journal over the user contents.

在crash之后的replay中, 我们要避免对 原先是日志块, 之后改作他用的块 进行操作.

If there are transactions for the block after the last revoke record of a block, these ops are safe to replay. Any transactions which appear before the revoke record aren't replayed. 块出现在revoke record之后的事务中, 我们可以安全地replay. The basic idea is that you don't want to replay ops corresponding to a block which may have been freed. Also note that if there are multiple revoke records corresponding to a block in a journal, we only need to worry about the latest record ie...one with highest transaction id. 基本思想是不要对已经释放掉的块进行replay操作.


linux-3.10.86/fs/jbd/revoke.c

Revoke is the mechanism used to prevent old log records for deleted metadata from being replayed on top of newer data(data无需日志) using the same blocks. The revoke mechanism is used in two separate places:
Commit: during commit we write the entire list of the current transaction's revoked blocks to the journal
Recovery: during recovery we record the transaction ID of all revoked blocks. If there are multiple revoke records in the log for a single block, only the last one counts, and if there is a log entry for a block beyond the last revoke(这里的beyond可以理解为在revoke的范围之外, 也就是时间上发生在revoke之后), then that log entry still gets replayed.

We can get interactions between revokes and new log data within a single transaction:

  • Block is revoked and then journaled:
    The desired end result is the journaling of the new block, so we cancel the revoke before the transaction commits.
    我们先对块revoked, 然后对该块journaled, 这样, 这个块的最终效果是要记录到log区域. 所以, 在transaction commits时, 无需记录这个revoked动作.

  • Block is journaled and then revoked:
    The revoke must take precedence over the write of the block, so we need either to cancel the journal entry or to write the revoke later in the log than the log block. In this case(指的是Block is journaled and then revoked), we choose the latter: journaling a block cancels any revoke record for that block in the current transaction, so any revoke for that block in the transaction must have happened after the block was journaled and so the revoke must take precedence(优先权).
    有两种处理方式, 第2种是 先完成journaled, 之后写revoked. TODO 原因的话, 这里的解释还没看懂.

  • Block is revoked and then written as data:
    The data write is allowed to succeed, but the revoke is not cancelled. We still need to prevent old log records from overwriting the new data. We don't even need to clear the revoke bit here.

5. 参考资料

Linux: The Journaling Block Device https://web.archive.org/web/20070926223043/http://kerneltrap.org/node/6741 http://mkatiyar.blogspot.com/2011/07/journal-jbd-revoke-mechanism.html
Understanding the Linux Kernel, 3rd Edition

本文地址: https://awakening-fong.github.io/posts/fs/jbd

转载请注明出处: https://awakening-fong.github.io


若无法评论, 请打开JavaScript, 并通过proxy.


blog comments powered by Disqus