数据库recovery的一些基本概念 2018-10-24

1. REDO, UNDO, checkpoint

Write Ahead Logging (WAL) is a standard approach to transaction logging. Briefly, WAL's central concept is that changes to data files (where tables and indexes reside) must be written only after those changes have been logged - that is, when log records have been flushed to permanent storage. When we follow this procedure, we do not need to flush data pages to disk on every transaction commit, because we know that in the event of a crash we will be able to recover the database using the log: any changes that have not been applied to the data pages will first be redone from the log records (this is roll-forward recovery, also known as REDO) and then changes made by uncommitted transactions will be removed from the data pages (roll-backward recovery - UNDO). [1]
已commit但未apply的, 执行REDO, 而由 未commit的事务 的修改 需要执行UNDO.

------------------

Its transaction recovery log contains log records of the following form:
<txnId, objectId, beforeValue, afterValue>

有如下log:

1 <T1 BEGIN>
2 <T1, X, 1, 2>
3 <T2 BEGIN>
4 <T3 BEGIN>
5 <T2, Y, 1, 2>
6 <T2 COMMIT>
7 <T1, Y, 2, 3>
8 <T3, Z, 1, 2>
9 <CHECKPOINT>
10 <T1, X, 2, 3>
11 <T1, Y, 3, 4>
12 <T3, Z, 2, 3>
13 <T3 COMMIT>
14 <T1, Z, 3, 4>

使用lab来理解raft 2018-10-16

1. 背景

为理解raft, 参照2018年的 https://pdos.csail.mit.edu/6.824/labs/lab-raft.html 进行其中的lab 2A 和 2B.

2. 代码实现/协议细节

2.1 文章摘录

https://thesquareplanet.com/blog/students-guide-to-raft/

  • you might reasonably reset a peer’s election timer whenever you receive an AppendEntries or RequestVote RPC, as both indicate that some other peer either thinks it’s the leader, or is trying to become the leader. Intuitively, this means that we shouldn’t be interfering. However, if you read Figure 2 carefully, it says:

    If election timeout elapses without receiving AppendEntries RPC from current leader or granting vote to candidate: convert to candidate. ​

  • many would simply reset their election timer when they received a heartbeat, and then return success, without performing any of the checks specified in Figure 2. This is extremely dangerous. By accepting the RPC, the follower is implicitly telling the leader that their log matches the leader’s log up to and including the prevLogIndex included in the AppendEntries arguments. Upon receiving the reply, the leader might then decide (incorrectly) that some entry has been replicated to a majority of servers, and start committing it. 接收到心跳, 不能简单地返回成功, 而应该 对比leader发过来的prevLogIndex 和 自己(follower)已有的log. 这样, 可以应对如下情况: 当某个节点网络断开, 重新加入集群时, 就能够及时地更新自身数据到最新状态.

RCU常见问题 2018-03-22

1. 没有锁保护

有如下结构体:

struct your_obj {
    struct hlist_node obj_node_hlist;
    struct rcu_head rcu_head;   
    atomic_t refcnt;
    int id;
};

错误写法:

hlist_del_init_rcu(&your_obj->obj_node_hlist);  

正确写法:

spin_lock(&obj_hash_lock[hash]);
hlist_del_init_rcu(&your_obj->obj_node_hlist);
spin_unlock(&obj_hash_lock[hash]);

什么情况下bio->bi_end_io发生时中断是禁止的? 2018-03-11

1. 问题

通常来说, bio->bi_end_io是软中断上下文中运行的, 在bio->bi_end_io中对in_irq(), in_softirq(), in_serving_softirq(),irqs_disabled()进行判断的话, 仅有in_softirq(), in_serving_softirq()会成立.

现有如下问题, spin_lock_bh()无法禁止软中断运行, why?

2. 解

测试发现, bio->bi_end_io()中有时in_irq(), in_softirq(), in_serving_softirq(),irqs_disabled()都成立, 有时 仅in_irq(), irqs_disabled()成立.

查看调用栈, 发现有virtio_blk模块, 也就是运行在虚拟机环境中. 仅in_irq(), irqs_disabled()成立时的调用栈:

Pid: 3660, comm: mount Not tainted 2.6.32-debug #2
Call Trace:
 <IRQ>  [<ffffffffa03e18f5>] ? your_bio_end_io+0x2b5/0x310 [your_kmod]
 [<ffffffff811e3a5d>] ? bio_endio+0x1d/0x40
 [<ffffffffa0003efc>] ? dec_pending+0x1cc/0x320 [dm_mod]
 [<ffffffffa0003d7d>] ? dec_pending+0x4d/0x320 [dm_mod]
 [<ffffffffa00040ef>] ? clone_endio+0x9f/0xd0 [dm_mod]
 [<ffffffff811e3a5d>] ? bio_endio+0x1d/0x40
 [<ffffffff8128ef7b>] ? req_bio_endio+0x9b/0xe0
 [<ffffffff812906dc>] ? blk_update_request+0x11c/0x520
 [<ffffffff81290999>] ? blk_update_request+0x3d9/0x520
 [<ffffffff81290b07>] ? blk_update_bidi_request+0x27/0xa0
 [<ffffffff81291aae>] ? __blk_end_request_all+0x2e/0x60
 [<ffffffffa006321a>] ? blk_done+0x4a/0x110 [virtio_blk]
 [<ffffffffa005638c>] ? vring_interrupt+0x3c/0xe0 [virtio_ring]
 [<ffffffff810fc970>] ? handle_IRQ_event+0x50/0x160
 [<ffffffff810ff2f0>] ? handle_edge_irq+0xe0/0x170
 [<ffffffff8100fdc9>] ? handle_irq+0x49/0xa0
 [<ffffffff81570e7c>] ? do_IRQ+0x6c/0xf0
 [<ffffffff8100ba93>] ? ret_from_intr+0x0/0x11
 <EOI>  [<ffffffff8118dd73>] ? __kmalloc+0x143/0x2c0
 [<ffffffffa00c4e87>] ? ext4_mb_add_groupinfo+0xd7/0x1e0 [ext4]
 [<ffffffffa00c4e87>] ? ext4_mb_add_groupinfo+0xd7/0x1e0 [ext4]
 [<ffffffffa00c5152>] ? ext4_mb_init+0x1c2/0x450 [ext4]
 [<ffffffffa00b7128>] ? ext4_fill_super+0x2358/0x2950 [ext4]
 [<ffffffff812b8684>] ? snprintf+0x34/0x40
 [<ffffffff811ac141>] ? get_sb_bdev+0x191/0x1d0
 [<ffffffffa00b4dd0>] ? ext4_fill_super+0x0/0x2950 [ext4]
 [<ffffffffa00b04f8>] ? ext4_get_sb+0x18/0x20 [ext4]
 [<ffffffff811ab51b>] ? vfs_kern_mount+0x7b/0x1b0
 [<ffffffff811ab6c2>] ? do_kern_mount+0x52/0x130
 [<ffffffff811cdc5b>] ? do_mount+0x2fb/0x920
 [<ffffffff8115a6d4>] ? strndup_user+0x64/0xc0
 [<ffffffff811ce310>] ? sys_mount+0x90/0xe0
 [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b

使用ftrace分析两时间点间隔 2018-02-03

1. 使用 python 提取出 interval

# -*- coding: utf-8 -*

#添加相应trace_printk, ftrace输出类似:
#           <...>-3084  [000] 352166.459260: your_func_01: your_point_01
#           <...>-2892  [000] 352166.459529: your_func_02: your_point_02

read_point_start = 1
f_in=open('interval.ftrace', 'r')
f_out = open("interval.dat", "wb")
line_no = 1

while 1:
    line = f_in.readline()
    if not line:
        break
    #print line

    if read_point_start == 1:
        if "your_func_01" in line:
            val_point_start=line.split(":")[0].split(" ")[-1]
            read_point_start = 0
            print val_point_start,
    elif "your_func_02" in line:
            val_point_end=line.split(":")[0].split(" ")[-1]
            print val_point_end,
            interval = int((float(val_point_end) - float(val_point_start)) * 1000 * 1000)
            print interval
            f_out.write(str(line_no) + " " + str(interval)+'\n')    
            line_no+=1
            read_point_start = 1

f_out.close()