2019-09-11

目录:

  • 1. log的格式
  • 1.1 整体格式
  • 1.2 log文件的 header
  • 1.3 大小端
  • 1.4 trailer
  • 1.5 fragment
  • 2. MemTable
  • 2.1 SkipList
  • 3. WriteBatch
  • 4. get
  • 5. ldb, sst格式
  • 6. BlockBuilder
  • 7. 杂
  • 7.1 个人标记方法
  • 7.2 tag

  • 1. log的格式

    1.1 整体格式

    参考 Writer::AddRecord -> Writer::EmitPhysicalRecord

    kHeader   WriteBatch::rep_
    |-------|-------------------|
    
    
    
    kHeader   WriteBatch::rep_   kHeader rep_(1/2) trailer kHeader rep_(2/2)
    |-------|-------------------|-------|.........|------|-------|.........|
    \-------------------------v--------------------------/
                         kBlockSize(32KB)
    

    1.2 log文件的 header

    checksum  len type
    |--------|---|---|--------------|
    \------- v -----/\------v------/
        kHeader          len
    

    kHeaderSize:
    checksum (4B), length (2B), type (1B)

    header中的len为 header后面跟的数据的长度. 根据
    leveldb/db/log_reader.cc
    ReadPhysicalRecord()
    if (kHeaderSize + length > buffer_.size()) {错误情况处理}

    unsigned int Reader::ReadPhysicalRecord(Slice* result) {
      ...略...
      *result = Slice(header + kHeaderSize, length);
      return type;
    }
    

    1.3 大小端

    unsigned int Reader::ReadPhysicalRecord(...) {
      // Parse the header
      const char* header = buffer_.data();
      const uint32_t a = static_cast<uint32_t>(header[4]) & 0xff;
      const uint32_t b = static_cast<uint32_t>(header[5]) & 0xff;
      const unsigned int type = header[6];
      const uint32_t length = a | (b << 8);
    }
    

    写入时:

    Writer::EmitPhysicalRecord
    {
      char buf[kHeaderSize];
      buf[4] = static_cast<char>(n & 0xff);
      buf[5] = static_cast<char>(n >> 8);
      buf[6] = static_cast<char>(t);
    }
    

    1.4 trailer

    https://github.com/google/leveldb/blob/master/doc/log_format.md

    A record never starts within the last six bytes of a block (since it won't fit). Any leftover bytes here form the trailer, which must consist entirely of zero bytes and must be skipped by readers. 若一个记录块只剩下6个字节的空余, 那么, 这些空间不会用来存储key-value, 我们称块的这最后几个字节空间(可以小于6字节)为 trailer.

    awakening-fong注释: see Writer::AddRecord()

    1.5 fragment

    leveldb/db/log_format.h

    enum RecordType {
      // Zero is reserved for preallocated files
      kZeroType = 0,
    
      kFullType = 1,
    
      // For fragments
      kFirstType = 2,
      kMiddleType = 3,
      kLastType = 4
    };
    

    一个完整块 被分为 多个fragment 来存储.

    2. MemTable

    leveldb/db/write_batch.cc

    class MemTableInserter : public WriteBatch::Handler {
    
    };
    

    leveldb/db/memtable.cc void MemTable::Add(....)

    每条记录在内存中的格式:

    size+8     userkey  seq|type size  value
    |--------|--------|---------|-----|------|
    内部大小              8字节
    

    2.1 SkipList

     MemTable::Add
     |--typedef SkipList<const char*, KeyComparator> Table;
     |--Table table_;
     |--char* buf = arena_.Allocate(encoded_len)
     |--buf = xxx
     |--table_.Insert(buf)
    

    参考 MemTable::Get SkipList节点的格式:

                                   也称为tag
                                   /---^---\   
    |  internal_key_size | user key|seq type|val size|val|
    \--user key size +8--/          \--8B--/
    
                         ^
                      key_ptr   
    

    MemTable::Get()中 称seq+type为tag. 是不是只有type是tag??

    SkipList 的内存组织:

    template<typename Key, class Comparator>
    struct SkipList<Key,Comparator>::Node {
      explicit Node(const Key& k) : key(k) { }
    
      // Array of length equal to the node height.  next_[0] is lowest level link.
    port::AtomicPointer next_[1];
    
    };
    
    class MemTable {
    
    typedef SkipList<const char*, KeyComparator> Table;
    };
    

    Skiplist add element

    (图片来源 https://en.wikipedia.org/wiki/Skip_list)

    wiki动图中, 何时要放在第2层?
    答:好像是随机的, wiki上的动图中的coin flip就是这个意思.

    问题: leveldb代码上, 如何体现这点?
    答: wiki上, 高是固定的; 对新的节点, 随机地决定, 是否移动到较高的level上.
    而leveldb中, 在新添加一个节点时, 总是尝试生成一个新的高度,

    level3  o'     o'       o     
    level2  o      o'       o      o
    level1  o      o        o      o
    level0  o      o        o      o
    新节点值 11      17      24    43          
    生成高度 3       2       4      3
    插入顺序 A       B       D      C
    

    然后对新节点x执行:

    x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));
    prev[i]->SetNext(i, x);
    

    上图中 B 上方后续会有两个o' 吗?
    答: 会的.

    
    void SkipList<Key,Comparator>::Insert(const Key& key) {
    
      int height = RandomHeight();
      if (height > GetMaxHeight()) {
        对新的高度, prev[i] = head_;
      }
    }
    

    3. WriteBatch

    根据 WriteBatch::Put() 以及 WriteBatch 的ctor

    rep_:

      seq     cnt  type len+key  len+value
    |--------|---|-----|--------|----------|
        8B    4B  \----------v------------/
                          一条记录
    cnt为记录的个数
    
    根据 WriteBatch::Iterate
    type为: kTypeValue 或 kTypeDeletion                      
    

    4. get

    see LookupKey ctor

                       internal key
               -------------v--------------
              /                            \
    |usize + 8| user_key  | seq(7B) type(1B)|
    ^         ^                             ^
    start_   kstart_                       end_
    
    \-------------------v------------------/
                   memtable_key
    

    5. ldb, sst格式

    doc/table_format.md

    footer(参考Table::Open):

          metaindex     index
    |off(8B)|size(8B)|off|size|--------  |magic(8B)|
      kEncodedLength
    

    size不包括 kBlockTrailerSize,

    kBlockTrailerSize: // 1-byte type + 32-bit crc
    type:kNoCompression 或 kSnappyCompression

    metaindex 用来告知每段meta block的起始位置和长度.
    index 用来告知每段data block的起始位置和长度.

    block:

              /-----NumRestarts个-------\
    |        |restart|restart|...|restart|NumRestarts| kBlockTrailerSize |
             ^
        restart_offset_
    \--------------------size (handle给出)-----------/    
    
    1. 读foot file->Read()
    2. 读index ReadBlock(,,index_handle,)
    3. 读meta ReadMeta
    4. 得到 Filter
    // An entry for a particular key-value pair has the form:
    //     shared_bytes: varint32
    //     unshared_bytes: varint32
    //     value_length: varint32
    //     key_delta: char[unshared_bytes]
    //     value: char[value_length]
    // shared_bytes == 0 for restart points.
    
    Table::ReadMeta() {
     if (rep_->options.filter_policy == nullptr) {
       return;  // Do not need any metadata
     }
    }
    
    Table::ReadMeta
    |--Block* meta = new Block(contents);
    |--std::string key = "filter.";
    |--key.append(rep_->options.filter_policy->Name());
    |--iter->Seek(key)
    |--ReadFilter(iter->value()) //参数是 BlockHandle
    |       |--ReadBlock()  // 读出 filter
    |       |--FilterBlockReader //用来读取 filter block
    

    filter block: see FilterBlockReader()

              /-- num_*4个--\
    |........ |.............|offset array(4B)|base_lg_(1B)|
    ^         ^
    data_     offset_
    
    The filter block is formatted as follows:
    
        [filter 0]
        [filter 1]
        [filter 2]
        ...
        [filter N-1]
    
        [offset of filter 0]                  : 4 bytes
        [offset of filter 1]                  : 4 bytes
        [offset of filter 2]                  : 4 bytes
        ...
        [offset of filter N-1]                : 4 bytes
    
        [offset of beginning of offset array] : 4 bytes
        lg(base)                              : 1 byte
    

    6. BlockBuilder

    see BlockBuilder::Add()

    <shared><non_shared><value_size><key中non_shared的那部分><value>
    

    7. 杂

    7.1 个人标记方法

    ^ 表示读出 \---v---/ 表示圈定的范围

    static Slice GetLengthPrefixedSlice(const char* data) {
      uint32_t len;
      const char* p = data;
      /*
      开头给出长度,
       len
        ^
      |---|--------------|
           \------v-----/
                 len
      */
      p = GetVarint32Ptr(p, p + 5, &len);  // +5: we assume "p" is not corrupted
      return Slice(p, len);
    }
    

    所以, 上面表示, 开头内容为len, 随后是长度为len的数据.

    7.2 tag

    leveldb/db/version_set.cc:

    VersionSet::Recover()
    {
      reader.ReadRecord(&record, &scratch);
      VersionEdit edit;
      edit.DecodeFrom(record)
    }
    
    Status VersionEdit::DecodeFrom(..)
    {
    
      /*
      MANIFEST文件 record的数据是:
    
      |------|------|-----|
       header  tags
      */
      while (msg == nullptr && GetVarint32(&input, &tag)) {
        switch (tag) {
          case kComparator: ...
          case kLogNumber: ...
          case kPrevLogNumber: ...
          ...
      }
    }
    

    本文地址: https://awakening-fong.github.io/posts/database/leveldb_03_data_format

    转载请注明出处: https://awakening-fong.github.io


    若无法评论, 请打开JavaScript, 并通过proxy.


    blog comments powered by Disqus