LevelDB 03: 数据格式
目录:
1. log的格式
1.1 整体格式
参考 Writer::AddRecord -> Writer::EmitPhysicalRecord
kHeader   WriteBatch::rep_
|-------|-------------------|
kHeader   WriteBatch::rep_   kHeader rep_(1/2) trailer kHeader rep_(2/2)
|-------|-------------------|-------|.........|------|-------|.........|
\-------------------------v--------------------------/
                     kBlockSize(32KB)
1.2 log文件的 header
checksum  len type
|--------|---|---|--------------|
\------- v -----/\------v------/
    kHeader          len
kHeaderSize:  
checksum (4B), length (2B), type (1B)
header中的len为 header后面跟的数据的长度.
根据
leveldb/db/log_reader.cc
ReadPhysicalRecord()
if (kHeaderSize + length > buffer_.size()) {错误情况处理}
unsigned int Reader::ReadPhysicalRecord(Slice* result) {
  ...略...
  *result = Slice(header + kHeaderSize, length);
  return type;
}
1.3 大小端
unsigned int Reader::ReadPhysicalRecord(...) {
  // Parse the header
  const char* header = buffer_.data();
  const uint32_t a = static_cast<uint32_t>(header[4]) & 0xff;
  const uint32_t b = static_cast<uint32_t>(header[5]) & 0xff;
  const unsigned int type = header[6];
  const uint32_t length = a | (b << 8);
}
写入时:
Writer::EmitPhysicalRecord
{
  char buf[kHeaderSize];
  buf[4] = static_cast<char>(n & 0xff);
  buf[5] = static_cast<char>(n >> 8);
  buf[6] = static_cast<char>(t);
}
1.4 trailer
https://github.com/google/leveldb/blob/master/doc/log_format.md
A record never starts within the last six bytes of a block (since it won't fit). Any leftover bytes here form the trailer, which must consist entirely of zero bytes and must be skipped by readers. 若一个记录块只剩下6个字节的空余, 那么, 这些空间不会用来存储key-value, 我们称块的这最后几个字节空间(可以小于6字节)为 trailer.
awakening-fong注释: see Writer::AddRecord()
1.5 fragment
leveldb/db/log_format.h
enum RecordType {
  // Zero is reserved for preallocated files
  kZeroType = 0,
  kFullType = 1,
  // For fragments
  kFirstType = 2,
  kMiddleType = 3,
  kLastType = 4
};
一个完整块 被分为 多个fragment 来存储.
2. MemTable
leveldb/db/write_batch.cc
class MemTableInserter : public WriteBatch::Handler {
};
leveldb/db/memtable.cc void MemTable::Add(....)
每条记录在内存中的格式:
size+8     userkey  seq|type size  value
|--------|--------|---------|-----|------|
内部大小              8字节
2.1 SkipList
 MemTable::Add
 |--typedef SkipList<const char*, KeyComparator> Table;
 |--Table table_;
 |--char* buf = arena_.Allocate(encoded_len)
 |--buf = xxx
 |--table_.Insert(buf)
参考 MemTable::Get SkipList节点的格式:
                               也称为tag
                               /---^---\   
|  internal_key_size | user key|seq type|val size|val|
\--user key size +8--/          \--8B--/
                     ^
                  key_ptr   
MemTable::Get()中 称seq+type为tag. 是不是只有type是tag??
SkipList 的内存组织:
template<typename Key, class Comparator>
struct SkipList<Key,Comparator>::Node {
  explicit Node(const Key& k) : key(k) { }
  // Array of length equal to the node height.  next_[0] is lowest level link.
port::AtomicPointer next_[1];
};
class MemTable {
typedef SkipList<const char*, KeyComparator> Table;
};


(图片来源 https://en.wikipedia.org/wiki/Skip_list)
wiki动图中, 何时要放在第2层?
答:好像是随机的, wiki上的动图中的coin flip就是这个意思.
问题: leveldb代码上, 如何体现这点?
答:
wiki上, 高是固定的; 对新的节点, 随机地决定, 是否移动到较高的level上.
而leveldb中,  在新添加一个节点时, 总是尝试生成一个新的高度,
level3  o'     o'       o     
level2  o      o'       o      o
level1  o      o        o      o
level0  o      o        o      o
新节点值 11      17      24    43          
生成高度 3       2       4      3
插入顺序 A       B       D      C
然后对新节点x执行:
x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));
prev[i]->SetNext(i, x);
上图中 B 上方后续会有两个o' 吗?
答: 会的.
void SkipList<Key,Comparator>::Insert(const Key& key) {
  int height = RandomHeight();
  if (height > GetMaxHeight()) {
    对新的高度, prev[i] = head_;
  }
}
3. WriteBatch
根据 WriteBatch::Put() 以及 WriteBatch 的ctor
rep_:
  seq     cnt  type len+key  len+value
|--------|---|-----|--------|----------|
    8B    4B  \----------v------------/
                      一条记录
cnt为记录的个数
根据 WriteBatch::Iterate
type为: kTypeValue 或 kTypeDeletion                      
4. get
see LookupKey ctor
                   internal key
           -------------v--------------
          /                            \
|usize + 8| user_key  | seq(7B) type(1B)|
^         ^                             ^
start_   kstart_                       end_
\-------------------v------------------/
               memtable_key
5. ldb, sst格式
doc/table_format.md
footer(参考Table::Open):
      metaindex     index
|off(8B)|size(8B)|off|size|--------  |magic(8B)|
  kEncodedLength
size不包括 kBlockTrailerSize,
kBlockTrailerSize:
// 1-byte type + 32-bit crc
type:kNoCompression 或 kSnappyCompression
metaindex 用来告知每段meta block的起始位置和长度.
index 用来告知每段data block的起始位置和长度.
block:
          /-----NumRestarts个-------\
|        |restart|restart|...|restart|NumRestarts| kBlockTrailerSize |
         ^
    restart_offset_
\--------------------size (handle给出)-----------/    
- 读foot  file->Read()
 - 读index   ReadBlock(,,index_handle,)
 - 读meta   ReadMeta
 - 得到 Filter
 
// An entry for a particular key-value pair has the form:
//     shared_bytes: varint32
//     unshared_bytes: varint32
//     value_length: varint32
//     key_delta: char[unshared_bytes]
//     value: char[value_length]
// shared_bytes == 0 for restart points.
Table::ReadMeta() {
 if (rep_->options.filter_policy == nullptr) {
   return;  // Do not need any metadata
 }
}
Table::ReadMeta
|--Block* meta = new Block(contents);
|--std::string key = "filter.";
|--key.append(rep_->options.filter_policy->Name());
|--iter->Seek(key)
|--ReadFilter(iter->value()) //参数是 BlockHandle
|       |--ReadBlock()  // 读出 filter
|       |--FilterBlockReader //用来读取 filter block
filter block: see FilterBlockReader()
          /-- num_*4个--\
|........ |.............|offset array(4B)|base_lg_(1B)|
^         ^
data_     offset_
The filter block is formatted as follows:
    [filter 0]
    [filter 1]
    [filter 2]
    ...
    [filter N-1]
    [offset of filter 0]                  : 4 bytes
    [offset of filter 1]                  : 4 bytes
    [offset of filter 2]                  : 4 bytes
    ...
    [offset of filter N-1]                : 4 bytes
    [offset of beginning of offset array] : 4 bytes
    lg(base)                              : 1 byte
6. BlockBuilder
see BlockBuilder::Add()
<shared><non_shared><value_size><key中non_shared的那部分><value>
7. 杂
7.1 个人标记方法
^ 表示读出
\---v---/ 表示圈定的范围
static Slice GetLengthPrefixedSlice(const char* data) {
  uint32_t len;
  const char* p = data;
  /*
  开头给出长度,
   len
    ^
  |---|--------------|
       \------v-----/
             len
  */
  p = GetVarint32Ptr(p, p + 5, &len);  // +5: we assume "p" is not corrupted
  return Slice(p, len);
}
所以, 上面表示, 开头内容为len, 随后是长度为len的数据.
7.2 tag
leveldb/db/version_set.cc:
VersionSet::Recover()
{
  reader.ReadRecord(&record, &scratch);
  VersionEdit edit;
  edit.DecodeFrom(record)
}
Status VersionEdit::DecodeFrom(..)
{
  /*
  MANIFEST文件 record的数据是:
  |------|------|-----|
   header  tags
  */
  while (msg == nullptr && GetVarint32(&input, &tag)) {
    switch (tag) {
      case kComparator: ...
      case kLogNumber: ...
      case kPrevLogNumber: ...
      ...
  }
}
	  本文地址: https://awakening-fong.github.io/posts/database/leveldb_03_data_format
转载请注明出处: https://awakening-fong.github.io
若无法评论, 请打开JavaScript, 并通过proxy.
blog comments powered by Disqus