LevelDB 03: 数据格式
目录:
1. log的格式
1.1 整体格式
参考 Writer::AddRecord -> Writer::EmitPhysicalRecord
kHeader WriteBatch::rep_
|-------|-------------------|
kHeader WriteBatch::rep_ kHeader rep_(1/2) trailer kHeader rep_(2/2)
|-------|-------------------|-------|.........|------|-------|.........|
\-------------------------v--------------------------/
kBlockSize(32KB)
1.2 log文件的 header
checksum len type
|--------|---|---|--------------|
\------- v -----/\------v------/
kHeader len
kHeaderSize:
checksum (4B), length (2B), type (1B)
header中的len为 header后面跟的数据的长度.
根据
leveldb/db/log_reader.cc
ReadPhysicalRecord()
if (kHeaderSize + length > buffer_.size()) {错误情况处理}
unsigned int Reader::ReadPhysicalRecord(Slice* result) {
...略...
*result = Slice(header + kHeaderSize, length);
return type;
}
1.3 大小端
unsigned int Reader::ReadPhysicalRecord(...) {
// Parse the header
const char* header = buffer_.data();
const uint32_t a = static_cast<uint32_t>(header[4]) & 0xff;
const uint32_t b = static_cast<uint32_t>(header[5]) & 0xff;
const unsigned int type = header[6];
const uint32_t length = a | (b << 8);
}
写入时:
Writer::EmitPhysicalRecord
{
char buf[kHeaderSize];
buf[4] = static_cast<char>(n & 0xff);
buf[5] = static_cast<char>(n >> 8);
buf[6] = static_cast<char>(t);
}
1.4 trailer
https://github.com/google/leveldb/blob/master/doc/log_format.md
A record never starts within the last six bytes of a block (since it won't fit). Any leftover bytes here form the trailer, which must consist entirely of zero bytes and must be skipped by readers. 若一个记录块只剩下6个字节的空余, 那么, 这些空间不会用来存储key-value, 我们称块的这最后几个字节空间(可以小于6字节)为 trailer.
awakening-fong注释: see Writer::AddRecord()
1.5 fragment
leveldb/db/log_format.h
enum RecordType {
// Zero is reserved for preallocated files
kZeroType = 0,
kFullType = 1,
// For fragments
kFirstType = 2,
kMiddleType = 3,
kLastType = 4
};
一个完整块 被分为 多个fragment 来存储.
2. MemTable
leveldb/db/write_batch.cc
class MemTableInserter : public WriteBatch::Handler {
};
leveldb/db/memtable.cc void MemTable::Add(....)
每条记录在内存中的格式:
size+8 userkey seq|type size value
|--------|--------|---------|-----|------|
内部大小 8字节
2.1 SkipList
MemTable::Add
|--typedef SkipList<const char*, KeyComparator> Table;
|--Table table_;
|--char* buf = arena_.Allocate(encoded_len)
|--buf = xxx
|--table_.Insert(buf)
参考 MemTable::Get SkipList节点的格式:
也称为tag
/---^---\
| internal_key_size | user key|seq type|val size|val|
\--user key size +8--/ \--8B--/
^
key_ptr
MemTable::Get()中 称seq+type为tag. 是不是只有type是tag??
SkipList 的内存组织:
template<typename Key, class Comparator>
struct SkipList<Key,Comparator>::Node {
explicit Node(const Key& k) : key(k) { }
// Array of length equal to the node height. next_[0] is lowest level link.
port::AtomicPointer next_[1];
};
class MemTable {
typedef SkipList<const char*, KeyComparator> Table;
};
(图片来源 https://en.wikipedia.org/wiki/Skip_list)
wiki动图中, 何时要放在第2层?
答:好像是随机的, wiki上的动图中的coin flip就是这个意思.
问题: leveldb代码上, 如何体现这点?
答:
wiki上, 高是固定的; 对新的节点, 随机地决定, 是否移动到较高的level上.
而leveldb中, 在新添加一个节点时, 总是尝试生成一个新的高度,
level3 o' o' o
level2 o o' o o
level1 o o o o
level0 o o o o
新节点值 11 17 24 43
生成高度 3 2 4 3
插入顺序 A B D C
然后对新节点x执行:
x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));
prev[i]->SetNext(i, x);
上图中 B 上方后续会有两个o' 吗?
答: 会的.
void SkipList<Key,Comparator>::Insert(const Key& key) {
int height = RandomHeight();
if (height > GetMaxHeight()) {
对新的高度, prev[i] = head_;
}
}
3. WriteBatch
根据 WriteBatch::Put() 以及 WriteBatch 的ctor
rep_:
seq cnt type len+key len+value
|--------|---|-----|--------|----------|
8B 4B \----------v------------/
一条记录
cnt为记录的个数
根据 WriteBatch::Iterate
type为: kTypeValue 或 kTypeDeletion
4. get
see LookupKey ctor
internal key
-------------v--------------
/ \
|usize + 8| user_key | seq(7B) type(1B)|
^ ^ ^
start_ kstart_ end_
\-------------------v------------------/
memtable_key
5. ldb, sst格式
doc/table_format.md
footer(参考Table::Open):
metaindex index
|off(8B)|size(8B)|off|size|-------- |magic(8B)|
kEncodedLength
size不包括 kBlockTrailerSize,
kBlockTrailerSize:
// 1-byte type + 32-bit crc
type:kNoCompression 或 kSnappyCompression
metaindex 用来告知每段meta block的起始位置和长度.
index 用来告知每段data block的起始位置和长度.
block:
/-----NumRestarts个-------\
| |restart|restart|...|restart|NumRestarts| kBlockTrailerSize |
^
restart_offset_
\--------------------size (handle给出)-----------/
- 读foot file->Read()
- 读index ReadBlock(,,index_handle,)
- 读meta ReadMeta
- 得到 Filter
// An entry for a particular key-value pair has the form:
// shared_bytes: varint32
// unshared_bytes: varint32
// value_length: varint32
// key_delta: char[unshared_bytes]
// value: char[value_length]
// shared_bytes == 0 for restart points.
Table::ReadMeta() {
if (rep_->options.filter_policy == nullptr) {
return; // Do not need any metadata
}
}
Table::ReadMeta
|--Block* meta = new Block(contents);
|--std::string key = "filter.";
|--key.append(rep_->options.filter_policy->Name());
|--iter->Seek(key)
|--ReadFilter(iter->value()) //参数是 BlockHandle
| |--ReadBlock() // 读出 filter
| |--FilterBlockReader //用来读取 filter block
filter block: see FilterBlockReader()
/-- num_*4个--\
|........ |.............|offset array(4B)|base_lg_(1B)|
^ ^
data_ offset_
The filter block is formatted as follows:
[filter 0]
[filter 1]
[filter 2]
...
[filter N-1]
[offset of filter 0] : 4 bytes
[offset of filter 1] : 4 bytes
[offset of filter 2] : 4 bytes
...
[offset of filter N-1] : 4 bytes
[offset of beginning of offset array] : 4 bytes
lg(base) : 1 byte
6. BlockBuilder
see BlockBuilder::Add()
<shared><non_shared><value_size><key中non_shared的那部分><value>
7. 杂
7.1 个人标记方法
^
表示读出
\---v---/
表示圈定的范围
static Slice GetLengthPrefixedSlice(const char* data) {
uint32_t len;
const char* p = data;
/*
开头给出长度,
len
^
|---|--------------|
\------v-----/
len
*/
p = GetVarint32Ptr(p, p + 5, &len); // +5: we assume "p" is not corrupted
return Slice(p, len);
}
所以, 上面表示, 开头内容为len, 随后是长度为len的数据.
7.2 tag
leveldb/db/version_set.cc:
VersionSet::Recover()
{
reader.ReadRecord(&record, &scratch);
VersionEdit edit;
edit.DecodeFrom(record)
}
Status VersionEdit::DecodeFrom(..)
{
/*
MANIFEST文件 record的数据是:
|------|------|-----|
header tags
*/
while (msg == nullptr && GetVarint32(&input, &tag)) {
switch (tag) {
case kComparator: ...
case kLogNumber: ...
case kPrevLogNumber: ...
...
}
}
本文地址: https://awakening-fong.github.io/posts/database/leveldb_03_data_format
转载请注明出处: https://awakening-fong.github.io
若无法评论, 请打开JavaScript, 并通过proxy.
blog comments powered by Disqus