Skip to content

Commit fb6fab2

Browse files
committed
update readme.
1 parent 7ccc87a commit fb6fab2

File tree

2 files changed

+248
-1
lines changed

2 files changed

+248
-1
lines changed

:wqa

Lines changed: 247 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,247 @@
1+
# Access Bitcoin UTXO Set With the Bitcoin Core chainstate Database
2+
C++ project to access Bitcoin UTXO data by parsing the `chainstate` database.
3+
4+
Build
5+
-----
6+
* Clone this project
7+
* Run `make` in project root to build
8+
9+
Usage
10+
-----
11+
`./bin/main <option(s)> <SOURCE DATABASE>`
12+
13+
Options:
14+
* `-h,--help`: Show this help message
15+
* `-m,--mode <mode>`: `dump_all` to dump all UTXOs, `single` for a single txid. Default is `single`.
16+
* `-t, --txid <txid>`: Hexstring representation of UTXO to lookup. If mode is `t` and no txid is provided, user will be prompted to enter one.
17+
* `-o <vout>`: Output index of the required outpoint
18+
19+
Examples:
20+
* Run `./bin/main <path to chainstate database>` to lookup an individual UTXO: programme prompts for `txid` and `vout` on `stdin`.
21+
* Run `./bin/main -m dump_all <path to chainstate database>` to dump all UTXOs into CSV format on `stdout` (this will be a lengthy process)
22+
* As above, redirect output to a file: `./bin/main -m dump_all <path to chainstate database> > path/results.csv`
23+
24+
The `chainstate` database should be a copy, not currently being accessed by Bitcoin Core.
25+
26+
UTXOs
27+
-----
28+
The Unspent Transaction Output (UTXO) set is a subset of Bitcoin transaction outputs that are not yet spent.
29+
30+
For a new transaction to be valid, it must have access to UTXOs that can be used as inputs - the creator of the transaction must be able to meet the spending conditions of the transaction input UTXOs. Transactions consume UTXOs as inputs and create new UTXOs as outputs - with spending conditions locked such that the intended recipient can unlock the new UTXOs.
31+
32+
The UTXO set contains all unspent outputs - it therefore contains all the data necessary to validate new transactions.
33+
34+
UTXOs consist of two parts:
35+
36+
1. The amount transferred to the output
37+
2. The locking script `scriptPubKey` that specifies the spending conditions for the output
38+
39+
Outpoints
40+
---------
41+
Each transaction (with the exception of coinbase transactions) spends a UTXO from a previous transaction. In the context of a transaction, UTXOs are referenced by "outpoints".
42+
43+
A single transaction can have multiple outputs - so each UTXO outpoint consists of their transaction ID (`txid`) and output index number (often referred to as `vout`).
44+
45+
Together, the `txid` and output index are known as the UTXO outpoint.
46+
47+
Local Database: chainstate
48+
--------------------------
49+
Bitcoin Core full nodes store UTXO data in the `chainstate` LevelDB database. Data is stored in a per-output model - each entry in the chainstate database represents a single UTXO.
50+
51+
UTXO data is stored in this way so that transactions can be validated and new transactions created without the necessity of checking the entire blockchain.
52+
53+
At the time of writing, the `chainstate` database is approximately 4GB in size.
54+
55+
Data Storage Format
56+
-------------------
57+
[LevelDB][8] is a simple on-disk key-value store. Keys and values are stored as strings in arbitrary byte arrays, sorted by key.
58+
59+
Indexing is not supported.
60+
61+
Though LevelDB keys & values are strings, they are not C-style null-terminated strings - this is because LevelDB keys & values may contain null bytes.
62+
63+
Chainstate Keys
64+
---------------
65+
Keys in the `chainstate` database consist of a little-endian representation of the `txid` prepended with the single byte `0x43` ('C') and appended with a Varint (see following section) representation of the vout:
66+
67+
`0x43<txid, little endian><vout>`
68+
69+
Chainstate Values
70+
-----------------
71+
Values in the `chainstate` database contain the following data:
72+
73+
* Block height
74+
* Whether or not the UTXO is a coinbase transaction
75+
* Amount (in Satoshis)
76+
* nSize - an indication of the type/size of locking script
77+
* Locking script - hash 160 for P2PKH & P2SH, public key for P2PK, otherwise full script
78+
79+
Value Obfuscation
80+
-----------------
81+
Values in the `chainstate` database are obfuscated - this was a [change][11] added to the database in order to prevent false positives triggered in Windows anti-virus software.
82+
83+
Obfuscation is a simple XOR operation against a repeated 8-byte obfuscation key. The obfuscation key is a random value unique to each node, with the obfuscation key stored in the `chainstate` database itself, under the key `0x0e00` concatenated with the raw bytes of the string "obfuscate_key", i.e. `0e006f62667573636174655f6b6579` ([see here][12]).
84+
85+
This means that values must be XORed against the obfuscation key after they have been retrieved.
86+
87+
**TODO**: If this tool is expanded to provide analytic data on UTXOs - possibly by building a relational database from UTXO data - consider building a custom version of Bitcoin Core without obfuscation and a rebuild `chainstate` database. If dumping all UTXOs, this will likely provide a performance boost. This might be achieved by removing the following code and re-building the `chainstate` database - [see this note][13]:
88+
89+
```c++
90+
// The base-case obfuscation key, which is a noop.
91+
obfuscate_key = std::vector<unsigned char>(OBFUSCATE_KEY_NUM_BYTES, '\000');
92+
93+
bool key_exists = Read(OBFUSCATE_KEY_KEY, obfuscate_key);
94+
95+
if (!key_exists && obfuscate && IsEmpty()) {
96+
// Initialize non-degenerate obfuscation if it won't upset
97+
// existing, non-obfuscated data.
98+
std::vector<unsigned char> new_key = CreateObfuscateKey();
99+
100+
// Write `new_key` so we don't obfuscate the key with itself
101+
Write(OBFUSCATE_KEY_KEY, new_key);
102+
obfuscate_key = new_key;
103+
104+
LogPrintf("Wrote new obfuscate key for %s: %s\n", path.string(), HexStr(obfuscate_key));
105+
}
106+
107+
LogPrintf("Using obfuscation key for %s: %s\n", path.string(), HexStr(obfuscate_key));
108+
```
109+
See [here][10].
110+
111+
Once the value has been de-obfuscated, data is held in the following format:
112+
113+
`<Varint block height><Varint amount><nSize><locking script>`
114+
115+
Varints in the context of the `chainstate` database are described below.
116+
117+
Variable Length Integers: Varints
118+
---------------------------------
119+
Bitcoin uses Varints to transmit and store values where the minimum number of bytes required to store a value is not known.
120+
121+
For example, a block height that is less than or equal to 255 could be stored in a single byte (as a `uint8_t` or `unsigned char` data type) whereas the block height 649392 would require a minimum of three (unsigned) bytes:
122+
123+
| Value | Minimum Byte Representation |
124+
|-|-|
125+
| 649392 | 09 E8 B0 |
126+
||(9 * 256²) + (232 * 256) + 176 = 649392 |
127+
128+
To efficiently allow for such variability, Bitcoin uses a system of variable-length integers such that a minimal amount of space is used to store integers, whilst allowing for integers to be as large or as small as necessary.
129+
130+
**Varints serialize integers into one or more bytes, with smaller numbers requiring fewer bytes to be encoded.**
131+
132+
### Varints vs compactSize Integers
133+
Bitcoin has multiple methods for encoding variable length integers, with different methods used in different parts of the codebase.
134+
135+
The raw transaction format and peer-to-peer network messages within Bitcoin use a type of variable length integer encoding known as "compactSize". This involves prepending integers with a byte that indicates integer length for numbers greater than 252.
136+
137+
Used in the transaction format, compactSize integers format part of the [Bitcoin consensus rules][9].
138+
139+
This document and repo is concerned with Varints as this is the method which Bitcoin core uses to serialize data to disk in the LevelDB database.
140+
141+
Varints in the LevelDB chainstate Database
142+
-------------------------------------------
143+
In the context of storing data in the levelDB `chainstate` database (which stores UTXO data), integers are stored as base 128 encoded numbers.
144+
145+
In this system, the last 7 bits in each byte are used to represent a digit, and the position of the byte represents the power of 128 to be multiplied.
146+
147+
This leaves the most significant bit (MSB) of each byte available to carry information regarding whether or not the integer is complete.
148+
149+
If the MSB of a byte is set, the next digit (byte) should be read as part of the integer. If the leading digit is not set, the byte represents the final digit in the encoded base 128 integer.
150+
151+
To ensure that each integer has a unique representation in the encoding system, 1 is subtracted from all bytes except for the byte representing the last digit.
152+
153+
| Decimal Number | Hexadecimal Representation | Binary |
154+
|-|-|-|
155+
| 128 | 0x80 0x00 | 1000 0000 0000 0000 |
156+
| 256 | 0x81 0x00 | 1000 0001 0000 0000 |
157+
| 65535 | 0x82 0xFE 0x7F | 1000 0010 1111 1110 0111 1111 |
158+
159+
This system is compact:
160+
161+
* Integers 0-127 are represented by 1 byte
162+
* 128-16511 require 2 bytes
163+
* 16512-2113663 require 3 bytes.
164+
165+
Each integer has a unique encoding, and the encoding is infinite in capacity - integers of any size can be represented.
166+
167+
Worked Manual Example
168+
---------------------
169+
This example takes a value from the `chainstate` database of UTXOs from Bitcoin Core and decodes the value to provide:
170+
171+
* Block height (First Varint, excluding least significant bit)
172+
* Coinbase status (Last bit of first Varint)
173+
* Amount (Second Varint)
174+
* Script type (Third Varint)
175+
* Unique script value (Remainder of the value)
176+
177+
This worked example is drawn from the README [this GitHub repo][3] for a Bitcoin chainstate parser in Ruby.
178+
179+
Start value: `c0842680ed5900a38f35518de4487c108e3810e6794fb68b189d8b`
180+
181+
### First Varint: Block Height
182+
| | Byte₀ | Byte₁ | Byte₂ |
183+
|-|-|-|-|
184+
| Start, hexadecimal |0xC0 |0x84 |0x26 |
185+
| Start, binary |1100 0000 |1000 0100 |0010 0110 |
186+
| Last 7 bits of each byte | 100 0000 | 000 0100 | 010 0110 |
187+
| Add 1 to each byte except last | 100 0001 | 000 0101 | 010 0110 |
188+
189+
tmp array:
190+
||||
191+
|-|-|-|
192+
|0x41 |0x05 |0x26 |
193+
|0100 0001 |0000 0101 |0010 0110 |
194+
195+
Remove last zero - flag showing coinbase status
196+
| | Byte₀ | Byte₁ | Byte₂ |
197+
|-|-|-|-|
198+
| Concatenate consecutive bits to get value | 0000 1000 | 0010 0001 | 0101 0011 |
199+
| Result, hexadecimal | 0x08 | 0x21 | 0x53 |
200+
| Result, decimal | 8 | 33 | 83 |
201+
202+
In decimal: (8 * 256²) + (33 * 256) + 83 = 532819
203+
204+
### Second Varint: Amount
205+
| | Byte₀ | Byte₁ | Byte₂ |
206+
|-|-:|-:|-:|
207+
| |0x80 |0xED |0x59 |
208+
| |1000 0000 |1110 1101 |0101 1001 |
209+
| Last 7 bits | 000 0000 | 110 1101 | 101 1001 |
210+
| Add 1 to each byte except last | 000 0001 | 110 1110 | 101 1001 |
211+
| Concatenate |0000 0000 |0111 0111 |0101 1001 |
212+
| Result, hexadecimal |0x00 | 0x77 |0x59 |
213+
| Result, decimal |0 |119 |89 |
214+
215+
In decimal: (0 . 256²) + (119 * 256) + 89 = 30553
216+
217+
Amount Compression
218+
------------------
219+
To further save on space Bitcoin Core compresses numbers in the `amount` field of the UTXO. For this project I've used the Bitcoin Core `DecompressAmount` function: [see here][14].
220+
221+
References
222+
----------
223+
* [LevelDB Project docs][8] - not very useful
224+
* [Compact Integers in Bitcoin][9] - Bitcoin Developer on bitcoin.org
225+
* [Variable length quantity][7], Wikipedia
226+
* [Bitcoin SE answer on CVarint format in chainstate DB][2]
227+
* [Bitcoin chainstate parser by in3rsha][3] - very useful README
228+
* [https://jonnydee.wordpress.com/2011/05/01/convert-a-block-of-digits-from-base-x-to-base-y/][4]
229+
* [Comment relating to Variable-length integers][5], Bitcoin Core `/src/serialize.h#L339`
230+
* [SO Answer on Varint encoding in chainstate DB][6]
231+
* [Remove lines for non-obfuscated values][10]
232+
* [Obfuscate chainstate, PR #6650][11]
233+
234+
[1]: https://github.com/bitcoin/bitcoin/blob/v0.13.2/src/serialize.h#L307L372
235+
[2]: https://bitcoin.stackexchange.com/a/51639/56514
236+
[3]: https://github.com/in3rsha/bitcoin-chainstate-parser
237+
[4]: https://jonnydee.wordpress.com/2011/05/01/convert-a-block-of-digits-from-base-x-to-base-y/
238+
[5]: https://github.com/bitcoin/bitcoin/blob/master/src/serialize.h#L339
239+
[6]: https://bitcoin.stackexchange.com/a/51639/56514
240+
[7]: https://en.wikipedia.org/wiki/Variable-length_quantity
241+
[8]: https://github.com/google/leveldb/blob/master/doc/index.md
242+
[9]: https://developer.bitcoin.org/reference/transactions.html#compactsize-unsigned-integers
243+
[10]: https://github.com/bitcoin/bitcoin/blob/80aa83aa406447d9b0932301b37966a30d0e1b6e/src/dbwrapper.cpp#L149-L166
244+
[11]: https://github.com/bitcoin/bitcoin/pull/6650
245+
[12]: https://github.com/csknk/parse-chainstate/blob/51434fbf8cfde3f19e2a3ac0ff8a5ee35259b6e0/DBWrapper.cpp#L24-L25
246+
[13]: https://bitcoin.stackexchange.com/a/62700/56514
247+
[14]: https://github.com/bitcoin/bitcoin/blob/9e8d2bd076d78ba59abceb80106f44fe26246b14/src/compressor.cpp#L168-L192 "Decompress amount"

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Examples:
2121
* Run `./bin/main -m dump_all <path to chainstate database>` to dump all UTXOs into CSV format on `stdout` (this will be a lengthy process)
2222
* As above, redirect output to a file: `./bin/main -m dump_all <path to chainstate database> > path/results.csv`
2323

24-
The `chainstate` database should be a copy, not currently being accessed by Bitcoin Core and it's location is currently hardcoded into `main.cpp`.
24+
The `chainstate` database should be a copy, not currently being accessed by Bitcoin Core.
2525

2626
UTXOs
2727
-----

0 commit comments

Comments
 (0)