modify elasticsearch rehash

puppylpg · puppylpg · commit a48a919a7f73 · 2024-03-21T18:22:08.000+08:00
diff --git a/_posts/2023-02-09-es-rehash.md b/_posts/2023-02-09-es-rehash.md
@@ -1,3 +1,5 @@
+[toc]
+
 ---
 layout: post
 title: "Elasticsearch：数据重分配"
@@ -6,8 +8,8 @@ categories: elasticsearch rehash
 tags: elasticsearch rehash
 ---
 
-之前看elasticsearch按照`_routing`的hash对文档进行分片的时候，竟然都没有注意到elasticsearch实现做虚拟分片，再映射到实体分片……
-```
+之前看elasticsearch按照`_routing`的hash对文档进行分片的时候，竟然都没有注意到elasticsearch是先做虚拟分片，再映射到实体分片……
+```bash
 routing_factor = num_routing_shards / num_primary_shards
 shard_num = (hash(_routing) % num_routing_shards) / routing_factor
 ```
@@ -27,35 +29,45 @@ shard_num = (hash(_routing) % num_routing_shards) / routing_factor
 ## 一致性哈希 Consistent Hashing
 - https://bbs.huaweicloud.com/blogs/333158
 
-本质上是**新增一个节点之后，只去分担前一个节点的部分key，所以完全不影响其他节点上的key，要移动的数据范围很小**。
+两种hash：
+1. 节点做hash（比如用自己的ip做hash），找到自己在环上的位置；
+2. 数据做hash，找到自己在环上的位置；
+
+数据未必正好落在节点上，所以数据在环上顺时针走的下一个节点，就是目标节点。
 
-**但是这样很容易导致数据分布不均匀，所以一致性哈希一般使用大量的虚拟节点**，把数据切分的比较细，再把多个虚拟节点映射到物理节点上，就均匀了。**如果新增一个物理节点，就会相应增加一些虚拟节点，抢过来一些其他虚拟节点的数据，可以理解为是均匀抢的其他节点。如果下掉一个物理节点，就下掉它对应的虚拟节点，把这些虚拟节点的数据都分配给环上的下一个虚拟节点，对其他物理机来说，基本也是均匀增加数据**。
+本质上是**在哈希环上新增一个节点之后，只去分担下一个后继节点的部分key，所以完全不影响其他节点上的key，要移动的数据范围很小**。删除节点的时候也是把数据给到下一个后继节点。
+
+**但是如果节点本身比较少，节点所在的位置不均衡，会导致数据分布不均匀，所以一致性哈希还要使用大量的虚拟节点**，把数据切分的比较细，再把多个虚拟节点映射到一个物理节点上，就均匀了。**如果新增一个物理节点，就会相应增加一些虚拟节点，抢过来一些其他虚拟节点的数据，可以理解为是均匀抢的其他节点。如果下掉一个物理节点，就下掉它对应的虚拟节点，把这些虚拟节点的数据都分配给环上的下一个虚拟节点，对其他物理机来说，基本也是均匀增加数据**。
+
+> 不过es没有用一致性哈希~
 
 ## 一分为二的hash
 在JDK的HashMap里，扩容是这么玩的：每次容量扩大一倍，就可以做到把一个桶只拆为对应的两个桶。
 
 - https://tech.meituan.com/2016/06/24/java-hashmap.html
 
+**一分为二的hash相比于普通rehash，数据挪动起来更快，只需要挪一半**。比如一开始有两个桶，只看hash值的最后一个bit就行，二分四后，**每个桶里的数据看hash值的倒数第二个bit是0还是1就知道是走（1）还是留了（0）了**。因此一分为二的hash可以在一开始就把hash值记录下来，后面根本不需要再计算一遍了。
+
 **elasticsearch分片拆分的原理跟HashMap几乎一样**。当elasticsearch的分片太大的时候，可以增加分片数，使用[split API](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-split-index.html)将大分片拆分为体积较小的分片。
 
 再看es的[`_routing`](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html)，虽然是hash取模分片，但在按照实际分片数取模之前，先按照虚拟分片数取模：
-```
+```bash
 routing_factor = num_routing_shards / num_primary_shards
 shard_num = (hash(_routing) % num_routing_shards) / routing_factor
 ```
 
-虚拟分片数是通过[`index.number_of_routing_shards`](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#index-number-of-routing-shards)设置的，**默认值是主分片数的2^n，同时不超过1024**。比如primary shard=30，虚拟分片数就是30x2^5=960。
+虚拟分片数是通过[`index.number_of_routing_shards`](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#index-number-of-routing-shards)设置的，**默认值是主分片数的2^n，同时不超过1024**。比如primary shard=30，虚拟分片数就是30x2^5=960，**此时每个分片上有`2^5`个虚拟分片**。
 
-拆分索引的时候就可以按照主分片的2^n拆分，比如一拆二就是设置新索引的主分片数为30x2=60。
+拆分索引的时候就可以按照主分片的2^n拆分，比如一拆二就是设置新索引的主分片数为30x2=60，**此时每个分片上有`2^4`个虚拟分片**，少了一半。
 
 ### elasticsearch的考量
-为什么elasticsearch不采用普通的rehash？rehash代价太大，对key value系统如此，对es就更不用说了。
+问题一：为什么elasticsearch不采用普通的rehash？rehash代价太大，对key value系统如此，对es就更不用说了。
 
-[为什么不采用一致性哈希](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-split-index.html#incremental-resharding)？**虽然只需要挪n分之一的数据，es仍然认为代价太大**。因为相比简单的key value系统，es的每个文档都要在创建时建立索引。**和索引新文档比起来，还是删数据的速度更快——**
+问题二：[为什么不采用一致性哈希](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-split-index.html#incremental-resharding)？**虽然只需要挪n分之一的数据，es仍然认为代价太大**。因为相比简单的key value系统，es的每个文档都要在创建时建立索引。**和索引新文档比起来，还是删数据的速度更快——**
 
 > The most common way that key-value stores do this efficiently is by using consistent hashing. Consistent hashing only requires 1/N-th of the keys to be relocated when growing the number of shards from N to N+1. However Elasticsearch’s unit of storage, shards, are Lucene indices. Because of their search-oriented data structure, taking a significant portion of a Lucene index, be it only 5% of documents, deleting them and indexing them on another shard typically comes with a much higher cost than with a key-value store.
 
-为什么一拆2/4/8/...就可以？因为es可以直接给新索引创建硬链接指向之前的数据，然后删掉一半不再属于该分片的数据即可。比重新索引一个个文档简单。
+问题三：和一致性hash比起来，一分二不一样要挪数据吗？为什么一拆2/4/8/...就快了？**因为es可以直接给新索引创建硬链接指向之前要挪过来的那些虚拟节点的数据**，同时老索引标记这些数据不再属于该分片的数据即可。比重新索引一个个文档简单。
 
 > The split is done efficiently by hard-linking the data in the source primary shard into multiple primary shards in the new index, **then running a fast Lucene Delete-By-Query to mark documents which should belong to a different shard as deleted**. These deleted documents will be physically removed over time by the background merge process.
 
@@ -95,7 +107,7 @@ A split operation:
 4. Recovers the target index as though it were a closed index which had just been re-opened.
 
 **首先[设置索引禁止写入](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-blocks.html)**：
-```
+```json
 PUT stored_kol_split_lhb/_settings
 {
   "settings": {
@@ -104,7 +116,7 @@ PUT stored_kol_split_lhb/_settings
 }
 ```
 由于原索引主分片为2，这里新索引设置为它的2^5=32倍，64片，所以相当于原来的索引一拆32：
-```
+```json
 POST /stored_kol_split_lhb/_split/stored_kol_split_after_lhb
 {
   "settings": {
@@ -113,7 +125,7 @@ POST /stored_kol_split_lhb/_split/stored_kol_split_after_lhb
 }
 ```
 新的索引把禁止写入的设置也同步了过来，所以搞定后别忘了取消禁止写入：
-```
+```json
 PUT stored_kol_split_after_lhb/_settings
 {
   "settings": {
@@ -124,7 +136,7 @@ PUT stored_kol_split_after_lhb/_settings
 分片增加之后，虽然文档数没变，但占用空间变大了不少。
 
 如果有某些分片没有成功分配，可以使用[diagnose api](https://www.elastic.co/guide/en/elasticsearch/reference/current/diagnose-unassigned-shards.html)查查原因：
-```
+```json
 GET _cluster/allocation/explain
 {
   "index": "stored_kol_split_after_lhb",