|
1 | | -## Link analysisLink analysis is a technique used to evaluate relationships between nodes. Link analysis is used on several fields,such as search engines, fraud detection, among others. There is several algorithms of different kinds to perform link analysis.Here we are only going to focus on the Hyperlink-Induced Topic Search \(HITS\) algorithm.This algorithm was originally developed to rate web pages. But, nowadays modern search engines do not usethis algorithm since there is more advanced techniques. HITS has been also used to identify the important classes that should becommented in a large software system or the classes that a developer should read to get an insight of the key classes.### Hyperlink-Induced Topic Search \(HITS\) algorithmHyperlink-Induced Topic Search \(HITS\) algorithm, also knows as Hubs and Authorities,is an algorithm that rates every the nodes of a graph. Every node has a hub and a authority score. A hub is a node that may notbe relevant but references relevant nodes. An authority is a node that contains relevant information.The algorithm does the following:1. Assign to each node a hub and an authority score equal to 1.1. Run the authority update rule for each node.1. Run the hub update rule for each.1. Normalize the values by dividing each Hub score by the square root of the sum of the squares of all Hub scores, and dividing each Authority score by the square root of the sum of the squares of all Authority scores.1. Repeat from the second step as necessary.The update rules are simple:**Authority update rule**Update each node's authority score to be equal to the sum of the hub scores of each node that points to it.**Hub update rule**Update each node's hub score to be equal to the sum of the authority scores of each node that it points to.### HITS implementationThe Pharo implementation is as follows. The `k` number is the number of times that the scores are going to beupdated. The default value is `20` but it can also be set manually.```AIHits >> run |
2 | | - |
3 | | - self initializeNodes. |
4 | | - k timesRepeat: [ |
5 | | - nodes do: [ :node | self computeAuthoritiesFor: node ]. |
6 | | - nodes do: [ :node | self computeHubsFor: node ]. |
7 | | - self normalizeScores ]. |
8 | | - ^ nodes``````AIHits >> initializeNodes |
9 | | - |
10 | | - "Here we are using float instead of int because of the normalization." |
11 | | - nodes do: [ :n | |
12 | | - n auth: 1.0. |
13 | | - n hub: 1.0 ]``````AIHits >> computeAuthoritiesFor: aNode |
14 | | - |
15 | | - aNode auth: |
16 | | - (aNode incomingNodes |
17 | | - inject: 0 |
18 | | - into: [ :sum :node | sum + node hub ])``````AIHits >> computeHubsFor: aNode |
19 | | - |
20 | | - aNode hub: |
21 | | - (aNode adjacentNodes |
22 | | - inject: 0 |
23 | | - into: [ :sum :node | sum + node auth ])``````AIHits >> normalizeScores |
24 | | - |
25 | | - | authNorm hubNorm | |
26 | | - authNorm := 0. |
27 | | - hubNorm := 0. |
28 | | - |
29 | | - nodes do: [ :node | |
30 | | - authNorm := authNorm + node auth squared. |
31 | | - hubNorm := hubNorm + node hub squared ]. |
32 | | - |
33 | | - authNorm := authNorm sqrt. |
34 | | - hubNorm := hubNorm sqrt. |
35 | | - |
36 | | - "To avoid dividing by 0" |
37 | | - authNorm = 0 ifTrue: [ authNorm := 1.0 ]. |
38 | | - hubNorm = 0 ifTrue: [ hubNorm := 1.0 ]. |
39 | | - |
40 | | - nodes do: [ :n | |
41 | | - n auth: n auth / authNorm. |
42 | | - n hub: n hub / hubNorm ]```### Case studyHere we calculate the hubs and authorities scores for all the nodes of the graph shown in Figure *@hits@* with 3 iterations.```nodes := #( 'A' 'B' 'C' 'D' ). |
| 1 | +## Link Analysis |
| 2 | + |
| 3 | +Link analysis is a technique used to evaluate relationships between nodes. Link analysis is used on several fields, such as search engines and fraud detection, among others. There is several algorithms of different kinds to perform link analysis. |
| 4 | +Here we are only going to focus on the Hyperlink-Induced Topic Search (HITS) algorithm. |
| 5 | + |
| 6 | +This algorithm was originally developed to rate web pages. But nowadays, modern search engines do not use this algorithm since there are more advanced techniques. HITS has been also used to identify the important classes that should be commented in a large software system or the classes that a developer should read to get an insight of the key classes. |
| 7 | + |
| 8 | +### Hyperlink-Induced Topic Search (HITS) algorithm |
| 9 | + |
| 10 | +Hyperlink-Induced Topic Search (HITS) algorithm, also known as *Hubs and Authorities*, |
| 11 | +is an algorithm that rates every node of a graph. Every node has a hub and a authority score. A **hub** is a node that may not be relevant but references relevant nodes. An **authority** is a node that contains relevant information. |
| 12 | + |
| 13 | +The algorithm does the following: |
| 14 | + |
| 15 | +1. Assign to each node a hub and an authority score equal to $1$. |
| 16 | +2. Run the authority update rule for each node. |
| 17 | +3. Run the hub update rule for each. |
| 18 | +4. Normalize the values by dividing each hub score by the square root of the sum of the squares of all Hub scores, and dividing each authority score by the square root of the sum of the squares of all Authority scores. |
| 19 | +5. Repeat from the second step as necessary. |
| 20 | + |
| 21 | +The update rules are simple: |
| 22 | + |
| 23 | +- **Authority update rule**: Update each node's authority score to be equal to the sum of the hub scores of each node that points to it. |
| 24 | + |
| 25 | +- **Hub update rule**: Update each node's hub score to be equal to the sum of the authority scores of each node that it points to. |
| 26 | + |
| 27 | +### HITS implementation |
| 28 | + |
| 29 | +The Pharo implementation is as follows. The `k` number is the number of times that the scores are going to be updated. The default value is $20$ but it can also be set manually. |
| 30 | + |
| 31 | +``` |
| 32 | +AIHits >> run |
| 33 | +
|
| 34 | + self initializeNodes. |
| 35 | + k timesRepeat: [ |
| 36 | + nodes do: [ :node | self computeAuthoritiesFor: node ]. |
| 37 | + nodes do: [ :node | self computeHubsFor: node ]. |
| 38 | + self normalizeScores ]. |
| 39 | + ^ nodes |
| 40 | +``` |
| 41 | + |
| 42 | +``` |
| 43 | +AIHits >> initializeNodes |
| 44 | +
|
| 45 | + "Here we are using float instead of int because of the normalization." |
| 46 | + nodes do: [ :n | |
| 47 | + n auth: 1.0. |
| 48 | + n hub: 1.0 ] |
| 49 | +``` |
| 50 | + |
| 51 | +``` |
| 52 | +AIHits >> computeAuthoritiesFor: aNode |
| 53 | +
|
| 54 | + aNode auth: |
| 55 | + (aNode incomingNodes |
| 56 | + inject: 0 |
| 57 | + into: [ :sum :node | sum + node hub ]) |
| 58 | +``` |
| 59 | + |
| 60 | +``` |
| 61 | +AIHits >> computeHubsFor: aNode |
| 62 | +
|
| 63 | + aNode hub: |
| 64 | + (aNode adjacentNodes |
| 65 | + inject: 0 |
| 66 | + into: [ :sum :node | sum + node auth ]) |
| 67 | +``` |
| 68 | + |
| 69 | +``` |
| 70 | +AIHits >> normalizeScores |
| 71 | +
|
| 72 | + | authNorm hubNorm | |
| 73 | + authNorm := 0. |
| 74 | + hubNorm := 0. |
| 75 | +
|
| 76 | + nodes do: [ :node | |
| 77 | + authNorm := authNorm + node auth squared. |
| 78 | + hubNorm := hubNorm + node hub squared ]. |
| 79 | +
|
| 80 | + authNorm := authNorm sqrt. |
| 81 | + hubNorm := hubNorm sqrt. |
| 82 | +
|
| 83 | + "To avoid dividing by 0" |
| 84 | + authNorm = 0 ifTrue: [ authNorm := 1.0 ]. |
| 85 | + hubNorm = 0 ifTrue: [ hubNorm := 1.0 ]. |
| 86 | +
|
| 87 | + nodes do: [ :n | |
| 88 | + n auth: n auth / authNorm. |
| 89 | + n hub: n hub / hubNorm ] |
| 90 | +``` |
| 91 | + |
| 92 | +### Case study |
| 93 | + |
| 94 | +Here we calculate the hubs and authorities scores for all the nodes of the graph shown in Figure *@hits@* with three iterations. |
| 95 | + |
| 96 | + |
| 97 | + |
| 98 | +``` |
| 99 | +nodes := #( 'A' 'B' 'C' 'D' ). |
43 | 100 | edges := #( #( 'A' 'B' ) #( 'A' 'C' ) #( 'A' 'D' ) #( 'B' 'C' ) |
44 | 101 | #( 'B' 'D' ) #( 'C' 'A' ) #( 'C' 'D' ) #( 'D' 'D' ) ). |
45 | 102 | hits := AIHits new. |
46 | 103 | hits |
47 | | - nodes: nodes; |
48 | | - edges: edges from: #first to: #second; |
| 104 | + nodes: nodes; |
| 105 | + edges: edges from: #first to: #second; |
49 | 106 | k: 3. |
50 | | -nodes := hits run```If we inspect the nodes, these are the scores calculated after 3 iterations.```('A' auth: 0.17 hub: 0.65) |
| 107 | +nodes := hits run |
| 108 | +``` |
| 109 | + |
| 110 | +If we inspect the nodes, these are the scores calculated after 3 iterations. |
| 111 | + |
| 112 | +``` |
| 113 | +('A' auth: 0.17 hub: 0.65) |
51 | 114 | ('B' auth: 0.27 hub: 0.54) |
52 | 115 | ('C' auth: 0.49 hub: 0.41) |
53 | | -('D' auth: 0.81 hub: 0.34)```### Weighted HITSThere are cases where the Hits algorithm does not behave as expected and sometimes the Hits algorithm puts 0 as valuesfor the hubs and authorities. Using weights in a graph helps in obtaining better results. Establishing the weights is aresponsibility of the user.For more information, you can read these papers:- _Modifications of Kleinberg's HITS Algorithm Using Matrix Exponentiation and Web Log Records_ by Miller et al. % ${cite:Mill01a}$- _An Improved Weighted HITS Algorithm Based on Similarity andPopularity_ by Zhang et al. % ${cite:Zhan07a}$In terms of implementation, it is only necessary to multiply the weights with the scores in each iteration.That means changing `computeAuthoritiesFor:` and `computeHubsFor:` methods.This is done in `AIWeightedHits` class.```AIWeightedHits >> computeAuthoritiesFor: aNode |
| 116 | +('D' auth: 0.81 hub: 0.34) |
| 117 | +``` |
| 118 | + |
| 119 | +### Weighted HITS |
| 120 | + |
| 121 | +There are cases where the Hits algorithm does not behave as expected and sometimes the HITS algorithm puts 0 as values for the hubs and authorities. Using weights in a graph helps in obtaining better results. Establishing the weights is a responsibility of the user. |
| 122 | + |
| 123 | +For more information, you can read these papers: |
| 124 | + |
| 125 | +- _Modifications of Kleinberg's HITS Algorithm Using Matrix Exponentiation and Web Log Records_, by Miller et al. (2001) |
| 126 | +- *An Improved Weighted HITS Algorithm Based on Similarity andPopularity*, by Zhang et al. (2007) |
| 127 | + |
| 128 | +In terms of implementation, it is only necessary to multiply the weights with the scores in each iteration. |
| 129 | +That means changing `computeAuthoritiesFor:` and `computeHubsFor:` methods. |
| 130 | +This is done in `AIWeightedHits` class. |
| 131 | + |
| 132 | +``` |
| 133 | +AIWeightedHits >> computeAuthoritiesFor: aNode |
| 134 | +
|
| 135 | + aNode auth: (aNode incomingEdges |
| 136 | + inject: 0 |
| 137 | + into: [ :sum :edge | sum + (edge weight * edge from hub) ]) |
| 138 | +``` |
| 139 | + |
| 140 | +``` |
| 141 | +AIWeightedHits >> computeHubsFor: aNode |
| 142 | +
|
| 143 | + aNode hub: (aNode outgoingEdges |
| 144 | + inject: 0 |
| 145 | + into: [ :sum :edge | sum + (edge weight * edge to auth) ]) |
| 146 | +``` |
54 | 147 |
|
55 | | - aNode auth: (aNode incomingEdges |
56 | | - inject: 0 |
57 | | - into: [ :sum :edge | sum + (edge weight * edge from hub) ])``````AIWeightedHits >> computeHubsFor: aNode |
| 148 | +### Conclusion |
58 | 149 |
|
59 | | - aNode hub: (aNode outgoingEdges |
60 | | - inject: 0 |
61 | | - into: [ :sum :edge | sum + (edge weight * edge to auth) ])```### ConclusionEven if the HITS algorithm is not used anymore in the modern search engines, it is a very good algorithm forhaving a first look on how to classify links according to their relevance in the network. |
| 150 | +Even if the HITS algorithm is not used anymore in the modern search engines, it is a very good algorithm for having a first look on how to classify links according to their relevance in the network. |
0 commit comments