Skip to content

Commit

Permalink
Added a zim parser to the surrogate import option.
Browse files Browse the repository at this point in the history
You can now import zim files into YaCy by simply moving them
to the DATA/SURROGATE/IN folder. They will be fetched and after
parsing moved to DATA/SURROGATE/OUT.
There are exceptions where the parser is not able to identify the
original URL of the documents in the zim file. In that case the file
is simply ignored.
This commit also carries an important fix to the pdf parser and an
increase of the maximum parsing speed to 60000 PPM which should make it
possible to index up to 1000 files in one second.
  • Loading branch information
Orbiter committed Nov 5, 2023
1 parent 70e2993 commit 7db0534
Show file tree
Hide file tree
Showing 12 changed files with 279 additions and 186 deletions.
21 changes: 0 additions & 21 deletions htroot/ConfigParser_p.html
Original file line number Diff line number Diff line change
Expand Up @@ -51,27 +51,6 @@ <h2>Parser Configuration</h2>
</tr>
</table>
</fieldset>
<fieldset><legend id="parser">PDF Parser Attributes</legend>
<p>
This is an experimental setting which makes it possible to split PDF documents into individual index entries.
Every page will become a single index hit and the url is artifically extended with a post/get attribute value containing
the page number as value. When such an url is displayed within a search result, then the post/get attribute is transformed into an anchor hash link.
This makes it possible to view the individual page directly in the pdf.js viewer built-in into firefox,
for reference see https://github.com/mozilla/pdf.js/wiki/Viewer-options
</p>
<table border="0">
<tr class="TableCellLight">
<td class="small" align="right" width="90">Split PDF</td>
<td class="small" align="left" width="300"><input type="checkbox" name="individualPages" #(individualPages)#::checked="checked" #(/individualPages)#/></td>
</tr>
<tr class="TableCellLight">
<td class="small" align="right">Property Name</td>
<td class="small" align="left"><input type="text" name="individualPagePropertyname" value="#[individualPagePropertyname]#"/></td>
</tr>
<tr class="TableCellDark">
<td colspan="3" class="small" ><input type="submit" name="pdfSettings" value="Submit" class="btn btn-primary"/></td>
</tr>
</table>
</form>
#%env/templates/footer.template%#
</body>
Expand Down
4 changes: 2 additions & 2 deletions htroot/Crawler_p.html
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ <h2>Crawler</h2>
<tr class="TableCellLight">
<td align="left">Speed / PPM<br/>(Pages Per Minute)</td>
<td align="left" colspan="4">
<input id="customPPM" name="customPPM" type="number" min="10" max="30000" style="width:5em" value="#[customPPMdefault]#" /><label for="customPPM"><abbr title="Pages Per Minute">PPM</abbr></label>
<input id="customPPM" name="customPPM" type="number" min="10" max="60000" style="width:5em" value="#[customPPMdefault]#" /><label for="customPPM"><abbr title="Pages Per Minute">PPM</abbr></label>
<input id="latencyFactor" name="latencyFactor" type="number" min="0.1" max="3.0" step="0.1" style="width:3.5em" value="#[latencyFactorDefault]#" />
<label for="latencyFactor"><abbr title="Latency Factor">LF</abbr></label>
<input id="MaxSameHostInQueue" name="MaxSameHostInQueue" type="number" min="1" max="30" style="width:3em" value="#[MaxSameHostInQueueDefault]#" />
Expand All @@ -147,7 +147,7 @@ <h2>Crawler</h2>
<td align="left">Crawler PPM</td>
<td align="left" width="60"><span id="ppmNum">&nbsp;&nbsp;&nbsp;</span></td>
<td align="left" width="260px" colspan="3">
<progress id="ppmbar" max="30000" value="0" style="width:94%;"/>
<progress id="ppmbar" max="60000" value="0" style="width:94%;"/>
</td>
</tr>
<tr class="TableCellLight">
Expand Down
1 change: 1 addition & 0 deletions ivy.xml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
<dependency org="io.opentracing" name="opentracing-noop" rev="0.33.0"/>
<dependency org="io.opentracing" name="opentracing-util" rev="0.33.0"/>
<dependency org="javax.servlet" name="javax.servlet-api" rev="3.1.0"/>
<dependency org="javainetlocator" name="inetaddresslocator" rev="2.18" />
<dependency org="jcifs" name="jcifs" rev="1.3.17" conf="compile->master" />
<dependency org="net.arnx" name="jsonic" rev="1.3.10"/>
<dependency org="net.jthink" name="jaudiotagger" rev="2.2.5"/>
Expand Down
14 changes: 9 additions & 5 deletions source/net/yacy/cora/document/id/MultiProtocolURL.java
Original file line number Diff line number Diff line change
Expand Up @@ -2593,14 +2593,18 @@ public boolean exists(final ClientIdentification.Agent agent) {
return client.fileSize(path) > 0;
}
if (isHTTP() || isHTTPS()) {
try (final HTTPClient client = new HTTPClient(agent)) {
client.setHost(getHost());
org.apache.http.HttpResponse response = client.HEADResponse(this, true);
return response != null && (response.getStatusLine().getStatusCode() == 200 || response.getStatusLine().getStatusCode() == 301);
}
final HTTPClient client = new HTTPClient(agent);
client.setHost(getHost());
org.apache.http.HttpResponse response = client.HEADResponse(this, true);
client.close();
if (response == null) return false;
int status = response.getStatusLine().getStatusCode();
return status == 200 || status == 301 || status == 302;
}
return false;
} catch (IOException e) {
if (e.getMessage().contains("Circular redirect to")) return true; // exception; this is a 302 which the client actually accepts
//e.printStackTrace();
return false;
}
}
Expand Down
Loading

0 comments on commit 7db0534

Please sign in to comment.