Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring old readability-server into this repo #21

Merged
merged 4 commits into from
Mar 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 20 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Web Monitoring Task Sheets

**⚠️ This is a work in progress, and is pretty messy! There will be lots of commits directly on `main` for now. ⚠️**
**⚠️ This is a messy work in progress! There will be lots of commits directly on `main` for now. ⚠️**

This project is a re-envisioning of our process for generating weekly analyst tasking spreadsheets. It pulls down information from a web-monitoring-db instance about all the changes that have occurred over a given timeframe and then attempts to analyze them and assign a meaningful priority to each page.

Expand All @@ -9,7 +9,7 @@ Run it like:
```sh
# Generate task sheets covering the timeframe between November 10, 2019 and now
# and save them in the ./task-sheets directory.
> python generate_task_sheets.py --after '2019-11-10T00:00:00Z' --output ./task-sheets
> python generate_task_sheets.py --after '2019-11-10T00:00:00Z' --skip-readability --output ./task-sheets
```

There are a slew of other options you can find out about with the `--help` option:
Expand All @@ -18,10 +18,24 @@ There are a slew of other options you can find out about with the `--help` optio
> python generate_task_sheets.py --help
```

It requires a copy of `readability-server` from [web-monitoring-changed-terms-analysis](https://github.com/edgi-govdata-archiving/web-monitoring-changed-terms-analysis) to be running.

The actual analysis routines can be found in [`./analyst_sheets/analyze.py`](./analyst_sheets/analyze.py).

---

In current production usage, we use [Mozilla’s “Readability” tool](https://github.com/mozilla/readability) (what generates the reader view in Firefox) for some parts of the analysis. It has some issues, though, so there is a partially built alternative/fallback for it in `analyst_sheets/normalize.py:get_main_content` (for more info, see [#9](https://github.com/edgi-govdata-archiving/web-monitoring-task-sheets/issues/9)). It’s likely too simplistic to work for a lot of potential cases, though.

- To use Readability, you’ll need Node.js v20 or later installed. Before running `generate_task_sheets.py`, start the Readability server:

```sh
> cd readability-server
> npm install
> npm start
```

Then, in a different shell session, run `generate_task_sheets.py` with whatever arguments you want. Afterward, you can shut down the Readability server.

- To use the in-development fallback, specify the `--skip-readability` option when running `generate_task_sheets.py` instead of starting the Readability server.


## Installation

Expand All @@ -37,7 +51,7 @@ The actual analysis routines can be found in [`./analyst_sheets/analyze.py`](./a

```sh
> cd xyz
> pyenv virtualenv 3.7.4 wm-task-sheets
> pyenv virtualenv 3.10.16 wm-task-sheets
> pyenv activate wm-task-sheets
> pip install --upgrade pip # Make sure pip is up-to-date
> pip install -r requirements.txt
Expand All @@ -52,7 +66,7 @@ The actual analysis routines can be found in [`./analyst_sheets/analyze.py`](./a

## License & Copyright

Copyright (C) 2019 Environmental Data and Governance Initiative (EDGI)
Copyright (C) 2019–2025 Environmental Data and Governance Initiative (EDGI)
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.0.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Expand Down
1 change: 1 addition & 0 deletions readability-server/.node-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
22.14.0
155 changes: 155 additions & 0 deletions readability-server/index.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
/**
* A minimal HTTP server that uses Mozilla’s Readability fork (the best working
* and most maintained thing I could find) to convert the contets of a URL to
* plain text.
*
* Make a request to `/proxy?url={some_url}` and it will return a plain-text
* version of the main body of the content at `some_url`.
*
* NOTE: Readability is wrapped in a worker pool implementation because it is
* not async. It is also not published as a standalone package on NPM, so we
* depend directly on its git URL. See the source here:
* https://github.com/mozilla/readability/
*/
'use strict';

import bodyParser from 'body-parser';
import express from 'express';
import { WorkerPool } from './worker-pool.js';

const serverPort = process.env.PORT || 7323;

const workerPool = new WorkerPool('./readability-worker.js', 10);
const app = express();

// TODO: add logging/warning for slow requests

const booleanTrue = /^(t|true|1)*$/i;

function timedFetch (url, options = {}) {
const timeout = options.totalTimeout || (options.timeout * 2) || 10000;
const controller = new AbortController();
const timer = setTimeout(() => controller.abort(), timeout);

options = Object.assign({signal: controller.signal}, options);
return fetch(url, options).finally(() => clearTimeout(timer));
}

async function loadUrlMiddleware (request, response, next) {
const url = request.query.url;

if (!url) {
return response.status(400).json({
error: 'You must set the `?url=<url>` querystring parameter.'
});
}

console.log('Proxying', url);

try {
const upstream = await timedFetch(url);
const html = await upstream.text();
request.htmlBody = html;
next();
}
catch (error) {
if (error.name === 'AbortError') {
response.status(504).json({error: `Upstream request timed out: ${url}`});
console.error('TIMEOUT:', url);
}
else {
response.status(500).json({error: error.message});
console.error(error);
console.error(' While loading:', url);
}
}
}

const _bodyMiddleware = bodyParser.text({type: 'text/*', limit: '5MB'})
function readBodyMiddleware (request, response, next) {
_bodyMiddleware(request, response, function () {
const url = request.query.url;
console.log('Processing POST data for:', url);
request.htmlBody = request.body;
next();
});
}

async function readabilityMiddleware (request, response, next) {
const force = booleanTrue.test(request.query.force);
const url = request.query.url;

try {
const html = request.htmlBody;
const parsed = await workerPool.send({timeout: 45000}, html, url, force);

if (parsed) {
request.parsedPage = parsed;
next();
}
else {
response
.status(422)
.json({error: `Could not parse content for ${url}`});
}
}
catch (error) {
response.status(500).json({error: error.message});
console.error(error);
console.error(' While processing:', url);
}
}

app.get('/proxy', loadUrlMiddleware, readabilityMiddleware, function (request, response) {
response
.type('text/plain')
.send(request.parsedPage.text);
});

app.get('/text', loadUrlMiddleware, readabilityMiddleware, function (request, response) {
response
.type('text/plain')
.send(request.parsedPage.text);
});

app.post('/text', readBodyMiddleware, readabilityMiddleware, function (request, response) {
response
.type('text/plain')
.send(request.parsedPage.text);
});

app.get('/html', loadUrlMiddleware, readabilityMiddleware, function (request, response) {
response
.type('text/html')
.send(request.parsedPage.html);
});

app.post('/html', readBodyMiddleware, readabilityMiddleware, function (request, response) {
response
.type('text/html')
.send(request.parsedPage.html);
});

app.get('/non-content-html', loadUrlMiddleware, readabilityMiddleware, function (request, response) {
response
.type('text/html')
.send(request.parsedPage.nonContentHtml);
});

app.post('/non-content-html', readBodyMiddleware, readabilityMiddleware, function (request, response) {
response
.type('text/html')
.send(request.parsedPage.nonContentHtml);
});

app.get('/all', loadUrlMiddleware, readabilityMiddleware, function (request, response) {
response.json(request.parsedPage);
});

app.post('/all', readBodyMiddleware, readabilityMiddleware, function (request, response) {
response.json(request.parsedPage);
});

app.listen(serverPort, function () {
console.log(`Listening on port ${serverPort}`);
});
Loading