Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/port to py3 #184

Open
wants to merge 33 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
702dbf1
Uses pycld2 instead of the (outdate) chrom[...]tector
flavioamieiro Nov 23, 2016
cb3d1d2
Removes pyparsing from requirements
flavioamieiro Nov 23, 2016
915efa7
fix cld import
geron Nov 23, 2016
0589592
prevent mongo from connecting at import time
geron Nov 23, 2016
0b4ccf6
run 2to3
geron Nov 24, 2016
e41028e
Removes redundant try/except block in urlparse import
flavioamieiro Nov 24, 2016
ccfb5d9
Pins celery version
flavioamieiro Nov 24, 2016
01a5fa6
Removes unnecessary cast to list that 2to3 inserted
flavioamieiro Nov 24, 2016
b16be95
Fixes test that expected str but receives bytes
flavioamieiro Nov 24, 2016
21aa0a6
Adds test to make sure the 'process' method receives the expected data
flavioamieiro Nov 24, 2016
7d540d0
Fixes existing base task test
flavioamieiro Nov 24, 2016
aa4478a
Uses BytesIO instead of StringIO in wordcloud
flavioamieiro Nov 24, 2016
d311b74
Changes Wordcloud test not to touch the database
flavioamieiro Nov 24, 2016
65c07b1
Changes palavras_raw test to not touch the database
flavioamieiro Nov 24, 2016
9c8f952
Fix freqdist test and sorting
geron Nov 25, 2016
05594a1
fix spellchecker tests
geron Nov 25, 2016
7b31c98
spellchecker: warn if dictionary is missing
geron Nov 25, 2016
00cce60
fix test_unknown_mimetype_should_be_flagged test
geron Nov 25, 2016
afaaa0b
Update TestExtractorWorker.test_unknown_encoding_should_be_ignored
geron Nov 25, 2016
427da7d
fix TestExtractorWorker.test_unescape_html_entities
geron Nov 25, 2016
2c0f8e8
fix TestExtractorWorker.test_should_detect_encoding_and_return_a_unic…
geron Nov 25, 2016
6989936
fix TestExtractorWorker.test_should_guess_mimetype_for_file_without_e…
geron Nov 25, 2016
17e47cb
updated more extractor tests
geron Nov 26, 2016
4eb5f61
fix extractor.extract_pdf
geron Nov 26, 2016
24c266f
Rewrite extractor.trial_decode and write tests for it
geron Nov 27, 2016
c084132
extractor: convert text to string before calling parse_html
geron Nov 27, 2016
8e67779
extractor: fix language detection
geron Nov 27, 2016
11c203c
extractor: remove checks for text being a str, it will always be
geron Dec 2, 2016
c6b3296
extractor: remove up to 1k bytes that cld says are invalid
geron Dec 2, 2016
25a8e54
SpellingChecker: no need to check for KeyError from document keys
geron Dec 2, 2016
573a111
extractor: turn redundant tests into integration test
geron Dec 6, 2016
0265786
extractor tests: support newer version of pdfinfo
geron Dec 6, 2016
7b84def
change bigram worker to return metric names and respect bigram order
geron Jan 31, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@
master_doc = 'index'

# General information about the project.
project = u'PyPLN'
copyright = u'2011, Flávio Codeço Coelho'
project = 'PyPLN'
copyright = '2011, Flávio Codeço Coelho'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
Expand Down Expand Up @@ -187,8 +187,8 @@
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title, author, documentclass [howto/manual]).
latex_documents = [
('index', 'PyPLN.tex', u'PyPLN Documentation',
u'Flávio Codeço Coelho', 'manual'),
('index', 'PyPLN.tex', 'PyPLN Documentation',
'Flávio Codeço Coelho', 'manual'),
]

# The name of an image file (relative to this directory) to place at the top of
Expand Down Expand Up @@ -220,6 +220,6 @@
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
('index', 'pypln', u'PyPLN Documentation',
[u'Flávio Codeço Coelho'], 1)
('index', 'pypln', 'PyPLN Documentation',
['Flávio Codeço Coelho'], 1)
]
2 changes: 1 addition & 1 deletion pypln/backend/celery_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

from celery import Celery
from kombu import Exchange, Queue
import config
from . import config

app = Celery('pypln_workers', backend='mongodb',
broker='amqp://', include=['pypln.backend.workers'])
Expand Down
2 changes: 1 addition & 1 deletion pypln/backend/celery_task.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
from pypln.backend import config


mongo_client = pymongo.MongoClient(host=config.MONGODB_URIS)
mongo_client = pymongo.MongoClient(host=config.MONGODB_URIS, _connect=False)
database = mongo_client[config.MONGODB_DBNAME]
document_collection = database[config.MONGODB_COLLECTION]

Expand Down
12 changes: 4 additions & 8 deletions pypln/backend/config.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,12 @@
import os
import urllib.parse

from decouple import config, Csv

try:
import urlparse
except ImportError:
import urllib.parse as urlparse

def parse_url(url):
urlparse.uses_netloc.append('mongodb')
urlparse.uses_netloc.append('celery')
url = urlparse.urlparse(url)
urllib.parse.uses_netloc.append('mongodb')
urllib.parse.uses_netloc.append('celery')
url = urllib.parse.urlparse(url)

path = url.path[1:]
path = path.split('?', 2)[0]
Expand Down
24 changes: 12 additions & 12 deletions pypln/backend/workers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,18 +17,18 @@
# You should have received a copy of the GNU General Public License
# along with PyPLN. If not, see <http://www.gnu.org/licenses/>.

from extractor import Extractor
from tokenizer import Tokenizer
from freqdist import FreqDist
from pos import POS
from statistics import Statistics
from bigrams import Bigrams
from palavras_raw import PalavrasRaw
from lemmatizer_pt import Lemmatizer
from palavras_noun_phrase import NounPhrase
from palavras_semantic_tagger import SemanticTagger
from word_cloud import WordCloud
from elastic_indexer import ElasticIndexer
from .extractor import Extractor
from .tokenizer import Tokenizer
from .freqdist import FreqDist
from .pos import POS
from .statistics import Statistics
from .bigrams import Bigrams
from .palavras_raw import PalavrasRaw
from .lemmatizer_pt import Lemmatizer
from .palavras_noun_phrase import NounPhrase
from .palavras_semantic_tagger import SemanticTagger
from .word_cloud import WordCloud
from .elastic_indexer import ElasticIndexer


__all__ = ['Extractor', 'Tokenizer', 'FreqDist', 'POS', 'Statistics',
Expand Down
44 changes: 21 additions & 23 deletions pypln/backend/workers/bigrams.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,33 +16,31 @@
#
# You should have received a copy of the GNU General Public License
# along with PyPLN. If not, see <http://www.gnu.org/licenses/>.
from collections import OrderedDict

import nltk
from collections import defaultdict

from nltk.collocations import BigramCollocationFinder
from nltk import BigramCollocationFinder, BigramAssocMeasures
from pypln.backend.celery_task import PyPLNTask

METRICS = ['chi_sq',
'dice',
'jaccard',
'likelihood_ratio',
'mi_like',
'phi_sq',
'pmi',
'poisson_stirling',
'raw_freq',
'student_t']

class Bigrams(PyPLNTask):
"""Create a NLTK bigram finder and return a table in JSON format"""

class Bigrams(PyPLNTask):
def process(self, document):
#todo: support filtering by stopwords
bigram_measures = nltk.collocations.BigramAssocMeasures()
metrics = ['chi_sq',
'dice',
'jaccard',
'likelihood_ratio',
'mi_like',
'phi_sq',
'pmi',
'poisson_stirling',
'raw_freq',
'student_t']
bigram_finder = BigramCollocationFinder.from_words(document['tokens'])
br = defaultdict(lambda :[])
for m in metrics:
for res in bigram_finder.score_ngrams(getattr(bigram_measures,m)):
br[res[0]].append(res[1])
return {'metrics': metrics, 'bigram_rank': br.items()}
bigram_rankings = OrderedDict()
for metric_name in METRICS:
metric = getattr(BigramAssocMeasures, metric_name)
for ranking in bigram_finder.score_ngrams(metric):
bigram = ranking[0]
d = bigram_rankings.setdefault(bigram, {})
d[metric_name] = ranking[1]
return {'bigram_rankings': list(bigram_rankings.items())}
138 changes: 74 additions & 64 deletions pypln/backend/workers/extractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,15 @@
# along with PyPLN. If not, see <http://www.gnu.org/licenses/>.

import base64
import html
import shlex

from HTMLParser import HTMLParser
from tempfile import NamedTemporaryFile
from os import unlink
from subprocess import Popen, PIPE
from mimetypes import guess_type
from re import compile as regexp_compile, DOTALL, escape

import cld
import pycld2 as cld
import magic

from pypln.backend.celery_task import PyPLNTask
Expand All @@ -46,6 +45,10 @@
'/h2', 'h3', '/h3', 'h4', '/h4', 'h5', '/h5', 'h6', '/h6',
'br', 'br/']
double_breakline = ['table', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']
cld_error_re = regexp_compile('input contains invalid UTF-8 around byte '
'(?P<index>\d+) \(of \d+\)')
MAX_CLD_BYTES_TO_REMOVE = 1024


def clean(text):
text = regexp_spaces_start.sub(r'\1', text)
Expand Down Expand Up @@ -84,10 +87,10 @@ def parse_html(html, remove_tags=None, remove_inside=None,
[''] * (total_to_remove - 2)
content_between[index + 1] = '\n'
complete_tags.append('')
result = ''.join(sum(zip(content_between, complete_tags), tuple()))
result = ''.join(sum(list(zip(content_between, complete_tags)), tuple()))
return clean(result)

def get_pdf_metadata(data):
def get_pdf_metadata(data: str) -> dict:
lines = data.strip().splitlines()
metadata = {}
for line in lines:
Expand All @@ -98,7 +101,7 @@ def get_pdf_metadata(data):
metadata[key.strip()] = value.strip()
return metadata

def extract_pdf(data):
def extract_pdf(data: bytes) -> (str, dict):
temp = NamedTemporaryFile(delete=False)
filename = temp.name
temp.close()
Expand All @@ -112,14 +115,16 @@ def extract_pdf(data):
unlink(filename + '_ind.html')
unlink(filename + 's.html')
text = parse_html(html.replace('&#160;', ' '), True, ['script', 'style'])
pdfinfo = Popen(shlex.split('pdfinfo -'), stdin=PIPE, stdout=PIPE,
stderr=PIPE)
meta_out, meta_err = pdfinfo.communicate(input=data)

info_process = Popen(shlex.split('pdfinfo -'), stdin=PIPE, stdout=PIPE,
stderr=PIPE)
meta_out, meta_err = info_process.communicate(input=data)
try:
metadata = get_pdf_metadata(meta_out)
except:
metadata = get_pdf_metadata(meta_out.decode('utf-8'))
except Exception:
# TODO: what should I do here?
metadata = {}
#TODO: what should I do here?

if not (text and metadata):
return '', {}
elif not html_err:
Expand All @@ -128,41 +133,57 @@ def extract_pdf(data):
return '', {}


def trial_decode(text):
def decode_text_bytes(text: bytes) -> str:
"""
Tries to detect text encoding using `magic`. If the detected encoding is
not supported, try utf-8, iso-8859-1 and ultimately falls back to decoding
as utf-8 replacing invalid chars with `U+FFFD` (the replacement character).

This is far from an ideal solution, but the extractor and the rest of the
pipeline need an unicode object.
Tries to detect text encoding using file magic. If that fails or the
detected encoding is not supported, tries using utf-8. If that doesn't work
tries using iso8859-1.
"""
with magic.Magic(flags=magic.MAGIC_MIME_ENCODING) as m:
content_encoding = m.id_buffer(text)
try:
with magic.Magic(flags=magic.MAGIC_MIME_ENCODING) as m:
content_encoding = m.id_buffer(text)
except magic.MagicError:
pass # This can happen for instance if text is a single char
else:
try:
return text.decode(content_encoding)
except LookupError: # The detected encoding is not supported
pass

forced_decoding = False
try:
result = text.decode(content_encoding)
except LookupError:
# If the detected encoding is not supported, we try to decode it as
# utf-8.
result = text.decode('utf-8')
except UnicodeDecodeError:
# Decoding with iso8859-1 doesn't raise UnicodeDecodeError, so this is
# a last resort.
result = text.decode('iso8859-1')
return result


def detect_language(text: str) -> str:
# CLD seems to have an issue with some bytes that Python considers
# to be valid utf-8. Remove up to MAX_CLD_BYTES_TO_REMOVE of such
# "invalid" bytes
# TODO: alert the user somehow if we give up removing them
detected_language = None
text_bytes = text.encode('utf-8')
for i in range(MAX_CLD_BYTES_TO_REMOVE):
try:
result = text.decode('utf-8')
except UnicodeDecodeError:
# Is there a better way of doing this than nesting try/except
# blocks? This smells really bad.
try:
result = text.decode('iso-8859-1')
except UnicodeDecodeError:
# If neither utf-8 nor iso-885901 work are capable of handling
# this text, we just decode it using utf-8 and replace invalid
# chars with U+FFFD.
# Two somewhat arbitrary decisions were made here: use utf-8
# and use 'replace' instead of 'ignore'.
result = text.decode('utf-8', 'replace')
forced_decoding = True

return result, forced_decoding
languages = cld.detect(text_bytes)[2]
except cld.error as exc:
message = exc.args[0] if exc.args else ''
match = cld_error_re.match(message)
if match:
byte_index = int(match.group('index'))
text_bytes = (text_bytes[:byte_index]
+ text_bytes[byte_index + 1:])
else:
raise
else:
if languages:
detected_language = languages[0][1]
break

return detected_language


class Extractor(PyPLNTask):
Expand All @@ -173,11 +194,12 @@ def process(self, file_data):
contents = base64.b64decode(file_data['contents'])
with magic.Magic(flags=magic.MAGIC_MIME_TYPE) as m:
file_mime_type = m.id_buffer(contents)

metadata = {}
if file_mime_type == 'text/plain':
text = contents
elif file_mime_type == 'text/html':
text = parse_html(contents, True, ['script', 'style'])
if file_mime_type in ('text/plain', 'text/html'):
text = decode_text_bytes(contents)
if file_mime_type == 'text/html':
text = parse_html(text, True, ['script', 'style'])
elif file_mime_type == 'application/pdf':
text, metadata = extract_pdf(contents)
else:
Expand All @@ -191,22 +213,10 @@ def process(self, file_data):
return {'mimetype': 'unknown', 'text': "",
'file_metadata': {}, 'language': ""}

text, forced_decoding = trial_decode(text)

if isinstance(text, unicode):
# HTMLParser only handles unicode objects. We can't pass the text
# through it if we don't know the encoding, and it's possible we
# also shouldn't. There's no way of knowing if it's a badly encoded
# html or a binary blob that happens do have bytes that look liked
# html entities.
text = HTMLParser().unescape(text)

text = html.unescape(text)
text = clean(text)

if isinstance(text, unicode):
language = cld.detect(text.encode('utf-8'))[1]
else:
language = cld.detect(text)[1]

return {'text': text, 'file_metadata': metadata, 'language': language,
'mimetype': file_mime_type, 'forced_decoding': forced_decoding}
return {'text': text,
'file_metadata': metadata,
'language': detect_language(text),
'mimetype': file_mime_type,
'forced_decoding': None}
4 changes: 2 additions & 2 deletions pypln/backend/workers/freqdist.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def process(self, document):
tokens = [info.lower() for info in document_tokens]
frequency_distribution = {token: tokens.count(token) \
for token in set(tokens)}
fd = frequency_distribution.items()
fd.sort(lambda x, y: cmp(y[1], x[1]))
fd = list(frequency_distribution.items())
fd.sort(key=lambda x: (-x[1], x[0]))

return {'freqdist': fd}
2 changes: 1 addition & 1 deletion pypln/backend/workers/palavras_noun_phrase.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def process(self, document):
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
palavras_output = document['palavras_raw']
if isinstance(palavras_output, unicode):
if isinstance(palavras_output, str):
# we *need* to send a 'str' to the process. Otherwise it's going to try to use ascii.
palavras_output = palavras_output.encode('utf-8')
stdout, stderr = process.communicate(palavras_output)
Expand Down
Loading