-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to make wagtail_textract work with Media Storage Backend #15
Comments
Hi @danieltomasku , Sorry, there is nothing yet for other storage backends. Since we only use the default backend for now. PR's are always welcome ;). |
Hi @allcaps , Thanks for your response. I ended up using import tempfile
def index_documents(self):
"""Loops through all the documents in a ResourcePage and returns a string with all the extracted text."""
alltext = ''
for block in self.body:
if block.block_type == 'resource_document':
if block.value['document'].file_extension.lower() in ['pdf', 'pptx', 'html', 'htm', 'xls', 'xlsx', 'doc', 'docx', 'rtf', '.txt']:
try:
path = block.value['document'].file.path
text = self.extract_text(path)
except NotImplementedError:
logger.info('Downloading for search index %s' % block.value['document'].file.url)
remote_file_url = block.value['document'].file.url
f = tempfile.NamedTemporaryFile('w+b', suffix='.%s' % block.value['document'].file_extension)
urlretrieve(remote_file_url, filename=f.name)
path = f.name
text = self.extract_text(path)
f.close()
alltext += text.decode("utf-8")
return alltext The relevant section is in the Thoughts on the approach? |
Hi @danieltomasku , Thanks for bringing this use case to attention. That approach looks pretty good to me. Is there some way we can make this easier? Maybe the |
Seems like Wagtail does the 'is-this-a-local-file' check in the same way: https://github.com/wagtail/wagtail/blob/7034cd131774b8971ff3c7424999a28164480f29/wagtail/documents/views/serve.py#L35 |
Hi @danieltomasku, I'd be happy to accept a PR if that helps. |
Hi there,
I am trying to use
wagtail_textract
for my project. I tried previously using justtextract
but am interested in some of the helper utilities ofwagtail_textract
. I am wondering howwagtail_textract
will work in production with Docker and a Media Storage backend, such as Azure.The line here is referencing a file path:
text = textract.process(document.file.path).strip()
but when in production using a Media Storage backend, it seems like this will fail because it does not have a proper file system. Has this been tested or does anybody know how I might be able to get this to work? Any help would be much appreciated! Let me know if you need any more info about my project setup.
The text was updated successfully, but these errors were encountered: