-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: What is the best way to use website crawler in a workspace? #605
Comments
Bedrock knowledge Base supports Crawling websites and has an API to sync the data You might be able to set upEvent Bridge to periodically call the Bedrock API StartIngestionJob. An alternative is to periodically remove the website and add it back from the workspace. The integration test has an example where it adds a RSS Feed and remove it (document) |
@charles-marion Thanks for the links, I will have a look. |
@charles-marion Currently I am using OpenSearch vector storage and primarily uploading PDF documents in the workspace using file upload option. I am thinking to use website crawler so that I don't have to manually upload the documents as the documents are also being uploaded as web pages on the website. |
This issue is stale because it has been open for 60 days with no activity. |
The website crawler feature is a great feature and can be used to ingest webpages in the workspace. Just wondering, what is the best way to update the workspace when some of the webpages now have updated content after initial crawling is done.
I believe we would need to crawl the website again, would that result in duplicate documents in the workspace and vector database?
How to avoid the duplication and update the workspace with updated webpages?
Should we create a new workspace and crawl the website again? This doesn't seem scalable when the website content is being updated frequently.
What should be the best approach in this situation?
The text was updated successfully, but these errors were encountered: