-
Notifications
You must be signed in to change notification settings - Fork 6k
added context.dev web tools #244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| # Convert any webpage to Clean Markdown or Structured JSON with context.dev | ||
|
|
||
| This project lets you convert any webpage into **LLM-ready Markdown** or **structured JSON** using [context.dev](https://context.dev). | ||
|
|
||
| - [context.dev](https://context.dev) is used to scrape, crawl, and extract structured data from websites. | ||
| - A Jupyter notebook walks through each API step by step. | ||
|
|
||
| --- | ||
|
|
||
| ## Setup and Installation | ||
|
|
||
| **Get a context.dev API key**: | ||
|
|
||
| - Go to [context.dev](https://context.dev) and sign up for an account. | ||
| - Paste your API key into the first code cell of `notebook.ipynb`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Root cause: the tutorial currently standardizes an insecure API-key workflow. Both docs and notebook teach storing the key directly in notebook content. Switch the tutorial contract to environment-variable-based auth and keep both files aligned to that single secure pattern. 🤖 Prompt for AI Agents |
||
|
|
||
| **Install dependencies**: | ||
|
|
||
| Ensure you have Python 3.11 or later installed. | ||
|
|
||
| ```bash | ||
| pip install context.dev jupyter | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Run the project | ||
|
|
||
| ```bash | ||
| jupyter notebook notebook.ipynb | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## 📬 Stay Updated with Our Newsletter! | ||
|
|
||
| **Get a FREE Data Science eBook** 📖 with 150+ essential lessons in Data Science when you subscribe to our newsletter! Stay in the loop with the latest tutorials, insights, and exclusive resources. [Subscribe now!](https://join.dailydoseofds.com) | ||
|
|
||
| [](https://join.dailydoseofds.com) | ||
|
|
||
| --- | ||
|
|
||
| ## Contribution | ||
|
|
||
| Contributions are welcome! Please fork the repository and submit a pull request with your improvements. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ❤️ |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,185 @@ | ||
| { | ||
| "cells": [ | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "a1b2c3d4", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# Convert any webpage to Clean Markdown or Structured JSON with context.dev\n", | ||
| "\n", | ||
| "This notebook walks through the three core context.dev web APIs step by step.\n", | ||
| "\n", | ||
| "**Prerequisites:** Get an API key at [context.dev](https://context.dev) and set it in the cell below." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "b2c3d4e5", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "!pip install -U context.dev\n", | ||
| "\n", | ||
| "CONTEXT_DEV_API_KEY = \"your_api_key_here\"" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "c3d4e5f6", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "import json\n", | ||
| "\n", | ||
| "from context.dev import ContextDev\n", | ||
| "\n", | ||
| "client = ContextDev(api_key=CONTEXT_DEV_API_KEY)\n", | ||
| "URL = \"https://stripe.com/pricing\"" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "d4e5f6a7", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## 1. Single page → Clean Markdown\n", | ||
| "\n", | ||
| "`web.web_scrape_md` converts a URL into GitHub Flavored Markdown, with optional main-content-only stripping." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "e5f6a7b8", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "scrape = client.web.web_scrape_md(\n", | ||
| " url=URL,\n", | ||
| " use_main_content_only=True,\n", | ||
| " include_links=True,\n", | ||
| ")\n", | ||
| "\n", | ||
| "print(f\"URL: {scrape.url}\")\n", | ||
| "print(f\"Credits used: {scrape.key_metadata.credits_consumed}\")\n", | ||
| "print(\"\\n--- Preview ---\\n\")\n", | ||
| "print(scrape.markdown[:2000])" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "f6a7b8c9", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## 2. Crawl a site → Markdown per page\n", | ||
| "\n", | ||
| "`web.web_crawl_md` follows internal links and returns Markdown for each page." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "a7b8c9d0", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "crawl = client.web.web_crawl_md(\n", | ||
| " url=\"https://docs.stripe.com\",\n", | ||
| " max_pages=3,\n", | ||
| " max_depth=1,\n", | ||
| " use_main_content_only=True,\n", | ||
| ")\n", | ||
| "\n", | ||
| "print(crawl.metadata)\n", | ||
| "for page in crawl.results:\n", | ||
| " status = \"ok\" if page.metadata.success else \"failed\"\n", | ||
| " print(f\"\\n[{status}] {page.metadata.title} — {page.metadata.url}\")\n", | ||
| " print(page.markdown[:500])" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "b8c9d0e1", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## 3. Extract structured JSON with a schema\n", | ||
| "\n", | ||
| "`web.extract` crawls relevant pages and returns typed JSON matching your JSON Schema." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "c9d0e1f2", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "pricing_schema = {\n", | ||
| " \"type\": \"object\",\n", | ||
| " \"properties\": {\n", | ||
| " \"currency\": {\"type\": \"string\"},\n", | ||
| " \"plans\": {\n", | ||
| " \"type\": \"array\",\n", | ||
| " \"items\": {\n", | ||
| " \"type\": \"object\",\n", | ||
| " \"properties\": {\n", | ||
| " \"name\": {\"type\": \"string\"},\n", | ||
| " \"price\": {\"type\": \"string\"},\n", | ||
| " \"billing_period\": {\"type\": \"string\"},\n", | ||
| " \"features\": {\"type\": \"array\", \"items\": {\"type\": \"string\"}},\n", | ||
| " },\n", | ||
| " \"required\": [\"name\"],\n", | ||
| " \"additionalProperties\": False,\n", | ||
| " },\n", | ||
| " },\n", | ||
| " },\n", | ||
| " \"required\": [\"plans\"],\n", | ||
| " \"additionalProperties\": False,\n", | ||
| "}\n", | ||
| "\n", | ||
| "extracted = client.web.extract(\n", | ||
| " url=URL,\n", | ||
| " schema=pricing_schema,\n", | ||
| " instructions=\"Prioritize the pricing page. Include currency when visible.\",\n", | ||
| " max_pages=5,\n", | ||
| ")\n", | ||
| "\n", | ||
| "print(\"Pages analyzed:\", extracted.urls_analyzed)\n", | ||
| "print(json.dumps(extracted.data, indent=2))" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "d0e1f2a3", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Customize the schema\n", | ||
| "\n", | ||
| "Change `pricing_schema` above to extract whatever you need — company info, job postings, article metadata, etc. context.dev returns JSON matching your schema." | ||
| ] | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "kernelspec": { | ||
| "display_name": "Python 3", | ||
| "language": "python", | ||
| "name": "python3" | ||
| }, | ||
| "language_info": { | ||
| "codemirror_mode": { | ||
| "name": "ipython", | ||
| "version": 3 | ||
| }, | ||
| "file_extension": ".py", | ||
| "mimetype": "text/x-python", | ||
| "name": "python", | ||
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3", | ||
| "version": "3.11.0" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 5 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update beginner project count metadata to match the new entry.
Line 95 adds a beginner project, but the displayed beginner count (Line 34:
Beginner Projects (22)) now appears stale. Please update the count(s) in the README headers/TOC so registry metadata stays accurate.🤖 Prompt for AI Agents