Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ Perfect for getting started with AI engineering. These projects focus on single
- [**Video RAG with Gemini**](./video-rag-gemini) - Chat with videos using Gemini AI

#### Other Tools
- [**Webpage to Markdown & JSON with context.dev**](./context-dev-website-to-md-and-json) - Notebook tutorial for clean Markdown and structured JSON extraction

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update beginner project count metadata to match the new entry.

Line 95 adds a beginner project, but the displayed beginner count (Line 34: Beginner Projects (22)) now appears stale. Please update the count(s) in the README headers/TOC so registry metadata stays accurate.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` at line 95, Update the README metadata that lists beginner project
counts to reflect the newly added entry: locate the "Beginner Projects (22)"
header/TOC entry and any other occurrences of that count text in README.md and
increment it to "Beginner Projects (23)" (or adjust to the correct total if
there are multiple new/removed entries), ensuring all instances match the new
list including the newly added "[Webpage to Markdown & JSON with context.dev]"
link.

- [**Website to API with FireCrawl**](./Website-to-API-with-FireCrawl) - Convert websites to APIs
- [**AI News Generator**](./ai_news_generator) - News generation with CrewAI and Cohere
- [**Siamese Network**](./siamese-network) - Digit similarity detection on MNIST
Expand Down
45 changes: 45 additions & 0 deletions context-dev-website-to-md-and-json/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Convert any webpage to Clean Markdown or Structured JSON with context.dev

This project lets you convert any webpage into **LLM-ready Markdown** or **structured JSON** using [context.dev](https://context.dev).

- [context.dev](https://context.dev) is used to scrape, crawl, and extract structured data from websites.
- A Jupyter notebook walks through each API step by step.

---

## Setup and Installation

**Get a context.dev API key**:

- Go to [context.dev](https://context.dev) and sign up for an account.
- Paste your API key into the first code cell of `notebook.ipynb`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Root cause: the tutorial currently standardizes an insecure API-key workflow.

Both docs and notebook teach storing the key directly in notebook content. Switch the tutorial contract to environment-variable-based auth and keep both files aligned to that single secure pattern.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@context-dev-website-to-md-and-json/README.md` at line 15, Replace the
insecure “paste API key into notebook” workflow with environment-variable-based
auth: update README.md text to instruct users to export/openai key into an env
var (e.g., OPENAI_API_KEY) and to run the notebook after that, and modify
notebook.ipynb to read the key from os.environ.get('OPENAI_API_KEY') (or via
getpass if you prefer interactive prompt) and raise a clear error if the var is
missing; ensure any in-notebook hardcoded key cells are removed or replaced with
the env-var retrieval and that examples reference the same OPENAI_API_KEY
variable name so docs and notebook stay aligned.


**Install dependencies**:

Ensure you have Python 3.11 or later installed.

```bash
pip install context.dev jupyter
```

---

## Run the project

```bash
jupyter notebook notebook.ipynb
```

---

## 📬 Stay Updated with Our Newsletter!

**Get a FREE Data Science eBook** 📖 with 150+ essential lessons in Data Science when you subscribe to our newsletter! Stay in the loop with the latest tutorials, insights, and exclusive resources. [Subscribe now!](https://join.dailydoseofds.com)

[![Daily Dose of Data Science Newsletter](https://github.com/patchy631/ai-engineering/blob/main/resources/join_ddods.png)](https://join.dailydoseofds.com)

---

## Contribution

Contributions are welcome! Please fork the repository and submit a pull request with your improvements.
185 changes: 185 additions & 0 deletions context-dev-website-to-md-and-json/notebook.ipynb

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a1b2c3d4",
"metadata": {},
"source": [
"# Convert any webpage to Clean Markdown or Structured JSON with context.dev\n",
"\n",
"This notebook walks through the three core context.dev web APIs step by step.\n",
"\n",
"**Prerequisites:** Get an API key at [context.dev](https://context.dev) and set it in the cell below."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b2c3d4e5",
"metadata": {},
"outputs": [],
"source": [
"!pip install -U context.dev\n",
"\n",
"CONTEXT_DEV_API_KEY = \"your_api_key_here\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c3d4e5f6",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"from context.dev import ContextDev\n",
"\n",
"client = ContextDev(api_key=CONTEXT_DEV_API_KEY)\n",
"URL = \"https://stripe.com/pricing\""
]
},
{
"cell_type": "markdown",
"id": "d4e5f6a7",
"metadata": {},
"source": [
"## 1. Single page → Clean Markdown\n",
"\n",
"`web.web_scrape_md` converts a URL into GitHub Flavored Markdown, with optional main-content-only stripping."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e5f6a7b8",
"metadata": {},
"outputs": [],
"source": [
"scrape = client.web.web_scrape_md(\n",
" url=URL,\n",
" use_main_content_only=True,\n",
" include_links=True,\n",
")\n",
"\n",
"print(f\"URL: {scrape.url}\")\n",
"print(f\"Credits used: {scrape.key_metadata.credits_consumed}\")\n",
"print(\"\\n--- Preview ---\\n\")\n",
"print(scrape.markdown[:2000])"
]
},
{
"cell_type": "markdown",
"id": "f6a7b8c9",
"metadata": {},
"source": [
"## 2. Crawl a site → Markdown per page\n",
"\n",
"`web.web_crawl_md` follows internal links and returns Markdown for each page."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a7b8c9d0",
"metadata": {},
"outputs": [],
"source": [
"crawl = client.web.web_crawl_md(\n",
" url=\"https://docs.stripe.com\",\n",
" max_pages=3,\n",
" max_depth=1,\n",
" use_main_content_only=True,\n",
")\n",
"\n",
"print(crawl.metadata)\n",
"for page in crawl.results:\n",
" status = \"ok\" if page.metadata.success else \"failed\"\n",
" print(f\"\\n[{status}] {page.metadata.title} — {page.metadata.url}\")\n",
" print(page.markdown[:500])"
]
},
{
"cell_type": "markdown",
"id": "b8c9d0e1",
"metadata": {},
"source": [
"## 3. Extract structured JSON with a schema\n",
"\n",
"`web.extract` crawls relevant pages and returns typed JSON matching your JSON Schema."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c9d0e1f2",
"metadata": {},
"outputs": [],
"source": [
"pricing_schema = {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"currency\": {\"type\": \"string\"},\n",
" \"plans\": {\n",
" \"type\": \"array\",\n",
" \"items\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"name\": {\"type\": \"string\"},\n",
" \"price\": {\"type\": \"string\"},\n",
" \"billing_period\": {\"type\": \"string\"},\n",
" \"features\": {\"type\": \"array\", \"items\": {\"type\": \"string\"}},\n",
" },\n",
" \"required\": [\"name\"],\n",
" \"additionalProperties\": False,\n",
" },\n",
" },\n",
" },\n",
" \"required\": [\"plans\"],\n",
" \"additionalProperties\": False,\n",
"}\n",
"\n",
"extracted = client.web.extract(\n",
" url=URL,\n",
" schema=pricing_schema,\n",
" instructions=\"Prioritize the pricing page. Include currency when visible.\",\n",
" max_pages=5,\n",
")\n",
"\n",
"print(\"Pages analyzed:\", extracted.urls_analyzed)\n",
"print(json.dumps(extracted.data, indent=2))"
]
},
{
"cell_type": "markdown",
"id": "d0e1f2a3",
"metadata": {},
"source": [
"## Customize the schema\n",
"\n",
"Change `pricing_schema` above to extract whatever you need — company info, job postings, article metadata, etc. context.dev returns JSON matching your schema."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}