Skip to content

Commit c435542

Browse files
authored
Merge pull request #30 from ScrapeGraphAI/feat-searchscraper
SearchScraper
2 parents c898e99 + bcb9b0b commit c435542

26 files changed

+911
-315
lines changed

README.md

+12-11
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,15 @@
99
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/api-banner.png" alt="ScrapeGraph API Banner" style="width: 70%;">
1010
</p>
1111

12-
Official SDKs for the ScrapeGraph AI API - Intelligent web scraping powered by AI. Extract structured data from any webpage with natural language prompts.
12+
Official SDKs for the ScrapeGraph AI API - Intelligent web scraping and search powered by AI. Extract structured data from any webpage or perform AI-powered web searches with natural language prompts.
1313

1414
Get your [API key](https://scrapegraphai.com)!
1515

1616
## 🚀 Quick Links
1717

1818
- [Python SDK Documentation](scrapegraph-py/README.md)
1919
- [JavaScript SDK Documentation](scrapegraph-js/README.md)
20-
- [API Documentation](https://docs.scrapegraphai.com)
20+
- [API Documentation](https://docs.scrapegraphai.com)
2121
- [Website](https://scrapegraphai.com)
2222

2323
## 📦 Installation
@@ -34,31 +34,31 @@ npm install scrapegraph-js
3434

3535
## 🎯 Core Features
3636

37-
- 🤖 **AI-Powered Extraction**: Use natural language to describe what data you want
37+
- 🤖 **AI-Powered Extraction & Search**: Use natural language to extract data or search the web
3838
- 📊 **Structured Output**: Get clean, structured data with optional schema validation
3939
- 🔄 **Multiple Formats**: Extract data as JSON, Markdown, or custom schemas
4040
-**High Performance**: Concurrent processing and automatic retries
4141
- 🔒 **Enterprise Ready**: Production-grade security and rate limiting
4242

4343
## 🛠️ Available Endpoints
4444

45-
### 🔍 SmartScraper
46-
Extract structured data from any webpage using natural language prompts.
45+
### 🤖 SmartScraper
46+
Using AI to extract structured data from any webpage or HTML content with natural language prompts.
47+
48+
### 🔍 SearchScraper
49+
Perform AI-powered web searches with structured results and reference URLs.
4750

4851
### 📝 Markdownify
4952
Convert any webpage into clean, formatted markdown.
5053

51-
### 💻 LocalScraper
52-
Extract information from a local HTML file using AI.
53-
54-
5554
## 🌟 Key Benefits
5655

5756
- 📝 **Natural Language Queries**: No complex selectors or XPath needed
5857
- 🎯 **Precise Extraction**: AI understands context and structure
59-
- 🔄 **Adaptive Scraping**: Works with dynamic and static content
58+
- 🔄 **Adaptive Processing**: Works with both web content and direct HTML
6059
- 📊 **Schema Validation**: Ensure data consistency with Pydantic/TypeScript
6160
-**Async Support**: Handle multiple requests efficiently
61+
- 🔍 **Source Attribution**: Get reference URLs for search results
6262

6363
## 💡 Use Cases
6464

@@ -67,13 +67,14 @@ Extract information from a local HTML file using AI.
6767
- 📰 **Content Aggregation**: Convert articles to structured formats
6868
- 🔍 **Data Mining**: Extract specific information from multiple sources
6969
- 📱 **App Integration**: Feed clean data into your applications
70+
- 🌐 **Web Research**: Perform AI-powered searches with structured results
7071

7172
## 📖 Documentation
7273

7374
For detailed documentation and examples, visit:
7475
- [Python SDK Guide](scrapegraph-py/README.md)
7576
- [JavaScript SDK Guide](scrapegraph-js/README.md)
76-
- [API Documentation](https://docs.scrapegraphai.com)
77+
- [API Documentation](https://docs.scrapegraphai.com)
7778

7879
## 💬 Support & Feedback
7980

scrapegraph-py/README.md

+53-28
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
[![Python Support](https://img.shields.io/pypi/pyversions/scrapegraph-py.svg)](https://pypi.org/project/scrapegraph-py/)
55
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
66
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
7-
[![Documentation Status](https://readthedocs.org/projects/scrapegraph-py/badge/?version=latest)](https://docs.scrapegraphai.com)
7+
[![Documentation Status](https://readthedocs.org/projects/scrapegraph-py/badge/?version=latest)](https://docs.scrapegraphai.com)
88

99
<p align="left">
1010
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/api-banner.png" alt="ScrapeGraph API Banner" style="width: 70%;">
@@ -20,7 +20,7 @@ pip install scrapegraph-py
2020

2121
## 🚀 Features
2222

23-
- 🤖 AI-powered web scraping
23+
- 🤖 AI-powered web scraping and search
2424
- 🔄 Both sync and async clients
2525
- 📊 Structured output with Pydantic schemas
2626
- 🔍 Detailed logging
@@ -40,21 +40,36 @@ client = Client(api_key="your-api-key-here")
4040
4141
## 📚 Available Endpoints
4242

43-
### 🔍 SmartScraper
43+
### 🤖 SmartScraper
4444

45-
Scrapes any webpage using AI to extract specific information.
45+
Extract structured data from any webpage or HTML content using AI.
4646

4747
```python
4848
from scrapegraph_py import Client
4949

5050
client = Client(api_key="your-api-key-here")
5151

52-
# Basic usage
52+
# Using a URL
5353
response = client.smartscraper(
5454
website_url="https://example.com",
5555
user_prompt="Extract the main heading and description"
5656
)
5757

58+
# Or using HTML content
59+
html_content = """
60+
<html>
61+
<body>
62+
<h1>Company Name</h1>
63+
<p>We are a technology company focused on AI solutions.</p>
64+
</body>
65+
</html>
66+
"""
67+
68+
response = client.smartscraper(
69+
website_html=html_content,
70+
user_prompt="Extract the company description"
71+
)
72+
5873
print(response)
5974
```
6075

@@ -80,46 +95,56 @@ response = client.smartscraper(
8095

8196
</details>
8297

83-
### 📝 Markdownify
98+
### 🔍 SearchScraper
8499

85-
Converts any webpage into clean, formatted markdown.
100+
Perform AI-powered web searches with structured results and reference URLs.
86101

87102
```python
88103
from scrapegraph_py import Client
89104

90105
client = Client(api_key="your-api-key-here")
91106

92-
response = client.markdownify(
93-
website_url="https://example.com"
107+
response = client.searchscraper(
108+
user_prompt="What is the latest version of Python and its main features?"
94109
)
95110

96-
print(response)
111+
print(f"Answer: {response['result']}")
112+
print(f"Sources: {response['reference_urls']}")
97113
```
98114

99-
### 💻 LocalScraper
100-
101-
Extracts information from HTML content using AI.
115+
<details>
116+
<summary>Output Schema (Optional)</summary>
102117

103118
```python
119+
from pydantic import BaseModel, Field
104120
from scrapegraph_py import Client
105121

106122
client = Client(api_key="your-api-key-here")
107123

108-
html_content = """
109-
<html>
110-
<body>
111-
<h1>Company Name</h1>
112-
<p>We are a technology company focused on AI solutions.</p>
113-
<div class="contact">
114-
<p>Email: [email protected]</p>
115-
</div>
116-
</body>
117-
</html>
118-
"""
124+
class PythonVersionInfo(BaseModel):
125+
version: str = Field(description="The latest Python version number")
126+
release_date: str = Field(description="When this version was released")
127+
major_features: list[str] = Field(description="List of main features")
128+
129+
response = client.searchscraper(
130+
user_prompt="What is the latest version of Python and its main features?",
131+
output_schema=PythonVersionInfo
132+
)
133+
```
134+
135+
</details>
119136

120-
response = client.localscraper(
121-
user_prompt="Extract the company description",
122-
website_html=html_content
137+
### 📝 Markdownify
138+
139+
Converts any webpage into clean, formatted markdown.
140+
141+
```python
142+
from scrapegraph_py import Client
143+
144+
client = Client(api_key="your-api-key-here")
145+
146+
response = client.markdownify(
147+
website_url="https://example.com"
123148
)
124149

125150
print(response)
@@ -177,7 +202,7 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
177202
## 🔗 Links
178203

179204
- [Website](https://scrapegraphai.com)
180-
- [Documentation](https://docs.scrapegraphai.com)
205+
- [Documentation](https://docs.scrapegraphai.com)
181206
- [GitHub](https://github.com/ScrapeGraphAI/scrapegraph-sdk)
182207

183208
---
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
"""
2+
Example of using the async searchscraper functionality to search for information concurrently.
3+
"""
4+
5+
import asyncio
6+
7+
from scrapegraph_py import AsyncClient
8+
from scrapegraph_py.logger import sgai_logger
9+
10+
sgai_logger.set_logging(level="INFO")
11+
12+
13+
async def main():
14+
# Initialize async client
15+
sgai_client = AsyncClient(api_key="your-api-key-here")
16+
17+
# List of search queries
18+
queries = [
19+
"What is the latest version of Python and what are its main features?",
20+
"What are the key differences between Python 2 and Python 3?",
21+
"What is Python's GIL and how does it work?",
22+
]
23+
24+
# Create tasks for concurrent execution
25+
tasks = [sgai_client.searchscraper(user_prompt=query) for query in queries]
26+
27+
# Execute requests concurrently
28+
responses = await asyncio.gather(*tasks, return_exceptions=True)
29+
30+
# Process results
31+
for i, response in enumerate(responses):
32+
if isinstance(response, Exception):
33+
print(f"\nError for query {i+1}: {response}")
34+
else:
35+
print(f"\nSearch {i+1}:")
36+
print(f"Query: {queries[i]}")
37+
print(f"Result: {response['result']}")
38+
print("Reference URLs:")
39+
for url in response["reference_urls"]:
40+
print(f"- {url}")
41+
42+
await sgai_client.close()
43+
44+
45+
if __name__ == "__main__":
46+
asyncio.run(main())
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
"""
2+
Example of using the async searchscraper functionality with output schemas for extraction.
3+
"""
4+
5+
import asyncio
6+
from typing import List
7+
8+
from pydantic import BaseModel
9+
10+
from scrapegraph_py import AsyncClient
11+
from scrapegraph_py.logger import sgai_logger
12+
13+
sgai_logger.set_logging(level="INFO")
14+
15+
16+
# Define schemas for extracting structured data
17+
class PythonVersionInfo(BaseModel):
18+
version: str
19+
release_date: str
20+
major_features: List[str]
21+
22+
23+
class PythonComparison(BaseModel):
24+
key_differences: List[str]
25+
backward_compatible: bool
26+
migration_difficulty: str
27+
28+
29+
class GILInfo(BaseModel):
30+
definition: str
31+
purpose: str
32+
limitations: List[str]
33+
workarounds: List[str]
34+
35+
36+
async def main():
37+
# Initialize async client
38+
sgai_client = AsyncClient(api_key="your-api-key-here")
39+
40+
# Define search queries with their corresponding schemas
41+
searches = [
42+
{
43+
"prompt": "What is the latest version of Python? Include the release date and main features.",
44+
"schema": PythonVersionInfo,
45+
},
46+
{
47+
"prompt": "Compare Python 2 and Python 3, including backward compatibility and migration difficulty.",
48+
"schema": PythonComparison,
49+
},
50+
{
51+
"prompt": "Explain Python's GIL, its purpose, limitations, and possible workarounds.",
52+
"schema": GILInfo,
53+
},
54+
]
55+
56+
# Create tasks for concurrent execution
57+
tasks = [
58+
sgai_client.searchscraper(
59+
user_prompt=search["prompt"],
60+
output_schema=search["schema"],
61+
)
62+
for search in searches
63+
]
64+
65+
# Execute requests concurrently
66+
responses = await asyncio.gather(*tasks, return_exceptions=True)
67+
68+
# Process results
69+
for i, response in enumerate(responses):
70+
if isinstance(response, Exception):
71+
print(f"\nError for search {i+1}: {response}")
72+
else:
73+
print(f"\nSearch {i+1}:")
74+
print(f"Query: {searches[i]['prompt']}")
75+
# print(f"Raw Result: {response['result']}")
76+
77+
try:
78+
# Try to extract structured data using the schema
79+
result = searches[i]["schema"].model_validate(response["result"])
80+
81+
# Print extracted structured data
82+
if isinstance(result, PythonVersionInfo):
83+
print("\nExtracted Data:")
84+
print(f"Python Version: {result.version}")
85+
print(f"Release Date: {result.release_date}")
86+
print("Major Features:")
87+
for feature in result.major_features:
88+
print(f"- {feature}")
89+
90+
elif isinstance(result, PythonComparison):
91+
print("\nExtracted Data:")
92+
print("Key Differences:")
93+
for diff in result.key_differences:
94+
print(f"- {diff}")
95+
print(f"Backward Compatible: {result.backward_compatible}")
96+
print(f"Migration Difficulty: {result.migration_difficulty}")
97+
98+
elif isinstance(result, GILInfo):
99+
print("\nExtracted Data:")
100+
print(f"Definition: {result.definition}")
101+
print(f"Purpose: {result.purpose}")
102+
print("Limitations:")
103+
for limit in result.limitations:
104+
print(f"- {limit}")
105+
print("Workarounds:")
106+
for workaround in result.workarounds:
107+
print(f"- {workaround}")
108+
except Exception as e:
109+
print(f"\nCould not extract structured data: {e}")
110+
111+
print("\nReference URLs:")
112+
for url in response["reference_urls"]:
113+
print(f"- {url}")
114+
115+
await sgai_client.close()
116+
117+
118+
if __name__ == "__main__":
119+
asyncio.run(main())

0 commit comments

Comments
 (0)