Nexus News Agent is an AI-powered project that provides a suite of tools for searching, crawling, and summarizing news articles from the web. It features a powerful AI engine to process text and a server component to generate images with the summarized content.
- Search: Find news articles on any topic using Google Custom Search.
- Crawl: Fetch and parse the content of articles from their URLs.
- Summarize: Generate concise, one-line summaries of articles using the Gemini API.
- Image Generation: Create images with the title, summary, and source of an article overlaid on a template.
- Modular Architecture: The project is divided into an
ai_enginefor core processing and aserverfor API and image generation.
PGAI/
├── .gitignore
├── .python-version
├── pyproject.toml
├── README.md
├── requirements.txt
├── uv.lock
├── ai_engine/
│ ├── app.py # Main entry point for the AI engine
│ ├── tools_crawl.py # Web crawling utilities
│ ├── tools_search.py # Search utilities
│ └── tools_summarize.py # Summarization logic
└── server/
├── main.py # FastAPI server
├── workflow.py # Image generation workflow
├── assets/
│ ├── base.png
│ ├── glyphnames.json
│ └── JetBrainsMonoNerdFontMono-BoldItalic.ttf
└── output/
└── processed_image.png
- Python 3.11 or higher
- An environment manager like
venvoruv.
-
Clone the repository:
git clone <repository-url> cd PGAI
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the dependencies: You can use either
pipwithrequirements.txtoruvwithpyproject.toml.-
Using
pip:pip install -r requirements.txt
-
Using
uv:uv sync
-
The project requires API keys for Google Custom Search and the Gemini API.
- Create a
.envfile in the root of thePGAIdirectory. - Add your API keys to the
.envfile:GOOGLE_CSE_KEY="your_google_custom_search_engine_key" GOOGLE_CSE_CX="your_google_custom_search_engine_cx" GOOGLE_API_KEY="your_gemini_api_key"
The AI engine is a command-line tool that takes a topic as input, searches for relevant articles, crawls them, and generates summaries.
-
Run the AI engine:
python -m ai_engine.app
-
Enter a topic when prompted.
-
The summaries will be saved in a JSON file in the root of the
PGAIdirectory.
The server is a FastAPI application that can generate an image with the summarized content of an article.
-
Start the server:
uvicorn server.main:app --reload
-
Send a POST request to the
/contentendpoint with the following JSON payload:{ "title": "Your Title", "content": "Your one-line summary.", "reference": "your.source.com" } -
The generated image will be saved as
processed_image.pngin theserver/outputdirectory.
The main dependencies are listed in pyproject.toml and requirements.txt. They include:
- FastAPI: For the web server.
- LangChain & LangGraph: For building the AI workflow.
- Pillow & OpenCV: For image manipulation.
- Beautiful Soup & Requests: For web crawling.
- Pydantic: For data validation.
Contributions are welcome! Please feel free to submit a pull request.
- Fork the repository.
- Create a new branch (
git checkout -b feature/your-feature). - Make your changes.
- Commit your changes (
git commit -m 'Add some feature'). - Push to the branch (
git push origin feature/your-feature). - Open a pull request.
A CLI tool to automatically search, crawl, and summarize news and articles on any given topic, creating a concise brief from multiple sources.
- Overview
- Features
- Architecture
- Prerequisites
- Installation
- Configuration
- Quickstart
- Usage
- Output Example
- Troubleshooting
- FAQ
- Limitations
- Contributing
- License
News Agent Nexus is a Python-based command-line tool that acts as an autonomous research agent. You provide a topic, and it performs a web search to find relevant seed articles, crawls those pages and linked articles, and uses a large language model to generate structured summaries. The final output is a JSON file containing a list of concise, easy-to-read summaries with source references.
- Topic-based Search: Initiates research from a simple user-provided topic.
- Web Crawling: Fetches content from seed URLs and discovers related articles.
- AI-Powered Summarization: Uses Google's Gemini models to generate structured, concise summaries.
- Parallel Processing: Summarizes multiple articles concurrently for faster results.
- Structured Output: Saves results in a clean, timestamped JSON file for easy use in other applications.
- Configurable: API keys and search engine details are managed via environment variables.
The agent operates in a three-stage pipeline orchestrated by app.py:
- Search: The user's topic is fed to the
tools_searchmodule, which uses the Google Custom Search API to find a list of initial "seed" URLs. - Crawl: The
tools_crawlmodule fetches the HTML content for each seed URL. It also identifies and crawls other promising article-like links on those pages to broaden the content base. - Summarize: The extracted text from the crawled pages is passed to the
tools_summarizemodule. This module uses a large language model (Gemini) to generate a short, structured summary for each article in parallel.
The final summaries are collected and written to a single JSON file.
graph TD
A[User Topic] --> B{app.py};
B --> C[1. Search Seeds<br>(tools_search.py)];
C -- Google CSE API --> D[Seed URLs];
D --> E[2. Crawl Pages<br>(tools_crawl.py)];
E -- HTTP Requests --> F[Page Content];
F --> G[3. Summarize Articles<br>(tools_summarize.py)];
G -- Gemini API --> H[Structured Summaries];
H --> I{output.json};
- Python 3.8+
- Access to Google Cloud Platform for:
- Google Custom Search API
- Google AI (Gemini) API
Follow these steps to set up the project locally.
1. Clone the repository:
git clone <YOUR_REPOSITORY_URL>
cd News-Agent-Nexus2. Create and activate a Python virtual environment:
-
macOS / Linux (bash)
python3 -m venv .venv source .venv/bin/activate -
Windows (Command Prompt)
python -m venv .venv .venv\Scripts\activate.bat
-
Windows (PowerShell)
python -m venv .venv .venv\Scripts\Activate.ps1
3. Install the required dependencies:
pip install -r requirements.txtThe tool requires three environment variables for Google's APIs. Create a file named .env in the root of the project directory and add the following, replacing the placeholder values with your actual credentials.
# .env
# For AI-powered summarization via Google AI Studio or GCP
GOOGLE_API_KEY="AIzaSy..."
# For the initial web search via Google Custom Search Engine API
GOOGLE_CSE_KEY="AIzaSy..."
GOOGLE_CSE_CX="your_custom_search_engine_id"| Variable | Purpose | Example Value |
|---|---|---|
GOOGLE_API_KEY |
API key for the Google Gemini model used in summarization. | AIzaSy... |
GOOGLE_CSE_KEY |
API key for the Google Custom Search Engine API. | AIzaSy... |
GOOGLE_CSE_CX |
The unique ID for your Programmable Search Engine instance. | a1b2c3d4e5f67890 |
The .env file is ignored by Git, so your keys will not be committed.
Once you have completed the installation and configuration steps, you can run the agent immediately.
-
Activate your virtual environment (if you haven't already).
-
Run the application:
python app.py
-
Enter a topic when prompted:
Enter your topic (e.g. 'latest AI safety blog posts'): developments in solid-state batteries
The script will then execute all stages and save the results to a summaries_{timestamp}.json file.
To run the agent, simply execute the main application script.
python app.pyThe application will prompt you to enter a topic. After you provide the topic and press Enter, it will begin the search, crawl, and summarization process, printing its progress to the console.
Enter your topic (e.g. 'latest AI safety blog posts'): latest news on quantum computing hardware
[1/3] Searching…
Using 8 seed URLs.
[2/3] Crawling (seed + top links)…
Unique pages: 21
[3/3] Summarizing (parallel)…
Summaries written: 5
Wrote summaries_1756808756.json with 5 summaries.
Timing breakdown (seconds):
search 1.12
crawl 8.45
summarize 15.20
TOTAL 24.77
The output is a JSON array of summary objects, saved to a file like summaries_1756808756.json. Each object contains a title, the summarized content, and a reference URL.
[
{
"title": "Quantum Leap",
"content": "Researchers at XYZ University have developed a new qubit stabilization technique, potentially extending coherence times by over 200%.",
"reference": "https://example.com/news/quant"
},
{
"title": "Scaling Up",
"content": "A major tech firm announced a 1,000-qubit processor, a significant milestone in building fault-tolerant quantum computers. Details remain sparse.",
"reference": "https://tech-journal.com/artic"
}
]RuntimeError: Google Custom Search credentials missing...: This error means theGOOGLE_CSE_KEYorGOOGLE_CSE_CXenvironment variables are not set. Ensure your.envfile is correct and in the project root.RuntimeError: GOOGLE_API_KEY is missing...: This error means theGOOGLE_API_KEYfor the Gemini model is not set. Check your.envfile.- Slow Performance: The crawling and summarization steps depend on network speed and API response times. The script runs summarization in parallel to mitigate this, but it can still take time.
- No Summaries Generated: This can happen if the initial search yields no results, the web pages cannot be crawled, or the content is too sparse to summarize. Check the console output for errors.
Q: Can I use a different search engine or language model?
A: Currently, the tool is hardcoded to use Google Custom Search and Google Gemini. Replacing these would require modifying tools_search.py and tools_summarize.py respectively.
Q: How many articles does it summarize?
A: The script is currently configured to summarize the top 5 most article-like pages it finds to keep the process quick and focused. This can be changed in app.py.
Q: Why are the summaries so short?
A: The summary length (title, content) is constrained in the prompt sent to the language model to produce very brief, tweet-sized outputs. You can adjust the max_length constraints in the ArticleSummary Pydantic model in tools_summarize.py.
- API Costs: This tool makes calls to paid Google Cloud APIs. Be mindful of the potential costs, especially if running it frequently or on many topics.
- Crawl Quality: The web crawler uses simple heuristics to find articles and may miss content or fail on sites with heavy JavaScript.
- Summarization Accuracy: Summaries are generated by an AI and may contain inaccuracies or misinterpret the source material. Always consult the reference link for critical information.
TODO: Please add contribution guidelines, such as how to submit pull requests, coding standards, and testing procedures.
TODO: A license has not yet been specified for this project. Please choose an open-source license and add a LICENSE file.