The original search was failing with "No URLs found" error because:
googlesearch-pythonlibrary gets frequently blocked by Google's anti-bot measures- Google rate-limits automated searches aggressively
Replaced Google search with DuckDuckGo search:
-
New Library: Replaced
googlesearch-pythonwithddgs(DuckDuckGo Search)- More reliable for automation
- Doesn't get blocked as easily
- Open and permissive for bots
-
Region Filtering: Added
region='wt-wt'parameter- Gets international English results
- Prevents non-English websites from appearing
-
Enhanced Filtering: Added more domains to skip list
- Filters out listing sites (clutch, goodfirms, techbehemoths, etc.)
- Focuses on actual company websites
input_handler.py: UpdatedSearchHandler.search_companies()methodrequirements.txt: Changed dependency fromgooglesearch-pythontoddgs>=9.0.0
✅ Search now returns 5 relevant URLs for "software development consultancy finland"
-
Start the Streamlit app:
cd ~/Desktop/ai-web-crawler-bootcamp source venv/bin/activate streamlit run app.py
-
Try these search terms:
- "software development consultancy finland" (specific location + industry)
- "tech startup San Francisco"
- "IT consulting Helsinki"
- "web development agency London"
✅ Good Search Terms (Specific):
- "software development consultancy helsinki"
- "IT services company Toronto"
- "web agency Berlin"
- "tech startup New York"
❌ Poor Search Terms (Too generic):
- "software companies" (too broad, mixed languages)
- "tech firms" (vague)
- "IT business" (not specific enough)
- Generic terms → listing sites, directories, Wikipedia
- Specific location + industry → actual company websites
- DuckDuckGo performs better with clear, targeted queries
The search is now fixed and ready to use! Just:
- Run
streamlit run app.py - Enter a specific search term (location + industry)
- The crawler will find and extract company information
Note: If you still want to use CSV upload, that option is still available in the Streamlit interface.