Skip to content

Latest commit

 

History

History
76 lines (56 loc) · 2.28 KB

File metadata and controls

76 lines (56 loc) · 2.28 KB

Search Fix Summary

Problem

The original search was failing with "No URLs found" error because:

  • googlesearch-python library gets frequently blocked by Google's anti-bot measures
  • Google rate-limits automated searches aggressively

Solution

Replaced Google search with DuckDuckGo search:

Changes Made

  1. New Library: Replaced googlesearch-python with ddgs (DuckDuckGo Search)

    • More reliable for automation
    • Doesn't get blocked as easily
    • Open and permissive for bots
  2. Region Filtering: Added region='wt-wt' parameter

    • Gets international English results
    • Prevents non-English websites from appearing
  3. Enhanced Filtering: Added more domains to skip list

    • Filters out listing sites (clutch, goodfirms, techbehemoths, etc.)
    • Focuses on actual company websites

Files Modified

  • input_handler.py: Updated SearchHandler.search_companies() method
  • requirements.txt: Changed dependency from googlesearch-python to ddgs>=9.0.0

Testing

✅ Search now returns 5 relevant URLs for "software development consultancy finland"

How to Test

  1. Start the Streamlit app:

    cd ~/Desktop/ai-web-crawler-bootcamp
    source venv/bin/activate
    streamlit run app.py
  2. Try these search terms:

    • "software development consultancy finland" (specific location + industry)
    • "tech startup San Francisco"
    • "IT consulting Helsinki"
    • "web development agency London"

Tips for Best Results

✅ Good Search Terms (Specific):

  • "software development consultancy helsinki"
  • "IT services company Toronto"
  • "web agency Berlin"
  • "tech startup New York"

❌ Poor Search Terms (Too generic):

  • "software companies" (too broad, mixed languages)
  • "tech firms" (vague)
  • "IT business" (not specific enough)

Why Specific Is Better

  • Generic terms → listing sites, directories, Wikipedia
  • Specific location + industry → actual company websites
  • DuckDuckGo performs better with clear, targeted queries

Next Steps

The search is now fixed and ready to use! Just:

  1. Run streamlit run app.py
  2. Enter a specific search term (location + industry)
  3. The crawler will find and extract company information

Note: If you still want to use CSV upload, that option is still available in the Streamlit interface.